Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
WITHDRAWN: Generating Structurally Diverse Therapeutic Peptides with GFlowNet
Wijaya, E.Abstract
The authors have withdrawn this manuscript because the submitter did not have the rights to agree to the distribution license at the time of submission. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v5WITHDRAWN: Distilling Protein Language Models with Complementary Regularizers
Wijaya, E.Abstract
The authors have withdrawn this manuscript because, at the time of submission, the submitter did not have the rights required to agree to the distribution license. Accordingly, the authors request that this work not be cited as a reference for the project. Please contact the corresponding author with any questions.
bioinformatics2026-06-22v4ATLAS: a scverse-compatible package for multi-omic single-cell trajectory inference integration
Leclercq, A.; Martini, L.; Bardini, R.; Savino, A.; Di Carlo, S.Abstract
Single-cell trajectory inference is widely used to study cellular differentiation and fate decisions, yet most existing approaches rely on transcriptomic information alone, limiting their ability to capture the regulatory processes underlying cell-state transitions. This work presents ATLAS (Advanced Trajectory Learning from multi-omics At Single-cell resolution), a scverse-compatible framework for trajectory inference in paired single-cell RNA-seq and ATAC-seq data. ATLAS integrates transcriptomic and chromatin accessibility information through Weighted Nearest Neighbor graphs, enabling both molecular layers to jointly inform pseudotime estimation, terminal-state identification, and fate probability inference within a unified multi-omic representation. Across synthetic and real datasets, ATLAS reconstructs coherent developmental trajectories, captures progressive fate commitment, and resolves biologically meaningful lineage structures, demonstrating the effectiveness of multi-omic integration for characterizing cellular dynamics. In addition, ATLAS enables the joint exploration of transcription factor expression and target gene activity along pseudotime, providing direct access to regulatory programs and chromatin-associated transitions that are not detectable from transcriptomic data alone. Overall, ATLAS provides a scalable and biologically informative framework for studying dynamic cellular processes in single-cell multi-omics experiments.
bioinformatics2026-06-22v2WITHDRAWN: Preprint Commons: A platform for the systematic tracking of preprint trends and impact
Behera, B. P.; panda, B.Abstract
The authors have withdrawn their manuscript because it was posted without the consent of all authors. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v2Proteomics-constrained deconvolution reveals spatial cell-type programs in tumours
Isik, E. B.; Haley, M. J.; Anbaki, A. A.; Bere, L.; Roncaroli, F.; Piper Hanley, K.; Couper, K.; Wedge, D. C.; Sellers, R.; Baker, A.; Oliveira, P.; Ashton, J.; Bristow, R. G.; Alvarez, M. A.; Georgaka, S.; Rattray, M.Abstract
Accurately resolving cell-type mixtures in spatial transcriptomics remains challenging, particularly in heterogeneous tumours where cell populations are intermixed and matched single-cell references may be unavailable or poorly aligned. Current deconvolution approaches either require high-quality scRNA-seq references, suffer from scalability limitations, or lack interpretability. We introduce PISTACHIO, a proteomics-informed spatial transcriptomics deconvolution framework based on constrained non-negative matrix factorization with a negative-binomial likelihood. Rather than using probabilistic priors, PISTACHIO incorporates spatial cell-type constraints derived from paired Imaging Mass Cytometry, enforcing biologically grounded sparsity and explicit spatial feasibility of cell-type presence. PISTACHIO improved recovery of spatial cell-type distributions compared with Cell2location and STdeconvolve across synthetic and real tumour datasets. Our approach remains robust under cell-type assignment errors, maintaining high correlation with ground-truth under moderate noise, and achieves fast runtime on standard hardware, enabling practical large-scale deployment.
bioinformatics2026-06-22v2WITHDRAWN: Agent-Guided Ranking Policy Improvement for Peptide Drug Candidate Prioritization
Wijaya, E.Abstract
The authors have withdrawn this manuscript because the submitter did not have the rights to agree to the distribution license at the time of submission. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v2Benchmarking cell type annotation in spatial transcriptomics: resolving cellular hierarchies, biological fidelity, and dynamic cell states
Zhu, Y.; Hu, Y.; Xie, M. B.; Qin, H.; Szul, Z. J.; Young, D. M.; Yuan, W.; Wang, Q.; Liu, Y. H.; Shen, W.; Meltzer, S.; Zhou, X. M.Abstract
Spatial transcriptomics enables the quantification of gene expression within its native tissue context, providing unprecedented insight into tissue architecture, cellular ecosystems, and local cell-cell interactions at regional and single-cell resolution. Accurate cell type annotation is a critical prerequisite for interpreting these data and is often the first and most essential step in downstream analysis. Despite rapid advances in computational methods, cell type annotation remains challenging and frequently requires extensive expert-driven manual curation based on marker-gene expression, spatial context, and prior biological knowledge. While early approaches relied primarily on transcriptional similarity, newer methods increasingly incorporate spatial information, histological features, and multimodal data to improve annotation accuracy. Nevertheless, reliable annotation remains difficult when biological interpretation requires fine-grained subtype resolution, particularly for platforms with limited gene panels, tissues undergoing dynamic cellular state transitions, and studies in which reference and query datasets differ substantially in biological context or technical modality. Here, we present a systematic benchmark of 20 state-of-the-art cell type annotation methods across four spatial transcriptomics datasets spanning diverse technologies, experimental conditions, cell numbers, and gene panel sizes. Importantly, all benchmark datasets contain expert-curated cell type labels, including well-resolved cell populations and subtype annotations, providing high-quality biological ground truth for evaluation. The benchmark encompasses both reference-based and reference-free methods representing a broad range of computational frameworks. Performance was assessed using conventional classification metrics, including accuracy and F1-based measures, together with structure-aware metrics that evaluate both cell-level annotation accuracy and preservation of higher-order biological organization. Across datasets, annotation performance varied substantially according to tissue context, reference-query similarity, and annotation granularity. Fine-grained subtype annotation and recovery of rare cell populations remained challenging for many methods, particularly in datasets capturing injury, repair, developmental, and regenerative processes characterized by continuous cellular state transitions. Notably, high classification accuracy did not necessarily correspond to preservation of global cellular relationships or biologically coherent downstream pathway and gene-set enrichment analyses. Overall, scANVI, Seurat, and TACCO consistently ranked among the top-performing methods, although their relative advantages were context dependent. Together, our results provide a comprehensive assessment of current annotation strategies for spatial transcriptomics and offer practical guidance for selecting methods that best align with specific biological questions, dataset characteristics, and analytical priorities.
bioinformatics2026-06-22v1When Less Is Not More: DICEPro Mitigates the Impact of Incomplete Reference Matrices on Cellular Frequency Deconvolution.
BA, K.; Thiebaut, R.; Hinaut, X.; Hejblum, B. P.Abstract
Cellular deconvolution aims to estimate the frequencies of different cell populations from gene expression measurements in a biological sample. Supervised approaches, such as CIBERSORTx and DISSECT, critically depend on the reference signature matrix, which encodes the gene expression profiles of cell-types based on prior knowledge. Despite numerous deconvolution methods, the impact of missing cell populations in the reference matrix remains understudied. Here, we evaluate the robustness of state-of-the-art deconvolution approaches using simulations based on real dataset examples combined with statistical modeling, validated against published data, and multiple real benchmark datasets. Results show that deconvolution performance remains stable when the reference matrix includes most cell-types, but declines sharply as the matrix becomes incomplete, especially for abundant cell populations. To address the limitations of incomplete reference matrices, we introduce DICEPro, an optimization-based framework designed to enhance existing deconvolution methods. By systematically adjusting the reference signatures, DICEPro better accounts for missing or underrepresented cell populations, leading to improved precision and robustness. We show that DICEPro consistently boosts deconvolution performance across both simulated datasets, derived from real data examples, and multiple real biological datasets, offering a practical solution when standard methods are hindered by incomplete references.
bioinformatics2026-06-22v1PhaseWY: A pipeline for haplotype phasing, sex chromosome identification and extraction of sex-limited sequences
Ellerstrand, S. J.; Churcher, A. M. J.; Kutschera, V. E.; Hansson, B.Abstract
Sex chromosomes are central to many ecological and evolutionary processes. Evidence has accumulated that sex chromosome systems vary extensively in age, turnover and transitions, motivating renewed efforts to study the diversity of sex chromosome systems across the tree of life. However, successful genomic detection of sex chromosomes depends on several factors, including the size and divergence time, background genetic diversity, and the number of sequenced females and males. In addition, technical challenges associated with sequencing and analysing the sex-limited Y/W chromosome remain. Here, we present PhaseWY, an automated Snakemake pipeline that uses whole-genome sequencing data from multiple female and male individuals to identify sex-chromosomal regions and extract the corresponding Y/W sequences. PhaseWY (i) detects sex differences in alignment depth, (ii) applies read-based and statistical haplotype phasing, (iii) identifies sex-linked regions using haplotype clustering, and (iv) subsets autosomal, X/Z- and Y/W-linked variants for downstream analyses. We applied PhaseWY to simulated data to benchmark factors influencing sex-linkage detection and successful extraction of Y/W-linked variants. To demonstrate its practical utility, we further applied PhaseWY to the neo-sex chromosome system in Alauda larks (Alaudidae) and performed a range of downstream analyses demonstrating the scope of applications of the PhaseWY output. We conclude that PhaseWY provides an easy-to-use and reproducible tool for population-genomic analyses in non-model organisms, with particular importance for advancing our understanding of sex-chromosome evolution.
bioinformatics2026-06-22v1Reference-guided immune recovery matching prioritizes traditional Chinese medicine ingredients
Hu, C.; Xiao, B.; Chen, C. Y.-C.Abstract
Therapeutic prioritization from single-cell transcriptomes requires a target that is closer to treatment response than disease-signature reversal. In immune diseases, post-treatment recovery may follow patient- and cell-type-specific trajectories rather than a simple return along the pretreatment disease axis. We developed ImmuneNavi, a healthy-reference-anchored recovery-matching workflow for ranking traditional Chinese medicine ingredients from paired PBMC data. The workflow maps heterogeneous PBMC cohorts to a common healthy immune coordinate system, constructs patient-cell-type disease and recovery states, and processes ITCM treated-control profiles into a fixed ingredient perturbation bank. Patient and ingredient states are represented in matched gene, pathway and transcription-factor views, allowing the model to combine local transcriptional direction with more stable program-level features. A matcher trained on one paired treatment cohort preserved recovery-aligned ingredient rankings in independent PBMC cohorts without redefining the feature space, candidate set or preprocessing procedure. This provides a reusable transcriptomic pipeline for moving from paired immune-state measurements to prioritized natural-product candidates for experimental follow-up.
bioinformatics2026-06-22v1From hotspot dependence to distributed robustness in resistance-aware lead optimization
Wang, Y.; Xiao, B.; Kang, J.; Cui, H.; Fu, Y.; Li, W.; Perea, S. E.; Han, W.Abstract
Drug resistance remains a recurrent failure mode in targeted anticancer and antiviral therapy, and resistance evidence often enters only after compound selection. ResistAgent is an evidence-constrained framework that converts mutational liabilities into design-time objectives through site- and combo-aware resistance mapping, deterministic mechanism diagnosis and robust counter-design. In EGFR-Erlotinib and HIV-RT-Rilpivirine, the framework separated residue-level liabilities from observed HIV combination liabilities and linked prioritized mutations to anchor loss, pocket rearrangement, electrostatic shifts and contact redistribution. Same-budget paired searches showed that robust objectives changed lower-tail mutant-panel behavior and interaction-dependence profiles while prioritizing robustness over average-affinity behavior. Under predefined liability panels, selected robust-best trajectories shifted support away from mutable hotspot contacts toward more distributed interaction networks. Supplementary physical summaries and ranking-first benchmarks support the scope of this resistance-aware design strategy while preserving clear boundaries for prospective validation.
bioinformatics2026-06-22v1EventHorizon: A Foundation Model for Clinical Flow Cytometry
Medina Grespan, M.; Morrison, M.; O'Fallon, B.; Shean, R.; Spies, N. C.; Ng, D.Abstract
Flow cytometry is an essential tool for diagnosis of hematologic malignancies, but existing clinical workflows are highly dependent on expert manual interpretation. Existing machine learning approaches typically require extensive labeled data and are sensitive to variability in panel design, instrumentation, and laboratory workflows, limiting their generalizability. We present EventHorizon, a self-supervised foundation model for clinical flow cytometry that produces unified specimen-level representations from heterogeneous multi-panel data. EventHorizon employs a two-stage hierarchical transformer architecture with marker-aware tokenization, enabling seamless integration of cells measured across different antibody panels into a single shared latent space. We pre-train the model using a DINO-inspired self-distillation strategy with a variety of flow cytometry-specific augmentations on a dataset of more than 100,000 clinical specimens across 17 distinct panels. We evaluate the resulting embeddings on three clinically relevant classification tasks spanning common and rare panels, demonstrating that simple k-nearest neighbor probing of frozen EventHorizon embeddings achieves performance comparable to a fully supervised baseline model and a prior panel-specific self-supervised model. To ensure EventHorizon is not simply shortcut learning on features such as the markers/panels run for a given specimen, we perform a graph-theoretic analysis of EventHorizon's latent space which argues that specimen embeddings are organized primarily by biological diagnosis. Taken together, these results demonstrate that EventHorizon produces biologically meaningful, panel-agnostic specimen representations from clinical flow cytometry data which, with further development and validation, could provide a potential basis for scalable, reproducible diagnostic support across diverse clinical laboratory settings.
bioinformatics2026-06-22v1Complex-valued representations of time-series gene expression profiles for network analysis
Sun, J.; Cao, W.; Ikumi, K.; Shimizu, K. K.; Sese, J.Abstract
Time-series RNA sequencing provides a powerful framework for studying dynamic gene regulation, yet conventional analyses usually represent gene expression profiles as real-valued vectors in Euclidean space and quantify similarity using correlation or distance. Inspired by quantum information theory, we present a framework for encoding time-series gene expression profiles as complex-valued vectors comprising amplitude and phase components in Hilbert space. We designed multiple encoding models to represent gene expression in the amplitude of complex-valued vectors, encode temporal differences in the phase, and extend the phase representation to incorporate the direction of local expression changes. Gene-gene similarity was then quantified using fidelity, which measures the overlap between two encoded vectors. Evaluation using time-series RNA-seq datasets across diverse species and biological contexts showed that different encoding models produced distinct fidelity distributions that were related to, but distinct from, conventional correlation measures. We then constructed gene-gene networks using pairwise fidelity values and detected communities containing genes with similar temporal profiles. Although fidelity distributions differed across encoding models, the resulting communities captured major temporal expression programs, and functional annotations based on gene ontology and Kyoto encyclopedia of genes and genomes pathway analyses provided exploratory biological context. The detected communities were comparable to those obtained using conventional methods, including weighted correlation network analysis and fuzzy c-means clustering. Furthermore, as a proof-of-concept, we performed SWAP-test circuit simulations to mimic fidelity computation on a quantum computer; under noise-aware conditions, these simulations produced less accurate fidelity estimates with higher computational cost than classical computation. As a proof-of-concept, this study provides a complementary view of temporal transcriptome organization, rather than a uniformly superior alternative to conventional methods.
bioinformatics2026-06-22v1CellTosg2Sequence: A Unified Text-Omics-Signaling-Graph Large Language Model for Single-Cell Analysis
chen, w.; Ye, M.; Xu, T.; Huang, D.; Zhang, H.; Li, H.; Li, W.; Chen, Y.; Payne, P. R.; Li, F.Abstract
bioRxivLaTeXUnicodeabstract --- In single-cell (sc)-based scientific discovery, text-formatted biomedical prior knowledge and signaling graphs are essential for annotating and interpreting numeric sc-omics data and for generating novel testable hypotheses. A major limitation of existing single-cell large language models (scLLMs) is that they rely on numeric expression data with gene names as the only textual signal, while comprehensive biomedical priors -- cellular localization, gene function, disease associations, and signaling interaction patterns -- remain absent from the model input. We introduce CellTosg2Sequence, a textual-prior- and signaling-graph-augmented cell-omics-sentence language model. A lightweight heterogeneous graph encoder maps a curated 62,507-node biomedical knowledge graph (KG) into compact virtual tokens that are prepended to each cell sentence, allowing the language model to condition on biological structure with minimal sequence-length overhead. We train CellTosg2Sequence with a three-stage objective: Stage I anchors the KG channel under autoregressive language-model pretraining, leveraging Qwen2.5-32B's own language reasoning for rapid KG alignment; Stage II aligns labels via supervised fine-tuning with KG-anchored InfoNCE; Stage III applies Group Relative Policy Optimization (GRPO) with an ontology-hierarchy reward, enabling free-generation cell-type prediction that generalizes beyond the closed training vocabulary. Across multiple benchmarks and ablation experiments, CellTosg2Sequence outperforms strong baselines. All results are achieved with lightweight LoRA training and a single unified checkpoint.
bioinformatics2026-06-22v1πDIA-CLIP: efficient identification of highly heterogeneous proteomics data via a generalized zero-shot framework
Liao, Y.; Li, Y.; Xiao, Z.; Miao, C.; Yi, T.; Zhao, X.; Zhang, Y.; Wen, H.; E, W.; Chang, C.; Zhang, W.Abstract
Data-independent acquisition mass spectrometry has increasingly emerged as a cornerstone for characterizing highly heterogeneous biological systems, such as single-cell proteomics, metaproteomics, and spatial proteomics, offering unparalleled identification depth and quantification reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring, which is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present {pi}DIA-CLIP, a generalized framework shifting the DIA analysis strategy from semi-supervised training to zero-shot cross-modal representation learning through integrating dual-encoder contrastive learning and encoder-decoder architectures to establish a unified, high-precision representation for spectral features and peptides. Notably, the generalized zero-shot nature of {pi}DIA-CLIP facilitates an inference-only architecture, streamlining the analysis to achieve exceptional computational efficiency. Extensive evaluations across five distinct benchmarks demonstrate that {pi}DIA-CLIP consistently outperforms existing tools, yielding an up to 44.6% increase in protein identification alongside a reduction in entrapment identifications reaching a maximal 52.5%. Furthermore, the enhanced identification depth facilitates the discovery of novel biomarkers and the elucidation of intricate cellular mechanisms.
bioinformatics2026-06-21v4SIEVEseq: One-stop differential expression, variability, and skewness analyses using RNA-Seq data
Li, H.; Khang, T. F.Abstract
RNA-Seq data analysis is commonly biased towards detecting differentially expressed genes and insufficiently conveys the complexity of gene expression changes between biological conditions. This bias arises because discrete count models cannot fully and independently parameterize the mean, variance, and skewness of gene expression distributions. Therefore, a unified statistical framework that simultaneously tests differential expression, variability, and skewness is needed. We present SIEVEseq, a statistical methodology that provides such a framework. SIEVEseq embraces a compositional data analysis strategy to transform discrete RNA-Seq counts into continuous form with a distribution well-fitted by the skew-normal distribution. Both parametric and nonparametric simulations show that SIEVEseq better controls the false discovery rate and Type II error than existing differential expression methods. Analysis of the Mayo RNA-Seq dataset for Alzheimer's disease demonstrates that gene sets with significant differences in mean, variance, and skewness between control and disease groups strongly predict disease state. Furthermore, functional enrichment analysis indicates that relying solely on differentially expressed genes identifies only part of the biological spectrum, whereas incorporating genes with differential variability and skewness reveals additional disease-related aspects. Cross-data and cross-methodology validation suggest the detected biological signals are genuine. The SIEVEseq R package and source codes are available at: https://github.com/Divo-Lee/SIEVEseq.
bioinformatics2026-06-21v3Hierarchical classification of immune cell transcriptomes at population-scale
Beltz, C.; Qiu, Z.; Sadowski, L.; Kraske, J. A.; Aggarwal, A.; Quintanal-Villalonga, A.; Manoj, P.; Littbarski, A.; Bajaj, S.; Meskauskaite, B.; Umeda, S.; Mazutis, L.; Rose, S. A.; Chan, J. M.; Nawy, T.; Nainys, J.; Chaligne, R.; de Stanchina, E.; Kaelber, K. A.; Cussigh, C. S.; Kallenberger, S. M.; Williams, A.; Jenzer, M.; Pompecki, T.; Kahle, S.; Hohmann, N.; Nussbaum, D. P.; Moss, N. S.; Ziv, E.; Berger, A. K.; Haag, G. M.; Springfeld, C.; Zschaebitz, S.; Hassel, J. C.; Debus, J.; Jaeger, D.; Iacobuzio-Donahue, C. A.; Ganesh, K.; Peer, D.; Ungerechts, G.; Rudin, C. M.; Huber, P. E.; WalleAbstract
Accurate immune cell classification is essential for interpreting single-cell RNA sequencing (scRNA-seq) data. However, progress in automating cell type annotation is constrained by the lack of independent, high-resolution benchmarks, as routine data integration introduces statistical dependencies that inflate model generalizability. Here, we present the single-cell universal classification omnibus (Suco), a resource of independent, uniform expert annotations, and Compocyte, a modular hierarchical classifier. Together, they establish a framework that substantially outperforms existing classifiers while facilitating expert review of ambiguous annotations. Applying Compocyte across 50 studies, including three newly generated datasets, we classified 15.6 million leukocytes from 3,965 patients. Within this cohort, we identified a new tumor-associated resorptive macrophage phenotype, a non-canonical monocyte subtype in subclinical cytokine release syndrome, and the programmatic erosion of T cell memory stemness across metastatic sites. Suco and Compocyte thus provide a generalizable framework to uncover the principles governing human immunity at population scale.
bioinformatics2026-06-21v2Antibody-Antigen Affinity Prediction with Chain-Aware Protein Language Modeling
Singh, H.; Malhotra, A.; Srivastava, S. P.; SINGH, R. K.; Gorantla, R.Abstract
Motivation: Antibody-antigen affinity determines which antibodies advance in therapeutic discovery, repertoire analysis and affinity maturation, but experimental measurements are sparse relative to the scale of sequence libraries. Structure-based predictors can exploit interface geometry when reliable complexes are available, yet early discovery often requires ranking many heavy-light chain pairs against antigens for which no complex structure exists. Existing sequence-based models are scalable, but frequently compress heavy and light chains into a single antibody representation or concatenate antibody and antigen features obscuring the chain-specific and epitope-specific signals that drive binding. Results: We present AbAffinity, a sequence-only chain-aware three-stream architecture that maintains heavy chain, light chain and antigen as distinct streams. It integrates frozen ESM-2 embeddings with heavy-chain CDR-focused pooling, heavy-light self-attention, adaptive fusion gating and gated cross-attention, training only a compact interaction module. On the SAAINT-DB benchmark, AbAffinity achieves strong predictive performance under ten-fold cross-validation and maintains robust accuracy on novel antigens. It consistently outperforms recent sequence-based models across external benchmarks including SAbDab, AB-Bind and SKEMPI 2.0. Ablation studies highlight the contributions of chain-specific representations, CDR-focused pooling and the gated interaction pathway. Integrated Gradients attributions recover known paratope and epitope residues at structurally validated interfaces. AbAffinity provides a lightweight, explainable sequence-first framework for antibody triage and prioritisation when structural information is limited or unavailable.
bioinformatics2026-06-21v1Fast Multi-objective RNA Optimization with Autoregressive Reinforcement Learning
Huang, J.; Feng, N.; Bai, H.; Fang, Y.; Liu, X.; Wang, S.; Yan, J.; Shen, H.-B.; Qiu, Z.; Yuan, Y.; Hu, R.; Pan, X.Abstract
Codon optimization is essential in mRNA vaccine development, while existing tools face limitations in the computational efficiency, sequence diversity and universality. To address these challenges, we develop RNAJog (RNA Joint Optimization with autoregressive Generative model), a framework integrating autoregressive generation with reinforcement learning to optimize codon sequences for minimum free energy (MFE), codon adaptation index (CAI) and GC content, even enabling sequence design without requiring annotated training data. Evaluations in both in silico and wet-lab experiments have confirmed RNAJog's effectiveness and efficiency, with two orders of magnitude faster than traditional algorithm (LinearDesign) for long RNA sequence and about a 10-fold increase in antibody titer compared to the wild-type mRNA for Influenza virus hemagglutinin (HA) mRNA vaccine design in mouse. RNAJog also supports biological constraints for sequence optimization. Using this feature, we minimized m6A modification motifs in Bmp2 coding sequence for enhancing the translational efficiency and RNA stability, which are validated in cell-based experiments.
bioinformatics2026-06-20v2A network approach to DNA methylation clocks
Carcedo, A.; Yang, S.-G.; Smiljanic, J.; Neunman, M.; Wennstedt, S.; Degerman, S.; Lizana, L.Abstract
Biological age predicts health and lifespan better than chronological age, but remains difficult to measure. One leading molecular proxy for biological age is DNA methylation, which underlies age predictors known as "clocks". These clocks use penalized linear regression to predict chronological age from methylation levels using selected cytosine--guanine pairs (CpGs) along DNA. Although they predict chronological age within a few years and track mortality risk, there are several issues. Different clocks share a vanishingly small number of CpG sites, many of which show weak associations with age. Also, the clocks often do not transfer across methylation array platforms. This paper takes a network approach to better understand these issues. By using 12 public datasets from human blood, we build a co-methylation network of the sites that show the strongest age correlation. After pruning weak links, we find that it has a small number of large modules of covarying CpGs surrounded by many small modules and singleton sites. These modules are biologically interpretable, as they are associated with CpG island contexts and enriched for distinct Gene Ontology functions. We also map five established clocks onto this network (Horvath, Hannum, AltumAge, Skin \& Blood, and Han) and find that they select some CpGs from the same module. This suggests that they are more similar than they appear. The network structure also suggests new ways to build clocks. A simple clock that retains one CpG per module matches the performance of established clocks. A second one, built from module-level principal components, outperforms all five established clocks in three validation cohorts and is transferable across array platforms (Illumina Infinium Methylation 450K or EPIC arrays). Overall, the network perspective shifts attention from individual CpG sites to modules of covarying sites. This perspective helps explain why DNA methylation clocks perform so well despite their differences and provides a more systematic approach for developing the next generation of aging biomarkers.
bioinformatics2026-06-20v1The recount3 Python package for programmatic access to uniformly processed RNA-seq data
Alsalihi, A.; Flight, R. M.; Moseley, H. N. B.Abstract
The recount3 online resource provides tens of thousands of uniformly processed RNA-seq samples across human and mouse from major sequencing repositories like the Sequence Read Archive. While access to these datasets has traditionally been centered in the R/Bioconductor ecosystem, the growing prominence of Python in bioinformatics and machine learning necessitates native, efficient tooling for Python users. Therefore, we present the recount3 Python package with robust application programming interface (API) and command-line interface (CLI) for discovering, downloading, and materializing recount3 resources. The software orchestrates uniform resource locator (URL) resolution, persistent on-disk caching, and the automatic parsing of data into analysis-ready data structures, including Pandas DataFrames and BiocPy RangedSummarizedExperiment objects. The recount3 Python package drastically lowers the barrier to entry for large-scale utilization of RNA-seq data in Python-based computational pipelines, bridging the gap between massive public transcriptomic data and modern machine learning ecosystems.
bioinformatics2026-06-20v1Ribosomes are covered by a coat of flexible protein fragments
McGrath, H.; Kvasnovsky, R.; Kolar, M.Abstract
Ribosomal proteins contain flexible terminal regions that are averaged out during electron density reconstructions, rendering them absent from experimental models derived by X-ray crystallography or cryogenic electron microscopy. These flexible protein fragments (FPFs) collectively form an invisible coat on the ribosome surface whose presence has been systematically overlooked. Here we analysed FPFs from 36 ribosomes spanning bacteria, eukaryotes, and mitochondria. We found that mitoribosomes harbour the most numerous and longest FPFs. Structural predictions confirmed that FPFs are predominantly disordered across all ribosome classes. Comparison of FPF amino acid composition against proteome-wide background frequencies revealed strong and domain-specific compositional biases. The balance between arginine and lysine content tracks the cardiolipin content of the membrane each ribosome class contacts. The arginine enrichment in mitoribosomal FPFs may additionally reflect selection arising from the RNA-rich environment of mitochondrial RNA granules, membraneless condensates where mitoribosomes are assembled. FPFs are uniformly depleted in aromatic residues, arguing against protein-driven liquid--liquid phase separation propensity. Our findings suggest that the flexibly tethered coat is a highly functional intrinsic part of all ribosomes.
bioinformatics2026-06-20v1Finding stable clusterings of single-cell RNA-seq data
Klebanoff, V. F.Abstract
Run a UMI count matrix through a pipeline to obtain n cell clusters. Suppose that counts for an equal number of additional cells from the same experiment become available. Would including them change the result? Form the matrix containing both sets of counts, obtain n clusters, restrict this clustering to the initial cells and compare it with the initial clustering. If they are not consistent, conclude that the initial clustering is unstable. This is unrealistic, but reverse the perspective: given a clustering, process samples of half of the cells. If their clusters are consistent with those of all cells restricted to the samples, conclude that the clustering is stable. We use divisive hierarchical spectral clustering and define what may be a novel mapping of the dendrogram to nested clusterings. Counts are transformed to points in low-dimensional Euclidean space. Positive affinities are defined for points that are k-nearest neighbors. The affinity equals the inverse of the distance between points. Ng, Jordan, and Weiss' algorithm divides the points into two clusters. The normalized cut measures the clusters' separation. Recursion generates a dendrogram. Set the length of the branch between a node and its daughters to the normalized cut. Nodes' distances from the root define the mapping to nested clusterings. Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample's clustering and clusters are compared with those of the full data set (restricted to the sample). This provides measures of the stability of the clustering and its clusters. For three large data sets, this yielded clusterings compatible with published results, though with fewer clusters. Clusterings of two were judged to be stable. We conclude that it is feasible to identify stable clusterings of as many as 100,000 cells. Future research should explore using differential expression for validation.
bioinformatics2026-06-19v5damidBind: an R/Bioconductor package for differential DamID analysis and data exploration
Marshall, O. J.Abstract
DamID, and its cell-type specific adaptations, including Targeted DamID (TaDa) and Chromatin Accessibility TaDa (CATaDa), are now widely-adopted as techniques for the genome-wide profiling of DNA binding proteins. Despite this popularity, no dedicated software solution exists for identifying differentially bound or accessible loci, or differentially transcribed genes, between cell types using DamID. The R/Bioconductor package damidBind provides these functions, allowing an end-user to move from processed binding profiles to identifying differentially-bound loci in a reproducible, statistically appropriate and straightforward workflow. Abstract Availability and Implementation: damidBind is an open-source R/Bioconductor package and freely available from Bioconductor at [https://bioconductor.org/packages/damidBind/||https://bioconductor.org/packages/damidBind/], and from GitHub at [https://github.com/marshall-lab/damidBind]. It is released under the GPLv3 licence.
bioinformatics2026-06-19v3PLncFire enables genome wide identification and annotation of plant long noncoding RNAs from RNA sequencing data
Mistry, S. D.; Saxena, S.; Rizvi, A. Z.Abstract
Long non-coding RNAs (lncRNAs) are key regulators of plant biology, yet their discovery is hindered by low sequence conservation and a lack of comprehensive annotations. To overcome these challenges, we developed PLncFire, a modular computational pipeline that automates the genome-wide identification and annotation of lncRNAs from standard RNA-seq data. PLncFire integrates quality control, transcript assembly, and a robust consensus coding-potential assessment using CPC2, PlantLncPipe, and FEELnc to generate high-confidence predictions. It classifies lncRNAs as known or novel, facilitates their prioritisation through differential expression analysis, and is designed for scalability and reproducibility across diverse plant species. PLncFire provides a standardised framework to empower large-scale lncRNA discovery and advance comparative functional genomics. The source code is available at https://github.com/ahsan-rizvi/PLncFire.git.
bioinformatics2026-06-19v2From Scarce Functional Labels to Label-Aware Generation in Homologous Protein Families
Rosset, L.; Weigt, M.; Zamponi, F.Abstract
Accurately annotating and controlling protein function from sequence data remains a major challenge in protein engineering, especially when functional labels are scarce within large homologous families. Here, we study a two-stage light-supervision strategy for fine-grained functional annotation and label-aware sequence generation. First, we compare several sequence representations, including one-hot encodings, Restricted Boltzmann Machines (RBMs), and ESM2-based protein language model embeddings, for predicting intra-family specificity labels from limited supervision. By using train/test splits that explicitly reduce phylogenetic leakage, we show that ESM2-based representations do not systematically outperform family-specific RBM embeddings or even simple one-hot baselines in this regime. Second, we use the inferred annotations to train an annotation-aware RBM capable of generating artificial homologs conditioned on prescribed labels. Across several protein families, we quantify how the number and quality of available labels determine the reliability of conditional generation. Our results show that scarce annotations can support label-aware protein design when they are accurately propagated, while also highlighting the importance of phylogeny-aware evaluation for assessing functional annotation methods within homologous families.
bioinformatics2026-06-19v2FeatureMSEA: Metabolic Feature-based Metabolite Set Enrichment Analysis
Liu, Y.; Wang, Y.; Huan, T.; Shen, X.Abstract
Liquid chromatography-mass spectrometry (LC-MS) untargeted metabolomics detects thousands of metabolic features, but converting these chemical signals into metabolite set-level biological knowledge remains challenging. This is because most features lack unambiguous metabolite identities. Conventional metabolite set enrichment analysis (MSEA) generally requires identified metabolites and metabolite-level ranked inputs, leaving much of the untargeted feature space unused. Here, we present FeatureMSEA, a feature rank-based framework for metabolite set enrichment directly from metabolic features with ambiguous annotations. FeatureMSEA integrates multi-evidence feature-to-metabolite annotation, feature rank-based enrichment scoring, permutation-based inference, and iterative leading-edge-guided annotation refinement, with an optional LLM-assisted module for post-enrichment interpretation. In null comparisons of randomly split healthy samples, FeatureMSEA detected no significant metabolite sets, whereas metabolite-set spike-in simulations showed recovery of implanted signals. In a cerebrospinal fluid metabolomics study of Huntington's disease, FeatureMSEA identified dysregulated metabolite sets related to amino acid metabolism, mitochondrial energy metabolism, and neuroactive signaling. MS/MS-based annotation analysis further showed that FeatureMSEA refinement reduced annotation ambiguity and prioritized chemically consistent candidate metabolites. In summary, FeatureMSEA provides a general framework for extracting metabolite set-level biological insights from LC-MS untargeted metabolomics in which confident metabolite identification remains incomplete.
bioinformatics2026-06-19v1Perturbation Curve models continuous transcriptional response trajectories and improves prediction of genetic modulations
Zhong, Y.; wang, l.; Yang, G.; Yu, L.; Qi, X.; Jiang, H.Abstract
Single-cell CRISPR screens, Perturb-seq, have revolutionized functional genomics by revealing biological causality. However, although perturbation assignments are typically represented as discrete labels, the cell-level effective strength of perturbations is often continuous and diverse. Current analytical frameworks struggle to decouple the variability in perturbation strength from the diversity of downstream responses. Here, we present Perturbation Curve (PertCurve), a nonlinear, curve-based computational framework that models the trajectories of transcriptomic responses by explicitly incorporating diverse perturbation magnitudes and strengths. By ordering cells by perturbation strength, we demonstrate that PertCurve accurately recapitulates the response magnitudes and reveals the distinct modularity and asynchrony patterns of downstream gene behaviors. These patterns are categorized into archetypes, including proportional, sensitive, and threshold responses. By applying this framework across CRISPRi/a modalities, we identify universal response patterns in viral infection, apoptosis, and proliferation genes, and reveal previously overlooked context-specific regulatory features in cell differentiation. Finally, incorporating PertCurve into perturbation prediction models and evaluation metrics enhances predictive performance, delivering actionable insights for refining established models.
bioinformatics2026-06-19v1SteerAF: Distogram-based Steering of AlphaFold2 toward Alternative Conformations
Tang, J.; Zhu, Z.; Yang, S.; Song, C.Abstract
End-to-end structure predictors, such as AlphaFold2, typically output only the dominant conformational state of a given protein, which is biased by the training data set. Existing strategies for recovering alternative conformations are often computationally expensive and offer limited biological interpretability. Here, we present SteerAF, an inference-time optimization framework based on AlphaFold2 that leverages information encoded in the distogram derived from deep multiple sequence alignments (MSAs) to predict alternative protein conformations. Across four benchmark datasets, SteerAF matches or surpasses existing methods in predicting alternative conformations for the majority of systems. Sparse MSA-feature modifications generated via block gradient ascent exhibit a strong correlation with experimentally characterized functional residues, recovering them with approximately 50% precision in the tested proteins. Furthermore, SteerAF enables effective decoy selection in the absence of experimental structures, and its predictions can serve as seed structures for molecular dynamics simulations to map conformational landscapes. Thus, SteerAF provides an efficient and interpretable approach for predicting alternative conformations, offering a framework that can be extended to other similar predictors and problems.
bioinformatics2026-06-19v1OmniPath Metabo: chemical structures, interactions and mechanisms to study the metabolome
Schaul, J.; Bai, Y.; Franken, J.; Lawrence, T.; Palacio-Escat, N.; Bottazzi, D.; Carreno, E.; Daley, M.; Gul, L.; Sahin, A.; Mananes, D.; Bohar, B.; Dugourd, A.; Korcsmaros, T.; Turei, D.; Schmidt, C.; Saez-Rodriguez, J.Abstract
Mechanistic and functional analysis of omics data largely relies on the incorporation of prior knowledge; however, connecting metabolomics data and knowledge is a major methodological challenge. This is largely driven by the diverse prior knowledge being fragmented across many databases requiring the merging of different database records across chemical structures, identifiers, and varying levels of structural specificity. Hence, this limits mechanistic interpretation and functional characterisation of the metabolome. Here, we present OmniPath Metabo, a comprehensive, harmonized, metabolome-centric database covering metabolites, lipids, food-derived compounds, and small molecule drugs, along with their associated receptors, transporters, enzymes, reactions, allosteric regulators, and disease associations. OmniPath Metabo harmonizes attributes using controlled vocabularies and ontologies, structures and built-in cheminformatics to map identifiers and track ambiguity. OmniPath Metabo is built directly from 40+ original resources and is freely accessible via an interactive web app and API at metabo.omnipathdb.org. OmniPath Metabo enables dynamic, context-specific construction of subnetworks to serve dedicated purposes, such as cell-cell communication or integrated multi-omics metabolite-driven regulation, connecting reactions, allosteric regulation, metabolite-receptor and metabolite-transporter interactions. Combining it with the over 170 other resources in OmniPath, it can be used for integrated networks of signaling, gene regulation, and metabolism. We showcase the application of OmniPath Metabo by analysing publicly available metabolomics data of lung cancer cell lines and metabolic footprints to mutational patterns. In summary, OmniPath Metabo transforms fragmented resources into a harmonised prior knowledge framework for a mechanistic and functional analysis of the metabolome.
bioinformatics2026-06-19v1Simulation-based Bayesian deep learning enables uncertainty-aware tumor fraction estimation in cell-free DNA
Volkov, H.; Raitses-Gurevich, M.; Grad, M.; Shlayem, R.; Danilevsky, A.; Rubinek, T.; Gorfine, M.; Shomron, N.Abstract
Background: Estimating tumor fraction from whole-genome cell-free DNA sequencing is critical for liquid biopsy, but is hampered by weak signals and baseline noise at low tumor fractions. Existing computational methods often require matched controls or large labeled datasets for training and lack uncertainty quantification. To address these gaps, we developed purNPE, a Bayesian deep-learning framework trained without labeled cancer cell-free DNA samples. Specifically, purNPE leverages a two-part generative model: one component simulates diverse tumor copy-number profiles based on evolutionary genealogies, while a second, data-driven component learns and replicates realistic sequencing background patterns from cancer-free cell-free DNA. By training a Neural Posterior Estimator on synthetic tumor profiles augmented with learned noise, purNPE performs amortized inference in milliseconds without needing a reference sample set at inference. Results: In a real-world pan-cancer cohort, purNPE achieved comparable performance with existing methods against orthogonal mutant-allele-fraction validation (MAE = 0.066). In silico and semi-synthetic experiments suggested analytical sensitivity around 1% tumor fraction under the evaluated conditions and showed strong classification accuracy in low tumor fractions (AUC = 0.98 for TF [≤] 3% versus controls). Conclusions: This work provides a framework for using simulation-based inference to derive calibrated, uncertainty-aware TF estimates, offering a potential alternative to traditional data-dependent methods.
bioinformatics2026-06-19v1ContinuumCellAgent: A Framework-Guided Agent for Long-Horizon Scientific Research
Li, H.; Lu, Y.; Fang, K.; Xu, Z.; Li, F.Abstract
AI-scientist systems are beginning to automate parts of scientific research. We present ContinuumCellAgent, an autonomous agent that executes literature review, hypothesis formation, computational experimentation, manuscript drafting, and adversarial peer review as a single unattended run. Existing AI scientist systems remain difficult to diagnose because they lack modularity, systematic prompt grounding, and observability into long-running behavior. ContinuumCellAgent addresses these gaps with a modular supernode architecture for stage-wise backend swapping, protocols grounded in curated research-method checklists that also define reviewer rubrics, and a diagnostics layer that records file-based artifacts, message traces, and state transitions. We evaluate the system on open-domain QA benchmarks and biomedical/longevity case studies, showing that it can produce checkable research artifacts while exposing pipeline dynamics for rigorous AI co-scientist research.
bioinformatics2026-06-19v1Nickel-Driven Dynamics of Urease in Sporosarcina pasteurii: Integrated Computational and Experimental Insights
Al-Thawadi, S. M.Abstract
Urease is a nickel-dependent enzyme that plays an important role in urea hydrolysis and in a process named as microbial-induced calcium carbonate precipitation (MICP), which is widely used in sustainable environmental biotechnology. Despite its ecological importance, urease powers Biogrout (biocementation), a promising green technology for soil stabilization and infrastructure repair. Yet, the relationship between nickel availability, enzyme activation, and bacterial fitness remains poorly understood. In this study, we reveal a striking dual effect of nickel on Sporosarcina pasteurii: while high Ni2+ concentrations strongly inhibit growth (IC50 {approx} 637.7 {micro}M), they simultaneously boost specific urease activity up to six-fold. This uncoupling between biomass and enzymatic efficiency highlights a previously overlooked adaptive strategy under metal stress. Using structural bioinformatics and molecular docking, we show that Ure1--the catalytic subunit--exhibits the strongest nickel affinity (-4.3 kcal{middle dot}mol-1), supported by highly conserved active-site residues, whereas accessory proteins UreE and UreG display moderate and weak binding, consistent with their roles in metal delivery and GTP-dependent maturation. In addition, microscopic observations confirmed that calcium carbonate precipitation was most pronounced at intermediate nickel concentrations (approximately 400-1000 {micro}M), whereas higher concentrations ([≥]1000-1300 {micro}M) led to reduced mineral formation due to loss viable cells. Taken together, these results indicates that nickel availability controls both urease activation and bacterial fitness, and that an optimal balance is required to maximize biomenerilization efficiency in environmental applications, particularly in biocementation technology.
bioinformatics2026-06-19v1StickForStats: automated statistical assumption validation for reproducible computational biology
Bharti, V.; Chakraborty, D.Abstract
Reproducible computational biology depends on statistical decisions that routine workflows often skip: verifying that a differential-expression test's assumptions hold across all genes, that a strategy-comparison ANOVA is robust to non-normality, or that a meta-analysis is not distorted by publication bias. Surveys consistently find that fewer than 20% of published biomedical studies report checking these assumptions, and existing statistical software leaves validation to the analyst as an optional step. We present StickForStats, an open-source web platform that reframes assumption validation as a default precondition for every analysis. Its Guardian system--a middleware pipeline of eight validators (normality, variance homogeneity, independence, outliers, sample size, modality, linearity, homoscedasticity)--checks assumptions before execution and, on critical violations, reroutes to an appropriate nonparametric alternative with a documented decision trail. At genome scale, applying Guardian to a 91-sample synovial-sarcoma RNA-seq study (GSE271517) cascaded 90.6% of 27,221 genes to a rank-based test and flipped the differential-expression verdict for 553 genes--479 rescued from an under-powered t-test and 74 outlier-driven false positives rejected--materially changing the gene list a biologist would act on. The same automatic validation generalizes across domains: a CRISPR editing-strategy comparison (ANOVA F = 1122, with Guardian recommending Kruskal-Wallis H = 36.6), an ordinal correlation (Pearson r = 0.476 corrected to Spearman {rho} = 0.479), and a sixteen-trial clinical meta-analysis revealing severe publication bias (Egger's t = -5.78, p < 0.001); a complementary module extends the same validators to published manuscripts, checking claims against CONSORT, STROBE, ICH-E9, and JARS-Quant reporting standards. By making assumption validation automatic and transparent, StickForStats targets a tractable, under-served contributor to irreproducibility. The platform is MIT-licensed, validated against SciPy and R, and freely available at https://github.com/visvikbharti/stickforstats_new.
bioinformatics2026-06-19v1Accurate detection of tumor clonality and ongoing expansion mode from genomic data
Chen, Y.; Jaksik, R.; Terranova, P.; El Baghdadi, S.; Koval, A.; Kurpas, M. K.; Tavare, S.; Kimmel, M.; Dinh, K. N.Abstract
Recent evidence shows that despite considerable effort, currently available algorithms for estimating intra-tumor heterogeneity (ITH) remain limited. We developed DECODE (Deciphering Cancer Origin from DNA Evolution), a novel mutation clustering method that incorporates the impact of sample-specific sequencing coverage and mutation calling biases. On synthetic data, DECODE outperformed existing methods across multiple clonality metrics and accurately detected and characterized the neutral tail in the site frequency spectrum (SFS), which encodes the tumor's ongoing expansion mode. In acute myeloid leukemia, accounting for the neutral tail enabled DECODE to yield more parsimonious clonal decompositions that align more closely with known subclonal dynamics that drive relapse. Applied to data from The Cancer Genome Atlas, DECODE not only detected a neutral SFS tail in most samples across tumor types but also uncovered a clinically meaningful link between ITH and survival in low-grade glioma. By jointly inferring clonality and expansion mode, DECODE provides two complementary and prognostically relevant readouts of tumor evolution from single tumor genomic samples.
bioinformatics2026-06-19v1HTS-Oracle v2: Prospective AI-Guided Discovery and Experimental Validation of Small Molecule Modulators Across Multiple Targets
Abdel-Rahman, S.; Gabr, M.Abstract
High-throughput screening (HTS) remains the cornerstone of early-phase small molecule discovery yet consistently underperforms against immunotherapy targets, yielding validated hit rates below 0.1%. Here we introduce HTS-Oracle v2, which features rigorous cross-validation that ensures honest performance estimates. HTS-Oracle v2 was trained and validated across four clinically significant immune checkpoint targets (CD28, ICOS, LAG-3, and TIGIT) achieving ROC-AUC values of 0.968, 0.969, 0.875, 0.928 respectively under rigorous cross-validation. For prospective experimental validation, HTS-Oracle v2 was applied to an 8,960-compound Enamine Protein Mimetic Library, selecting only 25 compounds per target for experimental testing using temperature-related intensity change (TRIC) technology, a 99.7% reduction in screening burden. HTS-Oracle v2 identified 4, 5, 4, and 6 validated binders from 25 prospectively selected compounds per target, corresponding to validated hit rates of 16%, 20%, 16%, and 24%, respectively. Notably, 67-80% of all experimentally confirmed hits across the full 8,960-compound library were captured within just 25 model-selected compounds per target. For CD28, this represents a 28-fold improvement over HTS-Oracle v1 (239x versus 8.4x), establishing HTS-Oracle v2 as an efficient platform for AI-guided prospective hit discovery across immunotherapy targets.
bioinformatics2026-06-19v1Children's DNA Methylation and Family Dynamics in a Congo Basin Subsistence Community: Links with Parental Conflict and Fathers' Caregiving
Chan, M. H.-M.; Merrill, S. S.; Zhuang, B. C.; Lin, D. T. S.; Macisaac, J. L.; Miegakanda, V.; Lew-Levy, S.; Boyette, A. H.; Kobor, M. S.; Gettler, L. T.Abstract
Family environments may contribute to children's long-term health through biological processes, including epigenetic regulation such as DNA methylation (DNAm). However, most studies in this area focus on Euro-American populations while also rarely including fathering data. The current study investigated children's blood DNAm associations with positive (father caregiving) and negative (parental conflict) family dynamics in a smaller-scale subsistence society living in the Congo Basin rainforest. We measured DNAm from dried blood spots of 54 children (mean age=8.48 years) and conducted three epigenome-wide association studies aimed at discovering differential co-methylated regions (CMRs) associated with family dynamics. Via path models, we investigated the health implications and shared contribution of family factors of the identified CMRs. Differential DNAm associated with family dynamics was localized to genes related to stress, immunology, development, and aging, thus possibly linking to children's physical health and were simultaneously connected to other family factors such as number of siblings. Our findings suggested similarities in biological embedding of family factors across socio-ecologically diverse contexts.
bioinformatics2026-06-19v1VaxjoGNN: A Graph Neural Network for Ontology-Grounded Vaccine Adjuvant Recommendation
He, Y.; Zheng, Y.Abstract
Selecting an effective adjuvant remains a bottleneck in vaccine development, but most computational efforts have targeted antigen discovery rather than adjuvant prioritization. We frame disease-adjuvant matching as a top-k recommendation task on a heterogeneous knowledge graph grounded in biomedical ontologies, integrating curated facts, mechanistic pathways, and textual evidence. We introduce VaxjoGNN, a graph neural network trained with a listwise ranking objective. On a public benchmark, VaxjoGNN achieves NDCG@10 of 0.59 on seen diseases and 0.27 on previously unseen diseases (a 5.4 times improvement over a random baseline). The framework provides an ontology-anchored approach to adjuvant prioritization that complements existing antigen-focused tools.
bioinformatics2026-06-18v3Impact of the N-glycosylation on full-length IgG2 and IgG4 antibodies: a comparative study using molecular dynamics simulations.
LEON FOUN LIN, R.; Bellaiche, A.; Diharce, J.; Etchebest, C.Abstract
Like other proteins, monoclonal antibodies - important biodrugs- are subject to post translational modifications, especially the N-glycosylations. However, the effect of the N-glycosylations remains poorly studied and atomistic details about their influence are rarely available. . Moreover, the few existing studies focus on the prevalent immunoglobulin G1. To go further in the understanding of the impact of glycosylations, we have carried out a comparative exploration of the effect of N-glycosylations on two different classes of antibodies, namely Mab231, an IgG2 and the pembrolizumab, an IgG4 . The two antibodies differ by their sequences, their length, their 3D structure but also by the location and composition of the glycans. In the present work, detailed and important information were gained through molecular dynamics simulations where both monoclonal antibodies were studied without and with the presence of their glycans. The results of 1.5 microseconds of sampling for each system show that glycosylation does not drastically alter the overall conformational landscape of either antibody, whatever the metrics considered. However, it measurably modulates local flexibility, inter-domain correlated motions, and the relative orientation of the Fab arms with respect to the Fc domain, with statistically significant shifts in key geometric descriptors. Importantly, contact analysis reveals that glycan interactions extend beyond the Fc region to reach Fab residues. The allosteric network calculations demonstrate that the influence of Fc-bound glycans propagates even until the Fab framework regions in both mAbs, which could impact the antigen binding. The nature and magnitude of these effects are subclass-dependent, reflecting differences in glycan composition, hinge architecture, and three-dimensional organization Our findings challenge the prevailing view that Fc glycosylation uniformly promotes CH2 domain opening. More importantly, it underscores the necessity of considering full-length structures and IgG subclass diversity in glyco-engineering strategies.
bioinformatics2026-06-18v3Global StationaryOT: Trajectory inference for aging time courses of single-cell snapshots
Boyle, C.; Ventre, E.; Schiebinger, G.Abstract
Trajectory inference (TI) methods for single-cell snapshots of developmental systems have yielded numerous insights into the gene regulatory networks (GRNs) that control cell differentiation. Many TI algorithms have been proposed for recovering cell trajectories from single samples containing cells spanning a spectrum of differentiation states; however, these methods cannot leverage temporal information when a time course of such diverse samples is available. As interest grows in understanding how the regulation of GRNs changes as an organism ages, current TI theory and methods must be adapted to take advantage of all information in aging time courses of single-cell data. In this paper, we present our novel age-conscious method, global StationaryOT, which exploits the temporal information in aging time courses to simultaneously reconstruct debiased cell trajectories at all ages. We demonstrate that this first-of-its-kind method achieves more accurate, biologically consistent trajectories in synthetic and real biological contexts where data sparsity produces significant noise in the outputs of current TI methods when they are applied to time course samples independently.
bioinformatics2026-06-18v2Multiple Fault Analysis and Drug Therapy on Signaling Pathways Using Dynamic Bayesian Network-based Model
Chowdhury, T.; Majumder, S.; Lodh, E.; Maitra, A.; Agarwal, A.; Sur, A.; Sarkar, S.Abstract
Cell growth is an intricate biological phenomenon that is closely regulated by the interplay between various growth factors and transcription factors. Signaling pathways are the main mediators in this event, which provide the driving force for mitosis or sometimes meiosis. However, when malfunctions occur within the biological network, they can cause uncontrolled cell division, regardless of external stimuli. By employing Dynamic Bayesian Networks (DBNs), these malfunctions can be explicitly simulated, offering insights into their effects on cellular behavior and growth regulation. To a significant extent, the resultant outcomes can be mitigated through the use of reduced drug combinations. This study delves into the intricacies of signaling pathway behavior under the influence of concurrent malfunctions. Initially, we replicate the effects of these dysfunctions within DBNs. Subsequently, drug therapy is applied to alleviate their impact. Our methodology introduces a parameter known as efficiency_score, enabling the identification of optimized drug combinations without prior knowledge of specific dysfunctions. Particularly relevant in the context of realistic cancer conditions, these tailored drug inhibition points demonstrate enhanced efficacy compared to conventional treatments. Leveraging GPU acceleration throughout the modeling process accelerates the analysis of multiple faults within the biological networks, rendering our approach notably faster and more efficient.
bioinformatics2026-06-18v2Cross-platform nanopore benchmarking reveals methylation-associated substitution errors in bacterial reads
Liu, X.; Ding, Q.; Shao, Y.; GUO, Z.; Ni, Y.; Fan, L.; Yang, Y.; Chen, K.; Yang, M.; Li, R.Abstract
Nanopore sequencing enables long-read genome assembly and direct detection of DNA modifications, but emerging platforms require systematic evaluation against established technologies. We benchmarked CycloneSEQ against Oxford Nanopore Technologies R9.4.1 and R10.4.1 using matched native whole-genome sequencing and methylation-free whole-genome amplification libraries from six bacterial species. Updated CycloneSEQ chemistry and basecalling improved mean observed read accuracy to 96.0%, approaching R10.4.1. Across platforms, error spectra were non-random, with adenine-to-guanine and guanine-to-adenine substitutions consistently overrepresented. Comparisons with methylation-free controls showed that bacterial DNA methylation contributes substantially to these substitution patterns, highlighting a source of systematic nanopore error relevant to variant analysis. CycloneSEQ reads, when combined with short-read polishing, produced near-finished bacterial assemblies. We further show that CycloneSEQ supports bacterial methylation profiling: strand-specific basecalling errors enabled de novo discovery of 12 methylation-associated motifs, and two signal-to-reference alignment strategies enabled raw-signal comparison between native and amplification-derived reads. These results establish a cross-platform framework for nanopore benchmarking and extend bacterial epigenomic analysis to CycloneSEQ.
bioinformatics2026-06-18v2Trajectory inference of epithelial-centered neighborhood profiles reconstructs a pseudo-temporal continuum in idiopathic pulmonary fibrosis
Nakamura, S.; Tsubouchi, K.; Yamamoto, Y.; Takano, T.; Nakatsuru, K.; Takenaka, T.; Hashisako, M.; Oda, Y.; Okamoto, I.Abstract
Idiopathic pulmonary fibrosis (IPF) is characterized by complex lung architecture and spatially heterogeneous remodeling, which have hindered integrated analysis of cell-intrinsic activity and intercellular communication during disease progression. Here we profiled six IPF lung specimens comprising more than 630,000 cells using the Xenium 5k panel and developed an epithelial-centered neighborhood profiling framework based on the local cellular composition around each epithelial cell. This approach captured fibrosis-associated variation in epithelial niches without requiring predefined histological regions. Pseudo-temporal continuum inference of these profiles reconstructed a continuous axis that reflected the spatial progression of fibrotic remodeling from relatively preserved alveolar regions to fibrotic and airway-like remodeled regions. Within this spatial dataset, we mapped coordinated changes in epithelial states, local microenvironments, epithelial intracellular pathway activities, and directional interactions with neighboring cell types along the same axis. Our findings provide a spatial framework that generates testable hypotheses for progressive epithelial niche remodeling in IPF.
bioinformatics2026-06-18v1Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction
Bhattacharya, S.; Gensbigler, C.; Karim, S.; Lees, J.Abstract
Next-token prediction has produced predictable scaling in language, but the recipe presumes a sequence of tokens with a meaningful order. Single-cell RNA-seq counts have no natural gene ordering, so applying the recipe directly to raw expression fails under an ill-suited left-to-right bias. We instead ask whether a learned latent can supply the structure the recipe needs. We introduce \texttt{ExpressionVAE} (eVAE), a discrete-latent perturbation model that compresses each cell into a short sequence of discrete codes through a finite-scalar-quantization (FSQ) bottleneck and trains a perturbation-conditioned discrete prior over those codes. On Replogle and Parse~1M, eVAE sets a new state of the art on every distributional metric and leads on most cell-eval perturbation metrics, with Fr\'echet distance and $\mathrm{MMD}^2$ roughly $3$ to $20\times$ lower than the strongest continuous-latent baseline. Swapping the prior between autoregressive and masked discrete diffusion leaves performance near-identical, isolating the gain to the discrete latent itself rather than the prior family. A decoder-head ablation then exposes a single design axis, the richness of the predictive distribution at inference, that splits the standard metrics into two groups, variance-sensitive and mean-sensitive, which move in opposite directions along the axis. Finally, on a held-out CRISPRi reversion benchmark of $1{,}732$ perturbations under inflammatory cytokine stress, the frozen eVAE encoder outperforms UMAP and differential expression and matches scGPT on perturbation ranking at a fraction of the data.
bioinformatics2026-06-18v1A unified smoothing framework for protein domain bigram model
Cui, X.; Iyer, G.; Durand, D.Abstract
Biomolecular sequences can be represented as strings over an alphabet, an analogy that has motivated many applications of computational linguistic techniques to biological problems. However, such methods must be adapted to the characteristic scale and organization of biomolecular data. Here, we consider the problem of bigram smoothing for multidomain protein architectures, where domain bigram frequency data is extremely sparse and differs from textual data in alphabet size, string length distribution, the relationship between bigram and unigram frequencies, tandem repeat lengths, and the distribution of domain adjacencies. Moreover, some domain combinations are unobserved because they are biologically incompatible, others because the data are incomplete. A smoothing method that distinguishes these two cases is required. We propose a unified smoothing framework based on interpolation that can be tuned to accommodate different bigram data characteristics. Within this framework, we design specific model variants suited to protein domain bigram data: these assign low adjusted counts to pairs that are likely incompatible, while making appropriate adjustments for undersampled pairs. We demonstrate empirically that this approach distinguishes the two cases while preserving the characteristic signatures of multidomain data.
bioinformatics2026-06-18v1novelBGC: An interactive dual-score framework for biosynthetic gene cluster novelty assessment and candidate prioritisation
Shukla, G.; Merugu, B.; Sharma, G.Abstract
Genome mining now yields tens of thousands of putative biosynthetic gene clusters (BGCs) per project, yet, separating genuinely novel candidates from rediscoveries of known compounds remains the rate-limiting step before experimental validation. Single-axis prioritisation tools, antiSMASH similarity, BiG-FAM GCF distance, and self-resistance-enzyme (SRE) filters such as ARTS, each surface a different facet of evidence, yet their isolated use systematically over-ranks rediscovery-prone BGCs and overlooks genuinely orphan clusters. We present novelBGC, a web-hosted framework that converts these disparate outputs into two deliberately non-inverse continuous metrics per BGC, a Novelty (N) and a Reference Similarity (RS) score which together define a 2D decision plane that resolves rediscoveries, divergent family members, contig-edge artefacts, and uncharted chemistry with interactive visualisations, with all component weights user-tuneable at submission. Retrospective validation across three independent experimental datasets demonstrates the utility of the framework for candidate prioritization. Within the first 186-BGC SRE-guided cloning study, every confirmed bioactive product fell within the low-to-mid N band whereas 55 high-N (N [≥] 0.50) BGCs were never selected. Moreover, in the other two studies, it correctly prioritised the fully orphan lariocidin BGC of Paenibacillus sp. M2 and the divergent within-family indanopyrrole-A idp BGC of Streptomyces sp. CNX-425. Together, these case studies demonstrate that the joint (N, RS) space facilitates prioritization decisions that are difficult to achieve using any single criterion alone. from identical input data. novelBGC requires no command-line expertise, no local tool installation, and no manual integration of intermediate output formats, addressing a well-documented accessibility barrier for wet-laboratory researchers engaging with genome-mining workflows. novelBGC is freely available at https://project.iith.ac.in/sharmaglab/novelbgc/.
bioinformatics2026-06-18v1Calculation of sequence space coverage in a mutagenesis library
Florez Prada, A.; Uguzzoni, G.; Hart, D. J.Abstract
Directed evolution requires screening of large mutagenesis libraries, but accurate calculation of library sizes needed to discover functional variants remains challenging. Existing models provide baseline estimates, yet current computational approaches for finding the best variants scale poorly with library complexity. Here, we introduce a scalable algorithmic framework to compute exact discovery probabilities in saturation mutagenesis libraries with no requirement for explicit sequence enumeration. By aggregating variants into a composition log--sum distribution and applying log-space convolution across randomisation blocks, it is possible to extend this to massive sequence spaces and mixed codon schemes. By inverting these calculations, absolute mathematical ceilings for experimental design are established. Ultimately, this framework provides a rapid, quantitative tool to balance the statistical coverage-diversity trade-off within the limitations of laboratory screening. Finally, this is implemented as an open-source web application (SSCC) that allows researchers to construct heterogeneous library designs and compute required sampling depths, coverage probabilities, and absolute randomisation limits.
bioinformatics2026-06-18v1Looking beyond stereotyped neuron structures reveals links between beading and morphological rearrangements in aging phenotypes.
Gomez, K.; Nguyen, K.; Lagergren, J.; Flores, K.; San Miguel, A.Abstract
Understanding how neuronal morphology changes during aging and acute stress is essential for elucidating mechanisms of neurodegeneration. The highly branched PVD neuron of Caenorhabditis elegans provides a powerful model for studying dendritic remodeling and degeneration-associated phenotypes such as dendritic beading. However, the complexity of this arbor presents substantial challenges for automated segmentation and quantitative analysis. In this study, we adapted a convolutional neural network (CNN)-guided region growing framework for automated dendrite tracing, coupled with two topology-based algorithms for categorizing dendritic segments by branching degree. The segmentation algorithm achieved high accuracy relative to manual tracing, with a median Dice coefficient of 0.82, while reducing analysis time by approximately tenfold. Automated dendrite categorization demonstrated strong agreement with manual annotations across branching orders, though position-based mapping performance declined with age due to progressive morphological distortion. Leveraging this platform, we investigated mechanistic differences in dendritic beading patterns observed during aging and cold shock. Consistent with prior work, aging was associated with decreased inter-bead spacing, whereas cold shock produced increased bead dispersion with stress severity. Structural analysis revealed that these trends were not driven by dendritic pruning or reduced arbor complexity. Instead, while a traditional anatomically unflexible paradigm falsely implicated lower-degree dendrites as highly vulnerable, our branching-informed framework revealed that age-dependent beading is fundamentally dictated by a segments history of successive branching events. Conversely, acute cold shock triggered systemic beading that expanded across all dendritic orders in a severity-dependent manner. Together, these findings demonstrate that chronic aging and acute stress engage distinct degenerative pathways (compartment-specific lineage vulnerability versus global architectural collapse) rather than gross morphological loss, as well as highlighting the need for paradigms that enable reliable analysis of changing morphologies.
bioinformatics2026-06-18v1Bioinf-Farma: supervised integration of epitope prediction and recombinant protein developability for automated vaccine candidate prioritization
Bondi, H.; Crespi, M.; Orlando, M.; Lescai, F.; Serapian, S. A.; Colombo, G.; Fasano, M.; Pollegioni, L.; Molla, G.Abstract
Vaccine antigen discovery requires prioritizing protein candidates according to both immunogenic potential and recombinant expression feasibility. These properties are typically evaluated using separate computational tools, requiring researchers to integrate heterogeneous outputs through ad hoc workflows. Here, we present BIOINF-farma, a modular platform integrating epitope prediction and developability assessment for rational antigen selection within a unified environment. Candidates can be submitted as amino acid sequences or three-dimensional structures. When experimental structures are unavailable, BIOINF-farma automatically searches for models in AlphaFold DB or performs structure prediction using Boltz-2, ensuring a standardized structural representation for downstream analyses. Antigenicity is quantified by combining structure-based conformational epitope signals (MLCE/REBELOT-BEPPE) and sequence-based linear epitope propensity scores (BepiPred 3.0) into a protein-level Antigenicity Score, with a classification threshold optimized on a manually curated validation dataset. Developability is evaluated through two supervised Random Forest meta-learners that integrate three solubility predictors (DeepSoluE, SoluProt, Protein-Sol) and three thermal stability predictors (TemStaPro, ProLaTherm, BertThermo), whose outputs are combined into an Expression Efficiency Score (EES). By integrating complementary predictive signals, the meta-learning framework achieves greater accuracy and robustness than individual predictors while maintaining performance across a broad range of sequence identities. The Antigenicity Score effectively discriminates antigenic from non-antigenic proteins with a large effect size, whereas EES successfully distinguishes soluble from insoluble outcomes on an independent panel of recombinant proteins expressed in Escherichia coli. BIOINF-farma jointly assesses antigenicity and expression feasibility within a single framework. Its modular architecture facilitates the incorporation of future predictive methods, while its web-based interface makes the full pipeline accessible to users without programming expertise, supporting rapid candidate triage in vaccine research and emerging pathogen responses.
bioinformatics2026-06-18v1A data-driven rediscovery of the specificity-conferring code of adenylation domains in nonribosomal peptide synthetases
Li, Z.; Bozhuyuk, K. A. J.; Kalinina, O. V.; Klakow, D.Abstract
Nonribosomal peptide synthetases (NRPSs) are large modular enzymes that assemble structurally diverse peptides, many of pharmacological importance, including antibiotics and immunosuppressants. Within each NRPS module, the adenylation (A) domain selects the substrate to be incorporated, a choice governed by a small set of residues lining the binding pocket. For two decades, computational prediction of A-domain substrate specificity has relied on residue sets - most prominently the Stachelhaus code and the 34-residue "8 Angstrom code" - that were defined by spatial proximity to the substrate rather than by demonstrated predictive value. Here we revisit which residues govern substrate specificity from a purely data-driven perspective. We assembled a non-redundant dataset of 5,366 A-domain sequences (4,693 bacterial and 673 fungal) and used information-theoretic measures to rank alignment positions by their statistical association with substrate identity, without restricting candidate positions to any predefined structural shell. This procedure yielded two compact, kingdom-specific codes: IG15B (15 positions) for bacterial and IG13F (13 positions) for fungal A-domains. Both match or exceed the predictive accuracy of the 34-residue 8 Angstrom code while using fewer than half its positions, and both independently recover the majority of the classical Stachelhaus positions. Notably, our analysis identifies four positions (242, 280, 281, and 284) that lie outside all conventional codes yet carry non-redundant specificity information and co-localize with classical determinants on two helices flanking the binding pocket. These positions provide new candidate sites for the rational engineering of A-domain specificity.
bioinformatics2026-06-18v1