Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Calibrated analysis framework for nanopore direct RNA sequencing uncovers cell-specific m⁶A stoichiometry at conserved sites
Ohnezeit, D.; Loliashvili, E.; Putzel, G.; Verstraten, R.; Liu, J.; Nicholson, L. S.; Pironti, A.; Jaffrey, S. R.; Depledge, D. P.; Wilson, A. C.Abstract
Nanopore direct RNA sequencing (DRS) coupled with Dorado modification-aware basecalling enables mapping of epitranscriptomic modifications including N6-methyladenosine (m6A) at the level of individual RNAs. However, a lack of systematic benchmarking continues to raise questions regarding the sensitivity, specificity, and reproducibility of this method. To address this and to establish a best-practice workflow, we evaluated multiple Dorado versions using in vitro transcribed RNA and an m6A methyltransferase inhibitor as specificity controls. We established that stringent filtering is necessary to reduce false-positive calls and found strong concordance at high-stoichiometry sites when compared to an orthogonal m6A mapping method (GLORI). Further, by applying DRS to primary human fibroblasts and HD10.6 neurons, we uncovered cell type-specific differences in m6A stoichiometry, indicating a finely tuned epitranscriptomic regulation. Our study thus presents the first systematic comparison of Dorado and GLORI from the same input RNA and expands characterization of the m6A epitranscriptome to fibroblasts and neurons.
bioinformatics2026-04-30v4SpatialQuery: scalable discovery and molecular characterization of multicellular motifs from spatial omics data
An, S.; Keller, M.; Gehlenborg, N.; Hemberg, M.Abstract
Spatially resolved single-cell technologies enable profiling of cells in situ, yet computational approaches that jointly discover multicellular spatial patterns and characterize their molecular programs remain limited. Here we introduce SpatialQuery, a framework that can both identify cellular motifs, i.e. recurrent multicellular co-localization patterns, and perform molecular analyses focused on the motifs. It uncovers genes modulated by spatial contexts through differential expression analysis, and detects coordinated expression changes through covariation analysis. SpatialQuery can identify functional tissue units, and goes beyond pairwise analyses to characterize multicellular interactions. Applications to both spatial transcriptomics and proteomics data uncover cross-germ-layer signaling in gut tube patterning, disease-specific fibrotic and immunosuppressive niches in kidney and colon, and regional determinants of motif-associated transcriptional programs in a mouse brain atlas. SpatialQuery is available as a Python package, and we demonstrate how its light computational footprint enables integration into web-based cell atlas portals for interactive visualization and exploration.
bioinformatics2026-04-30v3Praxis-BGM: Clustering of Omics Data Using Semi-Supervised Transfer Learning for Gaussian Mixture Models via Natural-Gradient Variational Inference
Jia, Q.; Goodrich, J. A.; Conti, D. V.Abstract
High-dimensional omics data are typically measured on limited sample sizes, which challenges model-based clustering methods such as Gaussian mixture models, often leading to instability and poor generalization under complex mixture structures. To address these limitations, we developed Praxis-BGM, a natural-gradient variational inference framework for Gaussian mixture models that enables semi-supervised transfer learning by incorporating an informative prior Gaussian mixture model derived from large-scale reference data with robust cluster structures. This prior can encode cluster-specific means, covariance structures, and structural connectivity patterns, and is updated using the target data with variational inference to improve clustering in small-sample settings. We derived natural-gradient updates for standard parameters and assess feature-level contributions to posterior clustering via Bayes Factors. Implemented in Python library JAX for accelerator-oriented computation, Praxis-BGM is computationally efficient and scalable. Across extensive simulations and two real-world applications-breast cancer bulk transcriptomics for subtype recovery and single-cell transcriptomics for cross-platform label transfer-Praxis-BGM improves posterior clustering performance, stability, and biological interpretability, even when priors are partially mismatched.
bioinformatics2026-04-30v3Systematic evaluation and benchmarking of text summarization methods for biomedical literature: From word-frequency methods to language models
Baumgärtel, F.; Bono, E.; Fillinger, L.; Galou, L.; Keska-Izworska, K.; Walter, S.; Andorfer, P.; Kratochwill, K.; Perco, P.; Ley, M.Abstract
The rapid expansion of biomedical literature demands automated summarization tools that can reliably condense research articles into concise, accurate overviews. We benchmarked 62 text summarization methods - ranging from frequency-based and TextRank extractors to modern encoder-decoder models (EDMs) and large language models (LLMs) - on a set of 1,000 biomedical abstracts for which author-generated highlights sections were available as reference summaries. Models were evaluated using a composite suite of metrics covering lexical overlap (ROUGE-1/2/L, BLEU, METEOR), embedding-based semantic similarity (RoBERTa, DeBERTa, all-mpnet-base-v2), and factual consistency (AlignScore). Our results indicate that general-purpose language models (LMs) achieve the highest overall scores across both lexical and semantic metrics, outperforming both reasoning-oriented and domain-specific models. Within the general-purpose group, medium-sized models, typically runnable on a single node, often outperform frontier-scale counterparts, suggesting an optimal balance between model capacity and computational efficiency. Statistical extractive methods lag behind all neural approaches. These findings provide a systematic reference for selecting summarization tools in biomedical research and highlight that broad pretraining remains more effective than narrow domain adaptation for generating high-quality scientific summaries.
bioinformatics2026-04-30v3Comprehensive top-down mass spectral repository enables pan-dataset analysis and top-down spectral prediction
Li, K.; Liu, K.; Fulcher, J. M.; Kaulich, P. T.; Tang, H.; Liu, X.Abstract
Mass spectral libraries have become essential resources for training deep learning (DL) models for spectral prediction and de novo sequencing in bottom-up mass spectrometry (BU-MS). Compared with BU-MS, top-down MS (TD-MS) offers unique advantages for characterizing intact proteoforms by analyzing proteoforms without enzymatic digestion. Despite these advantages, large-scale spectral libraries for TD-MS are currently lacking. Here we present TopRepo, the first comprehensive and publicly available repository of TD-MS spectra, comprising more than 12 million spectra acquired from 12 species across seven types of mass spectrometers. Using TopRepo, we constructed a large-scale top-down spectral library containing over 5 million spectra with proteoform and fragment-ion annotations. We demonstrate that TopRepo enables pan-dataset analyses of N-terminal processing, mass shifts, and other proteoform characteristics identified by TD-MS. Furthermore, we show that the TopRepo spectral library substantially improves proteoform identification through spectral library searching and supports the training of DL models for high-accuracy top-down spectral prediction.
bioinformatics2026-04-30v2Harnessing AI to Build Virtual Cells
Cheng, X.; Li, P.; Guo, H.; Liang, Y.; Gong, J.; de Vazelhes, W.; Gou, C.; Xie, P.; Song, L.; Xing, E. P.Abstract
A virtual cell is a world model of a cell: a computational system that predicts, simulates and programs cellular processes across modalities and scales. An important path toward this goal is to model how genetic and chemical perturbations give rise to transcriptional responses, a core capability for disease understanding and drug discovery. However, current approaches remain expert-intensive, relying on iterative manual model design, training and debugging over months. Here we present VCHarness, an autonomous AI system that constructs perturbation-response models by combining an AI coding agent with multimodal biological foundation models. The system explores large spaces of architectures and training pipelines with minimal human intervention, iteratively generating, evaluating and refining candidate models. Across multiple perturbation-response benchmarks, VCHarness identifies architectures that outperform expert-designed approaches while reducing development time from months to days. It further uncovers non-obvious architectural patterns associated with improved performance, indicating that automated search can extend beyond conventional design strategies. These results suggest a shift from manually engineered models toward autonomous systems for constructing components of virtual cell world models, enabling scalable and data-driven exploration of cellular systems.
bioinformatics2026-04-30v2On the consistency of duplication, loss, and deep coalescence gene tree parsimony costs under the multispecies coalescent
Sapoval, N.; Nakhleh, L.Abstract
Gene tree parsimony (GTP) is a common approach for efficient reconciliation of multiple discordant gene tree phylogenies for the inference of a single species tree. However, despite the popularity of GTP methods due to their low computational costs, prior work has shown that some commonly employed parsimony costs are statistically inconsistent under the multispecies coalescent process. Furthermore, a fine-grained analysis of the inconsistency has indicated potentially complimentary behavior of duplication and deep coalescence costs for symmetric and asymmetric species trees. In this work, we prove inconsistency of GTP estimators for all linear combinations of duplication, loss and deep coalescence scores. We also explore empirical implications of this result evaluating inference results of several GTP cost schemes under varying levels of incomplete lineage sorting.
bioinformatics2026-04-30v2Cross-Species Adaptation of RETFound for Rodent OCT Age Estimation Reveals Strong CNN Baselines in Data-Scarce Space Biology
Hayati, A.; Gong, J.; Nagesh, V.; Avci, P.; Ong, A. Y.; Masalkhi, M.; Engelmann, J.; Karouia, F.; Keane, P. A.; Costes, S. V.; Sanders, L. M.Abstract
Space-biology imaging studies are often constrained by severe data scarcity, limiting the development of robust machine-learning biomarkers. Rodent spaceflight and space-analog datasets provide an important preclinical setting for testing transfer-learning strategies, but the extent to which human retinal foundation models can generalize to rodent optical coherence tomography (OCT) remains unclear. Here, we benchmark cross-species adaptation of RETFound, a human retinal Vision Transformer pretrained on 1.6 million retinal images, for chronological age prediction from Brown Norway rat OCT B-scans in the NASA Open Science Data Repository dataset OSD-679. We adapted RETFound using Low-Rank Adaptation (LoRA) and evaluated performance on control animals under matched 3-fold rat-level cross-validation. We compared RETFound+LoRA with a strong ImageNet-pretrained Xception baseline under matched protocols and included a scratch/random ViT as a negative-control architecture check. Metrics included mean absolute error (MAE), R2, and inter-eye mean absolute difference (MAD). RETFound+LoRA achieved MAE = 26.20 +/- 5.03 days with R2 = 0.744 +/- 0.049. However, Xception performed better in the primary benchmark (MAE = 19.01 +/- 7.67 days, R2 = 0.853 +/- 0.082), and the matched-fold comparison favored Xception, although this result should be interpreted cautiously given the small number of folds. Inter-eye consistency was maintained across the matched control evaluation, and saliency maps localized model attention to anatomically plausible inner retinal regions. Together, these results show that human retinal foundation models can transfer to rodent OCT in a scientifically useful way, but also that strong CNN baselines may outperform transformer-based models in small-sample cross-species settings. This preprint provides a reproducible benchmark and baseline framework for future retinal biomarker development in space biology.
bioinformatics2026-04-30v2On the use of variational autoencoders for biomedical data integration
Pielies Avelli, M.; Hernandez Medina, R.; Webel, H. E.; Rasmussen, S.Abstract
Variational Autoencoders (VAEs) are a widely used framework to integrate diverse biomedical data modalities, create representations that capture the underlying structure of the datasets, and obtain insights about the relations between variables. Here we describe how this is achieved from an empirical point of view in our previously developed VAE-based framework MOVE, providing an intuitive perspective on the inner workings of multimodal VAEs in biomedical contexts. We explore how the models' emerging dynamics shape their performance and how in silico perturbations can be leveraged to identify potential associations between variables. To do that, we extend our framework to handle perturbations of continuous variables, introduce a new approach to better capture associations between them, and create synthetic datasets to benchmark the proposed methods against well-defined ground truth associations. We finally showcase our findings in real biomedical scenarios, namely a multimodal dataset of inflammatory bowel disease and a dataset containing genetic knockdowns in K562 and RPE1 cells.
bioinformatics2026-04-30v2Spartan: activation-aware framework for spatial domain and variable gene discovery
Faiz, M. F. I.; Jokl, E.; Jennings, R.; Piper Hanley, K.; Sharrocks, A.; Iqbal, M.; Baker, S. M.Abstract
Spatial transcriptomics is rapidly advancing toward single-cell-level resolution, revealing complex tissue architectures organized across continuous anatomical gradients. However, accurate identification of spatial domains remains a central computational challenge, as many existing clustering approaches blur anatomical boundaries, merge transitional zones, or fail to resolve localized microstructures. Here we introduce Spartan, an activation-aware multiplex graph framework for high-resolution domain discovery. Spartan integrates spatial topology and Local Spatial Activation (LSA), a neighborhood deviation signal that captures localized transcriptional heterogeneity often attenuated by similarity-based clustering. By jointly modeling cohesion within domains and localized activation structure, Spartan recovers anatomically aligned partitions across spatially resolved transcriptomics technologies including Visium HD, MERFISH, Stereo-seq, and STARmap. We further demonstrate its utility in a high-resolution Visium HD section of developing human esophagus and stomach, where activation-aware graph integration enables precise delineation of complex transitional regions such as the gastroesophageal junction and supports stable multi-scale domain recovery without fragile hyperparameter tuning. Beyond domain identification, Spartan leverages activation-aware structure to detect spatially variable genes associated with localized tissue remodeling. Spartan scales near-linearly with dataset size, providing a robust and interpretable framework for spatial systems-level analysis.
bioinformatics2026-04-30v2PanVariants: Best Practice for Pangenome-based Variant Calling Pipeline and Framework
Yi, H.; Wang, L.; Chen, X.; Ding, Y.; Carroll, A.; Chang, P.-C.; Shafin, K.; Xu, L.; Zeng, X.; Zhao, X.; Gong, M.; Wei, X.; Hou, Y.; Ni, M.Abstract
Background: Although pangenome references offer richer population diversity compared to linear references, current mainstream pangenome-based variant callers are limited to detecting only known variants stored in the graph. To address this limitation, we developed PanVariants, a novel pipeline designed to improve the detection of both known and novel variants accurately. We systematically evaluated its performance against the traditional linear alignment solution (BWA+GATK/Manta) and the existing pangenome-aware solution (DRAGEN/PanGenie) in three contexts: small variants (SNVs/indels) and structural variants (SVs) accuracy in Genome in a Bottle samples, clinical detection on positive samples, and application in cohort-based joint calling. Results: By integrating k-mer-based and mapping-based methods, PanVariants significantly reduced variant errors (FPs + FNs), achieving a 73% reduction compared to BWA+GATK and a 45% reduction compared to DRAGEN for SNVs. Retraining the DeepVariant model with high-quality DNBSEQ data further decreased errors by 15%. For SVs detection, PanVariants attained an F1-score of 89.39%, markedly outperforming DRAGEN (68.18%) and BWA+Manta (58.33%), approaching long-read sequencing performance (95.22%). In validation using clinical positive samples, PanVariants successfully detected all expected pathogenic variants while PanGenie failed. In the cohort joint-calling analysis, PanVariants detected more variants, made fewer Mendelian inheritance errors, and gave better per-sample accuracy than GATK. Conclusions: PanVariants establishes a robust framework and best-practice pipeline for pangenome-based variant detection, achieving both sensitive novel variant discovery and high accuracy for SNVs, indels and SVs. Our systematic evaluation of optional processing steps and input variables offers practical guidance for users. Validated across diagnostic and population-based applications, our findings strongly support the transition from linear to pangenome references in future genomics.
bioinformatics2026-04-30v2Survey of the human proteostasis network: the ubiquitin-proteasome system
Elsasser, S.; Powers, E.; Stoeger, T.; Sui, X.; Kurtzbard, R. D.; Martinez-Botia, P.; Wangaline, M. A.; Gama, A. R.; Huttlin, E. L.; Elia, L. P.; Kelly, J. W.; Gestwicki, J. E.; Frydman, J. E.; Finkbeiner, S.; Clerico, E. M.; Morimoto, R.; Prado, M. A.; Vertegaal, A. C. O.; Hofmann, K.; Finley, D.Abstract
Modification by ubiquitination governs the half-lives of thousands of proteins that are fated for elimination by either the proteasome or autophagy pathways, depending on the intricate architectures of ubiquitin modification. This system mediates quality control for individual proteins, protein complexes, and organelles, as well as myriad purely regulatory functions. Here we provide a comprehensive survey of the ubiquitin-proteasome system (UPS), the scope of which is at present poorly defined. The UPS, with the inclusion of pathways involving ubiquitin-like modifiers, comprises in our estimate over 1400 distinct proteins in humans, a vast set of activities whose collective impact on the biology of the cell is pervasive. The UPS is an integral component of the proteostasis network (PN), the remainder of which we have also surveyed in recent studies. With the addition of molecular chaperones, proteins from autophagy-lysosome pathway, and related activities, the PN includes in total over 3100 components by our estimates. Comprehensive and systematic definition of these pathways should support a range of ongoing investigations in the areas of genomics, proteomics, biochemistry, cell biology, and disease research.
bioinformatics2026-04-30v2A gene program dictionary of human cells
Xu, Y.; Wang, Y.; Geng, Z.; Qin, Y.; Ma, S.Abstract
Defining all human cell types and their roles in health and disease is a central goal of biology. Single-cell RNA sequencing has enabled the construction of organ-specific cell atlases, but building a comprehensive organism-wide atlas spanning multiple organs remains challenging due to batch effects, study biases, and inter-organ complexity. Here, we present Gene Program Dictionary (GPD), a framework that leverages robust gene co-expression programs-rather than direct cell integration-to overcome these barriers. Using SpacGPA, a partial correlation-based network method, we analyzed 466 scRNA-seq datasets, generating 1,975 independent networks and 90,701 gene co-expression modules, which were consolidated into 1,534 consensus gene programs representing a wide range of human tissues and cell types. Each program serves as a composite marker, capturing both cell-type-specific and shared biological processes. We demonstrate their utility by mapping endothelial cell subtypes across tissues to reveal their heterogeneity-including tumor-specific programs-annotating colorectal cancer spatial transcriptomes, and linking programs and their corresponding cell types to disease loci, revealing hotspots such as neuronal programs in psychiatric disorders and a proximal tubule program in kidney diseases. GPD provides an organism-wide reference for studying cellular diversity and disease mechanisms.
bioinformatics2026-04-30v2Overcoming systematic data biases enables accurate prediction of enzyme kcat fold-changes for computational protein design
Rousset, Y.; Kroll, A.; Lercher, M.Abstract
Machine learning is increasingly used to guide protein engineering by predicting how mutations affect desired properties. Recent models for the turnover number (kcat) of enzymes report high accuracy, suggesting that mutation effects can be inferred directly from protein sequence. However, these approaches are typically evaluated on heterogeneous datasets of enzyme variants, where closely related sequences and systematic reporting patterns may confound model performance. A central challenge is therefore to determine whether current models truly capture mutation-specific effects or instead exploit statistical regularities in the data. Here we show that much of the reported accuracy in mutant kcat prediction arises from two pervasive biases: variants of the same enzyme occupy a narrow activity range, and mutations within a group often share a common direction of change. Simple baselines that exploit these biases match or exceed the performance of existing models, indicating that high apparent accuracy does not imply mechanistic understanding. To address this limitation, we introduce a bias-aware framework that reformulates prediction as a pairwise fold-change task and evaluates performance on unseen mutant-mutant pairs, thereby isolating mutation-specific signal. A proof-of-principle implementation explains approximately one-third of the variance under these conditions and outperforms existing models on leakage-controlled benchmarks. More broadly, this work establishes a general strategy for evaluating and modeling mutation effects in biochemical datasets, with implications for protein engineering and related fields.
bioinformatics2026-04-30v2Hierarchical Breakdown of RNA Structure Prediction in CASP16: From Reliable Local Features to Speculative Multimer Assembly
Nithin, C.; Pilla, S. P.; Kmiecik, S.Abstract
CASP16 provided a community-wide benchmark for assessing RNA structure prediction, including the first large-scale blind assessment of RNA-RNA multimer prediction. The results showed that achieving high atomic precision remains a major challenge across the field. In this work, we use the performance of our group (LCBio) as a diagnostic case study to examine the current limits of RNA structure prediction. Our workflow ranked first in the RNA multimer category and remained competitive for monomers. We combine hierarchical analysis with representative case studies to identify a pattern of predictive breakdown, in which modeling fidelity degrades from reliable local features to increasingly speculative global architectures. Multi-helix junctions appear to mark a major transition boundary where 2D topological success often fails to translate into 3D geometric realism, leading to cascading errors in global architecture. This hierarchical breakdown is especially pronounced in RNA multimers, where limitations in the recovery of junction geometry and tertiary interactions propagate directly into errors in higher-order assembly, making multimer prediction increasingly speculative. By placing benchmark performance in a direct structural context, this case study helps define the current limits of RNA structure prediction and highlights priorities for improving predictive accuracy.
bioinformatics2026-04-30v2Distilling Direct Effects via Conditional Differential Gene Expression Analysis
Gu, J.; Skelton, A.; Staley, J.; Popson, P. O.; Peng, L.; Song, X.; Knowles, J. K.; He, Z.Abstract
Differential gene expression (DGE) analysis is foundational for interpreting RNA sequencing data, but it conflates direct biological effects with correlations propagated through gene co-expression. Across three RNA sequencing datasets (including a genome-scale perturb-seq experiment), we find that only a small fraction of differentially expressed genes have direct effects on the trait of interest, while the majority are undirected or passengers whose associations are mediated through other genes. To distinguish direct effect genes, we introduce conditional differential gene expression (CDGE) analysis, a framework that tests for conditional rather than marginal association between each gene and the trait of interest. Implemented via the GhostKnockoff procedure with lasso regression, CDGE delivers false discovery rate control, operates on summary statistics from existing DGE pipelines, and accommodates batch effects. The genes identified by CDGE mediate the effects of most other differentially expressed genes and show stronger enrichment for known protein-protein interactions and biological pathways than DGE-identified genes. These results suggest that the field has been systematically over-interpreting DGE outputs, and that distinguishing direct from mediated effects is essential for prioritizing genes for functional follow-up and therapeutic development.
bioinformatics2026-04-30v2CountESS: a flexible, graphical pipeline tool for deep mutational scanning analysis
Moore, N.; Sargeant, C. J.; Wakefield, M. J.; Popp, N. A.; Fowler, D. M.; Rubin, A. F.Abstract
Deep Mutational Scanning (DMS) experiments generate large volumes of sequencing data that must be processed through multi-step computational pipelines to yield interpretable variant scores. At least twelve dedicated tools have been published for this purpose, yet the diversity of experimental designs, scoring strategies, and software implementations has produced a fragmented landscape in which no single tool accommodates the full range of workflows encountered in practice. Here we present CountESS (Count-based Experiment Scoring and Statistics), an open-source pipeline tool that provides a modular, graphical interface for constructing flexible DMS analysis workflows. CountESS supports a wide range of input formats, barcode translation, HGVS variant calling, and user-defined scoring functions, enabling it to accommodate diverse experimental designs including selection assays, time-series experiments, and bin-based assays such as VAMP-seq. Implemented in Python with DuckDB as a computational backend, the software provides high-performance, memory-efficient processing suitable for large datasets. CountESS is freely available at https://github.com/CountESS-Project/CountESS under the 3-Clause BSD Licence. Supplementary data, including demonstration pipelines and example datasets, are available at https://github.com/CountESS-Project/countess-demo.
bioinformatics2026-04-30v1EnzCast: Prediction of Patient-Specific Enzymatic Kinetics through Multi-Modal Deep Learning and Isoform-Resolved Bayesian Inference based on Single-Cell Transcriptomics
Mu, X.; Yang, Y.; Wang, Q.; Chen, Z.; Luo, B.; Huang, Z.; Lin, X.; Xu, L.; Li, X.; Qu, Y.; Xiao, J.; Wang, Z.; Shi, B.; Ou, Q.; Yao, B.; Yan, J.; Zhuang, Y.; Zhang, Y.; Shi, R.; Xu, Y.Abstract
Enzyme kinetic parameters underpin mechanistic biology but remain sparse in physiological context. We present EnzCast, a multi-modal framework jointly predicting Km, kcat, kcat/Km, and Ki from protein sequence, 3D structure, substrate chemistry, and experimental conditions, paired with IsoKin, an isoform-resolved Bayesian framework converting EnzCast priors into patient-specific in vivo kinetics. Trained on KinBench, the largest curated kinetics database, task-adaptive EnzCast achieved R2 = 0.413, 0.455, 0.227, and 0.105 for Km, Ki, kcat, and kcat/Km, surpassing all baselines on catalytic tasks. Systematic condition scans recovered compartment-specific pH direction inversion and pathway-level temperature responses. In a 20-patient colorectal cancer single-cell cohort, IsoKin reduced posterior uncertainty by 73.3% and 77.3%, revealing cell-type-specific rewiring. Orthogonal validation--scFEA flux, DepMap essentiality (permutation P = 0.0008) and TCGA survival--provided mixed but directionally consistent support. Together, EnzCast and IsoKin bridge in vitro prediction, condition-aware biochemical interrogation and patient-resolved in vivo inference.
bioinformatics2026-04-30v1Understanding the bias of compositional microbiome differential abundance estimation
Calle, M. L.; Pujolassos, M.; Susin, A.Abstract
One of the most relevant objectives in microbiome studies is the identification of microbial species that are differentially abundant across conditions. However, the compositional nature of microbiome data complicates this task. Interdependence among components leads to spurious associations when the abundances of each component are analyzed separately. Due to the growing awareness of the challenges of compositional data analysis (CoDA), log-ratio transformations, such as the additive log-ratio (alr) or the centered log-ratio (clr) transformations, have become increasingly popular in microbiome studies. Several studies have compared the performance of compositional and non-compositional methods through simulations. However, the debate between these two frameworks remains unresolved, creating confusion among researchers. Rather than relying on simulation-based results, this work provides theoretical results that enable a more rigorous and conclusive analysis of the problem, contributing to a better understanding of differential abundance estimation. We provide theoretical expressions of the bias of differential abundance estimation related to the use of proportions (total sum scaling) and log-ratio transformations (alr and clr) when estimates are interpreted as absolute rather than relative to a reference. The factors that most strongly influence the bias are the magnitude and direction of the effects, the dimension of the composition, the proportion of differentially abundant variables, and the distribution of relative abundances. The findings of this work strongly support the use of CoDA transformations; however, they also highlight that even when log-ratio transformations are applied, interpreting the results outside of a CoDA framework can still lead to biased conclusions. Among CoDA transformations, alr has several advantages over clr: its reference is more explicit, which reduces the risk of interpreting estimates as absolute rather than relative, and it facilitates the replication of results in independent studies, as it only requires assessing changes relative to the same reference rather than reconstructing the full composition. In this work, we propose a heuristic method for selecting a suitable alr reference component, which will enable a more widespread use of this transformation.
bioinformatics2026-04-30v1Data-driven prioritization of mouse strains for improved preclinical modeling of rare and common disease
Ball, R. L.; Klein, A.; Gerring, M. W.; Berger-Liedtka, A. K.; Kim, M. J.; Berry, M. A.; Gargano, M. A.; Mukherjee, G.; Fisher, H. S.; Nichols-Meade, T.; Castellanos, F.; Smith, C. L.; Karlebach, G.; Murray, S. A.; Bult, C. J.; Robinson, P. N.; Chesler, E. J.Abstract
Choosing an appropriate mouse genetic background is a persistent challenge for successful translation of preclinical disease modeling. We present Strain Recommender, a genomic framework that prioritizes inbred mouse strains as relatively vulnerable or resilient to a disease state using disease-associated gene signatures and strain-specific transcriptome predictions. The method represents disease states as weighted gene scores, ranks 657 strains based on resemblance to the disease state, and estimates uncertainty via a permutation-derived false positive rate (FPR). In a prospective validation of connective tissue disorder predictions, vulnerable and resilient Collaborative Cross strains showed significantly different cardiovascular abnormalities. In a global retrospective validation predicting previously reported strain background effects, Strain Recommender achieved [≥] 90% sensitivity for 86.6% of diseases with 94.4% mean sensitivity (95% CI: 94.0-94.8%) across 5,890 diseases, including 92.3% (95% CI: 91.6-93.0%) for 2,598 rare diseases, demonstrating its potential to improve the validity of mouse models of human disease.
bioinformatics2026-04-30v1linearPOA: A parallel, memory-efficient framework for Partial Order Alignment with linear space complexity
Wei, Y.; Huang, Z.; Zhang, P.; Tian, Q.; Li, Y.; Zou, Q.; Yu, L.Abstract
Multiple sequence alignment (MSA) is a fundamental problem in computational bioinformatics, playing a critical role in genome biology, especially in long read sequencing and assembly. One solution for representing and solving MSA is Partial Order Alignment (POA), which employs Directed Acyclic Graphs (DAGs) to represent sequence relationships. However, when facing the ultra-long, error-prone reads (e.g., >100 kbps), existing POA algorithms with quadratic space complexity become impractical due to excessive memory consumption. This paper introduces the linearPOA, which based on divide-and-conquer strategy to solve the POA, aimed at saving memory compared to quadratic space complexity algorithms like SPOA, abPOA and TSTA. Particularly notable is its capability to save up to 102.74 times memory usage when aligning sequences with 100 kbp reads, compared to the abPOA method using non-heuristic methods. The algorithm was implemented within the linearPOA library, providing functionality for POA and foundational support for sequencing analysis, like error correction for reads. The linearPOA algorithm provides memory-efficient algorithms for long-read sequencing, especially in directly assembling long reads like 100 kbp reads.
bioinformatics2026-04-30v1TxConformal: Controlling False Discoveries in AI-Driven Therapeutic Discovery
Jin, Y.; Huang, K.; Diamant, N.; Buchholz, K. R.; Rutherford, S. T.; Skelton, N.; Biancalani, T.; Scalia, G.; Leskovec, J.; Candes, E. J.Abstract
Artificial Intelligence (AI) is transforming therapeutic discovery by scoring a large set of promising candidates and prioritizing a shortlist for further investigation. Quantifying the reliability of AI scores and preventing false positives among selected candidates is key to the efficiency of the discovery process. Conformal prediction (CP) has emerged as a popular tool for guiding such prioritization, especially via the conformal selection framework to control false discovery rates (FDR) in selecting top-ranked candidates under distributional shift. However, deploying these advances in real-world therapeutic discovery remains challenging: distribution shifts are difficult to quantify and correct in high-dimensional biomedical data, and practical workflows often require flexible error metrics. Here, we present TxConformal, a general framework for trustworthy decision making when building shortlists using AI scores. TxConformal adjusts for distribution shift by balancing the hidden representations in AI models and then provides confidence measures for true discoveries of target biological properties. These confidence measures, interpretable as p-values, can be used in conjunction with statistical multiple testing procedures to derive selection decisions with limited false positives or to estimate the errors in given selection decisions. TxConformal controls the false positive rate in six real-world tasks spanning various therapeutic discovery stages, modalities, and AI models with realistic data splits. When selecting promising combinatorial genetic perturbations, TxConformal nearly halves false-positive selections compared to baseline methods, substantially reducing unnecessary experimental costs by tens of thousands of dollars. When selecting stable protein structures under mutant shifts, TxConformal identifies about 10 times more proteins than baseline methods at stringent thresholds when running at a target FDR level of 10\%, recovering over 90\% of valuable candidates that baseline methods miss due to unaccounted distribution shifts. Furthermore, we demonstrate that TxConformal robustly supports various alternative error metrics suitable for resource-constrained settings. Finally, in a prospective fixed-budget virtual screening campaign for novel antibiotic discovery, TxConformal predicted false positives in close agreement with experimental outcomes, with substantial improvements over simple baselines.
bioinformatics2026-04-30v1Molecular Translators as a Computational Primitive for Biomarker Discovery: Learnability Gains Under Conserved Information Ceilings
Saisan, P. A.; Patel, S. P.Abstract
Virtual molecular mapping systems such as MISO and GigaTIME introduce a potentially transformative primitive in computational pathology: translation of H\&E whole-slide images into biologically structured molecular representations, learned on paired cohorts and deployed as an inference-time map. Despite sustained progress in machine learning, H\&E-to-molecular-biomarker (e.g., gene mutation) prediction continues to exhibit recurrent field-level performance plateaus whose drivers remain poorly resolved. It remains unclear whether continued optimization targets a removable methodological limitation or instead presses against an intrinsic ceiling imposed by morphology. We develop a formal framework characterizing what deterministic translators can and cannot change. Histology-based biomarker modeling is governed by two constraints: method-limited gaps (finite labels, weak supervision, structured nuisance) and modality-limited ceilings (intrinsic slide-specific information in morphology). Because deterministic translation introduces no new slide-level measurements at inference, H\&E information ceilings are conserved; however, translation can still improve finite-sample learnability, yielding an apparent information--performance paradox that we formalize as learnability gains under conserved information ceilings. We derive falsifiable signatures distinguishing these regimes and characterize them in controlled analytical experiments anchored to representative systems, including MISO and GigaTIME. We introduce an open-source toolkit comprising learning regime diagnosis, information-ceiling estimations, phase analyses, fidelity perturbation tests, and shortcut-confounding stress tests as an operational rubric for identifying and overcoming removable performance plateaus in translator-assisted molecular biomarker discovery and computational pathology.
bioinformatics2026-04-30v1Unlocking Multi-Sample Differential Expression for Spatial Transcriptomics Data with TESSERA
Constantine, F.; Laszik, Z.; Dudoit, S.; Purdom, E.Abstract
Spatial transcriptomics allows the unprecedented examination of gene expression levels at the resolution of spatially-situated single cells in a high-throughput manner. As the technology is adopted more broadly, studies frequently collect data from multiple tissue samples, which leads to unique challenges that traditional spatial statistical methods are not equipped to handle. In particular, factors that differ across samples, such as different coordinate systems, different numbers and types of cells, different underlying tissue architectures, among others, preclude the application of traditional methods to our new setting. In this work, we propose a novel method, TESSERA, based on a spatial generalized linear model, for analyzing multi-sample spatial transcriptomics count data. Importantly, we provide a mathematical and computational framework for efficient and scalable model fitting and statistical inference to accompany the specification of our model. Our method for fitting the model enables the estimation of a common set of fixed effects across samples. This allows us to address a variety of differential expression questions, such as identification of which genes are differentially expressed between conditions (e.g., diseases, treatments), while accounting for spatial correlation between cells within a sample. We benchmark our proposed method on simulated data and apply it to a spatial transcriptomics dataset of human kidney samples. We find that our method provides a hitherto nonexistent extension to the multi-sample setting while remaining competitive with or outperforming existing algorithms in the single-sample setting.
bioinformatics2026-04-30v1A Conditional Variational Autoencoder with QSAR-Guided Surrogate-Weighted Fine-Tuning and Cross-Entropy Optimization for Targeted Antimicrobial Peptide Generation
Castanon, I.; Wan, F.; de la Fuente, C.; Pini, A.; Falciani, C.Abstract
Machine Learning frameworks have emerged as a promising tool for antimicrobial peptide design; however, generative models remain limited by two persistent problems: the limited availability of experimentally validated peptides and the circular dependency of the models. In this work we present a conditional variational autoencoder pipeline that addresses both limitations through a modular architecture that combines both binary and quantitative experimental data and implements a multimodal approach to externally guide the generation. A transformer-based encoder successfully generated a discriminative 64-dimensional latent space (test AUROC 0.968, F1 0.919) separating antimicrobial from non-antimicrobial sequences. This latent representation conditions a species-specific LoRA fine-tuned ProtGPT2 decoder through a scalar gating function, which generates balanced antimicrobial peptides through two different modes; prior and perturb, depending on their generation starting points. We introduced a Surrogate Weighted Fine-Tuning (SWF) ensemble to eliminate the circular dependency and a Cross-Entropy Method to explore and exploit the latent space, leading to successful antimicrobial peptide generation. The best candidates exhibited competitive physicochemical characteristics, a mean helical fraction of 0.874 (mean pLDDT 83.7), and externally predicted efficacy evaluated by APEX.
bioinformatics2026-04-30v1Systems Pharmacology Reveals Type I Interferon and Myeloid-Like B Cell Reprogramming as Druggable Axes in Antiphospholipid Syndrome
Sun, B.; Lu, Y.; Liu, W.; Wang, C.Abstract
Antiphospholipid syndrome (APS) lacks targeted therapies beyond anticoagulation, and its molecular heterogeneity remains poorly characterized. We employed an integrative systems pharmacology approach combining weighted gene co-expression network analysis (WGCNA), single-cell RNA sequencing, Connectivity Map (CMap) screening, and molecular docking to identify druggable targets in APS. WGCNA of bulk RNA-seq data from neutrophils (n = 18) and whole blood (n = 88) identified two disease-associated modules: ME10 (176 genes, r = 0.77, interferon-I signaling) and ME2 (3409 genes, r = 0.79, degranulation/innate activation). Single-cell analysis of 26,936 B cells revealed transitional B cells with elevated ME2 scores and aberrant SPI1 expression, suggesting myeloid-like transcriptional reprogramming. CMap analysis ranked chloroquine, a first-line APS therapy, among top ME2 candidates (NCS = -2.07), validating the computational approach. DrugBank mapping identified 14 FDA-approved drugs targeting module genes, and a 3-gene machine learning signature (CORO1A, ANKRD22, IFITM1) achieved cross-tissue validation AUC of 0.802. External validation confirmed ME2 pathway modulation by NAPc2 intervention and cross-tissue module conservation in platelets. Patient-level ME10 x ME2 stratification revealed four molecular subtypes with distinct pathway activation profiles. This framework nominates druggable targets across both IFN-I and degranulation pathways, providing a foundation for pathway-guided precision medicine in APS.
bioinformatics2026-04-30v1Species-specific transformer models of bacterial gene order and content for genomic surveillance tasks
Horsfield, S. T.; Wiatrak, M.; McInerney, J. O.; Bentley, S. D.; Colijn, C.; Lees, J. A.Abstract
Transformer models enable functionally meaningful representation of complex biological data, such as nucleotide or protein sequences. Existing foundation transformer models are trained on large multi-domain corpuses of unlabelled DNA or protein data, showing unmatched task generalisation. However, these foundation models are often outperformed on domain-specific tasks by models trained on taxonomically-constrained data, such as gene classification in prokaryotes. By extension, species-specific transformer models hold promise for targeted analyses, given sufficient training data are available. Epidemiological analysis of bacterial pathogens exemplifies the use-case of species-specific transformers, due to the wealth of genome data available, coupled with pathogen-specific analyses carried out during routine and outbreak surveillance. Here, we trained a transformer model, PanBART, on the gene content and gene order of two important and biologically distinct bacterial pathogens, Escherichia coli and Streptococcus pneumoniae, benchmarking against state-of-the-art non-transformer approaches for genomic epidemiology. We show PanBART learns representations of population structure in an unsupervised manner, and can be used to accurately assign genomes to biologically-meaningful sequence clusters. PanBART is also able to identify emergent lineages, differentiating them from pre-existing lineages, and can accurately predict genomes likely to uptake genes involved in antibiotic resistance before a transfer event has occurred. Finally, PanBART can be used to conduct co-selection analysis to identify pairs of genes likely to be found together. Our work demonstrates that species-specific transformer models can be employed in many critical public health scenarios. We lay the groundwork for wider application of such models in epidemiological analysis, and provide scenarios where such models excel.
bioinformatics2026-04-30v1Advances in protein function prediction from the fifth CAFA challenge
De Paolis Kaluza, M. C.; Ramola, R.; Joshi, P.; Piovesan, D.; Reade, W.; Orchard, S.; Martin, M. J.; Ignatchenko, A.; Kaggle Competition Participants, ; Rost, B.; Orengo, C. A.; Robinson-Rechavi, M.; Durand, D.; Brenner, S. E.; Greene, C. S.; Mooney, S. D.; Friedberg, I.; Radivojac, P.Abstract
The Critical Assessment of Function Annotation (CAFA) is a long-standing community effort to independently assess computational methods for protein function prediction, to highlight well-performing methodologies, to identify bottlenecks in the field, and to provide a forum for the dissemination of results and exchange of ideas. In its fifth round (CAFA 5) of triennial challenges, a partnership with Kaggle Inc. facilitated participation from a large community of data scientists and computational biologists through a competitive prospective challenge on the crowdsourcing platform. In this work, we present an in-depth analysis of the submitted predictions and report improvements in accuracy over all methods from the previous CAFA challenges. We further introduce a new evaluation setting for proteins with pre-existing (incomplete) annotations and identify the need for methods that better leverage existing annotations to predict those that will be discovered later. Finally, we characterize the prospective evaluation framework by examining performance on a strict set of unpublished annotations and across intermediate database releases. Our results indicate that recent developments in the field, such as the availability of protein language models and accurately predicted 3D structures, as well as the growth of experimental annotations through biocuration, have all contributed to performance improvements.
bioinformatics2026-04-30v1CAPHEINE, or everything and the kitchen sink: a workflow for automating selection analyses using HyPhy
Verdonk, H. E.; Callan, D.; Kosakovsky Pond, S. L.Abstract
Here we present CAPHEINE, a computational workflow that starts with a set of unaligned pathogen sequences and a reference genome and performs a comprehensive exploratory evolutionary analysis of the input data. CAPHEINE pairs nicely with studies of site-level selection dynamics, gene-level positive selection, and lineage-specific shifts in selective pressure. Our workflow is portable across Mac OS, Windows, and Linux, allowing researchers to focus on results. CAPHEINE is freely available at https://github.com/veg/CAPHEINE, along with a set of usage instructions.
bioinformatics2026-04-29v3Multi-Modal Deep Learning Integrates Spatial Topologies and Sequential Motifs to Identify Class I HDAC Inhibitors as Pan-Cancer Therapeutics
Tong, S.; Zhang, W.; Ji, S.Abstract
The molecular characterization of human solid growths has introduced immense genomic complexity and intra-tumoral diversification. Converting these detailed, multi-omic profiles right into workable, broad-spectrum therapeutics continues to be an awesome traffic jam in accuracy oncology. Traditional computational drug repurposing strategies largely rely on single-modality chemical descriptors, which frequently fail to capture the systemic transcriptomic interactions within the highly dynamic tumor microenvironment. Here, this study presents a robust multi-modal deep learning framework that synergistically integrates two-dimensional (2D) molecular graphs via Graph Neural Networks (GNNs) and chemical functional group patterns via self-attention Transformers. By mapping this dual-stream chemical feature space to the perturbational transcriptomic signatures (LINCS L1000) of 22 distinct cancer types from The Cancer Genome Atlas (TCGA), a vast library of over 28,000 small-molecule compounds was computationally screened. of over 28,000 small-molecule compounds. The developed multi-modal architecture achieved state-of-the-art predictive accuracy, significantly outperforming traditional single-modality baseline models. Strikingly, our comprehensive pan-cancer transcriptomic reversal landscape identified a persistent convergence of non-oncology drugs exhibiting potent broad-spectrum anti-tumor potential. Specifically, Class I Histone Deacetylase (HDAC) inhibitorsmost notably TC-H-106, RG2833, and Tianeptinaline, agents originally developed to penetrate the blood-brain barrier for neurodegenerative and psychiatric disorderse-merged as top therapeutic candidates across lung adenocarcinoma (LUAD), bladder urothelial carcinoma (BLCA), and rectum adenocarcinoma (READ). Subsequent high-dimensional network pharmacology and functional enrichment analyses confirmed that these agents robustly suppress essential oncogenic pathways, specifically collapsing the G1/S phase transition and DNA damage repair machineries. Furthermore, structural validation via molecular docking and force-field thermodynamics confirmed the highly stable physical binding affinity (Vina score: -7.0 kcal/mol, MMFF94 Energy: 64.76 kcal/mol) of TC-H-106 to the HDAC1 catalytic pocket. Kaplan-Meier survival analysis based on TCGA gene expression stratification underscored the significant prognostic benefit of targeting this epigenetic axis. Collectively, these findings introduce a powerful multi-modal AI framework for systems-level drug repurposing and highlight brain-penetrant Class I HDAC inhibitors as highly promising candidates for pan-cancer epigenetic therapy.
bioinformatics2026-04-29v2Fast and haplotype-aware assembly of high-fidelity reads based on MSR sketching: the Alice assembler
Faure, R.; Hilaire, B.; Flot, J.-F.; Lavenier, D.Abstract
Background: Long-read metagenomic assembly is becoming a critical bottleneck in microbiome analysis, as deep sequencing generates massive datasets that existing methods struggle to assemble while maintaining strain resolution. Results: We present Alice, a lightweight long-read assembler that achieves orders-of-magnitude speedups through a new sequence sketching technique, MSR sketching, compatible with classical assembly methods. Alice assembles a 235 Gbp soil metagenome in 5 hours using only 84 GB RAM - a task that causes most competing methods to exhaust our computational resources (500 GB RAM and 7 days runtime). Across diverse benchmarks, Alice delivered strain-resolved assemblies an order of magnitude faster than state-of-the-art approaches, while producing the most complete assemblies in some cases. Conclusions: MSR sketching overcomes computational barriers in metagenomic assembly, enabling fast, memory-efficient strain-resolved analysis of massive datasets. While Alice's assemblies were more fragmented than with other assemblers, this approach establishes a promising paradigm for scalable metagenomic analysis.
bioinformatics2026-04-29v2Learning the All-Atom Equilibrium Distribution of Biomolecular Interactions at Scale
Wang, Y.; Xu, Y.; Li, W.; Yu, H.; Tan, W.; Li, S.; Huang, Q.; Chen, N.; Wu, X.; Wu, Q.; Liu, K.Abstract
Biomolecular functions are governed by dynamic conformational ensembles rather than static structures. While models like AlphaFold have revolutionized static structure prediction, accurately capturing the equilibrium distribution of all-atom biomolecular interactions remains a significant challenge due to the high computational cost of molecular dynamics (MD). We present AnewSampling, a transferable generative foundation framework designed for the high-fidelity sampling of all-atom equilibrium distributions, which is the first model to faithfully reproduce MD at the all-atom level. It uses a quotient-space generative framework to ensure mathematical consistency and leverages the largest self-curated database of protein-ligand trajectories to date, with over 15 million conformations. Statistically, AnewSampling consistently outperforms all prior generative methods on the ATLAS monomer benchmark, and the all-atom capabilities of AnewSampling enable close statistical alignment with ground-truth MD for evaluating atomic biomolecular interactions in protein-ligand dynamics. Furthermore, AnewSampling successfully recovers coupled ligand and side-chain motions in CDK2 systems, overcoming a major sampling hurdle inherent to conventional MD. AnewSampling enables rapid exploration of conformational landscapes prior to intensive simulations, elucidating fundamental biophysical mechanisms and accelerating the broader design of functional biomolecules.
bioinformatics2026-04-29v2Using AI to Build AI: AIDO.Builder Enables Autonomous Machine Learning Model Building for Biomedicine
Guo, H.; Liang, Y.; Cheng, X.; Ellington, C.; Xie, P.; Song, L.; Xing, E.Abstract
Machine learning accelerates biomedical discovery, but creating effective predictive models requires specialized human expertise and demanding manual effort. Researchers must iteratively design pipelines, select architectures, and debug code. This challenge is particularly severe in biomedicine because of the heterogeneous datasets, sparse annotations, and complex evaluation protocols that are common in the domain. We present AIDO.Builder, an agentic artificial intelligence system that fully automates the entire life-cycle of biomedical model development. Provided only with a natural language task description and a target metric, AIDO.Builder autonomously constructs executable training and evaluation pipelines. The system selects suitable modeling strategies, executes experiments, and uses automated feedback-loop to iteratively revise its own code, configurations, and training procedures. It flexibly adapts to new tasks by training specialized models de novo or by using pretrained foundation models to build predictive models through task-appropriate adaptation. We show that across diverse biomedical benchmarks, AIDO.Builder produces highly competitive solutions against human alternatives, while eliminating the manual iteration previously required for robust model development. By automating the translation of raw data into reliable AI models,AIDO.Builder demonstrates how AI itself can be used to accelerate AI for biomedical research.
bioinformatics2026-04-29v2Robust metabolomics data normalization across scales and experimental designs
Vynck, M.; Vangeenderhuysen, P.; De Paepe, E.; Nawrot, T.; Plekhova, V.; Vanhaecke, L.Abstract
Metabolomics studies employing liquid chromatography-mass spectrometry are affected by signal drift and batch effects, introducing technical variance that impedes biological knowledge discovery. Quality control (QC) sample-based normalization strategies are widely implemented but remain vulnerable to outliers, thereby reducing normalization performance. We introduce rLOESS, rGAM and tGAM, three robust normalization methods that improve resistance to outliers by downweighting or accommodating them. Leveraging additive models, the rGAM and tGAM methods allow flexible non-linear modeling, differential sample weighting, and data-driven QC representativeness evaluation. Implementations of these methods are gathered in the Metanorm R package, integrating robust normalization with visualization for performance verification, while supporting efficient parallel processing. In in silico and/or experimental datasets, the robust methods, relative to several popular existing strategies, improved replicate concordance, and reduced drift and batch effects. The robust methods, with improved recovery of the underlying signal demonstrated in simulation, produced distinct differential abundance results, highlighting the impact of normalization on downstream statistical inference. Overall, tGAM-based normalization suggested the best performance across scenarios and is proposed as default choice. Metanorm is versatile, supporting normalization in metabolomics studies across scales and experimental setups.
bioinformatics2026-04-29v2Identification of different intrinsic sequence patterns between HIV-1 DNA and RNA across subtypes using the k-mer-based approach
Chen, H.-C.; Wisniewski, J.; Serwin, K.; Parczewski, M.; Kula-Pacurar, A.; Skums, P.; Kirpich, A.; Yakovlev, S.Abstract
Advanced analytical tools that enable mining of the masked features hidden in intricate datasets and strengthening the biological interpretation of multigenomic outputs hold paramount importance. At present, HIV-1 subtyping remains a challenging task in a great part due to analytical tool discordance. To tackle this issue, in this study, we present an updated version of a k-mer-based approach, PORT-EK-v2, a streamlined bioinformatic pipeline, allowing for a comparison of multiple genomic datasets and identification of over-represented genomic regions, k-mers, related to specific origins of datasets. Using PORT-EK-v2, we exemplified that intrinsic sequence patterns between HIV-1 DNA and RNA are distinct across group M HIV-1 subtypes. Furthermore, we showcased that "isolate k-mer count", a predictive variable computed in this work, could serve as a default choice in classifying the HIV-1 DNA versus RNA sequences across subtypes. Lastly, results based on network-based analyses and Markov chain Monte Carlo modeling unveiled a clear discontinuation of a random walk throughout the network properties corresponding to each tested group of HIV-1 subtypes, confirming the specificity of enriched k-mer retrieved by PORT-EK-v2 and the genomic diversity across group M HIV-1 subtypes. Source code for PORT-EK-v2 is at https://github.com/Quantitative-Virology-Research-Group/PORT-EK-version-2 and is freely available.
bioinformatics2026-04-29v2Deterministic retrieval recovers biomedical associations lost by language models
Halder, A.; Singh, M.; Kesarwani, R.; Mathew, B.; Bhattacharya, N.; Chikhaliya, O.; Motwani, D.; Peela, S. C. M.; Samanta, S.; Muddemmanavar, P.; Farooq, M.; Ahuja, G.; Sengupta, D.Abstract
Large language model (LLM)-based retrieval systems miss biomedical associations through output truncation, synonym mismatch and run-to-run variability, but the magnitude of this loss remains unclear. We present BioChirp, an open-source framework that uses LLMs for query interpretation and candidate filtering, combining multi-source consensus entity resolution with deterministic graph-based retrieval. Across four major biomedical databases, BioChirp recovered more associations with higher reproducibility than conventional LLM-based retrieval approaches.
bioinformatics2026-04-29v1Structure-Guided Biochemical Design of DNA Tweezers As A Dual Target of the Primary Glioblastoma Biomarkers S100A4 and Midkine
Foo, H.; Sharma, G.Abstract
Glioblastoma multiforme (GBM) is among the most aggressive malignant brain tumors originating from glial cells and characterized by severe infiltration into surrounding brain tissue, rendering early detection difficult with current diagnostic imaging methods. S100A4 has been identified as a biomarker protein associated with glioblastoma invasiveness due to its role in cell motility and tumor metastasis. Similarly, midkine (MDK) poses an optimal biomedical target for identifying GBM invasive phenotypes because of its connection to the tumor microenvironment and infiltrative proliferation. Both proteins notably possess a positive charge that interacts electrostatically with the negatively charged phosphate backbone of DNA. It has been established that early molecular detection remains a critical unmet need. This study investigates a promising strategy for GBM diagnosis based on how S100A4 and MDK can selectively bind with DNA tweezer nanostructures. Computationally predicting eight distinct nucleotide sequences yielded three-stranded, hinge-scaffolded tweezer conformations for each candidate. The target protein and DNA structures, derived from AlphaFold, were paired together by molecular docking simulations conducted with HDOCK. Docking analyses evaluated binding affinity, structural complementarity, and conformational stability of the complexes formed. Among the evaluated candidates, DT3_8 computationally established the most biochemically robust interaction with both biomarker proteins. Selectivity is especially important because many S100 proteins share similar electrostatic profiles, yet DT3_8 indicates stronger selectivity for S100A4 and MDK over other S100 family proteins. These findings establish a biomechanical basis for the development of nanoscale DNA biosensors, which suggests the potential for detecting invasive GBM phenotypes, preceding radiographic manifestation and pending experimental validation.
bioinformatics2026-04-29v1Diagnosing protein sequence search in the era of language models
Zhou, H.; Yang, Y.; Lu, Y. Y.Abstract
Protein language model (PLM) based search is rapidly emerging as a successor to classical sequence alignment, with recent high-profile studies reporting substantial improvements in speed and remote homology detection. However, success on standard benchmarks does not guarantee that similarity derived from PLM embeddings constitutes reliable biological evidence. Here, we introduce PLM-GUARD, a diagnostic framework designed to interrogate the underlying meaning of protein search scores and assess their biological trustworthiness. PLM-GUARD comprises six sanity checks spanning biological fidelity, semantic validity, and manipulation safety. Across eight representative search methods, classical alignment-based systems demonstrate remarkable robustness, whereas current PLM-based methods fail broadly across all three dimensions. Notably, hybrid methods show intermediate results, indicating that alignment is still critical for ensuring biologically grounded correspondence. Our findings provide a timely clarification for the field and underscore the necessity of diagnostic evaluation as protein search enters the era of language models.
bioinformatics2026-04-29v1GRNPred: A Multimodal Graph Transformer with Masked Gene Expression Pretraining for Gene Regulatory Network Inference
Nguyen, T. M.; Hegde, A.; Cheng, J.Abstract
Gene regulatory network (GRN) inference is a key problem in systems biology that aims to identify transcription factor (TF)-target gene interactions from high-dimensional gene expression data, but it remains challenging due to limited labeled data, class imbalance, and complex nonlinear regulatory relationships. To address this, we propose GRNPred, a multimodal graph transformer framework that integrates gene expression, functional annotations, semantic gene descriptions, regulatory motif priors, and co-expression network topology. GRNPred uses a two-stage training strategy: first, a self-supervised pretraining phase where a graph transformer learns transcriptional context through masked gene-expression reconstruction on TF-centered subgraphs, and second, a supervised fine-tuning phase for TF-target edge prediction using known regulatory annotations. By leveraging transformer-based attention, the model captures long-range and context-dependent interactions that traditional methods struggle to model. Extensive evaluation across seven benchmark datasets and three regulatory network constructions shows that GRNPred outperforms state-of-the-art approaches, achieving up to 0.94 AUROC and 0.93 AUPRC while maintaining strong robustness across diverse biological settings.
bioinformatics2026-04-29v1Whole-Proteome ESM-2 Embeddings Recover Taxonomy and Enable Geometry-Aware Triage of Foodborne Bacterial Genomes
Gutierrez, J.; Correa Alvarez, J.Abstract
Whole-genome sequencing (WGS) has transformed foodborne pathogen surveillance, yet time-sensitive decision-making remains constrained by computationally expensive alignment-centric workflows that scale poorly to outbreak volumes and lack built-in confidence signals. Using 21,657 GenomeTrakr-derived assemblies spanning nine food safety relevant taxa, we represent each genome by mean-pooling per-protein embeddings from ESM-2 (480 dimensions). The resulting embedding space is dominated by taxonomic structure, exhibiting near-perfect neighborhood consistency for both species and a coarse species/pathotype-derived pathogenicity prior (mean homophily >0.99). Density-based clustering recovered species-coherent structure with high purity and bootstrap stability, while external agreement with the binary pathogenicity prior was only moderate, which is consistent with phylogenetic entanglement by design rather than embedding failure. As a within-genus stress test, kNN separates E. coli O157:H7 from non-pathogenic E. coli with ~98% accuracy (5-fold CV), demonstrating that known pathotype annotations are preserved in the embedding geometry even among closely related genomes. We position this mean-pooling baseline relative to contextual genome language models that retain protein order or operon-scale context, and outline how embedding geometry (homophily, purity, outliers) can serve as a principled confidence layer in bio-surveillance-oriented triage pipelines.
bioinformatics2026-04-29v1An Open-Source Reproducible Workflow for Pocket-Oriented Virtual Screening and ADME-Integrated Chemoinformatics: A Multi-Target Flavivirus Case Study
Teixeira, J. P.; Bajay, M. M.; Freire, C. C. d. M.; Bettin, L. B. F.; Soares, A. P.; de Lima Neto, D. F.Abstract
Zika virus (ZIKV), yellow fever virus (YFV), West Nile virus (WNV), Usutu virus (USUV), and Saint Louis encephalitis virus (SLEV) remain major public health concerns, yet broad-spectrum antiviral options are limited. Here, we present an open-source, reproducible software workflow for pocket-oriented virtual screening and ADME-integrated chemoinformatics, designed to support standardized multi-target compound prioritization. As a case study, the workflow was applied to structural and nonstructural proteins from clinically relevant flaviviruses. Automated pocket detection using Concavity reduces site-selection bias by generating docking boxes from surface concavity clusters, while standardized downstream scripts parse docking logs, convert docking-derived binding energies into Kd-related metrics, integrate SwissADME descriptors, and compute LE, LLE, FQ, and drug-likeness rules. The framework also supports retrospective validation and comparative benchmarking using literature-supported reference compounds and target-specific plausibility checks. Rather than proposing experimentally validated antiviral candidates, this study provides a reusable computational framework for hypothesis generation, benchmarking, and downstream experimental prioritization in structure-based drug discovery. The workflow is modular and adaptable to other multi-target screening campaigns where integrated ranking across binding, physicochemical, and ADME dimensions is required.
bioinformatics2026-04-29v1A Bayesian approach for identifying similar transcript dynamics using curve registration
Kristianingsih, R.; Calderwood, A.; Sidhu, G.; Woodhouse, S.; Woolfenden, H. C.; Kurup, S.; Wells, R.; Morris, R. J.Abstract
Changes in gene expression over time can provide valuable insights into developmental processes and responses to the environment. Differences in expression may be indicative of potential differences in regulation. Comparing transcript dynamics may help identify correspondences between developmental stages within and between species, differences in the timing of key events during development, and transcriptional response to treatments or perturbations. A straightforward comparison between the dynamics is, however, hindered by measurements that were taken at different time points and over different timescales. To address this, we developed a statistical approach that seeks the optimal alignment between two time series as a function of a temporal shift and stretch. We validated our approach using simulated data and applied it to several transcriptome datasets, including comparisons between different plant species. Our development facilitates knowledge transfer from model systems to less studied species, the identification of modules of co-regulated genes, and the discovery of condition-specific, temporally differentially-expressed genes. The method is provided freely available as an R package.
bioinformatics2026-04-29v1GenPept-Curated-2025: A Benchmark Dataset for Antimicrobial Peptide Prediction with Homology-Controlled Partitioning
Pham, H. T.; Huynh, B.; Nguyen-Vo, T.-H.Abstract
Antimicrobial peptides (AMPs) are promising therapeutic candidates against rising antimicrobial resistance, yet progress in AMP prediction is hampered by the lack of benchmark datasets that address homology leakage, negative set reliability, and distributional diversity. Existing AMP databases, designed as biological repositories, do not enforce the controlled partitioning required for rigorous machine learning evaluation. We present GenPept-Curated-2025, a curated, class-balanced benchmark of 11,000 peptide sequences (5,500 AMP / 5,500 non-AMP) derived from Bacteria, Archaea, and Fungi, and sourced exclusively from GenPept/NCBI Protein. The dataset was constructed through a reproducible pipeline comprising taxonomic scoping, quality control, precursor handling, annotation-based labeling, and Identical Protein Groups (IPG)-based deduplication, with sequence length restricted to 10--200~aa. The AMP proportion varies substantially across length bins (14.2% in [10, 50] aa to 77.1% in [101, 150] aa), identifying length-dependent class imbalance as a distribution shift that benchmarking must account for. The dataset is openly released to support standardized, reproducible, and leakage-free evaluation of AMP prediction models.
bioinformatics2026-04-29v1Scalable machine learning improves resistance prediction and identifies novel determinants in Mycobacterium tuberculosis
Serajian, M.; Lotfollahi, M.; Green, O.; Smith, K.; Marini, S.; Prosperi, M.; Boucher, C.Abstract
Multidrug-resistant and extensively drug-resistant Mycobacterium tuberculosis (MTB) represents a growing global health crisis, characterized by limited treatment options and high mortality rates. Rapid and accurate prediction of resistance profiles is critical to guide effective therapy and curb transmission. Whole-genome sequencing (WGS) offers promise for individualized resistance profiling, yet existing computational tools remain constrained by predefined mutation catalogs and prohibitive resource requirements for large-scale analyses. Here, we present AURA, a GPU-accelerated, pangenome-scale machine learning framework for de novo resistance prediction. Trained on 12,185 globally diverse MTB isolates, AURA predicts resistance to 13 first-line, second-line, and repurposed antibiotics with high precision and identifies 59 novel resistance-associated loci, including variants in katG, pncA, rpoC, and members of the PE/PGRS gene family. By enabling model training on an unprecedented genomic scale, AURA provides new insights into the genetic architecture of resistance and establishes a scalable platform for precision-guided therapy and global surveillance of MTB.
bioinformatics2026-04-29v1Topology-driven classification of time series
Bernadotte, A.Abstract
Time series analysis is fundamentally limited by the lack of representations that reflect the underlying generative mechanisms of observed signals. Existing approaches, ranging from spectral decompositions to modern machine learning, primarily operate on signal values or frequency content, and therefore fail to capture the intrinsic structure of the dynamics that produce the data. In this work, we introduce a geometric framework that establishes a direct correspondence between the generative structure of a time series and the topology of its delay embedding. We show that broad classes of signals (including exponential, harmonic, and exponentially modulated oscillatory processes) induce invariant low-dimensional subspaces in Hankel embedding space, which dimension is determined solely by the number and type of latent dynamical components. This leads to a unifying principle: the intrinsic dimension and geometry of delay embeddings act as invariants of the underlying dynamics. Building on this result, we reformulate time series classification as the problem of separating equivalence classes defined by {varepsilon}-neighborhoods of subspaces on a Grassmann manifold. This yields a topological classifier that is interpretable, data-efficient, and provably robust, where noise admits a natural geometric interpretation as bounded perturbations of subspaces. We demonstrate that the proposed framework distinguishes signals with indistinguishable spectral signatures and consistently recovers the latent structure of complex, noisy, multi-component processes. On benchmark EEG data, the method achieves state-of-the-art performance without feature engineering or large-scale training. These results suggest a shift from feature-based and statistical representations toward a geometric theory of time series, in which structure, classification are governed by the topology of embeddings. An interactive web-based demonstration is available to facilitate exploration of the geometric structure of delay embeddings and the proposed classification approach.
bioinformatics2026-04-29v1Pan-cancer virtual spatial transcriptomics from routine histology with Phoenix
Tran, M.; Gindra, R. H.; Putze, P.; Senbai, K.; Palla, G.; Kos, T.; Falcomata, C.; Wang, C.; Guo, R.; Boxberg, M.; Berclaz, L. M.; Lindner, L. H.; Bergmayr, L.; Knoesel, T.; Jurmeister, P.; Klauschen, F.; Homicsko, K.; Gottardo, R.; Eckstein, M.; Matek, C.; Mock, A.; Theis, F. J.; Saur, D.; Peng, T.Abstract
Spatial transcriptomics links gene expression to tissue architecture, providing a mechanistic view of cellular organization. Yet existing datasets cover few donors and miss the complexity of human disease. Experimental costs remain prohibitive, and large-scale profiling is impractically slow for population-level studies. Accurate computational methods are urgently needed. Predicting gene expression from standard histology, however, remains an open problem, as current approaches transfer poorly to unseen cohorts and diseases. Here, we present Phoenix, a latent flow matching generative model that infers pan-cancer spatially resolved single-cell gene expression with high accuracy. Phoenix analyzes treatment response in silico: Applied to 763 head and neck cancer patients, it identified three new spatial biomarkers that we validated across two cancers (breast cancer, n = 84; ovarian cancer, n = 157) and treatment regimens (platinum, trastuzumab). Phoenix generalizes beyond carcinomas: In a large sarcoma cohort (802 tissue microarray cores), it accurately predicted cell-type-specific signatures in held-out samples and captured chemotherapy-induced immune remodeling. Phoenix also extends across species: In a mouse model, it accurately predicted the expression of pancreatic cancer lineage markers and the mutant mKrasG12D allele in silico. Together, Phoenix establishes virtual spatial transcriptomics from routine histology as a scalable framework for studying tissue organization, therapeutic response, and disease mechanisms.
bioinformatics2026-04-29v1Advancing ab initio genome annotation with OrionGeno
Liu, L.; Cai, X.; Wang, S.; Deng, Y.; Wu, Y.; Pan, Y.; Wang, J.; Zhang, C.; Xia, H.; Tan, N.; Su, K.; Liu, Y.; Zhou, X.; Liu, L.; Wei, T.; Zhang, Y.; Li, Q.; Li, Y.; Yin, P.; Xu, X.Abstract
The rapid expansion of eukaryotic genome sequencing has created an urgent demand for scalable and accurate gene annotation, particularly for large-scale genomic initiatives such as the Earth BioGenome Project (EBP). Existing ab initio methods often struggle with complex gene architectures and exhibit limited cross-lineage generalizability. Moreover, these frameworks typically treat repetitive DNA sequences (repeats) as genomic noise to be pre-masked, leaving the joint modeling of genes and repeats largely unexplored. Here we present OrionGeno, a multispecies phylogeny-aware deep learning framework for end-to-end eukaryotic genome annotation. By integrating phylogenetic context into model learning, OrionGeno resolves complex gene structure variations across divergent lineages, jointly predicting exon-intron architectures, UTRs, and repeats directly from genomic sequences. Across Vertebrates, Invertebrates, Viridiplantae and Fungi, OrionGeno consistently outperforms state-of-the-art methods, achieving a 37.2% relative improvement in protein-level F1 score over the existing best-performing method. Beyond benchmarking, OrionGeno identifies novel loci within well-curated model genomes and generates high-confidence annotations for ~1,200 previously uncharacterized species, expanding NCBI's family-level coverage by 40.5%. As an evidence-independent approach, OrionGeno bridges the gap between genome sequencing and functional discovery, holding promise for large-scale biodiversity initiatives like the EBP.
bioinformatics2026-04-29v1Metacontam: A Negative Control-Free Decontamination Method for Metagenomic Analysis
Jo, J.; Lee, H.; Baek, J. W.; Lee, S.; Singh, V.; Shoaie, S.; Mardinoglu, A.; Choi, J.; Lee, S.Abstract
Shotgun metagenomic sequencing enables high-resolution profiling of host-associated microbial communities. However, contaminant DNA can substantially distort biological interpretations, especially in low-biomass samples. Here, we introduce Metacontam, a control-free method for species-level decontamination of shotgun metagenomic data. Metacontam integrates blacklist-guided community detection within a species correlation network with average nucleotide identity (ANI) to identify contaminants arising from shared sources. Across diverse low-biomass and mixed-biomass datasets, Metacontam outperformed existing approaches, improving the detection of low-abundance and low-prevalence contaminants while retaining biologically plausible taxa. It also reduces kit-specific biases in skin metagenomes and improves downstream analyses of tissue microbiome data. Together, these results demonstrate that Metacontam enables accurate identification of contaminant taxa across diverse metagenomic datasets, even in the absence of negative controls.
bioinformatics2026-04-29v1TissueFormer: Extending single-cell foundation models to predict population-level phenotypes
Benjamin, A. S.; Zador, A.Abstract
Single-cell RNA sequencing technologies have enabled unprecedented insights into gene expression and opened new pathways for diagnostics and tissue annotation. At present, most computational approaches for interpreting single-cell data predict labels or properties based on isolated single-cell transcriptomic profiles. This approach overlooks the cellular composition within a sample, which is often critical for inferring tissue identity or other sample-level phenotypes. To address this limitation, we introduce TissueFormer, a Transformer-based neural network that infers population-level labels from groups of single-cell RNA profiles while retaining single-cell resolution. We applied TissueFormer to two tasks: predicting COVID-19 severity from single-cell RNA sequencing of blood samples, and predicting cortical area identity from spatial transcriptomic data in mouse brains. TissueFormer outperformed single-cell foundation models and machine learning methods applied to pseudobulk and cell type composition. TissueFormer's higher performance promises more accurate diagnostics and enables the automated construction of high-resolution brain region maps in individual mice directly from spatial transcriptomic data. Applied to mice with developmental perturbations to visual input, these maps revealed a significant reduction in predicted visual cortex area, illustrating how individual differences in neuroanatomy can be quantified. More broadly, TissueFormer provides a framework for predicting any population-level phenotypes which are influenced by cellular diversity and tissue-level organization.
bioinformatics2026-04-28v2Accurate ab initio gene prediction in eukaryotes with Tiberius in multiple clades
Gabriel, L.; Bruna, T.; Kaur, A.; Krishnan, A.; Ortmann, F.; Salamov, A.; Talbot, S.; Becker, F.; Krieg, R.; Wheat, C. W.; Grigoriev, I. V.; Stanke, M.; Hoff, K. J.Abstract
Eukaryotic genome annotation is currently bottlenecked by limitations in the generality, scalability and accuracy of computational methods. Deep learning approaches have recently achieved large improvements in ab initio gene prediction accuracy. We extend the deep learning-based ab initio gene predictor Tiberius beyond mammals by training lineage-specific models for Mesangiospermae, Fungi, Vertebrata, Insecta, Chlorophyta and Bacillariophyta. Across a benchmark of 33 species, Tiberius consistently achieves higher accuracy than the other evaluated ab initio methods, Helixer and ANNEVO, while also having the fastest runtimes overall. Compared with BRAKER3, which incorporates RNA-Seq and protein evidence, Tiberius approaches state-of-the-art accuracy in Mesangiospermae, Fungi, Bacillariophyta and Chlorophyta, while being on average 80 times faster when using a GPU. Availability and implementation: https://github.com/Gaius-Augustus/Tiberius
bioinformatics2026-04-28v1