Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
CAPHEINE, or everything and the kitchen sink: a workflow for automating selection analyses using HyPhy
Verdonk, H. E.; Callan, D.; Kosakovsky Pond, S. L.Abstract
Here we present CAPHEINE, a computational workflow that starts with a set of unaligned pathogen sequences and a reference genome and performs a comprehensive exploratory evolutionary analysis of the input data. CAPHEINE pairs nicely with studies of site-level selection dynamics, gene-level positive selection, and lineage-specific shifts in selective pressure. Our workflow is portable across Mac OS, Windows, and Linux, allowing researchers to focus on results. CAPHEINE is freely available at https://github.com/veg/CAPHEINE, along with a set of usage instructions.
bioinformatics2026-04-29v3Multi-Modal Deep Learning Integrates Spatial Topologies and Sequential Motifs to Identify Class I HDAC Inhibitors as Pan-Cancer Therapeutics
Tong, S.; Zhang, W.; Ji, S.Abstract
The molecular characterization of human solid growths has introduced immense genomic complexity and intra-tumoral diversification. Converting these detailed, multi-omic profiles right into workable, broad-spectrum therapeutics continues to be an awesome traffic jam in accuracy oncology. Traditional computational drug repurposing strategies largely rely on single-modality chemical descriptors, which frequently fail to capture the systemic transcriptomic interactions within the highly dynamic tumor microenvironment. Here, this study presents a robust multi-modal deep learning framework that synergistically integrates two-dimensional (2D) molecular graphs via Graph Neural Networks (GNNs) and chemical functional group patterns via self-attention Transformers. By mapping this dual-stream chemical feature space to the perturbational transcriptomic signatures (LINCS L1000) of 22 distinct cancer types from The Cancer Genome Atlas (TCGA), a vast library of over 28,000 small-molecule compounds was computationally screened. of over 28,000 small-molecule compounds. The developed multi-modal architecture achieved state-of-the-art predictive accuracy, significantly outperforming traditional single-modality baseline models. Strikingly, our comprehensive pan-cancer transcriptomic reversal landscape identified a persistent convergence of non-oncology drugs exhibiting potent broad-spectrum anti-tumor potential. Specifically, Class I Histone Deacetylase (HDAC) inhibitorsmost notably TC-H-106, RG2833, and Tianeptinaline, agents originally developed to penetrate the blood-brain barrier for neurodegenerative and psychiatric disorderse-merged as top therapeutic candidates across lung adenocarcinoma (LUAD), bladder urothelial carcinoma (BLCA), and rectum adenocarcinoma (READ). Subsequent high-dimensional network pharmacology and functional enrichment analyses confirmed that these agents robustly suppress essential oncogenic pathways, specifically collapsing the G1/S phase transition and DNA damage repair machineries. Furthermore, structural validation via molecular docking and force-field thermodynamics confirmed the highly stable physical binding affinity (Vina score: -7.0 kcal/mol, MMFF94 Energy: 64.76 kcal/mol) of TC-H-106 to the HDAC1 catalytic pocket. Kaplan-Meier survival analysis based on TCGA gene expression stratification underscored the significant prognostic benefit of targeting this epigenetic axis. Collectively, these findings introduce a powerful multi-modal AI framework for systems-level drug repurposing and highlight brain-penetrant Class I HDAC inhibitors as highly promising candidates for pan-cancer epigenetic therapy.
bioinformatics2026-04-29v2Fast and haplotype-aware assembly of high-fidelity reads based on MSR sketching: the Alice assembler
Faure, R.; Hilaire, B.; Flot, J.-F.; Lavenier, D.Abstract
Background: Long-read metagenomic assembly is becoming a critical bottleneck in microbiome analysis, as deep sequencing generates massive datasets that existing methods struggle to assemble while maintaining strain resolution. Results: We present Alice, a lightweight long-read assembler that achieves orders-of-magnitude speedups through a new sequence sketching technique, MSR sketching, compatible with classical assembly methods. Alice assembles a 235 Gbp soil metagenome in 5 hours using only 84 GB RAM - a task that causes most competing methods to exhaust our computational resources (500 GB RAM and 7 days runtime). Across diverse benchmarks, Alice delivered strain-resolved assemblies an order of magnitude faster than state-of-the-art approaches, while producing the most complete assemblies in some cases. Conclusions: MSR sketching overcomes computational barriers in metagenomic assembly, enabling fast, memory-efficient strain-resolved analysis of massive datasets. While Alice's assemblies were more fragmented than with other assemblers, this approach establishes a promising paradigm for scalable metagenomic analysis.
bioinformatics2026-04-29v2Learning the All-Atom Equilibrium Distribution of Biomolecular Interactions at Scale
Wang, Y.; Xu, Y.; Li, W.; Yu, H.; Tan, W.; Li, S.; Huang, Q.; Chen, N.; Wu, X.; Wu, Q.; Liu, K.Abstract
Biomolecular functions are governed by dynamic conformational ensembles rather than static structures. While models like AlphaFold have revolutionized static structure prediction, accurately capturing the equilibrium distribution of all-atom biomolecular interactions remains a significant challenge due to the high computational cost of molecular dynamics (MD). We present AnewSampling, a transferable generative foundation framework designed for the high-fidelity sampling of all-atom equilibrium distributions, which is the first model to faithfully reproduce MD at the all-atom level. It uses a quotient-space generative framework to ensure mathematical consistency and leverages the largest self-curated database of protein-ligand trajectories to date, with over 15 million conformations. Statistically, AnewSampling consistently outperforms all prior generative methods on the ATLAS monomer benchmark, and the all-atom capabilities of AnewSampling enable close statistical alignment with ground-truth MD for evaluating atomic biomolecular interactions in protein-ligand dynamics. Furthermore, AnewSampling successfully recovers coupled ligand and side-chain motions in CDK2 systems, overcoming a major sampling hurdle inherent to conventional MD. AnewSampling enables rapid exploration of conformational landscapes prior to intensive simulations, elucidating fundamental biophysical mechanisms and accelerating the broader design of functional biomolecules.
bioinformatics2026-04-29v2Using AI to Build AI: AIDO.Builder Enables Autonomous Machine Learning Model Building for Biomedicine
Guo, H.; Liang, Y.; Cheng, X.; Ellington, C.; Xie, P.; Song, L.; Xing, E.Abstract
Machine learning accelerates biomedical discovery, but creating effective predictive models requires specialized human expertise and demanding manual effort. Researchers must iteratively design pipelines, select architectures, and debug code. This challenge is particularly severe in biomedicine because of the heterogeneous datasets, sparse annotations, and complex evaluation protocols that are common in the domain. We present AIDO.Builder, an agentic artificial intelligence system that fully automates the entire life-cycle of biomedical model development. Provided only with a natural language task description and a target metric, AIDO.Builder autonomously constructs executable training and evaluation pipelines. The system selects suitable modeling strategies, executes experiments, and uses automated feedback-loop to iteratively revise its own code, configurations, and training procedures. It flexibly adapts to new tasks by training specialized models de novo or by using pretrained foundation models to build predictive models through task-appropriate adaptation. We show that across diverse biomedical benchmarks, AIDO.Builder produces highly competitive solutions against human alternatives, while eliminating the manual iteration previously required for robust model development. By automating the translation of raw data into reliable AI models,AIDO.Builder demonstrates how AI itself can be used to accelerate AI for biomedical research.
bioinformatics2026-04-29v2Robust metabolomics data normalization across scales and experimental designs
Vynck, M.; Vangeenderhuysen, P.; De Paepe, E.; Nawrot, T.; Plekhova, V.; Vanhaecke, L.Abstract
Metabolomics studies employing liquid chromatography-mass spectrometry are affected by signal drift and batch effects, introducing technical variance that impedes biological knowledge discovery. Quality control (QC) sample-based normalization strategies are widely implemented but remain vulnerable to outliers, thereby reducing normalization performance. We introduce rLOESS, rGAM and tGAM, three robust normalization methods that improve resistance to outliers by downweighting or accommodating them. Leveraging additive models, the rGAM and tGAM methods allow flexible non-linear modeling, differential sample weighting, and data-driven QC representativeness evaluation. Implementations of these methods are gathered in the Metanorm R package, integrating robust normalization with visualization for performance verification, while supporting efficient parallel processing. In in silico and/or experimental datasets, the robust methods, relative to several popular existing strategies, improved replicate concordance, and reduced drift and batch effects. The robust methods, with improved recovery of the underlying signal demonstrated in simulation, produced distinct differential abundance results, highlighting the impact of normalization on downstream statistical inference. Overall, tGAM-based normalization suggested the best performance across scenarios and is proposed as default choice. Metanorm is versatile, supporting normalization in metabolomics studies across scales and experimental setups.
bioinformatics2026-04-29v2Identification of different intrinsic sequence patterns between HIV-1 DNA and RNA across subtypes using the k-mer-based approach
Chen, H.-C.; Wisniewski, J.; Serwin, K.; Parczewski, M.; Kula-Pacurar, A.; Skums, P.; Kirpich, A.; Yakovlev, S.Abstract
Advanced analytical tools that enable mining of the masked features hidden in intricate datasets and strengthening the biological interpretation of multigenomic outputs hold paramount importance. At present, HIV-1 subtyping remains a challenging task in a great part due to analytical tool discordance. To tackle this issue, in this study, we present an updated version of a k-mer-based approach, PORT-EK-v2, a streamlined bioinformatic pipeline, allowing for a comparison of multiple genomic datasets and identification of over-represented genomic regions, k-mers, related to specific origins of datasets. Using PORT-EK-v2, we exemplified that intrinsic sequence patterns between HIV-1 DNA and RNA are distinct across group M HIV-1 subtypes. Furthermore, we showcased that "isolate k-mer count", a predictive variable computed in this work, could serve as a default choice in classifying the HIV-1 DNA versus RNA sequences across subtypes. Lastly, results based on network-based analyses and Markov chain Monte Carlo modeling unveiled a clear discontinuation of a random walk throughout the network properties corresponding to each tested group of HIV-1 subtypes, confirming the specificity of enriched k-mer retrieved by PORT-EK-v2 and the genomic diversity across group M HIV-1 subtypes. Source code for PORT-EK-v2 is at https://github.com/Quantitative-Virology-Research-Group/PORT-EK-version-2 and is freely available.
bioinformatics2026-04-29v2Deterministic retrieval recovers biomedical associations lost by language models
Halder, A.; Singh, M.; Kesarwani, R.; Mathew, B.; Bhattacharya, N.; Chikhaliya, O.; Motwani, D.; Peela, S. C. M.; Samanta, S.; Muddemmanavar, P.; Farooq, M.; Ahuja, G.; Sengupta, D.Abstract
Large language model (LLM)-based retrieval systems miss biomedical associations through output truncation, synonym mismatch and run-to-run variability, but the magnitude of this loss remains unclear. We present BioChirp, an open-source framework that uses LLMs for query interpretation and candidate filtering, combining multi-source consensus entity resolution with deterministic graph-based retrieval. Across four major biomedical databases, BioChirp recovered more associations with higher reproducibility than conventional LLM-based retrieval approaches.
bioinformatics2026-04-29v1Structure-Guided Biochemical Design of DNA Tweezers As A Dual Target of the Primary Glioblastoma Biomarkers S100A4 and Midkine
Foo, H.; Sharma, G.Abstract
Glioblastoma multiforme (GBM) is among the most aggressive malignant brain tumors originating from glial cells and characterized by severe infiltration into surrounding brain tissue, rendering early detection difficult with current diagnostic imaging methods. S100A4 has been identified as a biomarker protein associated with glioblastoma invasiveness due to its role in cell motility and tumor metastasis. Similarly, midkine (MDK) poses an optimal biomedical target for identifying GBM invasive phenotypes because of its connection to the tumor microenvironment and infiltrative proliferation. Both proteins notably possess a positive charge that interacts electrostatically with the negatively charged phosphate backbone of DNA. It has been established that early molecular detection remains a critical unmet need. This study investigates a promising strategy for GBM diagnosis based on how S100A4 and MDK can selectively bind with DNA tweezer nanostructures. Computationally predicting eight distinct nucleotide sequences yielded three-stranded, hinge-scaffolded tweezer conformations for each candidate. The target protein and DNA structures, derived from AlphaFold, were paired together by molecular docking simulations conducted with HDOCK. Docking analyses evaluated binding affinity, structural complementarity, and conformational stability of the complexes formed. Among the evaluated candidates, DT3_8 computationally established the most biochemically robust interaction with both biomarker proteins. Selectivity is especially important because many S100 proteins share similar electrostatic profiles, yet DT3_8 indicates stronger selectivity for S100A4 and MDK over other S100 family proteins. These findings establish a biomechanical basis for the development of nanoscale DNA biosensors, which suggests the potential for detecting invasive GBM phenotypes, preceding radiographic manifestation and pending experimental validation.
bioinformatics2026-04-29v1Diagnosing protein sequence search in the era of language models
Zhou, H.; Yang, Y.; Lu, Y. Y.Abstract
Protein language model (PLM) based search is rapidly emerging as a successor to classical sequence alignment, with recent high-profile studies reporting substantial improvements in speed and remote homology detection. However, success on standard benchmarks does not guarantee that similarity derived from PLM embeddings constitutes reliable biological evidence. Here, we introduce PLM-GUARD, a diagnostic framework designed to interrogate the underlying meaning of protein search scores and assess their biological trustworthiness. PLM-GUARD comprises six sanity checks spanning biological fidelity, semantic validity, and manipulation safety. Across eight representative search methods, classical alignment-based systems demonstrate remarkable robustness, whereas current PLM-based methods fail broadly across all three dimensions. Notably, hybrid methods show intermediate results, indicating that alignment is still critical for ensuring biologically grounded correspondence. Our findings provide a timely clarification for the field and underscore the necessity of diagnostic evaluation as protein search enters the era of language models.
bioinformatics2026-04-29v1GRNPred: A Multimodal Graph Transformer with Masked Gene Expression Pretraining for Gene Regulatory Network Inference
Nguyen, T. M.; Hegde, A.; Cheng, J.Abstract
Gene regulatory network (GRN) inference is a key problem in systems biology that aims to identify transcription factor (TF)-target gene interactions from high-dimensional gene expression data, but it remains challenging due to limited labeled data, class imbalance, and complex nonlinear regulatory relationships. To address this, we propose GRNPred, a multimodal graph transformer framework that integrates gene expression, functional annotations, semantic gene descriptions, regulatory motif priors, and co-expression network topology. GRNPred uses a two-stage training strategy: first, a self-supervised pretraining phase where a graph transformer learns transcriptional context through masked gene-expression reconstruction on TF-centered subgraphs, and second, a supervised fine-tuning phase for TF-target edge prediction using known regulatory annotations. By leveraging transformer-based attention, the model captures long-range and context-dependent interactions that traditional methods struggle to model. Extensive evaluation across seven benchmark datasets and three regulatory network constructions shows that GRNPred outperforms state-of-the-art approaches, achieving up to 0.94 AUROC and 0.93 AUPRC while maintaining strong robustness across diverse biological settings.
bioinformatics2026-04-29v1Whole-Proteome ESM-2 Embeddings Recover Taxonomy and Enable Geometry-Aware Triage of Foodborne Bacterial Genomes
Gutierrez, J.; Correa Alvarez, J.Abstract
Whole-genome sequencing (WGS) has transformed foodborne pathogen surveillance, yet time-sensitive decision-making remains constrained by computationally expensive alignment-centric workflows that scale poorly to outbreak volumes and lack built-in confidence signals. Using 21,657 GenomeTrakr-derived assemblies spanning nine food safety relevant taxa, we represent each genome by mean-pooling per-protein embeddings from ESM-2 (480 dimensions). The resulting embedding space is dominated by taxonomic structure, exhibiting near-perfect neighborhood consistency for both species and a coarse species/pathotype-derived pathogenicity prior (mean homophily >0.99). Density-based clustering recovered species-coherent structure with high purity and bootstrap stability, while external agreement with the binary pathogenicity prior was only moderate, which is consistent with phylogenetic entanglement by design rather than embedding failure. As a within-genus stress test, kNN separates E. coli O157:H7 from non-pathogenic E. coli with ~98% accuracy (5-fold CV), demonstrating that known pathotype annotations are preserved in the embedding geometry even among closely related genomes. We position this mean-pooling baseline relative to contextual genome language models that retain protein order or operon-scale context, and outline how embedding geometry (homophily, purity, outliers) can serve as a principled confidence layer in bio-surveillance-oriented triage pipelines.
bioinformatics2026-04-29v1An Open-Source Reproducible Workflow for Pocket-Oriented Virtual Screening and ADME-Integrated Chemoinformatics: A Multi-Target Flavivirus Case Study
Teixeira, J. P.; Bajay, M. M.; Freire, C. C. d. M.; Bettin, L. B. F.; Soares, A. P.; de Lima Neto, D. F.Abstract
Zika virus (ZIKV), yellow fever virus (YFV), West Nile virus (WNV), Usutu virus (USUV), and Saint Louis encephalitis virus (SLEV) remain major public health concerns, yet broad-spectrum antiviral options are limited. Here, we present an open-source, reproducible software workflow for pocket-oriented virtual screening and ADME-integrated chemoinformatics, designed to support standardized multi-target compound prioritization. As a case study, the workflow was applied to structural and nonstructural proteins from clinically relevant flaviviruses. Automated pocket detection using Concavity reduces site-selection bias by generating docking boxes from surface concavity clusters, while standardized downstream scripts parse docking logs, convert docking-derived binding energies into Kd-related metrics, integrate SwissADME descriptors, and compute LE, LLE, FQ, and drug-likeness rules. The framework also supports retrospective validation and comparative benchmarking using literature-supported reference compounds and target-specific plausibility checks. Rather than proposing experimentally validated antiviral candidates, this study provides a reusable computational framework for hypothesis generation, benchmarking, and downstream experimental prioritization in structure-based drug discovery. The workflow is modular and adaptable to other multi-target screening campaigns where integrated ranking across binding, physicochemical, and ADME dimensions is required.
bioinformatics2026-04-29v1A Bayesian approach for identifying similar transcript dynamics using curve registration
Kristianingsih, R.; Calderwood, A.; Sidhu, G.; Woodhouse, S.; Woolfenden, H. C.; Kurup, S.; Wells, R.; Morris, R. J.Abstract
Changes in gene expression over time can provide valuable insights into developmental processes and responses to the environment. Differences in expression may be indicative of potential differences in regulation. Comparing transcript dynamics may help identify correspondences between developmental stages within and between species, differences in the timing of key events during development, and transcriptional response to treatments or perturbations. A straightforward comparison between the dynamics is, however, hindered by measurements that were taken at different time points and over different timescales. To address this, we developed a statistical approach that seeks the optimal alignment between two time series as a function of a temporal shift and stretch. We validated our approach using simulated data and applied it to several transcriptome datasets, including comparisons between different plant species. Our development facilitates knowledge transfer from model systems to less studied species, the identification of modules of co-regulated genes, and the discovery of condition-specific, temporally differentially-expressed genes. The method is provided freely available as an R package.
bioinformatics2026-04-29v1GenPept-Curated-2025: A Benchmark Dataset for Antimicrobial Peptide Prediction with Homology-Controlled Partitioning
Pham, H. T.; Huynh, B.; Nguyen-Vo, T.-H.Abstract
Antimicrobial peptides (AMPs) are promising therapeutic candidates against rising antimicrobial resistance, yet progress in AMP prediction is hampered by the lack of benchmark datasets that address homology leakage, negative set reliability, and distributional diversity. Existing AMP databases, designed as biological repositories, do not enforce the controlled partitioning required for rigorous machine learning evaluation. We present GenPept-Curated-2025, a curated, class-balanced benchmark of 11,000 peptide sequences (5,500 AMP / 5,500 non-AMP) derived from Bacteria, Archaea, and Fungi, and sourced exclusively from GenPept/NCBI Protein. The dataset was constructed through a reproducible pipeline comprising taxonomic scoping, quality control, precursor handling, annotation-based labeling, and Identical Protein Groups (IPG)-based deduplication, with sequence length restricted to 10--200~aa. The AMP proportion varies substantially across length bins (14.2% in [10, 50] aa to 77.1% in [101, 150] aa), identifying length-dependent class imbalance as a distribution shift that benchmarking must account for. The dataset is openly released to support standardized, reproducible, and leakage-free evaluation of AMP prediction models.
bioinformatics2026-04-29v1Scalable machine learning improves resistance prediction and identifies novel determinants in Mycobacterium tuberculosis
Serajian, M.; Lotfollahi, M.; Green, O.; Smith, K.; Marini, S.; Prosperi, M.; Boucher, C.Abstract
Multidrug-resistant and extensively drug-resistant Mycobacterium tuberculosis (MTB) represents a growing global health crisis, characterized by limited treatment options and high mortality rates. Rapid and accurate prediction of resistance profiles is critical to guide effective therapy and curb transmission. Whole-genome sequencing (WGS) offers promise for individualized resistance profiling, yet existing computational tools remain constrained by predefined mutation catalogs and prohibitive resource requirements for large-scale analyses. Here, we present AURA, a GPU-accelerated, pangenome-scale machine learning framework for de novo resistance prediction. Trained on 12,185 globally diverse MTB isolates, AURA predicts resistance to 13 first-line, second-line, and repurposed antibiotics with high precision and identifies 59 novel resistance-associated loci, including variants in katG, pncA, rpoC, and members of the PE/PGRS gene family. By enabling model training on an unprecedented genomic scale, AURA provides new insights into the genetic architecture of resistance and establishes a scalable platform for precision-guided therapy and global surveillance of MTB.
bioinformatics2026-04-29v1Topology-driven classification of time series
Bernadotte, A.Abstract
Time series analysis is fundamentally limited by the lack of representations that reflect the underlying generative mechanisms of observed signals. Existing approaches, ranging from spectral decompositions to modern machine learning, primarily operate on signal values or frequency content, and therefore fail to capture the intrinsic structure of the dynamics that produce the data. In this work, we introduce a geometric framework that establishes a direct correspondence between the generative structure of a time series and the topology of its delay embedding. We show that broad classes of signals (including exponential, harmonic, and exponentially modulated oscillatory processes) induce invariant low-dimensional subspaces in Hankel embedding space, which dimension is determined solely by the number and type of latent dynamical components. This leads to a unifying principle: the intrinsic dimension and geometry of delay embeddings act as invariants of the underlying dynamics. Building on this result, we reformulate time series classification as the problem of separating equivalence classes defined by {varepsilon}-neighborhoods of subspaces on a Grassmann manifold. This yields a topological classifier that is interpretable, data-efficient, and provably robust, where noise admits a natural geometric interpretation as bounded perturbations of subspaces. We demonstrate that the proposed framework distinguishes signals with indistinguishable spectral signatures and consistently recovers the latent structure of complex, noisy, multi-component processes. On benchmark EEG data, the method achieves state-of-the-art performance without feature engineering or large-scale training. These results suggest a shift from feature-based and statistical representations toward a geometric theory of time series, in which structure, classification are governed by the topology of embeddings. An interactive web-based demonstration is available to facilitate exploration of the geometric structure of delay embeddings and the proposed classification approach.
bioinformatics2026-04-29v1Pan-cancer virtual spatial transcriptomics from routine histology with Phoenix
Tran, M.; Gindra, R. H.; Putze, P.; Senbai, K.; Palla, G.; Kos, T.; Falcomata, C.; Wang, C.; Guo, R.; Boxberg, M.; Berclaz, L. M.; Lindner, L. H.; Bergmayr, L.; Knoesel, T.; Jurmeister, P.; Klauschen, F.; Homicsko, K.; Gottardo, R.; Eckstein, M.; Matek, C.; Mock, A.; Theis, F. J.; Saur, D.; Peng, T.Abstract
Spatial transcriptomics links gene expression to tissue architecture, providing a mechanistic view of cellular organization. Yet existing datasets cover few donors and miss the complexity of human disease. Experimental costs remain prohibitive, and large-scale profiling is impractically slow for population-level studies. Accurate computational methods are urgently needed. Predicting gene expression from standard histology, however, remains an open problem, as current approaches transfer poorly to unseen cohorts and diseases. Here, we present Phoenix, a latent flow matching generative model that infers pan-cancer spatially resolved single-cell gene expression with high accuracy. Phoenix analyzes treatment response in silico: Applied to 763 head and neck cancer patients, it identified three new spatial biomarkers that we validated across two cancers (breast cancer, n = 84; ovarian cancer, n = 157) and treatment regimens (platinum, trastuzumab). Phoenix generalizes beyond carcinomas: In a large sarcoma cohort (802 tissue microarray cores), it accurately predicted cell-type-specific signatures in held-out samples and captured chemotherapy-induced immune remodeling. Phoenix also extends across species: In a mouse model, it accurately predicted the expression of pancreatic cancer lineage markers and the mutant mKrasG12D allele in silico. Together, Phoenix establishes virtual spatial transcriptomics from routine histology as a scalable framework for studying tissue organization, therapeutic response, and disease mechanisms.
bioinformatics2026-04-29v1Advancing ab initio genome annotation with OrionGeno
Liu, L.; Cai, X.; Wang, S.; Deng, Y.; Wu, Y.; Pan, Y.; Wang, J.; Zhang, C.; Xia, H.; Tan, N.; Su, K.; Liu, Y.; Zhou, X.; Liu, L.; Wei, T.; Zhang, Y.; Li, Q.; Li, Y.; Yin, P.; Xu, X.Abstract
The rapid expansion of eukaryotic genome sequencing has created an urgent demand for scalable and accurate gene annotation, particularly for large-scale genomic initiatives such as the Earth BioGenome Project (EBP). Existing ab initio methods often struggle with complex gene architectures and exhibit limited cross-lineage generalizability. Moreover, these frameworks typically treat repetitive DNA sequences (repeats) as genomic noise to be pre-masked, leaving the joint modeling of genes and repeats largely unexplored. Here we present OrionGeno, a multispecies phylogeny-aware deep learning framework for end-to-end eukaryotic genome annotation. By integrating phylogenetic context into model learning, OrionGeno resolves complex gene structure variations across divergent lineages, jointly predicting exon-intron architectures, UTRs, and repeats directly from genomic sequences. Across Vertebrates, Invertebrates, Viridiplantae and Fungi, OrionGeno consistently outperforms state-of-the-art methods, achieving a 37.2% relative improvement in protein-level F1 score over the existing best-performing method. Beyond benchmarking, OrionGeno identifies novel loci within well-curated model genomes and generates high-confidence annotations for ~1,200 previously uncharacterized species, expanding NCBI's family-level coverage by 40.5%. As an evidence-independent approach, OrionGeno bridges the gap between genome sequencing and functional discovery, holding promise for large-scale biodiversity initiatives like the EBP.
bioinformatics2026-04-29v1Metacontam: A Negative Control-Free Decontamination Method for Metagenomic Analysis
Jo, J.; Lee, H.; Baek, J. W.; Lee, S.; Singh, V.; Shoaie, S.; Mardinoglu, A.; Choi, J.; Lee, S.Abstract
Shotgun metagenomic sequencing enables high-resolution profiling of host-associated microbial communities. However, contaminant DNA can substantially distort biological interpretations, especially in low-biomass samples. Here, we introduce Metacontam, a control-free method for species-level decontamination of shotgun metagenomic data. Metacontam integrates blacklist-guided community detection within a species correlation network with average nucleotide identity (ANI) to identify contaminants arising from shared sources. Across diverse low-biomass and mixed-biomass datasets, Metacontam outperformed existing approaches, improving the detection of low-abundance and low-prevalence contaminants while retaining biologically plausible taxa. It also reduces kit-specific biases in skin metagenomes and improves downstream analyses of tissue microbiome data. Together, these results demonstrate that Metacontam enables accurate identification of contaminant taxa across diverse metagenomic datasets, even in the absence of negative controls.
bioinformatics2026-04-29v1TissueFormer: Extending single-cell foundation models to predict population-level phenotypes
Benjamin, A. S.; Zador, A.Abstract
Single-cell RNA sequencing technologies have enabled unprecedented insights into gene expression and opened new pathways for diagnostics and tissue annotation. At present, most computational approaches for interpreting single-cell data predict labels or properties based on isolated single-cell transcriptomic profiles. This approach overlooks the cellular composition within a sample, which is often critical for inferring tissue identity or other sample-level phenotypes. To address this limitation, we introduce TissueFormer, a Transformer-based neural network that infers population-level labels from groups of single-cell RNA profiles while retaining single-cell resolution. We applied TissueFormer to two tasks: predicting COVID-19 severity from single-cell RNA sequencing of blood samples, and predicting cortical area identity from spatial transcriptomic data in mouse brains. TissueFormer outperformed single-cell foundation models and machine learning methods applied to pseudobulk and cell type composition. TissueFormer's higher performance promises more accurate diagnostics and enables the automated construction of high-resolution brain region maps in individual mice directly from spatial transcriptomic data. Applied to mice with developmental perturbations to visual input, these maps revealed a significant reduction in predicted visual cortex area, illustrating how individual differences in neuroanatomy can be quantified. More broadly, TissueFormer provides a framework for predicting any population-level phenotypes which are influenced by cellular diversity and tissue-level organization.
bioinformatics2026-04-28v2A Multi-modal LLM-Knowledge Fusion Framework for Predicting Single-cell Genetic Perturbation Effects
LU, M.; YOU, N.; ZHANG, H.; ZHENG, L.; LI, B.; JIANG, W.; ZHANG, Y.; SUN, H.; ZHOU, Y.Abstract
Understanding cellular responses to genetic perturbations is fundamental for drug discovery, yet experimental approaches face significant limitations in coverage and cost that prevent comprehensive mapping of cellular behavior. This has motivated the development of virtual cells, computational models that learn the relationship between cell state and function to predict the consequences of perturbations across diverse contexts. However, current computational methods suffer from limited accuracy in complex genetic interactions, poor biological interpretability, and inadequate generalization to unseen genes, severely constraining virtual cell capabilities. We present scPert, a multi-modal framework based on Transformer architecture that integrates large language model embeddings with structured biological knowledge to predict single-cell transcriptomic responses to genetic perturbations. Through hierarchical fusion of knowledge graph representations, contextual embeddings from foundation models, and gene-specific encodings, scPert achieves significant performance improvements in both single-gene and combinatorial perturbations over existing methods. In cancer-relevant applications, scPert demonstrates the capability to reveal p53 pathway dynamics and immune checkpoint regulatory mechanisms. Systematic evaluation on 42 cancer dependency genes demonstrates scPert's ability to identify critical potential therapeutic targets. Our framework establishes a powerful computational foundation for virtual cell construction and accelerates drug target discovery.
bioinformatics2026-04-28v1ActSeekN: A Structural-Motif-Based Pipeline for Interpretable Enzyme Function Annotation
Castillo, S.; Gu, C.; Jouhten, P.; Peddinti, G.; Ollila, S. O. H.Abstract
Accurate enzyme function annotation remains a major bottleneck in genome analysis despite the rapid expansion of available protein sequence and structure data. Most existing methods rely on sequence similarity or machine-learning representations, which often perform poorly for proteins with low sequence identity or convergent evolutionary histories. Because enzymatic activity is determined by the three-dimensional arrangement of catalytic and binding-site residues, structure-based approaches offer a mechanistically grounded alternative. However, their broader application has been constrained by the limited size and coverage of curated active-site reference databases. To address this challenge, we developed ActSeekN, a structural-motif-based functional annotation pipeline that combines the ActSeek active-site search algorithm with a newly constructed large-scale reference database derived from AlphaFold-predicted structures, UniProt annotations, and curated catalytic residue information. This framework enables rapid and scalable identification of conserved catalytic motifs across structurally related proteins, allowing function to be transferred on the basis of local three-dimensional catalytic geometry rather than global sequence similarity. In this way, ActSeekN overcomes a central limitation of previous structure-based methods by expanding the searchable space of catalytic motifs while retaining mechanistic interpretability. Benchmarking against state-of-the-art machine-learning approaches demonstrates competitive or superior performance. Applications to yeast, human, and Trichoderma reesei proteomes refine existing annotations, complete partial EC assignments, and identify previously unrecognized enzymatic functions, highlighting ActSeekN as a powerful tool for genome annotation and biotechnology.
bioinformatics2026-04-28v1Expanding the options for therapeutic exon skipping as a future treatment for USH2A-associated disease by 3D structural modeling of newly formed hybrid domains
Malinar, L.; Broekman, S.; Rademaker, D. T.; Le, A. Q.; Peters, T.; de Vrieze, E.; 't Hoen, P. A. C.; van Wijk, E.; Venselaar, H.Abstract
Usher syndrome, the leading cause of hereditary deaf-blindness affecting approximately 1 in 15,000 individuals worldwide, is currently still untreatable. Antisense oligonucleotide-based exon skipping has shown significant therapeutic promise for USH2A-associated retinal dysfunction. Selection of (combinations of) exons suitable for therapeutic exon skipping within the fibronectin type 3 (FN3) domain-encoding region of USH2A currently requires that skipped exons exactly align with complete protein domains. However, only few exon combinations meet this criterion, which significantly restricts the therapeutic potential of this strategy. Our study addresses this limitation by incorporating AlphaFold2 structural modelling into the exon skipping target selection pipeline. Following this adjusted framework, we can predict exon skipping combinations that allow remaining domain fragments to form structurally viable hybrid domains. As a proof-of-concept, we examined and confirmed the functionality of usherin{Delta}exon54-58 that contains a hybrid FN3 domain, using zebrafish as a model. This highligts the potential of the newly developed paradigm for identifying exon skipping targets with potential therapeutic relevance. Our results emphasize the value of structural modeling in identifying new therapeutic exon skipping targets, aiming to improve precision, efficiency, applicability, and cost-effectiveness in the development of genetic therapies for hereditary diseases such as Usher syndrome.
bioinformatics2026-04-28v1Accurate ab initio gene prediction in eukaryotes with Tiberius in multiple clades
Gabriel, L.; Bruna, T.; Kaur, A.; Krishnan, A.; Ortmann, F.; Salamov, A.; Talbot, S.; Becker, F.; Krieg, R.; Wheat, C. W.; Grigoriev, I. V.; Stanke, M.; Hoff, K. J.Abstract
Eukaryotic genome annotation is currently bottlenecked by limitations in the generality, scalability and accuracy of computational methods. Deep learning approaches have recently achieved large improvements in ab initio gene prediction accuracy. We extend the deep learning-based ab initio gene predictor Tiberius beyond mammals by training lineage-specific models for Mesangiospermae, Fungi, Vertebrata, Insecta, Chlorophyta and Bacillariophyta. Across a benchmark of 33 species, Tiberius consistently achieves higher accuracy than the other evaluated ab initio methods, Helixer and ANNEVO, while also having the fastest runtimes overall. Compared with BRAKER3, which incorporates RNA-Seq and protein evidence, Tiberius approaches state-of-the-art accuracy in Mesangiospermae, Fungi, Bacillariophyta and Chlorophyta, while being on average 80 times faster when using a GPU. Availability and implementation: https://github.com/Gaius-Augustus/Tiberius
bioinformatics2026-04-28v1Learning dynamics of unsupervised deep learning reveal epoch-specific genetic architectures of brain morphology
ISLAM, S. M. S.; Xia, T.; Zhao, X.; Xie, Z.; Zhi, D.Abstract
Representation learning is an emerging paradigm for deriving phenotypes from complex measurements (e.g., imaging) for genetic discovery. However, the learning dynamics of deep neural networks, especially the evolution of representations during training, while of interest in representation learning, were insufficiently investigated in the context of genetic discovery. In this study, using a 3D convolutional autoencoder trained on T1-weighted brain MRIs UK Biobank participants, we show that its learning trajectory forms an epoch-stratified landscape of brain morphology heritability. Different training epochs capture distinct genetic architectures at comparable heritability levels. Overall, ensembling across informative checkpoints identifies more genomic risk loci than the conventional single-checkpoint approach. Interpretability analysis reveals that epoch-specific loci, including MAPT and MCPH1, map onto biologically coherent and distinct neuroanatomical signatures, identified at different stages of the training process. Our results establish learning dynamics as a novel axis for genetic discovery using unsupervised deep learning and have practical implications for any architecture that saves multiple checkpoints during training.
bioinformatics2026-04-28v1A transcriptomic-driven segmentation and cell simulation framework for high-resolution spatial transcriptomics and cell-cell communication
Wanchai, V.; Bustamante-Gomez, N. C.; Kurilung, A.; Beenken, K. E.; Cortes, S.; Smeltzer, M. S.; Leung, Y.-K.; Xiong, J.; Almeida, M.; O'Brien, C. A.; Nookaew, I.Abstract
The Visium HD spatial transcriptomics platform enables transcriptome-wide profiling at near-single-cell resolution. However, accurate segmentation of cells to define spatial boundaries relies heavily on histological images. Previous approaches struggle to define cells when the tissues have high cell density, are inflamed, or are mineralized, leading to transcriptomic bleed-through and inaccurate clustering. To address this, we developed TENGU (Transcript-signal Enrichment and Grouping Unit), a comprehensive end-to-end bioinformatic software package. Unlike existing tools, TENGU employs a transcript-first segmentation approach, prioritizing transcript-signal density as the primary modality and utilizing histological images only as a secondary supplement in unresolved regions. These initial boundaries are further optimized through a novel transcriptomic-driven cell simulation algorithm. Iterative refinement of boundaries based on localized gene expression probabilities effectively minimizes spatial scattering and preserves biologically distinct molecular signatures. The pipeline seamlessly integrates tissue segmentation, high-resolution cell-type annotation, and basic spatially aware cell-cell communication (CCC) analysis. We rigorously benchmarked TENGU against the 10X Genomics and Bin2cell pipelines for cell segmentation across diverse and technically challenging microenvironments. TENGU demonstrated superior transcriptomic distinctness in the murine brain, successfully captured matrix-embedded osteocytes, and localized critical osteoimmune CCC networks (Tgfb and Il1a) in a murine model of osteomyelitis. TENGU also resolved species-specific, pro-tumorigenic signaling hubs (MDK-SDC4) within a highly compacted human colorectal cancer xenograft. By mitigating the constraints of traditional image-dependent segmentation, TENGU provides a highly adaptable and robust computational framework that empowers researchers to accurately decode the complex functional micro-anatomy of both healthy and pathological tissues.
bioinformatics2026-04-28v1Automated generation of personalized trajectories of aging phenotypes with DyViA-GAN
Pyne, S.; Ray, D.; Ray, M. S.Abstract
With a general increase in human lifespan, the need for technological advances to develop strategies for healthy aging has assumed great importance. In the present study, our goal is to predict the progression of selected aging phenotypes in a given healthy individual as one continues aging past 65 years. Therefore, we developed a novel framework called Dynamic Views of Aging with conditional Generative Adversarial Networks (or DyViA-GAN) which is capable of predicting the plausible personalized trajectories of a selected aging phenotype conditioned on the available measurements of the phenotype at a few initial time instances, and additional covariates. Given the prevalence of osteoporosis in the aging population, we selected femoral neck Bone Mineral Density (BMD) of a healthy individual as the phenotype of interest, and baseline individual Body Mass Index (BMI) as covariate. We trained DyViA-GAN on a publicly available longitudinal dataset of a large cohort of mostly white women in the United States of age 65 years or above. Thus, it generated, for each individual, continuous phenotype trajectories, along with a corresponding region of acceptable predictions, for an age range of 66 to 98 years, for eight different combinations both with and without involving the covariate. The prediction results were subjected to rigorous quality-control and multiple comparative analyses. Our results clearly demonstrate the potential of generative deep learning frameworks in healthspan research.
bioinformatics2026-04-27v5A Robust and Integrated Framework for Cross-platform Adaptation of Epigenetic Clocks in Cell-free DNA Sequencing
Li, G.; Huang, W.; Zhao, X.; Wu, J.; Guo, Y.; Chen, L.; Cao, X.; Yang, Z.; Jiang, S.; Hu, B.; Wang, Y.; Tan, D.; Tong, V.; Tang, C.; Feng, X.; Hu, X.; Ouyang, C.; Zhou, G.Abstract
Epigenetic clocks trained on methylation arrays generalize poorly to high-throughput sequencing (HTS) of cell-free DNA (cfDNA). Using paired array and HTS replicates, we systematically identified requirements to bridge this platform gap and developed a standardized, model-agnostic adaptation framework. Optimal performance requires maintaining at least 10 x mean target depth, utilizing L2-weighted regularization, implementing targeted beta-value imputation and transfer learning. A combined framework using these strategies significantly enhanced legacy clock performance across independent aging and disease cohorts, enabling robust, minimally invasive biological age assessment without compromising biological interpretability.
bioinformatics2026-04-27v4Uniform pre-processing of bacterial single-cell RNA-seq
Oakes, C. G.; Beilinson, V.; McFall-Ngai, M. J.; Pachter, L. G.Abstract
Bacteria are highly heterogeneous, even under controlled conditions, making single-cell RNA sequencing (scRNA-seq) essential for studying microbial diversity and symbiosis. Since its first application in 2015, bacterial scRNA-seq has expanded, but different assays depend on distinct, custom, in-house preprocessing making it difficult to analyze data as part of a unified workflow. The kallisto-bustools suite of tools has enabled uniform pre-processing of eukaryotic scRNA-seq while also reducing time and resource demands for pre-processing, but is not optimized for bacterial scRNA-seq. We adapt kallisto-bustools to be suitable for reads generated from operons, as well as for a much shorter gene length distribution, and show that it can efficiently and accurately quantify bacterial scRNA-seq. Our work provides a scalable foundation for uniform pre-processing of microbial single-cell transcriptomics.
bioinformatics2026-04-27v3Combining AI structure prediction and integrative modelling for nanobody-antigen complexes
Sanchez-Marin, M.; Giulini, M.; Bonvin, A.Abstract
Nanobodies exhibit antigen binding a[ffi]nities of the same order as those of antibodies, which, along with their small size and unique structural characteristics, makes them well-suited for therapeutic and diagnostic applications. The lack of coevolutionary signals in nanobody-antigen complexes together with the broad complementary determining region 3 loop (CDR3) conformational space poses a challenge for predicting the 3D structure of those complexes with computational modelling and artificial intelligence-based methods. In this context, physics-based information-driven docking can provide an alternative solution. This study evaluates the state-of-the-art of machine learning-based methods for nanobody structure prediction and benchmarks various HADDOCK workflows to model their interaction with antigens using different input nanobody ensembles and information scenarios. We propose an ensemble docking pipeline that achieves high success rates starting from nanobody structural models predicted by AlphaFold2 and ImmuneBuilder. Our results highlight the e[ff]ectiveness of physics-based complex prediction of immune proteins when accurate input structures and su[ffi]cient information to guide the modelling are available.
bioinformatics2026-04-27v2SpatialQuery: scalable discovery and molecular characterization of multicellular motifs from spatial omics data
An, S.; Keller, M.; Gehlenborg, N.; Hemberg, M.Abstract
Spatially resolved single-cell technologies enable profiling of cells in situ, yet computational approaches that jointly discover multicellular spatial patterns and characterize their molecular programs remain limited. Here we introduce SpatialQuery, a framework that can both identify cellular motifs, i.e. recurrent multicellular co-localization patterns, and perform molecular analyses focused on the motifs. It uncovers genes modulated by spatial contexts through differential expression analysis, and detects coordinated expression changes through covariation analysis. SpatialQuery can identify functional tissue units, and goes beyond pairwise analyses to characterize multicellular interactions. Applications to both spatial transcriptomics and proteomics data uncover cross-germ-layer signaling in gut tube patterning, disease-specific fibrotic and immunosuppressive niches in kidney and colon, and regional determinants of motif-associated transcriptional programs in a mouse brain atlas. SpatialQuery is available as a Python package, and we demonstrate how its light computational footprint enables integration into web-based cell atlas portals for interactive visualization and exploration.
bioinformatics2026-04-27v2Risk Based Prediction of Novel AMR Variants Using Protein Language Models
Wood, J. J.; Portelli, S.; Ascher, D. B.; Furnham, N.Abstract
Antimicrobial resistance (AMR) is among the most pressing global health threats of the 21st century, with the potential to thrust modern medicine back into a pre-antibiotic era. Resistance can arise through diverse mechanisms, including genomic mutations that prevent antibiotics from reaching or acting on their targets. To limit the spread of AMR, surveillance systems must detect both known and emerging resistance markers. Here we present AMRscope, a model trained on ESM2 protein language model embeddings of single mutations for prediction of resistance likelihood, combined with a rigorous evaluation framework. This tool is applied across antibiotic-interacting proteins of different bacterial species, including WHO priority pathogens, such as rifampicin-resistant M. tuberculosis and carbapenem-resistant P. Aeruginosa. Performance on random splits achieves a competitive accuracy, F1 and MCC of 0.88, 0.87 and 0.75, respectively, while additional splitting strategies demonstrate transfer of predictive power to unseen organisms or genes. Moreover, in silico deep mutational scanning and structural mapping across these targets reveals the tool can recover known resistance-associated regions and highlight new candidates. The risk-based outputs complement database matching and resistance element detection tools, providing clinicians and public health agencies with an interpretable and scalable system for AMR surveillance and proactive response.
bioinformatics2026-04-27v2Unified imputation of missing data modalities and features in multi-omic data via shared representation learning
Nambiar, A.; Melendez, C.; Noble, W. S.Abstract
Multi-omic studies promise a more comprehensive view of biological systems by jointly measuring multiple molecular layers. In practice, however, such datasets are rarely complete: entire molecular modalities may be missing for many samples, and observed modalities often contain substantial feature-level missingness. Existing imputation approaches typically address only one of these two problems, relying either on feature-level imputation within a single modality or on pairwise translation models that cannot accommodate arbitrary combinations of missing modalities. We present MIMIR, a deep learning framework for unified multi-omic imputation of bulk data that addresses both missing modalities and missing values through shared representation learning. MIMIR first learns modality-specific representations using masked autoencoders and then projects these representations into a common latent space, enabling reconstruction from any subset of observed modalities. Evaluated on pan-cancer multi-omic data from The Cancer Genome Atlas, MIMIR consistently outperforms baseline methods across a range of missing-modality and missing-value scenarios, including missing completely at random and missing not at random settings. Analysis of the learned shared space reveals structured cross-modal dependencies that explain modality-specific differences in imputation accuracy, with transcriptional and epigenetic modalities forming a strongly aligned core and copy number variation contributing more distinct signal. Together, these results demonstrate that shared representation learning provides an effective and flexible foundation for multi-omic imputation under heterogeneous patterns of missingness.
bioinformatics2026-04-27v2Semi supervised GAN for smart microscopy, fast and data efficient cell cycle classification
Manick, R.; El Habouz, Y.; Guillout, M.; Martin, C.; Bonnet-gelebart, J.; Ruel, L.; Pastezeur, S.; Chanteux, O.; Bouchareb, O.; Tramier, M.; Pecreaux, J.Abstract
Modern optical microscopes are fully motorised; however, transforming them into truly smart systems requires real-time adjustment of acquisition settings in response to detected objects and dynamic biological events. At the core are classification algorithms that commonly depend on customised softwares and are generally designed for narrowly-defined biological applications. In addition, they often require substantial annotated datasets for effective training. We introduce a semi-supervised generative adversarial network (SGAN) for robust cell-cycle stage classification under low-resource conditions, adaptable to diverse cellular structures. The framework combines unlabelled microscopy images with synthetically generated samples to mitigate limited annotation, while preserving stable performance even when the unlabelled subset is class-imbalanced. Tested on the Mitocheck dataset, which features five mitosis classes, the model achieved 93{+/-}2 % accuracy using only 80 labelled per class and 600 unlabelled images. The proposed algorithm is generic and can be readily adapted to new labelling schemes, classification targets, cell lines, or microscopy modalities through transfer learning. SGAN is well suited for integration into automated microscopes, enabling efficient and adaptable image analysis across diverse biological and microscopy applications.
bioinformatics2026-04-27v1SynCom101: A web-based platform for the standardized design of functionally tailored synthetic microbial communities
Jing, J.; Rockx, S.; Liu, A.; Melkonian, C.; Raaijmakers, J. M.; Garbeva, P.; Medema, M. H.Abstract
Background Synthetic microbial communities (SynComs) are essential tools for dissecting the causal mechanisms in host-microbiota interactions. To date, however, SynCom design suffers from a lack of standardization, typically oscillating between arbitrary strain selection and computational pipelines that misalign with experimental design. As microbiome research transitions toward functionally defined community systems with reproducible experimental outcomes, there is a strong need for a user-friendly platform that integrates multi-dimensional genomic and/or biological data into a standardized and tailored SynComs design. Results Here, we present SynCom101, a web-based platform that democratizes the design of reproducible, hypothesis-driven SynComs. SynCom101 accommodates diverse input formats including genomic annotations and laboratory-obtained phenotypic traits, allowing users to customize their design criteria with high flexibility. The platform utilizes a parsimony algorithm to ensure computational scalability for large datasets, complemented by an optional correlation-aware mode to account for microbial compatibility and co-occurrence patterns when ecological interactions among strains are available. A core innovation of SynCom101 is its suite of trait-weighting modules, which empowers researchers to strategically guide the selection algorithm toward maximal functional trait coverage, the emulation of natural community architectures, or the enrichment of positively correlated microbial assemblages to enhance community stability. We showcase the functionalities of the platform by in silico design of communities from different datasets, demonstrating its capacity to generate concise, functionally prioritized SynComs aligned with targeted design objectives. Conclusion By providing a transparent, parameter-documented workflow, SynCom101 ensures that community design is no longer a "black box" but a reproducible scientific record. This platform establishes a necessary standard for in silico community assembly, facilitating the transition from descriptive microbiome studies toward high-throughput, predictive functional screening and cross-study comparability. Availability SynCom101 can be accessed via the web interface (https://syncom101.bioinformatics.nl/). The datasets used for case studies are available on Zenodo (https://doi.org/10.5281/zenodo.18310451). The source code is available at Git (https://git.wur.nl/jiayi.jing/syncom101).
bioinformatics2026-04-27v1An Extended Clade Framework for Annotated Trees in the Context of Phylogeography and Transmission Tree Inference
Berling, L.; Colijn, C.Abstract
Bayesian phylogenetic inference produces large samples from a posterior distribution over phylogenetic trees that represents uncertainty in both tree topology and associated variables. Such a collection of trees is hard to interpret and it is common practice to summarize such samples into a single representative tree. Methods for constructing representative trees have largely been restricted to plain tree topologies, encoding only relationships among taxa. Inference with more sophisticated models produce annotated tree objects. These have additional information representing nodes' locations in the case of phylogeography, host information when inferring transmission trees, or sampled ancestor status when incorporating fossil information. Nevertheless, these annotated representations are reduced to a single representative tree, typically using methods developed for plain tree topologies and without accounting for the resulting methodological mismatch. Here, we introduce the concept of an extended clade and investigate an extension of the conditional clade distribution (CCD) model. Through motivating examples and case studies in discrete trait phylogeography and transmission tree reconstruction, we demonstrate limitations of standard summary tree approaches and show how these can be addressed using an extended CCD framework that explicitly incorporates the annotated tree structure.
bioinformatics2026-04-27v1MycorrhizaTracer: A BIOINFORMATIC PIPELINE FOR FUNGI AND PLANT CLASSIFICATION OF SANGER DNA SEQUENCES
Brekke, T. D.; Weeks, T.; Barber, R. A.; Thomson, I.; Gooda, R.; Gargiulo, R.; Delhaye, G.; Andrew, C.; Kowal, J.; Bidartondo, M.; Martinez-Suz, L.Abstract
Processing Sanger DNA sequences remains a routine yet technically demanding step in many biodiversity and ecological studies, particularly when barcoding large numbers of environmental samples. Manual inspection and editing of trace files, DNA sequence alignment, and classification using taxonomic reference databases is time-consuming, inconsistent, and prone to error. These challenges are compounded in studies involving degraded samples, in-house DNA sequencing, under-described taxa, or when investigators have limited access to computational tools. We present MycorrhizaTracer, an open-source, fully automated pipeline for processing and taxonomically classifying large batches of Sanger sequencing chromatograms. We have optimized it for fungal and plant taxa, but it is adaptable across the tree of life. The pipeline performs quality trimming, consensus generation from bidirectional reads, taxonomic classification via BLAST, clustering, optional salvaging of low-quality sequences, and functional annotation of fungal taxa. Designed for scalability and ease of use, MycorrhizaTracer can process thousands of DNA chromatograms in a matter of hours without the need for an HPC. Accuracy and ecological relevance are ensured by features such as gene region-specific taxonomic filtering and sequence-based clustering of unclassified reads. By streamlining trace-to-taxon workflows, MycorrhizaTracer reduces the burden of manual curation, supports reproducibility, and enables efficient recovery of biodiversity data from Sanger sequences - particularly in field-based or resource-limited research contexts.
bioinformatics2026-04-27v1Unraveling protein conformational plasticity with PROTEUS
Caparelli Piochi, L. F.; Karami, Y.; Khakzad, H.Abstract
Protein conformational plasticity underpins allosteric regulation, fold switching, and post-translational modification accessibility, yet no existing method can probe this property at the proteome scale without simulation. Here we show that SimpleFold, a flow-matching protein structure predictor, implicitly encodes conformational plasticity in its internal representations. By comparing per-residue embeddings between the sequence-only regime and the structure-converged regime of the denoising trajectory, we define a zero-shot conformational plasticity score, PROTEUS (PROtein TrajEctory Uncertainty Score), that requires no experimental dynamics data. PROTEUS correctly orders five independent protein classes spanning the full flexibility spectrum, from rigid de novo designed scaffolds to intrinsically disordered proteins that fold upon binding. Per-residue PROTEUS profiles correlate with atomic fluctuations from 1,290 independent molecular dynamics trajectories, and this signal persists after controlling for structure prediction confidence (pLDDT) and sequence-based disorder predictions. At the protein level, PROTEUS achieves AUROC = 0.77 for fold-switch detection, 0.81 for open/closed state discrimination, and 0.93 for identifying proteins with buried phosphorylation sites. Proteome-wide analysis of 4,188 Escherichia coli K-12 proteins reveals that fimbrial adhesins and the type II secretion machinery rank among the most conformationally plastic functional classes, consistent with the structural demands of chaperone-mediated secretion and receptor engagement, while ribosomal proteins score systematically lower. These results establish that PROTEUS provides unsupervised, proteome-scale probing of structural dynamics directly from a generative model.
bioinformatics2026-04-27v1MOSAIC: a longitudinal phenotypic clock to dissect organismal aging trajectories in C. elegans
Vaudano, A. P.; Pierron, M.; Stojkovic, L.; Membrez, M.; Bourgeois, M.; Neal, C.; Chimen, M.; Verbakel, L.; Cornaglia, M.; Solari, F.; Mouchiroud, L.Abstract
Interventions that extend lifespan do not necessarily preserve healthspan, the portion of life spent in good health. This disconnect has intensified interest in biological aging clocks as quantitative proxies of organismal health. However, most existing clocks rely on invasive or endpoint measurements, providing static estimates that capture biological age at a single time point and offer limited insight into aging trajectories - the dynamic patterns through which physiological resilience and functional capacity change within individuals over time. Here we combine standardized, high-frequency imaging of individual Caenorhabditis elegans across the lifespan with machine learning to develop MOSAIC (Modular Organismal Signature of Aging In C. elegans), a non-invasive phenotypic clock that estimates biological age longitudinally at single-organism resolution. Leveraging ~3'750 animals, ~230'000 observations and 29 phenotypic features, MOSAIC predicts biological age with high accuracy and resolves organism-wide aging trajectories at high temporal resolution. Beyond age prediction, MOSAIC decomposes biological age into contributions from distinct physiological modules, enabling mechanistic interpretation of organismal decline. Applying MOSAIC to natural lifespan variation, dietary restriction, longevity mutants and pharmacological interventions reveals that lifespan extension can emerge through distinct, time-dependent phenotypic trajectories rather than a uniform slowing of aging. Interventions with similar effects on longevity produce divergent biological-age trajectories and distinct combinations of younger and older traits, highlighting context-dependent physiological trade-offs. MOSAIC provides a scalable, non-invasive framework to repeatedly quantify biological age across the lifespan and to compare interventions based on how they reshape aging trajectories.
bioinformatics2026-04-27v1Integrative Bioinformatics Approach to Identify Prognostic Gene Signatures for Risk Stratification in Thyroid Carcinoma
Malik, S.; Raghava, G. P. S.Abstract
Thyroid cancer is a heterogeneous malignancy with variable outcomes, highlighting the need for reliable biomarkers and effective risk stratification. In this study, we implemented a multi-step integrative framework to identify distinct prognostic biomarker sets using transcriptomic data from 572 thyroid cancer patients. Correlation analysis followed by false discovery rate (FDR) correction revealed significant gene associations. Notably, MAFF (r = 0.25, p = 1.34e-9, FDR = 2.46e-7), NR4A3 (r = 0.24, p = 1.26e-8, FDR = 9.25e-7), and SRF showed strong positive correlations, whereas LOC728264 (r = -0.21, p = 7.39e-7, FDR = 6.36e-6) and VAMP1 (r = -0.20, p = 1.20e-6, FDR = 1.3e-4) exhibited negative correlations with OS. Univariate Cox regression identified several survival-associated genes, including TMEM90B (HR = 10.66, p = 2.88e-5) and PTH1R (HR = 9.88, p = 5.55e-5). LASSO regression further identified 31 key prognostic genes, including 13 potential drug targets predominantly functioning as inhibitors. Machine learning models based on seven independent 20-gene biomarker sets effectively predicted Class 0 (0-1 years), Class 1 (1-3 years), Class 2 (3-5 years), and Class 3 (>5 years), achieving AUC values of 0.91-0.94 and Kappa up to 0.76. An ensemble model further improved prediction (AUC = 0.95, Kappa = 0.72). Incorporating clinical variables (age, gender, stage) enhanced model performance (AUC = 0.96, Kappa = 0.80). Reduced 10- and 5-gene subsets demonstrated consistent yet slightly lower performance (AUC = 0.90 and 0.86, respectively). Collectively, the 20-gene set exhibited the strongest predictive and prognostic potential, highlighting the importance of integrating molecular and clinical features for risk stratification in thyroid cancer. All data and code are openly available (https://github.com/raghavagps/THCA_prognostic_biomarkers), supporting future research in thyroid cancer prediction.
bioinformatics2026-04-27v1Integrative Clinical-Molecular Modeling Identifies LRRN4CL as a Determinant of Structural and Functional Myocardial Improvement
Johnson, E.; Visker, J. R.; Brintz, B. J.; Kyriakopoulos, C. P.; Jeong, J.; Zhang, Y.; Shankar, T. S.; Hillas, Y.; Taleb, I.; Badolia, R.; Amrute, J. M.; Stubben, C. J.; Cedeno-Rosario, L.; Kyriakoulis, I.; Sideris, K.; Ling, J.; Hamouche, R.; Tseliou, E.; Navankasattusas, S.; Ducker, G. S.; Rutter, J.; Holland, W. L.; Summers, S. A.; Hong, T.; Koenig, S. C.; Hanff, T. C.; Lavine, K. J.; Greene, T.; Bailey, S.; Alharethi, R.; Selzman, C. H.; Shah, P.; Guo, H.; Slaughter, M. S.; Kanwar, M. K.; Drakos, S. G.Abstract
Background: Mechanical ventricular unloading and systemic circulatory support with left ventricular assist devices (LVADs) enable myocardial recovery in a subset of advanced heart failure (HF) patients, but predictors and mechanisms of recovery are not well understood. Integrating clinical and molecular data may improve identification of patients most likely to recover and uncover biologically relevant targets in HF. Methods: We collected and analyzed left ventricular apical myocardial tissue and clinical data from 208 patients undergoing LVAD implantation across five centers. Pre-implant transcriptomic profiles (22,373 mRNA transcripts) were integrated with 59 clinical variables using supervised machine learning with repeated cross-validation to identify and prioritize features associated with myocardial recovery, defined as a binary outcome based on improvement in left ventricular ejection fraction (LVEF [≥]40%) and left ventricular end-diastolic diameter (LVEDD [≤]5.9 cm). We also modeled functional (LVEF) and structural (LVEDD) improvement as a continuous outcome without any predefined LVEF and LVEDD pathological thresholds. Feature prioritization was followed by validation in human myocardial tissue and mechanistic interrogation in human induced pluripotent stem cell-derived cardiomyocytes (iPSC-CMs). Results: Integrative models achieved modest discrimination for myocardial recovery as a binary categorical outcome (maximum mean cross-validated area under the curve 0.73{+/-}0.15), identifying clinical features such as HF duration, LVEDD, HF pharmacologic therapy, and device configuration. Leucine-rich repeat neuronal 4C-like (LRRN4CL), measured in human myocardium, consistently emerged as a top transcriptomic predictor across both binary and continuous metric models (functional and structural). Higher pre-LVAD LRRN4CL expression was associated with reduced likelihood of myocardial recovery and localized primarily to cardiomyocytes. In iPSC-CMs, LRRN4CL overexpression localized to the sarcoplasmic reticulum, induced transcriptional remodeling characterized by suppression of contractile pathways and activation of stress programs, impaired calcium handling, impaired contraction?relaxation kinetics, and diminished mitochondrial respiratory reserve capacity. Conclusions: Integration of clinical and myocardial transcriptomic data identifies LRRN4CL as a novel marker associated with impaired myocardial recovery following LVAD-mediated ventricular unloading and systemic circulatory support. These findings move beyond predictive modeling, linking integrative computational discovery to cardiomyocyte dysfunction and providing a translational framework for biologically informed risk stratification and therapeutic targeting for myocardial recovery.
bioinformatics2026-04-26v1Are Current AI Virtual Cell Models Useful for Scientific Discovery?
Bereket, M. D.; Leskovec, J.Abstract
AI models are increasingly developed to predict the effect of perturbations on gene expression, but current benchmarks fail to reliably measure model performance. Here, we argue that new benchmarks that directly measure the value of model predictions for specific scientific discovery outcomes are needed to address this gap. We present PerturbHD, an evaluation framework for AI-enabled hit discovery, to demonstrate the benefits our proposed approach.
bioinformatics2026-04-25v1AI-readiness criteria for biomedical data
Clark, T.; Caufield, H.; Parker, J. A.; Al Manir, S.; Amorim, E.; Eddy, J.; Gim, N.; Gow, B.; Goar, W.; Hansen, J. N.; Harris, N.; Hermjakob, H.; Joachimiak, M.; Jordan, G.; Lee, I.-H.; McWeeney, S. K.; Nebeker, C.; Nikolov, M.; Reese, J.; Shaffer, J.; Sheffield, N.; Sheynkman, G.; Stevenson, J.; Chen, J. Y.; Mungall, C.; Wagner, A.; Kong, S. W.; Ghosh, S. S.; Patel, B.; Williams, A.; Munoz-Torres, M. C.Abstract
Biomedical research is rapidly adopting artificial intelligence (AI). Yet the inherent complexity of biomedical data preparation requires implementing actionable, robust criteria for ethical and explainable AI (XAI) at the "pre-model" stage, encompassing data acquisition, detailed transformations, and ethical governance. Simple conformance to FAIR (Findable, Accessible, Interoperable, Reusable) Principles is insufficient. Here, we define criteria and practices for reliable AI-readiness of biomedical data, developed by the NIH Bridge to Artificial Intelligence (Bridge2AI) Standards Working Group across seven core dimensions of dataset AI-readiness: FAIRness, Provenance, Characterization, Ethics, Pre-model Explainability, Sustainability, and Computability. Conformance to these criteria provides a basis for pre-model scientific rigor and ethical integrity, mitigating downstream risks of bias and error before AI modeling. We apply and evaluate these standards across all four Bridge2AI flagship datasets, spanning functional genomics to clinical medicine, and encode them in machine-actionable metadata bound to the datasets. This framework sets a benchmark for preparing ethical, reusable datasets in biomedical AI and provides standardized methods for reliable pre-model data evaluation.
bioinformatics2026-04-24v6Characterization of selective pressures acting on protein sites with Deep Learning
Bergiron, E.; Nesterenko, L.; Barnier, J.; Veber, P.; Boussau, B.Abstract
It is often useful, in the field of molecular evolution, to identify the selective pressures acting on a particular site of a protein to better understand its function. This is typically done with likelihood-based approaches applied to codon sequences in a phylogenetic context. However, these approaches are computationally costly. Here we adapt a linear transformer neural network architecture, which has been shown to be able to reconstruct accurate pairwise distances from sequence alignments, to identify selective pressures acting on individual amino acid sites. We design different versions of the architecture and train and test them on simulations. We compare the results of one of our best models to state-of-the-art likelihood-based methods and find that it outperforms it when it is applied to data that resemble its training data, but that it performs less well when applied to datasets that do not resemble the ones the model has been trained on. In all cases, our approach operates at a fraction of the computational cost of likelihood-based methods. These results suggest that such a neural network architecture can compare very favorably to state-of-the-art approaches to characterize selection pressures acting on coding sequences, but that it must be trained on datasets representative of empirical data.
bioinformatics2026-04-24v4Modeling causal signal propagation in multi-omic factor space with COSMOS
Dugourd, A.; Lafrenz, P.; Mananes, D.; Paton, V.; Fallegger, R.; Bai, Y.; Kroger, A.-C.; Turei, D.; Li, Y.; Trogdon, M.; Nager, D.; Deng, S.; Shen, C.; Lapek, J. D.; Shtylla, B.; Saez-Rodriguez, J.Abstract
Understanding complex diseases requires approaches that jointly analyze omics data across multiple biological layers, including signaling, gene regulation, and metabolism. Existing data-driven multi-omics analysis methods, such as multi-omics factor analysis (MOFA), can identify associations between molecular features and phenotypes, but they are not designed to integrate existing mechanistic molecular knowledge, which can provide further actionable insights. We introduce an approach that connects data-driven analysis of multi-omics data with systematic integration of mechanistic prior knowledge using COSMOS+ (Causal Oriented Search of Multi-Omics Space). We show how factor analysis output can be used to estimate activities of transcription factors and kinases as well as ligand-receptor interactions, which in turn are integrated with network-level prior-knowledge to generate mechanistic hypotheses about paths connecting deregulated molecular features. We apply this approach on a novel multi-omics dataset of cell line models of breast cancer resistance to evaluate the ability of such mechanistic hypotheses to identify resistance drivers, as well as a breast cancer patient cohort. Our approach offers an interpretable framework to generate actionable insights from multi-omic data particularly suited for high dimensional datasets.
bioinformatics2026-04-24v3MetaTree: an interactive web platform for aligned hierarchical data visualization and multi-group comparison
Wu, Q.; Zhang, A.; Ning, Z.; Figeys, D.Abstract
Background: Hierarchical quantitative profiles are widely used in microbiome studies and other domains. However, comparing multiple samples and experimental groups while preserving hierarchical structure remains challenging. Many existing workflows require extensive manual figure assembly or do not support aligned comparisons across conditions on a shared hierarchy. Results: We developed MetaTree, an open-source platform that runs in a web browser for interactive visualization and comparative analysis of hierarchical quantitative data. MetaTree anchors samples, groups, and contrasts between groups to a shared reference hierarchy, preserving one-to-one node correspondence so that the same clade is compared in the same position across views. In addition to visualization, MetaTree integrates statistical testing for comparisons between two groups with false discovery rate (FDR) control, enabling users to identify clades with consistent differences between conditions and interpret them in hierarchical context. MetaTree also provides user configurable controls for visual encoding, filtering thresholds, label density, and layout, allowing figures to be adapted to different datasets and reporting needs. The interface remains usable for large hierarchies through interactive navigation, adaptive label handling, and branch collapsing. Conclusions: MetaTree is an installation-free web platform (https://byemaxx.github.io/MetaTree) for topology-consistent visualization and comparison of hierarchical profiles, supporting coordinated multi-panel exploration and automated comparison matrices to enable rapid generation of publication-ready figures for microbiome and other hierarchical datasets.
bioinformatics2026-04-24v3Adaptive prediction intervals for polygenic risk scores reveal individual variation in genetic predictability
Wang, C.; Wang, F.; Bogdan, M.; Masala, M.; Fiorillo, E.; Devoto, M.; Cucca, F.; Belsky, D.; Ionita-Laza, I.Abstract
Polygenic risk scores (PRS) are widely used in post-GWAS analyses to predict complex traits across humans, animals, and plants, yet the uncertainty of these predictions is rarely quantified at the individual level. Here, we introduce a framework for individualized uncertainty quantification based on quantile regression and conformal prediction, enabling the construction of prediction intervals with guaranteed coverage under minimal assumptions. Quantile regression enables adaptive, individual-specific prediction intervals that capture asymmetry and allow interval widths to vary substantially across individuals based on genetic information alone. Applying this framework to 62 traits in the UK Biobank and the ProgeNIA/SardiNIA studies, we show that these intervals maintain valid coverage and reduce uncertainty in risk stratification compared to existing methods, driven by their adaptive construction. Prediction interval width correlates positively with age and BMI, indicating reduced genetic predictability in subsets of the population where genetic effects interact with environmental factors. Our results demonstrate that incorporating uncertainty is essential for interpreting polygenic predictions and provide a principled approach to distinguish individuals whose phenotypes are well explained by genetic predictors from those in whom non-genetic influences dominate.
bioinformatics2026-04-24v3A De Novo Algorithm for Allele Reconstruction from Oxford Nanopore Amplicon Reads, with Application to CYP2D6
Brown, S. D.; Dreolini, L.; Minor, A.; Mozel, M.; Wong, N.; Mar, S.; Lieu, A.; Khan, M.; Carlson, A.; Hrynchak, M.; Holt, R. A.; Missirlis, P. I.Abstract
The Oxford Nanopore Technologies' sequencing platform offers a path towards bedside genomics, producing long reads that can completely cover a gene of interest, and thus detect any known or novel variant the gene contains. However, the analysis of these long reads to identify actionable genotypes remains challenging and typically requires customization depending on the target gene. Here, we describe a generic algorithm to accurately reconstruct allele sequences derived from long-reads of genomic-amplicon origin. Rather than calling variants directly from these long-reads, our method takes a "sequence-first" approach, performing an unbiased reconstruction of the underlying amplicon sequences to generate high-confidence reconstructed allele sequences. This is done without user input of the expected target gene, allowing for any source amplicon to be reconstructed. These high-confidence reconstructed allele sequences are then compared to the genomic reference sequence of the gene to infer the specific diplotype present in the sample. This approach is agnostic towards the number of genes and alleles present and readily detects novel variants. We demonstrate our approach using three independent data sets for CYP2D6, a diverse and complex gene with over 175 known alleles of clinical significance affecting drug dosing. We show how our approach can accurately recover validated CYP2D6 diplotypes from 20 Coriell samples sequenced using different primer sets, on different Oxford Nanopore Technologies flow cell versions, and to different depths. This includes inferring occurrences of copy number variation from relative abundances of each allele, a critical factor for ascribing functional effects to a diplotype. Further, we demonstrate our approach's utility for other genomic regions, including HLA.
bioinformatics2026-04-24v3Ancestra: A lineage-explicit simulator for benchmarking B-cell receptor repertoire and lineage inference methods
Hassanzadeh, R.; Abdollahi, N.; Kossida, S.; Giudicelli, V.; Eslahchi, C.Abstract
High-throughput B-cell receptor sequencing has transformed the analysis of adaptive immunity, but benchmarking clonal grouping and lineage reconstruction methods remains limited by the absence of datasets with known evolutionary histories. Here we present Ancestra, a lineage-explicit simulator of B-cell receptor heavy-chain affinity maturation. Ancestra models stochastic V(D)J recombination, context-dependent somatic hypermutation, affinity-based selection and clonal expansion while recording complete parent-child relationships and mutation events. The framework generates BCR heavy-chain sequence datasets together with their corresponding ground-truth lineage trees, enabling direct benchmarking of lineage-aware analytical methods. Across simulations, Ancestra recapitulates key properties of human repertoires, including complementarity-determining region 3 length distributions, amino-acid usage patterns, junctional mutation patterns consistent with IMGT criteria and heterogeneous branching topologies. Simulated lineages also reveal multi-label lineage trees, in which identical nucleotide sequences can arise independently along distinct evolutionary paths. Ancestra provides a practical foundation for rigorous benchmarking of lineage-aware immune repertoire analysis.
bioinformatics2026-04-24v2