Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Jiang, T.; Hu, H.; Gao, R.; Jiang, Z.; Zhou, M.; Gao, W.; Zhou, S.; Wang, G.Abstract
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangenome analysis. However, the availability of structural variant (SV) calling tools that can effectively leverage pangenome information is limited. In addition, efficient construction of pangenome graphs becomes increasingly challenging with acquisition of larger number of samples. In this study, we present SVPG, an approach that leverages haplotype-resolved pangenome reference for accurate SV detection and rapid pangenome graph augmentation from long-read sequencing data. Compared to state-of-the-art SV callers, SVPG maintained superior overall performance across different coverages and sequencing technologies. SVPG also achieves notable improvements in calling rare and individual-specific SVs on both simulated and real somatic datasets. Furthermore, in a benchmark involving 20 samples, SVPG accelerated pangenome graph augmentation by nearly 10-fold compared to traditional augmentation strategies. We believe that this novel SVPG method, has the potential to revolutionize SV detection and serve as an effective and essential tool, offering new possibilities for advancing pangenomic research.
bioinformatics2026-03-20v4SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Jiang, T.; Hu, H.; Gao, R.; Cao, S.; Jiang, Z.; Liu, Y.; Zhou, M.; Gao, W.; Zhou, S.; Wang, G.Abstract
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangenome analysis. However, the availability of structural variant (SV) calling tools that can effectively leverage pangenome information is limited. In addition, efficient construction of pangenome graphs becomes increasingly challenging with acquisition of larger number of samples. In this study, we present SVPG, an approach that leverages haplotype-resolved pangenome reference for accurate SV detection and rapid pangenome graph augmentation from long-read sequencing data. Compared to state-of-the-art SV callers, SVPG maintained superior overall performance across different coverages and sequencing technologies. SVPG also achieves notable improvements in calling rare and individual-specific SVs on both simulated and real somatic datasets. Furthermore, in a benchmark involving 20 samples, SVPG accelerated pangenome graph augmentation by nearly 10-fold compared to traditional augmentation strategies. We believe that this novel SVPG method, has the potential to revolutionize SV detection and serve as an effective and essential tool, offering new possibilities for advancing pangenomic research.
bioinformatics2026-03-20v3Coupling codon and protein constraints decouples drivers of variant pathogenicity
Chen, R.; Palpant, N.; Foley, G.; Boden, M.Abstract
Predicting the functional impact of genetic variants remains a fundamental challenge in genomics. Existing models focus on protein-intrinsic defects yet overlook regulatory constraints embedded within coding sequences. Here, we couple a codon language model (CaLM) with a protein language model (ESM-2) to dissect the drivers of variant pathogenicity. On ClinVar data, both modalities contribute near-equally to distinguishing pathogenic from benign variants. Evaluation across Deep Mutational Scanning and CRISPR-Based Genome Editing platforms in ClinMAVE reveals that loss-of-function variants are governed primarily by residue-level features, whereas gain-of-function variants show a greater relative contribution from codon-level constraints, albeit in a gene-specific manner. A controlled comparison of identical variants in BRCA1 and TP53 further suggests that codon-level signals are elevated in the endogenous genomic context. Together, these findings indicate that pathogenicity reflects both the "product'' and the "process,'' and that the experimental platform may influence which dimension is observable.
bioinformatics2026-03-20v3ChiMER: Integrating chromatin architecture into splicing graphs for chimeric enhancer RNAs detection
Xiang, Y.; Xiao, X.; Zhou, B.; Xie, L.Abstract
Motivation: Enhancer-derived RNAs (eRNAs) and their fusion with protein coding genes represent a crucial yet understudied layer of transcriptional regulation. eRNAs are typically expressed at low levels, which makes fusion events difficult to detect with conventional fusion detection tools. In addition, these tools are not designed to capture fusion transcripts arising from spatial proximity between distal regulatory elements and gene loci. Reads spanning such regions are also frequently filtered as mapping artifacts. As a result, computational approaches for systematically identifying spatially mediated enhancer-exon fusion transcripts remain lacking. Methods: We developed ChiMER, a graph-based framework for detecting ChiMeric Enhancer RNAs from short-read RNA-seq data. ChiMER constructs splice graphs with chromatin contact information to introduce enhancer-exon edges and uses graph alignment to search for potential transcriptional paths. A ranking-based scoring module then prioritizes high-confidence events. Evaluations on simulated and real RNA-seq datasets show that ChiMER achieves higher sensitivity than conventional linear fusion detection methods while maintaining low false-positive rates. Results: Applied to cancer cell line RNA-seq datasets, ChiMER identified multiple enhancer-exon chimeric transcripts, several associated with super-enhancer regions. Multi-omics analysis further show that fusion transcripts occur in transcriptionally active regulatory environments and frequently coincide with strong R-loop signals, suggesting a potential role of RNA-DNA hybrid structures in facilitating long-range transcriptional joining events.
bioinformatics2026-03-20v2Integrative transcriptome-based drug repurposing in tuberculosis
Samart, K.; Thang, L.; Buskirk, L. R.; Tonielli, A. P.; Krishnan, A.; Ravi, J.Abstract
Tuberculosis (TB) remains the leading cause of infectious disease mortality worldwide, killing over one million people annually. Rising antibiotic resistance has added urgency to the need for host-directed therapeutics (HDTs) that modulate host immune responses alongside directly targeting the pathogen. Repurposing FDA-approved drugs is particularly attractive for this purpose because their safety profiles are already well-established, substantially reducing development time and cost. Transcriptomic methods have successfully identified repurposable therapeutics for TB based on 'connectivity mapping,' which identifies drugs that reverse disease gene expression patterns. However, these applications are limited to a small subset of data belonging to a specific data platform and a few connectivity methods. Expanding beyond these constrained settings introduces substantial challenges, including dataset heterogeneity across transcriptomics platforms and biological conditions, uncertainty about optimal scoring methods, and the lack of systematic approaches to identify robust disease signatures. We developed a computational workflow that integrates 28 TB gene expression signatures and multiple connectivity scoring methods to capture dominant TB signals regardless of variation in microarray and RNAseq platforms, cell types, and infection conditions. We systematically identified 64 FDA-approved drugs as promising TB host-directed therapeutics. These high-confidence drug candidates include known HDTs, such as statins (rosuvastatin, fluvastatin, lovastatin) and tamoxifen, recently validated in experimental TB models. Our prioritized candidate drugs reveal enrichment for therapeutically TB-relevant mechanisms, e.g., cholesterol metabolism inhibition and immune modulation pathways. Network analysis of disease-drug interactions identified 12 key bridging genes (including IL-8, CXCR2) that represent potential novel druggable targets for TB host-directed therapy. This work establishes transcriptome-based connectivity mapping as a viable approach for systematic HDT discovery in bacterial infections and provides a robust computational framework applicable to other infectious diseases. Our findings offer immediate opportunities for experimental validation of prioritized drug candidates and mechanistic investigation of identified druggable targets in TB pathogenesis.
bioinformatics2026-03-20v2PyrMol: A Knowledge-Structured Pyramid Graph Framework forGeneralizable Molecular Property Prediction
Li, Y.; Zhao, Q.; Wang, J.Abstract
Expert pharmaceutical chemists interpret molecular structures through a sophisticated cognitive hierarchy, transitioning from local functional moieties to spatial pharmacophores and, ultimately, to macroscopic pharmacological and physicochemical profiles. However, conventional Graph Neural Networks frequently overlook this high-level chemical intuition by treating molecules as single-scale atomic topology. To bridge this gap between human expertise and computational inference, we propose PyrMol, a knowledge-structured pyramid representation learning framework. By constructing heterogeneous hierarchical graphs, PyrMol orchestrates information flow across atomic, subgraph, and molecular levels. Crucially, the subgraph level systematically integrates three complementary expert views comprising functional groups, pharmacophores, and retrosynthetic fragments. To harmonize these explicit domain priors with implicit computational semantics, we introduce an adaptive Multi-source Knowledge Enhancement and Fusion module that dynamically balances their complementarity and redundancy. A Hierarchical Contrastive Learning strategy further ensures cross-scale semantic consistency. Empirical evaluations across ten benchmark datasets demonstrate that PyrMol outperforms 12 state-of-the-art baselines. Furthermore, its "plug-and-play" versatility provides a framework-agnostic performance boost for existing GNN architectures. PyrMol thus establishes a principled data-knowledge dual-driven paradigm for AI-aided Drug Discovery, effectively leveraging domain knowledge to catalyze advances in molecular property prediction.
bioinformatics2026-03-20v2A new pipeline for cross-validation fold-aware machine learning prediction of clinical outcomes addresses hidden data-leakage in omics based 'predictors'.
Hurtado, M.; Pancaldi, V.Abstract
Motivation: Machine learning (ML) approaches are increasingly applied to high-dimensional biological data in which features are often dataset-dependent. In many omics workflows, features are computed using information derived from the entire dataset, such as correlations between variables, clustering structures, or enrichment scores. We refer to these as global dataset features, defined as features whose computation depends on properties of the full dataset. In such cases, standard validation strategies can fail, especially when evaluating on independent datasets, due to information leakage that leads to overly optimistic performance estimates. Results: To address this challenge, we present pipeML, a flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation (CV) fold construction. pipeML enables users to recompute global dataset features independently within each CV fold, ensuring strict separation between training and test data, while preserving compatibility with a wide range of ML algorithms for both classification and survival tasks. Using real-world biological datasets, we demonstrate that pipeML enables leakage-free model evaluation when global dataset features are used. We argue that overestimation of model performance during CV can lead to overoptimistic expectations for validation on independent datasets. By explicitly addressing data leakage and offering a transparent, modular workflow, pipeML provides a robust solution for developing and validating ML models in complex biological settings. Availability:The pipeML R package as well as a tutorial are available at https://github.com/VeraPancaldiLab/pipeML Contact: vera.pancaldi@inserm.fr or marcelo.hurtado@inserm.fr Supplementary information: Available at Bioinformatics online.
bioinformatics2026-03-20v2Enhancing non-local interaction modeling for ab initio biomolecular calculations and simulations with ViSNet-PIMA
Cui, T.; Wang, Z.; Wang, T.Abstract
AI-based molecular dynamics simulation brings ab initio calculations to biomolecules in an efficient way, in which the machine learning force field (MLFF) locates at the central position by accurately predicting the molecular energies and forces. Most existing MLFFs assume localized interatomic interactions, limiting their ability to accurately model non-local interactions, which are crucial in biomolecular dynamics. In this study, we introduce ViSNet-PIMA, which efficiently learns non-local interactions by physics-informed multipole aggregator (PIMA) and accurately encodes molecular geometric information. ViSNet-PIMA outperforms all state-of-the-art MLFFs for energy and force predictions of different kinds of biomolecules and various conformations on MD22 and AIMD-Chig datasets, while adapting the PIMA blocks into other MLFFs further achieves 55.1% performance gains, demonstrating the superiority of ViSNet-PIMA and the universality of the model design. Furthermore, we propose AI2BMD-PIMA to incorporate ViSNet-PIMA into AI2BMD simulation program by introducing "Transfer Learning-Pretraining-Finetuning" scheme and replacing molecular mechanics-based non-local calculations among protein fragments with ViSNet-PIMA, which reduces AI2BMD's energy and force calculation errors by more than 50% for different protein conformations and protein folding and unfolding processes. ViSNet-PIMA advances ab initio calculation for the entire biomolecules, amplifying the application values of AI-based molecular dynamics simulations and property calculations in biochemical research.
bioinformatics2026-03-20v1RNAGAN: Train One and Get Four, Multipurpose Human RNA-Seq Analysis Tool with Enhanced Interpretability and Small Data Size Capability
HOU, Z.; Lee, V. H.-F.; Kwong, D. L.-W.; Guan, X.; Liu, Z.; Dai, W.Abstract
The advent of artificial intelligence (AI) has brought revolutionary tools for biomedical transcriptomic (RNA-level) research. However, there are persistent constraints including limited interpretations with biomedical concepts such as functional pathways, small sample sizes and substantial time and computing power requirements for AI training. To overcome these limitations, we developed RNAGAN (https://github.com/ZhaozhengHou-HKU/RNAGAN-1.0.git), an AI tool with a generative adversarial network (GAN) structure with the objective of enhancing transcriptomic analysis. The network was established based on public human datasets comprising 4.6 million single cells from multiple organs and 5,900 sequenced samples of various cancer types with normal references. A specialized pathway neural layer was embedded to extract activities of predefined pathways from the Human Molecular Signatures Database (MSigDB), or newly learned pathways from single-cell data. The structure of RNAGAN (generator and discriminator) enables four applications after one shared training procedure: 1. single-cell and bulk-level patient stratification or differential diagnosis; 2. analysis of the gene and pathway markers in a selected disease; 3. pseudo data generation when sample size is limited for downstream analysis; 4. vectorization with gene and pathway-level features learned from multiple data sets. RNGAN contributes to the efficient utilization of limited data for transcriptomic studies.
bioinformatics2026-03-20v1TriGraphQA: a triple graph learning framework for model quality assessment of protein complexes
Liang, L.; Zhao, K.Abstract
Accurate quality assessment of predicted protein-protein complex structures remains a major challenge. Existing graph-based quality assessment methods often treat the entire complex as a homogeneous graph, which obscures the physical distinction between intra-chain folding stability and inter-chain binding specificity. In this study, we introduce TriGraphQA, a novel triple graph learning framework designed for model quality assessment of protein complexes. TriGraphQA explicitly decouples monomeric and interfacial representations by constructing three geometric views: two residue-node graphs capturing the local folding environments of individual chains, and a dedicated contact-node graph representing the binding interface. Crucially, we propose an interface context aggregation module to project context-rich embeddings from the monomers onto the interface, effectively fusing multi-scale structural features. We conducted comprehensive tests on several challenging benchmark datasets, including Dimer50, DBM55-AF2, and HAF2. The results show that TriGraphQA significantly outperforms state-of-the-art single-model methods. TriGraphQA consistently achieves the highest global scoring correlations and lower top-ranking losses. Consequently, TriGraphQA provides a powerful evaluation tool for protein-protein docking, facilitating the reliable identification of near-native assemblies in large-scale structural modeling and molecular recognition studies.
bioinformatics2026-03-20v1ECHO: a nanopore sequencing-based workflow for (epi)genetic profiling of the human repeatome
Poggiali, B.; Putzeys, L.; Andersen, J. D.; Vidaki, A.Abstract
The human genome is dominated by repetitive DNA, whose genetic and epigenetic variation plays a key role in gene regulation, genome stability, and disease. Recent advances in long-read sequencing now enable large-scale, haplotype-resolved, and DNA methylation-informative analysis of the human genome, including on previously inaccessible complex and repetitive regions. However, the comprehensive, simultaneous characterisation of the "human repeatome" remains challenging, largely due to the lack of comprehensive tools integrated in a single pipeline that can capture the full spectrum of variation across diverse types of DNA repeats. Here, we present ECHO, a user-friendly, Snakemake-based pipeline for the "(Epi)genomic Characterisation of Human Repetitive Elements using Oxford Nanopore Sequencing". ECHO provides a reproducible and scalable framework for end-to-end analysis of whole-genome nanopore sequencing data, enabling integrative but also tailored (epi)genetic analyses of the human repeatome
bioinformatics2026-03-20v1CliPepPI: Scalable prediction of domain-peptide specificityusing contrastive learning
Hochner-Vilk, T.; Stein, D.; Schueler-Furman, O.; Raveh, B.; Chook, Y. M.; Schneidman-Duhovny, D.Abstract
Domain-peptide interactions mediate a significant fraction of cellular protein networks, yet accurately predicting their specificity remains challenging. Peptide motifs typically have short, fuzzy sequence profiles, and their interactions are often weak and transient, limiting the size, coverage, and quality of experimentally validated domain-peptide datasets. Since true non-binders are rarely known, constructing negative examples often introduces bias. While structure-based prediction methods can achieve high accuracy, they are computationally demanding and difficult to scale to the proteome level. We introduce CLIPepPI, a dual-encoder model that leverages contrastive learning to embed domains and peptides into a shared space directly from sequence. Both encoders are initialized from a protein language model (ESM-C) and fine-tuned using lightweight LoRA adapters, enabling parameter-efficient training on positive pairs alone. To overcome data scarcity, we augment ~3K protein-peptide complexes from PPI3D with ~150K domain-peptide pairs derived from protein-protein interfaces. CLIPepPI further injects structural information by marking interface residues in the domain sequence, thus guiding the encoders toward binding regions and linking sequence-level learning with structural context. Competitive performance is achieved across three independent benchmarks: domain-peptide complexes from PPI3D, large-scale phage-library data from ProP-PD, and a curated dataset of nuclear export signal (NES) sequences. We demonstrate scalability and generalization through two applications: (i) proteome-wide NES scanning, and (ii) variant-effect prediction, where score changes in domain-peptide interactions between wild-type and mutant sequences discriminate pathogenic from benign variants. Together, CLIPepPI offers a scalable, structure-informed model for predicting domain-peptide specificity and generating meaningful embeddings suited for large-scale proteomic analyses. CLIPepPI is available at: https://bio3d.cs.huji.ac.il/webserver/clipeppi/.
bioinformatics2026-03-20v1RNASTOP: A Deep Learning Framework for mRNA Chemical Stability Prediction and Optimization
Lin, S.; Chen, J.; Sun, H.; Zhang, Y.; Yang, W.; tan, h.; Wei, D.-Q.; Jiang, Q.; Xiong, Y.Abstract
Messenger RNA (mRNA) vaccines offer promising therapeutics for combating various diseases, yet their inherent chemical instability hampers their long-term efficacy. Although several methods have been developed to predict mRNA degradation, they exhibit limited accuracy and lack the capability for rational sequence optimization. Here, we propose RNASTOP, a novel framework integrating deep learning with heuristic search to simultaneously predict and optimize mRNA chemical stability. RNASTOP achieves a 13% accuracy improvement over the top-performing model on the Stanford OpenVaccine competition dataset and demonstrates robust generalization in predicting full-length mRNA degradation. Applied to mRNA codon optimization, RNASTOP reduces the minimum free energy of the Varicella-Zoster Virus vaccine sequence by 75.73% while maintaining high translation efficiency. Overall, RNASTOP serves as a powerful tool for predicting and optimizing mRNA chemical stability, poised to expedite the development of mRNA therapeutics. The source code of RNASTOP can be accessed at https://github.com/xlab-BioAI/RNASTOP.
bioinformatics2026-03-20v1Computational Prediction of Plasmodium falciparum Antigen-T-cell Receptor Interactions via Molecular Docking: Implications for Malaria Vaccine Design
Kipkoech, G.; Kanda, W.; Irungu, B.; Nyangi, M.; Kimani, C.; Nyangacha, R.; Keter, L.; Atieno, D.; Gathirwa, J.; Kigondu, E.; Murungi, E.Abstract
Malaria is one of the deadliest diseases in sub-Saharan Africa and Southeast Asia. The majority of the fatalities occur mostly in children under 5 years and pregnant women and this is due to infection by Plasmodium spp, of which Plasmodium falciparum is the most virulent and is responsible for most of the morbidity and mortality. Despite various public health interventions such as use of insecticide-treated bed nets, spraying of homes with insecticides and use of WHO recommended artemisinin-based combination therapies (ACT), malaria prevention still faces major setback due to drug and insecticide resistance by P. falciparum and mosquitoes respectively. The study uses molecular docking and immunoinformatics to screen various Plasmodium spp antigens and evaluate their antigenicity and suitability as vaccine candidates. The P. falciparum antigens and T-cell receptor (TCR) structures were obtained from Protein Data Bank (PDB) based on a range of factors related to their role in the lifecycle of the parasite and their status as vaccine targets. Protein structures not available in the PDB were predicted using AlphaFold. The 3D structures of selected P. falciparum antigens and TCR structures were downloaded in PDB format then all water molecules, Hetatm, and bound ligands were deleted from the protein structures using BIOVIA Discovery Studio Visualizer. Subsequently, molecular docking was done using ClusPro v2.0 server and docked complexes were compared. The findings of this study gave valuable insights into the interaction of human immune response with P. falciparum antigens. The best three ranked antigen complexes are PfCyRPA, PfMSP10 and PfCSP and this confirm their use as potential candidates for vaccine development. This study highlights the usefulness of computational docking in identifying P. falciparum antigens of excellent immunogenic potential as vaccine candidates.
bioinformatics2026-03-20v1Dingent: An Easily Deployable Database Retrieval and Integration Agent framework
Kong, D.; Bei, S.; Wu, Y.; Tang, B.; Zhao, W.Abstract
AI-driven data search and integration represent an emerging research direction. Although several LLM-based backend frameworks and agentic frameworks have emerged, significant gap remains in developing a one-stop, configurable agent framework that supports various data sources and provides a web interface for efficient data retrieval using natural language. To address this, we present Dingent, a novel and configurable agent framework that facilitates data access from various resources and enables the flexible constructions of agent applications. We demonstrate its capabilities across three distinct application scenarios, achieving promising results. The Dingent framework can be readily applied to other fields, such as earth sciences and ecology, to facilitate data discovery.
bioinformatics2026-03-20v1A Multi-Dataset Transcriptomic Analysis Unravels Core Mechanisms Involving Vitamin D Metabolism and Inflammatory Pathways for Frailty Diagnosis.
Hu, X.; Zheng, W.; Li, Y.; Zhou, D.Abstract
Frailty is a prevalent geriatric syndrome, and the shortage of objective biomarkers restricts its early diagnosis and intervention. This study aimed to identify robust molecular signatures and diagnostic markers for frailty using bioinformatics analyses of multiple independent datasets. Two transcriptome datasets (GSE144304, n=80; GSE287726, n=70) were obtained from the GEO database. We performed differential gene expression analysis, GO, KEGG and GSEA enrichment, and machine learning (70% training / 30% validation) to screen and validate core biomarkers. Numerous shared differentially expressed genes were identified. Vitamin D metabolism, ABC transporter, and inflammatory/immune pathways were consistently enriched and confirmed by GSEA. Machine learning models based on these signatures showed favorable diagnostic performance. Our study demonstrates that vitamin D metabolic disorders and chronic inflammation are core molecular features of frailty. The identified biomarkers provide new strategies for basic research, early clinical diagnosis, and therapeutic target development for frailty.
bioinformatics2026-03-20v1Pareto optimization of masked superstrings improves compression of pan-genome k-mer sets
Plachy, J.; Sladky, O.; Brinda, K.; Vesely, P.Abstract
The growing interest in k-mer-based methods across bioinformatics calls for compact k-mer set representations that can be optimized for specific downstream applications. Recently, masked superstrings have provided such flexibility by moving beyond de Bruijn graph paths to general k-mer superstrings equipped with a binary mask, thereby subsuming Spectrum-Preserving String Sets and achieving compactness on arbitrary k-mer sets. However, existing methods optimize superstring length and mask properties in two separate steps, possibly missing solutions where a small increase in superstring length yields a substantial reduction in mask complexity. Here, we introduce the first method for Pareto optimization of k-mer superstrings and masks, and apply it to the problem of compressing pan-genome k-mer sets. We model the compressibility of masked superstrings using an objective that combines superstring length and the number of runs in the mask. We prove that the resulting optimization problem is NP-hard and develop a heuristic based on iterative deepening search in the Aho-Corasick automaton. Using microbial pan-genome datasets, we characterize the Pareto front in the superstring-length/mask-run space and show that the front contains points that Pareto-dominate simplitigs and matchtigs, while nearly encompassing the previously studied greedy masked superstrings. Finally, we demonstrate that Pareto-optimized masked superstrings improve pan-genome k-mer set compressibility by 12-19% when combined with neural-network compressors.
bioinformatics2026-03-20v1GenBio-PathFM: A State-of-the-Art Foundation Model for Histopathology
Kapse, S.; Aygün, M.; Cole, E.; Lundberg, E.; Song, L.; Xing, E. P.Abstract
Recent advancements in histopathology foundation models (FMs) have largely been driven by scaling the training data, often utilizing massive proprietary datasets. However, the long-tailed distribution of morphological features in whole-slide images (WSIs) makes simple scaling inefficient, as common morphologies dominate the learning signal. We introduce GenBio-PathFM, a 1.1B-parameter FM that achieves state-of-the-art performance on public benchmarks while using a fraction of the training data required by current leading models. The efficiency of GenBio-PathFM is underpinned by two primary innovations: an automated data curation pipeline that prioritizes morphological diversity and a novel dual-stage learning strategy which we term JEDI (JEPA + DINO). Across the THUNDER, HEST, and PathoROB benchmarks, GenBio-PathFM demonstrates state-of-the-art accuracy and robustness. GenBio-PathFM is the strongest open-weight model to date and the only state-of-the-art model trained exclusively on public data.
bioinformatics2026-03-20v1WITHDRAWN: Beyond Binding Affinity: The Kinetic-Compatibility Hypothesis for Nipah Virus Neutralization
Bozkurt, C.Abstract
The authors have withdrawn their manuscript because of a fundamental error in the identification of the biological target protein. The analysis was originally framed around the mechanical transitions of the Nipah virus Fusion (F) protein; however, the empirical functional data utilized (from the 2025 AdaptyvBio competition) was directed toward the Attachment (G) glycoprotein. While the sequence-level characterization of the binders remains internally consistent, the mechanical analogies used are not applicable to the Attachment (G) protein architecture. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-03-19v2Frequency-domain kernels enable atlas-scale detection of spatially variable genes
Yang, C.; Zhang, X.; Chen, J.Abstract
Identifying spatially variable genes in spatial transcriptomics requires methods that are accurate, well calibrated and scalable, yet current approaches trade expressive kernels for tractable computation. We present FlashS, which moves spatial testing to the frequency domain: Random Fourier Features and sparse sketching enable multi-scale kernel testing on zero-inflated data without constructing distance matrices, and a kurtosis-corrected null preserves calibration. Across 50 datasets from 9 platforms, FlashS achieves a mean Kendall {tau} of 0.935, exceeding the next-best method by 0.049. On the Allen Brain MERFISH atlas of 3.94 million cells, it completes in 12.6 minutes using 21.5 GB memory and maintains near-nominal false-positive rates under permutation. In human cardiac tissue, this improved ranking recovers a ventricular cardiomyocyte-associated mitochondrial biogenesis program that largely eludes parametric alternatives and replicates in an independent cohort.
bioinformatics2026-03-19v2Composition and higher-order structure in nucleic acids sequenced from a chondrite
Farage, C.; Church, G. M.; Bachelet, I.Abstract
The known tree of life occupies an infinitesimal region of the space of all mathematically possible evolutionary histories, yet our sequence analysis frameworks are implicitly calibrated to it and to its associated compositional and grammatical regularities. Here we analyze nucleic acid molecules sequenced from the Zag meteorite as part of a broader effort to understand how nucleic acid sequence composition and higher-order structure are shaped under chemically divergent environments. We characterize these sequences across multiple analytical layers, and show that they lack signatures of protein-coding organization, translational periodicity, or known biological grammar. At the same time, they deviate significantly from random or composition-only null models, displaying constrained complexity and low-dimensional structure in k-mer frequency space. Multiple tests place amplification and sequencing-driven artifacts and metagenomic contaminants at a low likelihood. Taken together, these findings indicate that the Zag sequences occupy an unusual region of sequence space that is not readily accounted for by known biological or technical models, thereby narrowing, but not resolving, the range of plausible explanations and motivating independent replication and further investigation.
bioinformatics2026-03-19v2Using Variable Window Sizes for Phylogenomic Analyses of Whole Genome Alignments
Ivan, J.; Lanfear, R.Abstract
Many phylogenomic studies used non-overlapping windows to address gene tree discordance across a set of aligned genomes. Recently, Ivan et al. (2025) proposed an information theoretic approach to choose an optimal window size given the alignment. However, this approach selects only a single fixed window size per chromosome, which is a useful first step but fails to account for variation in the size of non-recombining regions along each chromosome. Such variation is expected to occur due to the stochastic nature of recombination as well as the variation in recombination rates along chromosomes. In this study, we extend the approach of Ivan et al. (2025) to allow window sizes to vary across the chromosome, using a splitting-and-merging strategy that allows for each window to be of an arbitrary length. We showed that the new method outperformed the fixed-window approach in recovering gene tree topologies on a wide range of simulated datasets. Applying the new method on the genomes of seven Heliconius butterflies, we found that the average window sizes for the group ranged between 538-808bp, but with a very similar distribution of gene tree topologies compared to previous studies that used fixed window sizes. For the genomes of great apes, the average window sizes ranged from 4.2kb to 6.2kb, with the proportion of the major topology (i.e., grouping human and chimpanzee together) reaching approximately 80%. In conclusion, our study highlights the limitations of using a fixed window size when recombination rates vary across the chromosomes, and proposes a splitting-and-merging approach that allows for variable window sizes across whole genome alignments.
bioinformatics2026-03-19v2ChiMER: Integrating chromatin architecture into splicing graphs for chimeric enhancer RNAs detection
Xiang, Y.; Xiao, X.; Zhou, B.; Xie, L.Abstract
Motivation: Enhancer-derived RNAs (eRNAs) and their fusion with protein coding genes represent a crucial yet understudied layer of transcriptional regulation. eRNAs are typically expressed at low levels, which makes fusion events difficult to detect with conventional fusion detection tools. In addition, these tools are not designed to capture fusion transcripts arising from spatial proximity between distal regulatory elements and gene loci. Reads spanning such regions are also frequently filtered as mapping artifacts. As a result, computational approaches for systematically identifying spatially mediated enhancer-exon fusion transcripts remain lacking. Methods: We developed ChiMER, a graph-based framework for detecting ChiMeric Enhancer RNAs from short-read RNA-seq data. ChiMER constructs splice graphs with chromatin contact information to introduce enhancer-exon edges and uses graph alignment to search for potential transcriptional paths. A ranking-based scoring module then prioritizes high-confidence events. Evaluations on simulated and real RNA-seq datasets show that ChiMER achieves higher sensitivity than conventional linear fusion detection methods while maintaining low false-positive rates. Results: Applied to cancer cell line RNA-seq datasets, ChiMER identified multiple enhancer-exon chimeric transcripts, several associated with super-enhancer regions. Multi-omics analysis further show that fusion transcripts occur in transcriptionally active regulatory environments and frequently coincide with strong R-loop signals, suggesting a potential role of RNA-DNA hybrid structures in facilitating long-range transcriptional joining events.
bioinformatics2026-03-19v1SNMF: Ultrafast, Spatially-Aware Deconvolution for Spatial Transcriptomics
Alonso, L.; Ochoa, I.; Rubio, A.Abstract
Sequencing-based spatial transcriptomics has revolutionized the study of tissue architecture, but its `spots' often contain multiple cells, creating a key computational challenge, termed deconvolution, to decipher each spot's cell-type composition. Reference-free deconvolution methods avoid the need for a matched single-cell RNA-seq dataset, but typically neglect the spatial correlation between neighboring spots and do not leverage modern hardware for efficient computation. Here, we propose SNMF (Spatial Non-negative Matrix Factorization): a rapid, accurate, and reference-free deconvolution method. SNMF extends the standard NMF framework with a spatial mixing matrix that models neighborhood influences, guiding the factorization toward spatially coherent solutions. Our R package is, to our knowledge, the first spatial transcriptomics deconvolution tool to natively support GPU execution, completing benchmark analyses in under one minute---over two orders of magnitude faster than the slowest competing methods---with moderate memory requirements. On synthetic and real benchmark datasets, SNMF significantly outperforms state-of-the-art methods in deconvolution accuracy, and on a human melanoma dataset it recovers biologically meaningful cell-type signatures---including a tumor-boundary transition zone---without any reference input. The proposed mehtod is publicly available at https://github.com/ML4BM-Lab/SNMF.
bioinformatics2026-03-19v1ABAG-Rank: Improving Model Selection of AlphaFold Antibody-Antigen Complexes by Learning to Rank
Tadiello, M.; Ludaic, M.; Viliuga, V.; Elofsson, A.Abstract
Motivation: AlphaFold has transformed structural biology with an unprecedented accuracy in modeling protein structures and their interactions with biomolecules, with AlphaFold3 (AF3) achieving state-of-the-art performance. However, AF3 and other methods often struggle to accurately predict the structure of protein complexes that lack strong co-evolutionary information, such as antibody-antigen (Ab-Ag) complexes. One of the fundamental issues is that AF3 often generates accurate predictions, but fails to reliably distinguish them from the much larger set of incorrect ones. Results: To address this, we propose ABAG-Rank, a deep neural network that provides an efficient and robust solution for model selection of Ab-Ag interactions from a pool of structural ensembles predicted with AlphaFold. Built on the permutation-invariant DeepSets architecture, ABAG-Rank can process variable-sized ensembles of structural decoys and is directly applicable to prediction settings in which the number of candidates may vary. We train a model on a redundancy-reduced set of all known antibody-antigen complexes and find that simple geometric descriptors, along with confidence scores from AlphaFold, provide rich information about interface quality without requiring intensive physics-based calculations. Our experiments demonstrate that ABAG-Rank significantly outperforms AF3 internal scoring and the ranking performance of existing deep learning baselines. Implementation: Source code can be found at: https://github.com/tadteo/ABAG-Rank
bioinformatics2026-03-19v1GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation. GAP-MS is freely available as a web interface at https://webclu.bio.wzw.tum.de/gapms/.
bioinformatics2026-03-19v1A Cross-Study Multi-Organ Cell Atlas ofMacaca fascicularis Informed by Human Foundation Model Annotation: A Resource for Translational Target Assessment
Souza, T. M.; Gamse, J. T.; Moreno, L.; van Rumpt, M.; Nunez-Moreno, G.; Khatri, I.; van Asten, S. D.; Khusial, N. V.; Baltasar-Perez, E.; Adhav, R.; Abdelaal, T.; Wojtuszkiewicz, A.; Calis, J. J. A.; Csala, A.; Dahlman, A.; Fuller, C. L.; Thalhauser, C. J.; Kolder, I. C. R. M.Abstract
Non-human primates (NHPs), particularly Macaca fascicularis (cynomolgus macaque), represent an essential model for preclinical assessment of biologics due to their high genetic and physiological similarity to humans. However, mounting regulatory pressure to reduce NHP use and the lack of a unified, well-annotated single-cell atlas currently limits both target qualification and mechanistic interpretation of toxicity in this species. To address this gap, we assembled and harmonized the largest single-cell transcriptomic atlas of M. fascicularis to date, integrating 30 publicly available studies spanning 57 anatomical regions, 43 organs and 14 physiological systems. We implemented a scalable framework for cross-species cell type annotation by embedding both cynomolgus monkeys and human (Tabula Sapiens V2) datasets into a shared reference space using Universal Cell Embeddings (UCE), enabling consistent harmonization of cell identities. In total, 27 organs were annotated using human reference labels, while the remaining sets retained author-provided annotations or labels transferred from other cynomolgus studies with available annotations. The resulting atlas comprises over 2.5 million high-quality cells and demonstrates strong concordance in cell-type-specific expression patterns between cynomolgus and humans, including tissue-specific markers and targets relevant for biologics development. Through multiple translational use cases, we illustrate how this resource can be applied to assess target expression in tissues affected by concordant human-NHP toxicities, investigate ocular adverse events associated with antibody-drug conjugates (ADCs), and identify species-specific features of immune cell subtypes with known safety implications. By enabling scalable, high-resolution, cross-species comparisons of gene expression across organs, tissues, and cell states, this atlas supports improved target qualification, more mechanistic interpretation of toxicities, and evidence-based decisions on the relevance and design of NHP studies. Collectively, this work provides a unified cross-species single-cell resource for cynomolgus monkey and a modular computational framework that advances new approach methodologies and contributes to the refinement and reduction of NHP use in preclinical research.
bioinformatics2026-03-19v1ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling
Shen, L.; Chao, L.; Liu, T.; Liu, Q.; Zhou, G.; Wang, H.; Dong, X.; Li, T.; Zhang, X.; Ni, J.Abstract
While protein language models typically rely on sequence-only pretraining objectives, this approach often fails to capture structural regularities and demands large datasets. To address this, we introduce ProteinSage, a pretraining framework that learns protein representations under explicit structural constraints. ProteinSage incorporates structural signals via structure-guided masking and a causal objective designed to model longrange dependencies. This structure-constrained pretraining equips ProteinSage with transferable representations using less data and computation, yet achieves competitive or superior performance across diverse structure-aware and general protein modeling benchmarks. To determine whether these gains stem from genuine structural generalization rather than task-specific fitting, we applied ProteinSage to a structure-driven protein discovery task, focusing on proteins with multi-pass transmembrane helical architectures such as distantly related microbial rhodopsins. The model successfully identified six previously unannotated microbial rhodopsin homologs. Together, our work establishes structure-constrained pretraining as an effective pathway toward data-efficient and structurally faithful protein representation learning.
bioinformatics2026-03-19v1evedesign: accessible biosequence design with a unified framework
Hopf, T. A.; Gazizov, A.; Garcia Busto, S.; Eschbach, E.; Lee, S.; Mirdita, M.; Orenbuch, R.; Belahsen, K.; Ross, D.; Sander, C.; Steinegger, M.; d'Oelsnitz, S.; Marks, D.Abstract
Machine learning methods for protein engineering are rarely interoperable, require bespoke workflows, and remain inaccessible to non-experts. Yet the design problems that matter most - conditional design subject to real-world constraints, multi-objective optimization, and iterative lab-in-the-loop workflows where experimental data continuously refines successive design rounds - demand exactly the kind of flexible, composable infrastructure that no single tool provides. We present evedesign, a unified open-source framework that formalizes conditional biosequence design in a method-agnostic way, enabling complex multiobjective workflows combining supervised and unsupervised models from standardized specifications, and built from the outset to support iterative experimental integration. An interactive web interface facilitates end-to-end design for a broad scientific audience at https://evedesign.bio. We demonstrate evedesign's utility in antibody engineering, enzyme design, and natural enzyme discovery, and invite open-source community contributions.
bioinformatics2026-03-19v1SELFormerMM: multimodal molecular representation learning via SELFIES, structure, text, and knowledge graph integration
Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.Abstract
Motivation: Molecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. Results: We introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph-derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the model's ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. Availability: SELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contact: tuncadogan@gmail.com
bioinformatics2026-03-19v1PhyloRNA: a database of RNA secondary structures with associated phylogenies
Quadrini, M.; Tesei, L.Abstract
The ability to access, search, and analyse large collections of RNA molecules together with their secondary structure and evolutionary context is essential for comparative and phylogeny-driven studies. Although RNA secondary structure is known to be more conserved than primary sequence, no existing resource systematically associates individual RNA molecules with curated phylogenetic classifications. Here, we introduce PhyloRNA, a curated meta-database that provides large-scale access to RNA secondary structures collected from public resources or derived from experimentally resolved 3D structures. PhyloRNA allows users to search, select, and download extensive sets of RNA molecules in multiple textual formats, each entry being explicitly linked to phylogenetic annotations derived from five curated taxonomy systems. In addition to taxonomic information, each RNA molecule is accompanied by a rich set of descriptors, including pseudoknot order, genus, and three levels of structural abstraction - Core, Core Plus, and Shape - which facilitate comparative analyses across sets of molecules. PhyloRNA is publicly available at https://bdslab.unicam.it/phylorna/ and is regularly updated to incorporate newly available data and revised taxonomic annotations.
bioinformatics2026-03-19v1NOHIC: A PIPELINE FOR PLANT CONTIG SCAFFOLDING USING PERSONALIZED REFERENCES FROM PANGENOME GRAPHS
Nguyen-Hoang, A.; Arslan, K.; Kopalli, V.; Windpassinger, S.; Perovic, D.; Stahl, A.; Golicz, A.Abstract
Hi-C data is commonly used for reference-free de novo scaffolding. However, with the rapid increase in high-quality reference genomes, reference-guided workflows are now more practical for assembling large numbers of target genomes without relying on costly and labor-intensive Hi-C sequencing. Recently, a pangenome graph-based haplotype sampling algorithm was introduced to generate personalized graphs for target genomes. Such graphs have strong potential as references for reference-guided contig scaffolding. Here, we present noHiC, a reference-guided scaffolding pipeline supporting key steps of plant contig scaffolding. A distinctive feature of noHiC is the nohic-refpick script, generating a best-fit synthetic reference (synref) from a pangenome graph that is genetically close to the target contigs. This enables the integration of genetic information from many references (up to 48 in our tests) without using them separately during scaffolding. Synrefs showed advantages over highly contiguous conventional references in reducing false contig breaking during reference-based correction. Additionally, nohic-refpick can be combined with fast scaffolders (ntJoin) to rapidly produce highly contiguous assemblies using synrefs derived from pangenome graphs. The noHiC pipeline, used alone or in combination with ntJoin, can generally produce assemblies that are structurally consistent with public Hi-C-based or manually curated genomes. The pipeline is publicly available at https://github.com/andyngh/noHiC.
bioinformatics2026-03-19v1Translating Histopathology Foundation Model Embeddings into Cellular and Molecular Features for Clinical Studies
Cui, S.; Sui, Z.; Li, Z.; Matkowskyj, K. A.; Yu, M.; Grady, W. M.; Sun, W.Abstract
AI-powered pathology foundation models provide general-purpose representations of histopathological images by encoding image tiles into numerical embeddings. However, these embeddings are not directly interpretable in biological or clinical terms and must be translated into biologically meaningful features, such as cell-type composition or gene expression, to enable downstream clinical applications. To bridge this gap, we developed STpath, a framework that integrates histopathology image embeddings derived from existing pathology foundation models with matched, spatially resolved transcriptomics data. STpath consists of cancer-specific XGBoost models trained to infer cell-type compositions and gene expression from histopathology image tiles. We evaluated STpath in colorectal and breast cancer datasets and showed that it provides accurate estimates of the composition of major cell types and the expression of a subset of genes, with further performance gains achieved by combining embeddings from multiple foundation models. Finally, we demonstrated that STpath inferred features that can be used in downstream studies to evaluate their associations with clinical outcomes.
bioinformatics2026-03-19v1G-VEP: GPU-Accelerated Variant Effect Prediction for Clinical Whole-Genome Sequencing Analysis
Green, E.; Mardinoglu, A.Abstract
Whole-genome sequencing (WGS) has transformed clinical diagnostics, yet variant annotation remains a computational bottleneck. The Variant Effect Predictor (VEP) integrates pathogenicity predictors and population databases essential for ACMG/AMP variant classification, but these annotation plugins are fundamentally I/O-bound, consuming over 70% of total pipeline runtime. Here, we present G-VEP, a GPU-accelerated annotation framework built on a custom CUDA kernel that replaces sequential per-variant database lookups with massively parallel binary search across precomputed indices. By executing annotation lookups for all input variants simultaneously, G-VEP reduces plugin runtime from 72 minutes to 4 minutes (17-fold acceleration) and total annotation runtime from 100 minutes to 33 minutes (3-fold acceleration), while maintaining complete concordance with standard VEP output. Benchmarking across 75 clinical WGS samples demonstrated consistent performance, with no annotation discrepancies; validation on samples containing known pathogenic variants confirmed the preservation of all clinically significant findings. The 8.8 GB index footprint fits within consumer-grade 16 GB GPUs. G-VEP addresses an unmet need in clinical WGS analysis, while GPU suites such as NVIDIA Parabricks accelerate alignment and variant calling, they do not provide the Ensembl VEP plugin ecosystem used in clinical interpretation. G-VEP removes this final bottleneck and enables accelerated WGS interpretation. G-VEP is freely available through a web-based user interface with REST API documentation at https://www.phenomeportal.org/gvep, and source code for local installation and deployment at https://github.com/Phenome-Longevity/G-VEP.
bioinformatics2026-03-19v1ST-PARM: Pareto-Complete Inference-Time Alignment for Multi-Objective Protein Design
Yin, R.; Shen, Y.Abstract
Motivation: Protein engineering is inherently multi-objective: improving one property can degrade others, so practical workflows require generating non-dominated (Pareto-optimal) candidates spanning a trade-off surface. Linear objective scalarization and deterministic pairwise preference learning can under-explore non-convex Pareto regions and amplify noise from uncertain evaluators, limiting Pareto coverage and trade-off controllability. Results: We introduce Smooth Tchebycheff Preference-Aware Reward Model (ST-PARM), an inference-time alignment framework that steers a frozen protein language model along user-specified trade-offs with a lightweight reward model trained only once. ST-PARM combines (i) a reward-calibrated pairwise preference loss that is uncertainty-aware by down-weighting ambiguous comparisons under noises, (ii) a smooth Tchebycheff scalarization that is Pareto-complete in principle and improves empirical trade-off coverage, and (iii) latent-space pair-construction strategies. On GFP fluorescence--stability (full-length design) and IL-6 nanobody stability--solubility (CDR3+suffix design), ST-PARM delivers broader Pareto coverage and stronger preference tracking than baselines PARM and MosPro. For GFP, a conservative structural screen for local confidence and global fold preservation retains a broad frontier and strong controllability, yielding an actionable cohort for downstream assays. We also provide cross-evaluator robustness checks, a three-objective extension, and a natural-language alignment generality check in the Supplement, establishing a practical foundation for controllable sequence generation under competing multi-objectives and noisy measurements. Availability and Implementation: https://github.com/Shen-Lab/ST-PARM.
bioinformatics2026-03-19v1RiboBA: a bias-aware probabilistic framework for robust ORF identification across diverse ribosome profiling protocols
BAI, J.; Yang, R.Abstract
By mapping ribosome-protected fragments (RPFs) genome-wide, ribosome profiling (Ribo-seq) has uncovered extensive translation beyond conventional coding sequences, revealing non-canonical ORFs (ncORFs) with emerging roles in diverse biological processes. However, protocol-induced biases introduced during library construction can substantially distort RPF signals. Most existing ORF callers are not designed to explicitly account for such artifacts, limiting robust ncORF identification. Here, we present RiboBA, a bias-aware probabilistic framework to address this challenge. RiboBA consists of two main components: a generative module that recovers protocol-induced biases and codon-level ribosome occupancy, and a supervised module that identifies translated ORFs and initiation sites using the resulting bias-adjusted profiles. Evaluated through simulations and on a range of Ribo-seq datasets-particularly supported by cell-type-specific immunopeptidomics-RiboBA robustly recovers protocol-induced parameters and achieves superior accuracy and sensitivity in ncORF identification. Notably, RiboBA performs particularly well on RNase I libraries with attenuated three-nucleotide periodicity, as well as on MNase and nuclease P1 libraries, while maintaining competitive runtimes. In a Drosophila case study, RiboBA identifies conserved ncORFs with coding potential, including recurrent upstream translation of ThrRS and Mettl2 that suggests a potential threonine-specific translational control axis.
bioinformatics2026-03-19v1Semantic-Aware Energy-Efficient Operation inSmart Capsule Endoscopy
Zoofaghari, M.; Rahaimifard, A.; Chatterjee, S.; Balasingham, I.Abstract
Goal-oriented semantic communication has recently emerged in wireless sensor-actuator networks, emphasizing the meaning and relevance of information over raw data delivery, thereby enabling resource-efficient telecommunication. This paradigm offers significant benefits for intra-body or implantable sensor-actuator networks, including dramatic reductions in band-width requirements, latency, and power consumption. In this paper, we address a patch-based energy-efficient anomaly detection method for smart capsule endoscopy. We propose a deep learning-based algorithm that employs the similarity between features extracted from measured images and a reference (normal) image as the detection metric. The algorithm is evaluated using a clinical dataset of capsule-captured images, combined with a simulated intra-body channel model. The results demonstrate that even with only 60% of the transmission power (relative to a standard link design for QPSK modulation) and 65% of the light intensity, the probability of anomaly detection remains above 85%, and it gradually improves as power and illumination levels increase. This improvement translates into a potential battery life extension of over 43%. The findings highlight the potential of semantic-aware, energy-efficient intra-body devices for more sustainable and effective medical interventions.
bioinformatics2026-03-19v1Identification and classification of all Cytochrome P450 deposits in the Protein Data Bank
Smieja, P.; Zadrozna, M.; Syed, K.; Nelson, D.; Gront, D.Abstract
Cytochrome P450 monooxygenases (CYPs/P450s) form a highly diverse enzyme superfamily central to biotechnology, pharmacology, and environmental science. Despite the large number of available structures, identifying and comparing P450 entries in structural repositories remains challenging due to their extreme sequence divergence and inconsistent annotation practices. In particular, many deposits lack the standardized nomenclature (CYPid) and rather rely on legacy or author-defined common names (like P450cam, P450BM-3 and P450-PCN1), which are often inconsistent in formatting and specificity. This is particularly difficult for a superfamily as sequentially diverse as P450s. This hinders reliable retrieval and cross-referencing, making even identification all P450 structures in the database nontrivial. To overcome these obstacles, we developed a structure-guided discovery and validation workflow combining keyword search, Hidden Markov Models, and structural alignment, enabling robust detection and annotation. This strategy identified 1,513 deposits representing 674 unique sequences. All sequences were reannotated using the P450Atlas server and manually verified, confirming high assignment accuracy. In the process, we have also identified five new CYP subfamilies. The resulting dataset constitutes the first rigorously curated, structure-linked registry of P450 enzymes, integrated into a publicly accessible resource and supported by an automated pipeline that periodically scans newly released entries. By unifying structurally validated identification with standardized CYP nomenclature, this work establishes a reliable framework for accurate retrieval, comparison, and future large-scale analyses of P450 enzymes.
bioinformatics2026-03-19v1Super Bloom: Fast and precise filter for streaming k-mer queries
Conchon-Kerjan, E.; Rouze, T.; Robidou, L.; Ingels, F.; Limasset, A.Abstract
Approximate membership query structures are used throughout sequence bioinformatics, from read screening and metagenomic classification to assembly, indexing, and error correction. Among them, Bloom filters remain the default choice. They are not the most efficient structures in either time or memory, but they provide an effective compromise between compactness, speed, simplicity, and dynamic insertions, which explains their widespread adoption in practice. Their main drawback is poor cache locality, since each query typically requires several random memory accesses. Blocked Bloom filters alleviate this issue by restricting accesses for any given element to a single memory block, but this usually comes with a loss in accuracy at fixed memory. In this work, we introduce the Super Bloom Filter, a Bloom filter variant designed for streaming k-mer queries on biological sequences. Super Bloom uses minimizers to group adjacent k-mers into super-k-mers and assigns all k-mers of a group to the same memory block, thereby amortizing random accesses over consecutive k-mer queries and improving cache efficiency. We further combine this layout with the findere scheme, which reduces false positives by requiring consistent evidence across overlapping subwords. We provide a theoretical analysis of the construction of Super Bloom filters, showing how minimizer density controls the expected reduction in memory transfers, and derive a practical parameterization strategy linking memory budget, block size, collision overhead, and the number of hash functions to robust false-positive control. Across a broad range of memory budgets and numbers of hash functions, Super Bloom consistently outperforms existing Bloom filter implementations, with several-fold time improvements. As a practical validation, we integrated it into a Rust reimplementation of BioBloom Tools, a sequence screening tool that builds filters from reference genomes and classifies reads through k-mer membership queries for applications such as host removal and contamination filtering. This replacement yields substantially faster indexing and querying than both the original C++ implementation and Rust variants based on Bloom filters and blocked Bloom filters. The findere scheme also reduces false positives by several orders of magnitude, with some configurations yielding no observed false positives among 10^9 random queried k-mers. Code is available at https://github.com/EtienneC-K/SuperBloom and https://github.com/Malfoy/SBB
bioinformatics2026-03-19v1STiLE: Automated Tissue Microarray Dearraying for Spatial Transcriptomics
Sinha, H.; Das, A.; Chiu, Y.-C.; Gao, S.-J.; Huang, Y.Abstract
Tissue microarrays (TMAs) enable high-throughput spatial transcriptomic profiling of dozens of tissue cores on a single slide. However, existing dearraying methods operate on histological images and do not support the coordinate-based outputs of spatial transcriptomics platforms. Therefore, task of assigning cells to their respective cores (dearraying) remains a manual bottleneck. We present STiLE, a tool for automated TMA dearraying that operates solely on cell centroid coordinates. By eliminating dependence on image data, STiLE is robust to artifacts such as variable staining quality and uneven illumination. The algorithm combines connectivity-based component detection, density-based clustering (HDBSCAN), component-guided cluster merging, and optional grid-based peak detection. Validation on eleven public TMA samples (50-150 cores, three platforms) achieved ARI > 0.99, while systematic benchmarking on 396 synthetic datasets with realistic artifacts demonstrated consistently robust performance (mean ARI = 0.992). STiLE accepts standard formats (AnnData, CSV) and is platform-agnostic, supporting diverse platforms including Vizgen MERSCOPE, 10x Xenium, and NanoString CosMx. An interactive Streamlit interface enables parameter tuning, visual inspection, and region-based processing for large slides.
bioinformatics2026-03-19v1High Resolution Solvated Models Reveal Mechanisms of Allosteric Activation of mTORC1 by RHEB
Ghosh, P.; Maity, A.; Kutti, V. R.; Venkatramani, R.Abstract
The mechanistic target of rapamycin complex 1 (mTORC1) is a ~1.2 MDa dimeric assembly comprising mTOR, mLST8, and RAPTOR that integrates nutrient, energy, and stress signals to regulate cell growth. While Cryo-EM structures have provided insights into allosteric activation of the complex by the small GTPase RHEB, their limited resolution has constrained a full mechanistic understanding. Here, we combine deep learning-based AlphaFold-3 models with Molecular Dynamics Flexible Fitting and simulations to generate refined, atomistic solvated models of mTORC1{+/-}ATP{+/-}RHEB. Simulations reveal a global remodelling of the complex by RHEB, which strengthens mTOR-RAPTOR interactions while weakening mTOR-mLST8 contacts. These drive the reorganization of Kinase N- and C-lobes into a catalytically competent state in which ATP binding is stabilized enthalpically with improved Magnesium ion coordination. Our studies present structural, energetic and dynamic changes induced by RHEB binding which collectively cause allosteric preorganization of mTORC1 for catalysis prior to substrate binding.
bioinformatics2026-03-19v1StrucTTY: An Interactive, Terminal-Native Protein Structure Viewer
Jang, L. S.-e.; Cha, S.; Steinegger, M.Abstract
Terminal-based workflows are central to large-scale structural biology, particularly in high-performance computing (HPC) environments and SSH sessions. Yet no existing tool enables real-time, interactive visualization of protein backbone structures directly within a text-only terminal. To address this gap, we present StrucTTY, a fully interactive, terminal-native protein structure viewer. StrucTTY is a single self-contained executable that loads mulitple PDB and mmCIF files, normalizes three-dimensional coordinates, and renders protein structures as ASCII graphics. Users can rotate, translate, and zoom in on structures, adjust visualization modes, inspect chain-level features and view secondary structure assignments. The tool supports simultaneous visualization of up to nine protein structures and can directly display structural alignments using Foldseek's output, enabling rapid comparative analysis in headless environments. The source code is available at https://github.com/steineggerlab/StrucTTY
bioinformatics2026-03-19v1DOTSeq enables genome-wide detection of differential ORF usage
Lim, C. S.; Chieng, G. S. W.Abstract
Protein synthesis is regulated by multiple cis-regulatory elements, including small ORFs, yet current differential translation methods assume uniform changes at the gene level. We present DOTSeq, a Differential ORF Translation statistical framework that resolves ORF-level regulation in bulk ribosome profiling (Ribo-seq) experiments and provides ORF-level read summarisation for single-cell Ribo-seq. DOTSeq's core module, Differential ORF Usage (DOU), quantifies changes in an ORF's relative contribution to a gene's translation output, using a beta-binomial GLM with flexible dispersion modelling. DOTSeq also implements ORF-level Differential Translation Efficiency (DTE) using a standard approach to complement DOU. Benchmarks show that DOU achieves superior sensitivity with near-nominal FDR across effect sizes, while DTE and some existing methods excel when technical noise is low. DOTSeq introduces an ORF-aware, quantitative framework for ribosome profiling, delivering end-to-end workflows for ORF annotation, read summarisation, contrast estimation, and visualisation to uncover translational control events at scale.
bioinformatics2026-03-18v3plsMD: A plasmid reconstruction tool from short-read assemblies
Lotfi, M.; Jalal, D.; Sayed, A. A.Abstract
While whole genome sequencing (WGS) has become a cornerstone of antimicrobial resistance (AMR) surveillance, the reconstruction of plasmid sequences from short-read WGS data remains a challenge due to repetitive sequences and assembly fragmentation. Current computational tools for plasmid identification and binning, such as PlasmidFinder, cBAR, PlasmidSPAdes, and Mob-recon, have limitations in reconstructing full plasmid sequences, hindering downstream analyses like phylogenetic studies and AMR gene tracking. To address this gap, we present plsMD, a tool designed for full plasmid reconstruction from short-read assemblies. plsMD integrates Unicycler assemblies with replicon and full plasmid sequence databases (PlasmidFinder, MOB-typer and PLSDB) to guide plasmid reconstruction through a series of contig manipulations. Using two datasets, one established benchmark dataset used in previous benchmarking studies and another novel dataset consisting of newly sequenced bacterial isolates, plsMD outperformed existing tools in both. In the benchmark dataset, it achieved excellent recall, precision, and F1 scores of 91.3%, 95.5%, and 92.0%, respectively. In the novel dataset, it achieved good recall, precision, and F1 scores of 77.6%, 88.9%, and 74.5%, respectively. plsMD supports two usage modalities: single-sample analysis for plasmid reconstruction and gene annotation, and batch-sample analysis for phylogenetic investigations of plasmid transmission. This computational tool represents a significant advancement in plasmid analysis, offering a robust solution for utilizing existing short-read WGS data to study plasmid-mediated AMR spread and evolution.
bioinformatics2026-03-18v2PREMISE: A Quality-Aware Probabilistic Framework for Pathogen Resolution and Source Assignment in Viral mNGS
Vijendran, S.; Dorman, K.; Anderson, T. K.; Eulenstein, O.Abstract
The circulation of Influenza A viruses (IAVs) in wildlife and livestock presents a significant public health threat due to their zoonotic potential and rapid genomic diversification. Accurate classification of viral subtypes and characterization of within-host diversity are crucial for risk assessment and vaccine development. Although metagenomic sequencing facilitates early detection, prevalent memory-efficient k-mer-based pipelines often discard critical linkage information. This loss of information can result in missed or imprecise pathogen identification, potentially delaying clinical and public health responses. We introduce premise (Pathogen Resolution via Expectation Maximization In Sequencing Experiments), a probabilistic, alignment-based framework implemented in RUST for high-resolution viral genome identification. By integrating advanced string data structures for efficient alignment with a quality-score-aware Expectation-Maximization algorithm, premise accurately identifies source strains, estimates relative abundances, and performs precise read assignments. This framework provides superior source estimation with statistical confidence, enabling the identification of mixed infections, recombination, and IAV-reassortment directly from raw data. Validated against simulated and empirical datasets, premise outperforms state-of-the-art k-mer methods. Ultimately, this framework represents a significant advancement in viral identification, providing a foundation for novel approaches that can automatically flag reassorted viruses or recombination events in the future, thereby improving the detection of emerging pathogens with zoonotic potential. Availability: https://github.com/sriram98v/premise} under a MIT license. Contact: sriramv@iastate.edu
bioinformatics2026-03-18v1Hierarchical genomic feature annotation with variable-length queries
Alanko, J. N.; Ranallo-Benavidez, T. R.; Barthel, F. P.; Puglisi, S. J.; Marchet, C.Abstract
K-mer-based methods are widely used for sequence classification in metagenomics, pangenomics, and RNA-seq analysis, but existing tools face important limitations: they typically require a fixed k-mer length chosen at index construction time, handle multi-matching k-mers (whose origin in the indexed data is ambiguous) in ad-hoc ways, and some resort to lossy approximations, complicating interpretation. We present HKS, a data structure for exact hierarchical variable-length k-mer annotation. Building on the Spectral Burrows-Wheeler Transform (SBWT), a single HKS index is constructed for a specified maximum query length s, and supports queries at any length k [≤] s. HKS associates each k-mer with exactly one label from a user-defined category hierarchy, where multi-matching k-mers are resolved to their most specific common node in the hierarchy. We formalize a feature assignment framework that partitions indexed k-mers into disjoint sets according to a user-defined category hierarchy. To recover specificity lost to multi-matching and novel k-mers, we introduce a hierarchy-aware smoothing algorithm that makes use of flanking sequence context. We validate the approach by assigning each query k-mer to a specific chromosome across human genome assemblies, including the T2T-CHM13v2.0 reference as a positive control and two diploid genomes of different ancestries (HG002, NA19185). Smoothing increases overall concordance from [~]81% to [~]97%, with residual errors attributable to known biological phenomena including acrocentric short-arm recombination and subtelomeric duplications. In performance benchmarks against Kraken2, HKS provides comparable query throughput while providing exact, lossless annotation across all k-mer lengths simultaneously from a single index. A prototype implementation is available at https://github.com/jnalanko/HKS.
bioinformatics2026-03-18v1HARVEST: Unlocking the Dark Bioactivity Data of Pharmaceutical Patents via Agentic AI
Shepard, V.; Musin, A.; Chebykina, K.; Zeninskaya, N. A.; Mistryukova, L.; Avchaciov, K.; Fedichev, P. O.Abstract
Pharmaceutical patents contain vast Structure-Activity Relationship tables documenting protein-ligand binding data that are technically public yet computationally inaccessible, rendering this wealth of data effectively dark - trapped in unstructured archives no existing database has systematically captured. We present HARVEST, a multi-agent large language model pipeline that autonomously extracts structured bioactivity records from USPTO patent archives at $0.11 per document. Applied to 164,877 patents, HARVEST produced 3.36 million activity records, recovering 365,713 unique scaffolds and 1,108 protein targets absent from BindingDB - completing in under a week a task requiring over 55 years of continuous expert labor. Automated extraction achieves 91% agreement with human curators while exhibiting lower unit-conversion error rates. We further introduce H-Bench, a structurally guaranteed held-out benchmark built from this recovered data. Evaluation of the leading open-source model Boltz-2 on H-Bench reveals a two-dimensional generalization gap: performance degrades both on novel chemical scaffolds and on uncharacterized protein targets, exposing fundamental limitations of models trained on existing public repositories.
bioinformatics2026-03-18v1Sex Checking by Zygosity Distributions
Molina-Sedano, O.; Mas Montserrat, D.; Ioannidis, A. G.Abstract
Motivation: In genomic and clinical studies, verifying concordance between self-reported and genotype-inferred sex is a crucial quality control step, since mismatches arising from mislabeling or aneuploidies can bias downstream analyses and affect diagnostic accuracy. Existing approaches typically require substantial auxiliary data, and often require manual threshold tuning. There remains a need for a streamlined, reference-free method that generalizes across different data modalities, including whole-genome, single-sample and array, without requiring additional files or parameter tuning. Results: We present Zigo, a novel ML-based sex-checking method that operates solely on a standard VCF file, designed using X-chromosome genotype class distributions across sexes. Our model was trained on synthetic data incorporating standard demographic models and empirical recombination maps to ensure realistic genetic architecture and population structure. We simulate WGS, array, and single-sample files for broad applicability. Unlike traditional methods, we eliminate manual thresholding by distilling learned discriminative patterns into a single polynomial equation that determines genetic sex directly from normalized genotype counts. We validated Zigo on independent datasets, including 1000 Genomes, UK Biobank, and HGDP. Additional experiments assessed robustness under reduced variant availability through random SNP subsampling and allele-frequency filtering. Across all evaluations, the model achieved state-of-the-art accuracy, high time efficiency, and strong generalization, even with severely limited variant sets.
bioinformatics2026-03-18v1usiGrabber: Automating the curation of proteomics spectra data at scale, making large datasets ready for use in machine learning systems
Auge, G.; Clausen, M.; Ketterer, K.; Schaefer, J.; Schmitt, N.; Altenburg, T.; Hartmaring, Y.; Raetz, H.; Schlaffner, C. N.; Renard, B. Y.Abstract
Motivation: An unprecedented amount of mass spectrometry-based proteomics data is publicly available through repositories such as the PRoteomics IDEntifications Database (PRIDE), and the field is increasingly leveraging machine learning approaches. However, the available data is not ready to be reused in a scalable way beyond the original acquisition purpose. Existing machine learning models commonly rely on a few manually curated datasets that require deep domain expertise and tedious technical work to construct. Importantly, these datasets have not been updated in recent years, so that newly published data remains inaccessible. We present usiGrabber, a scalable framework for assembling large proteomic datasets. usiGrabber is designed around portability and extensibility. It extracts spectra identification data from mzIdentML files, stores additional project-level metadata retrieved through the PRIDE API, indexes raw spectra using Universal Spectrum Identifiers (USIs), and offers download utilities to retrieve spectra data at scale. Results: Within 49 hours, we parsed over 800 million peptide spectrum matches and corresponding USIs from over 1,200 projects. As a proof of concept, we used usiGrabber to construct a phosphorylation-specific training dataset of nearly 11 million spectra in under two days and used it to retrain a binary phosphorylation classifier based on the AHLF model architecture. With a balanced accuracy of 0.78, our model achieves comparable performance to the original model on an independent test set, showing that automated data extraction is an alternative to manual curation of static datasets. Availability: All code is available at https://github.com/usiGrabber/usiGrabber; the data is available at https://zenodo.org/records/18853258.
bioinformatics2026-03-18v1Interpolating and Extrapolating Node Counts in Colored Compacted de Bruijn Graphs for Pangenome Diversity
Parmigiani, L.; Peterlongo, P.Abstract
A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.
bioinformatics2026-03-18v1