Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
rnaends: an R package to study exact RNA ends at nucleotide resolution
Caetano, T.; Redder, P.; Fichant, G.; Barriot, R.Abstract
5' and 3' RNA-end sequencing protocols have unlocked new opportunities to study aspects of RNA metabolism such as synthesis, maturation and degradation, by enabling the quantification of exact ends of RNA molecules in vivo. From RNA-Seq data that have been generated with one of the specialized protocols, it is possible to identify transcription start sites (TSS) and/or endoribonucleolytic cleavage sites, and even, in some cases, co-translational 5' to 3' degradation dynamics. Furthermore, post-transcriptional addition of ribonucleotides at the 3' end of RNA can be studied at the nucleotide resolution. While different RNA-end sequencing library protocols exist that have been adapted to a specific organism (prokaryote or eukaryote) or specific biological question, the generated RNA-Seq data are very similar and share common processing steps. Most importantly, the major aspect of RNA-end sequencing is that only the 5' or 3' end mapped location is of interest, contrary to conventional RNA sequencing that considers genomic ranges for gene expression analysis. This translates to a simple representation of the quantitative data as a count matrix of RNA-end location on the reference sequences. This representation seems under-exploited and is, to our knowledge, not available in a generic package focused on the analyses on the exact transcriptome ends. Here, we present the rnaends R package which is dedicated to RNA-end sequencing analysis. It offers functions for raw read pre-processing, RNA-end mapping and quantification, RNA-end count matrix post-processing, and further downstream count matrix analyses such as TSS identification, fast Fourier transform for signal periodic pattern analysis, or differential proportion of RNA-end analysis. The use of rnaends is illustrated here with applications in RNA metabolism studies through selected rnaends workflows on published RNA-end datasets: (i) TSS identification, (ii) ribosome translation speed and co-translational degradation, (iii) post-transcriptional modification analysis and differential proportion analysis.
bioinformatics2026-04-11v4COMPASS: A Web-Based COMPosite Activity Scoring System to Navigate Health and Disease Through Deterministic Digital Biomarkers
Sinha, S.; Ghosh, P.Abstract
Quantifying pathway activity in a reproducible and interpretable manner remains a central challenge in systems biology and precision medicine. Here, we introduce COMPASS (COMPosite Activity Scoring System), a deterministic, ontology-free, threshold-based framework that converts gene expression into per-sample pathway activity scores without reliance on permutation or reference cohorts. Implemented as an intuitive web application, COMPASS derives gene-specific activation thresholds directly from data, standardizes deviations from these boundaries, and integrates directionally opposing genes into a single composite score using closed-form logic. Implemented as an accessible web application, COMPASS enables users to upload expression matrices, define gene signatures, and perform activity scoring, statistical comparisons, and survival analyses without coding. Across diverse biological and clinical datasets, COMPASS generates stable and transferable digital biomarkers that quantify cellular states, benchmark humanness and relevance of model systems and enable outcome stratification. In head-to-head comparisons with widely used single-sample enrichment methods (GSVA and ssGSEA), COMPASS shows consistent performance across multi-cohort datasets, with improved discrimination when integrating bidirectional gene programs. Stratified bootstrap analyses further demonstrate reduced variability and increased robustness. By directly linking expression thresholds, deviation, and gene directionality, COMPASS provides a transparent and generalizable framework for ontology-free pathway activity quantification and outcome modeling.
bioinformatics2026-04-11v3Coherent Cross-modal Generation of Synthetic Biomedical Data to Advance Multimodal Precision Medicine
Marchesi, R.; Lazzaro, N.; Endrizzi, W.; Leonardi, G.; Pozzi, M.; Ragni, F.; Bovo, S.; Moroni, M.; Osmani, V.; Jurman, G.Abstract
Integration of multimodal, multi-omics data is critical for advancing precision medicine, yet its application is frequently limited by incomplete datasets where one or more modalities are missing. To address this challenge, we developed a generative framework capable of synthesizing any missing modality from an arbitrary subset of available modalities. We introduce Coherent Denoising, a novel ensemble-based generative diffusion method that aggregates predictions from multiple specialized, single-condition models and enforces consensus during the sampling process. We compare this approach against a multi-condition, generative model that uses a flexible masking strategy to handle arbitrary subsets of inputs. The results show that our architectures successfully generate high-fidelity data that preserve the complex biological signals required for downstream tasks. We demonstrate that the generated synthetic data can be used to maintain the performance of predictive models on incomplete patient profiles and can leverage counterfactual analysis to guide the prioritization of diagnostic tests. We validated the framework's efficacy on a large-scale multimodal, multi-omics cohort from The Cancer Genome Atlas (TCGA) of over 10,000 samples spanning across 20 tumor types, using data modalities such as copy-number alterations (CNA), transcriptomics (RNA-Seq), proteomics (RPPA), and histopathology (WSI). This work establishes a robust and flexible generative framework to address sparsity in multimodal datasets, providing a key step toward improving precision oncology.
bioinformatics2026-04-11v3AEGIS: an annotation extraction and genomic integration resource
Navarro-Paya, D.; Santiago, A.; Velt, A.; Moretto, M.; Rustenholz, C.; Matus, J. T.Abstract
The GTF/GFF3 formats are the standard for storing and exchanging genome annotations. However, their flexibility often results in inconsistent and poorly formatted files across different sources, creating a major bottleneck for downstream bioinformatics analyses. Here, we present Annotation Extraction and Genomic Integration Suite (AEGIS), a comprehensive and user-friendly command-line toolkit designed to parse, validate and standardise genome annotation files. AEGIS robustly corrects common structural and formatting errors, ensuring interoperability with downstream tools. Beyond standardisation, the suite provides advanced modules for analysis, such as flexible sequence extraction (e.g. genes, CDS, proteins) with isoform handling, customisable promoter region definitions and targeted DNA motif searches. A key feature of AEGIS is its integrated workflow for comparative genomics, which combines multiple lines of evidence (i.e., sequence homology, synteny and coordinate-based lift-overs) to enable a robust gene ID correspondence and orthology assessment. We demonstrate the utility of AEGIS by comparing two major Arabidopsis thaliana annotations (TAIR10 vs. Araport11), successfully identifying and quantifying complex structural changes such as gene splits and fusions. AEGIS provides a unified solution for annotation quality control, feature extraction and comparative genomic analysis, simplifying complex workflows and enhancing reliability in bioinformatic research. The software is open-source, implemented in Python and is available on GitHub, PyPI, and as a Docker container to ensure accessibility and reproducibility.
bioinformatics2026-04-11v2scLongTree: an accurate computational tool to infer the longitudinal tree for scDNAseq data
Khan, R.; Bhattarai, P.; Zhang, L.; Zhou, X. M.; Mallory, X.Abstract
Longitudinal single-cell DNA sequencing (scDNA-seq) refers to single-cell data sequenced at different time points providing more knowledge of the order of mutations than scDNA-seq taken at only one time point. The technique can facilitate the inference of subclonal trees that depict the evolution of cancer cells and facilitate understanding of how cancer grows, with implications for prognosis and treatment. There is currently a scarcity of tools that can infer subclonal trees based on longitudinal scDNA-seq, and existing tools are limited in accuracy and scale. We therefore introduce scLongTree, a computational tool that can accurately infer a subclonal tree based on longitudinal scDNA-seq. ScLongTree is scalable to hundreds of mutations, and outperforms state-of-the-art tools such as LACE, SCITE, and SiCloneFit on a comprehensive simulated dataset. Tests on a real dataset, SA501, showed that scLongTree can more accurately interpret the progressive growth of the tumor than LACE, and is more robust to different numbers of mutations being used. Tests on a large AML dataset AML107, which has 4,617 cells, show that scLongTree is scalable to thousands of cells. ScLongTree is freely available on https://github.com/compbio-mallory/sc_longitudinal_infer.
bioinformatics2026-04-11v2PRIZM: Combining Low-N Data and Zero-shot Models to Design Enhanced Protein Variants
Harding-Larsen, D.; Lax, B. M.; Garcia, M. E.; Mendonca, C.; Mejia-Otalvaro, F.; Welner, D. H.; Mazurenko, S.Abstract
Machine learning has repeatedly shown the ability to accelerate protein engineering, but many approaches demand large amounts of robust, high-quality training data as well as substantial computational expertise. While large pre-trained models can function as zero-shot proxies for predicting variant effects, selecting the best model for a given protein property is often non-trivial. Here, we introduce Protein Ranking using Informed Zero-shot Modelling (PRIZM), a two-phase workflow that first uses a high-quality low-N dataset to identify the most suitable pre-trained zero-shot model for a target protein property and then applies that model to rank and prioritize an in silico variant library for experimental testing. Across diverse benchmark datasets spanning multiple protein properties, PRIZM reliably separated low- from high-performing models using datasets of ~20 labelled variants. We further demonstrate PRIZM in enzyme engineering case studies targeting sucrose synthase thermostability and glycosyltransferase activity, where PRIZM-guided selection identified improved variants, including gains of ~3{degrees}C in apparent melting temperature and ~20% higher relative activity. PRIZM provides an accessible, data-efficient route to leverage foundation models for protein design while requiring minimal experimental data.
bioinformatics2026-04-11v2DyGraphTrans: A temporal graph representation learning framework for modeling disease progression from Electronic Health Records
Rahman, M. T.; Al Olaimat, M.; Bozdag, S.; Alzheimer's Disease Neuroimaging Initiative,Abstract
Motivation: Electronic Health Records (EHRs) contain vast amounts of longitudinal patient medical history data, making them highly informative for early disease prediction. Numerous computational methods have been developed to leverage EHR data; however, many process multiple patient records simultaneously, resulting in high memory consumption and computational cost. Moreover, these models also often lack interpretability, limiting insight into the factors driving their predictions. Efficiently handling large-scale EHR data while maintaining predictive accuracy and interpretability therefore remains a critical challenge. To address this gap, we propose DyGraphTrans, a dynamic graph representation learning framework that represents patient EHR data as a sequence of temporal graphs. In this representation, nodes correspond to patients, node features encode temporal clinical attributes, and edges capture patient similarity. DyGraphTrans models both local temporal dependencies and long-range global trends, while a sliding-window mechanism reduces memory consumption without sacrificing essential temporal context. Unlike existing dynamic graph models, DyGraphTrans jointly captures patient similarity and temporal evolution in a memory-efficient and interpretable manner. Results: We evaluated DyGraphTrans on Alzheimer's Disease Neuroimaging Initiative (ADNI) and National Alzheimer's Coordinating Center (NACC) for disease progression prediction, as well as on the Medical Information Mart for Intensive Care (MIMIC-IV) dataset for early mortality prediction. We further assessed the model on multiple benchmark dynamic graph datasets to evaluate its generalizability. DyGraphTrans achieved strong predictive performance across diverse datasets. We also demonstrated interpretability of DyGraphTrans aligned with known clinical risk factors.
bioinformatics2026-04-11v2Identification, evolutionary history and characteristics of orphan genes in root-knot nematodes
Seckin, E.; Colinet, D.; Bailly-Bechet, M.; Seassau, A.; Bottini, S.; Sarti, E.; Danchin, E. G.Abstract
Orphan genes, lacking homologs in other species, are systematically found across genomes. Their presence may result from extensive divergence from pre-existing genes or from de novo gene birth, which occurs when a gene emerges from a previously non-genic region. In this study, we identified orphan genes in the genomes of globally distributed plant-parasitic nematodes of the genus Meloidogyne and investigated their origins, evolution, and characteristics. Using a comparative genomics framework across 85 nematode species, we found that 18% of Meloidogyne genes are genus-specific, transcriptionally supported orphans. By combining ancestral sequence reconstruction and synteny-based approaches, we inferred that 20% of these orphan genes originated through high divergence, while 18% likely emerged de novo. Proteomic and translatomic evidence confirmed the translation of a subset of these genes, and feature analyses revealed distinctive molecular signatures, including shorter length, signal peptide enrichment, and a tendency for extracellular localization. These findings highlight orphan genes as a substantial and previously underexplored component of the Meloidogyne genome, with potential roles in their worldwide parasitism.
bioinformatics2026-04-11v2A structure-informed deep learning framework for modeling TCR-peptide-HLA interactions
Cao, K.; Li, R.; Strazar, M.; Brown, E. M.; Nguyen, P. N. U.; Pust, M.-M.; Park, J.; Graham, D. B.; Ashenberg, O.; Uhler, C.; Xavier, R.Abstract
The interaction between T cell receptors (TCRs), peptides, and human leukocyte antigens (HLAs) underlies antigen-specific T cell immunity. Despite substantial advances in peptide-HLA presentation prediction, accurate modeling of coupled TCR-peptide-HLA recognition remains underdeveloped, limiting applications such as TCR and neoepitope prioritization in cancer and antigen identification in autoimmunity. Here we present StriMap, a unified framework for predicting TCR-peptide-HLA interactions by integrating physicochemical, sequence-context, and structural features at recognition interfaces. StriMap achieves state-of-the-art performance with improved generalizability and enables applications in both cancer and autoimmunity. As a case study in ankylosing spondylitis (AS), we screened 13 million peptides derived from 43,241 bacterial proteins and identified candidate molecular mimics that were experimentally validated to activate T cells expressing an AS-associated TCR. Notably, a top validated peptide was enriched in patients with inflammatory bowel disease (IBD), suggesting potential shared microbial triggers between AS and IBD. Overall, StriMap provides a generalizable framework for rational immunotherapy design and for dissecting antigenic drivers of autoimmunity.
bioinformatics2026-04-11v2Large-Scale Statistical Dissection of Sequence-Derived Biochemical Features Distinguishing Soluble and Insoluble Proteins
Vu, N. H. H.; Nguyen Bao, L.Abstract
Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a statistically rigorous large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliffs {delta}, and discriminative performance was assessed by ROC-AUC. Although 34 features remained significant after correction, most exhibited small effect sizes and substantial class overlap, consistent with a weak-signal regime. The strongest effects were associated with size-related features (sequence length and molecular weight; {delta} {approx} -0.21), whereas charge-related descriptors, particularly the proportion of negatively charged residues ({delta} = 0.150; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ({rho} up to 0.998). Applying a redundancy threshold (|{rho}| [≥] 0.85), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings demonstrate that sequence-level solubility information is intrinsically low-dimensional and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.
bioinformatics2026-04-11v2scMultiPreDICT: A single-cell predictive framework with transcriptomic and epigenetic signatures
Manful, E.-E.; Uzun, Y.Abstract
Cellular responses to genetic perturbations depend on both transcriptional programs and the epigenetic landscape. While single-cell multiomics technologies enable simultaneous profiling of gene expression and chromatin accessibility, the relative contribution of each regulatory layer to gene expression remains unclear. Existing computational approaches focus on data integration and gene regulatory network inference but do not systematically compare the predictive performance of transcriptional versus epigenetic features on a gene-by-gene basis.We present scMultiPreDICT, a computational framework for comparative predictive modeling of gene expression using single-cell multiomics data. scMultiPreDICT benchmarks RNA-only, ATAC-only and multimodal feature sets across six machine learning models including regression, tree-based learning and deep learning using multiple biological datasets. We show that RNA-derived features generally provide strong predictive power, whereas chromatin accessibility alone yields a modest performance. Surprisingly, multimodal integration does not uniformly improve prediction accuracy; instead, its benefit is gene-specific and context-dependent. Feature importance analysis reveals that transcriptional features dominate for most genes, whereas chromatin accessibility contributes meaningfully for a subset of genes in specific cellular contexts. Overall, the results demonstrate that regulatory layers contribute differently to gene expression. scMultiPreDICT provides a systematic framework for identifying the relative contributions of transcriptional and epigenetic regulation across genes and cellular contexts, guiding the design of targeted perturbation studies and the prioritization of regulatory layers for therapeutic interventions. scMultiPreDICT is implemented in R and available at https://github.com/UzunLab/scMultiPreDICT/.
bioinformatics2026-04-11v1Structural Connectome Analysis using a Graph-based Deep Model for Age and Dementia Prediction
Kazi, A.; Mora, J.; Fischl, B.; Dalca, A.; Aganj, I.Abstract
We address the prediction of non-imaging variables based on structural brain connectivity derived from diffusion magnetic resonance images, using graph-based machine learning. We predict age and the mini-mental state examination (MMSE) score as examples of a demographic and a clinical variable. We propose a machine-learning model inspired by graph convolutional networks (GCNs), which takes a brain connectivity graph as input and processes the data separately through a parallel GCN mechanism with multiple branches. The novelty of our work lies in the model architecture, especially the Connectivity Attention Block, which learns an embedding representation of brain graphs while providing graph-level attention. We show experiments on publicly available datasets of PREVENT-AD and OASIS3. The proposed network is a simple design that employs different heads involving graph convolutions focused on edges and nodes, capturing representations from the input data thoroughly. A linear branch, and skip connections. To test the ability of our model to extract complementary and representative features from brain connectivity data, we chose the task of sex classification. We validate our model by comparing it to existing methods and via ablations. This quantifies the degree to which the connectome varies depending on the task, which is important for improving our understanding of health and disease across the population. The proposed model generally demonstrates higher performance especially for age prediction compared to the existing machine-learning algorithms we tested, including classical methods and (graph and non-graph) deep learning.
bioinformatics2026-04-10v3MHCXGraph: A Graph-Based approach to detecting T cell receptor cross-reactivity
Simoes, C. D. M. S.; Maidana, R. L. B. R.; De Assis, S. C.; Guerra, J. V. d. S.; Ribeiro-Filho, H. V.Abstract
The T cell receptor (TCR) recognition of multiple peptides presented by the major histocompatibility complex (MHC) is a key natural phenomenon, enabling the T cell repertoire to respond to a broad array of antigens. Despite its importance to the immune response, T cell cross-reactivity poses a major challenge for the development of novel T cell-based therapies. In this study, we present MHCXGraph, a graph-based computational approach for identifying conserved and immunologically relevant regions across multiple structures of peptides bound to MHC molecules (pMHC). Our approach provides three operational modes with user-defined parameters, allowing flexible configuration according to specific scientific needs while delivering fully interpretable results through user-friendly interfaces. We evaluated MHCXGraph across three case studies, including peptides bound to classical MHC Class I, MHC Class II, and unbound HLA alleles, demonstrating its ability to capture conserved structural determinants beyond sequence similarity. By integrating structural information with efficient graph-based analysis, MHCXGraph addresses key limitations of sequence-based methods while maintaining computational scalability. Collectively, these results indicate that MHCXGraph can be readily integrated into computational pipelines for T cell cross-reactivity discovery, especially in the context of de novo pMHC engager design and T cell-based vaccine development.
bioinformatics2026-04-10v1Benchmarking ambient RNA removal across droplet and well-plate platforms reveals artificial count generation as a critical failure mode of scAR and CellClear
Schroeder, L.; Gerber, S.; Ruffini, N.Abstract
Background: Ambient RNA contamination is a pervasive artifact of single-cell and single-nucleus RNA sequencing (sxRNA-seq), yet no consensus exists on which computational removal tool performs best across experimental platforms. Results: We present a systematic benchmark of six tools: CellBender, DecontX, SoupX, scCDC, scAR, and CellClear - evaluated across six human-mouse cell line mixing (hgmm) datasets (1k-20k cells) providing partial ground truth, two droplet-based complex tissue datasets (PBMC scRNA-seq; prefrontal cortex snRNA-seq), and a well-plate-based dataset (BD Rhapsody WBC). Using inter-species counts as partial ground truth, we quantify sensitivity, specificity, precision, and removal consistency per tool. We further apply a count-integrity criterion quantifying gene-cell positions where corrected values exceed raw counts. This reveals that scAR and CellClear do not merely denoise but fundamentally restructure count matrices: CellClear replaces >93% of counts with values derived from matrix factorization, while scAR generates spurious cell types absent from uncorrected data, including three spurious coarse cell types in the BD Rhapsody dataset and up to eight novel cell types in the prefrontal cortex. CellBender and SoupX exhibit reliable contamination removal with minimal count distortion. DecontX and scCDC are the only tools operable on non-droplet platforms without raw count matrix access. Runtime benchmarking at atlas scale (up to 172,000 nuclei) further demonstrates that CellClear fails to scale. Conclusions: Count matrix integrity, not removal sensitivity alone, must be a primary criterion when selecting ambient RNA correction tools. We provide platform-specific recommendations and a decision framework to guide tool selection across experimental contexts.
bioinformatics2026-04-10v1TCMCard: A High-Confidence Digital Infrastructure for Traditional Chinese Medicine Quantified by Multi-Dimensional Evidence Integration
Wang, Y.; Dong, W.; Yao, J.; Wang, K.; Zhang, L.; Wang, Y.; Guo, S.; Li, H.; Cai, H.; Wang, X.; Li, Y.Abstract
Network pharmacology has become a widely used approach for deciphering multi-component, multi-target mechanisms of traditional Chinese medicine (TCM). Here we introduce TCMCard, a high-confidence digital infrastructure built on a Multi-Dimensional Evidence Integration (MDEI) framework. The framework integrates experimental activity data from authoritative chemical databases, literature-derived evidence, and structure-based similarity inference. Preprocessing steps include chemical structure normalization, species-specific filtering, and target quality scoring. Applied to conventional interaction datasets, this pipeline leads to the removal of over 60% of low-confidence noise. TCMCard supports network pharmacology exploration through an interactive visualization platform, and module analysis identifies functionally relevant communities that offer insights into the synergistic actions of TCM formulas. Overall, TCMCard may help move the field beyond simple data aggregation toward evidence-informed curation and quality-driven analysis. As an interactive and publicly accessible platform, it reveals an organized backbone within complex interaction networks, offering a more reliable basis for understanding multi-component synergy in TCM.
bioinformatics2026-04-10v1Generating, curating, and evaluating trnL reference sequence databases: Benchmarking OBITools3/ecoPCR, RESCRIPt, and MetaCurator
KUDDAR, O. S.; Meiklejohn, K. A.; Callahan, B. J.Abstract
Plant DNA metabarcoding enables the identification of plant taxa in mixed samples, with the trnL (UAA) intron and its P6 loop mini-barcode region performing as well as or better than other commonly used markers. Reliable metabarcoding requires high-quality reference databases, yet a regularly maintained trnL resource is currently lacking. Consequently, most studies use uncurated sequences downloaded directly from public repositories without essential validation. We address these gaps by providing guidance through a systematic comparison of three database curation tools: OBITools3/ecoPCR, RESCRIPt, and MetaCurator, to generate three trnL reference sequence databases and evaluate their classification performance across commonly sequenced trnL regions (CD, CH, and GH). Reference trnL sequences and taxonomy files were retrieved from public sequence repositories and curated using standardized filtering steps to reduce taxonomic errors, sequence ambiguity, and redundancy. Four simulated query datasets; two base sets and their mutated counterparts, were constructed to assess classification performance of the databases using the Naive Bayesian Classifier implemented in DADA2. The evaluation showed that performance differed by trnL region: MetaCurator and RESCRIPt yielded higher and similar metrics for trnL CD; OBITools3/ecoPCR and RESCRIPt were comparable for trnL CH; and MetaCurator attained the highest performance for trnL GH region. All reference databases, taxonomy, and evaluation files are available at Zenodo (https://doi.org/10.5281/zenodo.17969450). The complete computational workflow and scripts are available on GitHub (https://github.com/oskuddar/trnL_DB). Although evaluation was focused on plant taxa in the United States, the resulting databases are suitable for use as global trnL reference databases.
bioinformatics2026-04-10v1SimpleFold-Turbo: Adaptive Inference Caching Yields 14-fold Acceleration of Flow-Matching Protein Structure Prediction
Taghon, G.Abstract
We apply TeaCache, an adaptive caching technique from video diffusion to SimpleFold's flow-matching protein structure prediction and achieve (9 to 14)-fold inference speedups with negligible quality loss. We determine that flow matching's near-linear generative trajectories make consecutive neural-network evaluations highly redundant. At a low redundancy threshold, SimpleFold-Turbo (SFT) skips {approx} 93 % of forward passes while preserving near-baseline template modeling (TM)-scores across 300 structurally diverse CATH domains and all six SimpleFold model sizes (100 million to 3 billion parameters), at compute budgets where log-uniform step-skipping collapses. Speedup scales with model size because caching overhead is constant while per-step cost grows, and a general three-phase skip pattern emerges independent of protein size or fold. SF-T requires no retraining, no weight modification, and no MSA server dependencies. We release SF-T as fully open-source software enabling thousands of structure predictions per hour on commodity hardware.
bioinformatics2026-04-10v1Structure-Based and Stability-Validated Prioritization of BACE1 Inhibitors Integrating Meta-Ensemble QSAR and Molecular Dynamics
Chowdhury, T. D.; Shafoyat, M. U.; Hemel, N. H.; Nizam, D.; Sajib, J. H.; Toha, T. I.; Nyeem, T. A.; Farzana, M.; Haque, S. R.; Hasan, M.; Siddiquee, K. N. e. A.; Mannoor, K.Abstract
Alzheimers disease remains an unmet therapeutic challenge, and no {beta}-secretase (BACE1) inhibitor has achieved clinical approval. A major limitation of prior discovery efforts is reliance on single-parameter optimization, often yielding computational hits with poor translational potential. Here, we present a stability-validated, biology-informed computational framework that integrates meta-ensemble QSAR (five tree-based classifiers with ECFP4 fingerprints), structure-based docking, Protein Language Model (ESM-1b)-guided hybrid residue interaction weighting, and comprehensive ADMET profiling within a normalized composite ranking scheme. Model robustness was confirmed through external validation and Y-randomization (n = 100; empirical p = 0.009). Heuristic weighting was quantitatively stress-tested using global {+/-}10% perturbation analysis (mean Spearman {rho} = 0.998; mean Kendalls {tau} = 0.970), demonstrating exceptional ranking stability under controlled parameter uncertainty. Screening of 16,196 structurally diverse compounds, including CNS-active molecules, phytochemicals, approved drugs, and investigational agents, identified 153 predicted actives (accuracy 0.852; ROC-AUC 0.920), which were refined to 111 drug-like candidates and seven prioritized leads. Two-hundred-nanosecond molecular dynamics simulations confirmed stable binding within the BACE1 catalytic pocket and sustained interaction networks over time. Mol-2 exhibited the most favorable profile, characterized by low ligand RMSD (1.2-1.6 [A]), persistent catalytic dyad interactions (ASP32 98%, ASP228 99%), predicted BBB permeability, acceptable efflux profile, and balanced ADMET characteristics consistent with CNS drug-like space. Collectively, this integrative, interpretable, and robustness-validated framework provides a systematic strategy for multi-criteria lead prioritization and may serve as a transferable platform for structure-guided discovery of therapeutics targeting complex neurodegenerative pathways
bioinformatics2026-04-10v1PERREO: An integrated pipeline for repetitive elements analysis enables the repeatome expression profiling in cancer
Rodriguez-Martin, F.; Masero-Leon, M.; Gomez-Cabello, D.Abstract
Transcriptome-wide profiling of repetitive elements expression reveals transposable element-derived transcripts that are deregulated in diverse biological contexts including cancer. However, most RNA-seq pipelines are optimized for annotated genes and substantially undercount repeat RNA molecules, limiting their discovery and characterization. Here we present PERREO, a comprehensive, user-friendly pipeline for analyzing repetitive RNA elements from short- and long-read sequencing data. PERREO performs quality control, repeat-aware alignment and quantification, differential expression analysis, co-expression network analysis, and de novo transcript assembly with minimal computational expertise required. We validate PERREO across cell lines, tumor tissues and liquid biopsies, demonstrating superior sensitivity to repetitive RNA signatures compared with standard RNA-seq approaches. PERREO integrates predictive modelling to identify biological associations and generates publication-ready visualizations. By removing the bioinformatic barrier to repetitive RNA discovery, this pipeline enables broader investigation of the repeatome's role in cellular biology and disease, yielding valuable results that, for specific analytical objectives, outperform certain existing tools and pipelines.
bioinformatics2026-04-10v1BrightEyes-FFS: an open-source platform for comprehensive analysis of fluorescence fluctuation spectroscopy experiments with small detector arrays
Slenders, E.; Perego, E.; Zappone, S.; Vicidomini, G.Abstract
Fluorescence fluctuation spectroscopy (FFS) is an ensemble of techniques for quantitative measurement of molecular dynamics and interactions. Recently, the introduction of small-format array detectors has opened up a new range of spatiotemporal information, allowing for more detailed analysis of system kinetics. However, there is currently no open-source software available for analyzing the high-dimensional FFS data sets. We present BrightEyes-FFS, an open-source Python-based environment for FFS analysis with array detectors. The environment includes a Python package for reading raw FFS data, computing auto- and cross-correlations using various algorithms, and fitting the correlations to several models. A graphical user interface (GUI), available as a standalone executable, makes the analysis fast and user-friendly. An automated Jupyter Notebook writing tool enables transition from the GUI to Jupyter Notebook for custom analysis. We believe that BrightEyes-FFS will enable a wider community to study diffusion, flow, and interaction dynamics.
bioinformatics2026-04-10v1Statistical Principles Define an Open-Source Differential Analysis Workflow for Mass Spectrometry Imaging Experiments with Complex Designs
Rogers, E. B. T.; Lakkimsetty, S. S.; Bemis, K. A.; Schurman, C. A.; Angel, P. A.; Schilling, B.; Vitek, O.Abstract
Mass spectrometry imaging (MSI) characterizes the spatial heterogeneity of molecular abundances in biological samples. Experiments with complex designs, involving multiple conditions and multiple samples, provide particularly useful insight into differential abundance of analytes. However, analyses of these experiments require attention to details such as signal processing, selection of regions of interest, and statistical methodology. This manuscript contributes a statistical analysis workflow for detecting differentially abundant analytes in MSI experiments with complex designs. Using a case study of histologic samples of human tibial plateaus from knees of osteoarthritis patients and cadaveric controls, as well as simulated datasets, we illustrate the impact of the analysis decisions. We illustrate the importance of signal processing and feature aggregation for preserving biological relevance and alleviating the stringency of multiple testing. We further demonstrate the importance of selecting regions of interest in ways that are compatible with differential analysis. Finally, we contrast several common statistical models for differential analysis, showcase the appropriate use of replication, and demonstrate model-based calculation of sample size for followup investigations. The discussion is accompanied by detailed recommendations and an open-source R-based implementation that can be followed by other investigations.
bioinformatics2026-04-10v1Deep learning enables direct HLA typing from immunopeptidomics data
Pilz, M.; Scheid, J.; Bauer, A.; Lemke, S.; Sachsenberg, T.; Bauer, J.; Nelde, A.; Stadelmaier, J.; Walter, A.; Rammensee, H.-G.; Nahnsen, S.; Kohlbacher, O.; Walz, J. S.Abstract
The immune system eliminates malignant and infected cells through T-cell-mediated recognition of peptides presented by human leukocyte antigen molecules. Mass spectrometry-based immunopeptidomics enables unbiased identification of naturally presented HLA-restricted peptides and has become central to the development of T-cell-based immunotherapies. However, immunopeptidomics data reflects the combined peptide presentation of multiple HLA alleles, and determining which allotypes are represented in this multi-allelic complexity remains an unmet computational challenge. Here, we introduce immunotype, a deep learning-based ensemble predictor for HLA class I allotyping directly from immunopeptidomics data. Immunotype integrates peptide and HLA sequence information through transformer encoders and a graph neural network, complemented by a curated mono-allelic reference of known peptide-HLA binding preferences. Immunotype achieves an overall accuracy of 87.2% at protein-level resolution across diverse tissues and thereby enables rapid, cost-effective HLA typing of large-scale immunopeptidomics datasets.
bioinformatics2026-04-10v1A computational model for quantifying instability of tandem repeats across the genome
Dolzhenko, E.; English, A.; Mokveld, T.; de Sena Brandine, G.; Kronenberg, Z.; Wright, G.; Drogemoller, B.; Rowell, W. J.; Wenger, A. M.; Bennett, M. F.; Weisburd, B.; Erwin, G. S.; Jin, P.; Nelson, D. L.; Dashnow, H.; Sedlazeck, F.; Eberle, M. A.Abstract
Tandem repeats (TRs) exhibit high levels of somatic mosaicism, which is increasingly recognized as an important modifier of repeat expansion disorders. Long-read sequencing can capture full-length repeat alleles, yet robust frameworks for quantifying instability across TRs genome-wide are still needed. Here, we introduce a general-purpose model for quantifying TR instability in a given long-read sequencing dataset, without explicitly distinguishing biological mosaicism from technical noise, and which is broadly applicable to both simple and structurally complex loci. This model accurately characterizes allelic instability at each TR locus by representing the distribution of read-to-consensus deviations for each allele. Using HiFi sequencing data from 256 HPRC cell line samples, we fitted models for 617,007 TR loci, including known pathogenic repeats. We observe that instability levels are generally low, but vary substantially across individual TRs, and are driven more strongly by repeat composition than overall repeat length. Furthermore, we applied our method to targeted PureTarget long-read data from samples with known repeat expansions and identified significant mosaicism in the majority of expanded alleles. Our model offers a practical way to quantify instability of tandem repeats across the genome and to detect unusually unstable repeat alleles.
bioinformatics2026-04-10v1Structure-aware geometric graph learning for modeling protease-substrate specificity at scale
Guo, X.; Bi, Y.; Ran, Z.; Pan, T.; Sun, H.; Hao, Y.; Jia, R.; Wang, C.; Zhang, Q.; Kurgan, L.; Song, J.; Li, F.Abstract
Protease-substrate specificity is central to cellular regulation and disease pathogenesis, and accurately modeling its structural determinants remains challenging. Substrate recognition is governed by spatial constraints and higher-order relationships that extend beyond local sequence motifs. Most computational approaches rely predominantly on motif-centric or sequence-based representations, limiting their ability to capture the geometric and relational structure underlying enzymatic specificity. Here, we introduce OmniCleave, a structure-aware geometric graph learning framework for modeling protease-substrate specificity at scale. OmniCleave is trained on 57,278 structure-informed protease-substrate pairs derived from 9,651 substrates spanning over 100 proteases across six distinct families. The framework integrates multi-scale structural graphs with higher-order protease relational topology, explicitly encoding spatial context and inter-protease dependencies within a unified geometric representation. This formulation moves beyond local pattern recognition and enables transferable modelling across six protease families. Across large-scale benchmarks, the framework consistently outperforms existing approaches and reveals interpretable geometric determinants underlying substrate recognition. Experimental validation confirms three novel caspase-3 substrates and 21 cleavage sites predicted by OmniCleave, supporting the biological relevance of the learned representations. Together, OmniCleave provides a scalable geometric framework for modeling protease-substrate specificity, with practical utility for systematic analysis of protease biology.
bioinformatics2026-04-10v1MTB-KB: A Curated Knowledgebase of Mycobacterium tuberculosis Related Studies
Li, P.; Li, C.; Zhu, R.; Sun, W.; Zhou, H.; Fan, Z.; Yue, L.; Zhang, S.; Jiang, X.; Luo, Q.; Han, J.; Huang, H.; Shen, A.; Bahetibieke, T.; Wang, J.; Zhang, W.; Wen, H.; Niu, H.; Bu, C.; Zhang, Z.; Xiao, J.; Gao, R.; Chen, F.Abstract
Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), has regained its position as the world's leading killer among infectious diseases. Despite extensive research progress across epidemiology, diagnosis, drug development, treatment regimens, vaccines, drug resistance, virulence factors, and immune mechanisms, MTB-related knowledge remains fragmented across thousands of publications, limiting its effective use. To address this gap, we present MTB-KB, a literature-curated knowledgebase that systematically integrates high-impact findings from eight major sections of TB research. The current release contains 75,170 associations from 1,246 publications, covering 18,439 entities standardized using authoritative databases and WHO-endorsed classifications. A central feature is the interactive knowledge graph, which links cross-section associations to reveal and infer MTB-host interactions, treatment strategies, and vaccine development opportunities. MTB-KB also provides a user-friendly interface with browsing, advanced search, and statistical visualization. Overall, by consolidating dispersed MTB knowledge into a structured and accessible platform, MTB-KB provides a valuable resource for researchers, clinicians, and policymakers, supporting both basic and clinical TB research, enabling evidence-based TB prevention, diagnosis, and treatment, and contributing to global elimination efforts. MTB-KB is accessible at https://ngdc.cncb.ac.cn/mtbkb/.
bioinformatics2026-04-10v1Synolog: A Scalable Synteny-Based Framework for Genome Architecture Characterization
Madrigal, G.; Catchen, J. M.Abstract
Detailing the genomic architecture across multiple organisms has been a task performed for decades. The continuing growth of genomic datasets not only serves as a resource for studying genome evolution but warrants the availability of scalable and user-friendly software for processing these datasets. Here, we present Synolog, a bioinformatic toolkit that can automatically identify orthologs for both protein-coding and non-coding genes, synteny clusters across two or more genomes, as well as retrogenes, and segmental duplications. Applying Synolog, we illustrate cases of local gene expansions in ecologically disparate turtle species, identify synteny clusters across hundreds of millions of years of metazoan evolution, and reconstruct chromosome-level assemblies in teleosts using the inferred synteny clusters; all using its integrated visual features. In parallel, we compare our orthogroup method to that of commonly used software and note the tradeoffs of making inferences solely based on sequence similarity versus a synteny-based approach.
bioinformatics2026-04-10v1Impact of Regularization Methods and Outlier Removal on Unsupervised Sample Classification
Heckman, C. A.Abstract
Background: High-content assays have problems distinguishing biologically significant effects from the incidental effects of non-repeatable technical factors. Non-repeatable results are attributed to variations in the cell culture environment and the numerous, heterogeneous descriptors evaluated. The aim here was to determine whether preprocessing operations impacted the reproducibility of class assignments of experimental data. Methods: Batch effects that could affect reproducibility, i.e., signal/noise ratio, instrumental conditions, and segmentation, were controlled variables. The remaining batch effects, variations in materials, personnel, and culture environment could not be controlled. The values of descriptors were measured directly from images. Exploratory factor analysis was used to solve the identifiable and interpretable feature, factor 4. In each of five trials, one sample was treated with the same chemical mixture (EXP) and another with the solvent vehicle alone (CON). Results: Repeated CON and EXP samples showed significant differences among factor 4 means in data regularized within each trial. The mean of Trial 3 CON differed significantly from all other CON samples. These differences disappeared upon regularization to comprehensive databases. Among repeated EXPs, the Trial 2 mean differed from three other EXPs, but regularization to comprehensive databases had little effect. However, classification patterns were unchanged after regularization to any comprehensive database derived by the same protocol. After regularization to datasets derived by two different protocols, the classification pattern differed but only reflected elevation of differences that had been marginal to statistical significance. Outlier removal was deleterious. Even with the most sparing definition of outliers, over 3% of the contents of a single sample were removed from most trials. Elimination based on the overall within-trial distributions caused type I and type II errors. Conclusions: Non-repeatable factor 4 means in repeated trials had negligible influence on classification outcomes, so repeatability may not be a good indicator of assay quality. Irreducible batch effects, combined with small sample sizes and skewed distributions of the descriptor values, may account for non-repeatability. As the current results are based on real-world data, they suggest that non-repeatability is an uncorrectable feature of these assays. Classification patterns are not affected by several irreducible technical factors, namely materials, personnel, and non-repeatable environmental variables.
bioinformatics2026-04-10v1Multi-scale spatial testing recovers gene programs missed by existing detection methods
Yang, C.; Zhang, X.; Chen, J.Abstract
Identifying spatially variable genes (SVGs) is the first analytical step in spatial transcriptomics, determining which genes and pathways are prioritized for downstream validation. Yet the restricted spatial models of current detection methods create systematic blind spots that can exclude biologically coherent programs from discovery. Here we present FlashS, which reformulates kernel-based spatial testing in the frequency domain to detect arbitrary multi-scale expression patterns while scaling to millions of cells. In human cardiac tissue, this broader detection capacity recovers a coherent PGC-1alpha-regulated mitochondrial biogenesis program, with 40 of 49 pathway genes spatially associated with ventricular cardiomyocytes, that PreTSA, a leading parametric alternative, largely misses (1 of 49 genes), a finding replicated in an independent cohort. Across 50 benchmark datasets spanning 9 platforms, FlashS achieves state-of-the-art ranking accuracy (mean Kendall tau = 0.935) and completes on the Allen Brain MERFISH atlas (3.94 million cells) in 12.6 minutes with 21.5 GB memory.
bioinformatics2026-04-09v3A Grid-Search Framework for Dataset-Specific Calibration of Actigraphy Sleep Detection Algorithms
Rahjouei, A.Abstract
Actigraphy is widely used for long-term sleep monitoring, but established sleep-wake scoring algorithms often require parameter tuning, which is commonly performed manually and can reduce reproducibility. In this study, a grid-search-based calibration framework is presented for established actigraphy algorithms and evaluate whether it can serve as a practical alternative to manual tuning. The method was evaluated using two datasets: a multi-subject polysomnography-validated actigraphy dataset and a self-collected dual-device dataset. In the polysomnography-validated dataset, grid-search optimization produced performance patterns similar to manual parameter selection, while slightly improving detection of sleep onset and sleep offset and yielding modest gains in wake-sensitive metrics. In the dual-device dataset, consensus and majority voting were useful for reducing the influence of brief wake episodes occurring within the main sleep period, including micro-awakenings that can fragment sleep predictions across individual algorithms. Overall, these findings show that grid-search can replace manual parameter tuning with a more explicit and reproducible procedure while providing small improvements in sleep timing estimation and benefiting ensemble-based handling of within-sleep wakefulness.
bioinformatics2026-04-09v1gbdraw: a genome diagram generator for microbes and organelles
Kawato, S.Abstract
Motivation: Generating graphical diagrams of microbial and organellar genomes is a common and essential task in bioinformatics. Existing tools often present a trade-off; while powerful programming libraries that require coding skills, graphical applications require server processing or local installation with complex dependency. This highlights the need for a tool that offers both programmatic control for batch processing and graphical accessibility for ease of use. Results: To fill this gap, I developed gbdraw, a web application that generates circular and linear genome diagrams from self-contained GenBank or DDBJ files or combinations of GFF3 annotation and FASTA sequence files. Its core functions include visualizing annotated features, plotting GC content/skew tracks, and optionally generating pairwise sequence comparisons for comparative genomics. It is available as both a GUI web application and a command-line utility. Unlike existing web-based tools that require data upload to a remote server, gbdraw operates entirely within the user's web browser. This serverless architecture ensures that sensitive sequence data never leaves the local machine, providing a secure environment for visualizing unpublished genomic data. Availability and Implementation: gbdraw is implemented in Python 3 (version 3.10+) and is freely available under the MIT license. The web app is available at https://gbdraw.app/. Source code and documentation are available at https://github.com/satoshikawato/gbdraw. The local version can be installed from the Bioconda channel using a conda-compatible package manager.
bioinformatics2026-04-09v1GMIP-PLSR: A Nextflow Pipeline for GWAS and Multi-Omics Integration in Gene Prioritization Using PLSR
Kanchwala, M. S.; Xing, C.; Xuan, Z.Abstract
Genome-wide association studies (GWAS) have significantly advanced our understanding of complex traits and diseases, but their interpretive power remains limited due to challenges in identifying causal genes and pathways. Integrating GWAS with multi-omics data - such as gene expression, protein-protein interactions, and gene-pathway networks have the potential to enhance biological insights and improve gene prioritization. To fulfill this potential and need, we developed the GWAS & Multi-omics Integration Pipeline (GMIP), a flexible and scalable framework that incorporates widely used tools such as PoPS, MAGMA, and benchmarker to enrich GWAS findings. However, PoPS suffers from multicollinearity in its features, which can impact performance. To overcome this, we introduce GMIP-PLSR, an extension of GMIP that uses Partial Least Squares Regression (PLSR) to manage multicollinearity effectively. We applied GMIP-PLSR across multiple GWAS datasets, demonstrating superior performance over PoPS in most cases. In a case study on NAFLD, GMIP-PLSR, using features derived from both disease-specific scRNA-seq and general PoPS features, identified gene sets with higher heritability and stronger enrichment in known NAFLD pathways, confirming its ability to enhance GWAS findings. Built on Nextflow, GMIP is computationally efficient, adaptable to diverse research environments, and provides a robust solution for gene reprioritization in post-GWAS analyses. GMIP-PLSR is available at https://github.com/mohammedmsk/GMIP.
bioinformatics2026-04-09v1Spectral Graph Features for Reference-free RNA 3D Quality Assessment
Zhu, Y.; Zhang, H.; Calhoun, V. D.; Bi, Y.Abstract
Motivation: Existing RNA 3D structure quality assessment (QA) methods rely on local geometric descriptors or statistical potentials that evaluate atomic-level contacts but are blind to global topological coherence. This creates a critical failure mode---structures that are ''locally correct but globally wrong''---where well-formed local helices mask misplaced domains and incorrect overall packing. Results: We introduce SpecRNA-QA, a lightweight method that scores RNA 3D models using multi-scale spectral features derived from the graph Laplacian of inter-nucleotide contact networks. By computing eigenvalue distributions, heat-kernel traces, and spectral entropy across four distance scales with binary and Gaussian kernels, SpecRNA-QA captures global structural coherence inaccessible to conventional descriptors. In leave-one-out cross-validation on CASP16 (42 targets, 7368 models), spectral features achieve median per-target Spearman rho = 0.69 [95% CI: 0.64--0.73], significantly outperforming an internal geometry baseline (rho = 0.47, Delta_rho = +0.22, Wilcoxon p = 1.2 x s 10^{-10}). Compared against established unsupervised statistical potentials---which require no labeled data, unlike the supervised spectral model---rsRNASP outperforms on small-to-medium RNAs (rho = 0.67 vs. 0.57$ , [≤]200~nt). However, rsRNASP times out on most large RNAs (>200~nt), where SpecRNA-QA provides the strongest available quality signal (rho = 0.72 vs. DFIRE 0.52), revealing clear complementarity between global-topological and local-energy scoring. A training-free heuristic using only three spectral statistics enables quality estimation without any labeled data.
bioinformatics2026-04-09v1Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening
Kim, U.; Kwon, O.; Lee, D.Abstract
Systematic literature reviews are labor-intensive tasks in biomedical research. While Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) techniques have enhanced information accessibility, the inherent complexity of biological systems---characterized by high context dependency and conflicting data---remains a primary driver of LLM hallucinations. This imposes a structural constraint that limits the precision of evidence synthesis. To address these limitations, we propose an automated framework designed for the exhaustive identification of supporting and contradictory evidence within a target literature set. Rather than relying on a model's pre-trained knowledge, our system requires the LLM to review each paper individually to determine its alignment with a specific research hypothesis. By evaluating semantic context, the framework captures subtle contradictions that are often overgeneralized by conventional methods. The framework's performance was validated using the BioNLI task, where it demonstrated high classification accuracy in distinguishing whether evidence supports or contradicts a given hypothesis. Notably, the implementation of an ensemble approach provided superior stability and slightly higher precision compared to individual models. Furthermore, the framework exhibited robust performance across several well-established biological hypotheses, confirming its practical utility and reliability in real-world research. This approach provides a rigorous basis for biomedical discovery by enabling the precise, systematic analysis of biological literature and the robust collection of evidence.
bioinformatics2026-04-09v1Germline VCF Annotator: a lightweight pipeline for processing germline VCFs with robust variant extraction and read evidence quality control
Manojlovic, Z.Abstract
Raw variant calls are typically distributed as VCF files and are not well-suited for direct human review. They are intended for programmatic parsing, and spreadsheet import can distort data through automatic type conversion. Furthermore, variants in VCF are commonly annotated to add gene context and predicted functional consequences. Ensembl VEP, a widely used standard for transcript-aware variant annotation, was adapted in this study to generate standardized consequence fields across genomic features. Using a colon crypt whole-genome sequencing cohort as the motivating dataset, this study examined whether variation at DNA damage response and repair (DDR) loci could contribute to mutation-burden patterns in normal colon crypts, including patterns associated with age and potential treatment-related exposure. To make this question testable in a reproducible table-based format, the Germline VCF Annotator was developed as a two-step workflow that normalizes germline VCFs, generates VEP tabular annotations with explicit allele fields, and then extracts variants of interest and appends read-evidence metrics to assign a rules-based QC class. Within-patient concordance across technical repeats at predefined DDR loci was near-perfect after filtering for nonsilent SNVs with read depth [≥] 15, with discordance concentrated among Low-QC loci. Bulk and crypt-derived samples showed no age-related trend in DDR burden. Although the demonstration centers on DDR and aging, the Germline VCF Annotator is applicable to other gene sets that require human-readable locus-level summaries with retained allele provenance and read evidence.
bioinformatics2026-04-09v1PoolParty: streamlined design of DNA sequence libraries in Python
Liu, Z.; Cordero, A.; Kinney, J. B.Abstract
Computationally designed DNA sequence libraries are essential components of many high-throughput assays. They are also increasingly used in silico to analyze genomic AI models. Designing these libraries, however, remains tedious and error-prone. Here we describe PoolParty, a Python package that streamlines the design of complex oligo pools using a simple but flexible API. In PoolParty, each library is represented by a computational graph that can be specified in just a few lines of code. Over 50 built-in operations cover nucleotide- and codon-level mutagenesis, motif insertion, barcode generation, and more. PoolParty also provides "design cards" detailing how each sequence was generated.
bioinformatics2026-04-09v1STAnalyzer: Transparent Spatial Transcriptomics Analysis via an Agentic Architecture
Luo, H. H.; Liu, L.; Xing, Z.; Li, X.; Zhang, X.; Du, W.; Liu, B.; Wang, J.; Yu, G.Abstract
Spatial transcriptomics enables high resolution profiling of gene expression within spatial contexts, yet its potential is often hindered by fragmented toolchains, intricate parameters, and cognitive bottlenecks of interpreting high dimensional data. While recent Large Language Model agents have attempted to automate this process, they remain constrained by rigid execution logic, lack multimodal feedback for self correction, and operate in epistemic isolation from established biological knowledge. Here, we present STAnalyzer, an intelligent multiagent framework designed to automate the end to end analytical lifecycle from raw data processing to biological hypothesis generation. Transcending traditional pipelines, STAnalyzer employs a collaborative intelligence architecture to achieve three core capabilities: (1) Intent Driven Orchestration, which dynamically translates natural language queries into rigorous bioinformatics workflows; (2) Multi Modal Self Refinement, which autonomously ensures analytical robustness through closed loop synthesis of evidence from visual patterns and statistical metrics; and (3) Evidence based Cross Validation, which bridges the gap between data driven correlations and biological causation by anchoring findings in ground truth literature and structured databases. By eliminating manual analytical bottlenecks and ensuring rigorous evidentiary traceability and transparency, STAnalyzer makes high resolution spatial omics more accessible to a broader research community. It provides a robust and scalable framework for cross platform automated analysis and accelerated biological discovery, translating massive spatial datasets into verifiable biological insights.
bioinformatics2026-04-09v1IEKB: a comprehensive knowledge base for inner ear genetics integrating curated associations, cochlear interactions, Bayesian candidate prioritisation, explainable dark-gene support relations, and a scientific entity network
Wang, H.; Chen, W.; Ning, H.; Cai, Y.; Xu, Y.; Hou, X.; Pang, L.; Luo, Z.; Tian, C.Abstract
Inner-ear genetics has expanded rapidly, yet the supporting evidence remains dispersed across a vast literature and across resources that typically emphasise loci, variants, or expression data rather than integrated biological interpretation. Here we present the Inner Ear Knowledge Base (IEKB; https://earkb.org), an open database that unifies curated associations, cochlear interaction evidence, candidate prioritisation, explainable support relations, and network exploration for inner-ear research. IEKB was built with an automated agent-assisted curation workflow that combines schema-constrained literature extraction, continuous human monitoring, and final expert review by inner-ear genetics researchers. By systematically analysing 250,696 PubMed-indexed records retrieved across 16,563 screened genes, IEKB curates 6,051 gene--phenotype--disease associations from 2,494 genes across 43 phenotype categories and 4,102 cochlear gene--gene interactions with pathway, cell-type, and experimental context. IEKB further includes a Bayesian ``dark matter'' module that prioritises 243,071 candidate gene--phenotype associations for 13,229 genes across all 43 phenotypes (global AUC-ROC = 0.8603; global AUC-PR = 0.1674), together with a supervised dark-relation layer that ranks phenotype-specific known-gene support for each candidate and a multi-entity scientific network containing nearly 4,000 entities, 28,616 deterministic edges, and 83,712 literature-derived relational links. The web resource supports interactive search, multi-parameter filtering, gene-detail pages, bibliometric exploration, domain-specific enrichment against IEKB phenotype and disease gene sets, network visualisation, bulk download in CSV, JSON, SQLite, and XLSX formats, and natural-language evidence-grounded question answering through a companion conversational interface (IEKB QA). To our knowledge, IEKB is the first openly accessible inner-ear resource that integrates curated associations, cochlear interactions, probabilistic candidate prioritisation, auditable known-gene support relations for novel candidates, and a multi-entity scientific network within a single database. All data are released without registration under the CC BY 4.0 license.
bioinformatics2026-04-09v1Agentic systems are adept at solving well-scoped, verifiable problems in computational biology
Nair, S.; Gunsalus, L.; Orcutt-Jahns, B.; Rossen, J.; Lal, A.; Donno, C. D.; Celik, M. H.; Fletez-Brant, K.; Xie, X.; Bravo, H. C.; Eraslan, G.Abstract
We introduce CompBioBench, a benchmark of 100 diverse tasks for evaluating agentic systems in computational biology. Unlike mathematics and programming, which more readily admit systematic verification, biological data are inherently noisy and open to interpretation. To enable objective evaluation without reducing tasks to prescriptive checklists, we propose a new benchmark construction strategy based on synthetic/augmented data and metadata scrambling/scrubbing of real datasets to create challenging problems with a single ground-truth answer that require multi-step reasoning, tool use, bespoke code, and interaction with real-world external resources. The benchmark spans genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and machine learning workflows. Questions are curated by domain experts to cover a broad range of skills with varying difficulty. We evaluate leading general-purpose agentic systems starting from a bare-minimum environment, requiring them to fetch data and tools as needed to solve each problem. We find strong end-to-end performance, with Codex CLI (GPT 5.4) reaching 83% accuracy and Claude Code (Opus 4.6) reaching 81%. On the hardest questions, Codex CLI (GPT 5.4) reaches 59%, while Claude Code (Opus 4.6) reaches 69%. CompBioBench provides a practical testbed for measuring the progress of agentic systems in computational biology and for guiding future benchmark design.
bioinformatics2026-04-09v1Quaternion Spectral Fingerprinting of DNA: GPU-Accelerated Multi-Channel Fourier Analysis for Alignment-Free Genomics
Bergach, M. A.Abstract
Spectral methods for DNA sequence analysis---treating genomic data as a discrete signal and computing its Fourier transform---were proposed over three decades ago but remained impractical for whole-genome analysis due to computational cost. We present a quaternion Fourier transform framework that encodes DNA as a quaternion-valued signal q[n] [isin] {1, i, j, k} mapping to the four nucleotides {A, T, G, C}, and prove that the full quaternion spectrum is computable from exactly two standard complex FFTs: Q(k) = Z_1(k) + Z_2(N-k) {middle dot} j, where Z_1 = FFT(u_A + i {middle dot} u_T) and Z_2 = FFT(u_G + i {middle dot} u_C). We establish that the resulting spectral fingerprint F(k) = (|Z_1(k)|^2, |Z_2(k)|^2) is invariant under both cyclic shift and reverse complement---the two fundamental symmetries of double-stranded DNA. Building on this theoretical foundation, we develop three computational tools: (i)~a 4x4 Hermitian cross-spectral matrix with inter-channel coherence analysis, (ii)~a genome spectrogram via sliding-window short-time Fourier transform, and (iii)~an alignment-free spectral variant detection algorithm with O(N log N) complexity. Applying Welch's cross-spectral coherence analysis to E.~coli K-12, we discover that the DNA helical repeat (~11~bp) is invisible to the standard power spectrum but clearly detected through the cross-spectral matrix condition number ({kappa} = 6.5), demonstrating that multi-channel analysis reveals structural periodicities that single-channel methods miss. Phase spectrum analysis recovers the characteristic nucleotide ordering within codons (A [->] T [->] G [->] C), while three distinct frequency regimes of inter-nucleotide coupling emerge: complementary-dominated (long-range), purine/pyrimidine-dominated (structural), and codon-position-dominated (coding). Cross-species validation on 18 genomes spanning all three domains of life---Bacteria~(5), Archaea~(3), and Eukarya~(10)---with GC content from 19.6% (P. falciparum) to 69.5% (T. thermophilus) confirms the universality of these findings. The helical repeat is detected via cross-spectral coherence in 18/18 organisms (100%). All 10 eukaryotes show A-T dominance at the helical repeat---a spectral signature of nucleosome wrapping absent from prokaryotes. Non-complementary pairs (A-C, T-G) dominate the coding frequency in 17/18 organisms. Validation on human chromosome 21 (46.7 Mb, processed in 5.0 s on Apple M1) reveals eukaryote-specific spectral signatures---nucleosome positioning at 10.67 bp, nucleosome spacing at 170.7 bp, and Alu repeat dominance at 341 bp---absent from prokaryotic spectra. A proof-of-concept spectral variant detection experiment achieves 100% read-matching accuracy (100/100 reads) and statistically significant discrimination of SNPs from sequencing errors (t = 14.80, p < 0.001, Cohen's d = 1.64), scaling to d = 8.96 at 30x coverage. The full human genome can be spectrally analyzed in approximately 3--4 seconds on an M1 GPU and under 1 second on M4 Max, enabling interactive spectral genomics on commodity hardware.
bioinformatics2026-04-09v1End-to-end evaluation of pipelines for metagenome-assembled genomes reveals hidden performance gaps
Coleman, I.; Ma, J.; Qian, G.; Jiang, Y.; Brown Kav, A.; Korem, T.Abstract
The generation of Metagenome Assembled Genomes (MAGs) has become a standard and basic step in the analysis of metagenomic data. This multi-step process, which includes assembly, binning, refinement, and quality control, has many alternative approaches, algorithms, and parameters. Determining the ideal approach for a given ecosystem and study, or highlighting algorithmic gaps in need of additional research and development, requires rigorous benchmarking. We present MAG-E (MAG pipeline Evaluator), a generalizable and expandable framework for end-to-end evaluation of entire MAG pipelines: from assembly, through binning, to quality control and filtering. MAG-E relies on simulations that are built to match an ecosystem of interest and provide a ground truth for accurate evaluation. To demonstrate the capabilities of MAG-E, we benchmark two assemblers, six binning algorithms, three binning modes, and three quality control and refinement methods in the context of the human gut microbiome. Our findings offer multiple insights into optimal MAG generation in this context. We find that metaSPAdes consistently outperforms MEGAHIT in terms of recall (completeness), and that COMEBin overall outperforms alternative binning algorithms, but has lower precision than SemiBin2. While multi-sample binning results in higher precision, as previously shown, single-sample binning has higher recall and leads to better overall performance with modern binners. Binning refinement, which combines bins from multiple different algorithms, leads to reduced performance. We further show that CheckM2 systematically overestimates completeness and underestimates contamination, and that this is partially ameliorated when using GUNC. Finally, we analyze performance at the contig level, and demonstrate that binning algorithms systematically underperform for prophages and fail to bin contigs that are shared between genomes. Overall, MAG-E offers deep insights into successes and gaps in this important analytic process.
bioinformatics2026-04-09v1Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-04-08v6TPCAV: Interpreting deep learning genomics models via concept attribution
Yang, J.; Mahony, S.Abstract
Interpreting genomics deep learning models remains challenging. Existing feature attribution methods are largely restricted to one-hot DNA inputs and therefore cannot assess the influence of more general genomic features such as chromatin states or genomic repeats. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We improve upon the original TCAV method by incorporating a PCA-based decorrelation transformation to address correlated and redundant embedding features commonly observed in genomics deep learning models, resulting in the Testing with PCA-projected Concept Activation Vectors (TPCAV) approach. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides comparable motif feature interpretation to TF-MoDISco on one-hot encoded DNA-based transcription factor binding prediction models. TPCAV also enables robust interpretive analysis of how more general biological concepts such as repetitive elements and chromatin state annotations contribute towards predictions. TPCAV uniquely generalizes to interpret features learned by tokenized foundation models as well as models incorporating chromatin signals as inputs. We further show that TPCAV can identify representative regions associated with specific concepts, motivating downstream investigation of distinct regulatory mechanisms. TPCAV provides a flexible and robust complement to existing model interpretation techniques.
bioinformatics2026-04-08v3A longitudinal data framework for context-specific genotype-to-phenotype mapping
Veith, T.; Beck, R. J.; Tagal, V.; Li, T.; Alahmari, S.; Cole, J.; Hannaby, D.; Kyei, J.; Yu, X.; Maksin, K.; Schultz, A.; Lee, H.; Diaz, A.; Lupo, J.; El Naqa, I.; Eschrich, S. A.; Ji, H.; Andor, N.Abstract
Molecular assays can resolve clonal structure, but they are expensive and typically sparse in time, whereas phenotypic observations such as imaging can be collected frequently but often are not preserved in the context needed for later interpretation. We present CLONEID, an event-based framework for organizing clone-resolved phenotypic, molecular, and specimen-context records so that genotype-to-phenotype interpretation can be maintained across time. CLONEID links time-stamped Events, assay-specific Perspectives, and reconciled Identities through structured ingestion, provenance-aware retrieval, and reproducible export, complementing upstream clone-calling methods. In a long-term gastric cancer density-selection experiment, CLONEID linked repeated culture events, growth measurements, and late karyotypic profiling within a shared record, supporting longitudinal interpretation of phenotypic adaptation together with underlying chromosomal state.
bioinformatics2026-04-08v3Local and Global Patterns Support Medical Imaging as a Biomarker of Ageing
Mueller, T. T.; Starck, S.; Llalloshi, R.; Kaissis, G.; Ziller, A.; Graf, R.; Schlett, C.; Ringhof, S.; Bamberg, MD, MPH, F.; Wielpuetz, M.; Völzke, H.; Leitzmann, M.; Niendorf, T.; Keil, T.; Krist, L.; Pischon, T.; Karch, A.; Berger, K.; Kirschke, J.; Rueckert, D.; Braren, R.Abstract
Background: Understanding human ageing across multiple organs is essential for characterising individual health trajectories and identifying abnormal ageing processes. Multi-organ imaging provides an opportunity to quantify biological ageing beyond chronological age. The aim of this study is to assess organ-specific and whole-body ageing patterns and their associations with disease and lifestyle factors. Methods: In this large-scale study, we evaluate biological ageing patterns using 70,000 MRI scans from the UK Biobank and the German National Cohort. We employ 3D ResNet-18 models to predict chronological age from various body regions (brain, heart, liver, spine, lungs, muscle, and intestine) and the whole body. From these predictions, we derive age gaps relative to a strictly healthy reference cohort, which enables the identification of accelerated ageing patterns. We then evaluate associations with chronic diseases and lifestyle factors, and a virtual ageing framework was developed to explore counterfactual scenarios by substituting anatomical regions across subjects, quantifying local impacts on global biological age. Results: Here we show significant associations between detected accelerated ageing and specific chronic diseases, including multiple sclerosis and chronic obstructive pulmonary disease, as well as lifestyle factors such as smoking and physical activity. Virtual substitution of anatomical regions demonstrates that local substitutions can influence global ageing patterns. Conclusions: This study demonstrates that multi-organ imaging enables the detection of abnormal ageing patterns at both local and global levels. The presented framework provides a foundation for improved risk stratification and supports the development of personalised approaches to health assessment and disease prevention.
bioinformatics2026-04-08v3Reconstructing biologically coherent cellular profiles from imaging-based spatial transcriptomics
Yuan, L.; Zheng, Y.; Zhang, S.; Beroukhim, R.; Deshpande, A.Abstract
In imaging-based spatial transcriptomics, transcript-to-cell assignment shapes downstream biological interpretation including cell typing, ligand-receptor inference, and niche characterization. However, two-dimensional segmentation of volumetric tissue often yields mixed cellular profiles, while cells without detected nuclei are missed entirely, distorting the aforementioned downstream analyses. We present TRACER, which refines cellular representations in imaging-based transcriptomics by leveraging gene-gene coherence and spatial co-localization of transcripts observed directly in the data, without requiring external annotations or reference atlases. TRACER resolves mixed cellular profiles and reconstructs partial cells whose nuclei are not detected, enabling more complete representation of cells within the tissue section. We also introduce coherence-based metrics that quantify transcriptional purity and conflict, enabling platform-agnostic benchmarking of segmentation quality. Across diverse platforms, tissues, and segmentation methodologies, TRACER consistently and reproducibly improves the coherence of cellular profiles and the quality of downstream analyses.
bioinformatics2026-04-08v2Genetic demultiplexing and transcript start site identification from nanopore sequencing of 10x Genomics multiome libraries
Mears, J.; Orchard, P.; Varshney, A.; Bose, M. L.; Robertson, C. C.; Piper, M.; Pashos, E.; Dolgachev, V.; Manickam, N.; Jean, P.; Kitzman, D. W.; Fauman, E.; Damilano, F.; Roth Flach, R. J.; Nicklas, B.; Parker, S. C.Abstract
Short-read Illumina sequencing of 10x Genomics single-nucleus multiome libraries captures only the 3' end of RNA transcripts, losing transcription start site (TSS) information. Here we demonstrate nanopore sequencing of 10x multiome libraries, which enables the profiling of full length transcripts. We show concordance with common short-read sequencing based workflows including successful genetic demultiplexing of nanopore data despite its higher error rate. We compare TSS identified using nanopore sequencing of multiome cDNA to those identified using a short-read 5' assay, and provide an optimized approach for the preprocessing of nanopore reads prior to TSS identification. We find that nanopore sequencing of multiome cDNA captures a median of 63% of the TSS detected by the 5' assay.
bioinformatics2026-04-08v2Horse, not zebra: accounting for lineage abundance in maximum likelihood phylogenetics
De Maio, N.Abstract
Maximum likelihood phylogenetic methods are popular approaches for estimating evolutionary histories from genome data. These methods do not make prior assumptions regarding strategies used for deciding which genomes were sequenced. However, in genomic epidemiology the sequencing rate is often agnostic to the specific pathogen strain considered. In this scenario, a pathogen strain prevalence should be reflected in its relative abundance in the genome data. Here, I show that this simple assumption, when appropriate and incorporated within maximum likelihood phylogenetics, greatly improves the accuracy of phylogenetic inference. I introduce and assess two separate approaches to achieve this. The first approach rescales the likelihood of a phylogenetic tree by the number of distinct binary topologies obtainable by arbitrarily resolving multifurcations in the tree. This approach interprets multifurcations as the result of lack of signal for resolving a bifurcating topology rather than as an instantaneous multifurcating event. The second approach instead includes a tree prior that assumes that genomes are sequenced at a rate proportional to their abundance. Both approaches favor phylogenetic placement at abundant lineages, and dramatically improve the accuracy of phylogenetic inference in scenarios like SARS-CoV-2 phylogenetics, where large multifurcations are common. This considerable impact is also observed in real pandemic-scale SARS-CoV-2 genome data, where accounting for lineage prevalence reduces phylogenetic uncertainty by around one order of magnitude. Both approaches were implemented in the open source phylogenetic software MAPLE v0.7.5.4 (https://github.com/NicolaDM/MAPLE).
bioinformatics2026-04-08v2GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation.
bioinformatics2026-04-08v2Sampling protein structural token space enables accurate prediction of multiple conformations
Wang, Z.; Yu, Y.; Yu, C.; Bu, D.Abstract
Protein function is fundamentally mediated by ensembles of distinct metastable states. However, existing methods, such as AlphaFold 3, typically exhibit a bias toward predicting a single dominant state, failing to capture alternative conformations or provide robust metrics for identifying high-quality multi-state conformations. Here, we present MultiStateFold (MSFold), a framework that integrates Parallel Tempering into the discrete structure token space of the ESM3 protein language model. By conceptualizing the model's latent space as an implicit energy landscape, MSFold enables global exploration and barrier crossing, thereby overcoming the local sampling limitations inherent in base generative models. Across a benchmark of 313 multi-conformation pairs, MSFold sets a new performance standard: it achieves the highest success rate in modeling native states and substantially outperforms leading methods, including AlphaFold 3, on challenging alternative conformations, while maintaining competitive accuracy for primary structures. Furthermore, we propose Sequence Log-Likelihood (SLL), a novel confidence metric derived from sequence-structure consistency. Our results demonstrate that SLL offers a modest improvement over standard metrics such as pTM and pLDDT. This work establishes a new paradigm for conformational sampling, bridging classical statistical physics with protein language models.
bioinformatics2026-04-08v2A geometric criterion links HIV-1 capsid topography to its biophysical properties and function
Li, W.; Peeples, C. A.; Rey, J. S.; Perilla, J. R.; Twarock, R.Abstract
Mathematical models of virus capsid structure are pillars of modern virology, aiding the understanding of viral mechanisms and the design of antiviral interventions. Traditionally, the HIV-1 capsid core geometry is represented as a fullerene lattice, akin to the icosahedral models of spherical viruses in Caspar-Klug theory. However, recent studies revealed that many viral capsids deviate from such idealised lattices, with important functional implication. Here we demonstrate that this is the case also for the conical HIV-1 core geometries, in which the hexamer and pentamer boundaries form a pseudo-tiling rather than a perfectly aligned fullerene network. We introduce a triangular geometric criterion that quantifies local deviations of an HIV-1 atomic model from its idealised fullerene backbone. Using this criterion, we demonstrate that this difference in geometric organisation between idealised (fullerene) and actual (data-derived) capsid model has implications for the capsid's biophysical properties. We also discuss the use of the geometric criterion as a predictive tool regarding cofactor binding and implied geometric changes in the capsid surface coupled to the interfacial frustration response. Our results establish a quantitative framework linking capsid geometry, curvature, and biophysical function, offering new perspectives for assembly inhibitor design and lentiviral vector engineering.
bioinformatics2026-04-08v2