Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Structural Connectome Analysis using a Graph-based Deep Model for Age and Dementia Prediction
Kazi, A.; Mora, J.; Fischl, B.; Dalca, A.; Aganj, I.Abstract
We address the prediction of non-imaging variables based on structural brain connectivity derived from diffusion magnetic resonance images, using graph-based machine learning. We predict age and the mini-mental state examination (MMSE) score as examples of a demographic and a clinical variable. We propose a machine-learning model inspired by graph convolutional networks (GCNs), which takes a brain connectivity graph as input and processes the data separately through a parallel GCN mechanism with multiple branches. The novelty of our work lies in the model architecture, especially the Connectivity Attention Block, which learns an embedding representation of brain graphs while providing graph-level attention. We show experiments on publicly available datasets of PREVENT-AD and OASIS3. The proposed network is a simple design that employs different heads involving graph convolutions focused on edges and nodes, capturing representations from the input data thoroughly. A linear branch, and skip connections. To test the ability of our model to extract complementary and representative features from brain connectivity data, we chose the task of sex classification. We validate our model by comparing it to existing methods and via ablations. This quantifies the degree to which the connectome varies depending on the task, which is important for improving our understanding of health and disease across the population. The proposed model generally demonstrates higher performance especially for age prediction compared to the existing machine-learning algorithms we tested, including classical methods and (graph and non-graph) deep learning.
bioinformatics2026-04-10v3Benchmarking ambient RNA removal across droplet and well-plate platforms reveals artificial count generation as a critical failure mode of scAR and CellClear
Schroeder, L.; Gerber, S.; Ruffini, N.Abstract
Background: Ambient RNA contamination is a pervasive artifact of single-cell and single-nucleus RNA sequencing (sxRNA-seq), yet no consensus exists on which computational removal tool performs best across experimental platforms. Results: We present a systematic benchmark of six tools: CellBender, DecontX, SoupX, scCDC, scAR, and CellClear - evaluated across six human-mouse cell line mixing (hgmm) datasets (1k-20k cells) providing partial ground truth, two droplet-based complex tissue datasets (PBMC scRNA-seq; prefrontal cortex snRNA-seq), and a well-plate-based dataset (BD Rhapsody WBC). Using inter-species counts as partial ground truth, we quantify sensitivity, specificity, precision, and removal consistency per tool. We further apply a count-integrity criterion quantifying gene-cell positions where corrected values exceed raw counts. This reveals that scAR and CellClear do not merely denoise but fundamentally restructure count matrices: CellClear replaces >93% of counts with values derived from matrix factorization, while scAR generates spurious cell types absent from uncorrected data, including three spurious coarse cell types in the BD Rhapsody dataset and up to eight novel cell types in the prefrontal cortex. CellBender and SoupX exhibit reliable contamination removal with minimal count distortion. DecontX and scCDC are the only tools operable on non-droplet platforms without raw count matrix access. Runtime benchmarking at atlas scale (up to 172,000 nuclei) further demonstrates that CellClear fails to scale. Conclusions: Count matrix integrity, not removal sensitivity alone, must be a primary criterion when selecting ambient RNA correction tools. We provide platform-specific recommendations and a decision framework to guide tool selection across experimental contexts.
bioinformatics2026-04-10v1MHCXGraph: A Graph-Based approach to detecting T cell receptor cross-reactivity
Simoes, C. D. M. S.; Maidana, R. L. B. R.; De Assis, S. C.; Guerra, J. V. d. S.; Ribeiro-Filho, H. V.Abstract
The T cell receptor (TCR) recognition of multiple peptides presented by the major histocompatibility complex (MHC) is a key natural phenomenon, enabling the T cell repertoire to respond to a broad array of antigens. Despite its importance to the immune response, T cell cross-reactivity poses a major challenge for the development of novel T cell-based therapies. In this study, we present MHCXGraph, a graph-based computational approach for identifying conserved and immunologically relevant regions across multiple structures of peptides bound to MHC molecules (pMHC). Our approach provides three operational modes with user-defined parameters, allowing flexible configuration according to specific scientific needs while delivering fully interpretable results through user-friendly interfaces. We evaluated MHCXGraph across three case studies, including peptides bound to classical MHC Class I, MHC Class II, and unbound HLA alleles, demonstrating its ability to capture conserved structural determinants beyond sequence similarity. By integrating structural information with efficient graph-based analysis, MHCXGraph addresses key limitations of sequence-based methods while maintaining computational scalability. Collectively, these results indicate that MHCXGraph can be readily integrated into computational pipelines for T cell cross-reactivity discovery, especially in the context of de novo pMHC engager design and T cell-based vaccine development.
bioinformatics2026-04-10v1Impact of Regularization Methods and Outlier Removal on Unsupervised Sample Classification
Heckman, C. A.Abstract
Background: High-content assays have problems distinguishing biologically significant effects from the incidental effects of non-repeatable technical factors. Non-repeatable results are attributed to variations in the cell culture environment and the numerous, heterogeneous descriptors evaluated. The aim here was to determine whether preprocessing operations impacted the reproducibility of class assignments of experimental data. Methods: Batch effects that could affect reproducibility, i.e., signal/noise ratio, instrumental conditions, and segmentation, were controlled variables. The remaining batch effects, variations in materials, personnel, and culture environment could not be controlled. The values of descriptors were measured directly from images. Exploratory factor analysis was used to solve the identifiable and interpretable feature, factor 4. In each of five trials, one sample was treated with the same chemical mixture (EXP) and another with the solvent vehicle alone (CON). Results: Repeated CON and EXP samples showed significant differences among factor 4 means in data regularized within each trial. The mean of Trial 3 CON differed significantly from all other CON samples. These differences disappeared upon regularization to comprehensive databases. Among repeated EXPs, the Trial 2 mean differed from three other EXPs, but regularization to comprehensive databases had little effect. However, classification patterns were unchanged after regularization to any comprehensive database derived by the same protocol. After regularization to datasets derived by two different protocols, the classification pattern differed but only reflected elevation of differences that had been marginal to statistical significance. Outlier removal was deleterious. Even with the most sparing definition of outliers, over 3% of the contents of a single sample were removed from most trials. Elimination based on the overall within-trial distributions caused type I and type II errors. Conclusions: Non-repeatable factor 4 means in repeated trials had negligible influence on classification outcomes, so repeatability may not be a good indicator of assay quality. Irreducible batch effects, combined with small sample sizes and skewed distributions of the descriptor values, may account for non-repeatability. As the current results are based on real-world data, they suggest that non-repeatability is an uncorrectable feature of these assays. Classification patterns are not affected by several irreducible technical factors, namely materials, personnel, and non-repeatable environmental variables.
bioinformatics2026-04-10v1Synolog: A Scalable Synteny-Based Framework for Genome Architecture Characterization
Madrigal, G.; Catchen, J. M.Abstract
Detailing the genomic architecture across multiple organisms has been a task performed for decades. The continuing growth of genomic datasets not only serves as a resource for studying genome evolution but warrants the availability of scalable and user-friendly software for processing these datasets. Here, we present Synolog, a bioinformatic toolkit that can automatically identify orthologs for both protein-coding and non-coding genes, synteny clusters across two or more genomes, as well as retrogenes, and segmental duplications. Applying Synolog, we illustrate cases of local gene expansions in ecologically disparate turtle species, identify synteny clusters across hundreds of millions of years of metazoan evolution, and reconstruct chromosome-level assemblies in teleosts using the inferred synteny clusters; all using its integrated visual features. In parallel, we compare our orthogroup method to that of commonly used software and note the tradeoffs of making inferences solely based on sequence similarity versus a synteny-based approach.
bioinformatics2026-04-10v1Generating, curating, and evaluating trnL reference sequence databases: Benchmarking OBITools3/ecoPCR, RESCRIPt, and MetaCurator
KUDDAR, O. S.; Meiklejohn, K. A.; Callahan, B. J.Abstract
Plant DNA metabarcoding enables the identification of plant taxa in mixed samples, with the trnL (UAA) intron and its P6 loop mini-barcode region performing as well as or better than other commonly used markers. Reliable metabarcoding requires high-quality reference databases, yet a regularly maintained trnL resource is currently lacking. Consequently, most studies use uncurated sequences downloaded directly from public repositories without essential validation. We address these gaps by providing guidance through a systematic comparison of three database curation tools: OBITools3/ecoPCR, RESCRIPt, and MetaCurator, to generate three trnL reference sequence databases and evaluate their classification performance across commonly sequenced trnL regions (CD, CH, and GH). Reference trnL sequences and taxonomy files were retrieved from public sequence repositories and curated using standardized filtering steps to reduce taxonomic errors, sequence ambiguity, and redundancy. Four simulated query datasets; two base sets and their mutated counterparts, were constructed to assess classification performance of the databases using the Naive Bayesian Classifier implemented in DADA2. The evaluation showed that performance differed by trnL region: MetaCurator and RESCRIPt yielded higher and similar metrics for trnL CD; OBITools3/ecoPCR and RESCRIPt were comparable for trnL CH; and MetaCurator attained the highest performance for trnL GH region. All reference databases, taxonomy, and evaluation files are available at Zenodo (https://doi.org/10.5281/zenodo.17969450). The complete computational workflow and scripts are available on GitHub (https://github.com/oskuddar/trnL_DB). Although evaluation was focused on plant taxa in the United States, the resulting databases are suitable for use as global trnL reference databases.
bioinformatics2026-04-10v1MTB-KB: A Curated Knowledgebase of Mycobacterium tuberculosis Related Studies
Li, P.; Li, C.; Zhu, R.; Sun, W.; Zhou, H.; Fan, Z.; Yue, L.; Zhang, S.; Jiang, X.; Luo, Q.; Han, J.; Huang, H.; Shen, A.; Bahetibieke, T.; Wang, J.; Zhang, W.; Wen, H.; Niu, H.; Bu, C.; Zhang, Z.; Xiao, J.; Gao, R.; Chen, F.Abstract
Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), has regained its position as the world's leading killer among infectious diseases. Despite extensive research progress across epidemiology, diagnosis, drug development, treatment regimens, vaccines, drug resistance, virulence factors, and immune mechanisms, MTB-related knowledge remains fragmented across thousands of publications, limiting its effective use. To address this gap, we present MTB-KB, a literature-curated knowledgebase that systematically integrates high-impact findings from eight major sections of TB research. The current release contains 75,170 associations from 1,246 publications, covering 18,439 entities standardized using authoritative databases and WHO-endorsed classifications. A central feature is the interactive knowledge graph, which links cross-section associations to reveal and infer MTB-host interactions, treatment strategies, and vaccine development opportunities. MTB-KB also provides a user-friendly interface with browsing, advanced search, and statistical visualization. Overall, by consolidating dispersed MTB knowledge into a structured and accessible platform, MTB-KB provides a valuable resource for researchers, clinicians, and policymakers, supporting both basic and clinical TB research, enabling evidence-based TB prevention, diagnosis, and treatment, and contributing to global elimination efforts. MTB-KB is accessible at https://ngdc.cncb.ac.cn/mtbkb/.
bioinformatics2026-04-10v1Structure-aware geometric graph learning for modeling protease-substrate specificity at scale
Guo, X.; Bi, Y.; Ran, Z.; Pan, T.; Sun, H.; Hao, Y.; Jia, R.; Wang, C.; Zhang, Q.; Kurgan, L.; Song, J.; Li, F.Abstract
Protease-substrate specificity is central to cellular regulation and disease pathogenesis, and accurately modeling its structural determinants remains challenging. Substrate recognition is governed by spatial constraints and higher-order relationships that extend beyond local sequence motifs. Most computational approaches rely predominantly on motif-centric or sequence-based representations, limiting their ability to capture the geometric and relational structure underlying enzymatic specificity. Here, we introduce OmniCleave, a structure-aware geometric graph learning framework for modeling protease-substrate specificity at scale. OmniCleave is trained on 57,278 structure-informed protease-substrate pairs derived from 9,651 substrates spanning over 100 proteases across six distinct families. The framework integrates multi-scale structural graphs with higher-order protease relational topology, explicitly encoding spatial context and inter-protease dependencies within a unified geometric representation. This formulation moves beyond local pattern recognition and enables transferable modelling across six protease families. Across large-scale benchmarks, the framework consistently outperforms existing approaches and reveals interpretable geometric determinants underlying substrate recognition. Experimental validation confirms three novel caspase-3 substrates and 21 cleavage sites predicted by OmniCleave, supporting the biological relevance of the learned representations. Together, OmniCleave provides a scalable geometric framework for modeling protease-substrate specificity, with practical utility for systematic analysis of protease biology.
bioinformatics2026-04-10v1A computational model for quantifying instability of tandem repeats across the genome
Dolzhenko, E.; English, A.; Mokveld, T.; de Sena Brandine, G.; Kronenberg, Z.; Wright, G.; Drogemoller, B.; Rowell, W. J.; Wenger, A. M.; Bennett, M. F.; Weisburd, B.; Erwin, G. S.; Jin, P.; Nelson, D. L.; Dashnow, H.; Sedlazeck, F.; Eberle, M. A.Abstract
Tandem repeats (TRs) exhibit high levels of somatic mosaicism, which is increasingly recognized as an important modifier of repeat expansion disorders. Long-read sequencing can capture full-length repeat alleles, yet robust frameworks for quantifying instability across TRs genome-wide are still needed. Here, we introduce a general-purpose model for quantifying TR instability in a given long-read sequencing dataset, without explicitly distinguishing biological mosaicism from technical noise, and which is broadly applicable to both simple and structurally complex loci. This model accurately characterizes allelic instability at each TR locus by representing the distribution of read-to-consensus deviations for each allele. Using HiFi sequencing data from 256 HPRC cell line samples, we fitted models for 617,007 TR loci, including known pathogenic repeats. We observe that instability levels are generally low, but vary substantially across individual TRs, and are driven more strongly by repeat composition than overall repeat length. Furthermore, we applied our method to targeted PureTarget long-read data from samples with known repeat expansions and identified significant mosaicism in the majority of expanded alleles. Our model offers a practical way to quantify instability of tandem repeats across the genome and to detect unusually unstable repeat alleles.
bioinformatics2026-04-10v1Deep learning enables direct HLA typing from immunopeptidomics data
Pilz, M.; Scheid, J.; Bauer, A.; Lemke, S.; Sachsenberg, T.; Bauer, J.; Nelde, A.; Stadelmaier, J.; Walter, A.; Rammensee, H.-G.; Nahnsen, S.; Kohlbacher, O.; Walz, J. S.Abstract
The immune system eliminates malignant and infected cells through T-cell-mediated recognition of peptides presented by human leukocyte antigen molecules. Mass spectrometry-based immunopeptidomics enables unbiased identification of naturally presented HLA-restricted peptides and has become central to the development of T-cell-based immunotherapies. However, immunopeptidomics data reflects the combined peptide presentation of multiple HLA alleles, and determining which allotypes are represented in this multi-allelic complexity remains an unmet computational challenge. Here, we introduce immunotype, a deep learning-based ensemble predictor for HLA class I allotyping directly from immunopeptidomics data. Immunotype integrates peptide and HLA sequence information through transformer encoders and a graph neural network, complemented by a curated mono-allelic reference of known peptide-HLA binding preferences. Immunotype achieves an overall accuracy of 87.2% at protein-level resolution across diverse tissues and thereby enables rapid, cost-effective HLA typing of large-scale immunopeptidomics datasets.
bioinformatics2026-04-10v1Statistical Principles Define an Open-Source Differential Analysis Workflow for Mass Spectrometry Imaging Experiments with Complex Designs
Rogers, E. B. T.; Lakkimsetty, S. S.; Bemis, K. A.; Schurman, C. A.; Angel, P. A.; Schilling, B.; Vitek, O.Abstract
Mass spectrometry imaging (MSI) characterizes the spatial heterogeneity of molecular abundances in biological samples. Experiments with complex designs, involving multiple conditions and multiple samples, provide particularly useful insight into differential abundance of analytes. However, analyses of these experiments require attention to details such as signal processing, selection of regions of interest, and statistical methodology. This manuscript contributes a statistical analysis workflow for detecting differentially abundant analytes in MSI experiments with complex designs. Using a case study of histologic samples of human tibial plateaus from knees of osteoarthritis patients and cadaveric controls, as well as simulated datasets, we illustrate the impact of the analysis decisions. We illustrate the importance of signal processing and feature aggregation for preserving biological relevance and alleviating the stringency of multiple testing. We further demonstrate the importance of selecting regions of interest in ways that are compatible with differential analysis. Finally, we contrast several common statistical models for differential analysis, showcase the appropriate use of replication, and demonstrate model-based calculation of sample size for followup investigations. The discussion is accompanied by detailed recommendations and an open-source R-based implementation that can be followed by other investigations.
bioinformatics2026-04-10v1BrightEyes-FFS: an open-source platform for comprehensive analysis of fluorescence fluctuation spectroscopy experiments with small detector arrays
Slenders, E.; Perego, E.; Zappone, S.; Vicidomini, G.Abstract
Fluorescence fluctuation spectroscopy (FFS) is an ensemble of techniques for quantitative measurement of molecular dynamics and interactions. Recently, the introduction of small-format array detectors has opened up a new range of spatiotemporal information, allowing for more detailed analysis of system kinetics. However, there is currently no open-source software available for analyzing the high-dimensional FFS data sets. We present BrightEyes-FFS, an open-source Python-based environment for FFS analysis with array detectors. The environment includes a Python package for reading raw FFS data, computing auto- and cross-correlations using various algorithms, and fitting the correlations to several models. A graphical user interface (GUI), available as a standalone executable, makes the analysis fast and user-friendly. An automated Jupyter Notebook writing tool enables transition from the GUI to Jupyter Notebook for custom analysis. We believe that BrightEyes-FFS will enable a wider community to study diffusion, flow, and interaction dynamics.
bioinformatics2026-04-10v1PERREO: An integrated pipeline for repetitive elements analysis enables the repeatome expression profiling in cancer
Rodriguez-Martin, F.; Masero-Leon, M.; Gomez-Cabello, D.Abstract
Transcriptome-wide profiling of repetitive elements expression reveals transposable element-derived transcripts that are deregulated in diverse biological contexts including cancer. However, most RNA-seq pipelines are optimized for annotated genes and substantially undercount repeat RNA molecules, limiting their discovery and characterization. Here we present PERREO, a comprehensive, user-friendly pipeline for analyzing repetitive RNA elements from short- and long-read sequencing data. PERREO performs quality control, repeat-aware alignment and quantification, differential expression analysis, co-expression network analysis, and de novo transcript assembly with minimal computational expertise required. We validate PERREO across cell lines, tumor tissues and liquid biopsies, demonstrating superior sensitivity to repetitive RNA signatures compared with standard RNA-seq approaches. PERREO integrates predictive modelling to identify biological associations and generates publication-ready visualizations. By removing the bioinformatic barrier to repetitive RNA discovery, this pipeline enables broader investigation of the repeatome's role in cellular biology and disease, yielding valuable results that, for specific analytical objectives, outperform certain existing tools and pipelines.
bioinformatics2026-04-10v1Structure-Based and Stability-Validated Prioritization of BACE1 Inhibitors Integrating Meta-Ensemble QSAR and Molecular Dynamics
Chowdhury, T. D.; Shafoyat, M. U.; Hemel, N. H.; Nizam, D.; Sajib, J. H.; Toha, T. I.; Nyeem, T. A.; Farzana, M.; Haque, S. R.; Hasan, M.; Siddiquee, K. N. e. A.; Mannoor, K.Abstract
Alzheimers disease remains an unmet therapeutic challenge, and no {beta}-secretase (BACE1) inhibitor has achieved clinical approval. A major limitation of prior discovery efforts is reliance on single-parameter optimization, often yielding computational hits with poor translational potential. Here, we present a stability-validated, biology-informed computational framework that integrates meta-ensemble QSAR (five tree-based classifiers with ECFP4 fingerprints), structure-based docking, Protein Language Model (ESM-1b)-guided hybrid residue interaction weighting, and comprehensive ADMET profiling within a normalized composite ranking scheme. Model robustness was confirmed through external validation and Y-randomization (n = 100; empirical p = 0.009). Heuristic weighting was quantitatively stress-tested using global {+/-}10% perturbation analysis (mean Spearman {rho} = 0.998; mean Kendalls {tau} = 0.970), demonstrating exceptional ranking stability under controlled parameter uncertainty. Screening of 16,196 structurally diverse compounds, including CNS-active molecules, phytochemicals, approved drugs, and investigational agents, identified 153 predicted actives (accuracy 0.852; ROC-AUC 0.920), which were refined to 111 drug-like candidates and seven prioritized leads. Two-hundred-nanosecond molecular dynamics simulations confirmed stable binding within the BACE1 catalytic pocket and sustained interaction networks over time. Mol-2 exhibited the most favorable profile, characterized by low ligand RMSD (1.2-1.6 [A]), persistent catalytic dyad interactions (ASP32 98%, ASP228 99%), predicted BBB permeability, acceptable efflux profile, and balanced ADMET characteristics consistent with CNS drug-like space. Collectively, this integrative, interpretable, and robustness-validated framework provides a systematic strategy for multi-criteria lead prioritization and may serve as a transferable platform for structure-guided discovery of therapeutics targeting complex neurodegenerative pathways
bioinformatics2026-04-10v1SimpleFold-Turbo: Adaptive Inference Caching Yields 14-fold Acceleration of Flow-Matching Protein Structure Prediction
Taghon, G.Abstract
We apply TeaCache, an adaptive caching technique from video diffusion to SimpleFold's flow-matching protein structure prediction and achieve (9 to 14)-fold inference speedups with negligible quality loss. We determine that flow matching's near-linear generative trajectories make consecutive neural-network evaluations highly redundant. At a low redundancy threshold, SimpleFold-Turbo (SFT) skips {approx} 93 % of forward passes while preserving near-baseline template modeling (TM)-scores across 300 structurally diverse CATH domains and all six SimpleFold model sizes (100 million to 3 billion parameters), at compute budgets where log-uniform step-skipping collapses. Speedup scales with model size because caching overhead is constant while per-step cost grows, and a general three-phase skip pattern emerges independent of protein size or fold. SF-T requires no retraining, no weight modification, and no MSA server dependencies. We release SF-T as fully open-source software enabling thousands of structure predictions per hour on commodity hardware.
bioinformatics2026-04-10v1TCMCard: A High-Confidence Digital Infrastructure for Traditional Chinese Medicine Quantified by Multi-Dimensional Evidence Integration
Wang, Y.; Dong, W.; Yao, J.; Wang, K.; Zhang, L.; Wang, Y.; Guo, S.; Li, H.; Cai, H.; Wang, X.; Li, Y.Abstract
Network pharmacology has become a widely used approach for deciphering multi-component, multi-target mechanisms of traditional Chinese medicine (TCM). Here we introduce TCMCard, a high-confidence digital infrastructure built on a Multi-Dimensional Evidence Integration (MDEI) framework. The framework integrates experimental activity data from authoritative chemical databases, literature-derived evidence, and structure-based similarity inference. Preprocessing steps include chemical structure normalization, species-specific filtering, and target quality scoring. Applied to conventional interaction datasets, this pipeline leads to the removal of over 60% of low-confidence noise. TCMCard supports network pharmacology exploration through an interactive visualization platform, and module analysis identifies functionally relevant communities that offer insights into the synergistic actions of TCM formulas. Overall, TCMCard may help move the field beyond simple data aggregation toward evidence-informed curation and quality-driven analysis. As an interactive and publicly accessible platform, it reveals an organized backbone within complex interaction networks, offering a more reliable basis for understanding multi-component synergy in TCM.
bioinformatics2026-04-10v1Multi-scale spatial testing recovers gene programs missed by existing detection methods
Yang, C.; Zhang, X.; Chen, J.Abstract
Identifying spatially variable genes (SVGs) is the first analytical step in spatial transcriptomics, determining which genes and pathways are prioritized for downstream validation. Yet the restricted spatial models of current detection methods create systematic blind spots that can exclude biologically coherent programs from discovery. Here we present FlashS, which reformulates kernel-based spatial testing in the frequency domain to detect arbitrary multi-scale expression patterns while scaling to millions of cells. In human cardiac tissue, this broader detection capacity recovers a coherent PGC-1alpha-regulated mitochondrial biogenesis program, with 40 of 49 pathway genes spatially associated with ventricular cardiomyocytes, that PreTSA, a leading parametric alternative, largely misses (1 of 49 genes), a finding replicated in an independent cohort. Across 50 benchmark datasets spanning 9 platforms, FlashS achieves state-of-the-art ranking accuracy (mean Kendall tau = 0.935) and completes on the Allen Brain MERFISH atlas (3.94 million cells) in 12.6 minutes with 21.5 GB memory.
bioinformatics2026-04-09v3Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening
Kim, U.; Kwon, O.; Lee, D.Abstract
Systematic literature reviews are labor-intensive tasks in biomedical research. While Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) techniques have enhanced information accessibility, the inherent complexity of biological systems---characterized by high context dependency and conflicting data---remains a primary driver of LLM hallucinations. This imposes a structural constraint that limits the precision of evidence synthesis. To address these limitations, we propose an automated framework designed for the exhaustive identification of supporting and contradictory evidence within a target literature set. Rather than relying on a model's pre-trained knowledge, our system requires the LLM to review each paper individually to determine its alignment with a specific research hypothesis. By evaluating semantic context, the framework captures subtle contradictions that are often overgeneralized by conventional methods. The framework's performance was validated using the BioNLI task, where it demonstrated high classification accuracy in distinguishing whether evidence supports or contradicts a given hypothesis. Notably, the implementation of an ensemble approach provided superior stability and slightly higher precision compared to individual models. Furthermore, the framework exhibited robust performance across several well-established biological hypotheses, confirming its practical utility and reliability in real-world research. This approach provides a rigorous basis for biomedical discovery by enabling the precise, systematic analysis of biological literature and the robust collection of evidence.
bioinformatics2026-04-09v1Quaternion Spectral Fingerprinting of DNA: GPU-Accelerated Multi-Channel Fourier Analysis for Alignment-Free Genomics
Bergach, M. A.Abstract
Spectral methods for DNA sequence analysis---treating genomic data as a discrete signal and computing its Fourier transform---were proposed over three decades ago but remained impractical for whole-genome analysis due to computational cost. We present a quaternion Fourier transform framework that encodes DNA as a quaternion-valued signal q[n] [isin] {1, i, j, k} mapping to the four nucleotides {A, T, G, C}, and prove that the full quaternion spectrum is computable from exactly two standard complex FFTs: Q(k) = Z_1(k) + Z_2(N-k) {middle dot} j, where Z_1 = FFT(u_A + i {middle dot} u_T) and Z_2 = FFT(u_G + i {middle dot} u_C). We establish that the resulting spectral fingerprint F(k) = (|Z_1(k)|^2, |Z_2(k)|^2) is invariant under both cyclic shift and reverse complement---the two fundamental symmetries of double-stranded DNA. Building on this theoretical foundation, we develop three computational tools: (i)~a 4x4 Hermitian cross-spectral matrix with inter-channel coherence analysis, (ii)~a genome spectrogram via sliding-window short-time Fourier transform, and (iii)~an alignment-free spectral variant detection algorithm with O(N log N) complexity. Applying Welch's cross-spectral coherence analysis to E.~coli K-12, we discover that the DNA helical repeat (~11~bp) is invisible to the standard power spectrum but clearly detected through the cross-spectral matrix condition number ({kappa} = 6.5), demonstrating that multi-channel analysis reveals structural periodicities that single-channel methods miss. Phase spectrum analysis recovers the characteristic nucleotide ordering within codons (A [->] T [->] G [->] C), while three distinct frequency regimes of inter-nucleotide coupling emerge: complementary-dominated (long-range), purine/pyrimidine-dominated (structural), and codon-position-dominated (coding). Cross-species validation on 18 genomes spanning all three domains of life---Bacteria~(5), Archaea~(3), and Eukarya~(10)---with GC content from 19.6% (P. falciparum) to 69.5% (T. thermophilus) confirms the universality of these findings. The helical repeat is detected via cross-spectral coherence in 18/18 organisms (100%). All 10 eukaryotes show A-T dominance at the helical repeat---a spectral signature of nucleosome wrapping absent from prokaryotes. Non-complementary pairs (A-C, T-G) dominate the coding frequency in 17/18 organisms. Validation on human chromosome 21 (46.7 Mb, processed in 5.0 s on Apple M1) reveals eukaryote-specific spectral signatures---nucleosome positioning at 10.67 bp, nucleosome spacing at 170.7 bp, and Alu repeat dominance at 341 bp---absent from prokaryotic spectra. A proof-of-concept spectral variant detection experiment achieves 100% read-matching accuracy (100/100 reads) and statistically significant discrimination of SNPs from sequencing errors (t = 14.80, p < 0.001, Cohen's d = 1.64), scaling to d = 8.96 at 30x coverage. The full human genome can be spectrally analyzed in approximately 3--4 seconds on an M1 GPU and under 1 second on M4 Max, enabling interactive spectral genomics on commodity hardware.
bioinformatics2026-04-09v1Agentic systems are adept at solving well-scoped, verifiable problems in computational biology
Nair, S.; Gunsalus, L.; Orcutt-Jahns, B.; Rossen, J.; Lal, A.; Donno, C. D.; Celik, M. H.; Fletez-Brant, K.; Xie, X.; Bravo, H. C.; Eraslan, G.Abstract
We introduce CompBioBench, a benchmark of 100 diverse tasks for evaluating agentic systems in computational biology. Unlike mathematics and programming, which more readily admit systematic verification, biological data are inherently noisy and open to interpretation. To enable objective evaluation without reducing tasks to prescriptive checklists, we propose a new benchmark construction strategy based on synthetic/augmented data and metadata scrambling/scrubbing of real datasets to create challenging problems with a single ground-truth answer that require multi-step reasoning, tool use, bespoke code, and interaction with real-world external resources. The benchmark spans genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and machine learning workflows. Questions are curated by domain experts to cover a broad range of skills with varying difficulty. We evaluate leading general-purpose agentic systems starting from a bare-minimum environment, requiring them to fetch data and tools as needed to solve each problem. We find strong end-to-end performance, with Codex CLI (GPT 5.4) reaching 83% accuracy and Claude Code (Opus 4.6) reaching 81%. On the hardest questions, Codex CLI (GPT 5.4) reaches 59%, while Claude Code (Opus 4.6) reaches 69%. CompBioBench provides a practical testbed for measuring the progress of agentic systems in computational biology and for guiding future benchmark design.
bioinformatics2026-04-09v1IEKB: a comprehensive knowledge base for inner ear genetics integrating curated associations, cochlear interactions, Bayesian candidate prioritisation, explainable dark-gene support relations, and a scientific entity network
Wang, H.; Chen, W.; Ning, H.; Cai, Y.; Xu, Y.; Hou, X.; Pang, L.; Luo, Z.; Tian, C.Abstract
Inner-ear genetics has expanded rapidly, yet the supporting evidence remains dispersed across a vast literature and across resources that typically emphasise loci, variants, or expression data rather than integrated biological interpretation. Here we present the Inner Ear Knowledge Base (IEKB; https://earkb.org), an open database that unifies curated associations, cochlear interaction evidence, candidate prioritisation, explainable support relations, and network exploration for inner-ear research. IEKB was built with an automated agent-assisted curation workflow that combines schema-constrained literature extraction, continuous human monitoring, and final expert review by inner-ear genetics researchers. By systematically analysing 250,696 PubMed-indexed records retrieved across 16,563 screened genes, IEKB curates 6,051 gene--phenotype--disease associations from 2,494 genes across 43 phenotype categories and 4,102 cochlear gene--gene interactions with pathway, cell-type, and experimental context. IEKB further includes a Bayesian ``dark matter'' module that prioritises 243,071 candidate gene--phenotype associations for 13,229 genes across all 43 phenotypes (global AUC-ROC = 0.8603; global AUC-PR = 0.1674), together with a supervised dark-relation layer that ranks phenotype-specific known-gene support for each candidate and a multi-entity scientific network containing nearly 4,000 entities, 28,616 deterministic edges, and 83,712 literature-derived relational links. The web resource supports interactive search, multi-parameter filtering, gene-detail pages, bibliometric exploration, domain-specific enrichment against IEKB phenotype and disease gene sets, network visualisation, bulk download in CSV, JSON, SQLite, and XLSX formats, and natural-language evidence-grounded question answering through a companion conversational interface (IEKB QA). To our knowledge, IEKB is the first openly accessible inner-ear resource that integrates curated associations, cochlear interactions, probabilistic candidate prioritisation, auditable known-gene support relations for novel candidates, and a multi-entity scientific network within a single database. All data are released without registration under the CC BY 4.0 license.
bioinformatics2026-04-09v1STAnalyzer: Transparent Spatial Transcriptomics Analysis via an Agentic Architecture
Luo, H. H.; Liu, L.; Xing, Z.; Li, X.; Zhang, X.; Du, W.; Liu, B.; Wang, J.; Yu, G.Abstract
Spatial transcriptomics enables high resolution profiling of gene expression within spatial contexts, yet its potential is often hindered by fragmented toolchains, intricate parameters, and cognitive bottlenecks of interpreting high dimensional data. While recent Large Language Model agents have attempted to automate this process, they remain constrained by rigid execution logic, lack multimodal feedback for self correction, and operate in epistemic isolation from established biological knowledge. Here, we present STAnalyzer, an intelligent multiagent framework designed to automate the end to end analytical lifecycle from raw data processing to biological hypothesis generation. Transcending traditional pipelines, STAnalyzer employs a collaborative intelligence architecture to achieve three core capabilities: (1) Intent Driven Orchestration, which dynamically translates natural language queries into rigorous bioinformatics workflows; (2) Multi Modal Self Refinement, which autonomously ensures analytical robustness through closed loop synthesis of evidence from visual patterns and statistical metrics; and (3) Evidence based Cross Validation, which bridges the gap between data driven correlations and biological causation by anchoring findings in ground truth literature and structured databases. By eliminating manual analytical bottlenecks and ensuring rigorous evidentiary traceability and transparency, STAnalyzer makes high resolution spatial omics more accessible to a broader research community. It provides a robust and scalable framework for cross platform automated analysis and accelerated biological discovery, translating massive spatial datasets into verifiable biological insights.
bioinformatics2026-04-09v1PoolParty: streamlined design of DNA sequence libraries in Python
Liu, Z.; Cordero, A.; Kinney, J. B.Abstract
Computationally designed DNA sequence libraries are essential components of many high-throughput assays. They are also increasingly used in silico to analyze genomic AI models. Designing these libraries, however, remains tedious and error-prone. Here we describe PoolParty, a Python package that streamlines the design of complex oligo pools using a simple but flexible API. In PoolParty, each library is represented by a computational graph that can be specified in just a few lines of code. Over 50 built-in operations cover nucleotide- and codon-level mutagenesis, motif insertion, barcode generation, and more. PoolParty also provides "design cards" detailing how each sequence was generated.
bioinformatics2026-04-09v1End-to-end evaluation of pipelines for metagenome-assembled genomes reveals hidden performance gaps
Coleman, I.; Ma, J.; Qian, G.; Jiang, Y.; Brown Kav, A.; Korem, T.Abstract
The generation of Metagenome Assembled Genomes (MAGs) has become a standard and basic step in the analysis of metagenomic data. This multi-step process, which includes assembly, binning, refinement, and quality control, has many alternative approaches, algorithms, and parameters. Determining the ideal approach for a given ecosystem and study, or highlighting algorithmic gaps in need of additional research and development, requires rigorous benchmarking. We present MAG-E (MAG pipeline Evaluator), a generalizable and expandable framework for end-to-end evaluation of entire MAG pipelines: from assembly, through binning, to quality control and filtering. MAG-E relies on simulations that are built to match an ecosystem of interest and provide a ground truth for accurate evaluation. To demonstrate the capabilities of MAG-E, we benchmark two assemblers, six binning algorithms, three binning modes, and three quality control and refinement methods in the context of the human gut microbiome. Our findings offer multiple insights into optimal MAG generation in this context. We find that metaSPAdes consistently outperforms MEGAHIT in terms of recall (completeness), and that COMEBin overall outperforms alternative binning algorithms, but has lower precision than SemiBin2. While multi-sample binning results in higher precision, as previously shown, single-sample binning has higher recall and leads to better overall performance with modern binners. Binning refinement, which combines bins from multiple different algorithms, leads to reduced performance. We further show that CheckM2 systematically overestimates completeness and underestimates contamination, and that this is partially ameliorated when using GUNC. Finally, we analyze performance at the contig level, and demonstrate that binning algorithms systematically underperform for prophages and fail to bin contigs that are shared between genomes. Overall, MAG-E offers deep insights into successes and gaps in this important analytic process.
bioinformatics2026-04-09v1A Grid-Search Framework for Dataset-Specific Calibration of Actigraphy Sleep Detection Algorithms
Rahjouei, A.Abstract
Actigraphy is widely used for long-term sleep monitoring, but established sleep-wake scoring algorithms often require parameter tuning, which is commonly performed manually and can reduce reproducibility. In this study, a grid-search-based calibration framework is presented for established actigraphy algorithms and evaluate whether it can serve as a practical alternative to manual tuning. The method was evaluated using two datasets: a multi-subject polysomnography-validated actigraphy dataset and a self-collected dual-device dataset. In the polysomnography-validated dataset, grid-search optimization produced performance patterns similar to manual parameter selection, while slightly improving detection of sleep onset and sleep offset and yielding modest gains in wake-sensitive metrics. In the dual-device dataset, consensus and majority voting were useful for reducing the influence of brief wake episodes occurring within the main sleep period, including micro-awakenings that can fragment sleep predictions across individual algorithms. Overall, these findings show that grid-search can replace manual parameter tuning with a more explicit and reproducible procedure while providing small improvements in sleep timing estimation and benefiting ensemble-based handling of within-sleep wakefulness.
bioinformatics2026-04-09v1gbdraw: a genome diagram generator for microbes and organelles
Kawato, S.Abstract
Motivation: Generating graphical diagrams of microbial and organellar genomes is a common and essential task in bioinformatics. Existing tools often present a trade-off; while powerful programming libraries that require coding skills, graphical applications require server processing or local installation with complex dependency. This highlights the need for a tool that offers both programmatic control for batch processing and graphical accessibility for ease of use. Results: To fill this gap, I developed gbdraw, a web application that generates circular and linear genome diagrams from self-contained GenBank or DDBJ files or combinations of GFF3 annotation and FASTA sequence files. Its core functions include visualizing annotated features, plotting GC content/skew tracks, and optionally generating pairwise sequence comparisons for comparative genomics. It is available as both a GUI web application and a command-line utility. Unlike existing web-based tools that require data upload to a remote server, gbdraw operates entirely within the user's web browser. This serverless architecture ensures that sensitive sequence data never leaves the local machine, providing a secure environment for visualizing unpublished genomic data. Availability and Implementation: gbdraw is implemented in Python 3 (version 3.10+) and is freely available under the MIT license. The web app is available at https://gbdraw.app/. Source code and documentation are available at https://github.com/satoshikawato/gbdraw. The local version can be installed from the Bioconda channel using a conda-compatible package manager.
bioinformatics2026-04-09v1GMIP-PLSR: A Nextflow Pipeline for GWAS and Multi-Omics Integration in Gene Prioritization Using PLSR
Kanchwala, M. S.; Xing, C.; Xuan, Z.Abstract
Genome-wide association studies (GWAS) have significantly advanced our understanding of complex traits and diseases, but their interpretive power remains limited due to challenges in identifying causal genes and pathways. Integrating GWAS with multi-omics data - such as gene expression, protein-protein interactions, and gene-pathway networks have the potential to enhance biological insights and improve gene prioritization. To fulfill this potential and need, we developed the GWAS & Multi-omics Integration Pipeline (GMIP), a flexible and scalable framework that incorporates widely used tools such as PoPS, MAGMA, and benchmarker to enrich GWAS findings. However, PoPS suffers from multicollinearity in its features, which can impact performance. To overcome this, we introduce GMIP-PLSR, an extension of GMIP that uses Partial Least Squares Regression (PLSR) to manage multicollinearity effectively. We applied GMIP-PLSR across multiple GWAS datasets, demonstrating superior performance over PoPS in most cases. In a case study on NAFLD, GMIP-PLSR, using features derived from both disease-specific scRNA-seq and general PoPS features, identified gene sets with higher heritability and stronger enrichment in known NAFLD pathways, confirming its ability to enhance GWAS findings. Built on Nextflow, GMIP is computationally efficient, adaptable to diverse research environments, and provides a robust solution for gene reprioritization in post-GWAS analyses. GMIP-PLSR is available at https://github.com/mohammedmsk/GMIP.
bioinformatics2026-04-09v1Spectral Graph Features for Reference-free RNA 3D Quality Assessment
Zhu, Y.; Zhang, H.; Calhoun, V. D.; Bi, Y.Abstract
Motivation: Existing RNA 3D structure quality assessment (QA) methods rely on local geometric descriptors or statistical potentials that evaluate atomic-level contacts but are blind to global topological coherence. This creates a critical failure mode---structures that are ''locally correct but globally wrong''---where well-formed local helices mask misplaced domains and incorrect overall packing. Results: We introduce SpecRNA-QA, a lightweight method that scores RNA 3D models using multi-scale spectral features derived from the graph Laplacian of inter-nucleotide contact networks. By computing eigenvalue distributions, heat-kernel traces, and spectral entropy across four distance scales with binary and Gaussian kernels, SpecRNA-QA captures global structural coherence inaccessible to conventional descriptors. In leave-one-out cross-validation on CASP16 (42 targets, 7368 models), spectral features achieve median per-target Spearman rho = 0.69 [95% CI: 0.64--0.73], significantly outperforming an internal geometry baseline (rho = 0.47, Delta_rho = +0.22, Wilcoxon p = 1.2 x s 10^{-10}). Compared against established unsupervised statistical potentials---which require no labeled data, unlike the supervised spectral model---rsRNASP outperforms on small-to-medium RNAs (rho = 0.67 vs. 0.57$ , [≤]200~nt). However, rsRNASP times out on most large RNAs (>200~nt), where SpecRNA-QA provides the strongest available quality signal (rho = 0.72 vs. DFIRE 0.52), revealing clear complementarity between global-topological and local-energy scoring. A training-free heuristic using only three spectral statistics enables quality estimation without any labeled data.
bioinformatics2026-04-09v1Germline VCF Annotator: a lightweight pipeline for processing germline VCFs with robust variant extraction and read evidence quality control
Manojlovic, Z.Abstract
Raw variant calls are typically distributed as VCF files and are not well-suited for direct human review. They are intended for programmatic parsing, and spreadsheet import can distort data through automatic type conversion. Furthermore, variants in VCF are commonly annotated to add gene context and predicted functional consequences. Ensembl VEP, a widely used standard for transcript-aware variant annotation, was adapted in this study to generate standardized consequence fields across genomic features. Using a colon crypt whole-genome sequencing cohort as the motivating dataset, this study examined whether variation at DNA damage response and repair (DDR) loci could contribute to mutation-burden patterns in normal colon crypts, including patterns associated with age and potential treatment-related exposure. To make this question testable in a reproducible table-based format, the Germline VCF Annotator was developed as a two-step workflow that normalizes germline VCFs, generates VEP tabular annotations with explicit allele fields, and then extracts variants of interest and appends read-evidence metrics to assign a rules-based QC class. Within-patient concordance across technical repeats at predefined DDR loci was near-perfect after filtering for nonsilent SNVs with read depth [≥] 15, with discordance concentrated among Low-QC loci. Bulk and crypt-derived samples showed no age-related trend in DDR burden. Although the demonstration centers on DDR and aging, the Germline VCF Annotator is applicable to other gene sets that require human-readable locus-level summaries with retained allele provenance and read evidence.
bioinformatics2026-04-09v1Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-04-08v6TPCAV: Interpreting deep learning genomics models via concept attribution
Yang, J.; Mahony, S.Abstract
Interpreting genomics deep learning models remains challenging. Existing feature attribution methods are largely restricted to one-hot DNA inputs and therefore cannot assess the influence of more general genomic features such as chromatin states or genomic repeats. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We improve upon the original TCAV method by incorporating a PCA-based decorrelation transformation to address correlated and redundant embedding features commonly observed in genomics deep learning models, resulting in the Testing with PCA-projected Concept Activation Vectors (TPCAV) approach. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides comparable motif feature interpretation to TF-MoDISco on one-hot encoded DNA-based transcription factor binding prediction models. TPCAV also enables robust interpretive analysis of how more general biological concepts such as repetitive elements and chromatin state annotations contribute towards predictions. TPCAV uniquely generalizes to interpret features learned by tokenized foundation models as well as models incorporating chromatin signals as inputs. We further show that TPCAV can identify representative regions associated with specific concepts, motivating downstream investigation of distinct regulatory mechanisms. TPCAV provides a flexible and robust complement to existing model interpretation techniques.
bioinformatics2026-04-08v3A longitudinal data framework for context-specific genotype-to-phenotype mapping
Veith, T.; Beck, R. J.; Tagal, V.; Li, T.; Alahmari, S.; Cole, J.; Hannaby, D.; Kyei, J.; Yu, X.; Maksin, K.; Schultz, A.; Lee, H.; Diaz, A.; Lupo, J.; El Naqa, I.; Eschrich, S. A.; Ji, H.; Andor, N.Abstract
Molecular assays can resolve clonal structure, but they are expensive and typically sparse in time, whereas phenotypic observations such as imaging can be collected frequently but often are not preserved in the context needed for later interpretation. We present CLONEID, an event-based framework for organizing clone-resolved phenotypic, molecular, and specimen-context records so that genotype-to-phenotype interpretation can be maintained across time. CLONEID links time-stamped Events, assay-specific Perspectives, and reconciled Identities through structured ingestion, provenance-aware retrieval, and reproducible export, complementing upstream clone-calling methods. In a long-term gastric cancer density-selection experiment, CLONEID linked repeated culture events, growth measurements, and late karyotypic profiling within a shared record, supporting longitudinal interpretation of phenotypic adaptation together with underlying chromosomal state.
bioinformatics2026-04-08v3Local and Global Patterns Support Medical Imaging as a Biomarker of Ageing
Mueller, T. T.; Starck, S.; Llalloshi, R.; Kaissis, G.; Ziller, A.; Graf, R.; Schlett, C.; Ringhof, S.; Bamberg, MD, MPH, F.; Wielpuetz, M.; Völzke, H.; Leitzmann, M.; Niendorf, T.; Keil, T.; Krist, L.; Pischon, T.; Karch, A.; Berger, K.; Kirschke, J.; Rueckert, D.; Braren, R.Abstract
Background: Understanding human ageing across multiple organs is essential for characterising individual health trajectories and identifying abnormal ageing processes. Multi-organ imaging provides an opportunity to quantify biological ageing beyond chronological age. The aim of this study is to assess organ-specific and whole-body ageing patterns and their associations with disease and lifestyle factors. Methods: In this large-scale study, we evaluate biological ageing patterns using 70,000 MRI scans from the UK Biobank and the German National Cohort. We employ 3D ResNet-18 models to predict chronological age from various body regions (brain, heart, liver, spine, lungs, muscle, and intestine) and the whole body. From these predictions, we derive age gaps relative to a strictly healthy reference cohort, which enables the identification of accelerated ageing patterns. We then evaluate associations with chronic diseases and lifestyle factors, and a virtual ageing framework was developed to explore counterfactual scenarios by substituting anatomical regions across subjects, quantifying local impacts on global biological age. Results: Here we show significant associations between detected accelerated ageing and specific chronic diseases, including multiple sclerosis and chronic obstructive pulmonary disease, as well as lifestyle factors such as smoking and physical activity. Virtual substitution of anatomical regions demonstrates that local substitutions can influence global ageing patterns. Conclusions: This study demonstrates that multi-organ imaging enables the detection of abnormal ageing patterns at both local and global levels. The presented framework provides a foundation for improved risk stratification and supports the development of personalised approaches to health assessment and disease prevention.
bioinformatics2026-04-08v3Reconstructing biologically coherent cellular profiles from imaging-based spatial transcriptomics
Yuan, L.; Zheng, Y.; Zhang, S.; Beroukhim, R.; Deshpande, A.Abstract
In imaging-based spatial transcriptomics, transcript-to-cell assignment shapes downstream biological interpretation including cell typing, ligand-receptor inference, and niche characterization. However, two-dimensional segmentation of volumetric tissue often yields mixed cellular profiles, while cells without detected nuclei are missed entirely, distorting the aforementioned downstream analyses. We present TRACER, which refines cellular representations in imaging-based transcriptomics by leveraging gene-gene coherence and spatial co-localization of transcripts observed directly in the data, without requiring external annotations or reference atlases. TRACER resolves mixed cellular profiles and reconstructs partial cells whose nuclei are not detected, enabling more complete representation of cells within the tissue section. We also introduce coherence-based metrics that quantify transcriptional purity and conflict, enabling platform-agnostic benchmarking of segmentation quality. Across diverse platforms, tissues, and segmentation methodologies, TRACER consistently and reproducibly improves the coherence of cellular profiles and the quality of downstream analyses.
bioinformatics2026-04-08v2Genetic demultiplexing and transcript start site identification from nanopore sequencing of 10x Genomics multiome libraries
Mears, J.; Orchard, P.; Varshney, A.; Bose, M. L.; Robertson, C. C.; Piper, M.; Pashos, E.; Dolgachev, V.; Manickam, N.; Jean, P.; Kitzman, D. W.; Fauman, E.; Damilano, F.; Roth Flach, R. J.; Nicklas, B.; Parker, S. C.Abstract
Short-read Illumina sequencing of 10x Genomics single-nucleus multiome libraries captures only the 3' end of RNA transcripts, losing transcription start site (TSS) information. Here we demonstrate nanopore sequencing of 10x multiome libraries, which enables the profiling of full length transcripts. We show concordance with common short-read sequencing based workflows including successful genetic demultiplexing of nanopore data despite its higher error rate. We compare TSS identified using nanopore sequencing of multiome cDNA to those identified using a short-read 5' assay, and provide an optimized approach for the preprocessing of nanopore reads prior to TSS identification. We find that nanopore sequencing of multiome cDNA captures a median of 63% of the TSS detected by the 5' assay.
bioinformatics2026-04-08v2Horse, not zebra: accounting for lineage abundance in maximum likelihood phylogenetics
De Maio, N.Abstract
Maximum likelihood phylogenetic methods are popular approaches for estimating evolutionary histories from genome data. These methods do not make prior assumptions regarding strategies used for deciding which genomes were sequenced. However, in genomic epidemiology the sequencing rate is often agnostic to the specific pathogen strain considered. In this scenario, a pathogen strain prevalence should be reflected in its relative abundance in the genome data. Here, I show that this simple assumption, when appropriate and incorporated within maximum likelihood phylogenetics, greatly improves the accuracy of phylogenetic inference. I introduce and assess two separate approaches to achieve this. The first approach rescales the likelihood of a phylogenetic tree by the number of distinct binary topologies obtainable by arbitrarily resolving multifurcations in the tree. This approach interprets multifurcations as the result of lack of signal for resolving a bifurcating topology rather than as an instantaneous multifurcating event. The second approach instead includes a tree prior that assumes that genomes are sequenced at a rate proportional to their abundance. Both approaches favor phylogenetic placement at abundant lineages, and dramatically improve the accuracy of phylogenetic inference in scenarios like SARS-CoV-2 phylogenetics, where large multifurcations are common. This considerable impact is also observed in real pandemic-scale SARS-CoV-2 genome data, where accounting for lineage prevalence reduces phylogenetic uncertainty by around one order of magnitude. Both approaches were implemented in the open source phylogenetic software MAPLE v0.7.5.4 (https://github.com/NicolaDM/MAPLE).
bioinformatics2026-04-08v2GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation.
bioinformatics2026-04-08v2Sampling protein structural token space enables accurate prediction of multiple conformations
Wang, Z.; Yu, Y.; Yu, C.; Bu, D.Abstract
Protein function is fundamentally mediated by ensembles of distinct metastable states. However, existing methods, such as AlphaFold 3, typically exhibit a bias toward predicting a single dominant state, failing to capture alternative conformations or provide robust metrics for identifying high-quality multi-state conformations. Here, we present MultiStateFold (MSFold), a framework that integrates Parallel Tempering into the discrete structure token space of the ESM3 protein language model. By conceptualizing the model's latent space as an implicit energy landscape, MSFold enables global exploration and barrier crossing, thereby overcoming the local sampling limitations inherent in base generative models. Across a benchmark of 313 multi-conformation pairs, MSFold sets a new performance standard: it achieves the highest success rate in modeling native states and substantially outperforms leading methods, including AlphaFold 3, on challenging alternative conformations, while maintaining competitive accuracy for primary structures. Furthermore, we propose Sequence Log-Likelihood (SLL), a novel confidence metric derived from sequence-structure consistency. Our results demonstrate that SLL offers a modest improvement over standard metrics such as pTM and pLDDT. This work establishes a new paradigm for conformational sampling, bridging classical statistical physics with protein language models.
bioinformatics2026-04-08v2A geometric criterion links HIV-1 capsid topography to its biophysical properties and function
Li, W.; Peeples, C. A.; Rey, J. S.; Perilla, J. R.; Twarock, R.Abstract
Mathematical models of virus capsid structure are pillars of modern virology, aiding the understanding of viral mechanisms and the design of antiviral interventions. Traditionally, the HIV-1 capsid core geometry is represented as a fullerene lattice, akin to the icosahedral models of spherical viruses in Caspar-Klug theory. However, recent studies revealed that many viral capsids deviate from such idealised lattices, with important functional implication. Here we demonstrate that this is the case also for the conical HIV-1 core geometries, in which the hexamer and pentamer boundaries form a pseudo-tiling rather than a perfectly aligned fullerene network. We introduce a triangular geometric criterion that quantifies local deviations of an HIV-1 atomic model from its idealised fullerene backbone. Using this criterion, we demonstrate that this difference in geometric organisation between idealised (fullerene) and actual (data-derived) capsid model has implications for the capsid's biophysical properties. We also discuss the use of the geometric criterion as a predictive tool regarding cofactor binding and implied geometric changes in the capsid surface coupled to the interfacial frustration response. Our results establish a quantitative framework linking capsid geometry, curvature, and biophysical function, offering new perspectives for assembly inhibitor design and lentiviral vector engineering.
bioinformatics2026-04-08v2A mathematical model for inflammation and demyelination in multiple sclerosis
Jenner, A. L.; Weatherley, G. R.; Frascoli, F.Abstract
Multiple sclerosis (MS) is an incurable life-long disease caused by the demyelination of neurons in the brain and spine. MS is often characterised by relapses in inflammation and demyelination, that are then followed by periods of remittance. Symptoms can be highly debilitating and there are still many open questions about the origin and progression of the disease. Mathematical modelling is well-placed to capture the dynamics of MS and provide insight into disease aetiology. In this work, we present a minimal model for MS disease onset and progression driven by inflammation and demyelination. The model dynamics are capable of describing a typical evolution of the illness, with changes from a healthy state to a diseased scenario captured by certain ranges of parameter values. Our model also describes the non-uniform oscillatory nature of the disease, born from a Hopf bifurcation due to the strength of the inflammatory response. In particular, using experimental data for Contrast Enhancing Lesions obtained from MS patients, we are able to reproduce some of the typical relapsing-remitting behaviours of this disease. We hope that the model presented here can serve as a baseline for more complex approaches and as a tool to predict possible evolutions of the disease.
bioinformatics2026-04-08v2Spatially Anchored Regulatory State Inference in Melanoma
Dwarampudi, J. M. R.; Kochat, V.; Satpati, S.; Mahmud, M. I.; Anzum, H.; Wani, K.; Lazar, A.; Saw, A. K.; Malke, J.; Nguyen, H. V.; Rai, K.; Banerjee, T.Abstract
Spatial transcriptomics (ST) captures gene expression within tissue architecture but lacks direct regulatory information, while single-cell multiome assays profile transcriptional and chromatin states without spatial context. We present a framework for spatially anchored regulatory inference that integrates Visium ST with single-cell multiome data to infer spatially resolved regulatory programs. Building upon GraphST, we introduce spatially regularized cell-to-spot mapping and propagate chromatin accessibility and transcription factor motif activity into tissue space. Regulatory analysis is performed at the spatial domain level via joint differential expression and accessibility testing, along with quantitative concordance assessment. Applied to melanoma tissue sections, the framework reveals spatially localized regulatory programs and shows that assignment strategy substantially affects downstream regulatory stability. This modular approach enables interpretable gene-, peak-, and transcription factor-level outputs for multimodal spatial analysis.
bioinformatics2026-04-08v1UBL3 UBL domain exhibits distinct helix-centered dynamic control among ubiquitin-like proteins
Matsuda, K.; Moriya, Y.; Xu, L.; Ohmagari, R.; Aramaki, S.; Zhang, C.; Baba, A.; Hirayama, S.; Kahyo, T.; Setou, M.Abstract
Ubiquitin-like protein 3 (UBL3) is a post-translational modifier that sorts proteins into small extracellular vesicles and regulates the trafficking of disease-associated proteins such as -synuclein. The structural and dynamic features of the UBL domain that underlie these functions, however, remain poorly understood. Here we performed in silico structural dynamics analysis of the UBL3 UBL domain using an NMR structure ensemble combined with anisotropic network modeling (ANM) and perturbation response scanning (PRS). Principal component analysis and residue- wise fluctuation analysis consistently revealed high flexibility in the C-terminal region of UBL3. Comparative ANM analysis across 20 ubiquitin-like proteins (UBLs) further showed that C-terminal flexibility is a conserved yet variable property within the UBL family. PRS analysis demonstrated that residues forming the central -helix of the {beta}-grasp fold exert greater dynamic control over collective motions than {beta}- sheet residues. Notably, UBL3 displayed the highest helix/sheet PRS effectiveness ratio among all UBLs analyzed, highlighting the prominent dynamic contribution of helix residues in this domain. Together, these results provide a structural basis for understanding UBL3-dependent protein interactions and disease-related mechanisms, and suggest that helix-centered dynamic control in the UBL domain may represent a potential target for modulating UBL3 function.
bioinformatics2026-04-08v1Geometry-aware ligand-receptor analysis distinguishes interface association from spatial localization and reveals a continuum of tumor communication
Yepes, S.Abstract
Spatial transcriptomics enables inference of cell-cell communication through ligand-receptor (LR) interactions, but current prioritization strategies often rely on expression strength or interface-associated enrichment without explicitly modeling tissue geometry. As a result, interactions associated with population interfaces are frequently interpreted as spatially localized even when their underlying expression is broadly distributed. Here, we present a geometry-aware framework for LR prioritization that explicitly separates interface structure from spatial localization within a locked and reproducible analysis pipeline. We quantify interface-associated communication using a distance-weighted boundary score defined on a spatial neighbor graph, evaluate interface specificity using a label-permutation null model that preserves spatial geometry, and compute an LR-specific localization score that captures the proximity of ligand and receptor expression to the corresponding interface. This framework distinguishes interface-associated compatibility from interaction-level spatial concentration. Across spatial transcriptomics datasets from breast cancer, colorectal cancer, melanoma, and pancreatic ductal adenocarcinoma, interface-aware ranking consistently recovers pathway families associated with extracellular matrix, adhesion, inflammatory, and immune-related processes. However, interface enrichment frequently shows limited separation from the null model, indicating that interface structure alone does not establish spatial specificity. Incorporating geometric localization substantially alters LR prioritization, distinguishing interactions that are concentrated near interfaces from those that are more diffusely distributed. Under a fixed, deterministic pipeline applied identically across datasets without parameter tuning, discrete spatial communication regimes were not reproducibly recovered. Instead, variation across samples is more consistently captured as continuous differences in geometry-aware attenuation, reflecting the degree to which inferred interactions are spatially constrained by tissue architecture. Together, these results demonstrate that interface-associated enrichment and spatial localization are distinct properties of inferred LR interactions, and that accurate interpretation of spatial communication requires explicit modeling of tissue geometry. Under this framework, tumor communication is more consistently described as a continuum of spatial constraint.
bioinformatics2026-04-08v1Geometry-enhanced protein language modeling enables discovery of novel antibiotic resistance genes
Lin, X.; Guan, J.; Hong, Y.; Guo, Y.; Yang, Y.; Xie, P.; Zhao, Z.; Liu, X.; Huang, Y.; Ye, Y.; Tang, Y.; Lee, T.-Y.; Chiang, Y.-C.; Wei, L.; Liu, X.; Wang, J.; Pan, Y.; Tang, J.; Pei, Y.; Yao, L.Abstract
The global antibiotic resistome remains largely unexplored, not because antibiotic resistance genes (ARGs) are rare in the environment, but because many are evolutionarily distant from known ARGs. Current computational approaches primarily rely on sequence homology, and thus miss distant homologues. We develop GeoARG, a geometry-enhanced framework that integrates structural features with protein language models through knowledge distillation, enabling efficient large-scale screening using sequence input alone. Across multiple benchmarks, GeoARG substantially improves the detection of remotely homologous ARGs, particularly under low sequence identity and fragmented conditions. Large-scale metagenomic analysis uncovers 1,485 high-confidence ARG candidates that are highly divergent from known ARGs, expanding the phylogenetic and functional landscape of the resistome. Structural analyses further show that these candidates preserve active-site geometry and maintain stable ligand-binding configurations consistent with known resistance mechanisms. These results demonstrate that geometric constraints enable systematic expansion of the resistome and facilitate the discovery of evolutionarily distant yet functionally conserved genes. A public web server is available at https://ycclab.cuhk.edu.cn/GeoARG/
bioinformatics2026-04-08v1Adaptive Integration of Heterogeneous Foundation Models to Find Histologically Predictable Genes in Breast Cancer
Nguyen, H.; Li, C.; Peng, C.; Simpson, P.; Ye, N.; Nguyen, Q.Abstract
Foundation models for computational pathology have rapidly emerged as powerful tools for extracting rich biological and morphological representations from histopathology images. However, variations in model architecture, pre-training data, and optimization objectives often lead to task-dependent performance, rather than universal generalization. As a result, effective strategies for integrating their complementary strengths are essential to fully realize the potential of foundation models for robust histopathology analysis. Meanwhile, recent breakthroughs such as spatial transcriptomics provide an unprecedented opportunity to integrate genetic and histopathology information from the same patient sample, thereby maximizing both molecular and anatomical pathology insights. Specifically, each model's embedding is first mapped to gene-level predictions via a dedicated prediction head, enabling model-specific feature utilization. A lightweight weighting network then adaptively aggregates these predictions to produce a unified and robust output at gene and spatial location levels. Across multiple spatial transcriptomics datasets, our approach consistently outperforms both individual foundation models and classical ensembling methods. Focusing on breast cancer, we observe substantial gains in prediction accuracy for clinically relevant PAM50 subtype markers and drug-target genes. Moreover, the proposed framework improves interpretability by revealing model-specific contributions and specialization at the gene level. Overall, our work presents an effective solution to integrating multiple foundation models for enhancing the genetic analyses of histopathology images.
bioinformatics2026-04-08v1Analysis of multicellular anatomical structures from spatial omics data using sosta
Gunz, S.; Crowell, H. L.; Robinson, M. D.Abstract
Spatial omics technologies enable high-resolution, large-scale quantification of molecular features while preserving the spatial context within tissues. Existing analysis methods largely focus on spatial arrangements of single cells, whereas biological function often emerges from multicellular arrangements. Here, we introduce structure-based analysis of spatial omics data, which focuses on the direct analysis of multicellular, anatomical structures. We illustrate this type of analysis using two publicly available datasets and provide sosta, an open-source Bioconductor package for broad community use.
bioinformatics2026-04-07v3Flow molecular dynamics simulations reveal mechanosensitive regulation of von Willebrand factor through glycan-modulated autoinhibitory modules
Richard Louis, N. E. L.; Zhao, Y. C.; Ju, L. A.Abstract
Force-induced protein conformational changes govern many essential biological processes, yet their molecular mechanisms remain difficult to resolve. Von Willebrand factor (VWF), a central regulator of haemostasis, is activated by hydrodynamic forces in blood flow, but how mechanical signals propagate across its multidomain architecture is poorly understood. Here, we use flow molecular dynamics (FMD), a simulation framework that applies fluid forces via controlled solvent flow to interrogate mechanosensitive proteins. Using VWF as a model system, we reconstructed the complete mechanomodule (DD3A1A2A3; 1,109 residues) with native glycosylation by integrating crystallographic data and AlphaFold predictions. FMD simulations capture a force-driven transition from a compact, autoinhibited bird-nest ensemble to an extended, activated state, revealing asymmetric autoinhibitory strengths within the NAIM and CAIM modules of the A1 domain. By directly linking static structures to dynamic, force-regulated behaviour, this work establishes a generalizable platform for dissecting protein mechanosensitivity and enabling the rational design of force-responsive therapeutics.
bioinformatics2026-04-07v1FunctionaL Assigning Sequence Homing (FLASH) maps phenotype to sequence with deep and machine learning
Cotter, D. J.; Harrison, M.-C.; Rustagi, A.; Wang, P. L.; Kokot, M.; Carey, A. F.; Deorowicz, S.; Salzman, J.Abstract
Genome-wide association studies (GWAS) map genetic variation to a reference genome and correlate variants to phenotypes. Yet, GWAS and similar procedures have limitations, including an inability to predict phenotype on variants never seen during the discovery phase and difficulty integrating structural variants. Deep and machine learning alternatives have not been successful at consistent prediction of resistance phenotypes (Hu et al. 2024). Here, we introduce FLASH: a new interpretable, statistically-based deep learning framework that operates directly on raw sequencing reads. In over 35,000 isolates of bacteria, fungi and viruses, FLASH achieves uniformly high accuracy on independent test data, including on variation never seen in training, meeting or exceeding bespoke state of the art methods. FLASH identifies canonical drug targets ab initio and new pan-species predictors of virulence, including those lacking annotation and those only partially aligned to NCBI reference databases. Further, FLASH can predict phenotypes beyond the possibility of GWAS, such as bacterial host range of phage, a task that to our knowledge is impossible today. FLASH is simple to run, highly efficient and constitutes a new approach for predicting gene function and phenotype across the tree of life. It is especially valuable when bioethical concerns and the vast genetic complexity of pathogenic microbes limit the feasibility of experimental validation.
bioinformatics2026-04-07v1Estimation of metabolite levels in cheese from microbial gene expression
Mansouri, A.; Mekuli, R.; Swennen, D.; Durazzi, F.; Remondini, D.Abstract
Characterizing aroma and flavours generated during cheese production is of high relevance for the food industry. A deeper comprehension of flavour generation can be achieved by understanding the role of microbial population governing milk processing, and in particular their metabolic activity governed by gene expression. In this work we considered two independent experiments in which gene expression of the microbial population involved in cheese processing is sampled, together with final volatile products quantification. We estimated the final volatile compound profile from the measured metatranscriptomic expression by using machine learning with two different strategies for model training and validation, and we were able to associate specific biochemical pathways to the identified gene signatures.
bioinformatics2026-04-07v1Integrative AlphaFold Modeling, Fragment Mapping, and Microsecond Molecular Dynamics Reveal Ligand-Specific Structural Plasticity at the Human Urotensin II Receptor
Torbey, A. G.Abstract
Peptide ligands Urotensin II (hUII, human), hUII-related peptide (URP) and its cognate human receptor (hUT) are known for their implications in cardiovascular pathophysiology, yet the lack of experimentally resolved hUT structures has limited a deep mechanistic understanding of ligand binding and receptor activation. Here, we leverage recent breakthroughs in multistate AlphaFold predictions, long-timescale molecular dynamics (MD) simulations, and site identification by ligand competitive saturation (SILCS) based pocket mapping and solving ligand bound conformation to illuminate the dynamic interaction of hUII and URP with hUTR. By analyzing hUT dynamics in its intracellular transducer binding pocket, and residue-level interaction probabilities in each simulation, we capture subtle distinctions in the way hUII and URP anchor key pocket residues, modulate transmembrane (TM) domain tilts. Results indicate that hUII imposes stronger conformational constraints on TM5 and TM6 relative to URP, both potentially stabilizing different active-like receptor configurations. At the same time, interaction maps highlight unique aromatic and polar networks that each ligand exploits. These findings reinforce the concept that relatively small differences in GPCR peptide ligand structure may lead to large effects on receptor-state selection, signal specificity, ultimately reflecting different clinical outcomes. By integrating computational modeling with per-residue dynamics, this work not only reconciles prior mutagenesis and docking data but also provides validated 3D models and MD simulations of the endogenous ligands bound to hUT, offering new opportunities to selectively harness ligand-dependent signaling in the urotensinergic system.
bioinformatics2026-04-07v1