Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Multi-scale spatial testing recovers gene programs missed by existing detection methods
Yang, C.; Zhang, X.; Chen, J.Abstract
Identifying spatially variable genes (SVGs) is the first analytical step in spatial transcriptomics, determining which genes and pathways are prioritized for downstream validation. Yet the restricted spatial models of current detection methods create systematic blind spots that can exclude biologically coherent programs from discovery. Here we present FlashS, which reformulates kernel-based spatial testing in the frequency domain to detect arbitrary multi-scale expression patterns while scaling to millions of cells. In human cardiac tissue, this broader detection capacity recovers a coherent PGC-1alpha-regulated mitochondrial biogenesis program, with 40 of 49 pathway genes spatially associated with ventricular cardiomyocytes, that PreTSA, a leading parametric alternative, largely misses (1 of 49 genes), a finding replicated in an independent cohort. Across 50 benchmark datasets spanning 9 platforms, FlashS achieves state-of-the-art ranking accuracy (mean Kendall tau = 0.935) and completes on the Allen Brain MERFISH atlas (3.94 million cells) in 12.6 minutes with 21.5 GB memory.
bioinformatics2026-04-09v3Quaternion Spectral Fingerprinting of DNA: GPU-Accelerated Multi-Channel Fourier Analysis for Alignment-Free Genomics
Bergach, M. A.Abstract
Spectral methods for DNA sequence analysis---treating genomic data as a discrete signal and computing its Fourier transform---were proposed over three decades ago but remained impractical for whole-genome analysis due to computational cost. We present a quaternion Fourier transform framework that encodes DNA as a quaternion-valued signal q[n] [isin] {1, i, j, k} mapping to the four nucleotides {A, T, G, C}, and prove that the full quaternion spectrum is computable from exactly two standard complex FFTs: Q(k) = Z_1(k) + Z_2(N-k) {middle dot} j, where Z_1 = FFT(u_A + i {middle dot} u_T) and Z_2 = FFT(u_G + i {middle dot} u_C). We establish that the resulting spectral fingerprint F(k) = (|Z_1(k)|^2, |Z_2(k)|^2) is invariant under both cyclic shift and reverse complement---the two fundamental symmetries of double-stranded DNA. Building on this theoretical foundation, we develop three computational tools: (i)~a 4x4 Hermitian cross-spectral matrix with inter-channel coherence analysis, (ii)~a genome spectrogram via sliding-window short-time Fourier transform, and (iii)~an alignment-free spectral variant detection algorithm with O(N log N) complexity. Applying Welch's cross-spectral coherence analysis to E.~coli K-12, we discover that the DNA helical repeat (~11~bp) is invisible to the standard power spectrum but clearly detected through the cross-spectral matrix condition number ({kappa} = 6.5), demonstrating that multi-channel analysis reveals structural periodicities that single-channel methods miss. Phase spectrum analysis recovers the characteristic nucleotide ordering within codons (A [->] T [->] G [->] C), while three distinct frequency regimes of inter-nucleotide coupling emerge: complementary-dominated (long-range), purine/pyrimidine-dominated (structural), and codon-position-dominated (coding). Cross-species validation on 18 genomes spanning all three domains of life---Bacteria~(5), Archaea~(3), and Eukarya~(10)---with GC content from 19.6% (P. falciparum) to 69.5% (T. thermophilus) confirms the universality of these findings. The helical repeat is detected via cross-spectral coherence in 18/18 organisms (100%). All 10 eukaryotes show A-T dominance at the helical repeat---a spectral signature of nucleosome wrapping absent from prokaryotes. Non-complementary pairs (A-C, T-G) dominate the coding frequency in 17/18 organisms. Validation on human chromosome 21 (46.7 Mb, processed in 5.0 s on Apple M1) reveals eukaryote-specific spectral signatures---nucleosome positioning at 10.67 bp, nucleosome spacing at 170.7 bp, and Alu repeat dominance at 341 bp---absent from prokaryotic spectra. A proof-of-concept spectral variant detection experiment achieves 100% read-matching accuracy (100/100 reads) and statistically significant discrimination of SNPs from sequencing errors (t = 14.80, p < 0.001, Cohen's d = 1.64), scaling to d = 8.96 at 30x coverage. The full human genome can be spectrally analyzed in approximately 3--4 seconds on an M1 GPU and under 1 second on M4 Max, enabling interactive spectral genomics on commodity hardware.
bioinformatics2026-04-09v1Agentic systems are adept at solving well-scoped, verifiable problems in computational biology
Nair, S.; Gunsalus, L.; Orcutt-Jahns, B.; Rossen, J.; Lal, A.; Donno, C. D.; Celik, M. H.; Fletez-Brant, K.; Xie, X.; Bravo, H. C.; Eraslan, G.Abstract
We introduce CompBioBench, a benchmark of 100 diverse tasks for evaluating agentic systems in computational biology. Unlike mathematics and programming, which more readily admit systematic verification, biological data are inherently noisy and open to interpretation. To enable objective evaluation without reducing tasks to prescriptive checklists, we propose a new benchmark construction strategy based on synthetic/augmented data and metadata scrambling/scrubbing of real datasets to create challenging problems with a single ground-truth answer that require multi-step reasoning, tool use, bespoke code, and interaction with real-world external resources. The benchmark spans genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and machine learning workflows. Questions are curated by domain experts to cover a broad range of skills with varying difficulty. We evaluate leading general-purpose agentic systems starting from a bare-minimum environment, requiring them to fetch data and tools as needed to solve each problem. We find strong end-to-end performance, with Codex CLI (GPT 5.4) reaching 83% accuracy and Claude Code (Opus 4.6) reaching 81%. On the hardest questions, Codex CLI (GPT 5.4) reaches 59%, while Claude Code (Opus 4.6) reaches 69%. CompBioBench provides a practical testbed for measuring the progress of agentic systems in computational biology and for guiding future benchmark design.
bioinformatics2026-04-09v1IEKB: a comprehensive knowledge base for inner ear genetics integrating curated associations, cochlear interactions, Bayesian candidate prioritisation, explainable dark-gene support relations, and a scientific entity network
Wang, H.; Chen, W.; Ning, H.; Cai, Y.; Xu, Y.; Hou, X.; Pang, L.; Luo, Z.; Tian, C.Abstract
Inner-ear genetics has expanded rapidly, yet the supporting evidence remains dispersed across a vast literature and across resources that typically emphasise loci, variants, or expression data rather than integrated biological interpretation. Here we present the Inner Ear Knowledge Base (IEKB; https://earkb.org), an open database that unifies curated associations, cochlear interaction evidence, candidate prioritisation, explainable support relations, and network exploration for inner-ear research. IEKB was built with an automated agent-assisted curation workflow that combines schema-constrained literature extraction, continuous human monitoring, and final expert review by inner-ear genetics researchers. By systematically analysing 250,696 PubMed-indexed records retrieved across 16,563 screened genes, IEKB curates 6,051 gene--phenotype--disease associations from 2,494 genes across 43 phenotype categories and 4,102 cochlear gene--gene interactions with pathway, cell-type, and experimental context. IEKB further includes a Bayesian ``dark matter'' module that prioritises 243,071 candidate gene--phenotype associations for 13,229 genes across all 43 phenotypes (global AUC-ROC = 0.8603; global AUC-PR = 0.1674), together with a supervised dark-relation layer that ranks phenotype-specific known-gene support for each candidate and a multi-entity scientific network containing nearly 4,000 entities, 28,616 deterministic edges, and 83,712 literature-derived relational links. The web resource supports interactive search, multi-parameter filtering, gene-detail pages, bibliometric exploration, domain-specific enrichment against IEKB phenotype and disease gene sets, network visualisation, bulk download in CSV, JSON, SQLite, and XLSX formats, and natural-language evidence-grounded question answering through a companion conversational interface (IEKB QA). To our knowledge, IEKB is the first openly accessible inner-ear resource that integrates curated associations, cochlear interactions, probabilistic candidate prioritisation, auditable known-gene support relations for novel candidates, and a multi-entity scientific network within a single database. All data are released without registration under the CC BY 4.0 license.
bioinformatics2026-04-09v1STAnalyzer: Transparent Spatial Transcriptomics Analysis via an Agentic Architecture
Luo, H. H.; Liu, L.; Xing, Z.; Li, X.; Zhang, X.; Du, W.; Liu, B.; Wang, J.; Yu, G.Abstract
Spatial transcriptomics enables high resolution profiling of gene expression within spatial contexts, yet its potential is often hindered by fragmented toolchains, intricate parameters, and cognitive bottlenecks of interpreting high dimensional data. While recent Large Language Model agents have attempted to automate this process, they remain constrained by rigid execution logic, lack multimodal feedback for self correction, and operate in epistemic isolation from established biological knowledge. Here, we present STAnalyzer, an intelligent multiagent framework designed to automate the end to end analytical lifecycle from raw data processing to biological hypothesis generation. Transcending traditional pipelines, STAnalyzer employs a collaborative intelligence architecture to achieve three core capabilities: (1) Intent Driven Orchestration, which dynamically translates natural language queries into rigorous bioinformatics workflows; (2) Multi Modal Self Refinement, which autonomously ensures analytical robustness through closed loop synthesis of evidence from visual patterns and statistical metrics; and (3) Evidence based Cross Validation, which bridges the gap between data driven correlations and biological causation by anchoring findings in ground truth literature and structured databases. By eliminating manual analytical bottlenecks and ensuring rigorous evidentiary traceability and transparency, STAnalyzer makes high resolution spatial omics more accessible to a broader research community. It provides a robust and scalable framework for cross platform automated analysis and accelerated biological discovery, translating massive spatial datasets into verifiable biological insights.
bioinformatics2026-04-09v1PoolParty: streamlined design of DNA sequence libraries in Python
Liu, Z.; Cordero, A.; Kinney, J. B.Abstract
Computationally designed DNA sequence libraries are essential components of many high-throughput assays. They are also increasingly used in silico to analyze genomic AI models. Designing these libraries, however, remains tedious and error-prone. Here we describe PoolParty, a Python package that streamlines the design of complex oligo pools using a simple but flexible API. In PoolParty, each library is represented by a computational graph that can be specified in just a few lines of code. Over 50 built-in operations cover nucleotide- and codon-level mutagenesis, motif insertion, barcode generation, and more. PoolParty also provides "design cards" detailing how each sequence was generated.
bioinformatics2026-04-09v1End-to-end evaluation of pipelines for metagenome-assembled genomes reveals hidden performance gaps
Coleman, I.; Ma, J.; Qian, G.; Jiang, Y.; Brown Kav, A.; Korem, T.Abstract
The generation of Metagenome Assembled Genomes (MAGs) has become a standard and basic step in the analysis of metagenomic data. This multi-step process, which includes assembly, binning, refinement, and quality control, has many alternative approaches, algorithms, and parameters. Determining the ideal approach for a given ecosystem and study, or highlighting algorithmic gaps in need of additional research and development, requires rigorous benchmarking. We present MAG-E (MAG pipeline Evaluator), a generalizable and expandable framework for end-to-end evaluation of entire MAG pipelines: from assembly, through binning, to quality control and filtering. MAG-E relies on simulations that are built to match an ecosystem of interest and provide a ground truth for accurate evaluation. To demonstrate the capabilities of MAG-E, we benchmark two assemblers, six binning algorithms, three binning modes, and three quality control and refinement methods in the context of the human gut microbiome. Our findings offer multiple insights into optimal MAG generation in this context. We find that metaSPAdes consistently outperforms MEGAHIT in terms of recall (completeness), and that COMEBin overall outperforms alternative binning algorithms, but has lower precision than SemiBin2. While multi-sample binning results in higher precision, as previously shown, single-sample binning has higher recall and leads to better overall performance with modern binners. Binning refinement, which combines bins from multiple different algorithms, leads to reduced performance. We further show that CheckM2 systematically overestimates completeness and underestimates contamination, and that this is partially ameliorated when using GUNC. Finally, we analyze performance at the contig level, and demonstrate that binning algorithms systematically underperform for prophages and fail to bin contigs that are shared between genomes. Overall, MAG-E offers deep insights into successes and gaps in this important analytic process.
bioinformatics2026-04-09v1A Grid-Search Framework for Dataset-Specific Calibration of Actigraphy Sleep Detection Algorithms
Rahjouei, A.Abstract
Actigraphy is widely used for long-term sleep monitoring, but established sleep-wake scoring algorithms often require parameter tuning, which is commonly performed manually and can reduce reproducibility. In this study, a grid-search-based calibration framework is presented for established actigraphy algorithms and evaluate whether it can serve as a practical alternative to manual tuning. The method was evaluated using two datasets: a multi-subject polysomnography-validated actigraphy dataset and a self-collected dual-device dataset. In the polysomnography-validated dataset, grid-search optimization produced performance patterns similar to manual parameter selection, while slightly improving detection of sleep onset and sleep offset and yielding modest gains in wake-sensitive metrics. In the dual-device dataset, consensus and majority voting were useful for reducing the influence of brief wake episodes occurring within the main sleep period, including micro-awakenings that can fragment sleep predictions across individual algorithms. Overall, these findings show that grid-search can replace manual parameter tuning with a more explicit and reproducible procedure while providing small improvements in sleep timing estimation and benefiting ensemble-based handling of within-sleep wakefulness.
bioinformatics2026-04-09v1gbdraw: a genome diagram generator for microbes and organelles
Kawato, S.Abstract
Motivation: Generating graphical diagrams of microbial and organellar genomes is a common and essential task in bioinformatics. Existing tools often present a trade-off; while powerful programming libraries that require coding skills, graphical applications require server processing or local installation with complex dependency. This highlights the need for a tool that offers both programmatic control for batch processing and graphical accessibility for ease of use. Results: To fill this gap, I developed gbdraw, a web application that generates circular and linear genome diagrams from self-contained GenBank or DDBJ files or combinations of GFF3 annotation and FASTA sequence files. Its core functions include visualizing annotated features, plotting GC content/skew tracks, and optionally generating pairwise sequence comparisons for comparative genomics. It is available as both a GUI web application and a command-line utility. Unlike existing web-based tools that require data upload to a remote server, gbdraw operates entirely within the user's web browser. This serverless architecture ensures that sensitive sequence data never leaves the local machine, providing a secure environment for visualizing unpublished genomic data. Availability and Implementation: gbdraw is implemented in Python 3 (version 3.10+) and is freely available under the MIT license. The web app is available at https://gbdraw.app/. Source code and documentation are available at https://github.com/satoshikawato/gbdraw. The local version can be installed from the Bioconda channel using a conda-compatible package manager.
bioinformatics2026-04-09v1GMIP-PLSR: A Nextflow Pipeline for GWAS and Multi-Omics Integration in Gene Prioritization Using PLSR
Kanchwala, M. S.; Xing, C.; Xuan, Z.Abstract
Genome-wide association studies (GWAS) have significantly advanced our understanding of complex traits and diseases, but their interpretive power remains limited due to challenges in identifying causal genes and pathways. Integrating GWAS with multi-omics data - such as gene expression, protein-protein interactions, and gene-pathway networks have the potential to enhance biological insights and improve gene prioritization. To fulfill this potential and need, we developed the GWAS & Multi-omics Integration Pipeline (GMIP), a flexible and scalable framework that incorporates widely used tools such as PoPS, MAGMA, and benchmarker to enrich GWAS findings. However, PoPS suffers from multicollinearity in its features, which can impact performance. To overcome this, we introduce GMIP-PLSR, an extension of GMIP that uses Partial Least Squares Regression (PLSR) to manage multicollinearity effectively. We applied GMIP-PLSR across multiple GWAS datasets, demonstrating superior performance over PoPS in most cases. In a case study on NAFLD, GMIP-PLSR, using features derived from both disease-specific scRNA-seq and general PoPS features, identified gene sets with higher heritability and stronger enrichment in known NAFLD pathways, confirming its ability to enhance GWAS findings. Built on Nextflow, GMIP is computationally efficient, adaptable to diverse research environments, and provides a robust solution for gene reprioritization in post-GWAS analyses. GMIP-PLSR is available at https://github.com/mohammedmsk/GMIP.
bioinformatics2026-04-09v1Spectral Graph Features for Reference-free RNA 3D Quality Assessment
Zhu, Y.; Zhang, H.; Calhoun, V. D.; Bi, Y.Abstract
Motivation: Existing RNA 3D structure quality assessment (QA) methods rely on local geometric descriptors or statistical potentials that evaluate atomic-level contacts but are blind to global topological coherence. This creates a critical failure mode---structures that are ''locally correct but globally wrong''---where well-formed local helices mask misplaced domains and incorrect overall packing. Results: We introduce SpecRNA-QA, a lightweight method that scores RNA 3D models using multi-scale spectral features derived from the graph Laplacian of inter-nucleotide contact networks. By computing eigenvalue distributions, heat-kernel traces, and spectral entropy across four distance scales with binary and Gaussian kernels, SpecRNA-QA captures global structural coherence inaccessible to conventional descriptors. In leave-one-out cross-validation on CASP16 (42 targets, 7368 models), spectral features achieve median per-target Spearman rho = 0.69 [95% CI: 0.64--0.73], significantly outperforming an internal geometry baseline (rho = 0.47, Delta_rho = +0.22, Wilcoxon p = 1.2 x s 10^{-10}). Compared against established unsupervised statistical potentials---which require no labeled data, unlike the supervised spectral model---rsRNASP outperforms on small-to-medium RNAs (rho = 0.67 vs. 0.57$ , [≤]200~nt). However, rsRNASP times out on most large RNAs (>200~nt), where SpecRNA-QA provides the strongest available quality signal (rho = 0.72 vs. DFIRE 0.52), revealing clear complementarity between global-topological and local-energy scoring. A training-free heuristic using only three spectral statistics enables quality estimation without any labeled data.
bioinformatics2026-04-09v1Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening
Kim, U.; Kwon, O.; Lee, D.Abstract
Systematic literature reviews are labor-intensive tasks in biomedical research. While Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) techniques have enhanced information accessibility, the inherent complexity of biological systems---characterized by high context dependency and conflicting data---remains a primary driver of LLM hallucinations. This imposes a structural constraint that limits the precision of evidence synthesis. To address these limitations, we propose an automated framework designed for the exhaustive identification of supporting and contradictory evidence within a target literature set. Rather than relying on a model's pre-trained knowledge, our system requires the LLM to review each paper individually to determine its alignment with a specific research hypothesis. By evaluating semantic context, the framework captures subtle contradictions that are often overgeneralized by conventional methods. The framework's performance was validated using the BioNLI task, where it demonstrated high classification accuracy in distinguishing whether evidence supports or contradicts a given hypothesis. Notably, the implementation of an ensemble approach provided superior stability and slightly higher precision compared to individual models. Furthermore, the framework exhibited robust performance across several well-established biological hypotheses, confirming its practical utility and reliability in real-world research. This approach provides a rigorous basis for biomedical discovery by enabling the precise, systematic analysis of biological literature and the robust collection of evidence.
bioinformatics2026-04-09v1Germline VCF Annotator: a lightweight pipeline for processing germline VCFs with robust variant extraction and read evidence quality control
Manojlovic, Z.Abstract
Raw variant calls are typically distributed as VCF files and are not well-suited for direct human review. They are intended for programmatic parsing, and spreadsheet import can distort data through automatic type conversion. Furthermore, variants in VCF are commonly annotated to add gene context and predicted functional consequences. Ensembl VEP, a widely used standard for transcript-aware variant annotation, was adapted in this study to generate standardized consequence fields across genomic features. Using a colon crypt whole-genome sequencing cohort as the motivating dataset, this study examined whether variation at DNA damage response and repair (DDR) loci could contribute to mutation-burden patterns in normal colon crypts, including patterns associated with age and potential treatment-related exposure. To make this question testable in a reproducible table-based format, the Germline VCF Annotator was developed as a two-step workflow that normalizes germline VCFs, generates VEP tabular annotations with explicit allele fields, and then extracts variants of interest and appends read-evidence metrics to assign a rules-based QC class. Within-patient concordance across technical repeats at predefined DDR loci was near-perfect after filtering for nonsilent SNVs with read depth [≥] 15, with discordance concentrated among Low-QC loci. Bulk and crypt-derived samples showed no age-related trend in DDR burden. Although the demonstration centers on DDR and aging, the Germline VCF Annotator is applicable to other gene sets that require human-readable locus-level summaries with retained allele provenance and read evidence.
bioinformatics2026-04-09v1Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-04-08v6Local and Global Patterns Support Medical Imaging as a Biomarker of Ageing
Mueller, T. T.; Starck, S.; Llalloshi, R.; Kaissis, G.; Ziller, A.; Graf, R.; Schlett, C.; Ringhof, S.; Bamberg, MD, MPH, F.; Wielpuetz, M.; Völzke, H.; Leitzmann, M.; Niendorf, T.; Keil, T.; Krist, L.; Pischon, T.; Karch, A.; Berger, K.; Kirschke, J.; Rueckert, D.; Braren, R.Abstract
Background: Understanding human ageing across multiple organs is essential for characterising individual health trajectories and identifying abnormal ageing processes. Multi-organ imaging provides an opportunity to quantify biological ageing beyond chronological age. The aim of this study is to assess organ-specific and whole-body ageing patterns and their associations with disease and lifestyle factors. Methods: In this large-scale study, we evaluate biological ageing patterns using 70,000 MRI scans from the UK Biobank and the German National Cohort. We employ 3D ResNet-18 models to predict chronological age from various body regions (brain, heart, liver, spine, lungs, muscle, and intestine) and the whole body. From these predictions, we derive age gaps relative to a strictly healthy reference cohort, which enables the identification of accelerated ageing patterns. We then evaluate associations with chronic diseases and lifestyle factors, and a virtual ageing framework was developed to explore counterfactual scenarios by substituting anatomical regions across subjects, quantifying local impacts on global biological age. Results: Here we show significant associations between detected accelerated ageing and specific chronic diseases, including multiple sclerosis and chronic obstructive pulmonary disease, as well as lifestyle factors such as smoking and physical activity. Virtual substitution of anatomical regions demonstrates that local substitutions can influence global ageing patterns. Conclusions: This study demonstrates that multi-organ imaging enables the detection of abnormal ageing patterns at both local and global levels. The presented framework provides a foundation for improved risk stratification and supports the development of personalised approaches to health assessment and disease prevention.
bioinformatics2026-04-08v3A longitudinal data framework for context-specific genotype-to-phenotype mapping
Veith, T.; Beck, R. J.; Tagal, V.; Li, T.; Alahmari, S.; Cole, J.; Hannaby, D.; Kyei, J.; Yu, X.; Maksin, K.; Schultz, A.; Lee, H.; Diaz, A.; Lupo, J.; El Naqa, I.; Eschrich, S. A.; Ji, H.; Andor, N.Abstract
Molecular assays can resolve clonal structure, but they are expensive and typically sparse in time, whereas phenotypic observations such as imaging can be collected frequently but often are not preserved in the context needed for later interpretation. We present CLONEID, an event-based framework for organizing clone-resolved phenotypic, molecular, and specimen-context records so that genotype-to-phenotype interpretation can be maintained across time. CLONEID links time-stamped Events, assay-specific Perspectives, and reconciled Identities through structured ingestion, provenance-aware retrieval, and reproducible export, complementing upstream clone-calling methods. In a long-term gastric cancer density-selection experiment, CLONEID linked repeated culture events, growth measurements, and late karyotypic profiling within a shared record, supporting longitudinal interpretation of phenotypic adaptation together with underlying chromosomal state.
bioinformatics2026-04-08v3TPCAV: Interpreting deep learning genomics models via concept attribution
Yang, J.; Mahony, S.Abstract
Interpreting genomics deep learning models remains challenging. Existing feature attribution methods are largely restricted to one-hot DNA inputs and therefore cannot assess the influence of more general genomic features such as chromatin states or genomic repeats. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We improve upon the original TCAV method by incorporating a PCA-based decorrelation transformation to address correlated and redundant embedding features commonly observed in genomics deep learning models, resulting in the Testing with PCA-projected Concept Activation Vectors (TPCAV) approach. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides comparable motif feature interpretation to TF-MoDISco on one-hot encoded DNA-based transcription factor binding prediction models. TPCAV also enables robust interpretive analysis of how more general biological concepts such as repetitive elements and chromatin state annotations contribute towards predictions. TPCAV uniquely generalizes to interpret features learned by tokenized foundation models as well as models incorporating chromatin signals as inputs. We further show that TPCAV can identify representative regions associated with specific concepts, motivating downstream investigation of distinct regulatory mechanisms. TPCAV provides a flexible and robust complement to existing model interpretation techniques.
bioinformatics2026-04-08v3A mathematical model for inflammation and demyelination in multiple sclerosis
Jenner, A. L.; Weatherley, G. R.; Frascoli, F.Abstract
Multiple sclerosis (MS) is an incurable life-long disease caused by the demyelination of neurons in the brain and spine. MS is often characterised by relapses in inflammation and demyelination, that are then followed by periods of remittance. Symptoms can be highly debilitating and there are still many open questions about the origin and progression of the disease. Mathematical modelling is well-placed to capture the dynamics of MS and provide insight into disease aetiology. In this work, we present a minimal model for MS disease onset and progression driven by inflammation and demyelination. The model dynamics are capable of describing a typical evolution of the illness, with changes from a healthy state to a diseased scenario captured by certain ranges of parameter values. Our model also describes the non-uniform oscillatory nature of the disease, born from a Hopf bifurcation due to the strength of the inflammatory response. In particular, using experimental data for Contrast Enhancing Lesions obtained from MS patients, we are able to reproduce some of the typical relapsing-remitting behaviours of this disease. We hope that the model presented here can serve as a baseline for more complex approaches and as a tool to predict possible evolutions of the disease.
bioinformatics2026-04-08v2Genetic demultiplexing and transcript start site identification from nanopore sequencing of 10x Genomics multiome libraries
Mears, J.; Orchard, P.; Varshney, A.; Bose, M. L.; Robertson, C. C.; Piper, M.; Pashos, E.; Dolgachev, V.; Manickam, N.; Jean, P.; Kitzman, D. W.; Fauman, E.; Damilano, F.; Roth Flach, R. J.; Nicklas, B.; Parker, S. C.Abstract
Short-read Illumina sequencing of 10x Genomics single-nucleus multiome libraries captures only the 3' end of RNA transcripts, losing transcription start site (TSS) information. Here we demonstrate nanopore sequencing of 10x multiome libraries, which enables the profiling of full length transcripts. We show concordance with common short-read sequencing based workflows including successful genetic demultiplexing of nanopore data despite its higher error rate. We compare TSS identified using nanopore sequencing of multiome cDNA to those identified using a short-read 5' assay, and provide an optimized approach for the preprocessing of nanopore reads prior to TSS identification. We find that nanopore sequencing of multiome cDNA captures a median of 63% of the TSS detected by the 5' assay.
bioinformatics2026-04-08v2Horse, not zebra: accounting for lineage abundance in maximum likelihood phylogenetics
De Maio, N.Abstract
Maximum likelihood phylogenetic methods are popular approaches for estimating evolutionary histories from genome data. These methods do not make prior assumptions regarding strategies used for deciding which genomes were sequenced. However, in genomic epidemiology the sequencing rate is often agnostic to the specific pathogen strain considered. In this scenario, a pathogen strain prevalence should be reflected in its relative abundance in the genome data. Here, I show that this simple assumption, when appropriate and incorporated within maximum likelihood phylogenetics, greatly improves the accuracy of phylogenetic inference. I introduce and assess two separate approaches to achieve this. The first approach rescales the likelihood of a phylogenetic tree by the number of distinct binary topologies obtainable by arbitrarily resolving multifurcations in the tree. This approach interprets multifurcations as the result of lack of signal for resolving a bifurcating topology rather than as an instantaneous multifurcating event. The second approach instead includes a tree prior that assumes that genomes are sequenced at a rate proportional to their abundance. Both approaches favor phylogenetic placement at abundant lineages, and dramatically improve the accuracy of phylogenetic inference in scenarios like SARS-CoV-2 phylogenetics, where large multifurcations are common. This considerable impact is also observed in real pandemic-scale SARS-CoV-2 genome data, where accounting for lineage prevalence reduces phylogenetic uncertainty by around one order of magnitude. Both approaches were implemented in the open source phylogenetic software MAPLE v0.7.5.4 (https://github.com/NicolaDM/MAPLE).
bioinformatics2026-04-08v2GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation.
bioinformatics2026-04-08v2Reconstructing biologically coherent cellular profiles from imaging-based spatial transcriptomics
Yuan, L.; Zheng, Y.; Zhang, S.; Beroukhim, R.; Deshpande, A.Abstract
In imaging-based spatial transcriptomics, transcript-to-cell assignment shapes downstream biological interpretation including cell typing, ligand-receptor inference, and niche characterization. However, two-dimensional segmentation of volumetric tissue often yields mixed cellular profiles, while cells without detected nuclei are missed entirely, distorting the aforementioned downstream analyses. We present TRACER, which refines cellular representations in imaging-based transcriptomics by leveraging gene-gene coherence and spatial co-localization of transcripts observed directly in the data, without requiring external annotations or reference atlases. TRACER resolves mixed cellular profiles and reconstructs partial cells whose nuclei are not detected, enabling more complete representation of cells within the tissue section. We also introduce coherence-based metrics that quantify transcriptional purity and conflict, enabling platform-agnostic benchmarking of segmentation quality. Across diverse platforms, tissues, and segmentation methodologies, TRACER consistently and reproducibly improves the coherence of cellular profiles and the quality of downstream analyses.
bioinformatics2026-04-08v2Sampling protein structural token space enables accurate prediction of multiple conformations
Wang, Z.; Yu, Y.; Yu, C.; Bu, D.Abstract
Protein function is fundamentally mediated by ensembles of distinct metastable states. However, existing methods, such as AlphaFold 3, typically exhibit a bias toward predicting a single dominant state, failing to capture alternative conformations or provide robust metrics for identifying high-quality multi-state conformations. Here, we present MultiStateFold (MSFold), a framework that integrates Parallel Tempering into the discrete structure token space of the ESM3 protein language model. By conceptualizing the model's latent space as an implicit energy landscape, MSFold enables global exploration and barrier crossing, thereby overcoming the local sampling limitations inherent in base generative models. Across a benchmark of 313 multi-conformation pairs, MSFold sets a new performance standard: it achieves the highest success rate in modeling native states and substantially outperforms leading methods, including AlphaFold 3, on challenging alternative conformations, while maintaining competitive accuracy for primary structures. Furthermore, we propose Sequence Log-Likelihood (SLL), a novel confidence metric derived from sequence-structure consistency. Our results demonstrate that SLL offers a modest improvement over standard metrics such as pTM and pLDDT. This work establishes a new paradigm for conformational sampling, bridging classical statistical physics with protein language models.
bioinformatics2026-04-08v2A geometric criterion links HIV-1 capsid topography to its biophysical properties and function
Li, W.; Peeples, C. A.; Rey, J. S.; Perilla, J. R.; Twarock, R.Abstract
Mathematical models of virus capsid structure are pillars of modern virology, aiding the understanding of viral mechanisms and the design of antiviral interventions. Traditionally, the HIV-1 capsid core geometry is represented as a fullerene lattice, akin to the icosahedral models of spherical viruses in Caspar-Klug theory. However, recent studies revealed that many viral capsids deviate from such idealised lattices, with important functional implication. Here we demonstrate that this is the case also for the conical HIV-1 core geometries, in which the hexamer and pentamer boundaries form a pseudo-tiling rather than a perfectly aligned fullerene network. We introduce a triangular geometric criterion that quantifies local deviations of an HIV-1 atomic model from its idealised fullerene backbone. Using this criterion, we demonstrate that this difference in geometric organisation between idealised (fullerene) and actual (data-derived) capsid model has implications for the capsid's biophysical properties. We also discuss the use of the geometric criterion as a predictive tool regarding cofactor binding and implied geometric changes in the capsid surface coupled to the interfacial frustration response. Our results establish a quantitative framework linking capsid geometry, curvature, and biophysical function, offering new perspectives for assembly inhibitor design and lentiviral vector engineering.
bioinformatics2026-04-08v2Spatially Anchored Regulatory State Inference in Melanoma
Dwarampudi, J. M. R.; Kochat, V.; Satpati, S.; Mahmud, M. I.; Anzum, H.; Wani, K.; Lazar, A.; Saw, A. K.; Malke, J.; Nguyen, H. V.; Rai, K.; Banerjee, T.Abstract
Spatial transcriptomics (ST) captures gene expression within tissue architecture but lacks direct regulatory information, while single-cell multiome assays profile transcriptional and chromatin states without spatial context. We present a framework for spatially anchored regulatory inference that integrates Visium ST with single-cell multiome data to infer spatially resolved regulatory programs. Building upon GraphST, we introduce spatially regularized cell-to-spot mapping and propagate chromatin accessibility and transcription factor motif activity into tissue space. Regulatory analysis is performed at the spatial domain level via joint differential expression and accessibility testing, along with quantitative concordance assessment. Applied to melanoma tissue sections, the framework reveals spatially localized regulatory programs and shows that assignment strategy substantially affects downstream regulatory stability. This modular approach enables interpretable gene-, peak-, and transcription factor-level outputs for multimodal spatial analysis.
bioinformatics2026-04-08v1UBL3 UBL domain exhibits distinct helix-centered dynamic control among ubiquitin-like proteins
Matsuda, K.; Moriya, Y.; Xu, L.; Ohmagari, R.; Aramaki, S.; Zhang, C.; Baba, A.; Hirayama, S.; Kahyo, T.; Setou, M.Abstract
Ubiquitin-like protein 3 (UBL3) is a post-translational modifier that sorts proteins into small extracellular vesicles and regulates the trafficking of disease-associated proteins such as -synuclein. The structural and dynamic features of the UBL domain that underlie these functions, however, remain poorly understood. Here we performed in silico structural dynamics analysis of the UBL3 UBL domain using an NMR structure ensemble combined with anisotropic network modeling (ANM) and perturbation response scanning (PRS). Principal component analysis and residue- wise fluctuation analysis consistently revealed high flexibility in the C-terminal region of UBL3. Comparative ANM analysis across 20 ubiquitin-like proteins (UBLs) further showed that C-terminal flexibility is a conserved yet variable property within the UBL family. PRS analysis demonstrated that residues forming the central -helix of the {beta}-grasp fold exert greater dynamic control over collective motions than {beta}- sheet residues. Notably, UBL3 displayed the highest helix/sheet PRS effectiveness ratio among all UBLs analyzed, highlighting the prominent dynamic contribution of helix residues in this domain. Together, these results provide a structural basis for understanding UBL3-dependent protein interactions and disease-related mechanisms, and suggest that helix-centered dynamic control in the UBL domain may represent a potential target for modulating UBL3 function.
bioinformatics2026-04-08v1Geometry-enhanced protein language modeling enables discovery of novel antibiotic resistance genes
Lin, X.; Guan, J.; Hong, Y.; Guo, Y.; Yang, Y.; Xie, P.; Zhao, Z.; Liu, X.; Huang, Y.; Ye, Y.; Tang, Y.; Lee, T.-Y.; Chiang, Y.-C.; Wei, L.; Liu, X.; Wang, J.; Pan, Y.; Tang, J.; Pei, Y.; Yao, L.Abstract
The global antibiotic resistome remains largely unexplored, not because antibiotic resistance genes (ARGs) are rare in the environment, but because many are evolutionarily distant from known ARGs. Current computational approaches primarily rely on sequence homology, and thus miss distant homologues. We develop GeoARG, a geometry-enhanced framework that integrates structural features with protein language models through knowledge distillation, enabling efficient large-scale screening using sequence input alone. Across multiple benchmarks, GeoARG substantially improves the detection of remotely homologous ARGs, particularly under low sequence identity and fragmented conditions. Large-scale metagenomic analysis uncovers 1,485 high-confidence ARG candidates that are highly divergent from known ARGs, expanding the phylogenetic and functional landscape of the resistome. Structural analyses further show that these candidates preserve active-site geometry and maintain stable ligand-binding configurations consistent with known resistance mechanisms. These results demonstrate that geometric constraints enable systematic expansion of the resistome and facilitate the discovery of evolutionarily distant yet functionally conserved genes. A public web server is available at https://ycclab.cuhk.edu.cn/GeoARG/
bioinformatics2026-04-08v1Geometry-aware ligand-receptor analysis distinguishes interface association from spatial localization and reveals a continuum of tumor communication
Yepes, S.Abstract
Spatial transcriptomics enables inference of cell-cell communication through ligand-receptor (LR) interactions, but current prioritization strategies often rely on expression strength or interface-associated enrichment without explicitly modeling tissue geometry. As a result, interactions associated with population interfaces are frequently interpreted as spatially localized even when their underlying expression is broadly distributed. Here, we present a geometry-aware framework for LR prioritization that explicitly separates interface structure from spatial localization within a locked and reproducible analysis pipeline. We quantify interface-associated communication using a distance-weighted boundary score defined on a spatial neighbor graph, evaluate interface specificity using a label-permutation null model that preserves spatial geometry, and compute an LR-specific localization score that captures the proximity of ligand and receptor expression to the corresponding interface. This framework distinguishes interface-associated compatibility from interaction-level spatial concentration. Across spatial transcriptomics datasets from breast cancer, colorectal cancer, melanoma, and pancreatic ductal adenocarcinoma, interface-aware ranking consistently recovers pathway families associated with extracellular matrix, adhesion, inflammatory, and immune-related processes. However, interface enrichment frequently shows limited separation from the null model, indicating that interface structure alone does not establish spatial specificity. Incorporating geometric localization substantially alters LR prioritization, distinguishing interactions that are concentrated near interfaces from those that are more diffusely distributed. Under a fixed, deterministic pipeline applied identically across datasets without parameter tuning, discrete spatial communication regimes were not reproducibly recovered. Instead, variation across samples is more consistently captured as continuous differences in geometry-aware attenuation, reflecting the degree to which inferred interactions are spatially constrained by tissue architecture. Together, these results demonstrate that interface-associated enrichment and spatial localization are distinct properties of inferred LR interactions, and that accurate interpretation of spatial communication requires explicit modeling of tissue geometry. Under this framework, tumor communication is more consistently described as a continuum of spatial constraint.
bioinformatics2026-04-08v1Adaptive Integration of Heterogeneous Foundation Models to Find Histologically Predictable Genes in Breast Cancer
Nguyen, H.; Li, C.; Peng, C.; Simpson, P.; Ye, N.; Nguyen, Q.Abstract
Foundation models for computational pathology have rapidly emerged as powerful tools for extracting rich biological and morphological representations from histopathology images. However, variations in model architecture, pre-training data, and optimization objectives often lead to task-dependent performance, rather than universal generalization. As a result, effective strategies for integrating their complementary strengths are essential to fully realize the potential of foundation models for robust histopathology analysis. Meanwhile, recent breakthroughs such as spatial transcriptomics provide an unprecedented opportunity to integrate genetic and histopathology information from the same patient sample, thereby maximizing both molecular and anatomical pathology insights. Specifically, each model's embedding is first mapped to gene-level predictions via a dedicated prediction head, enabling model-specific feature utilization. A lightweight weighting network then adaptively aggregates these predictions to produce a unified and robust output at gene and spatial location levels. Across multiple spatial transcriptomics datasets, our approach consistently outperforms both individual foundation models and classical ensembling methods. Focusing on breast cancer, we observe substantial gains in prediction accuracy for clinically relevant PAM50 subtype markers and drug-target genes. Moreover, the proposed framework improves interpretability by revealing model-specific contributions and specialization at the gene level. Overall, our work presents an effective solution to integrating multiple foundation models for enhancing the genetic analyses of histopathology images.
bioinformatics2026-04-08v1Analysis of multicellular anatomical structures from spatial omics data using sosta
Gunz, S.; Crowell, H. L.; Robinson, M. D.Abstract
Spatial omics technologies enable high-resolution, large-scale quantification of molecular features while preserving the spatial context within tissues. Existing analysis methods largely focus on spatial arrangements of single cells, whereas biological function often emerges from multicellular arrangements. Here, we introduce structure-based analysis of spatial omics data, which focuses on the direct analysis of multicellular, anatomical structures. We illustrate this type of analysis using two publicly available datasets and provide sosta, an open-source Bioconductor package for broad community use.
bioinformatics2026-04-07v3Integrative AlphaFold Modeling, Fragment Mapping, and Microsecond Molecular Dynamics Reveal Ligand-Specific Structural Plasticity at the Human Urotensin II Receptor
Torbey, A. G.Abstract
Peptide ligands Urotensin II (hUII, human), hUII-related peptide (URP) and its cognate human receptor (hUT) are known for their implications in cardiovascular pathophysiology, yet the lack of experimentally resolved hUT structures has limited a deep mechanistic understanding of ligand binding and receptor activation. Here, we leverage recent breakthroughs in multistate AlphaFold predictions, long-timescale molecular dynamics (MD) simulations, and site identification by ligand competitive saturation (SILCS) based pocket mapping and solving ligand bound conformation to illuminate the dynamic interaction of hUII and URP with hUTR. By analyzing hUT dynamics in its intracellular transducer binding pocket, and residue-level interaction probabilities in each simulation, we capture subtle distinctions in the way hUII and URP anchor key pocket residues, modulate transmembrane (TM) domain tilts. Results indicate that hUII imposes stronger conformational constraints on TM5 and TM6 relative to URP, both potentially stabilizing different active-like receptor configurations. At the same time, interaction maps highlight unique aromatic and polar networks that each ligand exploits. These findings reinforce the concept that relatively small differences in GPCR peptide ligand structure may lead to large effects on receptor-state selection, signal specificity, ultimately reflecting different clinical outcomes. By integrating computational modeling with per-residue dynamics, this work not only reconciles prior mutagenesis and docking data but also provides validated 3D models and MD simulations of the endogenous ligands bound to hUT, offering new opportunities to selectively harness ligand-dependent signaling in the urotensinergic system.
bioinformatics2026-04-07v1Flow molecular dynamics simulations reveal mechanosensitive regulation of von Willebrand factor through glycan-modulated autoinhibitory modules
Richard Louis, N. E. L.; Zhao, Y. C.; Ju, L. A.Abstract
Force-induced protein conformational changes govern many essential biological processes, yet their molecular mechanisms remain difficult to resolve. Von Willebrand factor (VWF), a central regulator of haemostasis, is activated by hydrodynamic forces in blood flow, but how mechanical signals propagate across its multidomain architecture is poorly understood. Here, we use flow molecular dynamics (FMD), a simulation framework that applies fluid forces via controlled solvent flow to interrogate mechanosensitive proteins. Using VWF as a model system, we reconstructed the complete mechanomodule (DD3A1A2A3; 1,109 residues) with native glycosylation by integrating crystallographic data and AlphaFold predictions. FMD simulations capture a force-driven transition from a compact, autoinhibited bird-nest ensemble to an extended, activated state, revealing asymmetric autoinhibitory strengths within the NAIM and CAIM modules of the A1 domain. By directly linking static structures to dynamic, force-regulated behaviour, this work establishes a generalizable platform for dissecting protein mechanosensitivity and enabling the rational design of force-responsive therapeutics.
bioinformatics2026-04-07v1FunctionaL Assigning Sequence Homing (FLASH) maps phenotype to sequence with deep and machine learning
Cotter, D. J.; Harrison, M.-C.; Rustagi, A.; Wang, P. L.; Kokot, M.; Carey, A. F.; Deorowicz, S.; Salzman, J.Abstract
Genome-wide association studies (GWAS) map genetic variation to a reference genome and correlate variants to phenotypes. Yet, GWAS and similar procedures have limitations, including an inability to predict phenotype on variants never seen during the discovery phase and difficulty integrating structural variants. Deep and machine learning alternatives have not been successful at consistent prediction of resistance phenotypes (Hu et al. 2024). Here, we introduce FLASH: a new interpretable, statistically-based deep learning framework that operates directly on raw sequencing reads. In over 35,000 isolates of bacteria, fungi and viruses, FLASH achieves uniformly high accuracy on independent test data, including on variation never seen in training, meeting or exceeding bespoke state of the art methods. FLASH identifies canonical drug targets ab initio and new pan-species predictors of virulence, including those lacking annotation and those only partially aligned to NCBI reference databases. Further, FLASH can predict phenotypes beyond the possibility of GWAS, such as bacterial host range of phage, a task that to our knowledge is impossible today. FLASH is simple to run, highly efficient and constitutes a new approach for predicting gene function and phenotype across the tree of life. It is especially valuable when bioethical concerns and the vast genetic complexity of pathogenic microbes limit the feasibility of experimental validation.
bioinformatics2026-04-07v1Estimation of metabolite levels in cheese from microbial gene expression
Mansouri, A.; Mekuli, R.; Swennen, D.; Durazzi, F.; Remondini, D.Abstract
Characterizing aroma and flavours generated during cheese production is of high relevance for the food industry. A deeper comprehension of flavour generation can be achieved by understanding the role of microbial population governing milk processing, and in particular their metabolic activity governed by gene expression. In this work we considered two independent experiments in which gene expression of the microbial population involved in cheese processing is sampled, together with final volatile products quantification. We estimated the final volatile compound profile from the measured metatranscriptomic expression by using machine learning with two different strategies for model training and validation, and we were able to associate specific biochemical pathways to the identified gene signatures.
bioinformatics2026-04-07v1MitoChontrol: Adaptive mitochondrial filtering for robust single-cell RNA sequencing quality control
Strassburg, C.; Pitlor, D.; Singhi, A. D.; Gottschalk, R.; Uttam, S.Abstract
Mitochondrial transcript abundance is a standard quality control metric in single-cell RNA sequencing, but fixed percentage thresholds fail to account for the substantial variation in mitochondrial content across cell types and tissues, risking both retention of compromised cells and exclusion of transcriptionally active viable cell populations. We present MitoChontrol, a cell-type-aware probabilistic framework for mitochondrial quality control that models the mitochondrial transcript fraction within transcriptionally coherent clusters as a Gaussian mixture distribution. Compromised-cell components are identified from the upper tail of each cluster-specific distribution, and filtering thresholds are defined as the point at which theposterior probability of cellular compromise exceeds a user-definded confidence value. Applied to controlled perturbation experiments and a pancreatic ductal adenocarcinoma single-cell dataset, MitoChontrol selectively removes transcriptionally compromised cells while preserving biologically elevated but viable populations, outperforming fixed-threshold and outlier-based approaches.
bioinformatics2026-04-07v1Accurate estimation of canine inbreeding using ultra low-coverage whole genomesequencing
Pellegrini, M.; Kim, R.; Rubbi, L.; Kislik, G.; Smith, D.Abstract
The measurement of inbreeding has gained significance across diverse fields, including population and conservation genetics, agricultural genetics, breeding programs for animals and plants, and wildlife management. This is due to the fact that inbreeding leads to increased homozygosity and results in lower genetic diversity, rendering populations more vulnerable to environmental changes, diseases, and other stressors. High or mid-coverage whole genome sequencing (WGS) has been widely used for inbreeding estimation, but it is resource-intensive. We aimed to investigate the use of ultra low-coverage whole genome sequencing (ulcWGS) as a cost-effective alternative for inbreeding analysis. Domestic dogs were used for our study as their extensive breeding histories lead to populations with a wide range of inbreeding levels. We constructed a multi-breed reference panel from high-coverage WGS samples. Inbreeding in independent ulcWGS samples was then estimated using runs of homozygosity (RoH) and inbreeding coefficients (F). We modeled the relationship between these measures and sequencing depth using nonlinear regression, to generate inbreeding estimates relative to sequencing depth. Resulting relative RoH and F measurements were significantly correlated, with purebred dogs exhibiting more runs of homozygosity and higher inbreeding coefficients2 compared to mixed-breed dogs. Our findings demonstrate that ulcWGS can provide reliable and economical estimations of inbreeding, expanding accessibility to genetic monitoring.
bioinformatics2026-04-07v1Correlation Between Information Entropy and Functions of Gene Sequences in the Evolutionary Context: A New Way to Construct Gene Regulatory Networks from Sequence
Pan, L.; Chen, M.; Tanik, M.Abstract
The information encoded in DNA sequences can be rigorously quantified using Shannon entropy and related measures. When placed in an evolutionary context, this quantification offers a principled yet underexplored route to constructing gene regulatory networks (GRNs) directly from sequence data. While most GRN inference methods rely exclusively on gene expression profiles, the regulatory code is ultimately written in the DNA sequence itself. Here we review the mathematical foundations of information theory as applied to gene sequences, survey existing computational methods for GRN inference with emphasis on information-theoretic and sequence-based approaches, and examine how evolutionary conservation constrains sequence entropy to preserve biological function. We then propose a four-layer integrative framework that combines per-position Shannon entropy profiles, evolutionary conservation scoring via Jensen-Shannon divergence, expression-based mutual information and transfer entropy, and DNA foundation model embeddings to construct GRNs from sequence. Through worked examples on the Escherichia coli SOS regulatory sub-network, we demonstrate how conservation-weighted mutual information improves edge discrimination and how transfer entropy resolves regulatory directionality. The framework generates testable predictions: edges supported by low-entropy regulatory regions should show higher experimental validation rates, and cross-species entropy profile conservation should predict GRN topology conservation. This work bridges three scales of biological information-nucleotide-level entropy, evolutionary constraint patterns, and network-level regulatory logic-establishing information entropy as the natural mathematical language for sequence-to-network regulatory inference.
bioinformatics2026-04-07v1DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
Liu, T.; Jiang, S.; Zhang, F.; Sun, K.; Head-Gordon, T.; Zhao, H.Abstract
Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.
bioinformatics2026-04-07v1A Context-Aware Single-Cell Proteomics Analysis pipeline.
Salomo Coll, C.; Makar, A. N.; Brenes, A. J.; Inns, J.; Trost, M.; Rajan, N.; Wilkinson, S.; von Kriegsheim, A.Abstract
Single-cell proteomics (SCP) by mass spectrometry can now quantify hundreds to thousands of proteins per cell, but the field still lacks standardised analytical pipelines that accommodate the diversity of instruments, sample preparation workflows and biological contexts encountered in practice. Existing workflows, largely adapted from single-cell transcriptomics, do not account for the informative missingness, pervasive ambient protein contamination and limited feature space that distinguish proteomic from transcriptomic data. In addition, cell type annotation remains a manual bottleneck that is subjective, difficult to reproduce and hard to scale. Here we present an end-to-end pipeline that integrates adaptive quality control, entropy-guided iterative batch correction, multi-modal marker discovery that exploits detection patterns unique to proteomics, and context-aware annotation by large language models (LLMs) coupled to structured contradiction reasoning and orthogonal data-driven validation. Benchmarking on published single-cell proteomic datasets from developing human brain and glioblastoma-associated neutrophils revealed systematic LLM failure modes, including context-insensitive marker vocabulary and misinterpretation of phagocytic or lytic cell states. We addressed these errors using a three-round prompt architecture that combines general biological principles with auto-generated dataset-specific constraints. In held-out validation on a skin tumour dataset acquired, the pipeline showed high concordance with FACS-sorted ground truth. In the caerulein-injured pancreas, orthogonal immunohistochemistry further supported annotations of macrophage, stellate and immune populations. The pipeline is fully automated under fixed settings, and available as Context-Aware Single-Cell Proteomics Analysis (CASPA), providing SCP laboratories and facilities with a reproducible workflow that delivers interpretable, confidence-quantified annotations suitable for downstream expert review.
bioinformatics2026-04-07v1STDrug enables spatially informed personalized drug repurposing from spatial transcriptomics
Yang, Y.; Unjitwattana, T.; Zhou, S.; Kadomoto, S.; Yang, X.; Chen, T.; Karaaslanli, A.; Du, Y.; Zhang, W.; Liang, H.; Guo, X.; Keller, E. T.; Garmire, L. X.Abstract
Drug repurposing offers a scalable route to accelerate therapeutic discovery, yet existing approaches based on single-cell RNA sequencing (scRNA-seq) often overlook spatial tissue context, limiting their ability to capture microenvironment-dependent drug responses. Here we present STDrug, a spatially informed computational framework that integrates spatial transcriptomics, graph-based modeling, and multimodal learning to enable patient-specific therapeutic prioritization. STDrug identifies and aligns disease and control spatial domains using graph convolutional networks and coherent point drift, and prioritizes candidate drugs through an integrative scoring scheme combining tumor-reversible gene signatures, perturbation-based reversal scores, and knowledge-guided gene weighting within a machine learning framework. By modeling spatial domain interactions alongside predicted drug efficacy and toxicity, STDrug generates robust patient-level drug scores. Across hepatocellular carcinoma and prostate cancer datasets, STDrug outperforms existing single-cell and spatial transcriptomics-based drug repurposing methods, achieving signficantly improved predictive accuracy (AUCs=0.81-0.82) across patients. Validation using large-scale electronic health records and in vitro assays further supports the translational relevance of top-ranked candidates. Taking together, STDrug establishes a generalizable framework for incorporating spatial omics into therapeutic discovery, advancing spatially informed and personalized drug repurposing.
bioinformatics2026-04-07v1Locat: Joint enrichment and depletion testing identifies localized marker genes in single-cell transcriptomics
Lewis, W. R.; Aizenbud, Y.; Strino, F.; Kluger, Y.; Parisi, F.Abstract
Several methods have been developed to identify marker genes that delineate cell populations in single-cell transcriptomic data, yet most emphasize enrichment within candidate populations without testing whether expression is significantly reduced outside those populations. We present Locat, a framework for identifying highly specific localized genes by testing whether expression is concentrated within compact regions of the cellular embedding and depleted elsewhere. For each gene, Locat fits weighted Gaussian mixture models to gene-specific and background densities, computes test statistics for concentration within compact regions and depletion outside those regions, and integrates the results into a unified localization score. Across synthetic benchmarks with controlled ground truth, Locat detects localized genes spanning uni-modal, multi-modal, and sparse expression patterns, and appropriately loses significance when simulated expression becomes indistinguishable from background structure. In biological datasets spanning developmental, perturbation, and differentiation contexts, Locat identifies compact marker sets that capture lineage organization, condition-specific programs, and temporal regulatory dynamics. Localized gene sets are often smaller than conventional feature selections such as highly variable genes, and embeddings constructed from localized gene sets tend to preserve separation of major cell populations and developmental programs. In murine dermis, embeddings computed using localized genes preserve differentiation and cell-cycle trajectories observed in the full dataset. In interferon-{beta}-treated PBMCs, independent localization analysis of control and stimulated samples reveals stimulus-responsive programs and markers of shared immune populations without requiring batch correction or data integration. In retinoic acid-induced embryonic stem cell differentiation, localized genes exhibit reproducible stage-specific patterns across time points. Together, these results demonstrate that jointly assessing concentration and depletion yields specific, interpretable marker genes that enable direct cross-condition and multi-sample comparisons of marker genes across diverse biological settings.
bioinformatics2026-04-07v1REBEL, Reproducible Environment Builder for Explicit Library resolution
Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.Abstract
Background: Achieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. Findings: We present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. Conclusions: REBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core Keywords: reproducibility, bioinformatics, dependency resolution, Docker, FAIR, software environments, package management
bioinformatics2026-04-07v1Representation Methods of Transcriptomics with Applications in Neuroimmune Biology
Abbasi, M.; Ochoa Zermeno, S.; Spendlove, M. D.; Tashi, Z.; Plaisier, C. L.; Bartelle, B. B.Abstract
Interpretable representations of gene expression are used to define cellular identities and the molecular programs active within cells, two related, but distinct phenomena. In the case of microglia, a cell type with high transcriptomic, functional, and morphological heterogeneity, the predominant representation of transcriptomic data presumes the adoption of distinct molecular identities, despite a lack of easily separable transcriptional states. Here, we explore alternative transcriptomic representations by comparing two single-cell analysis methods: differential expression analysis for identities and co-expression network analysis for molecular programs. For microglia, co-expression network analysis identifies highly significant functional ontologies not resolved by differential expression analysis. The identified co-expression modules are preserved across transcriptomic datasets and suggest reducible functional programs that activate and modulate depending on context. We conclude that co-expression analysis constitutes a best practice for single cell analysis of an individual cell type and describing microglia function as concurrent molecular programs offers a more parsimonious model of microglia function.
bioinformatics2026-04-07v1Domain classification of archaeal proteomes reveals conserved fold repertoire
Schaeffer, R. D.; Pei, J.; Guo, R.; Zhang, J.; Medvedev, K.; Cong, Q.; Grishin, N.Abstract
Archaea represent one of the three domains of cellular life and yet account for fewer than 1% of experimentally determined protein structures, leaving the extent of their structural novelty unknown. Here we present a systematic domain-level classification of 124,075 proteins from 65 archaeal classes spanning 21 phyla and all major lineages, using both AFDB and newly predicted AlphaFold3 structures classified against the Evolutionary Classification of protein Domains (ECOD). We assigned 204,758 domains, of which 76.8% received high-confidence classifications, spanning 987 ECOD X-groups; 40% of known structural diversity within a single domain of life. Clustering by Foldseek recovered structural relationships for 63% of domains that are singletons by sequence comparison. To characterize the 21% of proteins lacking high-confidence classification, we applied successive filters for structure prediction confidence, protein length, and structural cluster context, reducing 8,452 domain-free proteins to a small number of well-folded structural orphans (less than 0.1% of the dataset). The unclassified fraction is dominated by sub-threshold matches to known folds (14% of all proteins) and low-confidence structure predictions (5%), not by novel structures. These results demonstrate that the protein fold repertoire at the single-domain level is broadly conserved across the deepest phylogenetic distances in cellular life, and that the gap between archaeal and well-characterized proteomes reflects classification sensitivity for divergent sequences rather than unexplored structural diversity.
bioinformatics2026-04-06v1Multimodal Fusion of Circular Functional Data on High-resolution Neuroretinal Phenotypes
Pyne, S.; Wainwright, B.; Ali, M. H.; Lee, H.; Ray, M. S.; Senthil, S.; Jammalamadaka, S. R.Abstract
Progressive optic neuropathies, particularly glaucoma, represent a significant global health challenge, and the need for precise understanding of the heterogeneous neurodegenerative phenotypes cannot be overstated. Here, we brought together two complementary sources of unstructured yet clinically-relevant information about neurotinal rim (NRR) thinning, a common clinical marker of such decay. These are based on a new dataset of Fundus digital images and a corresponding one of optical coherence tomography, both collected from a large clinical cohort of healthy eyes. First, we represented them using a common data structure that imposed a high-resolution scale of 180 equally-spaced and registered measurements on a 360{whitebullet} circular axis. We modeled such NRR data-points of each eye as circular curves, and aligned these multimodal curves to obtain a fused NRR curve for each eye. Unsupervised clustering of these fused curves identified 4 clusters of eyes with structural heterogeneity, which were also found to have distinctive clinical covariates. The computation of functional derivatives revealed the troughs in the curves of each cluster. Using circular statistics, we estimated the directional distributions of such troughs as potentially clinically-relevant regions of NRR decay. We also demonstrated that multimodal fusion leads to improvement in the robustness of baseline NRR data obtained from fundus imaging.
bioinformatics2026-04-06v1NovoTax: prokaryotic strain identification from mass spectrometry-based proteomics data
Svedberg, D.; Mateus, A.Abstract
Traditional mass spectrometry-based proteomics typically requires prior knowledge of sample composition to match spectra to peptides. Yet, novel de novo peptide sequencing approaches can provide peptide sequences to identify the organism. Here, we introduce an end-to-end pipeline (NovoTax) to identify the closest prokaryotic genome directly from raw bottom-up proteomics data. The approach combines peptide sequencing tools with an optimized implementation of peptide searching through an extensive genome database. On a benchmark dataset of species isolates, we identified the reported species and strain in the majority of the cases, and showed that in discordant cases NovoTax was likely correct. Interestingly, NovoTax was also able to identify contaminating species in some samples. The algorithm also identified the most abundant species in bacterial communities. In summary, NovoTax provides strain level identification of microbial samples enabling the downstream use of traditional proteomics search engines for a deeper proteome analysis.
bioinformatics2026-04-06v1EV-Net: A computational framework to model extracellular vesicles-mediated communication
Torrejon, E.; Sleegers, J.; Matthiesen, R.; Macedo, M. P.; Baudot, A.; Machado de Oliveira, R.Abstract
Summary Extracellular vesicles (EVs) are bilayer vesicles that carry a diverse cargo of molecules, such as nucleic acids, proteins and metabolites. These EVs can be transported throughout the organism to specific recipient tissues. For this reason, EVs have been recognized as pivotal mediators of cell-to-cell communication (CCC). Importantly, alterations in EV-mediated communication have been linked to pathological processes, further highlighting their biological relevance. However, the in silico exploration of the functional effects of EV cargo in recipient tissues remains limited due to the lack of dedicated tools that can be applied to EV omics datasets. Most current bioinformatics tools for assessing CCC rely on ligand-mediated communication and therefore cannot be used to explore EV-mediated communication. To address this gap, we developed EV-Net, a bioinformatics tool designed to explore the effects of EV cargo on recipient tissues. EV-Net was built by adapting NicheNet, a CCC bioinformatics tool that relies on ligand-receptor mediated communication, for the analysis of EVs proteomics and RNA-seq data. The EV-Net framework enables the identification and prioritization of EV cargo molecules with high regulatory potential in a recipient tissue of interest. This prioritization facilitates the systematic translation of EV cargo profiles into testable biological hypotheses. Availability and documentation The source code of EV-Net is stored in GitHub https://github.com/torrejoNia/EV-Net alongside instructions on how to install it. Comprehensive tutorials and additional documentation are available at https://torrejonia.github.io/EV-Net/. The datasets used in the use cases were deposited in Zenodo. The corresponding Zenodo links are provided in the tutorials for each use case. This software is distributed under a GLP3 licence.
bioinformatics2026-04-06v1Statistical signals indicate a dependence between amino acid backbone conformation and the translated synonymous codon
Rosenberg, A.; Marx, A.; Bronstein, A. M.Abstract
Synonymous codons encode the same amino acid but can differ in their usage and translational properties. In previous work we reported statistical differences in backbone dihedral angle distributions associated with synonymous codons in the Escherichia coli proteome. This finding has been questioned due to concerns regarding the statistical methodology used. Here we revisit the dataset using corrected statistical procedures and alternative statistical tests. Across multiple frameworks, the real dataset consistently shows an excess of small p-values relative to randomized controls, indicating detectable codon-associated differences in backbone conformation.
bioinformatics2026-04-06v1From Parametric Guessing to Graph-Grounded Answers: Building Reliable ChatGPT-like tools for Plant Science
Itharajula, M.; Lim, S. C.; Mutwil, M.Abstract
Large language models (LLMs) are increasingly used by plant biologists to summarize literature, generate hypotheses, and interpret experimental results. However, LLMs are unreliable sources of exhaustive, source-attributed facts, a critical limitation for the list-style queries that pervade plant biology (e.g., "list all transcription factors regulating secondary cell wall (SCW) biosynthesis in Arabidopsis"). Here, we query ChatGPT, Claude, and Gemini with such queries and demonstrate that none return complete gene lists with reliable citations. We trace these failures to how LLMs store knowledge: as statistical patterns distributed across billions of internal parameters, with no mechanism to guarantee completeness, provenance, or reproducibility. We also review fine-tuning mitigation strategies, including multi-task instruction tuning, parameter-efficient methods, and context engineering, that alleviate but do not resolve these limitations. We then discuss retrieval-augmented generation (RAG), which feeds relevant documents to the LLM at query time, and argue that while it improves source attribution, it remains impractical when answers require synthesizing information scattered across hundreds of papers. As an alternative, we advocate graph retrieval-augmented generation (GraphRAG), in which the LLM serves as a reasoning and language interface over a structured, provenance-linked knowledge graph (KG) that returns complete result sets reproducibly. We outline a practical GraphRAG architecture and survey existing plant KG resources. Finally, we discuss open challenges, including entity disambiguation, relation normalization and evidence grading, and propose a roadmap for building open, continuously updated plant KGs that can turn "read 1,000 papers" into a single reproducible query.
bioinformatics2026-04-06v1MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization
Zhang, L.; Wang, L.; Sun, X.; Tang, W.; Su, H.; Qian, Y.; Yang, Q.; Li, Q.; Tang, Z.; Sun, H.; Han, Y.; Jiang, Y.; Lou, W.; Zhou, B.; Wang, X.; Bai, L.; Xie, Z.Abstract
Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
bioinformatics2026-04-06v1