Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Hybrid molecular dynamics-deep generative framework expands apo RNA ensembles toward cryptic ligand-binding conformations: application to HIV-1 TAR
Kurisaki, I.; Hamada, M.Abstract
RNA plays vital roles in diverse biological processes and represents an attractive class of therapeutic targets. In particular, cryptic ligand-binding sites--absent in apo structures but formed upon conformational rearrangement--offer high specificity for RNA-ligand recognition, yet remain rare among experimentally-resolved RNA-ligand complex structures and difficult to predict in silico. RNA-targeted structure-based drug design (SBDD) is therefore limited by challenges in sampling cryptic states. Here, we apply Molearn, a hybrid molecular dynamics-deep generative framework, to expand apo RNA conformational ensembles toward cryptic states. Focusing on the paradigmatic HIV-1 TAR-MV2003 system, Molearn was trained exclusively on apo TAR conformations and used to generate a diverse ensemble of TAR structures. Candidate cryptic MV2003-binding conformations were subsequently identified using post-generation geometric analyses. Docking simulations of these conformations with MV2003 yielded binding poses with RNA-ligand interaction scores comparable to those of NMR-derived complexes. Notably, this work provides the first demonstration that a generative model can access cryptic RNA conformations that are ligand-binding competent and have not been recovered in prior molecular dynamics and deep generative modeling studies. Finally, we discuss current limitations in scalability and systematic detection, including application to the Internal Ribosome Entry Site, and outline future directions toward RNA-targeted SBDD.
bioinformatics2026-03-06v8Mutation Reporter: Protein-Level Identification of Single and Compound Mutations in NGS Data
Teodoro, M.; das Chagas, R. V.; Yunes, J. A.; Migita, N. A.; Meidanis, J.Abstract
Next-generation sequencing (NGS) has accelerated precision medicine by enabling simultaneous analysis of multiple genes and detection of low-frequency mutations. However, few open-source tools allow non-specialized users to transparently adjust quality parameters during mutation analysis. Mutation Reporter was developed to identify both single and compound amino acid alterations directly from raw fastq files of sequencing originated from RNA or exon sequences. The software provides full parameter control --- including alignment e-value, minimum read length, minimum read depth, and minimum variant allele frequency (VAF). The software is freely available under a GNU license on GitHub (https://github.com/meidanis-lab/mutation-reporter) and as a Code Ocean capsule (https://codeocean.com/capsule/0121109/tree).
bioinformatics2026-03-06v2scExploreR: a flexible platform for democratized analysis of multimodal single-cell data by non-programmers
Showers, W.; Desai, J.; Gipson, S. R.; Engel, K. L.; Smith, C.; Jordan, C. T.; Gillen, A. E.Abstract
Single-cell sequencing has revolutionized biomedical research by uncovering cellular heterogeneity in disease mechanisms, with significant potential for advancing personalized medicine. However, participation in single-cell data analysis is limited by the programming experience required to access data. Several existing browsers allow the interrogation of single-cell data through a point-and-click interface accessible to non-programmers, but many of these browsers are limited in the depth of analysis that can be performed, or the flexibility of input data formats accepted. Thus, programming experience is still required for comprehensive data analysis. We developed scExploreR to address these limitations and extend the range of analysis tasks that can be performed by non-programmers. scExploreR is implemented as a packaged R Shiny app that can be run locally or easily deployed for multiple users on a server. scExploreR offers extensive customization options for plots, allowing users to generate publication quality figures. Leveraging our SCUBA package, scExploreR seamlessly handles multimodal data, providing identical plotting capabilities regardless of input format. By empowering researchers to directly explore and analyze single-cell data, scExploreR bridges communication gaps between biological and computational scientists, streamlining insight generation.
bioinformatics2026-03-06v2Multi-omics Profiling Identifies Molecular and Cellular Signatures of Regular Physical Activity in Human Peripheral Blood
Song, X.; Lv, J.; Ge, S.; Xu, S.; Wu, Y.; Zheng, Y.; Zhou, W.; Li, L.; Zhang, Y.; Zhang, J.; Gao, P.; Chen, Z.; Yin, P.; Yin, J.; Liu, C.Abstract
Regular physical activity is well established to protect against metabolic disorders and bolster immunity; yet, the underlying molecular and cellular mechanisms remain incompletely understood. We integrated plasma metabolomics and lipidomics with single-cell transcriptomic and chromatin accessibility profiles to decode the systemic impact of physical activity on human immunity and metabolism. Our data reveal that regular physical activity is linked to a coordinated metabolic signature marked by enhanced fatty acid oxidation and antioxidant defense. In circulating immune cells, regularly active individuals exhibited synchronous enhancement at both the chromatin accessibility and transcriptional levels for antigen presentation-related genes in antigen-presenting cells (APCs), particularly in classical monocytes, naive B cells, and switched memory B cells. Meanwhile, cytotoxic programs in CD8+ cytotoxic T and mature NK cells showed epigenetic pre-activation of effector function regulators. Intercellular communication analysis further revealed that regular exercise enhanced MHC-I/II signaling between APCs and T cells and suppressed inflammatory signaling networks. Together, these findings elucidate molecular mechanisms underlying the health benefits of regular exercise and offer a theoretical basis for enhancing public health and preventing chronic diseases.
bioinformatics2026-03-06v2Getting over ANOVA: Estimation graphics for multi-group comparisons
Lu, Z.; Anns, J.; Mai, Y.; Zhang, R.; Lian, K.; Lee, N. M.; Hashir, S.; Wang Zhouyu, L.; Li, Y.; Gonzalez, A. R. C.; Ho, J.; Choi, H.; Xu, S.; Claridge-Chang, A.Abstract
Data analysis in experimental science mainly relies on null-hypothesis significance testing, despite its well-known limitations. A powerful alternative is estimation statistics, which focuses on effect-size quantification. However, current estimation tools struggle with the complex, multi-group comparisons common in biological research. Here we introduce DABEST 2.0, an estimation framework for complex experimental designs, including shared-control, repeated-measures, two-way factorial experiments, and meta-analysis of replicates.
bioinformatics2026-03-06v2Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles
Ni, Z.; Li, Y.; Qiu, Z.; Schölkopf, B.; Guo, H.; Liu, W.; Liu, S.Abstract
Generative models have recently advanced de novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce RigidSSL (Rigidity-Aware Self-Supervised Learning}, a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.
bioinformatics2026-03-06v2Automated Cell Type Annotation with Reference Cluster Mapping
Galanti, V.; Shi, L.; Azizi, E.; Liu, Y.; Blumberg, A. J.Abstract
Single-cell RNA sequencing has transformed the field of cellular biology by providing unprecedented insights into cellular heterogeneity. However, characterizing scRNA-seq datasets remains a significant challenge. We introduce RefCM, a novel computational method that combines optimal transport and integer programming to enhance the annotation of scRNA clusters using established reference datasets. Our method produces highly accurate cross-technology, cross-tissue, and cross-species mappings while remaining tractable at atlas scale, outperforming existing methods across all these tasks. By providing precise annotations, RefCM can enable the discovery of new cell types, states, and relationships in single-cell transcriptomic data.
bioinformatics2026-03-06v2DPGT: A spark based high-performance joint variant calling tool for large cohort sequencing
Gong, C.; Yang, Q.; Wan, R.; Li, S.; Zhang, Y.; Li, Y.Abstract
Background: Joint variant calling is a crucial step in population-scale sequencing analysis. While population-scale sequencing is a powerful tool for genetic studies, achieving fast and accurate joint variant calling on large cohorts remains computationally challenging. Findings: To meet this challenge, we developed Distributed Population Genetics Tool (DPGT), an efficient computing framework and a robust tool for joint variant calling on large cohorts based on Apache Spark. DPGT simplifies joint calling tasks for large cohorts with a single command on a local computer or a computing cluster, eliminating the need for users to create complex parallel workflows. We evaluated the performance of DPGT using 2,504 1000 Genomes Project (1KGP), 6 Genome in a Bottle (GIAB) and 9,158 internal whole genome sequencing (WGS) samples together with existing methods. As a result, DPGT produced results comparable in accuracy to existing methods, with less time and better scalability. Conclusions: DPGT is a fast, scalable, and accurate tool for joint variant calling. The source code is available under a GPLv3 license at https://github.com/BGI-flexlab/DPGT, implemented in Java and C++. Keywords: SNP/INDEL, joint calling, computational performance optimization, parallel computing
bioinformatics2026-03-06v2Reliable prediction of short linear motifs in the human proteome
Pancsa, R.; Ficho, E.; Kalman, Z. E.; Gerdan, C.; Remenyi, I.; Zeke, A.; Tusnady, G. E.; Dobson, L.Abstract
Short linear motifs (SLiMs) are small, often transient interaction modules within intrinsically disordered regions (IDRs) of proteins that interact with particular domains and thereby regulate numerous biological processes. The limited sequence information within these short peptides leads to frequent false positive hits in both computational and experimental SLiM identification methods. This makes the description of novel SLiMs challenging and has limited the number of known cases to a few thousand, even though SLiMs play widespread roles in cellular functions. We present SLiMMine, a deep learning-based method to identify SLiMs in the human proteome. By refining the annotations of known, annotated motif classes, we created a high-quality dataset for model training. Using protein embeddings and neural networks, SLiMMine reliably predicts novel SLiM candidates in known classes, eliminates ~80% of the pattern matching-based motif hits as false-positives, furthermore, it can also be used as a discovery tool to find uncharacterized SLiMs based on optimal sequence environment. In addition, we narrowed the highly general interactor-domain definitions of known SLiM classes to specific human proteins, enabling more precise prediction of a wide range of potential protein-protein interactions (PPIs) in the human interactome. SLiMMine is available in the form of an appealing, user-friendly, multi-purpose web-server at https://slimmine.pbrg.hu/.
bioinformatics2026-03-06v1TFBSpedia: a comprehensive human and mouse transcription factor binding sites database
Li, S.; Chou, E.; Wang, K.; Boyle, A. P.; Sartor, M. A.Abstract
Mapping the genomic locations and patterns of transcription factor binding sites (TFBS) is essential for understanding gene regulation and advancing treatments for diseases driven by DNA modifications, including epigenetic changes and sequence variants. Although several TFBS databases exist, no study has systematically benchmarked these databases across different sequencing technologies and computational algorithms. In this study, we addressed this gap by constructing a TFBS database that integrates all available ENCODE cell line ATAC-seq and Cistrome Data Browser ChIP-seq datasets, comprising 11.3 million human and 1.87 million mouse TFBS. We also integrated previously published TFBS resources (Factorbook, Unibind, RegulomeDB, and ENCODE_footprint) and found each contains a substantial fraction of unique TFBS predictions, highlighting significant discrepancies among existing resources. To assess the accuracy of the combined TFBS regions, we assembled ten independent genomic annotation datasets for evaluation and found that TFBS regions predicted by multiple databases are more likely to represent true and biologically meaningful binding sites. For each predicted TFBS region, we define two scores: the confidence score reflects prediction reliability, while the importance score represents biological functional relevance. Finally, we introduce TFBSpedia, a lightweight and efficient search engine that enables rapid retrieval of TFBS regions and comprehensive annotation information across the integrated databases.
bioinformatics2026-03-06v1A latent space thermodynamic model of cell differentiation
Poursina, A.; Hajhashemi, S.; Mikaeili Namini, A.; Saberi, A.; Emad, A.; Najafabadi, H. S.Abstract
Inferring the governing dynamics of differentiation that capture cell state evolution remains a central challenge in single-cell biology. We present Latent Space Dynamics (LSD), a thermodynamics-inspired framework that models cell differentiation as evolution on a learned Waddington landscape in latent space. LSD jointly infers a low-dimensional cell state, a differentiable potential function governing developmental flow, and a local entropy term that quantifies cellular plasticity. Using a neural ordinary differential equation, LSD reconstructs continuous differentiation trajectories from time-ordered single-cell data. Across diverse developmental systems, LSD accurately recovers lineage hierarchies, predicts fate commitment for unseen cell types, and outperforms existing trajectory inference approaches in directional accuracy. Moreover, in silico gene perturbations reveal how individual regulators reshape the landscape, and entropy provides a quantitative measure of plasticity in development and cancer.
bioinformatics2026-03-06v1RNA-seq analysis in seconds using GPUs
Melsted, P.; Guthnyjarson, E. M.; Nordal, J.Abstract
We present a GPU implementation of kallisto for RNA-seq transcript quantification. By redesigning the core algorithms: pseudoalignment, equivalence class intersection, and the EM algorithm; for massively parallel execution on GPUs, we achieve a 30-50x speedup over multithreaded CPU kallisto. On a benchmark of 100 Geuvadis samples from Human cell lines the GPU version processes paired-end reads at a rate of 3.6 million per second, completing a typical sample in seconds rather than minutes. For a large dataset of 295 million reads, runtime drops from 40 minutes to 50 seconds. Our implementation demonstrates that careful algorithmic redesign, rather than naive porting of software, is necessary to fully exploit the computing power of GPUs in sequence analysis.
bioinformatics2026-03-06v1From expansion to consolidation: two decades ofGene Ontology evolution
Pitarch, B.; Pazos, F.; Chagoyen, M.Abstract
The Gene Ontology (GO) is a long-standing, community-maintained knowledge resource that underpins the functional annotation of gene products across numerous biological databases. Released regularly, GO and its associated annotations form a large, continuously evolving dataset whose temporal dynamics have direct consequences for data reuse, versioning, and reproducibility. Because analytical results derived from GO are inherently tied to specific ontology and annotation releases, a systematic understanding of how GO changes over time is essential for transparent interpretation and long-term reuse of GO-based analyses. Here, we present a comprehensive temporal characterization of the Gene Ontology and its annotations spanning 21 years of publicly available releases. Treating successive ontology and annotation versions as longitudinal research data, we quantify changes in ontology structure, term composition, relationships, and annotation content across time and across three representative annotation resources. Our analysis reveals sustained growth of GO over its lifetime, accompanied by marked structural reorganization, particularly affecting high-level, general ontology terms. Notably, across multiple structural and annotation metrics, we identify a transition toward increased stability beginning around 2017, consistent with a maturation phase of the resource. This work provides a reference framework for researchers who rely on GO releases for data integration, benchmarking, and reproducible functional analysis.
bioinformatics2026-03-06v1What Do Biological Foundation Models Compute? Sparse Autoencoders from Feature Recovery to Mechanistic Interpretability
Orlov, A. V.; Makus, Y. V.; Ashniev, G. A.; Orlova, N. N.; Nikitin, P. I.Abstract
Foundation models trained on protein and DNA sequences are increasingly deployed for variant interpretation, drug design, and gene regulation prediction, yet their internal representations remain opaque, limiting both biological insight and trust in model-guided decisions. Existing interpretation approaches establish what these models encode but cannot reveal how biological knowledge is internally organized and computed. Sparse autoencoders (SAEs) offer a complementary approach by decomposing model activations into interpretable features, each capturing a distinct biological concept. Over the past year, SAEs have been applied to protein language models, genomic language models, pathology vision transformers, single-cell foundation models, and protein structure generators. Here we provide a systematic review of sparse dictionary learning across biological foundation models. We find that independent studies using different architectures and evaluation strategies consistently recover features spanning biological scales (from secondary structure elements and functional domains in proteins to transcription factor binding sites and regulatory elements in genomes), providing convergent evidence that these models learn interpretable representations accessible through sparse decomposition. However, we identify a critical gap: validation relies almost exclusively on matching features against existing annotations, risking circularity when those annotations derive from the same sequence databases used for model training. We propose a three-level interpretability framework (representational, computational, and causal mechanistic) and argue that the field's most distinctive opportunity lies in experimental validation through deep mutational scanning, massively parallel reporter assays, and structural characterization, which can establish whether these models have learned genuine biological mechanisms rather than training set statistics.
bioinformatics2026-03-06v1In silico drug repurposing and in vitro validation of cestode fatty acid binding proteins
Rodriguez, S.; Alberca, L. N.; Gavernet, L.; Franchini, G. R.; Talevi, A.Abstract
Echinococcosis is a Neglected Tropical Disease (NTD) caused by Echinococcus granulosus and Echinococcus multilocularis, the etiological agents of cystic and alveolar echinococcosis, respectively. These infections pose a significant public health burden, particularly in endemic regions. Cestodes lack key enzymes involved in lipid metabolism and must acquire lipids from their hosts. Fatty Acid Binding Proteins (FABPs), which mediate lipid trafficking and intracellular transport, have therefore emerged as essential and potentially druggable targets. In this study, we implemented an integrated virtual screening strategy combining ligand-based and structure-based approaches to identify novel FABP binders as potential therapeutic agents against Echinococcus spp. High-specificity screening of approximately 435,000 compounds yielded a limited number of prioritized in silico hits. Four compounds (hydrochlorothiazide, naratriptan, fenticonazole, and montelukast) were selected for experimental validation, prioritizing repurposing candidates. Fluorescence displacement assays confirmed that hydrochlorothiazide binds to three cestode FABPs (EgFABP1, EmFABP1, and EmFABP3), validating the predictive performance of the computational workflow. These findings support the value of parallel in silico screening strategies and drug repurposing approaches for the discovery of new therapeutic candidates against neglected tropical diseases. Keywords:Drug discovery, Drug repositioning; Drug repurposing; Echinococcus spp; FABP; Medicinal chemistry; Neglected tropical diseases;Virtual screening
bioinformatics2026-03-06v1CLAMP: Curated Latent-variable Analysis with Molecular Priors
Subirana-Granes, M.; Nandi, S.; Zhang, H.; Chikina, M.; Pividori, M.Abstract
Gene expression analysis has long been fundamental for elucidating molecular pathways and gene-disease relationships, but traditional single-gene approaches cannot capture the coordinated regulatory networks underlying complex phenotypes; although unsupervised matrix factorization methods (e.g., PCA, NMF) reveal coexpression patterns, they lack the ability to incorporate prior biological knowledge and often struggle with interpretability and technical noise correction. Semi-supervised strategies such as PLIER have improved interpretability by integrating pathway annotations during latent variable extraction, yet the original PLIER implementation is prohibitively slow and memory-intensive, making it impractical for modern large-scale resources like ARCHS4 or recount3. Here, we introduce CLAMP, which overcomes these constraints through a two-phase algorithmic design (an unsupervised CLAMPbase initialization followed by a CLAMPfull regression that incorporates priors via glmnet), rigorous internal cross-validation to tune regularization parameters for each latent variable, and efficient on-disk data handling using memory-mapped matrices from the bigstatsr package. Benchmarking on GTEx, recount2, and ARCHS4 demonstrates that CLAMP achieves 7x-41x speedups over PLIER, succeeds in modeling hundreds of thousands of samples that PLIER cannot handle, and maintains or improves biological specificity of latent variables as shown by tissue-alignment and pathway enrichment analyses. By filling the gap in scalable, biologically informed latent variable extraction, CLAMP enables comprehensive analysis of modern transcriptomic compendia and paves the way for deeper insights into gene regulatory networks and downstream applications in translational genomics.
bioinformatics2026-03-05v2GraTools, an user-friendly tool for exploring and manipulating pangenome variation graphs
Ravel, S.; Marthe, N.; Carrette, C.; Mohamed, M.; Sabot, F.; Tranchant-Dubreuil, C.Abstract
Background: Pangenome variation graphs (PVGs), which represent genomic diversity through multiple genomes alignment, are powerful tools for studying genomic variations in populations. However, current tools often lack integration, efficiency, or require format conversions, to use them, hindering their usability. Results: Here, we introduce GraTools, a fast and user-friendly command-line tool for manipulating PVGs using the original GFA file. After a one-time graph import, GraTools enables rapid subgraph extraction, fasta sequence retrieval, and comprehensive analyses, including core/dispensable genome ratio calculation or group-specific segment identification. The import step results in conversion in standard data formats (BAM/BED), enabling the reuse of well-optimized existing tools, allowing an efficient storage and the querying of the PVGs large complex data structures. Scalability is ensured by a modular architecture sup- porting parallel processing and asynchronous I/O operations. GraTools supports coordinates defined on both the primary reference as well as from alternative genomes within the graph without re-importing, and its outputs can be easily visualized or manipulated using external tools. Using an Asian rice pangenome graph (13 accessions), we demonstrate its ability to easily extract subgraphs, compute depth statistics, and identify subspecies-specific segments. An intuitive command-line interface, a real-time execution feedback and a detailed logging system make this tool suitable for a wide range of applications, from population genetics to breeding and genomic medicine, for both biologists and bioinformati- cians. Conclusions: Through its unified graph manipulation interface, GraTools offers an interesting alternative to the few existing tools for manipulating PVGs, facil- itating rapid, efficient and flexible downstream analyses. It is available as an open-source tool (GNU GPLv3), with its documentation available at https: //gratools.readthedocs.io.
bioinformatics2026-03-05v2Machine Learning Ensemble Reveals Distinct Molecular Pathways of Retinal Damage in Spaceflown Mice
Casaletto, J. A.; Scott, R. T.; Rathod, A.; Jain, A.; Chandar, A.; Adapala, A.; Prajapati, A.; Nautiyal, A.; Jayaraman, A.; Boddu, A.; Kelam, A.; Jain, A.; Pham, B.; Shastry, D.; Narayanan, D.; Kosaraju, E.; Paley, E.; Uribe, F. P.; Shahid, I.; Ye, I.; Wu, J.; Lin, J.; Srinivas, K.; Della Monica, M. P.; Hitt, M.; Lin, M.; Volkan, M.; Kharya, M.; Kaul, M.; Jaffer, M. A.; Ali, M.; Chang, N. Z.; Ashri, N.; Couderc, N. B.; Paladugu, P.; Sood, R.; Hiremath, R.; Pathak, R.; Dogra, S.; Srinivas, S.; Samaddar, S.; Gopinath, S.; Sawant, S.; Cai, S.; Pala, V.; Nair, V.; Shi, Z.; Narayanan, S.; MundackalAbstract
Background Spaceflight-associated neuro-ocular syndrome (SANS) poses significant risks to astronaut visual health during long-duration missions, yet its underlying molecular mechanisms remain incompletely understood. Oxidative stress and apoptosis are candidate molecular drivers, but their transcriptomic signatures in spaceflight-exposed retinal tissue have not been systematically characterized. Methods We applied a machine learning ensemble of linear regression models to predict two ocular phenotypes: 4-hydroxynonenal (4-HNE) immunostaining as a marker of lipid peroxidation-mediated oxidative damage; and TUNEL positivity as a marker of apoptotic cell death. In this observational study, we use bulk retinal gene expression data obtained from a controlled experiment with ground control and spaceflown mice to predict these phenotypes. Gene Ontology pathway enrichment was performed on the most predictive gene sets for each phenotype. Results The 4-HNE phenotype was predicted by genes that converge on membrane-associated pathways, photoreceptor protein modification, synaptic dysfunction, and extracellular matrix dysregulation, including B2m, Tf, Cnga1, mt-Nd1, Snap25, and Efemp1. The genes predicting the TUNEL phenotype revealed a distinct signature emphasizing stress-induced apoptosis, rod photoreceptor degeneration, and endoplasmic reticulum dysfunction, with Ddit4, Nrl, Rom1, Reep6, and Gabarapl1 emerging as central regulators. Conclusions Oxidative lipid peroxidation and apoptotic cell death represent complementary and molecularly distinct pathological mechanisms in spaceflight-exposed murine retinal tissue. The gene signatures provide a putative molecular framework for developing noninvasive biomarkers and therapeutic targets to monitor and protect astronaut visual health during long-duration and deep-space missions.
bioinformatics2026-03-05v2Nested birth-death processes are competitive with parameter-heavy neural networks as time-dependent models of protein evolution
Large, A.; Holmes, I.Abstract
Most statistical phylogenetics analyses use relatively simple continuous-time finite-state Markov models of point substitution to describe molecular evolution, often keeping sequence length fixed, ignoring insertions and deletions (indels) entirely, and making little (if any) allowance for variations in selection pressure due to interactions between amino acids. The simplistic assumptions of these models limit the realism of phylogenetics. We extend the TKF92 model - the canonical hierarchical model combining an outer birth-death process for indels with an inner finite-state Markov chain for substitutions - by introducing additional nesting and latent states, allowing for structural heterogeneity. We compare these TKF92 extensions (which are derived as exact solutions of instantaneous processes, and in which evolutionary time naturally appears as a matrix exponential coefficient) to two classes of neural seq2seq model that are not derived in such a way, but instead take evolutionary time as an input feature: the first class of model being constrained to enforce a TKF92-like structure, and the second lacking any such constraint. We evaluate the per-character perplexities of all models on splits of the PFam database of aligned protein domains. A nested TKF-based model with only 32,000 parameters is highly competitive with neural networks containing tens of millions of parameters, outperforming all but two of the neural architectures tested. Our results indicate that approaches grounded in molecular evolutionary theory may be more parameter-efficient and provide a better fit to real alignments than unconstrained alternatives, supporting the incorporation of CTMC-based model structure within future neural phylogenetic approaches.
bioinformatics2026-03-05v2CAPHEINE, or everything and the kitchen sink: a workflow for automating selection analyses using HyPhy
Verdonk, H. E.; Callan, D.; Kosakovsky Pond, S. L.Abstract
Here we present CAPHEINE, a computational workflow that starts with a set of unaligned pathogen sequences and a reference genome and performs a comprehensive exploratory evolutionary analysis of the input data. CAPHEINE pairs nicely with studies of site-level selection dynamics, gene-level positive selection, and lineage-specific shifts in selective pressure. Our workflow is portable across Mac OS, Windows, and Linux, allowing researchers to focus on results. CAPHEINE is freely available at https://github.com/veg/CAPHEINE, along with a set of usage instructions.
bioinformatics2026-03-05v2Towards building a World Model to simulate perturbation-induced cellular dynamics by AlphaCell
Chuai, G.; Chen, X.; Yang, X.; Zhang, C.; Qu, K.; Wang, Y.; Li, W.; Yang, J.; Si, D.; Xing, F.; Gao, Y.; Wu, S.; Fu, S.; He, B.; Liu, Q.Abstract
Predicting cellular responses to perturbations is crucial for therapeutic discovery, yet experimental screening is severely constrained by the combinatorial vastness of biological space. While computational simulations offer a scalable alternative, current models are limited by the incomplete latent representation, mainly relying on highly variable genes in feature representation; the poor genome-wide reconstruction fidelity; and the ungeneralizable dynamic laws across diverse contexts. Consequently, they fail to mechanistically transfer learned dynamics to unseen cellular contexts. To address these systemic flaws, we introduce AlphaCell, a generative Virtual Cell World Model that unifies genome-wise representation with continuous state transition modeling. AlphaCell achieves three synergistic innovations: (1) Latent Manifold Rectification, processing the full protein-coding transcriptome to construct a differentiable Virtual Cell Space, effectively filtering noise while preserving intrinsic cellular topology; (2) Biological Reality Reconstruction, utilizing a massive, knowledge-rich decoder to translate abstract latent states back into high-fidelity, genome-wide expression profiles; and (3) Universal State Transition, applying Optimal Transport Conditional Flow Matching to model perturbations as continuous, deterministic vector fields. By abstracting perturbation mechanisms into generalized dynamic laws, AlphaCell makes robust prediction of perturbation responses in a compositional generalization scenario and enables zero-shot prediction of cellular dynamics in entirely unseen cellular contexts, providing a foundational engine for cellular-context-generalizable perturbation prediction and perturbation-induced cellular dynamics simulation.
bioinformatics2026-03-05v1A Resolution-Agnostic Geometric Transformer for Chromosome Modeling Using Inertial Frame
Zhou, Y.; Li, H.; Liu, S.Abstract
Chromosomes are the carriers of genetic information. Further understanding their 3D structure can help reveal gene-regulatory mechanisms and cellular functions. However, high-resolution 3D structures are often missing due to the high cost and inherent noise of experimental screening. A standard pipeline for reconstructing the chromosome 3D structure first applies the single-cell Hi-C high-throughput screening method to measure pairwise interactions between DNA fragments at different resolutions; then it adopts computational methods to reconstruct the 3D structures from these contacts. These include traditional numerical methods and deep learning models, which struggle with limited model expressiveness and poor generalization across resolutions. To handle this issue, we propose InertialGenome, a novel transformer-based framework for robust and resolution-agnostic chromosome reconstruction. InertialGenome first adopts the inertial frame for the pose canonicalization. Then, based on such an invariant pose, it proposes a Transformer with geometry-aware positional encoding, leveraging Nystrom estimation. To verify the effectiveness of InertialGenome, we conduct experiments on two single-cell 3D reconstruction datasets with four resolutions, reaching superior performance over all four computational baselines. Additionally, we observe that the 3D structure reconstructed by InertialGenome is more in line with the results of real experimental results on two functional verification tasks. Finally, we leverage InertialGenome for cross-resolution transfer learning, yielding up to a 5% improvement from low to high resolution. The source code is available at https://github.com/yize1203/InertialGenome.
bioinformatics2026-03-05v1DPGT: A spark based high-performance joint variant calling tool for large cohort sequencing
Gong, C.; Yang, Q.; Wan, R.; Li, S.; Zhang, Y.; Li, Y.Abstract
Background: Joint variant calling is a crucial step in population-scale sequencing analysis. While population-scale sequencing is a powerful tool for genetic studies, achieving fast and accurate joint variant calling on large cohorts remains computationally challenging. Findings: To meet this challenge, we developed Distributed Population Genetics Tool (DPGT), an efficient computing framework and a robust tool for joint variant calling on large cohorts based on Apache Spark. DPGT simplifies joint calling tasks for large cohorts with a single command on a local computer or a computing cluster, eliminating the need for users to create complex parallel workflows. We evaluated the performance of DPGT using 2,504 1000 Genomes Project (1KGP), 6 Genome in a Bottle (GIAB) and 9,158 internal whole genome sequencing (WGS) samples together with existing methods. As a result, DPGT produced results comparable in accuracy to existing methods, with less time and better scalability. Conclusions: DPGT is a fast, scalable, and accurate tool for joint variant calling. The source code is available
bioinformatics2026-03-05v1PAMG-AT: A Physiological Attention Multi-Graph Model with Adaptive Topology for Stress Detection using Wearable Devices
YILDIZ, O.; Subasi, A.Abstract
Stress detection with wearable physiological sensors is vital in digital health and affective computing. Conventional machine learning techniques usually examine physiological signals separately, missing the intricate inter-signal connections involved in the human stress response. While deep neural networks offer high accuracy, they function as black boxes, offering minimal understanding of the physiological processes behind stress detection. This study introduces a hierarchical graph neural network framework for WESAD stress detection, establishing a methodology for affective computing that emphasizes interpretability and extensibility while maintaining strong predictive performance. We proposed PAMG-AT (Physiological Attention Multi-Graph with Adaptive Topology) which is a hierarchical graph neural network architecture, for stress detection using multimodal physiological signals. In this framework, physiological features serve as nodes within a knowledge-driven graph, while edges represent established physiological relationships, including cardiac-electrodermal coupling and cardio-respiratory interaction. The architecture employs a three-level attention mechanism: spatial encoding via Graph Attention Networks (GAT) to assess feature importance, temporal modeling with a Transformer to capture dynamics across time windows, and global pooling for classification. The model is evaluated using three sensor configurations (chest-only, wrist-only, and hybrid) on the WESAD dataset, employing rigorous Leave-One-Subject-Out (LOSO) cross-validation. PAMG-AT achieves competitive performance, with 94.59% accuracy (+/- 6.8%) for chest sensors, 91.76% accuracy (+/- 9.2%) for wrist sensors, and 92.80% accuracy (+/- 8.33%) for the hybrid configuration. The proposed method provides interpretability via attention weights, revealing that ECG-EDA relationships (cardiac-electrodermal coupling) are most predictive of stress. Three low-responder subjects (S2, S3, S9) with atypical physiological stress patterns demonstrate lower accuracy (81-87%), offering clinically valuable insights for personalized stress management. The effective wrist-only configuration, achieving 91.76% accuracy, supports practical deployment in consumer wearables.
bioinformatics2026-03-05v1Genome-wide maps of transcription factor footprints identify noncoding variants rewiring gene regulatory networks with varTFBridge
Lin, J.; Dong, W.; Zhang, J.; Xie, C.; Jing, X.; Zhao, J.; Ma, K.; Kang, H.; Jiang, Y.; Xie, X. S.; Zhao, Y.Abstract
Common-variant genome-wide association studies have identified thousands of noncoding loci associated with human diseases and complex traits; however, interpreting their functional mechanisms remains a major challenge. In recent years, the decreasing cost of high-throughput sequencing has enabled large population cohorts to generate whole-genome sequencing data for hundreds of thousands of individuals, providing an opportunity to comprehensively explore the functional impact of the full spectrum of genetic variants. However, unlike common-variant GWAS or rare-variant burden testing in coding regions, robust and systematic strategies for interpreting rare variants in noncoding regions remain limited. Meanwhile, many causal variants in noncoding regions are thought to act by perturbing transcription factor (TF) binding and rewiring gene regulatory networks, but progress has been limited by the lack of accurate, high-resolution maps of TF footprints. Recent single-molecule deaminase footprinting (FOODIE) technologies enable precise genome-wide detection of TF footprints at near single-base resolution. We discovered that K562 FOODIE footprints, although comprising less than 0.5% of the genome, exhibit approximately 70-fold enrichment for erythroid trait heritability, establishing an unprecedented resource for interrogating TF-mediated regulatory mechanisms. We further introduce varTFBridge, an integrative framework that combines both common and rare noncoding variant association analyses, footprint-gene linking models, and AlphaGenome to prioritize causal noncoding variants that rewire gene regulatory networks. Across 13 erythrocyte traits, varTFBridge linked 209 credible common and 18 driver rare variants altering TF binding affinity to modulate 207 unique target genes. We successfully recapitulated a known causal variant associated with erythrocyte traits and further elucidated its molecular mechanism, revealing how specific disruption of TF co-binding alters CCND3 regulation to drive variation in red blood cell count. Together, our approach enables genome-wide functional predictions of trait-associated noncoding variants by linking transcription factor binding events to their target genes.
bioinformatics2026-03-05v1Relation Extraction for Diet, Non-Communicable Disease and Biomarker Associations (RECoDe): A CoDiet study
Choi, D.; Gu, Y.; Zong, K.; Lain, A. D.; Zaikis, D.; Rowlands, T.; Rei, M.; CoDiet Consortium, ; Beck, T.; Posma, J. M.Abstract
Diet plays a critical role in human health, with growing evidence linking dietary habits to disease outcomes. However, extracting structured dietary knowledge from biomedical literature remains challenging due to the lack of dedicated relation extraction datasets. To address this gap, we introduce RECoDe, a novel relation extraction (RE) dataset designed specifically for diet, disease, and related biomedical entities. RECoDe captures a diverse set of relation types, including a broad spectrum of positive association patterns and explicit negative examples, with over 5,000 human-annotated instances validated by up to five independent annotators. Furthermore, we benchmark various natural language processing (NLP) RE models, including BERT-based architectures and enhanced prompting techniques with locally deployed large language models (LLMs) to improve classification performance on underrepresented relation types. The best performing model gpt-oss-20B, a local LLM, achieved an F1-score of 64% for multi-class classification and 92% for binary classification using a hierarchical prompting strategy with a separate reflection step built in. To demonstrate the practical utility of RECoDe, we introduce the Contextual Co-occurrence Summarisation (CoCoS) framework, which aggregates sentence-level relation extractions into document-level summaries and further integrates evidence across multiple documents. CoCoS produces effect estimates consistent with established dietary knowledge, demonstrating its validity as a general framework for systematic evidence synthesis. Availability: The code, models, and data will be made freely available following peer review.
bioinformatics2026-03-05v1CellTypeAI: Automated cell identification for scRNA-seq using local generative-AI
Daw, R. H.; Deijnen, H. R.; Rattray, M.; Grainger, J. R.Abstract
Single-cell RNA sequencing (scRNA-seq) cell annotation techniques rely on the matching of known defining marker genes to a given cell population. However, these methods may lack robustness to dynamic fluctuations in cell marker expression between patients, samples and pathologies. The advent of easy-to-implement predictive technologies, like generative-AI (gen-AI), has facilitated the introduction of computational workflows that improve otherwise inaccurate context-dependent cell type annotation. Here, we introduce CellTypeAI, a streamlined, scalable program developed for tissue context-dependent cell annotation of scRNA-seq datasets using modern gen-AI models, enhanced by retrieval augmented generation methods. Our implementation builds upon local gen-AI hosting technologies and directly integrates into scRNA-seq analysis pipelines. We show that CellTypeAI provides improved annotation accuracy compared to current conventional annotation methods and nascent cloud-based gen-AI approaches. As CellTypeAI leverages locally-run AI models, it can be applied to sensitive datasets, unlike approaches utilising online gen-AI tools such as ChatGPT, DeepSeek, or Claude. CellTypeAI presents a novel solution for tissue-specific cell type identification, overcoming traditional marker-based limitations via locally-deployed generative-AI models.
bioinformatics2026-03-05v1The FAIRSCAPE AI-readiness Framework for Biomedical Research
Al Manir, S.; Levinson, M. A.; Niestroy, J.; Churas, C.; Sheffield, N. C.; Sullivan, B.; Fairchild, K.; Torres, M. M.; Ratcliffe, S. J.; Parker, J. A.; Ideker, T.; Clark, T.Abstract
Objective: Biomedical datasets intended for use in AI applications require packaging with rich pre-model metadata to support model development that is explainable, ethical, epistemically grounded and FAIR (Findable, Accessible, Interoperable, Reusable). Methods: We developed FAIRSCAPE, a digital commons environment, using agile methods, in close alignment with the team developing the AI-readiness criteria and Bridge2AI data production teams. Work was initially based on an existing provenance-aware framework for clinical machine learning. We incrementally added RO-Crate data+metadata packaging and exchange methods, client-side packaging support, provenance visualization, and support metadata mapped to the AI-readiness criteria, with automated AI-readiness evaluation. LinkML semantic enrichment and Croissant ML-ecosystem translations were also incorporated. Results: The FAIRSCAPE framework generates, packages, evaluates, and manages critical pre-model AI-readiness and explainability information with descriptive metadata and deep provenance graphs for biomedical datasets. It provides ethical, schema, statistical, and semantic characterization of dataset releases, licensing and availability information, and an automated AI-readiness evaluation across all 28 AI-readiness criteria. We applied this framework to successive, large-scale releases of multimodal datasets, progressively increasing dataset AI-readiness to full compliance. Conclusion: FAIRSCAPE enables AI-readiness in biomedical datasets using standard metadata components and has been used to establish this pattern across a major, multimodal NIH data generation program. It eliminates early-stage opacity apparent in many biomedical AI applications and provides a basis for establishing end-to-end AI explainability.
bioinformatics2026-03-04v4PopGenAgent: Tool-Aware, Reproducible, Report-Oriented Workflows for Population Genomics
su, h.; Long, W.; Feng, J.; Hou, Y.; Zhang, Y.Abstract
Population-genetic inference routinely requires coordinating many specialized tools, managing brittle file formats, iterating through diagnostics, and converting intermediate results into interpretable figures and written summaries. Although workflow frameworks improve reproducibility, substantial last-mile effort remains for parameterization, troubleshooting, and report preparation. Here we present PopGenAgent, a turnkey, report-oriented delivery system that packages a curated library of population-genetics toolchains into validated execution and visualization templates with standardized I/O contracts and full provenance capture. PopGenAgent separates retrieval-grounded user assistance for interpretation and write-up from conservative, template-driven execution that emphasizes auditable commands, artefact integrity checks, and report-ready figure generation. To control operating cost, an economical language model is used for template selection, parameter instantiation, and minor repairs, while higher-capacity models can be invoked selectively for narrative report generation grounded in recorded artefacts. We evaluate PopGenAgent on a broad panel of routine and advanced tasks spanning preprocessing, population structure analysis, and allele-sharing statistics, and we further demonstrate end-to-end replication of standard analyses on 26 populations from the 1000 Genomes Project, reproducing canonical summaries including ROH/heterozygosity profiles, LD decay, PCA, ADMIXTURE structure, TreeMix diagnostics, and f-statistics. Together, these results indicate that a validated template library coupled with provenance-aware reporting can substantially reduce manual scripting and coordination overhead while preserving reproducibility and step-level inspectability for population-genomic studies.
bioinformatics2026-03-04v1Uncovering Latent Structure in Gliomas Using Multi-Omics Factor Analysis
Carvalho, C. G.; Carvalho, A. M.; Vinga, S.Abstract
Background: Gliomas are the most common malignant brain tumors in adults, characterized by a poor prognosis. Although the current World Health Organization (WHO) classification provides clear guidelines for classifying oligodendroglioma, astrocytoma, and glioblastoma patients, significant heterogeneity persists within each class, limiting the effectiveness of current treatment strategies. With the increase of large-scale multi-omics datasets due to advancements in sequencing technologies, and online databases that provide them, such as The Cancer Genome Atlas (TCGA), it is now possible to investigate these tumors at multiple molecular levels. Methods: In this work, we apply integrative multi-omics analysis to explore the interplay between genomic (mutations), epigenomic (DNA methylation), and transcriptomic (mRNA and miRNA) layers. Our approach relies on Multi-Omics Factor Analysis (MOFA), a Bayesian latent factor analysis model designed to capture sources of variation across different omics types. Results: Our results highlight distinct molecular profiles across the three glioma types and identify potential relationships between methylation and genetic expression. In particular, we uncover novel candidate biomarkers with prognostic value, as well as a transcriptional profile associated with neural system development. Conclusions: These findings may contribute to more personalized therapeutic strategies, potentially enhancing treatment effectiveness and improving survival outcomes for this disease.
bioinformatics2026-03-04v1LLMsFold: Integrating Large Language Models and Biophysical Simulations for De Novo Drug Design
Waththe Liyanage, W. W.; Bove, F.; Righelli, D.; Romano, S.; Visone, R.; Iorio, M. V.; Lio, P.; Taccioli, C.Abstract
The discovery of novel small molecules is challenging because of the vastness of chemical space and the complexity of protein-ligand interactions, leading to low success rates and time-consuming workflows. Here, we present LLMsFold, a computational framework that combines Large Language Models (LLMs) and biophysical foundation tools to design and validate new small molecules targeting pathogenic proteins. The pipeline starts by identifying viable binding pockets on a target protein through geometry-based pocket detection. A 70-billion-parameter transformer model from the LlaMA family then generates candidate molecules as SMILES strings under prompt constraints that enforce drug-likeness. Each molecule is evaluated by Boltz-2, a diffusion-based model for protein-ligand co-folding that predicts bound 3D structure and binding affinity. Promising candidates are iteratively optimized through a reinforcement learning loop that prioritizes high predicted affinity and synthetic accessibility. We demonstrate the approach on two challenging targets: ACVR1 (Activin A Receptor Type 1), implicated in fibrodysplasia ossificans progressiva (FOP), and CD19, a surface antigen expressed on most B-cell lymphoma and leukemia cells. Top candidates show strong in silico binding predictions and favorable drug-like profiles. All code and models are made available to support reproducibility and further development.
bioinformatics2026-03-04v1Deciphering the links between metabolism and health by building small-scale knowledge graphs: application to endometriosis and persistent pollutants
Mathe, M.; Laisney, G.; Filangi, O.; Giacomoni, F.; Delmas, M.; Cano-Sancho, G.; Jourdan, F.; Frainay, C.Abstract
Knowledge graphs (KGs) are a robust formalism for structuring biomedical knowledge, but large-scale KGs often require complex queries, are difficult for non-experts to explore, and lack real-world context (such as experimental data, clinical conditions, patients symptoms). This limits their usability for addressing specific research questions. We present Kg4j, a computational framework built on FORVM (a large-scale KG containing 82 million compound-biological concept associations), that constructs local, keyword-based sub-graphs tailored to address biomedical research questions. Resulting graphs support hypothetical relationships and can integrate experimental datasets, enabling the discovery of plausible but yet unknown connections. Starting from a conceptual definition of a research field of interest (e.g., disease, symptoms, exposure), the framework extracts relevant associations from FORVM and identifies potential biological mechanisms and chemical compounds. We applied this approach to endometriosis, exploring links between exposure to Persistent Organic Pollutants (POPs) and disease risk. We propose a novel validation strategy comparing the resulting sub-graph (2,706 nodes and 23,243 edges, 0.002% of FORVM) with recent scientific literature, showing consistency with known findings while also revealing new hypothetical associations requiring further investigation.We also showed that removing duplicated nodes and edges from the KG improves the proportion of validated nodes (from 8.4% to 16%), doubles the precision (from 0.085 to 0.197) while maintaining the recall (0.954 to 0.952), illustrating a trade-off between the loss of potentially relevant but redundant information and the reliability of remaining associations. By combining automated knowledge mining with experimental data integration, this framework supports reproducible, context-based exploration of biomedical knowledge and systematic hypothesis generation. Applied to endometriosis, it highlights potential mechanisms linking exposure to POPs to the aetiology of the disease, offering a scalable strategy for constructing disease-specific KGs.
bioinformatics2026-03-04v1Formalized scientific methodology enables rigorous AI-conducted research across domains
Zhang, Y.; Zhao, J.Abstract
We formalize scientific methodology, the end-to-end process from question formulation to evidence-grounded writing, as a phase-gated research protocol with explicit return paths and persistent constraints, and instantiate it for general-purpose language models as executable protocol specifications. The formalization decomposes methodology into three complementary layers: a procedural workflow, an integrity discipline, and project governance. Encoded as protocol and activated across the lifecycle, these constraints externalize planning and verification artifacts and make integrity-relevant interventions auditable. We validate the approach in six end-to-end projects, including a matched controlled study, where the same agent produced two complete papers with and without the protocol. Across domains, the protocol-constrained agent produced evidence-backed, auditable research outputs - including closed-form derivations, quantitative ablations that resolve modeling design choices, and algorithmic refactors that preserve the objective while changing the computational primitive. In population-genomic applications, it also recovered well-studied biological signals as validity checks, including known admixture targets in the 1000 Genomes Project and Neanderthal-introgressed immune loci on chromosome 21 consistent with prior catalogs. In the controlled study, the protocol-free baseline could still produce a complete manuscript, but integrity-relevant risks were easier to introduce and harder to detect when constraints and artifacts were absent.
bioinformatics2026-03-04v1T cell-Macrophage Interactions Potentially Influence Chemotherapeutic Response in Ovarian Cancer Patients.
Hameed, S. A.; kolch, W.; Zhernovkov, V.Abstract
Tumor development and progression involve complex cell-cell interactions and dynamic co-evolution between cancer cells, immune cells and stromal cells in the tumour microenvironment and this may influence therapeutic resistance. A large proportion of this network relies on direct physical interactions between cells, particularly T-cell mediated interactions. Cell-cell communication inference has now become routine in downstream scRNAseq analysis but this mostly fails to capture physical cell-cell interactions due to tissue dissociation. Doublets occur naturally in scRNA-seq and are usually excluded from analysis. However, they may represent directly interacting cells that remain undissociated during library preparation. In the present study, we uncover the physical interaction landscape of the ovarian tumour microenvironment using the scRNAseq datasets from 13 treatment-naive ovarian cancer patients. Focusing on T-cell-Macrophage (T-Mac) interaction doublet, we reveal the modulatory effect of macrophages on T cells and the potential influence of this interaction on therapeutic response. Our findings show that T-Macs from resistant patients are functionally polarized to the M2 phenotype and engage T cells to induce T-cell exhaustion. Whereas, T-Macs from sensitive patients are predominantly of the M1 polarized phenotype, physically engaging T cells that lack exhaustion signatures. We also demonstrate that T cells and macrophages in T-Mac doublet are interacting primarily for the purpose of antigen presentation, with the enrichment of several ligand-receptor pairs involved in TCR-MHC interactions and immune synapse formations. We partly validated some of these findings from a spatial transcriptomics dataset of ovarian cancer patients from a separate cohort.
bioinformatics2026-03-04v1Direct pathway enrichment prediction from histopathological whole slide images and comparison with gene expression mediated models
Jabin, A.; Ahmad, S.Abstract
Molecular profiling of tumours via RNA sequencing (RNA-seq) enables clinically actionable stratification but remains costly, tissue-intensive, and time-consuming. Recent advances in computational pathology suggest that routine H&E whole-slide images (WSIs) can be utilized to estimate transcriptomic states of cancer cells. Given the WSI-derived predictions of transcriptional signatures are noisy, their use for accurate biological interpretation faces challenges. On the other hand pathway enrichment analysis has been routinely used in describing biologically meaningful cellular states from noisy gene expression data and some studies have evaluated the ability of WSI-predicted gene expression profiles to reconstruct enriched pathways in experiments where the two data modalities were concurrently available. However, it remains unclear if a predictive model that is designed to predict enriched pathways directly from WSI samples would be better than the current approaches to do so by first predicting gene expressions. Here, we develop and evaluate these two complementary approaches for predicting pathway enrichment profiles from WSIs in TCGA Breast Invasive Carcinoma (TCGA-BRCA) by training parallel models which predict pathway enrichment directly from image features and those which rely on predicted gene expression profiles, which is the current state-of-the-art. Our results suggest that under controlled experiments direct prediction of a selected pool of enriched pathways outperforms the models trained on predicting gene expression and then inferring enrichments on predicted gene expression values. These findings will be helpful in prioritizing the goals of predictive modeling of WSI images and improving diagnostic outcomes of cancer patients.
bioinformatics2026-03-04v1Towards Useful and Private Synthetic Omics: Community Benchmarking of Generative Models for Transcriptomics Data
Öztürk, H.; Afonja, T.; Jälkö, J.; Binkyte, R.; Rodriguez-Mier, P.; Lobentanzer, S.; Wicks, A.; Kreuer, J.; Ouaari, S.; Pfeifer, N.; Menzies, S.; Pentyala, S.; Filienko, D.; Golob, S.; McKeever, P.; Banerjee, J.; Foschini, L.; De Cock, M.; Saez-Rodriguez, J.; Fritz, M.; Stegle, O.; Honkela, A.Abstract
Background: The synthesis of anonymized data derived from real-world cohorts offers a promising strategy for regulatory-compliant and privacy-preserving biological data sharing, potentially facilitating model development that can improve predictive performance. However, the extent to which generative models can preserve biological signals while remaining resilient to adversarial privacy attacks in high-dimensional omics contexts remains underexplored. To address this gap, the CAMDA 2025 Health Privacy Challenge launched a community-driven effort to systematically benchmark synthetic and privacy-preserving data generation for bulk RNA-seq cohorts. Results: Building on this initiative, we systematically benchmarked 11 generative methods across two cancer cohorts (~1,000 and ~5,000 patients) over 978 landmark genes. Methods were evaluated across complementary axes of distributional fidelity, downstream utility, biological plausibility and empirical privacy risk, with emphasis on trade-offs between vulnerability to membership inference attacks (MIA) and other evaluation dimensions. Expressive deep generative models achieved strong predictive utility and differential expression recovery, but were often more vulnerable to membership inference risk. Differentially private methods improved resistance to attacks at the cost of reduced utility, while simpler statistical approaches offered competitive utility with moderate privacy risk and fast training. Conclusions: Synthetic bulk RNA-seq quality is inherently multi-dimensional and shaped by trade-offs between utility, biological preservation and privacy. Our results indicate that differences in model architecture drive distinct trade-offs across these axes, suggesting that model choice should align with dataset characteristics, intended downstream use and privacy requirements. Privacy risk should also be assessed using multiple complementary attack methods and, where possible, formal differential privacy protection.
bioinformatics2026-03-04v1A comprehensive benchmark of discrepancies across microbial genome reference databases
Boldirev, G.; Aguma, P.; Munteanu, V.; Koslicki, D.; Alser, M.; Zelikovsky, A.; Mangul, S.Abstract
Metagenomic analysis of microbial communities relies significantly on the quality and completeness of reference genomes, which allow researchers to compare sequencing reads against reference genome collections to reveal essential community characteristics. However, the reliability of these analyses is often compromised by substantial discrepancies across existing reference resources, including differences in genome content, assembly fragmentation, taxonomic representation, and metadata completeness. While these inconsistencies are known to introduce bias, the extent of divergence between major databases remains largely unknown. Here, we present a comprehensive benchmark of discrepancies across multiple widely used microbial genome reference resources. We developed the Cross-DB Genomic Comparator (CDGC), which utilizes reference genome alignments to systematically capture discrepancies in genome assemblies across reference databases. Applying this framework, we found that 99% of viral genomes were identical across databases, indicating strong consistency in viral reference resources. In contrast, fungal genomes showed substantially greater variability: although 82% of assemblies exhibited at least 90% similarity, only 7% were identical across databases. More concerning, we identified a subset of 461 assemblies with less than 50% similarity, suggesting the presence of technical artifacts, incomplete assemblies, or damaged genome files that require closer examination. Collectively, these results demonstrate that systematic cross-database benchmarking provides a critical mechanism for refining the accuracy of individual reference databases and advancing efforts towards more unified and reliable universal reference genomes.
bioinformatics2026-03-04v1gSV: a general structural variant detector using the third-generation sequencing data
HAO, J.; Shi, J.; Lian, S.; Zhang, Z.; Luo, Y.; Hu, T.; Ishibashi, T.; Wang, D.; Wang, S.; Fan, X.; Yu, W.Abstract
Structural variants (SVs) are of increasing significance with the advancement of the third-generation sequencing technologies, but detecting complex SVs is still challenging for existing SV detection tools. In this paper, we propose gSV, a general SV detector that integrates alignment-based and assembly-based approaches with the maximum exact match (MEM) strategy. Without predefined assumptions about SV types, gSV captures all potential variant signals, enabling the detection of SVs with complex alignment patterns that are usually missed by other tools. Evaluations using both simulated and real datasets demonstrate that gSV outperforms state-of-the-art tools in detecting both simple and complex SVs. Unique SV discoveries in four breast cancer cell lines, particularly in cancer-associated genes, validate the clinical utility of gSV. The application in a breast cancer cohort from the Chinese population further illustrates the usefulness of our new tool in genomic studies.
bioinformatics2026-03-04v1Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles
Ni, Z.; Li, Y.; Qiu, Z.; Schölkopf, B.; Guo, H.; Liu, W.; Liu, S.Abstract
Generative models have recently advanced de novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce RigidSSL (Rigidity-Aware Self-Supervised Learning}, a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.
bioinformatics2026-03-04v1EvoStructCLIP: A Mutation-Centered Multimodal Embedding Model for CAGI7 Variant Effect Prediction
Chung, K.; Lee, J.; Kim, Y.; Lee, J.; Park, J.; Lee, H.Abstract
We present EvoStructCLIP, a mutation-centered multimodal embedding model that integrates local 3D structural windows and evolutionary constraints to predict missense variant effects. EvoStructCLIP combines two encoders: a structure voxel encoder derived from AlphaFold residue neighborhoods and an MSA-based evolutionary encoder. It aligns the modalities through CLIP-style contrastive learning, with FuseMix regularization and an auxiliary pathogenicity loss trained on 153,787 ClinVar variants. Evaluations using lightweight regressors demonstrate that EvoStructCLIP embeddings capture highly transferable predictive signals across diverse phenotypes, including gene-specific functional readouts of BRCA1, KCNQ4, and PTEN/TPMT. This transferability is further supported in the CAGI7 blind competition setting, where models generalized to predicting different gene-specific readouts for BARD1, FGFR, and TSC2 without target-specific retraining and achieved competitive performance across heterogeneous biological tasks.
bioinformatics2026-03-04v1MiGenPro: A linked data workflow for phenotype-genotype prediction of microbial traits using machine learning.
Loomans, M.; Suarez-Diez, M.; Schaap, P. J.; Saccenti, E.; Koehorst, J. J.Abstract
The availability of microbial genomic data and the development of machine learning methods have created a unique opportunity to establish associations between genetic information and phenotypes. Here, we introduce a computational workflow for Microbial Genome Prospecting (MiGenPro) that combines phenotypic and genomic information. MiGenPro serves as a workflow for the training of machine learning models that predict microbial traits from genomes that have been annotated. Microbial genomes have been consistently annotated and features were stored in a semantic framework that is easy to query using SPARQL. The data was used to train machine learning models and successfully predicted microbial traits such as motility, Gram stain, optimal temperature range, and sporulation capabilities. To ensure robustness, a hyper parameter halving grid search was used to determine optimal parameter settings followed by a five-fold cross-validation which demonstrated consistent model performance across iterations and without overfitting. Effectiveness was further validated through comparison with existing models, showing comparable accuracy, with modest variations attributed to differences in datasets rather than methodology. Classification can be further explored using feature importance characterisation to identify biologically relevant genomic features. MiGenPro provides an easy to use interoperable workflow to build and validate models to predict phenotypes from microbes based on their annotated genome.
bioinformatics2026-03-03v2Characterizing and Mitigating Protocol-Dependent Gene Expression Bias in 3' and 5' Single-Cell RNA Sequencing
Shydlouskaya, V.; Haeryfar, S. M. M.; Andrews, T. S.Abstract
Single-cell RNA sequencing (scRNA-seq) has enabled large-scale characterization of cellular heterogeneity; yet, integrating datasets generated through different library preparation protocols remains challenging. For instance, comparisons between 10X Genomics 3' and 5' chemistries are complicated by protocol-dependent technical biases imposed by differences in transcript end capture and amplification. While normalization, and often batch correction, is an integral step in preprocessing scRNA-seq datasets, it remains unclear which correction is most appropriate, or even necessary, for reliable cross-protocol comparisons. Here, we systematically characterize protocol-related expression differences using 35 matched donors across six tissues profiled with both 3' and 5' scRNA-seq approaches. We find that gene expression discrepancies are not pervasive across the whole transcriptome, but driven instead by a relatively small, reproducible subset of protocol-biased genes. Excluding these genes improves cross-protocol concordance, indicating that most genes are directly comparable without aggressive correction. We then benchmark commonly employed normalization approaches and show that while several methods, such as fastMNN, improve statistical alignment when cell populations are well matched, they can distort gene-level signals and inflate differential expression in biologically realistic settings with incomplete cell-type overlap. Taken together, our results demonstrate that protocol bias between 3' and 5' scRNA-seq is limited in scope and that targeted handling of a small set of biased genes presents an alternative approach to normalization or batch correction strategies. This work provides a practical guideline for integrating 3' and 5' scRNA-seq data and highlights the importance of matching normalization strategies to the structure of technical variation and the intended downstream analyses.
bioinformatics2026-03-03v1selscape: A Snakemake Workflow for Investigating Genomic Landscapes of Natural Selection
Chen, S.; Huang, X.Abstract
Analyzing natural selection is a central task in evolutionary genomics, yet applying multiple tools across populations in a reproducible and scalable manner is often complicated by heterogeneous input formats, parameter settings, and tool dependencies. Here, we present selscape, a Snakemake workflow that automates end-to-end genome-wide selection analysis--from input preparation and statistic calculation to functional annotation, downstream visualization, and summary reporting. We demonstrate selscape on high-coverage genomes from the 1000 Genomes Project, illustrating how the workflow enables efficient, large-scale analyses and streamlined comparisons across populations. By unifying diverse tools with Snakemake, selscape lowers the barrier to robust genome-wide analyses and provides a flexible framework for future extensions and integration with complementary population genetic analyses.
bioinformatics2026-03-03v1The limits of Bayesian estimates of divergence times in measurably evolving populations
Ivanov, S.; Fosse, S.; dos reis, M.; Duchene, S.Abstract
Bayesian inference of divergence times for extant species using molecular data is an unconventional statistical problem: Divergence times and molecular rates are confounded, and only their product, the molecular branch length, is statistically identifiable. This means we must use priors on times and rates to break the identifiability problem. As a consequence, there is a lower bound in the uncertainty that can be attained under infinite data for estimates of evolutionary timescales using the molecular clock. With infinite data (i.e., an infinite number of sites and loci in the alignment) uncertainty in ages of nodes in phylogenies increases proportionally with their mean age, such that older nodes have higher uncertainty than younger nodes. On the other hand, if extinct taxa are present in the phylogeny, and if their sampling times are known (i.e., `heterochronous' data), then times and rates are identifiable and uncertainties of inferred times and rates go to zero with infinite data. However, in real heterochronous datasets (such as viruses and bacteria), alignments tend to be small and how much uncertainty is present and how it can be reduced as a function of data size are questions that have not been explored. This is clearly important for our understanding of the tempo and mode of microbial evolution using the molecular clock. Here we conducted extensive simulation experiments and analyses of empirical data to develop the infinite-sites theory for heterochronous data. Contrary to expectations, we find that uncertainty in ages of internal nodes scales positively with the distance to their closest tip with known age (i.e., calibration age), not their absolute age. Our results also demonstrate that estimation uncertainty decreases with calibration age more slowly in data sets with more, rather than fewer site patterns, although overall uncertainty is lower in the former. Our statistical framework establishes the minimum uncertainty that can be attained with perfect calibrations and sequence data that are effectively infinitely informative. Finally, we discuss the implications for viral sequence data sets. In a vast majority of cases viral data from outbreaks is not sufficiently informative to display infinite-sites behaviour and thus all estimates of evolutionary timescales will be associated with a degree of uncertainty that will depend on the size of the data set, its information content, and the complexity of the model. We anticipate that our framework is useful to determine such theoretical limits in empirical analyses of microbial outbreaks.
bioinformatics2026-03-03v1A comprehensive assessment of tandem repeat genotyping methods for Nanopore long-read genomes
Aliyev, E.; Avvaru, A.; De Coster, W.; Arner, G. M.; Nyaga, D. M.; Gibson, S. B.; Weisburd, B.; Gu, B.; Gonzaga-Jauregui, C.; 1000 Genomes Long-Read Sequencing Consortium, ; Chaisson, M. J. P.; Miller, D. E.; Ostrowski, E.; Dashnow, H.Abstract
Background Tandem repeats (TRs) play critical roles in human disease and phenotypic diversity but are among the most challenging classes of genomic variation to measure accurately. While it is possible to identify TR expansions using short-read sequencing, these methods are limited because they often cannot accurately determine repeat length or sequence composition. Long-read sequencing (LRS) has the potential to accurately characterize long TRs, including the identification of non-canonical motifs and complex structures. However, while there are an increasing number of genotyping methods available, no systematic effort has been undertaken to evaluate their length and sequence-level accuracy, performance across motifs from STRs to VNTRs and across allele lengths, and, critically, how usable these tools are in practice. Results We reviewed 25 available bioinformatic tools, and selected seven that are actively maintained for benchmarking using publicly available Oxford Nanopore genome sequencing data from more than 100 individuals. Our benchmarking catalog included ~43k TR loci genome-wide, selected to represent a range of simple and challenging TR loci. As no "truth" exists for this purpose, we used four complementary strategies to assess accuracy: concordance with high-quality haplotype-resolved Human Pangenome Reference Consortium (HPRC) assemblies, Mendelian consistency in Genome in a Bottle trios, cross-tool consistency, and sensitivity in individuals with pathogenic TR expansions confirmed by molecular methods. For all comparisons, we assess both total allele length and full sequence similarity using the Levenshtein distance. We also evaluated installation, documentation, computational requirements, and output characteristics to reflect real-world use. We provide a complete analysis workflow for all tools to support community reuse. Tool performance varied substantially across both accuracy and usability. Most methods achieved high concordance with HPRC assemblies, with higher accuracy when using the R10 ONT pore chemistry. Accuracy generally declined with increasing allele length, and most tools performed worse on homopolymers, likely reflecting underlying sequencing accuracy. Tools generally performed worse at heterozygous loci and at alleles that differed from the reference genome. Interestingly, concordance with assembly in population samples did not predict sensitivity to pathogenic expansions, with different genotypers performing best in each category. Similarly, Mendelian consistency was highest in the tool that performed worst in assembly concordance. Conclusions No single genotyper emerged as consistently best across all assessments, but strong contenders emerged in each. Our results demonstrate that length accuracy (a typical benchmarking approach) alone overestimates TR genotyping performance. Sequence-level benchmarking is essential for selecting tools best-suited for population studies and clinical diagnostics. This work provides practical guidance for tool selection and highlights key priorities for future long-read TR genotyping method development.
bioinformatics2026-03-03v1iGS: A Zero-Code Dual-Engine Graphical Software for Polygenic Trait Prediction
Zhang, J.; Chen, F.Abstract
Genomic selection (GS) has become the core driving force in modern plant and animal breeding. However , state-of-the-art comprehensive GS tools often rely on complex underlying environment configurations and command-line operations, posing significant technical barriers for breeders lacking programming expertise . To address this critical pain point, this study developed a fully "zero-code" graphical user interface (GUI) decision support system for genomic selection. The platform innovatively employs a "portable dual-engine architecture" (R-Portable and Python-Portable) to achieve completely dependency-free, "out-of-the-box" deployment , and integrates a standardized six-step end-to-end workflow from data quality control to result export . Furthermore, the platform comprehensively integrates 33 cutting-edge prediction models across four major paradigms, linear, Bayesian, machine learning, and deep learning , and features an original intelligent parameter configuration system that dynamically renders algorithm parameters to provide a minimalist UI interaction experience . Benchmark testing on the Wheat2000 dataset across six complex agronomic and quality traits, including thousand-kernel weight (TKW) and grain protein content (PROT), demonstrated that classic linear models remain highly robust for polygenic additive traits, while tree-based machine learning and hybrid deep learning architectures exhibit superior predictive potential and noise resilience when resolving complex epistatic effects and low-heritability traits. The successful deployment of this platform fundamentally liberates biologists from the constraints of computational science, providing robust digital infrastructure to accelerate the popularization and practical application of GS technologies in agricultural production.
bioinformatics2026-03-03v1Phenotypic Bioactivity Prediction as Open-set Biological Assay Querying
Sun, Y.; Zhang, X.; Zheng, Q.; Li, H.; Zhang, J.; Hong, L.; Wang, Y.; Zhang, Y.; Xie, W.Abstract
The traditional drug discovery pipeline is severely bottlenecked by the need to design and execute bespoke biological assays for every new target and compound, which is both time-consuming and prohibitively expensive. While machine learning has accelerated virtual screening, current models remain confined to ``closed-set'' paradigms, unable to generalize to entirely novel biological assays without target-specific experimental data. Here, we present OpenPheno, a groundbreaking multimodal foundation model that fundamentally redefines bioactivity prediction as an open-set, visual-language question-answering (QA) task. By integrating chemical structures (SMILES), universal phenotypic profiles (Cell Painting images), and natural language descriptions of biological assays, OpenPheno unlocks the highly coveted "profile once, predict many'' paradigm. Instead of conducting countless target-specific wet-lab experiments, researchers only need to capture a single, low-cost Cell Painting image of a novel compound. OpenPheno then evaluates this universal phenotypic ``fingerprint'' against the text-based description of any unseen assay, predicting bioactivity in a zero-shot manner. On 54 entirely unseen assays, it achieves strong zero-shot performance (mean AUROC 0.75), exceeding supervised baselines trained with full labeled data, and few-shot adaptation further improves predictions. In the most stringent setting where both compounds and assays are novel, OpenPheno maintains robust generalization (mean AUROC 0.66), opening up a new paradigm for a highly scalable, cost-effective, and universal engine for next-generation drug discovery.
bioinformatics2026-03-03v1Large-Scale Statistical Dissection of Sequence-Derived Biochemical Features Distinguishing Soluble and Insoluble Proteins
Vu, N. H. H.; Nguyen Bao, L.Abstract
Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a statistically rigorous large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliffs {delta}, and discriminative performance was assessed by ROC-AUC. Although 34 features remained significant after correction, most exhibited small effect sizes and substantial class overlap, consistent with a weak-signal regime. The strongest effects were associated with size-related features (sequence length and molecular weight; {delta} {approx} -0.21), whereas charge-related descriptors, particularly the proportion of negatively charged residues ({delta} = 0.150; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ({rho} up to 0.998). Applying a redundancy threshold (|{rho}| [≥] 0.85), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings demonstrate that sequence-level solubility information is intrinsically low-dimensional and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.
bioinformatics2026-03-03v1h5adify: neuro-symbolic metadata harmonizationenables scalable AnnData integration with locallarge language models
Rincon de la Rosa, L.; Mouazer, A.; Navidi, M.; Degroodt, E.; Künzle, T.; Geny, S.; Idbaih, A.; Verrault, M.; Labreche, K.; Hernandez-Verdin, I.; Alentorn, A.Abstract
Background: The rapid growth of public single-cell and spatial transcriptomics repositories has shifted the main bottleneck for atlas-scale integration from data generation to metadata heterogeneity. Even when datasets are released in the AnnData H5AD format, inconsistent column naming, partial annotations, and mixed gene identifier conventions frequently prevent reproducible merging, downstream benchmarking, and reuse in foundation model training. Automated approaches that resolve semantic inconsistency while preserving biological validity are therefore essential for scalable data reuse. Results: We present h5adify, a neuro-symbolic toolkit that combines deterministic biological inference with locally deployed large language models to transform heterogeneous AnnData objects into schema-normalized, integration-ready representations. The framework performs metadata field discovery, gene identifier harmonization, optional paper-aware extraction, and consensus resolution with explicit uncertainty logging. Benchmarking four open-weight model families deployed through Ollama (Gemma, Llama, Mistral, and Qwen) demonstrates that small local models achieve high semantic accuracy in metadata resolution with low hallucination rates and modest computational requirements. In controlled simulations introducing annotation noise into single-cell and Visium-like datasets, harmonization improves integration benchmarking and reduces spurious batch effects. Application to sex-annotated glioblastoma datasets recovers biologically coherent microenvironmental patterns and cell type-specific genomic differences not explained by differential expression alone. Conclusions: Together, h5adify provides a reproducible framework for evaluating LLM-assisted biocuration and enables scalable, privacy-preserving metadata harmonization for modern single-cell atlases and foundation model pipelines. These results demonstrate that modular neuro-symbolic integration of deterministic biological inference and small local language models can effectively resolve semantic heterogeneity while remaining computationally accessible.
bioinformatics2026-03-03v1snputils: A High-Performance Python Library for Genetic Variation and Population Structure
Bonet, D.; Comajoan Cara, M.; Barrabes, M.; Smeriglio, R.; Agrawal, D.; Aounallah, K.; Geleta, M.; Dominguez Mantes, A.; Thomassin, C.; Shanks, C.; Huang, E. C.; Franquesa Mones, M.; Luis, A.; Saurina, J.; Perera, M.; Lopez, C.; Sabat, B. O.; Abante, J.; Moreno-Grau, S.; Mas Montserrat, D.; Ioannidis, A. G.Abstract
The increasing size and resolution of genomic and population genetic datasets offer unprecedented opportunities to study population structure and uncover the genetic basis of complex traits and diseases. The collection of existing analytical tools, however, is characterized by format incompatibilities, limited functionality, and computational inefficiencies, forcing researchers to construct fragile pipelines that chain together fragmented command-line utilities and ad hoc scripts. These are difficult to maintain, scale, and reproduce. To address such limitations, we present snputils, a Python library that unifies high-performance I/O, transformation, and analysis of genotype, ancestry, and phenotypic information within a single framework suitable for biobank-scale research. The library provides efficient tools for essential operations, including querying, cleaning, merging, and statistical analysis. In addition, it offers classical population genetic statistics with optional ancestry-specific masking. An identity-by-descent module supports reading of multiple formats, filtering and ancestry-restricted segment trimming for relatedness and demographic inference. snputils also incorporates ancestry-masking and multi-array functionalities for dimensionality reduction methods, as well as efficient implementations of admixture simulation, admixture mapping, and advanced visualization capabilities. With support for the most commonly used file formats, snputils integrates smoothly with existing tools and clinical databases. At the same time, its modular and optimized design reduces technical overhead, facilitating reproducible workflows that accelerate discoveries in population genetics, genomic research, and precision medicine. Benchmarking demonstrates a significant reduction in genotype data loading speed compared to existing Python libraries. The open-source library is available at https://github.com/AI-sandbox/snputils, with full documentation and tutorials at https://snputils.org.
bioinformatics2026-03-03v1