Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
AI-readiness for Biomedical Data
Clark, T.; Caufield, H.; Parker, J. A.; Al Manir, S.; Amorim, E.; Eddy, J.; Gim, N.; Gow, B.; Goar, W.; Hansen, J. N.; Harris, N.; Hermjakob, H.; Joachimiak, M.; Jordan, G.; Lee, I.-H.; McWeeney, S. K.; Nebeker, C.; Nikolov, M.; Reese, J.; Shaffer, J.; Sheffield, N.; Sheynkman, G.; Stevenson, J.; Chen, J. Y.; Mungall, C.; Wagner, A.; Kong, S. W.; Ghosh, S. S.; Patel, B.; Williams, A.; Munoz-Torres, M. C.Abstract
Biomedical research is rapidly adopting artificial intelligence (AI). Yet the inherent complexity of biomedical data preparation requires implementing actionable, robust criteria for ethical and explainable AI (XAI) at the "pre-model" stage, encompassing data acquisition, detailed transformations, and ethical governance. Simple conformance to FAIR (Findable, Accessible, Interoperable, Reusable) Principles is insufficient. Here, we define criteria and practices for reliable AI-readiness of biomedical data, developed by the NIH Bridge to Artificial Intelligence (Bridge2AI) Standards Working Group across seven core dimensions of dataset AI-readiness: FAIRness, Provenance, Characterization, Ethics, Pre-model Explainability, Sustainability, and Computability. Conformance to these criteria provides a basis for pre-model scientific rigor and ethical integrity, mitigating downstream risks of bias and error before AI modeling. We apply and evaluate these standards across all four Bridge2AI flagship datasets, spanning functional genomics to clinical medicine, and encode them in machine-actionable metadata bound to the datasets. This framework sets a benchmark for preparing ethical, reusable datasets in biomedical AI and provides standardized methods for reliable pre-model data evaluation.
bioinformatics2026-03-23v5Variable performance of widely used bisulfite sequencing methods and read mapping software for DNA methylation
Kerns, E. V.; Weber, J. N.Abstract
DNA methylation (DNAm) is the most commonly studied marker in ecological epigenetics, yet the performance of library preparation strategies and bioinformatic tools are seldom assessed in genetically variable natural populations. We profiled DNAm in threespine stickleback (Gasterosteus aculeatus) liver tissue, using reduced representation bisulfite sequencing (RRBS) and whole genome bisulfite sequencing (WGBS) across technical and biological replicates. We additionally collated publicly available RRBS and WGBS data from taxonomically diverse organisms, and then compared how the most commonly used methylation software (Bismark) performed relative to alternative pipelines (BWA meth, BiSulfite Bolt, and Biscuit). Even after choosing parameters to maximize Bismarks mapping efficiency, it was still outperformed by all other methods. Surprisingly, newer tools overrepresented DNAm compared to older methods, highlighting the importance of testing methods on nonmodel organisms. There were also distinct differences in DNAm profiles produced across library preparation methods, with large impacts of population and read depth filters. Methylated sites unique to WGBS predominantly mapped to introns and intergenic regions, while sites unique to RRBS primarily overlapped with promoters and exons. Moreover, the prevalence of nucleotides with intermediate methylation (within individuals) was greatly reduced in RRBS. Together, this suggests that RRBS may be more useful for detecting functionally-relevant methylation differences. Based on these results, we provide methodological recommendations for improving the reliability and utility of DNAm profiles, particularly concerning the detection of functionally relevant DNAm differences in genetically diverse natural populations.
bioinformatics2026-03-23v4Identification of Distinct Topological Structures From High-Dimensional Data
Xu, B.; Braun, R.Abstract
Single-cell RNA sequencing allows the direct measurement of the expression of tens of thousands of genes, providing an unprecedented view of the transcriptomic state of a cell. Within each cell, different biological processes such as differentiation or cell cycle take place simultaneously, each providing a different characterization of cell state. To identify gene sets that govern these processes for the purpose of disentangling convolved biological processes, we present "Identification of Distinct topological structures" (ID). ID works by constructing an alternative low-dimensional parametrization of the high-dimensional system, applying a finite perturbation to this alternative parametrization, and looking for genes that respond similarly. With this approach, we demonstrate that ID is capable of identifying structures within the data that will otherwise be missed. We further demonstrate the utility of ID in scRNA-seq datasets collected under various backgrounds, delineating cellular differentiation, characterizing cellular response to external perturbation, and dissecting the effect of genetic knock-outs.
bioinformatics2026-03-23v3AI-Enhanced Adaptive Virtual Screening Platform Enabling Exploration of 69 Billion Molecules Discovers Structurally Validated FSP1 Inhibitors
Cecchini, D.; Nigam, A.; Tang, M.; Reis, J.; Koop, M.; Gottinger, A.; Nicoll, C. R.; Wang, Y.; Jayaraj, A.; Cinaroglu, S. S.; Törner, R.; Malets, Y.; Gehev, M.; Padmanabha Das, K. M.; Churion, K.; Kim, J.; Thomas, N.; Li, Y.; Seo, H.-S.; Dhe-Paganon, S.; Secker, C.; Haddadnia, M.; Hasson, A.; Li, M.; Kumar, A.; Levin-Konigsberg, R.; Choi, E.-B.; Shapiro, G. I.; Cox, H.; Sebastian, L.; Braithwaite, C.; Bashyal, P.; Radchenko, D. S.; Kumar, A.; Yang, L.; Aquilanti, P.-Y.; Gabb, H.; Alhossary, A.; Wagner, G.; Aspuru-Guzik, A.; Moroz, Y. S.; Kalodimos, C. G.; Fackeldey, K.; Schuetz, J. D.; MattevAbstract
Identifying potent lead molecules for specific targets remains a major bottleneck in drug discovery. As structural information about proteins becomes increasingly available, ultra-large virtual screenings (ULVSs) which computationally evaluate billions of molecules offer a powerful way to accelerate early-stage drug discovery. Here, we introduce AdaptiveFlow, an open-source platform designed to make ULVSs more accessible, scalable, and efficient. AdaptiveFlow provides free access to a screening-ready version of the Enamine REAL Space, the largest library of ready-to-dock, drug-like molecules, containing 69 billion compounds that we prepared using the ligand preparation module of the platform. A key innovation of the platform is its use of a multi-dimensional grid of molecular properties, which helps researchers explore and prioritize chemical space more effectively and reduce the computational costs by a factor of approximately 1000. This grid forms the basis of a new method for identifying promising regions of chemical space, enabling systematic exploration and prioritization of compound libraries. An optional active learning component can further accelerate this process by adaptively steering the search toward molecules most likely to bind a given target. To support a broad range of applications, AdaptiveFlow is compatible with over 1,500 docking methods. The platform achieves near-linear scaling on up to 5.6 million CPUs in the AWS Cloud, setting a new benchmark for large-scale cloud computing in drug discovery. Using this approach, we identified nanomolar inhibitors of two disease-relevant targets: ferroptosis suppressor protein 1 (FSP1) and poly(ADP-ribose) polymerase 1 (PARP-1). By leveraging newly solved crystal structures of FSP1 in complex with NAD+, FAD, and coenzyme Q1, we validated these hits experimentally and determined the first co-crystal structures of FSP1 bound to small-molecule inhibitors, enabling insights into inhibitor binding mechanisms previously unknown. With its high scalability, flexibility, and open accessibility, AdaptiveFlow offers a powerful new resource for discovering and optimizing drug candidates at an unprecedented scale and speed.
bioinformatics2026-03-23v3Breaking the Extraction Bottleneck: A Single AI Agent Achieves Statistical Equivalence with Human-Extracted Meta-Analysis Data Across Five Agricultural Datasets
Halpern, M.Abstract
Background: Data extraction is the primary bottleneck in meta-analysis, consuming weeks of researcher time with single-extractor error rates of 17.7%. Existing LLM-based systems achieve only 26-36% accuracy on continuous outcomes, and no study has validated AI-extracted continuous data against multiple independent datasets using formal equivalence testing. Methods: A single AI agent (Claude Opus 4.6) extracted treatment means, control means, sample sizes, and variance measures from source PDFs across five published agricultural meta-analyses spanning zinc biofortification, biostimulant efficacy, biochar amendments, predator biocontrol, and elevated CO2 effects on plant mineral nutrition. Observations were matched to reference standards using an LLM-driven alignment method. Validation employed proportional TOST equivalence testing, ICC(3,1), Bland-Altman analysis, and source-type stratification. Results: Across five datasets, the agent produced 1,149 matched observations from 136 papers. Pearson correlations ranged from 0.984 to 0.999. Proportional TOST confirmed statistical equivalence for all five datasets (all p < 0.05). Table-sourced observations achieved 5.5x lower median error than figure-sourced observations. Aggregate effects were reproduced within 0.01-1.61 pp of published values. Independent duplicate runs confirmed extraction stability (within 0.09-0.23 pp). Conclusions: A single AI agent achieves statistical equivalence with human-extracted meta-analysis data across five independent agricultural datasets. The approach reduces extraction cost by approximately one to two orders of magnitude while maintaining accuracy sufficient for aggregate meta-analytic pooling.
bioinformatics2026-03-23v2REMAG: recovery of eukaryotic genomes from metagenomic data using contrastive learning
Gomez-Perez, D.; Raguideau, S.; Warring, S.; James, R.; Hildebrand, F.; Quince, C.Abstract
Metagenome-assembled genomes (MAGs) are central to exploring microbial communities. Yet, despite the relevance of protists and fungi to diverse ecosystems, eukaryotic MAG recovery lags behind that of prokaryotes. A major bottleneck is that most state-of-the-art binning pipelines exclusively rely on prokaryotic single-copy core gene reference databases and are optimized for smaller genomes. To address this gap, we present REMAG (Recovery of Eukaryotic MAGs), a tool designed to recover high-quality eukaryotic genomes suited for long-read metagenomic data. REMAG leverages fine-tuned HyenaDNA genomic foundation models to efficiently filter eukaryotic contigs. It then employs a dual-encoder Siamese network trained with Barlow Twins contrastive loss to learn a shared embedding space by integrating contig composition and differential coverage. Finally, high-quality bins are extracted using greedy iterative Leiden clustering optimized with eukaryotic single-copy core gene constraints. In benchmarks based on simulated mixed prokaryotic/eukaryotic communities and real datasets of varying sizes and origin, we demonstrate REMAG's ability to recover more near-complete eukaryotic genomes than existing state-of-the-art tools, which often produce highly fragmented eukaryotic bins. REMAG provides an automated eukaryotic binning method that scales effectively with the increasing size and sequencing depth of metagenomic datasets.
bioinformatics2026-03-23v2VINE: Variational inference for scalable Bayesian reconstruction of species and cell-lineage phylogenies
Siepel, A.; Hassett, R.; Staklinski, S. J.Abstract
Bayesian methods are now widely used in reconstructing both species and cell-lineage phylogenies, but they remain heavily reliant on computationally intensive Markov chain Monte Carlo sampling. Phylogenetic variational inference (VI) circumvents this dependency but so far has been limited in speed and scalability. Here we introduce Variational Inference with Node Embeddings (VINE), a computational method that combines an embedding of taxa in a high-dimensional space and a distance-based "decoder" with several algorithmic innovations to dramatically improve phylogenetic VI. VINE supports both standard DNA substitution models and CRISPR barcode-mutation models for inference of cell-lineage trees and tissue-migration histories. In extensive simulation experiments, we show that VINE is comparable in accuracy to the best available Bayesian methods with speeds orders of magnitude faster. We then apply VINE to ~1,000 complete SARS-CoV-2 genomes and ~900 lung-cancer cell barcodes, showing reductions in compute time from days to hours or minutes.
bioinformatics2026-03-23v2ChEA-KG: Human Transcription Factor Regulatory Network with a Knowledge Graph Interactive User Interface
Byrd, A. I.; Evangelista, J. E.; Lachmann, A.; Chung, H.-Y.; Jenkins, S. L.; Ma'ayan, A.Abstract
Gene expression is controlled by transcription factors (TFs) that selectively bind and unbind to DNA to regulate mRNA expression of all human genes. TFs control the expression of other TFs, forming a complex gene regulatory network (GRN) with switches, feedback loops, and other regulatory motifs. Many experimental and computational methods have been developed to reconstruct the human intracellular GRN. Here we present a different approach. By submitting thousands of up and down gene sets from the RummaGEO resource for TF enrichment analysis with ChEA3, we distill signed and directed edges that connect human TFs to construct a high quality human GRN. The GRN has 131,581 signed and directed edges connecting 701 source TF nodes to 1,559 target TF nodes. The GRN is accessible via the ChEA-KG web server application, which provides interactive network visualization and analysis tools. Users may query the GRN for single or pairs of TFs or submit gene sets to perform TF enrichment analysis with ChEA3, placing the enriched TFs within the GRN. To demonstrate the utility of ChEA-KG, several TF-centric atlases are also made available via the ChEA-KG website. These atlases host TF subnetworks that regulate 131 major normal human cell-types (Cell Type Atlas); 69 tumour subtypes from 10 cancers (Cancer Atlas); 30 consensus perturbation response signatures for common mechanisms of action (MoA Atlas); and 24 aging signatures from tissues profiled by GTEx. Overall, ChEA-KG is an interactive web-server application that presents to users a new method of exploring the human gene regulatory network through both network visualization and transcription factor enrichment analysis. The ChEA-KG application is available from: https://chea-kg.maayanlab.cloud/.
bioinformatics2026-03-23v2Translating Histopathology Foundation Model Embeddings into Cellular and Molecular Features for Clinical Studies
Cui, S.; Sui, Z.; Li, Z.; Matkowskyj, K. A.; Yu, M.; Grady, W. M.; Sun, W.Abstract
AI-powered pathology foundation models provide general-purpose representations of histopathological images by encoding image tiles into numerical embeddings. However, these embeddings are not directly interpretable in biological or clinical terms and must be translated into biologically meaningful features, such as cell-type composition or gene expression, to enable downstream clinical applications. To bridge this gap, we developed STpath, a framework that integrates histopathology image embeddings derived from existing pathology foundation models with matched, spatially resolved transcriptomics data. STpath consists of cancer-specific XGBoost models trained to infer cell-type compositions and gene expression from histopathology image tiles. We evaluated STpath in colorectal and breast cancer datasets and showed that it provides accurate estimates of the composition of major cell types and the expression of a subset of genes, with further performance gains achieved by combining embeddings from multiple foundation models. Finally, we demonstrated that STpath inferred features that can be used in downstream studies to evaluate their associations with clinical outcomes.
bioinformatics2026-03-23v2Single-cell spatial multi-omics molecular pathology enabled by SuperFocus
Lu, Y.; Tian, X.; Vicari, M.; Enninful, A.; Bao, S.; Bai, Z.; Liu, C.; Zhang, X.; Andren, P.; Lundeberg, J.; Xu, M. L.; Fan, R.; Xiao, Y.; Ma, Z.Abstract
Histopathology and molecular pathology are currently distinct diagnostic modalities for the most part, one revealing tissue morphology at cellular resolution and the other providing molecular measurements with limited or no spatial context. Projecting genome-scale molecular information onto histopathology images at single-cell resolution across whole tissue sections represents a long-sought goal for next-generation pathology. Here we present SuperFocus, a modality-agnostic computational platform that generates histopathology-integrated single-cell spatial multi-omics from spot-based spatial measurements acquired on the same or an adjacent section without requiring external reference data. SuperFocus combines constrained cascading imputation with feature-level and cell-level quality-control scores to reduce spurious predictions and quantify confidence. On a ground-truth spatial transcriptomics benchmark dataset, SuperFocus improves key accuracy metrics by 28-73% over existing methods. Across Patho-DBiT, spatial ATAC-RNA, spatial CITE-seq and Visium-MALDI-MSI (SMA) datasets, SuperFocus enables cell-resolved analyses of MALT lymphoma microenvironments, gene regulatory programs in human hippocampus, lipotoxic hepatocyte states in human MASH, and transcriptomic-metabolomic states linked to neurotransmission and neuroinflammation in Parkinsonian mouse brain. Overall, SuperFocus enables scalable whole-slide single-cell spatial multi-omics integrated with histopathology, bridging histology and genome-scale molecular profiling for next-generation molecular pathology.
bioinformatics2026-03-23v2Structurally Restricted Message-Passing within Shallow Architectures for Explainable Network-Level Brain Decoding on Small Cohorts
Marques dos Santos, J. D.; Ramos, M. B.; Reis, L. P.; Marques dos Santos, J. P.; Direito, B.Abstract
The application of artificial intelligence (AI) to functional magnetic resonance imaging (fMRI) has gained increasing attention due to its ability to model complex, high-dimensional brain data and capture nonlinear patterns of neural activity. However, deep learning architectures, such as Graph Neural Networks (GNNs), typically require large sample sizes to achieve stable convergence, limiting their applicability in neuroimaging contexts where data are often scarce. This challenge highlights the need for compact, data-efficient models that maintain predictive performance and interpretability. Shallow neural networks (SNNs) have demonstrated robustness in low-sample settings but commonly rely on region-level features that treat brain areas independently, overlooking the brain's intrinsically network-based organization. To address this limitation, we propose a structurally constrained message-passing framework that integrates diffusion tensor imaging (DTI)-derived structural connectivity with region-level fMRI signals within a shallow architecture. This approach enables network-level modeling while preserving the stability and data efficiency of SNNs. The method is evaluated on 30 subjects performing a Theory of Mind (ToM) task from the Human Connectome Project Young Adult dataset. A baseline SNN achieved global accuracies of 88.2% (fully connected), 80.0% (pruned), and 84.7% (retrained), while the proposed model achieved 87.1%, 77.6%, and 84.7%, respectively. Although structural constraints led to a more pronounced performance decrease after pruning, retraining restored accuracy to baseline levels, demonstrating that biological constraints can be incorporated without compromising predictive validity. Model interpretability was assessed using SHAP (Shapley Additive Explanations). While the baseline model primarily identified isolated regions as key contributors, the proposed framework revealed distributed, structurally coherent networks as the main drivers of classification. These networks showed correspondence with established ToM regions, including the temporo-parietal junction, superior temporal sulcus, and inferior frontal gyrus. Importantly, the findings suggest that groups of moderately informative regions can collectively form highly relevant subnetworks. Overall, the proposed framework achieves competitive performance in a limited dataset while incorporating graph-inspired message passing into a shallow architecture. Its explainability provides insight into how structurally constrained networks support stimulus-driven responses in ToM and demonstrates potential for investigating network dysfunction in disorders such as Alzheimer's disease, ADHD, autism spectrum disorder, bipolar disorder, mild cognitive impairment, and schizophrenia.
bioinformatics2026-03-23v1Epidemiology of Legionella: Genome-bAsed Typing (el_gato) - a new bioinformatic tool for identifying sequence-based types of Legionella pneumophila from whole genome sequencing data
Collins, A. J.; Mashruwala, D.; Chivukula, V.; Kozak-Muiznieks, N. A.; Rishishwar, L.; Norris, E. T.; Willby, M. J.; Hamlin, J.; Overholt, W. A.Abstract
Sequence-based typing (SBT) via Sanger sequencing has been the standard for describing Legionella pneumophila relatedness for two decades. SBT involves sequencing seven loci, identifying alleles using the United Kingdom Health Security Agency (UKHSA) database, and inferring the corresponding sequence type (ST). While similar SBT approaches for other organisms can be easily adapted to whole-genome sequencing (WGS), L. pneumophila presents several challenges for this adaptation: multiple copies of one locus (mompS) and extensive heterogeneity in a second locus (neuA/neuAh). Although several computational methods have been proposed to address these issues, a WGS-based replacement with equal resolution to traditional SBT has been elusive. To address this gap, we developed el_gato (Epidemiology of Legionella: Genome-bAsed Typing; https://github.com/CDCgov/el_gato), which offers several advantages over existing methods: (1) a novel approach for resolving multiple mompS alleles identified in the same isolate, (2) the ability to capture diverse neuA/neuAh alleles, (3) fast single-threaded execution with an average of 27.7 seconds per sample, (4) easy installation via Bioconda or Docker and (5) an updated database as of March 2025. el_gato works with either paired-end short reads or genome assemblies, performing more accurately with paired-end short reads at least 250 base pairs (bp) in length. We compared el_gato against two other in silico SBT tools (mompS, hereafter referred to as mompS tool and legsta) using a dataset of 441 isolates with sequence types (STs) previously determined by Sanger-based sequencing. el_gato correctly identified the ST for 98.9% of the test isolates, compared to 95.2% for the mompS tool and 42.2% for legsta, demonstrating a significant improvement compared to the mompS tool (adjusted p = 1.06e- 3) and legsta (adjusted p = 4.24e-55) in ST identification. Furthermore, el_gatos determination of ST was not significantly different from Sanger sequencing (adjusted p = 0.442). In summary, el_gato significantly improves in silico SBT and given its growing adoption, is poised to support the public health community.
bioinformatics2026-03-23v1CoPISA: Combinatorial Proteome Integral Solubility/Stability Alteration analysis
zangene, e.; gholizadeh, e.; Vadadokhau, U.; Ritz, D.; Saei, A.; JAFARI, M.Abstract
Combination therapies are widely used in acute myeloid leukemia (AML), but systematic datasets capturing proteome-wide responses to multi-drug perturbations remain limited. Here we present CoPISA (Combinatorial Proteome Integral Solubility/Stability Alteration), a quantitative proteomics assay designed to profile protein solubility and stability responses to single and combined drug treatments. The dataset includes two AML drug pairs (LY3009120-sapanisertib and ruxolitinib-ulixertinib) applied to four AML cell lines (MOLM-13, MOLM-16, SKM-1, and NOMO-1) under control, single-agent, and combination conditions in both lysate and intact-cell formats. Thermal solubility profiling coupled with TMT-based multiplexed LC-MS/MS generated 16 TMT16-plex experiments comprising 192 LC-MS/MS raw files, providing deep proteome coverage across treatments and biological contexts. The resource includes raw and processed proteomics data, detailed experimental metadata in Sample and Data Relationship Format (SDRF), and reproducible analysis scripts for reporter normalization, protein-level aggregation, statistical modeling, and classification of combinatorial response patterns. The experimental design enables identification of proteins responding uniquely to combination treatments as well as overlapping single-agent effects. Technical validation demonstrates reproducible quantification across multiplex experiments and assay formats. All data are publicly available through the PRIDE repository (PXD066812) together with analysis code, enabling independent reanalysis and method development. This dataset provides a benchmark resource for studying proteome responses to drug combinations, comparing lysate and intact-cell perturbation profiles, developing computational approaches for combinatorial target inference, and supporting training in computational proteomics.
bioinformatics2026-03-23v1NLCD: A method to discover nonlinear causal relations among genes
Easwar, A.; Narayanan, M.Abstract
Distinguishing correlation from causation is a fundamental challenge in many scientific fields, including biology, especially when interventions like randomized controlled trials are infeasible and only observational data are available. Methods based on statistical tests of conditional independence within the Mendelian Randomization framework can detect causality between two observed variables that are each associated with a third instrumental variable. However, these methods for detecting causal relationships between traits (e.g., two gene expression or clinical traits associated with a genetic variant, all observed in the same population) often assume a linear relationship, thereby hindering the discovery of causal gene networks from genomics data. We have developed NLCD, a method for NonLinear Causal Discovery from genomics data based on nonlinear regression modeling and conditional feature importance scoring. NLCD uses these techniques to extend the statistical tests in an existing linear causal discovery method called the Causal Inference Test (CIT). We benchmarked NLCD against current state-of-the-art methods: CIT, Findr, and MRPC. On simulated datasets, NLCD performs comparably to most methods in detecting linear relations (Average AUPRC (Area Under the Precision-Recall Curve) of NLCD=0.94, CIT=0.94, Findr=0.94, and MRPC=0.99), and outperforms them in detecting nonlinear (sine and sawtooth type) relations between two genes (Average AUPRC of NLCD=0.76, CIT=0.60, Findr=0.56, and MRPC=0.73). When tested on a nonlinear subset of a yeast genomic dataset to recover known causal relations involving transcription factors, NLCD and CIT performed comparable to each other and slightly better than Findr and MRPC (Average AUPRC of NLCD=0.82, CIT=0.81, Findr=0.71, and MRPC=0.54). On application to a human genomic dataset, NLCD revealed active causal gene pairs (IRF1[->]PSME1 and HLA-C[->]HLA-T) in the muscle tissue, and clarified the promises and challenges in discovering causal gene networks in tissues under in vivo human settings.
bioinformatics2026-03-23v1MHCBind: A Pan- and Allele-Specific Model for Predicting Class I MHC-Peptide Binding Affinity
Peddi, N.; Bijjula, D. R.; Gogte, S.; Kondaparthi, V.Abstract
Major Histocompatibility Complex (MHC) molecules are essential to the immune system because they bind and present peptide antigens to T cells, enabling immune recognition and response. The specificity of MHC-peptide interactions is crucial for understanding immune-related diseases, developing personalized immunotherapies, and designing effective vaccines. Current computational methods, while powerful, often rely on a single type of molecular information, usually sequence, and implicitly model the interaction between the two molecules. To address these limitations, we introduce MHC-Bind, a novel deep learning framework that captures a more comprehensive and biologically relevant view of the binding event. MHCBind's architecture employs a dual-view feature extraction strategy for both the MHC and the peptide. A Graph Attention Network (GAT) learns topological features from predicted residue contact maps, while a parallel 1D Convolutional Neural Network (CNN) captures multi-scale patterns from sequence embeddings. These four distinct feature sets are then integrated in a cross-fusion module that uses an attention mechanism to model interactions between the two molecules. Finally, a multi-layer perceptron (MLP) regression head maps the fused interaction signature to a precise binding affinity score. In rigorous comparative benchmarks against recent variants, such as NetMHCpan, MHCFlurry, and MHCnuggets, MHCBind demonstrates superior performance, achieving a significantly lower average prediction error (RMSE: 0.1485) and a higher correlation (PCC: 0.7231) in allele-specific contexts. For pan-allele tasks, it excels at correctly ranking peptides with a superior Spearman's Correlation (SCC: 0.7102), a crucial advantage for practical applications. The framework's design is inherently flexible, excelling in both allele-specific and pan-allele prediction tasks.
bioinformatics2026-03-23v1General-purpose embeddings for long-read metagenomic sequences via β-VAE on multi-scale k-mer frequencies
Nielsen, T. N.; Lui, L. M.Abstract
Long-read metagenomics routinely produces millions of assembled contigs, creating a need for methods that organize sequences into biologically meaningful groups across samples and environments. We present a general-purpose compositional embedding for metagenomic sequences based on a -variational autoencoder (-VAE) trained on multi-scale k-mer frequencies (1-mers through 6-mers; 2,772 features with centered log- ratio transformation). The embedding compresses each contig into a 384-dimensional vector that preserves local compositional similarity, enabling similarity search and graph-based clustering from sequence composi- tion alone. Through systematic comparison of fifteen models trained on up to 17.4 million contigs (525.5 Gbp) from brackish, terrestrial, and reference genome sources, we find that a small set of curated prokaryotic refer- ence genomes (656,000 contigs) outperforms ten-fold larger domain-specific training sets, and that neither reconstruction loss nor Spearman correlation reliably predicts downstream clustering quality. On nearest- neighbor graphs, flow-based clustering (MCL) markedly outperforms modularity-based methods (Leiden), yielding 12,123 clusters from 154,041 contigs ([≥] 100 kbp) with 99.2% phylum-level purity confirmed by inde- pendent marker gene phylogenetics. Multi-method taxonomic annotation achieves 87% coverage and reveals that 16.4% of contigs are eukaryotic - the single largest component invisible to standard prokaryotic anno- tation tools. The embedding provides a sample-independent coordinate system for organizing metagenomic sequence space at scale.
bioinformatics2026-03-23v1Single-cell Landscape of T Cell Heterogeneity in Kawasaki Disease: STAT3/JAK Axis Regulates the Lineage Differentiation Bias of Th17 Cells
Song, S.; Zong, Y.; Xu, Y.; Chen, L.; Zhou, Y.; Chen, L.; Li, G.; Xiao, T.; Huang, M.Abstract
Abstract Background: Kawasaki disease (KD) is a pediatric systemic vasculitis in which T-cell-mediated immune responses play a pivotal role. However, the precise dynamic evolution of T-cell subsets during disease progression remains poorly understood. Methods: Single-cell RNA sequencing (scRNA-seq) was employed to perform high-resolution annotation of peripheral blood mononuclear cells (PBMCs) from healthy controls and KD patients, both pre- and post- IVIG treatment. T-cell developmental trajectories were reconstructed via Monocle3-based pseudotime analysis. Furthermore, the functional significance of the significant pathway was validated in a CAWS-induced KD murine model. Results: A high-resolution single-cell landscape identified 13 distinct T-cell subtypes. Pseudotime analysis revealed a significant lineage commitment of CD4+ T cells toward a Th17 phenotype during the acute phase of KD, synchronized with the transcriptional upregulation of the STAT3/JAK signaling axis. Animal experiments further demonstrated that pharmacological inhibition of this pathway substantially attenuated inflammatory infiltration in the cardiac vasculature of KD mice. Conclusion: This study identifies the STAT3/JAK-mediated Th17 differentiation bias as a potential regulatory program associated with acute inflammation in Kawasaki disease, thereby highlighting the STAT3/JAK axis as a potential therapeutic target.
bioinformatics2026-03-23v1A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes
Muneeb, M.; Ascher, D.Abstract
Polygenic risk score (PRS) tools differ substantially in statistical assumptions, input requirements, and implementation complexity, making direct comparison difficult. We developed a harmonized, implementation-aware benchmarking framework to evaluate 46 PRS tools across seven binary UK Biobank phenotypes and one continuous trait under three model configurations: null, PRS-only, and PRS plus covariates. The framework integrates standardized preprocessing, tool-specific execution, hyperparameter exploration, and unified downstream evaluation using five-fold cross-validation on high-performance computing infrastructure. In addition to predictive performance, we assessed runtime, memory use, input dependencies, and failure modes. A Friedman test across 40 phenotype--fold combinations confirmed significant differences in tool rankings ({chi}2 = 102.29, p = 2.57 x 10-11), with no single method universally optimal. These findings provide a reproducible framework for comparative PRS evaluation and demonstrate that tool performance is shaped not only by statistical methodology but also by phenotype architecture, preprocessing choices, covariate structure, computational demands, software robustness, and practical implementation constraints.
bioinformatics2026-03-23v1Learning gene interactions from tabular gene expression data using Graph Neural Networks
Boulougouri, M.; Nallapareddy, M. V.; Vandergheynst, P.Abstract
Gene interactions form complex networks underlying disease susceptibility and therapeutic response. While bulk transcriptomic datasets offer rich resources for studying these interactions, applying Graph Neural Networks (GNNs) to such data remains limited by a lack of methodological guidance, especially for constructing gene interaction graphs. We present REGEN (REconstruction of GEne Networks), a GNN-based framework that simultaneously learns latent gene interaction networks from bulk transcriptomic profiles and predicts patient vital status. Evaluated across seven cancer types in the TCGA cohort, REGEN outperforms baseline models in five datasets and provides robust network inference. By systematically comparing strategies for initializing gene - gene adjacency matrices, we derive practical guidelines for GNN application to bulk transcriptomics. Analysis of the learned kidney cancer gene-network reveals cancer related pathways and biomarkers, validating the model's biological relevance. Together, we establish a principled approach for applying GNNs to bulk transcriptomics, enabling improved phenotype prediction and meaningful gene network discovery.
bioinformatics2026-03-23v1Solving the Diagnostic Odyssey with Synthetic Phenotype Data
Colangelo, G.; Marti, M.Abstract
The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.
bioinformatics2026-03-23v1FuzzyClusTeR: a web server for analysis of tandem and diffuse DNA repeat clusters with application to telomeric-like repeats
Aksenova, A. Y.; Zhuk, A. S.; Lada, A. G.; Sergeev, A. V.; Volkov, K. V.; Batagov, A.Abstract
DNA repeats constitute a large fraction of eukaryotic genomes and play important roles in genome stability and evolution. While tandem repeats such as microsatellites have been extensively studied, the genomic organization and potential functions of dispersed or loosely organized repeat patterns remain poorly understood. Here we present FuzzyClusTeR, a web server for the identification, visualization and enrichment analysis of DNA repeat clusters in genomic sequences. Using parameterized metrics, FuzzyClusTeR detects both classical tandem repeats and regions where related motifs occur in proximity without forming perfect tandem arrays, which we term diffuse (or fuzzy) repeat clusters. The server supports analysis of user-defined sequences as well as genome-scale datasets, including the T2T-CHM13 and GRCh38 human genome assemblies, and provides interactive visualization and statistical tools for assessing the genomic distribution of repetitive motifs and corresponding clusters. As a demonstration, we analyzed telomeric-like repeats in the T2T-CHM13v2.0 genome and identified families of diffuse clusters enriched in these motifs. Comparison with simulated sequences suggests that these clusters represent non-random genomic patterns with potential evolutionary and functional significance. FuzzyClusTeR enables systematic exploration of repeat clustering across genomic regions or entire genomes. It is available at https://utils.researchpark.ru/bio/fuzzycluster
bioinformatics2026-03-23v1EvoMut: A Computational Framework for Engineering Oxidative Stability in Proteins
Arab, S. S.; Lewis, N. E.Abstract
Amino acid oxidation is a major cause of protein instability and loss of function in therapeutic and industrial settings. Although methionine, cysteine, tyrosine, and tryptophan residues are widely recognized as oxidation-prone, only a subset of such residues are dominant functional hotspots, and not all are suitable targets for mutation. Identifying these vulnerable yet engineerable sites remains a major challenge. Here, we present EvoMut, a residue-level analytical framework for evaluating both oxidative vulnerability and mutation feasibility. EvoMut estimates oxidation risk by integrating structural features, local functional context, intrinsic chemical susceptibility, and evolutionary conservation. A central feature of the framework is the explicit separation of oxidation risk from mutation feasibility: candidate substitutions are evaluated only after high-risk residues are identified and ranked by evolutionary substitution patterns. Application of EvoMut to multiple proteins, and evaluation with experimental data, showed that oxidation-prone residues differ markedly in their engineering potential. EvoMut distinguishes residues that are both oxidation-sensitive and evolutionarily permissive from those that are chemically vulnerable but functionally constrained. By providing residue-level mechanistic insight, EvoMut offers a practical framework for the rational design of oxidation-resistant proteins. EvoMut is freely available as a web server at https://evomut.org.
bioinformatics2026-03-23v1Time-Resolved Phosphoproteomics-Guided BFS Beam Search Reveals Cell-Type-Specific EGFR Signaling Architectures and SHP2 Inhibitor-Induced Pathway Rewiring
Lee, H.; Lee, G.Abstract
Background: The epidermal growth factor receptor (EGFR) orchestrates highly context-dependent intracellular signaling networks whose architecture varies across cell types and is frequently rewired by targeted therapeutics. Systems-level reconstruction of these networks from phosphoproteomic data remains challenging because phosphorylation measurements identify signaling nodes but do not reveal the interaction paths that propagate signals between proteins. Results: We developed a computational framework integrating time-resolved phosphoproteomics with graph traversal algorithms to reconstruct EGFR-initiated signaling pathways across three contexts/conditions. A sign-assignment preprocessing procedure converts quantitative phosphorylation measurements into binary activation states across time points, defining a condition-specific active node set that filters the protein-protein interaction network. Breadth-First Search combined with interaction-weighted Beam Search is then applied to the STRING interaction database (v11.5) to enumerate candidate signaling paths. Applying this framework to phosphoproteomic datasets from EGF-stimulated HeLa cells, EGF-stimulated MDA-MB-468 triple-negative breast cancer (TNBC) cells, and EGF-stimulated MDA-MB-468 cells pretreated with the SHP2 inhibitor SHP099 yielded 260 paths in HeLa cells (117 unique topologies), 293 paths in MDA-MB-468 cells (155 unique), and 292 paths under SHP2 inhibition (85 unique). HeLa cells displayed a SRC-centered architecture dominated by ERBB2 and SHC1 first-hop effectors, converging on focal adhesion, HSP90 chaperone, CRKL adaptor, and integrin signaling arms. In contrast, MDA-MB-468 cells showed a PIK3CA/PTPN11 dual-axis architecture integrating direct PI3K engagement with SHP2-mediated GRB2-IRS1-ABL1 signaling. SHP2 inhibition abolished PTPN11-mediated pathways and induced PIK3CA dominance (69.2% first-hop), accompanied by compensatory ERBB3 engagement and a computationally predicted SYK/VAV1/LCP2 node set whose biological role warrants experimental validation. Conclusions: Time-resolved phosphoproteomics-guided BFS Beam Search over STRING interaction networks captures cell-type-specific EGFR signaling architectures and drug-induced pathway rewiring. This framework provides a systematic approach for transforming phosphoproteomic measurements into mechanistically interpretable signaling hypotheses specific to the cell-type-specific contexts, directly applicable to drug resistance modeling.
bioinformatics2026-03-23v1The Risk of Gulf Birds Functional Diversity Loss with Climate Change Uncovered Using Deep Learning Population Models
Li, L.; Bai, J.; Sun, S.; Zuzuarregui, M.; Wang, Z.Abstract
Climate change and sea-level rise (SLR) pose increasing threats to coastal ecosystems and biodiversity in the Gulf of America. Most efforts to anticipate these threats focus on species counts or range shifts, while changes in species functional diversity remain uncovered. We estimated climate change and sea level rise impacts on hundreds of bird species populations and corresponding functional diversity shifts. We used the generative deep learning method, Variational Gaussian Mixture Autoencoder (GMVAE), and Trait Probability Density analysis to study such impacts. We found that a generative GMVAE model uncovered species' unobserved ranges, and that climate change reduced coastal ecosystem resilience and caused biodiversity loss across multiple dimensions, including functional richness, redundancy, evenness, and divergence. Surprisingly, the most impacted areas are not the exposed shoreline but the landward coastal transition zones. Specifically, shoreline functional diversity turned out to increase with climate change and sea level rise, whereas uplands showed declining functional diversity and increasing redundancy, indicating contraction of functional trait space. Furthermore, avian biodiversity expanded in coastal protected areas, serving as refugia embedded in a surrounding landscape where unique combinations of species traits are lost.
bioinformatics2026-03-23v1Rastair: an integrated variant and methylation caller
Etzioni, Z.; Zhao, L.; Hertleif, P.; Schuster-Boeckler, B.Abstract
Cytosine methylation is a crucial epigenetic mark that impact tissue-specific chromatin conformation and gene expression. For many years, bisulfite sequencing (BS-seq), which converts all non-methylated cytosine (C) to thymine (T), remained the only approach to measure cytosine methylation at base resolution. Recently, however, several new methods that convert only methylated cytosines to thymine (mC[->]T) have become widely available. Here we present rastair, an integrated software toolkit for simultaneous SNP detection and methylation calling from mC[->]T sequencing data such as those created with Watchmaker's TAPS+ and Illumina's 5-Base chemistries. Rastair combines machine-learning-based variant detection with genotype-aware methylation estimation. Using NA12878 benchmark datasets, we show that rastair outperforms existing methylation-aware SNP callers and achieves F1 scores exceeding 0.99 for datasets above 30x depth, matching the accuracy of state-of-the-art tools run on whole-genome sequencing data. At the same time, rastair is significantly faster than other genetic variant callers, processing a 30x depth file takes less than 30 minutes given 32 CPU cores on an Intel Xeon, and half as long when a GPU is available. By integrating genotyping with methylation calling, rastair reports an additional 500,000 positions in NA12878 where a SNP turns a non-CpG reference position into a "de-novo" CpG. Vice-versa, rastair also identifies positions where a variant disrupts a CpG and corrects their reported methylation levels. Rastair produces standard-compliant outputs in vcf, bam and bed formats, facilitating integration into downstream analyses pipelines. Rastair is open-source and available via conda, Dockerhub, and as pre-compiled binaries from https://www.rastair.com.
bioinformatics2026-03-23v1TogoMCP: Natural Language Querying of Life-Science Knowledge Graphs via Schema-Guided LLMs and the Model Context Protocol
Kinjo, A. R.; Yamamoto, Y.; Bustamante-Larriet, S.; Labra-Gayo, J. E.; Fujisawa, T.Abstract
Querying the RDF Portal knowledge graph maintained by DBCLS, which aggregates more than 70 life-science databases, requires proficiency in both SPARQL and database-specific RDF schemas, placing this resource beyond the reach of most researchers. Large Language Models (LLMs) can, in principle, translate natural-language questions into executable SPARQL, but without schema-level context, they frequently fabricate non-existent predicates or fail to resolve entity names to database-specific identifiers. We present TogoMCP, a system that recasts the LLM as a protocol-driven inference engine orchestrating specialized tools via the Model Context Protocol (MCP). Two mechanisms are essential to its design: (i) the MIE (Metadata-Interoperability-Exchange) file, a concise YAML document that dynamically supplies the LLM with each target database's structural and semantic context at query time; and (ii) a two-stage workflow separating entity resolution via external REST APIs from schema-guided SPARQL generation.On a benchmark of 50 biologically grounded questions spanning five types and 23 databases, TogoMCP achieved a large improvement over an unaided baseline (Cohen's d = 0.92, Wilcoxon p < 10-6), with win rates exceeding 80% for question types with precise, verifiable answers. An ablation study identified MIE files as the single indispensable component: removing them reduced the effect to a non-significant level (d = 0.08), while a one-line instruction to load the relevant MIE file recovered the full benefit of an elaborate behavioral protocol. These results suggest a general design principle: concise, dynamically delivered schema context is more valuable than complex orchestration logic.
bioinformatics2026-03-23v1aaKomp: Alignment-free amino acid k-mer matching for genome completeness assessment at scale
Wong, J.; Coombe, L.; Warren, R. L.; Birol, I.Abstract
In de novo sequencing projects, genome assembly optimization requires evaluating a number of candidate assemblies to identify optimal tool parameters. Yet, current completeness assessment tools like BUSCO and compleasm require 10-80 minutes per evaluation for gigabase-scale genomes, transforming what should be rapid iteration into time-intensive processes. These tools rely on alignment-based approaches and fixed ortholog databases, limiting their scalability across the tree of life. We present aaKomp, a scalable alignment-free tool that leverages amino acid k-mer matching and multi-index Bloom filters for rapid genome completeness assessment. Unlike current utilities, aaKomp supports user-defined reference databases, enabling customized assessments for any organism. In benchmarking against state-of-the-art tools using simulated T2T-CHM13 datasets, aaKomp achieved 68-fold faster execution and 15-fold lower memory consumption while maintaining accuracy. Testing on 94 Human Pangenome Reference Consortium assemblies and a European Eel assembly, aaKomp maintained one-minute runtimes (1.2 {+/-} 0.35 min) and low memory usage (<13.64 GB). aaKomp's scoring system provides nuanced estimates rather than threshold-based classifications, offering increased resolution for tracking incremental improvements during iterative workflows. aaKomp's speed, memory efficiency, and flexible database generation makes it well-suited for modern and biodiverse projects requiring the evaluation of hundreds of assemblies.
bioinformatics2026-03-22v1ATHILAfinder: a tool to detect ATHILA LTR retrotransposons in plant genomes
Bousios, A.; Primetis, E.Abstract
Motivation The ATHILA lineage of LTR retrotransposons has colonised all branches of the plant tree of life. In Arabidopsis thaliana and A. lyrata, ATHILA elements have invaded centromeres, influencing the genetic and epigenetic organisation, and driving satellite evolution. To assess the broader significance of ATHILA across plants, a computational pipeline is needed to identify ATHILA elements with high efficiency. Existing tools lack this ability because they are optimised for broad transposon classification at the expense of precise annotation of lower taxonomic levels. Results We present ATHILAfinder, a pipeline for accurate and large-scale discovery of ATHILA elements. ATHILAfinder uses lineage-specific sequence motifs as seeds and additional filters to build de novo intact elements. Homology-based steps rescue intact ATHILA and identify soloLTRs. A detailed identity card includes coordinates, LTR identity, coding capacity, length and other sequence features for every ATHILA. We validate ATHILAfinder in the A. thaliana Col-CEN assembly and five additional Brassicaceae species, covering four supertribes and ~30 million years of evolution. ATHILAfinder has very low false positive rates and outperforms widely-used tools like EDTA and the deep-learning-based Inpactor2 software for both recovery and precision of ATHILA. To demonstrate its usefulness, we generate insights into ATHILA dynamics across Brassicaceae. Outlook Few computational pipelines target specific transposon lineages, yet such tools can empower their identification and downstream analyses. Our tailored approach can be adapted to other LTR retrotransposon lineages, offering new ways for high-resolution analysis of transposons.
bioinformatics2026-03-22v1Helicase: Vectorized parsing and bitpacking of genomic sequences
Martayan, I.; Lobet, L.; Marchet, C.; Paperman, C.Abstract
Modern sequencing pipelines routinely produce billions of reads, yet the dominant storage formats (FASTQ and FASTA) are text-based and sequential, making high-throughput parsing a persistent bottleneck in bioinformatics. Their regular, line-oriented structure makes them well-suited to SIMD vectorization, but existing libraries do not fully exploit it. We present vectorized algorithms for high-throughput FASTA/Q parsing, with on-the-fly handling of non-ACTG characters and built-in bitpacking of DNA sequences into multiple compact representations. The parsing logic is expressed as a finite state machine, compiled into efficient SIMD programs targeting both x86 and ARM CPUs. These algorithms are implemented in Helicase, a Rust library exposing a tunable interface that retrieves only caller-requested fields, minimizing unnecessary work. Exhaustive benchmarks across a wide range of CPUs show that Helicase meets or exceeds the throughput of all evaluated state-of-the-art libraries, making it the fastest general-purpose FASTA/Q parser to our knowledge. Availability: https://github.com/imartayan/helicase
bioinformatics2026-03-22v1miRBind2 enables sequence-only prediction of miRNA binding and transcript repression
Cechak, D.; Tzimotoudis, D.; Sammut, S.; Gresova, K.; Marsalkova, E.; Farrugia, D.; Alexiou, P.Abstract
Motivation: MicroRNAs (miRNAs) regulate gene expression by guiding Argonaute proteins to partially complementary sites on target RNAs. While classical prediction methods rely on engineered features such as seed match categories, evolutionary conservation, and site context, recent advances in deep learning offer the potential to learn targeting rules directly from sequence. We developed a sequence-based deep learning model that improves miRNA target site prediction, and further validated the learned target site representations by extending the model to gene-level functional repression prediction. Results: We introduce miRBind2, a deep learning method for miRNA target site prediction that incorporates a novel pairwise nucleotide representation capturing all possible miRNA-target nucleotide interactions, with a CNN-based architecture. miRBind2 outperforms previous SotA models across four independent datasets from the debiased miRBench benchmark, while using 92% fewer parameters. We show that the convolutional features and weights learned by miRBind2 can be transferred to transcript-level prediction by extending the miRBind2 architecture and fine-tuning it on miRNA perturbation experiments. This miRBind2-3UTR model predicts gene repression from sequence alone. On a dataset of 50,549 miRNA-gene pairs, miRBind2-3UTR significantly outperforms TargetScan. These results show that deep models pretrained on target site data can capture regulatory signals and predict functional repression without requiring conventional engineered biological features. Availability: Models and source code are freely available via GitHub (https://github.com/BioGeMT/miRBind_2.0). A publicly available web-tool for novel predictions and visualization is available at: (https://huggingface.co/spaces/dimostzim/BioGeMT-miRBind2) Contact: panagiotis.alexiou@um.edu.mt
bioinformatics2026-03-21v1SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Jiang, T.; Hu, H.; Gao, R.; Jiang, Z.; Zhou, M.; Gao, W.; Zhou, S.; Wang, G.Abstract
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangenome analysis. However, the availability of structural variant (SV) calling tools that can effectively leverage pangenome information is limited. In addition, efficient construction of pangenome graphs becomes increasingly challenging with acquisition of larger number of samples. In this study, we present SVPG, an approach that leverages haplotype-resolved pangenome reference for accurate SV detection and rapid pangenome graph augmentation from long-read sequencing data. Compared to state-of-the-art SV callers, SVPG maintained superior overall performance across different coverages and sequencing technologies. SVPG also achieves notable improvements in calling rare and individual-specific SVs on both simulated and real somatic datasets. Furthermore, in a benchmark involving 20 samples, SVPG accelerated pangenome graph augmentation by nearly 10-fold compared to traditional augmentation strategies. We believe that this novel SVPG method, has the potential to revolutionize SV detection and serve as an effective and essential tool, offering new possibilities for advancing pangenomic research.
bioinformatics2026-03-20v4Coupling codon and protein constraints decouples drivers of variant pathogenicity
Chen, R.; Palpant, N.; Foley, G.; Boden, M.Abstract
Predicting the functional impact of genetic variants remains a fundamental challenge in genomics. Existing models focus on protein-intrinsic defects yet overlook regulatory constraints embedded within coding sequences. Here, we couple a codon language model (CaLM) with a protein language model (ESM-2) to dissect the drivers of variant pathogenicity. On ClinVar data, both modalities contribute near-equally to distinguishing pathogenic from benign variants. Evaluation across Deep Mutational Scanning and CRISPR-Based Genome Editing platforms in ClinMAVE reveals that loss-of-function variants are governed primarily by residue-level features, whereas gain-of-function variants show a greater relative contribution from codon-level constraints, albeit in a gene-specific manner. A controlled comparison of identical variants in BRCA1 and TP53 further suggests that codon-level signals are elevated in the endogenous genomic context. Together, these findings indicate that pathogenicity reflects both the "product'' and the "process,'' and that the experimental platform may influence which dimension is observable.
bioinformatics2026-03-20v3SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Jiang, T.; Hu, H.; Gao, R.; Cao, S.; Jiang, Z.; Liu, Y.; Zhou, M.; Gao, W.; Zhou, S.; Wang, G.Abstract
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangenome analysis. However, the availability of structural variant (SV) calling tools that can effectively leverage pangenome information is limited. In addition, efficient construction of pangenome graphs becomes increasingly challenging with acquisition of larger number of samples. In this study, we present SVPG, an approach that leverages haplotype-resolved pangenome reference for accurate SV detection and rapid pangenome graph augmentation from long-read sequencing data. Compared to state-of-the-art SV callers, SVPG maintained superior overall performance across different coverages and sequencing technologies. SVPG also achieves notable improvements in calling rare and individual-specific SVs on both simulated and real somatic datasets. Furthermore, in a benchmark involving 20 samples, SVPG accelerated pangenome graph augmentation by nearly 10-fold compared to traditional augmentation strategies. We believe that this novel SVPG method, has the potential to revolutionize SV detection and serve as an effective and essential tool, offering new possibilities for advancing pangenomic research.
bioinformatics2026-03-20v3PyrMol: A Knowledge-Structured Pyramid Graph Framework forGeneralizable Molecular Property Prediction
Li, Y.; Zhao, Q.; Wang, J.Abstract
Expert pharmaceutical chemists interpret molecular structures through a sophisticated cognitive hierarchy, transitioning from local functional moieties to spatial pharmacophores and, ultimately, to macroscopic pharmacological and physicochemical profiles. However, conventional Graph Neural Networks frequently overlook this high-level chemical intuition by treating molecules as single-scale atomic topology. To bridge this gap between human expertise and computational inference, we propose PyrMol, a knowledge-structured pyramid representation learning framework. By constructing heterogeneous hierarchical graphs, PyrMol orchestrates information flow across atomic, subgraph, and molecular levels. Crucially, the subgraph level systematically integrates three complementary expert views comprising functional groups, pharmacophores, and retrosynthetic fragments. To harmonize these explicit domain priors with implicit computational semantics, we introduce an adaptive Multi-source Knowledge Enhancement and Fusion module that dynamically balances their complementarity and redundancy. A Hierarchical Contrastive Learning strategy further ensures cross-scale semantic consistency. Empirical evaluations across ten benchmark datasets demonstrate that PyrMol outperforms 12 state-of-the-art baselines. Furthermore, its "plug-and-play" versatility provides a framework-agnostic performance boost for existing GNN architectures. PyrMol thus establishes a principled data-knowledge dual-driven paradigm for AI-aided Drug Discovery, effectively leveraging domain knowledge to catalyze advances in molecular property prediction.
bioinformatics2026-03-20v2A new pipeline for cross-validation fold-aware machine learning prediction of clinical outcomes addresses hidden data-leakage in omics based 'predictors'.
Hurtado, M.; Pancaldi, V.Abstract
Motivation: Machine learning (ML) approaches are increasingly applied to high-dimensional biological data in which features are often dataset-dependent. In many omics workflows, features are computed using information derived from the entire dataset, such as correlations between variables, clustering structures, or enrichment scores. We refer to these as global dataset features, defined as features whose computation depends on properties of the full dataset. In such cases, standard validation strategies can fail, especially when evaluating on independent datasets, due to information leakage that leads to overly optimistic performance estimates. Results: To address this challenge, we present pipeML, a flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation (CV) fold construction. pipeML enables users to recompute global dataset features independently within each CV fold, ensuring strict separation between training and test data, while preserving compatibility with a wide range of ML algorithms for both classification and survival tasks. Using real-world biological datasets, we demonstrate that pipeML enables leakage-free model evaluation when global dataset features are used. We argue that overestimation of model performance during CV can lead to overoptimistic expectations for validation on independent datasets. By explicitly addressing data leakage and offering a transparent, modular workflow, pipeML provides a robust solution for developing and validating ML models in complex biological settings. Availability:The pipeML R package as well as a tutorial are available at https://github.com/VeraPancaldiLab/pipeML Contact: vera.pancaldi@inserm.fr or marcelo.hurtado@inserm.fr Supplementary information: Available at Bioinformatics online.
bioinformatics2026-03-20v2ChiMER: Integrating chromatin architecture into splicing graphs for chimeric enhancer RNAs detection
Xiang, Y.; Xiao, X.; Zhou, B.; Xie, L.Abstract
Motivation: Enhancer-derived RNAs (eRNAs) and their fusion with protein coding genes represent a crucial yet understudied layer of transcriptional regulation. eRNAs are typically expressed at low levels, which makes fusion events difficult to detect with conventional fusion detection tools. In addition, these tools are not designed to capture fusion transcripts arising from spatial proximity between distal regulatory elements and gene loci. Reads spanning such regions are also frequently filtered as mapping artifacts. As a result, computational approaches for systematically identifying spatially mediated enhancer-exon fusion transcripts remain lacking. Methods: We developed ChiMER, a graph-based framework for detecting ChiMeric Enhancer RNAs from short-read RNA-seq data. ChiMER constructs splice graphs with chromatin contact information to introduce enhancer-exon edges and uses graph alignment to search for potential transcriptional paths. A ranking-based scoring module then prioritizes high-confidence events. Evaluations on simulated and real RNA-seq datasets show that ChiMER achieves higher sensitivity than conventional linear fusion detection methods while maintaining low false-positive rates. Results: Applied to cancer cell line RNA-seq datasets, ChiMER identified multiple enhancer-exon chimeric transcripts, several associated with super-enhancer regions. Multi-omics analysis further show that fusion transcripts occur in transcriptionally active regulatory environments and frequently coincide with strong R-loop signals, suggesting a potential role of RNA-DNA hybrid structures in facilitating long-range transcriptional joining events.
bioinformatics2026-03-20v2Integrative transcriptome-based drug repurposing in tuberculosis
Samart, K.; Thang, L.; Buskirk, L. R.; Tonielli, A. P.; Krishnan, A.; Ravi, J.Abstract
Tuberculosis (TB) remains the leading cause of infectious disease mortality worldwide, killing over one million people annually. Rising antibiotic resistance has added urgency to the need for host-directed therapeutics (HDTs) that modulate host immune responses alongside directly targeting the pathogen. Repurposing FDA-approved drugs is particularly attractive for this purpose because their safety profiles are already well-established, substantially reducing development time and cost. Transcriptomic methods have successfully identified repurposable therapeutics for TB based on 'connectivity mapping,' which identifies drugs that reverse disease gene expression patterns. However, these applications are limited to a small subset of data belonging to a specific data platform and a few connectivity methods. Expanding beyond these constrained settings introduces substantial challenges, including dataset heterogeneity across transcriptomics platforms and biological conditions, uncertainty about optimal scoring methods, and the lack of systematic approaches to identify robust disease signatures. We developed a computational workflow that integrates 28 TB gene expression signatures and multiple connectivity scoring methods to capture dominant TB signals regardless of variation in microarray and RNAseq platforms, cell types, and infection conditions. We systematically identified 64 FDA-approved drugs as promising TB host-directed therapeutics. These high-confidence drug candidates include known HDTs, such as statins (rosuvastatin, fluvastatin, lovastatin) and tamoxifen, recently validated in experimental TB models. Our prioritized candidate drugs reveal enrichment for therapeutically TB-relevant mechanisms, e.g., cholesterol metabolism inhibition and immune modulation pathways. Network analysis of disease-drug interactions identified 12 key bridging genes (including IL-8, CXCR2) that represent potential novel druggable targets for TB host-directed therapy. This work establishes transcriptome-based connectivity mapping as a viable approach for systematic HDT discovery in bacterial infections and provides a robust computational framework applicable to other infectious diseases. Our findings offer immediate opportunities for experimental validation of prioritized drug candidates and mechanistic investigation of identified druggable targets in TB pathogenesis.
bioinformatics2026-03-20v2A Multi-Dataset Transcriptomic Analysis Unravels Core Mechanisms Involving Vitamin D Metabolism and Inflammatory Pathways for Frailty Diagnosis.
Hu, X.; Zheng, W.; Li, Y.; Zhou, D.Abstract
Frailty is a prevalent geriatric syndrome, and the shortage of objective biomarkers restricts its early diagnosis and intervention. This study aimed to identify robust molecular signatures and diagnostic markers for frailty using bioinformatics analyses of multiple independent datasets. Two transcriptome datasets (GSE144304, n=80; GSE287726, n=70) were obtained from the GEO database. We performed differential gene expression analysis, GO, KEGG and GSEA enrichment, and machine learning (70% training / 30% validation) to screen and validate core biomarkers. Numerous shared differentially expressed genes were identified. Vitamin D metabolism, ABC transporter, and inflammatory/immune pathways were consistently enriched and confirmed by GSEA. Machine learning models based on these signatures showed favorable diagnostic performance. Our study demonstrates that vitamin D metabolic disorders and chronic inflammation are core molecular features of frailty. The identified biomarkers provide new strategies for basic research, early clinical diagnosis, and therapeutic target development for frailty.
bioinformatics2026-03-20v1Enhancing non-local interaction modeling for ab initio biomolecular calculations and simulations with ViSNet-PIMA
Cui, T.; Wang, Z.; Wang, T.Abstract
AI-based molecular dynamics simulation brings ab initio calculations to biomolecules in an efficient way, in which the machine learning force field (MLFF) locates at the central position by accurately predicting the molecular energies and forces. Most existing MLFFs assume localized interatomic interactions, limiting their ability to accurately model non-local interactions, which are crucial in biomolecular dynamics. In this study, we introduce ViSNet-PIMA, which efficiently learns non-local interactions by physics-informed multipole aggregator (PIMA) and accurately encodes molecular geometric information. ViSNet-PIMA outperforms all state-of-the-art MLFFs for energy and force predictions of different kinds of biomolecules and various conformations on MD22 and AIMD-Chig datasets, while adapting the PIMA blocks into other MLFFs further achieves 55.1% performance gains, demonstrating the superiority of ViSNet-PIMA and the universality of the model design. Furthermore, we propose AI2BMD-PIMA to incorporate ViSNet-PIMA into AI2BMD simulation program by introducing "Transfer Learning-Pretraining-Finetuning" scheme and replacing molecular mechanics-based non-local calculations among protein fragments with ViSNet-PIMA, which reduces AI2BMD's energy and force calculation errors by more than 50% for different protein conformations and protein folding and unfolding processes. ViSNet-PIMA advances ab initio calculation for the entire biomolecules, amplifying the application values of AI-based molecular dynamics simulations and property calculations in biochemical research.
bioinformatics2026-03-20v1Pareto optimization of masked superstrings improves compression of pan-genome k-mer sets
Plachy, J.; Sladky, O.; Brinda, K.; Vesely, P.Abstract
The growing interest in k-mer-based methods across bioinformatics calls for compact k-mer set representations that can be optimized for specific downstream applications. Recently, masked superstrings have provided such flexibility by moving beyond de Bruijn graph paths to general k-mer superstrings equipped with a binary mask, thereby subsuming Spectrum-Preserving String Sets and achieving compactness on arbitrary k-mer sets. However, existing methods optimize superstring length and mask properties in two separate steps, possibly missing solutions where a small increase in superstring length yields a substantial reduction in mask complexity. Here, we introduce the first method for Pareto optimization of k-mer superstrings and masks, and apply it to the problem of compressing pan-genome k-mer sets. We model the compressibility of masked superstrings using an objective that combines superstring length and the number of runs in the mask. We prove that the resulting optimization problem is NP-hard and develop a heuristic based on iterative deepening search in the Aho-Corasick automaton. Using microbial pan-genome datasets, we characterize the Pareto front in the superstring-length/mask-run space and show that the front contains points that Pareto-dominate simplitigs and matchtigs, while nearly encompassing the previously studied greedy masked superstrings. Finally, we demonstrate that Pareto-optimized masked superstrings improve pan-genome k-mer set compressibility by 12-19% when combined with neural-network compressors.
bioinformatics2026-03-20v1GenBio-PathFM: A State-of-the-Art Foundation Model for Histopathology
Kapse, S.; Aygün, M.; Cole, E.; Lundberg, E.; Song, L.; Xing, E. P.Abstract
Recent advancements in histopathology foundation models (FMs) have largely been driven by scaling the training data, often utilizing massive proprietary datasets. However, the long-tailed distribution of morphological features in whole-slide images (WSIs) makes simple scaling inefficient, as common morphologies dominate the learning signal. We introduce GenBio-PathFM, a 1.1B-parameter FM that achieves state-of-the-art performance on public benchmarks while using a fraction of the training data required by current leading models. The efficiency of GenBio-PathFM is underpinned by two primary innovations: an automated data curation pipeline that prioritizes morphological diversity and a novel dual-stage learning strategy which we term JEDI (JEPA + DINO). Across the THUNDER, HEST, and PathoROB benchmarks, GenBio-PathFM demonstrates state-of-the-art accuracy and robustness. GenBio-PathFM is the strongest open-weight model to date and the only state-of-the-art model trained exclusively on public data.
bioinformatics2026-03-20v1RNAGAN: Train One and Get Four, Multipurpose Human RNA-Seq Analysis Tool with Enhanced Interpretability and Small Data Size Capability
HOU, Z.; Lee, V. H.-F.; Kwong, D. L.-W.; Guan, X.; Liu, Z.; Dai, W.Abstract
The advent of artificial intelligence (AI) has brought revolutionary tools for biomedical transcriptomic (RNA-level) research. However, there are persistent constraints including limited interpretations with biomedical concepts such as functional pathways, small sample sizes and substantial time and computing power requirements for AI training. To overcome these limitations, we developed RNAGAN (https://github.com/ZhaozhengHou-HKU/RNAGAN-1.0.git), an AI tool with a generative adversarial network (GAN) structure with the objective of enhancing transcriptomic analysis. The network was established based on public human datasets comprising 4.6 million single cells from multiple organs and 5,900 sequenced samples of various cancer types with normal references. A specialized pathway neural layer was embedded to extract activities of predefined pathways from the Human Molecular Signatures Database (MSigDB), or newly learned pathways from single-cell data. The structure of RNAGAN (generator and discriminator) enables four applications after one shared training procedure: 1. single-cell and bulk-level patient stratification or differential diagnosis; 2. analysis of the gene and pathway markers in a selected disease; 3. pseudo data generation when sample size is limited for downstream analysis; 4. vectorization with gene and pathway-level features learned from multiple data sets. RNGAN contributes to the efficient utilization of limited data for transcriptomic studies.
bioinformatics2026-03-20v1TriGraphQA: a triple graph learning framework for model quality assessment of protein complexes
Liang, L.; Zhao, K.Abstract
Accurate quality assessment of predicted protein-protein complex structures remains a major challenge. Existing graph-based quality assessment methods often treat the entire complex as a homogeneous graph, which obscures the physical distinction between intra-chain folding stability and inter-chain binding specificity. In this study, we introduce TriGraphQA, a novel triple graph learning framework designed for model quality assessment of protein complexes. TriGraphQA explicitly decouples monomeric and interfacial representations by constructing three geometric views: two residue-node graphs capturing the local folding environments of individual chains, and a dedicated contact-node graph representing the binding interface. Crucially, we propose an interface context aggregation module to project context-rich embeddings from the monomers onto the interface, effectively fusing multi-scale structural features. We conducted comprehensive tests on several challenging benchmark datasets, including Dimer50, DBM55-AF2, and HAF2. The results show that TriGraphQA significantly outperforms state-of-the-art single-model methods. TriGraphQA consistently achieves the highest global scoring correlations and lower top-ranking losses. Consequently, TriGraphQA provides a powerful evaluation tool for protein-protein docking, facilitating the reliable identification of near-native assemblies in large-scale structural modeling and molecular recognition studies.
bioinformatics2026-03-20v1ECHO: a nanopore sequencing-based workflow for (epi)genetic profiling of the human repeatome
Poggiali, B.; Putzeys, L.; Andersen, J. D.; Vidaki, A.Abstract
The human genome is dominated by repetitive DNA, whose genetic and epigenetic variation plays a key role in gene regulation, genome stability, and disease. Recent advances in long-read sequencing now enable large-scale, haplotype-resolved, and DNA methylation-informative analysis of the human genome, including on previously inaccessible complex and repetitive regions. However, the comprehensive, simultaneous characterisation of the "human repeatome" remains challenging, largely due to the lack of comprehensive tools integrated in a single pipeline that can capture the full spectrum of variation across diverse types of DNA repeats. Here, we present ECHO, a user-friendly, Snakemake-based pipeline for the "(Epi)genomic Characterisation of Human Repetitive Elements using Oxford Nanopore Sequencing". ECHO provides a reproducible and scalable framework for end-to-end analysis of whole-genome nanopore sequencing data, enabling integrative but also tailored (epi)genetic analyses of the human repeatome
bioinformatics2026-03-20v1CliPepPI: Scalable prediction of domain-peptide specificityusing contrastive learning
Hochner-Vilk, T.; Stein, D.; Schueler-Furman, O.; Raveh, B.; Chook, Y. M.; Schneidman-Duhovny, D.Abstract
Domain-peptide interactions mediate a significant fraction of cellular protein networks, yet accurately predicting their specificity remains challenging. Peptide motifs typically have short, fuzzy sequence profiles, and their interactions are often weak and transient, limiting the size, coverage, and quality of experimentally validated domain-peptide datasets. Since true non-binders are rarely known, constructing negative examples often introduces bias. While structure-based prediction methods can achieve high accuracy, they are computationally demanding and difficult to scale to the proteome level. We introduce CLIPepPI, a dual-encoder model that leverages contrastive learning to embed domains and peptides into a shared space directly from sequence. Both encoders are initialized from a protein language model (ESM-C) and fine-tuned using lightweight LoRA adapters, enabling parameter-efficient training on positive pairs alone. To overcome data scarcity, we augment ~3K protein-peptide complexes from PPI3D with ~150K domain-peptide pairs derived from protein-protein interfaces. CLIPepPI further injects structural information by marking interface residues in the domain sequence, thus guiding the encoders toward binding regions and linking sequence-level learning with structural context. Competitive performance is achieved across three independent benchmarks: domain-peptide complexes from PPI3D, large-scale phage-library data from ProP-PD, and a curated dataset of nuclear export signal (NES) sequences. We demonstrate scalability and generalization through two applications: (i) proteome-wide NES scanning, and (ii) variant-effect prediction, where score changes in domain-peptide interactions between wild-type and mutant sequences discriminate pathogenic from benign variants. Together, CLIPepPI offers a scalable, structure-informed model for predicting domain-peptide specificity and generating meaningful embeddings suited for large-scale proteomic analyses. CLIPepPI is available at: https://bio3d.cs.huji.ac.il/webserver/clipeppi/.
bioinformatics2026-03-20v1RNASTOP: A Deep Learning Framework for mRNA Chemical Stability Prediction and Optimization
Lin, S.; Chen, J.; Sun, H.; Zhang, Y.; Yang, W.; tan, h.; Wei, D.-Q.; Jiang, Q.; Xiong, Y.Abstract
Messenger RNA (mRNA) vaccines offer promising therapeutics for combating various diseases, yet their inherent chemical instability hampers their long-term efficacy. Although several methods have been developed to predict mRNA degradation, they exhibit limited accuracy and lack the capability for rational sequence optimization. Here, we propose RNASTOP, a novel framework integrating deep learning with heuristic search to simultaneously predict and optimize mRNA chemical stability. RNASTOP achieves a 13% accuracy improvement over the top-performing model on the Stanford OpenVaccine competition dataset and demonstrates robust generalization in predicting full-length mRNA degradation. Applied to mRNA codon optimization, RNASTOP reduces the minimum free energy of the Varicella-Zoster Virus vaccine sequence by 75.73% while maintaining high translation efficiency. Overall, RNASTOP serves as a powerful tool for predicting and optimizing mRNA chemical stability, poised to expedite the development of mRNA therapeutics. The source code of RNASTOP can be accessed at https://github.com/xlab-BioAI/RNASTOP.
bioinformatics2026-03-20v1Computational Prediction of Plasmodium falciparum Antigen-T-cell Receptor Interactions via Molecular Docking: Implications for Malaria Vaccine Design
Kipkoech, G.; Kanda, W.; Irungu, B.; Nyangi, M.; Kimani, C.; Nyangacha, R.; Keter, L.; Atieno, D.; Gathirwa, J.; Kigondu, E.; Murungi, E.Abstract
Malaria is one of the deadliest diseases in sub-Saharan Africa and Southeast Asia. The majority of the fatalities occur mostly in children under 5 years and pregnant women and this is due to infection by Plasmodium spp, of which Plasmodium falciparum is the most virulent and is responsible for most of the morbidity and mortality. Despite various public health interventions such as use of insecticide-treated bed nets, spraying of homes with insecticides and use of WHO recommended artemisinin-based combination therapies (ACT), malaria prevention still faces major setback due to drug and insecticide resistance by P. falciparum and mosquitoes respectively. The study uses molecular docking and immunoinformatics to screen various Plasmodium spp antigens and evaluate their antigenicity and suitability as vaccine candidates. The P. falciparum antigens and T-cell receptor (TCR) structures were obtained from Protein Data Bank (PDB) based on a range of factors related to their role in the lifecycle of the parasite and their status as vaccine targets. Protein structures not available in the PDB were predicted using AlphaFold. The 3D structures of selected P. falciparum antigens and TCR structures were downloaded in PDB format then all water molecules, Hetatm, and bound ligands were deleted from the protein structures using BIOVIA Discovery Studio Visualizer. Subsequently, molecular docking was done using ClusPro v2.0 server and docked complexes were compared. The findings of this study gave valuable insights into the interaction of human immune response with P. falciparum antigens. The best three ranked antigen complexes are PfCyRPA, PfMSP10 and PfCSP and this confirm their use as potential candidates for vaccine development. This study highlights the usefulness of computational docking in identifying P. falciparum antigens of excellent immunogenic potential as vaccine candidates.
bioinformatics2026-03-20v1Dingent: An Easily Deployable Database Retrieval and Integration Agent framework
Kong, D.; Bei, S.; Wu, Y.; Tang, B.; Zhao, W.Abstract
AI-driven data search and integration represent an emerging research direction. Although several LLM-based backend frameworks and agentic frameworks have emerged, significant gap remains in developing a one-stop, configurable agent framework that supports various data sources and provides a web interface for efficient data retrieval using natural language. To address this, we present Dingent, a novel and configurable agent framework that facilitates data access from various resources and enables the flexible constructions of agent applications. We demonstrate its capabilities across three distinct application scenarios, achieving promising results. The Dingent framework can be readily applied to other fields, such as earth sciences and ecology, to facilitate data discovery.
bioinformatics2026-03-20v1WITHDRAWN: Beyond Binding Affinity: The Kinetic-Compatibility Hypothesis for Nipah Virus Neutralization
Bozkurt, C.Abstract
The authors have withdrawn their manuscript because of a fundamental error in the identification of the biological target protein. The analysis was originally framed around the mechanical transitions of the Nipah virus Fusion (F) protein; however, the empirical functional data utilized (from the 2025 AdaptyvBio competition) was directed toward the Attachment (G) glycoprotein. While the sequence-level characterization of the binders remains internally consistent, the mechanical analogies used are not applicable to the Attachment (G) protein architecture. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-03-19v2Using Variable Window Sizes for Phylogenomic Analyses of Whole Genome Alignments
Ivan, J.; Lanfear, R.Abstract
Many phylogenomic studies used non-overlapping windows to address gene tree discordance across a set of aligned genomes. Recently, Ivan et al. (2025) proposed an information theoretic approach to choose an optimal window size given the alignment. However, this approach selects only a single fixed window size per chromosome, which is a useful first step but fails to account for variation in the size of non-recombining regions along each chromosome. Such variation is expected to occur due to the stochastic nature of recombination as well as the variation in recombination rates along chromosomes. In this study, we extend the approach of Ivan et al. (2025) to allow window sizes to vary across the chromosome, using a splitting-and-merging strategy that allows for each window to be of an arbitrary length. We showed that the new method outperformed the fixed-window approach in recovering gene tree topologies on a wide range of simulated datasets. Applying the new method on the genomes of seven Heliconius butterflies, we found that the average window sizes for the group ranged between 538-808bp, but with a very similar distribution of gene tree topologies compared to previous studies that used fixed window sizes. For the genomes of great apes, the average window sizes ranged from 4.2kb to 6.2kb, with the proportion of the major topology (i.e., grouping human and chimpanzee together) reaching approximately 80%. In conclusion, our study highlights the limitations of using a fixed window size when recombination rates vary across the chromosomes, and proposes a splitting-and-merging approach that allows for variable window sizes across whole genome alignments.
bioinformatics2026-03-19v2