Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Metastatic Site Prediction in Breast Cancer using Kirchhoff's Law and Omics Knowledge Graph
Jha, A.; Khan, Y.; Sahay, R.; d'Aquin, M.Abstract
Predicting the anatomical site of metastasis from a primary tumour remains an unsolved problem in breast cancer (BRCA) and metastatic disease more broadly. The difficulty is structural: metastatic biology is multi-site (bone, lung, liver, brain), multi-omics (genomics, proteomics, methylomics, drug response), and multi-modal (CNV, gene expression, DNA methylation, pathways, clinical associations). Existing classifiers either collapse this heterogeneity into a single feature vector or rely on a single omics layer, both of which discard the mechanistic structure that drives metastatic tropism. We introduce Kirchhoff Knowledge Graphs (K-KG), a framework that imports the conservation laws of electrical-circuit theory into knowledge graph reasoning. Our contributions are: (1) a layered RDF Cancer Decision Network integrating 36 polyomics datasets across mutations, pathways, drugs, diseases, and reactions; (2) two novel conservation laws - the Knowledge-Graph Voltage Law (KGVL) and Knowledge-Graph Current Law (KGCL) - that govern information flow during traversal and yield a principled measure of graph completeness; (3) topological motif mining on the conserved graph, replacing expression-based feature selection by identifying triangular sub-structures whose rewiring marks metastatic transition; (4) a Graph Convolutional Neural Network whose hidden layers are the omics layers themselves, predicting site-specific metastasis as a continuous percentage rather than a binary label. On TCGA-BRCA training plus one validation and four independent test cohorts from GEO, K-KG achieves 83.8% AUC for relapse prediction and up to 0.87 AUC / 0.91 F1 for brain-site-specific prediction, outperforming Random Forest, Neural Network, and SVM baselines by 8-20 AUC points. To our knowledge this is the first application of Kirchhoff's laws (1845, 1847) to graph-based machine learning, and the first metastasis predictor that returns a per-site contribution profile rather than a single label.
bioinformatics2026-05-07v3PanVariants: Best Practice for Pangenome-based Variant Calling Pipeline and Framework
Yi, H.; Wang, L.; Chen, X.; Ding, Y.; Carroll, A.; Chang, P.-C.; Shafin, K.; Xu, L.; Zeng, X.; Zhao, X.; Gong, M.; Wei, X.; Hou, Y.; Ni, M.Abstract
Background: Although pangenome references offer richer population diversity compared to linear references, current mainstream pangenome-based variant callers are limited to detecting only known variants stored in the graph. To address this limitation, we developed PanVariants, a novel pipeline designed to improve the detection of both known and novel variants accurately. We systematically evaluated its performance against the traditional linear alignment solution (BWA+GATK/Manta) and the existing pangenome-aware solution (DRAGEN/PanGenie) in three contexts: small variants (SNVs/indels) and structural variants (SVs) accuracy in Genome in a Bottle samples, clinical detection on positive samples, and application in cohort-based joint calling. Results: By integrating k-mer-based and mapping-based methods, PanVariants significantly reduced variant errors (FPs + FNs), achieving a 73% reduction compared to BWA+GATK and a 45% reduction compared to DRAGEN for SNVs. Retraining the DeepVariant model with high-quality DNBSEQ data further decreased errors by 15%. For SVs detection, PanVariants attained an F1-score of 89.39%, markedly outperforming DRAGEN (68.18%) and BWA+Manta (58.33%), approaching long-read sequencing performance (95.22%). In validation using clinical positive samples, PanVariants successfully detected all expected pathogenic variants while PanGenie failed. In the cohort joint-calling analysis, PanVariants detected more variants, made fewer Mendelian inheritance errors, and gave better per-sample accuracy than GATK. Conclusions: PanVariants establishes a robust framework and best-practice pipeline for pangenome-based variant detection, achieving both sensitive novel variant discovery and high accuracy for SNVs, indels and SVs. Our systematic evaluation of optional processing steps and input variables offers practical guidance for users. Validated across diagnostic and population-based applications, our findings strongly support the transition from linear to pangenome references in future genomics.
bioinformatics2026-05-07v3STAT: A multi-agent framework for integrated and interactive spatial transcriptomics analysis
Chen, Y.; Han, S.; Chao, Z.; Liu, Y.; Zhang, F.; Chen, H.; Wang, J.; Xiao, J.; Yang, C.Abstract
Spatial transcriptomics analysis often involves a myriad of computational methods across diverse platforms, leading analysts to spend excessive time on data assembly rather than deriving biological insights. Current AI solutions tend to either oversimplify spatial data into generic single-cell tables or operate autonomously without opportunities for intermediate review, thus hindering the visual and iterative analyses essential for spatial biology. In response to these challenges, we introduce STAT, a multi-agent framework, designed to make spatial analysis more conversational and user-friendly while maintaining transparency and control. STAT integrates a persistent session, a shared interactive tissue viewer, and a staged skill-aware pipeline, enabling a more intuitive analytical experience. In a comprehensive benchmark evaluation encompassing eleven analytical task categories across three spatial platforms and both cell- and spot-resolution data, STAT demonstrated superior performance compared to a baseline large language model and existing autonomous spatial analysis agents, excelling in task completion, analytical quality, and token efficiency. Notably, STAT enables multi-task spatial analysis of a mixed-resolution breast cancer cohort, successfully reproducing key findings from a published Visium HD colorectal cancer study based solely on natural language prompts. STAT thus facilitates trustworthy and scientifically rigorous spatial transcriptomics analysis, allowing researchers to focus more on biological interpretation.
bioinformatics2026-05-07v2Better antibodies engineered with a GLIMPSE of human data
Hepler, N. L.; Hill, A. J.; Jaffe, D. B.; Gibbons, M. C.; Pfeiffer, K. A.; Hilton, D. M.; Freeman, M.; McDonnell, W. J.Abstract
GLIMPSE-1 is a protein language model trained solely on paired human antibody sequences. It captures immunological features and achieves best-in-class performance in humanization benchmarks. We demonstrate the utility of GLIMPSE-1 in humanization; engineering of antibodies for affinity, species cross-reactivity, and key developability parameters; and the creation of highly divergent functional variants with <90% sequence identity to a marketed antibody. Learning exclusively from human antibody data enables GLIMPSE-1 to enhance therapeutics and native antibodies based on patterns in the human repertoire.
bioinformatics2026-05-07v2immuneKG: An Immune-Cell-Aware Knowledge Graph Framework for Target Discovery in Immune-Mediated Diseases
Ye, Y.; PB-IDD Department, Pharmablock Sciences Inc.,Abstract
Biomedical knowledge graphs have emerged as foundational infrastructure for AI-driven drug discovery, yet their translational impact on novel target identification in immune-mediated diseases remains limited. Here we present immuneKG, a multimodal knowledge graph centred on autoimmune diseases, constructed through biologically meaningful feature reprogramming of disease nodes to enable deep mechanistic modelling of immune-related disorders. immuneKG introduces a new entity class immune_cell, and four original directed relation types, together adding 9,105 novel triples absent from all existing biomedical KG schemas. Disease nodes are endowed with three novel modal feature sets quantifying immune homeostatic imbalance: autoantibody profiles, cytokine signatures, and HLA genotypes, complemented by systemic involvement scores and genetic features. The graph encompasses over 407,000 training triples across 7,287 entities and 32 relation types. Applied to inflammatory bowel disease (IBD), immuneKG combined with a HeteroPNA-Attn graph neural network achieves a Hits@100 of 0.99 against a Clarivate Phase II+ clinical pipeline, while a novelty-penalised scoring function surfaces high-potential dark targets. The framework shifts from conventional candidate-space screening to a development-oriented decision-support paradigm, providing actionable and interpretable guidance for downstream drug discovery. The immuneKG project is publicly available now on GitHub at https://github.com/YaowenYe/immuneKG.
bioinformatics2026-05-07v2Scalable subclonal reconstruction of cancer cells in DNA sequencing data using a penalized likelihood model
Jiang, Y.; Montierth, M. D.; Ding, Y.; Yu, K.; Tran, Q.; Wu, A.; Li, R.; Ji, S.; Liu, X.; Shin, S. J.; Cao, S.; Tang, Y.; Lesluyes, T.; Kimmel, M.; Wang, J. R.; Tarabichi, M.; Zhu, H.; Van Loo, P.; Wang, W.Abstract
Tumor subclonal architecture shapes cancer evolution, yet subclonal reconstruction from bulk sequencing remains difficult to scale due to computational cost and model complexity. We present CliPP, a penalized-likelihood framework that jointly estimates cellular prevalence with pairwise fusion penalties, automatically identifying subclones without requiring extensive priors. Across simulations and 2,778 whole-genome tumors with external consensus reconstructions, CliPP achieves consistently good performances when compared to state-of-the-art approaches while providing substantial runtime reductions. Applied to 7,000+ tumors across >30 cancer types, CliPP quantifies pervasive subclonality and delineates cohort-level subclone landscapes. CliPP enables fast, reproducible large-scale subclonal analysis and is freely available to the community through GitHub and a shiny app.
bioinformatics2026-05-07v2Pan-cancer virtual spatial transcriptomics from routine histology with Phoenix
Tran, M.; Gindra, R. H.; Putze, P.; Senbai, K.; Palla, G.; Kos, T.; Falcomata, C.; Wang, C.; Guo, R.; Boxberg, M.; Berclaz, L. M.; Lindner, L. H.; Bergmayr, L.; Knoesel, T.; Jurmeister, P.; Klauschen, F.; Homicsko, K.; Gottardo, R.; Eckstein, M.; Matek, C.; Mock, A.; Theis, F. J.; Saur, D.; Peng, T.Abstract
Spatial transcriptomics links gene expression to tissue architecture, providing a mechanistic view of cellular organization. Yet existing datasets cover few donors and miss the complexity of human disease. Experimental costs remain prohibitive, and large-scale profiling is impractically slow for population-level studies. Accurate computational methods are urgently needed. Predicting gene expression from standard histology, however, remains an open problem, as current approaches transfer poorly to unseen cohorts and diseases. Here, we present Phoenix, a latent flow matching generative model that infers pan-cancer spatially resolved single-cell gene expression with high accuracy. Phoenix analyzes treatment response in silico: Applied to 763 head and neck cancer patients, it identified three new spatial biomarkers that we validated across two cancers (breast cancer, n = 84; ovarian cancer, n = 157) and treatment regimens (platinum, trastuzumab). Phoenix generalizes beyond carcinomas: In a large sarcoma cohort (802 tissue microarray cores), it accurately predicted cell-type-specific signatures in held-out samples and captured chemotherapy-induced immune remodeling. Phoenix also extends across species: In a mouse model, it accurately predicted the expression of pancreatic cancer lineage markers and the mutant mKrasG12D allele in silico. Together, Phoenix establishes virtual spatial transcriptomics from routine histology as a scalable framework for studying tissue organization, therapeutic response, and disease mechanisms.
bioinformatics2026-05-07v2SLiMNet: a deep learning model to detect short linear motifs using protein large language model representations and paired inputs
McFee, M. C.; Kim, P. M.Abstract
Short linear motifs (SLiMs) are short (3-15 amino acids in length) segments within intrinsically disordered regions (IDRs) that mediate transient protein-protein interactions as well as other functions such as stability and subcellular localization. Only a few thousand out of likely hundreds of thousands have been experimentally validated. SLiMs can be detected as conserved regions inside of IDRs using local alignments, though current approaches have limited sensitivity and specificity and are unable to functionally annotate their hits. Assigning function is hence a major outstanding issue in SLiM biology. Here we present SLiMNet, a deep learning model inspired by siamese networks and contrastive learning that predicts functional similarity in pairs of SLiMs. SLiMNet uses uses protein large language model embeddings and is trained on annotated sets of SLiMS. We show that it detects shared function in unseen, non-redundant motif pairs, and its scores correlate with experimental binding strengths from deep mutational scanning of cyclin-binding motifs. Using SLiMNet we provide repositories of putative SLiM pairs derived from annotated IDR regions for to help with hypothesis generation for the functional annotation of SLiMs. This includes an atlas generated from all-by-all scoring 16-mers from tiled IDRs from the DisProt database. We show that it captures a new nuclear localization motif recently added to MoMaP and a PRMT1 methylation motif in the literature. We also provided a repository of all IDRs scored with SLiMNet against against all MoMaP instances, and an atlas of potential functional pairs for 256 known orphan motifs (motifs with only a single known instance with essential function). Collectively, these atlases are useful resources for the SLiM biology community
bioinformatics2026-05-07v1scLASER: a robust framework for simulating and detecting time-dependent single-cell dynamics in longitudinal studies
Vanderlinden, L. A.; Vargas, J.; Inamo, J.; Young, J.; Wang, C.; Zhang, F.Abstract
Longitudinal single-cell clinical studies enable tracking within-individual cellular dynamics, but methods for modeling temporal phenotypic changes and estimating power remain limited. We present scLASER, a framework detecting time-dependent cellular neighborhood dynamics and simulating longitudinal single-cell datasets for power estimation. Across benchmark experiments, scLASER shows consistently higher sensitivity than traditional cluster--based approaches, with particularly pronounced gains in rare cell types and non-linear temporal patterns. Applications to inflammatory bowel disease (95,813 cells, 38 patients) reveal treatment-responsive NOTCH3+ stromal trajectories with high cell type discrimination (AUC > 0.92), while analysis of COVID-19 data (188,181 cells, 84 patients) identifies three distinct axes of T cell activity (cytotoxic effector, NK immunoreceptor signaling, and interferon-stimulated gene programs) over disease progression. scLASER enables robust longitudinal single-cell analysis and optimization of study design.
bioinformatics2026-05-07v1A lightweight codon-based DNA Transformer for Regulatory Region Identification in the Genome
Karthik, A. S. P.; Das, A. B.Abstract
We developed a lightweight codon-based DNA Transformer equipped with multi-head self-attention and an adaptive classifier head, which achieves exon intron classification with high accuracy and also has moderate accuracy in CDS classification and splice site recognition. We named this model as ExIT (Exon-Intron Transformer). We have implemented codon tokenization for this model. This has been validated on the human genome with external validation from the chimpanzee genome. Further benchmarking has implied that our model is better than the existing models in the above tasks.
bioinformatics2026-05-07v1Bridging genomes and peptidomes: hybrid sequencing reveals conserved bioactive peptides in crustaceans
Fields, L.; Qin, J.; Ibarra, A. E.; Selby, K. G.; Gao, T.; Dang, T. C.; Lu, H.; Li, L.Abstract
Endogenous peptides are critical regulators of signaling and immunity but remain difficult to characterize in organisms with incomplete genomic annotation. We developed a hybrid discovery platform that integrates transformer-based de novo sequencing (Casanovo), neuropeptide-focused database searching (EndoGenius), and empirical false discovery rate estimation via NovoBoard. This pipeline enables confident identification of endogenous peptides while expanding coverage beyond conventional database-only or de novo-only approaches. Applied to neuroendocrine tissues from Callinectes sapidus and Cancer borealis, the workflow revealed numerous high-abundance novel peptides and provided structural and genomic support for their biological relevance. Notably, we report the first histone-2A-derived antimicrobial peptide in the C. sapidus and characterize naturally occurring sequence variants. We also identified unexpected peptide homologies between crustaceans and Rattus norvegicus, enabling annotation of conserved housekeeping proteins in sparsely annotated genomes. This hybrid platform establishes a scalable, open-source strategy for advancing neuropeptidomics and endogenous peptide discovery in emerging model organisms.
bioinformatics2026-05-07v1ORBIT: Orthogonal Rotation for Biological Inter-species Transfer
Wissenberg, P.; Lee, J. M.; Mutwil, M.Abstract
Motivation. Cross-species gene embeddings are central to transferring functional annotations between species. A recent method demonstrated that species-specific STRING (PPI) network embeddings can be aligned across 1322 eukaryotes with autoencoders (FedCoder), but this approach is computationally expensive, depends on careful hyperparameter selection, leaves substantial room for improvement in cross-species retrieval quality, and has not been demonstrated on coexpression networks. Results. We introduce an alignment pipeline for cross-species coexpression network embeddings based on orthogonal Procrustes rotation. Species-specific Node2Vec embeddings of coexpression networks are aligned to a shared space using ortholog anchors from OrthoFinder, solved in closed form via Singular Value Decomposition (SVD). Applied to 153 plant species and 5.7 million genes, Procrustes alignment achieves four-fold higher cross-species Spearman correlation and consistently higher retrieval metrics than the SPACE autoencoder, while leaving within-species coexpression structure invariant (preservation ratio 1.000 against the unaligned baseline). The full alignment completes in under three minutes on a single CPU, and on downstream tasks, Procrustes embeddings improve within-species GO term prediction and outperform SPACE for cross-species GO transfer. Procrustes and sequence embeddings remain complementary for biological-process prediction, consistent with observations from SPACE. Availability. Code for producing the embeddings is made available at https://github.com/pwissenberg/orbit
bioinformatics2026-05-07v1Image-Conditioned Diffusion for Privacy-Preserving Synthetic Medical Images
Yaya-Stupp, D.; Lutsker, G.; Spiegel-Yerushalmi, O.; Segal, E.Abstract
Medical imaging models depend on large, shareable datasets, yet privacy constraints limit data dissemination. Current text-conditioned diffusion models fail to preserve subtle, distributed clinical signals, such as continuous physiological biomarkers, rendering synthetic data insufficient for robust downstream physiological modeling. Here, we evaluate image-to-image (I2I) diffusion as a tunable, privacy-preserving transformation that produces a synthetic counterpart of real images while preserving downstream-relevant information. We fine-tune Stable Diffusion with low-rank adapters on retinal fundus photographs and chest radiographs, assessing fidelity, clinical signal preservation, cross-site transfer, and empirical re-identification risk. I2I consistently outperforms text-to-image generation in image fidelity and in preserving biomarker information. In cross-cohort transfer to an external retinal dataset from the UK Biobank, pretraining on I2I synthetic data performs comparably to real-image pretraining and surpasses it in the smallest fine-tuning sets. Varying I2I strength reveals that the privacy-utility tradeoff is highly modality-dependent: while retinal images achieve practical de-identification, chest X-rays exhibit structural combinatorics that leave them substantially re-identifiable even at high noise strengths, exposing critical boundaries for diffusion-based anonymization. These results position image-conditioned diffusion as a practical approach for generating shareable medical images with tunable de-identification.
bioinformatics2026-05-07v1ProtSpace: Protein Universe in Your Browser
Senoner, T.; Vahidi, P.; Olenyi, T.; Senoner, F.; Sisman, G.; Kahl, E.; Rost, B.; Koludarov, I.Abstract
Protein Language Models (pLMs) generate per-protein embeddings that encode functional, structural, and evolutionary information, yet the relationships captured in these representations remain difficult to explore systematically. ProtSpace (https://protspace.app) is a web application for interactive visualization of pLM embedding spaces, enabling hypothesis generation directly in the browser without installation. Unlike traditional network-based tools that exclusively visualize amino acid sequence similarity, ProtSpace explores embedding spaces, revealing relationships often not captured by traditional comparisons. Users provide protein sequences or pre-computed embeddings through a Google Colab notebook or the Python CLI; the pipeline applies dimensionality reduction, retrieves 38 annotation types spanning UniProt, InterPro, NCBI Taxonomy, TED structural domains, and sequence-based predictors served via Biocentral, and produces a portable binary file for the browser-based viewer. WebGL-accelerated rendering supports interactive exploration of over 570,000 proteins. Distinctive features include per-point pie charts for multi-label annotations and integrated 3D structure viewing through AlphaFold2 predictions. All computation happens on the user's machine, ensuring data privacy. We demonstrate the utility of ProtSpace through a progressive zoom-in across biological scales: from global proteome organization of Swiss-Prot, through cross-species comparison revealing conserved and lineage-specific families, to functional hypothesis generation within the beta-lactamase superfamily. ProtSpace is freely available at https://protspace.app under the Apache 2.0 license.
bioinformatics2026-05-07v1metaJAM: a Nextflow integrated metagenomic workflow for sedimentary ancient DNA
Johnson, E.; Jin, C.; Guinet, B.; Alumbaugh, J.; Martin, N. L.Abstract
The application of metagenomics in ancient DNA (aDNA) research is rapidly expanding, driven in particular by advances in sedimentary aDNA research and sequencing technologies. Although many ancient DNA studies rely on broadly similar bioinformatic strategies, there is still no single standardized, widely adopted workflow. These differences can directly affect how efficiently past biodiversity can be reconstructed and authenticated from the various archives analyzed using ancient metagenomic approaches. Although a few pipelines tackle the processing of ancient DNA data from shotgun sequencing, the ones applied to metagenomic datasets are scarce and often resource-intensive or challenging to install, update, or extend with new tools and parameters. metaJAM, a scalable and user-friendly pipeline, is presented here to specifically address the challenges of metagenomic aDNA analyses of eukaryotes. The pipeline has been designed in Nextflow to ensure continuous development and can be used on different high-performance computing (HPC) clusters. metaJAM integrates all key steps required for ancient DNA metagenomic analyses, from raw sequencing data pre-processing to microbial filtering, taxonomic assignment via competitive iterative mapping against Bowtie 2 reference indexes and reassignment using lowest common ancestor (LCA) inference. Validation and authentication are performed using the post-LCA toolkit bamdam together with alignment to an exhaustive reference database using MMseqs2. It allows users to choose among alternative tools and generates a series of plots to support data visualization and taxon authentication. metaJAM differs from existing pipelines through its implementation of rigorous filtering of microbial-like reads by Kraken 2 classification and masking microbial-like regions, iterative or parallel Bowtie 2 mapping, validation of the detected taxa and integration of up-to-date tools for ancient metagenomic analysis, along with diagnostic plots that help users assess the reliability of taxonomic assignments and visualize their data. It complies well with limited computational resources, customised databases for taxonomical groups, and provides an accessible workflow to support the investigation of metagenomic ancient DNA datasets. Its applications span a range of contexts, from ecosystem reconstructions in environmental aDNA archives such as sediments, to metagenomic studies on archaeological artefacts and even taxonomic identification of undiagnosed biological materials.
bioinformatics2026-05-07v1BGC-QUAST: a quality assessment tool for genome mining software
Kushnareva, A.; Tupikina, D.; Almessady, H.; McHardy, A.; Gurevich, A.Abstract
Summary: Biosynthetic gene clusters (BGCs) encode microbial natural products, many of which have important ecological and biomedical roles. Genome mining tools enable large-scale BGC prediction, but their outputs differ substantially, complicating comparison and interpretation. We present BGC-QUAST, a framework for evaluating and comparing BGC predictions across three analysis modes: comparison across samples, assessment of BGC recovery in draft assemblies relative to reference genomes, and comparison of predictions from different tools using overlap analysis. BGC-QUAST provides standardized metrics, interactive visualizations, and integrated outputs for joint inspection of predictions, enabling the comprehensive comparison of genome mining results and facilitating sample prioritisation based on biosynthetic potential. Availability and implementation: BGC-QUAST is publicly available at https://github.com/gurevichlab/bgc-quast
bioinformatics2026-05-07v1geneSync: Gene Symbol Harmonization for Large-scale RNA-seq Data Integration
Feng, Z.; Li, T.Abstract
Cross-cohort integration of transcriptomic data is a routine strategy for boosting statistical power and enhancing generalizability. However, gene nomenclature inconsistencies across datasets-arising from annotation version updates, historical renaming, and synonym reassignment-introduce silent mismatches during feature alignment, causing genes to be falsely classified as absent or split into duplicate features. Here, we present geneSync, an R package that performs gene symbol harmonization as a quality-control (QC) step prior to data integration. geneSync uses a hierarchical matching strategy, prioritizing exact matches to authoritative gene symbols, then exact matches to National Center for Biotechnology Information (NCBI) gene symbols, and finally synonym-based fallback. It includes built-in offline databases for human, mouse, and rat, and supports auditable conflict resolution, cross-species ortholog mapping, and native integration with Seurat and SingleCellExperiment objects. Benchmarking across six mouse hippocampus scRNA-seq datasets spanning 2020-2025 and five CellRanger versions shows that 1.41%-6.22% of features require synonym resolution, and harmonization improves pairwise gene overlap by up to 13.14 percentage points, rescuing 707-1,098 genes per dataset pair. Notably, CellRanger annotation version-rather than data collection year-was identified as the primary driver of nomenclature discrepancy. geneSync is freely available at https://github.com/xiaoqqjun/geneSync.
bioinformatics2026-05-07v1Gene-Modulated Network Diffusion for Improved Modeling of Amyloid-β Spread in Alzheimer's Disease
Xu, F. H.; Duong-Tran, D.; Huang, H.; Saykin, A. J.; Thompson, P. M.; Davatzikos, C.; Zhao, Y.; Shen, L.Abstract
Understanding the pathogenesis of amyloid-{beta} pathology in Alzheimer's Disease (AD) proves to be a challenge. In this work, we expand upon the application of network diffusion models (NDM) to study pathophysiological spread of amyloid-{beta} throughout white matter structural brain networks. We found that the NDM successfully recaptures subpopulation-level spatial patterns (Pearson's R=0.45-0.48, PFDR < 0.01) of amyloid-{beta} deposition in the Alzheimer's Disease Neuroimaging Cohort at a regional level, but with drawbacks in mechanism interpretability. We then moved to an extended NDM framework (eNDM), including a protein synthesis term to better reflect the role of amyloid-{beta} metabolism, as well as including regional vulnerability using spatial transcriptomics from the Allen Human Brain Atlas to modulate the region-level rate parameters of the synthesis term. The novel gene eNDMs exhibited significant performance increases in Pearson's correlation (Steiger's Z, PFDR < 0.10) over baseline NDM performance in mild cognitive impairment and AD groups using APOE, SORL1, and FGL2 for gene modulation. The results were robust and replicable when testing on an external cohort of the Alzheimer's Disease Sequencing Project. The study thus demonstrates the importance of regional genetic vulnerability, in conjunction with network diffusion mechanisms, in improving the modelling and prediction of amyloid-{beta} pathophysiological spread.
bioinformatics2026-05-07v1Solid Tumors Pan Cancer Transcriptome Tissue/Cancer specific expression groups at the Isoform-Level
Surana, P.; Obusan, M.; Davuluri, R. V.Abstract
Most of the human genome is transcribed into diverse isoforms whose tissue specificity is profoundly disrupted in cancer, yet isoform-level dysregulation remains poorly characterized across solid tumors. Here, we introduce STPCaT (Solid Tumors Pan-Cancer Transcriptome), an isoform-centric analysis extending TransTEx to systematically classify transcript expression across TCGA solid tumors and GTEx normal tissues. STPCaT reveals a striking collapse of normal tissue-specific programs in cancer, accompanied by the emergence of two dominant expression groups: cancer-high (CanHigh) and normal-high (NorHigh) isoforms. We uncover a large repertoire of previously unannotated Cancer-Testis Antigens (CTAs), the majority of which are absent from existing CTA databases, with broad relevance across multiple cancers, including gliomas. In pan-gliomas, consensus clustering and random-forest feature selection identify compact, highly discriminative isoform signatures that robustly stratify low-grade and glioblastomas with up to 97 to 98% accuracy using as few as five transcripts. These signatures recapitulate canonical glioma biology and highlight pathways linked to migration, development, and vesicle trafficking. Independent validation in the GLASS consortium cohort demonstrates cohort-specific trends that partially recapitulate primary findings, reflecting known biological heterogeneity across patient populations. Together, STPCaT provides a scalable, isoform-resolved resource for tumor stratification, CTA discovery, and precision oncology applications across solid tumors.
bioinformatics2026-05-07v1A vaccine for global eradication of TB - A novel conceptual framework and design of a potent peptide-based vaccine with universal coverage through advanced computational vaccinology
Pawar, P.; samarasinghe, s.Abstract
Tuberculosis (TB) remains a formidable global health challenge, exacerbated by the emergence of drug-resistant Mycobacterium tuberculosis strains that threaten to render existing drug therapies and vaccine ineffective. Despite the availability of the Bacillus Calmette-Guerin (BCG) vaccine, its limited efficacy, primarily in infants and young children, falls short of reducing TB prevalence or offering adequate protection to adults. Therefore, developing a new TB vaccine with enhanced efficacy and the capability to generate a robust reservoir of memory cells is essential. Addressing the challenge of drug-resistant tuberculosis requires a deep understanding of bacterial evolution and developing robust countermeasures. This study aims to design a next-generation TB vaccine that provides broad-spectrum protection against various Mycobacterium tuberculosis strains, including drug-resistant ones. By conducting an in-depth investigation into pathogen-human interactions, the research proposes a holistic framework that leverages computational vaccinology to tackle challenges posed by pathogen polymorphism and overcome the limitations of conventional vaccines. By targeting conserved proteins across diverse TB strains and enhancing both humoral and cell-mediated immunity, this study proposes a new strategy for an epitope-based vaccine that provides long-lasting, universal coverage. An extensive proteomic, reverse vaccinology and immunoinformatics analysis of 159 TB strains yielded 27 highly conserved, immunogenic, non-toxic, and non-allergenic epitopes. These epitopes, consisting of 14 cytotoxic T-lymphocytes (CTL), 5 helper T-lymphocytes (HTL), and 8 B-cell epitopes, were used to construct a three-dimensional, multi-epitope TB vaccine designed based on a new concept introduced in this research for maximising vaccine efficacy. Molecular docking and immune simulation studies demonstrated a significant affinity between the vaccine constructs and toll-like receptors, indicating a strong potential for effective immune system engagement. The crucial features of the epitope-based TB vaccine constructed in this research include sequence conservancy, robust antigenicity, exclusion of self-peptides and potential for diverse allelic interactions. The proposed epitope-based vaccine is poised to be highly effective, safe, and capable of providing universal coverage, potentially paving the way for global TB eradication. Validation in laboratory and clinical settings will be essential to confirm its efficacy and real-world applicability.
bioinformatics2026-05-07v1Steering Sequence Generation in Protein Language Models through Iterative Lookback Monte Carlo Sampling
Calvanese, F.; Lombardi, G.; Weigt, M.; FERNANDEZ-DE-COSSIO-DIAZ, J.Abstract
Protein language models (pLMs) leverage large-scale evolutionary data to generate novel sequences, but steering generation toward desired physicochemical properties without sacrificing diversity remains a major challenge. Existing approaches often induce severe diversity loss or require computationally expensive retraining. We introduce Iterative Lookback Monte Carlo (ILMC), a training-free inference-time sampling strategy that interleaves autoregressive elongation with Metropolis--Hastings refinement to approximate sampling from a maximum-entropy target distribution balancing generative quality and steering objectives. We show theoretically that this target distribution is entropy-maximizing under fixed generative quality and steering constraints, and empirically that ILMC produces more diverse samples than standard autoregressive baselines at matched generative quality. Using simple steering potentials, ILMC improves desired molecular properties, including generating proteins with up to 12 higher predicted melting temperature than compute-matched alternative strategies. ILMC naturally applies to classifier-guided steering, where it outperforms purely autoregressive guidance in diversity while maintaining comparable enrichment of target properties. We validate ILMC on family-specific pLMs and on the multi-family model ProGen3.
bioinformatics2026-05-07v1Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges
Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.Abstract
Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.
bioinformatics2026-05-07v1Striping artifact removal in VisiumHD data through nuclear counts modeling
Malsot, P.; Londschien, M.; Boeva, V.; Raetsch, G.Abstract
Motivation: 10x Genomics VisiumHD enables spatial transcriptomics at 2 m x 2 m resolution but exhibits slide-specific, non-periodic striping artifacts due to lane-width variability. These multiplicative row/column effects distort bin total counts and can bias downstream analyses. The state-of-the-art destriping approach is the normalization procedure used as a preprocessing step in bin2cell; it applies sequential high-quantile row- then column-wise normalization, which is asymmetric and can introduce edge effects/macro-stripes and distortions of large-scale total-count structure. Results: We propose a statistical destriping approach that leverages nuclei segmentation from the co-registered H&E image. Assuming transcript abundance is constant within each nucleus, we model bin counts with a negative binomial distribution whose mean is a product of a nucleus-specific concentration and row- and column-specific stripe-factors reflecting lane-width variation. We fit all parameters in a generalized linear modeling framework with cross-validated regularization on stripe-factors and iterative dispersion estimation, and use the fitted parameters to correct the observed counts into a destriped image. On synthetic data with known ground truth, our method improves stripe-factor estimation accuracy and reduces error in corrected counts relative to bin2cell and bin2cell-derived baselines. Across four public VisiumHD slides, it consistently lowers striping intensity while substantially better preserving biological signal present in the large-scale global count structure and avoiding the artifacts introduced by other methods. Availability and Implementation: All source code and links to publicly available data used for this study are available at https://github.com/paolamalsot/destriping-GLM. Contact: paola.malsot@inf.ethz.ch, raetsch@inf.ethz.ch Note: This manuscript extends the version submitted to Intelligent Systems for Molecular Biology (ISMB) 2026 by describing a new optimization algorithm that yields an approximately tenfold speedup. All plots and benchmarks in this manuscript use the updated implementation.
bioinformatics2026-05-07v1DupyliCate: mining, classifying, and characterizing gene duplications
Natarajan, S.; Pucker, B.Abstract
Paralogs, copies of a gene, form an important basis for novelty during evolution. Analysis of such gene duplications is important to understand the emergence of novel traits during evolution. DupyliCate is a Python tool that has been developed for this purpose. With the ability to process multiple datasets concurrently, flexible features, and parameters to set species-specific thresholds, DupyliCate offers a high-throughput method for gene copy identification and analysis. The different available parameters and modes are explored in detail based on the Arabidopsis thaliana datasets. Proof of concept for the tool is presented by characterizing well known duplications in different plants, and its broad applicability is demonstrated by running it on diverse datasets including complex plant genome sequences with high heterozygosity. Further, two case studies involving the evolution of flavonol synthase (FLS) genes in Brassicales, and the evolution of flavonol synthesis regulating myeloblastosis (MYB) transcription factors- MYB12 and MYB111 across a large number of plant species, are presented as exemplar use cases. The tool's applicability beyond plants is demonstrated on Escherichia coli, Saccharomyces cerevisiae, and Caenorhabditis elegans datasets. DupyliCate is available at: https://github.com/ShakNat/DupyliCate.
bioinformatics2026-05-06v4Large-Scale Statistical Dissection of Sequence-Derived Biochemical Features Distinguishing Soluble and Insoluble Proteins
Vu, N. H. H.; Nguyen Bao, L.Abstract
Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a statistically rigorous large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliffs {delta}, and discriminative performance was assessed by ROC-AUC. Although 34 features remained significant after correction, most exhibited small effect sizes and substantial class overlap, consistent with a weak-signal regime. The strongest effects were associated with size-related features (sequence length and molecular weight; {delta} {approx} -0.21), whereas charge-related descriptors, particularly the proportion of negatively charged residues ({delta} = 0.150; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ({rho} up to 0.998). Applying a redundancy threshold (|{rho}| [≥] 0.85), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings demonstrate that sequence-level solubility information is intrinsically low-dimensional and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.
bioinformatics2026-05-06v3Identification, evolutionary history and characteristics of orphan genes in root-knot nematodes
Seckin, E.; Colinet, D.; Bailly-Bechet, M.; Seassau, A.; Bottini, S.; Sarti, E.; Danchin, E. G.Abstract
Orphan genes, lacking homologs in other species, are systematically found across genomes. Their presence may result from extensive divergence from pre-existing genes or from de novo gene birth, which occurs when a gene emerges from a previously non-genic region. In this study, we identified orphan genes in the genomes of globally distributed plant-parasitic nematodes of the genus Meloidogyne and investigated their origins, evolution, and characteristics. Using a comparative genomics framework across 85 nematode species, we found that 18% of Meloidogyne genes are genus-specific, transcriptionally supported orphans. By combining ancestral sequence reconstruction and synteny-based approaches, we inferred that 20% of these orphan genes originated through high divergence, while 18% likely emerged de novo. Proteomic and translatomic evidence confirmed the translation of a subset of these genes, and feature analyses revealed distinctive molecular signatures, including shorter length, signal peptide enrichment, and a tendency for extracellular localization. These findings highlight orphan genes as a substantial and previously underexplored component of the Meloidogyne genome, with potential roles in their worldwide parasitism.
bioinformatics2026-05-06v3Advancing in silico drug design with Bayesian refinement of AlphaFold models
Sen, S.; Hoff, S. E.; Morozova, T. I.; Schnapka, V.; Bonomi, M.Abstract
Virtual screening has become an indispensable tool in modern structure-based drug discovery, enabling the identification of candidate molecules by computationally evaluating their potential to bind target proteins. The accuracy of such screenings critically depends on the quality of the target structures employed. Recent advances in protein structure prediction, particularly AlphaFold2, have revolutionized this field with unprecedented accuracy. However, AlphaFold2 models often exhibit limitations in local structural details, especially within binding pockets, which limit their utility for small molecule docking. In contrast, molecular dynamics simulations with accurate atomistic force fields can refine protein structures, but lack the ability to leverage the structural information provided by deep learning approaches. Here, we introduce bAIes, an integrative method that bridges this gap by combining physics-based force fields with data-driven predictions through Bayesian inference. Crucially, bAIes demonstrates a superior ability to discriminate between binders and non-binders in virtual screening campaigns, outperforming both AlphaFold2 and molecular dynamics-refined models. By enhancing the usability of AlphaFold2 models without requiring extensive experimental or computational resources, bAIes offers a convenient solution to a longstanding challenge in structure-based drug design, potentially accelerating the early phases of drug discovery.
bioinformatics2026-05-06v2ArchaicSeeker 3.0: A deep-learning framework for scalable, haplotype-resolved inference of archaic introgression
Wang, B.; Lei, C.; Lin, H.; Shi, S.; Ma, X.; Zeng, W.; Yuan, K.; Ni, X.; Xu, S.Abstract
Archaic introgression has left a significant mark on human genetic diversity, but reliably identifying introgressed segments remains a major challenge, especially with complex demographic histories and limited sample sizes. Existing methods often rely on demographic assumptions or cohort-specific parameter fitting, which compromises robustness and scalability. We introduce ArchaicSeeker 3.0 (AS3), a deep-learning framework designed for haplotype-resolved detection of archaic introgression. AS3 integrates a tract-scale sequence model with an overlap-aware reassembly approach and boundary refinement, enabling accurate, boundary-coherent reconstruction of introgressed segments across diverse genomic contexts. By leveraging a simulation-trained model, AS3 avoids inference-time recalibration, offering stable performance across unrepresented demographic scenarios and small cohorts. In extensive simulations, AS3 outperforms existing methods in precision, recall, and F1 score, while providing more continuous segments with accurate boundary localization. It demonstrates robustness in small-target regimes and varying marker densities. Applied to 3,453 genomes from 209 populations, AS3 shows strong concordance with existing introgression callers and identifies additional introgressed regions, including high-frequency AS3-specific introgressed segments supported by locus-level haplotype and phylogenetic analyses. AS3 provides a scalable, robust solution for detecting archaic introgression from single individuals to large biobank datasets, marking a significant advancement in the field of local ancestry inference and opening new possibilities for the study of human evolutionary genetics. ArchaicSeeker 3.0 is available at https://github.com/Shuhua-Group/ArchaicSeeker3.0.
bioinformatics2026-05-06v1First Survey of Publicly Available Metagenomic Sequencing Data Across 24 Middle Eastern and North African Countries: The MENA Microbiome Database
Mathlouthi, N. E. H.; Gdoura-Ben Amor, M.; Belguith, I.; Derouich, R.; Ammar Keskes, L.; Gdoura, R.Abstract
Microbiome research has expanded globally, yet the Middle East and North Africa (MENA) region remains severely under-represented in international sequencing repositories. Here we present the MENA Microbiome Database, the first systematically harmonized catalog of publicly available metagenomic sequencing data from 24 MENA countries, consolidating 60,126 runs across 51,365 biological samples and 2,373 BioProjects deposited between 2008 and 2026. Records were retrieved from ENA, NCBI SRA, and PubMed, enriched with BioSample and study-level metadata, and classified into microbiome subtypes using a 73-rule keyword-based harmonization framework. Amplicon sequencing accounted for 80.6% of runs, with Illumina platforms dominating at 92.7%. Geographic coverage is highly skewed: Saudi Arabia and Turkey together contribute over half of all records, while five countries (Libya, Syria, Palestine, Yemen, and South Sudan) remain critically under-sampled. Metadata completeness averaged 73.97% under a MIxS-MIMS proxy framework, with geographic coordinates available for fewer than 15% of runs. Ecological analyses revealed that country-level factors significantly structure environmental, animal-associated, and plant-associated microbiomes, but not human-associated microbiomes. Spatial autocorrelation confirmed non-random clustering of sampling effort around Red Sea coastal and eastern Mediterranean hotspots. This open, reproducible resource, comprising harmonized data files, analysis code, and an interactive browsing platform, establishes a foundational infrastructure for regional microbiome science and equitable global comparative studies. Keywords: MENA; microbiome; metagenomics; public repository; SRA; ENA; database; harmonization; Middle East; North Africa
bioinformatics2026-05-06v1Tumor cell specific total mRNA expression informed neural networks predicts cancer progression
Paul, A.; Lal, J. C.; Ji, S.; Fong, C.; Chen, K.; Ding, Y.; Li, R.; Dai, Y.; Tran, Q.; Montierth, M.; Alberti, S.; Kopetz, S.; Wang, W.Abstract
Inferring tumor molecular phenotypes from high-dimensional multi-omic data is a fundamental challenge in computational biology. Current methods for estimating tumor cell-specific total mRNA expression (TmS) require matched DNA and RNA sequencing data and rely on computationally intensive deconvolution pipelines. We present TmSNet, a deep learning framework that predicts TmS using mRNA, DNA methylation, miRNA, and immune cell proportions as input features. TmSNet integrates structured feature selection (gradient boosting, LASSO, elastic net) with specialized neural architectures to predict continuous TmS. Across 12 TCGA cancer types, TmSNet achieved cross-validated performance up to concordance correlation coefficient (CCC) = 0.93 and correlation R-squared = 0.88 and generalized to external cohorts with correlations of 0.54 (SCAN-B) and 0.43 (FUSCC). Predicted TmS values effectively stratify patients by risk and preserve known transcriptional profiles across tumor subtypes. These results demonstrate that TmSNet can infer biologically meaningful phenotypes from multi-omic data and provide a scalable framework for modeling tumor transcriptional activity in heterogeneous cohorts.
bioinformatics2026-05-06v1Pharmacological proximities in the GPCR family discovered using contact-informed amino-acid and binding pocket similarities
So, S. S.; Ngo, T.; Ilatovskiy, A. V.; Finch, A. M.; Riek, R. P.; Abagyan, R.; Smith, N. J.; Kufareva, I.Abstract
Understanding protein proximities in the theoretical ligand space is essential for developing therapeutics with desirable polypharmacology, predicting off-targets, and discovering surrogate ligands for poorly characterized proteins. This is especially important for G protein-coupled receptors (GPCRs) - a major class of drug targets, many of which still lack known ligands. Circumventing this limitation, we present GPCR-CoINPocket v2, a contact-informed metric for detecting GPCR pharmacological similarities from amino-acid sequences alone. We first establish a "gold standard" of pharmacological relatedness using ChEMBL-derived ligand sets. We then replace traditional evolutionary amino acid similarity matrices with a chemically-informed matrix derived from protein:ligand interaction patterns across 3,306 structures, significantly improving early detection of shared pharmacology between distantly homologous receptors. An additional unconstrained, contact-informed matrix further enhances predictive performance. Pilot application of the method revealed previously unrecognized similarities between the {beta}2 adrenoceptor and three Class A peptide GPCRs, which we confirmed experimentally by demonstrating the binding of select ligands of these receptors to the {beta}2. Dimensionality reduction of similarity scores recapitulates known receptor relationships and predicts neighbors of orphan GPCRs later confirmed experimentally. Overall, GPCR-CoINPocket v2 provides a powerful sequence-based framework to prioritize ligand space, predict polypharmacology, and accelerate GPCR drug discovery and deorphanization.
bioinformatics2026-05-06v1Learning the Language of the Microbiome with Transformers
Treloar, N. J.; Ur-Rehman, S.; Yang, J.Abstract
Self-supervised pretraining has become central to biological machine learning, yet microbiome data remains comparatively underexplored in terms of both modeling approaches and evaluation frameworks. To address this gap, we present Atlas, a pretraining dataset of over 539,000 microbiome datapoints from the MGnify database. Using Atlas, we train the Waypoint family of microbiome foundation models: a series of GPT-2 style causal language models ranging from 6M to 170M parameters. We also introduce Compass, a curated benchmark of eight predictive tasks spanning biome classification, drug-microbiome interactions, drug degradation, and infant gut development. Using this benchmark, we compare the performance of Waypoint models against classical baselines and the existing MGM foundation model. Our results show that pretraining leads to consistent and significant improvements in downstream task performance, that both dataset scale and tokenization strategy impact model quality, and that pretraining is essential for achieving favorable scaling behavior. Furthermore, pretrained transformer models begin to reliably outperform classical methods once training data exceeds roughly 10,000 examples - a threshold that is attainable for modern microbiome studies. Finally, we demonstrate that the Waypoint models achieve state-of-the-art performance among microbiome foundation models. Overall, our work highlights the importance of large-scale self-supervised pretraining in this domain and establishes Atlas, Compass, and the Waypoint models as valuable resources for the research community in this emerging field.
bioinformatics2026-05-06v1Bridging LLM Reasoning and Chemical Knowledge via an Evolutionary Multi-Agent Framework for Molecular Synthesis
Chen, Y.; Rao, J.; Xie, J.; Sun, Y.; Yang, Y.Abstract
Molecular design faces the dual challenge of navigating a vast chemical space while ensuring experimental synthesizability. Traditional models are constrained by small datasets, restricting their scalability and broader chemical context. In contrast, Large Language Models (LLMs) encapsulate extensive synthesis protocols derived from vast scientific literature, yet they struggle to leverage this potential due to severe hallucinations and a superficial grasp of rigorous chemical logic. We propose EvoSyn, an evolutionary multi-agent framework that synergizes LLM reasoning with domain experts for preference-aware molecular synthesis. EvoSyn orchestrates a dual-process evolutionary paradigm: a co-evolving process that collaboratively aligns linguistic capabilities with multi-objective constraints, and a self-evolving process formulated as a Markov Game. Through evolution and reinforcement learning, agents actively learn from mistakes, utilizing domain feedback to penalize invalid proposals and ground generation in feasible reaction pathways. Extensive evaluations on comprehensive benchmarks demonstrate that EvoSyn significantly outperforms state-of-the-art baselines. These results highlight that by integrating LLM-guided self-evolution with rigorous domain validation to mitigate hallucinations, EvoSyn effectively yields molecules that are both bioactive and synthetically actionable.
bioinformatics2026-05-06v1UNKAI: A protein functional identity prediction model based on ESM-C latent representations and the attention mechanism
Ukai, K.; Fujita, S.; Terada, T.Abstract
The rapid advancement of genome sequencing technologies has led to the accumulation of a vast number of protein sequences in public databases. However, a significant proportion of these proteins remain functionally uncharacterized. Concurrently, the expansion of protein sequence data has enabled the development of protein language models (pLMs). By distilling billions of years of evolutionary history into a latent representational space, these models have acquired an unprecedented capacity to predict both the tertiary structures and functions of proteins. In this study, we developed a deep learning-based method to predict whether two proteins catalyze the same enzymatic reaction. Our approach leverages latent representations generated by ESM Cambrian (ESM C), a state-of-the-art pLM, which are then processed through a neural network architecture integrating an attention mechanism. Our method outperformed existing approaches, including those based solely on full-length sequence similarity. Notably, it also surpassed our previous LightGBM-based model, which relied on structural similarity scores derived from AlphaFold-predicted models. Analysis of the attention weights reveals that our model autonomously highlights biologically significant sites, such as catalytic and binding residues. This demonstrates that integrating pLMs with attention mechanisms can enhance the accuracy and interpretability of protein function prediction while eliminating the need for manual feature engineering.
bioinformatics2026-05-06v1Integrated Multi-Omics Analysis for the Identification of Disease-Associated Variations and Prognostic Biomarkers in Triple-Negative Breast Cancer (TNBC)
MANNEKUNTA, N.; NATRAJAN, E.Abstract
Background: Triple-negative breast cancer (TNBC) exhibits substantial molecular heterogeneity and lacks targeted receptor therapies. Single-omic approaches inadequately capture its regulatory complexity, necessitating integrated multi-omic frameworks to identify stable prognostic signatures. Methods: Matched transcriptomic and DNA methylation data from the TCGA-BRCA cohort were normalised and mathematically integrated to isolate disease-associated variations. A calibrated machine learning voting ensemble (comprising LightGBM, Random Forest, and Logistic Regression) was trained to predict clinical survival. Model generalisability was tested on an independent microarray cohort (GSE58812) using independent quantile normalisation. SHAP (SHapley Additive exPlanations) values provided biological interpretability. Results: Differential and integrative analyses identified a 47-gene master prognostic signature. The ensemble classifier achieved an external validation accuracy of 74.77% (AUC 0.590) on unseen clinical patients. SHAP analysis confirmed the biological directionality of these specific biomarkers in driving mortality. Hypergeometric pathway enrichment highlighted targetable metabolic and signalling networks. Conclusions: This multi-omic machine learning pipeline identifies a highly prognostic 47-gene signature for TNBC. The model demonstrates strong cross-platform generalisability and offers interpretable clinical utility for stratifying patient risk and guiding future therapeutic target development.
bioinformatics2026-05-06v1Simple baselines rival protein language models in mutation-dense design tasks
Talpir, I.; Fleishman, S. J.Abstract
Computational protein design demands generally applicable models that reliably predict or generate unmeasured variants with superior functional properties. Recent studies have proposed protein language models (pLMs) for design tasks, including zero-shot scoring and transfer learning from limited experimental data. Although pLMs have been used in zero-shot and transfer-learning studies, they have generally not been assessed in benchmarks that explicitly test combinatorial extrapolation from lower- to higher-order variants. Here we benchmark widely used pLMs against conventional baseline methods in recently described dense, experimentally validated multi-mutant landscapes. We find that regardless of architecture and parameter count, pLMs are statistically similar to one another, and none consistently outperforms conventional baseline methods. Furthermore, their ability to distinguish functional from non-functional variants in zero-shot prediction is comparable to that of conventional homology-based methods. We suggest that to contribute to the design of protein function, pLMs may need to encode biophysical and structural priors or be combined with structure-based approaches.
bioinformatics2026-05-06v1PhenotypeToGeneDownloaderR: automated multi-source retrieval and validation of phenotype-associated genes
Muneeb, M.; Ascher, D. B.Abstract
Identifying phenotype-associated genes is a common first step in polygenic risk score construction, enrichment testing, target prioritisation and variant interpretation, but relevant evidence is distributed across heterogeneous databases with different interfaces, formats and evidence models. Here, we present PhenotypeToGeneDownloaderR, a phenotype-guided R/Python pipeline for automated gene retrieval, harmonisation, symbol validation and cross-source summary analysis. Given a phenotype term, the pipeline queries integrated biological databases, standardises per-source outputs, combines gene lists, validates retrieved symbols against the NCBI human gene reference and generates summary tables and visualisations. Across 13 clinically relevant phenotypes and 13 databases, PhenotypeToGeneDownloaderR generated 136,487 raw gene retrievals, with at least one source returning genes for every phenotype. Across all 13 phenotypes, 100,175 of 114,345 combined input symbols were retained after direct or synonym-based validation, corresponding to an 87.6% validation rate. Cross-source overlap was low, supporting the complementarity of integrated evidence sources. Against an HPO/ClinVar/OMIM-derived gold standard, the pipeline recovered 1,039 of 1,056 known phenotype-associated genes, corresponding to 98.4% recall. PhenotypeToGeneDownloaderR provides a lightweight, reproducible upstream framework for generating candidate gene sets for downstream prioritisation and interpretation. The pipeline is implemented in R and Python, released under the MIT licence, and available at https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR.
bioinformatics2026-05-06v1An LLM-driven pipeline for proteomics-based detection and structural modeling of post-translational modifications
George, A.; Mejia-Rodriguez, D.; Li, X.; Rigor, P.; Cheung, M. S.; Bilbao, A.Abstract
Post-translational modifications (PTMs) on proteins dynamically regulate their functions and subsequently cellular physiology. Significant advances have been made in their detection and modeling: mass spectrometry-based proteomics has become the cornerstone for PTM detection in complex samples, while emerging structure-prediction frameworks enable modeling of PTM-dependent conformational changes. However, the biological significance of many PTMs remains largely unexplored, in part because integrated pipelines that bridge PTM detection with structural modeling remain limited. We present a generative AI-driven pipeline that integrates PTM detection with structural modeling of their effects on protein dynamics and interactions. The pipeline comprises two complementary tools: PTMdiscoverer and PTM-Psi. First, PTMdiscoverer leverages large language models to identify, annotate, and interpret candidate PTMs from open-search proteomics results, addressing limitations of conventional proteomics tools. Next, PTM-Psi models the structural, functional, and dynamic consequences of these spatially aware modifications on protein dynamics. These two components bridge PTM discovery with mechanistic interpretation at the structural level. We demonstrate our pipeline by using cyanobacterial proteomics data to study potential molecular mechanisms of redox-regulated "dark complex" formation in carbon metabolism, advancing our ability to interpret PTM-mediated regulation in microbial systems.
bioinformatics2026-05-06v1A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data
Hamilton, T.; Sparta, B.; Cooley, S. M.; Aragones, S. D.; Ray, J. C. J.; Deeds, E. J.Abstract
High-dimensional data are becoming increasingly common in nearly all areas of science. Developing approaches to analyze these data and understand their meaning is a pressing issue. This is particularly true for single-cell RNA-seq (scRNA-seq), a technique that simultaneously measures the expression of tens of thousands of genes in thousands to millions of single cells. Popular analysis pipelines significantly reduce the dimensionality of the dataset before performing downstream analysis. One problem with this approach is that dimensionality reduction can introduce substantial distortion into the data, particularly by disrupting the local neighborhoods of certain points. Since many scRNA-seq analyses like cell type clustering or trajectory inference rely on these near-neighbor relationships, distortion in this aspect of the data could significantly influence the outcomes of these analyses. Here, we introduce a straightforward approach to quantifying this distortion by comparing the local neighborhoods of points before and after dimensionality reduction. We found that popular techniques like t-SNE and UMAP introduce substantial distortion even for simple simulated data sets. For scRNA-seq data, we found the distortion in local neighborhoods was often greater than 95%, and that there was no consistent set of neighborhoods across the various steps in the consensus scRNA-seq analysis pipeline. We also found that this distortion had profound impacts on the outcomes of cell type clustering and other downstream analyses. Our findings suggest that caution must be applied when interpreting results in terms of 2-D visualizations produced by tools like UMAP, and that there is a critical need for new dimensionality reduction tools that more effectively preserve the local topological structure of the data.
bioinformatics2026-05-05v7Topology Matters: The Trade-off Between Wasserstein Critics and Discriminators in Single-Cell Data Integration
Reid, K.; Stein-O'Brien, G.; Guven, E.Abstract
Motivation: Integrating single-cell RNA sequencing experiments (scRNA-seq) across technologies is hindered by severe technical batch effects that confound analysis and mask biological variation. Adversarial autoencoders are a popular solution to correct for these confounding effects, often relying on discriminator networks that approximate the Jensen-Shannon divergence. Previous research has established that the Jensen-Shannon divergence suffers from vanishing gradients when distributions do not overlap, a common phenomenon when datasets come from different sequencing technologies, leading to failed training. In contrast, the Wasserstein distance remains a valid metric with informative gradients even for disjoint distributions. While both approaches appear in the literature, no study has rigorously isolated the adversarial objective to systematically evaluate its impact on batch alignment, biological conservation, and scalability across varying dataset complexities. Results: We introduce a multi-class reference-based Wasserstein critic to systematically benchmark adversarial objectives. We find that the Wasserstein critic yields superior mixing; however, extensive reference sensitivity analysis reveals that the Wasserstein critic is prone to over-correction resulting in collapsed cellular representations; that its integrative performance is dependent on a topologically dense reference batch; and that it scales poorly with the number of batches. In contrast, we find that the "weak" integration characteristic of discriminators acts as a protective measure against over-correction. By highlighting the trade-offs between these methods, we aim to empower researchers to choose the correct method for their specific needs. Availability and Implementation: Source code is available at https://github.com/kreid415/wasserstein-critic-deconfounding. Data are available at https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-_ integration_task_datasets_Immune_and_pancreas_/12420968/1. Contact: kreid20@jh.edu.
bioinformatics2026-05-05v3cellNexus: Quality control, annotation, aggregation and analytical layers for the Human Cell Atlas data
Shen, M.; Gao, Y.; Liu, N.; Bhuva, D.; Milton, M.; Henao, J.; Andrews, J.; Yang, E.; Zhan, C.; Liu, N.; Si, S.; Hutchison, W. J.; Shakeel, M. H.; Morgan, M.; Papenfuss, A. T.; Iskander, J.; Polo, J. M.; Mangiola, S.Abstract
Large-scale single-cell atlases such as the Human Cell Atlas have transformed our understanding of human biology. Yet, the lack of a robust framework that standardises quality control, expands cellular annotation, and adds normalisation and analytical layers, limits multi-study analyses and the usefulness of this resource. Here we present cellNexus, a comprehensive tool and resource that converts the Human Cell Atlas collection into analysis-ready data by linking quality control layers, metadata enrichment, expression normalisation, analysis and data aggregation. These enhancements enable robust statistical modelling across studies, exemplified by a multi-tissue map of immune cell communication during ageing, which reveals macrophage-muscle axes as among the most depleted regenerative interactions with age. All harmonised layers, including pseudobulk and cell-cell communication summaries, are accessible via a public web interface and with R and Python APIs. By providing continuous integration with CELLxGENE releases, cellNexus transforms large cell atlas corpora into an accessible, reproducible, interoperable foundation for large-scale biological discovery and the next generation of single-cell foundation models.
bioinformatics2026-05-05v3Exploring per-base quality scores as a surrogate marker of cell-free DNA fragmentome
Volkov, H. H. V.; Raitses-Gurevich, M.; Grad, M.; Shlayem, R.; Leibowitz, D.; Rubinek, T.; Golan, T.; Shomron, N.Abstract
Per-base quality scores are widely treated as technical metadata in next-generation sequencing. Here, we show that in rigorously controlled whole-genome sequencing of cell-free DNA, quality profiles may encode fragmentomic signals that enable classification of cancer samples against matched controls. Analyzing four independent batches (23 cancer samples: pancreatic and breast; 22 matched controls) sequenced in a within-lane regime and further normalized per flow-cell tile to reduce technical confounders, we demonstrate through unsupervised analysis that boundary-enriched dynamics captured in these quality scores consistently separate cancer from control samples. A leave-one-batch-out classifier trained on quality-derived scores achieved a pooled area under the curve of 0.81. Furthermore, we show that the quality-derived metric correlates with short-fragment enrichment and tumor-associated 5-end motifs, performing comparably to established, motif-based orthogonal methods. These results provide initial evidence that quality scores could serve as a low-cost, alignment-free biomarker for cfDNA-based cancer detection.
bioinformatics2026-05-05v2Preferential CDR masking in paired antibody language models improves binding affinity prediction
Talaei, M.; Walker, K. C.; Hao, B.; Jolley, E.; Jin, Y.; Kozakov, D.; Misasi, J.; Vajda, S.; Paschalidis, I. C.; Joseph-McCarthy, D.Abstract
Background: Therapeutic antibodies are a leading class of biologics, yet their unique architecture poses challenges for computational modeling. Each antibody comprises paired heavy and light variable domains with conserved framework regions that maintain structure and hypervariable complementarity-determining regions (CDRs) that directly contact antigens. This functional asymmetry, where CDRs determine binding specificity while frameworks provide scaffolding, suggests that region-aware training strategies could yield superior representations. Existing protein language models treat all regions uniformly, potentially missing critical features present in CDRs. Methods: We developed a region-aware pretraining strategy for paired variable domain sequences using two protein language models: a 3 billion parameter model (ESM2) and a compact 600 million parameter model (ESM C). We compared three masking approaches: uniform whole-chain masking, CDR-focused masking, and a hybrid strategy. Final models were trained on over 1.6 million paired antibody sequences and evaluated on binding affinity datasets with over 90,000 antibody variants across six antigens, including single-mutant panels and combinatorial libraries. Results: Here we show that CDR-focused training produces embeddings with superior predictive performance for antibody-antigen binding. Our approach achieves up to 27% improvements in binding affinity prediction compared to benchmarked antibody models. Remarkably, training exclusively on paired sequences proves sufficient; pretraining on billions of unpaired sequences provides no measurable benefit. Our compact model matches or exceeds larger antibody-specific baselines. Conclusions: These findings establish that prioritizing paired sequences with CDR-aware supervision over scale and complex training schemes achieves both computational efficiency and predictive accuracy, providing a practical framework for next generation antibody language models.
bioinformatics2026-05-05v2Systematic contextual biases in SegmentNT potentially relevant to other nucleotide transformer models
Ebbert, M. T. W.; Ho, A.; Page, M. L.; Dutch, B.; Byer, B. K.; Hankins, K. L.; Sabra, H.; Aguzzoli Heberle, B.; Wadsworth, M. E.; Fox, G. A.; Karki, B.; Hickey, C.; Fardo, D. W.; Bumgardner, C.; Jakubek, Y. A.; Steely, C. J.; Miller, J. B.Abstract
Recent advances in large language models (LLMs) have extended to genomic applications, yet model robustness relative to context is unclear. Here, we demonstrate two intrinsic biases (input sequence length and nucleotide position) affecting SegmentNT results, a model included with the Nucleotide Transformer that provides nucleotide-level predictions of biological features. We demonstrate that nucleotide position within the input sequence (beginning, middle, or end) alters the nature of SegmentNT's raw prediction probabilities, which can be standardized to improve prediction consistency. While longer input sequence length improves model performance, diminishing returns suggest a surprisingly small input length of ~3,072 nucleotides might be sufficient for many applications. We further identify a 24-nucleotide periodic oscillation in SegmentNT's prediction probabilities, revealing an intrinsic bias potentially linked to the model's training tokenization (6-mers) and architecture. We identify potential approaches to account for these biases and provide generalizable insights for utilizing nucleotide-resolution functional prediction models.
bioinformatics2026-05-05v2multiVIB: A unified probabilistic contrastive learning framework for atlas-scale integration of single-cell multi-omics data
Xu, Y.; Fleming, S. J.; Wang, B.; Schoenbeck, E. G.; Babadi, M.; Huo, B.-X.Abstract
Comprehensive brain cell atlases are essential for understanding neural functions and enabling translational insights. As single-cell technologies proliferate across experimental platforms, species, and modalities, these atlases must scale accordingly, calling for data integration framework that aligns heterogeneous datasets without erasing biologically meaningful variations. Existing tools typically address narrow integration settings, forcing researchers to assemble \textit{ad hoc} workflows that may generate artifacts. Here, we introduce multiVIB, a unified probabilistic contrastive learning framework that handles diverse integration scenarios. We show that multiVIB achieves state-of-the-art performance while mitigating spurious alignments. Applied to atlas-scale datasets from the BRAIN Initiative, multiVIB demonstrates robust and scalable integration, including integration of diverse data modalities and reliable preservation of species-specific variations in cross-species integration. These capabilities position multiVIB as a scalable, biologically faithful foundation for constructing next-generation brain cell atlases with the growing landscape of single-cell data.
bioinformatics2026-05-05v2MolGene-E: Inverse Molecular Design to Modulate Single Cell Transcriptomics
Ohlan, R.; Murugan, R.; Xie, L.; Nallabolu, V.; Mottaqi, M.; Zhang, S.; Xie, L.Abstract
Designing drugs that can restore a diseased cell to its healthy state is an emerging approach in systems pharmacology to address medical needs that conventional target-based drug discovery paradigms have failed to meet. Single-cell transcriptomics can comprehensively map the differences between diseased and healthy cellular states, making it a valuable technique for systems pharmacology. However, single-cell omics data is noisy, heterogeneous, scarce, and high-dimensional. As a result, no machine learning methods currently exist to use single-cell omics data to design new drug molecules. We have developed a new deep generative framework named MolGene-E to tackle this challenge. MolGene-E combines two novel models: 1) a cross-modal model that can harmonize and denoise chemical-perturbed bulk and single-cell transcriptomics data, and 2) a contrastive learning-based generative model that can generate new molecules based on the transcriptomics data. MolGene-E consistently outperforms baseline methods in generating high-quality, hit-like molecules on gene expression profiles from two evaluation settings: CRISPR knock-out perturbation profiles from L1000toRNAseq dataset, and single-cell gene expression profiles from Sciplex-3 dataset, both in zero-shot molecule generation setting. This superior performance is demonstrated across diverse de novo molecule generation metrics. Extensive evaluations demonstrate that MolGene-E achieves state-of-the-art performance for zero-shot molecular generations. This makes MolGene-E a potentially powerful new tool for drug discovery.
bioinformatics2026-05-05v2Sequence-dependent transferability of the LRLLR membrane translocation motif: A computational study of smacN and NR2B9c peptides.
Munoz-Gacitua, D.; Blamey, J.Abstract
The LRLLR cell-penetrating motif can be transferred to confer membrane translocation activity, but only to compatible recipient peptides. Using umbrella sampling molecular dynamics simulations, we demonstrate that C-terminal LRLLR addition to the pro-apoptotic smacN peptide eliminates its translocation barrier entirely, transforming a +65 kJ/mol barrier into a -50 kJ/mol energy well. In contrast, N-terminal LRLLR addition to the neuroprotective NR2B9c peptide increases the translocation barrier from +85 to +100 kJ/mol, demonstrating that motif transfer can prove counterproductive for incompatible sequences. Cell-penetrating peptides offer promising strategies for intracellular delivery of therapeutic cargo, yet the sequence determinants governing their activity remain incompletely understood. The LRLLR motif, identified through systematic screening as essential for spontaneous membrane translocation, represents a minimal penetrating element whose transferability has not been previously evaluated. We appended this motif to two clinically relevant peptides: smacN, a tetrapeptide targeting inhibitor of apoptosis proteins in chemotherapy-resistant cancers, and NR2B9c, a nonapeptide that disrupts excitotoxic signaling in ischemic stroke. Potential of mean force profiles calculated across a POPC/POPG bilayer, combined with analysis of hydrogen bonding patterns, secondary structure propensity, and conformational dynamics, reveal the structural basis for these divergent outcomes. Successful transfer to smacN results from favorable complementarity: the hydrophobic, neutral smacN provides an ideal platform for the charged, amphipathic LRLLR motif, yielding a chimera capable of simultaneous interaction with both membrane leaflets. Transfer failure with NR2B9c stems from conformational rigidity induced by intramolecular hydrogen bonding, which prevents optimal membrane insertion, combined with unfavorable positioning of internal polar residues at the bilayer center. These findings establish that cell-penetrating motif transfer requires compatibility in charge distribution, hydrophobicity, and conformational flexibility between the motif and recipient sequence. The smacN-LRLLR chimera emerges as a promising candidate for experimental validation as a membrane-permeable therapeutic for survivin-positive tumors. More broadly, this work demonstrates the value of computational screening to identify compatible motif-cargo pairings prior to experimental investment.
bioinformatics2026-05-05v2IMAS enables target-aware integration of tumour multiomics to resolve communication-guided regulatory mechanisms
Deyang, W.; Yamashiro, T.; Inubushi, T.Abstract
Tumour multiomic datasets are often sparse, heterogeneous and limited in size, hindering robust and interpretable discovery of regulatory mechanisms. Here we present IMAS, a target-aware integrative framework for multiomic data augmentation and mechanism prioritization that leverages a pan-cancer single-cell multiomic resource to contextualize new tumour datasets and identify reliable sample-specific mechanistic hypotheses. IMAS combines shared latent-space modelling with target-domain adaptation to improve correspondence between predicted and observed RNA and TF profiles while concentrating explanatory predictive supports within the target dataset. Building on this adapted representation, IMAS reconstructs structured RNA-TF coupling networks, refines intercellular signaling through ligand-informed communication modelling, and organizes regulatory programs along communication-associated ordering. In independent colon cancer data, IMAS improved cluster-resolved correspondence and revealed communication-guided regulatory cascades across malignant epithelial states. A LAMB1-centred analysis further demonstrates how the framework supports progressive reinforcement of local regulatory structure and enables perturbation-based probing of context-specific dependencies. Rather than exhaustively predicting all possible outcomes, IMAS provides a target-aware and interpretable strategy to construct consistent and interpretable mechanism-discovery scaffolds and prioritize regulatory dependencies in data-limited tumour systems.
bioinformatics2026-05-05v2Cross-assay RNA modeling reveals cancer biomarkers
Townsend, H. A.; Jordan, K. R.; Wolsky, R. J.; Van Kleunen, L. B.; Davidson, N. R.; Behbakht, K.; Sikora, M. J.; Dowell, R. D.; Clauset, A.; Bitler, B. G.Abstract
The clinical heterogeneity of cancer poses a major challenge for precision medicine. Limited cohort sizes across evolving assay platforms impede reliable biomarker discovery. Here, we systematically evaluate how to integrate data from four transcriptomics platforms: bulk and single-cell (sc) RNA sequencing (RNA-seq), NanoString, and microarray for predictive modeling in cancer. We use high-grade serous carcinoma (HGSC) of tube-ovarian origin as a model system, as it is highly heterogeneous in both biology and assay data. We find that using fold-change of gene expression in patients with matched pre- and post-neoadjuvant chemotherapy samples reduces inter-patient and inter-assay variability but is insufficient to overcome platform-specific biases. Microarray and scRNA-seq data exhibit systematic biases, while RNA-seq and NanoString show the most promise for combination into a single training cohort. To mitigate inter-assay limitations, we generate a new data set of HGSC tumor samples profiled with both RNA-seq and NanoString, and use it to identify the limits of detection and optimal harmonization strategies. Our approaches enable integration of cohorts for separate and combined RNA-seq and NanoString predictive models of disease recurrence (test-set AUROCs > 0.8), validated in external microarray cohorts. We leverage single-cell and bulk RNA-seq network-based analyses to provide mechanistic context for genes in the predictive models. Our models indicate that GBP4 expression is a key predictor of recurrence and marks immune remodeling towards cytotoxicity. We provide an interactive web portal to facilitate exploration of data and results. These findings guide cross-assay harmonization of transcriptomic data and enable improved predictive modeling in heterogeneous cancers.
bioinformatics2026-05-05v1Massively parallel reporter assay-informed modeling improves prediction of context-specific enhancer-gene regulatory interactions
DeGroat, W.; Kreimer, A.Abstract
Enhancers are cis-regulatory elements that drive context-specific gene expression, yet their target genes and modes of action remain largely unresolved. Because most disease-associated variants lie in non-coding regulatory DNA, accurate, cell type-specific enhancer-gene (E-G) mapping is essential for understanding genetic risk. However, current E-G prediction frameworks lack the resolution to capture such context-specific interactions. Massively parallel reporter assays (MPRAs) provide measurement of cis-regulatory activity, but their integration into genome-scale E-G models has been limited. Here, we introduce MPRabc, an MPRA-informed model that improves E-G interaction prediction. MPRabc integrates predicted MPRA activity, sequence-derived regulatory features, epigenomic signals, and three-dimensional chromatin contact maps with CRISPR-based perturbation training data. Benchmarking against validated regulatory interactions shows that MPRabc outperforms state-of-the-art models. We generated high-resolution E-G networks for K562, HepG2, and hiPSC cell lines and applied a graph-based framework to identify regulatory architecture, map trait-associated variants and expression quantitative trait loci, and resolve transcription factor drivers of enhancer activity. Across contexts, we accurately recovered lineage-defining regulatory programs, including GATA1::TAL1 in K562, HNF1A/B in HepG2, and POU factor circuits in hiPSCs. Together, these results establish MPRA-informed modeling as a scalable strategy for decoding enhancer function and linking non-coding variants to gene regulatory mechanisms across cellular contexts.
bioinformatics2026-05-05v1