Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Pathway redistribution reveals a shared signaling backbone and context-dependent regulatory modules in RNA-binding protein networks
Osato, N.; Sato, K.Abstract
Understanding how regulatory architectures are reorganized across cellular contexts remains a central challenge in functional genomics. Here, we integrate co-expression-derived candidate regulatory interactions with interpretable deep learning to generate gene-level contribution scores and introduce delta NES (normalized enrichment score difference) to quantify pathway redistribution between cellular states. Because gene expression reflects the combined effects of multiple regulatory inputs, contribution scores capture relative regulatory influence rather than transcriptional abundance itself. Applying this framework to neural progenitor cells and K562 leukemia cells, we identify systematic redistribution of functional modules across multiple RNA-binding proteins, including PKM, HNRNPK, and NELFE. Neural System- and Immune System-associated modules are differentially positioned along the delta NES spectrum, indicating context-dependent redistribution of regulatory influence rather than isolated pathway activation events. At the pathway level, Signal Transduction consistently forms a shared signaling backbone across proteins and cellular contexts, while modules related to neuronal functions, immune responses, and developmental processes exhibit context-dependent redistribution. Subpathway analysis further reveals convergence on receptor-mediated signaling processes, including FGFR/RTK-, IRS-, and MAPK-related pathways. These redistribution patterns are preserved under alternative DeepLIFT background settings despite polarity changes in contribution-expression correlations, indicating that pathway-level contrasts arise from stable rank-structure differences rather than background-dependent score artifacts. Together, our findings demonstrate that contribution score-based pathway ranking reveals a conserved signaling backbone alongside context-dependent functional modules, providing a framework for interpreting regulatory architecture beyond expression-centric analyses.
bioinformatics2026-04-16v11An Explainable Knowledge Graph-Driven Approach to Decipher the Link Between Brain Disorders and the Gut Microbiome
Aamer, N.; Asim, M. N.; Vollmer, S.; Dengel, A.Abstract
Motivation: The communication between the gut microbiome and the brain, known as the microbiome-gut-brain axis (MGBA), is emerging as a critical factor in neurological and psychiatric disorders. This communication involves complex pathways including neural, hormonal, and immune interactions that enable gut microbes to modulate brain function and behavior. However, the specific mechanisms through which gut microbes influence brain function remain poorly understood, and existing computational efforts to understand these mechanisms are simplistic or have limited scope. Results: This work presents a comprehensive approach for elucidating the interactions that allows gut microbes to influence brain disorders. We construct a large curated biomedical knowledge graph comprising 586,318 nodes across 16 entity types and 3,573,936 edges spanning 103 relation types, integrating ontological and experimental data relevant to the MGBA. On this graph, we train GNN-GBA, a GraphSAGE-based graph neural network with a DistMult relation-aware decoder, achieving an AUC-ROC of 0.997 and an F1-score of 0.981 on link prediction, outperforming nine baseline methods across four categories. Using GNNExplainer, we extract and rank mechanistic pathways connecting gut microbes to brain disorders, and demonstrate their stability across multiple random initializations. GNN-GBA successfully identified pathways for 125 brain disorders, revealing shared metabolite hubs (including flavonoids, bile acids, and short-chain fatty acids) that mediate gut-brain communication across diverse neurological conditions. Furthermore, we show that the top pathways are consistent with existing literature for three common disorders. Lastly, we develop an interactive dashboard (GutBrainExplorer) to explore thousands of potential mechanistic pathways across 125 brain disorders, which is publicly available. Availability: Code and data are available at https://github.com/naafey-aamer/GNN-GBA. Contact: naafey.aamer@cs.rptu.de
bioinformatics2026-04-16v4EGGS: Empirical Genotype Generalizer for Samples
Smith, T. Q.; Rahman, A.; Szpiech, Z. A.Abstract
Summary: We introduce Empirical Genotype Generalizer for Samples (EGGS) which accepts empirical genotypes with missing data and replicates the distribution of missing genotypes along the empirical segment in other replicates. The empirical segment must have a number of sites less than the replicate. In addition, EGGS can remove phase, remove polarization, simulate deamination, simulate sequencing error, create pseudohaploids, and convert between Variant Call Format (VCF), ms-style replicates, and EIGENSTRAT/ANCESTRYMAP. When producing VCF files, EGGS is not limited to biallelic sites and assumes all samples are diploid. Availability and Implementation: EGGS is written in the C programming language. Precompiled executables, source code, the manual, and the analysis conducted in the paper are available at https://github.com/TQ-Smith/EGGS
bioinformatics2026-04-16v3DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics
Liao, Y.; Wen, H.; E, W.; Zhang, W.Abstract
Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semisupervised training to universal cross-modal representation learning. By integrating dualencoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.
bioinformatics2026-04-16v3evo3D R package: a spatial haplotype framework for structure-informed analysis of molecular evolution
Broyles, B. K.; He, Q.Abstract
At the molecular level, selection pressures often act on protein structural features, yet most evolutionary analyses remain confined to linear sequences. Early structure-informed approaches improved interpretability by mapping single-site metrics onto protein structures, and later methods introduced 3D sliding windows to capture spatially clustered signals missed by linear window approaches. These frameworks, however, are restricted to predefined statistics and narrowly defined 3D window types, limiting the scope of questions that can be addressed. We developed an R package, evo3D, as a new framework for structure-informed evolutionary analysis that supports a wide range of downstream statistics and scales from simple to complex structures. evo3D extracts structure-informed multiple sequence alignment subsets (spatial haplotypes), making the structure-informed unit of analysis directly available to users. The framework supports fixed-count and fixed-distance spatial windows, introduces residue and codon analysis modes, and extends to multimers, interfaces, and multiple structural models through a single wrapper, run_evo3d(). We demonstrate evo3D's utility by performing an epitope-level diversity scan of Hepatitis C virus E1/E2 complex, identifying conserved spatial neighbourhoods missed by linear sliding windows, and by evaluating evo3D's scalability on the octameric Chikungunya virus E1/E2 assembly. Importantly, evo3D formalises the core components of structure-informed analysis of molecular evolution and removes technical barriers. As a result, the framework streamlines the evaluation of evolutionary patterns directly within 3D structural contexts, and we anticipate its wide application in molecular evolution studies. The package is available at github.com/bbroyle/evo3D.
bioinformatics2026-04-16v2Antimicrobial Resistance Prediction in Salmonella enterica Using Frequency Chaos Game Representation and ResNet-18
Ismail, S. M.; Fayed, S. H.Abstract
Antimicrobial resistance (AMR) prediction from bacterial genomes remains a major challenge for clinical microbiology and surveillance. We developed a deep learning model based on Frequency Chaos Game Representation (FCGR) and a ResNet-18 architecture to classify resistance phenotypes directly from whole-genome assemblies. Using homology-aware clustering to prevent genomic data leakage, we trained and evaluated models for Salmonella enterica (seven antibiotics) and Staphylococcus aureus (five antibiotics). The Salmonella model achieved high predictive accuracy, particularly for cephalosporins, while performance was lower for tetracycline and ampicillin. The Staphylococcus aureus model demonstrated that the pipeline generalizes to Gram-positive bacteria, with strong results for methicillin (Balanced Accuracy = 0.85). Benchmarking against the gene-based tool ResFinder showed that the FCGR-based model did not match the performance of ResFinder on most antibiotics, but achieved competitive results for cephalosporins. This study demonstrates the feasibility of applying FCGR-based deep learning to AMR prediction across bacterial species, though substantial improvements would be needed before clinical application.
bioinformatics2026-04-16v2Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models
Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.Abstract
The function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous model accuracy estimation (EMA) methodologies. Here we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structural models. Our method employs a structure-sequence cross-consistency mechanism to quantify the bidirectional compatibility between the input sequence and the predicted three-dimensional structure, enabling a comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in the Pearson correlation and 49.0% in the Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensus-based methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced on the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. Our results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.
bioinformatics2026-04-16v1Thermoadaptation of EndoG proteins in the Xenopus frog genus
Tokmakov, A. A.Abstract
Xenopus is a genus of entirely aquatic frogs found in sub-Saharan Africa. Currently, the complete genomes of two species within the Xenopus genus, Xenopus laevis and Xenopus tropicalis, have been fully sequenced, annotated, and made publicly available. The two species inhabit markedly different environments: X. tropicalis lives in the hot, equatorial regions of Africa, whereas X. laevis resides in the cooler climates of southern Africa. In the present study, mutational profiling, comparative homology modeling, and computational bioinformatics were used to identify the features of adaptive evolution in Xenopus endonuclease G (EndoG) proteins. The multiple characteristics of EndoG isozymes were discovered to vary considerably between the two Xenopus species dwelling in different locations. Most notably, EndoG proteins from the psychrophilic X. laevis exhibit the increased contents of charged and polar residues, elevated pI, higher intramolecular interaction energies, B factors, molecular void volumes, and solvent accessibilities, but the decreased contents of nonpolar and aromatic amino acids, lower hydrophobicity, buried surface area, and molecular packing density compared to those from the thermophilic X. tropicalis. The observed differences strongly suggest that temperature plays a dominant role in EndoG diversification. Evaluation of intramolecular interaction energies appears to be a particularly sensitive and discriminative framework for assessing protein divergence at the structural level. Overall, this study highlights the diversification of homologous proteins in ectothermic vertebrate eukaryotes and provides mechanistic insight into protein adaptation to contrasting environments.
bioinformatics2026-04-16v1Three-dimensional Virtual Adult Cardiomyocyte Transcriptomics
Luo, C.; Lyu, Y.; Guo, X.; Cheng, L.; Liang, Q.; Wang, S.; Wang, Y.; Zhang, S.; Wang, S.; Liu, T.; Luo, Y.; Lu, F.; Ran, B.; Zhang, Y.; Liu, X.; Wang, Y.; Qin, G.; Wu, J.; Lyu, Q. R.Abstract
Adult cardiomyocytes are large, rod-shaped, and often multinucleated, which makes them challenging for current single-cell or single-nucleus RNA-sequencing platforms. Current spatial transcriptomics (ST) relies on nuclear-based cell segmentation, which performs poorly when identifying adult cardiomyocytes. Moreover, single-section ST of adult myocardium is insufficient to capture the cellular transcriptomic information of intact cardiomyocytes. Thus, there is an urgent need for novel technology that accurately profiles the transcriptome of adult cardiomyocytes in situ at the single-cell level. Here, we report the first three-dimensional virtual cardiomyocyte (3D-VirtualCM) transcriptome atlas by reconstructing multi-layer ST spanning a 100m depth of the adult mouse heart. Using membrane-based cell segmentation and similarity-guided cross-sectional contour matching, 3D-VirtualCM delineates individual cardiomyocyte 3D contours and integrates in situ transcriptome. 3D-VirtualCM identifies cardiomyocytes in the cell cycle using proliferative markers in the context of myocardial infarction (MI) and reveals the asymmetric intracellular RNA distribution along the longitudinal axis of cardiomyocytes. Using 3D RNA fluorescence in situ hybridization (FISH), we validated the longitudinal asymmetry of Glul and Gja1 mRNA in adult cardiomyocytes. In summary, 3D-VirtualCM provides a workflow that advances the study of cardiac pathophysiology at a bona fide single-cell level while preserving spatial context.
bioinformatics2026-04-16v1scDisent: disentangled representation learning with causal structure for multi-omic single-cell analysis
Xi, G.Abstract
Single-cell multi-omic technologies measure complementary aspects of cellular identity and regulatory state, yet most integration models compress these signals into one entangled latent space. Such representations are useful for clustering but poorly suited for mechanistic interpretation or perturbation-oriented analysis. We present scDisent (https://github.com/xig uoren/scDisent), a generative framework for disentangled representation learning that separates expression-associated variables (zexpr) from regulation-associated variables (zreg) and links them through a sparse directed mapping. scDisent combines modality-specific encoding, variational disentanglement with total-correlation and orthogonality constraints, and a Gumbelgated causal module protected by detach-based gradient isolation. Evaluated on benchmark datasets with matched modalities, scDisent achieved best-in-benchmark integration performance while exposing regulatory structure that competing integration methods do not model explicitly. The learned causal atlas remained sparse, perturbation analyses recovered biologically coherent lineage-associated programs, and cross-dataset discovery analyses highlighted interpretable immune, neural and developmental signatures. Quantitative branch-separation analyses further showed that benchmark-label information concentrated in zexpr rather than zreg. Together, these results position scDisent as a computational method that improves not only integration quality but also biological interpretability, making single-cell multi-omic representations better suited to biological question answering and in silico hypothesis generation.
bioinformatics2026-04-16v1Multiscale transcriptomic organization of the human brain with DigitalBrain
An, J.; Hu, X.; Jiang, Y.; Jiang, M.; Qiu, S.; Liu, G.; Wei, X.; Wang, Y.; Lin, J. Q.; Wang, C.; Lu, M.Abstract
The human brain varies across anatomical regions, cell types, development, ageing and disease states, yet existing single-cell transcriptomic resources remain fragmented and difficult to integrate into a unified biological model. Here we present DigitalBrain, a human brain-specific atlas and foundation-model framework for organizing diverse and fragmented human brain transcriptomic data across scales. We first built DigitalBrain-Atlas, a harmonized whole-brain single-cell resource comprising 16.35 million transcriptomes from 2,143 donors across 165 brain regions, spanning the human lifespan and multiple neurological and clinical conditions. We then developed DigitalBrain-M1, a Transformer-based model that jointly encodes gene identity and expression magnitude to learn a shared embedding space for cells and genes. Across held-out datasets, DigitalBrain supported robust single-cell integration, clustering and cell-type annotation while preserving major biological structure and reducing technical fragmentation. Beyond these benchmarks, the learned embeddings revealed emergent large-scale hierarchical organization of the human brain, linking anatomically distinct regions into higher-order patterns consistent with known functional systems. Applied to human hippocampal aging, DigitalBrain identified cell-type-specific aging sensitive gene sets, identified dentate gyrus granule cells as a particularly age-sensitive population, and discovered selective reorganization of gene programs related to synaptic transmission, postsynaptic structure, membrane excitability and axon guidance during aging. Cross-dataset convergence was strongest at the level of functional modules and recurrent aging sensitive genes. Together, these results demonstrate DigitalBrain as a brain-specific framework for mapping human brain organization across scales, and as an early step towards a complete virtual organ for the human brain.
bioinformatics2026-04-16v1MISSTE: a multiscale integrative spatial simulator for understanding the mechanisms underlying tissue ecosystems
Su, Z.; Yin, S.; Wu, Y.Abstract
Multiscale tissue ecosystems are governed by coupled intracellular decision-making, cell-cell interactions, and spatially structured microenvironmental signals, yet these scales are often studied separately. Here we present MISSTE, a modular framework that integrates Boolean intracellular state logic, agent-based modeling, and partial differential equation fields within a unified spatial simulation architecture. As a proof of concept, we applied MISSTE to CAR-T therapy in a solid tumor microenvironment. The model recapitulated emergent features of CAR-T behavior, including limited tumor penetration, stromal suppression, localized cytokine remodeling, hypoxia-associated constraint, and progressive functional exhaustion. Comparison of baseline and optimized conditions showed that coordinated enhancement of interaction range, migration, and cytotoxic function improved immune persistence and partial tumor control. Systematic parameter scans further identified effective immune-tumor contact as a stronger determinant of outcome than killing strength alone, highlighting spatial access as the dominant bottleneck. Guided by these results, we designed sequential intervention strategies and found that time-ordered enhancement of infiltration, killing, and late functional protection outperformed a static optimized regime. Together, these results establish MISSTE as a generalizable multiscale methodology for dissecting tissue ecosystems and for generating mechanistically grounded strategies for engineered cellular therapy design.
bioinformatics2026-04-16v1vcfilt: A Zero-Allocation Streaming Filter for High-Throughput VCF Processing
KP, M. M.Abstract
Variant Call Format (VCF) files are the dominant interchange format for genomic variant data, but their size - routinely exceeding tens of gigabytes for population-scale studies - creates a significant computational bottleneck at the quality-filtering stage. Existing tools such as bcftools and vcftools provide broad functionality through general-purpose expression engines, but incur substantial per-record overhead from dynamic field lookup, type resolution, and heap allocation. We present vcfilt, a streaming, batch-parallel VCF filter implemented in Go that restricts its scope to three high-frequency filter criteria (INFO/DP, INFO/AF, and QUAL) and applies them via a zero-allocation byte-scan parser. Benchmarked on real 1000 Genomes Project data (chromosome 20, 1,811,146 variants), vcfilt achieves 147,000 variants/second on an 18 GB plain-text VCF file using a single thread - a 12.2x speedup over bcftools 1.18 under identical conditions. On gzip-compressed input, the speedup is 7.9x. Output is byte-for-byte identical to bcftools across all tested filter combinations. vcfilt is distributed as a self-contained static binary, a Docker image, and a Singularity-compatible container. The source code and all benchmark scripts are openly available under the MIT licence.
bioinformatics2026-04-16v1Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit
Carriere, L.; Huyghe, A.; Pajkos, M.; Bernado, P.; Cortes, J.Abstract
Intrinsically disordered proteins and regions (IDRs) are central to a multitude of biological processes. Despite extensive studies of their structural and physicochemical properties, the rational design of IDRs with defined conformational behavior remains challenging due to their ensemble nature. Here we present a generative framework for designing disordered protein sequences conditioned on target conformational ensemble descriptors using protein language models (pLMs). We formulate IDR design as the task of generating amino acid sequences predicted to realize specified biophysical properties and implement a Transformer encoder-decoder architecture that maps numerical descriptors to protein sequences. By training models on datasets spanning two orders of magnitude in size, we show that accurate control of conformational and physicochemical properties is achieved only at large data scale. These results demonstrate the feasibility of conditioning generative models on ensemble-level descriptors for IDR design. More broadly, these results support a data-centric paradigm for protein engineering, in which data availability emerges as a key limiting factor for the accurate design of IDRs.
bioinformatics2026-04-16v1Interpretable Biological Sequence Clustering with iClust
Zhang, S.; Liu, X.; Lou, J.; Jiang, M.; He, Z.Abstract
Biological sequence clustering is a fundamental problem in bioinformatics, yet most existing methods mainly optimize clustering quality or efficiency while offering limited insight into why sequences are grouped together. This restricts their usefulness in downstream analysis, where representative sequences and clear cluster boundaries are often needed. To address this issue, we propose iClust, an interpretable clustering method that characterizes each cluster by a representative prototype and an adaptive radius. By adapting to local sequence structure rather than relying on a single global threshold, iClust produces clusters that are both meaningful and explainable. A final consolidation step further reduces tiny fragments and improves structural stability. Experiments on simulated and real biological sequence datasets show that iClust achieves competitive clustering performance while providing clearer cluster-level explanations than conventional threshold-based methods. In addition to its empirical impact as a practical clustering method for biological sequences, this article opens up new avenues for developing biological sequence clustering approaches from the viewpoint of interpretable machine learning.
bioinformatics2026-04-16v1Sampling antibody conformational ensembles withABodyBuilder4-STEROIDS
Spoendlin, F. C.; Cagiada, M.; Ifashe, K.; Vavourakis, O.; Deane, C. M.Abstract
Conformational flexibility is fundamental to the function of many proteins and in the case of antibodies can impact key properties such as affinity and specificity. While it is possible to predict single, static protein structures with high accuracy, predicting conformational ensemble remains challenging. Molecular dynamics simulations suffer from high computational costs, while deep learning methods are yet to achieve the same level of accuracy. Here, we introduce ABB4-STEROIDS a generative structure prediction model that samples conformational ensembles of antibodies. We trained our model on 4.2 million structural frames derived from $\sim$136,000 coarse-grained and a set of 83 new all-atom antibody MD simulations. We benchmarked our model on reproducing MD ensembles and evaluated the diversity of sampled structures and the covered conformational space against experimental evidence. ABB4-STEROIDS achieves state-of-the-art accuracy, particularly within the experimental benchmarks. The model is openly available and provides a robust resource for large-scale investigations of antibody conformational ensembles.
bioinformatics2026-04-16v1LinkLlama: Enabling Large Language Model for Chemically Reasonable Linker Design
Sun, K.; Wang, Y. E.; Purnomo, J. C.; Cavanagh, J. M.; Alteri, G. B.; Head-Gordon, T.Abstract
Fragment-based drug discovery (FBDD) relies heavily on the design of chemically viable linkers to connect fragments binding to different pocket regions into potent lead molecules. While recent generative models have advanced spatial fragment linking, they frequently produce linkers characterized by high torsional strain and non-drug-like motifs. In this work, we present LinkLlama, a fine-tuned Meta Llama 3 model that bridges the gap between text-based generation and 3D spatial awareness. By accepting natural language prompts that specify geometric constraints, such as distances and angles, alongside physicochemical targets like Lipinski's rules and rotatable bond limits, LinkLlama generates highly tailored molecules for the input fragments. Leveraging the inherent chemical grammar captured through supervised fine-tuning on a curated corpus of drug-like molecules from ChEMBL, the model prioritizes chemical validity without requiring complex reinforcement learning loops. Benchmarking on the ZINC and HiQBind datasets demonstrates that LinkLlama maintains competitive geometric fidelity compared to strictly 3D-aware models while achieving a two-fold increase in the proportion of chemically reasonable designs. This rising success rate, jumping from ~35\% to over 80\%, is defined by strict adherence to comprehensive structural filters including PAINS, non-drug-like chemical patterns and complex ring systems. We further illustrate the model's versatility through prospective case studies in novel small-molecule scaffold hopping and PROTAC linker design, validated via molecular docking and molecular dynamics simulations against known crystal poses. Ultimately, LinkLlama demonstrates that large language models can overcome the structural pitfalls of purely 3D-generative methods, offering a highly controllable and chemically robust framework to accelerate linker design and drug discovery in general.
bioinformatics2026-04-16v1Impact of the N-glycosylation on full-length IgG2 and IgG4 antibodies: a comparative study using molecular dynamics simulations.
LEON FOUN LIN, R.; Bellaiche, A.; Diharce, J.; Etchebest, C.Abstract
Like other proteins, monoclonal antibodies - important biodrugs- are subject to post translational modifications, especially the N-glycosylations. However, the effect of the N-glycosylations remains poorly studied and atomistic details about their influence are rarely available. . Moreover, the few existing studies focus on the prevalent immunoglobulin G1. To go further in the understanding of the impact of glycosylations, we have carried out a comparative exploration of the effect of N-glycosylations on two different classes of antibodies, namely Mab231, an IgG2 and the pembrolizumab, an IgG4 . The two antibodies differ by their sequences, their length, their 3D structure but also by the location and composition of the glycans. In the present work, detailed and important information were gained through molecular dynamics simulations where both monoclonal antibodies were studied without and with the presence of their glycans. The results of 1.5 microseconds of sampling for each system show that glycosylation does not drastically alter the overall conformational landscape of either antibody, whatever the metrics considered. However, it measurably modulates local flexibility, inter-domain correlated motions, and the relative orientation of the Fab arms with respect to the Fc domain, with statistically significant shifts in key geometric descriptors. Importantly, contact analysis reveals that glycan interactions extend beyond the Fc region to reach Fab residues. The allosteric network calculations demonstrate that the influence of Fc-bound glycans propagates even until the Fab framework regions in both mAbs, which could impact the antigen binding. The nature and magnitude of these effects are subclass-dependent, reflecting differences in glycan composition, hinge architecture, and three-dimensional organization Our findings challenge the prevailing view that Fc glycosylation uniformly promotes CH2 domain opening. More importantly, it underscores the necessity of considering full-length structures and IgG subclass diversity in glyco-engineering strategies.
bioinformatics2026-04-16v1Track Display Jockey (trackDJ): a user-friendly R package for visualization of epigenomic data
Bokil, N. V.; Page, D. C.Abstract
Background Visualization of epigenomic data such as coverage tracks, peak calls, and chromatin interactions is a critical task in genomic data analysis. Although genome browsers such as the Integrative Genomics Viewer (IGV) and the UCSC Genome Browser permit user-friendly exploration of genomic tracks, they are not optimized for fully programmatic and reproducible generation of publication-quality figures. In contrast, existing programmatic tools lack a user-friendly interface and require extensive configuration. Results We present trackDJ (Track Display Jockey), an R package for visualization of epigenomic data. trackDJ prioritizes usability by favoring convention over configuration; it provides high-level plotting functions with sensible defaults, allowing users with minimal programming experience to generate clear, publication-quality figures with relatively little coding. Within a unified plotting framework, users can stack and align multiple data types, including coverage tracks, peak annotations, chromatin loops, and gene annotations. trackDJ allows users to select plotted genomic regions by coordinates or by gene name, enabling rapid visualization without knowledge of precise locus boundaries. Conclusions trackDJ provides a user-friendly and reproducible alternative to interactive genome browsers for epigenomic visualization, filling a critical gap in currently available epigenomics toolkits. By enabling scripted generation of clean, customizable genomic illustrations, trackDJ integrates naturally into R-based analysis workflows to streamline the creation of publication-quality figures.
bioinformatics2026-04-16v1FlyPredictome: A structural atlas of predicted protein-protein interactions in Drosophila
Kim, A.-R.; Comjean, A.; Veal, A.; Rodiger, J.; Han, M.; Hu, Y.; Perrimon, N.Abstract
Protein-protein interactions (PPIs) are fundamental to cellular function. Yet most Drosophila PPIs remain structurally uncharacterized despite the wealth of genetic and biochemical data available for this organism. Here we present FlyPredictome, a structural interactome based on 1.5 million pairwise AlphaFold-Multimer predictions. Using a local confidence metric that performs robustly for interactions involving flexible and disordered proteins, we systematically assess experimentally reported Drosophila PPIs and predict direct binding interfaces at residue-level resolution. Testing their functional relevance, we find that phenotype-associated missense mutations are enriched at predicted interaction interfaces. Building on these validated predictions, we construct an evidence-supported PPI network, revealing modular organization from signaling pathways to individual protein complexes. FlyPredictome is available as an open database, providing a structural foundation for interaction discovery in Drosophila.
bioinformatics2026-04-16v1ORION: An agentic reasoning construct for the analysis of complex human immune profiling
Dayao, M. T.; Kim, K.; Khor, B.; Jaech, A.; van Opheusden, B.; Bodansky, A.; DeRisi, J.Abstract
The capacity to generate high-dimensional biological datasets has outpaced the ability to interpret them. Technologies such as phage immunoprecipitation and sequencing (PhIP-seq) enable proteome-scale profiling of antibody repertoires, but interpreting thousands of enriched peptides into mechanistic hypotheses remains a labor-intensive bottleneck requiring expert synthesis of statistics, literature, and domain knowledge. Here we describe ORION (Omics Reasoning & Interpretation Orchestrator), a multi-agent framework that uses reasoning-capable large language models to perform end-to-end analysis of complex immune profiling data. ORION integrates statistical analysis, machine learning, and automated literature review into a single structured workflow, producing results that are reproducible and fully traceable. Applied to a published PhIP-seq dataset from autoimmune polyendocrine syndrome type 1 (APS-1), ORION recovered the canonical autoantibody signature in approximately two hours, closely recapitulating an analysis that originally required one to two months of manual effort. To test hypothesis-generation capacity on previously unseen data, we applied ORION to a novel PhIP-seq dataset from individuals with Down syndrome, for which no proteome-wide autoantibody reference exists. ORION distinguished disease from control samples with high accuracy, prioritized candidate autoantibody targets, and organized them into biologically coherent groups spanning immune, gut, and neuronal programs, generating testable hypotheses for experimental follow-up. These results demonstrate that agentic AI systems can compress the analysis of complex immune profiling data from weeks to hours, allowing scientists to redirect their time toward the fundamental biology.
bioinformatics2026-04-16v1Longevity Bench: Are SotA LLMs ready for aging research?
Zhavoronkov, A.; Sidorenko, D.; Naumov, V.; Pushkov, S.; Zagirova, D.; Aladinskiy, V.; Unutmaz, D.; Aliper, A.; Galkin, F.Abstract
Aging is a core biological process observed in most species and tissues, which is studied with a vast array of technologies. We argue that the abilities of AI systems to emulate aging and to accurately interpret biodata in its context are the key criteria to judge an LLM's utility in biomedical research. Here, we present LongevityBench -- a collection of tasks designed to assess whether foundation models grasp the fundamental principles of aging biology and can use low-level biodata to arrive at phenotype-level conclusions. The benchmark covers a variety of prediction targets including human time-to-death, mutations' effect on lifespan, and age-dependent omics patterns. It spans all common biodata types used in longevity research: transcriptomes, DNA methylation profiles, proteomes, genomes, clinical blood tests and biometrics, as well as natural language annotations. After ranking state-of-the-art foundation models using LongevityBench, we highlight their weaknesses and outline procedures to maximize their utility in aging research and life sciences
bioinformatics2026-04-15v3Exploring molecular signatures of senescence with markeR, an R toolkit for evaluating gene sets as phenotypic markers
Martins-Silva, R.; Kaizeler, A.; Barbosa-Morais, N. L.Abstract
Many biological processes, including cellular senescence, manifest as diverse phenotypes that vary across cell types and conditions. In the absence of single, definitive markers, researchers often rely on the expression of sets of genes to identify these complex states. However, there are multiple ways to summarise gene set expression into quantitative metrics (i.e., signatures), each with its own strengths and limitations, and we know of no consensual framework to systematically evaluate their performance across datasets. We therefore developed markeR (https://bioconductor.org/packages/markeR), an open-source, modular R package that evaluates gene sets as phenotypic markers using various scoring and enrichment-based approaches. markeR generates interpretable metrics and intuitive visualisations that enable benchmarking of gene signatures and exploration of their associations with chosen study variables. As a case study, we applied markeR to 9 published senescence-related gene sets across 25 RNA-seq datasets, covering 6 human cell types and 12 senescence-inducing conditions. There was wide variability in gene set performance, as some signatures (e.g., SenMayo) were robust senescence markers across contexts, while others (e.g., those from MSigDB), performed poorly as such. We also used markeR to analyse gene expression in 49 GTEx tissues, revealing tissue- and age-related differences in senescence-associated signals. Together, these findings emphasise the difficulty of characterising molecular phenotypes and demonstrate the potential of markeR in facilitating the systematic evaluation of gene sets in various biological contexts.
bioinformatics2026-04-15v3KyDab - a comprehensive database of antibody discovery selection campaigns.
Zhou, Q.; Chomicz, D.; Melvin, D.; Griffiths, M.; Yahiya, S.; Reece, S.; Le Pannerer, M.-M.; Krawczyk, K.Abstract
Preclinical antibody discovery relies on progressive screening and down-selection of candidate antibodies from large immune repertoires, yet this critical process is poorly represented in existing public databases. Here we introduce KyDab (Kymouse Antibody Database), a well-curated database of antibody discovery selection data generated using standardized workflows on the Kymouse humanized mouse platform. The current release includes 11 Kymouse platform mice immunisation studies covering 51 immunogens, more than 120,000 paired heavy-light chain sequences, and binding measurements for a selected subset of experimentally characterized clones. By capturing full-funnel selection data with consistent metadata and both positive and negative experimental outcomes, KyDab provides a valuable data resource for the development and evaluation of artificial intelligence models for antibody discovery. KyDab is accessible https://kydab.naturalantibody.com, and the database will be continuously updated as new datasets become available.
bioinformatics2026-04-15v2Fast and accurate resolution of ecDNA sequence using Cycle-Extractor
Faizrahnemoon, M.; Luebeck, J.; Hung, K. L.; Rao, S.; Prasad, G.; Tsz-Lo Wong, I.; G. Jones, M.; S. Mischel, P.; Y. Chang, H.; Zhu, K.; Bafna, V.Abstract
Extrachromosomal DNA (ecDNA) plays a key role in cancer pathology. EcDNAs mediate high oncogene amplification and expression and worse patient outcomes. Accurately determining the structure of these circular molecules is essential for understanding their function, yet reconstructing ecDNA cycles from sequencing data remains challenging. We introduce Cycle-Extractor (CE) for reconstruction. CE accepts a breakpoint graph derived from either short or long read sequencing data as input and extracts a cycle with the maximum length-weighted-copy-number. CE utilizes a mixed-integer linear program (MILP) and a separate traversal procedure, enabling fast optimization and compatibility with free solvers. We evaluated CE against CoRAL (long-read-based quadratic optimization), Decoil (long-reads), and AmpliconArchitect (AA for short reads) on both simulated data and real cancer cell lines. On simulated ecDNA, CE achieves performance comparable to CoRAL across three accuracy metrics and consistently outperforms AA and Decoil. On cancer cell lines, CE produces longer and heavier cycles than AA, and achieves performance similar to CoRAL. Moreover, CE is, on average, 40 x faster than CoRAL. These results demonstrate that CE accurately reconstructs ecDNA from both short- and long-read sequencing data, while long-read inputs allow CE to recover more complete and higher-confidence ecDNA structures. CE improved the prediction of many ecDNA structures. On a PC3 ecDNA containing MYC, CE uses ONT data to reconstruct a substantially larger and higher-copy sequence (4.2 Mbp) compared to the short-read-derived reconstruction (690 Kbp). CRISPR-CATCH experiments confirm the presence of a large ecDNA molecule, validating the long-read-based CE reconstruction.
bioinformatics2026-04-15v2DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics
Liao, Y.; Wen, H.; E, W.; Zhang, W.Abstract
Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semisupervised training to universal cross-modal representation learning. By integrating dualencoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.
bioinformatics2026-04-15v2A little longer, a lot better: simulation-guided exploration of extended-length single-end barcoded reads for structural variant detection
Luo, C.; Liu, Y. H.; Liu, H.; Zhang, Z.; Zhang, L.; Peters, B. A.; Zhou, X. M.Abstract
Accurate detection of genetic variants, including single nucleotide polymorphisms (SNPs), small insertions and deletions (INDELs), and structural variants (SVs), is essential for comprehensive genomic analysis. While short-read sequencing performs well for SNP and INDEL detection, it remains limited in resolving SVs, particularly in complex genomic regions, due to its short read length. Linked-read sequencing technologies, such as single-tube Long Fragment Read (stLFR), partially address this limitation by incorporating molecular barcodes to provide long-range information. In this study, we evaluate conventional paired-end linked reads (PE100_stLFR) and explore a conceptual extension: long single-end barcoded reads of 500 bp (SE500_stLFR) and 1000 bp (SE1000_stLFR). We developed stLFR-sim, a Python-based simulator that reproduces the stLFR workflow and enables realistic benchmarking. Using a high-quality T2T assembly of HG002, we generated multiple datasets across 12 sequencing configurations. SVs were called using Aquila_stLFR (v2) and benchmarked against the Genome in a Bottle (GIAB) HG002 SV truth set with Truvari. We show that simulated PE100_stLFR closely matches real data, validating the simulation framework. Increasing read length consistently improves SV detection accuracy, with SE1000_stLFR achieving the best performance and approaching long-read methods while outperforming short-read and pangenome-based approaches. Collectively, our results highlight the strong potential of long single-end barcoded reads for improving SV detection, and suggest that even modest increases in read length, when combined with barcode information, can provide a cost-effective and practical strategy for enhancing future sequencing technologies and SV discovery.
bioinformatics2026-04-15v2TFBindFormer: A Cross-Attention Transformer for Transcription Factor-DNA Binding Prediction
Liu, P.; Wang, L.; Basnet, S.; Cheng, J.Abstract
Transcription factors (TFs) are central regulators of gene expression, and their selective recognition of genomic DNA underlies various biological processes. Experimental profiling of TF-DNA interactions using chromatin immunoprecipitation followed by sequencing(ChIP-seq) provides high resolution maps of in vivoTF-DNA binding but remains costly, labor-intensive, and inherently low-throughput, limiting their scalability across different transcription factors,cell types, and regulatory conditions. Computational modeling therefore plays an essential role in inferring TF-DNA interactions at genome scale. However, most existing computational models rely solely on DNA sequence and chromatin features to predict TF-DNA binding, neglecting TF-specific protein information. This omission limits their ability to capture protein-dependent binding specificity. Here, we present TFBindFormer, a hybrid cross-attention transformer that explicitly integrates genomic DNA features with TF specific representations derived from protein sequences and structures. By modeling protein-conditioned, position-specific TF-DNA interactions, TFBindFormer enables direct learning of molecular determinants underlying DNA recognition. Evaluated across hundreds of cell-type-specific TFs and hundreds of millions of genome-wide DNA bins, TFBindFormer consistently outperforms DNA-only baselines, achieving substantial gains in both area under precision-recall curve(AUPRC) and area under receiver operating characteristic curve(AUROC). Together, these results demonstrate that integrating TF and DNA features via cross-attention enables TFBindFormer to serve as an effective and scalable framework for large-scale TF-DNA binding prediction.
bioinformatics2026-04-15v2Functional-space alignment resolves the eco-evolutionary landscape of siderophore biosynthesis across bacteria
Shao, J.; Wu, Y.; Tian, S.; Xu, R.; Luo, H.; He, R.; Shao, Y.; Yu, L.; Xiong, G.; Guo, P.; Nan, R.; Wei, Z.; Gu, S.; Li, Z.Abstract
Siderophores are central mediators of microbial iron acquisition, competition, and ecological adaptation, yet their biosynthetic diversity remains difficult to resolve across species because existing sequence-based BGC comparison is strongly constrained by phylogenetic background. Here we combine large-language-model-assisted literature mining, functional-space comparison, and genome-scale analysis to resolve the global organization of siderophore biosynthesis across bacteria. We first built SideroBank, a manually curated cross-species benchmark of siderophore biosynthetic gene clusters (BGCs), and used it to show that many identical products recur across distant taxa whereas the corresponding BGCs often fail to cluster in sequence space. We then developed BGC Block Aligner, which compares BGCs as ordered systems of functionally meaningful blocks and thereby converts comparison from sequence space to functional space. Applied to 97,432 bacterial genomes, this framework produced the Siderophore Atlas, revealing that siderophore synthesis is a remarkably pervasive trait encoded by over 60% of the analyzed genomes, with certain clusters being the most widely disseminated secondary metabolites across the bacterial domain. This global landscape suggests that the adoption of specific biosynthetic strategies is predominantly driven by ecological lifestyle rather than strict phylogenetic relatedness. Furthermore, a stark macro-evolutionary dichotomy was observed between the continuous structural diversification of NRPS pathways and the standardized, HGT-driven dissemination of NIS systems, linking functional-space genomics to the global ecology and evolution of siderophore biosynthesis..
bioinformatics2026-04-15v2PoolParty: streamlined design of DNA sequence libraries in Python
Liu, Z.; Cordero, A.; Kinney, J. B.Abstract
Motivation: Computationally designed DNA sequence libraries are essential components of massively parallel reporter assays (MPRAs), deep mutational scanning (DMS) experiments, and other multiplex assays of variant effect (MAVEs). They are also increasingly used in silico to analyze genomic AI models. Designing these libraries, however, remains tedious and error-prone due to the lack of purpose-built software. Results: Here we describe PoolParty, a Python package that streamlines the design of complex oligo pools using a simple but flexible API. In PoolParty, each library is represented by a computational graph that can be specified in just a few lines of code. Over 50 built-in operations cover nucleotide- and codon-level mutagenesis, motif insertion, barcode generation, and more. PoolParty automatically generates informative names for each sequence and provides "design cards" detailing how each sequence was generated. Visualization methods let users quickly audit library content and inspect the underlying graph. PoolParty thus transforms oligo pool design from a tedious task requiring custom functions and scripts into a structured, transparent, and reproducible process. Availability and implementation: PoolParty is freely available and can be installed using pip. It is compatible with Python [≥] 3.10. Documentation is provided at https://poolparty.readthedocs.io; source code is available at https://github.com/jbkinney/poolparty-statetracker. A static release is archived at DOI 10.5281/zenodo.19445098.
bioinformatics2026-04-15v2RapCluster: Bridging the Reproducibility Gap in Clustering Analysis
Lutfi, A.; Warneke, R.; Fischer, L.; Rappsilber, J.Abstract
Clustering is ubiquitous across science, yet a text-mining audit of 736,399 open-access articles identified as using clustering (2000-2025) reveals common practice leaves key parameters undocumented or untuned, contributing to the reproducibility crisis in science. We developed an interactive web platform featuring 11 widely adopted clustering algorithms to enable transparent clustering analysis and reporting, aligning practical use with best practices in computational research.
bioinformatics2026-04-15v1CROssBARv2: A Unified Computational Framework for Heterogeneous Biomedical Data Representation and LLM-Driven Exploration
Sen, B.; Ulusoy, E.; Darcan, M.; Ergun, M.; Lobentanzer, S.; Rifaioglu, A. S.; Turei, D.; Saez-Rodriguez, J.; Dogan, T.Abstract
Biomedical discovery is hindered by fragmented, modality-specific repositories and uneven metadata, limiting integrative analysis, accessibility, and reproducibility. To address these challenges, we present CROssBARv2, a provenance-rich biomedical data-and-knowledge integration platform that unifies heterogeneous sources into a maintainable, scalable system. By consolidating diverse data types into an extensive knowledge graph enriched with standardised ontologies, rich metadata, and deep learning based vector embeddings, CROssBARv2 alleviates the need for researchers to navigate multiple siloed databases and can facilitate downstream tasks, including predictive modelling and mechanistic reasoning, enabling applications such as drug repurposing and protein function prediction. The platform offers interactive graph exploration and embedding-based semantic search with CROssBAR-LLM, an intuitive natural language question-answering system that grounds large language model (LLM) outputs in the underlying knowledge graph to mitigate hallucinations. We assess CROssBARv2 through (i) multiple use-case analyses to test biological coherence and relational validity; (ii) knowledge-augmented biomedical question-answering benchmarks comparing CROssBAR-LLM against generalist LLMs; and (iii) a deep learning based predictive modelling experiment for protein function prediction leveraging the heterogeneous structure of CROssBARv2. Collectively, CROssBARv2 provides a scalable, AI-ready, and user-friendly foundation that facilitates hypothesis generation, knowledge discovery, and translational research.
bioinformatics2026-04-15v1Decoding Single-Cell Omics of Perturbation Responses Using DeSCOPE
Wu, P.; Wei, H.; Li, Y.; Zheng, X.; Zhou, C.; Hu, X.; Wang, C.Abstract
Deciphering cellular responses to genetic perturbations is fundamental to modeling gene regulatory networks and understanding mechanisms that change cellular phenotypes. However, current computational approaches often fail to outperform simple baseline models, highlighting a critical bottleneck in their generalizability and robustness. Here, we present DeSCOPE, a lightweight conditional variational autoencoder framework for predicting genetic perturbation responses spanning transcriptomic, epigenomic, and broader multi-modal landscapes. We systematically benchmarked DeSCOPE across diverse datasets under two challenging out-of-distribution settings: unseen genes and unseen cell types. DeSCOPE uniquely surpasses simple baselines in the unseen gene scenario, and achieves substantially improved performance for unseen cell types while requiring fine-tuning with far fewer perturbed genes. Finally, DeSCOPE demonstrates superior performance in predicting combinatorial multi-gene perturbations. Overall, DeSCOPE serves as a versatile multi-modal virtual cell model that can effectively guide the design of therapeutic targets that change cellular phenotypes. DeSCOPE is available at https://github.com/wanglabtongji/DeSCOPE.
bioinformatics2026-04-15v1Sex-biased gene expression shapes sex differences in gene essentiality
Rocca, C.; DeCasien, A. R.Abstract
Sex differences in disease incidence and progression are well documented, yet their underlying molecular mechanisms remain poorly understood. Multiple models suggest that baseline gene expression levels shape the impact of gene disruption, raising the possibility that sex-biased expression itself contributes to sex differences in cellular vulnerability. Here, we test this hypothesis by integrating sex-biased transcriptomic profiles with large-scale CRISPR loss-of-function screens to determine whether sex-biased expression predicts sex-biased gene essentiality across the genome. We find that gene expression level and sex chromosome dosage each explain a modest fraction of variance in essentiality, with substantially larger effects for sex chromosome genes than for autosomes. Across genes, sex effects on expression and essentiality are small in magnitude but directionally aligned, suggesting that sex differences in transcription can influence functional dependency. To resolve how these relationships arise, we applied gene-level mediation analyses to decompose sex effects on essentiality into expression-mediated and expression-independent components. This approach revealed multiple mechanistic architectures. On autosomes, most genes exhibited either sex-biased essentiality from direct sex effects (independent of expression) or sex-biased expression without functional consequence, while expression-mediated sex differences accounted for a smaller but substantial fraction of genes. In contrast, X chromosome genes were dominated by direct, expression-independent sex differences, consistent with strong effects of sex chromosome dosage, but also showed enrichment of expression-mediated architectures, particularly among X gametologs. Together, our results demonstrate that while sex-biased expression can generate sex-biased gene essentiality, this mechanism is not the default. Instead, sex-biased functional dependency is often driven by direct, expression-independent effects, particularly on the X chromosome, where dosage and compensatory mechanisms play a dominant role.
bioinformatics2026-04-15v1Benchmarking precision matrix estimation methods for differential co-expression network analysis
Overmann, M.; Grabert, G.; Kacprowski, T.Abstract
Background: Gene expression profiling is widely used to investigate disease mechanisms, but classical approaches such as differential expression or pairwise correlation analyses provide limited interpretability. Network-based differential co-expression methods that model conditional dependencies through partial correlations offer richer insights, yet their application in high-dimensional settings requires estimation of precision matrices. Numerous precision matrix estimation methods (PMEMs) have been proposed, but their relative performance under various conditions remains unclear. Results: Simulated gene expression datasets with known ground truth correlation structures were used to benchmark a broad set of PMEMs. Performance was strongly affected by data characteristics, including covariance structure, matrix density, covariance values, sample size-to-dimension ratio, and sampling distribution. Among the evaluated methods, GLassoElnetFast consistently showed the highest accuracy in recovering differential edges, although high signal-to-noise ratios and sufficient sample sizes remain essential for reliable inference. Conclusions: Evaluation across diverse simulation conditions demonstrated that no single metric or condition was sufficient to assess PMEM performance. Therefore, previous less extensive evaluations risked misleading conclusions. Our simulation and benchmarking framework supports future method development and ensures reproducible evaluation of newly developed approaches.
bioinformatics2026-04-15v1Beyond Structure and Affinity: Context-Dependent Signals for de novo Binder Success
Bozkurt, C.Abstract
De novo protein binder design has advanced rapidly, yet most designs fail experimentally and current structure- and affinity-centred evaluation does not reliably predict which candidates will succeed. Here we show that biology-informed sequence features, derived from models trained on natural proteins, identify transferable and context-dependent associations with binder expression and binding that are not captured by structural scoring alone. We re-analysed two public benchmarks - the Bits to Binders CAR-T CD20 competition (11,984 designs; expression, proliferation, and T cell function gates) and the Adaptyv EGFR competition (603 designs; expression and binding affinity) - using five biology-informed ML models predicting disorder, amyloidogenicity, topology, PTM sites, and protein classification. Every feature was tested at each gate with FDR-corrected statistics. We identify three layers of signal. Transferable: lower aggregation propensity is the most robust cross-benchmark signal; PTM-site density recurs univariately but is partly length-confounded in EGFR. Architecture-dependent: topology, disorder, and disulfide-related descriptors are significant in both datasets but flip direction, consistent with the different requirements of CAR extracellular domains versus standalone binders. Context-specific: phosphorylation-related associations with CAR-T depletion and low-disorder dominance in EGFR binding are tied to individual assay or format contexts. In the CAR-T benchmark, stacking biology-informed filters raises the enrichment hit rate from 13.8% to 38.6% (2.8x lift) after controlling for known sequence-level predictors. These results suggest that pre-synthesis screening of de novo binders may benefit from being multi-gate and context-aware, using biology-informed sequence descriptors not only to rank candidates but also to help flag likely failure modes earlier and reduce wasted synthesis and testing.
bioinformatics2026-04-15v1π-MSNet: A billion-scale, AI-ready living proteomics data portal
Dai, C.; Liu, Y.; Ling, T.; Qiu, Y.; Xu, H.; Zhang, Q.; Huang, X.; Zhu, Y.; Sachsenberg, T.; Bai, M.; He, F.; Perez-Riverol, Y.; Xie, L.; Chang, C.Abstract
Artificial intelligence (AI) is reshaping proteomics workflows, delivering remarkable gains in both peptide identification sensitivity and quantitative performance. However, the potential of deep learning models in proteomics has not been fully exploited due to the scarcity of large-scale, high-quality and consistently labeled datasets. Here, we present {pi}-MSNet, a billion-scale, AI-ready living mass spectrometry (MS) data portal. Using a uniform identification and quality control workflow, it comprises over 1.66 billion MS/MS spectra, 501 million peptide-spectrum matches (PSMs), and 9 million precursors from 36,356 LC-MS/MS runs across ten instrument types and 55 diverse species. Through community collaboration, the data are shared via international, interactive, and living web resources. Enabled by the built-in MSNetLoader Python API for seamless and scalable data access-with native support for PyTorch and TensorFlow-{pi}-MSNet provides an AI-ready data framework for efficient training and systematic benchmarking of multiple models across three representative tasks (e.g., MS/MS spectrum prediction, retention time prediction, and de novo peptide sequencing). In particular, by retraining multiple models on {pi}-MSNet, we achieved consistent performance improvements over their original versions. These improved models were subsequently integrated into the {pi}-MSNet agent to enable interactive, deployment-free use. Through SDRF (Sample and Data Relationship Format) metadata, an open-source cloud analysis workflow, and a community-driven interactive data portal that supports continuous data submission, {pi}-MSNet serves as a living, AI-ready resource for reproducible benchmarking, robust model training, and accelerated AI innovation in proteomics.
bioinformatics2026-04-15v1Differential co-localisation analysis of multi-sample and multi-condition experiments with spatialFDA
Emons, M.; Scheipl, F.; Gunz, S.; Purdom, E.; Robinson, M. D.Abstract
Advances in spatial omics data generation have led to an explosion in new datasets that record the spatial location of transcripts and proteins. However, challenges remain in the analysis of spatial omics data. One important analysis is differential cellular co-localisation (CCoL): the quantification of the clustering, or spacing, of one or more cell types across multiple conditions. Our framework spatialFDA combines methodology from spatial statistics with functional data analysis to accurately quantify and test for differences between conditions in CCoL across spatial scales. Using two simulation studies, we show that spatialFDA performs well in controlled settings. Furthermore, spatialFDA recovers known biological processes in type-1 diabetes and adds insights about the CCoL strength in space. spatialFDA is readily available as an open source Bioconductor R package.
bioinformatics2026-04-15v1U-Probe: universal agentic probe design for imaging-based spatial-omics
Zhang, Q.; Cai, H.; Zhang, J.; Zhang, L.; Wu, X.; Wei, Y.; Chen, Y.; Wu, X.; Su, W.; Qi, W.; Qiu, X.; Cao, G.; Xu, W.Abstract
Probe design for fluorescence in situ hybridization (FISH) underpins spatial transcriptomics, three-dimensional genome studies, and clinical diagnostics, yet remains constrained by two challenges: dependence on expert knowledge for parameter selection and quality evaluation, and the inability of existing tools to accommodate the diverse probe architectures introduced by rapidly emerging methods. Here we present U-Probe, a universal and agentic probe design platform. U-Probe employs a declarative configuration system with a directed acyclic graph (DAG)-based assembly engine that supports arbitrary probe structures from established protocols such as MERFISH, seqFISH, and MiP-seq to entirely novel architectures without code modifications. Integrated LLM-based AI agents enable conversational design workflows, allowing users to specify experimental goals in natural language and receive synthesis-ready probe sequences. We validated U-Probe in three scenarios: agent-driven MiP-seq panel design for influenza-infected mouse lung tissue, genome-tiling DNA-FISH for herpesvirus detection, and a novel RCA-based ligation probe for single-nucleotide mutation discrimination. U-Probe is available as an open-source tool with CLI, Web, and agent interfaces.
bioinformatics2026-04-15v1Discovery of Selective Nrf2 Activators from Natural Products: AComputational Screening Approach to Minimize Off-Target Effects on PXR and CYP2D6
Wang, Y.; Gong, Y.; Li, R.; Li, Z.; Cai, H.; Fan, L.; Ma, H.Abstract
Nuclear factor erythroid 2-related factor 2 (Nrf2) is a central regulator of cellular antioxidant responses and a highly promising therapeutic target for a range of oxidative stress-related diseases. However, the clinical translation of Nrf2 activators has been hampered by significant off target effects notably unintended activation of the pregnane X receptor (PXR) and inhibition of cytochrome P450 2D6 (CYP2D6) which can lead to dangerous drug-drug interactions and metabolic complications. To overcome this critical barrier, we conducted the first large-scale computational screening of 628,898 natural products from the COCONUT database, integrating molecular docking with a rigorous three-tier selectivity strategy designed to prioritize compounds that strongly bind KEAP1 (the primary Nrf2 repressor) while minimizing interactions with PXR and CYP2D6. Our innovative approach identified 10 ultraselective candidates that demonstrate potent KEAP1 affinity, negligible PXR engagement, and only moderate CYP2D6 binding achieving up to 12.29-fold selectivity for Nrf2 pathway activation. These top hits are structurally novel, enriched in lipid-like and nucleoside inspired scaffolds, and exhibit promising drug-like properties. By providing both a curated set of chemically diverse, selectivity-optimized leads and a publicly accessible screening dataset, this work establishes a new foundation for the rational development of safer, more precise Nrf2-targeted therapies, bridging a crucial gap between target potential and clinical viability. By prioritizing compounds with minima off-target effects on PXR and CYP2D6, our approach offers a scalable template for reducing drug development failures and advancing safer therapeutics for oxidative stress-related diseases.
bioinformatics2026-04-15v1TPCAV: Interpreting deep learning genomics models via concept attribution
Yang, J.; Mahony, S.Abstract
Interpreting genomics deep learning models remains challenging. Existing feature attribution methods are largely restricted to one-hot DNA inputs and therefore cannot assess the influence of more general genomic features such as chromatin states or genomic repeats. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We improve upon the original TCAV method by incorporating a PCA-based decorrelation transformation to address correlated and redundant embedding features commonly observed in genomics deep learning models, resulting in the Testing with PCA-projected Concept Activation Vectors (TPCAV) approach. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides comparable motif feature interpretation to TF-MoDISco on one-hot encoded DNA-based transcription factor binding prediction models. TPCAV also enables robust interpretive analysis of how more general biological concepts such as repetitive elements and chromatin state annotations contribute towards predictions. TPCAV uniquely generalizes to interpret features learned by tokenized foundation models as well as models incorporating chromatin signals as inputs. We further show that TPCAV can identify representative regions associated with specific concepts, motivating downstream investigation of distinct regulatory mechanisms. TPCAV provides a flexible and robust complement to existing model interpretation techniques.
bioinformatics2026-04-14v4TSvelo: Comprehensive RNA velocity by modeling cascade of gene regulation, transcription and splicing
Li, J.; Wang, Z.; Shen, H.-B.; Yuan, Y.Abstract
RNA velocity approaches fit gene dynamics and infer cell fate by modeling the splicing process using single-cell RNA sequencing (scRNA-seq) data. However, due to short time scale of splicing, high noise and large complexity of data, existing RNA velocity methods often fail to precisely capture the complex velocity dynamics for individual gene and single cell, which makes its downstream analysis less reliable and less robust. We propose TSvelo, a comprehensive RNA velocity mathematics framework that can model the cascade of gene regulation, Transcription and Splicing using highly interpretable neural Ordinary Differential Equations (ODEs). TSvelo can precisely capture the transcription-unspliced-spliced 3D dynamics of all genes simultaneously, infer unified latent time shared by genes within single cell, and be applied to multi-lineage datasets. Experiments on six scRNA-seq datasets, including two multi-lineage datasets, demonstrate TSvelo's superiority.
bioinformatics2026-04-14v4GRASP: Gene-relation adaptive soft prompt for scalable and generalizable gene network inference with large language models
Feng, Y.; Deng, K.; Guan, Y.Abstract
Gene networks (GNs) encode diverse molecular relationships and are central to interpreting cellular function and disease. The heterogeneity of interaction types has led to computational methods specialized for particular network contexts. Large language models (LLMs) offer a unified, language-based formulation of GN inference by leveraging biological knowledge from large-scale text corpora, yet their effectiveness remains sensitive to prompt design. Here, we introduce Gene-Relation Adaptive Soft Prompt (GRASP), a parameter-efficient and trainable framework that conditions inference on each gene pair through only three virtual tokens. Using factorized gene-specific and relation-aware components, GRASP learns to map each pair's biological context into compact soft prompts that combine pair-specific signals with shared interaction patterns. Across diverse GN inference tasks, GRASP consistently outperforms alternative prompting strategies. It also shows a stronger ability to recover unannotated interactions from synthetic negative sets, suggesting its capacity to identify biologically meaningful relationships beyond existing databases. Together, these results establish GRASP as a scalable and generalizable prompting framework for LLM-based GN inference.
bioinformatics2026-04-14v2Beyond Single Algorithms: A Framework for Validating and Aggregating Active Modules in Genetic Interaction Networks
Liu, J.; Xu, M.; Xing, J.Abstract
High-throughput sequencing methods have generated vast amounts of genetic data for candidate gene studies. However, the complexity of the disease genetic structure often results in a large number of candidate genes and poses a significant challenge for these studies. To explore the multi-gene interactions and elucidate the genetic mechanism, candidate genes are often analyzed through Gene-Gene interaction (GGI) networks. These networks can become very large, necessitating efficient methods to reduce their complexity. Active Module Identification (AMI) is a common method to analyze GGI networks by identifying enriched subnetworks representing relevant biological processes. Multiple AMI algorithms have been developed for biological datasets, and a comparative analysis of their behaviors across a variety of datasets is crucial to their application. In this study, we introduce a framework to compare and aggregate the modules produced by multiple AMI algorithms. We first used a modified Empirical Pipeline to validate the output of four AMI algorithms -- PAPER, DOMINO, FDRnet, and HotNet2 -- and find that no single algorithm performs well across the different datasets. Using the Earth Mover's Distance to measure pairwise module similarity, we find that the outputs of different algorithms are structurally distinct, suggesting that each captures different aspects of the underlying biology. These findings suggest that a comprehensive analysis requires the aggregation of outputs from multiple algorithms. We propose two methods to this end: a spectral clustering approach for module aggregation, and an algorithm that combines modules with similar network structures called Greedy Conductance-based Merging (GCM). The merging algorithm not only allows researchers to obtain a set of cohesive modules from multiple algorithms, it also has the potential of identifying "hidden" genes that are not present in the original input data from the network. Overall, our results advance our understanding of AMI algorithms and how they should be applied. Tools and workflows developed in this study will facilitate researchers working with GGI and AMI algorithms to enhance their analyses. Our code is freely available at https://github.com/LiuJ0/AMI-Benchmark/.
bioinformatics2026-04-14v2From Movement to METs: A Validation of ActTrust(R) for Energy Expenditure Estimation and Physical Activity Classification in Young Adults
dos Santos Batista, E.; Basilio Gomes, S. R.; Bruno de Morais Ferreira, A.; Franca, L. G. S.; Fontenele Araujo, J.; Mortatti, A. L.; Leocadio-Miguel, M. A.Abstract
Estimating physical activity (PA) levels is a challenging and expensive task. An alternative could be the use of actigraphy devices to estimate PA. This has been previously done to a number of devices, including ActiGraph(R) GT3X+. In this study, we validated ActTrust(R) against the widely used GT3X+ and compared activity counts to metabolic equivalents (METs) derived from indirect calorimetry during treadmill walking and running. Fifty-six young adults (34 men, 22 women) participated in controlled effort exercises including light, moderate, vigorous, and very vigorous activity intensities. We developed a linear model to estimate energy expenditure (EE) from movement count of combinations of devices placed at hip or wrist. We then estimated cut-off points for each intensity range. Our results showed correlations between treadmill speed and both METs (<em>r</em> = 0.95, <em>p</em> < 0.05) and movement counts from both GT3X+ and ActTrust devices placed either on the hip (<em>r</em> = 0.94, <em>p</em> < 0.05; <em>r</em> = 0.93, <em>p</em> < 0.05) or on the wrist (<em>r</em> = 0.88, <em>p</em> < 0.05; <em>r</em> = 0.88, <em>p</em> < 0.05), respectively. Our proposed model performed well with balanced accuracies above 0.77 for all intensity ranges and over 0.9 for light and moderate activity. This is the first study to model estimate and validate PA intensity thresholds on ActTrust(R) devices. Our findings support the use of ActTrust(R) devices as simple, cost-effective tool for 24-hour assessments of EE.
bioinformatics2026-04-14v2found: Inferring cell-level perturbation from structured label noise in single-cell data
Afanasiev, E.; Goeva, A.Abstract
Recent work by Goeva et al. introduced HiDDEN, a method for refining batch-level labels to infer cell-level perturbation without prior knowledge of affected populations, addressing the mismatch between sample-level labels and heterogeneous perturbation effects across cells. Here, we present found, a Python and R implementation of HiDDEN, supporting pipeline customization, by-factor grouping, hyperparameter selection, and visualization. Through benchmarking across diverse datasets, we show that performance depends strongly on modeling choices, particularly regression, grouping, and embedding dimensionality. found provides a practical, flexible, and accessible framework for robust cell-level perturbation analysis.
bioinformatics2026-04-14v1Identification of the novel inhibitors against M. tuberculosis ESX-1 secretion system EccA1 enzyme using virtual screening, docking and dynamics simulation techniques
Kumar, R.; saxena, a. K.Abstract
The M. tuberculosis ESX-1 secretion system EccA1 enzyme is involved in the secretion of virulence factors and is essential for virulence and bacterial survival within the phagosome. Development of the small molecular inhibitors abolishing EccA1 function can yield new antivirulence drugs. In this study, we modeled the full-length EccA1 (573 residues, Mw [~]62.4 kDa) structure, which contains N-terminal TPR domain and a C-terminal CbxX/CfqX type ATPase domain. We have identified five ZINC compounds having binding energy i. e. Z1 (ZINC000004513760, -43.45 kcal/mol), Z2 (ZINC000000001793, -49.56 kcal/mol), Z3 (ZINC000005390388, -55.83 kcal/mol), Z4 (ZINC000257294577, -52.33 kcal/mol), Z5 (ZINC000004824264, -44.44 kcal/mol) through virtual screening of the ZINC compounds targeting C-terminal ATPase pocket of EccA1. The Z1-Z5 compounds were compared with ADP substrate having binding energy (Adenosine diphosphate, -35.00 kcal/mol), p97 ATPase inhibitors i.e. NMS873 (3-[3-cyclopentylsulfanyl-5-[[3-methyl-4-(4 methylsulfonylphenyl)phenoxy]methyl]-1,2,4-triazol-4-yl]pyridine, -48.68 kcal/mol), and CB5083 (1-[4-(benzylamino)-5H,7H,8H-pyrano[4,3-d]pyrimidin-2-yl]-2-methyl-1H- indole-4-carboxamide, -50.88 kcal/mol) against EccA1. The Z1-Z5 compounds exhibited good Absorption, Distribution, Metabolism, and/or Excretion properties (ADMTE). Pharmacokinetic properties and Lipinsky's rule of five for Z1-Z5 compounds showed drug-like properties. 100 ns dynamics simulation analysis on EccA1 complexed with (i) Z1-Z5 compounds (ii) ADP substrate and (iii) NMS873 and CB5083 inhibitors showed high stability and biologically relevant conformation during dynamics simulation. These data indicate that Z1-Z5 compounds may act as potential inhibitors against EccA1 and provide avenues for new antivirulence drug development after in vitro and in vivo clinical trials.
bioinformatics2026-04-14v1A Machine Learning Approach for Physiological Role Prediction in Protein Contact Networks: a large-scale analysis on the human proteome
Cervellini, M.; Martino, A.Abstract
Proteins are fundamental macromolecules involved in virtually all biological processes. Their physiological roles are tightly linked to their three-dimensional structure, which can be naturally abstracted as Protein Contact Networks (PCNs), i.e., graphs where residues are nodes and edges encode spatial proximity. This representation enables the application of Graph Machine Learning to address the protein functional annotation gap at proteome scale. In this work, protein function prediction is studied on the majority of the human proteome, focusing on enzymatic activity and enzyme class assignment as well-defined and biologically meaningful targets. A large-scale supervised analysis was conducted on PCNs derived from experimentally resolved human protein structures. Multiple graph-based learning paradigms were systematically compared under a unified evaluation protocol, including handcrafted graph embeddings, kernel methods, and end-to-end Graph Neural Networks (GNNs). Feature engineering approaches comprised (i) spectral density embeddings of the normalized graph Laplacian and (ii) higher-order topological representations based on simplicial complexes, with optional INDVAL-based feature selection. These representations were paired with linear, ensemble, and kernel classifiers, while GNNs were trained directly on raw PCNs exploiting a diverse set of message-passing architectures. Two tasks were considered: binary classification of enzymatic versus non-enzymatic proteins and multiclass prediction of first-level Enzyme Commission (EC) classes. Performance was assessed using repeated stratified splits to ensure robust and variance-aware evaluation. In the binary enzymatic classification task, the Jaccard-based graph kernel achieved the best performance with an adjusted balanced accuracy of 0.90, closely followed by GNNs trained end-to-end on PCNs. In the multiclass EC prediction task, GNNs demonstrated superior discriminative power, reaching an adjusted balanced accuracy of 0.92 and outperforming all explicit embedding and kernel-based approaches. Overall, results indicate that EC class prediction is intrinsically more complex than binary enzymatic discrimination and benefits from the higher expressivity of deep message-passing architectures. The findings demonstrate that graph-based representations of protein structure support competitive functional prediction at proteome scale, with classical kernel methods and modern GNNs offering complementary strengths in terms of accuracy, scalability, and flexibility.
bioinformatics2026-04-14v1MAJEC: unified gene, isoform, and locus-level transposable element quantification from RNA-seq
Lim, T.-Y.; Firestone, A. J.Abstract
Background: The study of transposable elements (TEs) has become increasingly central to fields such as cancer biology, immunology, and aging. Accurately quantifying disease- or laboratory-mediated perturbations in these elements is critical to support this expanding research, yet current RNA-seq pipelines struggle with the pervasive overlap between TEs and protein-coding genes. Existing tools either aggregate to the subfamily level with no locus resolution (TEtranscripts), or provide locus-level quantification without modeling gene overlap (Telescope), with the latter attributing over 40% of TE signal to the 1.1% of loci that overlap gene exons. Results: We present MAJEC (Momentum Accelerated Junction Enhanced Counting), a unified Expectation-Maximization (EM) framework that jointly quantifies genes, transcript isoforms, and individual TE loci from BAM alignments in a single pass. Splice junction evidence informs transcript-level priors, enabling MAJEC to probabilistically distinguish genic from TE-derived reads. This approach was independently validated against Salmon and RSEM on isoform quantification benchmarks. The joint feature space reduces exon-overlap contamination of locus-level TE estimates from 43% of total signal (Telescope) to 5% (MAJEC), while preserving subfamily-level accuracy (differential expression r = 0.987 vs TEtranscripts). Using paired biological vignettes, we demonstrate that MAJEC correctly resolves both the false TE reactivation artifacts endemic to TE-only models, and the false gene upregulation artifacts that occur when heuristic rules misassign genuine intragenic TE transcription. Conclusion: MAJEC simultaneously produces the isoform and locus-level resolution that TEtranscripts lacks, with greater accuracy than Telescope, and runs faster than either.
bioinformatics2026-04-14v1A correlational study of ABCA3 and SCN4B as exercise-related biomarkers of patients with Stanford type A aortic dissection
Qiao, S.; Chen, T.; Xie, B.; Han, Y.; Wang, B.; Li, Y.; Jia, B.; Wu, N.Abstract
Background: Accumulating evidence indicates that moderate exercise may reduce the incidence of Stanford type A aortic dissection (TAAD), but the specific mechanisms remain unclear. This study aims to identify exercise-related biomarkers in TAAD patients and to investigate their underlying mechanisms. Methods: Transcriptome data related to TAAD and exercise-related genes were obtained from publicly available databases. Candidate biomarkers for TAAD were identified through an integrative approach incorporating differential expression analysis, machine learning, and expression level assessment, leading to the construction of a diagnostic model. Subsequently, functional enrichment, immune infiltration, regulatory network analysis, and computational drug prediction were conducted to systematically investigate the pathological mechanisms and translational potential of the identified biomarkers. Results: ABCA3 and SCN4B were identified as exercise-related biomarkers in TAAD progression. A nomogram incorporating these two biomarkers exhibited strong diagnostic performance for identifying the disease. Functional enrichment analysis revealed potential involvement of these biomarkers in disease progression through pathways including circadian rhythm regulation and ribosome biosynthesis. Additionally, immune cells like M1 macrophages and naive B cells, as well as regulatory factors including hsa-miR-1343-3p and XIST, were found to be involved in this process. Finally, zonisamide and MRS1097 were identified through computation prediction as potential therapeutic drugs. Conclusion: ABCA3 and SCN4B were identified as exercise-related biomarkers associated with TAAD and represent potential valuable targets for both diagnosis and treatment strategies.
bioinformatics2026-04-14v1