Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
A new iterative framework for simulation-based population genetic inference with improved coverage properties of confidence intervals
Rousset, F.; Leblois, R.; Estoup, A.; Marin, J.-M.Abstract
Simulation-based methods such as approximate Bayesian computation (ABC) are widely used to infer the evolutionary history of populations from molecular genetic data. We describe and evaluate a new iterative method of statistical inference about model parameters, which revisits the idea of inferring a likelihood surface using simulation when the likelihood function cannot be evaluated. It is based on combining the random forest machine learning method, and multivariate Gaussian mixture (MGM) models, in an effective inference workflow, here used to fit models with up to 15 variable parameters. In addition to the traditional assessment of precision in terms of bias and mean square error, we also evaluate the coverage of confidence intervals. The method is compared with approximate Bayesian computation using random forests (ABC-RF), a non-iterative method sharing some technical features with the proposed approach, across scenarios of historical demographic inference from population genetic data. It is also compared to another iterative method, sequential neural likelihood estimation (SNLE). These comparisons highlight the importance of an iterative workflow for exploring the parameter space efficiently. For equivalent simulation effort of the data-generating process, the new summary-likelihood method provides intervals whose coverage is better controlled than the marginal coverage of intervals provided by ABC with random forests, and than generally reported for ABC methods. The iterative workflow can also yield greater improvements in estimator precision when larger datasets are used.
bioinformatics2026-04-17v5Interpretable models for scRNA-seq data embedding with multi-scale structure preservation
Novak, D.; de Bodt, C.; Lambert, P.; Lee, J. A.; Van Gassen, S.; Saeys, Y.Abstract
The ability to explore high-dimensional single-cell transcriptomics data efficiently is crucial in many biological studies. Dimensionality reduction techniques have therefore emerged as a basic building block of analytical workflows. They generate low-dimensional embeddings that capture important structures in the data, and are often used in discovery, quality control, and downstream analysis. However, the trustworthiness of current methods and the rigour of popular evaluation criteria are limited. We tackle this in an empirical study of structure-preserving data embeddings, delivering two tools. First, we introduce ViScore: a robust scoring framework that improves both unsupervised and supervised quality metrics, with emphasis on scalability and fairness. Second, we introduce ViVAE: a deep learning model that achieves better multi-scale structure preservation and is equipped with new tools for interpretability. We demonstrate the potential of these contributions to advance the trustworthiness of single-cell dimensionality reduction in a quantitative comparison and focused case studies.
bioinformatics2026-04-17v5NetSyn: prokaryotic genomic context exploration of protein families
Stam, M.; Langlois, j.; Chevalier, C.; Mainguy, J.; Reboul, G.; Bastard, K.; Medigue, C.; Vallenet, D.Abstract
Background: The growing availability of large prokaryotic genomic datasets presents an opportunity to discover new metabolic pathways and enzymatic reactions useful for industrial or synthetic biological applications. Efforts to identify new enzyme functions in this vast number of sequences cannot be achieved without bioinformatics tools and the development of new strategies. Standard methods for assigning a biological function to a gene are based on sequence similarity. However, complementary approaches rely on mine databases to identify conserved gene clusters (i.e. syntenies). In prokaryotic genomes, genes involved in the same pathway are frequently encoded in a single locus with an operonic organisation. This genomic context conservation is considered as a reliable indicator of functional relationships, and is therefore a promising approach for improving the gene function prediction. Methods. Here we present NetSyn (Network Synteny), a tool to group protein sequences based on the conservation of their genomic context rather than solely on sequence similarity. From a list of protein sequence identifiers, NetSyn searches corresponding genome entries to retrieve neighboring genes. Corresponding protein sequences are grouped into families to define homology relationships and compute a synteny conservation score between the different extracted genomic contexts. A network is then created in which the nodes represent the input proteins and the edges indicate that two proteins share a conserved synteny. Finally, the network is partitioned into clusters grouping proteins with similar genomic contexts, using a community detection algorithm. Results. As a proof of concept, we used NetSyn on two different datasets. The first one is the BKACE protein family (formerly named DUF849) which has previously been divided into isofunctional sub-families. NetSyn was able to go a step further by providing additional sub-families beyond those already described. The second dataset corresponds to a set of non-homologous proteins belonging to three different glycoside hydrolase (GH) families. These GHs are known to work cooperatively in a Polysaccharide-Utilization Loci (PUL) and are therefore grouped together in the same genomic contexts. NetSyn was able to identify a locus grouping 3 GHs, involved in the degradation of xyloglucan, in 162 prokaryotic genomes. Discussion. By highlighting conserved synteny in distantly related prokaryotic species, NetSyn enables functional links between proteins to be established beyond sequence similarity alone. We showed that NetSyn is efficient for exploring large prokaryotic protein families, enabling the definition of isofunctional groups and the identification of functional interactions between non-homologous enzymes. These features enable the prediction of new genomic structures that have not yet been experimentally characterized. Finally, NetSyn is also useful for pinpointing annotation errors that have been propagated across databases, and for suggesting annotations on proteins lacking functional prediction. NetSyn is freely available at https://github.com/labgem/netsyn.
bioinformatics2026-04-17v4METRIN-KG: A knowledge graph integrating plant metabolites, traits, and biotic interactions
Tandon, D.; Mendes de Farias, T.; Allard, P.-M.; Defossez, E.Abstract
Background In recent years, biodiversity data management has emerged as a critical pillar in global conservation efforts. Today, the ability to efficiently collect, structure, and analyze biodiversity data is central to breakthroughs in conservation, drug development, disease monitoring, ecological forecasting, and agri-tech innovation. However, due to the vastness and heterogeneity of biodiversity data, it is often confined to databases for specific research areas in isolated formats and disconnected from other relevant resources. Crucial components of such data in kingdom Plantae comprise of metabolomes - the vast array of compounds produced by plants; traits - measurable characteristics of plants that influence their growth, survival, and reproduction, and that affect ecosystem processes; and biotic interactions - relationships of plants with other living organisms, affecting the ecosystem functions. Results In this work, we present METRIN-KG (MEtabolomes, TRaits, and INteractions-Knowledge Graph) a powerful data resource simplifying the integration of diverse and heterogeneous data resources such as plant metabolomes, traits, and biotic interactions. Conclusions The proposed knowledge graph provides an interface to interactively search for data relating plant metabolomes, traits, and interactions. This, in turn, will facilitate development of research questions in life-sciences. In this context, we provide representative case studies on how to frame queries that can be used to search for relevant data in the knowledge graph.
bioinformatics2026-04-17v3Mechanistic insights into CFTR function from molecular dynamics analysis of electrostatic interactions
ELBAHNSI, A.; Mornon, J.-P.; Callebaut, I.Abstract
The Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) is an ATP-gated anion channel whose function is tightly linked to its conformational dynamics and is influenced by the composition of its membrane lipid environment. Despite high-resolution three-dimensional (3D) structures, the molecular determinants that stabilize specific CFTR conformations and enable ion conduction remain incompletely understood. Here, we performed all-atom molecular dynamics (MD) simulations of the human CFTR 3D structure in both the apo and VX-770 (ivacaftor)-bound states, embedded in a heterogeneous lipid bilayer, in order to systematically analyze electrostatic interactions, linking amino acids to each other as well as to anions and membrane lipids. We identified 557 electrostatic interactions between charged and polar amino acid side chains, which we systematically mapped across the CFTR 3D structure. They are organized into specific regions, with a subset showing high frequency and conservation across simulations, suggesting a structural role in stabilizing CFTR architecture. In contrast, more transient electrostatic interactions were detected in dynamic regions potentially linked to conformational transitions or other functional roles. Irregularities in transmembrane (TM) helices often incorporate amino acids involved in electrostatic interactions. Many basic and polar residues involved in electrostatic interactions also engaged in anion coordination, underscoring their contribution to ion conduction. In addition, some showed selective interactions with cholesterol and phosphatidylserine, revealing spatially organized lipid binding, particularly at the level of the lasso and in the vicinity of the VX-770 binding site, which may mark regions important for allosteric communication. VX-770 binding preserved the global architecture of the electrostatic interaction networks but induced subtle shifts, acting on specific salt bridges. Regardless of whether VX-770 is present or not, a secondary portal displayed between TM10/TM12 emerged from these MD simulations, in addition to the main TM4/TM6 portal, whose morphology and diameter is controlled by a fluctuating salt bridge. Two exit routes also appeared for the exit of anions towards the extracellular milieu. Altogether, our integrative analysis highlights how dynamic electrostatic networks, together with ion and lipid interactions, support CFTR's structural plasticity and functional modulation, offering molecular insights into potentiation mechanisms and into the specific evolution of CFTR in the ABC transporter superfamily.
bioinformatics2026-04-17v3Scaling SMILES-Based Chemical Language Models for Therapeutic Peptide Engineering
Feller, A. L.; Secor, M.; Swanson, S.; Wilke, C. O.; Deibler, K.Abstract
Therapeutic peptides occupy a unique middle ground in drug discovery, offering the high specificity of protein interactions with the chemical diversity of small molecules, yet they currently fall in a computational blind spot. Existing foundation models cannot handle them effectively: protein models are restricted to natural amino acids, while chemical models struggle to process large, polymer-like sequences. This disconnect has forced the field to rely on static chemical descriptors that fail to capture subtle chemical details or on complex multi-embedding pipelines that are custom tailored to specific datasets. To bridge this gap, we present PeptideCLM-2, a suite of chemical language models trained on over 100 million molecules to natively represent complex peptide chemistry. This modeling approach both simplifies the application of machine learning to therapeutic peptides and results in improved performance over alternative approaches for predicting development endpoints including membrane diffusion, tumor homing, and half life.
bioinformatics2026-04-17v3Testing and Estimating Causal Treatment Effect Heterogeneity in Observational Studies via Revised Deep Semiparametric Regression:A Lung Transplant Case Study
Yuan, S.; Zou, F.; Zou, B.Abstract
Lung transplantation programs must decide when bilateral lung transplantation (BLT) offers meaningful functional benefit over single lung transplantation (SLT). Because donor and recipient characteristics jointly shape outcomes, the BLT-SLT contrast may differ across patients. However, observational registries pose a key statistical challenge: apparent subgroup differences can be artifacts of complex confounding, while true heterogeneity can be missed or poorly quantified. Using a large national lung transplant registry, we study whether the BLT effect varies across recipients and identify clinically relevant profiles of benefit using post-transplant lung function measured by forced expiratory volume in 1 second (FEV1). We develop deepHTL, an analysis framework that first tests whether treatment effect heterogeneity is supported by the data and then estimates how the BLT-SLT effect changes with patient features when heterogeneity is present. In extensive simulations designed to resemble registry-like confounding, deepHTL controls false positives for detecting heterogeneity and yields more accurate individualized effect estimates than common machine learning methods. In the lung transplant cohort, we find strong evidence of heterogeneity in the BLT-SLT effect on FEV1: younger, lower risk recipients with better baseline status show the largest FEV1 gains from BLT, whereas older, higher risk candidates exhibit diminished marginal benefit. These findings provide statistically grounded guidance for patient selection and allocation of scarce donor organs.
bioinformatics2026-04-17v2Impact of the N-glycosylation on full-length IgG2 and IgG4 antibodies: a comparative study using molecular dynamics simulations.
LEON FOUN LIN, R.; Bellaiche, A.; Diharce, J.; Etchebest, C.Abstract
Like other proteins, monoclonal antibodies - important biodrugs- are subject to post translational modifications, especially the N-glycosylations. However, the effect of the N-glycosylations remains poorly studied and atomistic details about their influence are rarely available. . Moreover, the few existing studies focus on the prevalent immunoglobulin G1. To go further in the understanding of the impact of glycosylations, we have carried out a comparative exploration of the effect of N-glycosylations on two different classes of antibodies, namely Mab231, an IgG2 and the pembrolizumab, an IgG4 . The two antibodies differ by their sequences, their length, their 3D structure but also by the location and composition of the glycans. In the present work, detailed and important information were gained through molecular dynamics simulations where both monoclonal antibodies were studied without and with the presence of their glycans. The results of 1.5 microseconds of sampling for each system show that glycosylation does not drastically alter the overall conformational landscape of either antibody, whatever the metrics considered. However, it measurably modulates local flexibility, inter-domain correlated motions, and the relative orientation of the Fab arms with respect to the Fc domain, with statistically significant shifts in key geometric descriptors. Importantly, contact analysis reveals that glycan interactions extend beyond the Fc region to reach Fab residues. The allosteric network calculations demonstrate that the influence of Fc-bound glycans propagates even until the Fab framework regions in both mAbs, which could impact the antigen binding. The nature and magnitude of these effects are subclass-dependent, reflecting differences in glycan composition, hinge architecture, and three-dimensional organization Our findings challenge the prevailing view that Fc glycosylation uniformly promotes CH2 domain opening. More importantly, it underscores the necessity of considering full-length structures and IgG subclass diversity in glyco-engineering strategies.
bioinformatics2026-04-17v2Deep Learning Enables Automated Segmentation and Quantification of Ultrastructure from Transmission Electron Microscopy Images
Zou, A.; Tan, W.; Ji, J.; Rojas-Miguez, F.; Dodd, L.; Oei, E.; Vargas, S. R.; Yang, H.; Berasi, S. P.; Chen, H.; Henderson, J. M.; Fan, X.; Lu, W.; Zhang, C.Abstract
Transmission electron microscopy (TEM) has become an essential technique for observing subcellular ultrastructure, and is widely used in both clinical diagnosis and biomedical research. However, analysis of TEM data remains extremely labor-intensive and often inconsistent across operators due to the lack of dedicated computational methods. Here, we present TEAMKidney, a deep learning framework for accurate and scalable measurement of ultrastructures in TEM images across species, magnifications, and instrument platforms. We collected 12,991 TEM images from patients with multiple kidney diseases and from different animal models. By combining a self-training-based semantic segmentation stage with a TEM-tailored panoptic segmentation model, we address two major challenges in TEM data analysis: the lack of accurately labeled training data and the difficulty of achieving high segmentation accuracy for complex ultrastructure. Application of TEAMKidney to both human and animal images successfully reveals disease-associated changes in two critical glomerular ultrastructures: the glomerular basement membrane and podocyte foot processes. In addition to significantly outperforming existing tools, TEAMKidney shows close agreement with pathological expert measurements used in clinical assessment protocols. By reducing dependence on manual tracing while preserving expert-level accuracy, TEAMKidney demonstrates that deep learning can substantially reduce the burden of image analysis in both clinical pathology and biomedical research settings.
bioinformatics2026-04-17v2Methylation-aware long-read phasing significantly improves genome-wide haplotype reconstruction
Pfennig, A.; Akey, J. M.Abstract
Haplotypes are linear sequences of co-inherited alleles along individual chromosomes and are central to genetic mapping, clinical variant interpretation, and inference of population history. However, accurate genome-wide haplotype reconstruction remains challenging. Long-read sequencing has the potential to dramatically improve haplotype inference, but existing methods do not directly leverage all the information embedded in these data. Here, we present LongHap, a read-based phasing method that integrates sequence and 5-methylcytosine (5mC) information in a unified probabilistic framework. By leveraging differentially methylated sites, LongHap resolves phase relationships between variants that are inaccessible to sequence-based approaches alone. Across multiple datasets and sequencing platforms, LongHap increases phase block lengths by up to 30% while substantially reducing switch error rates. LongHap rigorously embeds complex structural variants into the broader haplotype context using loopy belief propagation, enabling improved phasing of INDELs and other variant classes that are inherently difficult to resolve. Methylation-aware phasing also improves the accuracy and contiguity of haplotypes spanning rare variants and structurally complex, medically relevant genes across diverse ancestries, facilitating the interpretation of compound heterozygosity and haplotype-specific regulatory architectures. These results establish methylation-aware phasing as a general framework for improving genome-wide haplotype reconstruction, with broad applications across genetics and genomics.
bioinformatics2026-04-17v2HARVEST: Unlocking the Dark Bioactivity Data of Pharmaceutical Patents via Agentic AI
Shepard, V.; Musin, A.; Chebykina, K.; Zeninskaya, N. A.; Mistryukova, L.; Avchaciov, K.; Fedichev, P. O.Abstract
Pharmaceutical patents contain vast Structure-Activity Relationship tables documenting protein-ligand binding data. While technically public, this information remains computationally inaccessible and effectively dark, trapped in bulky documents that no existing database has systematically captured. We present HARVEST, a multi-agent large language model pipeline that autonomously extracts structured bioactivity records from USPTO patent archives at $0.11 per document. Applied to 164,877 patents, HARVEST produced 3.15 million activity records, recovering 326,342 unique scaffolds and 967 protein targets absent from BindingDB. This pipeline completed in under a week a task that would otherwise require over 55 years of continuous expert labor. Automated extraction achieves 80% agreement with human curated corpus of US patents from BindingDB, a conservative lower bound given identified errors within the reference data. We further introduce H-Bench, a structurally guaranteed held-out benchmark built from this recovered data. Evaluation of the leading open-source model Boltz-2 on H-Bench reveals a two-dimensional generalization gap: performance degrades both on novel scaffolds and on uncharacterized protein targets, exposing fundamental limitations of models trained on existing public repositories.
bioinformatics2026-04-17v2GraphPop: graph-native computation decouples population genomics complexity from sample count
Estaji, E.; Zhao, S.-W.; Chen, Z.-Y.; Nie, S.; Mao, J.-F.Abstract
Matrix-based population genomics tools scale as O(V x N), re-reading the full genotype matrix for every analysis. Here we present GraphPop, a graph database engine that reduces summary statistic complexity to O(V x K) where K is population count--independent of sample count--by computing on pre-aggregated allele-count arrays stored as graph node properties. The same architecture enables annotation-conditioned queries via edge traversal, persistent analytical records, and multi-statistic composition. Applied to rice 3K (29.6M SNPs, 3,024 accessions) and human 1000 Genomes (3,202 samples, 22 autosomes), GraphPop reveals that all 12 rice subpopulations show{pi} N /{pi}S > 1.0, uncovers opposite consequence-level Fst regimes between species, and identifies KCNE1 as a candidate pre-Out-of-Africa sweep via convergence of five stored statistics. GraphPop achieves 146-327x query-time speedup for pre-aggregated statistics and 63-179x for bit-packed haplotype computation, at constant[~] 160 MB memory. This complexity reduction makes systematic, annotation-integrated population genomics practical for the crop, livestock, conservation, and ecological datasets that constitute the majority of the field.
bioinformatics2026-04-17v2GraphMana: graph-native data management for population genomics projects
Estaji, E.; Zhao, S.-W.; Chen, Z.-Y.; Nie, S.; Mao, J.-F.Abstract
Population genomics projects rely on fragmented file-based workflows that lose provenance and require full reprocessing when samples are added. Graph-Mana stores variant data in a graph database as packed genotype arrays with pre-computed population statistics, enabling incremental sample addition, provenance tracking, cohort management, and export to 17 formats. On the human 1000 Genomes Project (3,202 samples, 70.7 million variants), GraphMana completed a 46-operation project lifecycle in 98 minutes from a single persistent database, replacing the ad hoc scripting otherwise required across multiple disconnected tools.
bioinformatics2026-04-17v2FairTCR: Equity-Aware TCR--pMHC Binding Prediction\\Across HLA Alleles and Cohort Strata
Nowak, P.; Kowalski, J.; Lewandowski, T.Abstract
Public TCR--pMHC binding databases are heavily skewed toward a handful of well-studied HLA alleles---most prominently HLA-A*02:01, which covers $\sim$45\% of curated records---and toward patients from European-ancestry cohorts. Standard empirical risk minimization (ERM) trained on such data achieves strong pooled accuracy but routinely underperforms on rare alleles and underrepresented cohorts, creating systematic disparities that are invisible in single-metric benchmarks. We introduce \emph{FairTCR}, a group distributionally robust optimization (GDRO) framework that minimizes worst-group loss across HLA supertypes and cohort strata via online exponentiated gradient updates. FairTCR reduces the average--worst-group AUPRC disparity from 0.190 (ERM) to 0.098 on a curated VDJdb--IEDB benchmark, achieving a 48.4\% disparity reduction while maintaining competitive average AUPRC (0.432 vs.\ 0.431 for ERM). Per-HLA analysis shows that rare allele groups (B*08:01, B*44:02) gain up to 0.062 AUPRC points, directly improving the equity of computational pre-screening for underrepresented patient populations.
bioinformatics2026-04-17v1Integrating glycosylation in de novo protein design with ReGlyco Binder Design Filter
Singh, O.; Fadda, E.Abstract
Artificial Intelligence (AI)-based methods for 3D protein structure prediction are revolutionising structural biology, providing novel templates for experimental data refinement and an on demand 3D perspective on any molecular architecture and protein-protein interaction (PPI). Regardless of the inherent limitations of the various approaches available to date, the continuous improvement of the algorithms, the broad availability of open access (OA) web servers, software packages and databases are bound to accelerate the discovery and optimisation of novel biopharmaceuticals. Within this context, the development of computational pipelines for the de novo design of target-specific protein binders is especially exciting. As it stands, these processes are still rather inefficient and expensive, rapidly outputting thousands of designs relatively quickly, which translate into meagre yields. Here we show how the explicit integration of glycosylation as a filter in the 3D de novo design pipeline can significantly improve efficiency and reduce laboratory costs with minimal additional computational resources. As a proof-of-concept, we used the GlycoShape database and ReGlyco tools to filter the results of a recent open competition launched by Adaptyv Bio for the design of binders as inhibitors against the heavily glycosylated Nipah virus glycoprotein (NiV-G). Screening of the 1,201 selected designs in block with ReGlyco, refined with the new ReGlyco Rotamer tool, flagged 11% of non-binders prior to experiment in approximately 3 hours on a dual-core CPU. We complement this analysis with a demo colab notebook to illustrate our workflow. In this demo users can design mini-binders against human erythropoietin (hEPO) by integrating GlycoShape resources with the RFdiffusion3 (RFD3) pipeline from the Institute for Protein Design (IDP).
bioinformatics2026-04-17v1Agent-Guided De Novo Design of Nanobody Binders Against a Novel Cancer Target
Zhao, Y.; Yilmaz, M.; Lee, E.; Teh, C.; Guo, L.; Sonmez, K.; Giancardo, L.; Trang, G.; Xu, F.; Espinosa-Cotton, M.; Cheung, N.-K.; Kim, J.; Cheng, X.Abstract
Therapeutic antibody discovery remains slow and resource-intensive, with traditional methods providing limited control over epitope selection. We present a workflow for de novo nanobody design applied to a novel Desmoplastic Small Round Cell Tumor target encompassing four stages: (1) epitope identification guided by our hotspot recommendation agent using physical chemistry-based structure and sequence analysis tools with two curated databases (IEDB, PFAM), (2) de novo nanobody generation using three independent methods (RFantibody, IgGM, mBER) across multiple predicted antigen structures and nanobody frameworks, (3) multi-metric scoring including structural metrics from folding models, and in silico binding affinity from our sequence- based predictor, (4) high-throughput yeast surface display (YSD) screening followed by surface plasmon resonance (SPR) characterization of the specific binders. We generated 288,000 nanobody designs spanning eight target epitope regions and three variable domains of heavy chain-only antibody (VHH) frameworks. Multi-objective Pareto filtering with our candidate selection agent yielded 100,000 candidates for YSD screening with fluorescence-activated cell sorting (FACS). Of 116 enriched candidates advanced to SPR characterization, 46/116 (39.7%) produced reliable kinetic fits with Rmax [≥] 30 RU, yielding KD values from 0.66 nM to 305 nM (median 31.7 nM). These results show that an agent-guided computational workflow can design nanomolar to sub-nanomolar nanobody binders against a novel target without experimental structure or prior antibody information.
bioinformatics2026-04-17v1Uncertainty-aware benchmarking reveals ambiguous transcripts in mRNA-lncRNA classification
Garcia-Ruano, D.; Georges, M.; Mohanty, S. K.; Baaziz, R.; Makova, K. D.; Nikolski, M.; Chalopin, D.Abstract
Background. Long non-coding RNAs (lncRNAs) have gained significant attention in recent years, yet distinguishing them from protein-coding transcripts remains challenging. Indeed, many lncRNAs share mRNA-like processing and existing sequence-derived signals do not fully capture the coding/non-coding boundary. Recent GENCODE annotation efforts revealed tens of thousands of novel lncRNA sequences as well as the reclassification of some lncRNAs into the protein-coding class, highlighting the need to better characterize transcript features associated with classification uncertainty and errors. Results. We performed uncertainty-aware benchmarking by retraining and evaluating eight transcript classifiers under a controlled protocol on a label-stable GENCODE v46-v47 subset. Beyond conventional model evaluation metrics, we quantified inter-tool agreement and entropy-based uncertainty to stratify transcripts into consensus, discordant, and consensus-error groups. To expand standard sequence and ORF-derived signals, we incorporated repeat-derived features from mature transcripts and non-B DNA motif features across gene bodies. Although aggregate performance was high, ~45% of transcripts showed inter-tool discordance, particularly among lncRNAs. Feature analyses linked low-uncertainty predictions to strong coding-like signals, whereas high-uncertainty profiles exhibited mixed signatures. Alongside classical predictors in global importance analyses, repeat-derived features appear as main contributors. Conclusions. By combining controlled benchmarking with transcript-level agreement and uncertainty stratification, together with extended feature profiling, we identified patterns associated with classifier disagreement and misclassification. This novel framework provides practical guidance for interpreting predictions, motivating the development of more robust coding/non-coding classifiers, while also shedding light on the sequence properties that distinguish lncRNA sequences.
bioinformatics2026-04-17v1Active Learning for Budget-Constrained TCR--pMHC Wet-Lab Validation
Mazur, K.; Piotrowska, M.; Kowalski, J.Abstract
Wet-lab validation of TCR--pMHC binding hypotheses is the rate-limiting step in T-cell therapy discovery: a single binding assay round can cost thousands of dollars and weeks of turnaround time, yet computational models generate thousands of candidate pairs per run. We frame this as a \emph{pool-based active learning} problem: given a fixed annotation budget $B$, which unlabeled pairs should be sent to the assay to maximally improve a predictive model that will guide the next screening round? We introduce \emph{UDAL} (Uncertainty--Diversity Active Learning), a batch acquisition strategy that combines BALD-based uncertainty estimation via MC Dropout with greedy core-set diversity selection in the encoder feature space. Evaluated on a curated VDJdb--IEDB benchmark under epitope-held-out and distance-aware protocols, UDAL achieves AUPRC 0.487 with only 5{,}000 queried labels---matching the performance of a model trained on 3$\times$ more randomly sampled labels. At a budget of 2{,}000 labels, UDAL improves AUPRC by 16.7\% over random acquisition, translating directly to fewer wasted assay slots. These results demonstrate that principled active query strategies can substantially reduce the wet-lab cost of building reliable TCR specificity models.
bioinformatics2026-04-17v1PathwaySeeker: Evidence-Grounded AI Reasoning over Organism-Specific Metabolic Networks
Oliveira Monteiro, L. M.; Chowdhury, N. B.; Oostrom, M.; McDermott, J. E.; Stratton, K. G.; Choudhury, S.; Bardhan, J. P.Abstract
Metabolic activity is not an intrinsic property of an organism, but an emergent state shaped by environmental and experimental context. Despite recent advances in large language models (LLMs) and multi-omics profiling, current computational frameworks struggle to represent and reason over metabolism in a condition-specific manner. General-purpose AI systems operate on static, public biochemical knowledge, while multi-omics datasets capture dynamic measurements without a structured framework for mechanistic interpretation. As a result, metabolic networks remains analysis remains disconnected from the experimental states that define biological function. Here, we introduce PathwaySeeker, an evidence-grounded AI system for organism-specific metabolic network reasoning. PathwaySeeker reconstructs sample-specific metabolic graphs from integrated proteomic and metabolomic data, fine-tunes an LLM on the resulting graph structure, and verifies each reasoning step against the experimental graph through iterative hypothesis search, an approach we term Oracle-in-the-Loop inference. Every output claim carries explicit evidence provenance, distinguishing experimentally confirmed relationships from biochemically plausible hypotheses requiring validation. We demonstrate the system using multi-omics data from the non-model white-rot fungus Trametes versicolor, where PathwaySeeker recovers branched phenylpropanoid pathways and transparently stratifies confirmed reactions from testable extensions. Post-hoc thermodynamic analysis condition-specific metabolite dynamics support the biological feasibility of the reconstructed routes. By embedding experimental evidence provenance directly into language model-guided metabolic network reasoning, PathwaySeeker enables systematic differentiation between experimentally grounded knowledge and structured hypothesis, bridging frontier AI capabilities with organism-specific experimental evidence.
bioinformatics2026-04-17v1Virtual multiplex staining of the pancreatic islets across type 1 diabetes progression using a Schroedinger bridge
Shen, Y.; Cho, W. J.; Joshi, S.; Wen, B.; Naganathanhalli, S.; Beery, M.; Grubel, C. R.; Sivasubramanian, A.; Forjaz, A.; Grahn, M. P.; Dequiedt, L.; Huang, Y.; Han, K. S.; Wu, F.; Pedro, B. A.; Wood, L. D.; Chen, T.; Hruban, R. H.; Kusmartseva, I.; Atkinson, M. A.; Wirtz, D.; Kiemen, A. L.Abstract
Classical hematoxylin and eosin (H&E) staining enables review of tissue morphology but lacks information regarding the molecular state of cells. Immunohistochemical (IHC) techniques label specific proteins in tissue, allowing differentiation of relevant structures that may go undetectable in H&E. However, the IHC process is complex, expensive, and time-consuming, especially for multiplex IHC (mIHC) limiting its use in large cohorts. Stain conversion of H&E to IHC using generative artificial intelligence models such as generative adversarial networks (GANs) represent one solution to this problem. However, GANs are unstable during out of distribution sampling and are prone to hallucinations or mode collapse, limiting their accuracy in challenging image conversion tasks. To address this, the field has recently turned to diffusion models. Here, we introduce Schroedinger-bridge for Multiplex ImmunoLabel Estimation (SMILE). Unlike conventional diffusion models that map from source to target through an intermediate Gaussian noise, Schroedinger-bridge diffusion models skip this step and have been shown to better preserve structures during image translation. To test the performance of SMILE, we generated a large cohort of high-fidelity H&E-mIHC image pairs from pancreatic organ donors, targeting insulin, glucagon, and CD3. Our dataset well-sampled across type-1 diabetes status, pancreas anatomical location, age, and sex. Using this cohort, we demonstrate the superiority of SMILE compared to GANs via a comprehensive evaluation framework incorporating texture, distribution, and antibody-specific metrics, as well as blinded pathologist reviews. We further confirmed the ability of SMILE to generate accurate mIHC images from H&Es generated at an external site, to perform whole slide image conversion, and to generate realistic three-dimensional maps of the pancreatic islets in non-diabetic, auto-antibody positive, and type-1 diabetic donor tissue. Finally, we performed stain conversion of paired H&E to HER2 and Ki67 images in breast cancer, confirming the superiority of SMILE in diverse stain conversion applications. Collectively, this framework provides a scalable pipeline for high-throughput proteomic inference from archival H&Es, providing transformative potential for pancreatic research and digital pathology.
bioinformatics2026-04-17v1Recursive Repeat Extender (RRE): A recursive approach to automatically extend repeat element models
Falcon, F.; Tanaka, E. M.; Rodriguez-Terrones, D.Abstract
Repetitive elements, including transposable elements (TEs), are integral structural components of eukaryotic genomes; consequently, their identification and classification are crucial to their study. Several approaches have been developed to perform de novo genome-wide repeat identification through pairwise sequence comparisons; however, they often generate truncated repeat models due to their sampling strategies and the substantial fragmentation of many of the older repeat copies in the genome. To improve repeat models generated de novo, several algorithms have been developed that increase model length via the BEEA (BLAST-Extend-Extract-Align) approach, in which genomic instances of each repeat are identified with BLAST, their coordinates are extended, and a refined model is generated by aligning the extended sequences. Nevertheless, these extension algorithms exhibit two key limitations that hinder the reconstruction of highly degenerate and fragmented repeats: the use of BLAST as a search algorithm - which limits their sensitivity in detecting highly diverged sequences - and the use of a single search step, which precludes the reconstruction of extensively fragmented repeat models. In this work, we present a novel approach to extend repeat models, called RRE (Recursive Repeat Extender), which uses profile hidden Markov models (HMMs) to search for repeat elements with high sensitivity and employs a recursive extension strategy that iteratively searches and extends the repeat model, using the extended model from each round as input for the next and continuing until no additional sequence can be incorporated. We apply RRE to repeat libraries generated de novo from five model organisms, and our results show that RRE-generated repeat libraries contain fewer but longer repeat models and can identify a larger proportion of the genomes as repetitive than RepeatModeler2-generated repeat libraries. Notably, RRE can reconstruct highly degenerate repeats such as CR1_Mam, producing a model that achieves similar coverage to the reference Dfam model while extending it by an additional 131 bp that were not captured in the reference model. Overall, RRE enables the automatic improvement of de novo repeat libraries and the reconstruction of highly degenerate and fragmented repeats.
bioinformatics2026-04-17v1Hybrid Gated Fusion: A Multimodal Deep Learning Framework for Protein Function Annotation
Zhou, Z.; Buchan, D. W.Abstract
Protein function annotation requires integrating diverse biological signals, yet existing multimodal methods often struggle with missing inputs and redundant information. We present Hybrid Gated Fusion, a multimodal architecture that combines intrinsic protein features, including sequence and structure, with extrinsic functional context from text and interaction networks. Rather than weighting all modalities equally, the model uses bilinear gating to assess both the informativeness of each modality and its agreement with the others, while auxiliary supervision reduces modality dominance and preserves useful signal in weaker modalities. On the CAFA3 benchmark, a single Hybrid Gated Fusion model achieves state-of-the-art performance in Biological Process (F_max = 0.601) and Cellular Component (F_max = 0.706), while remaining competitive in Molecular Function (F_max = 0.702). Analysis of the learned gates shows that interaction networks and text often provide complementary functional signals, whereas structural features are down-weighted when redundant but remain valuable under sparse-input settings. These results establish Hybrid Gated Fusion as a robust and scalable framework for genome-scale protein function annotation.
bioinformatics2026-04-17v1Pathway redistribution reveals a shared signaling backbone and context-dependent regulatory modules in RNA-binding protein networks
Osato, N.; Sato, K.Abstract
Understanding how regulatory architectures are reorganized across cellular contexts remains a central challenge in functional genomics. Here, we integrate co-expression-derived candidate regulatory interactions with interpretable deep learning to generate gene-level contribution scores and introduce delta NES (normalized enrichment score difference) to quantify pathway redistribution between cellular states. Because gene expression reflects the combined effects of multiple regulatory inputs, contribution scores capture relative regulatory influence rather than transcriptional abundance itself. Applying this framework to neural progenitor cells and K562 leukemia cells, we identify systematic redistribution of functional modules across multiple RNA-binding proteins, including PKM, HNRNPK, and NELFE. Neural System- and Immune System-associated modules are differentially positioned along the delta NES spectrum, indicating context-dependent redistribution of regulatory influence rather than isolated pathway activation events. At the pathway level, Signal Transduction consistently forms a shared signaling backbone across proteins and cellular contexts, while modules related to neuronal functions, immune responses, and developmental processes exhibit context-dependent redistribution. Subpathway analysis further reveals convergence on receptor-mediated signaling processes, including FGFR/RTK-, IRS-, and MAPK-related pathways. These redistribution patterns are preserved under alternative DeepLIFT background settings despite polarity changes in contribution-expression correlations, indicating that pathway-level contrasts arise from stable rank-structure differences rather than background-dependent score artifacts. Together, our findings demonstrate that contribution score-based pathway ranking reveals a conserved signaling backbone alongside context-dependent functional modules, providing a framework for interpreting regulatory architecture beyond expression-centric analyses.
bioinformatics2026-04-16v11An Explainable Knowledge Graph-Driven Approach to Decipher the Link Between Brain Disorders and the Gut Microbiome
Aamer, N.; Asim, M. N.; Vollmer, S.; Dengel, A.Abstract
Motivation: The communication between the gut microbiome and the brain, known as the microbiome-gut-brain axis (MGBA), is emerging as a critical factor in neurological and psychiatric disorders. This communication involves complex pathways including neural, hormonal, and immune interactions that enable gut microbes to modulate brain function and behavior. However, the specific mechanisms through which gut microbes influence brain function remain poorly understood, and existing computational efforts to understand these mechanisms are simplistic or have limited scope. Results: This work presents a comprehensive approach for elucidating the interactions that allows gut microbes to influence brain disorders. We construct a large curated biomedical knowledge graph comprising 586,318 nodes across 16 entity types and 3,573,936 edges spanning 103 relation types, integrating ontological and experimental data relevant to the MGBA. On this graph, we train GNN-GBA, a GraphSAGE-based graph neural network with a DistMult relation-aware decoder, achieving an AUC-ROC of 0.997 and an F1-score of 0.981 on link prediction, outperforming nine baseline methods across four categories. Using GNNExplainer, we extract and rank mechanistic pathways connecting gut microbes to brain disorders, and demonstrate their stability across multiple random initializations. GNN-GBA successfully identified pathways for 125 brain disorders, revealing shared metabolite hubs (including flavonoids, bile acids, and short-chain fatty acids) that mediate gut-brain communication across diverse neurological conditions. Furthermore, we show that the top pathways are consistent with existing literature for three common disorders. Lastly, we develop an interactive dashboard (GutBrainExplorer) to explore thousands of potential mechanistic pathways across 125 brain disorders, which is publicly available. Availability: Code and data are available at https://github.com/naafey-aamer/GNN-GBA. Contact: naafey.aamer@cs.rptu.de
bioinformatics2026-04-16v4EGGS: Empirical Genotype Generalizer for Samples
Smith, T. Q.; Rahman, A.; Szpiech, Z. A.Abstract
Summary: We introduce Empirical Genotype Generalizer for Samples (EGGS) which accepts empirical genotypes with missing data and replicates the distribution of missing genotypes along the empirical segment in other replicates. The empirical segment must have a number of sites less than the replicate. In addition, EGGS can remove phase, remove polarization, simulate deamination, simulate sequencing error, create pseudohaploids, and convert between Variant Call Format (VCF), ms-style replicates, and EIGENSTRAT/ANCESTRYMAP. When producing VCF files, EGGS is not limited to biallelic sites and assumes all samples are diploid. Availability and Implementation: EGGS is written in the C programming language. Precompiled executables, source code, the manual, and the analysis conducted in the paper are available at https://github.com/TQ-Smith/EGGS
bioinformatics2026-04-16v3DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics
Liao, Y.; Wen, H.; E, W.; Zhang, W.Abstract
Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semisupervised training to universal cross-modal representation learning. By integrating dualencoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.
bioinformatics2026-04-16v3evo3D R package: a spatial haplotype framework for structure-informed analysis of molecular evolution
Broyles, B. K.; He, Q.Abstract
At the molecular level, selection pressures often act on protein structural features, yet most evolutionary analyses remain confined to linear sequences. Early structure-informed approaches improved interpretability by mapping single-site metrics onto protein structures, and later methods introduced 3D sliding windows to capture spatially clustered signals missed by linear window approaches. These frameworks, however, are restricted to predefined statistics and narrowly defined 3D window types, limiting the scope of questions that can be addressed. We developed an R package, evo3D, as a new framework for structure-informed evolutionary analysis that supports a wide range of downstream statistics and scales from simple to complex structures. evo3D extracts structure-informed multiple sequence alignment subsets (spatial haplotypes), making the structure-informed unit of analysis directly available to users. The framework supports fixed-count and fixed-distance spatial windows, introduces residue and codon analysis modes, and extends to multimers, interfaces, and multiple structural models through a single wrapper, run_evo3d(). We demonstrate evo3D's utility by performing an epitope-level diversity scan of Hepatitis C virus E1/E2 complex, identifying conserved spatial neighbourhoods missed by linear sliding windows, and by evaluating evo3D's scalability on the octameric Chikungunya virus E1/E2 assembly. Importantly, evo3D formalises the core components of structure-informed analysis of molecular evolution and removes technical barriers. As a result, the framework streamlines the evaluation of evolutionary patterns directly within 3D structural contexts, and we anticipate its wide application in molecular evolution studies. The package is available at github.com/bbroyle/evo3D.
bioinformatics2026-04-16v2Antimicrobial Resistance Prediction in Salmonella enterica Using Frequency Chaos Game Representation and ResNet-18
Ismail, S. M.; Fayed, S. H.Abstract
Antimicrobial resistance (AMR) prediction from bacterial genomes remains a major challenge for clinical microbiology and surveillance. We developed a deep learning model based on Frequency Chaos Game Representation (FCGR) and a ResNet-18 architecture to classify resistance phenotypes directly from whole-genome assemblies. Using homology-aware clustering to prevent genomic data leakage, we trained and evaluated models for Salmonella enterica (seven antibiotics) and Staphylococcus aureus (five antibiotics). The Salmonella model achieved high predictive accuracy, particularly for cephalosporins, while performance was lower for tetracycline and ampicillin. The Staphylococcus aureus model demonstrated that the pipeline generalizes to Gram-positive bacteria, with strong results for methicillin (Balanced Accuracy = 0.85). Benchmarking against the gene-based tool ResFinder showed that the FCGR-based model did not match the performance of ResFinder on most antibiotics, but achieved competitive results for cephalosporins. This study demonstrates the feasibility of applying FCGR-based deep learning to AMR prediction across bacterial species, though substantial improvements would be needed before clinical application.
bioinformatics2026-04-16v2Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models
Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.Abstract
The function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous model accuracy estimation (EMA) methodologies. Here we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structural models. Our method employs a structure-sequence cross-consistency mechanism to quantify the bidirectional compatibility between the input sequence and the predicted three-dimensional structure, enabling a comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in the Pearson correlation and 49.0% in the Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensus-based methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced on the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. Our results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.
bioinformatics2026-04-16v1Thermoadaptation of EndoG proteins in the Xenopus frog genus
Tokmakov, A. A.Abstract
Xenopus is a genus of entirely aquatic frogs found in sub-Saharan Africa. Currently, the complete genomes of two species within the Xenopus genus, Xenopus laevis and Xenopus tropicalis, have been fully sequenced, annotated, and made publicly available. The two species inhabit markedly different environments: X. tropicalis lives in the hot, equatorial regions of Africa, whereas X. laevis resides in the cooler climates of southern Africa. In the present study, mutational profiling, comparative homology modeling, and computational bioinformatics were used to identify the features of adaptive evolution in Xenopus endonuclease G (EndoG) proteins. The multiple characteristics of EndoG isozymes were discovered to vary considerably between the two Xenopus species dwelling in different locations. Most notably, EndoG proteins from the psychrophilic X. laevis exhibit the increased contents of charged and polar residues, elevated pI, higher intramolecular interaction energies, B factors, molecular void volumes, and solvent accessibilities, but the decreased contents of nonpolar and aromatic amino acids, lower hydrophobicity, buried surface area, and molecular packing density compared to those from the thermophilic X. tropicalis. The observed differences strongly suggest that temperature plays a dominant role in EndoG diversification. Evaluation of intramolecular interaction energies appears to be a particularly sensitive and discriminative framework for assessing protein divergence at the structural level. Overall, this study highlights the diversification of homologous proteins in ectothermic vertebrate eukaryotes and provides mechanistic insight into protein adaptation to contrasting environments.
bioinformatics2026-04-16v1Three-dimensional Virtual Adult Cardiomyocyte Transcriptomics
Luo, C.; Lyu, Y.; Guo, X.; Cheng, L.; Liang, Q.; Wang, S.; Wang, Y.; Zhang, S.; Wang, S.; Liu, T.; Luo, Y.; Lu, F.; Ran, B.; Zhang, Y.; Liu, X.; Wang, Y.; Qin, G.; Wu, J.; Lyu, Q. R.Abstract
Adult cardiomyocytes are large, rod-shaped, and often multinucleated, which makes them challenging for current single-cell or single-nucleus RNA-sequencing platforms. Current spatial transcriptomics (ST) relies on nuclear-based cell segmentation, which performs poorly when identifying adult cardiomyocytes. Moreover, single-section ST of adult myocardium is insufficient to capture the cellular transcriptomic information of intact cardiomyocytes. Thus, there is an urgent need for novel technology that accurately profiles the transcriptome of adult cardiomyocytes in situ at the single-cell level. Here, we report the first three-dimensional virtual cardiomyocyte (3D-VirtualCM) transcriptome atlas by reconstructing multi-layer ST spanning a 100m depth of the adult mouse heart. Using membrane-based cell segmentation and similarity-guided cross-sectional contour matching, 3D-VirtualCM delineates individual cardiomyocyte 3D contours and integrates in situ transcriptome. 3D-VirtualCM identifies cardiomyocytes in the cell cycle using proliferative markers in the context of myocardial infarction (MI) and reveals the asymmetric intracellular RNA distribution along the longitudinal axis of cardiomyocytes. Using 3D RNA fluorescence in situ hybridization (FISH), we validated the longitudinal asymmetry of Glul and Gja1 mRNA in adult cardiomyocytes. In summary, 3D-VirtualCM provides a workflow that advances the study of cardiac pathophysiology at a bona fide single-cell level while preserving spatial context.
bioinformatics2026-04-16v1scDisent: disentangled representation learning with causal structure for multi-omic single-cell analysis
Xi, G.Abstract
Single-cell multi-omic technologies measure complementary aspects of cellular identity and regulatory state, yet most integration models compress these signals into one entangled latent space. Such representations are useful for clustering but poorly suited for mechanistic interpretation or perturbation-oriented analysis. We present scDisent (https://github.com/xig uoren/scDisent), a generative framework for disentangled representation learning that separates expression-associated variables (zexpr) from regulation-associated variables (zreg) and links them through a sparse directed mapping. scDisent combines modality-specific encoding, variational disentanglement with total-correlation and orthogonality constraints, and a Gumbelgated causal module protected by detach-based gradient isolation. Evaluated on benchmark datasets with matched modalities, scDisent achieved best-in-benchmark integration performance while exposing regulatory structure that competing integration methods do not model explicitly. The learned causal atlas remained sparse, perturbation analyses recovered biologically coherent lineage-associated programs, and cross-dataset discovery analyses highlighted interpretable immune, neural and developmental signatures. Quantitative branch-separation analyses further showed that benchmark-label information concentrated in zexpr rather than zreg. Together, these results position scDisent as a computational method that improves not only integration quality but also biological interpretability, making single-cell multi-omic representations better suited to biological question answering and in silico hypothesis generation.
bioinformatics2026-04-16v1Multiscale transcriptomic organization of the human brain with DigitalBrain
An, J.; Hu, X.; Jiang, Y.; Jiang, M.; Qiu, S.; Liu, G.; Wei, X.; Wang, Y.; Lin, J. Q.; Wang, C.; Lu, M.Abstract
The human brain varies across anatomical regions, cell types, development, ageing and disease states, yet existing single-cell transcriptomic resources remain fragmented and difficult to integrate into a unified biological model. Here we present DigitalBrain, a human brain-specific atlas and foundation-model framework for organizing diverse and fragmented human brain transcriptomic data across scales. We first built DigitalBrain-Atlas, a harmonized whole-brain single-cell resource comprising 16.35 million transcriptomes from 2,143 donors across 165 brain regions, spanning the human lifespan and multiple neurological and clinical conditions. We then developed DigitalBrain-M1, a Transformer-based model that jointly encodes gene identity and expression magnitude to learn a shared embedding space for cells and genes. Across held-out datasets, DigitalBrain supported robust single-cell integration, clustering and cell-type annotation while preserving major biological structure and reducing technical fragmentation. Beyond these benchmarks, the learned embeddings revealed emergent large-scale hierarchical organization of the human brain, linking anatomically distinct regions into higher-order patterns consistent with known functional systems. Applied to human hippocampal aging, DigitalBrain identified cell-type-specific aging sensitive gene sets, identified dentate gyrus granule cells as a particularly age-sensitive population, and discovered selective reorganization of gene programs related to synaptic transmission, postsynaptic structure, membrane excitability and axon guidance during aging. Cross-dataset convergence was strongest at the level of functional modules and recurrent aging sensitive genes. Together, these results demonstrate DigitalBrain as a brain-specific framework for mapping human brain organization across scales, and as an early step towards a complete virtual organ for the human brain.
bioinformatics2026-04-16v1MISSTE: a multiscale integrative spatial simulator for understanding the mechanisms underlying tissue ecosystems
Su, Z.; Yin, S.; Wu, Y.Abstract
Multiscale tissue ecosystems are governed by coupled intracellular decision-making, cell-cell interactions, and spatially structured microenvironmental signals, yet these scales are often studied separately. Here we present MISSTE, a modular framework that integrates Boolean intracellular state logic, agent-based modeling, and partial differential equation fields within a unified spatial simulation architecture. As a proof of concept, we applied MISSTE to CAR-T therapy in a solid tumor microenvironment. The model recapitulated emergent features of CAR-T behavior, including limited tumor penetration, stromal suppression, localized cytokine remodeling, hypoxia-associated constraint, and progressive functional exhaustion. Comparison of baseline and optimized conditions showed that coordinated enhancement of interaction range, migration, and cytotoxic function improved immune persistence and partial tumor control. Systematic parameter scans further identified effective immune-tumor contact as a stronger determinant of outcome than killing strength alone, highlighting spatial access as the dominant bottleneck. Guided by these results, we designed sequential intervention strategies and found that time-ordered enhancement of infiltration, killing, and late functional protection outperformed a static optimized regime. Together, these results establish MISSTE as a generalizable multiscale methodology for dissecting tissue ecosystems and for generating mechanistically grounded strategies for engineered cellular therapy design.
bioinformatics2026-04-16v1vcfilt: A Zero-Allocation Streaming Filter for High-Throughput VCF Processing
KP, M. M.Abstract
Variant Call Format (VCF) files are the dominant interchange format for genomic variant data, but their size - routinely exceeding tens of gigabytes for population-scale studies - creates a significant computational bottleneck at the quality-filtering stage. Existing tools such as bcftools and vcftools provide broad functionality through general-purpose expression engines, but incur substantial per-record overhead from dynamic field lookup, type resolution, and heap allocation. We present vcfilt, a streaming, batch-parallel VCF filter implemented in Go that restricts its scope to three high-frequency filter criteria (INFO/DP, INFO/AF, and QUAL) and applies them via a zero-allocation byte-scan parser. Benchmarked on real 1000 Genomes Project data (chromosome 20, 1,811,146 variants), vcfilt achieves 147,000 variants/second on an 18 GB plain-text VCF file using a single thread - a 12.2x speedup over bcftools 1.18 under identical conditions. On gzip-compressed input, the speedup is 7.9x. Output is byte-for-byte identical to bcftools across all tested filter combinations. vcfilt is distributed as a self-contained static binary, a Docker image, and a Singularity-compatible container. The source code and all benchmark scripts are openly available under the MIT licence.
bioinformatics2026-04-16v1Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit
Carriere, L.; Huyghe, A.; Pajkos, M.; Bernado, P.; Cortes, J.Abstract
Intrinsically disordered proteins and regions (IDRs) are central to a multitude of biological processes. Despite extensive studies of their structural and physicochemical properties, the rational design of IDRs with defined conformational behavior remains challenging due to their ensemble nature. Here we present a generative framework for designing disordered protein sequences conditioned on target conformational ensemble descriptors using protein language models (pLMs). We formulate IDR design as the task of generating amino acid sequences predicted to realize specified biophysical properties and implement a Transformer encoder-decoder architecture that maps numerical descriptors to protein sequences. By training models on datasets spanning two orders of magnitude in size, we show that accurate control of conformational and physicochemical properties is achieved only at large data scale. These results demonstrate the feasibility of conditioning generative models on ensemble-level descriptors for IDR design. More broadly, these results support a data-centric paradigm for protein engineering, in which data availability emerges as a key limiting factor for the accurate design of IDRs.
bioinformatics2026-04-16v1Interpretable Biological Sequence Clustering with iClust
Zhang, S.; Liu, X.; Lou, J.; Jiang, M.; He, Z.Abstract
Biological sequence clustering is a fundamental problem in bioinformatics, yet most existing methods mainly optimize clustering quality or efficiency while offering limited insight into why sequences are grouped together. This restricts their usefulness in downstream analysis, where representative sequences and clear cluster boundaries are often needed. To address this issue, we propose iClust, an interpretable clustering method that characterizes each cluster by a representative prototype and an adaptive radius. By adapting to local sequence structure rather than relying on a single global threshold, iClust produces clusters that are both meaningful and explainable. A final consolidation step further reduces tiny fragments and improves structural stability. Experiments on simulated and real biological sequence datasets show that iClust achieves competitive clustering performance while providing clearer cluster-level explanations than conventional threshold-based methods. In addition to its empirical impact as a practical clustering method for biological sequences, this article opens up new avenues for developing biological sequence clustering approaches from the viewpoint of interpretable machine learning.
bioinformatics2026-04-16v1Sampling antibody conformational ensembles withABodyBuilder4-STEROIDS
Spoendlin, F. C.; Cagiada, M.; Ifashe, K.; Vavourakis, O.; Deane, C. M.Abstract
Conformational flexibility is fundamental to the function of many proteins and in the case of antibodies can impact key properties such as affinity and specificity. While it is possible to predict single, static protein structures with high accuracy, predicting conformational ensemble remains challenging. Molecular dynamics simulations suffer from high computational costs, while deep learning methods are yet to achieve the same level of accuracy. Here, we introduce ABB4-STEROIDS a generative structure prediction model that samples conformational ensembles of antibodies. We trained our model on 4.2 million structural frames derived from $\sim$136,000 coarse-grained and a set of 83 new all-atom antibody MD simulations. We benchmarked our model on reproducing MD ensembles and evaluated the diversity of sampled structures and the covered conformational space against experimental evidence. ABB4-STEROIDS achieves state-of-the-art accuracy, particularly within the experimental benchmarks. The model is openly available and provides a robust resource for large-scale investigations of antibody conformational ensembles.
bioinformatics2026-04-16v1LinkLlama: Enabling Large Language Model for Chemically Reasonable Linker Design
Sun, K.; Wang, Y. E.; Purnomo, J. C.; Cavanagh, J. M.; Alteri, G. B.; Head-Gordon, T.Abstract
Fragment-based drug discovery (FBDD) relies heavily on the design of chemically viable linkers to connect fragments binding to different pocket regions into potent lead molecules. While recent generative models have advanced spatial fragment linking, they frequently produce linkers characterized by high torsional strain and non-drug-like motifs. In this work, we present LinkLlama, a fine-tuned Meta Llama 3 model that bridges the gap between text-based generation and 3D spatial awareness. By accepting natural language prompts that specify geometric constraints, such as distances and angles, alongside physicochemical targets like Lipinski's rules and rotatable bond limits, LinkLlama generates highly tailored molecules for the input fragments. Leveraging the inherent chemical grammar captured through supervised fine-tuning on a curated corpus of drug-like molecules from ChEMBL, the model prioritizes chemical validity without requiring complex reinforcement learning loops. Benchmarking on the ZINC and HiQBind datasets demonstrates that LinkLlama maintains competitive geometric fidelity compared to strictly 3D-aware models while achieving a two-fold increase in the proportion of chemically reasonable designs. This rising success rate, jumping from ~35\% to over 80\%, is defined by strict adherence to comprehensive structural filters including PAINS, non-drug-like chemical patterns and complex ring systems. We further illustrate the model's versatility through prospective case studies in novel small-molecule scaffold hopping and PROTAC linker design, validated via molecular docking and molecular dynamics simulations against known crystal poses. Ultimately, LinkLlama demonstrates that large language models can overcome the structural pitfalls of purely 3D-generative methods, offering a highly controllable and chemically robust framework to accelerate linker design and drug discovery in general.
bioinformatics2026-04-16v1Impact of the N-glycosylation on full-length IgG2 and IgG4 antibodies: a comparative study using molecular dynamics simulations.
LEON FOUN LIN, R.; Bellaiche, A.; Diharce, J.; Etchebest, C.Abstract
Like other proteins, monoclonal antibodies - important biodrugs- are subject to post translational modifications, especially the N-glycosylations. However, the effect of the N-glycosylations remains poorly studied and atomistic details about their influence are rarely available. . Moreover, the few existing studies focus on the prevalent immunoglobulin G1. To go further in the understanding of the impact of glycosylations, we have carried out a comparative exploration of the effect of N-glycosylations on two different classes of antibodies, namely Mab231, an IgG2 and the pembrolizumab, an IgG4 . The two antibodies differ by their sequences, their length, their 3D structure but also by the location and composition of the glycans. In the present work, detailed and important information were gained through molecular dynamics simulations where both monoclonal antibodies were studied without and with the presence of their glycans. The results of 1.5 microseconds of sampling for each system show that glycosylation does not drastically alter the overall conformational landscape of either antibody, whatever the metrics considered. However, it measurably modulates local flexibility, inter-domain correlated motions, and the relative orientation of the Fab arms with respect to the Fc domain, with statistically significant shifts in key geometric descriptors. Importantly, contact analysis reveals that glycan interactions extend beyond the Fc region to reach Fab residues. The allosteric network calculations demonstrate that the influence of Fc-bound glycans propagates even until the Fab framework regions in both mAbs, which could impact the antigen binding. The nature and magnitude of these effects are subclass-dependent, reflecting differences in glycan composition, hinge architecture, and three-dimensional organization Our findings challenge the prevailing view that Fc glycosylation uniformly promotes CH2 domain opening. More importantly, it underscores the necessity of considering full-length structures and IgG subclass diversity in glyco-engineering strategies.
bioinformatics2026-04-16v1Track Display Jockey (trackDJ): a user-friendly R package for visualization of epigenomic data
Bokil, N. V.; Page, D. C.Abstract
Background Visualization of epigenomic data such as coverage tracks, peak calls, and chromatin interactions is a critical task in genomic data analysis. Although genome browsers such as the Integrative Genomics Viewer (IGV) and the UCSC Genome Browser permit user-friendly exploration of genomic tracks, they are not optimized for fully programmatic and reproducible generation of publication-quality figures. In contrast, existing programmatic tools lack a user-friendly interface and require extensive configuration. Results We present trackDJ (Track Display Jockey), an R package for visualization of epigenomic data. trackDJ prioritizes usability by favoring convention over configuration; it provides high-level plotting functions with sensible defaults, allowing users with minimal programming experience to generate clear, publication-quality figures with relatively little coding. Within a unified plotting framework, users can stack and align multiple data types, including coverage tracks, peak annotations, chromatin loops, and gene annotations. trackDJ allows users to select plotted genomic regions by coordinates or by gene name, enabling rapid visualization without knowledge of precise locus boundaries. Conclusions trackDJ provides a user-friendly and reproducible alternative to interactive genome browsers for epigenomic visualization, filling a critical gap in currently available epigenomics toolkits. By enabling scripted generation of clean, customizable genomic illustrations, trackDJ integrates naturally into R-based analysis workflows to streamline the creation of publication-quality figures.
bioinformatics2026-04-16v1FlyPredictome: A structural atlas of predicted protein-protein interactions in Drosophila
Kim, A.-R.; Comjean, A.; Veal, A.; Rodiger, J.; Han, M.; Hu, Y.; Perrimon, N.Abstract
Protein-protein interactions (PPIs) are fundamental to cellular function. Yet most Drosophila PPIs remain structurally uncharacterized despite the wealth of genetic and biochemical data available for this organism. Here we present FlyPredictome, a structural interactome based on 1.5 million pairwise AlphaFold-Multimer predictions. Using a local confidence metric that performs robustly for interactions involving flexible and disordered proteins, we systematically assess experimentally reported Drosophila PPIs and predict direct binding interfaces at residue-level resolution. Testing their functional relevance, we find that phenotype-associated missense mutations are enriched at predicted interaction interfaces. Building on these validated predictions, we construct an evidence-supported PPI network, revealing modular organization from signaling pathways to individual protein complexes. FlyPredictome is available as an open database, providing a structural foundation for interaction discovery in Drosophila.
bioinformatics2026-04-16v1ORION: An agentic reasoning construct for the analysis of complex human immune profiling
Dayao, M. T.; Kim, K.; Khor, B.; Jaech, A.; van Opheusden, B.; Bodansky, A.; DeRisi, J.Abstract
The capacity to generate high-dimensional biological datasets has outpaced the ability to interpret them. Technologies such as phage immunoprecipitation and sequencing (PhIP-seq) enable proteome-scale profiling of antibody repertoires, but interpreting thousands of enriched peptides into mechanistic hypotheses remains a labor-intensive bottleneck requiring expert synthesis of statistics, literature, and domain knowledge. Here we describe ORION (Omics Reasoning & Interpretation Orchestrator), a multi-agent framework that uses reasoning-capable large language models to perform end-to-end analysis of complex immune profiling data. ORION integrates statistical analysis, machine learning, and automated literature review into a single structured workflow, producing results that are reproducible and fully traceable. Applied to a published PhIP-seq dataset from autoimmune polyendocrine syndrome type 1 (APS-1), ORION recovered the canonical autoantibody signature in approximately two hours, closely recapitulating an analysis that originally required one to two months of manual effort. To test hypothesis-generation capacity on previously unseen data, we applied ORION to a novel PhIP-seq dataset from individuals with Down syndrome, for which no proteome-wide autoantibody reference exists. ORION distinguished disease from control samples with high accuracy, prioritized candidate autoantibody targets, and organized them into biologically coherent groups spanning immune, gut, and neuronal programs, generating testable hypotheses for experimental follow-up. These results demonstrate that agentic AI systems can compress the analysis of complex immune profiling data from weeks to hours, allowing scientists to redirect their time toward the fundamental biology.
bioinformatics2026-04-16v1Longevity Bench: Are SotA LLMs ready for aging research?
Zhavoronkov, A.; Sidorenko, D.; Naumov, V.; Pushkov, S.; Zagirova, D.; Aladinskiy, V.; Unutmaz, D.; Aliper, A.; Galkin, F.Abstract
Aging is a core biological process observed in most species and tissues, which is studied with a vast array of technologies. We argue that the abilities of AI systems to emulate aging and to accurately interpret biodata in its context are the key criteria to judge an LLM's utility in biomedical research. Here, we present LongevityBench -- a collection of tasks designed to assess whether foundation models grasp the fundamental principles of aging biology and can use low-level biodata to arrive at phenotype-level conclusions. The benchmark covers a variety of prediction targets including human time-to-death, mutations' effect on lifespan, and age-dependent omics patterns. It spans all common biodata types used in longevity research: transcriptomes, DNA methylation profiles, proteomes, genomes, clinical blood tests and biometrics, as well as natural language annotations. After ranking state-of-the-art foundation models using LongevityBench, we highlight their weaknesses and outline procedures to maximize their utility in aging research and life sciences
bioinformatics2026-04-15v3Exploring molecular signatures of senescence with markeR, an R toolkit for evaluating gene sets as phenotypic markers
Martins-Silva, R.; Kaizeler, A.; Barbosa-Morais, N. L.Abstract
Many biological processes, including cellular senescence, manifest as diverse phenotypes that vary across cell types and conditions. In the absence of single, definitive markers, researchers often rely on the expression of sets of genes to identify these complex states. However, there are multiple ways to summarise gene set expression into quantitative metrics (i.e., signatures), each with its own strengths and limitations, and we know of no consensual framework to systematically evaluate their performance across datasets. We therefore developed markeR (https://bioconductor.org/packages/markeR), an open-source, modular R package that evaluates gene sets as phenotypic markers using various scoring and enrichment-based approaches. markeR generates interpretable metrics and intuitive visualisations that enable benchmarking of gene signatures and exploration of their associations with chosen study variables. As a case study, we applied markeR to 9 published senescence-related gene sets across 25 RNA-seq datasets, covering 6 human cell types and 12 senescence-inducing conditions. There was wide variability in gene set performance, as some signatures (e.g., SenMayo) were robust senescence markers across contexts, while others (e.g., those from MSigDB), performed poorly as such. We also used markeR to analyse gene expression in 49 GTEx tissues, revealing tissue- and age-related differences in senescence-associated signals. Together, these findings emphasise the difficulty of characterising molecular phenotypes and demonstrate the potential of markeR in facilitating the systematic evaluation of gene sets in various biological contexts.
bioinformatics2026-04-15v3Fast and accurate resolution of ecDNA sequence using Cycle-Extractor
Faizrahnemoon, M.; Luebeck, J.; Hung, K. L.; Rao, S.; Prasad, G.; Tsz-Lo Wong, I.; G. Jones, M.; S. Mischel, P.; Y. Chang, H.; Zhu, K.; Bafna, V.Abstract
Extrachromosomal DNA (ecDNA) plays a key role in cancer pathology. EcDNAs mediate high oncogene amplification and expression and worse patient outcomes. Accurately determining the structure of these circular molecules is essential for understanding their function, yet reconstructing ecDNA cycles from sequencing data remains challenging. We introduce Cycle-Extractor (CE) for reconstruction. CE accepts a breakpoint graph derived from either short or long read sequencing data as input and extracts a cycle with the maximum length-weighted-copy-number. CE utilizes a mixed-integer linear program (MILP) and a separate traversal procedure, enabling fast optimization and compatibility with free solvers. We evaluated CE against CoRAL (long-read-based quadratic optimization), Decoil (long-reads), and AmpliconArchitect (AA for short reads) on both simulated data and real cancer cell lines. On simulated ecDNA, CE achieves performance comparable to CoRAL across three accuracy metrics and consistently outperforms AA and Decoil. On cancer cell lines, CE produces longer and heavier cycles than AA, and achieves performance similar to CoRAL. Moreover, CE is, on average, 40 x faster than CoRAL. These results demonstrate that CE accurately reconstructs ecDNA from both short- and long-read sequencing data, while long-read inputs allow CE to recover more complete and higher-confidence ecDNA structures. CE improved the prediction of many ecDNA structures. On a PC3 ecDNA containing MYC, CE uses ONT data to reconstruct a substantially larger and higher-copy sequence (4.2 Mbp) compared to the short-read-derived reconstruction (690 Kbp). CRISPR-CATCH experiments confirm the presence of a large ecDNA molecule, validating the long-read-based CE reconstruction.
bioinformatics2026-04-15v2PoolParty: streamlined design of DNA sequence libraries in Python
Liu, Z.; Cordero, A.; Kinney, J. B.Abstract
Motivation: Computationally designed DNA sequence libraries are essential components of massively parallel reporter assays (MPRAs), deep mutational scanning (DMS) experiments, and other multiplex assays of variant effect (MAVEs). They are also increasingly used in silico to analyze genomic AI models. Designing these libraries, however, remains tedious and error-prone due to the lack of purpose-built software. Results: Here we describe PoolParty, a Python package that streamlines the design of complex oligo pools using a simple but flexible API. In PoolParty, each library is represented by a computational graph that can be specified in just a few lines of code. Over 50 built-in operations cover nucleotide- and codon-level mutagenesis, motif insertion, barcode generation, and more. PoolParty automatically generates informative names for each sequence and provides "design cards" detailing how each sequence was generated. Visualization methods let users quickly audit library content and inspect the underlying graph. PoolParty thus transforms oligo pool design from a tedious task requiring custom functions and scripts into a structured, transparent, and reproducible process. Availability and implementation: PoolParty is freely available and can be installed using pip. It is compatible with Python [≥] 3.10. Documentation is provided at https://poolparty.readthedocs.io; source code is available at https://github.com/jbkinney/poolparty-statetracker. A static release is archived at DOI 10.5281/zenodo.19445098.
bioinformatics2026-04-15v2TFBindFormer: A Cross-Attention Transformer for Transcription Factor-DNA Binding Prediction
Liu, P.; Wang, L.; Basnet, S.; Cheng, J.Abstract
Transcription factors (TFs) are central regulators of gene expression, and their selective recognition of genomic DNA underlies various biological processes. Experimental profiling of TF-DNA interactions using chromatin immunoprecipitation followed by sequencing(ChIP-seq) provides high resolution maps of in vivoTF-DNA binding but remains costly, labor-intensive, and inherently low-throughput, limiting their scalability across different transcription factors,cell types, and regulatory conditions. Computational modeling therefore plays an essential role in inferring TF-DNA interactions at genome scale. However, most existing computational models rely solely on DNA sequence and chromatin features to predict TF-DNA binding, neglecting TF-specific protein information. This omission limits their ability to capture protein-dependent binding specificity. Here, we present TFBindFormer, a hybrid cross-attention transformer that explicitly integrates genomic DNA features with TF specific representations derived from protein sequences and structures. By modeling protein-conditioned, position-specific TF-DNA interactions, TFBindFormer enables direct learning of molecular determinants underlying DNA recognition. Evaluated across hundreds of cell-type-specific TFs and hundreds of millions of genome-wide DNA bins, TFBindFormer consistently outperforms DNA-only baselines, achieving substantial gains in both area under precision-recall curve(AUPRC) and area under receiver operating characteristic curve(AUROC). Together, these results demonstrate that integrating TF and DNA features via cross-attention enables TFBindFormer to serve as an effective and scalable framework for large-scale TF-DNA binding prediction.
bioinformatics2026-04-15v2DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics
Liao, Y.; Wen, H.; E, W.; Zhang, W.Abstract
Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semisupervised training to universal cross-modal representation learning. By integrating dualencoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.
bioinformatics2026-04-15v2KyDab - a comprehensive database of antibody discovery selection campaigns.
Zhou, Q.; Chomicz, D.; Melvin, D.; Griffiths, M.; Yahiya, S.; Reece, S.; Le Pannerer, M.-M.; Krawczyk, K.Abstract
Preclinical antibody discovery relies on progressive screening and down-selection of candidate antibodies from large immune repertoires, yet this critical process is poorly represented in existing public databases. Here we introduce KyDab (Kymouse Antibody Database), a well-curated database of antibody discovery selection data generated using standardized workflows on the Kymouse humanized mouse platform. The current release includes 11 Kymouse platform mice immunisation studies covering 51 immunogens, more than 120,000 paired heavy-light chain sequences, and binding measurements for a selected subset of experimentally characterized clones. By capturing full-funnel selection data with consistent metadata and both positive and negative experimental outcomes, KyDab provides a valuable data resource for the development and evaluation of artificial intelligence models for antibody discovery. KyDab is accessible https://kydab.naturalantibody.com, and the database will be continuously updated as new datasets become available.
bioinformatics2026-04-15v2