Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
AI-readiness criteria for biomedical data
Clark, T.; Caufield, H.; Parker, J. A.; Al Manir, S.; Amorim, E.; Eddy, J.; Gim, N.; Gow, B.; Goar, W.; Hansen, J. N.; Harris, N.; Hermjakob, H.; Joachimiak, M.; Jordan, G.; Lee, I.-H.; McWeeney, S. K.; Nebeker, C.; Nikolov, M.; Reese, J.; Shaffer, J.; Sheffield, N.; Sheynkman, G.; Stevenson, J.; Chen, J. Y.; Mungall, C.; Wagner, A.; Kong, S. W.; Ghosh, S. S.; Patel, B.; Williams, A.; Munoz-Torres, M. C.Abstract
Biomedical research is rapidly adopting artificial intelligence (AI). Yet the inherent complexity of biomedical data preparation requires implementing actionable, robust criteria for ethical and explainable AI (XAI) at the "pre-model" stage, encompassing data acquisition, detailed transformations, and ethical governance. Simple conformance to FAIR (Findable, Accessible, Interoperable, Reusable) Principles is insufficient. Here, we define criteria and practices for reliable AI-readiness of biomedical data, developed by the NIH Bridge to Artificial Intelligence (Bridge2AI) Standards Working Group across seven core dimensions of dataset AI-readiness: FAIRness, Provenance, Characterization, Ethics, Pre-model Explainability, Sustainability, and Computability. Conformance to these criteria provides a basis for pre-model scientific rigor and ethical integrity, mitigating downstream risks of bias and error before AI modeling. We apply and evaluate these standards across all four Bridge2AI flagship datasets, spanning functional genomics to clinical medicine, and encode them in machine-actionable metadata bound to the datasets. This framework sets a benchmark for preparing ethical, reusable datasets in biomedical AI and provides standardized methods for reliable pre-model data evaluation.
bioinformatics2026-04-24v6Characterization of selective pressures acting on protein sites with Deep Learning
Bergiron, E.; Nesterenko, L.; Barnier, J.; Veber, P.; Boussau, B.Abstract
It is often useful, in the field of molecular evolution, to identify the selective pressures acting on a particular site of a protein to better understand its function. This is typically done with likelihood-based approaches applied to codon sequences in a phylogenetic context. However, these approaches are computationally costly. Here we adapt a linear transformer neural network architecture, which has been shown to be able to reconstruct accurate pairwise distances from sequence alignments, to identify selective pressures acting on individual amino acid sites. We design different versions of the architecture and train and test them on simulations. We compare the results of one of our best models to state-of-the-art likelihood-based methods and find that it outperforms it when it is applied to data that resemble its training data, but that it performs less well when applied to datasets that do not resemble the ones the model has been trained on. In all cases, our approach operates at a fraction of the computational cost of likelihood-based methods. These results suggest that such a neural network architecture can compare very favorably to state-of-the-art approaches to characterize selection pressures acting on coding sequences, but that it must be trained on datasets representative of empirical data.
bioinformatics2026-04-24v4Modeling causal signal propagation in multi-omic factor space with COSMOS
Dugourd, A.; Lafrenz, P.; Mananes, D.; Paton, V.; Fallegger, R.; Bai, Y.; Kroger, A.-C.; Turei, D.; Li, Y.; Trogdon, M.; Nager, D.; Deng, S.; Shen, C.; Lapek, J. D.; Shtylla, B.; Saez-Rodriguez, J.Abstract
Understanding complex diseases requires approaches that jointly analyze omics data across multiple biological layers, including signaling, gene regulation, and metabolism. Existing data-driven multi-omics analysis methods, such as multi-omics factor analysis (MOFA), can identify associations between molecular features and phenotypes, but they are not designed to integrate existing mechanistic molecular knowledge, which can provide further actionable insights. We introduce an approach that connects data-driven analysis of multi-omics data with systematic integration of mechanistic prior knowledge using COSMOS+ (Causal Oriented Search of Multi-Omics Space). We show how factor analysis output can be used to estimate activities of transcription factors and kinases as well as ligand-receptor interactions, which in turn are integrated with network-level prior-knowledge to generate mechanistic hypotheses about paths connecting deregulated molecular features. We apply this approach on a novel multi-omics dataset of cell line models of breast cancer resistance to evaluate the ability of such mechanistic hypotheses to identify resistance drivers, as well as a breast cancer patient cohort. Our approach offers an interpretable framework to generate actionable insights from multi-omic data particularly suited for high dimensional datasets.
bioinformatics2026-04-24v3MetaTree: an interactive web platform for aligned hierarchical data visualization and multi-group comparison
Wu, Q.; Zhang, A.; Ning, Z.; Figeys, D.Abstract
Background: Hierarchical quantitative profiles are widely used in microbiome studies and other domains. However, comparing multiple samples and experimental groups while preserving hierarchical structure remains challenging. Many existing workflows require extensive manual figure assembly or do not support aligned comparisons across conditions on a shared hierarchy. Results: We developed MetaTree, an open-source platform that runs in a web browser for interactive visualization and comparative analysis of hierarchical quantitative data. MetaTree anchors samples, groups, and contrasts between groups to a shared reference hierarchy, preserving one-to-one node correspondence so that the same clade is compared in the same position across views. In addition to visualization, MetaTree integrates statistical testing for comparisons between two groups with false discovery rate (FDR) control, enabling users to identify clades with consistent differences between conditions and interpret them in hierarchical context. MetaTree also provides user configurable controls for visual encoding, filtering thresholds, label density, and layout, allowing figures to be adapted to different datasets and reporting needs. The interface remains usable for large hierarchies through interactive navigation, adaptive label handling, and branch collapsing. Conclusions: MetaTree is an installation-free web platform (https://byemaxx.github.io/MetaTree) for topology-consistent visualization and comparison of hierarchical profiles, supporting coordinated multi-panel exploration and automated comparison matrices to enable rapid generation of publication-ready figures for microbiome and other hierarchical datasets.
bioinformatics2026-04-24v3Adaptive prediction intervals for polygenic risk scores reveal individual variation in genetic predictability
Wang, C.; Wang, F.; Bogdan, M.; Masala, M.; Fiorillo, E.; Devoto, M.; Cucca, F.; Belsky, D.; Ionita-Laza, I.Abstract
Polygenic risk scores (PRS) are widely used in post-GWAS analyses to predict complex traits across humans, animals, and plants, yet the uncertainty of these predictions is rarely quantified at the individual level. Here, we introduce a framework for individualized uncertainty quantification based on quantile regression and conformal prediction, enabling the construction of prediction intervals with guaranteed coverage under minimal assumptions. Quantile regression enables adaptive, individual-specific prediction intervals that capture asymmetry and allow interval widths to vary substantially across individuals based on genetic information alone. Applying this framework to 62 traits in the UK Biobank and the ProgeNIA/SardiNIA studies, we show that these intervals maintain valid coverage and reduce uncertainty in risk stratification compared to existing methods, driven by their adaptive construction. Prediction interval width correlates positively with age and BMI, indicating reduced genetic predictability in subsets of the population where genetic effects interact with environmental factors. Our results demonstrate that incorporating uncertainty is essential for interpreting polygenic predictions and provide a principled approach to distinguish individuals whose phenotypes are well explained by genetic predictors from those in whom non-genetic influences dominate.
bioinformatics2026-04-24v3A De Novo Algorithm for Allele Reconstruction from Oxford Nanopore Amplicon Reads, with Application to CYP2D6
Brown, S. D.; Dreolini, L.; Minor, A.; Mozel, M.; Wong, N.; Mar, S.; Lieu, A.; Khan, M.; Carlson, A.; Hrynchak, M.; Holt, R. A.; Missirlis, P. I.Abstract
The Oxford Nanopore Technologies' sequencing platform offers a path towards bedside genomics, producing long reads that can completely cover a gene of interest, and thus detect any known or novel variant the gene contains. However, the analysis of these long reads to identify actionable genotypes remains challenging and typically requires customization depending on the target gene. Here, we describe a generic algorithm to accurately reconstruct allele sequences derived from long-reads of genomic-amplicon origin. Rather than calling variants directly from these long-reads, our method takes a "sequence-first" approach, performing an unbiased reconstruction of the underlying amplicon sequences to generate high-confidence reconstructed allele sequences. This is done without user input of the expected target gene, allowing for any source amplicon to be reconstructed. These high-confidence reconstructed allele sequences are then compared to the genomic reference sequence of the gene to infer the specific diplotype present in the sample. This approach is agnostic towards the number of genes and alleles present and readily detects novel variants. We demonstrate our approach using three independent data sets for CYP2D6, a diverse and complex gene with over 175 known alleles of clinical significance affecting drug dosing. We show how our approach can accurately recover validated CYP2D6 diplotypes from 20 Coriell samples sequenced using different primer sets, on different Oxford Nanopore Technologies flow cell versions, and to different depths. This includes inferring occurrences of copy number variation from relative abundances of each allele, a critical factor for ascribing functional effects to a diplotype. Further, we demonstrate our approach's utility for other genomic regions, including HLA.
bioinformatics2026-04-24v3Gene-First Identity Construction for Robust Cell Identification in Single-Cell Transcriptomics
Yang, L.; Huang, Z.; Cai, J.; Xin, H.Abstract
The precise delineation of cell types is fundamental to single-cell transcriptomics, yet current clustering pipelines often violate an axiomatic principle: hierarchical consistency. Existing methods measure cell-to-cell distances within a fixed global feature space, disregarding the fact that biological distinctions are inherently context-dependent lineage separation requires different gene programs than subtype resolution. Mathematically, this implies that the similarity metric itself should not be a static functional, but a pair-dependent energy functional evaluated within a specific Hilbert subspace determined by the biological comparison at hand. The challenge lies in the fact that allowing pair-dependent metrics typically destroys the global geometric consistency required for downstream analysis, unless the family of Hilbert subspaces is given strong biological structure. To resolve this geometric dilemma, we introduce GeCCo (Gene Co-expression Constructed identity), which constructs identities by projecting cells onto a rigorously derived hierarchy of gene programs. To construct this hierarchy, GeCCo first quantifies Boolean regulatory logic via the {varphi} coefficient, and subsequently employs a greedy topological inference to organize genes based on their synergistic and antagonistic relationships. Benchmarking on human immune atlases demonstrates that GeCCo achieves superior hierarchical consistency, ensuring that globally inferred cell identities rigorously match locally refined subtypes. Furthermore, in pancreatic endocrine progenitors, GeCCo resolves a hidden mitotic bridge state, suggesting a concentrated division phase prior to differentiation. Ultimately, GeCCo shifts the paradigm from ad hoc clustering to programmatic cell typing, offering a mathematically grounded framework for scalable atlases of cellular discovery.
bioinformatics2026-04-24v2Ancestra: A lineage-explicit simulator for benchmarking B-cell receptor repertoire and lineage inference methods
Hassanzadeh, R.; Abdollahi, N.; Kossida, S.; Giudicelli, V.; Eslahchi, C.Abstract
High-throughput B-cell receptor sequencing has transformed the analysis of adaptive immunity, but benchmarking clonal grouping and lineage reconstruction methods remains limited by the absence of datasets with known evolutionary histories. Here we present Ancestra, a lineage-explicit simulator of B-cell receptor heavy-chain affinity maturation. Ancestra models stochastic V(D)J recombination, context-dependent somatic hypermutation, affinity-based selection and clonal expansion while recording complete parent-child relationships and mutation events. The framework generates BCR heavy-chain sequence datasets together with their corresponding ground-truth lineage trees, enabling direct benchmarking of lineage-aware analytical methods. Across simulations, Ancestra recapitulates key properties of human repertoires, including complementarity-determining region 3 length distributions, amino-acid usage patterns, junctional mutation patterns consistent with IMGT criteria and heterogeneous branching topologies. Simulated lineages also reveal multi-label lineage trees, in which identical nucleotide sequences can arise independently along distinct evolutionary paths. Ancestra provides a practical foundation for rigorous benchmarking of lineage-aware immune repertoire analysis.
bioinformatics2026-04-24v2Turep: Detecting cross-cancer tumor-reactive T cells in single-cell and spatial transcriptomics data
Liu, W.; Tung, C.-H.; Sevick-Muraca, E. M.; Zhao, Z.Abstract
Tumor-infiltrating lymphocytes are essential for anti-tumor immunity, yet distinguishing tumor-reactive T cells from non-reactive bystander cells remains a significant challenge. Existing signatures, often derived from single cohorts, lack robustness in cross-cancer prediction. We present Turep, a deep learning method designed for robust, cross-cancer prediction of tumor-reactive T cells using single-cell or spatial transcriptomics data. By integrating paired single-cell RNA and T cell receptor sequencing data from seven human malignancies, we identified a pan-cancer tumor-reactive gene signature and leveraged generative data augmentation to address data imbalance. Turep consistently outperformed existing biomarkers, achieving a mean area under the receiver operating characteristic curve of 0.870 across cancer types. In validation across diverse cohorts, we found that Turep-predicted tumor-reactive T cell proportions could predict clinical response to immunotherapy. Furthermore, extending Turep to spatial transcriptomics revealed that tumor-reactive T cells preferentially resided in spatial niches where target cells exhibited elevated antigen presentation. Overall, Turep provides a powerful, generalizable tool for identifying tumor-reactive T cells and their spatial architectures, facilitating personalized cancer immunotherapy strategies.
bioinformatics2026-04-24v1CellPulse: A Foundation Model of Coordinated Gene Dynamics Simulating Viral Infectious Diseases
Liu, D.; Zhu, X.; Zhang, L.; Xu, D.; Lou, J.; Xiong, X.; Ren, Y.; Wu, Y.; Zhou, X.Abstract
Understanding how cells respond to perturbations like viral infections requires models capturing coordinated gene dynamics. However, current gene expression foundation models are predominantly reliant on single-cell data and static gene expression, limiting their applicability in real clinical scenarios. We present CellPulse, a direction-aware foundation model trained on the Virus Stimulated Atlas (VISTA), a newly curated atlas of over 23 million bulk RNA-sequencing differential expression profiles from viral infections. CellPulse models the direction and magnitude of gene expression changes via a structured representation of differential expression and a direction-aware attention mechanism, enabling the learning of coherent regulatory programs. It shows powerful diagnosing capability by accurately classifying 31 distinct virus types across diverse clinical and laboratory samples, solely from host transcriptional signatures. Crucially, without prior knowledge injection, CellPulse's interpretability reveals virus-associated host factors that mediate infection. Using a selection of host factors for in silico drug screening yielded numerous compounds with confirmed efficacies in wet-lab assays, while cell-based and animal experiments further verified the causal relationship between host targets and viral infections. Overall, CellPulse represents a generalizable foundation model for deciphering coordinated gene dynamics from bulk transcriptomics, bridging host response modeling with clinical relevance and therapeutic discovery for infectious diseases and beyond.
bioinformatics2026-04-24v1A Systematic Evaluation of Single-Cell Batch Integration Metrics and sBEE: A Robust New Metric
Myradov, M.; HOUDJEDJ, A.; Tastan, O.; Kazan, H.Abstract
Single-cell RNA sequencing (scRNA-seq) datasets generated across laboratories and experimental conditions often exhibit batch effects that obscure biological variation. Numerous computational methods for batch integration have been developed, making rigorous benchmarking critical. Evaluation metrics are central to assessing method performance; however, existing metrics capture only partial aspects of integration quality and often rely on implicit assumptions about cell distributions in the embedding space. Consequently, benchmarking studies frequently report discordant rankings of batch integration methods across metrics, complicating interpretation and method selection. Here, we systematically evaluate widely used metrics under controlled scenarios that isolate common integration challenges, including imbalanced batch composition, partial cell-type overlap, and varying cluster geometries. By stress-testing metrics under these scenarios, we identify the conditions under which each metric succeeds or fails. Based on these observations, we introduce sBEE (single-cell Batch Effect Evaluator), a unified metric that jointly evaluates cross-batch distance relationships and local neighborhood batch composition. Across diverse scenarios, sBEE provides stable assessments of mixing quality and remains robust to failure modes that affect existing metrics. Together, our work provides a systematic evaluation of batch integration metrics and introduces a unified metric for a more reliable assessment of integration quality.
bioinformatics2026-04-24v1MSAgent: An Evidence Grounded Agentic Framework for LLM-driven Scientific Exploration in Mass Spectrometry-based Metabolomics
Li, Y.; Zhong, Y.; Liu, P.; Yusheng, T.; Zhan, H.; Xia, J.Abstract
Mass spectrometry (MS) is a cornerstone high-throughput technology for molecular discovery, yet the reliable elucidation of chemical structures remains a formidable, expert-dependent bottleneck. Currently, achieving a reliable molecular identification from raw mass spectra necessitates a manual assembly, a labor-intensive ordeal of heuristic reasoning and the tedious integration of siloed computational tools, perpetuating a profound throughput gap between rapid data acquisition and the glacial pace of structural annotation. Here we present MSAgent, an autonomous agentic framework that bridges the gap between computational automation and expert intuition by emulating the cognitive logic of human specialists. By orchestrating a MSToolbox of over 50 domain-specific tools via Large Language Models (LLMs), MSAgent dynamically unifies the analytical pipeline into a scalable, evidence-grounded workflow, allowing for intent-aware planning, cross-resources outputs synthesis, and visual mechanistic interpretation within traceable reasoning chains and evidence-backed analytical reports. We evaluated MSAgent across multiple open benchmarks, including the established community challenges - Critical Assessment of Small Molecule Identification (CASMI) 2016/2022, CANOPUS, and LLM-oriented test cases. On CASMI, MSAgent consistently boosts retrieval performance by over 10% MRR across diverse benchmarks while ensuring high reliability, improving or preserving ranks in 95% of cases. For more challenging molecular de novo tasks on CANOPUS, MSAgent builds upon the outputs of baseline models with consistent refinement, yielding over a 40% average gain in Tanimoto similarity for ground-truth recovery. In addition, MSAgent demonstrates remarkable advantages in eliminating the hallucination phenomenon over LLMs without domain tool support, producing better-calibrated confidence (Pearson r = 0.438 vs -0.219 for gpt-4o). It improves exact-match rate by 38.8% over gpt-4o in candidate discrimination tasks, and achieved a 64% success rate in recommending high-quality candidate structures with Tanimoto similarity more than 0.7, where gpt-4o predominantly selected candidates with similarity below 0.3. Our work enables high-throughput mass spectrometry data to be analyzed in an intent-driven and automated manner, lowering the analysis barrier for no-expert to deliver molecular identification result with transparent analytical process, and accelerating discovery in metabolism and related fields by bridging the gap between experimental data acquisition and computational interpretation.
bioinformatics2026-04-24v1SpatialQuery: scalable discovery and molecular characterization of multicellular motifs from spatial omics data
Hemberg, M.; An, S.; Gehlenborg, N.; Keller, M.Abstract
Spatially resolved single-cell technologies enable profiling of cells in situ, yet computational approaches that jointly discover multicellular spatial patterns and characterize their molecular programs remain limited. Here we introduce SpatialQuery, a framework that can both identify cellular motifs, i.e. recurrent multicellular co-localization patterns, and perform molecular analyses focused on the motifs. It uncovers genes modulated by spatial contexts through differential expression analysis, and detects coordinated expression changes through covariation analysis. SpatialQuery can identify functional tissue units, and goes beyond pairwise analyses to characterize multicellular interactions. Applications to both spatial transcriptomics and proteomics data uncover cross-germ-layer signaling in gut tube patterning, disease-specific fibrotic and immunosuppressive niches in kidney and colon, and regional determinants of motif-associated transcriptional programs in a mouse brain atlas. SpatialQuery is available as a Python package, and we demonstrate how its light computational footprint enables integration into web-based cell atlas portals for interactive visualization and exploration.
bioinformatics2026-04-24v1GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations
Shen, Y.; Cao, G.; Wu, J.; Chen, D.; Feng, C.; Chen, M.Abstract
Deciphering the mapping between linear biomolecular sequences and complex biological functions remains a central challenge in genomics. Although existing generative nucleotide language models have made substantial progress in modeling sequence distributions, they generally lack explicit access to high-level biological semantics, limiting their ability to support semantics-guided conditional generation. To address this limitation, we present GenNA, a generative nucleotide foundation model guided by natural-language annotations. GenNA is pretrained on a multimodal nucleotide-text corpus spanning 2,221 eukaryotic species and comprising approximately 416 billion characters, and learns the relationships between sequence patterns and functional annotations within a unified autoregressive framework. Systematic evaluations show that, even without explicit supervision from biological rules, GenNA yields distinguishable perplexity scores in response to semantic mismatches between sequences and functional annotations, to different mutation types, and to perturbations of species labels. Moreover, across a range of natural-language-guided nucleotide generation tasks, the model produces sequences consistent with both target semantics and species context. Overall, GenNA provides a unified framework for natural-language-guided nucleotide modeling and conditional generation, and offers a feasible route toward integrating high-level functional descriptions with low-level sequence design.
bioinformatics2026-04-24v1PanVariants: Best Practice for Pangenome-based Variant Calling Pipeline and Framework
Yi, H.; Wang, L.; Chen, X.; Ding, Y.; Carroll, A.; Chang, P.-C.; Shafin, K.; Xu, L.; Zeng, X.; Zhao, X.; Gong, M.; Wei, X.; Hou, Y.; Ni, M.Abstract
Background: Although pangenome references offer richer population diversity compared to linear references, current mainstream pangenome-based variant callers are limited to detecting only known variants stored in the graph. To address this limitation, we developed PanVariants, a novel pipeline designed to improve the detection of both known and novel variants accurately. We systematically evaluated its performance against the traditional linear alignment solution (BWA+GATK/Manta) and the existing pangenome-aware solution (DRAGEN/PanGenie) in three contexts: small variants (SNVs/indels) and structural variants (SVs) accuracy in Genome in a Bottle samples, clinical detection on positive samples, and application in cohort-based joint calling. Results: By integrating k-mer-based and mapping-based methods, PanVariants significantly reduced variant errors (FPs + FNs), achieving a 73% reduction compared to BWA+GATK and a 45% reduction compared to DRAGEN for SNVs. Retraining the DeepVariant model with high-quality DNBSEQ data further decreased errors by 15%. For SVs detection, PanVariants attained an F1-score of 89.39%, markedly outperforming DRAGEN (68.18%) and BWA+Manta (58.33%), approaching long-read sequencing performance (95.22%). In validation using clinical positive samples, PanVariants successfully detected all expected pathogenic variants while PanGenie failed. In the cohort joint-calling analysis, PanVariants detected more variants, made fewer Mendelian inheritance errors, and gave better per-sample accuracy than GATK. Conclusions: PanVariants establishes a robust framework and best-practice pipeline for pangenome-based variant detection, achieving both sensitive novel variant discovery and high accuracy for SNVs, indels and SVs. Our systematic evaluation of optional processing steps and input variables offers practical guidance for users. Validated across diagnostic and population-based applications, our findings strongly support the transition from linear to pangenome references in future genomics.
bioinformatics2026-04-24v1Small Area Estimation of Forest Volume Using Mixed Effects Random Forests and Multi-Source Remote Sensing Data
Vangi, E.Abstract
Accurate estimation of forest growing stock volume (GSV) at fine spatial scales is essential for sustainable forest management, carbon accounting, and local decision-making. However, traditional forest inventories often lack sufficient sampling density to provide reliable estimates for small areas. This study evaluates the performance of two small area estimation approaches: the Empirical Best Predictor (EBP) based on a nested-error linear regression model, and the Mixed-Effects Random Forest (MERF) for estimating GSV at the forest stand level using multi-source remote sensing data. The analysis was conducted in the Vallombrosa Nature Reserve (Italy), integrating field measurements from 101 plots with auxiliary variables derived from Sentinel-2 imagery and airborne LiDAR. Both methods were applied to estimate the mean and total GSV across 658 forest stands, many of which lacked direct observations. Model performance was assessed using spatial cross-validation, and uncertainty was quantified using root-mean-square error (RMSE). Results show that MERF outperformed EBP in predictive accuracy, achieving higher R2 (0.67 vs. 0.37) and lower RMSE (151 m3 ha-1 vs. 202 m3 ha-1). MERF also produced more stable and precise uncertainty estimates, with improved coverage of observed values. While both methods yielded comparable total GSV estimates, EBP exhibited greater variability and sensitivity to model assumptions. In contrast, MERF effectively captured non-linear relationships and handled multicollinearity among predictors, though at the cost of reduced interpretability and higher computational demand. Overall, findings highlight the advantages of integrating machine learning with mixed-effects modeling for SAE in forestry, particularly under conditions of sparse sampling and complex ecological variability.
bioinformatics2026-04-24v1scConcept enables concept-level exploration of single-cell transcriptomic data
Chen, H.; Li, Y.Abstract
Interpreting high-dimensional single-cell transcriptomic data remains challenging, as existing methods rely on latent representations or prior knowledge that require extensive post hoc analysis to derive biologically meaningful insights. Topic models provide interpretable gene-level signals but often produce redundant and coarse-grained programs that are difficult to translate into coherent biological concepts. While recent foundation models and large language models (LLMs) show promise, they are not readily applicable to large-scale single-cell data or fail to provide structured, cell-level interpretations. Here we present scConcept, a framework that introduces concept-level representation by transforming gene-level topic representations into structured, human-interpretable biological concepts. By integrating neural topic modeling with LLMs, scConcept distills fragmented gene programs into semantically coherent concepts defined by a biological label, description, and gene set, and quantitatively maps them back to individual cells. Across 16 single-cell datasets, scConcept improves clustering performance by 27.1\% and interpretability by 50.7\% over state-of-the-art methods. These concept-level representations enable interpretable cell-state annotation and capture gene programs that generalize across datasets. In cancer applications, scConcept identifies clinically relevant programs associated with tumor progression and patient survival, and links them to candidate therapeutic targets. Together, scConcept establishes concept-level representation as a general and scalable abstraction for interpretable single-cell analysis.
bioinformatics2026-04-24v1Probabilistic coupling of cellular and microenvironmental heterogeneity by masked self-supervised learning
Kojima, Y.; Tanaka, Y.; Hirose, H.; Chiwaki, F.; Nishimura, K.; Hayashi, S.; Itahashi, K.; Ishikawa, M.; Shimamura, T.; Mano, H.Abstract
Spatial omics technologies have advanced to single-cell resolution, enabling systematic analysis of tissue microenvironments alongside cellular-state heterogeneity. However, computationally defining microenvironmental states at single-cell resolution and identifying representations most informative for biological discovery remain major challenges. Here we present Mievformer, a Transformer-based masked self-supervised framework that learns microenvironmental embeddings by encoding neighboring cellular states and relative spatial configurations to parameterize the conditional distribution of continuous cell states at central spatial positions. Through InfoNCE optimization, Mievformer learns representations that capture the relative enrichment of cell states across microenvironments, formalized as a conditional density ratio, thereby enabling probabilistic inference of the coupling between microenvironmental and cellular heterogeneity. Mievformer outperformed existing methods in niche clustering on simulated spatial transcriptomics data and achieved the highest average performance across five real datasets spanning three spatial transcriptomics platforms when evaluated using DREC, a ground-truth-free metric that most strongly correlated with ground-truth performance in simulations. Beyond conventional clustering, Mievformer enables identification of cellular subpopulations based on their microenvironmental distribution and detection of gene-expression signatures associated with colocalization of specific cell populations. Together, these results establish Mievformer as a quantitatively robust and biologically informative framework for learning microenvironment representations in spatial omics.
bioinformatics2026-04-24v1Efficient and scalable modelling of cotranscriptional RNA folding with deterministic and iterative RNA structure sampling
Courtney, E.; Choi, E.; Ward, M.; Lucks, J. B.Abstract
RNA structure sampling is central to modelling RNA ensembles, yet stochastic sampling methods are non-exhaustive, scale poorly, and are biased towards low-free-energy structures, while current suboptimal folding approaches generate an unpredictable exponential number of structures. These limitations are particularly problematic for modelling cotranscriptional folding, where vectorial synthesis continuously reshapes the energy landscape during transcription, stabilising transient out-of-equilibrium structures. Here we introduce iterative sampling, a deterministic framework that enumerates unique RNA secondary structures in strict order of increasing free energy, enabling progressive and exhaustive exploration of the structure space up to an arbitrary stopping criterion. To implement this approach, we developed two scalable algorithms, iterative deepening and a persistent data structure approach, that incrementally traverse the expansion tree by evolving partial structures in place, avoiding redundant recomputation and fixed energy windows. Implemented in memerna, this approach achieves orders-of-magnitude speedups over existing tools (10x over ViennaRNA; 100x over RNAstructure). Integration within the sample-and-select framework (R2D2) improves structural diversity and identifies conformations with greater agreement with experimental data. Comprehensive sampling further enables direct comparison of equilibrium and cotranscriptionally restrained ensembles. Analysis of the resulting structural probability distributions uncovers kinetic traps and putative transcriptional pause sites, supporting an intuitive cotranscriptional folding mechanism in which local 3'-hairpin formation transiently stabilises upstream structure to delay large-scale rearrangement. Together, these results establish iterative sampling as a scalable and general framework for resolving out-of-equilibrium RNA cotranscriptional folding.
bioinformatics2026-04-24v1Verticall: A fast and robust tool for recombination detection in large-scale bacterial genomic datasets
Odih, E. E.; Wick, R. R.; Holt, K. E.Abstract
The inference and removal of horizontally acquired genomic regions is a crucial step in phylogenomics analyses for evolutionary studies. Existing tools perform well on clonal lineage-focused datasets on the scale of hundreds of genomes, but are limited in their ability to analyse larger or more diverse datasets. Here we present Verticall, a tool to identify recombinant regions in bacterial assemblies and generate recombination-free phylogenies, which scales to thousands of genomes from clonal to genus-level diversity. Verticall uses a non-parametric approach to assign genomic regions as horizontally or vertically related based on the distribution of pairwise genetic distances between genomes. Recombination-free phylogenetic trees may be inferred by either calculating a pairwise genetic distance matrix from vertical-only regions (distance-tree approach) or by pairwise comparisons of all genomes to a reference and then masking horizontally acquired regions in a pseudo-alignment to the reference (alignment-tree approach). We demonstrate Verticall's performance using four publicly available whole-genome sequence datasets of varying sample sizes (range: 154 - 4,857 genomes) and evolutionary scales (ranging from within-lineage to genus-wide diversity). Across all four datasets, Verticall showed comparable or superior performance to the established tools Gubbins and ClonalFrameML in terms of computational efficiency, plausibility of inferred phylogenetic trees, and recovery of temporal signal for molecular dating. Our results show that Verticall is a useful tool to more efficiently and accurately detect recombination, particularly applied to datasets for which existing tools are limited, including large datasets with hundreds to thousands of genomes and those that span entire species or genera. Verticall is available free and open source at https://github.com/rrwick/Verticall.
bioinformatics2026-04-24v1Systematic Evaluation of AlphaFold2 and OpenFold3 on Protein-Peptide Complexes
Fayetorbay, R.; Timucin, A. C.; Timucin, E.Abstract
Protein-peptide interactions are important mediators of diverse biological processes. While deep learning has revolutionized protein structure prediction, comparative evaluation of these methods, specifically for protein-peptide complexes, remains an area of active investigation. Here, we present a systematic benchmarking of AlphaFold2 (AF2) and OpenFold3 (OF3) on a curated, non-redundant dataset of 271 protein-peptide complexes evaluated under CAPRI peptide criteria, partitioned into disordered (IDR) and structured (Non-IDR) peptide subsets. Results show that AF2 consistently outperformed OF3 across both subsets in overall success rate and proportion of high-quality models, while both methods exhibited comparable global fold prediction accuracy. We further demonstrate that AF2 exhibited memorization on a large set of protein-peptide complexes that were in its training data. Analysis of built-in and post-hoc confidence scores demonstrated that PAE-derived metrics, particularly pDockQ2, LIS, and ipSAE, provided the most reliable proxies for structural accuracy in AF2 predictions, whereas OF3's PAE distributions substantially diminished the discriminative power of its derived scores. Furthermore, we find that canonical DockQ threshold cutoffs for protein-protein complexes are not directly transferable to protein-peptide complexes, underscoring the need for method- and dataset-specific calibration. Peptide sequence composition and length were identified as potential modulators of prediction success, with glycine-rich short peptides and long receptors posing challenges to both methods. Collectively, these findings establish a peptide-specific evaluation framework and highlight the need for dataset/method-calibrated metrics to support the continued development of structure prediction tools for protein-peptide interactions.
bioinformatics2026-04-24v1H2O: A Foundation Model Bridging Histopathology to Spatial Multi-Omics Profiling
Gu, Y.; Wu, Z.; Yan, R.; Wang, Z.; Li, Y.; Lin, S.; Cui, Y.; Lai, H.; Luo, X.; Zhou, S. K.; Yuan, Z.; Yao, J.Abstract
Spatial omics technologies have revolutionized the molecular profiling of tissues but remain constrained by high costs and limited scalability. While hematoxylin and eosin (H&E) staining is ubiquitous, it lacks molecular specificity. Here, we present H2O, a generalist AI framework that bridges the modality gap between histopathology and spatial multi-omics, enabling the direct inference of spatial transcriptomics (ST) and proteomics (SP) landscapes from routine H&E images. H2O integrates Vision Transformers (ViT) with Large Language Models (LLM) via contrastive learning to align histological morphology with semantic molecular knowledge. This cross-modal approach allows the model to incorporate spatial expression profiles into histological pattern recognition, effectively decoding the molecular heterogeneity underlying tissue morphology. Trained on a pan-tissue dataset of 1.3 million paired H&E-spatial patches across 25 organs and cancer types, H2O predicts spatial omics expression from histology with high concordance to sequenced measurements and consistently outperforms state-of-the-art models across three cancer benchmarks. Notably, H2O recovers the MIF-CD74/CD44 signaling axis directly from H&E images, highlighting its capacity to infer biologically meaningful cell-cell communication without molecular profiling. Applying on three additional public cohorts covering fetal and paediatric thymus tissues, human metastatic lymph node, and breast cancer, encompassing human development, 3D spatial frameworks, and integrative multi-omics, H2O yields biologically concordant insights, demonstrating superior accuracy, robustness, and generalizability across real-world applications in diverse scenarios. H2O converts routine histopathology into a portal for spatially resolved multi-omics profiling by computationally generating transcriptomic and proteomic landscapes, thereby enhancing tissue phenotyping and enabling scalable, integrative tissue-atlas construction.
bioinformatics2026-04-24v1Genomic dialects: How amino acid properties and the second codon base shape the informational accents of life
Martinez, O.; Ochoa-Alejo, N.Abstract
Codon Usage Bias (CUB) is a fundamental feature of genomic architecture, reflecting a balance between mutational pressure and natural selection. We propose a "genomic dialects" framework, where species-specific CUB profiles represent "informational accents" constrained by biochemical and structural requirements. Utilizing a normalized informational index based on Shannons entropy, we analyzed CUB profiles for 18 amino acids across 1,406 species from the three domains of life. Linear models were employed to investigate the relationship between CUB and physicochemical properties, including Saiers second-codon-base classification, molecular volume, hydrophobicity, aliphatic/aromatic status, and dissociation constants. CUB distributions are highly skewed, with >52% of values below 0.1, suggesting a near-optimal use of the genetic codes potential. We demonstrate that amino acid properties significantly influence CUB, with Saiers classification explaining up to 69% of variance in Archaea and {approx}47% across all taxa. Hydrophobic amino acids (Q1 class) consistently exhibit higher average CUB than hydrophilic ones, particularly in microbes. Individual species models reveal extreme correlations; for example, in the alga Chlamydomonas reinhardtii, Saier classes explain >95% of CUB variance. Finally, we show that CUB-based dendrograms represent phenetic similarity ("genomic accents") rather than reliable phylogenetic reconstructions, as they rarely coincide with the true Tree of Life. Our findings indicate that the "rules" of genomic dialects are largely anchored in the dual requirements of translational fidelity and protein stability. The observed "informational accents" are proximately governed by the metabolic and genomic machinery under the constraints of the drift-barrier hypothesis. This study provides a robust framework for understanding how the physical realities of amino acids have shaped the evolution of the genetic codes informational use across the tree of life.
bioinformatics2026-04-24v1SNooPy: a statistical framework for long-read metagenomic variant calling
Faure, R.; Faure, U.; Truong, T. M. K.; Derzelle, A.; Lavenier, D.; Flot, J.-F.; Quince, C.Abstract
Current long-read single-nucleotide variant callers were designed primarily for genomic data - particularly human genomes. While some have been used on metagenomic data, their underlying assumptions and training procedures fail to account for the inherent complexity of metagenomic samples. To date, no long-read variant caller has been purpose-built for metagenomic applications. To address this gap, we present SNooPy, a SNP-calling tool that implements a new statistical framework tailored to long-read metagenomic data. Unlike previous genomic methods, our approach makes no assumptions about the number of haplotypes present, their evolutionary relationships, or their sequence divergence. We demonstrate that SNooPy outperforms both traditional statistical and deep learning-based SNP callers. Our results suggest that future integration of this framework with deep learning approaches could further enhance variant calling performance. SNooPy is freely available on github.com/rolandfaure/snoopy.
bioinformatics2026-04-20v2Evaluation of deep learning tools for chromatin contact prediction
Nguyen, T. H. T.; Vermeirssen, V.Abstract
Background: Three-dimensional chromatin organization plays a central role in gene regulation and is commonly measured using Hi-C technology. Recently, deep learning models have been developed to predict Hi-C contact maps from genomic and epigenomic features, offering a computational alternative to costly experimental assays. However, the performance, robustness, and biological interpretability of these models remain unclear due to the absence of systematic benchmarking. Results: We present a comprehensive benchmark to evaluate five Hi-C prediction models, C.Origami, Epiphany, ChromaFold, HiCDiffusion, and GRACHIP, across multiple evaluation criteria, including accuracy, visual fidelity and loop detection. Among all models, Epiphany achieved the strongest overall performance, combining high accuracy, cell-type generalization, realistic image quality and reliable loop detection. Moreover, we evaluated predicted contact maps using four different loop-callers to assess the impact of model choice on loop detection performance. Despite the coarse resolution, many models could recover biologically relevant interactions. Notably, structural map quality was more critical than the choice of loop-caller for reliable detection. Finally, ablation analyses revealed that epigenomic signals are influential features for accurate Hi-C prediction. Despite the use of multiple input modalities in many models, only a limited subset contributed substantially to predictive performance. Conclusions: This study provides a systematic comparison of deep learning models for Hi-C prediction and highlights the importance of specific regulatory signals in reconstructing 3D chromatin organization. The proposed evaluation framework clarifies model behaviours and offers guidance for the development and interpretation of Hi-C prediction methods.
bioinformatics2026-04-20v2LagCI Enables Inference of Temporal Causal Relationships from Dense Multi-Omic Time Series
Ge, Y.; Bai, S.; Qiang, Z.; Liu, Y.; Wu, Y.; Shen, X.Abstract
Inferring causal relationships from time-series data is critical for uncovering the dynamics of biological regulation. However, in multi-omics studies, this task is often hampered by sparse temporal sampling and the limitations of existing methods. To address this, we developed Lagged-Correlation Based Causal Inference (lagCI), a computational framework designed to identify time-lagged associations by combining comprehensive lag-correlation profiling with a robust statistical filtering scheme. Rather than relying on simple cross-correlation, lagCI analyzes the entire correlation profile and applies a quality-scoring system to filter out spurious associations that often plague high-dimensional datasets. We first tested lagCI on wearable physiological data, where it successfully captured the well-known causal link between physical activity and heart rate, even accounting for variations in lag times between individuals. Moving to high-frequency human multi-omics, we used lagCI to build a directed network of 1,624 molecules connected by over 157,000 predicted interactions. This network didn't just mirror established biology (such as cytokine-hormone crosstalk); it also pointed to specific molecular hubs that seem to orchestrate the timing of metabolic and immune responses. Overall, lagCI provides a data-driven way to extract temporal insights from dense longitudinal omics. We've made the tool available as an R package with multiple interfaces to ensure it's accessible for both bioinformaticians and clinicians.
bioinformatics2026-04-20v2Natively entangled proteins are linked to human disease and pathogenic mutations likely due to a greater misfolding propensity
Anglero Mendez, M. F.; Sitarik, I.; Vu, Q. V.; Totoo, P.; Stephenson, J. D.; Song, H.; O'Brien, E. P.Abstract
A recently discovered class of protein misfolding involving native entanglements could be a widespread mechanism by which loss-of-function diseases arise. Here, we test that hypothesis by examining if there is any statistical association between proteins predisposed to misfold in this way and a database of gene-disease relationships. We find that globular proteins containing non-covalent lasso entanglements (NCLEs) in their native structure, which are more prone to misfolding, are 61% more likely to be associated with disease, 68% more likely to harbor pathogenic missense mutations, and their misfolding-prone entangled regions are 64% more likely to harbor pathogenic missense mutations. Protein refolding simulations indicate that these disease associated, natively entangled proteins are 2.5-times more likely to misfold than comparable non-disease proteins that lack native NCLEs. These results indicate that native entanglement misfolding, especially in the presence of missense mutations, have the potential to contribute to a wide variety of diseases. More broadly, these findings open an entirely new space of therapeutic targets in which drugs are designed to avoid these misfolded states and increase the amount of folded, functional protein.
bioinformatics2026-04-20v1Genome-wide identification and characterization of the NAC transcription factor family in Cynodon dactylon and their expression during abiotic stresses
Poudel, A.; Wu, Y.Abstract
Common bermudagrass (Cynodon dactylon) is a highly resilient and cosmopolitan grass widely used for turf, forage, and soil stabilization. Although its genome has been sequenced, little study has focused on characterizing genes underlying its resilience, including the NAC transcription factor family, which is well known for its physiological and stress-related functions. This study aimed to systematically characterize NAC TF genes in the bermudagrass genome and assess their potential roles in abiotic stress tolerance. A total of 237 CdNAC genes were identified and phylogenetically classified into 14 groups, including 40 members in the NAM/NAC1 class, which is associated with plant growth and development, and 23 members in the SNAC class, which is associated with stress responses. Tissue-specific RNA-seq analysis indicated that about one-fourth of CdNAC genes were expressed across all tissues, whereas 13 genes showed relatively higher expression in roots and 9 in inflorescence, suggesting both essential and specialized functions. Stress-responsive expression profiling revealed that 35 CdNAC genes were upregulated in response to drought, 43 to heat, 10 to salt, and 42 to submergence stress. Notably, CdNAC122, 149, and 155, the members of SNAC class, were consistently upregulated across all stress conditions, while others exhibited stress-specific expression, such as CdNAC37, 130, 145, and 199 in drought, CdNAC7, 12, 18, and 29 in heat, CdNAC46 and 151 in salt, and CdNAC9 and 31 in submergence. In contrast, 53 genes were downregulated during different stresses, with most belonging to NAM/NAC1, TERN, or OsNAC7 classes, possibly reflecting suppression of photosynthesis and development-related processes under stress. These results provide the first comprehensive characterization of CdNAC genes, reveal their distinct regulatory roles in abiotic stress responses, and establish a foundation for future functional validation and applications in breeding of stress-resilient bermudagrass.
bioinformatics2026-04-20v1KIR*BLOOM: Accurate KIR genotyping using a new copy number-aware integrated genotype likelihood framework
Gohar, Y.; Garcia, A. D.; Kichula, K. M.; Norman, P. J.; Dilthey, A. T.Abstract
Killer-cell immunoglobulin-like receptor (KIR) genes, key modulators of natural killer (NK) cell activity, play critical roles in immune response and disease susceptibility. Accurate KIR genotyping from short-read sequencing data remains challenging because of high sequence similarity among genes, extensive copy number variation, and substantial allelic diversity. Here, we present KIR*BLOOM, a likelihood-based approach for KIR genotyping from short-read data that models read depth and sequencing error across alternative genotype configurations. KIR*BLOOM first identifies KIR-relevant read pairs, maps them to a KIR allele database, and reduces the candidate allele space by excluding alleles unlikely to be present. It then infers gene copy number and selects alleles under the inferred copy-number constraints. Finally, variant calling is used to refine CDS sequences and identify potential novel alleles. We evaluated performance on 45 whole-genome sequencing samples with haplotype-resolved assemblies from the HPRC or HGSVC, using Immuannot-derived annotations as ground truth. KIR*BLOOM achieved 99.85% precision, 99.92% recall, and a Jaccard index of 99.77% for copy-number inference. At five-digit allele resolution, it achieved 92.73% precision, 92.69% recall, and an 87.29% Jaccard index, outperforming T1K, GraphKIR, and Geny. Together, these results demonstrate that KIR*BLOOM enables highly accurate KIR genotyping from short-read sequencing data.
bioinformatics2026-04-20v1Longitudinal Phylogenetic Inference of Copy Number Alterations and Single Nucleotide Variants from Single-Cell Sequencing
Kulman, E.; Kuang, R.; Morris, Q.Abstract
Longitudinal phylogenetic reconstruction reveals how cancers evolve over time and respond to treatments. Advances in targeted single-cell sequencing, combined with longitudinal sampling, now enable detailed longitudinal tracking of single nucleotide variants (SNVs) and copy number alterations (CNAs) at single-cell resolution. Here, we introduce LoPhy, the first method designed to reconstruct the evolution of SNVs and CNAs from these new longitudinal single-cell data. LoPhy is a sequential tree-building algorithm that reconstructs longitudinally-consistent phylogenies of SNVs and CNAs by maximizing a new factorized tree reconstruction objective. The algorithm incrementally grows a clone tree, adding SNVs and CNAs in the order they are observed across time points. Applied to a cohort of 15 acute myeloid leukemias (AMLs) and 4 TP53-mutated AMLs, LoPhy produced phylogenies that are biologically and temporally consistent with clinical observations, with many inferred CNAs validated by orthogonal bulk sequencing from the same cancer. These reconstructions highlight the role of CNAs in disease progression and resistance, revealing that AML clones selected after therapy are often defined by both large-scale CNAs and SNVs. More broadly, LoPhy can help uncover how SNVs and CNAs jointly shape the evolutionary trajectories of individual cancers at single-cell resolution. The LoPhy source code is available under a CC-BY-ND license at https://github.com/ethanumn/LoPhy.
bioinformatics2026-04-19v3DOME Copilot: Making transparency and reproducibility for artificial intelligence methods simple
Farrell, G.; Attafi, O. A.; Fragkouli, S.-C.; Heredia, I.; Fernandez Tobias, S.; Harrison, M.; Hermjakob, H.; Jeffryes, M.; Obregon Ruiz, M.; Pearce, M.; Pechlivanis, N.; Lopez Garcia, A.; Psomopoulos, F.; Tosatto, S. C. E.Abstract
Unprecedented breakthroughs are being made in life science research through the application of artificial intelligence (AI). However, adherence to method reporting guidelines is necessary to support their reusability and reproducibility. The DOME Copilot solution extracts structured reports of AI methods using a large language model to help interpret manuscripts. It is a fast and efficient resource capable of scaling to annotate the corpus of global AI literature, unlocking value and trust in published methods.
bioinformatics2026-04-19v1Pan-cancer survival modeling reveals structural limits of genomic feature integration in immunotherapy outcomes
Hassan, W.; Adeleke, S.Abstract
Background Immune checkpoint inhibitors (ICIs) have improved outcomes across multiple cancer types, yet reliable predictors of survival remain limited. While genomic features such as tumor mutational burden (TMB) are widely used, their contribution to predictive modeling in heterogeneous real-world cohorts remains unclear. We evaluated the relative contributions of clinical and whole-genome sequencing (WGS) features in pan-cancer survival modeling. Methods We analyzed 658 patients treated with ICIs with matched WGS data from the Genomics England. Using a leakage-controlled machine learning framework with strict train-test separation, we compared four models: TMB-only, clinical-only, clinical+TMB, and an integrated 11-feature clinico-genomic XGBoost survival model. Model performance was assessed using Harrells concordance index (C-index) with bootstrap confidence intervals. Results TMB alone demonstrated near-random discrimination (C-index 0.50; 95% CI 0.44-0.56). Clinical variables substantially improved predictive performance (0.59; 95% CI 0.53-0.64), with marginal gain from adding TMB (0.59). The integrated model achieved a C-index of 0.60 (95% CI 0.55-0.65). While improvement over TMB alone was significant, incremental gain beyond optimized clinical models was modest. Feature attribution analysis showed that model performance was dominated by clinical variables, with genomic features contributing limited additional signal. Conclusions These findings suggest that, in heterogeneous pan-cancer cohorts, predictive performance is constrained by the underlying data structure, in which dominant clinical signals overshadow genome-scale features. This study highlights fundamental limitations in integrating genomic data into survival models across diverse cancer types and provides a benchmark for future computational approaches.
bioinformatics2026-04-18v1Unsupervised Machine Learning for Adaptive Immune Receptors with immuneML
Pavlovic, M.; Wurtzen, C.; Kanduri, C.; Mamica, M.; Scheffer, L.; Lund-Andersen, C.; Gubatan, J. M.; Ullmann, T.; Greiff, V.; Sandve, G. K.Abstract
Machine learning (ML) enables adaptive immune receptor repertoires (AIRRs) analyses for biomarker identification and therapeutic development. With the majority of AIRR data partially or imperfectly labeled, unsupervised ML is essential for motif discovery, biologically meaningful clustering, and generation of novel receptor sequences. However, no unified framework for unsupervised ML exists in the AIRR field, hindering the assessment of model robustness and generalizability. Here, we present an immuneML release advancing unsupervised ML in the AIRR field through unified clustering workflows, interpretable generative modeling, integration with protein language model embeddings, dimensionality reduction, and visualization. We demonstrate immuneML's utility in three use cases: (i) benchmarking generative models for epitope-specific sequence generation, assessing specificity and novelty, (ii) systematic evaluation of clustering approaches on experimental receptor sequences against biological properties, such as epitope specificity and MHC, and (iii) unsupervised analysis of an experimental AIRR dataset to examine potential confounding, a practice widespread in related fields but unexplored in AIRR analyses.
bioinformatics2026-04-18v1LagCI Enables Inference of Temporal Causal Relationships from Dense Multi-Omic Time Series
Ge, Y.; Bai, S.; Qiang, Z.; Liu, Y.; Wu, Y.; Shen, X.Abstract
Inferring causal relationships from time-series data is critical for uncovering the dynamics of biological regulation. However, in multi-omics studies, this task is often hampered by sparse temporal sampling and the limitations of existing methods. To address this, we developed Lagged-Correlation Based Causal Inference (lagCI), a computational framework designed to identify time-lagged associations by combining comprehensive lag-correlation profiling with a robust statistical filtering scheme. Rather than relying on simple cross-correlation, lagCI analyzes the entire correlation profile and applies a quality-scoring system to filter out spurious associations that often plague high-dimensional datasets. We first tested lagCI on wearable physiological data, where it successfully captured the well-known causal link between physical activity and heart rate, even accounting for variations in lag times between individuals. Moving to high-frequency human multi-omics, we used lagCI to build a directed network of 1,624 molecules connected by over 157,000 predicted interactions. This network didn't just mirror established biology (such as cytokine-hormone crosstalk); it also pointed to specific molecular hubs that seem to orchestrate the timing of metabolic and immune responses. Overall, lagCI provides a data-driven way to extract temporal insights from dense longitudinal omics. We've made the tool available as an R package with multiple interfaces to ensure it's accessible for both bioinformaticians and clinicians.
bioinformatics2026-04-18v1Calibration of in-frame indel variant effect predictors for clinical variant classification
Abderrazzaq, H.; Singh, M.; Babb, L.; Bergquist, T.; Brenner, S. E.; Pejaver, V.; O'Donnell-Luria, A.; Radivojac, P.; ClinGen Computational Working Group, ; ClinGen Variant Classification Working Group,Abstract
Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([≤] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.
bioinformatics2026-04-18v1Interpretable models for scRNA-seq data embedding with multi-scale structure preservation
Novak, D.; de Bodt, C.; Lambert, P.; Lee, J. A.; Van Gassen, S.; Saeys, Y.Abstract
The ability to explore high-dimensional single-cell transcriptomics data efficiently is crucial in many biological studies. Dimensionality reduction techniques have therefore emerged as a basic building block of analytical workflows. They generate low-dimensional embeddings that capture important structures in the data, and are often used in discovery, quality control, and downstream analysis. However, the trustworthiness of current methods and the rigour of popular evaluation criteria are limited. We tackle this in an empirical study of structure-preserving data embeddings, delivering two tools. First, we introduce ViScore: a robust scoring framework that improves both unsupervised and supervised quality metrics, with emphasis on scalability and fairness. Second, we introduce ViVAE: a deep learning model that achieves better multi-scale structure preservation and is equipped with new tools for interpretability. We demonstrate the potential of these contributions to advance the trustworthiness of single-cell dimensionality reduction in a quantitative comparison and focused case studies.
bioinformatics2026-04-17v5A new iterative framework for simulation-based population genetic inference with improved coverage properties of confidence intervals
Rousset, F.; Leblois, R.; Estoup, A.; Marin, J.-M.Abstract
Simulation-based methods such as approximate Bayesian computation (ABC) are widely used to infer the evolutionary history of populations from molecular genetic data. We describe and evaluate a new iterative method of statistical inference about model parameters, which revisits the idea of inferring a likelihood surface using simulation when the likelihood function cannot be evaluated. It is based on combining the random forest machine learning method, and multivariate Gaussian mixture (MGM) models, in an effective inference workflow, here used to fit models with up to 15 variable parameters. In addition to the traditional assessment of precision in terms of bias and mean square error, we also evaluate the coverage of confidence intervals. The method is compared with approximate Bayesian computation using random forests (ABC-RF), a non-iterative method sharing some technical features with the proposed approach, across scenarios of historical demographic inference from population genetic data. It is also compared to another iterative method, sequential neural likelihood estimation (SNLE). These comparisons highlight the importance of an iterative workflow for exploring the parameter space efficiently. For equivalent simulation effort of the data-generating process, the new summary-likelihood method provides intervals whose coverage is better controlled than the marginal coverage of intervals provided by ABC with random forests, and than generally reported for ABC methods. The iterative workflow can also yield greater improvements in estimator precision when larger datasets are used.
bioinformatics2026-04-17v5NetSyn: prokaryotic genomic context exploration of protein families
Stam, M.; Langlois, j.; Chevalier, C.; Mainguy, J.; Reboul, G.; Bastard, K.; Medigue, C.; Vallenet, D.Abstract
Background: The growing availability of large prokaryotic genomic datasets presents an opportunity to discover new metabolic pathways and enzymatic reactions useful for industrial or synthetic biological applications. Efforts to identify new enzyme functions in this vast number of sequences cannot be achieved without bioinformatics tools and the development of new strategies. Standard methods for assigning a biological function to a gene are based on sequence similarity. However, complementary approaches rely on mine databases to identify conserved gene clusters (i.e. syntenies). In prokaryotic genomes, genes involved in the same pathway are frequently encoded in a single locus with an operonic organisation. This genomic context conservation is considered as a reliable indicator of functional relationships, and is therefore a promising approach for improving the gene function prediction. Methods. Here we present NetSyn (Network Synteny), a tool to group protein sequences based on the conservation of their genomic context rather than solely on sequence similarity. From a list of protein sequence identifiers, NetSyn searches corresponding genome entries to retrieve neighboring genes. Corresponding protein sequences are grouped into families to define homology relationships and compute a synteny conservation score between the different extracted genomic contexts. A network is then created in which the nodes represent the input proteins and the edges indicate that two proteins share a conserved synteny. Finally, the network is partitioned into clusters grouping proteins with similar genomic contexts, using a community detection algorithm. Results. As a proof of concept, we used NetSyn on two different datasets. The first one is the BKACE protein family (formerly named DUF849) which has previously been divided into isofunctional sub-families. NetSyn was able to go a step further by providing additional sub-families beyond those already described. The second dataset corresponds to a set of non-homologous proteins belonging to three different glycoside hydrolase (GH) families. These GHs are known to work cooperatively in a Polysaccharide-Utilization Loci (PUL) and are therefore grouped together in the same genomic contexts. NetSyn was able to identify a locus grouping 3 GHs, involved in the degradation of xyloglucan, in 162 prokaryotic genomes. Discussion. By highlighting conserved synteny in distantly related prokaryotic species, NetSyn enables functional links between proteins to be established beyond sequence similarity alone. We showed that NetSyn is efficient for exploring large prokaryotic protein families, enabling the definition of isofunctional groups and the identification of functional interactions between non-homologous enzymes. These features enable the prediction of new genomic structures that have not yet been experimentally characterized. Finally, NetSyn is also useful for pinpointing annotation errors that have been propagated across databases, and for suggesting annotations on proteins lacking functional prediction. NetSyn is freely available at https://github.com/labgem/netsyn.
bioinformatics2026-04-17v4METRIN-KG: A knowledge graph integrating plant metabolites, traits, and biotic interactions
Tandon, D.; Mendes de Farias, T.; Allard, P.-M.; Defossez, E.Abstract
Background In recent years, biodiversity data management has emerged as a critical pillar in global conservation efforts. Today, the ability to efficiently collect, structure, and analyze biodiversity data is central to breakthroughs in conservation, drug development, disease monitoring, ecological forecasting, and agri-tech innovation. However, due to the vastness and heterogeneity of biodiversity data, it is often confined to databases for specific research areas in isolated formats and disconnected from other relevant resources. Crucial components of such data in kingdom Plantae comprise of metabolomes - the vast array of compounds produced by plants; traits - measurable characteristics of plants that influence their growth, survival, and reproduction, and that affect ecosystem processes; and biotic interactions - relationships of plants with other living organisms, affecting the ecosystem functions. Results In this work, we present METRIN-KG (MEtabolomes, TRaits, and INteractions-Knowledge Graph) a powerful data resource simplifying the integration of diverse and heterogeneous data resources such as plant metabolomes, traits, and biotic interactions. Conclusions The proposed knowledge graph provides an interface to interactively search for data relating plant metabolomes, traits, and interactions. This, in turn, will facilitate development of research questions in life-sciences. In this context, we provide representative case studies on how to frame queries that can be used to search for relevant data in the knowledge graph.
bioinformatics2026-04-17v3Mechanistic insights into CFTR function from molecular dynamics analysis of electrostatic interactions
ELBAHNSI, A.; Mornon, J.-P.; Callebaut, I.Abstract
The Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) is an ATP-gated anion channel whose function is tightly linked to its conformational dynamics and is influenced by the composition of its membrane lipid environment. Despite high-resolution three-dimensional (3D) structures, the molecular determinants that stabilize specific CFTR conformations and enable ion conduction remain incompletely understood. Here, we performed all-atom molecular dynamics (MD) simulations of the human CFTR 3D structure in both the apo and VX-770 (ivacaftor)-bound states, embedded in a heterogeneous lipid bilayer, in order to systematically analyze electrostatic interactions, linking amino acids to each other as well as to anions and membrane lipids. We identified 557 electrostatic interactions between charged and polar amino acid side chains, which we systematically mapped across the CFTR 3D structure. They are organized into specific regions, with a subset showing high frequency and conservation across simulations, suggesting a structural role in stabilizing CFTR architecture. In contrast, more transient electrostatic interactions were detected in dynamic regions potentially linked to conformational transitions or other functional roles. Irregularities in transmembrane (TM) helices often incorporate amino acids involved in electrostatic interactions. Many basic and polar residues involved in electrostatic interactions also engaged in anion coordination, underscoring their contribution to ion conduction. In addition, some showed selective interactions with cholesterol and phosphatidylserine, revealing spatially organized lipid binding, particularly at the level of the lasso and in the vicinity of the VX-770 binding site, which may mark regions important for allosteric communication. VX-770 binding preserved the global architecture of the electrostatic interaction networks but induced subtle shifts, acting on specific salt bridges. Regardless of whether VX-770 is present or not, a secondary portal displayed between TM10/TM12 emerged from these MD simulations, in addition to the main TM4/TM6 portal, whose morphology and diameter is controlled by a fluctuating salt bridge. Two exit routes also appeared for the exit of anions towards the extracellular milieu. Altogether, our integrative analysis highlights how dynamic electrostatic networks, together with ion and lipid interactions, support CFTR's structural plasticity and functional modulation, offering molecular insights into potentiation mechanisms and into the specific evolution of CFTR in the ABC transporter superfamily.
bioinformatics2026-04-17v3Scaling SMILES-Based Chemical Language Models for Therapeutic Peptide Engineering
Feller, A. L.; Secor, M.; Swanson, S.; Wilke, C. O.; Deibler, K.Abstract
Therapeutic peptides occupy a unique middle ground in drug discovery, offering the high specificity of protein interactions with the chemical diversity of small molecules, yet they currently fall in a computational blind spot. Existing foundation models cannot handle them effectively: protein models are restricted to natural amino acids, while chemical models struggle to process large, polymer-like sequences. This disconnect has forced the field to rely on static chemical descriptors that fail to capture subtle chemical details or on complex multi-embedding pipelines that are custom tailored to specific datasets. To bridge this gap, we present PeptideCLM-2, a suite of chemical language models trained on over 100 million molecules to natively represent complex peptide chemistry. This modeling approach both simplifies the application of machine learning to therapeutic peptides and results in improved performance over alternative approaches for predicting development endpoints including membrane diffusion, tumor homing, and half life.
bioinformatics2026-04-17v3Impact of the N-glycosylation on full-length IgG2 and IgG4 antibodies: a comparative study using molecular dynamics simulations.
LEON FOUN LIN, R.; Bellaiche, A.; Diharce, J.; Etchebest, C.Abstract
Like other proteins, monoclonal antibodies - important biodrugs- are subject to post translational modifications, especially the N-glycosylations. However, the effect of the N-glycosylations remains poorly studied and atomistic details about their influence are rarely available. . Moreover, the few existing studies focus on the prevalent immunoglobulin G1. To go further in the understanding of the impact of glycosylations, we have carried out a comparative exploration of the effect of N-glycosylations on two different classes of antibodies, namely Mab231, an IgG2 and the pembrolizumab, an IgG4 . The two antibodies differ by their sequences, their length, their 3D structure but also by the location and composition of the glycans. In the present work, detailed and important information were gained through molecular dynamics simulations where both monoclonal antibodies were studied without and with the presence of their glycans. The results of 1.5 microseconds of sampling for each system show that glycosylation does not drastically alter the overall conformational landscape of either antibody, whatever the metrics considered. However, it measurably modulates local flexibility, inter-domain correlated motions, and the relative orientation of the Fab arms with respect to the Fc domain, with statistically significant shifts in key geometric descriptors. Importantly, contact analysis reveals that glycan interactions extend beyond the Fc region to reach Fab residues. The allosteric network calculations demonstrate that the influence of Fc-bound glycans propagates even until the Fab framework regions in both mAbs, which could impact the antigen binding. The nature and magnitude of these effects are subclass-dependent, reflecting differences in glycan composition, hinge architecture, and three-dimensional organization Our findings challenge the prevailing view that Fc glycosylation uniformly promotes CH2 domain opening. More importantly, it underscores the necessity of considering full-length structures and IgG subclass diversity in glyco-engineering strategies.
bioinformatics2026-04-17v2Testing and Estimating Causal Treatment Effect Heterogeneity in Observational Studies via Revised Deep Semiparametric Regression:A Lung Transplant Case Study
Yuan, S.; Zou, F.; Zou, B.Abstract
Lung transplantation programs must decide when bilateral lung transplantation (BLT) offers meaningful functional benefit over single lung transplantation (SLT). Because donor and recipient characteristics jointly shape outcomes, the BLT-SLT contrast may differ across patients. However, observational registries pose a key statistical challenge: apparent subgroup differences can be artifacts of complex confounding, while true heterogeneity can be missed or poorly quantified. Using a large national lung transplant registry, we study whether the BLT effect varies across recipients and identify clinically relevant profiles of benefit using post-transplant lung function measured by forced expiratory volume in 1 second (FEV1). We develop deepHTL, an analysis framework that first tests whether treatment effect heterogeneity is supported by the data and then estimates how the BLT-SLT effect changes with patient features when heterogeneity is present. In extensive simulations designed to resemble registry-like confounding, deepHTL controls false positives for detecting heterogeneity and yields more accurate individualized effect estimates than common machine learning methods. In the lung transplant cohort, we find strong evidence of heterogeneity in the BLT-SLT effect on FEV1: younger, lower risk recipients with better baseline status show the largest FEV1 gains from BLT, whereas older, higher risk candidates exhibit diminished marginal benefit. These findings provide statistically grounded guidance for patient selection and allocation of scarce donor organs.
bioinformatics2026-04-17v2Deep Learning Enables Automated Segmentation and Quantification of Ultrastructure from Transmission Electron Microscopy Images
Zou, A.; Tan, W.; Ji, J.; Rojas-Miguez, F.; Dodd, L.; Oei, E.; Vargas, S. R.; Yang, H.; Berasi, S. P.; Chen, H.; Henderson, J. M.; Fan, X.; Lu, W.; Zhang, C.Abstract
Transmission electron microscopy (TEM) has become an essential technique for observing subcellular ultrastructure, and is widely used in both clinical diagnosis and biomedical research. However, analysis of TEM data remains extremely labor-intensive and often inconsistent across operators due to the lack of dedicated computational methods. Here, we present TEAMKidney, a deep learning framework for accurate and scalable measurement of ultrastructures in TEM images across species, magnifications, and instrument platforms. We collected 12,991 TEM images from patients with multiple kidney diseases and from different animal models. By combining a self-training-based semantic segmentation stage with a TEM-tailored panoptic segmentation model, we address two major challenges in TEM data analysis: the lack of accurately labeled training data and the difficulty of achieving high segmentation accuracy for complex ultrastructure. Application of TEAMKidney to both human and animal images successfully reveals disease-associated changes in two critical glomerular ultrastructures: the glomerular basement membrane and podocyte foot processes. In addition to significantly outperforming existing tools, TEAMKidney shows close agreement with pathological expert measurements used in clinical assessment protocols. By reducing dependence on manual tracing while preserving expert-level accuracy, TEAMKidney demonstrates that deep learning can substantially reduce the burden of image analysis in both clinical pathology and biomedical research settings.
bioinformatics2026-04-17v2Methylation-aware long-read phasing significantly improves genome-wide haplotype reconstruction
Pfennig, A.; Akey, J. M.Abstract
Haplotypes are linear sequences of co-inherited alleles along individual chromosomes and are central to genetic mapping, clinical variant interpretation, and inference of population history. However, accurate genome-wide haplotype reconstruction remains challenging. Long-read sequencing has the potential to dramatically improve haplotype inference, but existing methods do not directly leverage all the information embedded in these data. Here, we present LongHap, a read-based phasing method that integrates sequence and 5-methylcytosine (5mC) information in a unified probabilistic framework. By leveraging differentially methylated sites, LongHap resolves phase relationships between variants that are inaccessible to sequence-based approaches alone. Across multiple datasets and sequencing platforms, LongHap increases phase block lengths by up to 30% while substantially reducing switch error rates. LongHap rigorously embeds complex structural variants into the broader haplotype context using loopy belief propagation, enabling improved phasing of INDELs and other variant classes that are inherently difficult to resolve. Methylation-aware phasing also improves the accuracy and contiguity of haplotypes spanning rare variants and structurally complex, medically relevant genes across diverse ancestries, facilitating the interpretation of compound heterozygosity and haplotype-specific regulatory architectures. These results establish methylation-aware phasing as a general framework for improving genome-wide haplotype reconstruction, with broad applications across genetics and genomics.
bioinformatics2026-04-17v2HARVEST: Unlocking the Dark Bioactivity Data of Pharmaceutical Patents via Agentic AI
Shepard, V.; Musin, A.; Chebykina, K.; Zeninskaya, N. A.; Mistryukova, L.; Avchaciov, K.; Fedichev, P. O.Abstract
Pharmaceutical patents contain vast Structure-Activity Relationship tables documenting protein-ligand binding data. While technically public, this information remains computationally inaccessible and effectively dark, trapped in bulky documents that no existing database has systematically captured. We present HARVEST, a multi-agent large language model pipeline that autonomously extracts structured bioactivity records from USPTO patent archives at $0.11 per document. Applied to 164,877 patents, HARVEST produced 3.15 million activity records, recovering 326,342 unique scaffolds and 967 protein targets absent from BindingDB. This pipeline completed in under a week a task that would otherwise require over 55 years of continuous expert labor. Automated extraction achieves 80% agreement with human curated corpus of US patents from BindingDB, a conservative lower bound given identified errors within the reference data. We further introduce H-Bench, a structurally guaranteed held-out benchmark built from this recovered data. Evaluation of the leading open-source model Boltz-2 on H-Bench reveals a two-dimensional generalization gap: performance degrades both on novel scaffolds and on uncharacterized protein targets, exposing fundamental limitations of models trained on existing public repositories.
bioinformatics2026-04-17v2GraphPop: graph-native computation decouples population genomics complexity from sample count
Estaji, E.; Zhao, S.-W.; Chen, Z.-Y.; Nie, S.; Mao, J.-F.Abstract
Matrix-based population genomics tools scale as O(V x N), re-reading the full genotype matrix for every analysis. Here we present GraphPop, a graph database engine that reduces summary statistic complexity to O(V x K) where K is population count--independent of sample count--by computing on pre-aggregated allele-count arrays stored as graph node properties. The same architecture enables annotation-conditioned queries via edge traversal, persistent analytical records, and multi-statistic composition. Applied to rice 3K (29.6M SNPs, 3,024 accessions) and human 1000 Genomes (3,202 samples, 22 autosomes), GraphPop reveals that all 12 rice subpopulations show{pi} N /{pi}S > 1.0, uncovers opposite consequence-level Fst regimes between species, and identifies KCNE1 as a candidate pre-Out-of-Africa sweep via convergence of five stored statistics. GraphPop achieves 146-327x query-time speedup for pre-aggregated statistics and 63-179x for bit-packed haplotype computation, at constant[~] 160 MB memory. This complexity reduction makes systematic, annotation-integrated population genomics practical for the crop, livestock, conservation, and ecological datasets that constitute the majority of the field.
bioinformatics2026-04-17v2GraphMana: graph-native data management for population genomics projects
Estaji, E.; Zhao, S.-W.; Chen, Z.-Y.; Nie, S.; Mao, J.-F.Abstract
Population genomics projects rely on fragmented file-based workflows that lose provenance and require full reprocessing when samples are added. Graph-Mana stores variant data in a graph database as packed genotype arrays with pre-computed population statistics, enabling incremental sample addition, provenance tracking, cohort management, and export to 17 formats. On the human 1000 Genomes Project (3,202 samples, 70.7 million variants), GraphMana completed a 46-operation project lifecycle in 98 minutes from a single persistent database, replacing the ad hoc scripting otherwise required across multiple disconnected tools.
bioinformatics2026-04-17v2Integrating glycosylation in de novo protein design with ReGlyco Binder Design Filter
Singh, O.; Fadda, E.Abstract
Artificial Intelligence (AI)-based methods for 3D protein structure prediction are revolutionising structural biology, providing novel templates for experimental data refinement and an on demand 3D perspective on any molecular architecture and protein-protein interaction (PPI). Regardless of the inherent limitations of the various approaches available to date, the continuous improvement of the algorithms, the broad availability of open access (OA) web servers, software packages and databases are bound to accelerate the discovery and optimisation of novel biopharmaceuticals. Within this context, the development of computational pipelines for the de novo design of target-specific protein binders is especially exciting. As it stands, these processes are still rather inefficient and expensive, rapidly outputting thousands of designs relatively quickly, which translate into meagre yields. Here we show how the explicit integration of glycosylation as a filter in the 3D de novo design pipeline can significantly improve efficiency and reduce laboratory costs with minimal additional computational resources. As a proof-of-concept, we used the GlycoShape database and ReGlyco tools to filter the results of a recent open competition launched by Adaptyv Bio for the design of binders as inhibitors against the heavily glycosylated Nipah virus glycoprotein (NiV-G). Screening of the 1,201 selected designs in block with ReGlyco, refined with the new ReGlyco Rotamer tool, flagged 11% of non-binders prior to experiment in approximately 3 hours on a dual-core CPU. We complement this analysis with a demo colab notebook to illustrate our workflow. In this demo users can design mini-binders against human erythropoietin (hEPO) by integrating GlycoShape resources with the RFdiffusion3 (RFD3) pipeline from the Institute for Protein Design (IDP).
bioinformatics2026-04-17v1Agent-Guided De Novo Design of Nanobody Binders Against a Novel Cancer Target
Zhao, Y.; Yilmaz, M.; Lee, E.; Teh, C.; Guo, L.; Sonmez, K.; Giancardo, L.; Trang, G.; Xu, F.; Espinosa-Cotton, M.; Cheung, N.-K.; Kim, J.; Cheng, X.Abstract
Therapeutic antibody discovery remains slow and resource-intensive, with traditional methods providing limited control over epitope selection. We present a workflow for de novo nanobody design applied to a novel Desmoplastic Small Round Cell Tumor target encompassing four stages: (1) epitope identification guided by our hotspot recommendation agent using physical chemistry-based structure and sequence analysis tools with two curated databases (IEDB, PFAM), (2) de novo nanobody generation using three independent methods (RFantibody, IgGM, mBER) across multiple predicted antigen structures and nanobody frameworks, (3) multi-metric scoring including structural metrics from folding models, and in silico binding affinity from our sequence- based predictor, (4) high-throughput yeast surface display (YSD) screening followed by surface plasmon resonance (SPR) characterization of the specific binders. We generated 288,000 nanobody designs spanning eight target epitope regions and three variable domains of heavy chain-only antibody (VHH) frameworks. Multi-objective Pareto filtering with our candidate selection agent yielded 100,000 candidates for YSD screening with fluorescence-activated cell sorting (FACS). Of 116 enriched candidates advanced to SPR characterization, 46/116 (39.7%) produced reliable kinetic fits with Rmax [≥] 30 RU, yielding KD values from 0.66 nM to 305 nM (median 31.7 nM). These results show that an agent-guided computational workflow can design nanomolar to sub-nanomolar nanobody binders against a novel target without experimental structure or prior antibody information.
bioinformatics2026-04-17v1