Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Navigating the peptide sequence space in search for peptide binders with BoPep
Hartman, E.; Samsudin, F.; Siljehag Alencar, M.; Tang, D.; Bond, P. J.; Schmidtchen, A.; Malmstrom, J.AI Summary
- The study developed BoPep, a framework using Bayesian optimization to efficiently explore peptide sequence space for protein binders, reducing the need for extensive docking evaluations.
- BoPep was applied to peptides from clinical wound fluids, the human proteome, and de novo designs, identifying novel peptide classes that bind CD14 and neutralize pneumolysin's hemolytic activity.
Abstract
Peptides are short amino-acid chains that mediate essential biological processes, including antimicrobial defence, immune modulation and cell signalling. Their high degree of modularity, biocompatibility and capacity to bind proteins with high specificity make them attractive therapeutic candidates. However, identifying peptides that bind and modulate the function of specific proteins remains challenging due to the immense size of the peptide sequence space. To adress this challenge, we developed BoPep (Bayesian Optimization for Peptides), an end-to-end modular framework that effectively navigates the landscape of peptide-protein interactions by directing the search toward informative regions of sequence space and prioritizes candidates with high binding potential. By focusing computational effort where it is most informative and using calibrated uncertainty to balance exploration and exploitation, BoPep reduces the number of expensive docking evaluations by orders of magnitudes. We demonstrate the utility of BoPep by applying it to three sources of peptides: endogenous proteolytic fragments from clinical wound fluids, the complete human proteome, and a de novo design peptide landscape generated by diffusion-based backbone sampling. Using these sources, we uncover novel encrypted peptide classes that bind CD14 and identify peptides that neutralize the hemolytic activity of pneumolysin, a major bacterial virulence factor. Together, these findings show that BoPep accelerates the identification of testable therapeutic leads from large and diverse peptide collections. BoPep is available at GitHub.
bioinformatics2026-03-02v2A Query-to-Dashboard Framework for Reproducible PubMed-Scale Bibliometrics and Trend Intelligence
Kidder, B. L.AI Summary
- The study introduces PubMed Atlas, a platform for conducting topic-specific bibliometric analyses using PubMed E-utilities, which retrieves and organizes metadata into a SQLite database for analysis.
- An interactive Streamlit dashboard allows for the exploration of publication trends, journal distributions, MeSH term frequencies, and author geography.
- The framework was applied to cancer stem cell biology and stem cell transcriptional regulatory networks, demonstrating its utility in identifying research trends and gaps.
Abstract
The rapid expansion of biomedical literature necessitates computational approaches for systematic analysis of publication patterns, identification of emerging scientific themes, and characterization of field evolution. We present PubMed Atlas, an integrated command-line and web-based platform for conducting topic-specific bibliometric analyses through programmatic access to PubMed E-utilities. This workflow retrieves PubMed identifiers matching user-defined queries, downloads comprehensive metadata in batch mode, extracts structured information including titles, abstracts, author affiliations, Medical Subject Headings, publication classifications, funding acknowledgments, and digital object identifiers, then organizes these data within a local SQLite relational database optimized for rapid queries and visualization. An accompanying Streamlit-based interactive dashboard enables exploration of temporal publication patterns, journal distribution profiles, MeSH term frequencies, geographic author distributions, and direct linking to recent publications. We demonstrate the application of PubMed Atlas to cancer stem cell biology and stem cell transcriptional regulatory network research, providing a framework for reproducible bibliometric investigation and systematic identification of research gaps within dynamically evolving scientific domains.
bioinformatics2026-03-02v1Density-guided AlphaFold3 uncovers unmodelled conformations in β2-microglobulin
Maddipatla, S. A.; Vedula, S.; Bronstein, A. M.; Marx, A.AI Summary
- The study uses density-guided AlphaFold3 to model alternative backbone conformations of β2-microglobulin from crystallographic maps, which are typically obscured in standard X-ray crystallography models.
- Findings show that the approach can reveal conformational heterogeneity influenced by electron density quality, crystallization conditions, and lattice packing.
- This method enhances the ability to capture the full structural landscape of proteins, improving macromolecular crystallography interpretation.
Abstract
Although X-ray crystallography captures the ensemble of conformations present within the crystal lattice, models typically depict only the most dominant conformation, obscuring the existence of alternative states. Applying the electron density-guided AlphaFold3 approach to {beta}2-Microglobulin highlights how ensembles of alternate backbone conformations can be systematically modeled directly from crystallographic maps. This study also highlights how the detection of conformational ensembles is affected by the local quality of electron density and subtle variations in crystallization conditions and lattice packing. These results demonstrate that density-guided AlphaFold3 can uncover conformational heterogeneity missed by conventional refinement, offering a robust, systematic framework to capture the full structural landscape of proteins in crystals and enhancing the interpretive power of macromolecular crystallography.
bioinformatics2026-03-02v1Synora: vector-based boundary detection for spatial omics
Li, J.-T.; Liang, Z.; Fu, Z.; Chen, H.; Liang, Y.-L.; Liu, N.; Wu, Q.-N.; Liu, Z.; Zheng, Y.; Huo, J.; Li, X.; Zuo, Z.; Zhao, Q.; Liu, Z.-X.AI Summary
- Synora is a computational framework for detecting tumor-stroma boundaries in spatial omics data, using only cell coordinates and binary annotations.
- It introduces 'orientedness' to differentiate true boundary cells from infiltrated regions, integrating this with diversity measures into a BoundaryScore.
- Synora effectively identifies boundaries in synthetic and real datasets, revealing gene signatures and spatial patterns, and performs well under data perturbations.
Abstract
Tumor-stroma boundaries are critical microenvironmental niches where malignant and non-malignant cells exchange signals that shape invasion, immune modulation and therapeutic response. Spatial omics platforms now resolve these interfaces at single-cell scale, but computational boundary detection remains challenging because heterogeneous neighborhoods can arise either from true compartment interfaces or from unstructured immune infiltration. Here we present Synora, a modality-agnostic computational framework that identifies tumor boundaries using only cell coordinates and binary tumor/non-tumor annotations, making it readily applicable across a broad range of spatial omics modalities. Synora introduces 'orientedness', a novel metric that quantifies directional neighborhood asymmetry and distinguishes true boundary cells, where neighbors are spatially segregated by type, from infiltrated regions where cell types intermingle randomly. By integrating orientedness with traditional diversity measures into a unified BoundaryScore, Synora achieves robust boundary identification across synthetic datasets with ground-truth boundaries, maintaining performance under realistic perturbations including 50% missing cells and 25% infiltration. Application to 15 Visium HD spatial transcriptomic datasets across multiple cancer types reveals consistent boundary-enriched gene signatures and cell-type spatial gradients. Validation on a CODEX multiplexed protein dataset demonstrates that Synora's precise boundary identification enables discovery of clinically relevant cellular neighborhoods and disease-associated spatial patterns missed by frequency-based approaches. Synora enables boundary-aware spatial analyses by making tissue interfaces quantifiable from minimal inputs, helping to standardize interface detection and comparison across spatial omics platforms and biological contexts.
bioinformatics2026-03-02v1STCS: A Platform-Agnostic Framework for Cell-Level Reconstruction in Sequencing-Based Spatial Transcriptomics
Chen Wu, L.; Hu, X.; Zhan, F.; Sun, C.; Gonzales, J.; Ofer, R.; Tran, T.; Verzi, M. P.; Liu, L.; Yang, J.AI Summary
- The study introduces STCS, a platform-agnostic framework for reconstructing single-cell expression profiles from sequencing-based spatial transcriptomics data by integrating transcriptomic and spatial data from H&E images.
- STCS uses two interpretable parameters for optimization, selected via internal metrics, and outperforms existing methods in reconstructing cell-level data from Visium HD and Stereo-seq datasets.
Abstract
Sequencing-based spatial transcriptomics platforms such as Visium HD and Stereo-seq achieve transcriptome-wide coverage at subcellular resolution, yet their measurements are defined over spatially barcoded units rather than biologically segmented cells. Reconstructing coherent cell-level expression profiles from these data remains a central computational challenge. Here, we introduce Spatial Transcriptomics Cell Segmentation (STCS), a platform-agnostic framework that reconstructs single-cell expression profiles by assigning spatial units to nuclei segmented from paired H&E images, using a combined transcriptomic and spatial distance. STCS is governed by two interpretable parameters that can be selected using reference-free internal metrics. On both Visium HD human lung cancer data with matched Xenium references and Stereo-seq mouse brain data, STCS achieves consistent improvements over existing methods across multiple evaluation dimensions. STCS is fully open-source and designed for broad applicability across sequencing-based spatial transcriptomics technologies.
bioinformatics2026-03-02v1STEQ: A statistically consistent quartet distance based species tree estimation method
Saha, P.; Saha, A.; Roddur, M. S.; Sikdar, S.; Anik, N. H.; Reaz, R.; Bayzid, M. S.AI Summary
- The study introduces STEQ, a new method for estimating species trees from multi-locus data using a quartet-based distance metric, which is statistically consistent under the multi-species coalescent model.
- STEQ offers faster computation with a time complexity of for taxa and genes, outperforming methods like ASTRAL in speed.
- Evaluations on simulated and empirical datasets show STEQ maintains competitive accuracy with leading methods like ASTRAL and wQFM-TREE while significantly reducing inference time.
Abstract
Accurate estimation of large-scale species trees from multi-locus data in the presence of gene tree discordance remains a major challenge in phylogenomics. Although maximum likelihood, Bayesian, and statistically consistent summary methods can infer species trees with high accuracy, most of these methods are slow and not scalable to large number of taxa and genes. One of the promising ways for enabling large-scale phylogeny estimation is distance based estimation methods. Here, we present STEQ, a new statistically consistent, fast, and accurate distance based method to estimate species trees from a collection of gene trees. We used a quartet based distance metric which is statistically consistent under the multi-species coalescent (MSC) model. The running time of STEQ scales as $\mathcal{O}(kn^2 \log n)$, for $n$ taxa and $k$ genes, which is asymptotically faster than the leading summary based methods such as ASTRAL. We evaluated the performance of STEQ in comparison with ASTRAL and wQFM-TREE -- two of the most popular and accurate coalescent-based methods. Experimental findings on a collection of simulated and empirical datasets suggest that STEQ enables significantly faster inference of species trees while maintaining competitive accuracy with the best current methods. STEQ is publicly available at \url{https://github.com/prottoysaha99/STEQ}.
bioinformatics2026-03-02v1Genomic language models improve cross-species gene expression prediction and accurately capture regulatory variant effects in Brachypodium mutant lines
Vahedi Torghabeh, B.; Moslemi, C.; Dybdal Jensen, J.; Hentrup, S.; Li, T.; Yu, X.; Wang, H.; Asp, T.; Ramstein, G. P.AI Summary
- This study developed deep learning sequence-to-expression (S2E) models using context-aware sequence embeddings from PlantCaduceus to predict gene expression across 17 plant species, incorporating chromatin accessibility data.
- The models showed superior performance over PhytoExpr in predicting gene expression across species (Pearson R=0.82 vs. R=0.74) and in Brachypodium mutant lines for between-gene expression differences (β=0.78 vs. β=0.57).
- Notably, the models accurately predicted single-nucleotide mutation effects on within-gene expression, outperforming existing models (β=0.38 vs. β=0.08).
Abstract
Predicting gene expression from cis-regulatory DNA sequences at the promoter and terminator regions is a central challenge in plant genomics. This capability is also a prerequisite for assessing the effects of regulatory mutations on gene expression. Here, we developed deep learning sequence-to-expression (S2E) models that leverage context-aware sequence embeddings from the PlantCaduceus genomic language model instead of one-hot encoding of sequences, to predict gene expression across 17 plant species. To further improve predictions, we integrated chromatin accessibility data as auxiliary regulatory features. First, we evaluated our models to predict gene expression on unseen gene families via cross-validation, demonstrating our model's prediction accuracy across all species outperforms PhytoExpr, the current state-of-the-art (SOTA) S2E model in plants (Pearson R=0.82 vs. R=0.74). We then validated variant effect predictions using an experimental dataset across 796 Brachypodium mutant lines, specifically designed to test predictions at single-base resolution. Our models outperformed SOTA S2E models in predicting between-gene expression differences (regression coefficient {beta}=0.78 vs. {beta}=0.57). Remarkably, they also accurately predicted the effects of single-nucleotide mutations on within-gene expression, while SOTA S2E models showed only weak associations (regression coefficient {beta}=0.38 vs. {beta}=0.08). Our results demonstrated the value of context-aware DNA sequence embeddings for predicting regulatory variant effects in plants. They also reveal a persistent accuracy gap in S2E models when moving from between-gene to allelic variation, a challenge that needs to be addressed in future S2E studies.
bioinformatics2026-03-02v1DNA fragment length analysis using machine learning assisted vibrational spectroscopy
Fatayer, R.; Ahmed, W.; Szeto, I.; Sammut, S.-J.; Senthil Murugan, G.AI Summary
- This study introduces a rapid, label-free method using ATR-FTIR and Raman spectroscopy combined with machine learning to quantify DNA fragment lengths from 50-300 bp.
- Machine learning models achieved high accuracy in predicting DNA length (R2=0.92-0.96), with multimodal fusion enhancing performance.
- The approach requires minimal sample (4 µL), short processing time (15 minutes), and allows full sample recovery, making it a scalable alternative for DNA length analysis.
Abstract
DNA length analysis is essential for genomic workflows including next-generation sequencing and fragmentomics based diagnostics. Conventional approaches typically require large, expensive instrumentation and sample-destructive protocols with long processing times. Here we present a rapid, label-free approach integrating vibrational spectroscopy with deep learning to quantify DNA fragment length distributions. We demonstrate that ATR-FTIR and Raman spectroscopy capture length-dependent spectral features arising from phosphate backbone, nucleobase, and structural vibrations. Machine learning models trained on spectra acquired from purified monodisperse DNA (50-300 bp) predicted DNA length with high accuracy (R2=0.92-0.94), with multimodal fusion improving performance to R2=0.96. A convolutional neural network trained on 35 DNA mixtures comprising molecules of different lengths also successfully deconvoluted their fragment length profile. Transfer learning enabled adaptation to biological samples, achieving low prediction error (RMSE=0.3-7.2%, {Delta}=12 bp). Importantly, the method requires only 4 L sample and 15 minutes passive drying, with no consumables beyond cleaning materials, and allows full sample recovery. This establishes vibrational spectroscopy as a scalable alternative for DNA length quantification.
bioinformatics2026-03-02v1Evaluation of deep learning tools for chromatin contact prediction
Nguyen, T. H. T.; Vermeirssen, V.AI Summary
- This study evaluates five deep learning models (C.Origami, Epiphany, ChromaFold, HiCDiffusion, GRACHIP) for predicting Hi-C contact maps from genomic and epigenomic data.
- Epiphany was found to have the best performance in terms of accuracy, generalization across cell types, and biological relevance.
- Key findings include the importance of CTCF binding and chromatin co-accessibility in prediction accuracy, with only a subset of omics inputs significantly contributing to model performance.
Abstract
Three-dimensional chromatin organization is essential for gene regulation and is commonly measured using Hi-C contact maps. Recent deep learning models have been developed to predict Hi-C maps from genomic and epigenomic features. However, their relative performance and biological interpretability remain poorly understood due to the lack of systematic evaluation. Here, we present a comprehensive benchmarking framework that evaluates five Hi-C prediction models: C.Origami, Epiphany, ChromaFold, HiCDiffusion, and GRACHIP, across predictive accuracy, visual fidelity, and downstream biological analyses. Among them, Epiphany consistently achieved the best overall performance, combining high accuracy, cross-cell-type generalization, realistic map quality, and reliable loop recovery. The framework further shows that epigenomic features, particularly CTCF binding and chromatin co-accessibility, are the primary drivers of accurate Hi-C pattern prediction. Notably, although many models incorporate multiple omics inputs, only a limited subset substantially contributes to performance. This manuscript clarifies model behaviour and provides guidance for developing and interpreting Hi-C prediction methods.
bioinformatics2026-03-02v1miREA: a network-based tool for microRNA-oriented enrichment analysis
Zhang, Z.; Lai, X.AI Summary
- miREA is a network-based tool designed for miRNA-oriented enrichment analysis, focusing on miRNA-gene interactions (MGIs) to interpret miRNA function at the pathway level.
- It employs five edge-based enrichment methods, integrating expression and interactome data with pathway networks, outperforming traditional node-based methods in sensitivity and biological interpretability.
- Benchmarking in various cancer types, including bladder cancer, demonstrated miREA's effectiveness in identifying relevant pathways and generating mechanistic hypotheses for experimental validation.
Abstract
MicroRNAs (miRNAs) regulate gene expression at the post-transcriptional level. To interpret the function of miRNAs at the pathway level, it is necessary to use enrichment analysis tools that employ gene regulatory networks. However, existing network node-centric methods focus predominantly on gene expression profiles, neglecting the role of regulatory information encoded in miRNA-gene interactions (MGIs) that constitute network edges. This omission introduces analytical bias and limits the methods' biological interpretability. Here, we present miREA, a network-based tool for miRNA enrichment analysis that leverages MGIs to characterize miRNA function at the pathway level. miREA implements five edge-based enrichment methods spanning over-representation, scoring-based, topology-aware, and network propagation approaches by integrating expression and interactome profiles with pathway networks. Benchmarking across multiple cancer types shows that the edge-based methods outperform node-based methods in improving sensitivity to identify relevant pathways and biological interpretability while maintaining controlled false positive rates. We further demonstrate the utility of miREA in elucidating miRNA-gene-pathway regulatory mechanisms in bladder cancer. miREA is a versatile enrichment analysis tool that provides pathway-level interpretation of human miRNA function and facilitates mechanistic hypothesis generation for experimental validation.
bioinformatics2026-03-02v1Evaluating genome assemblies with HMM-Flagger
Asri, M.; Eizenga, J. M.; Hebbar, P.; Real, T. D.; Lucas, J.; Loucks, H.; Calicchio, A.; Diekhans, M.; Eichler, E. E.; Salama, S.; Miga, K. H.; Paten, B.AI Summary
- HMM-Flagger uses a hidden Markov model with a Gaussian autoregressive process to detect structural errors in genome assemblies by analyzing read coverage.
- It achieved F1 scores of 78.4% and 60.4% for synthetic errors with Pacific Biosciences HiFi and Oxford Nanopore Technologies R10 data, respectively.
- Applied to real assemblies, it identified large misassemblies in HG002 and showed significant error rate reduction from 0.94% to 0.38% between HPRC releases, validating NOTCH2NL assemblies.
Abstract
HMM-Flagger is a reference-free tool for detecting structural errors in haplotype-resolved genome assemblies based upon the coverage of mapped reads. It models read coverage with a hidden Markov model augmented by a Gaussian autoregressive process, which enables classifying coverage anomalies as erroneous blocks, false duplications, or collapsed blocks. Trained and tested on synthetic misassemblies, it detected synthetic errors using Pacific Biosciences HiFi and Oxford Nanopore Technologies R10 data with F1 scores of 78.4\% and 60.4\% respectively. When applied to six HG002 assemblies it revealed multiple large misassemblies including false duplications and collapse events in human satellites. Applied to assemblies from the Human Pangenome Reference Consortium (HPRC), HMM-Flagger demonstrated substantial improvements from release 1 (0.94\% error rate) to release 2 (0.38\%), reflecting technological advances. HMM-Flagger also validated NOTCH2NL assemblies in HPRC release 2 and confirmed the correctness of three novel structural configurations.
bioinformatics2026-03-02v1Benchmarking niche identification via domain segmentation for spatial transcriptomics data
Wang, Y.; Chen, Y.; Yang, L.; Wang, C.; Cai, J.; Xin, H.AI Summary
- This study benchmarks 16 domain segmentation algorithms on high-resolution CosMx ST data from a human lymph node to identify tissue niches, revealing that most algorithms fail to accurately define niche boundaries in their default settings.
- The primary challenge identified is the reduction in spatial signal-to-noise ratio due to stochastic infiltration of peripheral cell types, which obscures key functional lineage distributions.
- Strategic weighting of core functional lineages improved niche resolution, highlighting the need for specialized computational methods for functional microenvironment analysis.
Abstract
Tissue niches are spatially organized microenvironments in which coordinated multicellular interactions shape cellular states and biological functions. Currently, niche identification is routinely performed using domain segmentation frameworks. While interrelated, spatial domains and niches are not fundamentally equivalent. The former emphasizes intra-domain compositional consistency and transcriptomic homogeneity, whereas the latter is defined by the emergent properties of localized signaling gradients and the functional reciprocity between key cell lineages. Here, we present a high-resolution reference by thoroughly annotating single-cell resolution CosMx ST data of a human follicular lymphoid hyperplasia lymph node, a dynamic, non-compartmentalized tissue containing several critical immune niches defined by specific lineage architectures. We systematically benchmarked 16 contemporary domain segmentation algorithms, demonstrating that most methods in their default configurations fail to recapitulate biologically defined niche boundaries. Our analysis reveals that the definitive, disjoint spatial distributions of key functional lineages are frequently obscured by the stochastic infiltration of peripheral cell types. Such reduction in the spatial signal-to-noise ratio represents a primary bottleneck for existing algorithms, which prioritize local transcriptomic variance over global architectural logic. Following this observation, we demonstrate that strategic weighting of core functional lineages can restore the resolution of spatial niches in select domain segmentation frameworks. Cross-comparison against compartmentalized tissues further underscores the unique challenges of niche identification in non-mechanically separated environments and clarifies the fundamental divergence between structural domain segmentation and functional niche discovery. Our work delineates the limitations of current paradigms and advocates for the development of specialized computational approaches tailored specifically to the complexity of functional microenvironments.
bioinformatics2026-03-02v1GTA-5: A Unified Graph Transformer Framework for Ligands and Protein Binding Sites - Part I: Constructing the PDB Pocket and Ligand Space
Ciambur, B. C.; Pageau, R.; Sperandio, O.AI Summary
- GTA-5 is a graph transformer auto-encoder framework that integrates ligands and protein binding sites into a unified latent space by representing them as 3D point clouds with Tripos atom type labels.
- Trained on 64,124 liganded pockets and 23,133 unique ligands, GTA-5 clusters functional protein families coherently while capturing physicochemical properties like volume and hydrophobicity.
- The framework supports applications like scaffold hopping, QSAR/QSPR modeling, and drug repurposing by enabling structural reasoning based on spatial context rather than bond connectivity.
Abstract
Structural recognition between a protein target and a ligand underpins therapeutic innovation, yet computational representations of protein binding sites and small molecules remain largely disjoint. Here we introduce GTA-5, a unified graph transformer auto-encoder framework designed to capture the geometric structure and chemical composition of ligands and protein binding pockets, embedding them into multidimensional latent spaces where proximity reflects functional compatibility. Ligands and pockets are represented as three-dimensional point clouds annotated with Tripos atom type labels, omitting explicit bond connectivity to enable structural reasoning based on spatial context rather than predefined connectivity graphs. By not enforcing bond topology, GTA-5 maintains representational flexibility across molecular modalities while preserving chemically meaningful local environments. The model was trained on a curated dataset from the Protein Data Bank comprising 64,124 liganded pockets and 23,133 unique ligands spanning 2,257 protein families. We find that functional protein families cluster coherently in both pocket and ligand latent spaces while retaining biologically meaningful heterogeneity. The model captures physicochemical pocket properties such as volume, exposure, and hydrophobicity directly from raw structural data, while ligands with distinct scaffolds co-localise when occupying similar binding environments. This provides a basis for several downstream applications including scaffold hopping in ligand-based virtual screening, QSAR/QSPR modelling using embedding-derived descriptors, and drug repurposing via pocket similarity. More broadly, the GTA-5 framework establishes a foundation for structural reasoning across molecular modalities in drug discovery.
bioinformatics2026-03-02v1ProPrep: An Interactive and Instructional Interface for Proper Protein Preparation with AMBER
Walker, a.; Guberman-Pfeffer, M. J.AI Summary
- ProPrep is an interactive interface designed to guide users through the process of preparing proteins for molecular dynamics (MD) simulations using AMBER, addressing the need for accessible yet expert-quality preparation.
- It integrates multiple functions including structure downloading, homology searches, alignment, structural repair, mutation application, and simulation setup, all within a single workspace.
- The tool was demonstrated on a 64-heme cytochrome 'nanowire' bundle, completing the preparation from a PDF to energy minimization in 18 minutes, showcasing its efficiency and transparency through an interactive session log.
Abstract
Millions of experimental and AI-predicted protein structures are now available, and the biosynthetic promise of bespoke proteins is increasingly within reach. The functional characterization challenge thus posed cannot be addressed by experimental techniques alone. Molecular dynamics (MD) simulations offer functional screening with atomic resolution, yet accessibility remains limited. Existing computational chemistry software presents stark trade-offs whereby powerful tools require extensive expertise and manual effort, or user-friendly programs function as black boxes that obscure critical preparation decisions. Herein, we present ProPrep, an interactive workflow manager that guides users through expert-quality MD preparation by showing the 'what, why, and how' of each step while automating tedious manual operations. Within a single workspace, ProPrep integrates (1) downloading structures from multiple sources (PDB, AlphaFold, AlphaFill), (2) performing homology searches, (3) aligning structures, (4) curating and repairing structural issues, (5) applying mutations, (6) parameterizing specialized residues, (7) converting redox-active sites to forcefield-compatible forms, (8) generating topology and coordinate files, and (9) configuring, executing, and analyzing simulations with active monitoring of key quantities via ASCII visualizations. A key innovation is ProPrep's extensible transformer framework for detecting, defining, and transforming redox-active sites--including mono- and polynuclear metal centers, organic cofactors, and redox-active amino acids--for forcefield compatibility. We demonstrate the full workflow on a 64-heme cytochrome 'nanowire' bundle (PDB: 9YUQ), proceeding from a PDF file to energy minimization of the solvated system (467,635 atoms) for constant pH molecular dynamics--a process demanding 4,819 PDB record modifications and 610 bond definitions'in 18 minutes of user interaction. The entire process is recorded in an interactive session log that can be shared and replayed for reproducibility, making simulation setup a fully transparent process that relies on what was done instead of what was remembered and reported.
bioinformatics2026-03-02v1Assessment of Generative De Novo Peptide Design Methods for G Protein-Coupled Receptors
Junker, H.; Schoeder, C. T.AI Summary
- The study assessed the effectiveness of deep learning methods (AlphaFold2 Initial Guess, Boltz-2, RosettaFold3) in designing de novo peptides for G protein-coupled receptors (GPCRs) by validating 124 known GPCR-peptide complexes.
- Generative methods (BindCraft, BoltzGen, RFdiffusion3) were evaluated for their peptide sampling capabilities, revealing issues with confidence overestimation and memorization in both prediction and generation.
- While backbone sampling was adequate, sequence generation was less effective, though improved by ProteinMPNN.
Abstract
G protein-coupled receptors (GPCRs) play an ubiquitous role in the transduction of extracellular stimuli into intracellular responses and therefore represent a major target for the development of novel peptide-based therapeutics. In fact, approximately 30% of all non-sensory GPCRs are peptide-targeted, representing a blueprint for the design of de novo peptides, both as pharmacological tools and therapeutics. The recent advances of deep learning-based protein structure generation and structure prediction offer a multitude of peptide design strategies for GPCRs, yet confidence metrics rarely correlate with experimental success. In the context of peptides, this problem is exacerbated due to the lack of elaborate tertiary structures in peptides, raising the question of whether this is due to inadequate sampling or insufficient scoring. In this two-part benchmark, we addressed this question by first simulating the validation process of 124 unique known GPCR-peptide complexes using AlphaFold2 Initial Guess, Boltz-2 and RosettaFold3. We then assessed the peptide sampling capabilities of the respective generative methods BindCraft, BoltzGen and RFdiffusion3. Our results indicate that current design pipelines primarily suffer from significant confidence overestimation for misplaced peptides in the validation phase across all three prediction methods. We further highlight occurrences of significant memorization in both prediction as well as generation of peptides. While all generative methods sample backbone space sufficiently, their simultaneous sequence generation remains subpar and can be partially recovered through the use of ProteinMPNN. Taken together, our benchmark offers guidance for the design of peptides specifically using deep learning-based pipelines.
bioinformatics2026-03-02v1SPATIALLY PATTERNED PODOCYTE STATE TRANSITIONS COORDINATE AGING OF THE GLOMERULUS
Chaney, C.; Pippin, J. W.; Tran, U.; Eng, D.; Wang, J.; Carroll, T. J.; Shankland, S. J.; Wessely, O.AI Summary
- The study investigated how aging affects the glomerulus by analyzing single nuclei transcriptomics from kidneys of mice at different ages, focusing on regional and cell type-specific responses.
- Results showed that aging in podocytes is characterized by a transition from expressing canonical podocyte genes to showing inflammatory and senescent signatures, predominantly in the juxtamedullary region.
- Unlike podocytes, other glomerular cell types showed minimal age-related changes, indicating that podocyte aging is selective and coordinated rather than a universal degeneration.
Abstract
Background: With the US population living longer, the risk, incidence, prevalence and severity for chronic kidney diseases become more abundant. Glomerular diseases are the leading cause for chronic and end-stage kidney disease. Yet, the cellular responses and the underlying mechanisms of progressive glomerular disease, which ultimately leads to glomerulosclerosis and loss of kidney function with advancing age, are poorly understood. Methods: Kidneys of young (4 months-old), middle-aged (20 months-old) and aged (24 months-old) mice were separated into outer cortex and juxta-medullary region and processed for single nuclei transcriptomics. Focusing on the aging glomerulus data were analyzed using a state-of-the-art analysis pipeline dissecting out the cellular age- and kidney region-specific responses. Results: Global analysis of the transcriptome reveals regional-specific differences that are detectable across multiple cell types exemplified by the expression of Napsa as a bona-fide juxta-medullary marker. In contrast aging led to rather cell type-specific responses. In the glomerulus, healthy podocytes were characterized by expression of canonical podocyte genes; conversely the senescent, aged podocytes were characterized by the down-regulation of canonical podocyte genes and the emergence of inflammatory and senescent signatures. Interestingly, these senescent podocytes were primarily located in the juxtamedullary region suggesting that juxtamedullary podocytes are more sensitive. Yet, instead of aging being defined by distinct cell states, the profiles, as well as ligand-receptor and pseudotime analyses suggest that podocytes aging is selective and coordinated, not universal degeneration. This was different to the other glomerular cell types, parietal epithelial cells, glomerular endothelial cells and mesangial cells. While they also as existed in different subpopulations, they exhibited little regional-, or age-depended changes. Finally proximal tubular aging manifested itself as discrete cellular states. Conclusions: The single nuclei transcriptomics of the aging kidney provides a mechanistic explanation for regional susceptibility of nephrons and suggests that the future therapeutic strategies need to consider the cellular and spatial complexity of the glomerulus.
bioinformatics2026-03-02v1Detecting Extrachromosomal DNA from Routine Histopathology
Khalid, M. A.; Gratius, M.; Brown, C.; Younis, R.; Ahmadi, Z.; Chavez, L.AI Summary
- This study developed a deep learning framework to detect extrachromosomal DNA (ecDNA) from standard histopathology images across twelve cancer types.
- The approach successfully distinguished ecDNA-amplified tumors from chromosomally amplified or non-amplified ones, with notable results in glioblastoma.
- The method identified histomorphologic changes associated with ecDNA, correlating with poor survival outcomes, suggesting potential for routine diagnostic integration.
Abstract
Extrachromosomal DNA (ecDNA) is a major driver of oncogene amplification, tumour heterogeneity and poor clinical outcomes [1-3], yet its detection relies on specialised genomic assays that are not integrated into routine diagnostics. Here, we show that ecDNA status can be inferred directly from standard haematoxylin and eosin-stained whole-slide pathology images. We develop an end-to-end, weakly supervised deep learning framework that aggregates thousands of high-magnification patches per slide with slide-level augmentation and interpretable attention. Across twelve cancer types from The Cancer Genome Atlas, the approach identifies tumours with genomic amplifications and, critically, distinguishes ecDNA-amplified from chromosomally amplified or non-amplified tumours, with the strongest signal in glioblastoma. Attention maps localise regions enriched for nuclei with altered chromatin intensity and texture, and predicted ecDNA status recapitulates its adverse association with survival. These results indicate that ecDNA amplifications leave reproducible histomorphologic footprints detectable by routine pathology, enabling scalable screening to prioritise tumours for confirmatory molecular testing.
bioinformatics2026-03-02v1Scalable mass-spectrometry-based molecular phylogeny with TreeMS2
Dierckx, M.; Adams, C.; Gauglitz, J. M.; Bittremieux, W.AI Summary
- TreeMS2 extends molecular phylogeny to proteomic and metabolomic data by comparing MS/MS spectra, bypassing annotation for rapid analysis.
- The tool constructs phenotype-derived trees that can be compared with genetic trees, revealing where molecular phenotypes align or diverge from evolutionary history.
- Across various datasets, TreeMS2 effectively reconstructs biological relationships, distinguishing cell types in single-cell proteomics and resolving biochemical structures in metabolomics.
Abstract
Molecular phylogeny is a well-established method for inferring evolutionary relationships from DNA and RNA sequences. Here, we extend this concept beyond genetic information by applying phylogeny-like analysis to proteomic and metabolomic mass spectrometry data, capturing relationships based on the realized molecular phenotype. The resulting phenotype-derived trees can be directly compared with conventional genetic-based trees to identify where molecular phenotypes reflect evolutionary history and where they diverge due to functional adaptation, regulation, or environmental influence. To enable this analysis, we introduce TreeMS2, a computational tool that constructs similarity matrices by directly comparing tandem mass spectrometry (MS/MS) spectra between samples. By bypassing spectrum annotation, TreeMS2 enables rapid, unbiased comparisons. Across diverse datasets, TreeMS2 reconstructs biologically meaningful relationships. In proteomics, phenotype-derived trees recapitulate established taxonomy, with deviations pinpointing sample handling errors. In single-cell proteomics our method distinguishes cell types despite sparse and noisy measurements and in metabolomics it resolves major biochemical divisions and fine-scale compositional structure. Together, these results establish TreeMS2 as a scalable, annotation-independent framework for deriving molecular relationships from raw MS/MS data.
bioinformatics2026-03-02v1Prediction and analysis of new HisKA-like domains
Silly, L.; Perriere, G.; Ortet, P.AI Summary
- This study analyzed 869,964 sequences of incomplete histidine kinases (iHKs) with HATPase but lacking HisKA domains to identify new HisKA-like domains.
- 18 HisKA-like profiles were identified, with their 3D structures matching known HisKA domains and genomic contexts indicating involvement in signal transduction.
- The findings were cross-validated with curated annotations and a negative dataset, suggesting potential improvements in annotating prokaryotic regulation pathways.
Abstract
Histidine kinases (HKs) are part of many signaling pathways, by being implicated in two components systems (TCS). Using autophosphorylation and phosphotransfer to a response regulators (RR), they enable organisms to adapt to their environment. Most HKs are transmembrane proteins with a sensing domain outside of the cell and two catalytic domains called HisKA and HATPase. HATPase is required for interaction with the ATP and HisKA contains the phosphorylated histidine residue. HKs are involved in various environmental adaptation mechanisms, like light sensing or biochemical changes. Studying their diversity is therefore important to better understand how cells interacts with their environment. There exist incomplete HKs (iHKs) lacking either the HisKA or HATPase domain. Some iHKs with an HATPase domain possess a section of their sequence where an HisKA domain could be expected. These iHKs may contain "true" HKs, with unknown HisKA domain, that could fill gaps in various signaling pathways. In this study we analyzed 869 964 sequences of iHKs having an HATPase domain but lacking an HisKA domain. We identified 18 HisKA-like profiles and did multiple meta-studies to assessed their HisKA-like characteristics. We found that their 3D structures matched the structure of known HisKA domains. We saw that the genomic context of the genes associated to these profiles contained genes implicated in signal transduction pathways. We cross-validated some of our profiles with curated annotations, as well as with a "negative dataset" made of non-HK proteins. We believe that our work could help improve the annotation of regulation pathways in prokaryotes.
bioinformatics2026-03-02v1Atlas-scale spatially aware clustering with support for 3D and multimodal data using SpatialLeiden
Müller-Bötticher, N.; Malt, A.; Kiessling, P.; Eils, R.; Kuppe, C.; Ishaque, N.AI Summary
- SpatialLeiden was extended to handle atlas-scale, multi-sample, 3D, and multimodal spatial omics data through neighbor-graph multiplexing on batch-corrected latent spaces.
- The algorithm demonstrated superior performance in creating coherent domains aligned with brain atlases, reconstructing 3D cancer tissue structures, and integrating multimodal features, surpassing specialized tools in modularity and scalability.
Abstract
Here we extend SpatialLeiden, our spatial clustering algorithm, to enable generalised atlas-scale multi-sample, 3D serial-section, and multimodal spatial omics via flexible neighbour-graph multiplexing on batch-corrected latent spaces. It delivers coherent domains aligning with brain atlases across >100 samples, stable 3D reconstruction of cancer tissue structures, and integrated multimodal features, outperforming specialized tools in modularity and scalability on standard hardware. SpatialLeiden is compatible with scverse for broad and intuitive adoption.
bioinformatics2026-03-02v1t2pmhc: A Structure-Informed Graph Neural Network to predict TCR-pMHC Binding
Polster, M.; Stadelmaier, J.; Ball, E.; Scheid, J.; Bauer, J.; Nelde, A.; Claassen, M.; Dubbelaar, M. L.; Walz, J. S.; Nahnsen, S.AI Summary
- The study introduces t2pmhc, a structure-based graph neural network framework to predict TCR-pMHC binding, utilizing predicted structures of the entire TCR-pMHC complex.
- t2pmhc, incorporating Graph Convolutional Network (GCN) and Graph Attention Network, showed enhanced generalization to unseen peptides over sequence-based methods.
- Analysis revealed that t2pmhc-GCN assigns high attention to biologically relevant regions, like the peptide and CDR3, with specific weighting within the peptide sequence.
Abstract
Mapping of T cell receptors (TCRs) to their cognate MHC-presented peptides (pMHC) is central for the development of precision immunotherapies and vaccine design. However, accurate prediction of TCR affinity to peptide antigens remains an open challenge. Most approaches rely solely on sequence information, although increasing evidence suggests that TCR-pMHC binding is primarily determined by three-dimensional structural interactions within the entire TCR-pMHC complex. Consequently, sequence-based methods often fail to generalize to peptides not included in the training data (unseen peptides). Here we introduce t2pmhc, a structure-based graph neural network framework for predicting TCR-pMHC binding using predicted structures of the entire TCR-pMHC complex. We evaluated a Graph Convolutional Network (GCN) and a Graph Attention Network, both demonstrating improved generalization to unseen peptides compared to state-of-the-art models across a variety of public datasets. Evaluation with crystallographic structures yields high-confidence predictions, indicating that current limitations of structure-based models are largely driven by the accuracy of structure prediction. Analysis of node attention patterns in t2pmhc-GCN reveals biologically consistent patterns, assigning high attention to the peptide and the CDR3 regions. Within the peptide sequence, canonical MHC anchor residues are consistently downweighted, whereas potential TCR-binding residues are upweighted. These findings establish t2pmhc as a structure-informed framework for robust TCR-pMHC binding prediction, enabling improved generalization to unseen antigens and providing a foundation for integrating TCR repertoire sequencing into vaccine design and immunotherapy.
bioinformatics2026-03-02v1Multiscale Symbolic Morpho-Barcoding Reveals Region-Specific and Scale-Dependent Neuronal Organization
Zhao, S.; Li, Y.; Liu, Y.; Peng, H.AI Summary
- The study introduces Multiscale Morpho-Barcoding (MMB), a framework for encoding whole-brain neuronal morphology into symbolic representations.
- By applying MMB to 1,876 reconstructed mouse neurons, the research identified region-specific and scale-dependent neuronal organization patterns.
- MMB effectively distinguishes major brain divisions and specific thalamic circuit classes, enhancing understanding beyond traditional projection strength analysis.
Abstract
Neuronal morphology is a central determinant of circuit organization, yet its multiscale complexity has hindered systematic, brain-wide analysis and integration with anatomical context. Here we introduce Multiscale Morpho-Barcoding (MMB), a framework that encodes whole-brain neuronal morphology into symbolic representations spanning cellular geometry, axonal tract routing, arbor organization, and predicted synaptic distributions. Applying MMB to 1,876 fully reconstructed mouse neurons, comprising 3,776 arbors and 2.63 million predicted presynaptic sites, we identify distinct multiscale morpho-patterns that reveal region-specific and scale-dependent principles of neuronal organization across the brain. MMB robustly discriminates major anatomical divisions and resolves canonical thalamic circuit classes beyond what can be achieved using projection strength alone. By transforming complex neuronal geometry into interpretable multiscale representations, MMB provides a general framework for systematic comparison of neuronal structure and for integrating morphology with connectivity and function at whole-brain scale.
bioinformatics2026-03-02v1Explainable AI for end-to-end pathogen target discovery and molecular design
Polonio, A.; Perez-Garcia, A.; Fernandez-Ortuno, D.; Jimenez-Castro, L.AI Summary
- The study introduces APEX, an explainable AI framework for identifying pathogen targets and designing molecules across species.
- APEX uses ESM-2 embeddings, graph attention networks, and a multilayer perceptron to predict essentiality, virulence, and druggability, recovering known fungal targets and proposing new ones like GmrSD and YadV.
- It also guides the design of inhibitors by highlighting key residues and pockets, demonstrating its utility in both known and novel target sites.
Abstract
Drug discovery is often constrained by target identification, a bottleneck especially acute in antimicrobial development and the fight against emerging fungicide resistance. We present APEX (Attention-based Protein EXplainer), an explainable AI framework for cross-species, proteome-scale target discovery and pocket-guided molecular design. APEX combines ESM-2 evolutionary embeddings, graph attention networks, and a multilayer perceptron to train pathogen-specific essentiality and virulence predictors (APEX-Tar) alonsgside a universal druggability model (APEX-Drug). Attention maps and GNNExplainer-derived subgraphs highlight residues and pockets driving predictions, enabling direct conditioning of structure-based diffusion models for inhibitor generation. APEX-Tar recovers known fungal targets (endopolygalacturonase 1, Hog1 MAPK) and proposes new candidates, including fungal GmrSD and bacterial YadV. APEX-Drug recapitulates established fungicide sites ({beta}-tubulin, cytochrome b), guides putative inhibitor design for GmrSD, and identifies in YadV a previously undescribed pocket distinct from known pilicide sites. Together, APEX offers a kingdom-agnostic pipeline for explainable target prioritization and guided molecular design.
bioinformatics2026-03-02v1Exploring the mechanism of Panax Notoginseng in the treatment of skin wound based on network pharmacology and experimental verification
Li, Y.-b.; Li, Q.-l.; Liu, J.; Li, J.-c.; Geng, H.-m.; Li, G.-k.; Jin, C.; Luo, J.; Zhang, Z.AI Summary
- This study used network pharmacology to identify 8 active components, 156 targets, and 115 pathways of Panax notoginseng (PN) in treating skin wounds, focusing on core targets like TNF, IL-6, and IL-10.
- Experimental validation in rats showed that PN treatment significantly reduced wound size, inflammation, and cytokine expression (TNF, IL-6, IL-10) compared to controls at various post-injury time points.
- PN promotes skin healing by modulating multiple signaling pathways, enhancing fibroblast proliferation, and optimizing the healing process from inflammation to tissue remodeling.
Abstract
Background How to shorten the healing cycle and reduce the incidence of infection is a difficult problem faced by clinicians. Panax notoginseng(PN), a traditional Chinese medicine, can promote the absorption of inflammatory exudates, granulation tissue formation and epidermal proliferation, effectively inhibit the inflammatory reaction of wounds and promote the healing of skin wounds, but its molecular mechanism has not been fully clarified so far. Based on network pharmacology and animal experiments, this study explored the target and molecular mechanism of PN in the treatment of skin wound. Methods Through network pharmacology, we screened the active components of PN and the common targets related to skin wounds, constructed a target protein-protein interaction (PPI) network, and performed GO and KEGG enrichment analysis. Using the MCODE and CytoHubba plugins, we explored core functional modules and key targets, ultimately constructing a visual network of PN components-targets-pathways. In the experimental section, Forty-eight male Sprague-Dawley (SD) rats were randomly divided into a control group and a PN group, with 24 rats in each group, and underwent full-thickness skin excision. Postoperatively, the PN group received intraperitoneal injections of drugs, while the control group received an equal amount of saline. Data were collected on postoperative days 1, 4, and 7, and hematoxylin and eosin (HE) staining, immunohistochemical staining, quantitative real-time polymerase chain reaction (qRT-PCR), and enzyme-linked immunosorbent assay (ELISA) were used to evaluate skin healing and detect changes in the expression of TNF-, IL-6, and IL-10 in the tissues. Results This study identified 8 major active components, 156 targets, and 115 signaling pathways involved in the treatment of skin wounds in rats using PN. The top 10 core target genes included TNF, IL-6, and IL-10, primarily enriched in signaling pathways such as NF-{kappa}B, MAPK, and JAK-STAT. Animal experiments revealed that at 4 and 7 days post-injury, the wound area in the PN group was significantly smaller than that in the control group (P<0.05). HE staining showed reduced infiltration of neutrophils and inflammatory cells in the injury area at 7 days in the PN group, accompanied by more pronounced fibroblast proliferation and collagen secretion. Molecular detection indicated that TNF-, IL-6, and IL-10 positive reactants were mainly distributed in the cytoplasm and matrix of epidermal cells, inflammatory cells, and fibroblasts in the skin. qRT-PCR and ELISA results showed that TNF- expression in the PN group was significantly lower than that in the control group at 4 and 7 days (P<0.01). IL-6 expression was lower than that in the control group at all time points, peaking at 4 days and then decreasing (P<0.01). IL-10 expression was significantly lower than that in the control group at 1 and 7 days (P<0.01). Conclusion PN treatment for skin wounds exhibits characteristics such as multi-component, multi-target, multi-pathway synergistic effects, and various regulatory pathways. It can reshape the dynamic balance of the cytokine network, optimize the temporal progression of "inflammation initiation - repair transition - tissue remodeling", and improve skin wound healing.
bioinformatics2026-03-02v1ExoFILT: Transfer learning for robust and accelerated analysis of exocytosis single-particle tracking data
Kramer, E.; Betancur, L. I.; Meek, S.; Tosi, S.; Manzo, C.; Oliva, B.; Gallego, O.AI Summary
- ExoFILT uses transfer learning to classify exocytic events in single-particle tracking data, reducing manual annotation time by ten-fold and enhancing consistency.
- Applied to dual-color time-lapse movies, ExoFILT quantified temporal relationships between exocytic proteins.
- The tool revealed distinct subpopulations of exocytic events with different molecular compositions, providing insights into exocytosis mechanisms.
Abstract
Motivation: Understanding constitutive exocytosis at the molecular level requires quantitative characterization of protein dynamics during the process. Single-particle tracking allows the measurement of protein dynamics in living cells. However, identifying bona fide exocytic events requires extensive manual annotation, limiting throughput and introducing personal biases that affect reproducibility. Results: We present ExoFILT, a deep learning-based classifier designed to identify exocytic events in single-particle tracking data, using the exocyst complex as a reference. Trained via transfer learning on simulated and experimental data, ExoFILT reduces the time required for manual annotation by ten-fold while improving measurement consistency across researchers. When applied to simultaneous dual-color time-lapse movies, ExoFILT enabled the systematic quantification of temporal relationships between exocytic proteins. The increased throughput uncovered distinct subpopulations of exocytic events with differential molecular composition (e.g., events with and without detectable levels of Sec1), underscoring the potential of ExoFILT to reveal mechanistic insights into exocytosis.
bioinformatics2026-03-02v1ToxiVerse: A Public Platform for Chemical Toxicity Data Sharing and Customizable Predictive Modeling
Durai, P.; Russo, D. P.; Shen, Y.; Wang, T.; Chung, E.; Li, L.; Zhu, H.AI Summary
- ToxiVerse is a public platform developed to provide user-friendly machine learning tools for computational toxicology, addressing the need for efficient chemical toxicity assessment.
- It features three modules: Bioprofiler for chemical descriptor generation, Database with 50,000 curated chemicals, and Cheminformatics for dataset management and QSAR model generation.
- The platform allows researchers to perform bioprofiling, access toxicity data, and predict chemical toxicity without programming expertise, available at www.toxiverse.com.
Abstract
Chemical toxicity assessment is critical for drug development and environmental safety. Computational models have emerged as a promising alternative to animal testing and now play a significant role in efficiently evaluating new chemicals. To address the urgent need for providing user-friendly machine learning tools in computational toxicology, we developed ToxiVerse, a public web-based platform. It provides curated toxicity datasets, automatic chemical bioprofiling, and a predictive modeling interface designed for researchers who lack programming expertise. The platform comprises three integrated modules: (i) the Bioprofiler module, which provides chemical descriptors by combining chemical-bioactivity data from PubChem assay with a machine learning-based data gap-filling procedure; (ii) the Database module, which hosts around 50,000 curated unique chemicals covering diverse toxicity endpoints; and (iii) the Cheminformatics module, which allows users to upload their own datasets, use datasets from ToxiVerse, or retrieve existing data from PubChem; perform chemical curation; and automatically generate Quantitative Structure-Activity Relationship (QSAR) models to predict chemicals of interest. ToxiVerse enables researchers to carry out bioprofiling, access curated toxicity datasets, and evaluate chemical toxicity through machine learning-based modeling and prediction. The platform is supported by sample files and a detailed tutorial, and it is freely accessible at www.toxiverse.com.
bioinformatics2026-03-02v1scProfiterole: Clustering of Single-Cell Proteomic DataUsing Graph Contrastive Learning via Spectral Filters
Coskun, M.; Lopes, F. B.; Kubilay Tolunay, P.; Chance, M. R.; Koyuturk, M.AI Summary
- The study addresses the challenge of clustering single-cell proteomic data by introducing scProfiterole, which uses graph contrastive learning (GCL) with spectral filters to improve cell type identification.
- scProfiterole employs three types of homophilic filters (random walks, heat kernels, beta kernels) and uses Arnoldi orthonormalization for efficient polynomial interpolation of these filters.
- Key findings show that GCL with learnable polynomial coefficients, along with heat and beta kernels, enhances clustering performance, with polynomial interpolation outperforming traditional methods.
Abstract
Novel technologies for the acquisition of protein expression data at the single cell level are emerging rapidly. Although there exists a substantial body of computational algorithms and tools for the analysis of single cell gene expression (scRNAseq) data, tools for even basic tasks such as clustering or cell type identification for single cell proteomic (scProteomics) data are relatively scarce. Adoption of algorithms that have been developed for scRNAseq into scProteomics is challenged by the larger number of drop-outs, missing data, and noise in single cell proteomic data. Graph contrastive learning (GCL) on cell-to-cell similarity graphs derived from single cell protein expression profiles show promise in cell type identification. However, missing edges and noise in the cell-to-cell similarity graph requires careful design of convolution matrices to overcome the imperfections in these graphs. Here, we introduce scProfiterole (Single Cell Proteomics Clustering via Spectral Filters), a computational framework to facilitate effective use of spectral graph filters in GCL-based clustering of single cell proteomic data. Since clustering assumes a homophilic network topology, we consider three types of homophilic filters: (i) random walks, (ii) heat kernels, (iii) beta kernels. Since direct implementation of these filters is computationally prohibitive, the filters are either truncated or approximated in practice. To overcome this limitation, scProfiterole uses Arnoldi orthonormalization to implement polynomial interpolations of any given spectral graph filter. Our results on comprehensive single cell proteomic data show that (i) graph contrastive learning with learnable polynomial coefficients that are carefully initialized improves the effectiveness and robustness of cell type identification, (ii) heat kernels and beta kernels improve clustering performance over adjacency matrices or random walks, and (iii) polynomial interpolation of complex filters outperforms approximation or truncation. The source code for scProfiterole is available at https://github.com/mustafaCoskunAgu/scProfiterole
bioinformatics2026-02-28v1LRSomatic: a highly scalable and robust pipeline for somatic variant calling in long-read sequencing data
Forsyth, R. A.; Harbers, L.; Verhasselt, A.; Iraizos, A.-L. R.; Yang, S.; Vande Velde, J.; Davies, C.; Pillay, N.; Lambrechts, L.; Demeulemeester, J.AI Summary
- LRSomatic is a Nextflow-based pipeline for somatic variant calling from long-read sequencing data, supporting SNV, indel, structural variant, and copy number analysis in both PacBio HiFi and ONT platforms.
- It accommodates paired tumor-normal and tumor-only designs, with the option for epigenetic integration via Fiber-seq.
- Benchmarking on COLO829 and HG008 showed high performance, and application to a clear cell sarcoma case identified all driver alterations, including the EWSR1::ATF1 fusion.
Abstract
Motivation Long-read sequencing is increasingly used in cancer research and clinical genomics due to its ability to resolve complex genomic variation and previously inaccessible regions of the genome. However, dedidated workflows for comprehensive somatic variant analysis from long-read whole-genome data remain scarce, limiting uptake in cancer genomics. Results We present LRSomatic, a Nextflow-based, nf-core-compliant pipeline supporting somatic SNV, indel, structural variant, and copy number calling from PacBio HiFi and ONT data. LRSomatic supports paired tumor-normal and tumor-only designs, as well as integration of epigenetic integration via Fiber-seq. Benchmarked on COLO829 and HG008 reference cell lines, LRSomatic achieves state-of-the-art performance across both platforms and variant types. Applied to a case of clear cell sarcoma, it recovers all identified driver alterations, including the pathognomonic EWSR1::ATF1 fusion, and resolves haplotype-specific chromatin accessibility via Fiber-seq. Availability and Implementation Freely available at https://github.com/intgenomicslab/lrsomatic, implemented in Nextflow DSL2, supported via Docker and Singularity.
bioinformatics2026-02-28v1Arborist: Prioritizing Bulk DNA Inferred Tumor Phylogenies via Low-pass Single-cell DNA Sequencing Data
Weber, L. L.; Ching, C. Y.; Ly, C.; Pan, Y.; Cheng, Y.; Gao, C.; Van Loo, P.AI Summary
- The study introduces ARBORIST, a method that integrates bulk DNA sequencing with low-pass single-cell DNA sequencing to improve tumor phylogeny reconstruction.
- ARBORIST uses variational inference to prioritize tumor phylogenies by approximating the marginal likelihood of candidate trees.
- Testing on simulated and biological data showed ARBORIST outperforms existing methods, resolving evolutionary relationships in a malignant peripheral nerve sheath tumor.
Abstract
Cancer arises from an evolutionary process that can be reconstructed from DNA sequencing and modeled by tumor phylogenies. High coverage bulk DNA sequencing (bulk DNA-seq) is widely available, but tumor phylogeny inference requires deconvolution, often resulting in non-uniqueness in the solution space. Single-cell DNA sequencing (scDNA-seq) holds potential to yield higher resolution tumor phylogenies, but the sparsity of emerging low-pass sequencing technologies poses challenges for the study of single-nucleotide variants. Increasing availability of data sequenced with both modalities provides an opportunity to capitalize on the advantages of these technologies. While inference methods exist for bulk DNA-seq and for low-pass scDNA-seq, no joint inference methods currently exist. As a first step, we propose a method named ARBORIST that prioritizes tumor phylogenies inferred via bulk DNA-seq using low-pass scDNA-seq data. ARBORIST takes as input a candidate set of trees with corresponding SNV clustering, along with variant and total read count data from scDNA-seq and uses variational inference to approximate a lower bound on the marginal likelihood of each tree in the candidate set. On simulated data, matching characteristics of current scDNA-seq data, ARBORIST outperforms both bulk and low-pass single-cell reconstruction methods. On a biological dataset, ARBORIST conclusively resolves the evolutionary relationship between different SNV clusters on a malignant peripheral nerve sheath tumor, which is supported by orthogonal validation via a proxy for copy number. ARBORIST provides a principled framework for integrating bulk DNA-seq and low-pass scDNA-seq data, improving confidence in tumor phylogeny reconstruction. Availability: https://github.com/VanLoo-lab/Arborist
bioinformatics2026-02-28v1Random Matrix Theory-guided sparse PCA for single-cell RNA-seq data
Chardes, V.AI Summary
- The study introduces a Random Matrix Theory (RMT)-guided sparse PCA method for single-cell RNA-seq data to address noise and variability issues, using a novel biwhitening algorithm to estimate noise per gene.
- This approach automatically selects sparsity levels, making sparse PCA nearly parameter-free, and retains PCA's interpretability.
- Across various technologies and algorithms, the method improved principal subspace reconstruction and outperformed traditional PCA, autoencoders, and diffusion methods in cell-type classification.
Abstract
Single-cell RNA-seq provides detailed molecular snapshots of individual cells but is notoriously noisy. Variability stems from biological differences and technical factors, such as amplification bias and limited RNA capture efficiency, making it challenging to adapt computational pipelines to heterogeneous datasets or evolving technologies. As a result, most studies still rely on principal component analysis (PCA) for dimensionality reduction, valued for its interpretability and robustness, in spite of its known bias in high dimensions. Here, we improve upon PCA with a Random Matrix Theory (RMT)-based approach that guides the inference of sparse principal components using existing sparse PCA algorithms. We first introduce a novel biwhitening algorithm which self-consistently estimates the magnitude of transcriptomic noise affecting each gene in individual cells, without assuming a specific noise distribution. This enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free. Our mathematically grounded approach retains the interpretability of PCA while enabling robust, hands-off inference of sparse principal components. Across seven single-cell RNA-seq technologies and four sparse PCA algorithms, we show that this method systematically improves the reconstruction of the principal subspace and consistently outperforms PCA-, autoencoder-, and diffusion-based methods in cell-type classification tasks.
bioinformatics2026-02-28v1Benchmarking computational tools for locus-specific analysis of transposable elements in single-cell RNA-seq datasets
Finazzi, V.; Vallejos, C. A.; Scialdone, A.AI Summary
- This study benchmarks tools for locus-specific analysis of transposable elements (TEs) in single-cell RNA-seq, using real and simulated datasets to assess performance.
- Findings indicate that older TEs can be quantified with high accuracy, while young TEs are challenging due to multi-mapping reads.
- SoloTE and Stellarscope performed comparably, with unique-mapper strategies recommended for precision, and subfamily aggregation suggested for young TEs.
Abstract
Background: Transposable elements (TEs) are increasingly recognized as regulators of gene expression and cellular identity in development and disease. Single-cell RNA-sequencing (scRNA-seq) enables the analysis of their transcription at cellular resolution, but the repetitive nature of TEs and their frequent overlap with genes create substantial mapping ambiguity. Although several tools quantify TE expression, few support locus-specific analysis, and their performance in single-cell data has not been systematically evaluated. Results: We present a comprehensive benchmarking framework for locus-level TE quantification in short-read scRNA-seq, combining real datasets with simulations that provide read-level ground truth. TE-derived reads constitute a considerable fraction of the transcriptome and capture meaningful biological structure. Our simulations reveal that older, sequence-diverged insertions can be quantified with relatively high accuracy, whereas young TEs remain intrinsically difficult to resolve due to unreliable assignment of multi-mapping reads. We observe pronounced family-specific biases and identify gene-TE disambiguation as a major unresolved challenge. Among evaluated methods, SoloTE (unique-mapper mode) and Stellarscope (with an expectation-maximization-based reallocation of multi-mappers) showed comparable performance, while including multi-mappers generally increased false positives without substantially improving locus-level accuracy. Conclusions: Our benchmark delineates the fundamental limits imposed by short-read scRNA-seq on locus-specific TE quantification, providing practical guidance for prospective users. Suggested best practices include focusing locus-level analyses on older insertions, applying unique-mapper strategies to improve precision, aggregating counts at the subfamily level for young TEs, and explicitly checking for gene-TE overlaps. Our workflow is fully reproducible and extensible, providing a foundation for evaluating emerging methods aimed at resolving TE transcription at single-locus resolution.
bioinformatics2026-02-28v1ESMRank reveals a transferable axis of protein mutational constraint from overlapping variant effect assays
Arnese, R.; Gambardella, G.AI Summary
- The study introduces ESMRank, a method that uses overlapping variant effect assays to derive a unified measure of protein mutational constraint, termed variant soundness.
- By analyzing over 1,100 MAVEdb score sets, ESMRank identifies a coherent constraint landscape related to structural stability, outperforming existing predictors in various protein assays.
- The approach, without clinical data, aligns with pathogenic variant data and disease mechanisms, demonstrating its utility in predicting protein folding and function in CFTR.
Abstract
Proteome-wide interpretation of missense variation is constrained not only by predictive model performance but also by the absence of principled methods to reconcile heterogeneous multiplexed assays of variant effect (MAVEs) into a unified representation of mutational constraint. We show that redundancy among partially overlapping deep mutational scanning experiments encodes a reproducible ordinal signal that can be recovered despite differences in assay scale and readout. We introduce variant soundness, an overlap-aware framework that aligns within-assay rankings and aggregates them across experiments to derive an assay-agnostic, within-protein measure of mutational tolerance. Applying this approach to about 1,100 MAVEdb score sets spanning >2M variants reveals a coherent constraint landscape enriched for structural stability determinants, including residue burial, packing perturbation magnitude, and domain architecture. By aligning learning objectives with this intrinsic ordering, we develop ESMRank, a sequence-based learning-to-rank predictor integrating protein language model representations with physicochemical descriptors. Under strict protein-level partitioning, ESMRank outperforms widely used stability and fitness predictors across the Human Domainome, ProteinGym stability assays, and VariBench folding kinetics. Without clinical supervision, the reconstructed constraint axis is enriched for ClinVar pathogenic variants and stratifies genes by mechanistic disease classes. In CFTR, predicted constraint tracks folding efficiency, channel activity, and pharmacological rescue. These findings establish experimental overlap as a scalable resource for extracting transferable mutational ordering and for building mechanistically interpretable, proteome-wide variant effect predictors.
bioinformatics2026-02-28v1SpatialCompassV (SCOMV): De novo cell and gene spatial pattern classification and spatially differential gene identification
Nomura, R.; Sakai, S. A.; Kageyama, S.-I.; Tsuchihara, K.; Yamashita, R.AI Summary
- SCOMV is a computational tool designed to cluster genes and cell types based on their spatial relationships in tissues, without relying on prior biological knowledge.
- It quantifies the spatial positioning of genes and cells relative to regions of interest, like tumors, by encoding distance and direction into feature representations.
- In breast and lung cancer datasets, SCOMV identified tumor-associated spatial patterns, classified genes by distribution types, and detected immune cell signatures in CAF-low regions, also identifying spatially differential genes.
Abstract
Spatial omics technologies enable the detection of gene expression together with spatial information in tissues. However, many existing analytical methods rely on prior biological knowledge or predefined annotations, while being limited in their ability to systematically characterize spatial distribution patterns. Here, we developed SpatialCompassV (SCOMV), a computational tool that clusters genes and cell types based on vectorial relationships between transcript locations and regions of interest, such as tumors. This tool quantifies the spatial positioning of genes and cells relative to a defined reference region by encoding their distance and direction into structured feature representations. SCOMV captured tumor-associated spatial patterns and enabled the unsupervised classification of genes into internal, peripheral, partially peripheral, and ubiquitous distribution types in breast and lung cancer spatial transcriptomic datasets of Xenium. Notably, SCOMV detected immune cell-related signatures that were preferentially localized in CAF-low regions. Extending the analysis to multiple regions of interest further enabled malignant state discrimination. Moreover, SCOMV identifies genes that differ not only in gene expression levels, but also in spatial distribution patterns, which we termed spatially differential genes (spatially DEGs).
bioinformatics2026-02-28v1Achieving spatial multi-omics integration from unaligned serial sections with DIME
Sun, P.; Huang, X.; Mou, T.; Zheng, X.AI Summary
- The study addresses the challenge of integrating spatial multi-omics data from unaligned serial sections by introducing DIME, a deep learning framework that uses graph contrastive learning and cross-modal correspondence.
- DIME employs a hybrid alignment strategy combining Coherent Point Drift with Linear Assignment and Optimal Transport to establish global correspondence across tissue sections.
- Experiments on simulated and real human tissue datasets show DIME's effectiveness in robust data fusion, denoising, and identifying biologically significant spatial domains with high clustering accuracy.
Abstract
Learning integrated representations from spatial multi-omics data is a fundamental challenge, particularly in the context of "diagonal integration", where data are collected from serial tissue sections across distinct omics modalities. Existing methods typically rely on the assumption of feature intersection to construct a common metric space, a prerequisite that is absent in this setting. To address this, we propose the Diagonal Integration Model for Spatial Multi-omics Embedding (DIME), a novel deep learning framework that couples a graph contrastive learning objective with cross-modal correspondence. This global correspondence is established by a hybrid alignment strategy: it first anchors high-confidence regions using Coherent Point Drift with Linear Assignment, and then extends matching to the entire tissue manifold via an Optimal Transport formulation encoding relative geodesic distances. Designed to balance inter-modal guidance with intra-modal structure preservation, DIME enables robust fusion and denoising. Experiments on simulated and real human tissue datasets demonstrate DIME's superior robustness and versatility, where its learned representations achieve outstanding clustering accuracy and unlock the identification of biologically meaningful spatial domains.
bioinformatics2026-02-28v1Counting-based inference of mutant growth rates from pooled sequencing across growth regimes
Sezer, D.; Toprak, E.AI Summary
- The study addresses quantifying variant growth rates from time-resolved sequencing data of pooled mutants by modeling growth dynamics.
- It compares weighted least-squares fitting with non-linear fitting using softmax transformation, favoring the latter for exponential growth.
- The research extends to logistic and Gompertz growth models, employing variational Bayesian inference for uncertainty quantification, enhancing high-throughput biochemical parameter estimation.
Abstract
Time-resolved sequencing of pooled mutants is widely used to track their frequencies under selection pressure, thereby revealing variants that are enriched or depleted. Here, we address how to quantify variant growth rates by analyzing the temporal dimension of the counts data through a model of growth. For exponential growth, we first study weighted least-squares fitting and show that non-linear fitting based on the softmax transformation exhibits more favorable properties than the currently employed linear regression. We then argue that direct maximization of the likelihood of the noise model should be preferred over least-squares fitting. For a multinomial model of counting noise, we adopt variational Bayesian inference to additionally quantify uncertainties in the estimated growth rates. We provide closed-form expressions for the experimentally practical case of sequencing only at the beginning and at the end of the experiment. Finally, we extend maximum-likelihood estimation and variational Bayesian inference to logistic and Gompertz growth, which serve as illustrative examples of general, non-exponential growth models formulated in terms of a small number of parameters per variant. The ability to incorporate arbitrary growth models within the developed inference framework opens new opportunities for high-throughput estimation of diverse biochemical parameters that influence growth.
bioinformatics2026-02-27v3Deep genomic models of allele-specific measurements
Mostafavi, S.; Tue, X.; Sasse, A.; Chowdhary, K.; Spiro, A.; Wang, L.; Chikina, M.; Benoist, C.AI Summary
- The study introduces DeepAllele, a deep learning model designed to predict allele-specific gene regulation changes using paired allele-specific input, particularly effective for datasets with few individuals like F1 hybrids.
- Applied to immune cells from F1 hybrid mice, DeepAllele effectively predicts regulatory changes across increasing biological complexity from TF binding to gene expression.
- The model identifies a broader range of genomic regions with known regulatory mechanisms compared to baseline models, enhancing causal discovery in genomics.
Abstract
Allele-specific quantification of sequencing data, such as gene expression, allows for a causal investigation of how DNA sequence variations influence cis gene regulation. Current methods for analyzing allele-specific measurements for causal analysis rely on statistical associations between genetic variation across individuals and allelic imbalance. Instead, we propose DeepAllele, a novel deep learning sequence-to-function model using paired allele-specific input, designed to learn sequence features that predict subtle changes in gene regulation between alleles. Our approach is especially suited for datasets with few individuals with unambiguous phasing, such as F1 hybrids and other controlled genetic crosses. We apply our framework to three types of allele-specific measurements in immune cells from F1 hybrid mice, illustrating that as the complexity of the underlying biological mechanism increases from TF binding to gene expression, the relative effectiveness of model's architecture becomes more pronounced. Furthermore, we show that the model's learned cis-regulatory grammar aligns with known biological mechanisms across a significantly larger number of genomic regions compared to baseline models. In summary, our work presents a computational framework to leverage genetic variation to uncover functionally-relevant regulatory motifs, enhancing causal discovery in genomics.
bioinformatics2026-02-27v2ITSxRust: ITS region extraction with partial-chain recovery and structured diagnostics for long-read amplicon sequencing
O'Brien, A.; Lagos, C.; Fernandez, K.; Ojeda, B.; Parada, P.AI Summary
- ITSxRust is a Rust-based tool designed for extracting ITS regions from long-read amplicon sequencing data, addressing throughput and robustness issues in fungal metabarcoding.
- It uses HMMER searches with efficient Rust-native processing, includes dereplication, and provides structured diagnostics and QC summaries.
- On an Oxford Nanopore dataset, ITSxRust extracted full ITS from 75.3% of reads, outperforming ITSx (69.9%) and ITSxpress v2 (41.4%), and was 4.6x faster than ITSx, with an additional 10,725 reads recovered via partial-chain fallback.
Abstract
As long-read amplicon sequencing (e.g., Oxford Nanopore and PacBio HiFi) becomes routine for fungal metabarcoding, identifying and extracting ITS subregions at scale has become a throughput and robustness bottleneck. The nuclear ribosomal internal transcribed spacer (ITS) region is the formal DNA barcode for fungi and is widely used for taxonomic profiling of fungal communities [Schoch et al., 2012]. Standard preprocessing locates conserved ribosomal flanks with hidden Markov profile models (profile-HMMs) to extract ITS1, 5.8S, ITS2, or the full ITS, as implemented in ITSx [Bengtsson-Palme et al., 2013] and ITSxpress [Rivers et al., 2018, Einarsson and Rivers, 2024]. Here we describe ITSxRust, a Rust-based ITS extractor designed for long-read scale. IT- SxRust coordinates HMMER searches with efficient Rust-native I/O and sequence processing, optionally reduces redundant searches via dereplication, provides ONT and HiFi parameter pre-sets, and emits structured failure diagnostics and QC summaries. On an Oxford Nanopore ITS dataset (54,659 reads), ITSxRust extracted the full ITS region from 75.3% of reads, exceeding both ITSx (69.9%) and ITSxpress v2 (41.4%), while running 4.6x faster than ITSx. In addition, a partial-chain fallback strategy that extracts subregions using two-anchor pairs when the full four-anchor chain is unavailable recovered an additional 10,725 reads that would otherwise be discarded.
bioinformatics2026-02-27v2Integrative Multi-Scale Sequence-Structure Modeling for Antimicrobial Peptide Prediction and Design
Li, J.; Shao, Y.; Li, Y.; Yu, Q.AI Summary
- The study introduces MultiAMP, a framework that integrates multi-scale sequence and structure information to predict antimicrobial peptides (AMPs), addressing the limitations of current single-scale approaches.
- MultiAMP significantly outperforms existing methods by over 10% in MCC, particularly in identifying AMPs with low sequence identity to known peptides.
- Applied to marine organisms, MultiAMP identified 484 novel AMPs and was used to design AMPs with specific motifs, enhancing understanding of AMP mechanisms.
Abstract
Antimicrobial resistance (AMR) is accelerating worldwide, undermining frontline antibiotics and making the need for novel agents more urgent than ever. Antimicrobial peptides (AMPs) are promising therapeutics against multidrug-resistant pathogens, as they are less prone to inducing resistance. However, current AMP prediction approaches often treat sequence and structure in isolation and at a single scale, leading to mediocre performance. Here, we propose MultiAMP, a framework that integrates multi-level information for predicting AMPs. The model captures evolutionary and contextual information from sequences alongside global and fine-grained information from structures, synergistically combining these features to enhance predictive power. MultiAMP achieves state-of-the-art performance, outperforming existing methods by over 10% in MCC when identifying distant AMPs sharing less than 40% sequence identity with known AMPs. To discover novel AMPs, we applied MultiAMP to marine organism data, discovering 484 high-confidence peptides with sequences that are highly divergent from known AMPs. Notably, MultiAMP accurately recognizes various structural types of peptides. In addition, our approach reveals functional patterns of AMPs, providing interpretable insights into their mechanisms. Building on these findings, we employed a gradient-based strategy and achieved the design of AMPs with specific motifs. We believe MultiAMP empowers both the rational discovery and mechanistic understanding of AMPs, facilitating future experimental validation and therapeutic design. The codebase is available at https://github.com/jiayili11/multi-amp.
bioinformatics2026-02-27v2MOSAIC: A Spectral Framework for Integrative Phenotypic Characterization Using Population-Level Single-Cell Multi-Omics
Lu, C.; Kluger, Y.; Ma, R.AI Summary
- MOSAIC is a spectral framework designed to analyze population-scale single-cell multi-omics data by learning a joint feature x sample embedding.
- It constructs sample-specific coupling matrices and uses spectral decomposition to enable applications like Differential Connectivity analysis, unsupervised subgroup detection, and clinical outcome prediction.
- Key findings include identifying regulatory network rewiring in activated T cells, discovering a stress-driven neuronal subtype in HIV+ patients, and enhancing COVID-19 severity classification.
Abstract
Population-scale single-cell multi-omics offers unprecedented opportunities to link molecular variation to human health and disease. However, existing methods for single-cell multi-omics analysis are either cell-centric, prioritizing batch-corrected cell embeddings that neglect feature relationships, or feature-centric, imposing global feature representations that overlook inter-sample heterogeneity. To address these limitations, we present MOSAIC, a spectral framework that learns a high-resolution feature $\times$ sample joint embedding from population-scale single-cell multi-omics data. For each individual, MOSAIC constructs a sample-specific coupling matrix capturing complete intra- and cross-modality feature interactions, then projects these into a shared latent space via spectral decomposition. The joint feature x sample embedding defines each feature's connectivity profile per sample, enabling three downstream applications. Differential Connectivity analysis identifies features with regulatory network rewiring across conditions even when their abundance remains unchanged, revealing rewiring of proliferation programs in activated T cells from a vaccination cohort. Unsupervised subgroup detection isolates coherent feature modules to discover hidden patient subtypes, uncovering a stress-driven neuronal subtype within an HIV+ cohort. Clinical outcome prediction using connectivity-derived features complements abundance-based analysis, improving COVID-19 severity classification when integrated. MOSAIC provides a general-purpose framework for systems-level phenotypic characterization, bridging network-level discovery with clinical outcome prediction in population-scale single-cell studies.
bioinformatics2026-02-27v2ProChoreo: de novo Binder Design from Conformational Ensembles with Generative Deep Learning
Ding, S.; Zhang, Y.AI Summary
- ProChoreo is a framework for de novo binder design that uses generative deep learning to incorporate conformational ensembles, unlike traditional methods that focus on static conformations.
- It employs multimodal contrastive learning to align protein sequences with molecular dynamics-derived ensembles, creating a shared latent representation for both sequence and dynamic structure.
- ProChoreo-designed binders for TAS1R2 and FGFR2 receptors were evaluated for structure and interaction quality, demonstrating the effectiveness of dynamics-informed design.
Abstract
Deep learning has transformed protein structure prediction and de novo protein design; however, most existing frameworks operate on a single static conformation and underutilize the conformational heterogeneity that governs protein binding and function. We introduce ProChoreo, a generalizable framework for de novo binder design that explicitly incorporates conformational ensembles. ProChoreo is pretrained with multimodal contrastive learning to align protein sequences with corresponding molecular dynamics (MD)-derived ensembles, producing a shared latent representation that captures both sequence-level and dynamic structural information. This representation is then integrated into an autoregressive generator to design protein binders conditioned on receptor sequences. Designed binders are evaluated using Boltz 1 for complex structure and interaction quality, followed by MD simulations of complexes with two representative receptors: the human sweet taste receptor TAS1R2 and FGFR2. ProChoreo designs binders that encode conformational features, highlighting dynamics-informed design as a route to protein design.
bioinformatics2026-02-27v2CycleGRN: Inferring Gene Regulatory Networks from Cyclic Flow Dynamics in Single-Cell RNA-seq
Zhao, W.; Fertig, E. J.; Stein-O'Brien, G. L.AI Summary
- CycleGRN is a new framework for inferring gene regulatory networks (GRNs) from single-cell RNA-seq data, focusing on the dynamic nature of oscillatory processes like the cell cycle.
- It uses a stochastic differential equation approach to model gene expression dynamics, constructing a directed graph to estimate gene interactions via Lie derivatives and time-lagged correlations.
- Evaluations on synthetic and real datasets showed CycleGRN effectively recovers oscillatory and directional interactions, ranking among top methods.
Abstract
Oscillatory processes such as the cell cycle play critical roles in cell fate determination and disease development, yet existing gene regulatory network (GRN) inference methods often fail to account for their dynamic nature. We propose CycleGRN, a novel framework that treats cell cycle gene expression observations as an invariant measure of a stochastic differential equation and learns from data a dynamical system that fits cycling biological processes. Using a directed graph constructed along the inferred flow field in the cell space, we estimate Lie derivatives for all genes, enabling velocity inference beyond the cell cycle subspace. To quantify regulatory interactions, we introduce a time-lagged correlation operator between any pair of genes supported on the flow-aligned directed graph, which respects the intrinsic geometry of the data manifold and allows temporal ordering consistent with the underlying oscillatory process. The method requires only raw gene expression data at single-cell resolution and a list of cycle genes, without temporal binning or splicing dynamics. We evaluate our method on four synthetic datasets generated from mechanistic models with known network structures with oscillatory subnetworks, and on a mouse retinal progenitor single-cell RNA-seq dataset spanning three cell types and a knockout condition. Across all settings, our method consistently ranks among the top-performing approaches and demonstrates strong recovery of oscillatory and directional interactions.
bioinformatics2026-02-27v2POTTR: Identifying Recurrent Trajectories in Evolutionary and Developmental Processes using Posets
Käufler, S. C.; Schmidt, H.; Jürgens, M.; Klau, G. W.; Sashittal, P.; Raphael, B.AI Summary
- The study addresses the identification of recurrent mutation trajectories in cancer evolution and organismal development by formalizing the problem using incomplete partially ordered sets (posets) to account for phylogenetic uncertainty.
- A novel algorithm, POTTR, was developed to solve the NP-hard problem of finding the largest recurrent trajectories shared in at least k phylogenies, modeled through a conflict graph.
- Application of POTTR to lung cancer, leukemia, and an in vitro embryoid model data revealed significant, previously unreported trajectories and conserved differentiation routes, demonstrating its utility in resolving mutation clusters and understanding developmental changes.
Abstract
Multiple biological processes, including cancer evolution and organismal development, are described as a sequence of events with a temporal ordering. While cancer evolves independently in each patient, DNA sequencing studies have shown that in some cancers different patients share specific orders of mutations and these correlate with distinct morphology, drug response, and treatment outcomes. Several methods have been developed to identify such recurrent trajectories of genetic events from phylogenetic trees, but this is complicated by high intra- and inter-tumor heterogeneity as well as uncertainty in the inferred tumor phylogenies including the ambiguous orders between some mutations. We formalize the problem of finding recurrent mutation trajectories using a novel framework of incomplete partially ordered sets (posets), which generalize representations used in previous works and explicitly account for the uncertainty in tumor phylogenies. We define the problem of identifying the largest recurrent trajectories shared in at least k input phylogenies as the maximum k-common induced incomplete subposet (MkCIIS) problem, which we show is NP-hard. We present a combinatorial algorithm, POsets for Temporal Trajectory Resolution (POTTR), to solve the MkCIIS problem using a conflict graph that models recurrent trajectories as independent sets. Thereby we identify maximum recurrent trajectories while resolving multiple sources of uncertainty, like mutation clusters, in the phylogenetic data. We apply POTTR to TRACERx non-small cell lung cancer bulk sequencing and acute myeloid leukemia single-cell sequencing data and through resolution of mutation clusters discover previously unreported trajectories of high statistical significance. On lineage tracing data of an in vitro embryoid model, POTTR identifies conserved differentiation routes across biological replicates and how these routes change in response to chemical perturbations.
bioinformatics2026-02-27v2EGGS: Empirical Genotype Generalizer for Samples
Smith, T. Q.; Rahman, A.; Szpiech, Z. A.AI Summary
- EGGS is a tool designed to handle empirical genotypes with missing data by replicating the distribution of missing genotypes across replicates.
- It offers functionalities like removing phase, polarization, simulating deamination and sequencing errors, creating pseudohaploids, and converting between various genetic data formats.
- EGGS assumes diploidy when producing VCF files and is implemented in C, with resources available on GitHub.
Abstract
Summary: We introduce Empirical Genotype Generalizer for Samples (EGGS) which accepts empirical genotypes with missing data and replicates the distribution of missing genotypes along the empirical segment in other replicates. The empirical segment must have a number of sites less than the replicate. In addition, EGGS can remove phase, remove polarization, simulate deamination, simulate sequencing error, create pseudohaploids, and convert between Variant Call Format (VCF), ms-style replicates, and EIGENSTRAT. When producing VCF files, EGGS assumes all samples are diploid. Availability and Implementation: EGGS is written in the C programming language. Precompiled executables, source code, and the manual are available at https://github.com/TQ-Smith/EGGS
bioinformatics2026-02-27v2Graph Lens Lite: An interactive biological network viewer for displaying, exploring, and sharing disease pathobiology and drug mechanism of action models
Ley, M.; Keska-Izworska, K.; Fillinger, L.; Walter, S. M.; Baumgärtel, F.; Bono, E.; Galou, L.; Andorfer, P.; Hauser, P.; Leierer, J.; kratochwill, k.; Perco, P.AI Summary
- Graph Lens Lite is a browser-based tool designed for visualizing and exploring biological networks to understand disease pathobiology and drug mechanisms.
- It features an expressive query language, topological analysis, GUI-based filtering, visual grouping, customizable layouts, a data-editor, and detailed styling options.
- The tool is available on GitHub for sharing and collaborative research in systems biology and network medicine.
Abstract
Motivation: Biological network visualization together with graph-based analyses are key techniques in systems biology and network medicine to detect patterns and generate new hypotheses regarding disease pathobiology, drug target identification, biomarker prioritization, or digital drug discovery. Network representations are also a way to communicate research findings and share results with colleagues and coworkers. Results: We have developed Graph Lens Lite, a browser-based tool that combines rich visualization capabilities with a streamlined interface for exploring and sharing biological networks. It offers an expressive query language, topological network analysis, GUI-based filtering, visual grouping, customizable layouts, a data-editor, and fine-grained property-based styling options, particularly suited for visualizing molecular models of disease pathobiology or drug mechanism of action. Availability: Graph Lens Lite is available at GitHub (https://github.com/Delta4AI/GraphLensLite).
bioinformatics2026-02-27v1Uncertainty-aware synthetic lethality prediction with pretrained foundation models
Hua, K.; Haber, E.; Ma, J.AI Summary
- CILANTRO-SL uses pretrained biological foundation models to predict synthetic lethality (SL) gene pairs, incorporating a two-stage process to generate context-aware embeddings from RNA-seq data and perform in silico gene knockouts.
- The framework employs a lightweight classifier in the second stage to differentiate SL from non-SL pairs, using features derived from the embeddings.
- Key findings include improved performance through viability pretraining and gene priors, with the ability to generalize to unseen genes and gene pairs, enhancing the discovery of therapeutic targets with calibrated uncertainty.
Abstract
Synthetic lethality (SL) offers a promising paradigm for targeted cancer therapy, yet experimental identification of SL gene pairs remains costly, context-dependent, and biased toward well-studied genes. Existing computational approaches often rely on curated protein-protein interaction (PPI) networks and Gene Ontology (GO) annotations, which limit their ability to generalize to novel genes. Here we introduce CILANTRO-SL, a two-stage, graph-free framework that leverages pretrained biological foundation models to predict SL pairs with calibrated uncertainty. In Stage 1, we apply a pretrained single-cell foundation model to bulk RNA-seq profiles of cancer cell lines to obtain context-aware embeddings and perform in silico gene knockouts to generate delta embeddings. These perturbation signals are further conditioned on a data-driven gene prior and supervised with CRISPR viability readouts to learn knockout-aware viability embeddings. In Stage 2, we derive pairwise features from these embeddings and train a lightweight classifier to distinguish SL from non-SL pairs. To enable reliable experimental prioritization, CILANTRO-SL incorporates conformal prediction, producing calibrated and interpretable prediction sets that highlight high-confidence SL candidates. Across two evaluation settings, including zero-shot generalization to unseen gene pairs and to unseen genes, ablation analyses show that viability pretraining and the gene prior substantially improve performance while avoiding reliance on PPI and GO features. CILANTRO-SL therefore transforms pretrained biological representations into practical, uncertainty-aware hypotheses that support robust and scalable discovery of therapeutic targets.
bioinformatics2026-02-27v1Spatial Mechanomics for Tissue-Scale Biomechanical Mapping and Multi-omics Integration
Xie, W.; Wang, Z.; Shan, Q.; Zhao, Q.; Ye, X.AI Summary
- The study introduces spatial mechanomics, a method for mapping biomechanical properties across tissues by integrating BioAFM-based spatial sampling with multi-protocol microrheology.
- This approach extracts viscoelastic parameters at specific tissue locations, creating mechanomic feature vectors and tissue-scale atlases.
- Applied to murine myocardial tissue, spatial mechanomics identified distinct mechanical states and condition-dependent remodeling, demonstrating its utility in multi-modal tissue analysis.
Abstract
Tissue mechanical properties are spatially heterogeneous and tightly coupled to cellular function, developmental patterning, and disease progression, yet spatially resolved characterization of viscoelastic and microrheological behavior across intact tissues remains limited. Here we introduce spatial mechanomics, a framework for tissue-wide acquisition, quantitative extraction, and computational representation of location-resolved mechanical states. Using BioAFM-based spatial sampling with multi-protocol microrheology, we acquire force responses at defined tissue coordinates and fit physically interpretable viscoelastic models to extract elastic, viscous, and frequency-dependent parameters at each position. These parameters are assembled into per-niche mechanomic feature vectors and reconstructed into tissue-scale mechanomic atlases that resolve heterogeneous mechanical organization. We implement these capabilities in MechScape, an open-source computational platform that supports force curve fitting, spatial feature matrix construction, unsupervised domain discovery, and cross-modal alignment with histological and molecular measurements. Application to murine myocardial tissue reveals that spatial mechanomics identifies distinct mechanical states, quantifies condition-dependent remodeling across all measured parameters, and resolves spatially coherent mechanical domains. This work establishes spatial mechanomics as a quantitative approach for tissue-scale biomechanical mapping and provides a generalizable framework for integrating mechanics as an omics layer in multi-modal tissue analysis.
bioinformatics2026-02-27v1DENcode: A model for haplotype-informed transmission probability of dengue virus
Maduranga, S.; Arroyo, B. M. V.; Sigera, C.; Weeratunga, P.; Fernando, D.; Rajapakse, S.; Lloyd, A. R.; Bull, R. A.; Stone, H.; Rodrigo, C.AI Summary
- DENcode is a model that estimates the transmission probability of dengue virus by integrating epidemiological factors with genetic similarity between viral haplotypes.
- Validation with 90 dengue cases from Colombo, Sri Lanka showed stable estimates with narrow credible intervals, highlighting the importance of both genetic and epidemiological components.
- Haplotype-informed networks were significantly more informative than consensus-based networks, enhancing the understanding of transmission dynamics within the community.
Abstract
Dengue virus transmission networks are often only partially resolved, due to gaps in sampling, unobserved mosquito-mediated transmission, and using methods (phylogenetics) that describe evolutionary relatedness but not explicit, probabilistic transmission links between individual infections. We developed DENcode, a framework to estimate the relative likelihood of vector-mediated transmission between pairs of dengue cases by combining a temperature- and time-modulated epidemiological kernel, which captures the extrinsic incubation period and human infectiousness, with a phylogenetically informed genetic similarity kernel derived from patristic distances between viral haplotypes or consensus sequences. Validation with a real-life dataset of 90 dengue infections sampled from Colombo, Sri Lanka between 2017 - 2020 and sequenced to resolve within-host haplotypes, DENcode estimates were stable across 100 Monte Carlo iterations, yielding narrow credible intervals (median width <0.001) and consistent top-ranked transmission pairs. Sensitivity analyses using ablation experiments showed that removing either the genetic or epidemiological component substantially altered the distribution of linkage probabilities, indicating that both contribute meaningfully to the inferred transmission structure. Serotype-specific transmission networks constructed from pairwise linkage probabilities from DENcode were analysed using degree- and path-based centrality measures at probability thresholds of 0.1 and 0.5, revealing relative importance of cases to disease transmission within the community. Haplotype-derived networks were more informative than consensus-based networks (x 3.6 and x 1.6 times more edges for DENV2 and 3 respectively). DENcode is a robust framework to explore dengue transmission within a community that provides an output of network of transmission probabilities informed by pathogen genetic similarity and clinical epidemiological parameters.
bioinformatics2026-02-27v1PantheonOS: An Evolvable Multi-Agent Framework for Automatic Genomics Discovery
Xu, W.; Poussi, E.; Zhong, Q.; Zeng, Z.; Zou, C.; Wang, X.; Lu, Y.; Cui, M.; Okamura, D.; Huang, C.; Ding, J.; Zhao, Z.; Yang, Y.; Pan, X.; Vijay, V.; Konno, N.; Liu, N.; Li, L.; Ma, X. R.; Conley, S. D.; Kern, C.; Goodyer, W. R.; Bintu, B.; Zhu, Q.; Chi, N. C.; He, J.; Rognoni, L.; Zhang, X.; Wu, J.; Ellison, D.; Rabinovitch, M.; Engreitz, J. M.; Qiu, X.AI Summary
- PantheonOS is introduced as an evolvable, privacy-preserving multi-agent framework for genomics discovery, aiming to balance generality with domain specificity.
- It utilizes agentic code evolution to enhance batch correction and gene panel selection, achieving super-human performance.
- Key findings include discovering asymmetric paracrine inhibition in mouse embryo development, integrating multi-omics data for heart disease insights, and predicting cardiac effects with virtual cell models.
Abstract
The convergence of large language model-powered autonomous agent systems and single-cell biology promises a paradigm shift in biomedical discovery. However, existing biological agent systems, building upon single-agent architectures, are narrowly specialized or overly general, limiting applications to routine analyses. We introduce PantheonOS (PantheonOS.stanford.edu), an evolvable, privacy-preserving multi-agent framework designed to reconcile generality with domain specificity. Critically, PantheonOS enables agentic code evolution, allowing evolving state-of-the-art batch correction and our reinforcement-learning augmented gene panel selection algorithms to achieve super-human performance. PantheonOS drives biological discoveries across systems: uncovering asymmetric paracrine Cer1-Nodal inhibition in proximal-distal axis formation of novel early mouse embryo 3D data; integrating human fetal heart multi-omics with whole-heart data to reveal molecular programs underpin heart diseases; and adaptively selecting virtual cell models to predict cardiac regulatory and perturbation effects. Together, PantheonOS points towards a future where scientific discoveries are increasingly driven by self-evolving AI systems across biology and beyond.
bioinformatics2026-02-27v1MAP: A Knowledge-driven Framework for Predicting Single-cell Responses for Unprofiled Drugs
Feng, J.; Zhao, Z.; Zhang, X.; Liu, M.; Chen, J.; Quan, X.; Zhang, J.; Wang, Y.; Zhang, Y.; Xie, W.AI Summary
- The study introduces MAP, a framework that integrates biological knowledge into cellular perturbation modeling to predict responses to unprofiled drugs.
- MAP uses a knowledge graph (MAP-KG) and a knowledge-driven pre-training strategy to create unified embeddings for molecular structures, protein sequences, and mechanistic descriptions.
- Evaluations showed MAP improved prediction accuracy by up to +13.3% for unseen cell type-drug combinations and +12.2% for unprofiled drugs, with pathway analysis confirming mechanism consistency in drug screening.
Abstract
Predicting how cells respond to chemical perturbations is one of the goals for building virtual cells, yet experimentally profiled compounds cover only a small fraction of this space. Existing models struggle to generalize to unprofiled compounds, as they typically treat drugs as isolated identifiers without encoding their mechanistic relationships. We present MAP, a framework that integrates structured biological knowledge into cellular perturbation modeling and supports zero-shot prediction for small molecules with scarce or absent perturbation profiles. Specifically: (i) we construct MAP-KG, a large-scale knowledge graph tailored for cellular perturbation modeling that unifies 14 public resources, spanning 187k drugs, 23k genes, and 694k mechanistic relationships; (ii) we propose a knowledge-driven pre-training strategy that aligns molecular structures, protein sequence features, and textual mechanistic descriptions into a unified embedding space via contrastive learning, producing mechanism-aware and transferable gene and compound embeddings. The resulting knowledge-informed gene and drug representations are then coupled with a pretrained single-cell foundation model to condition perturbation response prediction; (iii) we evaluate MAP under two zero-shot generalization regimes: unseen cell type-drug combinations and the stricter setting of unprofiled drugs, where it improves top-50 DEG Pearson delta correlation by up to +13.3% and +12.2%, respectively, over the strongest baselines across three benchmarks. We further perform pathway-level functional analysis via GSEA for in-silico screening, where MAP predicts coherent, mechanism-consistent programs on unprofiled candidate drugs, and prioritizes 4 of 5 approved anti-cancer drugs in A-549 (non-small cell lung cancer).
bioinformatics2026-02-27v1Topological Data Analysis of Spatial Protein Expression in Multiplexed Spatial Proteomics Studies
Samorodnitsky, S. N.; Wu, M.AI Summary
- The study introduces TOASTER, a method using topological data analysis to assess the association between continuous spatial protein expression and patient outcomes, bypassing traditional cell segmentation and phenotyping.
- TOASTER characterizes topological features of protein expression and uses adapted statistical methods to link these features with outcomes.
- Simulations and application to triple-negative breast cancer data show TOASTER improves power and controls type I error, revealing associations with immunotherapy response.
Abstract
Multiplexed spatial proteomics platforms generate high-resolution images capturing the spatial expression of proteins in tissue. Images are often fed through a complex pre-processing pipeline to identify individual cells (termed segmentation) and then to predict their phenotypes. It is common to test if the inferred spatial arrangement of cells associates with patient-level outcomes. However, cell segmentation and phenotyping are prone to error and this approach neglects the measured protein levels. Further, new research suggests topological analysis of spatial proteomics may yield more power than alternative approaches. We propose a method, TOASTER, that circumvents reliance on segmentation and phenotyping and instead tests the association between continuous spatial protein expression and a patient-level response variable. TOASTER uses topological data analysis to first characterize the presence of topological features within univariate and bivariate spatial protein expression. The topological structure is summarized using an adaptation of the Nelson-Aalen cumulative hazard function. We can then associate this summary with an outcome using either a functional data analytic approach, a gridwise testing approach, or using kernel association testing. We show via simulation that our approach improves power and controls type I error, even in the presence of gaps or tears in the image which may arise during tissue handling. We apply our approach to a study in triple-negative breast cancer and demonstrate topological features of protein expression associated with immunotherapy response.
bioinformatics2026-02-27v1