Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Counting-based inference of mutant growth rates from pooled sequencing across growth regimes
Sezer, D.; Toprak, E.AI Summary
- The study addresses quantifying variant growth rates from time-resolved sequencing data of pooled mutants by modeling growth dynamics.
- It compares weighted least-squares fitting with non-linear fitting using softmax transformation, favoring the latter for exponential growth.
- The research extends to logistic and Gompertz growth models, employing variational Bayesian inference for uncertainty quantification, enhancing high-throughput biochemical parameter estimation.
Abstract
Time-resolved sequencing of pooled mutants is widely used to track their frequencies under selection pressure, thereby revealing variants that are enriched or depleted. Here, we address how to quantify variant growth rates by analyzing the temporal dimension of the counts data through a model of growth. For exponential growth, we first study weighted least-squares fitting and show that non-linear fitting based on the softmax transformation exhibits more favorable properties than the currently employed linear regression. We then argue that direct maximization of the likelihood of the noise model should be preferred over least-squares fitting. For a multinomial model of counting noise, we adopt variational Bayesian inference to additionally quantify uncertainties in the estimated growth rates. We provide closed-form expressions for the experimentally practical case of sequencing only at the beginning and at the end of the experiment. Finally, we extend maximum-likelihood estimation and variational Bayesian inference to logistic and Gompertz growth, which serve as illustrative examples of general, non-exponential growth models formulated in terms of a small number of parameters per variant. The ability to incorporate arbitrary growth models within the developed inference framework opens new opportunities for high-throughput estimation of diverse biochemical parameters that influence growth.
bioinformatics2026-02-27v3EGGS: Empirical Genotype Generalizer for Samples
Smith, T. Q.; Rahman, A.; Szpiech, Z. A.AI Summary
- EGGS is a tool designed to handle empirical genotypes with missing data by replicating the distribution of missing genotypes across replicates.
- It offers functionalities like removing phase, polarization, simulating deamination and sequencing errors, creating pseudohaploids, and converting between various genetic data formats.
- EGGS assumes diploidy when producing VCF files and is implemented in C, with resources available on GitHub.
Abstract
Summary: We introduce Empirical Genotype Generalizer for Samples (EGGS) which accepts empirical genotypes with missing data and replicates the distribution of missing genotypes along the empirical segment in other replicates. The empirical segment must have a number of sites less than the replicate. In addition, EGGS can remove phase, remove polarization, simulate deamination, simulate sequencing error, create pseudohaploids, and convert between Variant Call Format (VCF), ms-style replicates, and EIGENSTRAT. When producing VCF files, EGGS assumes all samples are diploid. Availability and Implementation: EGGS is written in the C programming language. Precompiled executables, source code, and the manual are available at https://github.com/TQ-Smith/EGGS
bioinformatics2026-02-27v2Deep genomic models of allele-specific measurements
Mostafavi, S.; Tue, X.; Sasse, A.; Chowdhary, K.; Spiro, A.; Wang, L.; Chikina, M.; Benoist, C.AI Summary
- The study introduces DeepAllele, a deep learning model designed to predict allele-specific gene regulation changes using paired allele-specific input, particularly effective for datasets with few individuals like F1 hybrids.
- Applied to immune cells from F1 hybrid mice, DeepAllele effectively predicts regulatory changes across increasing biological complexity from TF binding to gene expression.
- The model identifies a broader range of genomic regions with known regulatory mechanisms compared to baseline models, enhancing causal discovery in genomics.
Abstract
Allele-specific quantification of sequencing data, such as gene expression, allows for a causal investigation of how DNA sequence variations influence cis gene regulation. Current methods for analyzing allele-specific measurements for causal analysis rely on statistical associations between genetic variation across individuals and allelic imbalance. Instead, we propose DeepAllele, a novel deep learning sequence-to-function model using paired allele-specific input, designed to learn sequence features that predict subtle changes in gene regulation between alleles. Our approach is especially suited for datasets with few individuals with unambiguous phasing, such as F1 hybrids and other controlled genetic crosses. We apply our framework to three types of allele-specific measurements in immune cells from F1 hybrid mice, illustrating that as the complexity of the underlying biological mechanism increases from TF binding to gene expression, the relative effectiveness of model's architecture becomes more pronounced. Furthermore, we show that the model's learned cis-regulatory grammar aligns with known biological mechanisms across a significantly larger number of genomic regions compared to baseline models. In summary, our work presents a computational framework to leverage genetic variation to uncover functionally-relevant regulatory motifs, enhancing causal discovery in genomics.
bioinformatics2026-02-27v2CycleGRN: Inferring Gene Regulatory Networks from Cyclic Flow Dynamics in Single-Cell RNA-seq
Zhao, W.; Fertig, E. J.; Stein-O'Brien, G. L.AI Summary
- CycleGRN is a new framework for inferring gene regulatory networks (GRNs) from single-cell RNA-seq data, focusing on the dynamic nature of oscillatory processes like the cell cycle.
- It uses a stochastic differential equation approach to model gene expression dynamics, constructing a directed graph to estimate gene interactions via Lie derivatives and time-lagged correlations.
- Evaluations on synthetic and real datasets showed CycleGRN effectively recovers oscillatory and directional interactions, ranking among top methods.
Abstract
Oscillatory processes such as the cell cycle play critical roles in cell fate determination and disease development, yet existing gene regulatory network (GRN) inference methods often fail to account for their dynamic nature. We propose CycleGRN, a novel framework that treats cell cycle gene expression observations as an invariant measure of a stochastic differential equation and learns from data a dynamical system that fits cycling biological processes. Using a directed graph constructed along the inferred flow field in the cell space, we estimate Lie derivatives for all genes, enabling velocity inference beyond the cell cycle subspace. To quantify regulatory interactions, we introduce a time-lagged correlation operator between any pair of genes supported on the flow-aligned directed graph, which respects the intrinsic geometry of the data manifold and allows temporal ordering consistent with the underlying oscillatory process. The method requires only raw gene expression data at single-cell resolution and a list of cycle genes, without temporal binning or splicing dynamics. We evaluate our method on four synthetic datasets generated from mechanistic models with known network structures with oscillatory subnetworks, and on a mouse retinal progenitor single-cell RNA-seq dataset spanning three cell types and a knockout condition. Across all settings, our method consistently ranks among the top-performing approaches and demonstrates strong recovery of oscillatory and directional interactions.
bioinformatics2026-02-27v2ProChoreo: de novo Binder Design from Conformational Ensembles with Generative Deep Learning
Ding, S.; Zhang, Y.AI Summary
- ProChoreo is a framework for de novo binder design that uses generative deep learning to incorporate conformational ensembles, unlike traditional methods that focus on static conformations.
- It employs multimodal contrastive learning to align protein sequences with molecular dynamics-derived ensembles, creating a shared latent representation for both sequence and dynamic structure.
- ProChoreo-designed binders for TAS1R2 and FGFR2 receptors were evaluated for structure and interaction quality, demonstrating the effectiveness of dynamics-informed design.
Abstract
Deep learning has transformed protein structure prediction and de novo protein design; however, most existing frameworks operate on a single static conformation and underutilize the conformational heterogeneity that governs protein binding and function. We introduce ProChoreo, a generalizable framework for de novo binder design that explicitly incorporates conformational ensembles. ProChoreo is pretrained with multimodal contrastive learning to align protein sequences with corresponding molecular dynamics (MD)-derived ensembles, producing a shared latent representation that captures both sequence-level and dynamic structural information. This representation is then integrated into an autoregressive generator to design protein binders conditioned on receptor sequences. Designed binders are evaluated using Boltz 1 for complex structure and interaction quality, followed by MD simulations of complexes with two representative receptors: the human sweet taste receptor TAS1R2 and FGFR2. ProChoreo designs binders that encode conformational features, highlighting dynamics-informed design as a route to protein design.
bioinformatics2026-02-27v2MOSAIC: A Spectral Framework for Integrative Phenotypic Characterization Using Population-Level Single-Cell Multi-Omics
Lu, C.; Kluger, Y.; Ma, R.AI Summary
- MOSAIC is a spectral framework designed to analyze population-scale single-cell multi-omics data by learning a joint feature x sample embedding.
- It constructs sample-specific coupling matrices and uses spectral decomposition to enable applications like Differential Connectivity analysis, unsupervised subgroup detection, and clinical outcome prediction.
- Key findings include identifying regulatory network rewiring in activated T cells, discovering a stress-driven neuronal subtype in HIV+ patients, and enhancing COVID-19 severity classification.
Abstract
Population-scale single-cell multi-omics offers unprecedented opportunities to link molecular variation to human health and disease. However, existing methods for single-cell multi-omics analysis are either cell-centric, prioritizing batch-corrected cell embeddings that neglect feature relationships, or feature-centric, imposing global feature representations that overlook inter-sample heterogeneity. To address these limitations, we present MOSAIC, a spectral framework that learns a high-resolution feature $\times$ sample joint embedding from population-scale single-cell multi-omics data. For each individual, MOSAIC constructs a sample-specific coupling matrix capturing complete intra- and cross-modality feature interactions, then projects these into a shared latent space via spectral decomposition. The joint feature x sample embedding defines each feature's connectivity profile per sample, enabling three downstream applications. Differential Connectivity analysis identifies features with regulatory network rewiring across conditions even when their abundance remains unchanged, revealing rewiring of proliferation programs in activated T cells from a vaccination cohort. Unsupervised subgroup detection isolates coherent feature modules to discover hidden patient subtypes, uncovering a stress-driven neuronal subtype within an HIV+ cohort. Clinical outcome prediction using connectivity-derived features complements abundance-based analysis, improving COVID-19 severity classification when integrated. MOSAIC provides a general-purpose framework for systems-level phenotypic characterization, bridging network-level discovery with clinical outcome prediction in population-scale single-cell studies.
bioinformatics2026-02-27v2Integrative Multi-Scale Sequence-Structure Modeling for Antimicrobial Peptide Prediction and Design
Li, J.; Shao, Y.; Li, Y.; Yu, Q.AI Summary
- The study introduces MultiAMP, a framework that integrates multi-scale sequence and structure information to predict antimicrobial peptides (AMPs), addressing the limitations of current single-scale approaches.
- MultiAMP significantly outperforms existing methods by over 10% in MCC, particularly in identifying AMPs with low sequence identity to known peptides.
- Applied to marine organisms, MultiAMP identified 484 novel AMPs and was used to design AMPs with specific motifs, enhancing understanding of AMP mechanisms.
Abstract
Antimicrobial resistance (AMR) is accelerating worldwide, undermining frontline antibiotics and making the need for novel agents more urgent than ever. Antimicrobial peptides (AMPs) are promising therapeutics against multidrug-resistant pathogens, as they are less prone to inducing resistance. However, current AMP prediction approaches often treat sequence and structure in isolation and at a single scale, leading to mediocre performance. Here, we propose MultiAMP, a framework that integrates multi-level information for predicting AMPs. The model captures evolutionary and contextual information from sequences alongside global and fine-grained information from structures, synergistically combining these features to enhance predictive power. MultiAMP achieves state-of-the-art performance, outperforming existing methods by over 10% in MCC when identifying distant AMPs sharing less than 40% sequence identity with known AMPs. To discover novel AMPs, we applied MultiAMP to marine organism data, discovering 484 high-confidence peptides with sequences that are highly divergent from known AMPs. Notably, MultiAMP accurately recognizes various structural types of peptides. In addition, our approach reveals functional patterns of AMPs, providing interpretable insights into their mechanisms. Building on these findings, we employed a gradient-based strategy and achieved the design of AMPs with specific motifs. We believe MultiAMP empowers both the rational discovery and mechanistic understanding of AMPs, facilitating future experimental validation and therapeutic design. The codebase is available at https://github.com/jiayili11/multi-amp.
bioinformatics2026-02-27v2ITSxRust: ITS region extraction with partial-chain recovery and structured diagnostics for long-read amplicon sequencing
O'Brien, A.; Lagos, C.; Fernandez, K.; Ojeda, B.; Parada, P.AI Summary
- ITSxRust is a Rust-based tool designed for extracting ITS regions from long-read amplicon sequencing data, addressing throughput and robustness issues in fungal metabarcoding.
- It uses HMMER searches with efficient Rust-native processing, includes dereplication, and provides structured diagnostics and QC summaries.
- On an Oxford Nanopore dataset, ITSxRust extracted full ITS from 75.3% of reads, outperforming ITSx (69.9%) and ITSxpress v2 (41.4%), and was 4.6x faster than ITSx, with an additional 10,725 reads recovered via partial-chain fallback.
Abstract
As long-read amplicon sequencing (e.g., Oxford Nanopore and PacBio HiFi) becomes routine for fungal metabarcoding, identifying and extracting ITS subregions at scale has become a throughput and robustness bottleneck. The nuclear ribosomal internal transcribed spacer (ITS) region is the formal DNA barcode for fungi and is widely used for taxonomic profiling of fungal communities [Schoch et al., 2012]. Standard preprocessing locates conserved ribosomal flanks with hidden Markov profile models (profile-HMMs) to extract ITS1, 5.8S, ITS2, or the full ITS, as implemented in ITSx [Bengtsson-Palme et al., 2013] and ITSxpress [Rivers et al., 2018, Einarsson and Rivers, 2024]. Here we describe ITSxRust, a Rust-based ITS extractor designed for long-read scale. IT- SxRust coordinates HMMER searches with efficient Rust-native I/O and sequence processing, optionally reduces redundant searches via dereplication, provides ONT and HiFi parameter pre-sets, and emits structured failure diagnostics and QC summaries. On an Oxford Nanopore ITS dataset (54,659 reads), ITSxRust extracted the full ITS region from 75.3% of reads, exceeding both ITSx (69.9%) and ITSxpress v2 (41.4%), while running 4.6x faster than ITSx. In addition, a partial-chain fallback strategy that extracts subregions using two-anchor pairs when the full four-anchor chain is unavailable recovered an additional 10,725 reads that would otherwise be discarded.
bioinformatics2026-02-27v2POTTR: Identifying Recurrent Trajectories in Evolutionary and Developmental Processes using Posets
Käufler, S. C.; Schmidt, H.; Jürgens, M.; Klau, G. W.; Sashittal, P.; Raphael, B.AI Summary
- The study addresses the identification of recurrent mutation trajectories in cancer evolution and organismal development by formalizing the problem using incomplete partially ordered sets (posets) to account for phylogenetic uncertainty.
- A novel algorithm, POTTR, was developed to solve the NP-hard problem of finding the largest recurrent trajectories shared in at least k phylogenies, modeled through a conflict graph.
- Application of POTTR to lung cancer, leukemia, and an in vitro embryoid model data revealed significant, previously unreported trajectories and conserved differentiation routes, demonstrating its utility in resolving mutation clusters and understanding developmental changes.
Abstract
Multiple biological processes, including cancer evolution and organismal development, are described as a sequence of events with a temporal ordering. While cancer evolves independently in each patient, DNA sequencing studies have shown that in some cancers different patients share specific orders of mutations and these correlate with distinct morphology, drug response, and treatment outcomes. Several methods have been developed to identify such recurrent trajectories of genetic events from phylogenetic trees, but this is complicated by high intra- and inter-tumor heterogeneity as well as uncertainty in the inferred tumor phylogenies including the ambiguous orders between some mutations. We formalize the problem of finding recurrent mutation trajectories using a novel framework of incomplete partially ordered sets (posets), which generalize representations used in previous works and explicitly account for the uncertainty in tumor phylogenies. We define the problem of identifying the largest recurrent trajectories shared in at least k input phylogenies as the maximum k-common induced incomplete subposet (MkCIIS) problem, which we show is NP-hard. We present a combinatorial algorithm, POsets for Temporal Trajectory Resolution (POTTR), to solve the MkCIIS problem using a conflict graph that models recurrent trajectories as independent sets. Thereby we identify maximum recurrent trajectories while resolving multiple sources of uncertainty, like mutation clusters, in the phylogenetic data. We apply POTTR to TRACERx non-small cell lung cancer bulk sequencing and acute myeloid leukemia single-cell sequencing data and through resolution of mutation clusters discover previously unreported trajectories of high statistical significance. On lineage tracing data of an in vitro embryoid model, POTTR identifies conserved differentiation routes across biological replicates and how these routes change in response to chemical perturbations.
bioinformatics2026-02-27v2Uncertainty-aware synthetic lethality prediction with pretrained foundation models
Hua, K.; Haber, E.; Ma, J.AI Summary
- CILANTRO-SL uses pretrained biological foundation models to predict synthetic lethality (SL) gene pairs, incorporating a two-stage process to generate context-aware embeddings from RNA-seq data and perform in silico gene knockouts.
- The framework employs a lightweight classifier in the second stage to differentiate SL from non-SL pairs, using features derived from the embeddings.
- Key findings include improved performance through viability pretraining and gene priors, with the ability to generalize to unseen genes and gene pairs, enhancing the discovery of therapeutic targets with calibrated uncertainty.
Abstract
Synthetic lethality (SL) offers a promising paradigm for targeted cancer therapy, yet experimental identification of SL gene pairs remains costly, context-dependent, and biased toward well-studied genes. Existing computational approaches often rely on curated protein-protein interaction (PPI) networks and Gene Ontology (GO) annotations, which limit their ability to generalize to novel genes. Here we introduce CILANTRO-SL, a two-stage, graph-free framework that leverages pretrained biological foundation models to predict SL pairs with calibrated uncertainty. In Stage 1, we apply a pretrained single-cell foundation model to bulk RNA-seq profiles of cancer cell lines to obtain context-aware embeddings and perform in silico gene knockouts to generate delta embeddings. These perturbation signals are further conditioned on a data-driven gene prior and supervised with CRISPR viability readouts to learn knockout-aware viability embeddings. In Stage 2, we derive pairwise features from these embeddings and train a lightweight classifier to distinguish SL from non-SL pairs. To enable reliable experimental prioritization, CILANTRO-SL incorporates conformal prediction, producing calibrated and interpretable prediction sets that highlight high-confidence SL candidates. Across two evaluation settings, including zero-shot generalization to unseen gene pairs and to unseen genes, ablation analyses show that viability pretraining and the gene prior substantially improve performance while avoiding reliance on PPI and GO features. CILANTRO-SL therefore transforms pretrained biological representations into practical, uncertainty-aware hypotheses that support robust and scalable discovery of therapeutic targets.
bioinformatics2026-02-27v1AbiOmics: An End-to-End Pipeline to Train Machine Learning Models for Discrimination of Plant Abiotic Stresses Using Transcriptomic Profiling Data
Park, M.; Oh, Y.; Choi, W.; Jo, Y. D.AI Summary
- The study developed AbiOmics, a machine learning pipeline to identify and discriminate among abiotic stresses in plants using transcriptomic data.
- Using 1,243 Arabidopsis transcriptomes, 320 stress-specific marker genes were identified for salt, cold, heat, and drought stresses.
- A single-layer perceptron model trained on these markers achieved 91% accuracy in cross-validation and 93% on an independent test set, showing effectiveness in multi-stress conditions.
Abstract
Abiotic stresses are primary constraints on global crop productivity, reducing yields by up to 80%. While traditional phenotypic sensing detects stress only after physiological symptoms emerge and often fails to discriminate specific stressor types, transcriptomic profiling offers a high-dimensional solution, capturing rapid and sensitive molecular shifts. In this study, we developed AbiOmics, the first end-to-end machine learning pipeline specifically designed to identify and discriminate among multiple stressors. This approach represents a previously undocumented method for stress specification using large-scale transcriptomic big data. We identified 320 stress-specific marker genes using a curated collection of 1,243 transcriptomes of Arabidopsis samples treated with four major abiotic stresses, salt, cold, heat, and drought. A single-layer perceptron model trained on these features achieved 91% accuracy during five-fold cross-validation and 93% accuracy on an independent test set. The model demonstrated an unprecedented capacity to generalize to multi-stress conditions, identifying concurrent signatures in combinatorial salt-and-heat treatments. By integrating marker identification with SHAP-based biological interpretation, AbiOmics provides a rigorously validated diagnostic tool superior to conventional sensing. This framework establishes a high-confidence labeling strategy for AI-driven crop management and precision breeding to mitigate climate change impacts.
bioinformatics2026-02-27v1Spatial Mechanomics for Tissue-Scale Biomechanical Mapping and Multi-omics Integration
Xie, W.; Wang, Z.; Shan, Q.; Zhao, Q.; Ye, X.AI Summary
- The study introduces spatial mechanomics, a method for mapping biomechanical properties across tissues by integrating BioAFM-based spatial sampling with multi-protocol microrheology.
- This approach extracts viscoelastic parameters at specific tissue locations, creating mechanomic feature vectors and tissue-scale atlases.
- Applied to murine myocardial tissue, spatial mechanomics identified distinct mechanical states and condition-dependent remodeling, demonstrating its utility in multi-modal tissue analysis.
Abstract
Tissue mechanical properties are spatially heterogeneous and tightly coupled to cellular function, developmental patterning, and disease progression, yet spatially resolved characterization of viscoelastic and microrheological behavior across intact tissues remains limited. Here we introduce spatial mechanomics, a framework for tissue-wide acquisition, quantitative extraction, and computational representation of location-resolved mechanical states. Using BioAFM-based spatial sampling with multi-protocol microrheology, we acquire force responses at defined tissue coordinates and fit physically interpretable viscoelastic models to extract elastic, viscous, and frequency-dependent parameters at each position. These parameters are assembled into per-niche mechanomic feature vectors and reconstructed into tissue-scale mechanomic atlases that resolve heterogeneous mechanical organization. We implement these capabilities in MechScape, an open-source computational platform that supports force curve fitting, spatial feature matrix construction, unsupervised domain discovery, and cross-modal alignment with histological and molecular measurements. Application to murine myocardial tissue reveals that spatial mechanomics identifies distinct mechanical states, quantifies condition-dependent remodeling across all measured parameters, and resolves spatially coherent mechanical domains. This work establishes spatial mechanomics as a quantitative approach for tissue-scale biomechanical mapping and provides a generalizable framework for integrating mechanics as an omics layer in multi-modal tissue analysis.
bioinformatics2026-02-27v1Topological Data Analysis of Spatial Protein Expression in Multiplexed Spatial Proteomics Studies
Samorodnitsky, S. N.; Wu, M.AI Summary
- The study introduces TOASTER, a method using topological data analysis to assess the association between continuous spatial protein expression and patient outcomes, bypassing traditional cell segmentation and phenotyping.
- TOASTER characterizes topological features of protein expression and uses adapted statistical methods to link these features with outcomes.
- Simulations and application to triple-negative breast cancer data show TOASTER improves power and controls type I error, revealing associations with immunotherapy response.
Abstract
Multiplexed spatial proteomics platforms generate high-resolution images capturing the spatial expression of proteins in tissue. Images are often fed through a complex pre-processing pipeline to identify individual cells (termed segmentation) and then to predict their phenotypes. It is common to test if the inferred spatial arrangement of cells associates with patient-level outcomes. However, cell segmentation and phenotyping are prone to error and this approach neglects the measured protein levels. Further, new research suggests topological analysis of spatial proteomics may yield more power than alternative approaches. We propose a method, TOASTER, that circumvents reliance on segmentation and phenotyping and instead tests the association between continuous spatial protein expression and a patient-level response variable. TOASTER uses topological data analysis to first characterize the presence of topological features within univariate and bivariate spatial protein expression. The topological structure is summarized using an adaptation of the Nelson-Aalen cumulative hazard function. We can then associate this summary with an outcome using either a functional data analytic approach, a gridwise testing approach, or using kernel association testing. We show via simulation that our approach improves power and controls type I error, even in the presence of gaps or tears in the image which may arise during tissue handling. We apply our approach to a study in triple-negative breast cancer and demonstrate topological features of protein expression associated with immunotherapy response.
bioinformatics2026-02-27v1Graph Lens Lite: An interactive biological network viewer for displaying, exploring, and sharing disease pathobiology and drug mechanism of action models
Ley, M.; Keska-Izworska, K.; Fillinger, L.; Walter, S. M.; Baumgärtel, F.; Bono, E.; Galou, L.; Andorfer, P.; Hauser, P.; Leierer, J.; kratochwill, k.; Perco, P.AI Summary
- Graph Lens Lite is a browser-based tool designed for visualizing and exploring biological networks to understand disease pathobiology and drug mechanisms.
- It features an expressive query language, topological analysis, GUI-based filtering, visual grouping, customizable layouts, a data-editor, and detailed styling options.
- The tool is available on GitHub for sharing and collaborative research in systems biology and network medicine.
Abstract
Motivation: Biological network visualization together with graph-based analyses are key techniques in systems biology and network medicine to detect patterns and generate new hypotheses regarding disease pathobiology, drug target identification, biomarker prioritization, or digital drug discovery. Network representations are also a way to communicate research findings and share results with colleagues and coworkers. Results: We have developed Graph Lens Lite, a browser-based tool that combines rich visualization capabilities with a streamlined interface for exploring and sharing biological networks. It offers an expressive query language, topological network analysis, GUI-based filtering, visual grouping, customizable layouts, a data-editor, and fine-grained property-based styling options, particularly suited for visualizing molecular models of disease pathobiology or drug mechanism of action. Availability: Graph Lens Lite is available at GitHub (https://github.com/Delta4AI/GraphLensLite).
bioinformatics2026-02-27v1MAP: A Knowledge-driven Framework for Predicting Single-cell Responses for Unprofiled Drugs
Feng, J.; Zhao, Z.; Zhang, X.; Liu, M.; Chen, J.; Quan, X.; Zhang, J.; Wang, Y.; Zhang, Y.; Xie, W.AI Summary
- The study introduces MAP, a framework that integrates biological knowledge into cellular perturbation modeling to predict responses to unprofiled drugs.
- MAP uses a knowledge graph (MAP-KG) and a knowledge-driven pre-training strategy to create unified embeddings for molecular structures, protein sequences, and mechanistic descriptions.
- Evaluations showed MAP improved prediction accuracy by up to +13.3% for unseen cell type-drug combinations and +12.2% for unprofiled drugs, with pathway analysis confirming mechanism consistency in drug screening.
Abstract
Predicting how cells respond to chemical perturbations is one of the goals for building virtual cells, yet experimentally profiled compounds cover only a small fraction of this space. Existing models struggle to generalize to unprofiled compounds, as they typically treat drugs as isolated identifiers without encoding their mechanistic relationships. We present MAP, a framework that integrates structured biological knowledge into cellular perturbation modeling and supports zero-shot prediction for small molecules with scarce or absent perturbation profiles. Specifically: (i) we construct MAP-KG, a large-scale knowledge graph tailored for cellular perturbation modeling that unifies 14 public resources, spanning 187k drugs, 23k genes, and 694k mechanistic relationships; (ii) we propose a knowledge-driven pre-training strategy that aligns molecular structures, protein sequence features, and textual mechanistic descriptions into a unified embedding space via contrastive learning, producing mechanism-aware and transferable gene and compound embeddings. The resulting knowledge-informed gene and drug representations are then coupled with a pretrained single-cell foundation model to condition perturbation response prediction; (iii) we evaluate MAP under two zero-shot generalization regimes: unseen cell type-drug combinations and the stricter setting of unprofiled drugs, where it improves top-50 DEG Pearson delta correlation by up to +13.3% and +12.2%, respectively, over the strongest baselines across three benchmarks. We further perform pathway-level functional analysis via GSEA for in-silico screening, where MAP predicts coherent, mechanism-consistent programs on unprofiled candidate drugs, and prioritizes 4 of 5 approved anti-cancer drugs in A-549 (non-small cell lung cancer).
bioinformatics2026-02-27v1PantheonOS: An Evolvable Multi-Agent Framework for Automatic Genomics Discovery
Xu, W.; Poussi, E.; Zhong, Q.; Zeng, Z.; Zou, C.; Wang, X.; Lu, Y.; Cui, M.; Okamura, D.; Huang, C.; Ding, J.; Zhao, Z.; Yang, Y.; Pan, X.; Vijay, V.; Konno, N.; Liu, N.; Li, L.; Ma, X. R.; Conley, S. D.; Kern, C.; Goodyer, W. R.; Bintu, B.; Zhu, Q.; Chi, N. C.; He, J.; Rognoni, L.; Zhang, X.; Wu, J.; Ellison, D.; Rabinovitch, M.; Engreitz, J. M.; Qiu, X.AI Summary
- PantheonOS is introduced as an evolvable, privacy-preserving multi-agent framework for genomics discovery, aiming to balance generality with domain specificity.
- It utilizes agentic code evolution to enhance batch correction and gene panel selection, achieving super-human performance.
- Key findings include discovering asymmetric paracrine inhibition in mouse embryo development, integrating multi-omics data for heart disease insights, and predicting cardiac effects with virtual cell models.
Abstract
The convergence of large language model-powered autonomous agent systems and single-cell biology promises a paradigm shift in biomedical discovery. However, existing biological agent systems, building upon single-agent architectures, are narrowly specialized or overly general, limiting applications to routine analyses. We introduce PantheonOS (PantheonOS.stanford.edu), an evolvable, privacy-preserving multi-agent framework designed to reconcile generality with domain specificity. Critically, PantheonOS enables agentic code evolution, allowing evolving state-of-the-art batch correction and our reinforcement-learning augmented gene panel selection algorithms to achieve super-human performance. PantheonOS drives biological discoveries across systems: uncovering asymmetric paracrine Cer1-Nodal inhibition in proximal-distal axis formation of novel early mouse embryo 3D data; integrating human fetal heart multi-omics with whole-heart data to reveal molecular programs underpin heart diseases; and adaptively selecting virtual cell models to predict cardiac regulatory and perturbation effects. Together, PantheonOS points towards a future where scientific discoveries are increasingly driven by self-evolving AI systems across biology and beyond.
bioinformatics2026-02-27v1DENcode: A model for haplotype-informed transmission probability of dengue virus
Maduranga, S.; Arroyo, B. M. V.; Sigera, C.; Weeratunga, P.; Fernando, D.; Rajapakse, S.; Lloyd, A. R.; Bull, R. A.; Stone, H.; Rodrigo, C.AI Summary
- DENcode is a model that estimates the transmission probability of dengue virus by integrating epidemiological factors with genetic similarity between viral haplotypes.
- Validation with 90 dengue cases from Colombo, Sri Lanka showed stable estimates with narrow credible intervals, highlighting the importance of both genetic and epidemiological components.
- Haplotype-informed networks were significantly more informative than consensus-based networks, enhancing the understanding of transmission dynamics within the community.
Abstract
Dengue virus transmission networks are often only partially resolved, due to gaps in sampling, unobserved mosquito-mediated transmission, and using methods (phylogenetics) that describe evolutionary relatedness but not explicit, probabilistic transmission links between individual infections. We developed DENcode, a framework to estimate the relative likelihood of vector-mediated transmission between pairs of dengue cases by combining a temperature- and time-modulated epidemiological kernel, which captures the extrinsic incubation period and human infectiousness, with a phylogenetically informed genetic similarity kernel derived from patristic distances between viral haplotypes or consensus sequences. Validation with a real-life dataset of 90 dengue infections sampled from Colombo, Sri Lanka between 2017 - 2020 and sequenced to resolve within-host haplotypes, DENcode estimates were stable across 100 Monte Carlo iterations, yielding narrow credible intervals (median width <0.001) and consistent top-ranked transmission pairs. Sensitivity analyses using ablation experiments showed that removing either the genetic or epidemiological component substantially altered the distribution of linkage probabilities, indicating that both contribute meaningfully to the inferred transmission structure. Serotype-specific transmission networks constructed from pairwise linkage probabilities from DENcode were analysed using degree- and path-based centrality measures at probability thresholds of 0.1 and 0.5, revealing relative importance of cases to disease transmission within the community. Haplotype-derived networks were more informative than consensus-based networks (x 3.6 and x 1.6 times more edges for DENV2 and 3 respectively). DENcode is a robust framework to explore dengue transmission within a community that provides an output of network of transmission probabilities informed by pathogen genetic similarity and clinical epidemiological parameters.
bioinformatics2026-02-27v1Quartet-based species tree methods enable fast and consistent tree of blobs reconstruction under network multispecies coalescent
Dai, J.; Han, Y.; Molloy, E.AI Summary
- The study addresses the challenge of reconstructing the tree of blobs (TOB) under the network multispecies coalescent model by introducing a framework that uses quartet-based methods for faster and consistent TOB estimation.
- The framework involves refining the TOB using Weighted Quartet Consensus and then contracting edges based on hypothesis tests, resulting in the method TOB-QMC with a time complexity of O(n^3k).
- TOB-QMC was found to be at least as accurate as TINNiK, scalable to larger datasets, and useful for exploring hyperparameters, with practical implications for interpreting species trees in the context of gene flow.
Abstract
Gene flow between species or populations is an important force in evolution, modeled by the network multispecies coalescent. Reconstructing evolutionary histories, called species networks, under this model is notoriously challenging, with the leading methods scaling to just tens of species. Divide-and-conquer is a promising path forward; however, methods with statistical consistency guarantees require the tree of blobs (TOB), which displays only the tree-like parts of the network, to perform subset decomposition. TOB reconstruction under the NMSC is challenging in its own right, with the only available method TINNiK having time complexity O(n^5 + n^4k), where k is the number of input gene trees and n is the number of species. Here, we present a framework for TOB reconstruction that operates by (1) seeking a refinement of the TOB and then (2) contracting edges in it. For step (1), we show that an optimal solution to Weighted Quartet Consensus is a TOB refinement almost surely, as the number of gene trees increases, motivating the use of fast quartet-based methods for species tree estimation such as ASTRAL or TREE-QMC. For step (2), we contract edges in the refinement tree based on the same hypothesis tests as TINNiK, which are applicable to subsets of four taxa. We show that sampling just O(n) four-taxon subsets around each edge enables statistically consistent TOB estimation, with asymptotic runtime dominated by tree reconstruction. Leveraging TREE-QMC for this step gives our method a time complexity of O(n^3k) and its name TOB-QMC. On simulated data sets, TOB-QMC is at least as accurate and often more accurate than TINNiK. Moreover, TOB-QMC scales to larger data sets and enables fast and interpretable exploration of hyperparameters used in hypothesis testing. We demonstrate the importance of this feature on phylogenomic data sets. Lastly, our framework is related to ad hoc analyses performed by biologists, as network methods do not scale. Our theoretical results provide justification for such approaches as well as context for interpreting species trees estimated with quartet-based methods in the presence of gene flow; this is critical given the recent result that tree-based network inference with ASTRAL can be positively misleading.
bioinformatics2026-02-26v3A Framework for Autonomous AI-Driven Drug Discovery
Selinger, D. W.; Wall, T. R.; Stylianou, E.; Khalil, E. M.; Gaetz, J.; Levy, O.AI Summary
- The study introduces a framework for autonomous AI-driven drug discovery that integrates knowledge graphs with large language models to manage vast biomedical data.
- The framework uses a focal graph to distill data into hypotheses, enhancing drug discovery processes like target prediction.
- Small-scale applications of this scalable approach demonstrated novel insights across multiple drug discovery stages, including autonomous execution of a multi-step target discovery workflow.
Abstract
The exponential increase in biomedical data offers unprecedented opportunities for drug discovery, yet overwhelms traditional data analysis methods, limiting the pace of new drug development. Here we introduce a framework for autonomous artificial intelligence (AI)-driven drug discovery that integrates knowledge graphs with large language models (LLMs). It is capable of planning and carrying out automated drug discovery programs at a massive scale while providing details of its research strategy, progress, and all supporting data. At the heart of this framework lies the focal graph - a novel construct that harnesses centrality algorithms to distill vast, noisy datasets into concise, transparent, data-driven hypotheses. We demonstrate that even small-scale applications of this highly scalable approach can yield novel, transparent insights relevant to multiple stages of the drug discovery process, including chemical structure-based target prediction, and present the implementation of a system which autonomously plans and executes a multi-step target discovery workflow.
bioinformatics2026-02-26v3MANTIS: Analytics toolkit for spatial metabolomics with matching spatial transcriptomics data
Hao, Y.; Kim, Y.; Aggarwal, B.; Sinha, S.AI Summary
- MANTIS is a statistical framework designed to analyze co-registered spatial metabolomics (SM) and spatial transcriptomics (ST) data at single cell or spot resolution, incorporating spatial domain or cell type information.
- It uses an autocorrelation-preserving permutation strategy for statistical significance, employing spatial cross-correlation and partial correlation to explore gene-metabolite relationships.
- Across various datasets, MANTIS offers more specific and interpretable results by modeling confounding structures, surpassing existing methods in statistical rigor.
Abstract
Motivation: Joint Spatial Metabolomics (SM) and Spatial Transcriptomics (ST) profiling is a powerful approach to fine-mapping of metabolic states associated with tissue function. Current computational tools for analysis of "SM+ST" data focus primarily on alignment and integration of the two modalities, with limited support for probing biological relationships between the two molecular layers. Results: We present MANTIS, a statistical framework for analyzing co-registered SM+ST profiles at single cell or spot resolution, along with spatial domain or cell type information, to discover metabolite spatial patterns and gene-metabolite relationships. It employs an autocorrelation-preserving permutation strategy to assess statistical significance, yielding calibrated inference under spatial dependence. It disentangles different sources of spatial patterns and correlations, viz., those arising from regional preferences, cell type associations, or other unknown factors. It introduces the use of spatial cross-correlation and spatial partial correlation statistics for quantifying gene-metabolite associations. Across data sets spanning different spatial technologies, tissues and species, MANTIS provides more specific and interpretable discoveries than existing methods through rigorous statistical testing and explicitly modeling confounding structure. To our knowledge, MANTIS is the first toolkit to unify spatial metabolomics, spatial transcriptomics, cell type information and spatial domains within a single framework that emphasizes spatial statistics, hypothesis testing and confounder correction. Availability and Implementation: Freely available on the web at https://github.com/yuhaotuo/MANTIS.
bioinformatics2026-02-26v2keju: powerful and accurate inference in Massively Parallel Reporter Assays
Xue, A.; Zahm, A. M.; English, J.; Sankararaman, S.; Pimentel, H.AI Summary
- The study addresses the challenge of uncertainty in Massively Parallel Reporter Assays (MPRAs) by introducing keju, a hierarchical statistical model.
- keju accounts for differences in uncertainty between DNA and RNA counts and between batches, improving inference accuracy.
- Simulations showed keju has a sensitivity of 59% and a lower false positive rate (6.8%) compared to MPRAnalyze (31%, 34%) and BCalm (9%, 12%).
Abstract
Massively Parallel Reporter Assays (MPRAs) interrogate the regulatory function of thousands of designed genetic elements in parallel through linked DNA and RNA readouts using an engineered construct and attached minimal reporter. Given the complexity of MPRA experimental designs, several different sources of uncertainty complicate inference. We show that previous methods do not account for substantial differences in uncertainty levels between the DNA and RNA counts and between batches. Accordingly, we present keju, a hierarchical statistical model that estimates candidate transcription rate, differential activity between conditions, and effects from promoter composition for MPRA data. To maximize statistical power and improve false positive rate control, keju conditions on the DNA counts to model batch-specific and modality-specific uncertainty in the RNA counts. keju shows vastly improved sensitivity (59%) in simulations compared to previous methods (31% for MPRAnalyze and 9% for BCalm), and also has lower, more robust false positive rates, calling only 6.8% of unlabeled negative controls significant in real data (compared to 34% for MPRAnalyze and 12% for BCalm).
bioinformatics2026-02-26v1SpaMOAL: A spatial multi-omics graph contrastive learning method for spatial domains identification
Wang, J.; Huo, Y.; Zhao, R.; Pan, Y.; Wang, H.; Li, X.AI Summary
- SpaMOAL is a graph-based contrastive learning method designed to integrate spatial coordinates, histological images, and molecular profiles for identifying spatial domains in tissues.
- The method was benchmarked on multiple spatial multi-omics datasets, where it outperformed existing methods in accurately delineating spatial tissue domains.
Abstract
Recent advances in spatial multi-omics technologies have opened new avenues for characterizing tissue architecture and function in situ, by simultaneously providing multimodal and complementary information such as spatially resolved transcriptomic, epigenomic, and proteomic features. Current computational approaches face substantial challenges such as effective integration of multi-omics molecular information with spatial information and corresponding high-resolution histology images. To address this challenge, we proposed SpaMOAL (Spatially Multi-Omics graph contrAstive Learning), a graph-based contrastive learning approach for spatial domain identification. SpaMOAL learns clustering-friendly representations from spatial multi-omics data by integrating spatial coordinates, histological image features and molecular profiles, enabling accurate delineation of spatial tissue domains. Benchmarking across multiple recent paired spatial multi-omics datasets demonstrated that SpaMOAL consistently outperforms existing methods. By enabling accurate spatial domain delineation, SpaMOAL provides a powerful framework for interpreting tissue organization and cellular microenvironments.
bioinformatics2026-02-26v1Identification of different sequence properties between HIV-1 DNA and RNA across subtypes using the k-mer-based approach
Chen, H.-C.; Wisniewski, J.; Serwin, K.; Parczewski, M.; Kula-Pacurar, A.; Skums, P.; Kirpich, A.; Yakovlev, S.AI Summary
- This study used an updated k-mer-based approach, PORT-EK-v2, to compare DNA and RNA sequence properties across HIV-1 subtypes.
- Findings indicated distinct sequence properties between DNA and RNA, with "isolate k-mer count" useful for classification.
- Markov chain Monte Carlo modeling showed discontinuous k-mer frequency patterns, suggesting significant subtype-specific differences.
Abstract
Advanced analytical tools that enable mining of the masked features hidden in intricate datasets and strengthening the biological interpretation of multigenomic outputs hold paramount importance. In this study, we present an updated version of a k-mer-based approach, PORT-EK-v2, allowing for a comparison of multiple genomic datasets and identification of over-represented genomic regions, k-mers, related to specific organisms. Using PORT-EK-v2, we exemplified that most likely DNA and RNA sequence properties are distinct across HIV-1 subtypes. Furthermore, we showcased that "isolate k-mer count" could serve as a default choice in classifying the DNA versus RNA sequence property. Lastly, results based on Markov chain Monte Carlo modeling unveiled a discontinuous nature of the sequence property in terms of k-mer frequencies across HIV-1 subtypes. Altogether, we propose that the sequence property (DNA versus RNA) is distinct across HIV-1 subtypes and has a consequential impact on identifying new and emerging subtypes in the future.
bioinformatics2026-02-26v1Transcriptome-based lead generation, ligand- and structure-based prioritization and experimental validation of TLR5-activating molecules
Jain, A.; Hungharla, H.; Subbarao, N.; Tandon, V.; Ahmad, S.AI Summary
- This study used a transcriptome-based approach with the connectivity map (CMAP) library to generate leads for TLR5 activation, integrating cellular context early in drug discovery.
- Leads were prioritized using ligand- and structure-based methods, and the top nine were experimentally validated with ELISA, showing dose-dependent TLR5 activation.
- The framework suggests potential complex interactions with the TLR signaling pathway and is scalable for other drug discovery applications.
Abstract
Current in silico drug discovery protocols ubiquitously depend on lead generation using a ligand-based approach in which novel leads are generated by fragment-signature matching or by a structure-based search involving molecular docking and conformational dynamics. None of them incorporates cellular contexts in which these drugs ultimately operate, leaving the task to a later stage of optimization leading to a high failure rate. Incorporating systems-level responses of drugs in an early stage of lead generation can significantly address this concern but has not been sufficiently explored. In this work, we employ a systems-level approach using connectivity map (CMAP) library to generate leads against a challenging system of a TLR pathway. Starting with gene expression data of TLR5 activation by its natural ligand, we generated molecular leads using CMAP and rigorously analyzed their validity using ligand and structure-based approaches, and helping to prioritize top hits. Experimental validation using ELISA-based antibody assay confirmed the activation of TLR5 by each of the top nine prioritized leads with their dose-dependent patterns suggesting that some of them may actually interact with the TLR signaling pathway in a complex manner. Although, demonstrated on TLR5, the proposed framework is intuitively scalable to other lead generation and optimization tasks.
bioinformatics2026-02-26v1Optimal transport fate mapping resolves T cell differentiation dynamics across tissues
Plotkin, A. L.; Mullins, G. N.; Green, W. D.; Shi, H.; Chung, H. K.; Yi, H.; Stanley, N.; Milner, J. J.AI Summary
- The study introduces an optimal transport-based fate mapping framework to reconstruct continuous CD8 T cell differentiation trajectories across time and tissues using single-cell RNA-seq data from mice with acute viral infection.
- This approach accurately depicts population dynamics and identifies distinct migration waves into the small intestine, leading to different tissue-resident memory (Trm) fates.
- The analysis revealed CD52 as a marker for recent tissue entrants and AP4 as a regulator distinguishing circulating from tissue-resident T cells.
Abstract
Immune responses evolve across time and tissues through coordinated programs of proliferation, differentiation, and migration, yet most single-cell measurements capture only static molecular snapshots. As a result, reconstructing how immune cells transition between alternative fates remains challenging, particularly for CD8 T cells, whose differentiation is highly dynamic and shaped by rapid expansion, contraction, and tissue trafficking. Here, we introduce an optimal transport-based fate mapping framework that reconstructs continuous CD8 T cell trajectories across time and tissues. Applied to longitudinal single-cell RNA-seq data from CD8 T cells responding to acute viral infection in mice, this approach accurately recapitulates population dynamics and resolves coherent effector and memory T cell differentiation trajectories. Extending the model to multiple tissues, we identify and experimentally validate temporally distinct waves of migration into the small intestine that give rise to divergent tissue-resident memory (Trm) fates, long-lived T cells crucial in immunosurveillance. By integrating optimal transport inference with time-resolved in vivo labeling, we demonstrate that CD52 marks recent tissue entrants and distinguishes them from Trm precursors. Finally, trajectory-guided analysis of transcription factor regulons reveals both shared and context-specific gene regulatory programs and identifies AP4 as a key regulator of circulating versus tissue-resident specification. Together, these results establish optimal transport as a principled framework for reconstructing immune cell fate dynamics and provide a quantitative map of early events governing antiviral CD8 T cell differentiation across tissues.
bioinformatics2026-02-26v1A pocket-centric framework for selective targeting of amyloid fibril polymorphs
Ossard, G.; Ciambur, C. B.; Melki, R.; Sperandio, O.; Romero, E.AI Summary
- The study analyzed 97 cryo-EM structures of amyloid-beta, tau, and alpha-synuclein fibrillar polymorphs to understand binding pocket distribution.
- Findings indicate that most pockets are shared across different amyloid proteins, explaining the lack of ligand selectivity.
- A small subset of unique pockets was identified, suggesting potential for selective ligand design under specific structural conditions.
Abstract
The rapid expansion of high-resolution cryo-EM structures of amyloid fibrils has not yet translated into the rational design of selective or specific ligands of protein aggregates involved in Alzheimer's and Parkinson's diseases. This persistent limitation suggests that the obstacle lies into a certain degree of communality within the organization of fibrillar polymorphs surfaces available for small molecule binding. Here, we present a systematic and global analysis of binding pockets across 97 cryo-EM structures of amyloid-beta, tau, and alpha-synuclein protein fibrillar polymorphs. Using a unified pocket similarity index and minimum spanning tree representations, we construct global and protein-specific pocketomes that reveal how surface cavities are distributed across different amyloid-forming proteins and the fibrillar polymorphs they form. We show that most detectable pockets are shared across multiple fibrillar folds and, in many cases, across different amyloid-forming proteins, providing a structural explanation for the widespread lack of ligand selectivity. Conversely, a limited subset of pockets forms isolated clusters associated with specific proteins or polymorphs, delineating the rare structural conditions under which selective or specific ligand design is feasible. Together, these results reframe amyloid targeting as a problem of constrained pocket diversity within the amyloid polymorphic landscape, and provide a conceptual framework to guide both the design of future ligands and the strategic avoidance of intrinsically non-discriminatory binding sites.
bioinformatics2026-02-26v1CellPace: A temporal diffusion-forcing framework for simulation, interpolation and forecasting of single-cell dynamics
Su, C.; Emad, A.AI Summary
- CellPace is a generative model using a transformer-based temporal diffusion approach to simulate, interpolate, and forecast single-cell dynamics, addressing the challenge of irregularly sampled or missing developmental stages.
- It excels in modeling continuous developmental dynamics across various mouse lineages, preserving fine biological details and accurately mapping to spatial transcriptomics.
- CellPace also handles multi-modal data, integrating RNA and chromatin dynamics, even when temporal ordering is inferred from pseudotime.
Abstract
Single-cell omics technologies resolve cellular heterogeneity at high resolution but provide only static snapshots of continuous developmental processes. This makes it difficult to recover coherent temporal dynamics when developmental stages are irregularly sampled or missing. While recent generative models can simulate observed cell states, they often treat timepoints as discrete categories, hindering interpolation across gaps and extrapolation to unobserved future stages. We present CellPace, a generative model that learns and generates developmental dynamics by leveraging a transformer-based temporal diffusion backbone conditioned on continuous, gap-aware temporal encodings. Across diverse mouse developmental lineages, CellPace achieves state-of-the-art performance in simulation, interpolation, and forecasting tasks. Beyond recovering global population statistics, generated cells preserve fine-grained biological structure, retaining dynamic gene regulatory programs and mapping accurately to anatomical regions in spatial transcriptomics data. Furthermore, CellPace extends naturally to multi-modal data, modeling joint RNA-chromatin dynamics even when temporal ordering is inferred from pseudotime. Together, these results position CellPace as a robust framework for modeling and generating continuous developmental dynamics from sparse, cross-sectional single-cell data.
bioinformatics2026-02-26v1Exploring differences across pangenome-graph representations using Escherichia coli O157:H7 as a model
Liu, P.; Hu, K.; Mughini-Gras, L.; Zomer, A. L.; Brouwer, M. S. M.; Dallman, T. J.; Paganini, J. A.AI Summary
- The study benchmarked six methods for constructing pangenome graphs of Escherichia coli O157:H7, comparing gene-cluster, ccDBG, multiple sequence alignment, and hybrid approaches.
- Results showed significant differences in graph size, fragmentation, and computational cost, with assembly completeness being a major factor influencing graph structure.
- The analysis highlighted that pangenome graph representation affects bacterial diversity modeling, with varying accuracy at specific loci like Shiga toxin genes.
Abstract
Pangenome graphs are increasingly used to represent population-scale bacterial diversity, yet construction methods span fundamentally different representation paradigms whose outputs and sensitivities to assembly quality remain poorly quantified. We systematically reviewed microbial pangenome graph tools and benchmarked six representative methods spanning gene-cluster, compacted coloured de Bruijn graph (ccDBG), multiple sequence alignment, and hybrid approaches. Using a repeat-rich Escherichia coli O157:H7 dataset with complete genomes and matched short-read data, we constructed graphs from identical inputs and observed orders-of-magnitude differences in graph size and fragmentation, indicating that global topology is driven by representation strategy. Varying completeness composition revealed that assembly fragmentation is a first-order determinant of graph structure: gene-cluster graphs contracted as draft assemblies replaced complete genomes, whereas unitig graphs expanded, with distinct degree-prevalence fingerprints across tools. Computational cost mirrored these shifts and depended strongly on completeness composition, including a pronounced runtime penalty for one ccDBG implementation on all-draft inputs. Finally, analysis of Shiga toxin loci showed that pangenome-level reconciliation does not reliably correct assembly artefacts at challenging multi-copy genes and that performance varies by locus. Together, these findings show that pangenome graphs are representation-dependent models of bacterial diversity, and that assembly completeness is a primary determinant of their topology, scalability, and locus-level accuracy.
bioinformatics2026-02-26v1Molecular Thermodynamics of KRAS Activation
Ciftci, F. S.; Erman, B.AI Summary
- The study investigates the structural basis of KRAS activation by comparing the residue-contact networks of its GTP-bound active (6GOD) and GDP-bound inactive (4OBE) states using a statistical-mechanical framework.
- Key findings include the active state having higher mean contact energy and conformational entropy, with a thermodynamic balance at kT ≈ 2.41.
- Switch I (residues 25-40) was identified as the primary allosteric locus due to significant changes in contact essentiality between states.
Abstract
The GTPase KRAS executes a conformational switch between a GTP-bound active and a GDP-bound inactive states, which are central to oncogenic signalling, yet the structural basis of this switching at the level of residue-contact network organization remains incompletely characterised by conventional pairwise analyses. Here we apply a rigorous statistical-mechanical framework, grounded in the weighted Kirchhoff Laplacian and the Matrix-Tree Theorem, to construct spanning-tree partition functions for residue contact graphs derived from two crystallographic structures: 6GOD (active, GTP-analog-bound; 172 residues, 830 contacts) and 4OBE (inactive, GDP-bound; 169 residues, 809 contacts). The log-partition function log Z, the network free energy F =-kT log Z, the mean contact energy, <E>, the heat capacity Cv, and the thermodynamic entropy S are computed across an effective temperature sweep from kT = 0.3 to 6.0. Edge marginal inclusion probabilities, P, obtained via effective-resistance theory, serve as topology-aware measures of contact essentiality. Differential analysis reveals that the active state consistently carries a higher mean contact energy ({Delta}<E>> 0) yet also a higher conformational entropy ({Delta}S > 0), with the free energy crossover {Delta}F = 0 occurring at kT {approx} 2.41, an intrinsic thermodynamic balance independent of any arbitrary additive reference. Switch I (residues 25-40) exhibits the largest state-dependent {Delta}P, identifying it as the primary allosteric locus of nucleotide-driven network reorganisation.
bioinformatics2026-02-26v1Protein Compositional Ratio Representation (PCRR)Systematically Improves Human Disease Prediction
Madduri, A. V.; Ellis, R. J.; Patel, C. J.AI Summary
- The study proposes using pairwise protein ratios (log(A)-log(B)) in machine learning to better capture the compositional nature of proteomic data for disease prediction.
- Applied to the ROSMAP cohort, this approach improved Alzheimer's subtype classification with an average AUROC increase of +0.1274.
- In the UK Biobank dataset, the ratio-based model outperformed raw protein level models in 95.1% of diseases, with significant improvements in 56.7%.
Abstract
Plasma proteomics captures a functional snapshot of human physiology; yet, most machine learning models treat protein abundances as independent variables, ignoring the fact that biological systems and proteomic measurements are inherently compositional. Many molecular processes depend not on absolute concentrations but on relative balances: receptor-ligand stoichiometry, enzyme-substrate ratios, and homeostatic feedbacks that govern signaling and metabolism. We propose that these relationships are best captured through pairwise protein ratios, which more faithfully reflect underlying biochemical constraints than raw expression values. We evaluate a machine learning framework that models pairwise log-ratios of proteins (log(A)-log(B)) as features, thereby encoding compositional structure directly into the learning space. Applied to the ROSMAP plasma proteomics cohort (n = 871), this approach substantially improved the classification of Alzheimer's subtypes (NCI, MCI, AD, AD+) with an average AUROC gain of +0.1274 over a strong baseline that incorporated raw proteomics and demographics. The top-ranked ratios(e.g., SEMA3C:TMEM70, IDUA:NPTXR) captured converging pathogenic pillars of Alzheimer's disease, including microglial activation, proteostasis dysregulation, and lipid-clearance imbalance, highlighting that ratio-based features recover biologically coherent axes of disease. To assess generality, we scaled the method to the UK Biobank proteomic dataset (n > 53,000; 587 phenotypes). The ratio-based model outperformed raw-level models in 95.1% of diseases, with statistically significant (FDR < 0.05) gains in 56.7%. Together, these results suggest that proteomic data should be viewed and modeled as compositional systems, where relative protein abundances carry the accurate functional signal. This insight supports the broader utility of ratio-based representations for disease prediction and biomarker discovery.
bioinformatics2026-02-25v4Generating Structurally Diverse Therapeutic Peptides with GFlowNet
Wijaya, E.AI Summary
- The study addresses mode collapse in reinforcement learning for therapeutic peptide generation by introducing GFlowNet, which samples sequences proportionally to reward.
- GFlowNet was compared with GRPO, showing GFlowNet achieves greater sequence diversity without explicit diversity penalties.
- When diversity penalties were removed, GRPO failed while GFlowNet maintained diversity, highlighting GFlowNet's robustness in drug discovery applications.
Abstract
Reinforcement learning approaches for therapeutic peptide generation suffer from mode collapse, converging to narrow regions of sequence space even when explicit diversity penalties are applied. Fine-grained analysis reveals persistent mode-seeking behavior invisible to standard diversity metrics. We propose GFlowNet for peptide generation, which samples sequences proportionally to reward rather than maximizing expected reward. This provides intrinsic diversity without diversity penalties. Comparing against GRPO with explicit diversity enforcement, GFlowNet achieves substantially more uniform sequence sampling and fewer repetitive motifs. Critically, when diversity mechanisms are removed from the reward, GRPO collapses completely while GFlowNet maintains natural diversity. These results demonstrate that proportional sampling is inherently robust to reward function design, offering a key advantage for drug discovery pipelines requiring diverse candidates.
bioinformatics2026-02-25v4Distilling Protein Language Models with Complementary Regularizers
Wijaya, E.AI Summary
- The study distills a large 738M-parameter protein language model into smaller models using uncertainty-aware position weighting and calibration-aware label smoothing, which together improve performance despite individual degradation.
- The distilled models offer up to 5x faster inference, use less memory, and maintain natural amino acid distributions, making them suitable for consumer-grade hardware.
- When fine-tuned on small protein family datasets, these models outperform the original in generating family-matching sequences, showing higher efficiency and Pfam hit rates.
Abstract
Large autoregressive protein language models generate novel sequences de novo, but their size limits throughput and precludes rapid domain adaptation on scarce proprietary data. We distill a 738M-parameter protein language model into compact students using two protein-specific enhancements, uncertainty-aware position weighting and calibration-aware label smoothing, that individually degrade quality yet combine for substantial improvement. We trace this complementary-regularizer effect to information theory: smoothing denoises teacher distributions while weighting amplifies the cleaned signal at biologically variable positions. Students achieve up to 5x inference speedup, preserve natural amino acid distributions, and require as little as 170 MB of GPU memory, enabling deployment on consumer-grade hardware. When fine-tuned on protein families with as few as 50 sequences, students generate more family-matching sequences than the teacher, achieving higher sample efficiency and Pfam hit rates despite their smaller capacity. These results establish distilled protein language models as superior starting points for domain adaptation on scarce data.
bioinformatics2026-02-25v3OriGene: A Self-Evolving Virtual Disease Biologist Automating Therapeutic Target Discovery
Zhang, Z.; Qiu, Z.; Wu, Y.; Li, S.; Wang, D.; Liu, Y.; Zhou, Z.; Hu, Y.; Chen, Y.; An, D.; Wang, Y.; Li, Y.; Zhong, Z.; Ou, C.; Wang, Z.; Tang, F.; Chen, J. X.; Ma, R.; Li, J.; Wang, X.; Lu, W.; Xue, H.; Zhang, W.; Wei, Z.; Ma, R.; Shi, Z.; Wang, K.; Liu, Q.; Dong, B.; He, Y.; Liu, T.; Gu, J.; Song, S.; Feng, Q.; Zhang, J.; Zhang, B.; Tian, L.; Bai, L.; Gao, Q.; Sun, S.; Zheng, S.AI Summary
- OriGene is a self-evolving multi-agent system designed to automate therapeutic target discovery by integrating diverse biomedical data.
- It outperforms human experts and other AI models in accuracy, recall, and robustness, particularly with sparse or noisy data.
- OriGene identified novel therapeutic targets for liver (GPR160) and colorectal cancer (ARG2), showing significant anti-tumor activity in preclinical models.
Abstract
Therapeutic target discovery remains a critical yet intuition-driven bottleneck in drug development, typically relying on disease biologists to laboriously integrate diverse biomedical data into testable hypotheses for experimental validation. Here, we present OriGene, a self-evolving multi-agent system that functions as a virtual disease biologist, systematically identifying original and mechanistically grounded therapeutic targets at scale. OriGene coordinates specialized agents that reason over diverse modalities, including genetic data, protein networks, pharmacological profiles, clinical records, and literature evidence, to generate and prioritize target discovery hypotheses. Through a self-evolving framework, OriGene continuously integrates human and experimental feedback to iteratively refine its core thinking templates, tool composition, and analytical protocols, thereby enhancing both accuracy and adaptability over time. To comprehensively evaluate its performance, we established TRQA, a benchmark comprising over 1,900 expert-level question-answer pairs spanning a wide range of diseases and target classes. OriGene consistently outperforms human experts, leading research agents, and state-of-the-art large language models in accuracy, recall, and robustness, particularly under conditions of data sparsity or noise. Critically, OriGene nominated previously underexplored therapeutic targets for liver (GPR160) and colorectal cancer (ARG2), which demonstrated significant anti-tumor activity in patient-derived organoid and tumor fragment models mirroring human clinical exposures. These findings demonstrate OriGene's potential as a scalable and adaptive platform for AI-driven discovery of mechanistically grounded therapeutic targets, offering a new paradigm to accelerate drug development.
bioinformatics2026-02-25v2KuPID: Kmer-based Upstream Preprocessing of Long Reads forIsoform Discovery
Borowiak, M.; Yu, Y. W.AI Summary
- KuPID is introduced as a method for preprocessing long RNAseq reads to enhance novel isoform discovery by using kmer sketching to pseudo-align reads to known isoforms.
- This approach reduces the need for full alignment to only relevant reads, speeding up the process by 2-3x and improving the f1 accuracy of isoform discovery by up to 16.7 points.
- An optional mode allows KuPID to be used for both isoform discovery and transcript quantification.
Abstract
Eukaryotic genes can encode multiple protein isoforms based on alternative splicing of their transcribed regions. Most modern novel isoform discovery methods function by identifying and assembling exon splice junctions from an RNAseq sample. However, splice junctions can only be accurately annotated with time-intensive dynamic programming alignment. This manuscript introduces KuPID, a method for preprocessing long RNAseq reads with the goal of better identifying novel isoform transcripts. KuPID utilizes kmer sketching as a pre-filter to quickly pseudo-align reads to known reference isoforms. Full alignment need only then be applied to reads that are most relevant to isoform discovery. Not only does KuPID speed up the discovery pipeline, it also increases downstream accuracy by filtering out extraneous reads. KuPID preprocessing simultaneously increases the f1 accuracy of isoform discovery pipelines by up to 16.7 points while decreasing the runtime by a factor of 2-3x. An optional mode permits a KuPID sample to be paired with both isoform discovery and transcript quantification. Code availability: https://github.com/mboro2000/KuPID.git
bioinformatics2026-02-25v2RNA foundation models enable generalizable endometriosis disease classification and stable gene-level interpretation
McConnell, N.; Kelly, J.; Tadikonda, R.; Bettencourt-Silva, J.; Mulligan, N.; Madgwick, M.; Krishna, R.; Strudwick, J.; Evans, A.; Checkley, S.; Carrieri, A. P.; Smyrnakis, M.; Knowles, C. H.; Gardiner, L.-J.AI Summary
- Researchers investigated whether foundation models (FMs) pretrained on large-scale transcriptomic data could improve endometriosis classification across different patient cohorts.
- Using a 12-cohort RNA-seq benchmark, FM embeddings significantly outperformed traditional TPM baselines, achieving a weighted F1-score of 0.83 versus 0.68.
- A new interpretability method, classified-aligned integrated gradients (CA-IG), identified a stable set of predictive genes across cohorts, highlighting novel candidates involved in endometriosis pathways.
Abstract
Endometriosis is a chronic inflammatory condition with significant diagnostic delays impacting one in ten reproductive age women worldwide. While machine learning (ML) models trained on transcriptomic data show promise for disease prediction, limited generalizability across independent patient cohorts has hindered clinical translation. Foundations models (FMs) pretrained on large-scale transcriptomic data offer promise to learn transferrable, biologically meaningful representations that could support cross-cohort predictions. We assembled a 12-cohort bulk RNA-seq benchmark (334 samples) and developed a computationally efficient pipeline to test whether FMs improve endometriosis classification, an approach not previously applied to this disease. Using AutoXAI4Omics with cohort-aware validation, we compared embeddings derived from five state-of-the-art RNA FMs against TPM baselines. In cross-cohort prediction, FM embeddings significantly improved performance, achieving a weighted F1-score of 0.83 vs. 0.68 for the baseline. To allow gene-level interpretation of FM embedding models, we introduce classified-aligned integrated gradients (CA-IG), an interpretability approach aligning gene-level attributions to the downstream classifier without end-to-end fine-tuning. CA-IG revealed a conserved set of predictive genes from FM embeddings across cohort-validation regimes, contrasting with unstable baseline explainability, suggesting that FM embeddings prioritized transferable disease-related signal over cohort-specific effects. These genes include novel candidates that converge on biologically plausible pathways for endometriosis.
bioinformatics2026-02-25v1Integrative Multi-Scale Sequence-Structure Modeling for Antimicrobial Peptide Prediction and Design
Li, J.; Shao, Y.; Li, Y.; Yu, Q.AI Summary
- The study introduces MultiAMP, a framework that integrates multi-scale sequence and structure information to predict antimicrobial peptides (AMPs), addressing the limitations of current methods that treat these aspects in isolation.
- MultiAMP significantly outperforms existing methods by over 10% in MCC, particularly in identifying AMPs with low sequence identity to known peptides.
- Applied to marine organisms, MultiAMP identified 484 novel high-confidence AMPs and provided insights into AMP mechanisms, aiding in the design of peptides with specific motifs.
Abstract
Antimicrobial resistance (AMR) is accelerating worldwide, undermining frontline antibiotics and making the need for novel agents more urgent than ever. Antimicrobial peptides (AMPs) are promising therapeutics against multidrug-resistant pathogens, as they are less prone to inducing resistance. However, current AMP prediction approaches often treat sequence and structure in isolation and at a single scale, leading to mediocre performance. Here, we propose MultiAMP, a framework that integrates multi-level information for predicting AMPs. The model captures evolutionary and contextual information from sequences alongside global and fine-grained information from structures, synergistically combining these features to enhance predictive power. MultiAMP achieves state-of-the-art performance, outperforming existing methods by over 10\% in MCC when identifying distant AMPs sharing less than 40\% sequence identity with known AMPs. To discover novel AMPs, we applied MultiAMP to marine organism data, discovering 484 high-confidence peptides with sequences that are highly divergent from known AMPs. Notably, MultiAMP accurately recognizes various structural types of peptides. In addition, our approach reveals functional patterns of AMPs, providing interpretable insights into their mechanisms. Building on these findings, we employed a gradient-based strategy and achieved the design of AMPs with specific motifs. We believe MultiAMP empowers both the rational discovery and mechanistic understanding of AMPs, facilitating future experimental validation and therapeutic design. The codebase is available at \url{https://github.com/jiayili11/multi-amp}.
bioinformatics2026-02-25v1Quantified duplications of proteins within complexes across eukaryotes
Francis, O.AI Summary
- The study integrates orthology and protein interaction data to map proteins of verified complexes across 31 diverse eukaryotes, identifying 184 universal orthogroups with components from all species.
- It developed the PCOC suite to analyze duplications and reductions in these orthogroups, revealing both multi-copy and single-copy proteins in various complexes.
- Case studies on Naegleria gruberi and Guillardia theta demonstrated taxon-specific expansions, enhancing understanding of eukaryotic protein-complex evolution.
Abstract
Protein complexes are central to cell biology and typically verified via a combination of interaction data, complete genome sequencing and comprehensive protein-coding gene predictions for reference eukaryotes. However this data is lacking for non-reference eukaryotes. Protein complexes can be predicted in species for which no interaction data is available by mapping orthology of verified protein complex components from reference eukaryotes to predicted proteomes. Studies that map conservation of protein complex components by orthology are often limited to a small number of protein queries, an under-representation of non-reference, microbial eukaryotes and are scattered across the literature. Here, I integrate orthology and protein interaction data by mapping proteins of experimentally verified complexes to orthogroups of proteins spanning 31 diverse eukaryotes. Proteins within complex-harbouring orthogroups are retained and distributed more evenly across taxa than non-complex orthogroups. I identified 184 universal orthogroups that included orthologs of known protein complex components from all 31 eukaryotes, consistent with a conserved core repertoire, likely present in the last eukaryotic common ancestor (LECA). I generated the protein complex orthology cartographer (PCOC) suite to find significant duplications and reductions of proteins in universal orthogroups across and between eukaryotes. This revealed both multi-copy and notably single-copy proteins, in all queried species, from the exosome, spliceosome, proteasome, small-ribosomal processome, tRNA synthetases, MCM complexes and RNA polymerase III. Case analyses of Naegleria gruberi and Guillardia theta highlight taxon-specific expansions and show how broader protist inclusion improves domain-wide inference of eukaryotic protein-complex evolution.
bioinformatics2026-02-25v1Bioactivity-driven discovery of repurposable antivirals as OSCAR inhibitors that promote cartilage protection via transcriptomic reprogramming
Ryu, G.; Kim, J.; Kim, S.; Lee, S. Y.; Kim, W.AI Summary
- Researchers used sBEAR to identify adefovir (ADV) and brivudine (BRV) as inhibitors of the OSCAR-collagen interaction in chondrocytes, targeting OA treatment.
- Molecular docking showed both compounds bind to the OSCAR D2 domain's collagen-recognition pocket.
- In a mouse model, ADV and BRV reduced OA progression, promoted chondrocyte regeneration, and BRV reversed inflammatory and matrix-degrading gene expression.
Abstract
Osteoarthritis (OA) is a progressive degenerative joint disorder characterized by cartilage degradation, chronic pain, and impaired joint function. The avascular nature of cartilage isolates chondrocytes from systemic circulation, presenting significant challenges for therapeutic intervention. Despite extensive efforts, no clinically effective disease modifying osteoarthritis drugs (DMOADs) are currently available. Targeting chondrocyte-specific receptors has therefore emerged as a promising strategy. The osteoclast-associated receptor (OSCAR), expressed on chondrocytes, has been implicated in the regulation of cartilage homeostasis and OA pathogenesis. Here, we applied sBEAR (Structurally similar Bioactive compound Enrichment by Assay Repositioning), a bioactivity-driven virtual screening framework independent of target structural information, to identify small molecule inhibitors of the OSCAR collagen interaction. By mining large scale bioactivity profiles, we identified adefovir (ADV) and brivudine (BRV), as candidate OSCAR inhibitors. Molecular docking analyses indicated that both compounds occupy the collagen-recognition pocket within the OSCAR D2 domain. Intra-articular administration of these compounds in a post-traumatic OA mouse model significantly attenuated OA progression and enhanced chondrocyte regeneration. Both compounds increased Sox9 expression, and transcriptomic analyses revealed that BRV reverses inflammatory and extracellular matrix degrading transcriptional programs. Together, these findings establish OSCAR as a therapeutically actionable target in OA and highlight ADV and BRV as potential DMOAD candidates.
bioinformatics2026-02-25v1A Comprehensive Analysis of the Electrolytic Hydrogen Water Mechanism via a Feedforward Loop and its Functional Role in Intestinal Cells In Vitro
LI, J.AI Summary
- This study investigated the molecular mechanisms of electrolytic hydrogen water (EHW) in Caco-2 cells using next-generation sequencing to analyze mRNA and miRNA expression.
- EHW was found to modulate oxidative stress response and tight junction formation, with bioinformatics revealing its impact on the HIF1 signaling pathway and the expression of genes like CUL5 and GOLGA7.
- EHW treatment reduced miR-429 and miR-200c-3p levels, enhancing CUL5 and GOLGA7 expression, and promoted cell differentiation, highlighting EHW's regulatory role via feed-forward loops.
Abstract
Electrolytic hydrogen water (EHW) plays a critical role in modulating cellular metabolism; yet, the underlying molecular mechanisms remain unclear. This study utilized next-generation sequencing (NGS) to assess mRNA and miRNA expression in EHW treated Caco 2 cells. Bioinformatics analysis identified differentially expressed genes (DEGs) and pathways influenced by EHW and highlighted its involvement in the oxidative stress response and tight junction formation. Protein-protein interaction (PPI) network analysis of the DEGs identified first neighbor genes, supporting the role of EHW in suppressing oxidative stress related genes while also enhancing the expression of the TCEB2 CUL5 COMMD8 (ECS complex) genes, both of which converged on the HIF1 signaling pathway. We also constructed an mRNA and miRNA competing endogenous RNA (ceRNA) network, which revealed four hub genes, two non-coding RNAs (miR-429 and miR-200c-3p) and two protein-coding RNAs (CUL5 and GOLGA7). These genes co-target the transcription factor KLF4 in Caco 2 cells, forming a TF miRNA gene network (TMGN). EHW treatment significantly decreased the levels of miR 429 and miR 200c 3p and stabilized CUL5 and GOLGA7 transcripts post-transcriptionally as compared to ACW. Concurrently, reduced miRNA expression weakened their pretranscriptional competition with mRNAs for KLF4 binding, further enhancing CUL5 and GOLGA7 expression. Phenotypic assays confirmed that continuous EHW treatment promotes Caco 2 cell differentiation. This study underscores the regulatory role of EHW in intestinal cells via feed-forward loops (FFLs), offering novel insights into the molecular mechanisms and functions of EHW.
bioinformatics2026-02-25v1STRATA: Spatial Regulon Field Theory Reveals Coupling Architecture of Human Skin and Its Homogenization in Melanoma
Tjiu, J.-W.AI Summary
- STRATA, a new differential-geometric framework, analyzes spatial transcriptomics by constructing continuous regulon activity fields and quantifying local co-regulation in tissues.
- Applied to human skin melanoma data, STRATA identifies coupling phase boundaries that align with histological tissue architecture.
- The analysis shows that melanoma homogenizes regulon coupling, reducing variance by 28% and phase boundary intensity by 18% compared to normal epidermal zones.
Abstract
Spatial transcriptomics captures gene expression in tissue context, yet current analyses reduce continuous regulatory landscapes to discrete cell clusters, discarding the geometry of intercellular regulation. Here we introduce STRATA (Spatial Transcription-factor Regulatory Architecture of Tissue Analysis), a differential-geometric framework that constructs continuous regulon activity fields from transcript coordinates, computes their coupling tensor to quantify local co-regulation between transcription factor programs, and derives a Regulon Stability Index from the Jacobian singular value decomposition. Applied to Xenium in situ data from human skin melanoma (382 genes, 13.7 million transcripts), STRATA identifies coupling phase boundaries -- positions where the regulatory logic of tissue changes -- that track histological tissue architecture (Pearson r = 0.32 with the dermal-epidermal junction marker KRT-diff, r = 0.51 with maximum principal stretch; P < 10^-10). Within-tissue comparison reveals that the melanoma microenvironment does not abolish regulon coupling but homogenizes it: coupling variance decreases 28% and phase boundary intensity drops 18% relative to the epidermal zone. STRATA transforms spatial transcriptomics from cell cataloguing to continuous field analysis of regulatory tissue architecture.
bioinformatics2026-02-25v1scDesignPop generates realistic population-scale single-cell RNA-seq for power analysis, benchmarking, and privacy protection
Dong, C. Y.; Cen, Y.; Song, D.; Li, J. J.AI Summary
- scDesignPop is introduced as a statistical simulator for generating realistic population-scale scRNA-seq data with genetic effects, addressing cost, method selection, and privacy issues in large cohort studies.
- It models cell- and individual-level covariates, cell-type-specific eQTLs, and uses real or synthetic genotypes, validated against OneK1K and CLUES cohorts.
- Compared to splatPop, scDesignPop better preserves eQTL effects and gene-gene dependencies, enabling power analysis, method benchmarking, and privacy protection through synthetic data.
Abstract
Single-cell RNA sequencing (scRNA-seq) combined with genotyping in large cohorts has enabled the discovery of genetic associations with molecular traits (e.g., eQTLs) at cell-type resolution. However, generating population-scale data remains cost-prohibitive, selecting appropriate analysis methods lacks consensus, and sharing eQTL results alongside scRNA-seq data raises privacy risks. To address these challenges, we introduce scDesignPop, a flexible statistical simulator for generating realistic population-scale scRNA-seq data with genetic effects. scDesignPop models cell- and individual-level covariates, putative cell-type-specific eQTLs (cts-eQTLs), and either real or synthetic genotypes. We validated scDesignPop using the OneK1K and CLUES cohorts across 4 qualitative and 16 quantitative metrics. Unlike splatPop, the only existing population-scale simulator, scDesignPop better preserves eQTL effects and gene-gene dependencies within cell types, closely recapitulating characteristics of the reference data. Leveraging its generative framework, scDesignPop enables power analysis in cell types under multiple eQTL model specifications to guide experimental design; facilitates benchmarking of single-cell eQTL mapping methods through user-defined ground truths; and mitigates re-identification risk using synthetic data while retaining cts-eQTL effects.
bioinformatics2026-02-25v1ARCH3D: A foundation model for global genome architecture
Galioto, N.; Stansbury, C.; Gorodetsky, A. A.; Rajapakse, I.AI Summary
- ARCH3D is introduced as a foundation model for global genome architecture, utilizing a novel masked locus modeling task to incorporate genome-wide contact profiles.
- The model's embeddings preserve genomic spatial structure, reconstruct interchromosomal interactions under sparse conditions, and identify multi-way interactions.
- ARCH3D aims to serve as a structural foundation for developing a virtual genome model to simulate genome behavior and dynamics.
Abstract
Biological foundation models are transforming scientific discovery by creating information-rich representations that enable inference in low-data settings. Progress on these models has mainly been achieved by increasing input contextual information, e.g., base pairs or genes. Most work, however, focuses on DNA, RNA, and protein, leaving genome architecture, a fundamental component regulating processes like the cell cycle and cell-fate determination, underexplored. Here, we introduce ARCH3D: a foundation model for global genome architecture. ARCH3D uses a novel masked locus modeling task that increases input contextual information to include genome-wide contact profiles of loci spread across the entirety of the genome. We demonstrate this strategy captures global genome structure by showing ARCH3D embeddings preserve genomic spatial structure, reconstruct interchromosomal interactions under extreme sparsity, and enable identification of multi-way interactions. Ultimately, ARCH3D provides a potential structural foundation for building the virtual genome, an artificial intelligence-based model capable of simulating genome behavior and dynamics.
bioinformatics2026-02-25v1Improved multimodal protein language model-driven universal biomolecules-binding protein design with EiRA
Zeng, W.; Zou, H.; Li, X.; Dou, Y.; Wang, X.; Peng, S.AI Summary
- The study introduces EiRA, a generative model for designing proteins that bind to various biomolecules, using a two-stage post-training process on a multimodal protein language model.
- EiRA showed state-of-the-art performance in structural confidence, diversity, novelty, and designability across 8 test sets for 6 biomolecule types, and improved downstream task predictions.
- Experimental validation confirmed a 100% success rate in expressing variants, and EiRA designed a Glucagon peptide binder with micromolar affinity.
Abstract
The interactions between proteins and biomolecules form a complex system that supports life activities. Designing proteins capable of targeted biomolecular binding is therefore critical for protein engineering and gene therapy. Here, we propose a new generative model, EiRA, specifically designed for universal biomolecular-binding protein design, which undergo two-stage post-training, i.e., domain-adaptive masking training and binding site-informed preference optimization, based on a general multimodal protein language model. A systemic evaluation reveals the SOTA performance of EiRA, including structural confidence, diversity, novelty, and designability on 8 test sets across 6 biomolecule types. Meanwhile, EiRA provides a better characterization for biomolecular-binding proteins than generic model, thereby improving the predictive performance of various downstream tasks. We also mitigate severe repetition generation in the original language model by optimizing training strategies and loss. Additionally, we introduced DNA information into EiRA to support DNA-conditioned binder design, further expanding the boundaries of the design paradigm. Experimental validation yielded a 100% success rate (20/20) in expressing highly divergent variants. Remarkably, EiRA achieved the one-shot design of a Glucagon peptide binder with SPR-confirmed micromolar affinity.
bioinformatics2026-02-24v3Transcriptomic analysis reveals immune signatures associated with specific cutaneous manifestations of lupus in systemic lupus erythematosus
Lee, E. Y.; Patterson, S.; Cutts, Z.; Lanata, C. M.; Dall'Era, M.; Yazdany, J.; Criswell, L. A.; Haemel, A.; Katz, P.; Ye, C. J.; Langelier, C.; Sirota, M.AI Summary
- This study used transcriptomics from a large cohort of SLE patients to identify molecular pathways associated with ten distinct cutaneous manifestations of SLE.
- Specific immune signatures were found, such as upregulation of type I interferon, TNF-, and IL6-JAK-STAT3 pathways in subacute cutaneous lupus, suggesting potential therapeutic targets.
- Unexpected findings included the absence of interferon signaling in patients with skin and mucosal ulcers, and roles for CD14+ monocytes in photosensitivity and NK cells in alopecia, mucosal ulceration, and livedo reticularis.
Abstract
Systemic lupus erythematosus (SLE) presents with diverse and heterogenous cutaneous manifestations. However, the molecular and immunologic pathways driving specific cutaneous manifestations of SLE are poorly understood. Here, we leverage transcriptomics from a large well-phenotyped longitudinal cohort of SLE patients to map molecular pathways linked to ten distinct SLE-related rashes. Through whole blood and immune cell-sorted bulk RNA sequencing, we identified immune signatures specific to cutaneous subtypes of SLE. Subacute cutaneous lupus (SCLE) exhibited broad upregulation of type I interferon, TNF-, and IL6-JAK-STAT3, pathways suggesting potential unique therapeutic responses to JAK and type I interferon inhibition. While interferon signaling is prominent in SCLE, discoid lupus, and acute lupus, it is unexpectedly absent in patients with skin and mucosal ulcers. Pathway and cell-type enrichment analysis revealed unexpected roles for CD14+ monocytes in photosensitivity of SLE and NK cells in alopecia, mucosal ulceration, and livedo reticularis. These findings illuminate the immune heterogeneity of rashes in SLE, highlighting subtype-specific mechanistic targets, and presenting opportunities to identify precision therapies for SLE-associated skin phenotypes.
bioinformatics2026-02-24v2The phylodynamic threshold of measurably evolving populations
Weber, A.; Kende, J.; Duitama Gonzalez, C.; Oeversti, S.; Duchene, S.AI Summary
- This study investigates the concepts of measurably evolving populations and the phylodynamic threshold, crucial for molecular clock calibration using sampling times.
- Through simulations and empirical data analysis, it was found that determining these thresholds depends on model assumptions, sampling strategies, and the sensitivity of priors in Bayesian analyses.
- The study emphasizes the importance of assessing prior sensitivity over tests of temporal signal to enhance molecular clock inferences and highlights sampling limitations.
Abstract
The molecular clock is a fundamental tool for understanding the time and pace of evolution, requiring calibration information alongside molecular data. Sampling times are often used for calibration since some organisms accumulate enough mutations over the course of their sampling period. This practice ties together two key concepts: measurably evolving populations and the phylodynamic threshold. Our current understanding suggests that populations meeting these criteria are suitable for molecular clock calibration via sampling times. However, the definitions and implications of these concepts remain unclear. Using Hepatitis B virus-like simulations and analyses of empirical data, this study shows that determining whether a population is measurably evolving or has reached the phylodynamic threshold does not only depend on the data, but also on model assumptions and sampling strategies. In Bayesian applications, a lack of temporal signal due to a narrow sampling window results in a prior that is overly informative relative to the data, such that a prior that is potentially misleading typically requires a wider sampling window than one that is reasonable. In our analyses we demonstrate that assessing prior sensitivity is more important than the outcome of tests of temporal signal. Our results offer guidelines to improve molecular clock inferences and highlight limitations in molecular sequence sampling procedures.
bioinformatics2026-02-24v2Condensate-Driven Transcriptional Reprogramming Defines Core Vulnerabilities in Esophageal and Gastric Cancers
Alvarez-Carrion, L.; R. Tejedor, A.; Ardura, J. A.; Alonso, V.; Alonso-Moreno, C.; Collepardo-Guevara, R.; Gutierrez-Rojas, I.; Privat, C.; Moreno, V.; Calvo, E.; Gyorffy, B.; Espinosa, J. R.; Ocana, A.AI Summary
- The study investigates how biomolecular condensates contribute to esophageal and gastric cancers using multi-omics profiling, functional genomics, and simulations.
- Findings show these cancers share a condensate-driven transcriptional program with upregulation of genes like TOPBP1 and CHERP, essential for tumor cell survival.
- Simulations confirmed that TOPBP1 and CHERP form condensates through phase separation, suggesting these proteins as potential therapeutic targets.
Abstract
Biomolecular condensates organize key nuclear functions by compartmentalizing biomolecules, yet their contribution to gastrointestinal tumorigenesis remains poorly defined. Integrating multi-omics profiling, functional genomics, and molecular dynamics simulations, we reveal that esophageal and gastric cancers share a condensate-enriched transcriptional program driven by intrinsically disordered proteins involved in transcription, RNA processing, and replication stress. Transcriptomic analyses identify a hyperactive transcriptional state with upregulation of condensate-associated genes, including TOPBP1 and CHERP. Dependency mapping demonstrates that these proteins are essential for tumor cell viability, defining a conserved condensate core across different tumor types. Machine-learned predictions and residue-resolution coarse-grained simulations confirm that TOPBP1 and CHERP undergo phase separation through homotypic interactions mediated by intrinsically disordered regions, with saturation concentrations below 2 M, consistent with spontaneous condensate formation observed in vitro. Together, these findings establish condensate organization as a fundamental mesoscale principle in upper gastrointestinal cancers and nominate condensate scaffolds as tractable therapeutic vulnerabilities.
bioinformatics2026-02-24v1A partition-based spatial entropy for co-occurrence analysis with broad application.
Otto, T.; Nemri, A.; Claessens, A.; Radulescu, O.AI Summary
- The study introduces Regional Co-occurrence Entropy (RCE), a new spatial entropy measure to analyze how categorical co-occurrences relate to specific environments.
- RCE was applied to various fields: it identified interactions between immune cells in Alzheimer's Disease, analyzed building diversity in town neighborhoods, and examined bird species distribution in a natural reserve.
- Key findings include novel interactions in Alzheimer's, potential drivers of social mixing, and vegetation-driven changes in bird community composition.
Abstract
Despite the advent of spatial data science, including spatial biology, there exist few methods that study the distribution of points e.g. cells or individuals, accounting for both their own characteristics and environmental factors. We propose a new spatial entropy measure, termed the Regional Co-occurrence Entropy (RCE), that detects when categorical co-occurrences happen preferentially in specific environments. We demonstrate its use over a broad range of application fields. As examples, we study brain cell dynamics in Alzheimer's Disease, identifying both known and likely novel interactions between immune cells around beta-amyloid plaques. We also investigate the diversity of buildings across a town neighborhoods, to detect potential drivers of social mixing at local scale. Finally, we dissect bird species distribution across a natural reserve, identifying potential vegetation-driven changes in community composition. Altogether, the proposed RCE enables rapid insights into interactions with an environmental component, making it a useful addition to the spatial data science toolbox.
bioinformatics2026-02-24v1A functional annotation based integration of different similarity measures for gene expressions
Misra, S.; Roy, S.; Ray, S. S.AI Summary
- The study developed an integrated similarity score (ISS) by combining various gene expression similarity measures, weighted by biological information, to enhance gene similarity prediction.
- A fitness function (FFFAG) was used to optimize the weights in ISS by minimizing the difference between functional similarity and ISS.
- ISS outperformed individual measures in identifying similar gene pairs and predicted functional categories for 40 unclassified yeast genes with high significance (p-value < 10^(-10)).
Abstract
Genes with similar expression profiles often exhibit similar functional properties. An integrated similarity score (ISS) is developed by combining different expression similarity measures through weights, obtained using biological information, for improving gene similarity. The expression similarity measures are converted to the common framework of positive predictive value using functional annotation. A fitness function, called fitness function using functional annotation of genes (FFFAG), is also developed by minimizing the difference between functional similarity value and the ISS. The FFFAG is used to determine the weight combination of different similarity measures in ISS. In addition, an existing similarity measure, called TMJ (integrated similarity measure by multiplying Triangle and Jaccard similarity), is also modified to incorporate biological knowledge involving functional annotation. The results demonstrate that ISS is superior to individual similarity measure to find similar gene pairs. Further, the ISS predicts the functional categories of 40 unclassified yeast genes at p-value cutoff of 10^(-10) from 12 clusters. The associated code is accessible at http://www.isical.ac.in/~shubhra/ISS.html.
bioinformatics2026-02-24v1Graph-based RNA structural representation reveals determinants of subcellular localization
Hao, Y.; Sun, H.; Ran, Z.; Guo, X.; Liu, M.; Bi, Y.; Polo, J.; Liu, N.; Li, F.AI Summary
- The study introduces GRASP, a graph neural network framework for predicting RNA subcellular localization, using a graph representation that includes nucleotide and substructure nodes.
- GRASP models both base-level interactions and structural context, incorporating multi-label dependency learning for co-localization patterns.
- It outperforms existing methods in accuracy, F1 score, and AUC across various RNA types, offering insights into structural determinants of RNA localization.
Abstract
RNA subcellular localization is a key determinant of RNA function and regulation, yet existing computational approaches rely primarily on sequence or simplified structural descriptors, limiting their scalability to long transcripts, their ability to model inter-label dependencies, and their applicability across RNA types. Here, we present GRASP, a unified graph neural network framework for predicting RNA subcellular localization using a heterogeneous graph representation that is RNA substructure-aware. GRASP presents each RNA as a multi-scale graph comprising nucleotide nodes and secondary-structure-derived substructure nodes, connected by relational edges, enabling joint modeling of base-level interactions and regional structural context. The model further incorporates multi-label dependency learning to capture co-localization patterns across cellular compartments within a unified framework. Across multiple benchmark datasets and RNA types, GRASP consistently outperforms state-of-the-art sequence-based and structure-informed methods, achieving substantial improvements in accuracy, F1 score, and AUC while maintaining strong scalability to long transcripts. In addition, the graph-based representation provides biologically interpretable insights into structural determinants of RNA localization. The source code and data are available at https://github.com/ABILiLab/GRASP, and the web server is accessible at http://grasp.biotools.bio.
bioinformatics2026-02-24v1CAPHEINE, or everything and the kitchen sink: a workflow for automating selection analyses using HyPhy
Verdonk, H. E.; Callan, D.; Kosakovsky Pond, S. L.AI Summary
- CAPHEINE is a workflow designed to automate evolutionary analysis from unaligned pathogen sequences, using a reference genome.
- It facilitates studies on site-level selection dynamics, gene-level positive selection, and lineage-specific selective pressure changes.
- The workflow is compatible with Mac OS, Windows, and Linux, enhancing accessibility for researchers.
Abstract
Here we present CAPHEINE, a computational workflow that starts with a set of unaligned pathogen sequences and a reference genome and performs a comprehensive exploratory evolutionary analysis of the input data. CAPHEINE pairs nicely with studies of site-level selection dynamics, gene-level positive selection, and lineage-specific shifts in selective pressure. Our workflow is portable across Mac OS, Windows, and Linux, allowing researchers to focus on results.
bioinformatics2026-02-24v1