Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Quartet-based species tree methods enable fast and consistent tree of blobs reconstruction under network multispecies coalescent
Dai, J.; Han, Y.; Molloy, E.AI Summary
- The study addresses the challenge of reconstructing the tree of blobs (TOB) under the network multispecies coalescent model by introducing a framework that uses quartet-based methods for faster and consistent TOB estimation.
- The framework involves refining the TOB using Weighted Quartet Consensus and then contracting edges based on hypothesis tests, resulting in the method TOB-QMC with a time complexity of O(n^3k).
- TOB-QMC was found to be at least as accurate as TINNiK, scalable to larger datasets, and useful for exploring hyperparameters, with practical implications for interpreting species trees in the context of gene flow.
Abstract
Gene flow between species or populations is an important force in evolution, modeled by the network multispecies coalescent. Reconstructing evolutionary histories, called species networks, under this model is notoriously challenging, with the leading methods scaling to just tens of species. Divide-and-conquer is a promising path forward; however, methods with statistical consistency guarantees require the tree of blobs (TOB), which displays only the tree-like parts of the network, to perform subset decomposition. TOB reconstruction under the NMSC is challenging in its own right, with the only available method TINNiK having time complexity O(n^5 + n^4k), where k is the number of input gene trees and n is the number of species. Here, we present a framework for TOB reconstruction that operates by (1) seeking a refinement of the TOB and then (2) contracting edges in it. For step (1), we show that an optimal solution to Weighted Quartet Consensus is a TOB refinement almost surely, as the number of gene trees increases, motivating the use of fast quartet-based methods for species tree estimation such as ASTRAL or TREE-QMC. For step (2), we contract edges in the refinement tree based on the same hypothesis tests as TINNiK, which are applicable to subsets of four taxa. We show that sampling just O(n) four-taxon subsets around each edge enables statistically consistent TOB estimation, with asymptotic runtime dominated by tree reconstruction. Leveraging TREE-QMC for this step gives our method a time complexity of O(n^3k) and its name TOB-QMC. On simulated data sets, TOB-QMC is at least as accurate and often more accurate than TINNiK. Moreover, TOB-QMC scales to larger data sets and enables fast and interpretable exploration of hyperparameters used in hypothesis testing. We demonstrate the importance of this feature on phylogenomic data sets. Lastly, our framework is related to ad hoc analyses performed by biologists, as network methods do not scale. Our theoretical results provide justification for such approaches as well as context for interpreting species trees estimated with quartet-based methods in the presence of gene flow; this is critical given the recent result that tree-based network inference with ASTRAL can be positively misleading.
bioinformatics2026-02-26v3A Framework for Autonomous AI-Driven Drug Discovery
Selinger, D. W.; Wall, T. R.; Stylianou, E.; Khalil, E. M.; Gaetz, J.; Levy, O.AI Summary
- The study introduces a framework for autonomous AI-driven drug discovery that integrates knowledge graphs with large language models to manage vast biomedical data.
- The framework uses a focal graph to distill data into hypotheses, enhancing drug discovery processes like target prediction.
- Small-scale applications of this scalable approach demonstrated novel insights across multiple drug discovery stages, including autonomous execution of a multi-step target discovery workflow.
Abstract
The exponential increase in biomedical data offers unprecedented opportunities for drug discovery, yet overwhelms traditional data analysis methods, limiting the pace of new drug development. Here we introduce a framework for autonomous artificial intelligence (AI)-driven drug discovery that integrates knowledge graphs with large language models (LLMs). It is capable of planning and carrying out automated drug discovery programs at a massive scale while providing details of its research strategy, progress, and all supporting data. At the heart of this framework lies the focal graph - a novel construct that harnesses centrality algorithms to distill vast, noisy datasets into concise, transparent, data-driven hypotheses. We demonstrate that even small-scale applications of this highly scalable approach can yield novel, transparent insights relevant to multiple stages of the drug discovery process, including chemical structure-based target prediction, and present the implementation of a system which autonomously plans and executes a multi-step target discovery workflow.
bioinformatics2026-02-26v3MANTIS: Analytics toolkit for spatial metabolomics with matching spatial transcriptomics data
Hao, Y.; Kim, Y.; Aggarwal, B.; Sinha, S.AI Summary
- MANTIS is a statistical framework designed to analyze co-registered spatial metabolomics (SM) and spatial transcriptomics (ST) data at single cell or spot resolution, incorporating spatial domain or cell type information.
- It uses an autocorrelation-preserving permutation strategy for statistical significance, employing spatial cross-correlation and partial correlation to explore gene-metabolite relationships.
- Across various datasets, MANTIS offers more specific and interpretable results by modeling confounding structures, surpassing existing methods in statistical rigor.
Abstract
Motivation: Joint Spatial Metabolomics (SM) and Spatial Transcriptomics (ST) profiling is a powerful approach to fine-mapping of metabolic states associated with tissue function. Current computational tools for analysis of "SM+ST" data focus primarily on alignment and integration of the two modalities, with limited support for probing biological relationships between the two molecular layers. Results: We present MANTIS, a statistical framework for analyzing co-registered SM+ST profiles at single cell or spot resolution, along with spatial domain or cell type information, to discover metabolite spatial patterns and gene-metabolite relationships. It employs an autocorrelation-preserving permutation strategy to assess statistical significance, yielding calibrated inference under spatial dependence. It disentangles different sources of spatial patterns and correlations, viz., those arising from regional preferences, cell type associations, or other unknown factors. It introduces the use of spatial cross-correlation and spatial partial correlation statistics for quantifying gene-metabolite associations. Across data sets spanning different spatial technologies, tissues and species, MANTIS provides more specific and interpretable discoveries than existing methods through rigorous statistical testing and explicitly modeling confounding structure. To our knowledge, MANTIS is the first toolkit to unify spatial metabolomics, spatial transcriptomics, cell type information and spatial domains within a single framework that emphasizes spatial statistics, hypothesis testing and confounder correction. Availability and Implementation: Freely available on the web at https://github.com/yuhaotuo/MANTIS.
bioinformatics2026-02-26v2A pocket-centric framework for selective targeting of amyloid fibril polymorphs
Ossard, G.; Ciambur, C. B.; Melki, R.; Sperandio, O.; Romero, E.AI Summary
- The study analyzed 97 cryo-EM structures of amyloid-beta, tau, and alpha-synuclein fibrillar polymorphs to understand binding pocket distribution.
- Findings indicate that most pockets are shared across different amyloid proteins, explaining the lack of ligand selectivity.
- A small subset of unique pockets was identified, suggesting potential for selective ligand design under specific structural conditions.
Abstract
The rapid expansion of high-resolution cryo-EM structures of amyloid fibrils has not yet translated into the rational design of selective or specific ligands of protein aggregates involved in Alzheimer's and Parkinson's diseases. This persistent limitation suggests that the obstacle lies into a certain degree of communality within the organization of fibrillar polymorphs surfaces available for small molecule binding. Here, we present a systematic and global analysis of binding pockets across 97 cryo-EM structures of amyloid-beta, tau, and alpha-synuclein protein fibrillar polymorphs. Using a unified pocket similarity index and minimum spanning tree representations, we construct global and protein-specific pocketomes that reveal how surface cavities are distributed across different amyloid-forming proteins and the fibrillar polymorphs they form. We show that most detectable pockets are shared across multiple fibrillar folds and, in many cases, across different amyloid-forming proteins, providing a structural explanation for the widespread lack of ligand selectivity. Conversely, a limited subset of pockets forms isolated clusters associated with specific proteins or polymorphs, delineating the rare structural conditions under which selective or specific ligand design is feasible. Together, these results reframe amyloid targeting as a problem of constrained pocket diversity within the amyloid polymorphic landscape, and provide a conceptual framework to guide both the design of future ligands and the strategic avoidance of intrinsically non-discriminatory binding sites.
bioinformatics2026-02-26v1CellPace: A temporal diffusion-forcing framework for simulation, interpolation and forecasting of single-cell dynamics
Su, C.; Emad, A.AI Summary
- CellPace is a generative model using a transformer-based temporal diffusion approach to simulate, interpolate, and forecast single-cell dynamics, addressing the challenge of irregularly sampled or missing developmental stages.
- It excels in modeling continuous developmental dynamics across various mouse lineages, preserving fine biological details and accurately mapping to spatial transcriptomics.
- CellPace also handles multi-modal data, integrating RNA and chromatin dynamics, even when temporal ordering is inferred from pseudotime.
Abstract
Single-cell omics technologies resolve cellular heterogeneity at high resolution but provide only static snapshots of continuous developmental processes. This makes it difficult to recover coherent temporal dynamics when developmental stages are irregularly sampled or missing. While recent generative models can simulate observed cell states, they often treat timepoints as discrete categories, hindering interpolation across gaps and extrapolation to unobserved future stages. We present CellPace, a generative model that learns and generates developmental dynamics by leveraging a transformer-based temporal diffusion backbone conditioned on continuous, gap-aware temporal encodings. Across diverse mouse developmental lineages, CellPace achieves state-of-the-art performance in simulation, interpolation, and forecasting tasks. Beyond recovering global population statistics, generated cells preserve fine-grained biological structure, retaining dynamic gene regulatory programs and mapping accurately to anatomical regions in spatial transcriptomics data. Furthermore, CellPace extends naturally to multi-modal data, modeling joint RNA-chromatin dynamics even when temporal ordering is inferred from pseudotime. Together, these results position CellPace as a robust framework for modeling and generating continuous developmental dynamics from sparse, cross-sectional single-cell data.
bioinformatics2026-02-26v1SpaMOAL: A spatial multi-omics graph contrastive learning method for spatial domains identification
Wang, J.; Huo, Y.; Zhao, R.; Pan, Y.; Wang, H.; Li, X.AI Summary
- SpaMOAL is a graph-based contrastive learning method designed to integrate spatial coordinates, histological images, and molecular profiles for identifying spatial domains in tissues.
- The method was benchmarked on multiple spatial multi-omics datasets, where it outperformed existing methods in accurately delineating spatial tissue domains.
Abstract
Recent advances in spatial multi-omics technologies have opened new avenues for characterizing tissue architecture and function in situ, by simultaneously providing multimodal and complementary information such as spatially resolved transcriptomic, epigenomic, and proteomic features. Current computational approaches face substantial challenges such as effective integration of multi-omics molecular information with spatial information and corresponding high-resolution histology images. To address this challenge, we proposed SpaMOAL (Spatially Multi-Omics graph contrAstive Learning), a graph-based contrastive learning approach for spatial domain identification. SpaMOAL learns clustering-friendly representations from spatial multi-omics data by integrating spatial coordinates, histological image features and molecular profiles, enabling accurate delineation of spatial tissue domains. Benchmarking across multiple recent paired spatial multi-omics datasets demonstrated that SpaMOAL consistently outperforms existing methods. By enabling accurate spatial domain delineation, SpaMOAL provides a powerful framework for interpreting tissue organization and cellular microenvironments.
bioinformatics2026-02-26v1Transcriptome-based lead generation, ligand- and structure-based prioritization and experimental validation of TLR5-activating molecules
Jain, A.; Hungharla, H.; Subbarao, N.; Tandon, V.; Ahmad, S.AI Summary
- This study used a transcriptome-based approach with the connectivity map (CMAP) library to generate leads for TLR5 activation, integrating cellular context early in drug discovery.
- Leads were prioritized using ligand- and structure-based methods, and the top nine were experimentally validated with ELISA, showing dose-dependent TLR5 activation.
- The framework suggests potential complex interactions with the TLR signaling pathway and is scalable for other drug discovery applications.
Abstract
Current in silico drug discovery protocols ubiquitously depend on lead generation using a ligand-based approach in which novel leads are generated by fragment-signature matching or by a structure-based search involving molecular docking and conformational dynamics. None of them incorporates cellular contexts in which these drugs ultimately operate, leaving the task to a later stage of optimization leading to a high failure rate. Incorporating systems-level responses of drugs in an early stage of lead generation can significantly address this concern but has not been sufficiently explored. In this work, we employ a systems-level approach using connectivity map (CMAP) library to generate leads against a challenging system of a TLR pathway. Starting with gene expression data of TLR5 activation by its natural ligand, we generated molecular leads using CMAP and rigorously analyzed their validity using ligand and structure-based approaches, and helping to prioritize top hits. Experimental validation using ELISA-based antibody assay confirmed the activation of TLR5 by each of the top nine prioritized leads with their dose-dependent patterns suggesting that some of them may actually interact with the TLR signaling pathway in a complex manner. Although, demonstrated on TLR5, the proposed framework is intuitively scalable to other lead generation and optimization tasks.
bioinformatics2026-02-26v1Optimal transport fate mapping resolves T cell differentiation dynamics across tissues
Plotkin, A. L.; Mullins, G. N.; Green, W. D.; Shi, H.; Chung, H. K.; Yi, H.; Stanley, N.; Milner, J. J.AI Summary
- The study introduces an optimal transport-based fate mapping framework to reconstruct continuous CD8 T cell differentiation trajectories across time and tissues using single-cell RNA-seq data from mice with acute viral infection.
- This approach accurately depicts population dynamics and identifies distinct migration waves into the small intestine, leading to different tissue-resident memory (Trm) fates.
- The analysis revealed CD52 as a marker for recent tissue entrants and AP4 as a regulator distinguishing circulating from tissue-resident T cells.
Abstract
Immune responses evolve across time and tissues through coordinated programs of proliferation, differentiation, and migration, yet most single-cell measurements capture only static molecular snapshots. As a result, reconstructing how immune cells transition between alternative fates remains challenging, particularly for CD8 T cells, whose differentiation is highly dynamic and shaped by rapid expansion, contraction, and tissue trafficking. Here, we introduce an optimal transport-based fate mapping framework that reconstructs continuous CD8 T cell trajectories across time and tissues. Applied to longitudinal single-cell RNA-seq data from CD8 T cells responding to acute viral infection in mice, this approach accurately recapitulates population dynamics and resolves coherent effector and memory T cell differentiation trajectories. Extending the model to multiple tissues, we identify and experimentally validate temporally distinct waves of migration into the small intestine that give rise to divergent tissue-resident memory (Trm) fates, long-lived T cells crucial in immunosurveillance. By integrating optimal transport inference with time-resolved in vivo labeling, we demonstrate that CD52 marks recent tissue entrants and distinguishes them from Trm precursors. Finally, trajectory-guided analysis of transcription factor regulons reveals both shared and context-specific gene regulatory programs and identifies AP4 as a key regulator of circulating versus tissue-resident specification. Together, these results establish optimal transport as a principled framework for reconstructing immune cell fate dynamics and provide a quantitative map of early events governing antiviral CD8 T cell differentiation across tissues.
bioinformatics2026-02-26v1keju: powerful and accurate inference in Massively Parallel Reporter Assays
Xue, A.; Zahm, A. M.; English, J.; Sankararaman, S.; Pimentel, H.AI Summary
- The study addresses the challenge of uncertainty in Massively Parallel Reporter Assays (MPRAs) by introducing keju, a hierarchical statistical model.
- keju accounts for differences in uncertainty between DNA and RNA counts and between batches, improving inference accuracy.
- Simulations showed keju has a sensitivity of 59% and a lower false positive rate (6.8%) compared to MPRAnalyze (31%, 34%) and BCalm (9%, 12%).
Abstract
Massively Parallel Reporter Assays (MPRAs) interrogate the regulatory function of thousands of designed genetic elements in parallel through linked DNA and RNA readouts using an engineered construct and attached minimal reporter. Given the complexity of MPRA experimental designs, several different sources of uncertainty complicate inference. We show that previous methods do not account for substantial differences in uncertainty levels between the DNA and RNA counts and between batches. Accordingly, we present keju, a hierarchical statistical model that estimates candidate transcription rate, differential activity between conditions, and effects from promoter composition for MPRA data. To maximize statistical power and improve false positive rate control, keju conditions on the DNA counts to model batch-specific and modality-specific uncertainty in the RNA counts. keju shows vastly improved sensitivity (59%) in simulations compared to previous methods (31% for MPRAnalyze and 9% for BCalm), and also has lower, more robust false positive rates, calling only 6.8% of unlabeled negative controls significant in real data (compared to 34% for MPRAnalyze and 12% for BCalm).
bioinformatics2026-02-26v1Exploring differences across pangenome-graph representations using Escherichia coli O157:H7 as a model
Liu, P.; Hu, K.; Mughini-Gras, L.; Zomer, A. L.; Brouwer, M. S. M.; Dallman, T. J.; Paganini, J. A.AI Summary
- The study benchmarked six methods for constructing pangenome graphs of Escherichia coli O157:H7, comparing gene-cluster, ccDBG, multiple sequence alignment, and hybrid approaches.
- Results showed significant differences in graph size, fragmentation, and computational cost, with assembly completeness being a major factor influencing graph structure.
- The analysis highlighted that pangenome graph representation affects bacterial diversity modeling, with varying accuracy at specific loci like Shiga toxin genes.
Abstract
Pangenome graphs are increasingly used to represent population-scale bacterial diversity, yet construction methods span fundamentally different representation paradigms whose outputs and sensitivities to assembly quality remain poorly quantified. We systematically reviewed microbial pangenome graph tools and benchmarked six representative methods spanning gene-cluster, compacted coloured de Bruijn graph (ccDBG), multiple sequence alignment, and hybrid approaches. Using a repeat-rich Escherichia coli O157:H7 dataset with complete genomes and matched short-read data, we constructed graphs from identical inputs and observed orders-of-magnitude differences in graph size and fragmentation, indicating that global topology is driven by representation strategy. Varying completeness composition revealed that assembly fragmentation is a first-order determinant of graph structure: gene-cluster graphs contracted as draft assemblies replaced complete genomes, whereas unitig graphs expanded, with distinct degree-prevalence fingerprints across tools. Computational cost mirrored these shifts and depended strongly on completeness composition, including a pronounced runtime penalty for one ccDBG implementation on all-draft inputs. Finally, analysis of Shiga toxin loci showed that pangenome-level reconciliation does not reliably correct assembly artefacts at challenging multi-copy genes and that performance varies by locus. Together, these findings show that pangenome graphs are representation-dependent models of bacterial diversity, and that assembly completeness is a primary determinant of their topology, scalability, and locus-level accuracy.
bioinformatics2026-02-26v1Molecular Thermodynamics of KRAS Activation
Ciftci, F. S.; Erman, B.AI Summary
- The study investigates the structural basis of KRAS activation by comparing the residue-contact networks of its GTP-bound active (6GOD) and GDP-bound inactive (4OBE) states using a statistical-mechanical framework.
- Key findings include the active state having higher mean contact energy and conformational entropy, with a thermodynamic balance at kT ≈ 2.41.
- Switch I (residues 25-40) was identified as the primary allosteric locus due to significant changes in contact essentiality between states.
Abstract
The GTPase KRAS executes a conformational switch between a GTP-bound active and a GDP-bound inactive states, which are central to oncogenic signalling, yet the structural basis of this switching at the level of residue-contact network organization remains incompletely characterised by conventional pairwise analyses. Here we apply a rigorous statistical-mechanical framework, grounded in the weighted Kirchhoff Laplacian and the Matrix-Tree Theorem, to construct spanning-tree partition functions for residue contact graphs derived from two crystallographic structures: 6GOD (active, GTP-analog-bound; 172 residues, 830 contacts) and 4OBE (inactive, GDP-bound; 169 residues, 809 contacts). The log-partition function log Z, the network free energy F =-kT log Z, the mean contact energy, <E>, the heat capacity Cv, and the thermodynamic entropy S are computed across an effective temperature sweep from kT = 0.3 to 6.0. Edge marginal inclusion probabilities, P, obtained via effective-resistance theory, serve as topology-aware measures of contact essentiality. Differential analysis reveals that the active state consistently carries a higher mean contact energy ({Delta}<E>> 0) yet also a higher conformational entropy ({Delta}S > 0), with the free energy crossover {Delta}F = 0 occurring at kT {approx} 2.41, an intrinsic thermodynamic balance independent of any arbitrary additive reference. Switch I (residues 25-40) exhibits the largest state-dependent {Delta}P, identifying it as the primary allosteric locus of nucleotide-driven network reorganisation.
bioinformatics2026-02-26v1Identification of different sequence properties between HIV-1 DNA and RNA across subtypes using the k-mer-based approach
Chen, H.-C.; Wisniewski, J.; Serwin, K.; Parczewski, M.; Kula-Pacurar, A.; Skums, P.; Kirpich, A.; Yakovlev, S.AI Summary
- This study used an updated k-mer-based approach, PORT-EK-v2, to compare DNA and RNA sequence properties across HIV-1 subtypes.
- Findings indicated distinct sequence properties between DNA and RNA, with "isolate k-mer count" useful for classification.
- Markov chain Monte Carlo modeling showed discontinuous k-mer frequency patterns, suggesting significant subtype-specific differences.
Abstract
Advanced analytical tools that enable mining of the masked features hidden in intricate datasets and strengthening the biological interpretation of multigenomic outputs hold paramount importance. In this study, we present an updated version of a k-mer-based approach, PORT-EK-v2, allowing for a comparison of multiple genomic datasets and identification of over-represented genomic regions, k-mers, related to specific organisms. Using PORT-EK-v2, we exemplified that most likely DNA and RNA sequence properties are distinct across HIV-1 subtypes. Furthermore, we showcased that "isolate k-mer count" could serve as a default choice in classifying the DNA versus RNA sequence property. Lastly, results based on Markov chain Monte Carlo modeling unveiled a discontinuous nature of the sequence property in terms of k-mer frequencies across HIV-1 subtypes. Altogether, we propose that the sequence property (DNA versus RNA) is distinct across HIV-1 subtypes and has a consequential impact on identifying new and emerging subtypes in the future.
bioinformatics2026-02-26v1Protein Compositional Ratio Representation (PCRR)Systematically Improves Human Disease Prediction
Madduri, A. V.; Ellis, R. J.; Patel, C. J.AI Summary
- The study proposes using pairwise protein ratios (log(A)-log(B)) in machine learning to better capture the compositional nature of proteomic data for disease prediction.
- Applied to the ROSMAP cohort, this approach improved Alzheimer's subtype classification with an average AUROC increase of +0.1274.
- In the UK Biobank dataset, the ratio-based model outperformed raw protein level models in 95.1% of diseases, with significant improvements in 56.7%.
Abstract
Plasma proteomics captures a functional snapshot of human physiology; yet, most machine learning models treat protein abundances as independent variables, ignoring the fact that biological systems and proteomic measurements are inherently compositional. Many molecular processes depend not on absolute concentrations but on relative balances: receptor-ligand stoichiometry, enzyme-substrate ratios, and homeostatic feedbacks that govern signaling and metabolism. We propose that these relationships are best captured through pairwise protein ratios, which more faithfully reflect underlying biochemical constraints than raw expression values. We evaluate a machine learning framework that models pairwise log-ratios of proteins (log(A)-log(B)) as features, thereby encoding compositional structure directly into the learning space. Applied to the ROSMAP plasma proteomics cohort (n = 871), this approach substantially improved the classification of Alzheimer's subtypes (NCI, MCI, AD, AD+) with an average AUROC gain of +0.1274 over a strong baseline that incorporated raw proteomics and demographics. The top-ranked ratios(e.g., SEMA3C:TMEM70, IDUA:NPTXR) captured converging pathogenic pillars of Alzheimer's disease, including microglial activation, proteostasis dysregulation, and lipid-clearance imbalance, highlighting that ratio-based features recover biologically coherent axes of disease. To assess generality, we scaled the method to the UK Biobank proteomic dataset (n > 53,000; 587 phenotypes). The ratio-based model outperformed raw-level models in 95.1% of diseases, with statistically significant (FDR < 0.05) gains in 56.7%. Together, these results suggest that proteomic data should be viewed and modeled as compositional systems, where relative protein abundances carry the accurate functional signal. This insight supports the broader utility of ratio-based representations for disease prediction and biomarker discovery.
bioinformatics2026-02-25v4Generating Structurally Diverse Therapeutic Peptides with GFlowNet
Wijaya, E.AI Summary
- The study addresses mode collapse in reinforcement learning for therapeutic peptide generation by introducing GFlowNet, which samples sequences proportionally to reward.
- GFlowNet was compared with GRPO, showing GFlowNet achieves greater sequence diversity without explicit diversity penalties.
- When diversity penalties were removed, GRPO failed while GFlowNet maintained diversity, highlighting GFlowNet's robustness in drug discovery applications.
Abstract
Reinforcement learning approaches for therapeutic peptide generation suffer from mode collapse, converging to narrow regions of sequence space even when explicit diversity penalties are applied. Fine-grained analysis reveals persistent mode-seeking behavior invisible to standard diversity metrics. We propose GFlowNet for peptide generation, which samples sequences proportionally to reward rather than maximizing expected reward. This provides intrinsic diversity without diversity penalties. Comparing against GRPO with explicit diversity enforcement, GFlowNet achieves substantially more uniform sequence sampling and fewer repetitive motifs. Critically, when diversity mechanisms are removed from the reward, GRPO collapses completely while GFlowNet maintains natural diversity. These results demonstrate that proportional sampling is inherently robust to reward function design, offering a key advantage for drug discovery pipelines requiring diverse candidates.
bioinformatics2026-02-25v4Distilling Protein Language Models with Complementary Regularizers
Wijaya, E.AI Summary
- The study distills a large 738M-parameter protein language model into smaller models using uncertainty-aware position weighting and calibration-aware label smoothing, which together improve performance despite individual degradation.
- The distilled models offer up to 5x faster inference, use less memory, and maintain natural amino acid distributions, making them suitable for consumer-grade hardware.
- When fine-tuned on small protein family datasets, these models outperform the original in generating family-matching sequences, showing higher efficiency and Pfam hit rates.
Abstract
Large autoregressive protein language models generate novel sequences de novo, but their size limits throughput and precludes rapid domain adaptation on scarce proprietary data. We distill a 738M-parameter protein language model into compact students using two protein-specific enhancements, uncertainty-aware position weighting and calibration-aware label smoothing, that individually degrade quality yet combine for substantial improvement. We trace this complementary-regularizer effect to information theory: smoothing denoises teacher distributions while weighting amplifies the cleaned signal at biologically variable positions. Students achieve up to 5x inference speedup, preserve natural amino acid distributions, and require as little as 170 MB of GPU memory, enabling deployment on consumer-grade hardware. When fine-tuned on protein families with as few as 50 sequences, students generate more family-matching sequences than the teacher, achieving higher sample efficiency and Pfam hit rates despite their smaller capacity. These results establish distilled protein language models as superior starting points for domain adaptation on scarce data.
bioinformatics2026-02-25v3OriGene: A Self-Evolving Virtual Disease Biologist Automating Therapeutic Target Discovery
Zhang, Z.; Qiu, Z.; Wu, Y.; Li, S.; Wang, D.; Liu, Y.; Zhou, Z.; Hu, Y.; Chen, Y.; An, D.; Wang, Y.; Li, Y.; Zhong, Z.; Ou, C.; Wang, Z.; Tang, F.; Chen, J. X.; Ma, R.; Li, J.; Wang, X.; Lu, W.; Xue, H.; Zhang, W.; Wei, Z.; Ma, R.; Shi, Z.; Wang, K.; Liu, Q.; Dong, B.; He, Y.; Liu, T.; Gu, J.; Song, S.; Feng, Q.; Zhang, J.; Zhang, B.; Tian, L.; Bai, L.; Gao, Q.; Sun, S.; Zheng, S.AI Summary
- OriGene is a self-evolving multi-agent system designed to automate therapeutic target discovery by integrating diverse biomedical data.
- It outperforms human experts and other AI models in accuracy, recall, and robustness, particularly with sparse or noisy data.
- OriGene identified novel therapeutic targets for liver (GPR160) and colorectal cancer (ARG2), showing significant anti-tumor activity in preclinical models.
Abstract
Therapeutic target discovery remains a critical yet intuition-driven bottleneck in drug development, typically relying on disease biologists to laboriously integrate diverse biomedical data into testable hypotheses for experimental validation. Here, we present OriGene, a self-evolving multi-agent system that functions as a virtual disease biologist, systematically identifying original and mechanistically grounded therapeutic targets at scale. OriGene coordinates specialized agents that reason over diverse modalities, including genetic data, protein networks, pharmacological profiles, clinical records, and literature evidence, to generate and prioritize target discovery hypotheses. Through a self-evolving framework, OriGene continuously integrates human and experimental feedback to iteratively refine its core thinking templates, tool composition, and analytical protocols, thereby enhancing both accuracy and adaptability over time. To comprehensively evaluate its performance, we established TRQA, a benchmark comprising over 1,900 expert-level question-answer pairs spanning a wide range of diseases and target classes. OriGene consistently outperforms human experts, leading research agents, and state-of-the-art large language models in accuracy, recall, and robustness, particularly under conditions of data sparsity or noise. Critically, OriGene nominated previously underexplored therapeutic targets for liver (GPR160) and colorectal cancer (ARG2), which demonstrated significant anti-tumor activity in patient-derived organoid and tumor fragment models mirroring human clinical exposures. These findings demonstrate OriGene's potential as a scalable and adaptive platform for AI-driven discovery of mechanistically grounded therapeutic targets, offering a new paradigm to accelerate drug development.
bioinformatics2026-02-25v2KuPID: Kmer-based Upstream Preprocessing of Long Reads forIsoform Discovery
Borowiak, M.; Yu, Y. W.AI Summary
- KuPID is introduced as a method for preprocessing long RNAseq reads to enhance novel isoform discovery by using kmer sketching to pseudo-align reads to known isoforms.
- This approach reduces the need for full alignment to only relevant reads, speeding up the process by 2-3x and improving the f1 accuracy of isoform discovery by up to 16.7 points.
- An optional mode allows KuPID to be used for both isoform discovery and transcript quantification.
Abstract
Eukaryotic genes can encode multiple protein isoforms based on alternative splicing of their transcribed regions. Most modern novel isoform discovery methods function by identifying and assembling exon splice junctions from an RNAseq sample. However, splice junctions can only be accurately annotated with time-intensive dynamic programming alignment. This manuscript introduces KuPID, a method for preprocessing long RNAseq reads with the goal of better identifying novel isoform transcripts. KuPID utilizes kmer sketching as a pre-filter to quickly pseudo-align reads to known reference isoforms. Full alignment need only then be applied to reads that are most relevant to isoform discovery. Not only does KuPID speed up the discovery pipeline, it also increases downstream accuracy by filtering out extraneous reads. KuPID preprocessing simultaneously increases the f1 accuracy of isoform discovery pipelines by up to 16.7 points while decreasing the runtime by a factor of 2-3x. An optional mode permits a KuPID sample to be paired with both isoform discovery and transcript quantification. Code availability: https://github.com/mboro2000/KuPID.git
bioinformatics2026-02-25v2Quantified duplications of proteins within complexes across eukaryotes
Francis, O.AI Summary
- The study integrates orthology and protein interaction data to map proteins of verified complexes across 31 diverse eukaryotes, identifying 184 universal orthogroups with components from all species.
- It developed the PCOC suite to analyze duplications and reductions in these orthogroups, revealing both multi-copy and single-copy proteins in various complexes.
- Case studies on Naegleria gruberi and Guillardia theta demonstrated taxon-specific expansions, enhancing understanding of eukaryotic protein-complex evolution.
Abstract
Protein complexes are central to cell biology and typically verified via a combination of interaction data, complete genome sequencing and comprehensive protein-coding gene predictions for reference eukaryotes. However this data is lacking for non-reference eukaryotes. Protein complexes can be predicted in species for which no interaction data is available by mapping orthology of verified protein complex components from reference eukaryotes to predicted proteomes. Studies that map conservation of protein complex components by orthology are often limited to a small number of protein queries, an under-representation of non-reference, microbial eukaryotes and are scattered across the literature. Here, I integrate orthology and protein interaction data by mapping proteins of experimentally verified complexes to orthogroups of proteins spanning 31 diverse eukaryotes. Proteins within complex-harbouring orthogroups are retained and distributed more evenly across taxa than non-complex orthogroups. I identified 184 universal orthogroups that included orthologs of known protein complex components from all 31 eukaryotes, consistent with a conserved core repertoire, likely present in the last eukaryotic common ancestor (LECA). I generated the protein complex orthology cartographer (PCOC) suite to find significant duplications and reductions of proteins in universal orthogroups across and between eukaryotes. This revealed both multi-copy and notably single-copy proteins, in all queried species, from the exosome, spliceosome, proteasome, small-ribosomal processome, tRNA synthetases, MCM complexes and RNA polymerase III. Case analyses of Naegleria gruberi and Guillardia theta highlight taxon-specific expansions and show how broader protist inclusion improves domain-wide inference of eukaryotic protein-complex evolution.
bioinformatics2026-02-25v1scDesignPop generates realistic population-scale single-cell RNA-seq for power analysis, benchmarking, and privacy protection
Dong, C. Y.; Cen, Y.; Song, D.; Li, J. J.AI Summary
- scDesignPop is introduced as a statistical simulator for generating realistic population-scale scRNA-seq data with genetic effects, addressing cost, method selection, and privacy issues in large cohort studies.
- It models cell- and individual-level covariates, cell-type-specific eQTLs, and uses real or synthetic genotypes, validated against OneK1K and CLUES cohorts.
- Compared to splatPop, scDesignPop better preserves eQTL effects and gene-gene dependencies, enabling power analysis, method benchmarking, and privacy protection through synthetic data.
Abstract
Single-cell RNA sequencing (scRNA-seq) combined with genotyping in large cohorts has enabled the discovery of genetic associations with molecular traits (e.g., eQTLs) at cell-type resolution. However, generating population-scale data remains cost-prohibitive, selecting appropriate analysis methods lacks consensus, and sharing eQTL results alongside scRNA-seq data raises privacy risks. To address these challenges, we introduce scDesignPop, a flexible statistical simulator for generating realistic population-scale scRNA-seq data with genetic effects. scDesignPop models cell- and individual-level covariates, putative cell-type-specific eQTLs (cts-eQTLs), and either real or synthetic genotypes. We validated scDesignPop using the OneK1K and CLUES cohorts across 4 qualitative and 16 quantitative metrics. Unlike splatPop, the only existing population-scale simulator, scDesignPop better preserves eQTL effects and gene-gene dependencies within cell types, closely recapitulating characteristics of the reference data. Leveraging its generative framework, scDesignPop enables power analysis in cell types under multiple eQTL model specifications to guide experimental design; facilitates benchmarking of single-cell eQTL mapping methods through user-defined ground truths; and mitigates re-identification risk using synthetic data while retaining cts-eQTL effects.
bioinformatics2026-02-25v1ARCH3D: A foundation model for global genome architecture
Galioto, N.; Stansbury, C.; Gorodetsky, A. A.; Rajapakse, I.AI Summary
- ARCH3D is introduced as a foundation model for global genome architecture, utilizing a novel masked locus modeling task to incorporate genome-wide contact profiles.
- The model's embeddings preserve genomic spatial structure, reconstruct interchromosomal interactions under sparse conditions, and identify multi-way interactions.
- ARCH3D aims to serve as a structural foundation for developing a virtual genome model to simulate genome behavior and dynamics.
Abstract
Biological foundation models are transforming scientific discovery by creating information-rich representations that enable inference in low-data settings. Progress on these models has mainly been achieved by increasing input contextual information, e.g., base pairs or genes. Most work, however, focuses on DNA, RNA, and protein, leaving genome architecture, a fundamental component regulating processes like the cell cycle and cell-fate determination, underexplored. Here, we introduce ARCH3D: a foundation model for global genome architecture. ARCH3D uses a novel masked locus modeling task that increases input contextual information to include genome-wide contact profiles of loci spread across the entirety of the genome. We demonstrate this strategy captures global genome structure by showing ARCH3D embeddings preserve genomic spatial structure, reconstruct interchromosomal interactions under extreme sparsity, and enable identification of multi-way interactions. Ultimately, ARCH3D provides a potential structural foundation for building the virtual genome, an artificial intelligence-based model capable of simulating genome behavior and dynamics.
bioinformatics2026-02-25v1RNA foundation models enable generalizable endometriosis disease classification and stable gene-level interpretation
McConnell, N.; Kelly, J.; Tadikonda, R.; Bettencourt-Silva, J.; Mulligan, N.; Madgwick, M.; Krishna, R.; Strudwick, J.; Evans, A.; Checkley, S.; Carrieri, A. P.; Smyrnakis, M.; Knowles, C. H.; Gardiner, L.-J.AI Summary
- Researchers investigated whether foundation models (FMs) pretrained on large-scale transcriptomic data could improve endometriosis classification across different patient cohorts.
- Using a 12-cohort RNA-seq benchmark, FM embeddings significantly outperformed traditional TPM baselines, achieving a weighted F1-score of 0.83 versus 0.68.
- A new interpretability method, classified-aligned integrated gradients (CA-IG), identified a stable set of predictive genes across cohorts, highlighting novel candidates involved in endometriosis pathways.
Abstract
Endometriosis is a chronic inflammatory condition with significant diagnostic delays impacting one in ten reproductive age women worldwide. While machine learning (ML) models trained on transcriptomic data show promise for disease prediction, limited generalizability across independent patient cohorts has hindered clinical translation. Foundations models (FMs) pretrained on large-scale transcriptomic data offer promise to learn transferrable, biologically meaningful representations that could support cross-cohort predictions. We assembled a 12-cohort bulk RNA-seq benchmark (334 samples) and developed a computationally efficient pipeline to test whether FMs improve endometriosis classification, an approach not previously applied to this disease. Using AutoXAI4Omics with cohort-aware validation, we compared embeddings derived from five state-of-the-art RNA FMs against TPM baselines. In cross-cohort prediction, FM embeddings significantly improved performance, achieving a weighted F1-score of 0.83 vs. 0.68 for the baseline. To allow gene-level interpretation of FM embedding models, we introduce classified-aligned integrated gradients (CA-IG), an interpretability approach aligning gene-level attributions to the downstream classifier without end-to-end fine-tuning. CA-IG revealed a conserved set of predictive genes from FM embeddings across cohort-validation regimes, contrasting with unstable baseline explainability, suggesting that FM embeddings prioritized transferable disease-related signal over cohort-specific effects. These genes include novel candidates that converge on biologically plausible pathways for endometriosis.
bioinformatics2026-02-25v1Integrative Multi-Scale Sequence-Structure Modeling for Antimicrobial Peptide Prediction and Design
Li, J.; Shao, Y.; Li, Y.; Yu, Q.AI Summary
- The study introduces MultiAMP, a framework that integrates multi-scale sequence and structure information to predict antimicrobial peptides (AMPs), addressing the limitations of current methods that treat these aspects in isolation.
- MultiAMP significantly outperforms existing methods by over 10% in MCC, particularly in identifying AMPs with low sequence identity to known peptides.
- Applied to marine organisms, MultiAMP identified 484 novel high-confidence AMPs and provided insights into AMP mechanisms, aiding in the design of peptides with specific motifs.
Abstract
Antimicrobial resistance (AMR) is accelerating worldwide, undermining frontline antibiotics and making the need for novel agents more urgent than ever. Antimicrobial peptides (AMPs) are promising therapeutics against multidrug-resistant pathogens, as they are less prone to inducing resistance. However, current AMP prediction approaches often treat sequence and structure in isolation and at a single scale, leading to mediocre performance. Here, we propose MultiAMP, a framework that integrates multi-level information for predicting AMPs. The model captures evolutionary and contextual information from sequences alongside global and fine-grained information from structures, synergistically combining these features to enhance predictive power. MultiAMP achieves state-of-the-art performance, outperforming existing methods by over 10\% in MCC when identifying distant AMPs sharing less than 40\% sequence identity with known AMPs. To discover novel AMPs, we applied MultiAMP to marine organism data, discovering 484 high-confidence peptides with sequences that are highly divergent from known AMPs. Notably, MultiAMP accurately recognizes various structural types of peptides. In addition, our approach reveals functional patterns of AMPs, providing interpretable insights into their mechanisms. Building on these findings, we employed a gradient-based strategy and achieved the design of AMPs with specific motifs. We believe MultiAMP empowers both the rational discovery and mechanistic understanding of AMPs, facilitating future experimental validation and therapeutic design. The codebase is available at \url{https://github.com/jiayili11/multi-amp}.
bioinformatics2026-02-25v1Bioactivity-driven discovery of repurposable antivirals as OSCAR inhibitors that promote cartilage protection via transcriptomic reprogramming
Ryu, G.; Kim, J.; Kim, S.; Lee, S. Y.; Kim, W.AI Summary
- Researchers used sBEAR to identify adefovir (ADV) and brivudine (BRV) as inhibitors of the OSCAR-collagen interaction in chondrocytes, targeting OA treatment.
- Molecular docking showed both compounds bind to the OSCAR D2 domain's collagen-recognition pocket.
- In a mouse model, ADV and BRV reduced OA progression, promoted chondrocyte regeneration, and BRV reversed inflammatory and matrix-degrading gene expression.
Abstract
Osteoarthritis (OA) is a progressive degenerative joint disorder characterized by cartilage degradation, chronic pain, and impaired joint function. The avascular nature of cartilage isolates chondrocytes from systemic circulation, presenting significant challenges for therapeutic intervention. Despite extensive efforts, no clinically effective disease modifying osteoarthritis drugs (DMOADs) are currently available. Targeting chondrocyte-specific receptors has therefore emerged as a promising strategy. The osteoclast-associated receptor (OSCAR), expressed on chondrocytes, has been implicated in the regulation of cartilage homeostasis and OA pathogenesis. Here, we applied sBEAR (Structurally similar Bioactive compound Enrichment by Assay Repositioning), a bioactivity-driven virtual screening framework independent of target structural information, to identify small molecule inhibitors of the OSCAR collagen interaction. By mining large scale bioactivity profiles, we identified adefovir (ADV) and brivudine (BRV), as candidate OSCAR inhibitors. Molecular docking analyses indicated that both compounds occupy the collagen-recognition pocket within the OSCAR D2 domain. Intra-articular administration of these compounds in a post-traumatic OA mouse model significantly attenuated OA progression and enhanced chondrocyte regeneration. Both compounds increased Sox9 expression, and transcriptomic analyses revealed that BRV reverses inflammatory and extracellular matrix degrading transcriptional programs. Together, these findings establish OSCAR as a therapeutically actionable target in OA and highlight ADV and BRV as potential DMOAD candidates.
bioinformatics2026-02-25v1A Comprehensive Analysis of the Electrolytic Hydrogen Water Mechanism via a Feedforward Loop and its Functional Role in Intestinal Cells In Vitro
LI, J.AI Summary
- This study investigated the molecular mechanisms of electrolytic hydrogen water (EHW) in Caco-2 cells using next-generation sequencing to analyze mRNA and miRNA expression.
- EHW was found to modulate oxidative stress response and tight junction formation, with bioinformatics revealing its impact on the HIF1 signaling pathway and the expression of genes like CUL5 and GOLGA7.
- EHW treatment reduced miR-429 and miR-200c-3p levels, enhancing CUL5 and GOLGA7 expression, and promoted cell differentiation, highlighting EHW's regulatory role via feed-forward loops.
Abstract
Electrolytic hydrogen water (EHW) plays a critical role in modulating cellular metabolism; yet, the underlying molecular mechanisms remain unclear. This study utilized next-generation sequencing (NGS) to assess mRNA and miRNA expression in EHW treated Caco 2 cells. Bioinformatics analysis identified differentially expressed genes (DEGs) and pathways influenced by EHW and highlighted its involvement in the oxidative stress response and tight junction formation. Protein-protein interaction (PPI) network analysis of the DEGs identified first neighbor genes, supporting the role of EHW in suppressing oxidative stress related genes while also enhancing the expression of the TCEB2 CUL5 COMMD8 (ECS complex) genes, both of which converged on the HIF1 signaling pathway. We also constructed an mRNA and miRNA competing endogenous RNA (ceRNA) network, which revealed four hub genes, two non-coding RNAs (miR-429 and miR-200c-3p) and two protein-coding RNAs (CUL5 and GOLGA7). These genes co-target the transcription factor KLF4 in Caco 2 cells, forming a TF miRNA gene network (TMGN). EHW treatment significantly decreased the levels of miR 429 and miR 200c 3p and stabilized CUL5 and GOLGA7 transcripts post-transcriptionally as compared to ACW. Concurrently, reduced miRNA expression weakened their pretranscriptional competition with mRNAs for KLF4 binding, further enhancing CUL5 and GOLGA7 expression. Phenotypic assays confirmed that continuous EHW treatment promotes Caco 2 cell differentiation. This study underscores the regulatory role of EHW in intestinal cells via feed-forward loops (FFLs), offering novel insights into the molecular mechanisms and functions of EHW.
bioinformatics2026-02-25v1STRATA: Spatial Regulon Field Theory Reveals Coupling Architecture of Human Skin and Its Homogenization in Melanoma
Tjiu, J.-W.AI Summary
- STRATA, a new differential-geometric framework, analyzes spatial transcriptomics by constructing continuous regulon activity fields and quantifying local co-regulation in tissues.
- Applied to human skin melanoma data, STRATA identifies coupling phase boundaries that align with histological tissue architecture.
- The analysis shows that melanoma homogenizes regulon coupling, reducing variance by 28% and phase boundary intensity by 18% compared to normal epidermal zones.
Abstract
Spatial transcriptomics captures gene expression in tissue context, yet current analyses reduce continuous regulatory landscapes to discrete cell clusters, discarding the geometry of intercellular regulation. Here we introduce STRATA (Spatial Transcription-factor Regulatory Architecture of Tissue Analysis), a differential-geometric framework that constructs continuous regulon activity fields from transcript coordinates, computes their coupling tensor to quantify local co-regulation between transcription factor programs, and derives a Regulon Stability Index from the Jacobian singular value decomposition. Applied to Xenium in situ data from human skin melanoma (382 genes, 13.7 million transcripts), STRATA identifies coupling phase boundaries -- positions where the regulatory logic of tissue changes -- that track histological tissue architecture (Pearson r = 0.32 with the dermal-epidermal junction marker KRT-diff, r = 0.51 with maximum principal stretch; P < 10^-10). Within-tissue comparison reveals that the melanoma microenvironment does not abolish regulon coupling but homogenizes it: coupling variance decreases 28% and phase boundary intensity drops 18% relative to the epidermal zone. STRATA transforms spatial transcriptomics from cell cataloguing to continuous field analysis of regulatory tissue architecture.
bioinformatics2026-02-25v1Improved multimodal protein language model-driven universal biomolecules-binding protein design with EiRA
Zeng, W.; Zou, H.; Li, X.; Dou, Y.; Wang, X.; Peng, S.AI Summary
- The study introduces EiRA, a generative model for designing proteins that bind to various biomolecules, using a two-stage post-training process on a multimodal protein language model.
- EiRA showed state-of-the-art performance in structural confidence, diversity, novelty, and designability across 8 test sets for 6 biomolecule types, and improved downstream task predictions.
- Experimental validation confirmed a 100% success rate in expressing variants, and EiRA designed a Glucagon peptide binder with micromolar affinity.
Abstract
The interactions between proteins and biomolecules form a complex system that supports life activities. Designing proteins capable of targeted biomolecular binding is therefore critical for protein engineering and gene therapy. Here, we propose a new generative model, EiRA, specifically designed for universal biomolecular-binding protein design, which undergo two-stage post-training, i.e., domain-adaptive masking training and binding site-informed preference optimization, based on a general multimodal protein language model. A systemic evaluation reveals the SOTA performance of EiRA, including structural confidence, diversity, novelty, and designability on 8 test sets across 6 biomolecule types. Meanwhile, EiRA provides a better characterization for biomolecular-binding proteins than generic model, thereby improving the predictive performance of various downstream tasks. We also mitigate severe repetition generation in the original language model by optimizing training strategies and loss. Additionally, we introduced DNA information into EiRA to support DNA-conditioned binder design, further expanding the boundaries of the design paradigm. Experimental validation yielded a 100% success rate (20/20) in expressing highly divergent variants. Remarkably, EiRA achieved the one-shot design of a Glucagon peptide binder with SPR-confirmed micromolar affinity.
bioinformatics2026-02-24v3Transcriptomic analysis reveals immune signatures associated with specific cutaneous manifestations of lupus in systemic lupus erythematosus
Lee, E. Y.; Patterson, S.; Cutts, Z.; Lanata, C. M.; Dall'Era, M.; Yazdany, J.; Criswell, L. A.; Haemel, A.; Katz, P.; Ye, C. J.; Langelier, C.; Sirota, M.AI Summary
- This study used transcriptomics from a large cohort of SLE patients to identify molecular pathways associated with ten distinct cutaneous manifestations of SLE.
- Specific immune signatures were found, such as upregulation of type I interferon, TNF-, and IL6-JAK-STAT3 pathways in subacute cutaneous lupus, suggesting potential therapeutic targets.
- Unexpected findings included the absence of interferon signaling in patients with skin and mucosal ulcers, and roles for CD14+ monocytes in photosensitivity and NK cells in alopecia, mucosal ulceration, and livedo reticularis.
Abstract
Systemic lupus erythematosus (SLE) presents with diverse and heterogenous cutaneous manifestations. However, the molecular and immunologic pathways driving specific cutaneous manifestations of SLE are poorly understood. Here, we leverage transcriptomics from a large well-phenotyped longitudinal cohort of SLE patients to map molecular pathways linked to ten distinct SLE-related rashes. Through whole blood and immune cell-sorted bulk RNA sequencing, we identified immune signatures specific to cutaneous subtypes of SLE. Subacute cutaneous lupus (SCLE) exhibited broad upregulation of type I interferon, TNF-, and IL6-JAK-STAT3, pathways suggesting potential unique therapeutic responses to JAK and type I interferon inhibition. While interferon signaling is prominent in SCLE, discoid lupus, and acute lupus, it is unexpectedly absent in patients with skin and mucosal ulcers. Pathway and cell-type enrichment analysis revealed unexpected roles for CD14+ monocytes in photosensitivity of SLE and NK cells in alopecia, mucosal ulceration, and livedo reticularis. These findings illuminate the immune heterogeneity of rashes in SLE, highlighting subtype-specific mechanistic targets, and presenting opportunities to identify precision therapies for SLE-associated skin phenotypes.
bioinformatics2026-02-24v2The phylodynamic threshold of measurably evolving populations
Weber, A.; Kende, J.; Duitama Gonzalez, C.; Oeversti, S.; Duchene, S.AI Summary
- This study investigates the concepts of measurably evolving populations and the phylodynamic threshold, crucial for molecular clock calibration using sampling times.
- Through simulations and empirical data analysis, it was found that determining these thresholds depends on model assumptions, sampling strategies, and the sensitivity of priors in Bayesian analyses.
- The study emphasizes the importance of assessing prior sensitivity over tests of temporal signal to enhance molecular clock inferences and highlights sampling limitations.
Abstract
The molecular clock is a fundamental tool for understanding the time and pace of evolution, requiring calibration information alongside molecular data. Sampling times are often used for calibration since some organisms accumulate enough mutations over the course of their sampling period. This practice ties together two key concepts: measurably evolving populations and the phylodynamic threshold. Our current understanding suggests that populations meeting these criteria are suitable for molecular clock calibration via sampling times. However, the definitions and implications of these concepts remain unclear. Using Hepatitis B virus-like simulations and analyses of empirical data, this study shows that determining whether a population is measurably evolving or has reached the phylodynamic threshold does not only depend on the data, but also on model assumptions and sampling strategies. In Bayesian applications, a lack of temporal signal due to a narrow sampling window results in a prior that is overly informative relative to the data, such that a prior that is potentially misleading typically requires a wider sampling window than one that is reasonable. In our analyses we demonstrate that assessing prior sensitivity is more important than the outcome of tests of temporal signal. Our results offer guidelines to improve molecular clock inferences and highlight limitations in molecular sequence sampling procedures.
bioinformatics2026-02-24v2Condensate-Driven Transcriptional Reprogramming Defines Core Vulnerabilities in Esophageal and Gastric Cancers
Alvarez-Carrion, L.; R. Tejedor, A.; Ardura, J. A.; Alonso, V.; Alonso-Moreno, C.; Collepardo-Guevara, R.; Gutierrez-Rojas, I.; Privat, C.; Moreno, V.; Calvo, E.; Gyorffy, B.; Espinosa, J. R.; Ocana, A.AI Summary
- The study investigates how biomolecular condensates contribute to esophageal and gastric cancers using multi-omics profiling, functional genomics, and simulations.
- Findings show these cancers share a condensate-driven transcriptional program with upregulation of genes like TOPBP1 and CHERP, essential for tumor cell survival.
- Simulations confirmed that TOPBP1 and CHERP form condensates through phase separation, suggesting these proteins as potential therapeutic targets.
Abstract
Biomolecular condensates organize key nuclear functions by compartmentalizing biomolecules, yet their contribution to gastrointestinal tumorigenesis remains poorly defined. Integrating multi-omics profiling, functional genomics, and molecular dynamics simulations, we reveal that esophageal and gastric cancers share a condensate-enriched transcriptional program driven by intrinsically disordered proteins involved in transcription, RNA processing, and replication stress. Transcriptomic analyses identify a hyperactive transcriptional state with upregulation of condensate-associated genes, including TOPBP1 and CHERP. Dependency mapping demonstrates that these proteins are essential for tumor cell viability, defining a conserved condensate core across different tumor types. Machine-learned predictions and residue-resolution coarse-grained simulations confirm that TOPBP1 and CHERP undergo phase separation through homotypic interactions mediated by intrinsically disordered regions, with saturation concentrations below 2 M, consistent with spontaneous condensate formation observed in vitro. Together, these findings establish condensate organization as a fundamental mesoscale principle in upper gastrointestinal cancers and nominate condensate scaffolds as tractable therapeutic vulnerabilities.
bioinformatics2026-02-24v1A partition-based spatial entropy for co-occurrence analysis with broad application.
Otto, T.; Nemri, A.; Claessens, A.; Radulescu, O.AI Summary
- The study introduces Regional Co-occurrence Entropy (RCE), a new spatial entropy measure to analyze how categorical co-occurrences relate to specific environments.
- RCE was applied to various fields: it identified interactions between immune cells in Alzheimer's Disease, analyzed building diversity in town neighborhoods, and examined bird species distribution in a natural reserve.
- Key findings include novel interactions in Alzheimer's, potential drivers of social mixing, and vegetation-driven changes in bird community composition.
Abstract
Despite the advent of spatial data science, including spatial biology, there exist few methods that study the distribution of points e.g. cells or individuals, accounting for both their own characteristics and environmental factors. We propose a new spatial entropy measure, termed the Regional Co-occurrence Entropy (RCE), that detects when categorical co-occurrences happen preferentially in specific environments. We demonstrate its use over a broad range of application fields. As examples, we study brain cell dynamics in Alzheimer's Disease, identifying both known and likely novel interactions between immune cells around beta-amyloid plaques. We also investigate the diversity of buildings across a town neighborhoods, to detect potential drivers of social mixing at local scale. Finally, we dissect bird species distribution across a natural reserve, identifying potential vegetation-driven changes in community composition. Altogether, the proposed RCE enables rapid insights into interactions with an environmental component, making it a useful addition to the spatial data science toolbox.
bioinformatics2026-02-24v1A functional annotation based integration of different similarity measures for gene expressions
Misra, S.; Roy, S.; Ray, S. S.AI Summary
- The study developed an integrated similarity score (ISS) by combining various gene expression similarity measures, weighted by biological information, to enhance gene similarity prediction.
- A fitness function (FFFAG) was used to optimize the weights in ISS by minimizing the difference between functional similarity and ISS.
- ISS outperformed individual measures in identifying similar gene pairs and predicted functional categories for 40 unclassified yeast genes with high significance (p-value < 10^(-10)).
Abstract
Genes with similar expression profiles often exhibit similar functional properties. An integrated similarity score (ISS) is developed by combining different expression similarity measures through weights, obtained using biological information, for improving gene similarity. The expression similarity measures are converted to the common framework of positive predictive value using functional annotation. A fitness function, called fitness function using functional annotation of genes (FFFAG), is also developed by minimizing the difference between functional similarity value and the ISS. The FFFAG is used to determine the weight combination of different similarity measures in ISS. In addition, an existing similarity measure, called TMJ (integrated similarity measure by multiplying Triangle and Jaccard similarity), is also modified to incorporate biological knowledge involving functional annotation. The results demonstrate that ISS is superior to individual similarity measure to find similar gene pairs. Further, the ISS predicts the functional categories of 40 unclassified yeast genes at p-value cutoff of 10^(-10) from 12 clusters. The associated code is accessible at http://www.isical.ac.in/~shubhra/ISS.html.
bioinformatics2026-02-24v1Graph-based RNA structural representation reveals determinants of subcellular localization
Hao, Y.; Sun, H.; Ran, Z.; Guo, X.; Liu, M.; Bi, Y.; Polo, J.; Liu, N.; Li, F.AI Summary
- The study introduces GRASP, a graph neural network framework for predicting RNA subcellular localization, using a graph representation that includes nucleotide and substructure nodes.
- GRASP models both base-level interactions and structural context, incorporating multi-label dependency learning for co-localization patterns.
- It outperforms existing methods in accuracy, F1 score, and AUC across various RNA types, offering insights into structural determinants of RNA localization.
Abstract
RNA subcellular localization is a key determinant of RNA function and regulation, yet existing computational approaches rely primarily on sequence or simplified structural descriptors, limiting their scalability to long transcripts, their ability to model inter-label dependencies, and their applicability across RNA types. Here, we present GRASP, a unified graph neural network framework for predicting RNA subcellular localization using a heterogeneous graph representation that is RNA substructure-aware. GRASP presents each RNA as a multi-scale graph comprising nucleotide nodes and secondary-structure-derived substructure nodes, connected by relational edges, enabling joint modeling of base-level interactions and regional structural context. The model further incorporates multi-label dependency learning to capture co-localization patterns across cellular compartments within a unified framework. Across multiple benchmark datasets and RNA types, GRASP consistently outperforms state-of-the-art sequence-based and structure-informed methods, achieving substantial improvements in accuracy, F1 score, and AUC while maintaining strong scalability to long transcripts. In addition, the graph-based representation provides biologically interpretable insights into structural determinants of RNA localization. The source code and data are available at https://github.com/ABILiLab/GRASP, and the web server is accessible at http://grasp.biotools.bio.
bioinformatics2026-02-24v1CAPHEINE, or everything and the kitchen sink: a workflow for automating selection analyses using HyPhy
Verdonk, H. E.; Callan, D.; Kosakovsky Pond, S. L.AI Summary
- CAPHEINE is a workflow designed to automate evolutionary analysis from unaligned pathogen sequences, using a reference genome.
- It facilitates studies on site-level selection dynamics, gene-level positive selection, and lineage-specific selective pressure changes.
- The workflow is compatible with Mac OS, Windows, and Linux, enhancing accessibility for researchers.
Abstract
Here we present CAPHEINE, a computational workflow that starts with a set of unaligned pathogen sequences and a reference genome and performs a comprehensive exploratory evolutionary analysis of the input data. CAPHEINE pairs nicely with studies of site-level selection dynamics, gene-level positive selection, and lineage-specific shifts in selective pressure. Our workflow is portable across Mac OS, Windows, and Linux, allowing researchers to focus on results.
bioinformatics2026-02-24v1Systematic identification of DNA methylation biomarkers for tumor-type-specific detection
Arbona, J. S.; Garcia Samartino, C.; Angeloni, A. R.; Vaquer, C. C.; Wetten, P. A.; Bocanegra, V.; Militello, R. D.; Sanguinetti, G.; Correa, A.; Pellegrini, P.; Carlen, M.; Minatti, W. R.; Vaschalde, G. A.; Perez, R.; Manzino, R. N.; Rodriguez, J. D.; Valdemoros, P.; Sarrio, L.; Ledesma, A.; Campoy, E. M.AI Summary
- The study developed a browser-based platform integrating genome-wide methylomes with transcriptomes to identify DNA methylation biomarkers, addressing issues like shared epigenetic programs and mixed cellular composition in cancer diagnostics.
- Validation using MSRE-qPCR in colorectal cancer cohorts confirmed effective biomarkers with AUCs of 0.81-1.00.
- The approach also successfully distinguished hepatocellular carcinoma from cirrhotic liver and identified subtype-specific markers in lung cancers.
Abstract
DNA methylation biomarkers for cancer diagnostics often underperform when tumor and background tissues share epigenetic programs, or when complex specimens with mixed cellular composition dilute tumor-derived signals and increase variability. To address these limitations, we developed a gene-centric, browser-based discovery platform that integrates genome-wide methylomes with matched transcriptomes and reference layers spanning pan-cancer tissues and leukocytes, enabling background-aware filtering beyond binary tumor-normal contrasts. Candidate loci are prioritized using combined thresholds on methylation effect size and intra-group variability to penalize stochastic and heterogeneous variation. In colorectal cancer, methylation-sensitive restriction enzyme quantitative PCR (MSRE-qPCR) validation in independent tissue cohorts confirmed multiple candidate loci with AUCs of 0.81-1.00. Using the same framework, MSRE-qPCR validation distinguished hepatocellular carcinoma from cirrhotic liver, and analysis of public tumor methylomes identified subtype-specific markers in lung adenocarcinoma and squamous-cell carcinoma. This resource bridges genome-scale epigenomic discovery with clinically accessible PCR-based methylation assays.
bioinformatics2026-02-24v1OligoGraph: A novel geometric graph-based approach for siRNA efficacy prediction
Saligram, S. S.; Kasturi, V. V.; Surkanti, S. R.; Basangari, B. C.; Kondaparthi, V.AI Summary
- The study introduces OligoGraph, a graph-based deep learning model for predicting siRNA efficacy against mRNA, addressing the limitations of traditional models by handling variable siRNA lengths.
- OligoGraph uses RiNALMo embeddings, GATconv, Transformerconv layers, and self-supervised pretraining, showing superior performance on both seen and unseen datasets.
- Specialized versions for 19- and 21-nucleotide siRNAs outperformed existing models, with significant improvements in AUC-ROC and PCC on various datasets.
Abstract
RNA interference (RNAi) is a biological process in which a small interfering RNA (siRNA) prevents the translation of a messenger RNA (mRNA) into a protein by cleaving the mRNA before translation. We exploit this process to prevent the formation of harmful proteins by using an effective siRNA on the target mRNA. The current rapidly emerging RNAi-based drugs show immense potential for therapeutic applications. Traditionally, designing a potent siRNA for an mRNA requires extensive lab experimentation and trials; therefore, there is a need to develop a model that reliably predicts a siRNA's efficacy against mRNA. This saves both cost and time. But designing such models is challenging, as the data available is either scarce or biased. The current models available exhibit limited generalization and are restricted to a fixed siRNA lengths of either 19 or 21 nucleotides, limiting flexible use. To address these challenges, we introduce OligoGraph, a graph-based deep learning architecture that operates on the siRNA-mRNA duplex. It leverages RiNALMo embeddings, multiple GATconv and Transformerconv layers, and self-supervised pretraining, and outperforms all other existing models in our testing on seen and unseen data. We implemented specialized OligoGraph variants for 19- and 21-nucleotide siRNAs, both of which outperformed the current state-of-the-art models on unseen data. The 19-nucleotide model yielded AUC-ROC and PCC increases of 1.1% and 4.6% on the Mixset; 19.07% and 127.3% on the Takayuki dataset, respectively. Furthermore, the 21-nucleotide model improved predictive performance on the Simone dataset by 2.62% (AUC-ROC) and 6.65% (PCC).
bioinformatics2026-02-24v1RevelioPlots: An Interactive Web Application for Fast AI-Based Protein Models Quality Assessment
Fernandes, L. L. d. S.; Azevedo, A. H. D. d.; Franca, J. V. S. d.; Lima, J. P. M. S.AI Summary
- RevelioPlots is an interactive web application designed to assess the quality of AI-predicted protein structures by integrating statistical pLDDT score analysis with confidence-colored Ramachandran plots.
- It supports both individual and batch model uploads, using B-factors as a fallback for pLDDT values.
- Testing with example proteins showed that RevelioPlots effectively highlights correlations between low pLDDT scores and sterically disallowed regions, aiding non-expert researchers in model quality assessment.
Abstract
High-accuracy protein structure prediction by deep learning requires rigorous model quality assessment, a process currently hampered by fragmented, non-interactive tools designed for older experimental data formats. We present RevelioPlots, an open-source, interactive web application (Python/Streamlit) that simplifies and streamlines the assessment of AI-predicted protein structure quality. Its key feature is the combination of statistical pLDDT score analysis (mean, median, box plots) with an interactive, confidence-colored Ramachandran plot. This integration establishes a direct visual link between a model's predicted local reliability (pLDDT) and its stereochemical feasibility (backbone geometry). RevelioPlots handles both individual and batch-uploaded models, intelligently falling back to B-factors as a proxy for pLDDT values. Using example model proteins, we demonstrated the tool's effectiveness, revealing differences in reliability and a clear visual correlation between regions of low pLDDT scores and residues in sterically disallowed regions. By unifying these critical metrics, RevelioPlots empowers non-experienced researchers to quickly and intuitively assess, compare, and interpret structural model quality, enabling a more confident and integrated use of predicted data.
bioinformatics2026-02-24v1Beyond alignment: synergistic integration is required for multimodal cell foundation models
Richter, T.; Zimmermann, E.; Hall, J.; Theis, F. J.; Raghavan, S.; Winter, P. S.; Amini, A. P.; Crawford, L.AI Summary
- The study introduces the Synergistic Information Score (SIS) to measure the information gain from cross-modal interactions in multimodal cell foundation models, addressing the limitation of alignment-based fusion methods which only detect linear redundancies.
- Benchmarking on spatial transcriptomics datasets showed that tasks with linear redundancies are well-handled by unimodal models, while complex tasks benefit from synergy-aware integration.
- The analysis suggests that for standard tasks, fine-tuning a dominant unimodal model is sample-efficient, but multimodal frameworks are advantageous when tasks require information from multiple modalities.
Abstract
The vision of a "virtual cell" - a computational model that simulates biological function across modalities and scales - has become a defining goal in computational biology. While powerful unimodal foundation models exist, the lack of large-scale paired data prohibits the joint training of multimodal approaches. This scarcity favors compositional foundation models (CFMs): architectures that fuse frozen unimodal experts via a learned interface. However, it remains unclear when this multimodal fusion adds task-relevant information beyond the strongest unimodal representation and when it merely aggregates redundant signal. Here, we introduce the Synergistic Information Score (SIS), a metric grounded in partial information decomposition (PID), that quantifies the information gain achievable only through cross-modal interactions. Extending theoretical results from self-supervised learning, we show that standard alignment-based fusion objectives on frozen encoders inherently collapse to detecting linear redundancies, limiting their ability to capture nonlinear synergistic states. This distinction is directly relevant for tasks aiming to link tissue morphology and gene expression. Benchmarking ten fusion methods on spatial transcriptomics datasets, we use SIS to demonstrate that tasks dominated by linear redundancies are sufficiently served by unimodal baselines, whereas complex niche definitions benefit from synergy-aware integration objectives that enable cross-modal interactions beyond linear alignment. Finally, we perform a scaling analysis which highlights that fine-tuning a dominant unimodal expert is the most sample-efficient path for standard tasks, suggesting that the benefits of multimodal frameworks only emerge when tasks depend on information distributed across modalities. Together, these results establish that building towards a virtual cell will require a fundamental shift from alignment objectives that emphasize shared structure to synergy-maximizing integration that preserves and exploits complementary cross-modal signal.
bioinformatics2026-02-24v1RSTG: Robust Generation of High Quality Spatial Transcriptomics Data using Beta Divergence Based AutoEncoder
Halder, A.; Ghosh, A.; Bandyopadhyay, S.AI Summary
- The study addresses the challenge of insufficient data in spatial transcriptomics by proposing RSTG, an autoencoder with a β-ELBO loss, to generate high-quality synthetic data.
- RSTG uses variational inference to uncover the intrinsic structure of the data, enhancing interpretability and robustness.
- Validation on datasets from the dorsolateral cortex and brain showed RSTG's superior performance in recovering cellular positions and its robustness to data contamination like noise and outliers.
Abstract
One of the key challenges in spatial transcriptomics data analysis is the lack of sufficient data to train the models. To address this shortcoming, multiple generative models have been developed to generate synthetic spatial transcriptomics samples in a controlled environment. However, these often fail at out-of-the-box generation in the presence of noise (such as outliers). To tackle this challenge, we propose RSTG (Robust Spatial Transcriptomic Generator), an autoencoder incorporating a {beta}-ELBO loss, to generate high-quality realistic spatial transcriptomic sequences. Our model uncovers the data' intrinsic structure by approximating its underlying distribution through variational inference, resulting in more interpretable and robust density estimation. We validate the effectiveness of RSTG across multiple tasks, including the recovery of cellular positions in both the 2D spatial and location domains. Our method shows improved performance both qualitatively and quantitatively on multiple datasets from the dorsolateral cortex and the brain using MERFISH and Visium technologies. We further illustrate the robustness of our model to outliers by contaminating a portion of the data with possible anomalies (such as white noises, batch effects, and dropouts). Promising results show that our proposal maintains high quality and stability even when the training data are contaminated, across a variety of experimental settings and in comparison with existing approaches.
bioinformatics2026-02-24v1SPrOUT: A computational and targeted sequencing approach for mixed plant DNA identification with Angiosperms353
Hu, N.; Bullock, M. R.; Jackson, C.; Miller, C.; Hunter, E.; Huff, C.; Chen, Y.; Handy, S.; Johnson, M.AI Summary
- This study introduces SPrOUT, a method using the Angiosperms353 target sequencing kit to identify plant species in mixed samples through phylogenetic inference.
- The approach achieves high accuracy (98.1-99.6%) and precision (92.9-100%) for in-silico mixes, and 90.7% accuracy with 98.0% precision for mock supplement mixtures.
- The method effectively identifies taxa in mixed plant DNA samples, providing a practical framework for various applications.
Abstract
Premise: The identification of plant species from mixed samples is crucial in various fields, including ecological surveys, conservation efforts, and food and dietary supplement safety. Traditional methods face potential challenges due to the high costs of DNA sequencing, inefficiencies in computational workflows, and incomplete sequence databases. Methods and Results: This study introduces a novel approach using the Angiosperms353 target sequencing kit for efficient taxonomic identification of angiosperm DNA in mixed samples. Our method assembles short pair-end reads for each mixed sample. Using gene sets of Angiosperms353 from 871 species, we apply phylogenetic inference to categorize the variance in phylogenetic distance across genes to identify the presence of taxa in mixed plant samples. The pipeline reaches 98.1 to 99.6% accuracy, 92.9 to 100% precision for identifying unknown taxa in in-silico mixes, and 90.7% accuracy and 98.0% precision for mock supplement mixtures. We explored the parameter cutoffs of the pipeline to offer an empirical range for different applications. Conclusions: The Angiosperms353 and HybPiper assembly proved effective in sorting mixed plant DNA samples. Our method offers a framework for scientific and practical applications in plant species identification in both single and mixed samples.
bioinformatics2026-02-23v1Comprehensive top-down mass spectral repository enables pan-dataset analysis and top-down spectral prediction
Li, K.; Liu, K.; Fulcher, J. M.; Tang, H.; Liu, X.AI Summary
- The study introduces TopRepo, a comprehensive repository of over 18 million top-down mass spectrometry (TD-MS) spectra from 12 species, creating a large-scale spectral library with over 5 million annotated spectra.
- TopRepo facilitates pan-dataset analyses of proteoform characteristics like N-terminal processing and mass shifts.
- The repository enhances proteoform identification via spectral library searching and supports training deep learning models for accurate TD-MS spectral prediction.
Abstract
Mass spectral libraries have become essential resources for training deep learning (DL) models for spectral prediction and de novo sequencing in bottom-up mass spectrometry (BU-MS). Compared with BU-MS, top-down MS (TD-MS) offers unique advantages for characterizing intact proteoforms by analyzing proteoforms without enzymatic digestion. Despite these advantages, large-scale spectral libraries for TD-MS are currently lacking. Here we present TopRepo, the first comprehensive repository of TD-MS spectra, comprising more than 18 million spectra acquired from 12 species across eight types of mass spectrometers. Using TopRepo, we constructed a large-scale top-down spectral library containing over 5 million spectra with curated proteoform and fragment-ion annotations. We demonstrate that TopRepo enables pan-dataset analyses of N-terminal processing, mass shifts, and other proteoform characteristics identified by TD-MS. Furthermore, we show that the TopRepo spectral library substantially improves proteoform identification through spectral library searching and supports the training of DL models for high-accuracy top-down spectral prediction.
bioinformatics2026-02-23v1Inference of cancer driver mutations from tumor microenvironmentcomposition: a pan-cancer study with cross-platform external validation
Baker, E. A.; Mehaffy, N. S.AI Summary
- This study investigated whether tumor microenvironment (TME) composition can predict cancer driver mutations across glioblastoma, breast, lung adenocarcinoma, and colorectal cancers using machine learning models trained on RNA-seq data from TCGA.
- The models were externally validated on independent cohorts, achieving AUC >0.65 for 14 out of 15 driver-cancer pairs, with top performance for ERBB2 in breast cancer (AUC=0.980).
- TME-predicted ERBB2 status was associated with overall survival in breast cancer, and the study highlighted the complexity of predicting KRAS mutations in lung adenocarcinoma due to co-mutant profiles.
Abstract
Cancer driver mutations shape the tumor microenvironment (TME), yet whether TME composition alone can predict genotype has not been systematically evaluated across cancers with external validation. We trained machine learning models to predict driver mutation status from TME cell-type composition signatures derived from bulk transcriptomes. Tissue-specific TME signatures (22-28 programs per cancer) were scored from RNA-seq data in TCGA for glioblastoma (GBM, n=157 total; n=90 EGFR-amplification evaluable), breast cancer (BRCA, n=1,082 total; n=994 evaluable), lung adenocarcinoma (LUAD, n=510 total; n=502 evaluable), and colorectal cancer (CRC, n=592 total; n=524 evaluable), then externally validated on independent cohorts spanning different platforms: CPTAC (GBM, n=65), METABRIC (BRCA, n=1,859), GSE72094 (LUAD, n=442), and GSE39582 (CRC, n=585). Of 15 driver-cancer pairs tested, 14 achieved external AUC >0.65, with top performance for ERBB2 amplification in BRCA (AUC=0.980), BRAF mutation in CRC (0.899), and TP53 mutation in BRCA (0.871). TME-predicted ERBB2 status stratified overall survival in METABRIC (Cox HR=1.73, p=7.95x10^-8). Marginal KRAS performance in LUAD (AUC=0.615) reflected opposing TME profiles in KRAS+STK11 versus KRAS+TP53 co-mutant tumors. These results demonstrate that TME composition encodes sufficient information to infer driver mutations across cancers.
bioinformatics2026-02-23v1GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking
Hu, S.; Bojar, D.AI Summary
- GlycoForge is introduced as a tool for simulating realistic glycomics data with known ground truths, addressing the challenge of simulating data with controlled effects and biases.
- It supports the creation of synthetic data with specified motif-level effects, batch effects, and realistic missing data scenarios.
- The utility of GlycoForge was demonstrated by evaluating batch effect correction algorithms, providing guidelines for their application in real-world glycomics data analysis.
Abstract
Quantifying all complex carbohydrates in a sample produces glycomics data, which constitutes compositional data and is stymied by biosynthetic dependencies between glycans, requiring dedicated analytic workflows. Properly assessing such methods frequently requires simulated data with known ground truths and injectable effects. However, simulating glycomics data, especially with control over effects and biases, is still unsolved. Here, we present GlycoForge, a feature-complete solution for simulating comparative glycomics data. GlycoForge supports simulating fully synthetic glycomics data, with specified motif-level effects, drawn from Dirichlet distributions, and templated simulations based on real-world data. We further support the injection of batch effects, both mean and variance shifts, via center-log ratio transformations to maintain compositional closure, and realistic missing data simulation. We showcase the utility of GlycoForge by evaluating batch effect correction algorithms for glycomics data, with automated guidelines for when to use such methods on real-world data. GlycoForge is available as an open-access Python package at https://github.com/BojarLab/GlycoForge.
bioinformatics2026-02-23v1Skip-Zeros Variational Inference in the Million-Cell Era of Single-Cell Transcriptomics
Shimamura, T.; Yuki, S.; Abe, K.AI Summary
- The study introduces UNISON, a scalable framework for matrix factorization in single-cell RNA sequencing, using skip-zeros variational inference to handle the sparsity of large datasets.
- UNISON performs inference using only nonzero elements, improving efficiency and scalability, and was tested on over one million cells from the Mouse Organogenesis Cell Atlas.
- Application to cross-species analysis showed UNISON's ability to distinguish conserved and species-specific transcriptional programs, enhancing understanding of biological processes like glaucoma.
Abstract
Combinatorial indexing-based single-cell RNA sequencing methods such as sci-RNA-seq and sci-RNA-seq3 now enable the profiling of millions of cells, producing expression matrices that are both extremely sparse and high-dimensional. Conventional nonnegative matrix factorization (NMF) provides an interpretable framework for uncovering latent biological structures but is computationally prohibitive at this scale, as it requires explicit access to the vast number of zero entries. We introduce UNISON (Unified Sparse-Optimized Nonnegative factorization), a scalable framework for matrix and tensor factorization based on skip-zeros variational inference. By reformulating stochastic variational Bayes updates in terms of sufficient statistics, UNISON performs inference using only nonzero elements, while implicitly accounting for zeros through geometric sampling. This strategy enables efficient parameter estimation without matrix expansion and naturally accommodates multiple experimental contexts. Simulation studies show that UNISON is robust to diverse learning-rate schedules and mini-batch sizes, providing practical guidelines for optimization. Application to the Mouse Organogenesis Cell Atlas demonstrates scalability to over one million cells, yielding latent factors that capture developmental trajectories and lineage-specific signatures with improved interpretability compared to existing methods. Cross-species analysis of aqueous humor outflow pathways across five vertebrate species further highlights UNISON's ability to disentangle conserved from species-specific transcriptional programs and to recover biologically meaningful gene-gene and gene-phenotype relationships relevant to glaucoma. By efficiently exploiting sparsity while preserving interpretability, UNISON establishes a principled and practical solution for integrative, large-scale single-cell transcriptomics.
bioinformatics2026-02-23v1What makes a banana false? How the genome of Ethiopian orphan staple Ensete ventricosum differs from the banana A and B sub-genomes
Muzemil, S.; Paul, P.; Baxter, L.; Dominguez-Ferreras, A.; Sahu, S. K.; Van Deynze, A.; Mai, G.; Yemataw, Z.; Tesfaye, K.; Ntoukakis, V.; Studholme, D. J.; Grant, M.AI Summary
- The study sequenced the genome of the Ensete ventricosum landrace Mazia, identifying 38,940 protein-coding genes, with the assembly being more complete than the previously published Bedadeti genome.
- Comparative analysis showed about 25% of the Mazia genome is unique to enset, with distinct functional signatures related to DNA integration, carbohydrate metabolism, disease resistance, and transcriptional regulation.
- The research highlights the potential for marker-assisted breeding in enset, providing a foundation for improving agronomically important traits through comparative genomics within the Musaceae family.
Abstract
Background: Ensete ventricosum, also known as the "tree against hunger" plays a key role in Ethiopian food security and farming systems, feeding more than 20 million people. Since domestication via clonal selection in the south-west Ethiopian highlands, today's diverse enset landraces contribute multiple benefits including food, fibre by-product, animal bedding and cattle fodder to farmers and local communities. Improved genomic resources for this highly drought-tolerant plant are essential to supplement the conventional clonal selection-based breeding programme and pave the way towards targeted breeding. Results: We sequenced the genome of enset landrace Mazia, which is partially resistant/tolerant to Xanthomonas wilt and predicted 38,940 protein-coding genes. The Mazia assembly (540.14 Mb) is more complete than the previously published genome assembly of landrace Bedadeti (451.28 Mb) and displayed 1.41% heterozygosity and 64.64% repetitive DNA content. Comparative analyses with the Bedadeti assembly and chromosome-level genome sequences of the two main banana progenitors (Musa acuminata, AA genome; Musa balbisiana, BB genome) unexpectedly revealed ~25% of the Mazia genome is unique to enset. Gene Ontology (GO) and sequence similarity search analysis of enset-specific protein-coding genes identified distinct functional signatures that underpin the lifestyle, adaptation, and corm productive quality of enset, including functions related to DNA integration, carbohydrate metabolism, disease resistance and transcriptional regulation. In contrast, Musa-specific genes showed enrichment for defence response, protein phosphorylation and fruit development pathways. Focusing on the classical nucleotide binding site leucine rich repeat (NLR) disease resistance genes, we identified and characterised NLRs in enset and Musa species genomes, revealing a considerable expansion in the Musa acuminata genome. We also identified unique genes in enset and banana genomes whose functional and evolutionary roles are yet to be determined. Conclusions: Here, we report a de novo genome assembly for the enset (Ensete ventricosum) landrace Mazia and provide a high-quality annotation of both Mazia and the previously published assembly of the landrace Bedadeti. Collectively, these genomic resources provide a valuable foundation for comparative genomics within the Musaceae family and open new opportunities for the development of marker-assisted breeding strategies to accelerate the improvement of agronomically important traits in enset. Keywords: Ensete ventricosum, Musa, gene families, nucleotide binding site leucine rich repeat (NLR), orphan crop.
bioinformatics2026-02-23v1An Integrated and Configurable End-to-End Pipeline for Longitudinal Cell Painting Analysis
Zhao, G.AI Summary
- The study introduces SCALE, an end-to-end pipeline for analyzing longitudinal cell painting data, addressing challenges like imaging variability and time consistency.
- SCALE integrates nucleus-centered segmentation, quality control, feature extraction, and signal aggregation in a modular framework.
- The pipeline's effectiveness was demonstrated with a chronic radiation exposure dataset, showing its capability for consistent longitudinal analysis.
Abstract
Cell painting assays generate high-dimensional, multi-channel imaging data that enable systematic characterization of cellular phenotypes. Increasingly, such assays are performed in longitudinal settings and under chronic perturbations, introducing additional challenges related to imaging variability, focus-field heterogeneity, and consistency across time points. Existing analysis workflows often require substantial manual adaptation to handle these complexities, limiting scalability and reproducibility. In this paper, we propose SCALE (Stable Cell painting Analysis for Longitudinal Experiments), an integrated, end-to-end analysis pipeline designed for robust longitudinal analysis of cell painting data. The pipeline combines nucleus-centered segmentation, automated quality control, feature extraction, and signal aggregation within a modular and configurable framework. Once assay-specific configurations are specified, the pipeline executes in a fully automated manner from raw images to downstream summary statistics and analysis-ready outputs. We demonstrate the utility of the pipeline using a chronic radiation exposure cell painting dataset, illustrating its ability to support consistent longitudinal comparisons across conditions and time points.
bioinformatics2026-02-23v1LinkDTI: Drug-Target Interactionsprediction through a Link Predictionframework on Biomedical KnowledgeGraph
Mondal, M.; Arunachalam, S.; Wu, S.; Datta, A.AI Summary
- LinkDTI is a computational framework that predicts drug-target interactions (DTIs) by analyzing connections in a heterogeneous biomedical knowledge graph using a modified GraphSAGE model.
- It employs negative sampling to balance data and outperforms baseline methods by at least 2.5% in AUROC and AUPRC.
- The framework identified 945 new potential DTIs, a 49.14% increase over known interactions.
Abstract
Computational drug-target interactions (DTI) prediction serves as a valuable tool for drug discovery and repurposing by cost-effectively narrowing down the potential drug-target space. This paper presents LinkDTI, a computational framework that predicts DTIs by identifying connections within a heterogeneous knowledge graph of drugs, proteins, diseases, and side effects. Unlike methods that rely on mathematical techniques like matrix completion or similarity-based scoring, LinkDTI uses an advanced graph-based approach to capture relationships between biomedical entities. Specifically, LinkDTI applies a modified version of the multilayer GraphSAGE model that learns from the heterogeneous knowledge graph and predicts potential drug-target interactions. Our model incorporates negative sampling that balances the data to address the issue of having more negative than positive interactions. Our results show that LinkDTI consistently performs better in AUROC and AUPRC than baseline methods by at least 2.5% across different sampling ratios and conditions. Subsequently, it identifies approximately 945 new potential DTIs, marking a 49.14% increase over known DTIs. Overall, LinkDTI offers a simple yet effective method for integrating diverse biomedical data to identify potential drug-target interactions.
bioinformatics2026-02-23v1Structure-Based TCR-pMHC Binding Prediction and Generalization to Unseen Peptides
Abeer, A. N. M. N.; Roy, R. S.; Qian, X.; Yoon, B.-J.AI Summary
- This study investigates the generalization performance of graph neural network (GNN)-based classifiers for predicting TCR-pMHC binding, focusing on their accuracy with unseen peptides.
- The research assesses factors like interaction features and structural uncertainty that affect classifier performance.
- By designing classifier architecture with auxiliary training objectives, the study shows improved generalization to novel peptides.
Abstract
The interaction between T-cell receptors (TCRs) with the peptide-bound major histocompatibility complex (MHC) intricately impacts the functional specificity of T-cell-mediated adaptive immune response. Consequently, implication in immunotherapy has contributed to the ever-growing computational methods for TCR recognition, which have recently attracted structure-based approaches due to advancements in protein structure modeling. Despite access to structural information of the predicted binding interface, graph neural network (GNN)-based TCR-pMHC binding specificity classifiers tend to show poor accuracy for samples with unseen peptides. In this work, we comprehensively assess the potential factors that critically impact the generalization performance of classifiers trained with computationally predicted structures. Specifically, our experiments focus on analyzing the sensitivity of such predictors to the interaction features in the TCR-pMHC interface and the structural uncertainty. Building on the analysis, we demonstrate how the design of classifier architecture with auxiliary training objectives can improve the generalization performance to novel peptides not yet seen during model training. Overall, our work highlights the challenges of unseen peptide generalization from different perspectives of the GNN-based classifier paradigm, showcasing the strengths and weaknesses of the current state-of-the-art approaches in the generalization landscape.
bioinformatics2026-02-23v1A Spatio-Temporal Analysis Framework for Characterizing Radiation-Induced Genomic Instability
Chopra, K.; Cucinell, C.; Weinberg, R.; Forrester, S.; Brettin, T.; Kilic, O.; Yoon, B.-J.AI Summary
- This study developed an analytical framework to investigate the coupling between structural variants and point mutations in human endothelial cells exposed to chronic low-dose gamma radiation.
- The framework revealed a 7.13-fold enrichment of doublet base substitutions (DBS) near inversion breakpoints, with this enrichment diminishing with distance.
- Temporal analysis showed inversions were transient, while DBS persisted, affecting genes critical for genomic stability like DNA damage response and chromatin regulation.
Abstract
Chronic low-dose ionizing radiation induces complex genomic instability encompassing both structural variants and point mutations, yet these alterations are typically analyzed as independent events limiting detection of mechanistic coupling between rearrangement formation and localized mutagenesis at breakpoint junctions. This gap is particularly consequential given the widespread occupational and environmental exposure contexts; nuclear energy, medical imaging, and environmental contamination, where coupled genomic alterations may contribute to cancer risk through mechanisms invisible to type-agnostic analyses. We developed an integrated analytical framework combining temporal pattern tracking, breakpoint-proximal mutation enrichment analysis, and systematic testing across all structural variant types to resolve these coupled dynamics across dose and time. Applying this framework to whole-genome sequencing data from primary human endothelial cells (HUVEC) exposed to chronic low-dose gamma radiation (0.001 - 2 mGy/hr) over three weeks, we discovered 7.13-fold enrichment of doublet base substitutions (DBS) within 10bp of inversion breakpoints, a signal absent from other structural variant types. This enrichment decayed sharply with distance (to [~]1.9 fold at 100bp), indicating localized mutagenesis at these junctions. Temporal analysis revealed divergent fates: inversions appeared transiently (100% single-timepoint) while DBS showed greater persistence (9.0% multi-timepoint). Among the INV-DBS events identified, affected genes include 16 high-constraint loci (pLI [≥] 0.9) involved in DNA damage response, signal transduction, and chromatin regulation; pathways critical for maintaining genomic stability. Our framework provides a generalizable approach for investigating structural variant-mutation relationships, with applications to radiation biology, cancer genomics, and mechanistic studies of DNA repair fidelity.
bioinformatics2026-02-23v1art_modern: An Accelerated ART Simulator of Diverse Next-Generation Sequencing Reads
YU, Z.AI Summary
- The study introduces art_modern, an accelerated version of the ART simulator for next-generation sequencing (NGS) data, enhanced with updated algorithms, SIMD instructions, and parallel processing.
- art_modern supports simulation of transcriptome profiling with contig-specific coverage and strand information.
- Benchmarking showed art_modern reduces CPU time by 75-77% and accelerates wall-clock time by 15-24 times compared to the original ART on multi-core systems._
Abstract
Fast simulation of next-generation sequencing (NGS) data is vital for software development and benchmarking. Here we describe art_modern, an accelerated ART simulator that can simulate various NGS data. We accelerated ART using updated sampling algorithms, single-instruction multiple-data (SIMD) instruction-set extensions (ISEs), thread- and node-level parallelism, and an asynchronous output writer, while enabling simulation of transcriptome profiling data by supporting contig-specific coverage with strand information. The new implementation was benchmarked against popular performance-oriented NGS simulators, revealing a 75--77% reduction in CPU time and a 15--24 times acceleration in wall-clock time on a multi-core machine compared to the original implementation. With this simulator, the process of developing and benchmarking NGS sequence analysis algorithms can be largely accelerated. Availability and Implementation: The software is implemented in C++17 with CMake as the building system. It can be built and executed on a modern GNU/Linux operating system with Boost, Zlib, and a C++17 compiler, with further acceleration available using Intel OneAPI C++/DPC++ compilers and Intel oneAPI MKL random generators. The software is available at https://github.com/YU-Zhejian/art_modern under the GNU General Public License v3.
bioinformatics2026-02-23v1Universal physical principles govern the deterministic genesis of protein structure
Chuanyang, L.; Liu, J.; Qiu, X.; Wu, X.; Li, W.; Min, L.; Zhang, G.; Zhang, S.; Zhu, L.AI Summary
- The study introduces ProtGenesis, a framework that models protein genesis as a deterministic process within a discrete structural space, governed by three universal principles: Assembly, Emergence, and Phase-Transition.
- These principles describe how amino acids form fractal-like structures, how peptides follow spatial trajectories, and how mutations lead to topological phase shifts in protein structure.
- ProtGenesis provides a mathematical foundation to interpret deep learning models and offers a basis for engineering protein structures.
Abstract
The origin of functional proteins remains a fundamental biological enigma. Although Anfinsen's dogma established sequence as the determinant of structure, and deep learning models can predict structures with high fidelity, the physical principles governing protein genesis itself, from prebiotic condensation to functional protein emergence, remain unresolved. This gap leaves a critical disconnect between mechanistic biological insights and artificial intelligence. Herein, we introduce a unified methodological framework ProtGenesis that recasts genesis of protein as a structured, deterministic navigation within a discrete structural space. We identify three universal principles governing this hierarchical organization: the Assembly Principle directs amino acids condensation into multilayer fractal-like architectures; the Emergence Principle ensures nascent peptides' emergence follow deterministic spatial trajectories; and the Phase-Transition Principle describes wherein incremental residue accrual or mutations drives precise topological phase shifts from short-range to long-range order. By quantifying these trajectories with novel tripartite spatial metrics, we reveal that protein genesis is not an abstract continuum but a principle-governed physical process with measurable coordinates. ProtGenesis thus provides an universal interpretable mathematical foundation for decoding "black-box" of deep learning models and establishes a rigorous basis for exploring, understanding, and engineering the molecular blueprint of life.
bioinformatics2026-02-23v1