Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Improved multimodal protein language model-driven universal biomolecules-binding protein design with EiRA
Zeng, W.; Zou, H.; Li, X.; Dou, Y.; Wang, X.; Peng, S.AI Summary
- The study introduces EiRA, a generative model for designing proteins that bind to various biomolecules, using a two-stage post-training process on a multimodal protein language model.
- EiRA showed state-of-the-art performance in structural confidence, diversity, novelty, and designability across 8 test sets for 6 biomolecule types, and improved downstream task predictions.
- Experimental validation confirmed a 100% success rate in expressing variants, and EiRA designed a Glucagon peptide binder with micromolar affinity.
Abstract
The interactions between proteins and biomolecules form a complex system that supports life activities. Designing proteins capable of targeted biomolecular binding is therefore critical for protein engineering and gene therapy. Here, we propose a new generative model, EiRA, specifically designed for universal biomolecular-binding protein design, which undergo two-stage post-training, i.e., domain-adaptive masking training and binding site-informed preference optimization, based on a general multimodal protein language model. A systemic evaluation reveals the SOTA performance of EiRA, including structural confidence, diversity, novelty, and designability on 8 test sets across 6 biomolecule types. Meanwhile, EiRA provides a better characterization for biomolecular-binding proteins than generic model, thereby improving the predictive performance of various downstream tasks. We also mitigate severe repetition generation in the original language model by optimizing training strategies and loss. Additionally, we introduced DNA information into EiRA to support DNA-conditioned binder design, further expanding the boundaries of the design paradigm. Experimental validation yielded a 100% success rate (20/20) in expressing highly divergent variants. Remarkably, EiRA achieved the one-shot design of a Glucagon peptide binder with SPR-confirmed micromolar affinity.
bioinformatics2026-02-24v3The phylodynamic threshold of measurably evolving populations
Weber, A.; Kende, J.; Duitama Gonzalez, C.; Oeversti, S.; Duchene, S.AI Summary
- This study investigates the concepts of measurably evolving populations and the phylodynamic threshold, crucial for molecular clock calibration using sampling times.
- Through simulations and empirical data analysis, it was found that determining these thresholds depends on model assumptions, sampling strategies, and the sensitivity of priors in Bayesian analyses.
- The study emphasizes the importance of assessing prior sensitivity over tests of temporal signal to enhance molecular clock inferences and highlights sampling limitations.
Abstract
The molecular clock is a fundamental tool for understanding the time and pace of evolution, requiring calibration information alongside molecular data. Sampling times are often used for calibration since some organisms accumulate enough mutations over the course of their sampling period. This practice ties together two key concepts: measurably evolving populations and the phylodynamic threshold. Our current understanding suggests that populations meeting these criteria are suitable for molecular clock calibration via sampling times. However, the definitions and implications of these concepts remain unclear. Using Hepatitis B virus-like simulations and analyses of empirical data, this study shows that determining whether a population is measurably evolving or has reached the phylodynamic threshold does not only depend on the data, but also on model assumptions and sampling strategies. In Bayesian applications, a lack of temporal signal due to a narrow sampling window results in a prior that is overly informative relative to the data, such that a prior that is potentially misleading typically requires a wider sampling window than one that is reasonable. In our analyses we demonstrate that assessing prior sensitivity is more important than the outcome of tests of temporal signal. Our results offer guidelines to improve molecular clock inferences and highlight limitations in molecular sequence sampling procedures.
bioinformatics2026-02-24v2Transcriptomic analysis reveals immune signatures associated with specific cutaneous manifestations of lupus in systemic lupus erythematosus
Lee, E. Y.; Patterson, S.; Cutts, Z.; Lanata, C. M.; Dall'Era, M.; Yazdany, J.; Criswell, L. A.; Haemel, A.; Katz, P.; Ye, C. J.; Langelier, C.; Sirota, M.AI Summary
- This study used transcriptomics from a large cohort of SLE patients to identify molecular pathways associated with ten distinct cutaneous manifestations of SLE.
- Specific immune signatures were found, such as upregulation of type I interferon, TNF-, and IL6-JAK-STAT3 pathways in subacute cutaneous lupus, suggesting potential therapeutic targets.
- Unexpected findings included the absence of interferon signaling in patients with skin and mucosal ulcers, and roles for CD14+ monocytes in photosensitivity and NK cells in alopecia, mucosal ulceration, and livedo reticularis.
Abstract
Systemic lupus erythematosus (SLE) presents with diverse and heterogenous cutaneous manifestations. However, the molecular and immunologic pathways driving specific cutaneous manifestations of SLE are poorly understood. Here, we leverage transcriptomics from a large well-phenotyped longitudinal cohort of SLE patients to map molecular pathways linked to ten distinct SLE-related rashes. Through whole blood and immune cell-sorted bulk RNA sequencing, we identified immune signatures specific to cutaneous subtypes of SLE. Subacute cutaneous lupus (SCLE) exhibited broad upregulation of type I interferon, TNF-, and IL6-JAK-STAT3, pathways suggesting potential unique therapeutic responses to JAK and type I interferon inhibition. While interferon signaling is prominent in SCLE, discoid lupus, and acute lupus, it is unexpectedly absent in patients with skin and mucosal ulcers. Pathway and cell-type enrichment analysis revealed unexpected roles for CD14+ monocytes in photosensitivity of SLE and NK cells in alopecia, mucosal ulceration, and livedo reticularis. These findings illuminate the immune heterogeneity of rashes in SLE, highlighting subtype-specific mechanistic targets, and presenting opportunities to identify precision therapies for SLE-associated skin phenotypes.
bioinformatics2026-02-24v2Systematic identification of DNA methylation biomarkers for tumor-type-specific detection
Arbona, J. S.; Garcia Samartino, C.; Angeloni, A. R.; Vaquer, C. C.; Wetten, P. A.; Bocanegra, V.; Militello, R. D.; Sanguinetti, G.; Correa, A.; Pellegrini, P.; Carlen, M.; Minatti, W. R.; Vaschalde, G. A.; Perez, R.; Manzino, R. N.; Rodriguez, J. D.; Valdemoros, P.; Sarrio, L.; Ledesma, A.; Campoy, E. M.AI Summary
- The study developed a browser-based platform integrating genome-wide methylomes with transcriptomes to identify DNA methylation biomarkers, addressing issues like shared epigenetic programs and mixed cellular composition in cancer diagnostics.
- Validation using MSRE-qPCR in colorectal cancer cohorts confirmed effective biomarkers with AUCs of 0.81-1.00.
- The approach also successfully distinguished hepatocellular carcinoma from cirrhotic liver and identified subtype-specific markers in lung cancers.
Abstract
DNA methylation biomarkers for cancer diagnostics often underperform when tumor and background tissues share epigenetic programs, or when complex specimens with mixed cellular composition dilute tumor-derived signals and increase variability. To address these limitations, we developed a gene-centric, browser-based discovery platform that integrates genome-wide methylomes with matched transcriptomes and reference layers spanning pan-cancer tissues and leukocytes, enabling background-aware filtering beyond binary tumor-normal contrasts. Candidate loci are prioritized using combined thresholds on methylation effect size and intra-group variability to penalize stochastic and heterogeneous variation. In colorectal cancer, methylation-sensitive restriction enzyme quantitative PCR (MSRE-qPCR) validation in independent tissue cohorts confirmed multiple candidate loci with AUCs of 0.81-1.00. Using the same framework, MSRE-qPCR validation distinguished hepatocellular carcinoma from cirrhotic liver, and analysis of public tumor methylomes identified subtype-specific markers in lung adenocarcinoma and squamous-cell carcinoma. This resource bridges genome-scale epigenomic discovery with clinically accessible PCR-based methylation assays.
bioinformatics2026-02-24v1OligoGraph: A novel geometric graph-based approach for siRNA efficacy prediction
Saligram, S. S.; Kasturi, V. V.; Surkanti, S. R.; Basangari, B. C.; Kondaparthi, V.AI Summary
- The study introduces OligoGraph, a graph-based deep learning model for predicting siRNA efficacy against mRNA, addressing the limitations of traditional models by handling variable siRNA lengths.
- OligoGraph uses RiNALMo embeddings, GATconv, Transformerconv layers, and self-supervised pretraining, showing superior performance on both seen and unseen datasets.
- Specialized versions for 19- and 21-nucleotide siRNAs outperformed existing models, with significant improvements in AUC-ROC and PCC on various datasets.
Abstract
RNA interference (RNAi) is a biological process in which a small interfering RNA (siRNA) prevents the translation of a messenger RNA (mRNA) into a protein by cleaving the mRNA before translation. We exploit this process to prevent the formation of harmful proteins by using an effective siRNA on the target mRNA. The current rapidly emerging RNAi-based drugs show immense potential for therapeutic applications. Traditionally, designing a potent siRNA for an mRNA requires extensive lab experimentation and trials; therefore, there is a need to develop a model that reliably predicts a siRNA's efficacy against mRNA. This saves both cost and time. But designing such models is challenging, as the data available is either scarce or biased. The current models available exhibit limited generalization and are restricted to a fixed siRNA lengths of either 19 or 21 nucleotides, limiting flexible use. To address these challenges, we introduce OligoGraph, a graph-based deep learning architecture that operates on the siRNA-mRNA duplex. It leverages RiNALMo embeddings, multiple GATconv and Transformerconv layers, and self-supervised pretraining, and outperforms all other existing models in our testing on seen and unseen data. We implemented specialized OligoGraph variants for 19- and 21-nucleotide siRNAs, both of which outperformed the current state-of-the-art models on unseen data. The 19-nucleotide model yielded AUC-ROC and PCC increases of 1.1% and 4.6% on the Mixset; 19.07% and 127.3% on the Takayuki dataset, respectively. Furthermore, the 21-nucleotide model improved predictive performance on the Simone dataset by 2.62% (AUC-ROC) and 6.65% (PCC).
bioinformatics2026-02-24v1RevelioPlots: An Interactive Web Application for Fast AI-Based Protein Models Quality Assessment
Fernandes, L. L. d. S.; Azevedo, A. H. D. d.; Franca, J. V. S. d.; Lima, J. P. M. S.AI Summary
- RevelioPlots is an interactive web application designed to assess the quality of AI-predicted protein structures by integrating statistical pLDDT score analysis with confidence-colored Ramachandran plots.
- It supports both individual and batch model uploads, using B-factors as a fallback for pLDDT values.
- Testing with example proteins showed that RevelioPlots effectively highlights correlations between low pLDDT scores and sterically disallowed regions, aiding non-expert researchers in model quality assessment.
Abstract
High-accuracy protein structure prediction by deep learning requires rigorous model quality assessment, a process currently hampered by fragmented, non-interactive tools designed for older experimental data formats. We present RevelioPlots, an open-source, interactive web application (Python/Streamlit) that simplifies and streamlines the assessment of AI-predicted protein structure quality. Its key feature is the combination of statistical pLDDT score analysis (mean, median, box plots) with an interactive, confidence-colored Ramachandran plot. This integration establishes a direct visual link between a model's predicted local reliability (pLDDT) and its stereochemical feasibility (backbone geometry). RevelioPlots handles both individual and batch-uploaded models, intelligently falling back to B-factors as a proxy for pLDDT values. Using example model proteins, we demonstrated the tool's effectiveness, revealing differences in reliability and a clear visual correlation between regions of low pLDDT scores and residues in sterically disallowed regions. By unifying these critical metrics, RevelioPlots empowers non-experienced researchers to quickly and intuitively assess, compare, and interpret structural model quality, enabling a more confident and integrated use of predicted data.
bioinformatics2026-02-24v1Beyond alignment: synergistic integration is required for multimodal cell foundation models
Richter, T.; Zimmermann, E.; Hall, J.; Theis, F. J.; Raghavan, S.; Winter, P. S.; Amini, A. P.; Crawford, L.AI Summary
- The study introduces the Synergistic Information Score (SIS) to measure the information gain from cross-modal interactions in multimodal cell foundation models, addressing the limitation of alignment-based fusion methods which only detect linear redundancies.
- Benchmarking on spatial transcriptomics datasets showed that tasks with linear redundancies are well-handled by unimodal models, while complex tasks benefit from synergy-aware integration.
- The analysis suggests that for standard tasks, fine-tuning a dominant unimodal model is sample-efficient, but multimodal frameworks are advantageous when tasks require information from multiple modalities.
Abstract
The vision of a "virtual cell" - a computational model that simulates biological function across modalities and scales - has become a defining goal in computational biology. While powerful unimodal foundation models exist, the lack of large-scale paired data prohibits the joint training of multimodal approaches. This scarcity favors compositional foundation models (CFMs): architectures that fuse frozen unimodal experts via a learned interface. However, it remains unclear when this multimodal fusion adds task-relevant information beyond the strongest unimodal representation and when it merely aggregates redundant signal. Here, we introduce the Synergistic Information Score (SIS), a metric grounded in partial information decomposition (PID), that quantifies the information gain achievable only through cross-modal interactions. Extending theoretical results from self-supervised learning, we show that standard alignment-based fusion objectives on frozen encoders inherently collapse to detecting linear redundancies, limiting their ability to capture nonlinear synergistic states. This distinction is directly relevant for tasks aiming to link tissue morphology and gene expression. Benchmarking ten fusion methods on spatial transcriptomics datasets, we use SIS to demonstrate that tasks dominated by linear redundancies are sufficiently served by unimodal baselines, whereas complex niche definitions benefit from synergy-aware integration objectives that enable cross-modal interactions beyond linear alignment. Finally, we perform a scaling analysis which highlights that fine-tuning a dominant unimodal expert is the most sample-efficient path for standard tasks, suggesting that the benefits of multimodal frameworks only emerge when tasks depend on information distributed across modalities. Together, these results establish that building towards a virtual cell will require a fundamental shift from alignment objectives that emphasize shared structure to synergy-maximizing integration that preserves and exploits complementary cross-modal signal.
bioinformatics2026-02-24v1RSTG: Robust Generation of High Quality Spatial Transcriptomics Data using Beta Divergence Based AutoEncoder
Halder, A.; Ghosh, A.; Bandyopadhyay, S.AI Summary
- The study addresses the challenge of insufficient data in spatial transcriptomics by proposing RSTG, an autoencoder with a β-ELBO loss, to generate high-quality synthetic data.
- RSTG uses variational inference to uncover the intrinsic structure of the data, enhancing interpretability and robustness.
- Validation on datasets from the dorsolateral cortex and brain showed RSTG's superior performance in recovering cellular positions and its robustness to data contamination like noise and outliers.
Abstract
One of the key challenges in spatial transcriptomics data analysis is the lack of sufficient data to train the models. To address this shortcoming, multiple generative models have been developed to generate synthetic spatial transcriptomics samples in a controlled environment. However, these often fail at out-of-the-box generation in the presence of noise (such as outliers). To tackle this challenge, we propose RSTG (Robust Spatial Transcriptomic Generator), an autoencoder incorporating a {beta}-ELBO loss, to generate high-quality realistic spatial transcriptomic sequences. Our model uncovers the data' intrinsic structure by approximating its underlying distribution through variational inference, resulting in more interpretable and robust density estimation. We validate the effectiveness of RSTG across multiple tasks, including the recovery of cellular positions in both the 2D spatial and location domains. Our method shows improved performance both qualitatively and quantitatively on multiple datasets from the dorsolateral cortex and the brain using MERFISH and Visium technologies. We further illustrate the robustness of our model to outliers by contaminating a portion of the data with possible anomalies (such as white noises, batch effects, and dropouts). Promising results show that our proposal maintains high quality and stability even when the training data are contaminated, across a variety of experimental settings and in comparison with existing approaches.
bioinformatics2026-02-24v1Condensate-Driven Transcriptional Reprogramming Defines Core Vulnerabilities in Esophageal and Gastric Cancers
Alvarez-Carrion, L.; R. Tejedor, A.; Ardura, J. A.; Alonso, V.; Alonso-Moreno, C.; Collepardo-Guevara, R.; Gutierrez-Rojas, I.; Privat, C.; Moreno, V.; Calvo, E.; Gyorffy, B.; Espinosa, J. R.; Ocana, A.AI Summary
- The study investigates how biomolecular condensates contribute to esophageal and gastric cancers using multi-omics profiling, functional genomics, and simulations.
- Findings show these cancers share a condensate-driven transcriptional program with upregulation of genes like TOPBP1 and CHERP, essential for tumor cell survival.
- Simulations confirmed that TOPBP1 and CHERP form condensates through phase separation, suggesting these proteins as potential therapeutic targets.
Abstract
Biomolecular condensates organize key nuclear functions by compartmentalizing biomolecules, yet their contribution to gastrointestinal tumorigenesis remains poorly defined. Integrating multi-omics profiling, functional genomics, and molecular dynamics simulations, we reveal that esophageal and gastric cancers share a condensate-enriched transcriptional program driven by intrinsically disordered proteins involved in transcription, RNA processing, and replication stress. Transcriptomic analyses identify a hyperactive transcriptional state with upregulation of condensate-associated genes, including TOPBP1 and CHERP. Dependency mapping demonstrates that these proteins are essential for tumor cell viability, defining a conserved condensate core across different tumor types. Machine-learned predictions and residue-resolution coarse-grained simulations confirm that TOPBP1 and CHERP undergo phase separation through homotypic interactions mediated by intrinsically disordered regions, with saturation concentrations below 2 M, consistent with spontaneous condensate formation observed in vitro. Together, these findings establish condensate organization as a fundamental mesoscale principle in upper gastrointestinal cancers and nominate condensate scaffolds as tractable therapeutic vulnerabilities.
bioinformatics2026-02-24v1A functional annotation based integration of different similarity measures for gene expressions
Misra, S.; Roy, S.; Ray, S. S.AI Summary
- The study developed an integrated similarity score (ISS) by combining various gene expression similarity measures, weighted by biological information, to enhance gene similarity prediction.
- A fitness function (FFFAG) was used to optimize the weights in ISS by minimizing the difference between functional similarity and ISS.
- ISS outperformed individual measures in identifying similar gene pairs and predicted functional categories for 40 unclassified yeast genes with high significance (p-value < 10^(-10)).
Abstract
Genes with similar expression profiles often exhibit similar functional properties. An integrated similarity score (ISS) is developed by combining different expression similarity measures through weights, obtained using biological information, for improving gene similarity. The expression similarity measures are converted to the common framework of positive predictive value using functional annotation. A fitness function, called fitness function using functional annotation of genes (FFFAG), is also developed by minimizing the difference between functional similarity value and the ISS. The FFFAG is used to determine the weight combination of different similarity measures in ISS. In addition, an existing similarity measure, called TMJ (integrated similarity measure by multiplying Triangle and Jaccard similarity), is also modified to incorporate biological knowledge involving functional annotation. The results demonstrate that ISS is superior to individual similarity measure to find similar gene pairs. Further, the ISS predicts the functional categories of 40 unclassified yeast genes at p-value cutoff of 10^(-10) from 12 clusters. The associated code is accessible at http://www.isical.ac.in/~shubhra/ISS.html.
bioinformatics2026-02-24v1A partition-based spatial entropy for co-occurrence analysis with broad application.
Otto, T.; Nemri, A.; Claessens, A.; Radulescu, O.AI Summary
- The study introduces Regional Co-occurrence Entropy (RCE), a new spatial entropy measure to analyze how categorical co-occurrences relate to specific environments.
- RCE was applied to various fields: it identified interactions between immune cells in Alzheimer's Disease, analyzed building diversity in town neighborhoods, and examined bird species distribution in a natural reserve.
- Key findings include novel interactions in Alzheimer's, potential drivers of social mixing, and vegetation-driven changes in bird community composition.
Abstract
Despite the advent of spatial data science, including spatial biology, there exist few methods that study the distribution of points e.g. cells or individuals, accounting for both their own characteristics and environmental factors. We propose a new spatial entropy measure, termed the Regional Co-occurrence Entropy (RCE), that detects when categorical co-occurrences happen preferentially in specific environments. We demonstrate its use over a broad range of application fields. As examples, we study brain cell dynamics in Alzheimer's Disease, identifying both known and likely novel interactions between immune cells around beta-amyloid plaques. We also investigate the diversity of buildings across a town neighborhoods, to detect potential drivers of social mixing at local scale. Finally, we dissect bird species distribution across a natural reserve, identifying potential vegetation-driven changes in community composition. Altogether, the proposed RCE enables rapid insights into interactions with an environmental component, making it a useful addition to the spatial data science toolbox.
bioinformatics2026-02-24v1Graph-based RNA structural representation reveals determinants of subcellular localization
Hao, Y.; Sun, H.; Ran, Z.; Guo, X.; Liu, M.; Bi, Y.; Polo, J.; Liu, N.; Li, F.AI Summary
- The study introduces GRASP, a graph neural network framework for predicting RNA subcellular localization, using a graph representation that includes nucleotide and substructure nodes.
- GRASP models both base-level interactions and structural context, incorporating multi-label dependency learning for co-localization patterns.
- It outperforms existing methods in accuracy, F1 score, and AUC across various RNA types, offering insights into structural determinants of RNA localization.
Abstract
RNA subcellular localization is a key determinant of RNA function and regulation, yet existing computational approaches rely primarily on sequence or simplified structural descriptors, limiting their scalability to long transcripts, their ability to model inter-label dependencies, and their applicability across RNA types. Here, we present GRASP, a unified graph neural network framework for predicting RNA subcellular localization using a heterogeneous graph representation that is RNA substructure-aware. GRASP presents each RNA as a multi-scale graph comprising nucleotide nodes and secondary-structure-derived substructure nodes, connected by relational edges, enabling joint modeling of base-level interactions and regional structural context. The model further incorporates multi-label dependency learning to capture co-localization patterns across cellular compartments within a unified framework. Across multiple benchmark datasets and RNA types, GRASP consistently outperforms state-of-the-art sequence-based and structure-informed methods, achieving substantial improvements in accuracy, F1 score, and AUC while maintaining strong scalability to long transcripts. In addition, the graph-based representation provides biologically interpretable insights into structural determinants of RNA localization. The source code and data are available at https://github.com/ABILiLab/GRASP, and the web server is accessible at http://grasp.biotools.bio.
bioinformatics2026-02-24v1CAPHEINE, or everything and the kitchen sink: a workflow for automating selection analyses using HyPhy
Verdonk, H. E.; Callan, D.; Kosakovsky Pond, S. L.AI Summary
- CAPHEINE is a workflow designed to automate evolutionary analysis from unaligned pathogen sequences, using a reference genome.
- It facilitates studies on site-level selection dynamics, gene-level positive selection, and lineage-specific selective pressure changes.
- The workflow is compatible with Mac OS, Windows, and Linux, enhancing accessibility for researchers.
Abstract
Here we present CAPHEINE, a computational workflow that starts with a set of unaligned pathogen sequences and a reference genome and performs a comprehensive exploratory evolutionary analysis of the input data. CAPHEINE pairs nicely with studies of site-level selection dynamics, gene-level positive selection, and lineage-specific shifts in selective pressure. Our workflow is portable across Mac OS, Windows, and Linux, allowing researchers to focus on results.
bioinformatics2026-02-24v1CellAwareGNN: Single-Cell Enhanced Knowledge Graph Foundation Model for Drug Indication Prediction
Zhang, X.; Jeong, E.; Yan, C.; Feng, Y.; Lyu, L.; Guo, X.; Chen, Y.AI Summary
- The study introduces CellAwareGNN, a model integrating single-cell genomics into a biomedical knowledge graph (scPrimeKG) to enhance drug indication prediction.
- CellAwareGNN was evaluated on all diseases in the graph, achieving an AUPRC of 0.826, surpassing TxGNN-U (0.816) and TxGNN (0.799).
- For autoimmune diseases, CellAwareGNN showed significant improvement, with an AUPRC of 0.864, and suggested specific drug repurposing candidates like Ocrelizumab for Pemphigus.
Abstract
Graph foundation models have emerged as powerful tools for drug repurposing by enabling the prediction of novel drug-disease indications from large biomedical knowledge graphs. A representative example is TxGNN, which was previously developed and trained on PrimeKG, a comprehensive biomedical knowledge graph covering over 17,000 diseases. While TxGNN demonstrates strong performance, existing biomedical knowledge graphs largely lack fine-grained, cell-type-specific genomic context. It limits their ability to capture disease mechanisms driven by dysregulated cellular programs, such as immune cell-specific pathways in autoimmune diseases. Moreover, prior evaluations typically test only randomly selected subsets of diseases, leaving many diseases unexamined and limiting conclusions about model performance across the full disease spectrum. To address these limitations, we first update PrimeKG to PrimeKG-U by incorporating expanded and curated biomedical knowledge and then develop TxGNN-U as a stronger graph-based baseline. Building on this foundation, we introduce CellAwareGNN, a graph foundation model that integrates single-cell genomics into PrimeKG-U. We construct a single-cell-enhanced knowledge graph, scPrimeKG, by incorporating cell-type-specific genetic associations from the OneK1K dataset, expanding PrimeKG from approximately 8.1 million edges and 129k nodes to over 14 million edges and 140k nodes. CellAwareGNN is pre-trained on all relation types in scPrimeKG and evaluated on drug indication prediction with explicit coverage of all diseases in the knowledge graph. CellAwareGNN consistently outperforms TxGNN and TxGNN-U. For drug indication prediction, CellAwareGNN achieves an AUPRC of 0.826, representing a 1.2% improvement over TxGNN-U (0.816) and a 3.4% improvement over TxGNN (0.799). Notably, for autoimmune diseases, CellAwareGNN attains an AUPRC of 0.864, improving by 2.0% over TxGNN-U (0.847) and 6.0% over TxGNN (0.815). Importantly, CellAwareGNN prioritizes promising repurposing candidates, including Ocrelizumab for Pemphigus via CD20-expressing B cells, Methotrexate for Pemphigus through DHFR and ATIC activity in T and B cells, and Rosiglitazone for Rheumatoid Arthritis through PPAR-{gamma} activation. These results demonstrate the value of incorporating cell-type-specific genomic context to improve both predictive performance and biological interpretability in graph- based drug repurposing.
bioinformatics2026-02-23v1Hierarchical Multi-Omics Trajectory Prediction forFecal Microbiota Transplantation: A Novel MachineLearning Framework for Small-Sample LongitudinalMulti-Omics Integration
Zhou, Y.-H.; Sun, G.AI Summary
- The study introduces Hierarchical Multi-Omics Trajectory Prediction (HMOTP), a machine learning framework for predicting patient trajectories post-fecal microbiota transplantation (FMT) using small-sample, longitudinal multi-omics data.
- HMOTP integrates lipidomics and metagenomics data through hierarchical feature construction and multi-level attention, achieving 96.67% accuracy in predicting treatment outcomes in a cohort of 15 patients.
- The framework identified key biomarkers and cross-omics associations, demonstrating its utility in personalized medicine and biological discovery in FMT applications.
Abstract
Fecal microbiota transplantation (FMT) has emerged as a highly effective treatment for recurrent Clostridioides difficile infection and is being actively investigated for numerous other conditions. While multi-omics studies have revealed dynamic changes in microbial communities and host metabolism following FMT, existing approaches are primarily descriptive and lack the ability to predict individual patient trajectories or identify early biomarkers of treatment response. Small-sample, multi-omics, longitudinal prediction problems present unique computational challenges: high dimensionality, multi-omics integration, temporal dynamics, and interpretability. Here, we present Hierarchical Multi-Omics Trajectory Prediction (HMOTP), a novel machine learning framework specifically designed for small-sample, multi-omics, longitudinal prediction that addresses these challenges through hierarchical feature construction using domain knowledge, multi-level attention mechanisms, and patient-specific trajectory prediction. HMOTP integrates multi-omics data at multiple biological levels (raw features, aggregated classes/categories, and cross-level interactions) while preserving biological interpretability. The framework employs multi-head attention to learn feature importance at different hierarchy levels and integrates information across omics layers. Patient-specific trajectory prediction enables personalized predictions despite limited sample sizes through transfer learning. We evaluated HMOTP on a cohort of 15 patients with recurrent Clostridioides difficile infection who underwent fecal microbiota transplantation, with comprehensive lipidomics (397 features) and metagenomics (10,634 pathways) profiling at four timepoints spanning six months. Using leave-one-patient-out cross-validation, HMOTP achieved 96.67% {+/-} 10.54% accuracy, outperforming baseline methods including Random Forest (91.33% {+/-} 21.33%) and Logistic Regression (86.33% {+/-} 24.67%). The framework demonstrated robust generalization across timepoints. Through hierarchical interpretability, HMOTP identified key biomarkers and revealed mechanistically informative cross-omics associations, including 324 strong correlations (|r| > 0.7) involving top-predictive biomarkers, demonstrating its utility for both prediction and biological discovery in FMT applications. HMOTP provides a generalizable framework applicable to other small-sample multi-omics problems, offering a powerful tool for personalized medicine applications.
bioinformatics2026-02-23v1A PLUM Job: Peptide modeLs for Understanding and engineering antiMicrobial therapeutics
Banerjee, P.; Friedberg, I.; Rued, B. E.; Eulenstein, O.AI Summary
- The study addresses the need for new antimicrobial strategies by developing PLUM, a structured conditional Variational Autoencoder for designing antimicrobial peptides (AMPs).
- PLUM allows for controlled generation of AMPs by disentangling sequence, function, and length, producing peptides 5-35 amino acids long.
- Compared to HydrAMP, PLUM achieved a 7% higher AMP yield, 14% increased diversity, and 37% more AMPs in prototype-conditioned generation, with low predicted toxicity.
Abstract
Motivation: Antibiotic-resistant infections in humans and animals are rising, creating an urgent need for new antimicrobial strategies. This challenge extends beyond clinical settings to food production systems; the Centers for Disease Control and Prevention estimates that foodborne pathogens cause over 48 million illnesses annually in the U.S. alone. Antimicrobial peptides (AMPs) are a promising alternative due to their broad activity and lower risk of resistance. However, rational design remains challenging, particularly when simultaneously controlling sequence, function, and peptide length. Results: We introduce Peptide modeLs for Understanding and engineering antiMicrobial therapeutics (PLUM), a structured conditional Variational Autoencoder for controlled AMP generation. PLUM disentangles sequence, function, and length within its latent space, enabling both *de novo* and prototype-conditioned generation of peptides 5-35 amino acids long, thereby capturing larger functional domains. Across 45,000 generated peptides, PLUM: * Achieved the highest AMP yield (0.885), **7% higher** than HydrAMP * Increased AMP diversity by 14%compared with HydrAMP * Maintained the highest non-AMP sequence yield (0.895), 19% higher than HydrAMP. For prototype-conditioned generation, PLUM produced 37% more AMPs than HydrAMP, generating sequences that closely matched real peptide compositions while exhibiting low predicted toxicity. Integrated AMP classifiers enabled robust evaluation of identity and potency across diverse bacterial targets. These results establish PLUM as a scalable and versatile platform for designing antimicrobial peptides and next-generation therapeutics. Availability: https://github.com/priyamayur/PLUM
bioinformatics2026-02-23v1EES-Transformer: A Dual-Path Transformer for Tissue Classification and Gene Representation Learning from Extreme Expression Sets
Park, J.-S.; Lee, Y.; Kang, Y. J.AI Summary
- The EES-Transformer V2 uses a dual-path architecture to classify tissues and learn gene representations from extreme gene expression sets in Arabidopsis thaliana RNA-seq data.
- It achieves 91-92% accuracy in classifying 47 tissue types, significantly outperforming the 2.1% random baseline.
- Analysis revealed tissue-specific gene relationships, with attention highlighting 3,524 gene-tissue associations and showing that tissue identity is encoded in gene pairs rather than single genes.
Abstract
Accurate tissue classification from gene expression data is fundamental to transcriptomic analysis. Here we introduce EES-Transformer V2, a dual-path transformer architecture that learns from Extreme Expression Sets (EES); sequences of genes at expression extremes (above 95th or below 5th percentile). The architecture separates tissue classification from masked language modeling through independent branches: the classification branch operates without tissue label information, while the generative branch receives tissue conditioning. This design enables fair evaluation of classification performance while learning tissue-specific gene relationships. Applied to 12,212 Arabidopsis thaliana RNA-seq samples spanning 47 tissue types, EES-Transformer achieves 91-92% classification accuracy (varying across evaluation runs due to stochastic input masking); substantially above the 2.1% random baseline. Attention-based analysis identifies 3,524 gene-tissue high-attention associations whose importance patterns reflect known biology. Critically, while individual high-attention genes appear broadly across tissues, gene pairs from attention-derived regulatory networks show higher tissue specificity: pollen gene pairs show a 2.7-fold enrichment over single-gene rates, and root and leaf pairs each show 1.5-fold enrichment. This finding reveals that tissue identity is encoded in combinatorial gene expression patterns rather than individual genes. Attention-derived gene regulatory networks exhibit scale-free topology and biologically coherent hub gene programs, with pollen networks consisting entirely of DOWN-DOWN interactions among silenced vegetative genes. EES-Transformer provides accurate tissue classification, interpretable gene importance scores, and attention-derived regulatory networks for biological discovery.
bioinformatics2026-02-23v1Pro-GAT: Reconnecting Fragmented PROTACs Using Graph Attention Transformer
Vemuri, S.; Bijigiri, L. P.; Gogte, S.; Kondaparthi, V.AI Summary
- The study addresses the issue of chemically invalid or disconnected linker structures in PROTACs generated by diffusion-based models, which fail to maintain local bonding requirements.
- Pro-GAT, a graph attention-based framework, was developed to repair these disconnected PROTAC candidates by predicting coordinate corrections and atom-type modifications.
- When integrated with DiffPROTACs and DiffLinker, Pro-GAT increased the percentage of chemically valid PROTAC candidates from 76.70% to 83.92% and 63.16% to 68.73%, respectively, while maintaining high uniqueness levels.
Abstract
PROTACs work by bringing together a protein-of-interest ligand and an E3 ligase recruiter to trigger targeted degradation. However, Diffusion-based generative models frequently produce chemically invalid or disconnected linker structures that satisfy global geometric constraints but violate local bonding requirements. These models operate in continuous coordinate space and therefore lack explicit mechanisms for enforcing discrete chemical connectivity under fixed-anchor constraints. Invalid, disconnected outputs recur rather than being a rare exception, such that naive resampling is not an effective method to obtain valid chimeras. Pro-GAT is a graph attention-based framework for geometry-preserving molecular graph repair, capable of functioning on chemically disconnected diffusion-generated PROTAC candidates by predicting bounded coordinate corrections and constrained atom-type modifications using geometry-aware graph attention network (GAT) layers. The proposed model is trained on PROTAC datasets with added disconnections to overcome systematic connectivity failures in diffusion-based PROTAC generation with fixed anchors. When combined with DiffPROTACs and DiffLinker, Pro-GAT improves the percentage of chemically valid candidates in the aggregated output from 76.70% to 83.92% and 63.16% to 68.73% while maintaining 80.18% and 63.80% uniqueness levels of valid candidates respectively, thus facilitating the generation of usable PROTAC candidates from invalid diffusion samples. Pro-GAT was used in a case study of the 7Z76 ternary complex to repair DiffPROTACs and DiffLinker generated samples, which gave rise to connected chimeras whose docking scores were comparable to the original 7Z76 structure.
bioinformatics2026-02-23v1LinkDTI: Drug-Target Interactionsprediction through a Link Predictionframework on Biomedical KnowledgeGraph
Mondal, M.; Arunachalam, S.; Wu, S.; Datta, A.AI Summary
- LinkDTI is a computational framework that predicts drug-target interactions (DTIs) by analyzing connections in a heterogeneous biomedical knowledge graph using a modified GraphSAGE model.
- It employs negative sampling to balance data and outperforms baseline methods by at least 2.5% in AUROC and AUPRC.
- The framework identified 945 new potential DTIs, a 49.14% increase over known interactions.
Abstract
Computational drug-target interactions (DTI) prediction serves as a valuable tool for drug discovery and repurposing by cost-effectively narrowing down the potential drug-target space. This paper presents LinkDTI, a computational framework that predicts DTIs by identifying connections within a heterogeneous knowledge graph of drugs, proteins, diseases, and side effects. Unlike methods that rely on mathematical techniques like matrix completion or similarity-based scoring, LinkDTI uses an advanced graph-based approach to capture relationships between biomedical entities. Specifically, LinkDTI applies a modified version of the multilayer GraphSAGE model that learns from the heterogeneous knowledge graph and predicts potential drug-target interactions. Our model incorporates negative sampling that balances the data to address the issue of having more negative than positive interactions. Our results show that LinkDTI consistently performs better in AUROC and AUPRC than baseline methods by at least 2.5% across different sampling ratios and conditions. Subsequently, it identifies approximately 945 new potential DTIs, marking a 49.14% increase over known DTIs. Overall, LinkDTI offers a simple yet effective method for integrating diverse biomedical data to identify potential drug-target interactions.
bioinformatics2026-02-23v1SPrOUT: A computational and targeted sequencing approach for mixed plant DNA identification with Angiosperms353
Hu, N.; Bullock, M. R.; Jackson, C.; Miller, C.; Hunter, E.; Huff, C.; Chen, Y.; Handy, S.; Johnson, M.AI Summary
- This study introduces SPrOUT, a method using the Angiosperms353 target sequencing kit to identify plant species in mixed samples through phylogenetic inference.
- The approach achieves high accuracy (98.1-99.6%) and precision (92.9-100%) for in-silico mixes, and 90.7% accuracy with 98.0% precision for mock supplement mixtures.
- The method effectively identifies taxa in mixed plant DNA samples, providing a practical framework for various applications.
Abstract
Premise: The identification of plant species from mixed samples is crucial in various fields, including ecological surveys, conservation efforts, and food and dietary supplement safety. Traditional methods face potential challenges due to the high costs of DNA sequencing, inefficiencies in computational workflows, and incomplete sequence databases. Methods and Results: This study introduces a novel approach using the Angiosperms353 target sequencing kit for efficient taxonomic identification of angiosperm DNA in mixed samples. Our method assembles short pair-end reads for each mixed sample. Using gene sets of Angiosperms353 from 871 species, we apply phylogenetic inference to categorize the variance in phylogenetic distance across genes to identify the presence of taxa in mixed plant samples. The pipeline reaches 98.1 to 99.6% accuracy, 92.9 to 100% precision for identifying unknown taxa in in-silico mixes, and 90.7% accuracy and 98.0% precision for mock supplement mixtures. We explored the parameter cutoffs of the pipeline to offer an empirical range for different applications. Conclusions: The Angiosperms353 and HybPiper assembly proved effective in sorting mixed plant DNA samples. Our method offers a framework for scientific and practical applications in plant species identification in both single and mixed samples.
bioinformatics2026-02-23v1Comprehensive top-down mass spectral repository enables pan-dataset analysis and top-down spectral prediction
Li, K.; Liu, K.; Fulcher, J. M.; Tang, H.; Liu, X.AI Summary
- The study introduces TopRepo, a comprehensive repository of over 18 million top-down mass spectrometry (TD-MS) spectra from 12 species, creating a large-scale spectral library with over 5 million annotated spectra.
- TopRepo facilitates pan-dataset analyses of proteoform characteristics like N-terminal processing and mass shifts.
- The repository enhances proteoform identification via spectral library searching and supports training deep learning models for accurate TD-MS spectral prediction.
Abstract
Mass spectral libraries have become essential resources for training deep learning (DL) models for spectral prediction and de novo sequencing in bottom-up mass spectrometry (BU-MS). Compared with BU-MS, top-down MS (TD-MS) offers unique advantages for characterizing intact proteoforms by analyzing proteoforms without enzymatic digestion. Despite these advantages, large-scale spectral libraries for TD-MS are currently lacking. Here we present TopRepo, the first comprehensive repository of TD-MS spectra, comprising more than 18 million spectra acquired from 12 species across eight types of mass spectrometers. Using TopRepo, we constructed a large-scale top-down spectral library containing over 5 million spectra with curated proteoform and fragment-ion annotations. We demonstrate that TopRepo enables pan-dataset analyses of N-terminal processing, mass shifts, and other proteoform characteristics identified by TD-MS. Furthermore, we show that the TopRepo spectral library substantially improves proteoform identification through spectral library searching and supports the training of DL models for high-accuracy top-down spectral prediction.
bioinformatics2026-02-23v1Inference of cancer driver mutations from tumor microenvironmentcomposition: a pan-cancer study with cross-platform external validation
Baker, E. A.; Mehaffy, N. S.AI Summary
- This study investigated whether tumor microenvironment (TME) composition can predict cancer driver mutations across glioblastoma, breast, lung adenocarcinoma, and colorectal cancers using machine learning models trained on RNA-seq data from TCGA.
- The models were externally validated on independent cohorts, achieving AUC >0.65 for 14 out of 15 driver-cancer pairs, with top performance for ERBB2 in breast cancer (AUC=0.980).
- TME-predicted ERBB2 status was associated with overall survival in breast cancer, and the study highlighted the complexity of predicting KRAS mutations in lung adenocarcinoma due to co-mutant profiles.
Abstract
Cancer driver mutations shape the tumor microenvironment (TME), yet whether TME composition alone can predict genotype has not been systematically evaluated across cancers with external validation. We trained machine learning models to predict driver mutation status from TME cell-type composition signatures derived from bulk transcriptomes. Tissue-specific TME signatures (22-28 programs per cancer) were scored from RNA-seq data in TCGA for glioblastoma (GBM, n=157 total; n=90 EGFR-amplification evaluable), breast cancer (BRCA, n=1,082 total; n=994 evaluable), lung adenocarcinoma (LUAD, n=510 total; n=502 evaluable), and colorectal cancer (CRC, n=592 total; n=524 evaluable), then externally validated on independent cohorts spanning different platforms: CPTAC (GBM, n=65), METABRIC (BRCA, n=1,859), GSE72094 (LUAD, n=442), and GSE39582 (CRC, n=585). Of 15 driver-cancer pairs tested, 14 achieved external AUC >0.65, with top performance for ERBB2 amplification in BRCA (AUC=0.980), BRAF mutation in CRC (0.899), and TP53 mutation in BRCA (0.871). TME-predicted ERBB2 status stratified overall survival in METABRIC (Cox HR=1.73, p=7.95x10^-8). Marginal KRAS performance in LUAD (AUC=0.615) reflected opposing TME profiles in KRAS+STK11 versus KRAS+TP53 co-mutant tumors. These results demonstrate that TME composition encodes sufficient information to infer driver mutations across cancers.
bioinformatics2026-02-23v1GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking
Hu, S.; Bojar, D.AI Summary
- GlycoForge is introduced as a tool for simulating realistic glycomics data with known ground truths, addressing the challenge of simulating data with controlled effects and biases.
- It supports the creation of synthetic data with specified motif-level effects, batch effects, and realistic missing data scenarios.
- The utility of GlycoForge was demonstrated by evaluating batch effect correction algorithms, providing guidelines for their application in real-world glycomics data analysis.
Abstract
Quantifying all complex carbohydrates in a sample produces glycomics data, which constitutes compositional data and is stymied by biosynthetic dependencies between glycans, requiring dedicated analytic workflows. Properly assessing such methods frequently requires simulated data with known ground truths and injectable effects. However, simulating glycomics data, especially with control over effects and biases, is still unsolved. Here, we present GlycoForge, a feature-complete solution for simulating comparative glycomics data. GlycoForge supports simulating fully synthetic glycomics data, with specified motif-level effects, drawn from Dirichlet distributions, and templated simulations based on real-world data. We further support the injection of batch effects, both mean and variance shifts, via center-log ratio transformations to maintain compositional closure, and realistic missing data simulation. We showcase the utility of GlycoForge by evaluating batch effect correction algorithms for glycomics data, with automated guidelines for when to use such methods on real-world data. GlycoForge is available as an open-access Python package at https://github.com/BojarLab/GlycoForge.
bioinformatics2026-02-23v1Skip-Zeros Variational Inference in the Million-Cell Era of Single-Cell Transcriptomics
Shimamura, T.; Yuki, S.; Abe, K.AI Summary
- The study introduces UNISON, a scalable framework for matrix factorization in single-cell RNA sequencing, using skip-zeros variational inference to handle the sparsity of large datasets.
- UNISON performs inference using only nonzero elements, improving efficiency and scalability, and was tested on over one million cells from the Mouse Organogenesis Cell Atlas.
- Application to cross-species analysis showed UNISON's ability to distinguish conserved and species-specific transcriptional programs, enhancing understanding of biological processes like glaucoma.
Abstract
Combinatorial indexing-based single-cell RNA sequencing methods such as sci-RNA-seq and sci-RNA-seq3 now enable the profiling of millions of cells, producing expression matrices that are both extremely sparse and high-dimensional. Conventional nonnegative matrix factorization (NMF) provides an interpretable framework for uncovering latent biological structures but is computationally prohibitive at this scale, as it requires explicit access to the vast number of zero entries. We introduce UNISON (Unified Sparse-Optimized Nonnegative factorization), a scalable framework for matrix and tensor factorization based on skip-zeros variational inference. By reformulating stochastic variational Bayes updates in terms of sufficient statistics, UNISON performs inference using only nonzero elements, while implicitly accounting for zeros through geometric sampling. This strategy enables efficient parameter estimation without matrix expansion and naturally accommodates multiple experimental contexts. Simulation studies show that UNISON is robust to diverse learning-rate schedules and mini-batch sizes, providing practical guidelines for optimization. Application to the Mouse Organogenesis Cell Atlas demonstrates scalability to over one million cells, yielding latent factors that capture developmental trajectories and lineage-specific signatures with improved interpretability compared to existing methods. Cross-species analysis of aqueous humor outflow pathways across five vertebrate species further highlights UNISON's ability to disentangle conserved from species-specific transcriptional programs and to recover biologically meaningful gene-gene and gene-phenotype relationships relevant to glaucoma. By efficiently exploiting sparsity while preserving interpretability, UNISON establishes a principled and practical solution for integrative, large-scale single-cell transcriptomics.
bioinformatics2026-02-23v1What makes a banana false? How the genome of Ethiopian orphan staple Ensete ventricosum differs from the banana A and B sub-genomes
Muzemil, S.; Paul, P.; Baxter, L.; Dominguez-Ferreras, A.; Sahu, S. K.; Van Deynze, A.; Mai, G.; Yemataw, Z.; Tesfaye, K.; Ntoukakis, V.; Studholme, D. J.; Grant, M.AI Summary
- The study sequenced the genome of the Ensete ventricosum landrace Mazia, identifying 38,940 protein-coding genes, with the assembly being more complete than the previously published Bedadeti genome.
- Comparative analysis showed about 25% of the Mazia genome is unique to enset, with distinct functional signatures related to DNA integration, carbohydrate metabolism, disease resistance, and transcriptional regulation.
- The research highlights the potential for marker-assisted breeding in enset, providing a foundation for improving agronomically important traits through comparative genomics within the Musaceae family.
Abstract
Background: Ensete ventricosum, also known as the "tree against hunger" plays a key role in Ethiopian food security and farming systems, feeding more than 20 million people. Since domestication via clonal selection in the south-west Ethiopian highlands, today's diverse enset landraces contribute multiple benefits including food, fibre by-product, animal bedding and cattle fodder to farmers and local communities. Improved genomic resources for this highly drought-tolerant plant are essential to supplement the conventional clonal selection-based breeding programme and pave the way towards targeted breeding. Results: We sequenced the genome of enset landrace Mazia, which is partially resistant/tolerant to Xanthomonas wilt and predicted 38,940 protein-coding genes. The Mazia assembly (540.14 Mb) is more complete than the previously published genome assembly of landrace Bedadeti (451.28 Mb) and displayed 1.41% heterozygosity and 64.64% repetitive DNA content. Comparative analyses with the Bedadeti assembly and chromosome-level genome sequences of the two main banana progenitors (Musa acuminata, AA genome; Musa balbisiana, BB genome) unexpectedly revealed ~25% of the Mazia genome is unique to enset. Gene Ontology (GO) and sequence similarity search analysis of enset-specific protein-coding genes identified distinct functional signatures that underpin the lifestyle, adaptation, and corm productive quality of enset, including functions related to DNA integration, carbohydrate metabolism, disease resistance and transcriptional regulation. In contrast, Musa-specific genes showed enrichment for defence response, protein phosphorylation and fruit development pathways. Focusing on the classical nucleotide binding site leucine rich repeat (NLR) disease resistance genes, we identified and characterised NLRs in enset and Musa species genomes, revealing a considerable expansion in the Musa acuminata genome. We also identified unique genes in enset and banana genomes whose functional and evolutionary roles are yet to be determined. Conclusions: Here, we report a de novo genome assembly for the enset (Ensete ventricosum) landrace Mazia and provide a high-quality annotation of both Mazia and the previously published assembly of the landrace Bedadeti. Collectively, these genomic resources provide a valuable foundation for comparative genomics within the Musaceae family and open new opportunities for the development of marker-assisted breeding strategies to accelerate the improvement of agronomically important traits in enset. Keywords: Ensete ventricosum, Musa, gene families, nucleotide binding site leucine rich repeat (NLR), orphan crop.
bioinformatics2026-02-23v1An Integrated and Configurable End-to-End Pipeline for Longitudinal Cell Painting Analysis
Zhao, G.AI Summary
- The study introduces SCALE, an end-to-end pipeline for analyzing longitudinal cell painting data, addressing challenges like imaging variability and time consistency.
- SCALE integrates nucleus-centered segmentation, quality control, feature extraction, and signal aggregation in a modular framework.
- The pipeline's effectiveness was demonstrated with a chronic radiation exposure dataset, showing its capability for consistent longitudinal analysis.
Abstract
Cell painting assays generate high-dimensional, multi-channel imaging data that enable systematic characterization of cellular phenotypes. Increasingly, such assays are performed in longitudinal settings and under chronic perturbations, introducing additional challenges related to imaging variability, focus-field heterogeneity, and consistency across time points. Existing analysis workflows often require substantial manual adaptation to handle these complexities, limiting scalability and reproducibility. In this paper, we propose SCALE (Stable Cell painting Analysis for Longitudinal Experiments), an integrated, end-to-end analysis pipeline designed for robust longitudinal analysis of cell painting data. The pipeline combines nucleus-centered segmentation, automated quality control, feature extraction, and signal aggregation within a modular and configurable framework. Once assay-specific configurations are specified, the pipeline executes in a fully automated manner from raw images to downstream summary statistics and analysis-ready outputs. We demonstrate the utility of the pipeline using a chronic radiation exposure cell painting dataset, illustrating its ability to support consistent longitudinal comparisons across conditions and time points.
bioinformatics2026-02-23v1Structure-Based TCR-pMHC Binding Prediction and Generalization to Unseen Peptides
Abeer, A. N. M. N.; Roy, R. S.; Qian, X.; Yoon, B.-J.AI Summary
- This study investigates the generalization performance of graph neural network (GNN)-based classifiers for predicting TCR-pMHC binding, focusing on their accuracy with unseen peptides.
- The research assesses factors like interaction features and structural uncertainty that affect classifier performance.
- By designing classifier architecture with auxiliary training objectives, the study shows improved generalization to novel peptides.
Abstract
The interaction between T-cell receptors (TCRs) with the peptide-bound major histocompatibility complex (MHC) intricately impacts the functional specificity of T-cell-mediated adaptive immune response. Consequently, implication in immunotherapy has contributed to the ever-growing computational methods for TCR recognition, which have recently attracted structure-based approaches due to advancements in protein structure modeling. Despite access to structural information of the predicted binding interface, graph neural network (GNN)-based TCR-pMHC binding specificity classifiers tend to show poor accuracy for samples with unseen peptides. In this work, we comprehensively assess the potential factors that critically impact the generalization performance of classifiers trained with computationally predicted structures. Specifically, our experiments focus on analyzing the sensitivity of such predictors to the interaction features in the TCR-pMHC interface and the structural uncertainty. Building on the analysis, we demonstrate how the design of classifier architecture with auxiliary training objectives can improve the generalization performance to novel peptides not yet seen during model training. Overall, our work highlights the challenges of unseen peptide generalization from different perspectives of the GNN-based classifier paradigm, showcasing the strengths and weaknesses of the current state-of-the-art approaches in the generalization landscape.
bioinformatics2026-02-23v1A Spatio-Temporal Analysis Framework for Characterizing Radiation-Induced Genomic Instability
Chopra, K.; Cucinell, C.; Weinberg, R.; Forrester, S.; Brettin, T.; Kilic, O.; Yoon, B.-J.AI Summary
- This study developed an analytical framework to investigate the coupling between structural variants and point mutations in human endothelial cells exposed to chronic low-dose gamma radiation.
- The framework revealed a 7.13-fold enrichment of doublet base substitutions (DBS) near inversion breakpoints, with this enrichment diminishing with distance.
- Temporal analysis showed inversions were transient, while DBS persisted, affecting genes critical for genomic stability like DNA damage response and chromatin regulation.
Abstract
Chronic low-dose ionizing radiation induces complex genomic instability encompassing both structural variants and point mutations, yet these alterations are typically analyzed as independent events limiting detection of mechanistic coupling between rearrangement formation and localized mutagenesis at breakpoint junctions. This gap is particularly consequential given the widespread occupational and environmental exposure contexts; nuclear energy, medical imaging, and environmental contamination, where coupled genomic alterations may contribute to cancer risk through mechanisms invisible to type-agnostic analyses. We developed an integrated analytical framework combining temporal pattern tracking, breakpoint-proximal mutation enrichment analysis, and systematic testing across all structural variant types to resolve these coupled dynamics across dose and time. Applying this framework to whole-genome sequencing data from primary human endothelial cells (HUVEC) exposed to chronic low-dose gamma radiation (0.001 - 2 mGy/hr) over three weeks, we discovered 7.13-fold enrichment of doublet base substitutions (DBS) within 10bp of inversion breakpoints, a signal absent from other structural variant types. This enrichment decayed sharply with distance (to [~]1.9 fold at 100bp), indicating localized mutagenesis at these junctions. Temporal analysis revealed divergent fates: inversions appeared transiently (100% single-timepoint) while DBS showed greater persistence (9.0% multi-timepoint). Among the INV-DBS events identified, affected genes include 16 high-constraint loci (pLI [≥] 0.9) involved in DNA damage response, signal transduction, and chromatin regulation; pathways critical for maintaining genomic stability. Our framework provides a generalizable approach for investigating structural variant-mutation relationships, with applications to radiation biology, cancer genomics, and mechanistic studies of DNA repair fidelity.
bioinformatics2026-02-23v1art_modern: An Accelerated ART Simulator of Diverse Next-Generation Sequencing Reads
YU, Z.AI Summary
- The study introduces art_modern, an accelerated version of the ART simulator for next-generation sequencing (NGS) data, enhanced with updated algorithms, SIMD instructions, and parallel processing.
- art_modern supports simulation of transcriptome profiling with contig-specific coverage and strand information.
- Benchmarking showed art_modern reduces CPU time by 75-77% and accelerates wall-clock time by 15-24 times compared to the original ART on multi-core systems._
Abstract
Fast simulation of next-generation sequencing (NGS) data is vital for software development and benchmarking. Here we describe art_modern, an accelerated ART simulator that can simulate various NGS data. We accelerated ART using updated sampling algorithms, single-instruction multiple-data (SIMD) instruction-set extensions (ISEs), thread- and node-level parallelism, and an asynchronous output writer, while enabling simulation of transcriptome profiling data by supporting contig-specific coverage with strand information. The new implementation was benchmarked against popular performance-oriented NGS simulators, revealing a 75--77% reduction in CPU time and a 15--24 times acceleration in wall-clock time on a multi-core machine compared to the original implementation. With this simulator, the process of developing and benchmarking NGS sequence analysis algorithms can be largely accelerated. Availability and Implementation: The software is implemented in C++17 with CMake as the building system. It can be built and executed on a modern GNU/Linux operating system with Boost, Zlib, and a C++17 compiler, with further acceleration available using Intel OneAPI C++/DPC++ compilers and Intel oneAPI MKL random generators. The software is available at https://github.com/YU-Zhejian/art_modern under the GNU General Public License v3.
bioinformatics2026-02-23v1Universal physical principles govern the deterministic genesis of protein structure
Chuanyang, L.; Liu, J.; Qiu, X.; Wu, X.; Li, W.; Min, L.; Zhang, G.; Zhang, S.; Zhu, L.AI Summary
- The study introduces ProtGenesis, a framework that models protein genesis as a deterministic process within a discrete structural space, governed by three universal principles: Assembly, Emergence, and Phase-Transition.
- These principles describe how amino acids form fractal-like structures, how peptides follow spatial trajectories, and how mutations lead to topological phase shifts in protein structure.
- ProtGenesis provides a mathematical foundation to interpret deep learning models and offers a basis for engineering protein structures.
Abstract
The origin of functional proteins remains a fundamental biological enigma. Although Anfinsen's dogma established sequence as the determinant of structure, and deep learning models can predict structures with high fidelity, the physical principles governing protein genesis itself, from prebiotic condensation to functional protein emergence, remain unresolved. This gap leaves a critical disconnect between mechanistic biological insights and artificial intelligence. Herein, we introduce a unified methodological framework ProtGenesis that recasts genesis of protein as a structured, deterministic navigation within a discrete structural space. We identify three universal principles governing this hierarchical organization: the Assembly Principle directs amino acids condensation into multilayer fractal-like architectures; the Emergence Principle ensures nascent peptides' emergence follow deterministic spatial trajectories; and the Phase-Transition Principle describes wherein incremental residue accrual or mutations drives precise topological phase shifts from short-range to long-range order. By quantifying these trajectories with novel tripartite spatial metrics, we reveal that protein genesis is not an abstract continuum but a principle-governed physical process with measurable coordinates. ProtGenesis thus provides an universal interpretable mathematical foundation for decoding "black-box" of deep learning models and establishes a rigorous basis for exploring, understanding, and engineering the molecular blueprint of life.
bioinformatics2026-02-23v1MetaTracer: A nucleotide alignment-based framework for high-resolution taxonomic and transcript assignment in metatranscriptomic data
Furstenau, T.; Shaffer, I.; Hsu, K.-L. C.; Pearson, T.; Ernst, R. K.; Fofanov, V.AI Summary
- MetaTracer is a tool that performs nucleotide alignment for metatranscriptomic analysis, assigning reads to both taxonomic groups and genes in one pass.
- It offers improved accuracy and species-level resolution compared to k-mer or protein-based methods by mapping reads to annotated genomic features.
- Testing on simulated and real dental plaque data showed high accuracy in taxonomic and gene assignment, revealing species-specific transcriptional differences in children with early childhood caries versus healthy controls.
Abstract
Summary: MetaTracer is a nucleotide alignment-based tool for metatranscriptomic analysis of complex bacterial communities that assigns sequence reads to both taxonomic groups and expressed genes in a single pass. Full nucleotide-level alignment improves accuracy relative to k-mer-based classifiers and preserves species-level resolution that is often lost in protein-based approaches. By retaining alignment coordinates and mapping reads directly to annotated genomic features, MetaTracer enables direct attribution of gene expression to specific microbial species. On simulated datasets, MetaTracer achieves high accuracy for both taxonomic and gene assignment. Applied to real dental plaque metatranscriptomic datasets, MetaTracer resolves species-specific transcriptional activity and detects reproducible differences in microbial gene expression between children with early childhood caries and healthy controls. Availability and implementation: MetaTracer is implemented as a Python-based workflow wrapper (metatracer v0.1.0) that depends on the mtsv-tools core engine (v2.1.0), which is written in Rust. The required functionality is supported by the v2.1.0 release of mtsv-tools. Both packages are open source under the MIT license and are available at github.com/FofanovLab/metatracer and github.com/FofanovLab/mtsv-tools. Versioned releases are archived at Zenodo (DOI: 10.5281/zenodo.18665766 and DOI: 10.5281/zenodo.18718002). Installation is supported via Bioconda.
bioinformatics2026-02-23v1BacTaxID: A universal framework for standardized bacterial classification
Fernandez-de-Bobadilla, M. D.; Lanza, V. F.AI Summary
- BacTaxID is a universal framework for bacterial classification using whole-genome k-mer-based sketches, which organizes strains into hierarchical clusters based on user-defined similarity thresholds.
- It provides a direct quantitative link to Average Nucleotide Identity (ANI), showing universal concordance with species and sub-species classifications across 2.3 million genomes from 67 genera.
- BacTaxID outperforms traditional methods in capturing strain-level diversity and replicates SNP and cgMLST-based definitions in surveillance and outbreak scenarios.
Abstract
Bacterial strain typing is key to surveillance, outbreak investigation and microbial ecology, yet current systems remain species-specific, reference-dependent and lack a universal, interpretable metric of genomic relatedness. Here, we introduce BacTaxID, a fully configurable, whole-genome k-mer-based framework that encodes each genome as a numeric sketch and organizes strains into hierarchical clusters with user-defined similarity thresholds. BacTaxID distances are strictly proportional to Average Nucleotide Identity (ANI), providing a direct quantitative link between vectorial typing and genome-wide divergence. Applied to 2.3 million genomes from All the Bacteria across 67 genera, BacTaxID demonstrates universal concordance species and sub-species classification systems, while capturing finer strain-level diversity than traditional reference-based approaches. In simulated surveillance and real outbreak datasets, BacTaxID reproduces SNP and cgMLST-based definitions while enabling rapid, scalable screening. Precomputed genus-level schemes and an open implementation provide a practical, genus-agnostic alternative to classical typing systems for standardized bacterial classification.
bioinformatics2026-02-22v3Protenix-v1: Toward High-Accuracy Open-Source Biomolecular Structure Prediction
Zhang, Y.; Gong, C.; Zhang, H.; Ma, W.; Liu, Z.; Chen, X.; Guan, J.; Wang, L.; Yang, Y.; Xia, Y.; Xiao, W.AI Summary
- Protenix-v1 (PX-v1) is an open-source biomolecular structure prediction model that outperforms AlphaFold3 with the same constraints, showing improved accuracy with increased sampling budget.
- It includes features like protein template integration and RNA MSA support, with a variant, Protenix-v1-20250630, trained on a larger dataset for enhanced accuracy.
- The study also addresses benchmarking limitations by providing updated evaluation tools and year-stratified benchmarks for more reliable assessments.
Abstract
We introduce Protenix-v1 (PX-v1), the first open-source structure prediction model to attain superior performance to AlphaFold3 while strictly adhering to the same training data cutoff, model size, and inference budget. Beyond standard evaluations, we highlight the effectiveness of inference-time scaling behavior, demonstrating that increasing the sampling budget yields consistent improvements in prediction quality--a behavior previously seen in AlphaFold3 but not in other open-source models. In addition to improved accuracy, Protenix-v1 incorporates key capabilities including protein template integration and RNA MSA support. Furthermore, to better support real-world applications such as drug discovery, we additionally release Protenix-v1-20250630, a variant trained on a larger dataset (cutoff: June 30, 2025), delivering further improved prediction accuracy. Finally, we identify the limitations of current benchmarking tools and we provide updated evaluation tools and year-stratified benchmarks to facilitate more reliable and transparent assessment within the community. Collectively, these contributions provide a robust foundation for the Protenix series and the broader field.
bioinformatics2026-02-22v3Deciphering Features of Metalloprotease Cleavage Targets Using Protein Structure Prediction
Chung, D. S.; Park, J.; Choi, W.; Hong, D.AI Summary
- The study developed a computational model using protein structure prediction to identify and classify substrates of ADAM10, a metalloprotease, and predict its cleavage sites.
- The model analyzed predicted protein complexes, focusing on protein-protein interactions, structural information of cleavage sites, and spatial relationships with metal ions.
- Results showed the model's effectiveness in substrate classification and cleavage site prediction, with potential applications to other ADAM family members and metalloproteases.
Abstract
Metalloproteases are a class of enzymes that utilize metal ions within their active sites to catalyze the hydrolysis of peptide bonds in proteins. Among these, ADAM10 (A Disintegrin and Metalloproteinase 10), a member of the ADAM family, plays a crucial role in mediating intracellular signaling by cleaving specific substrates, thereby influencing a variety of physiological and pathological processes. The mechanisms underlying the activity of ADAM10 present significant opportunities for the development of novel therapeutic strategies aimed at disease intervention. However, the information available to identify the substrate and cleavage sites of ADAM10 is still insufficient. Therefore, it is essential to identify and classify the features of substrates and to elucidate cleavage sites through experimental approaches. However, these studies across numerous proteins present significant challenges. To address the promise of these investigations, we developed a model that leverages protein structure prediction to decipher substrate features, classify substrates, and predict cleavage sites. Through the analysis of predicted protein complexes between ADAM10 and its substrates using PDB files, we evaluated protein-protein interaction (PPI) data, the structural information of cleavage sites, and the spatial relationships between the cleavage sites and metal ions. Finally, we present a computational model that effectively classifies substrates and accurately predicts cleavage sites in this study. Our study demonstrates the potential for application not only to ADAM10 but also to other members of the ADAM family and, more broadly, to additional metalloproteases. By leveraging computed protein structural information, our approach offers a novel perspective for substrate classification.
bioinformatics2026-02-22v2Bias in genome-wide association test statistics due to omitted interactions
Yelmen, B.; Güler, M. N.; Estonian Biobank Research Team, ; Kollo, T.; Möls, M.; Charpiat, G.; Jay, F.AI Summary
- This study investigates how omitting interaction terms in linear models used for GWAS can bias test statistics, specifically by altering the mean and variance of the null statistic.
- Through mathematical derivation and simulation using Estonian Biobank data, the study shows that ignoring epistasis can lead to an anti-conservative regime, inflating significance.
- The findings suggest that GWAS results based on linear models might include spurious associations, urging caution in their interpretation.
Abstract
Over the past two decades, genome-wide association studies (GWAS) enabled the discovery of thousands of variants associated with many complex human traits. However, conventional GWAS are still widely performed with linear models with the assumption that the genetic effects are predominantly additive. In this work, we investigate the test statistic behavior when linear models are used to obtain significant genotype-phenotype associations without accounting for epistasis. We first algebraically derive mean and variance shift in the null statistic due to the omitted interaction term, and define the boundary between conservative (i.e., deflated statistic tail) and anti-conservative (i.e., inflated statistic tail) regimes for the common GWAS significance threshold. We then perform phenotype simulation analyses using the Estonian Biobank genotypes and validate the mathematical model. We demonstrate that the anti-conservative regime is plausible under realistic parameter settings and models omitting interaction terms can produce spurious significance. Our findings suggest caution when interpreting statistically significant signals reported in the literature based on linear models, especially for large-scale GWAS.
bioinformatics2026-02-22v2Cellects, a software to quantify cell expansion and motion
Boussard, A.; Petit, M.; Arrufat, P.; Dussutour, A.; Perez-Escudero, A.AI Summary
- Cellects is an open-source software designed for automated quantification of cell expansion, motion, and morphology from 2D and time-lapse images across various biological systems.
- It features a graphical interface for interactive parameter tuning, visualization, and batch processing, with customization options via a Python API.
- The software outputs quantitative data like area and trajectory in .csv format, facilitating statistical analysis and integration into existing research workflows.
Abstract
Cellects is a user-friendly and open-source software for automated quantification of biological growth, motion, and morphology from 2D image data and time-lapse sequences (2D + t), acquired under a wide range of experimental conditions and biological systems (from fungal colonies to unicellular branching networks). The software is available as a stand-alone version, featuring a graphical interface that supports interactive parameter tuning, visualization, validation, and batch processing. The analysis pipeline can be extended and customized using a dedicated Python API. The typical inputs and outputs are as follows. Cellects is designed to process grayscale or color images originating from standard microscopy, macroscopic imaging setups, or camera-based platforms. The software supports single or multiple organisms growing or moving in one or several arenas and can analyze multiple folders sequentially. All quantitative results (area, circularity, orientation axes, centroid trajectories, oscillations, network topology) are exported as standardized .csv files suitable for downstream statistical analysis, ensuring reproducibility and integration into existing workflows.
bioinformatics2026-02-22v2Protenix-v1: Toward High-Accuracy Open-Source Biomolecular Structure Prediction
Zhang, Y.; Gong, C.; Zhang, H.; Ma, W.; Liu, Z.; Chen, X.; Guan, J.; Wang, L.; Yang, Y.; Xia, Y.; Xiao, W.AI Summary
- Protenix-v1 (PX-v1) is introduced as an open-source biomolecular structure prediction model that outperforms AlphaFold3 with the same constraints.
- It shows improved prediction quality with increased sampling budget and includes features like protein template integration and RNA MSA support.
- A variant, Protenix-v1-20250630, trained on a larger dataset, offers enhanced accuracy, and new evaluation tools are provided for better benchmarking.
Abstract
We introduce Protenix-v1 (PX-v1), the first open-source structure prediction model to attain superior performance to AlphaFold3 while strictly adhering to the same training data cutoff, model size, and inference budget. Beyond standard evaluations, we highlight the effectiveness of inference-time scaling behavior, demonstrating that increasing the sampling budget yields consistent improvements in prediction quality--a behavior previously seen in AlphaFold3 but not in other open-source models. In addition to improved accuracy, Protenix-v1 incorporates key capabilities including protein template integration and RNA MSA support. Furthermore, to better support real-world applications such as drug discovery, we additionally release Protenix-v1-20250630, a variant trained on a larger dataset (cutoff: June 30, 2025), delivering further improved prediction accuracy. Finally, we identify the limitations of current benchmarking tools and we provide updated evaluation tools and year-stratified benchmarks to facilitate more reliable and transparent assessment within the community. Collectively, these contributions provide a robust foundation for the Protenix series and the broader field.
bioinformatics2026-02-22v2STELAR-X: Scaling Coalescent-Based Species Tree Inference to 100,000 Species and Beyond
Saha, A.; Bayzid, M. S.AI Summary
- STELAR-X is introduced as a scalable, triplet-based phylogenetic inference algorithm for species tree reconstruction under the multispecies coalescent model, optimized for ultra-large datasets.
- It achieves O(nk) memory complexity for n species and k gene trees, significantly reducing both running time and memory usage compared to existing methods like ASTRAL-MP.
- Experiments showed STELAR-X analyzed datasets with 100,000 taxa in 8.5 hours and 100,000 genes in 4 minutes, demonstrating unprecedented scalability.
Abstract
Summary methods reconstruct species trees from collections of gene trees by accounting for gene tree discordance, and offer a statistically consistent framework for phylogenomic inference under the multispecies coalescent model. While existing triplet- and quartet-based approaches such as ASTRAL and STELAR have provable statistical consistency, their running time and memory usage restrict their applicability to ultra-large datasets. We introduce STELAR-X, a statistically consistent and highly scalable triplet-based phylogenetic inference algorithm that achieves an asymptotically optimal memory complexity of O(nk) for n species and k gene trees--essentially matching the input size and allowing analyses to remain feasible as long as the input trees fit in memory--while also substantially reducing running time. STELAR-X achieves this by a comprehensive re-engineering of the underlying data structures and algorithms. We introduce a novel, compact integer tuple-based encoding of tree bipartitions and efficient procedures for rapid pre-computation of bipartition weights. We further leverage GPU parallelism for fast pre-computation of necessary weights. This improved and redesigned computational framework underpins a dynamic programming algorithm with substantially reduced computational overhead. Extensive experiments demonstrate that STELAR-X achieves unprecedented scalability. On simulated datasets with 10,000 taxa and 1,000 gene trees, STELAR-X runs 712x faster than ASTRAL-MP (the most scalable variant of ASTRAL) while using 7.5x less CPU memory. Most significantly, STELAR-X analyzed a dataset of 100,000 taxa and 1,000 genes in 8.5 hours using 86 GB RAM, and a 100,000-gene dataset with 1000 taxa in just 4 minutes using 106 GB RAM -- scales that were previously intractable for statistically consistent summary methods. STELAR-X is publicly available at <a href="https://github.com/aaniksahaa/STELAR-X">https://github.com/aaniksahaa/STELAR-X</a>.
bioinformatics2026-02-22v2Bacterial protein function prediction via multimodal deep learning
Muzio, G.; Adamer, M.; Fernandez, L.; Miklautz, L.; Borgwardt, K.; Avican, K.AI Summary
- Developed DeepEST, a multimodal deep learning framework to predict bacterial protein functions by assigning GO terms, using gene expression, location, and protein structure.
- DeepEST integrates a multi-layer perceptron for expression/location data and a structure-based predictor, enhanced by a novel masked loss function for bacterial species.
- DeepEST outperformed existing methods on a 25-species benchmark and predicted functions for unclassified proteins in 25 human bacterial pathogens.
Abstract
Bacterial proteins are specialized with extensive functional diversity for survival in diverse and stressful environments. A significant portion of these proteins remains functionally uncharacterized, limiting our understanding of bacterial survival mechanisms. Hence, we developed Deep Expression STructure (DeepEST), a multimodal deep learning framework designed to accurately predict protein function in bacteria by assigning Gene Ontology (GO) terms. DeepEST comprises two modules: a multi-layer perceptron that takes gene expression and gene location as input features, and a protein structure-based predictor. Within DeepEST, we integrated these modules through a learnable weighted linear combination and introduced a novel masked loss function to fine-tune the structure-based predictor for bacterial species. These modeling choices are particularly well suited for bacteria due to the spatial organization of their circular genomes. Functionally related genes frequently co-localize and are co-transcribed within operons, allowing transcription dynamics to serve as crucial, condition-dependent regulatory signals. We show that DeepEST outperforms existing protein function prediction methods on a 25-species benchmark, relying solely on amino acid sequence or protein structure. Moreover, DeepEST predicts GO terms for unclassified hypothetical proteins across 25 human bacterial pathogens, facilitating the design of experimental setups for characterization studies. By combining expression, localization, and structure information in a unified deep learning framework, DeepEST bridges organism-specific data integration and structure-based transfer learning, providing a method tailored for bacterial protein function prediction in settings with structural and multi-condition expression data.
bioinformatics2026-02-22v2Leveraging Large Language Models to Extract Prognostic Pathology Features in Ewing Sarcoma
Huang, J.; Batool, A.; Gu, Z.; Zhao, Z.; Yao, B.; Black, J.; Davis, J.; al-Ibraheemi, A.; DuBois, S.; Barkauskas, D.; Ramakrishnan, S.; Hall, D.; Grohar, P.; Xie, Y.; Xiao, G.; Leavey, P. J.AI Summary
- The study aimed to use Large Language Models (LLMs) to extract prognostic histologic features from pathology reports of Ewing sarcoma patients from six COG trials.
- The LLM achieved high accuracy (94-98.1%) in extracting 17 IHC markers, surpassing human annotators in some cases.
- Survival analysis revealed NSE as a negative prognostic marker (HR 2.15) and S100 as a positive one (HR 0.58), suggesting potential for refining risk stratification in Ewing sarcoma.
Abstract
Background: Current risk stratification for Ewing sarcoma relies heavily on clinical factors such as metastatic status, failing to capture histologic heterogeneity as a potential prognostic indicator. Although pathology reports contain rich biological data, this information remains locked in unstructured narrative text, limiting large-scale retrospective analyses. We aimed to validate the utility of Large Language Models (LLMs) for scalable data abstraction and to identify prognostic histologic features from a large multi-institutional cohort. Methods: We conducted a retrospective cohort study using data from six Children's Oncology Group (COG) clinical trials. We utilized an LLM-based pipeline (OpenAI o3) to extract structured variables, including immunohistochemical (IHC) markers and CD99 staining patterns - from digitized, Optical Character Recognition (OCR)-processed pathology reports. Extraction accuracy was validated against a human-annotated ground truth (n=200) and cross-validated against senior experts (n=48). We assessed the association between extracted features and Overall Survival (OS) using Kaplan-Meier analysis and multivariable Cox proportional hazards regression, adjusting for metastatic status. Findings: We analyzed 931 diagnostic pathology reports spanning over 19-years. The LLM achieved a weighted average accuracy of 94% across 17 IHC markers; in a cross-validation subset, the LLM outperformed human annotators (weighted average accuracy over 15 IHC markers: LLM o3: 98.1%, a resident specialist 91.4%, and a senior expert 95.9%). Survival analysis identified Neuron-Specific Enolase (NSE) and S100 as significant prognostic biomarkers. After adjusting for metastatic status, NSE positivity was associated with significantly inferior survival (HR 2.15, 95% CI 1.15 - 4.02, p=0.016); this risk was most pronounced in patients with non-metastatic disease (HR 5.64, p=0.0055). Conversely, S100 positivity was associated with improved survival (HR 0.58, 95% CI 0.34-1.00, p=0.046). Interpretation: LLM-assisted extraction of pathology variables is highly accurate and scalable, capable of unlocking "dark data" from historical clinical trials. We identified NSE as a potent risk factor and S100 as a protective marker in Ewing sarcoma, particularly in localized disease. These findings suggest that AI-derived histologic data can refine risk stratification and, if validated, warrant inclusion in future prospective trials.
bioinformatics2026-02-22v1Hybrid MD-generative modeling expands RNA ensembles to include cryptic ligand-binding conformations: application to HIV-1 TAR
Kurisaki, I.; Hamada, M.AI Summary
- The study uses Molearn, a hybrid MD-generative deep learning model, to explore cryptic ligand-binding conformations in apo RNA structures, focusing on HIV-1 TAR.
- Molearn was trained on apo TAR conformations to generate a diverse ensemble, from which potential MV2003-binding conformations were identified.
- Docking simulations showed that these generated conformations could bind MV2003 with interaction scores similar to those from NMR structures, demonstrating the model's ability to predict novel ligand-binding RNA states.
Abstract
Cryptic ligand-binding sites--absent in apo structures but formed upon conformational rearrangement--offer high specificity for RNA-ligand recognition, yet remain rare among experimentally resolved RNA-ligand complex structures and difficult to predict in silico. We apply Molearn, a hybrid molecular dynamics-generative deep learning model, to expand apo RNA conformational ensembles to include cryptic states. Focusing on the paradigmatic HIV-1 TAR-MV2003 system, Molearn was trained exclusively on apo TAR conformations and used to generate a diverse ensemble of TAR structures. Candidate cryptic MV2003-binding conformations were subsequently identified using post-generation geometric analyses. Docking simulations of these conformations with MV2003 yielded binding poses with RNA-ligand interaction scores comparable to those of NMR-derived complexes. Notably, this work provides the first demonstration that a generative modeling framework can access cryptic RNA conformations that are ligand-binding competent and have not been observed in prior molecular-dynamics and deep-learning studies.
bioinformatics2026-02-21v7Defining the Active Conformation of Typical Protein Kinase Domains from Substrate-Bound PDB Structures Enables Active-State AlphaFold2 Models for All 437 Human Catalytic Protein Kinases
Gizzio, J.; Faezov, B.; Xu, Q.; Dunbrack, R. L.AI Summary
- The study aimed to define the active conformation of human protein kinase domains by analyzing 248 substrate-bound kinase structures, identifying criteria for the active state.
- Using these criteria, only 30% of the 437 human catalytic kinases were found in active form in the PDB.
- AlphaFold2 was employed to model all 437 kinases in their active form, with 71% of models achieving a backbone RMSD < 1.0 Å to benchmark structures, enhancing understanding of kinase activity in diseases like cancer.
Abstract
Humans have 437 catalytically competent protein kinase domains with the typical kinase fold, similar to the structure of Protein Kinase A (PKA). The active form of a kinase must satisfy requirements for binding ATP, magnesium, and substrate. From structural bioinformatics analysis of 248 crystal structures of 54 unique substrate-bound kinases, we derived structural criteria for the active form of typical protein kinases. We include well-known requirements on the DFG motif of the activation loop and the N-terminal domain salt bridge, but also on the positions of the N-terminal and C-erminal segments of the activation loop that must be placed appropriately to bind substrate. With these criteria, only 130 of the 437 human catalytic protein kinases (30%) are in the Protein Data Bank in their active form. Because the active forms of catalytic kinases are needed for understanding substrate specificity and the effects of mutations on catalytic activity in cancer and other diseases, we used AlphaFold2 to produce models of all 437 human protein kinases in the active form. This was accomplished with templates from the PDB that resemble substrate-bound structures, shallow multiple sequence alignments of orthologs and close paralogs of the query protein, and application of the active-kinase criteria to the output models. We selected models for each kinase based on intramolecular ipSAE scores of the activation loop residues of these models, demonstrating that the highest scoring models have the lowest or close to the lowest RMSD to 29 non-redundant substrate-bound structures in the PDB. A larger benchmark of 117 active kinase structures with solved activation loops in the PDB shows that 71% of the highest scoring AlphaFold2 models had backbone RMSD < 1.0 [A] to the benchmark structures and 92% were within 2.0 [A]. Models for all 437 catalytic kinases are available at https://dunbrack.fccc.edu/kincore/activemodels. We believe they may be useful for interpreting mutations leading to constitutive catalytic activity in cancer as well as for templates for modeling substrate and inhibitor binding for molecules which bind to the active state.
bioinformatics2026-02-21v2DynaBiomeX: An Interpretable Dual-Strategy Deep Learning Framework for Architectural Noise Filtration in Sparse Longitudinal Microbiome Data
Qureshi, A.; Wahid, A.; Qazi, S.; Shahzad, M. K. K.AI Summary
- DynaBiomeX is a dual-strategy deep learning framework designed to filter noise in sparse longitudinal microbiome data, integrating Stacking Ensembles (Bi-LSTM, GRU) and an adapted Temporal Fusion Transformer (TFT).
- It was validated on 1,871 hematopoietic cell transplantation patients to detect gut dysbiosis, with the ensembles acting as high-sensitivity screeners (ROC-AUC = 0.912) and the TFT as a precision Sentinel (Precision = 1.0, MCC = 0.646).
- The TFT showed superior calibration (ECE = 0.0085), and ablation studies confirmed robustness without clinical covariates (ROC-AUC > 0.81).
Abstract
Longitudinal microbiome datasets present unique challenges due to extreme sparsity, zero-inflation, and non-stationary behavior. Conventional Recurrent Neural Networks (RNNs) struggle to distinguish structural from sampling zeros in these contexts, limiting their utility for Clinical Decision Support (CDS). We introduce DynaBiomeX, an interpretable framework specifically developed for sparse biomedical time-series. It integrates Stacking Ensembles (Bi-LSTM, GRU) with an adapted Temporal Fusion Transformer (TFT) in a unified Screener-Sentinel workflow. The Ensembles optimize collective decision boundaries to maximize sensitivity and minimize missed cases. Concurrently, the TFT functions as a Physiological Gatekeeper, utilizing Gated Residual Networks (GRN) to actively filter stochastic noise from real biological signals. We validated this approach on a multi-modal dataset of 1,871 hematopoietic cell transplantation (HCT) patients to detect gut dysbiosis. Stacking ensembles maximized discriminative performance (ROC-AUC = 0.912), effectively serving as high-sensitivity screeners. In contrast, the Adapted TFT functioned as a precision Sentinel, achieving zero false positives (Precision = 1.0) and high stability (MCC = 0.646). Crucially, the TFT demonstrated superior probabilistic reliability with a low Expected Calibration Error (ECE = 0.0085), addressing the "black-box" overconfidence typical of deep learning models. Ablation studies confirmed predictive robustness even without clinical covariates (ROC-AUC > 0.81). DynaBiomeX couples sensitive screening with precise, calibrated validation to robustly analyze sparse longitudinal data. Validated on microbiome dysbiosis, this framework offers a scalable template for zero-inflated domains like single-cell sequencing and EHR monitoring.
bioinformatics2026-02-21v2ProteoMapper: Alignment-Aware Identification and Quantitative Analysis of Contextual Motif-Domain Patterns in Protein Families
Sefa, S. M.; Sarkar, J.; Robin, A. H. K.; Uddin, M.AI Summary
- ProteoMapper integrates domain annotation with motif detection to analyze spatial relationships in protein families, introducing metrics like positional conservation scoring and Motif-Domain Coverage Score (MDCS).
- The tool processes alignments in Excel format, providing rapid analysis and color-coded reports, validated across three protein families with high accuracy.
- In Arabidopsis ERD6-like sugar transporters, MDCS analysis showed PROSITE signatures PS00216 and PS00217 are fully domain-embedded but differ in evolutionary conservation, suggesting subfunctionalization.
Abstract
Protein function depends on interactions between structural domains and regulatory motifs. Yet current tools analyze these elements separately, hindering investigation of disease mutations affecting evolutionarily conserved, structurally constrained motifs. We present ProteoMapper, a computational framework integrating HMMER-based domain annotation with user-defined motif detection to quantify motif-domain spatial relationships in protein families. ProteoMapper introduces two discovery metrics: (1) positional conservation scoring, identifying motifs at identical alignment coordinates in [≥] N% of sequences (default 60%), indicating purifying selection; (2) Motif-Domain Coverage Score (MDCS), quantifying motif embedding within Pfam domains (MDCS=1: fully embedded; MDCS=0: extra-domain). The platform processes Excel-formatted alignments without programming requirements, delivering color-coded reports with conserved motif positions, domain boundaries, and MDCS values. Parallel execution of sequence batches enables rapid analysis (8 motifs were searched in 150 sequences with complete Pfam scanning in <6 seconds on standard hardware). Validation across three protein families confirmed technical accuracy and biological insight. In PLATZ transcription factors (24 proteins), domain predictions achieved 0.94 mean intersection-over-union versus published annotations, exactly reproducing 22 of 23 reported spans. In Arabidopsis ERD6-like sugar transporters (17 proteins), MDCS analysis revealed canonical PROSITE signatures PS00216 and PS00217 are equally domain-embedded (MDCS=1.0) but evolutionarily divergent. PS00217 shows positional conservation (58.8% of sequences) while PS00216 exhibits dispersal, suggesting subfunctionalization. In tomato actin-depolymerizing factors (11 proteins), domain detection achieved 100% sensitivity with >93% positional concordance. ProteoMapper enables hypothesis-driven investigation of evolutionary constraints, regulatory mechanisms, and variant effect prediction in biomedical and functional proteomics. Source code, documentation, and test results with datasets at https://github.com/sifullah0/ProteoMapper.
bioinformatics2026-02-20v1Geometric-aware and interpretable deep learning for single-cell batch correction via explicit disentanglement and optimal transport
Jiang, C.; Zheng, R.; Ji, Y.; Cao, S.; Fang, Y.; Wang, Z.; Wang, R.; Liang, S.; Tao, S.AI Summary
- The study introduces iDLC, a deep learning framework for single-cell RNA sequencing batch correction, using explicit feature disentanglement and optimal transport for dual-level correction.
- iDLC separates biological from technical components in a structured latent space and uses mutual nearest neighbors for geometric alignment.
- Evaluations on various datasets show iDLC effectively removes batch effects, preserves cell subtypes, and outperforms existing methods in both correction and biological fidelity.
Abstract
Single-cell RNA sequencing enables high-resolution characterization of cellular heterogeneity, yet integrating datasets from diverse sources remains challenging due to batch effects. Current methods rely on implicit feature disentanglement and and lack geometric constraintsoften result in under-correction, over-correction, or compromised biological fidelity. Here, we present iDLC, an interpretable deep learning framework that performs dual-level correction through explicit feature disentanglement and optimal transport - regularized adversarial alignment. iDLC separates biological and technical components within a structured latent space, then leverages high-confidence mutual nearest neighbor pairs to guide geometrically constrained distribution alignment. Systematic evaluation across pancreatic cancer datasets with varying batch effect intensities, multi-source human immune cells, and large-scale cross-species atlases demonstrates that iDLC robustly eliminates complex batch effects while preserving fine-grained cell subtypes, continuous developmental trajectories, and rare populations. The framework scales efficiently to datasets exceeding one million cells and consistently outperforms existing methods in both batch correction and biological conservation metrics. iDLC provides a principled and reliable tool for constructing unified single-cell reference atlases across diverse experimental conditions and biological systems.
bioinformatics2026-02-20v1OT-knn: a neighborhood-aware optimal transport framework for aligning spatial transcriptomics data
Song, J.; Li, Q.AI Summary
- OT-knn is introduced as a method for aligning spatial transcriptomics (ST) data by integrating local neighborhood information into an optimal transport framework.
- It reconstructs each spot using its k-nearest neighbors to capture microenvironment context, enhancing robustness against noise and variability.
- Evaluations on simulated and real datasets, including human and mouse brain data, show OT-knn achieves accurate alignment despite spatial deformation, donor heterogeneity, and developmental variation.
Abstract
Spatial transcriptomics (ST) measures gene expression while preserving spatial context within tissues, enabling detailed characterization of tissue organization. As ST technologies advance, aligning datasets across tissue sections, individuals, platforms, and developmental stages has become increasingly important but remains challenging due to sparse expression, biological heterogeneity, and geometric distortions between slices. We introduce OT-knn, a method for ST alignment that integrates local neighborhood information within an optimal transport framework. Rather than relying solely on single-spot expression, OT-knn reconstructs each spot using its spatial k-nearest neighbors, capturing microenvironment context that is more robust to noise and variability. These representations are then used to derive probabilistic correspondences between slices. We evaluate OT-knn using simulated data with known ground-truth alignment and real datasets from multiple ST platforms, including human dorsolateral prefrontal cortex data (10x Genomics Visium), mouse brain aging data with both within-donor and cross-donor comparisons (MERFISH), and a multi-stage axolotl brain dataset (Stereo-seq). Across these settings, OT-knn achieves accurate and robust alignment, particularly in the presence of spatial deformation, donor heterogeneity, and developmental variation.
bioinformatics2026-02-20v1SuperCell2.0 enables semi-supervised construction of multimodal metacell atlases
Herault, L.; Gabriel, A. A.; Duc, B.; Dolfi, B.; Shah, A.; Joyce, J. A.; Gfeller, D.AI Summary
- SuperCell2.0 is introduced as a workflow for constructing semi-supervised multimodal metacells from large single-cell datasets.
- It was found that multimodal metacells outperform single-modality metacells, enhancing inter-modality consistency and integration of multiomic data.
- The workflow identified interferon-primed monocytes and macrophages in blood and tumor samples, with markers used to characterize this population in healthy donors.
Abstract
Multimodal single-cell atlases comprising hundreds of thousands of cells provide unique resources for exploring complex biological tissues and generating testable hypotheses. To streamline the analysis of such large datasets, we introduce SuperCell2.0, a robust workflow to build (semi-)supervised multimodal metacells. We demonstrate that multimodal metacells outperform metacells built with a single modality, improve inter-modality consistency, and facilitate integration of multiomic single-cell datasets. SuperCell2.0 can further leverage full or partial cell type annotations to improve metacell quality. This workflow enables us to construct multimodal metacell atlases from blood and tumor samples and identifies interferon-primed monocytes and macrophages in the circulation and in the tumor microenvironment. Markers derived from the metacell analysis enable us to sort and phenotypically characterize this population in healthy donors. Overall, our work demonstrates how SuperCell2.0 facilitates the analysis of large multimodal single-cell atlases.
bioinformatics2026-02-20v1wavess 1.2: Presenting an HLA-aware within-host virus sequence simulation framework
Lapp, Z.; Leitner, T.AI Summary
- The study extends the wavess framework to simulate within-host virus sequence evolution by incorporating an HLA-aware CD8+ CTL response and variable recombination rates.
- This allows for more accurate modeling of virus sequences, especially in regions influenced by CTLs, and supports investigations into how these mechanisms affect within-host evolution.
Abstract
Motivation: Understanding how virus sequences are shaped by selection can inform vaccine design and transmission inference. Modeling within-host evolution to interrogate these questions requires a detailed mechanistic framework that accurately captures sequence diversification. The CD8+ cytotoxic T-lymphocyte (CTL) response plays an important role in immune-mediated selection and can leave strong signatures in virus sequences; however, existing sequence-based within-host virus modeling frameworks do not explicitly include an HLA-aware CTL response. Results: We extended our previously published within-host sequence evolution simulator, wavess, to include an explicit CTL response, and share a method for identifying HLA-specific CTL epitopes given a founder virus sequence. We also updated the model to permit a variable recombination rate, which allows for modeling recombination hotspots, non-adjacent genes, and segmented genomes. These extensions to wavess allow for more accurate simulation of viruses and virus genes, particularly in regions of the genome where the immune response is dominated by CTLs (rather than antibodies). It also provides the foundation for investigations of how these newly-added biological mechanisms influence within-host evolution. Availability and implementation: The core of wavess is written in Python 3, with helper functions written in R. It is available at https://github.com/MolEvolEpid/wavess.
bioinformatics2026-02-20v1Prediction of ligand-dependent conformational sampling of ABC transporters by AlphaFold3 and correlation to experimental structures and energetics
Tang, Q.; Mchaourab, H.; Wu, T.; Soubasis, B.AI Summary
- This study uses AlphaFold3 to predict nucleotide-dependent conformational changes in ABC transporters, comparing these predictions to experimental structures.
- AlphaFold3 accurately samples known conformations and correlates with experimental dynamics, also predicting previously unobserved conformations.
- The study suggests that AlphaFold3's predictions might extrapolate from known structures, as sequence determinants influence the predicted conformational changes.
Abstract
AlphaFold3 architecture represented an important leap relative to Alphafold2 by enabling the inclusion of protein ligands in the prediction network. Ligand-dependent structural rearrangements are inherently difficult to predict computationally as they imply transitions between states separated by large energy differences. Here we apply AlphaFold3 to predict nucleotide-dependent changes in the conformational cycle of representative ABC transporters that have been extensively investigated by experimental structural biology techniques. We show that under similar conditions, AlphaFold3 predictions sample experimentally observed conformations. Moreover, the heterogeneity of these predictions correlates with experimental measures of dynamics obtained from multiple techniques. For couple of the tested transporters, the implied relative energetics of the conformations mirror their experimental counterpart. Remarkably, AlphaFold3 predicts previously unobserved conformations that have been implied to be sampled by ABC transporters. Finally, we report preliminary results showing that postulated sequence determinants of conformational changes modify the predictions of AlphaFold3. Although hundreds of ABC transporter structures have been determined and were included in the training data of AF3, we propose that aspects of its predictions reflect extrapolation of principles learned from these structures.
bioinformatics2026-02-20v1A New Sparse Bayesian Quantile Neural Network-based Approach and Its Application to Discover Physiological Sweet Spots in the Canadian Longitudinal Study on Aging
Min, J.; Vishnyakova, O.; Brooks-Wilson, A.; Elliott, L. T.AI Summary
- The study introduces Q-FSNet and Q-DirichNet, neural network frameworks integrating quantile regression for identifying physiological sweet spots in high-dimensional data.
- Using data from the Canadian Longitudinal Study on Aging, these methods identified 25 metabolites with optimal ranges that minimize biological age acceleration.
- The findings suggest dietary and gut microbiome-derived metabolites as potential biomarkers for healthy aging, supported by existing literature.
Abstract
Identifying physiological sweet spots (optimal ranges for homeostasis) is essential for precision medicine. However, traditional statistical methods often rely on globally linear or locally jagged models that struggle to capture the smooth, non-linear nature of biological regulation in high-dimensional data. We present the Quantile Feature Selection Network (Q-FSNet), a neural network-based framework that integrates quantile regression, feature selection, and uncertainty estimation to identify biomarkers with sweet spots. Unlike traditional methods, Q-FSNet learns continuous response curves without requiring pre-specified number of change points. We further introduce Quantile Dirichlet Network (Q-DirichNet), a fully Bayesian extension that utilizes Dirichlet priors to automate feature shrinkage. Using data from the Canadian Longitudinal Study on Aging, we identified 25 metabolites with distinct homeostatic ranges for which biological age acceleration is minimized. The metabolites with sweet spots for biological aging include some derived from diet or produced by the gut microbiome; this highlights their potential for knowledge translation and public health impact. Our results, corroborated by existing literature, demonstrate that these sparse neural network-based methods offer a scalable and interpretable tool for discovering metabolic signatures of healthy aging vs. dysregulation in large-scale omics research.
bioinformatics2026-02-20v1