Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
BESTish: A Diffusion-Approximation Framework for Inferring Selection and Mutation in Clonal Hematopoiesis
Wang, R.-Y.; Dinh, K. N.; Taketomi, K.; Pang, G.; King, K. Y.; Kimmel, M.AI Summary
- The study introduces BESTish, a Bayesian inference method for analyzing clonal hematopoiesis (CH) by modeling the dynamics of wild-type and mutant hematopoietic stem cells with an environmental parameter affecting death rates.
- BESTish uses derived mean-field dynamics and Gaussian-Markov approximations to estimate mutation fitness, mutation rate, and environmental strength from VAF datasets.
- Applied to CH datasets, BESTish consistently estimates mutation fitness and rates, revealing patient-specific mutation behavior and identifying variants in non-homeostatic environments.
Abstract
Clonal hematopoiesis (CH) arises when hematopoietic stem cells (HSCs) gain a fitness advantage from somatic mutations and expand, resulting in an increase in variant allele frequency (VAF) over time. To analyze CH trajectories, we develop a state-dependent stochastic model of wild-type and mutant HSCs, in which an environmental parameter in [0,1] regulates death rates and interpolates between homeostatic (Moran-like, = 1) and growth-facilitating ( < 1) regimes. Using functional law of large numbers and central limit theorems, we derive explicit mean-field dynamics and a Gaussian-Markov approximation for VAF fluctuations. We show that the mean VAF trajectory has an explicit logistic form determined by selective advantage, while environmental effects affect only the variance and autocovariance structure. Building on these results, we introduce BESTish (Bayesian estimate for selection incorporating scaling-limit to detect mutant heterogeneity), a novel, efficient and accurate Bayesian inference method that can be applied to analyze both cohort-level and longitudinal VAF datasets. BESTish implements the closed-form finite-dimensional distributions that we derive to estimate mutation fitness, mutation rate, and environmental strength for individual CH drivers. When applied to existing CH datasets, BESTish produces consistent mutation fitness inferences across different studies, and estimates CH driver mutation rates in agreement with independent experimental studies. Furthermore, BESTish reveals patient-specific heterogeneity in the selective behavior of recurrent mutations, and identifies variants whose dynamics are compatible with non-homeostatic, growth-facilitating environments. BESTish provides a unified and mechanistic framework for quantifying CH evolution, with potential applications for other biological systems where clonal expansions can be measured.
bioinformatics2026-02-05v2PlotGDP: an AI Agent for Bioinformatics Plotting
Luo, X.; Shi, Y.; Huang, H.; Wang, H.; Cao, W.; Zuo, Z.; Zhao, Q.; Zheng, Y.; Xie, Y.; Jiang, S.; Ren, J.AI Summary
- PlotGDP is an AI agent-based web server designed for creating high-quality bioinformatics plots using natural language commands.
- It leverages large language models (LLMs) to process user-uploaded data on a remote server, eliminating the need for coding or environment setup.
- The platform uses curated template scripts to ensure plot accuracy, aiming to enhance bioinformatics visualization for global research.
Abstract
High-quality bioinformatics plotting is important for biology research, especially when preparing for publications. However, the long learning curve and complex coding environment configuration often appear as inevitable costs towards the creation of publication-ready plots. Here, we present PlotGDP (<a href="https://plotgdp.biogdp.com/">https://plotgdp.biogdp.com/</a>), an AI agent-based web server for bioinformatics plotting. Built on large language models (LLMs), the intelligent plotting agent is designed to accommodate various types of bioinformatics plots, while offering easy usage with simple natural language commands from users. No coding experience or environment deployment is required, since all the user-uploaded data is processed by LLM-generated codes on our remote high-performance server. Additionally, all plotting sessions are based on curated template scripts to minimize the risk of hallucinations from the LLM. Aided by PlotGDP, we hope to contribute to the global biology research community by constructing an online platform for fast and high-quality bioinformatics visualization.
bioinformatics2026-02-05v2Prediction of protein-carbohydrate binding sites from protein primary sequence
Nafi, M. M. I.; Nawar, Q. F.; Islam, T. N.; Rahman, M. S.AI Summary
- This study developed StackCBEmbed, an ensemble machine learning model to predict protein-carbohydrate binding sites from protein primary sequences.
- StackCBEmbed integrates traditional sequence-based features with features from a pre-trained transformer-based protein language model.
- It achieved sensitivity and balanced accuracy scores of 0.730, 0.776 and 0.666, 0.742 on two test sets, outperforming previous models.
Abstract
Background: A protein is a large complex macromolecule that has a crucial role in performing most of the work in cells and tissues. It is made up of one or more long chains of amino acid residues. Another important biomolecule, after DNA and protein, is carbohydrate. Carbohydrates interact with proteins to run various biological processes. Several biochemical experiments exist to learn the protein-carbohydrate interactions, but they are expensive, time-consuming, and challenging. Therefore, developing computational techniques for effectively predicting protein-carbohydrate binding interactions from protein primary sequence has given rise to a prominent new field of research. Result: In this study, we propose StackCBEmbed, an ensemble machine learning model to effectively classify protein-carbohydrate binding interactions at residue level. StackCBEmbed combines traditional sequence-based features along with features derived from a pre-trained transformer-based protein language model. To the best of our knowledge, ours is the first attempt to apply protein language model in predicting protein-carbohydrate binding interactions. StackCBEmbed achieved sensitivity and balanced accuracy scores of 0.730, 0.776 and 0.666, 0.742 in two separate independent test sets. This performance is superior compared to the earlier prediction models benchmarked in the same datasets. Conclusion: We thus hope that StackCBEmbed will discover novel protein-carbohydrate interactions and help advance the related fields of research. StackCBEmbed is freely available as Python scripts at https://github.com/nafiislam/StackCBEmbed.
bioinformatics2026-02-05v2Constrained Evolutionary Funnels Shape Viral Immune Escape
Huot, M.; Wang, D.; Shakhnovich, E.; Monasson, R.; Cocco, S.AI Summary
- This study presents a probabilistic framework to predict how viral proteins adapt under immune pressure, focusing on SARS-CoV-2's receptor binding domain.
- The framework identifies 'escape funnels' where viable mutations for immune evasion occur, constrained by protein viability and antibody escape.
- It predicts mutation sites, explains antibody-cocktail effectiveness, and shows that cocktails with de-correlated escape profiles force longer, costlier viral adaptation paths.
Abstract
Understanding how viral proteins adapt under immune pressure while preserving viability is crucial for anticipating antibody-resistant variants. We present a probabilistic framework that predicts viral escape trajectories and shows that immune evasion is channeled into a small set of viable "escape funnels" within the vast mutational space. These escape funnels arise from the combined constraints of protein viability and antibody escape, modeled using a generative model trained on homologs and deep mutational scanning data. We derive a mean-field approximation of evolutionary path ensembles, enabling us to quantify both the fitness and entropy of escape routes. Applied to SARS-CoV-2 receptor binding domain, our framework reveals convergent evolution patterns, predicts mutation sites in variants of concern, and explains differences in antibody-cocktail effectiveness. In particular, cocktails with de-correlated escape profiles slow viral adaptation by forcing longer, higher-cost escape paths.
bioinformatics2026-02-05v2CAMUS: Scalable Phylogenetic Network Estimation
Willson, J.; Warnow, T.AI Summary
- CAMUS is a new method for estimating phylogenetic networks, designed to handle larger datasets by maximizing quartet trees within a constrained tree framework.
- Simulation studies under the Network Multi-Species Coalescent showed CAMUS to be highly accurate, fast, and scalable, outperforming PhyloNet-MPL and SNaQ in speed and dataset size capacity.
- While slightly less accurate than PhyloNet-MPL without a fixed tree, CAMUS can process datasets with up to 201 species, compared to PhyloNet-MPL's limit of 51 species.
Abstract
Motivation: Phylogenetic networks are models of evolution that go beyond trees, and so represent reticulate events such as horizontal gene transfer or hybridization, which are frequently found in many taxa. Yet, the estimation of phylogenetic networks is extremely computationally challenging, and nearly all methods are limited to very small datasets with perhaps 10 to 15 species (some limited to even smaller numbers). Results: We introduce CAMUS (Constrained Algorithm Maximizing qUartetS), a scalable method for phylogenetic network estimation. CAMUS takes an input constraint tree T as well as a set Q of unrooted quartet trees that it derives from input, and returns a level-1 phylogenetic network N that is built upon T through the addition of edges, in order to maximize the number of quartet trees in Q that are induced in N. We perform a simulation study under the Network Multi-Species Coalescent and show that a simple pipeline using CAMUS provides high accuracy and outstanding speed and scalability, in comparison to two leading methods, PhyloNet-MPL used with a fixed tree and SNaQ. CAMUS is slightly less accurate than PhyloNet-MPL used without a fixed tree, but is much faster (minutes instead of hours) and can complete on inputs with 201 species while PhyloNet-MPL fails to complete on the inputs with more than 51 species. Availability and Implementation: The source code is available at https://github.com/jsdoublel/camus.
bioinformatics2026-02-05v2Integrating SHAPE Probing with Direct RNA Nanopore Sequencing Reveals Dynamic RNA Structural Landscapes
White Bear, J.; De Bisschop, G.; Lecuyer, E.; Waldispühl, J.AI Summary
- The study introduces Dashing Turtle (DT), an algorithm that integrates SHAPE probing with direct RNA nanopore sequencing to analyze RNA structural dynamics.
- DT uses probabilistic, weighted, stacked ensemble learning to detect structural modifications with high resolution, achieving 10-20% higher accuracy and 10-30% better structural feature identification than existing methods.
- Applied to well-characterized RNAs, DT identifies dominant conformations, conserved regions, and correlates well with known structures, enhancing understanding of RNA folding and interaction dynamics.
Abstract
Traditional SHAPE experiments rely on averaged reactivities, which may limit information on folding patterns, alternate structures, and RNA dynamics. Short-read sequencing often suffers from false stopping, stalls, and biases during reverse transcription. The introduction of direct, long-read nanopore technology offers an opportunity to expand RNA structure probing methods to better understand RNA structural diversity. While many comparative approaches have been developed for detection of endogenous modifications, fewer have explored the expansion of SHAPE-based methods. We introduce Dashing Turtle (DT), an algorithm using probabilistic, weighted, stacked ensemble learning to perform high-resolution detection of structural modifications that can capture detailed information about RNA architecture across dynamic structural landscapes. We apply our method to several well-characterized RNA samples, identify dominant conformations, and structurally conserved regions. We show that our landscapes correlate well with expected structures and recapitulate important functional elements. DT achieves accuracy 10-20% higher than comparable methods on many sequences. It accurately identifies structural features at a rate of 80-100%, approximately 10-30% better than its peers. DT's predictions are robust across replicates and sub-sampled datasets and can help detect changes in conformational states, inform RNA folding mechanisms, and indicate interaction efficiency. Overall, it expands the capabilities of direct RNA sequencing and structural probing.
bioinformatics2026-02-05v2The genetic repertoire of the deep sea: from sequence to structure and function
Guo, Y.; Wang, Z.; Li, D.; Wang, L.; Lan, H.; Guo, F.; Zhao, Z.; Liu, Z.; Meng, L.; Shen, X.; Wang, M.; Zhao, W.; Zhang, W.; Kong, C.; Shi, L.; Sun, Y.; Seim, I.; Jiang, A.; Ma, K.; Su, Z.; Zhang, N.; Ji, Q.; Chen, J.; Chen, K.; Qi, C.; Li, B.; He, B.; Liu, Y.; Zhou, J.; Zheng, Y.; Zhang, H.; Wang, Y.; Han, M.; Yang, T.; Tong, J.; Zhang, Y.; Wang, Z.; Xu, X.; Chen, J.; Liu, Y.; Chen, H.; Zeng, T.; Wei, X.; Li, C.; Yang, H.; Wang, B.; Liu, X.; Shao, C.; Zhang, W.; Gu, Y.; Xiao, X.; Xu, X.; Wang, J.; Mock, T.; Fan, G.; Li, Y.; Liu, S.; Dong, Y.AI Summary
- This study presents a comprehensive genetic dataset from the deep sea, including 502 million genes and 2.4 million predicted structures, to explore the link between genetic variants and protein structures adapted to deep-sea conditions.
- Analysis showed high sequence diversity but conserved protein structures, with proteins involved in replication, recombination, and repair evolving rapidly.
- A structurally divergent helicase was identified, showing potential in controlling nanopore sequencing speed, highlighting the deep sea's role in biotechnology.
Abstract
The deep sea as the largest and maybe most hostile environment on Earth is still underexplored especially regarding its genetic repertoire. Yet, previous work has revealed significant habitat-specific deep-sea biodiversity. Here, we present an integrated deep-sea genetic dataset comprising 502 million nonredundant genes from 2,138 samples and 2.4 million predicted structures, and used it to link specific protein structures with genetic variants associated with life in the deep sea and to assess their biotechnology potential. Combining global sequence analysis with biophysical and biochemical measurements revealed unprecedented sequence diversity, yet substantial structural conservation of proteins. Especially proteins involved in replication, recombination, and repair were identified to be under rapid evolution and with specialized properties. Among these, a structurally divergent helicase exhibited advantages in controlling nanopore sequencing speed. Thus, our work positions the deep sea as a unique evolutionary engine that generates and hosts genetic diversity and bridges genetic knowledge with biotechnology.
bioinformatics2026-02-05v1PRISM: Niche-informed Deciphering of Incomplete Spatial Multi-Omics Data
Mu, S.; Wang, Z.; Liao, Y.; Liang, J.; Zhang, D.; Wang, C.; Xie, J.; Sheng, X.; Zhang, T.; Huang, W.; Song, J.; Yuan, Z.; Cai, H.AI Summary
- PRISM is a computational method designed to address incomplete spatial multi-omics data by using a niche-informed graph to propagate information from paired to unpaired regions.
- It was benchmarked on five datasets, showing superior performance in spatial multi-omics analysis.
- Application to Parkinson's disease data demonstrated PRISM's ability to accurately recover dopamine-related spatial domains and metabolite distributions obscured by data gaps.
Abstract
Spatial multi-omics data, characterizing the knowledge of diverse molecular layers, have become indispensable for the in situ analysis of tissue architecture and complex biological processes. Nevertheless, current spatial multi-omics sequencing protocols are often hindered by incompatible protocols, resulting in incomplete spatial multi-omics pairing due to inconsistent field-of-view or spatial resolution. To address this, we present PRISM, a novel computational method tailored for this scenario. PRISM leverages a niche-informed graph to propagate information from paired to unpaired regions, jointly achieving the spatial domain identification and spatial omics imputation. Extensive benchmarking on five diverse simulated and real experimental datasets demonstrated that PRISM outperformed existing methods in spatial multi-omics analysis tasks. Application to the human Parkinson's disease data revealed that PRISM accurately recovered dopamine-associated spatial domains and metabolite distributions masked by incomplete data gaps. PRISM offers a robust solution for bridging the integration gap inherent in incompatible sequencing protocols, thereby facilitating more accurate downstream biological interpretation.
bioinformatics2026-02-05v1A Shape Analysis Algorithm Quantifies Spatial Morphology and Context of 2D to 3D Cell Culture for Correlating Novel Phenotypes with Treatment Resistance
Nguyen, D. H.AI Summary
- The study addresses the limitation of traditional metrics in capturing spatial context like chirality in cell morphology, which is crucial for understanding treatment resistance.
- A novel algorithm, the Linearized Compressed Polar Coordinates (LCPC) Transform, was developed to quantify spatial morphology by converting 2D cell contours into discrete sinusoid waves, followed by Fast Fourier Transform analysis.
- This approach allows for multidimensional representation of cell shapes in 2D and 3D cultures, potentially revealing insights into how morphological phenotypes correlate with resistance to anti-cancer treatments.
Abstract
Numerous studies have shown that the morphological phenotype of a cell or organoid correlates with its susceptibility to anti-cancer agents. However, traditional methods of measuring phenotype rely on spatial metrics such as area, volume, perimeter, and signal intensity, which work but are limited. These approaches cannot measure many crucial features of spatial context, such as chirality, which is the property of having left- and right-handedness. Volume cannot register chirality because the left shoe and right shoe hold the harbor the same amount of volume. Though spatial context in the form of chirality, direction of gravity, and the axis of polarity are intuitive notions to humans, traditional metrics relied on by cell biologists, pathologists, radiologists, and machine learning scientists up to this point cannot register these fundamental notions. The Linearized Compressed Polar Coordinates (LCPC) Transform is a novel algorithm that can capture spatial context unlike any other metric. The LCPC Transform translates a two-dimensional (2D) contour into a discrete sinusoid wave via overlaying a grid system that tracks points of intersection between the contour and the grid lines. It turns the contour into a series of sequential pairs of discrete coordinates, with the independent coordinate (x-coordinate) being consecutive positions in 2D space. Each dependent coordinate (y-coordinate) consists of the distance, between an intersection of the contour and gridline, to the origin of the grid system. In the form of a discrete sinusoid wave, the Fast Fourier Transform is then applied to the data. In this way, the shape of cells in 2D and 3D cell culture, are represented systematically and multidimensionally, allowing for robust quantitative stratification that will reveal insights into treatment resistance.
bioinformatics2026-02-05v1MosaicLev: Modified Levenshtein distance for mobile element-aware genome comparison
Stoltz, H. K.; Kuhlman, T. E.AI Summary
- The study introduces MosaicLev, a modified Levenshtein distance (mlev) that accounts for mobile element insertions in genome comparison by using a tunable parameter to discount these events.
- Validation on 67 mycobacteriophage genomes showed that mlev can distinguish between MPME1 and MPME2 elements, with significant discounts for the respective element carriers.
- The approach classified 35 MPME1 and 11 MPME2 carriers, and identified 14 phages lacking both elements.
Abstract
Motivation: Somatic genetic mosaicism arises when genomes diverge across cells during development, in part due to the activity of transposons (cut-paste) and retrotransposons (copy-paste). Standard sequence comparison methods are not motif-aware, penalizing mobile element insertions based on length rather than recognizing them as single biological events. Results: We introduce a modified Levenshtein distance (mlev) that discounts mobile element insertions via a tunable parameter (m in [0, 1]). Validation on 67 Cluster G1 mycobacteriophage genomes demonstrates bidirectional discrimination between Mycobacteriophage Mobile Element 1 (MPME1) and MPME2 elements: using MPME1 as target yields 49% discount for MPME1 carriers but only 9% for MPME2 carriers, while using MPME2 reverses this pattern. This approach classifies 35 MPME1 carriers and 11 MPME2 carriers, and identifies 14 phages showing low discount with both motifs, consistent with absence of both elements. Availability and Implementation: Python implementation with Numba JIT compilation freely available at https://doi.org/10.5281/zenodo.18452982.
bioinformatics2026-02-05v1A deep-learning-based score to evaluate multiple sequence alignments
Serok, N.; Polonsky, K.; Ashkenazy, H.; Mayrose, I.; Thorne, J. L.; Pupko, T.AI Summary
- The study evaluates the performance of the sum-of-pairs (SP) score in multiple sequence alignment (MSA) and finds it often does not correspond to the most accurate alignment.
- A deep-learning-based scoring function, Model 1, was developed to predict alignment accuracy, showing a stronger correlation with true accuracy than the SP score.
- Model 2 was introduced to rank alternative MSAs, outperforming the SP score, Model 1, and other programs, leading to more accurate phylogenetic reconstructions.
Abstract
Multiple sequence alignment (MSA) inference is a central task in molecular evolution and comparative genomics, and the reliability of downstream analyses, including phylogenetic inference, depends critically on alignment quality. Despite this importance, most widely used MSA methods optimize the sum-of-pairs (SP) score, and relatively little attention has been paid to whether this objective function accurately reflects alignment accuracy. Here, we evaluate the performance of the SP score using simulated and empirical benchmark alignments. For each dataset, we compare alternative MSAs derived from the same unaligned sequences and quantify the relationship between their SP scores and their distances from a reference alignment. We show that the alignment with the optimal SP score often does not correspond to the most accurate alignment. To address this limitation, we develop deep-learning-based scoring functions that integrate a collection of MSA features. We first introduce Model 1, a regression model that predicts the distance of a given MSA from the reference alignment. Across simulated and empirical datasets, this learned score correlates more strongly with true alignment accuracy than the SP score. However, Model 1 is less effective at identifying the best alignment among alternatives. We therefore develop Model 2, which takes as input a set of alternative MSAs generated from the same sequences and predicts their relative ranking. Model 2 more accurately identifies the top-ranking MSA than the SP score, Model 1, and several widely used alignment programs. Using simulations, we show that selecting MSAs based on our approach leads to more accurate phylogenetic reconstructions.
bioinformatics2026-02-05v1LongPolyASE: An end-to-end framework for allele-specific gene and isoform analysis in polyploids using long-read RNA-seq
Nolte, N. F.; Gruden, K.; Petek, M.AI Summary
- The study addresses the lack of tools for allele-specific gene and isoform analysis in polyploids by developing LongPolyASE, an end-to-end framework using long-read RNA-seq.
- LongPolyASE comprises three components: Syntelogfinder for identifying syntenic genes, longrnaseq for transcript quantification and isoform discovery, and PolyASE for analyzing differential expression and isoform usage.
- The framework was demonstrated on diploid rice and autotetraploid potato, showing its applicability in polyploid analysis.
Abstract
Motivation: Long-read RNA-seq and phased reference genomes enable haplotype-resolved gene and isoform expression analysis. While methods and tools exist for diploid organisms, analysis tools for polyploids are lacking. Results: We developed an end-to-end framework for allele-specific gene and isoform analysis in polyploids with three components: <a href="https://github.com/NIB-SI/syntelogfinder">Syntelogfinder</a> identifies syntenic genes in phased assemblies; <a href="https://github.com/NIB-SI/longrnaseq">longrnaseq</a> quantifies transcripts, discovers novel isoforms, and performs quality control of long-read RNA-seq; and <a href="https://pypi.org/project/polyase/">PolyASE</a> analyzes differential allelic expression, differential isoform usage between conditions, and structural differences in major isoforms between haplotypes. We demonstrate the use of the framework on diploid rice and autotetraploid potato. Availability and Implementation: <a href="https://github.com/NIB-SI/syntelogfinder">Syntelogfinder</a> and <a href="https://github.com/NIB-SI/longrnaseq">longrnaseq</a> are implemented in Nextflow and available on GitHub. <a href="https://pypi.org/project/polyase/">PolyASE</a> is a Python package available on PyPI. The framework is fully <a href="https://polyase.readthedocs.io/en/latest/index.html">documented</a> and tutorials are provided. Contact: <a href="mailto:Nadja.nolte.franziska@nib.si">Nadja.nolte.franziska@nib.si</a> Supplementary information: Supplementary data are available online and at <a href="https://zenodo.org/records/17590760?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6IjIzODg5ODA1LTdiMDQtNDU5Mi1hYTE1LTQyYzY5YzA2ZjhjYyIsImRhdGEiOnt9LCJyYW5kb20iOiJkYTQ1YjZmZjg1Mzg2MmE2M2I4M2Q4NTBlNzlhM2UwYSJ9.2AMQOPvz5yhPWJ-dIuf05NOiYQcyFkKPpRDYIqcUdYyGsMlHT8Y82wzMudpNJ6vxleNlOOSpukfcl-_astWCBw">Zenodo</a>.
bioinformatics2026-02-05v1BioVix: An Integrated Large Language Model Framework for Data Visualization, Graph Interpretation, and Literature-Aware Scientific Validation
Butt, M. Z.; Ahmad, R. S.; Fatima, E.; Tahir ul Qamar, M.AI Summary
- BioVix is a web-based framework that integrates data visualization, natural-language querying, and literature retrieval using LLMs to support scientific analysis.
- It uses a multi-model architecture to handle dataset uploads, generate visualizations, interpret data, and contextualize findings with literature.
- Evaluations across various biological domains showed BioVix's effectiveness in managing diverse datasets, though it supports rather than replaces expert analysis.
Abstract
The application of Large Language Models (LLMs) for generating data visualizations through natural language interaction represents a promising advance in AI-assisted scientific analysis. However, existing LLM-based tools largely emphasize graph generation, while research workflows require not only visualization but also rigorous interpretation and validation against established scholarly evidence. Despite advances in visualization technologies, no single tool currently integrates literature references with visualization while also generating insights from graphical data. To address this gap, we present BioVix, a web-based LLM-driven framework that integrates interactive data visualization, natural-language querying, and automated retrieval of relevant academic literature. BioVix enables users to upload datasets, generate complex visualizations, interpret graphical patterns, and contextualize findings through literature references within a unified workflow. The system employs a multi-model architecture combining DeepSeek V3.1 for code and logic generation, Qwen2.5-VL-32B-Instruct for multimodal interpretation, and GPT-OSS-20B for conversational reasoning, coordinated through structured prompt engineering. BioVix was evaluated across diverse biological domains, including proteomic expression profiling, epigenomic peak annotation, and clinical diabetes data, demonstrating its flexibility in handling heterogeneous datasets and supporting exploratory, literature-aware analysis. While BioVix substantially streamlines exploratory research workflows, its LLM-generated outputs are intended to support, not replace, expert judgment, and users should independently verify results before scientific reporting. BioVix is openly available via public deployment on Hugging Face (https://huggingface.co/spaces/MuhammadZain10/BioVix), with source code provided through GitHub (https://github.com/MuhammadZain-Butt/BioVix).
bioinformatics2026-02-05v1Hierarchical Representation Learning for Drug Mechanism-of-Action Prediction from Gene Expression Data
Katsaouni, N.; Schulz, M. H.AI Summary
- The study introduces a hierarchical representation learning framework using dual ArcFace objectives to predict drug mechanisms of action (MoAs) from gene expression data, enhancing interpretability and capturing both MoA-level and compound-level structures.
- The framework, trained on LINCS L1000 data, outperformed existing methods in F1 performance and generalized to new compounds, cell types, and even CRISPR knockdowns without retraining.
- Gene importance and pathway enrichment analyses validated that the learned representations align with known signaling pathways.
Abstract
Deciphering drug mechanisms of action (MoAs) from transcriptional responses is key for discovery and repurposing. While recent machine learning approaches improve prediction accuracy beyond traditional similarity metrics, they often lack biological structure and interpretability in the learned space. We introduce a hierarchical representation learning framework that explicitly enforces mechanistically coherent organization using dual ArcFace objectives, yielding an interpretable latent space that captures both MoA-level separation and compound-level substructure. Gene importance and pathway enrichment analyses confirm that the learned representations recover established signaling programs. Trained on LINCS L1000 data, the model also improves F1 performance over state-of-the-art baselines and generalizes to unseen compounds and cell types. Additionally, the latent space generalizes to CRISPR knockdowns without the need for retraining, indicating it captures pathway-level perturbations independently of modality.
bioinformatics2026-02-05v1Decoding ATG9A Variation: A Comprehensive Structural Investigation of All Missense Variants
Utichi, M.; Marjault, H.-B.; Tiberti, M.; Papaleo, E.AI Summary
- This study used an in silico saturation mutagenesis approach with the MAVISp framework to predict the impact of all missense mutations in ATG9A, focusing on protein stability, conformational changes, multimerization, and post-translational modifications.
- By integrating structural predictions with variant effect predictors and cross-referencing with disease databases, the study identified potentially damaging variants in ATG9A.
- The findings provide insights into the molecular mechanisms of these variants, offering a roadmap for understanding ATG9A mutations in human diseases.
Abstract
Macroautophagy (hereafter autophagy) is a cellular recycling pathway that requires different ATG (autophagy-related) proteins to generate double-membraned autophagosomes. ATG9A, a multi-spanning membrane protein, plays a crucial role in this process as the only transmembrane component of the core autophagy machinery. ATG9A functions as a lipid scramblase, redistributing lipids between membrane leaflets for the expanding autophagosome membrane. Structural studies have revealed that ATG9A forms a homotrimer with an interlocked domain-swapped architecture and a network of internal hydrophilic cavities. This configuration underlies its role in lipid transfer and membrane remodeling together with the lipid transporter ATG2A. ATG9A dysfunction has also been linked to human disease, as specific ATG9A mutations cause neurodevelopmental or neurodegenerative phenotypes. Additionally, ATG9A is altered in cancer, promoting pro-tumorigenic traits. However, most missense variants in ATG9A remain uncharacterized, posing a significant challenge for interpreting genomic data. In this study, we employed in silico saturation mutagenesis approach using the MAVISp (Multi-layered Assessment of VarIants by Structure) framework to predict the impact of every missense mutation in ATG9A. By analyzing multiple structural assemblies of ATG9A (monomer, trimer, and the ATG9A-ATG2A complex), we evaluated diverse mechanistic indicators of variant impact, including protein stability, long-range conformational changes, effects on multimerization interfaces, and alterations in post-translational modifications. We integrated the structure-based predictions with Variant Effect Predictors from recent deep-learning or evolutionary-based models and cross-referenced known variants catalogued in ClinVar, COSMIC, and cBioPortal. Finally, we predicted mechanistic indicators for all possible variants with structural coverage not yet reported in the disease-related databases supported by MAVISp. Our analyses identified a group of potentially damaging variants in ATG9A and the possible molecular mechanisms underlying their effects. Together, this work provides a roadmap for interpreting missense variants in autophagy regulators and highlights specific ATG9A mutations that deserve further investigation in the context of human disease.
bioinformatics2026-02-05v1ABFormer: A Transformer-based Model to Enhance Antibody-Drug Conjugates Activity Prediction through Contextualized Antibody-Antigen Embedding
Katabathuni, R.; Loka, V.; Gogte, S.; Kondaparthi, V.AI Summary
- ABFormer, a transformer-based model, was developed to predict Antibody-Drug Conjugate (ADC) activity by integrating contextualized antibody-antigen embeddings with chemically enriched linker and payload representations.
- It outperforms the current state-of-the-art model, ADCNet, by achieving 100% accuracy on a test set of 22 novel ADCs.
- The model's effectiveness is primarily due to its interaction-aware antibody-antigen representations, with additional specificity from small molecule encoders.
Abstract
Computational screening is increasingly becoming a crucial aspect of Antibody Drug Conjugate (ADC) research, allowing the elimination of dead ends at earlier stages and concentrating on potential candidates, which can significantly reduce the cost of development. The current state-of-the-art deep learning model, ADCNet, usually considers antibodies, antigens, linkers, and payloads as distinct features. However, this overlooks the complex context of antibody-antigen binding, which is primarily responsible for the targeting and uptake of ADCs. To address this limitation, we present ABFormer, a transformer-based framework tailored for ADC activity prediction and in-silico triage. ABFormer integrates high-resolution antibody-antigen interface information through a pretrained interaction encoder and combines it with chemically enriched linker and payload representations obtained from a fine-tuned molecular encoder. This multi-modal design replaces naive feature concatenation with biologically informed contextual embeddings that more accurately reflect molecular recognition. ABFormer outperforms in leave-pair-out evaluation and achieves 100% accuracy on a separate test set of 22 novel ADCs, while the baselines are severely miscalibrated. Ablation study confirms that the predictive capability is predominantly driven by interaction-aware antibody-antigen representations, while small molecule encoders enhance specificity by reducing false positives. In conclusion, ABFormer provides a reliable and efficient platform for early filtering of ADC activity and selection of candidates.
bioinformatics2026-02-05v1Large mRNA language foundation modeling with NUWA for unified sequence perception and generation
Zhong, Y.; Yan, W.; Zhang, Y.; Tan, K.; Saito, Y.; Bian, B.AI Summary
- This study introduces NUWA, a large mRNA language foundation model using a BERT-like architecture, trained on extensive mRNA sequences from bacteria, eukaryotes, and archaea for unified sequence perception and generation.
- NUWA excels in various downstream tasks, including RNA-related perception and cross-modal protein tasks, and uses an entropy-guided strategy for generating natural-like mRNA sequences.
- Fine-tuned NUWA can generate functional, novel mRNA sequences for applications in biomanufacturing, vaccine development, and therapeutics.
Abstract
The mRNA serves as a crucial bridge between DNA and proteins. Compared to DNA, mRNA sequences are much more concise and information-dense, which makes mRNA an ideal language through which to explore various biological principles. In this study, we present NUWA, a large mRNA language foundation model leveraging a BERT-like architecture, trained with curriculum masked language modeling and supervised contrastive loss for unified mRNA sequence perception and generation. For pretraining, we utilized large-scale mRNA coding sequences comprising approximately 80 million sequences from 19,676 bacterial species, 33 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species, and pre-trained three domain-specific models respectively. This enables NUWA to learn coding sequence patterns across the entire tree of life. The fine-tuned NUWA demonstrates strong performance across a variety of downstream tasks, excelling not only in RNA-related perception tasks but also exhibiting robust capability in cross-modal protein-related tasks. On the generation front, NUWA pioneers an entropy-guided strategy that enables BERT-like models in generating mRNA sequences, producing natural-like sequences that accurately recapitulate species-specific codon usage patterns. Moreover, NUWA can be effectively fine-tuned on small, task-specific datasets to generate functional mRNAs with desired properties, including sequences that do not exist in nature, and to design coding sequences for diverse proteins in biomanufacturing, vaccine development, and therapeutic applications. To our knowledge, NUWA represents the first mRNA language model for unified sequence perception and generation, providing a versatile and programmable platform for mRNA design.
bioinformatics2026-02-04v3FlashDeconv enables atlas-scale, multi-resolution spatial deconvolution via structure-preserving sketching
Yang, C.; Chen, J.; Zhang, X.AI Summary
- FlashDeconv uses leverage-score importance sampling and sparse spatial regularization to enable rapid, high-resolution spatial deconvolution, processing 1.6 million bins in 153 seconds.
- It identifies a tissue-specific resolution horizon in mouse intestine at 8-16 µm, where cell-type co-localization sign inversions occur, validated by Xenium data.
- In human colorectal cancer, FlashDeconv reveals neutrophil inflammatory microdomains in low-UMI regions, recovering spatial biology from previously uninformative data.
Abstract
Coarsening Visium HD resolution from 8 to 64 m can flip cell-type co-localization from negative to positive (r = -0.12 [->] +0.80), yet investigators are routinely forced to coarsen because current deconvolution methods cannot scale to million-bin datasets. Here we introduce FlashDeconv, which combines leverage-score importance sampling with sparse spatial regularization to match top-tier Bayesian accuracy while processing 1.6 million bins in 153 seconds on a standard laptop. Systematic multi-resolution analysis of Visium HD mouse intestine reveals a tissue-specific resolution horizon (8-16 m)--the scale at which this sign inversion occurs--validated by Xenium ground truth. Below this horizon, FlashDeconv provides the first sequencing-based quantification of Tuft cell chemosensory niches (15.3-fold stem cell enrichment). In a 1.6-million-bin human colorectal cancer cohort, FlashDeconv further uncovers neutrophil inflammatory microdomains in low-UMI regions that classification-based methods discard, recovering spatially organized biology from measurements previously considered uninformative.
bioinformatics2026-02-04v2Common Pitfalls in CircRNA Detection and Quantification
Weyrich, M.; Trummer, N.; Boehm, F.; Furth, P. A.; Hoffmann, M.; List, M.AI Summary
- This study compares circRNA detection in poly(A)-enriched versus ribosomal RNA-depleted RNA-seq data, finding that poly(A) data often yield false positives.
- The quality of sample processing, indicated by ribosomal read fraction, impacts circRNA detection sensitivity.
- Best practices include using total RNA sequencing with rRNA depletion, employing multiple detection tools, and focusing on back-splice junctions for reliable circRNA analysis.
Abstract
Circular RNAs have garnered considerable interest, as they have been implicated in numerous biological processes and diseases. Through their stability, they are often considered promising biomarker candidates or therapeutic targets. Due to the lack of a poly(A) tail, circRNAs are best detected in total RNA-seq data after depleting ribosomal RNA. However, we observe that the application of circRNA detection in the vastly more ubiquitous poly(A)-enriched RNA-seq data still occurs. In this study, we systematically compare the detection of circRNAs in two matched poly(A) and ribosomal RNA-depleted data sets. Our results indicate that the comparably few circRNAs detected in poly(A) data are likely false positives. In addition, we demonstrate that the quality of sample processing, as measured by the fraction of ribosomal reads, significantly affects the sensitivity of circRNA detection, leading to a bias in downstream analysis. Our findings establish best practices for circRNA research: total RNA sequencing with effective rRNA depletion is the preferred approach for accurate circRNA profiling, whereas poly(A)-enriched data are unsuitable for comprehensive detection. Employing multiple circRNA detection tools and prioritizing back-splice junctions identified by several algorithms enhances confidence in the selection of candidates. These recommendations, validated across diverse datasets and tissue types, provide generalizable principles for robust circRNA analysis.
bioinformatics2026-02-04v1CAMUS: Scalable Phylogenetic Network Estimation
Willson, J.; Warnow, T.AI Summary
- CAMUS is a scalable method for estimating phylogenetic networks, designed to handle larger datasets by maximizing quartet trees within a given constraint tree.
- Simulation studies under the Network Multi-Species Coalescent showed CAMUS to be highly accurate, fast, and scalable, processing up to 201 species in minutes.
- Compared to PhyloNet-MPL and SNaQ, CAMUS is slightly less accurate when PhyloNet-MPL is used without a fixed tree but significantly faster and capable of handling larger datasets.
Abstract
Motivation: Phylogenetic networks are models of evolution that go beyond trees, and so represent reticulate events such as horizontal gene transfer or hybridization, which are frequently found in many taxa. Yet, the estimation of phylogenetic networks is extremely computationally challenging, and nearly all methods are limited to very small datasets with perhaps 10 to 15 species (some limited to even smaller numbers). Results: We introduce CAMUS (Constrained Algorithm Maximizing qUartetS), a scalable method for phylogenetic network estimation. CAMUS takes an input constraint tree T as well as a set Q of unrooted quartet trees that it derives from input, and returns a level-1 phylogenetic network N that is built upon T through the addition of edges, in order to maximize the number of quartet trees in Q that are induced in N. We perform a simulation study under the Network Multi-Species Coalescent and show that a simple pipeline using CAMUS provides high accuracy and outstanding speed and scalability, in comparison to two leading methods, PhyloNet-MPL used with a fixed tree and SNaQ. CAMUS is slightly less accurate than PhyloNet-MPL used without a fixed tree, but is much faster (minutes instead of hours) and can complete on inputs with 201 species while PhyloNet-MPL fails to complete on the inputs with more than 51 species. Availability and Implementation: The source code is available at https://github.com/jsdoublel/camus.
bioinformatics2026-02-04v1Ophiuchus-Ab: A Versatile Generative Foundation Model for Advanced Antibody-Based Immunotherapy
Zhu, Y.; Ma, J.; Yin, M.; Wu, J.; Tang, L.; Zhang, Z.; Li, Q.; Feng, S.; Liu, H.; Qin, T.; Yan, J.; Hsieh, C.-Y.; Hou, T.AI Summary
- The study addresses the challenge of antibody design by modeling the sequence space of paired heavy and light chains to understand inter-chain dependencies.
- Ophiuchus-Ab, a generative foundation model, was developed using a diffusion language modeling framework, trained on large-scale paired antibody repertoires.
- This model excels in tasks like CDR infilling, antibody humanization, and light-chain pairing, and predicts antibody properties like developability and binding affinity, enhancing antibody-based immunotherapy.
Abstract
Antibodies exhibit extraordinary specificity and diversity in antigen recognition and have become a central class of therapeutics across a wide range of diseases. Despite this clinical success, antibody design remains fundamentally challenging. Antibody function emerges from intricate and highly coupled interactions between heavy and light chains, which complicate sequence-function relationships and limit the rational design of developable antibodies. Here, we reveal that modeling antibody sequence space at the level of paired heavy and light chains is essential to faithfully capture inter-chain dependencies, enabling a deeper understanding of antibody function and facilitating antibody discovery. We present Ophiuchus-Ab, a generative foundation model pre-trained on large-scale paired antibody repertoires within a diffusion language modeling framework, unifying antibody generation and representation learning in a single probabilistic formulation. This framework excels diverse antibody design tasks, including CDR infilling, antibody humanization, and light-chain pairing. Beyond generation, diffusion-based pre-training yields transferable representations that enable accurate prediction of antibody properties, including developability, binding affinity, and specificity, even in low-data regimes. Together, these results establish Ophiuchus-Ab as a versatile foundation model for modeling antibodies, providing a foundation for next-generation antibody-based immunotherapy.
bioinformatics2026-02-04v1SPCoral: diagonal integration of spatial multi-omics across diverse modalities and technologies
Wang, H.; Yuan, J.; Li, K.; Chen, X.; Yan, X.; Lin, P.; Tang, Z.; Wu, B.; Nan, H.; Lai, Y.; Lv, Y.; Esteban, M. A.; Xie, L.; Wang, G.; Hui, L.; Li, H.AI Summary
- SPCoral was developed to integrate spatial multi-omics data across different slices, modalities, and technologies using graph attention networks and optimal transport.
- It employs a cross-modality attention network for feature integration and cross-omics prediction, showing superior performance in benchmarks.
- The integration enhances spatial domain identification, data augmentation, cross-modal analysis, and cell-cell communication, revealing insights unattainable with single modality data.
Abstract
Spatial multi-omics is indispensable for decoding the comprehensive molecular landscape of biological systems. However, the integration of multi-omics remains largely unresolved due to inherent disparities in molecular features, spatial morphology, and resolution. Here we developed SPCoral for diagonal integration of spatial multi-omics across adjacent slices. SPCoral extracts spatial covariation patterns via graph attention networks, followed by the use of optimal transport to identify high-confidence anchors in an unsupervised, feature-independent manner. SPCoral utilizes a cross-modality attention network to enable seamless cross-resolution feature integration alongside robust cross-omics prediction. Comprehensive benchmarking demonstrates SPCoral's superior performance across different technologies, modalities and varied resolutions. The integrated multi-omics representation further improves spatial domain identification, effectively augments experimental data, enables cross-modal association analysis, and facilitates cell-cell communication. SPCoral exhibits good scalability with data size, reveals biological insights that are not attainable using a single modality. In summary, SPCoral offers a powerful framework for spatial multi-omics integration across various technologies and biological scenarios.
bioinformatics2026-02-04v1Peptide-to-protein data aggregation using Fisher's method improves target identification in chemical proteomics
Lyu, H.; Gharibi, H.; Meng, Z.; Sokolova, B.; Zhang, X.; Zubarev, R.AI Summary
- This study compares two methods for protein-level statistical testing in chemical proteomics: traditional aggregation of peptide data versus Fisher's method of combining peptide p-values.
- Fisher's method, using the top four peptides by p-value, was tested across various datasets and consistently outperformed traditional methods by avoiding biases from deviant or missing peptide data.
- The approach improved the identification of regulated or shifted proteins in diverse proteomics assays.
Abstract
Protein-level statistical tests in proteomics aimed at obtaining p-value are conventionally made on protein abundances aggregated from peptide data. This integral approach overlooks peptide-level heterogeneity and ignores important information coded in individual peptide data, while protein p-value can also be obtained by Fisher's method of combining peptide p-values using chi-square statistics. Here we test this latter approach across diverse chemical proteomics datasets based on assessments of protein expression, solubility and protease accessibility. Using the top four peptides ranked by their p-values consistently outperformed protein-level analysis and avoided biases introduced by inclusion of deviant peptides or imputation of missing peptide values. Fisher's method provides a simple and robust strategy, improving identification of regulated/shifted proteins in diverse proteomics assays.
bioinformatics2026-02-04v1EvoPool: Evolution-Guided Pooling of Protein Language Model Embeddings
NaderiAlizadeh, N.; Singh, R.AI Summary
- The study introduces EvoPool, a self-supervised pooling framework that integrates evolutionary information from homologous sequences into protein language model (PLM) embeddings using optimal transport.
- EvoPool constructs a fixed-size evolutionary anchor and uses sliced Wasserstein distances to enhance PLM representations for protein-level prediction tasks.
- Experiments on the ProteinGym benchmark showed that EvoPool outperforms standard pooling methods in variant effect prediction, highlighting the benefit of evolutionary guidance.
Abstract
Protein language models (PLMs) encode amino acid sequences into residue-level embeddings that must be pooled into fixed-size representations for downstream protein-level prediction tasks. Although these embeddings implicitly reflect evolutionary constraints, existing pooling strategies operate on single sequences and do not explicitly leverage information from homologous sequences or multiple sequence alignments. We introduce EvoPool, a self-supervised pooling framework that integrates evolutionary information from homologs directly into aggregated PLM representations using optimal transport. Our method constructs a fixed-size evolutionary anchor from an arbitrary number of homologous sequences and uses sliced Wasserstein distances to derive query protein embeddings that are geometrically informed by homologous sequence embeddings. Experiments across multiple state-of-the-art PLM families on the ProteinGym benchmark show that EvoPool consistently outperforms standard pooling baselines for variant effect prediction, demonstrating that explicit evolutionary guidance substantially enhances the functional utility of PLM representations. Our implementation code is available at https://github.com/navid-naderi/EvoPool.
bioinformatics2026-02-04v1Embarrassingly_FASTA: Enabling Recomputable, Population-Scale Pangenomics by Reducing Commercial Genome Processing Costs from $100 to less than $1
Walsh, D. J.; Njie, e. G.AI Summary
- The study introduces Embarrassingly_FASTA, a GPU-accelerated preprocessing pipeline that reduces genome processing costs from ~1/genome by using transient intermediates and ephemeral cloud infrastructure.
- This approach enables the retention of raw FASTQ data, facilitating recomputable, population-scale pangenomics.
- The efficiency was demonstrated through simulated large-cohort pangenome builds in C. elegans and humans, showcasing the potential for capturing unsampled genetic diversity._
Abstract
Computational preprocessing has become the dominant bottleneck in genomics, frequently exceeding sequencing costs and constraining population-scale analysis, even as large repositories grow from tens of petabytes toward exabyte-scale storage to support World Genome Models. Legacy CPU-based workflows require many hours to days per 30x human genome, driving many repositories to distribute aligned or derived intermediates such as BAM and VCF files rather than raw FASTQ data. These intermediates embed reference- and model-dependent assumptions that limit reproducibility and impede reanalysis as reference genomes, including pangenomes, continue to evolve. Although recent work has established that GPUs can dramatically accelerate genomic pipelines, enabling large-cohort processing to shrink from years to days given sufficient parallelism, such workflows remain cost-prohibitive. Here, we introduce Embarrassingly_FASTA, a GPU-accelerated preprocessing pipeline built on NVIDIA Parabricks that fundamentally changes the economics of genomic data management. By rendering intermediate files transient rather than archival, Embarrassingly_FASTA enables retention of raw FASTQ data and reliable use of highly discounted ephemeral cloud infrastructure such as spot instances, reducing compute spend from ~$17/genome (CPU on-demand) to <$1/genome (GPU spot), and commercial secondary-analysis pricing from ~$120/genome to compute spend under $1/genome. We demonstrate the impact of this efficiency using a simulated large-cohort pangenome build-up (using variant-union accumulation as a proxy for diversity growth) in Caenorhabditis elegans and humans, highlighting the long tail of unsampled human genetic diversity. Beyond GPU kernels, Embarrassingly_FASTA contributes a transient-intermediate lifecycle and spot-friendly orchestration that makes FASTQ retention and routine recomputation economically viable. Embarrassingly_FASTA thus provides enabling infrastructure for recomputable, population-scale pangenomics and next-generation genomic models. Keywords: Genome preprocessing, GPU acceleration, Whole-genome sequencing (WGS), Population genomics, Pangenomics, World Genome Models, Genomic infrastructure, Variant calling, Recomputable genomics
bioinformatics2026-02-04v1Joint Modeling of Transcriptomic and Morphological Phenotypes for Generative Molecular Design
Verma, S.; Wang, M.; Jayasundara, S.; Malusare, A. M.; Wang, L.; Grama, A.; Kazemian, M.; Lanman, N. A.AI Summary
- The study introduces Pert2Mol, a framework for integrating transcriptomic and morphological data from paired control-treatment experiments to generate molecular structures.
- Pert2Mol uses bidirectional cross-attention and a rectified flow transformer to model perturbation dynamics, achieving a Frechet ChemNet Distance of 4.996 on the GDP dataset, outperforming diffusion and transcriptomics-only methods.
- The model offers high molecular validity, good physicochemical property distributions, 84.7% scaffold diversity, and is 12.4 times faster than diffusion methods for generation.
Abstract
Motivation: Phenotypic drug discovery generates rich multi-modal biological data from transcriptomic and morphological measurements, yet translating complex cellular responses into molecular design remains a computational bottleneck. Existing generative methods operate on single modalities and condition on post-treatment measurements without leveraging paired control-treatment dynamics to capture perturbation effects. Results: We present Pert2Mol, the first framework for multi-modal phenotype-to-structure generation that integrates transcriptomic and morphological features from paired control-treatment experiments. Pert2Mol employs bidirectional cross-attention between control and treatment states to capture perturbation dynamics, conditioning a rectified flow transformer that generates molecular structures along straight-line trajectories. We introduce Student-Teacher Self-Representation (SERE) learning to stabilize training in high-dimensional multi-modal spaces. On the GDP dataset, Pert2Mol achieves Frechet ChemNet Distance of 4.996 compared to 7.343 for diffusion baselines and 59.114 for transcriptomics-only methods, while maintaining perfect molecular validity and appropriate physicochemical property distributions. The model demonstrates 84.7% scaffold diversity and 12.4 times faster generation than diffusion approaches with deterministic sampling suitable for hypothesis-driven validation. Availability: Code and pretrained models will be available at https://github.com/wangmengbo/Pert2Mol.
bioinformatics2026-02-04v1FRED: a universal tool to generate FAIR metadata for omics experiments
Walter, J.; Kuenne, C.; Knoppik, N.; Goymann, P.; Looso, M.AI Summary
- The study addresses the challenge of standardizing metadata in omics experiments to enhance data management according to FAIR principles.
- FRED, a new tool, was developed to generate machine-readable metadata, offering features like dialog-based creation, semantic validation, logical search, an API, and a web interface.
- FRED is designed for use by both non-computational scientists and specialized facilities, integrating easily into existing research data management systems.
Abstract
Scientific research relies on transparent dissemination of data and its associated interpretations. This task encompasses accessibility of raw data, its metadata, details concerning experimental design, along with parameters and tools employed for data interpretation. Production and handling of these data represents an ongoing challenge, extending beyond publication into individual facilities, institutes and research groups, often termed Research Data Management (RDM). It is foundational to scientific discovery and innovation, and can be paraphrased as Findability, Accessibility, Interoperability and Reusability (FAIR). Although the majority of peer-reviewed journals require the deposition of raw data in public repositories in alignment with FAIR principles, metadata frequently lacks full standardization. This critical gap in data management practices hinders effective utilization of research findings and complicates sharing of scientific knowledge. Here we present a flexible design of a machine-readable metadata format to store experimental metadata, along with an implementation of a generalized tool named FRED. It enables i) dialog based creation of metadata files, ii) structured semantic validation, iii) logical search, iv) an external programming interface (API), and v) a standalone web-front end. The tool is intended to be used by non-computational scientists as well as specialized facilities, and can be seamlessly integrated in existing RDM infrastructure.
bioinformatics2026-02-04v1QMAP: A Benchmark for Standardized Evaluation of Antimicrobial Peptide MIC and Hemolytic Activity Regression
Lavertu, A.; Corbeil, J.; Germain, P.AI Summary
- QMAP is introduced as a benchmark for evaluating the prediction of antimicrobial peptide (AMP) potency (MIC) and hemolytic toxicity (HC50), using homology-aware test sets to prevent overfitting.
- The benchmark reassessed existing MIC models, revealing limited progress over six years, poor performance in predicting high-potency MIC, and low predictability for hemolytic activity.
- A Python package with a Rust-accelerated engine for efficient data manipulation is provided to facilitate the adoption of QMAP.
Abstract
Antimicrobial peptides (AMPs) are promising alternatives to conventional antibiotics, but progress in computational AMP discovery has been difficult to quantify due to inconsistent datasets and evaluation protocols. We introduce QMAP, a domain-specific benchmark for predicting AMP antimicrobial potency (MIC) and hemolytic toxicity (HC50) with homology-aware, predefined test sets. QMAP enforces strict sequence homology constraints between training and test data, ensuring that model performance reflects true generalization rather than overfitting. Applying QMAP, we reassess existing MIC models and establish baselines for MIC and HC50 regression. Results show limited progress over six years, poor performance for high-potency MIC regression, and low predictability for hemolytic activity, emphasizing the need for standardized evaluation and improved modeling approaches for highly potent peptides. We release a Python package facilitating practical adoption, and with a Rust-accelerated engine enabling efficient data manipulation, installable with pip install qmap-benchmark.
bioinformatics2026-02-04v1petVAE: A Data-Driven Model for Identifying Amyloid PET Subgroups Across the Alzheimer's Disease Continuum
Tagmazian, A. A.; Schwarz, C.; Lange, C.; Pitkänen, E.; Vuoksimaa, E.AI Summary
- This study aimed to identify subgroups along the Alzheimer's disease (AD) continuum using Aβ PET scans by developing petVAE, a 2D variational autoencoder model.
- petVAE was trained on 3,110 scans from ADNI and A4 datasets, identifying four clusters (Aβ-, Aβ-+, Aβ+, Aβ++) that differed significantly in standardized uptake value ratio, CSF Aβ, cognitive performance, APOE ε4 prevalence, and progression rate to AD.
- The model effectively captured the AD continuum, revealing preclinical stages and offering a new framework for studying disease progression.
Abstract
Amyloid-{beta} (A{beta}) PET imaging is a core biomarker and is considered sufficient for the biological diagnosis of Alzheimer's disease (AD). However, it is typically reduced to a binary A{beta}-/A{beta}+ classification. In this study, we aimed to identify subgroups along the continuum of A{beta} accumulation including subgroups within A{beta}- and A{beta}+. We used a total of 3,110 of A{beta} PET scans from Alzheimer's Disease Neuroimaging Initiative (ADNI) and Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease (A4) datasets to develop petVAE, a 2D variational autoencoder model. The model accurately reconstructed A{beta} PET scans without prior labeling or pre-selection based on scanner type or region of interest. Latent representations of scans extracted from the petVAE (11,648 latent features per scan) were used to visualize, analyze, and cluster the AD continuum. We identified the latent features most representative of the continuum, and clustering of PET scans using these features produced four clusters. Post-hoc characterization revealed that two clusters (A{beta}-, A{beta}-+) were predominantly A{beta} negative and two (A{beta}+, A{beta}++) were predominantly A{beta} positive. All clusters differed significantly in standardized uptake value ratio (p < 1.64e-8) and cerebrospinal fluid (CSF) A{beta} (p < 0.02), demonstrating petVAE's ability to assign scans along the A{beta} continuum. The clusters at the extremes of the continuum (A{beta}-, A{beta}++) resembled to the conventional A{beta} negative and A{beta} positive groups and differed significantly in cognitive performance, Apolipoprotein E (APOE) {epsilon}4 prevalence, and A{beta}, tau and phosphorylated tau CSF biomarkers (p < 3e-6). The two intermediate clusters (A{beta}-+, A{beta}+) showed significantly higher odds of carrying at least one APOE {epsilon}4 allele compared with the A{beta}- cluster (p < 0.026). Participants in A{beta}+ or A{beta}++ clusters exhibited a significantly faster rate of progression to AD compared to A{beta}- group (Hazard ratio = 2.42 and 9.43 for groups A{beta}+ and A{beta}++, respectively, p < 1.17e-7). Thus, petVAE was capable of reconstructing PET scans while also extracting latent features that effectively represented the AD continuum and defined biologically meaningful clusters. By capturing subtle A{beta}-related changes in brain PET scans, petVAE-based classification enables the detection of preclinical AD stages and offers a new data-driven framework for studying disease progression.
bioinformatics2026-02-04v1RareCapsNet: An explainable capsule networks enable robust discovery of rare cell populations from large-scale single-cell transcriptomics
Ray, S.; Lall, S.AI Summary
- RareCapsNet uses capsule networks to identify rare cell populations in large-scale single-cell RNA-seq data.
- It leverages explainable AI to interpret lower-level capsules, identifying novel marker genes for rare cell types.
- Evaluations on simulated and real data show RareCapsNet outperforms other methods in specificity, selectivity, and can transfer knowledge across batches.
Abstract
In-silico analysis of single cell data (downstream analysis) seeks considerable attention to the machine learning researchers in the last few years. Recent technological advances and increases in throughput capabilities open up great new chances to discover rare cell types. We develop RareCapsNet, a rare cell identification technique through capsule network in large single cell RNA-seq data. RareCapsNet aiming to leverage the landmark advantages of capsule networks in single cell domain, by identifying novel rare cell population through markers genes explained from human-mind-friendly interpretation of lower-level (primary) capsules. We demonstrate the explainability of capsule network for identifying novel markers that are act as signature of certain cell population of rare type. A comprehensive evaluation in simulated and real life single cell data demonstrate the efficacy of RareCapsNet for finding out rare population in large scRNA-seq data. RareCapsNet outperforms the other state-of-the-art not only in specificity and selectivity for identifying rare cell types, it can also successfully extract transcriptomic signature of the cell population. We demonstrate RareCapsNet to the dataset of multiple batch, where the model can store the knowledge of one batch which can be transferred to find out rare cells of other batch without training the model. Availability and Implementation: RareCapsNet is available at: https://github.com/sumantaray/RareCapsNet.
bioinformatics2026-02-04v1coelsch: Platform-agnostic single-cell analysis of meiotic recombination events
Parker, M. T.; Amar, S.; Freudigmann, J.; Walkemeier, B.; Dong, X.; Solier, V.; Marek, M.; Huettel, B.; Mercier, R.; Schneeberger, K.AI Summary
- The study benchmarks single-cell sequencing methods (droplet-based chromatin accessibility, RNA sequencing, and plate-based whole-genome amplification) for mapping meiotic recombination in Arabidopsis thaliana.
- Novel tools, coelsch_mapping_pipeline and coelsch, were developed for haplotype-aware alignment and crossover detection, successfully mapping recombination in 34 out of 40 F1 hybrids.
- The analysis revealed significant variation in recombination rates and identified a large ~10 Mb pericentric inversion in accession Zin-9, the largest known in A. thaliana.
Abstract
Background: Meiotic recombination creates genetic diversity through reciprocal exchange of haplotypes between homologous chromosomes. Scalable and robust methods for mapping recombination breakpoints are essential for understanding meiosis and for genetic mapping. Single cell sequencing of gametes offers a direct approach to recombination mapping, yet the effect of technical differences between single-cell sequencing methods for crossover detection remains unclear. Results: We benchmark single cell methods for droplet-based chromatin accessibility and RNA sequencing and plate-based whole-genome amplification for mapping meiotic recombination in Arabidopsis thaliana. For this purpose we introduce two novel open-source tools coelsch_mapping_pipeline and coelsch for haplotype-aware alignment and per-cell crossover detection, using them to recover known recombination frequencies and quantify the effects of coverage sparsity. We subsequently apply our approach to a panel of 40 recombinant F1 hybrids derived from crosses of 22 diverse natural accessions, successfully recovering genetic maps for 34 F1s in a single dataset. This analysis reveals substantial variation in recombination rate and identifies a ~10 Mb pericentric inversion in the accession Zin-9, the largest natural inversion reported in A. thaliana to date. Conclusions: These results demonstrate the applicability and scalability of single-cell gamete sequencing for high-throughput mapping of meiotic recombination, and highlight the strengths and limitations of different single-cell modalities. The accompanying open-source tools provide a framework for haplotyping and crossover detection analysis using sparse single-cell sequencing data. Our methodology enables parallel analysis of large numbers of hybrids in a single dataset, removing a major technical barrier to large-scale studies of natural variation in recombination rate.
bioinformatics2026-02-04v1LoReMINE: Long Read-based Microbial genome mining pipeline
Agrawal, A. A.; Bader, C. D.; Kalinina, O. V.AI Summary
- The study introduces LoReMINE, a pipeline for microbial genome mining that automates the process from long-read sequencing data to the prediction and clustering of biosynthetic gene clusters (BGCs).
- LoReMINE integrates various tools to provide a scalable, reproducible workflow for natural product discovery, addressing the limitations of existing methods that require manual curation.
Abstract
Microbial natural products represent a chemically diverse repertoire of small molecules with major pharmaceutical potential. Despite the increasing availability of microbial genome sequences, large-scale natural product discovery remains challenging because the existing genome mining approaches lack integrated workflows for rapid dereplication of known compounds and prioritization of novel candidates, forcing researchers to rely on multiple tools that requires extensive manual curation and expert intervention at each step. To address these limitations, we introduce LoReMINE (Long Read-based Microbial genome mining pipeline), a fully automated end-to-end pipeline that generates high-quality assemblies, performs taxonomic classification, predicts biosynthetic gene clusters (BGCs) responsible for biosynthesis of natural products, and clusters them into gene cluster families (GCFs) directly from long-read sequencing data. By integrating state-of-the-art tools into a seamless pipeline, LoReMINE enables scalable, reproducible, and comprehensive genome mining across diverse microbial taxa. The pipeline is openly available at https://github.com/kalininalab/LoReMINE and can be installed via Conda (https://anaconda.org/kalininalab/loremine), facilitating broad adoption by the natural product research community.
bioinformatics2026-02-04v1Generative deep learning expands apo RNA conformational ensembles to include ligand-binding-competent cryptic conformations: a case study of HIV-1 TAR
Kurisaki, I.; Hamada, M.AI Summary
- The study used Molearn, a hybrid molecular-dynamics-generative deep-learning model, to explore cryptic conformations of apo HIV-1 TAR RNA that could bind ligands.
- Molearn was trained on apo TAR conformations and generated a diverse ensemble, from which potential MV2003-binding conformations were identified.
- Docking simulations showed these conformations had RNA-ligand interaction scores similar to NMR-derived complexes, demonstrating the model's ability to predict ligand-binding competent RNA states.
Abstract
RNA plays vital roles in diverse biological processes and represents an attractive class of therapeutic targets. In particular, cryptic ligand-binding sites--absent in apo structures but formed upon conformational rearrangement--offer high specificity for RNA-ligand recognition, yet remain rare among experimentally-resolved RNA-ligand complex structures and difficult to predict in silico. RNA-targeted structure-based drug design (SBDD) is therefore limited by challenges in sampling cryptic states. Here, we apply Molearn, a hybrid molecular-dynamics-generative deep-learning model, to expand apo RNA conformational ensembles toward cryptic states. Focusing on the paradigmatic HIV-1 TAR-MV2003 system, Molearn was trained exclusively on apo TAR conformations and used to generate a diverse ensemble of TAR structures. Candidate cryptic MV2003-binding conformations were subsequently identified using post-generation geometric analyses. Docking simulations of these conformations with MV2003 yielded binding poses with RNA-ligand interaction scores comparable to those of NMR-derived complexes. Notably, this work provides the first demonstration that a generative modeling framework can access cryptic RNA conformations that are ligand-binding competent and have not been recovered in prior molecular-dynamics and deep-learning studies. Finally, we discuss current limitations in scalability and systematic detection, including application to the Internal Ribosome Entry Site, and outline future directions toward RNA-targeted SBDD.
bioinformatics2026-02-03v6GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure
Pourmirzaei, M.; Morehead, A.; Esmaili, F.; Ren, J.; Pourmirzaei, M.; Xu, D.AI Summary
- The study introduces GCP-VQVAE, a tokenizer using SE(3)-equivariant GCPNet to convert protein structures into discrete tokens while preserving chirality and orientation.
- Trained on 24 million protein structures, GCP-VQVAE achieves state-of-the-art performance with backbone RMSDs of 0.4377 Å, 0.5293 Å, and 0.7567 Å on CAMEO2024, CASP15, and CASP16 datasets respectively.
- On a zero-shot set of 1,938 new structures, it showed robust generalization with a backbone RMSD of 0.8193 Å and TM-score of 0.9673, and offers significantly reduced latency compared to previous models.
Abstract
Converting protein tertiary structure into discrete tokens via vector-quantized variational autoencoders (VQ-VAEs) creates a language of 3D geometry and provides a natural interface between sequence and structure models. While pose invariance is commonly enforced, retaining chirality and directional cues without sacrificing reconstruction accuracy remains challenging. In this paper, we introduce GCP-VQVAE, a geometry-complete tokenizer built around a strictly SE(3)-equivariant GCPNet encoder that preserves orientation and chirality of protein backbones. We vector-quantize rotation/translation-invariant readouts that retain chirality into a 4,096-token vocabulary, and a transformer decoder maps tokens back to backbone coordinates via a 6D rotation head trained with SE(3)-invariant objectives. Building on these properties, we train GCP-VQVAE on a corpus of 24 million monomer protein backbone structures gathered from the AlphaFold Protein Structure Database. On the CAMEO2024, CASP15, and CASP16 evaluation datasets, the model achieves backbone RMSDs of 0.4377 A, 0.5293 A, and 0.7567 A, respectively, and achieves 100% codebook utilization on a held-out validation set, substantially outperforming prior VQ-VAE-based tokenizers and achieving state-of-the-art performance. Beyond these benchmarks, on a zero-shot set of 1,938 completely new experimental structures, GCP-VQVAE attains a backbone RMSD of 0.8193 A and a TM-score of 0.9673, demonstrating robust generalization to unseen proteins. Lastly, we show that the Large and Lite variants of GCP-VQVAE are substantially faster than the previous SOTA (AIDO), reaching up to ~408x and ~530x lower end-to-end latency, while remaining robust to structural noise. We make the GCP-VQVAE source code, zero-shot dataset, and its pretrained weights fully open for the research community: https://github.com/mahdip72/vq_encoder_decoder
bioinformatics2026-02-03v3Automated Segmentation of Kidney Nephron Structures by Deep Learning Models on Label-free Autofluorescence Microscopy for Spatial Multi-omics Data Acquisition and Mining
Patterson, N. H.; Neumann, E. K.; Sharman, K.; Allen, J. L.; Harris, R. C.; Fogo, A. B.; deCaestecker, M. P.; Van de Plas, R.; Spraggins, J. M.AI Summary
- Developed deep learning models for automated segmentation of kidney nephron structures using label-free autofluorescence microscopy.
- Models accurately segmented functional tissue units and gross kidney morphology with F1-scores >0.85 and Dice-Sorensen coefficients >0.80.
- Enabled quantitative association of lipids with segmented structures and spatial transcriptomics data acquisition from collecting ducts, showing differential gene expression in medullary regions.
Abstract
Automated spatial segmentation models can enrich spatio-molecular omics analyses by providing a link to relevant biological structures. We developed segmentation models that use label-free autofluorescence (AF) microscopy to recognize multicellular functional tissue units (FTUs) (glomerulus, proximal tubule, descending thin limb, ascending thick limb, distal tubule, and collecting duct) and gross morphological structures (cortex, outer medulla, and inner medulla) in the human kidney. Annotations were curated using highly specific multiplex immunofluorescence and transferred to co-registered AF for model training. All FTUs (except the descending thin limb) and gross kidney morphology were segmented with high accuracy: >0.85 F1-score, and Dice-Sorensen coefficients >0.80, respectively. This workflow allowed lipids, profiled by imaging mass spectrometry, to be quantitatively associated with segmented FTUs. The segmentation masks were also used to acquire spatial transcriptomics data from collecting ducts. Consistent with previous literature, we demonstrated differing transcript expression of collecting ducts in the inner and outer medulla.
bioinformatics2026-02-03v2Informative Missingness in Nominal Data: A Graph-Theoretic Approach to Revealing Hidden Structure
Zangene, E.; Schwammle, V.; JAFARI, M.AI Summary
- This study introduces a graph-theoretic approach to analyze missing data in nominal datasets, treating missing values as informative signals rather than gaps.
- By constructing bipartite graphs from nominal variables, the method reveals hidden structures through modularity, nestedness, and similarity analysis.
- Applied across various domains, the approach showed that missing data patterns can distinguish between random and non-random missingness, enhancing structural understanding and aiding in tasks like clustering.
Abstract
Missing data is often treated as a nuisance, routinely imputed or excluded from statistical analyses, especially in nominal datasets where its structure cannot be easily modeled. However, the form of missingness itself can reveal hidden relationships, substructures, and biological or operational constraints within a dataset. In this study, we present a graph-theoretic approach that reinterprets missing values not as gaps to be filled, but as informative signals. By representing nominal variables as nodes and encoding observed or missing associations as edges, we construct both weighted and unweighted bipartite graphs to analyze modularity, nestedness, and projection-based similarities. This framework enables downstream clustering and structural characterization of nominal data based on the topology of observed and missing associations; edge prediction via multiple imputation strategies is included as an optional downstream analysis to evaluate how well inferred values preserve the structure identified in the non-missing data. Across a series of biological, ecological, and social case studies, including proteomics data, the BeatAML drug screening dataset, ecological pollination networks, and HR analytics, we demonstrate that the structure of missing values can be highly informative. These configurations often reflect meaningful constraints and latent substructures, providing signals that help distinguish between data missing at random and not at random. When analyzed with appropriate graph-based tools, these patterns can be leveraged to improve the structural understanding of data and provide complementary signals for downstream tasks such as clustering and similarity analysis. Our findings support a conceptual shift: missing values are not merely analytical obstacles but valuable sources of insight that, when properly modeled, can enrich our understanding of complex nominal systems across domains.
bioinformatics2026-02-03v2Transcriptomic and protein analysis of human cortex reveals genes and pathways linked to NPTX2 disruption in Alzheimer's disease
Lao, Y.; Xiao, M.-F.; Ji, S.; Piras, I. S.; Kim, K.; Bonfitto, A.; Song, S.; Aldabergenova, A.; Sloan, J.; Trejo, A.; Geula, C.; Na, C.-H.; Rogalski, E. J.; Kawas, C. H.; Corrada, M. M.; Serrano, G. E.; Beach, T. G.; Troncoso, J. C.; Huentelman, M. J.; Barnes, C. A.; Worley, P. F.; Colantuoni, C.AI Summary
- This study used bulk RNA sequencing and targeted proteomics on human cortex samples to explore genes and pathways associated with NPTX2 disruption in Alzheimer's disease (AD).
- NPTX2 expression was significantly reduced in AD, correlating with BDNF, VGF, SST, and SCG2, indicating a role in synaptic and mitochondrial functions.
- In AD, NPTX2-related synaptic and mitochondrial pathways weakened, while stress-linked transcriptional regulators increased, suggesting a shift in regulatory dynamics.
Abstract
The expression of NPTX2, a neuronal immediate early gene (IEG) essential for excitatory-inhibitory balance, is altered in the earliest stages of cognitive decline that precede Alzheimer's disease (AD). Here, we use NPTX2 as a point of reference to identify genes and pathways linked to its role in AD onset and progression. We performed bulk RNA sequencing on 575 middle temporal gyrus (MTG) samples across four cohorts, together with targeted proteomics in 135 of these same samples, focusing on 20 curated proteins spanning synaptic, trafficking, lysosomal, and regulatory categories. NPTX2 RNA and protein were significantly reduced in AD, and to a lesser extent in mild cognitive impairment (MCI) samples. RNA expression of BDNF, VGF, SST, and SCG2 correlated with both NPTX2 mRNA and protein levels. We identified NPTX2-correlated synaptic and mitochondrial programs that were negatively correlated with lysosomal and chromatin/stress modules. Gene set enrichment analysis (GSEA) of NPTX2 correlations across all samples confirmed broad alignment with synaptic and mitochondrial compartments, and more NPTX2-specific associations with proteostasis and translation regulator pathways, all of which were weakened in AD. In contrast, correlation of NPTX2 protein with transcriptomic profiles revealed negative associations with stress-linked transcription regulator RNAs (FOXJ1, ZHX3, SMAD5, JDP2, ZIC4), which were strengthened in AD. These results position NPTX2 as a hub of an activity-regulated "plasticity cluster" (BDNF, VGF, SST, SCG2) that encompasses interneuron function and is embedded on a neuronal/mitochondrial integrity axis that is inversely coupled to lysosomal and chromatin-stress programs. In AD, these RNA-level correlations broadly weaken, and stress-linked transcriptional regulators become more prominent, suggesting a role in NPTX2 loss of function. Individual gene-level data from the bulk RNA-seq in this study can be freely explored at [INSERT LINK].
bioinformatics2026-02-03v2SpaCEy: Discovery of Functional Spatial Tissue Patterns by Association with Clinical Features Using Explainable Graph Neural Networks
Rifaioglu, A. S.; Ervin, E. H.; Sarigun, A.; Germen, D.; Bodenmiller, B.; Tanevski, J.; Saez-Rodriguez, J.AI Summary
- SpaCEy uses explainable graph neural networks to analyze spatial tissue patterns from molecular marker expression, linking these patterns to clinical outcomes without predefined cell types.
- Applied to lung cancer, SpaCEy identified spatial cell arrangements and protein marker expressions linked to disease progression.
- In breast cancer datasets, SpaCEy stratified patients by overall survival, revealing key spatial patterns of protein markers across and within clinical subtypes.
Abstract
Tissues are complex ecosystems tightly organized in space. This organization influences their function, and its alteration underpins multiple diseases. Spatial omics allows us to profile its molecular basis, but how to leverage these data to link spatial organization and molecular patterns to clinical practice remains a challenge. We present SpaCEy (SpatialClinicalExplainability), an explainable graph neural network that uncovers organizational tissue patterns predictive of clinical outcomes. SpaCEy learns directly from molecular marker expression by modelling tissues as spatial graphs of cells and their interactions, without requiring predefined cell types or anatomical regions. Its embeddings capture intercellular relationships and molecular dependencies that enable accurate prediction of variables such as overall survival and disease progression. SpaCEy integrates a specialized explainer module that reveals recurring spatial patterns of cell organisation and coordinated marker expression that are most relevant to predictions of the models. Applied to a spatially resolved proteomic lung cancer cohort, SpaCEy discovers distinct spatial arrangements of cells together with coordinated expression of protein markers associated with disease progression. Across multiple breast cancer proteomic datasets, it consistently stratifies patients according to overall survival, both across and within established clinical subtypes. SpaCEy also highlights spatial patterns of a small set of key protein markers underlying this patient stratification.
bioinformatics2026-02-03v2ImmunoPheno: A Computational Framework for Data-Driven Design and Analysis of Immunophenotyping Experiments
Wu, L.; Nguyen, M. A.; Yang, Z.; Potluri, S.; Sivagnanam, S.; Kirchberger, N.; Joshi, A.; Ahn, K. J.; Tumulty, J. S.; Cruz Cabrera, E.; Romberg, N.; Tan, K.; Coussens, L. M.; Camara, P. G.AI Summary
- ImmunoPheno is a computational framework that uses single-cell proteo-transcriptomic data to automate the design of antibody panels, gating strategies, and cell identity annotation for immunophenotyping.
- It was used to create a reference (HICAR) with 390 antibodies and 93 immune cell populations, enabling the design of minimal panels for isolating rare cells like MAIT cells and pDCs, validated experimentally.
- The framework accurately annotates cell identities across various cytometry datasets, enhancing the accuracy, reproducibility, and resolution of immunophenotyping.
Abstract
Immunophenotyping is fundamental to characterizing tissue cellular composition, pathogenic processes, and immune infiltration, yet its accuracy and reproducibility remain constrained by heuristic antibody panel design and manual gating. Here, we present ImmunoPheno, an open-source computational platform that repurposes large-scale single-cell proteo-transcriptomic data to guide immunophenotyping experimental design and analysis. ImmunoPheno integrates existing datasets to automate the design of optimal antibody panels, gating strategies, and cell identity annotation. We used ImmunoPheno to construct a harmonized reference (HICAR) comprising 390 monoclonal antibodies and 93 human immune cell populations. Leveraging this resource, we algorithmically designed minimal panels to isolate rare populations, such as MAIT cells and pDCs, which we validated experimentally. We further demonstrate accurate cell identity annotation across publicly available and newly generated cytometry datasets spanning diverse technologies, including spatial platforms like CODEX. ImmunoPheno complements expert curation and supports continual expansion, providing a scalable framework to enhance the accuracy, reproducibility, and resolution of immunophenotyping.
bioinformatics2026-02-03v1A modality gap in personal-genome prediction by sequence-to-function models
Mostafavi, S.; Tu, X.; Spiro, A.; Chikina, M.AI Summary
- The study evaluated AlphaGenome's ability to predict personal genome variations in gene expression and chromatin accessibility.
- AlphaGenome performed near the heritability ceiling for chromatin accessibility but significantly underperformed for gene expression compared to baseline.
- Findings suggest chromatin accessibility is influenced by local regulatory elements, while gene expression requires integration of long-range regulatory effects, which current models struggle with.
Abstract
Sequence-to-function (S2F) models trained on reference genomes have achieved strong performance on regulatory prediction and variant-effect benchmarks, yet they still struggle to predict inter-individual variation in gene expression from personal genomes. We evaluated AlphaGenome on personal genome prediction in two molecular modalities--gene expression and chromatin accessibility--and observed a striking dichotomy: AlphaGenome approaches the heritability ceiling for chromatin accessibility variation, but remains far below baseline for gene-expression variation, despite improving over Borzoi. Context truncation and fine-mapped QTL analyses indicate that accessibility is governed by local regulatory grammar captured by current architectures, whereas gene-expression variation requires long-range regulatory integration that remains challenging.
bioinformatics2026-02-03v1GAISHI: A Python Package for Detecting Ghost Introgression with Machine Learning
Huang, X.; Hackl, J.; Kuhlwilm, M.AI Summary
- GAISHI is a Python package designed to detect ghost introgression using machine learning techniques like logistic regression and UNet++.
- It addresses the limitation of previous studies by providing a software implementation for identifying introgressed segments and alleles.
- The package's utility was demonstrated in a Human-Neanderthal introgression scenario.
Abstract
Summary: Ghost introgression is a challenging problem in population genetics. Recent studies have explored supervised learning models, namely logistic regression and UNet++, to detect genomic footprints of ghost introgression. However, their applicability is limited because existing implementations are tailored to tasks in their respective publications, but not available as software implementations. Here, we present GAISHI, a Python package for identifying introgressed segments and alleles using machine learning and demonstrate its usage in a Human-Neanderthal introgression scenario. Availabity and implementation: GAISHI is available on GitHub under the GNU General Public License v3.0. The source code can be found at https://github.com/xin-huang/gaishi.
bioinformatics2026-02-03v1PepMCP: A Graph-Based Membrane Contact Probability Predictor for Membrane-Lytic Antimicrobial Peptides
Dong, R.; Awang, T.; Cao, Q.; Kang, K.; Wang, L.; Zhu, Z.; Song, C.AI Summary
- This study introduces PepMCP, a graph-based model for predicting membrane contact probability (MCP) of short antimicrobial peptides (AMPs) targeting bacterial membranes.
- Over 500 membrane-lytic AMPs were used to train PepMCP, employing coarse-grained molecular dynamics simulations and the GraphSAGE framework.
- PepMCP achieved a Pearson correlation coefficient of 0.883 and RMSE of 0.123, enhancing mechanism-driven AMP discovery with the MemAMPdb database and a web server for access.
Abstract
Motivation: The membrane-lytic mechanism of antimicrobial peptides (AMPs) is often overlooked during their in silico discovery process, largely due to the lack of a suitable metric for the membrane-binding propensity of peptides. Previously, we proposed a characteristic called membrane contact probability (MCP) and applied it to the identification of membrane proteins and membrane-lytic AMPs. However, previous MCP predictors were not trained on short peptides targeting bacterial membranes, which may result in unsatisfactory performance for peptide studies. Results: In this study, we present PepMCP, a peptide-tailored model for predicting MCP values of short peptides. We collected more than 500 membrane-lytic AMPs from the literature, conducted coarse-grained molecular dynamics (MD) simulations for these AMPs, and extracted their residue MCP labels from MD trajectories to train PepMCP. PepMCP employs the GraphSAGE framework to address this node regression task, encoding each peptide sequence as a graph with 4-hop edges. PepMCP achieved a Pearson correlation coefficient of 0. 883 and an RMSE of 0. 123 on the node-level test set. It can recognize membrane-lytic AMPs with the predicted MCP values for each sequence, thereby facilitating mechanism-driven AMP discovery. Additionally, we provide a database, MemAMPdb, which includes the membrane-lytic AMPs, as well as the PepMCP web server for easy access. Availability and Implementation: The code and data are available at https://github.com/ComputBiophys/PepMCP.
bioinformatics2026-02-03v1Predicting mutation-rate variation across the genome using epigenetic data
Katori, M.; Kobayashi, T. J.; Nordborg, M.; Shi, S.AI Summary
- The study integrates epigenetic data (histone marks, DNA methylation, chromatin accessibility) with de novo mutation data in Arabidopsis thaliana to model mutation probability at the coding sequence level.
- Using non-negative matrix factorization, 15 epigenetic patterns were identified, stratifying coding sequences into six classes with different mutation probabilities.
- A predictive model based on these patterns outperformed others, showing that epigenetic context significantly influences local mutation rates, with changes under hypoxia indicating dynamic chromatin effects on mutation probability.
Abstract
Mutation rate variation is a fundamental driver of evolution, yet how it is locally patterned across genomes and structured by chromatin context remains unresolved. Here, we integrate genome-wide profiles of histone marks, DNA methylation and chromatin accessibility in Arabidopsis thaliana with de novo mutation data to model mutation probability at the level of coding sequence (CDS). Using non-negative matrix factorization, we identify 15 combinatorial epigenetic patterns whose graded mixtures stratify CDSs into six classes with distinct mutation probabilities. A generalized linear model based on pattern weights predicts local mutation probability and outperforms models based on sequence context, expression and classical genomic categories. These patterns capture context-dependent variation that is obscured by gene-level summaries and single-feature analyses. Cluster-level differences are partly retained in mutation-accumulation lines, indicating persistence into heritable mutational input. Under hypoxia, stress-responsive chromatin remodeling redistributes epigenetic contexts associated with higher predicted mutation probability toward hypoxia-responsive genes and DNA-repair pathways. Together, our results provide a CDS-resolved and interpretable framework linking combinatorial epigenomic context to mutational input, clarifying how dynamic chromatin states shape local mutation-rate heterogeneity.
bioinformatics2026-02-03v1Predicting unknown binding sites for transition metal based compounds in proteins
Levy, A.; Rothlisberger, U.AI Summary
- This study evaluates the use of Metal3D and Metal1D, tools originally designed for zinc ion binding prediction, to identify binding sites for transition metal complexes in proteins.
- Both tools successfully predicted several known binding sites from apo protein structures, despite limitations like sensitivity to side-chain conformations.
- The research suggests a computational pipeline where these tools could initially identify potential binding sites, followed by refinement with more precise methods.
Abstract
Transition metal based compounds are promising therapeutic agents, particularly in cancer treatment. However, predicting their binding sites remains a major challenge. In this work, we investigate the applicability of two tools, Metal3D and Metal1D, for this purpose. Although originally trained to predict zinc ion binding sites only, both predictors successfully identify several experimentally observed binding sites for transition metal complexes directly from apo protein structures. At the same time, we highlight current limitations, such as the sensitivity to side-chain conformations, and discuss possible strategies for improvement. This work provides a first step toward establishing a robust computational pipeline in which rapid and low-cost predictors are able to identify putative hotspots for transition metal binding, which can then be refined using more accurate but computationally demanding methods.
bioinformatics2026-02-03v1PPGLomics: An Interactive Platform for Pheochromocytoma and Paraganglioma Transcriptomics
Alkaissi, H.; Gordon, C. M.; Pacak, K.AI Summary
- PPGLomics is an interactive web platform for analyzing pheochromocytoma and paraganglioma (PPGL) transcriptomics, addressing the lack of disease-specific bioinformatics resources.
- It integrates the TCGA-PCPG (n=160) and A5 consortium SDHB (n=91) datasets, offering tools for differential expression, correlation, survival analysis, and various visualizations.
- The platform is designed for use by scientists and healthcare professionals without requiring bioinformatics expertise and is freely accessible online.
Abstract
Pheochromocytoma and paraganglioma (PPGL) are rare neuroendocrine tumors with unique biological behavior and remarkably high heritability, yet dedicated bioinformatics resources for these diagnoses remain limited. Existing cancer multi-omics platforms are pan-cancer in scope, often lacking the disease-specific annotations, granularity, and cross-database harmonization required for meaningful stratification and hypothesis generation. Here we introduce PPGLomics, an interactive web-based platform designed for comprehensive PPGL transcriptomics analysis. PPGLomics v1.0 integrates two major datasets, the TCGA-PCPG cohort (n=160) spanning multiple molecular subtypes, and the A5 consortium SDHB cohort (n=91) with detailed clinicopathological and molecular annotations. The platform provides basic and clinical scientists, as well as a broad range of healthcare professionals, with tools for differential expression analysis, correlation analysis, survival analysis, and visualization, including boxplots, heatmaps, volcano plots, and Kaplan-Meier survival plots, enabling exploration of gene expression patterns across PPGL subtypes without requiring bioinformatics expertise. PPGLomics v1.0 is freely available at https://alkaissilab.shinyapps.io/PPGLomics.
bioinformatics2026-02-03v1PlotGDP: an AI Agent for Bioinformatics Plotting
Luo, X.; Shi, Y.; Huang, H.; Wang, H.; Cao, W.; Zuo, Z.; Zhao, Q.; Zheng, Y.; Xie, Y.; Jiang, S.; Ren, J.AI Summary
- PlotGDP is an AI agent-based web server designed for creating high-quality bioinformatics plots using natural language commands, eliminating the need for coding or environment setup.
- It leverages large language models (LLMs) to process user-uploaded data on a remote server, ensuring ease of use.
- The platform uses curated template scripts to reduce the risk of errors from LLMs, aiming to enhance bioinformatics visualization for global research.
Abstract
High-quality bioinformatics plotting is important for biology research, especially when preparing for publications. However, the long learning curve and complex coding environment configuration often appear as inevitable costs towards the creation of publication-ready plots. Here, we present PlotGDP (https://plotgdp.biogdp.com/), an AI agent-based web server for bioinformatics plotting. Built on large language models (LLMs), the intelligent plotting agent is designed to accommodate various types of bioinformatics plots, while offering easy usage with simple natural language commands from users. No coding experience or environment deployment is required, since all the user-uploaded data is processed by LLM-generated codes on our remote high-performance server. Additionally, all plotting sessions are based on curated template scripts to minimize the risk of hallucinations from the LLM. Aided by PlotGDP, we hope to contribute to the global biology research community by constructing an online platform for fast and high-quality bioinformatics visualization.
bioinformatics2026-02-03v1HiChIA-Rep quantifies the similarity between enrichment-based chromatin interactions datasets
Kim, S. S.; Jackson, J. T.; Zhang, H. B.; Kim, M.AI Summary
- HiChIA-Rep is an algorithm designed to quantify the similarity between datasets from enrichment-based 3D genome mapping technologies like ChIA-PET and HiChIP.
- It uses both 1D and 2D signals through graph signal processing to assess data reproducibility.
- HiChIA-Rep effectively distinguishes biological replicates from non-replicates and outperforms tools designed for Hi-C data.
Abstract
3D genome mapping technologies ChIA-PET, HiChIP, PLAC-seq, HiCAR, and ChIATAC yield pairwise contacts and a one-dimensional signal indicating protein binding or chromatin accessibility. However, a lack of computational tools to quantify the reproducibility of these enrichment-based 3C data prevents rigorous data quality assessment and interpretation. We developed HiChIA-Rep, an algorithm incorporating both 1D and 2D signals to measure similarity via graph signal processing methods. HiChIA-Rep can distinguish biological replicates from non-replicates, cell lines, and protein factors, outperforming tools designed for Hi-C data. With a large amount of multi-ome datasets being generated, HiChIA-Rep will likely be a fundamental tool for the 3D genomics community.
bioinformatics2026-02-03v1MOSAIC: A Structured Multi-level Framework for Probabilistic and Interpretable Cell-type Annotation
Yang, M.; Qi, J.; Lan, M.; Huang, J.; Jin, S.AI Summary
- MOSAIC is a multi-level framework for cell-type annotation in single-cell RNA sequencing that integrates cell-level marker evidence with cluster-level population context.
- It uses a probabilistic approach to handle uncertainty, mixed states, and population structure, improving upon single-level annotation methods.
- Across six tissues and under dropout perturbations, MOSAIC matched or outperformed other methods, providing structured uncertainty estimates and identifying stable intermediate cell states.
Abstract
Accurate cell-type annotation is a foundational task in single-cell RNA sequencing analysis, yet remains fundamentally challenged by cellular heterogeneity, gradual lineage transitions, and technical noise. As single-cell atlases expand in scale and resolution, most existing annotation approaches operate at a single analytical level and encode cell identity as fixed categorical labels, limiting their ability to represent uncertainty, mixed biological states, and population-level structure. Here we introduce MOSAIC (Multi-level prObabilistic and Structured Adaptive IdentifiCation), a structured multi-level annotation framework that integrates cell-level marker evidence with cluster-level population context within a unified probabilistic system. Rather than treating annotation as an independent per-cell prediction task, MOSAIC formulates cell-type assignment as a coordinated multi-level inference process, in which probabilistic evidence at the single-cell level is aggregated, constrained, and refined by population context. MOSAIC integrates direction-aware marker scoring with dual-layer probabilistic representation and adaptive cross-level refinement, enabling uncertainty to be quantified and propagated across biological scales. This design yields coherent annotations that preserve fine-grained single-cell variation while maintaining population-level consistency, and allows ambiguous or transitional states to be represented explicitly rather than collapsed into hard labels. Across six diverse tissues and under controlled dropout perturbations, MOSAIC consistently matches or outperforms representative marker-based, reference-based, and machine-learning annotation methods. Beyond accuracy, MOSAIC provides structured uncertainty estimates and coherent population-level structure, enabling the identification of stable intermediate cell states that arise from gradual lineage transitions rather than technical noise. Together, MOSAIC advances cell-type annotation from a single-level classification task to a structured multi-level inference problem, and establishes a general, interpretable, and uncertainty-aware computational framework for large-scale single-cell analysis.
bioinformatics2026-02-03v1Attractor Landscape Analysis Distinguishes Aging Markers from Rejuvenation Targets in Human Keratinocytes
Copes, N.; Canfield, C.-A. E.AI Summary
- The study used PRISM, a computational pipeline integrating pseudotime trajectory and Boolean network analysis, to identify rejuvenation targets in aging human keratinocytes from single-cell RNA sequencing data.
- Two distinct aging trajectories were identified: one where cells converge to an aged state (Y_272) and another where cells depart from a youthful state (Y_308).
- Key findings included BACH2 knockdown as the top rejuvenation target for Y_272, improving the aging score by 98.9%, and ASCL2 knockdown for Y_308, with enhanced effects when combined with ATF6 perturbation.
Abstract
Cellular aging is characterized by progressive changes in gene expression that contribute to tissue dysfunction; however, identifying genes that regulate the aging process, rather than merely serve as biomarkers, remains a significant challenge. Here we present PRISM (Pseudotime Reversion via In Silico Modeling), a computational pipeline that integrates pseudotime trajectory analysis with Boolean network analysis to identify cellular rejuvenation targets from single-cell RNA sequencing data. We applied PRISM to a published dataset of human skin comprising 47,060 cells from nine donors aged 18 to 76 years. Analysis of keratinocytes revealed two distinct aging trajectories with fundamentally different regulatory architectures. One trajectory (labeled Y_272) exhibited "aging as convergence," where cells were driven toward a single dominant aged attractor (aging score +2.181). A second trajectory (labeled Y_308) exhibited "aging as departure," where cells escaped from a dominant youthful attractor basin (aging score -0.536). Systematic perturbation analysis revealed a critical distinction between genes exhibiting age-related expression changes (phenotypic markers) and genes controlling attractor landscape architecture (regulatory controllers). Switch genes marking the aging trajectories proved largely ineffective as intervention targets, while master regulators operating at higher levels of the regulatory hierarchy produced substantial rejuvenation effects. BACH2 knockdown was identified as the dominant intervention for Y_272, shifting the aging score by {Delta}=-3.746 (98.9% improvement). ASCL2 knockdown was identified as the top target for Y_308, with synergistic enhancement observed through combinatorial perturbation with ATF6. These findings demonstrate that attractor-based analysis identifies different and potentially superior therapeutic targets compared to expression-based approaches and provide specific hypotheses for experimental validation of cellular rejuvenation strategies in human skin.
bioinformatics2026-02-03v1An agentic framework turns patient-sourced records into a multimodal map of ALS heterogeneity
Li, Z.; Gao, C.; Kong, J.; Fu, Y.; Wen, S.; Li, G.; Cao, Y.; Fu, Y.; Zhang, H.; Jia, S.; Liu, X.; Cai, L.; Yan, F.; Liu, X.; Tian, L.AI Summary
- The study introduces MEDSTREM, an LLM-based agent that transforms patient-sourced document images into standardized electronic health records, facilitating cohort building and linkage to trials and multi-omics data.
- By analyzing 8,298 individuals' clinical reports, MEDSTREM generated 17,602 records and multi-omics profiles, identifying five ALS subtypes and a continuous degeneration score.
- Key findings include functional loss tracking with hand-grip strength and forced vital capacity, malnutrition as a modifiable factor, and epigenetic changes like cell-cycle suppression and chromatin opening linked to clinical severity.
Abstract
ALS shows marked clinical heterogeneity, yet much real-world evidence remains trapped in unstructured reports. Here we introduce MEDSTREM, a large-language-model (LLM)-based agent that converts patient-sourced document images into standardized longitudinal electronic health records, enabling bottom-up cohort building and linkage to trials and multi-omics. By applying MEDSTREM to clinical report images from 8,298 individuals collected via AskHelpU and harmonizing with PRO-ACT and Answer ALS, we generated 17,602 standardized records and multi-omics profiles from 940 induced motor neuron lines. Progression modelling resolved five subtypes and a continuous degeneration score with interpretable anchors: hand-grip strength and forced vital capacity tracked functional loss, and malnutrition emerged as a modifiable correlate. Across RNA-seq and ATAC-seq, clinical severity is aligned with suppression of cell-cycle programmes, declining histone-gene activity and genome-wide chromatin opening, suggesting distinct epigenetic trajectories. These findings establish an agentic AI framework that turns unstructured clinical records into mechanistic insight and links them to multi-omics, reframing ALS studies from top-down, trial-centric analyses to a bottom-up, patient-sourced approach that reveals actionable heterogeneity.
bioinformatics2026-02-03v1