Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Learning a Continuous Progression Trajectory of Amyloid in Alzheimer's disease
Tong, M.; Mehfooz, F.; Zhang, S.; Wang, Y.; Fang, S.; Saykin, A. J.; Wang, X.; Yan, J.; Alzheimer's Disease Neuroimaging Initiative,AI Summary
- Researchers developed SLOPE, an unsupervised method to model amyloid progression in Alzheimer's disease (AD) continuously using longitudinal amyloid PET data.
- SLOPE generated a two-dimensional trajectory that better preserved temporal progression and showed greater sensitivity to early amyloid changes than global measures.
- The method revealed biologically consistent amyloid spreading patterns, enhancing disease modeling and monitoring in early AD stages.
Abstract
BACKGROUND: Understanding of Alzheimer progression is critical for timely diagnosis and treatment evaluation, but traditional discrete diagnostic groups often lack sensitivity to subtle early-stage changes. METHODS: We developed SLOPE, an unsupervised dimensionality reduction method that models the amyloid progression in AD on a continuous scale while preserving the temporal order of longitudinal follow-up visits. Applied to longitudinal amyloid PET data, SLOPE generated a two-dimensional trajectory capturing global amyloid accumulation across the AD continuum. RESULTS: SLOPE-derived staging scores better preserved temporal progression across diagnostic groups and longitudinal follow-up visits and can be generalized to held-out subjects. The learned trajectory revealed biologically consistent amyloid spreading patterns and greater sensitivity to early progression than global amyloid SUVR. DISCUSSION: SLOPE provides a continuous staging of amyloid pathology that complements global amyloid measures by capturing early localized progression. These properties highlight its potential in disease modeling and monitoring, particularly in early and preclinical stages of AD.
bioinformatics2026-02-06v1StrainCascade: An automated, modular workflow for high-throughput long-read bacterial genome reconstruction and characterization
Jordi, S. B. U.; Baertschi, I.; Li, J.; Fasel, N.; Misselwitz, B.; Yilmaz, B.AI Summary
- StrainCascade is an automated, modular workflow designed to streamline high-throughput long-read bacterial genome reconstruction.
- It integrates genome assembly, annotation, and functional profiling into a single framework, enhancing reproducibility.
- Key findings include improved resolution of strain-level variability, facilitating comparative genomics on diversity, host-microbe interactions, resistance mechanisms, and mobile genetic elements.
Abstract
Long-read sequencing offers unprecedented opportunities for high-resolution bacterial genome reconstruction, yet fragmented bioinformatics workflows hinder biological insights. StrainCascade addresses this gap by providing a fully automated, modular pipeline that integrates genome assembly, accurate annotation, and comprehensive functional profiling into a single, reproducible framework. Leveraging deterministic computational execution strategies, StrainCascade systematically resolves strain-level structural and functional variability, enabling robust comparative genomics of strain diversity, host-microbe interactions, antimicrobial resistance mechanisms, and mobile genetic element dynamics.
bioinformatics2026-02-06v1Unified imputation of missing data modalities and features in multi-omic data via shared representation learning
Nambiar, A.; Melendez, C.; Noble, W. S.AI Summary
- The study introduces MIMIR, a deep learning framework for imputing missing data in multi-omic studies by addressing both missing modalities and missing values through shared representation learning.
- MIMIR uses masked autoencoders to learn modality-specific representations, which are then projected into a common latent space for reconstruction from any observed modality subset.
- Evaluated on The Cancer Genome Atlas data, MIMIR outperformed baseline methods in various missing data scenarios, revealing structured cross-modal dependencies that influence imputation accuracy.
Abstract
Multi-omic studies promise a more comprehensive view of biological systems by jointly measuring multiple molecular layers. In practice, however, such datasets are rarely complete: entire molecular modalities may be missing for many samples, and observed modalities often contain substantial feature-level missingness. Existing imputation approaches typically address only one of these two problems, relying either on feature-level imputation within a single modality or on pairwise translation models that cannot accommodate arbitrary combinations of missing modalities. As a result, there is, to our knowledge, no unified framework for reconstructing both missing data modalities and missing values within those modalities. We present MIMIR, a deep learning framework for unified multi-omic imputation that addresses both missing modalities and missing values through shared representation learning. MIMIR first learns modality-specific representations using masked autoencoders and then projects these representations into a common latent space, enabling reconstruction from any subset of observed modalities. Evaluated on pan-cancer multi-omic data from The Cancer Genome Atlas, MIMIR consistently outperforms baseline methods across a range of missing-modality and missing-value scenarios, including missing completely at random and missing not at random settings. Analysis of the learned shared space reveals structured cross-modal dependencies that explain modality-specific differences in imputation accuracy, with transcriptional and epigenetic modalities forming a strongly aligned core and copy number variation contributing more distinct signal. Together, these results demonstrate that shared representation learning provides an effective and flexible foundation for multi-omic imputation under heterogeneous missingness.
bioinformatics2026-02-06v1Reference genome choice impacts SNP recovery but not evolutionary inference in young species
Soares, L. S.; Goncalves, L. T.; Guzman-Rodriguez, S.; Bombarely, A.; Freitas, L. B.AI Summary
- This study investigates how the choice of reference genome affects SNP recovery and evolutionary inference in RAD-seq analyses of young species like Petunia and Calibrachoa.
- Using congeneric reference genomes resulted in consistent mapping rates, SNP recovery, and population genomic patterns, while distantly related genomes showed lower mapping rates and affected summary statistics.
- Despite these differences, the broader genetic structure, diversity, and evolutionary relationships remained consistent, suggesting that closely related reference genomes are sufficient for robust analyses in recent radiations.
Abstract
Reduced-representation sequencing approaches such as RAD-seq are widely used in population genomics and phylogenetics, particularly for non-model organisms. However, bioinformatics choices during data processing can strongly influence downstream analyses. One key but underexplored factor is the reference genome used for read alignment and SNP discovery. Here, we evaluate the effects of reference genome choice on RAD-seq analyses using multiple datasets spanning recent radiations in Petunia and Calibrachoa, and reference genomes that differ in phylogenetic relatedness. When using congeneric reference genomes, we observed highly consistent mapping rates, SNP recovery, and downstream population genomic patterns. In contrast, mapping to more distantly related genomes resulted in lower mapping rates and stronger effects on summary statistics. Despite these quantitative reductions, broader patterns of genetic structure and diversity, as well as evolutionary relationships, remained largely congruent across reference genomes. Overall, our results indicate that reference genome choice matters most when genomes are distantly related or when analyses target fine-scale genomic signals. For recent radiations with largely conserved genome structure, closely related reference genomes yield comparable SNP datasets and lead to the same biological conclusions regarding population structure and phylogenetic relationships. These findings provide practical guidance for RAD-seq studies in non-model systems, showing that congeneric reference genomes are sufficient for robust population and phylogenetic inference, and that more distantly related genomes can remain informative when no close reference is available.
bioinformatics2026-02-06v1Scaling Variant-Aware Multiplex Primer Design
Han, Y.; Boucher, C.AI Summary
- The study addresses the challenge of designing primers for multiplex PCR that are effective across diverse and evolving pathogen genomes by introducing a near-linear algorithm for Primer Design Region (PDR) optimization with provable guarantees.
- A reference-free risk model based on Gini impurity was developed to ensure PDRs are robust to sequence diversity, and a local-search heuristic was used for optimizing primer subsets for thermodynamic stability.
- Testing on Foot-and-Mouth Disease and Zika virus datasets showed that the method, {Delta}-PRO, produced more compact and robust PDR sets with reduced predicted dimerization, enhancing multiplex PCR efficiency.
Abstract
Motivation: Robust primer design is essential for reliable multiplex PCR in diverse and evolving pathogen, microbial, and host genomes. Traditional methods optimized for a single reference often fail on emerging variants, leading to reduced efficiency. Variant-aware design seeks primers that remain effective across diverse targets, but this introduces two key challenges: identifying robust candidates and selecting an optimal subset of primers. Although there are methods for the first challenge, namely the Primer Design Region (PDR) optimization problem, existing approaches lack optimality guarantees. Results: We introduce a near-linear algorithm with provable guarantees for efficient PDR optimization. Complementing this, we propose a reference-free risk model based on Gini impurity that provides a stable, biologically interpretable measure of site-specific variation and yields PDRs that are robust to sequence diversity across datasets without ad hoc smoothing. For the second challenge related to thermodynamic stability, we optimize predicted {Delta}G and cast subset selection as a k-partite maximum-weight clique problem, which is NP-hard. We then design an efficient local-search heuristic with linear-time updates. Together, these advances yield a principled, scalable framework for variant-aware primer design. Across Foot-and-Mouth Disease virus and Zika virus datasets, {Delta}-PRO produces more compact and robust PDR sets and multiplex panels with reduced predicted dimerization compared to existing tools, demonstrating the practical gains of principled and scalable variant-aware primer design for high-throughput multiplex PCR assays. Availability: The proposed methods are implemented in a software package. The implementation and results are publicly available at https://github.com/yhhan19/variant-aware-primer-design.
bioinformatics2026-02-06v1Ecological context structures duplication and mobilization of antibiotic and metal resistance genes in bacteria
Tran, E.; Xu, P. N.; Assis, R.AI Summary
- The study investigated how ecological contexts influence the duplication and mobilization of antibiotic resistance genes (ARGs) and metal resistance genes (MRGs) in bacteria across clinical, agricultural, and wastewater environments.
- Resistance gene profiles varied significantly by environment, with distinct duplication patterns observed.
- Duplication of resistance genes was often linked with mobile genetic elements, but the dynamics of ARGs and MRGs were not uniformly coupled, highlighting the role of ecological context in resistance gene evolution.
Abstract
Antibiotic resistance is a global challenge driven by the persistence and spread of resistance genes across ecological contexts. While mobile genetic elements (MGEs) facilitate horizontal gene transfer, gene duplication represents an additional mechanism through which resistance genes can be amplified, diversified, and maintained under selection. How these processes interact across environments remains poorly understood. Here, we examined genome-level patterns of resistance gene abundance, duplication, and mobilization across clinical, agricultural, and wastewater settings, focusing on both antibiotic resistance genes (ARGs) and metal resistance genes (MRGs). Resistance gene profiles were strongly structured by environment, with distinct duplication patterns emerging across sources. Duplicate genes were frequently associated with MGEs, although the strength of this relationship varied by resistance type and ecological context. Despite frequent co-occurrence of ARGs and MRGs, their duplication and mobilization dynamics were not uniformly coupled at the genome level. Together, these findings highlight gene duplication as a context-dependent contributor to resistance evolution and underscore the importance of ecological setting in shaping how resistance genes persist and spread across microbial communities.
bioinformatics2026-02-06v1ChromBERT-tools: A versatile toolkit for context-specific embedding of transcription regulators across different cell types
Chen, Q.; Yu, Z.; Zhang, Y.AI Summary
- ChromBERT-tools is a toolkit designed to generate context-specific embeddings of transcription regulators, addressing the need for embeddings that consider genome-wide context and cell-type specificity.
- It uses a pre-trained foundation model to produce both cell-type-agnostic and cell-type-specific embeddings, enhancing transcription regulation modeling.
- The toolkit offers command-line interfaces and Python APIs for embedding generation, adaptation to different cell types, and interpretation, facilitating biological inferences like regulator interactions and cell state transitions.
Abstract
Abstract Motivation Representations that capture the genome-wide context of transcription regulators are critical for establishing a shared backbone for flexible transcription modeling and in silico regulatory analysis. Yet, current embeddings predominantly rely on limited modalities, such as gene co-expression or static protein features, offering an incomplete perspective that ignores context-dependent transcription regulator activities across the genome. The lack of transcription regulation-informed embeddings, paired with the absence of a user-friendly and lightweight toolkit for their generation, adaption to different cell types and interpretation, impedes the capture of the regulatory logic that underpin cellular states and functions. Results To address this need, we present ChromBERT-tools, a lightweight toolkit designed to operationalize regulation-informed embeddings derived from a foundation model pre-trained on the comprehensive landscapes of human and mouse transcription regulators. ChromBERT-tools provides user-friendly command-line interfaces (CLIs) and Python APIs to achieve two primary goals: (i) generating cell-type-agnostic embeddings that capture the semantic representations of individual regulators and their combinatorial interactions, serving as biological priors of transcription regulator modality to enhance transcription regulation modeling and rule interpretation; and (ii) generating cell-type-specific embeddings via fine-tuned model variants, which support in silico inference of regulatory roles of transcription regulators in cell types with scarce experimental data. The toolkit streamlines end-to-end workflows for embedding generation, adaption to different cell types and interpretation towards biological inferences such as regulator-regulator interaction across the genome and key regulators determining cell identity or cell state transition.
bioinformatics2026-02-06v1Rapid gene exchange explains differences in bacterial pangenome structure
Horsfield, S. T.; Peng, A.; Russell, M. J.; von Wachsmann, J.; Toussaint, J.; D'Aeth, J. C.; Qin, C.; Pesonen, H.; Tonkin-Hill, G.; Corander, J.; Croucher, N. J.; Lees, J. A.AI Summary
- The study developed Pansim and PopPUNK-mod to model pangenome dynamics, analyzing over 600,000 genomes from 400 bacterial species.
- Findings indicate that variation in the number of rapidly exchanged genes primarily drives differences in pangenome structure between species.
- Bacterial phylogeny, not ecology, was found to correlate with pangenome dynamics, suggesting the need for pan-species gene-level analyses.
Abstract
The size and diversity of bacterial gene repertoires, known as pangenomes, vary widely across species. The evolutionary forces driving the maintenance of pangenomes is an open topic of debate, with contradictory theories suggesting that pangenomes exist as a result of neutral evolution, with all genes gained and lost at random, or that all genes provide a fitness benefit to the host and are maintained by positive selection. Modelling of pangenome dynamics has provided insight into how gene exchange explains observed gene frequency distributions, and stands as the only means of jointly inferring contributions of individual gene selection effects and mobility on the maintenance of pangenomes. However, previous modelling studies have not included both gene-level selection and mobility, and do not consider broadly sampled genome datasets for many species. To differentiate neutral and selective forces maintaining pangenomes, we developed a mechanistic model of gene-level evolution, Pansim, and a scalable model fitting framework, PopPUNK-mod. Together, these tools leverage rapid genome distance calculation to fit models of pangenome dynamics to datasets containing hundreds of thousands of genomes. We used this framework to compare the pangenome dynamics of over 400 different bacterial species, using over 600,000 genomes. We find that diversity in pangenome characteristics between species is driven predominantly by variation in the number of rapidly exchanged genes, while the rate of exchange of remaining genes is conserved. We find that bacterial phylogeny, rather than ecology, correlates with pangenome dynamics. We express that pan-species gene-level analyses are now needed to understand selection across accessory genes. Our work highlights the importance of gene exchange rate differences in governing differences in pangenome characteristics between species.
bioinformatics2026-02-06v1A Systematic Benchmark of Antibiotic Resistance Gene Detection Tools for Shotgun Metagenomic Datasets
Tiwari, S. K.; Ponsero, A. J.; Talas, J.; Grimes, K. P.; Haynes, S.; Telatin, A.AI Summary
- This study benchmarked five ARG detection tools (ARGprofiler, KARGA, ARIBA, GROOT, SRST2) on simulated metagenomic datasets with varying sequencing coverages and microbial complexities.
- Sequencing coverage significantly affects ARG detection accuracy, with reliable detection at 10x coverage; ARGprofiler had the highest F1-score (0.891) at ≥10x.
- Increased community complexity reduced accuracy for all tools, with KARGA showing the highest mean F1-score (0.122 ± 0.067) under realistic uneven coverage, while computational efficiency varied, with ARGprofiler, SRST2, and GROOT being the most efficient.
Abstract
Accurate detection of antimicrobial resistance genes ARGs from metagenomic data is essential for understanding resistance dissemination within microbial communities yet tool performance remains influenced by sequencing coverage community complexity and dataset variability In this study we systematically benchmarked five widely used readbased ARG detection tools ARGprofiler KARGA ARIBA GROOT and SRST2 across simulated metagenomic datasets representing varying sequencing coverages microbial complexities and approximate realistic metagenomic dataset The results demonstrated that sequencing coverage is a major determinant of ARG detection accuracy with reliable detection achieved at 10x coverage and performance stabilizing between 20 and 30 ARGprofiler exhibited the highest overall F1score 0.891 at >=10 whereas KARGA showed higher recall at low coverage levels but lower precision compared to ARGprofiler Increasing community complexity led to a decline in accuracy across all tools and under realistic uneven coverage performance variability increased substantially with KARGA achieving the highest mean F1score 0122 +/- 0067 Runtime evaluation further revealed substantial differences in computational efficiency with ARGprofiler SRST2 and GROOT being the most resourceefficient while KARGA imposed the highest computational burden Collectively these findings highlight that both sequencing coverage and community complexity profoundly shape ARG detection outcomes and that tool selection should balance accuracy with computational efficiency The study also emphasizes the need for standardized benchmarking datasets that reflect true metagenomic complexity to ensure robust and comparable ARG surveillance across analytical pipelines.
bioinformatics2026-02-06v1Improved Ensemble Performance by Weight Optimisation for the Genomic Prediction of Maize Flowering Time Traits
Tomura, S.; Powell, O. M.; Wilkinson, M. J.; Lefevre, J.; Cooper, M.AI Summary
- This study investigated the impact of weight optimization on ensemble models for predicting maize flowering time traits using TeoNAM and MaizeNAM datasets.
- Three weight optimization methods (linear transformation, Nelder-Mead, Bayesian) were compared, showing that optimized weights improved prediction performance over naive equal-weighted ensembles.
- No single optimization method was consistently superior, suggesting further research into integrating weight optimization with hyperparameter tuning could be beneficial.
Abstract
Ensembles of multiple genomic prediction models have demonstrated improved prediction performance over the individual models contributing to the ensemble. The outperformance of ensemble models is expected from the Diversity Prediction Theorem, which states that for ensembles constructed with diverse prediction models, the ensemble prediction error becomes lower than the mean prediction error of the individual models. While a naive ensemble-average model provides baseline performance improvement by aggregating all individual prediction models with equal weights, optimising weights for each individual model could further enhance ensemble prediction performance. The weights can be optimised based on their level of informativeness regarding prediction error and diversity. Here, we evaluated weighted ensemble-average models with three possible weight optimisation approaches (linear transformation, Nelder-Mead and Bayesian) using flowering time traits from two maize nested associated mapping (NAM) datasets; TeoNAM and MaizeNAM. The three proposed weighted ensemble-average approaches improved prediction performance in several of the prediction scenarios investigated. In particular, the weighted ensemble models enhanced prediction performance when the adjusted weights differed substantially from the equal weights used by the naive ensemble models. For performance comparisons within the weighted ensembles, there was no clear superiority among the proposed approaches in both prediction accuracy and error across the prediction scenarios. Weight optimisation in ensembles warrants further investigation to explore the opportunities to improve their prediction performance; for example, integration of a weighted ensemble with a simultaneous hyperparameter tuning process may offer a promising direction for further research.
bioinformatics2026-02-06v1Decomposing multi-scale dynamic regulation from single-cell multiomics with scMagnify
Chen, X.; Yan, X.; Shen, B.; Wang, H.; Tang, Z.; Zang, Y.; Lin, P.; Zhang, H.; Li, Y.; Li, H.AI Summary
- scMagnify is a deep-learning framework that uses multiomic single-cell data to reconstruct and decompose multi-scale gene regulatory networks (GRNs) via nonlinear Granger causality.
- It employs tensor decomposition to identify combinatorial transcription factor modules and their activation profiles across different time-lags, providing insights into regulatory logic.
- Applied to human hematopoiesis, mouse pancreas development, and kidney injury, scMagnify revealed known regulators and new insights into cell fate decisions and pathological changes.
Abstract
Deciphering the highly coupled regulatory circuits that drive cellular dynamics remains a fundamental goal in biology. However, capturing the multi-scale time-lagged dynamics and combinatorial regulatory logic of gene regulation remains computationally challenging. Here we present scMagnify, a deep-learning-based framework that leverages multiomic single-cell assays of chromatin accessibility and gene expression via nonlinear Granger causality to reconstruct and decompose multi-scale gene regulatory networks (GRNs). Benchmarking on both simulated and real datasets demonstrates that scMagnify achieves superior performance. scMagnify employs tensor decomposition to systematically identify combinatorial TF modules and their activation profiles across different time-lags. It enables a hierarchical dissection of the regulatory landscape from the activity of individual regulator to the combinatorial logic of regulatory modules and intercellular communications. We applied scMagnify to human hematopoiesis and mouse pancreas development, where it successfully recovered known lineage-driving regulators and provided novel insights into the combinatorial logic that governs cell fate decisions. Furthermore, in the context of kidney injury, scMagnify's intracellular communication module mapped key signaling-to-transcription cascades linking microenvironment cues to pathological epithelial cell changes. In summary, scMagnify provides a powerful and versatile computational framework for dissecting the multi-scale regulatory logic that governs complex biological processes in development and disease.
bioinformatics2026-02-06v1Identification and Characterization of Metastasis-initiating cells
Wu, S.; Wei, J.; Liu, X.; Zhang, J.; Wen, J.; Huang, L.; zhou, X.AI Summary
- The study introduces scMIC, a computational framework for identifying metastasis-initiating cells (MICs) from single-cell data, addressing limitations of current methods.
- scMIC uses embedding-based representation, unbalanced optimal transport, and top-k selection to reliably identify MICs across various cancer types and datasets.
- Key findings include the framework's validation, its clinical utility in metastasis prognosis, and its role in discovering metastasis-related gene programs and biomarkers.
Abstract
Metastasis, the primary cause of cancer-related mortality, is a dynamic and complex process driven by a subset of cells known as metastasis-initiating cells (MICs). Accurate identification of MICs is therefore critical for metastasis diagnosis and therapeutic decision-making. However, current approaches rely either on mouse tracing experiments, which are difficult to translate to human systems, or on indirect strategies such as stemness, trajectory, pathway, and biomarker analyses that often yield inconsistent results. To address these limitations, we propose scMIC, a computational framework designed to explicitly and reliably identify MICs from single-cell data (available at https://github.com/swu13/scMIC). scMIC integrates an embedding-based representation, unbalanced optimal transport, and a top-k selection strategy to robustly capture metastasis-initiating potential. The framework was validated and applied across multiple cancer types, species, and multi-omics datasets. Our results demonstrate the reliability of scMIC for MIC identification, its potential clinical utility in metastasis prognosis, and its effectiveness in discovering metastasis-related gene programs and molecular biomarkers. Elucidating the mechanisms of metastasis initiation not only advances our understanding of metastatic progression but also enables the development of therapeutic strategies that target the more aggressive MIC population rather than non-MICs, thereby avoiding unintended increases in metastatic risk. Collectively, scMIC provides a powerful tool for cancer metastasis research and drug discovery.
bioinformatics2026-02-06v1Prediction of protein-carbohydrate binding sites from protein primary sequence
Nafi, M. M. I.; Nawar, Q. F.; Islam, T. N.; Rahman, M. S.AI Summary
- This study developed StackCBEmbed, an ensemble machine learning model to predict protein-carbohydrate binding sites from protein primary sequences.
- StackCBEmbed integrates traditional sequence-based features with features from a pre-trained transformer-based protein language model.
- It achieved sensitivity and balanced accuracy scores of 0.730, 0.776 and 0.666, 0.742 on two test sets, outperforming previous models.
Abstract
Background: A protein is a large complex macromolecule that has a crucial role in performing most of the work in cells and tissues. It is made up of one or more long chains of amino acid residues. Another important biomolecule, after DNA and protein, is carbohydrate. Carbohydrates interact with proteins to run various biological processes. Several biochemical experiments exist to learn the protein-carbohydrate interactions, but they are expensive, time-consuming, and challenging. Therefore, developing computational techniques for effectively predicting protein-carbohydrate binding interactions from protein primary sequence has given rise to a prominent new field of research. Result: In this study, we propose StackCBEmbed, an ensemble machine learning model to effectively classify protein-carbohydrate binding interactions at residue level. StackCBEmbed combines traditional sequence-based features along with features derived from a pre-trained transformer-based protein language model. To the best of our knowledge, ours is the first attempt to apply protein language model in predicting protein-carbohydrate binding interactions. StackCBEmbed achieved sensitivity and balanced accuracy scores of 0.730, 0.776 and 0.666, 0.742 in two separate independent test sets. This performance is superior compared to the earlier prediction models benchmarked in the same datasets. Conclusion: We thus hope that StackCBEmbed will discover novel protein-carbohydrate interactions and help advance the related fields of research. StackCBEmbed is freely available as Python scripts at https://github.com/nafiislam/StackCBEmbed.
bioinformatics2026-02-05v2BESTish: A Diffusion-Approximation Framework for Inferring Selection and Mutation in Clonal Hematopoiesis
Wang, R.-Y.; Dinh, K. N.; Taketomi, K.; Pang, G.; King, K. Y.; Kimmel, M.AI Summary
- The study introduces BESTish, a Bayesian inference method for analyzing clonal hematopoiesis (CH) by modeling the dynamics of wild-type and mutant hematopoietic stem cells with an environmental parameter affecting death rates.
- BESTish uses derived mean-field dynamics and Gaussian-Markov approximations to estimate mutation fitness, mutation rate, and environmental strength from VAF datasets.
- Applied to CH datasets, BESTish consistently estimates mutation fitness and rates, revealing patient-specific mutation behavior and identifying variants in non-homeostatic environments.
Abstract
Clonal hematopoiesis (CH) arises when hematopoietic stem cells (HSCs) gain a fitness advantage from somatic mutations and expand, resulting in an increase in variant allele frequency (VAF) over time. To analyze CH trajectories, we develop a state-dependent stochastic model of wild-type and mutant HSCs, in which an environmental parameter in [0,1] regulates death rates and interpolates between homeostatic (Moran-like, = 1) and growth-facilitating ( < 1) regimes. Using functional law of large numbers and central limit theorems, we derive explicit mean-field dynamics and a Gaussian-Markov approximation for VAF fluctuations. We show that the mean VAF trajectory has an explicit logistic form determined by selective advantage, while environmental effects affect only the variance and autocovariance structure. Building on these results, we introduce BESTish (Bayesian estimate for selection incorporating scaling-limit to detect mutant heterogeneity), a novel, efficient and accurate Bayesian inference method that can be applied to analyze both cohort-level and longitudinal VAF datasets. BESTish implements the closed-form finite-dimensional distributions that we derive to estimate mutation fitness, mutation rate, and environmental strength for individual CH drivers. When applied to existing CH datasets, BESTish produces consistent mutation fitness inferences across different studies, and estimates CH driver mutation rates in agreement with independent experimental studies. Furthermore, BESTish reveals patient-specific heterogeneity in the selective behavior of recurrent mutations, and identifies variants whose dynamics are compatible with non-homeostatic, growth-facilitating environments. BESTish provides a unified and mechanistic framework for quantifying CH evolution, with potential applications for other biological systems where clonal expansions can be measured.
bioinformatics2026-02-05v2PlotGDP: an AI Agent for Bioinformatics Plotting
Luo, X.; Shi, Y.; Huang, H.; Wang, H.; Cao, W.; Zuo, Z.; Zhao, Q.; Zheng, Y.; Xie, Y.; Jiang, S.; Ren, J.AI Summary
- PlotGDP is an AI agent-based web server designed for creating high-quality bioinformatics plots using natural language commands.
- It leverages large language models (LLMs) to process user-uploaded data on a remote server, eliminating the need for coding or environment setup.
- The platform uses curated template scripts to ensure plot accuracy, aiming to enhance bioinformatics visualization for global research.
Abstract
High-quality bioinformatics plotting is important for biology research, especially when preparing for publications. However, the long learning curve and complex coding environment configuration often appear as inevitable costs towards the creation of publication-ready plots. Here, we present PlotGDP (<a href="https://plotgdp.biogdp.com/">https://plotgdp.biogdp.com/</a>), an AI agent-based web server for bioinformatics plotting. Built on large language models (LLMs), the intelligent plotting agent is designed to accommodate various types of bioinformatics plots, while offering easy usage with simple natural language commands from users. No coding experience or environment deployment is required, since all the user-uploaded data is processed by LLM-generated codes on our remote high-performance server. Additionally, all plotting sessions are based on curated template scripts to minimize the risk of hallucinations from the LLM. Aided by PlotGDP, we hope to contribute to the global biology research community by constructing an online platform for fast and high-quality bioinformatics visualization.
bioinformatics2026-02-05v2CAMUS: Scalable Phylogenetic Network Estimation
Willson, J.; Warnow, T.AI Summary
- CAMUS is a new method for estimating phylogenetic networks, designed to handle larger datasets by maximizing quartet trees within a constrained tree framework.
- Simulation studies under the Network Multi-Species Coalescent showed CAMUS to be highly accurate, fast, and scalable, outperforming PhyloNet-MPL and SNaQ in speed and dataset size capacity.
- While slightly less accurate than PhyloNet-MPL without a fixed tree, CAMUS can process datasets with up to 201 species, compared to PhyloNet-MPL's limit of 51 species.
Abstract
Motivation: Phylogenetic networks are models of evolution that go beyond trees, and so represent reticulate events such as horizontal gene transfer or hybridization, which are frequently found in many taxa. Yet, the estimation of phylogenetic networks is extremely computationally challenging, and nearly all methods are limited to very small datasets with perhaps 10 to 15 species (some limited to even smaller numbers). Results: We introduce CAMUS (Constrained Algorithm Maximizing qUartetS), a scalable method for phylogenetic network estimation. CAMUS takes an input constraint tree T as well as a set Q of unrooted quartet trees that it derives from input, and returns a level-1 phylogenetic network N that is built upon T through the addition of edges, in order to maximize the number of quartet trees in Q that are induced in N. We perform a simulation study under the Network Multi-Species Coalescent and show that a simple pipeline using CAMUS provides high accuracy and outstanding speed and scalability, in comparison to two leading methods, PhyloNet-MPL used with a fixed tree and SNaQ. CAMUS is slightly less accurate than PhyloNet-MPL used without a fixed tree, but is much faster (minutes instead of hours) and can complete on inputs with 201 species while PhyloNet-MPL fails to complete on the inputs with more than 51 species. Availability and Implementation: The source code is available at https://github.com/jsdoublel/camus.
bioinformatics2026-02-05v2Constrained Evolutionary Funnels Shape Viral Immune Escape
Huot, M.; Wang, D.; Shakhnovich, E.; Monasson, R.; Cocco, S.AI Summary
- This study presents a probabilistic framework to predict how viral proteins adapt under immune pressure, focusing on SARS-CoV-2's receptor binding domain.
- The framework identifies 'escape funnels' where viable mutations for immune evasion occur, constrained by protein viability and antibody escape.
- It predicts mutation sites, explains antibody-cocktail effectiveness, and shows that cocktails with de-correlated escape profiles force longer, costlier viral adaptation paths.
Abstract
Understanding how viral proteins adapt under immune pressure while preserving viability is crucial for anticipating antibody-resistant variants. We present a probabilistic framework that predicts viral escape trajectories and shows that immune evasion is channeled into a small set of viable "escape funnels" within the vast mutational space. These escape funnels arise from the combined constraints of protein viability and antibody escape, modeled using a generative model trained on homologs and deep mutational scanning data. We derive a mean-field approximation of evolutionary path ensembles, enabling us to quantify both the fitness and entropy of escape routes. Applied to SARS-CoV-2 receptor binding domain, our framework reveals convergent evolution patterns, predicts mutation sites in variants of concern, and explains differences in antibody-cocktail effectiveness. In particular, cocktails with de-correlated escape profiles slow viral adaptation by forcing longer, higher-cost escape paths.
bioinformatics2026-02-05v2Integrating SHAPE Probing with Direct RNA Nanopore Sequencing Reveals Dynamic RNA Structural Landscapes
White Bear, J.; De Bisschop, G.; Lecuyer, E.; Waldispühl, J.AI Summary
- The study introduces Dashing Turtle (DT), an algorithm that integrates SHAPE probing with direct RNA nanopore sequencing to analyze RNA structural dynamics.
- DT uses probabilistic, weighted, stacked ensemble learning to detect structural modifications with high resolution, achieving 10-20% higher accuracy and 10-30% better structural feature identification than existing methods.
- Applied to well-characterized RNAs, DT identifies dominant conformations, conserved regions, and correlates well with known structures, enhancing understanding of RNA folding and interaction dynamics.
Abstract
Traditional SHAPE experiments rely on averaged reactivities, which may limit information on folding patterns, alternate structures, and RNA dynamics. Short-read sequencing often suffers from false stopping, stalls, and biases during reverse transcription. The introduction of direct, long-read nanopore technology offers an opportunity to expand RNA structure probing methods to better understand RNA structural diversity. While many comparative approaches have been developed for detection of endogenous modifications, fewer have explored the expansion of SHAPE-based methods. We introduce Dashing Turtle (DT), an algorithm using probabilistic, weighted, stacked ensemble learning to perform high-resolution detection of structural modifications that can capture detailed information about RNA architecture across dynamic structural landscapes. We apply our method to several well-characterized RNA samples, identify dominant conformations, and structurally conserved regions. We show that our landscapes correlate well with expected structures and recapitulate important functional elements. DT achieves accuracy 10-20% higher than comparable methods on many sequences. It accurately identifies structural features at a rate of 80-100%, approximately 10-30% better than its peers. DT's predictions are robust across replicates and sub-sampled datasets and can help detect changes in conformational states, inform RNA folding mechanisms, and indicate interaction efficiency. Overall, it expands the capabilities of direct RNA sequencing and structural probing.
bioinformatics2026-02-05v2ABFormer: A Transformer-based Model to Enhance Antibody-Drug Conjugates Activity Prediction through Contextualized Antibody-Antigen Embedding
Katabathuni, R.; Loka, V.; Gogte, S.; Kondaparthi, V.AI Summary
- ABFormer, a transformer-based model, was developed to predict Antibody-Drug Conjugate (ADC) activity by integrating contextualized antibody-antigen embeddings with chemically enriched linker and payload representations.
- It outperforms the current state-of-the-art model, ADCNet, by achieving 100% accuracy on a test set of 22 novel ADCs.
- The model's effectiveness is primarily due to its interaction-aware antibody-antigen representations, with additional specificity from small molecule encoders.
Abstract
Computational screening is increasingly becoming a crucial aspect of Antibody Drug Conjugate (ADC) research, allowing the elimination of dead ends at earlier stages and concentrating on potential candidates, which can significantly reduce the cost of development. The current state-of-the-art deep learning model, ADCNet, usually considers antibodies, antigens, linkers, and payloads as distinct features. However, this overlooks the complex context of antibody-antigen binding, which is primarily responsible for the targeting and uptake of ADCs. To address this limitation, we present ABFormer, a transformer-based framework tailored for ADC activity prediction and in-silico triage. ABFormer integrates high-resolution antibody-antigen interface information through a pretrained interaction encoder and combines it with chemically enriched linker and payload representations obtained from a fine-tuned molecular encoder. This multi-modal design replaces naive feature concatenation with biologically informed contextual embeddings that more accurately reflect molecular recognition. ABFormer outperforms in leave-pair-out evaluation and achieves 100% accuracy on a separate test set of 22 novel ADCs, while the baselines are severely miscalibrated. Ablation study confirms that the predictive capability is predominantly driven by interaction-aware antibody-antigen representations, while small molecule encoders enhance specificity by reducing false positives. In conclusion, ABFormer provides a reliable and efficient platform for early filtering of ADC activity and selection of candidates.
bioinformatics2026-02-05v1A Shape Analysis Algorithm Quantifies Spatial Morphology and Context of 2D to 3D Cell Culture for Correlating Novel Phenotypes with Treatment Resistance
Nguyen, D. H.AI Summary
- The study addresses the limitation of traditional metrics in capturing spatial context like chirality in cell morphology, which is crucial for understanding treatment resistance.
- A novel algorithm, the Linearized Compressed Polar Coordinates (LCPC) Transform, was developed to quantify spatial morphology by converting 2D cell contours into discrete sinusoid waves, followed by Fast Fourier Transform analysis.
- This approach allows for multidimensional representation of cell shapes in 2D and 3D cultures, potentially revealing insights into how morphological phenotypes correlate with resistance to anti-cancer treatments.
Abstract
Numerous studies have shown that the morphological phenotype of a cell or organoid correlates with its susceptibility to anti-cancer agents. However, traditional methods of measuring phenotype rely on spatial metrics such as area, volume, perimeter, and signal intensity, which work but are limited. These approaches cannot measure many crucial features of spatial context, such as chirality, which is the property of having left- and right-handedness. Volume cannot register chirality because the left shoe and right shoe hold the harbor the same amount of volume. Though spatial context in the form of chirality, direction of gravity, and the axis of polarity are intuitive notions to humans, traditional metrics relied on by cell biologists, pathologists, radiologists, and machine learning scientists up to this point cannot register these fundamental notions. The Linearized Compressed Polar Coordinates (LCPC) Transform is a novel algorithm that can capture spatial context unlike any other metric. The LCPC Transform translates a two-dimensional (2D) contour into a discrete sinusoid wave via overlaying a grid system that tracks points of intersection between the contour and the grid lines. It turns the contour into a series of sequential pairs of discrete coordinates, with the independent coordinate (x-coordinate) being consecutive positions in 2D space. Each dependent coordinate (y-coordinate) consists of the distance, between an intersection of the contour and gridline, to the origin of the grid system. In the form of a discrete sinusoid wave, the Fast Fourier Transform is then applied to the data. In this way, the shape of cells in 2D and 3D cell culture, are represented systematically and multidimensionally, allowing for robust quantitative stratification that will reveal insights into treatment resistance.
bioinformatics2026-02-05v1The genetic repertoire of the deep sea: from sequence to structure and function
Guo, Y.; Wang, Z.; Li, D.; Wang, L.; Lan, H.; Guo, F.; Zhao, Z.; Liu, Z.; Meng, L.; Shen, X.; Wang, M.; Zhao, W.; Zhang, W.; Kong, C.; Shi, L.; Sun, Y.; Seim, I.; Jiang, A.; Ma, K.; Su, Z.; Zhang, N.; Ji, Q.; Chen, J.; Chen, K.; Qi, C.; Li, B.; He, B.; Liu, Y.; Zhou, J.; Zheng, Y.; Zhang, H.; Wang, Y.; Han, M.; Yang, T.; Tong, J.; Zhang, Y.; Wang, Z.; Xu, X.; Chen, J.; Liu, Y.; Chen, H.; Zeng, T.; Wei, X.; Li, C.; Yang, H.; Wang, B.; Liu, X.; Shao, C.; Zhang, W.; Gu, Y.; Xiao, X.; Xu, X.; Wang, J.; Mock, T.; Fan, G.; Li, Y.; Liu, S.; Dong, Y.AI Summary
- This study presents a comprehensive genetic dataset from the deep sea, including 502 million genes and 2.4 million predicted structures, to explore the link between genetic variants and protein structures adapted to deep-sea conditions.
- Analysis showed high sequence diversity but conserved protein structures, with proteins involved in replication, recombination, and repair evolving rapidly.
- A structurally divergent helicase was identified, showing potential in controlling nanopore sequencing speed, highlighting the deep sea's role in biotechnology.
Abstract
The deep sea as the largest and maybe most hostile environment on Earth is still underexplored especially regarding its genetic repertoire. Yet, previous work has revealed significant habitat-specific deep-sea biodiversity. Here, we present an integrated deep-sea genetic dataset comprising 502 million nonredundant genes from 2,138 samples and 2.4 million predicted structures, and used it to link specific protein structures with genetic variants associated with life in the deep sea and to assess their biotechnology potential. Combining global sequence analysis with biophysical and biochemical measurements revealed unprecedented sequence diversity, yet substantial structural conservation of proteins. Especially proteins involved in replication, recombination, and repair were identified to be under rapid evolution and with specialized properties. Among these, a structurally divergent helicase exhibited advantages in controlling nanopore sequencing speed. Thus, our work positions the deep sea as a unique evolutionary engine that generates and hosts genetic diversity and bridges genetic knowledge with biotechnology.
bioinformatics2026-02-05v1MosaicLev: Modified Levenshtein distance for mobile element-aware genome comparison
Stoltz, H. K.; Kuhlman, T. E.AI Summary
- The study introduces MosaicLev, a modified Levenshtein distance (mlev) that accounts for mobile element insertions in genome comparison by using a tunable parameter to discount these events.
- Validation on 67 mycobacteriophage genomes showed that mlev can distinguish between MPME1 and MPME2 elements, with significant discounts for the respective element carriers.
- The approach classified 35 MPME1 and 11 MPME2 carriers, and identified 14 phages lacking both elements.
Abstract
Motivation: Somatic genetic mosaicism arises when genomes diverge across cells during development, in part due to the activity of transposons (cut-paste) and retrotransposons (copy-paste). Standard sequence comparison methods are not motif-aware, penalizing mobile element insertions based on length rather than recognizing them as single biological events. Results: We introduce a modified Levenshtein distance (mlev) that discounts mobile element insertions via a tunable parameter (m in [0, 1]). Validation on 67 Cluster G1 mycobacteriophage genomes demonstrates bidirectional discrimination between Mycobacteriophage Mobile Element 1 (MPME1) and MPME2 elements: using MPME1 as target yields 49% discount for MPME1 carriers but only 9% for MPME2 carriers, while using MPME2 reverses this pattern. This approach classifies 35 MPME1 carriers and 11 MPME2 carriers, and identifies 14 phages showing low discount with both motifs, consistent with absence of both elements. Availability and Implementation: Python implementation with Numba JIT compilation freely available at https://doi.org/10.5281/zenodo.18452982.
bioinformatics2026-02-05v1PRISM: Niche-informed Deciphering of Incomplete Spatial Multi-Omics Data
Mu, S.; Wang, Z.; Liao, Y.; Liang, J.; Zhang, D.; Wang, C.; Xie, J.; Sheng, X.; Zhang, T.; Huang, W.; Song, J.; Yuan, Z.; Cai, H.AI Summary
- PRISM is a computational method designed to address incomplete spatial multi-omics data by using a niche-informed graph to propagate information from paired to unpaired regions.
- It was benchmarked on five datasets, showing superior performance in spatial multi-omics analysis.
- Application to Parkinson's disease data demonstrated PRISM's ability to accurately recover dopamine-related spatial domains and metabolite distributions obscured by data gaps.
Abstract
Spatial multi-omics data, characterizing the knowledge of diverse molecular layers, have become indispensable for the in situ analysis of tissue architecture and complex biological processes. Nevertheless, current spatial multi-omics sequencing protocols are often hindered by incompatible protocols, resulting in incomplete spatial multi-omics pairing due to inconsistent field-of-view or spatial resolution. To address this, we present PRISM, a novel computational method tailored for this scenario. PRISM leverages a niche-informed graph to propagate information from paired to unpaired regions, jointly achieving the spatial domain identification and spatial omics imputation. Extensive benchmarking on five diverse simulated and real experimental datasets demonstrated that PRISM outperformed existing methods in spatial multi-omics analysis tasks. Application to the human Parkinson's disease data revealed that PRISM accurately recovered dopamine-associated spatial domains and metabolite distributions masked by incomplete data gaps. PRISM offers a robust solution for bridging the integration gap inherent in incompatible sequencing protocols, thereby facilitating more accurate downstream biological interpretation.
bioinformatics2026-02-05v1A deep-learning-based score to evaluate multiple sequence alignments
Serok, N.; Polonsky, K.; Ashkenazy, H.; Mayrose, I.; Thorne, J. L.; Pupko, T.AI Summary
- The study evaluates the performance of the sum-of-pairs (SP) score in multiple sequence alignment (MSA) and finds it often does not correspond to the most accurate alignment.
- A deep-learning-based scoring function, Model 1, was developed to predict alignment accuracy, showing a stronger correlation with true accuracy than the SP score.
- Model 2 was introduced to rank alternative MSAs, outperforming the SP score, Model 1, and other programs, leading to more accurate phylogenetic reconstructions.
Abstract
Multiple sequence alignment (MSA) inference is a central task in molecular evolution and comparative genomics, and the reliability of downstream analyses, including phylogenetic inference, depends critically on alignment quality. Despite this importance, most widely used MSA methods optimize the sum-of-pairs (SP) score, and relatively little attention has been paid to whether this objective function accurately reflects alignment accuracy. Here, we evaluate the performance of the SP score using simulated and empirical benchmark alignments. For each dataset, we compare alternative MSAs derived from the same unaligned sequences and quantify the relationship between their SP scores and their distances from a reference alignment. We show that the alignment with the optimal SP score often does not correspond to the most accurate alignment. To address this limitation, we develop deep-learning-based scoring functions that integrate a collection of MSA features. We first introduce Model 1, a regression model that predicts the distance of a given MSA from the reference alignment. Across simulated and empirical datasets, this learned score correlates more strongly with true alignment accuracy than the SP score. However, Model 1 is less effective at identifying the best alignment among alternatives. We therefore develop Model 2, which takes as input a set of alternative MSAs generated from the same sequences and predicts their relative ranking. Model 2 more accurately identifies the top-ranking MSA than the SP score, Model 1, and several widely used alignment programs. Using simulations, we show that selecting MSAs based on our approach leads to more accurate phylogenetic reconstructions.
bioinformatics2026-02-05v1LongPolyASE: An end-to-end framework for allele-specific gene and isoform analysis in polyploids using long-read RNA-seq
Nolte, N. F.; Gruden, K.; Petek, M.AI Summary
- The study addresses the lack of tools for allele-specific gene and isoform analysis in polyploids by developing LongPolyASE, an end-to-end framework using long-read RNA-seq.
- LongPolyASE comprises three components: Syntelogfinder for identifying syntenic genes, longrnaseq for transcript quantification and isoform discovery, and PolyASE for analyzing differential expression and isoform usage.
- The framework was demonstrated on diploid rice and autotetraploid potato, showing its applicability in polyploid analysis.
Abstract
Motivation: Long-read RNA-seq and phased reference genomes enable haplotype-resolved gene and isoform expression analysis. While methods and tools exist for diploid organisms, analysis tools for polyploids are lacking. Results: We developed an end-to-end framework for allele-specific gene and isoform analysis in polyploids with three components: <a href="https://github.com/NIB-SI/syntelogfinder">Syntelogfinder</a> identifies syntenic genes in phased assemblies; <a href="https://github.com/NIB-SI/longrnaseq">longrnaseq</a> quantifies transcripts, discovers novel isoforms, and performs quality control of long-read RNA-seq; and <a href="https://pypi.org/project/polyase/">PolyASE</a> analyzes differential allelic expression, differential isoform usage between conditions, and structural differences in major isoforms between haplotypes. We demonstrate the use of the framework on diploid rice and autotetraploid potato. Availability and Implementation: <a href="https://github.com/NIB-SI/syntelogfinder">Syntelogfinder</a> and <a href="https://github.com/NIB-SI/longrnaseq">longrnaseq</a> are implemented in Nextflow and available on GitHub. <a href="https://pypi.org/project/polyase/">PolyASE</a> is a Python package available on PyPI. The framework is fully <a href="https://polyase.readthedocs.io/en/latest/index.html">documented</a> and tutorials are provided. Contact: <a href="mailto:Nadja.nolte.franziska@nib.si">Nadja.nolte.franziska@nib.si</a> Supplementary information: Supplementary data are available online and at <a href="https://zenodo.org/records/17590760?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6IjIzODg5ODA1LTdiMDQtNDU5Mi1hYTE1LTQyYzY5YzA2ZjhjYyIsImRhdGEiOnt9LCJyYW5kb20iOiJkYTQ1YjZmZjg1Mzg2MmE2M2I4M2Q4NTBlNzlhM2UwYSJ9.2AMQOPvz5yhPWJ-dIuf05NOiYQcyFkKPpRDYIqcUdYyGsMlHT8Y82wzMudpNJ6vxleNlOOSpukfcl-_astWCBw">Zenodo</a>.
bioinformatics2026-02-05v1BioVix: An Integrated Large Language Model Framework for Data Visualization, Graph Interpretation, and Literature-Aware Scientific Validation
Butt, M. Z.; Ahmad, R. S.; Fatima, E.; Tahir ul Qamar, M.AI Summary
- BioVix is a web-based framework that integrates data visualization, natural-language querying, and literature retrieval using LLMs to support scientific analysis.
- It uses a multi-model architecture to handle dataset uploads, generate visualizations, interpret data, and contextualize findings with literature.
- Evaluations across various biological domains showed BioVix's effectiveness in managing diverse datasets, though it supports rather than replaces expert analysis.
Abstract
The application of Large Language Models (LLMs) for generating data visualizations through natural language interaction represents a promising advance in AI-assisted scientific analysis. However, existing LLM-based tools largely emphasize graph generation, while research workflows require not only visualization but also rigorous interpretation and validation against established scholarly evidence. Despite advances in visualization technologies, no single tool currently integrates literature references with visualization while also generating insights from graphical data. To address this gap, we present BioVix, a web-based LLM-driven framework that integrates interactive data visualization, natural-language querying, and automated retrieval of relevant academic literature. BioVix enables users to upload datasets, generate complex visualizations, interpret graphical patterns, and contextualize findings through literature references within a unified workflow. The system employs a multi-model architecture combining DeepSeek V3.1 for code and logic generation, Qwen2.5-VL-32B-Instruct for multimodal interpretation, and GPT-OSS-20B for conversational reasoning, coordinated through structured prompt engineering. BioVix was evaluated across diverse biological domains, including proteomic expression profiling, epigenomic peak annotation, and clinical diabetes data, demonstrating its flexibility in handling heterogeneous datasets and supporting exploratory, literature-aware analysis. While BioVix substantially streamlines exploratory research workflows, its LLM-generated outputs are intended to support, not replace, expert judgment, and users should independently verify results before scientific reporting. BioVix is openly available via public deployment on Hugging Face (https://huggingface.co/spaces/MuhammadZain10/BioVix), with source code provided through GitHub (https://github.com/MuhammadZain-Butt/BioVix).
bioinformatics2026-02-05v1Hierarchical Representation Learning for Drug Mechanism-of-Action Prediction from Gene Expression Data
Katsaouni, N.; Schulz, M. H.AI Summary
- The study introduces a hierarchical representation learning framework using dual ArcFace objectives to predict drug mechanisms of action (MoAs) from gene expression data, enhancing interpretability and capturing both MoA-level and compound-level structures.
- The framework, trained on LINCS L1000 data, outperformed existing methods in F1 performance and generalized to new compounds, cell types, and even CRISPR knockdowns without retraining.
- Gene importance and pathway enrichment analyses validated that the learned representations align with known signaling pathways.
Abstract
Deciphering drug mechanisms of action (MoAs) from transcriptional responses is key for discovery and repurposing. While recent machine learning approaches improve prediction accuracy beyond traditional similarity metrics, they often lack biological structure and interpretability in the learned space. We introduce a hierarchical representation learning framework that explicitly enforces mechanistically coherent organization using dual ArcFace objectives, yielding an interpretable latent space that captures both MoA-level separation and compound-level substructure. Gene importance and pathway enrichment analyses confirm that the learned representations recover established signaling programs. Trained on LINCS L1000 data, the model also improves F1 performance over state-of-the-art baselines and generalizes to unseen compounds and cell types. Additionally, the latent space generalizes to CRISPR knockdowns without the need for retraining, indicating it captures pathway-level perturbations independently of modality.
bioinformatics2026-02-05v1Decoding ATG9A Variation: A Comprehensive Structural Investigation of All Missense Variants
Utichi, M.; Marjault, H.-B.; Tiberti, M.; Papaleo, E.AI Summary
- This study used an in silico saturation mutagenesis approach with the MAVISp framework to predict the impact of all missense mutations in ATG9A, focusing on protein stability, conformational changes, multimerization, and post-translational modifications.
- By integrating structural predictions with variant effect predictors and cross-referencing with disease databases, the study identified potentially damaging variants in ATG9A.
- The findings provide insights into the molecular mechanisms of these variants, offering a roadmap for understanding ATG9A mutations in human diseases.
Abstract
Macroautophagy (hereafter autophagy) is a cellular recycling pathway that requires different ATG (autophagy-related) proteins to generate double-membraned autophagosomes. ATG9A, a multi-spanning membrane protein, plays a crucial role in this process as the only transmembrane component of the core autophagy machinery. ATG9A functions as a lipid scramblase, redistributing lipids between membrane leaflets for the expanding autophagosome membrane. Structural studies have revealed that ATG9A forms a homotrimer with an interlocked domain-swapped architecture and a network of internal hydrophilic cavities. This configuration underlies its role in lipid transfer and membrane remodeling together with the lipid transporter ATG2A. ATG9A dysfunction has also been linked to human disease, as specific ATG9A mutations cause neurodevelopmental or neurodegenerative phenotypes. Additionally, ATG9A is altered in cancer, promoting pro-tumorigenic traits. However, most missense variants in ATG9A remain uncharacterized, posing a significant challenge for interpreting genomic data. In this study, we employed in silico saturation mutagenesis approach using the MAVISp (Multi-layered Assessment of VarIants by Structure) framework to predict the impact of every missense mutation in ATG9A. By analyzing multiple structural assemblies of ATG9A (monomer, trimer, and the ATG9A-ATG2A complex), we evaluated diverse mechanistic indicators of variant impact, including protein stability, long-range conformational changes, effects on multimerization interfaces, and alterations in post-translational modifications. We integrated the structure-based predictions with Variant Effect Predictors from recent deep-learning or evolutionary-based models and cross-referenced known variants catalogued in ClinVar, COSMIC, and cBioPortal. Finally, we predicted mechanistic indicators for all possible variants with structural coverage not yet reported in the disease-related databases supported by MAVISp. Our analyses identified a group of potentially damaging variants in ATG9A and the possible molecular mechanisms underlying their effects. Together, this work provides a roadmap for interpreting missense variants in autophagy regulators and highlights specific ATG9A mutations that deserve further investigation in the context of human disease.
bioinformatics2026-02-05v1Large mRNA language foundation modeling with NUWA for unified sequence perception and generation
Zhong, Y.; Yan, W.; Zhang, Y.; Tan, K.; Saito, Y.; Bian, B.AI Summary
- This study introduces NUWA, a large mRNA language foundation model using a BERT-like architecture, trained on extensive mRNA sequences from bacteria, eukaryotes, and archaea for unified sequence perception and generation.
- NUWA excels in various downstream tasks, including RNA-related perception and cross-modal protein tasks, and uses an entropy-guided strategy for generating natural-like mRNA sequences.
- Fine-tuned NUWA can generate functional, novel mRNA sequences for applications in biomanufacturing, vaccine development, and therapeutics.
Abstract
The mRNA serves as a crucial bridge between DNA and proteins. Compared to DNA, mRNA sequences are much more concise and information-dense, which makes mRNA an ideal language through which to explore various biological principles. In this study, we present NUWA, a large mRNA language foundation model leveraging a BERT-like architecture, trained with curriculum masked language modeling and supervised contrastive loss for unified mRNA sequence perception and generation. For pretraining, we utilized large-scale mRNA coding sequences comprising approximately 80 million sequences from 19,676 bacterial species, 33 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species, and pre-trained three domain-specific models respectively. This enables NUWA to learn coding sequence patterns across the entire tree of life. The fine-tuned NUWA demonstrates strong performance across a variety of downstream tasks, excelling not only in RNA-related perception tasks but also exhibiting robust capability in cross-modal protein-related tasks. On the generation front, NUWA pioneers an entropy-guided strategy that enables BERT-like models in generating mRNA sequences, producing natural-like sequences that accurately recapitulate species-specific codon usage patterns. Moreover, NUWA can be effectively fine-tuned on small, task-specific datasets to generate functional mRNAs with desired properties, including sequences that do not exist in nature, and to design coding sequences for diverse proteins in biomanufacturing, vaccine development, and therapeutic applications. To our knowledge, NUWA represents the first mRNA language model for unified sequence perception and generation, providing a versatile and programmable platform for mRNA design.
bioinformatics2026-02-04v3FlashDeconv enables atlas-scale, multi-resolution spatial deconvolution via structure-preserving sketching
Yang, C.; Chen, J.; Zhang, X.AI Summary
- FlashDeconv uses leverage-score importance sampling and sparse spatial regularization to enable rapid, high-resolution spatial deconvolution, processing 1.6 million bins in 153 seconds.
- It identifies a tissue-specific resolution horizon in mouse intestine at 8-16 µm, where cell-type co-localization sign inversions occur, validated by Xenium data.
- In human colorectal cancer, FlashDeconv reveals neutrophil inflammatory microdomains in low-UMI regions, recovering spatial biology from previously uninformative data.
Abstract
Coarsening Visium HD resolution from 8 to 64 m can flip cell-type co-localization from negative to positive (r = -0.12 [->] +0.80), yet investigators are routinely forced to coarsen because current deconvolution methods cannot scale to million-bin datasets. Here we introduce FlashDeconv, which combines leverage-score importance sampling with sparse spatial regularization to match top-tier Bayesian accuracy while processing 1.6 million bins in 153 seconds on a standard laptop. Systematic multi-resolution analysis of Visium HD mouse intestine reveals a tissue-specific resolution horizon (8-16 m)--the scale at which this sign inversion occurs--validated by Xenium ground truth. Below this horizon, FlashDeconv provides the first sequencing-based quantification of Tuft cell chemosensory niches (15.3-fold stem cell enrichment). In a 1.6-million-bin human colorectal cancer cohort, FlashDeconv further uncovers neutrophil inflammatory microdomains in low-UMI regions that classification-based methods discard, recovering spatially organized biology from measurements previously considered uninformative.
bioinformatics2026-02-04v2SPCoral: diagonal integration of spatial multi-omics across diverse modalities and technologies
Wang, H.; Yuan, J.; Li, K.; Chen, X.; Yan, X.; Lin, P.; Tang, Z.; Wu, B.; Nan, H.; Lai, Y.; Lv, Y.; Esteban, M. A.; Xie, L.; Wang, G.; Hui, L.; Li, H.AI Summary
- SPCoral was developed to integrate spatial multi-omics data across different slices, modalities, and technologies using graph attention networks and optimal transport.
- It employs a cross-modality attention network for feature integration and cross-omics prediction, showing superior performance in benchmarks.
- The integration enhances spatial domain identification, data augmentation, cross-modal analysis, and cell-cell communication, revealing insights unattainable with single modality data.
Abstract
Spatial multi-omics is indispensable for decoding the comprehensive molecular landscape of biological systems. However, the integration of multi-omics remains largely unresolved due to inherent disparities in molecular features, spatial morphology, and resolution. Here we developed SPCoral for diagonal integration of spatial multi-omics across adjacent slices. SPCoral extracts spatial covariation patterns via graph attention networks, followed by the use of optimal transport to identify high-confidence anchors in an unsupervised, feature-independent manner. SPCoral utilizes a cross-modality attention network to enable seamless cross-resolution feature integration alongside robust cross-omics prediction. Comprehensive benchmarking demonstrates SPCoral's superior performance across different technologies, modalities and varied resolutions. The integrated multi-omics representation further improves spatial domain identification, effectively augments experimental data, enables cross-modal association analysis, and facilitates cell-cell communication. SPCoral exhibits good scalability with data size, reveals biological insights that are not attainable using a single modality. In summary, SPCoral offers a powerful framework for spatial multi-omics integration across various technologies and biological scenarios.
bioinformatics2026-02-04v1Peptide-to-protein data aggregation using Fisher's method improves target identification in chemical proteomics
Lyu, H.; Gharibi, H.; Meng, Z.; Sokolova, B.; Zhang, X.; Zubarev, R.AI Summary
- This study compares two methods for protein-level statistical testing in chemical proteomics: traditional aggregation of peptide data versus Fisher's method of combining peptide p-values.
- Fisher's method, using the top four peptides by p-value, was tested across various datasets and consistently outperformed traditional methods by avoiding biases from deviant or missing peptide data.
- The approach improved the identification of regulated or shifted proteins in diverse proteomics assays.
Abstract
Protein-level statistical tests in proteomics aimed at obtaining p-value are conventionally made on protein abundances aggregated from peptide data. This integral approach overlooks peptide-level heterogeneity and ignores important information coded in individual peptide data, while protein p-value can also be obtained by Fisher's method of combining peptide p-values using chi-square statistics. Here we test this latter approach across diverse chemical proteomics datasets based on assessments of protein expression, solubility and protease accessibility. Using the top four peptides ranked by their p-values consistently outperformed protein-level analysis and avoided biases introduced by inclusion of deviant peptides or imputation of missing peptide values. Fisher's method provides a simple and robust strategy, improving identification of regulated/shifted proteins in diverse proteomics assays.
bioinformatics2026-02-04v1EvoPool: Evolution-Guided Pooling of Protein Language Model Embeddings
NaderiAlizadeh, N.; Singh, R.AI Summary
- The study introduces EvoPool, a self-supervised pooling framework that integrates evolutionary information from homologous sequences into protein language model (PLM) embeddings using optimal transport.
- EvoPool constructs a fixed-size evolutionary anchor and uses sliced Wasserstein distances to enhance PLM representations for protein-level prediction tasks.
- Experiments on the ProteinGym benchmark showed that EvoPool outperforms standard pooling methods in variant effect prediction, highlighting the benefit of evolutionary guidance.
Abstract
Protein language models (PLMs) encode amino acid sequences into residue-level embeddings that must be pooled into fixed-size representations for downstream protein-level prediction tasks. Although these embeddings implicitly reflect evolutionary constraints, existing pooling strategies operate on single sequences and do not explicitly leverage information from homologous sequences or multiple sequence alignments. We introduce EvoPool, a self-supervised pooling framework that integrates evolutionary information from homologs directly into aggregated PLM representations using optimal transport. Our method constructs a fixed-size evolutionary anchor from an arbitrary number of homologous sequences and uses sliced Wasserstein distances to derive query protein embeddings that are geometrically informed by homologous sequence embeddings. Experiments across multiple state-of-the-art PLM families on the ProteinGym benchmark show that EvoPool consistently outperforms standard pooling baselines for variant effect prediction, demonstrating that explicit evolutionary guidance substantially enhances the functional utility of PLM representations. Our implementation code is available at https://github.com/navid-naderi/EvoPool.
bioinformatics2026-02-04v1Embarrassingly_FASTA: Enabling Recomputable, Population-Scale Pangenomics by Reducing Commercial Genome Processing Costs from $100 to less than $1
Walsh, D. J.; Njie, e. G.AI Summary
- The study introduces Embarrassingly_FASTA, a GPU-accelerated preprocessing pipeline that reduces genome processing costs from ~1/genome by using transient intermediates and ephemeral cloud infrastructure.
- This approach enables the retention of raw FASTQ data, facilitating recomputable, population-scale pangenomics.
- The efficiency was demonstrated through simulated large-cohort pangenome builds in C. elegans and humans, showcasing the potential for capturing unsampled genetic diversity._
Abstract
Computational preprocessing has become the dominant bottleneck in genomics, frequently exceeding sequencing costs and constraining population-scale analysis, even as large repositories grow from tens of petabytes toward exabyte-scale storage to support World Genome Models. Legacy CPU-based workflows require many hours to days per 30x human genome, driving many repositories to distribute aligned or derived intermediates such as BAM and VCF files rather than raw FASTQ data. These intermediates embed reference- and model-dependent assumptions that limit reproducibility and impede reanalysis as reference genomes, including pangenomes, continue to evolve. Although recent work has established that GPUs can dramatically accelerate genomic pipelines, enabling large-cohort processing to shrink from years to days given sufficient parallelism, such workflows remain cost-prohibitive. Here, we introduce Embarrassingly_FASTA, a GPU-accelerated preprocessing pipeline built on NVIDIA Parabricks that fundamentally changes the economics of genomic data management. By rendering intermediate files transient rather than archival, Embarrassingly_FASTA enables retention of raw FASTQ data and reliable use of highly discounted ephemeral cloud infrastructure such as spot instances, reducing compute spend from ~$17/genome (CPU on-demand) to <$1/genome (GPU spot), and commercial secondary-analysis pricing from ~$120/genome to compute spend under $1/genome. We demonstrate the impact of this efficiency using a simulated large-cohort pangenome build-up (using variant-union accumulation as a proxy for diversity growth) in Caenorhabditis elegans and humans, highlighting the long tail of unsampled human genetic diversity. Beyond GPU kernels, Embarrassingly_FASTA contributes a transient-intermediate lifecycle and spot-friendly orchestration that makes FASTQ retention and routine recomputation economically viable. Embarrassingly_FASTA thus provides enabling infrastructure for recomputable, population-scale pangenomics and next-generation genomic models. Keywords: Genome preprocessing, GPU acceleration, Whole-genome sequencing (WGS), Population genomics, Pangenomics, World Genome Models, Genomic infrastructure, Variant calling, Recomputable genomics
bioinformatics2026-02-04v1Joint Modeling of Transcriptomic and Morphological Phenotypes for Generative Molecular Design
Verma, S.; Wang, M.; Jayasundara, S.; Malusare, A. M.; Wang, L.; Grama, A.; Kazemian, M.; Lanman, N. A.AI Summary
- The study introduces Pert2Mol, a framework for integrating transcriptomic and morphological data from paired control-treatment experiments to generate molecular structures.
- Pert2Mol uses bidirectional cross-attention and a rectified flow transformer to model perturbation dynamics, achieving a Frechet ChemNet Distance of 4.996 on the GDP dataset, outperforming diffusion and transcriptomics-only methods.
- The model offers high molecular validity, good physicochemical property distributions, 84.7% scaffold diversity, and is 12.4 times faster than diffusion methods for generation.
Abstract
Motivation: Phenotypic drug discovery generates rich multi-modal biological data from transcriptomic and morphological measurements, yet translating complex cellular responses into molecular design remains a computational bottleneck. Existing generative methods operate on single modalities and condition on post-treatment measurements without leveraging paired control-treatment dynamics to capture perturbation effects. Results: We present Pert2Mol, the first framework for multi-modal phenotype-to-structure generation that integrates transcriptomic and morphological features from paired control-treatment experiments. Pert2Mol employs bidirectional cross-attention between control and treatment states to capture perturbation dynamics, conditioning a rectified flow transformer that generates molecular structures along straight-line trajectories. We introduce Student-Teacher Self-Representation (SERE) learning to stabilize training in high-dimensional multi-modal spaces. On the GDP dataset, Pert2Mol achieves Frechet ChemNet Distance of 4.996 compared to 7.343 for diffusion baselines and 59.114 for transcriptomics-only methods, while maintaining perfect molecular validity and appropriate physicochemical property distributions. The model demonstrates 84.7% scaffold diversity and 12.4 times faster generation than diffusion approaches with deterministic sampling suitable for hypothesis-driven validation. Availability: Code and pretrained models will be available at https://github.com/wangmengbo/Pert2Mol.
bioinformatics2026-02-04v1FRED: a universal tool to generate FAIR metadata for omics experiments
Walter, J.; Kuenne, C.; Knoppik, N.; Goymann, P.; Looso, M.AI Summary
- The study addresses the challenge of standardizing metadata in omics experiments to enhance data management according to FAIR principles.
- FRED, a new tool, was developed to generate machine-readable metadata, offering features like dialog-based creation, semantic validation, logical search, an API, and a web interface.
- FRED is designed for use by both non-computational scientists and specialized facilities, integrating easily into existing research data management systems.
Abstract
Scientific research relies on transparent dissemination of data and its associated interpretations. This task encompasses accessibility of raw data, its metadata, details concerning experimental design, along with parameters and tools employed for data interpretation. Production and handling of these data represents an ongoing challenge, extending beyond publication into individual facilities, institutes and research groups, often termed Research Data Management (RDM). It is foundational to scientific discovery and innovation, and can be paraphrased as Findability, Accessibility, Interoperability and Reusability (FAIR). Although the majority of peer-reviewed journals require the deposition of raw data in public repositories in alignment with FAIR principles, metadata frequently lacks full standardization. This critical gap in data management practices hinders effective utilization of research findings and complicates sharing of scientific knowledge. Here we present a flexible design of a machine-readable metadata format to store experimental metadata, along with an implementation of a generalized tool named FRED. It enables i) dialog based creation of metadata files, ii) structured semantic validation, iii) logical search, iv) an external programming interface (API), and v) a standalone web-front end. The tool is intended to be used by non-computational scientists as well as specialized facilities, and can be seamlessly integrated in existing RDM infrastructure.
bioinformatics2026-02-04v1QMAP: A Benchmark for Standardized Evaluation of Antimicrobial Peptide MIC and Hemolytic Activity Regression
Lavertu, A.; Corbeil, J.; Germain, P.AI Summary
- QMAP is introduced as a benchmark for evaluating the prediction of antimicrobial peptide (AMP) potency (MIC) and hemolytic toxicity (HC50), using homology-aware test sets to prevent overfitting.
- The benchmark reassessed existing MIC models, revealing limited progress over six years, poor performance in predicting high-potency MIC, and low predictability for hemolytic activity.
- A Python package with a Rust-accelerated engine for efficient data manipulation is provided to facilitate the adoption of QMAP.
Abstract
Antimicrobial peptides (AMPs) are promising alternatives to conventional antibiotics, but progress in computational AMP discovery has been difficult to quantify due to inconsistent datasets and evaluation protocols. We introduce QMAP, a domain-specific benchmark for predicting AMP antimicrobial potency (MIC) and hemolytic toxicity (HC50) with homology-aware, predefined test sets. QMAP enforces strict sequence homology constraints between training and test data, ensuring that model performance reflects true generalization rather than overfitting. Applying QMAP, we reassess existing MIC models and establish baselines for MIC and HC50 regression. Results show limited progress over six years, poor performance for high-potency MIC regression, and low predictability for hemolytic activity, emphasizing the need for standardized evaluation and improved modeling approaches for highly potent peptides. We release a Python package facilitating practical adoption, and with a Rust-accelerated engine enabling efficient data manipulation, installable with pip install qmap-benchmark.
bioinformatics2026-02-04v1petVAE: A Data-Driven Model for Identifying Amyloid PET Subgroups Across the Alzheimer's Disease Continuum
Tagmazian, A. A.; Schwarz, C.; Lange, C.; Pitkänen, E.; Vuoksimaa, E.AI Summary
- This study aimed to identify subgroups along the Alzheimer's disease (AD) continuum using Aβ PET scans by developing petVAE, a 2D variational autoencoder model.
- petVAE was trained on 3,110 scans from ADNI and A4 datasets, identifying four clusters (Aβ-, Aβ-+, Aβ+, Aβ++) that differed significantly in standardized uptake value ratio, CSF Aβ, cognitive performance, APOE ε4 prevalence, and progression rate to AD.
- The model effectively captured the AD continuum, revealing preclinical stages and offering a new framework for studying disease progression.
Abstract
Amyloid-{beta} (A{beta}) PET imaging is a core biomarker and is considered sufficient for the biological diagnosis of Alzheimer's disease (AD). However, it is typically reduced to a binary A{beta}-/A{beta}+ classification. In this study, we aimed to identify subgroups along the continuum of A{beta} accumulation including subgroups within A{beta}- and A{beta}+. We used a total of 3,110 of A{beta} PET scans from Alzheimer's Disease Neuroimaging Initiative (ADNI) and Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease (A4) datasets to develop petVAE, a 2D variational autoencoder model. The model accurately reconstructed A{beta} PET scans without prior labeling or pre-selection based on scanner type or region of interest. Latent representations of scans extracted from the petVAE (11,648 latent features per scan) were used to visualize, analyze, and cluster the AD continuum. We identified the latent features most representative of the continuum, and clustering of PET scans using these features produced four clusters. Post-hoc characterization revealed that two clusters (A{beta}-, A{beta}-+) were predominantly A{beta} negative and two (A{beta}+, A{beta}++) were predominantly A{beta} positive. All clusters differed significantly in standardized uptake value ratio (p < 1.64e-8) and cerebrospinal fluid (CSF) A{beta} (p < 0.02), demonstrating petVAE's ability to assign scans along the A{beta} continuum. The clusters at the extremes of the continuum (A{beta}-, A{beta}++) resembled to the conventional A{beta} negative and A{beta} positive groups and differed significantly in cognitive performance, Apolipoprotein E (APOE) {epsilon}4 prevalence, and A{beta}, tau and phosphorylated tau CSF biomarkers (p < 3e-6). The two intermediate clusters (A{beta}-+, A{beta}+) showed significantly higher odds of carrying at least one APOE {epsilon}4 allele compared with the A{beta}- cluster (p < 0.026). Participants in A{beta}+ or A{beta}++ clusters exhibited a significantly faster rate of progression to AD compared to A{beta}- group (Hazard ratio = 2.42 and 9.43 for groups A{beta}+ and A{beta}++, respectively, p < 1.17e-7). Thus, petVAE was capable of reconstructing PET scans while also extracting latent features that effectively represented the AD continuum and defined biologically meaningful clusters. By capturing subtle A{beta}-related changes in brain PET scans, petVAE-based classification enables the detection of preclinical AD stages and offers a new data-driven framework for studying disease progression.
bioinformatics2026-02-04v1RareCapsNet: An explainable capsule networks enable robust discovery of rare cell populations from large-scale single-cell transcriptomics
Ray, S.; Lall, S.AI Summary
- RareCapsNet uses capsule networks to identify rare cell populations in large-scale single-cell RNA-seq data.
- It leverages explainable AI to interpret lower-level capsules, identifying novel marker genes for rare cell types.
- Evaluations on simulated and real data show RareCapsNet outperforms other methods in specificity, selectivity, and can transfer knowledge across batches.
Abstract
In-silico analysis of single cell data (downstream analysis) seeks considerable attention to the machine learning researchers in the last few years. Recent technological advances and increases in throughput capabilities open up great new chances to discover rare cell types. We develop RareCapsNet, a rare cell identification technique through capsule network in large single cell RNA-seq data. RareCapsNet aiming to leverage the landmark advantages of capsule networks in single cell domain, by identifying novel rare cell population through markers genes explained from human-mind-friendly interpretation of lower-level (primary) capsules. We demonstrate the explainability of capsule network for identifying novel markers that are act as signature of certain cell population of rare type. A comprehensive evaluation in simulated and real life single cell data demonstrate the efficacy of RareCapsNet for finding out rare population in large scRNA-seq data. RareCapsNet outperforms the other state-of-the-art not only in specificity and selectivity for identifying rare cell types, it can also successfully extract transcriptomic signature of the cell population. We demonstrate RareCapsNet to the dataset of multiple batch, where the model can store the knowledge of one batch which can be transferred to find out rare cells of other batch without training the model. Availability and Implementation: RareCapsNet is available at: https://github.com/sumantaray/RareCapsNet.
bioinformatics2026-02-04v1coelsch: Platform-agnostic single-cell analysis of meiotic recombination events
Parker, M. T.; Amar, S.; Freudigmann, J.; Walkemeier, B.; Dong, X.; Solier, V.; Marek, M.; Huettel, B.; Mercier, R.; Schneeberger, K.AI Summary
- The study benchmarks single-cell sequencing methods (droplet-based chromatin accessibility, RNA sequencing, and plate-based whole-genome amplification) for mapping meiotic recombination in Arabidopsis thaliana.
- Novel tools, coelsch_mapping_pipeline and coelsch, were developed for haplotype-aware alignment and crossover detection, successfully mapping recombination in 34 out of 40 F1 hybrids.
- The analysis revealed significant variation in recombination rates and identified a large ~10 Mb pericentric inversion in accession Zin-9, the largest known in A. thaliana.
Abstract
Background: Meiotic recombination creates genetic diversity through reciprocal exchange of haplotypes between homologous chromosomes. Scalable and robust methods for mapping recombination breakpoints are essential for understanding meiosis and for genetic mapping. Single cell sequencing of gametes offers a direct approach to recombination mapping, yet the effect of technical differences between single-cell sequencing methods for crossover detection remains unclear. Results: We benchmark single cell methods for droplet-based chromatin accessibility and RNA sequencing and plate-based whole-genome amplification for mapping meiotic recombination in Arabidopsis thaliana. For this purpose we introduce two novel open-source tools coelsch_mapping_pipeline and coelsch for haplotype-aware alignment and per-cell crossover detection, using them to recover known recombination frequencies and quantify the effects of coverage sparsity. We subsequently apply our approach to a panel of 40 recombinant F1 hybrids derived from crosses of 22 diverse natural accessions, successfully recovering genetic maps for 34 F1s in a single dataset. This analysis reveals substantial variation in recombination rate and identifies a ~10 Mb pericentric inversion in the accession Zin-9, the largest natural inversion reported in A. thaliana to date. Conclusions: These results demonstrate the applicability and scalability of single-cell gamete sequencing for high-throughput mapping of meiotic recombination, and highlight the strengths and limitations of different single-cell modalities. The accompanying open-source tools provide a framework for haplotyping and crossover detection analysis using sparse single-cell sequencing data. Our methodology enables parallel analysis of large numbers of hybrids in a single dataset, removing a major technical barrier to large-scale studies of natural variation in recombination rate.
bioinformatics2026-02-04v1LoReMINE: Long Read-based Microbial genome mining pipeline
Agrawal, A. A.; Bader, C. D.; Kalinina, O. V.AI Summary
- The study introduces LoReMINE, a pipeline for microbial genome mining that automates the process from long-read sequencing data to the prediction and clustering of biosynthetic gene clusters (BGCs).
- LoReMINE integrates various tools to provide a scalable, reproducible workflow for natural product discovery, addressing the limitations of existing methods that require manual curation.
Abstract
Microbial natural products represent a chemically diverse repertoire of small molecules with major pharmaceutical potential. Despite the increasing availability of microbial genome sequences, large-scale natural product discovery remains challenging because the existing genome mining approaches lack integrated workflows for rapid dereplication of known compounds and prioritization of novel candidates, forcing researchers to rely on multiple tools that requires extensive manual curation and expert intervention at each step. To address these limitations, we introduce LoReMINE (Long Read-based Microbial genome mining pipeline), a fully automated end-to-end pipeline that generates high-quality assemblies, performs taxonomic classification, predicts biosynthetic gene clusters (BGCs) responsible for biosynthesis of natural products, and clusters them into gene cluster families (GCFs) directly from long-read sequencing data. By integrating state-of-the-art tools into a seamless pipeline, LoReMINE enables scalable, reproducible, and comprehensive genome mining across diverse microbial taxa. The pipeline is openly available at https://github.com/kalininalab/LoReMINE and can be installed via Conda (https://anaconda.org/kalininalab/loremine), facilitating broad adoption by the natural product research community.
bioinformatics2026-02-04v1CAMUS: Scalable Phylogenetic Network Estimation
Willson, J.; Warnow, T.AI Summary
- CAMUS is a scalable method for estimating phylogenetic networks, designed to handle larger datasets by maximizing quartet trees within a given constraint tree.
- Simulation studies under the Network Multi-Species Coalescent showed CAMUS to be highly accurate, fast, and scalable, processing up to 201 species in minutes.
- Compared to PhyloNet-MPL and SNaQ, CAMUS is slightly less accurate when PhyloNet-MPL is used without a fixed tree but significantly faster and capable of handling larger datasets.
Abstract
Motivation: Phylogenetic networks are models of evolution that go beyond trees, and so represent reticulate events such as horizontal gene transfer or hybridization, which are frequently found in many taxa. Yet, the estimation of phylogenetic networks is extremely computationally challenging, and nearly all methods are limited to very small datasets with perhaps 10 to 15 species (some limited to even smaller numbers). Results: We introduce CAMUS (Constrained Algorithm Maximizing qUartetS), a scalable method for phylogenetic network estimation. CAMUS takes an input constraint tree T as well as a set Q of unrooted quartet trees that it derives from input, and returns a level-1 phylogenetic network N that is built upon T through the addition of edges, in order to maximize the number of quartet trees in Q that are induced in N. We perform a simulation study under the Network Multi-Species Coalescent and show that a simple pipeline using CAMUS provides high accuracy and outstanding speed and scalability, in comparison to two leading methods, PhyloNet-MPL used with a fixed tree and SNaQ. CAMUS is slightly less accurate than PhyloNet-MPL used without a fixed tree, but is much faster (minutes instead of hours) and can complete on inputs with 201 species while PhyloNet-MPL fails to complete on the inputs with more than 51 species. Availability and Implementation: The source code is available at https://github.com/jsdoublel/camus.
bioinformatics2026-02-04v1Common Pitfalls in CircRNA Detection and Quantification
Weyrich, M.; Trummer, N.; Boehm, F.; Furth, P. A.; Hoffmann, M.; List, M.AI Summary
- This study compares circRNA detection in poly(A)-enriched versus ribosomal RNA-depleted RNA-seq data, finding that poly(A) data often yield false positives.
- The quality of sample processing, indicated by ribosomal read fraction, impacts circRNA detection sensitivity.
- Best practices include using total RNA sequencing with rRNA depletion, employing multiple detection tools, and focusing on back-splice junctions for reliable circRNA analysis.
Abstract
Circular RNAs have garnered considerable interest, as they have been implicated in numerous biological processes and diseases. Through their stability, they are often considered promising biomarker candidates or therapeutic targets. Due to the lack of a poly(A) tail, circRNAs are best detected in total RNA-seq data after depleting ribosomal RNA. However, we observe that the application of circRNA detection in the vastly more ubiquitous poly(A)-enriched RNA-seq data still occurs. In this study, we systematically compare the detection of circRNAs in two matched poly(A) and ribosomal RNA-depleted data sets. Our results indicate that the comparably few circRNAs detected in poly(A) data are likely false positives. In addition, we demonstrate that the quality of sample processing, as measured by the fraction of ribosomal reads, significantly affects the sensitivity of circRNA detection, leading to a bias in downstream analysis. Our findings establish best practices for circRNA research: total RNA sequencing with effective rRNA depletion is the preferred approach for accurate circRNA profiling, whereas poly(A)-enriched data are unsuitable for comprehensive detection. Employing multiple circRNA detection tools and prioritizing back-splice junctions identified by several algorithms enhances confidence in the selection of candidates. These recommendations, validated across diverse datasets and tissue types, provide generalizable principles for robust circRNA analysis.
bioinformatics2026-02-04v1Ophiuchus-Ab: A Versatile Generative Foundation Model for Advanced Antibody-Based Immunotherapy
Zhu, Y.; Ma, J.; Yin, M.; Wu, J.; Tang, L.; Zhang, Z.; Li, Q.; Feng, S.; Liu, H.; Qin, T.; Yan, J.; Hsieh, C.-Y.; Hou, T.AI Summary
- The study addresses the challenge of antibody design by modeling the sequence space of paired heavy and light chains to understand inter-chain dependencies.
- Ophiuchus-Ab, a generative foundation model, was developed using a diffusion language modeling framework, trained on large-scale paired antibody repertoires.
- This model excels in tasks like CDR infilling, antibody humanization, and light-chain pairing, and predicts antibody properties like developability and binding affinity, enhancing antibody-based immunotherapy.
Abstract
Antibodies exhibit extraordinary specificity and diversity in antigen recognition and have become a central class of therapeutics across a wide range of diseases. Despite this clinical success, antibody design remains fundamentally challenging. Antibody function emerges from intricate and highly coupled interactions between heavy and light chains, which complicate sequence-function relationships and limit the rational design of developable antibodies. Here, we reveal that modeling antibody sequence space at the level of paired heavy and light chains is essential to faithfully capture inter-chain dependencies, enabling a deeper understanding of antibody function and facilitating antibody discovery. We present Ophiuchus-Ab, a generative foundation model pre-trained on large-scale paired antibody repertoires within a diffusion language modeling framework, unifying antibody generation and representation learning in a single probabilistic formulation. This framework excels diverse antibody design tasks, including CDR infilling, antibody humanization, and light-chain pairing. Beyond generation, diffusion-based pre-training yields transferable representations that enable accurate prediction of antibody properties, including developability, binding affinity, and specificity, even in low-data regimes. Together, these results establish Ophiuchus-Ab as a versatile foundation model for modeling antibodies, providing a foundation for next-generation antibody-based immunotherapy.
bioinformatics2026-02-04v1Generative deep learning expands apo RNA conformational ensembles to include ligand-binding-competent cryptic conformations: a case study of HIV-1 TAR
Kurisaki, I.; Hamada, M.AI Summary
- The study used Molearn, a hybrid molecular-dynamics-generative deep-learning model, to explore cryptic conformations of apo HIV-1 TAR RNA that could bind ligands.
- Molearn was trained on apo TAR conformations and generated a diverse ensemble, from which potential MV2003-binding conformations were identified.
- Docking simulations showed these conformations had RNA-ligand interaction scores similar to NMR-derived complexes, demonstrating the model's ability to predict ligand-binding competent RNA states.
Abstract
RNA plays vital roles in diverse biological processes and represents an attractive class of therapeutic targets. In particular, cryptic ligand-binding sites--absent in apo structures but formed upon conformational rearrangement--offer high specificity for RNA-ligand recognition, yet remain rare among experimentally-resolved RNA-ligand complex structures and difficult to predict in silico. RNA-targeted structure-based drug design (SBDD) is therefore limited by challenges in sampling cryptic states. Here, we apply Molearn, a hybrid molecular-dynamics-generative deep-learning model, to expand apo RNA conformational ensembles toward cryptic states. Focusing on the paradigmatic HIV-1 TAR-MV2003 system, Molearn was trained exclusively on apo TAR conformations and used to generate a diverse ensemble of TAR structures. Candidate cryptic MV2003-binding conformations were subsequently identified using post-generation geometric analyses. Docking simulations of these conformations with MV2003 yielded binding poses with RNA-ligand interaction scores comparable to those of NMR-derived complexes. Notably, this work provides the first demonstration that a generative modeling framework can access cryptic RNA conformations that are ligand-binding competent and have not been recovered in prior molecular-dynamics and deep-learning studies. Finally, we discuss current limitations in scalability and systematic detection, including application to the Internal Ribosome Entry Site, and outline future directions toward RNA-targeted SBDD.
bioinformatics2026-02-03v6GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure
Pourmirzaei, M.; Morehead, A.; Esmaili, F.; Ren, J.; Pourmirzaei, M.; Xu, D.AI Summary
- The study introduces GCP-VQVAE, a tokenizer using SE(3)-equivariant GCPNet to convert protein structures into discrete tokens while preserving chirality and orientation.
- Trained on 24 million protein structures, GCP-VQVAE achieves state-of-the-art performance with backbone RMSDs of 0.4377 Å, 0.5293 Å, and 0.7567 Å on CAMEO2024, CASP15, and CASP16 datasets respectively.
- On a zero-shot set of 1,938 new structures, it showed robust generalization with a backbone RMSD of 0.8193 Å and TM-score of 0.9673, and offers significantly reduced latency compared to previous models.
Abstract
Converting protein tertiary structure into discrete tokens via vector-quantized variational autoencoders (VQ-VAEs) creates a language of 3D geometry and provides a natural interface between sequence and structure models. While pose invariance is commonly enforced, retaining chirality and directional cues without sacrificing reconstruction accuracy remains challenging. In this paper, we introduce GCP-VQVAE, a geometry-complete tokenizer built around a strictly SE(3)-equivariant GCPNet encoder that preserves orientation and chirality of protein backbones. We vector-quantize rotation/translation-invariant readouts that retain chirality into a 4,096-token vocabulary, and a transformer decoder maps tokens back to backbone coordinates via a 6D rotation head trained with SE(3)-invariant objectives. Building on these properties, we train GCP-VQVAE on a corpus of 24 million monomer protein backbone structures gathered from the AlphaFold Protein Structure Database. On the CAMEO2024, CASP15, and CASP16 evaluation datasets, the model achieves backbone RMSDs of 0.4377 A, 0.5293 A, and 0.7567 A, respectively, and achieves 100% codebook utilization on a held-out validation set, substantially outperforming prior VQ-VAE-based tokenizers and achieving state-of-the-art performance. Beyond these benchmarks, on a zero-shot set of 1,938 completely new experimental structures, GCP-VQVAE attains a backbone RMSD of 0.8193 A and a TM-score of 0.9673, demonstrating robust generalization to unseen proteins. Lastly, we show that the Large and Lite variants of GCP-VQVAE are substantially faster than the previous SOTA (AIDO), reaching up to ~408x and ~530x lower end-to-end latency, while remaining robust to structural noise. We make the GCP-VQVAE source code, zero-shot dataset, and its pretrained weights fully open for the research community: https://github.com/mahdip72/vq_encoder_decoder
bioinformatics2026-02-03v3Transcriptomic and protein analysis of human cortex reveals genes and pathways linked to NPTX2 disruption in Alzheimer's disease
Lao, Y.; Xiao, M.-F.; Ji, S.; Piras, I. S.; Kim, K.; Bonfitto, A.; Song, S.; Aldabergenova, A.; Sloan, J.; Trejo, A.; Geula, C.; Na, C.-H.; Rogalski, E. J.; Kawas, C. H.; Corrada, M. M.; Serrano, G. E.; Beach, T. G.; Troncoso, J. C.; Huentelman, M. J.; Barnes, C. A.; Worley, P. F.; Colantuoni, C.AI Summary
- This study used bulk RNA sequencing and targeted proteomics on human cortex samples to explore genes and pathways associated with NPTX2 disruption in Alzheimer's disease (AD).
- NPTX2 expression was significantly reduced in AD, correlating with BDNF, VGF, SST, and SCG2, indicating a role in synaptic and mitochondrial functions.
- In AD, NPTX2-related synaptic and mitochondrial pathways weakened, while stress-linked transcriptional regulators increased, suggesting a shift in regulatory dynamics.
Abstract
The expression of NPTX2, a neuronal immediate early gene (IEG) essential for excitatory-inhibitory balance, is altered in the earliest stages of cognitive decline that precede Alzheimer's disease (AD). Here, we use NPTX2 as a point of reference to identify genes and pathways linked to its role in AD onset and progression. We performed bulk RNA sequencing on 575 middle temporal gyrus (MTG) samples across four cohorts, together with targeted proteomics in 135 of these same samples, focusing on 20 curated proteins spanning synaptic, trafficking, lysosomal, and regulatory categories. NPTX2 RNA and protein were significantly reduced in AD, and to a lesser extent in mild cognitive impairment (MCI) samples. RNA expression of BDNF, VGF, SST, and SCG2 correlated with both NPTX2 mRNA and protein levels. We identified NPTX2-correlated synaptic and mitochondrial programs that were negatively correlated with lysosomal and chromatin/stress modules. Gene set enrichment analysis (GSEA) of NPTX2 correlations across all samples confirmed broad alignment with synaptic and mitochondrial compartments, and more NPTX2-specific associations with proteostasis and translation regulator pathways, all of which were weakened in AD. In contrast, correlation of NPTX2 protein with transcriptomic profiles revealed negative associations with stress-linked transcription regulator RNAs (FOXJ1, ZHX3, SMAD5, JDP2, ZIC4), which were strengthened in AD. These results position NPTX2 as a hub of an activity-regulated "plasticity cluster" (BDNF, VGF, SST, SCG2) that encompasses interneuron function and is embedded on a neuronal/mitochondrial integrity axis that is inversely coupled to lysosomal and chromatin-stress programs. In AD, these RNA-level correlations broadly weaken, and stress-linked transcriptional regulators become more prominent, suggesting a role in NPTX2 loss of function. Individual gene-level data from the bulk RNA-seq in this study can be freely explored at [INSERT LINK].
bioinformatics2026-02-03v2Informative Missingness in Nominal Data: A Graph-Theoretic Approach to Revealing Hidden Structure
Zangene, E.; Schwammle, V.; JAFARI, M.AI Summary
- This study introduces a graph-theoretic approach to analyze missing data in nominal datasets, treating missing values as informative signals rather than gaps.
- By constructing bipartite graphs from nominal variables, the method reveals hidden structures through modularity, nestedness, and similarity analysis.
- Applied across various domains, the approach showed that missing data patterns can distinguish between random and non-random missingness, enhancing structural understanding and aiding in tasks like clustering.
Abstract
Missing data is often treated as a nuisance, routinely imputed or excluded from statistical analyses, especially in nominal datasets where its structure cannot be easily modeled. However, the form of missingness itself can reveal hidden relationships, substructures, and biological or operational constraints within a dataset. In this study, we present a graph-theoretic approach that reinterprets missing values not as gaps to be filled, but as informative signals. By representing nominal variables as nodes and encoding observed or missing associations as edges, we construct both weighted and unweighted bipartite graphs to analyze modularity, nestedness, and projection-based similarities. This framework enables downstream clustering and structural characterization of nominal data based on the topology of observed and missing associations; edge prediction via multiple imputation strategies is included as an optional downstream analysis to evaluate how well inferred values preserve the structure identified in the non-missing data. Across a series of biological, ecological, and social case studies, including proteomics data, the BeatAML drug screening dataset, ecological pollination networks, and HR analytics, we demonstrate that the structure of missing values can be highly informative. These configurations often reflect meaningful constraints and latent substructures, providing signals that help distinguish between data missing at random and not at random. When analyzed with appropriate graph-based tools, these patterns can be leveraged to improve the structural understanding of data and provide complementary signals for downstream tasks such as clustering and similarity analysis. Our findings support a conceptual shift: missing values are not merely analytical obstacles but valuable sources of insight that, when properly modeled, can enrich our understanding of complex nominal systems across domains.
bioinformatics2026-02-03v2Automated Segmentation of Kidney Nephron Structures by Deep Learning Models on Label-free Autofluorescence Microscopy for Spatial Multi-omics Data Acquisition and Mining
Patterson, N. H.; Neumann, E. K.; Sharman, K.; Allen, J. L.; Harris, R. C.; Fogo, A. B.; deCaestecker, M. P.; Van de Plas, R.; Spraggins, J. M.AI Summary
- Developed deep learning models for automated segmentation of kidney nephron structures using label-free autofluorescence microscopy.
- Models accurately segmented functional tissue units and gross kidney morphology with F1-scores >0.85 and Dice-Sorensen coefficients >0.80.
- Enabled quantitative association of lipids with segmented structures and spatial transcriptomics data acquisition from collecting ducts, showing differential gene expression in medullary regions.
Abstract
Automated spatial segmentation models can enrich spatio-molecular omics analyses by providing a link to relevant biological structures. We developed segmentation models that use label-free autofluorescence (AF) microscopy to recognize multicellular functional tissue units (FTUs) (glomerulus, proximal tubule, descending thin limb, ascending thick limb, distal tubule, and collecting duct) and gross morphological structures (cortex, outer medulla, and inner medulla) in the human kidney. Annotations were curated using highly specific multiplex immunofluorescence and transferred to co-registered AF for model training. All FTUs (except the descending thin limb) and gross kidney morphology were segmented with high accuracy: >0.85 F1-score, and Dice-Sorensen coefficients >0.80, respectively. This workflow allowed lipids, profiled by imaging mass spectrometry, to be quantitatively associated with segmented FTUs. The segmentation masks were also used to acquire spatial transcriptomics data from collecting ducts. Consistent with previous literature, we demonstrated differing transcript expression of collecting ducts in the inner and outer medulla.
bioinformatics2026-02-03v2SpaCEy: Discovery of Functional Spatial Tissue Patterns by Association with Clinical Features Using Explainable Graph Neural Networks
Rifaioglu, A. S.; Ervin, E. H.; Sarigun, A.; Germen, D.; Bodenmiller, B.; Tanevski, J.; Saez-Rodriguez, J.AI Summary
- SpaCEy uses explainable graph neural networks to analyze spatial tissue patterns from molecular marker expression, linking these patterns to clinical outcomes without predefined cell types.
- Applied to lung cancer, SpaCEy identified spatial cell arrangements and protein marker expressions linked to disease progression.
- In breast cancer datasets, SpaCEy stratified patients by overall survival, revealing key spatial patterns of protein markers across and within clinical subtypes.
Abstract
Tissues are complex ecosystems tightly organized in space. This organization influences their function, and its alteration underpins multiple diseases. Spatial omics allows us to profile its molecular basis, but how to leverage these data to link spatial organization and molecular patterns to clinical practice remains a challenge. We present SpaCEy (SpatialClinicalExplainability), an explainable graph neural network that uncovers organizational tissue patterns predictive of clinical outcomes. SpaCEy learns directly from molecular marker expression by modelling tissues as spatial graphs of cells and their interactions, without requiring predefined cell types or anatomical regions. Its embeddings capture intercellular relationships and molecular dependencies that enable accurate prediction of variables such as overall survival and disease progression. SpaCEy integrates a specialized explainer module that reveals recurring spatial patterns of cell organisation and coordinated marker expression that are most relevant to predictions of the models. Applied to a spatially resolved proteomic lung cancer cohort, SpaCEy discovers distinct spatial arrangements of cells together with coordinated expression of protein markers associated with disease progression. Across multiple breast cancer proteomic datasets, it consistently stratifies patients according to overall survival, both across and within established clinical subtypes. SpaCEy also highlights spatial patterns of a small set of key protein markers underlying this patient stratification.
bioinformatics2026-02-03v2