Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Information-Content-Informed Kendall-tau Correlation Methodology: Interpreting Missing Values in Metabolomics as Potentially Useful Information
Flight, R. M.; Bhatt, P. S.; Moseley, H. N. B.AI Summary
- The study introduces the Information-Content-Informed Kendall-tau (ICI-Kt) methodology to handle missing values in metabolomics data by treating them as informative, particularly when they are left-censored due to detection limits.
- Using simulated and over 700 experimental datasets, the approach was shown to enhance the interpretation of correlation by including these missing values, improving outlier detection and feature network construction.
- The methodology is implemented in R and Python, available on GitHub, facilitating fast calculations for large datasets.
Abstract
Background: Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by re-moving them or are imputed and used in the calculation of the correlation coefficient. In either case, the correlation value will be impacted based on a perspective that the missing data represents no useful information. However, missing values occur in real data sets for a variety of reasons. In metabolomics data sets, a major reason for missing values is that a specific measurable phenomenon falls below the detection limits of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent potentially useful information by virtue of their "missingness" at one end of the data distribution. Methods: To include this information due to left-censorsed missingness, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We developed a statistical test and then show that most missing values in metabolomics datasets are the result of left-censorship. Next, we show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that add further layers of information interpretation in the methodology. Results: Using both simulated and over 700 experimental data sets from The Metabolomics Workbench, we demonstrate that the ICI-Kt methodology allows for the inclusion of left-censored missing data values as interpretable information, enabling both improved determination of outlier samples and improved feature-feature network construction. Conclusions: We provide explicitly parallel implementations in both R and Python that allow fast calculations of all the variables used when applying the ICI-Kt methodology on large numbers of samples. The ICI-Kt methods are available as an R package and Python module on GitHub at https://github.com/moseleyBioinformaticsLab/ICIKendallTau and https://github.com/moseleyBioinformaticsLab/icikt, respectively.
bioinformatics2026-02-17v5ConNIS and labeling instability: new statistical methods for improving the detection of essential genes in TraDIS libraries
Hanke, M.; Harten, T.; Foraita, R.AI Summary
- The study introduces ConNIS, a new method for determining gene essentiality in TraDIS data by analyzing the probability of consecutive non-insertion sites within genes.
- ConNIS was shown to outperform existing methods in simulations and real-world scenarios, especially at low to medium insertion densities.
- A subsample-based instability criterion was developed to set methodologically sound parameter values, enhancing the precision of TraDIS analyses.
Abstract
The identification of essential genes in Transposon Directed Insertion Site Sequencing (TraDIS) data relies on the assumption that transposon insertions occur randomly in non-essential regions, leaving essential genes largely insertion-free. While intragenic insertion-free sequences have been considered as a reliable indicator for gene essentiality, so far, no exact probability distribution for these sequences has been proposed. Further, many methods require setting thresholds or parameter values a priori without providing any statistical basis, limiting the comparability of results. Here, we introduce Consecutive Non-Insertions Sites (ConNIS), a novel method for gene essentiality determination. ConNIS provides an analytic solution for the probability of observing insertion-free sequences within genes of given length and considers variation in insertion density across the genome. Based on an extensive simulation study and real world scenarios, ConNIS was found to be superior to prevalent state-of-the-art methods, particularly in scenarios with a low or medium insertion density. In addition, our results show that the precision of existing methods can be improved by incorporating a simple weighting factor for the genome-wide insertion density. To set methodically embedded parameter and threshold values of TraDIS methods a subsample based instability criterion was developed. Application of this criterion in real and synthetic data settings demonstrated its effectiveness in selecting well-suited parameter/threshold values across methods. A ready-to-use R package and an interactive web application are provided to facilitate application and reproducibility.
bioinformatics2026-02-17v3Minimizer Density revisited: Models and Multiminimizers
Ingels, F.; Robidou, L.; Martayan, I.; Marchet, C.; Limasset, A.AI Summary
- This study revisits the concept of density in k-mer-based sequence analysis by linking it to the gap between selected positions, proposing a probabilistic model that assumes equally distributed gaps.
- A novel technique, multiminimizers, is introduced where each k-mer is associated with multiple candidate minimizers, leading to a semi-local scheme that reduces density at the cost of increased computation time.
- The study also introduces deduplicated density, showing that multiminimizers improve this metric, though globally minimizing it is NP-complete, and provides an efficient SIMD-accelerated Rust implementation.
Abstract
High-throughput sequence analysis commonly relies on k-mers (words of fixed length k) to remain tractable at modern scales. These k-mer-based pipelines can employ a sampling step, which in turn allows grouping consecutive k-mers into larger strings to improve data locality. Although other sampling strategies exist, local schemes have become standard: such schemes map each k-mer to the position of one of its characters. A key performance measure of these schemes is their density, defined as the expected fraction of selected positions. The most widely used local scheme is the minimizer scheme: given an integer m [≤] k, a minimizer scheme associates each k-mer to the starting position of one of its m-mers, called its minimizer. Being a local scheme, the minimizer scheme guarantees covering all k-mers of a sequence, with a maximal gap between selected positions of w = k - m + 1. Recent works have established near-tight lower bounds on achievable density under standard assumptions for local schemes, and state-of-the-art schemes now operate close to these limits, suggesting that further improvements under the classical notion of density will face diminishing returns. Hence, in this work, we aim to revisit the notion of density and broaden its scope. As a first contribution, we draw a link between density and the gap between consecutive selected positions. We propose a probabilistic model allowing us to establish that the density of a local scheme is exactly the inverse of the expected gap between the positions it selects, under the minimal and only assumption that said gaps are somehow equally distributed. We emphasize here that our model makes no assumptions about how positions are selected, unlike the classical models in the literature. Our result introduces a novel method for computing the density of a local scheme, extending beyond classical settings. Based on this analysis, we introduce a novel technique, named multiminizers, by associating each k-mer with a bounded set of candidate minimizers rather than a single one. The candidate furthest away (in a precise sense defined in the article) is selected. Since the decision is made by taking advantage of a context beyond a single k-mer, this technique is not a local scheme - as a result, we propose the concept of semi-local schemes, which provide a broader framework within which our method fits. Using the multiminimizer trick on a local scheme reduces its density at the expense of a controlled increase in computation time. We show that this method, when applied to random (hash-based) minimizers and to open-closed mod-minimizers, achieves asymptotically optimal density representing, to our knowledge, the first construction converging to this limit. Our third contribution is the introduction of the deduplicated density, which measures the fraction of distinct minimizers used to cover all k-mers of a set of sequences. While this problem has gained traction in applications such as assembly, filtering, and pattern matching, standard minimizer schemes are often used as a proxy, blurring the distinction between the two objectives (minimizing the number of selected positions or the number of selected minimizers). Although related to the classical notion of density, deduplicated density differs in both definition and suitable constructions, and must be analyzed in its own right, together with its precise connections to standard density. We show that multiminimizers can also improve this metric, but that globally minimizing deduplicated density in this setting is NP-complete, and we instead propose a local heuristic with strong empirical behavior. Finally, we show that multiminimizers can be computed efficiently, and provide a SIMD-accelerated Rust implementation together with proofs of concept demonstrating reduced memory footprints on core sequence-analysis tasks. We conclude with open theoretical and practical questions that remain to be addressed in the area of density.
bioinformatics2026-02-17v2ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa
Malbranke, C.; Zalaffi, G. P.; Bitbol, A.-F.AI Summary
- ProteomeLM, a transformer-based model, was developed to predict protein-protein interactions (PPI) and gene essentiality by learning from entire proteomes across various species.
- The model uses contextualized protein representations to predict PPI with high accuracy and speed, surpassing traditional coevolution-based methods.
- ProteomeLM-PPI and ProteomeLM-Ess, extensions of ProteomeLM, achieve state-of-the-art performance in PPI prediction and gene essentiality prediction across different taxa.
Abstract
Language models trained on biological sequences are advancing inference tasks from the scale of single proteins to that of genomic neighborhoods. Here, we introduce ProteomeLM, a transformer-based language model that uniquely operates on entire proteomes from species spanning the tree of life. ProteomeLM is trained to reconstruct masked protein embeddings using the whole proteomic context, yielding contextualized protein representations that reflect proteome-scale functional constraints. Notably, ProteomeLM's attention coefficients encode protein-protein interactions (PPI), despite being trained without interaction labels. Furthermore, it enables interactome-wide PPI screening that is substantially more accurate, and orders of magnitude faster, than amino-acid coevolution-based methods. We further develop ProteomeLM-PPI, a supervised model that combines ProteomeLM embeddings and attention coefficients to achieve state-of-the-art PPI prediction across benchmarks and species. Finally, we introduce ProteomeLM-Ess, a supervised gene essentiality predictor that generalizes across diverse taxa. Our results demonstrate the potential of proteome-scale language models for addressing function and interactions at the organism level.
bioinformatics2026-02-17v2Compressed inverted indexes for scalable sequence similarity
Ingels, F.; Vandamme, L.; Girard, M.; Agret, C.; Cazaux, B.; Limasset, A.AI Summary
-
The study addresses the scalability limits of MinHash-derived sketching methods for nucleotide sequence similarity by introducing a novel framework using compressed inverted indexes, which match the space complexity of forward indexes.
-
They developed algorithms for efficient all-vs-all comparisons and introduced early-pruning schemes to optimize time and memory usage, maintaining high accuracy in similarity searches.
-
Onika, the resulting tool, significantly accelerates large-scale similarity searches and reduces resource usage, as demonstrated on various datasets, while maintaining sensitivity at practical similarity thresholds.
Abstract
Modern sequencing continues to drive explosive growth of nucleotide sequence archives, pushing MinHash-derived sketching methods to their practical scalability limits. State-of-the-art tools such as Mash, Dashing2, and Bindash2 provide compact sketches and accurate similarity estimates for large collections, yet ultimately rely on forward indexes that materialize sketches as explicit fingerprint vectors. As a result, large-scale similarity search and exhaustive collection-versus-collection comparison still incur quadratic resource usage. In this work, we revisit the architecture of sketch-based indexes and provide a novel framework for scalable similarity search over massive sketch collections. Our first contribution is a formal cost model for sketch comparison, within which we prove that inverted indexes on sketch fingerprints, equipped with suitably compressed posting lists, achieve the same asymptotic space complexity as standard forward indexes, thereby eliminating the perceived memory penalty traditionally associated with inverted indexes. Building on this model, we design algorithms for all-vs-all comparison between two inverted indexes whose running times are proportional to the total number of matching sketch positions, leading to output-sensitive optimality and enabling efficient large-scale similarity comparisons. Our second contribution leverages the prevalence of similarity thresholds in downstream applications such as clustering, redundancy filtering, and database search. We introduce two early-pruning schemes: an exact criterion that safely eliminates pairs guaranteed not to reach a target Jaccard similarity, and a probabilistic strategy that exploits partial match statistics to discard pairs unlikely to exceed this threshold. Together, these schemes address both time and memory bottlenecks while maintaining rigorous guarantees on retained high-similarity pairs and providing explicit control of the false-rejection probability for the probabilistic variant. Finally, we instantiate these ideas in Onika, an open-source Rust implementation based on compressed inverted posting lists available at github.com/Malfoy/Onika. Onika incorporates a similarity-aware document reordering strategy that restructures sketch identifiers to further shrink index size and improve locality, particularly for redundant collections. Experiments on large bacterial genome repositories, synthetic low-redundancy benchmarks, and long-read HiFi sequencing datasets demonstrate that Onika matches or improves upon the sketch sizes of leading tools while accelerating large-scale search and collection-versus-collection comparisons by up to several orders of magnitude in low-redundancy regimes, without compromising sensitivity at practically relevant similarity thresholds.
bioinformatics2026-02-17v2-
A Discrete Language of Protein Words for Functional Discovery and Design
Guo, Z.; Wang, Z.; Chai, Y.; XU, K.; Li, M.; Li, W.; Ou, G.AI Summary
- The study introduces a framework that discretizes protein sequences into a vocabulary of "Protein Words" to capture higher-order structural and functional signals, outperforming traditional residue-level models in tasks like remote homology and mutation effect prediction.
- Analysis across 54 species showed that these words correlate with evolutionary complexity, particularly in eukaryotic disordered regions.
- The framework identified ADMAP1 as a new regulator of sperm motility and enabled the design of functional cofilin variants, demonstrating its utility in both discovery and protein engineering.
Abstract
Proteins function through hierarchical modules, yet conventional models treat sequences as linear strings of residues, overlooking the recurrent multi-residue patterns-or "Protein Words"-that govern biological architecture. We introduce a physics-aware framework that discretizes protein space into a learnable vocabulary derived from the evolutionary record. By encoding proteins as sequences of discrete "words," our model captures higher-order structural and functional signals inaccessible to residue-level models, achieving highly competitive performance against widely established baselines in remote homology and mutation effect prediction. Analysis across 54 species reveals that these words track evolutionary complexity, specifically identifying the expansion of eukaryotic disordered regions. We demonstrate the discovery potential of this semantic axis by identifying ADMAP1 as a previously uncharacterized regulator of sperm motility, validated via CRISPR-Cas9 knockout mice. Finally, this vocabulary enables programmable design, generating functional cofilin variants despite high sequence divergence. This work establishes a linguistically inspired framework for deciphering the dark proteome and engineering biological function.
bioinformatics2026-02-17v1TITAN-BBB: Predicting BBB Permeability using Multi-Modal Deep-Learning Models
de Oliveira, G. B.; Saeed, F.AI Summary
- TITAN-BBB uses a multi-modal deep-learning approach integrating tabular, image, and text features to predict blood-brain barrier (BBB) permeability.
- The model was trained on the largest aggregated BBB permeability dataset, achieving 86.5% balanced accuracy in classification and 0.436 mean absolute error in regression.
- TITAN-BBB outperformed existing models by 3.1% in accuracy and reduced regression error by 20%.
Abstract
Computational prediction of blood-brain barrier (BBB) permeability has emerged as a vital alternative to traditional experimental assays, which are often resource-intensive and low-throughput to meet the demands of early-stage drug discovery. While early machine learn-ing approaches have shown promise, integration of traditional chemical descriptors with deep learning embeddings remains an underexplored frontier. In this paper, we introduce TITAN-BBB, a multi-modal deep-learning architecture that utilizes tabular, image, and text-based features and combines them using attention mechanisms. To evaluate, we aggregated multiple literature sources to create the largest BBB permeability dataset to date, enabling robust training for both classification and regression tasks. Our results demonstrate that TITAN-BBB achieves 86.5% of balanced accuracy on classification tasks and 0.436 of mean absolute error for regression, outperforming the state-of-the-art by 3.1 percentage points in balanced accuracy and reducing the regression error by 20%. Our approach also outperforms state-of-the-art models in both classification and regression performance, demonstrating the benefits of combining deep and domain-specific representations. The source code is publicly available at https://github.com/pcdslab/TITAN-BBB. The inference-ready model is hosted on Hugging Face at https://huggingface.co/SaeedLab/TITAN-BBB, and the aggregated BBB permeability datasets are available at https://huggingface.co/datasets/SaeedLab/BBBP.
bioinformatics2026-02-17v1FiCOPS: Hardware/Software Co-Design of FPGA Computational Framework for Mass Spectrometry-Based Peptide Database Search
Kumar, S.; Zambreno, J.; Khokhar, A.; Akram, S.; Saeed, F.AI Summary
- The study aimed to enhance the speed and efficiency of peptide database search from mass spectrometry data by developing FiCOPS, an FPGA-based computational framework.
- FiCOPS was designed using a hardware/software co-design approach, focusing on parallelism and reducing computational bottlenecks in the database search algorithm.
- Testing on the Intel Stratix 10 FPGA showed FiCOPS achieved a 3.5x speed-up over CPU solutions and reduced power consumption by 3x and 5x compared to CPU and GPU solutions, respectively.
Abstract
Improving the speed and efficiency of database search algorithms that deduce peptides from mass spectrometry (MS) data has been an active area of research for more than three decades. The significance of the need for faster database search methods has rapidly increased due to the growing interest in studying non-model organisms, meta-proteomics, and proteogenomic data, which are notorious for their enormous search space. Poor scalability of serial algorithms with the growing size of the database and increasing parameters of post-translational modifications is a widely recognized problem. While high-performance computing techniques can be used on supercomputing machines, the need for real-time, on-the-instrument solutions necessitates the development of an efficient system-on-chip that optimizes design constraints such as cost, performance, and power of the system. To show case that such a system can work, we present an FPGA-based computational framework called FiCOPS to accelerate database search using a hardware/software co-design methodology. First, we theoretically analyze the database-search algorithm (closed-search) to reveal opportunities for parallelism and uncover computational bottlenecks. We then design an FPGA-based architectural template to exploit parallelism inherent in the search workload. We also formulate an analytical performance model for the architecture template to perform rapid design space exploration and find a near-optimal accelerator configuration. Finally, we implement our design on the Intel Stratix 10 FPGA platform and evaluate it using real-world datasets. Our experiments demonstrate that FiCOPS achieves 3.5 times speed-up over existing CPU solutions and 3 times and 5 times reduction in power consumption compared to existing CPU and GPU solutions.
bioinformatics2026-02-17v1ProtFlow: Flow Matching-based Protein Sequence Design with Comprehensive Protein Semantic Distribution Learning and High-quality Generation
Kong, Z.; Zhu, Y.; Xu, Y.; Yin, M.; Hou, T.; Wu, J.; Xu, H.; Hsieh, C.-Y.AI Summary
- ProtFlow uses a flow matching algorithm to learn the comprehensive semantic distribution of protein sequences, addressing the limitations of existing models that focus on local statistics.
- The model incorporates a semantic integration network to reorganize protein representation space, enhancing global semantic capture.
- Experiments showed ProtFlow excels in generating high-quality peptides, particularly antimicrobial peptides with effective activity against various pathogens, including underrepresented species.
Abstract
Designing protein sequences with desired properties is a fundamental task in protein engineering. Recent advances in deep generative models have greatly accelerated this design process. However, most existing models face the issue of distribution centralization and focus on local compositional statistics of natural sequences instead of the global semantic organization of protein space, which confines their generation to specific regions of the distribution. These problems are amplified for functional proteins, whose sequence patterns strongly correlate with semantic representations and exhibit a long-tailed functional distribution, causing existing models to miss semantic regions associated with rare but essential functions. Here, we propose ProtFlow, a generative model designed for comprehensive semantic distribution learning of protein sequences, enabling high-quality sequence generation. ProtFlow employs a rectified flow matching algorithm to efficiently capture the underlying semantic distribution of the protein design manifold, and introduces a reflow technique enabling one-step sequence generation. We construct a semantic integration network to reorganize the representation space of large protein language models, facilitating stable and compact incorporation of global protein semantics. We pretrain ProtFlow on 2.6M peptide sequences and finetune it on antimicrobial peptides (AMPs), a representative class of therapeutic proteins exhibiting unevenly distributed activities across pathogen targets. Experiments show that ProtFlow outperforms state-of-the-art methods in generating high-quality peptides, and AMPs with desirable activity profiles across a range of pathogens, particularly against underrepresented bacterial species. These results demonstrate the effectiveness of ProtFlow in capturing the full training distribution and its potential as a general framework for computational protein design.
bioinformatics2026-02-17v1Diffusion Probabilistic Models for Missing-Wedge Correction in Cryo-Electron Tomography
Hasan, N.; Bertin, A.; Jonic, S.AI Summary
- The study addresses the missing-wedge (MW) distortion in cryo-electron tomography by proposing MW-RaMViD, a method for generating unacquired 2D tilt images based on the RaMViD approach.
- MW-RaMViD was adapted for cryo-ET by incorporating MRC image format, floating-point pixel intensity, and a controlled inference protocol for MW correction.
- Evaluations on synthetic datasets showed that smaller step sizes and larger conditioning windows in MW-RaMViD reduce error accumulation and improve reconstruction fidelity, as measured by RMSE and Fourier Shell Correlation.
Abstract
Interpretation of 3D cryo-electron tomography (cryo-ET) reconstructions (tomograms) is hampered by the so-called missing-wedge (MW) distortions, which arise because tilt image series used for the reconstructions are acquired in a limited angular range. While many deep-learning approaches address the correction of the MW artifacts on the level of tomograms (3D volumes), the correction at the level of 2D tilt images (generation of unacquired images) remains underexplored. We propose MW-RaMViD, a 2D tilt-image generation method for MW correction, based on Random-Mask Video Diffusion (RaMViD) method for prediction of frames in natural videos. To adapt RaMViD for cryo-ET, we add MRC image-format support, floating-point pixel intensity representation, and a controlled inference protocol enabling both one-run and progressive MW completion (generating a small number of missing tilts per step using a sliding window). We evaluate the method on a synthetic noisy tilt-series dataset and study the effects of MW completion step size and conditioning sequence length. Results show that smaller step sizes and larger conditioning windows reduce error accumulation at higher tilt angles and improve reconstruction fidelity, which was measured by Root Mean Square Error on the image level and by Fourier Shell Correlation on the tomogram level.
bioinformatics2026-02-17v1Evaluating Single-Cell Perturbation Response Models Is Far from Straightforward
Heidari, M.; Karimpour, M.; Srivatsa, S.; Montazeri, H.AI Summary
- This study evaluates the performance of single-cell perturbation response models, highlighting the limitations of current evaluation metrics like correlation-based measures and distributional distances.
- Using cross-splitting, controlled noise experiments, and synthetic data, the research shows that metrics such as Wasserstein and Energy distances can misrepresent model performance due to issues like scale and dimensionality.
- The findings indicate that complex deep learning models often do not outperform simple baselines, suggesting a need for improved evaluation standards in developing reliable virtual-cell models.
Abstract
Predicting cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology and a key step toward building in silico virtual cells. The rapid growth of perturbation datasets and advances in deep-learning models have raised expectations for accurate and generalizable prediction. We show that these expectations are overly optimistic, largely due to the failure modes of existing evaluation metrics. In this study, using cross-splitting, controlled noise experiments, and synthetic data, we systematically evaluate both prediction models and evaluation metrics. We demonstrate that widely used metrics, including correlation-based measures and common distributional distances, are strongly influenced by scale, sparsity, and dimensionality, often misrepresenting model performance. In particular, the Wasserstein distance fails in high-dimensional gene expression spaces under variance scaling, while the Energy distance can overlook disruptions in gene-gene dependencies. Our analyses further reveal that complex deep learning models often underperform simple baselines and remain far from empirical performance bounds across multiple chemical perturbation datasets. Together, our framework exposes critical pitfalls, establishes robust evaluation guidelines, and provides a foundation for trustworthy benchmarking toward reliable virtual-cell models.
bioinformatics2026-02-17v1RNAiSpline: A Deep learning model for siRNA efficacy prediction
Surkanti, S. R.; Kasturi, V. V.; Saligram, S. S.; Basangari, B. C.; Kondaparthi, V.AI Summary
- The study aimed to develop a computational model, RNAiSpline, for predicting siRNA efficacy in silencing mRNA.
- RNAiSpline uses self-supervised pretraining and fine-tuning with KAN, CNN, and Transformer Encoder to address data scarcity and bias.
- On an independent test dataset, RNAiSpline achieved an ROC-AUC of 0.8175, F1 score of 0.7717, and Pearson correlation of 0.6032.
Abstract
RNA interference (RNAi) is a crucial biological post-transcriptional gene silencing mechanism where small interfering RNA (siRNA) guides RNA-induced silencing complex (RISC) to bind with messenger RNA (mRNA) thereby silencing it and stopping protein formation. We exploit this process to prevent the formation of harmful proteins by silencing mRNA before it is translated into protein through an effective siRNA. There exists a need to develop a computational model that predicts the effectiveness of siRNA on a given mRNA. Designing a model is challenging, as the data availability is either scarce or biased, and existing models lack generalization ability, even though the parameters to training samples ratio is very high. To overcome these challenges, we introduce RNAiSpline, which incorporates self-supervised pretraining and fine-tuning with Kalmogorov-Arnold Network (KAN), Convolutional Neural Network (CNN), and Transformer Encoder. Evaluation on the independent test dataset yields an ROC-AUC of 0.8175, an F1 score of 0.7717, and Pearson correlation of 0.6032, making RNAiSpline a robust model for siRNA efficacy prediction.
bioinformatics2026-02-17v1A Robust Framework for Predicting Mutation Effects on Transcription Factor Binding: Insights from Mutational Signatures in 560 Breast CancerGenomes
Kilinc, H. H.; Otlu, B.AI Summary
- This study developed a k-mer-based linear regression model to predict the effects of 3.5 million somatic mutations from 560 breast cancer genomes on transcription factor (TF) binding affinity.
- The framework identified that specific mutational signatures like APOBEC (SBS2, SBS13) and aging (SBS1) correlate with gain- or loss-of-function (GOF/LOF) in TF families, affecting gene regulation.
- Analysis showed subtype-specific effects, with basal-like TNBC showing SBS3-driven GOF in CXXC family linked to MYC targets, and SBS39-driven LOF linked to DNA repair pathways.
Abstract
Background: A vast majority of somatic mutations in cancer reside in noncoding regions, yet systematically predicting their functional impact on gene regulation remains a significant challenge. These variants often enforce their effects by altering the binding affinity of transcription factors (TFs) to cis-regulatory elements. However, a critical gap exists in linking specific mutational processes to the disruption of gene regulatory networks at a systems level. Results: In this study, we present a comprehensive in silico pipeline centered on k-mer-based linear regression models to quantify TF binding affinity. Our framework produced 403 high-confidence TF models trained on high-throughput ChIP-seq and PBM datasets. We applied this pipeline to 3.5 million somatic mutations from 560 breast cancer whole genomes to predict gain- or loss-of-function (GOF/LOF) binding perturbations. These predictions were integrated with mutational signature analysis and curated gene sets, utilizing Activity-by-Contact model-based enhancer-gene maps to link variants to their target genes. Our analysis revealed that distinct mutational processes exert non-random, directional effects on specific TF families. The APOBEC-associated signatures (SBS2 and SBS13) were strongly enriched for GOF events in the Myb/SANT and FOX families, while the aging-associated signature SBS1 was enriched for LOF events in the Ets family members. Furthermore, predicted perturbations at putative enhancers were significantly linked to key oncogenes and tumor suppressor genes, with GOF and LOF events (e.g., FOXA1 and BRCA1/2, respectively). In breast cancer samples, the basal-like TNBC subtype exhibited that SBS3-driven GOF enrichments for the CXXC family converged on MYC target gene programs, while SBS39-driven LOF events for the same family converged on DNA Repair pathways. Conclusions: Our framework provides a robust and scalable approach for prioritizing and interpreting the functional consequences of somatic mutations in terms of TF perturbations. We demonstrate that specific mutational processes systematically rewire the gene regulatory landscape in a subtype-specific manner, offering novel mechanisms for transcriptional deregulation in breast cancer.
bioinformatics2026-02-17v1Ancestry-specific performance of variant effect predictors in clinical variant classification
Hoffing, R.; Zeiberg, D.; Stenton, S. L.; Mort, M.; Cooper, D. N.; Hahn, M. W.; O'Donnell-Luria, A.; Ward, L. D.; Radivojac, P.AI Summary
- The study assessed the ancestry-specific performance of variant effect predictors in classifying clinical variants, focusing on accuracy and evidence strength per ACMG/AMP guidelines.
- Key confounders identified were the count of rare variants and their allele frequency distribution across ancestries.
- Results showed that after stratifying by allele frequency, predictors had comparable performance across major ancestry groups, supporting their broad application in genetic diagnosis.
Abstract
Predicting the effects of genetic variants and assessing prediction performance are key computational tasks in genomic medicine. It has been shown that well-calibrated variant effect predictors can be reliably used as evidence towards establishing pathogenicity (or benignity) of missense variants, thereby rendering these variants suitable for use in (or exclusion from) the genetic diagnosis of rare Mendelian conditions. However, most predictors have been trained or calibrated on data that may not be sufficiently representative to lead to similar performance across all genetic ancestries. This raises questions about the responsible deployment of these tools to improve human health. To better understand the utility of computational predictors, we set out to assess their ancestry-specific performance in terms of accuracy and evidence strength according to the ACMG/AMP guidelines. First, we determined that the expected count of rare variants in an individual's genome and the allele frequency distribution of these variants are the key confounders when evaluating a predictor's performance across different genetic ancestries. Second, we found that a predictor's accuracy itself inversely correlates with the allele frequency of the rare variant. After stratifying according to allele frequency, we show that established methods for predicting the pathogenicity of missense variants have comparable performance levels across major ancestry groups. Our results therefore support the wide deployment of such models in the context of genetic diagnosis and related applications.
bioinformatics2026-02-17v1rbio1-training scientific reasoning LLMs with biological world models as soft verifiers
Istrate, A.-M.; Milletari, F.; Castrotorres, F.; Tomczak, J. M.; Torkar, M.; Li, D.; Karaletsos, T.AI Summary
- The study explores training reasoning models in biology using biological world models as soft verifiers, introducing two paradigms: RLEMF and RLPK.
- rbio1, a model post-trained from a pretrained LLM with reinforcement learning, uses these paradigms to achieve state-of-the-art performance on the PerturbQA benchmark.
- The approach demonstrates that soft verification can enhance model performance and enable zero-shot transfer to tasks like disease-state prediction.
Abstract
Reasoning models are typically trained against verification mechanisms in formally specified systems such as code or symbolic math. In open domains like biology, however, we lack exact rules to enable large-scale formal verification and instead often rely on lab experiments to test predictions. Such experiments are slow, costly, and cannot scale with computation. In this work, we show that world models of biology or other prior knowledge can serve as approximate oracles for soft verification, allowing reasoning systems to be trained without additional experimental data. We present two paradigms of training models with approximate verifiers: RLEMF: reinforcement learning with experimental model feedback and RLPK: reinforcement learning from prior knowledge. Using these paradigms, we introduce rbio1, a reasoning model for biology post-trained from a pretrained LLM with reinforcement learning, using learned biological models for verification during training. We demonstrate that soft verification can distill biological world models into rbio1, enabling it to achieve state-of-the-art performance on perturbation prediction in the PerturbQA benchmark. We further show that composing multiple AI-verifiers improves performance and that models trained with soft biological rewards transfer zero-shot to cross-domain tasks such as disease-state prediction. We present rbio1 as a proof of concept that predictions from biological models can train powerful reasoning systems using simulations rather than experimental data, offering a new paradigm for model training.
bioinformatics2026-02-16v4SMECT: a framework for benchmarking post-GWAS methods for spatial mapping of cells associated with human complex traits
Liu, M.; Xue, C.; Luo, Y.; Peng, W.; Ye, L.; Zhang, L.; Wei, W.; Li, M.AI Summary
- SMECT is a framework designed to benchmark post-GWAS methods for mapping cells associated with complex human traits using spatial transcriptomics.
- It uses a simulation engine, 21 real-world datasets, and an assessment toolkit to evaluate methods like DESE, S-LDSC, and scDRS.
- Findings show DESE excels in both sensitivity and specificity, while S-LDSC has high sensitivity but low specificity, and scDRS is specific but less sensitive.
Abstract
Spatially resolving the cellular basis of complex human traits is essential for elucidating disease mechanisms, yet the comparative performance of computational methods for this task has not been systematically evaluated. Here, we present SMECT (Spatial Mapping Evaluation of Complex Traits), the first comprehensive framework for systematically evaluating methods that integrate genetic data with spatial transcriptomics. SMECT combines a biologically realistic simulation engine, a curated resource of 21 diverse real-world datasets, and a multi-faceted assessment toolkit. Using this framework, we benchmarked three state-of-the-art methods. DESE, S-LDSC, and scDRS across 19 complex traits. Our analysis reveals a fundamental trade-off between detection sensitivity and biological specificity. We demonstrate that while S-LDSC identifies extensive spatial signals, it suffers from inflated non-specific significant associations. Conversely, scDRS is highly specific but conservative, performing well only in tissues with strong biological signals while missing subtle associations in sparser datasets. DESE overcomes these limitations, consistently achieving high power and robust specificity across both simulated and real-world scenarios. SMECT provides critical guidelines for method selection and serves as a foundational resource for developing robust spatial analyses of human complex traits. The framework is publicly available at https://github.com/pmglab/smect.
bioinformatics2026-02-16v1GATSBI: Improving context-aware protein embeddingsthrough biologically motivated data splits
Nayar, G.; Altman, R. B.AI Summary
- GATSBI introduces a graph attention framework to create context-aware protein embeddings by integrating various biological data types.
- The study uses task-aligned evaluation protocols, showing that models trained with biologically relevant data partitions generalize better.
- GATSBI outperforms existing embeddings in predicting interactions, functions, and functional sets, especially for understudied proteins under inductive settings.
Abstract
Motivation: Understanding protein function requires integrating diverse biological evidence while accounting for strong contextual dependence. Recent protein embedding methods increasingly leverage heterogeneous biological networks, yet their evaluation protocols often fail to reflect the specific biological tasks for which the embeddings are intended. Prediction of missing interactions, annotation of new proteins, and discovery of functional modules require fundamentally different data partitions, such as edge masked versus node held out splits. Moreover, most approaches report performance primarily on well-studied proteins, where computational predictions are least needed, risking substantial overestimation of real-world utility. Results: We introduce a graph attention based framework (GATSBI) to construct context-aware protein embeddings from integrated protein protein interactions, co-expression, sequence representations, and tissue-specific associations. Using task-aligned evaluation protocols, we show that models trained with biologically appropriate partitions achieve markedly better generalization. Across interaction, function, and functional set prediction, Gatsbi consistently outperforms existing pretrained embeddings for both well-studied and understudied proteins, with the largest gains observed for the understudied regime and under inductive node-held-out evaluation. To enable broad reuse, we provide the learned embeddings for download for application to other protein prediction tasks. Availability and Implementation: Code and models for our experiments are available at https://github.com/Helix-Research-Lab/GATSBI-embedding
bioinformatics2026-02-15v1Empty drops in scRNA-seq uncover the surprising prevalence of sequestered neuropeptide mRNA and pervasive sequencing artifacts
Gorin, G.; Goodman, L.AI Summary
- This study utilized empty drops from single-cell RNA sequencing to explore sequencing artifacts and biological phenomena.
- A simple procedure was developed to detect sequencing artifacts, providing recommendations to minimize quantification errors.
- Surprisingly, empty drops showed a high prevalence of mRNA for neuropeptide-related genes, suggesting potential physiological relevance.
Abstract
The empty drops in single-cell sequencing experiments are an underexplored resource. As such, they present a substrate to ask questions orthogonal to standard single-cell sequencing workflows, calibrate statistical models using simple internal controls, and detect technical outliers which would be otherwise challenging to distinguish from real biology. In this case study, we report a relatively simple procedure to detect sequencing artifacts and make recommendations to reduce the risk of erroneous quantifications. In addition, we report the surprising abundance and co-expression of mRNA coding for neuropeptide-related genes in the empty drops, possibly reflecting underlying physiology.
bioinformatics2026-02-15v1EMReady2: improvement of cryo-EM and cryo-ET maps by local quality-aware deep learning with Mamba
Cao, H.; Zhu, Y.; Li, T.; Chen, J.; He, J.; Wang, X.; Huang, S.-Y.AI Summary
- The study addresses the challenge of improving cryo-EM map quality by introducing EMReady_mamba, a deep learning model using a Mamba-based dual-branch UNet architecture.
- EMReady_mamba employs a local resolution-guided learning strategy to handle map heterogeneity, extending its applicability to various map types including nucleic acids and cryo-ET maps.
- Evaluated on 136 maps, EMReady_mamba demonstrated superior performance in enhancing map quality and interpretation compared to existing methods._
Abstract
Cryo-electron microscopy (cryo-EM) has emerged as a leading technology for determining the structures of biological macromolecules. However, map quality issues such as noise and loss of contrast hinder accurate map interpretation. Traditional and deep learning-based post processing methods offer improvements but face limitations particularly in handling map heterogeneity. Here, we present a generalist Mamba-based deep learning model for improving cryo-EM maps, named EMReady_mamba. EMReady_mamba introduces a fast Mamba-based dual-branch UNet architecture to jointly capture local and global features. In addition, EMReady_mamba also uses a local resolution-guided learning strategy to address map heterogeneity, and significantly extends the training set. These advances render EMReady_mamba applicable to a broader range of cryo-EM maps, including those con taining nucleic acids, medium-resolution maps, and cryo-electron tomography (cryo-ET) maps, while substantially reducing computational cost. EMReady_mamba is extensively evaluated on 136 diverse maps at 2.0-10.0 A resolutions, and compared with existing map post-processing methods. It is shown that EMReady_mamba exhibits state-of-the-art performance in both map quality and map interpretation improvement. EMReady2 is freely available at https://github.com/huang-laboratory/EMReady2/.
bioinformatics2026-02-14v2RNApdbee 3.0: A unified web server for comprehensive RNA secondary structure annotation from 3D coordinates
Pielesiak, J.; Niznik, K.; Snioszek, P.; Wachowski, G.; Zurawski, M.; Antczak, M.; Szachniuk, M.; Zok, T.AI Summary
- RNApdbee 3.0 is a web server that integrates 2D and 3D data to annotate RNA secondary structures, classifying base pairs and identifying various nucleotide interactions.
- It handles incomplete or modified residues, providing results in standard formats like dot-bracket notation, BPSEQ, and CT, along with graphical visualizations.
- The tool standardizes inputs to PDBx/mmCIF, integrates seven annotation tools, and decomposes structures into stems, loops, and single strands, ensuring comprehensive RNA structural analysis.
Abstract
RNApdbee 3.0 (publicly available at https://rnapdbee.cs.put.poznan.pl/) offers an advanced pipeline for comprehensive RNA structural annotation, integrating 2D and 3D data to build detailed nucleotide interaction networks. It classifies base pairs as canonical or noncanonical using the Leontis-Westhof and Saenger schemes and identifies stacking, base-ribose, base-phosphate, and base-triple interactions. The tool handles incomplete or modified residues, marking missing nucleotides and distinguishing noncanonical base pairs for accurate and effective visualization. Results are provided in standard formats - namely, extended dot-bracket notation, BPSEQ, and CT - and in highly valuable graphical visualizations. RNApdbee decomposes secondary structures into stems, loops, and single strands and offers flexible pseudoknot encoding. Its unified framework addresses inconsistencies across structural data formats by standardizing all inputs to PDBx/mmCIF and integrating seven widely used annotation tools. Finally, RNApdbee ensures reliable, format-independent, and comprehensive RNA structural annotation and interpretation.
bioinformatics2026-02-14v2DiCoLo: Integration-free and cluster-free detection of localized differential gene co-expression in single-cell data
Li, R.; Yang, J.; Su, P.-C.; Jaffe, A.; Lindenbaum, O.; Kluger, Y.AI Summary
- DiCoLo is introduced to detect localized differential gene co-expression in single-cell data without relying on cell clustering or cross-condition alignment.
- It uses Optimal Transport distances to construct gene graphs and identify changes in gene connectivity patterns between conditions.
- DiCoLo effectively identifies differential gene co-localization in complex scenarios, revealing new insights in mouse hair follicle development related to dermal condensate differentiation.
Abstract
Detecting changes in gene coordination patterns between biological conditions and identifying the cell populations in which these changes occur are key challenges in single-cell analysis. Existing approaches often compare gene co-expression between predefined cell clusters or rely on aligning cells across conditions. These strategies can be suboptimal when changes occur within small subpopulations or when batch effects obscure the underlying biological signal. To address these challenges, we introduce DiCoLo, a framework that identifies genes exhibiting differential co-localization, defined as changes in coordinated expression within localized cell neighborhoods - subsets of highly similar cells in the transcriptomic space. Importantly, DiCoLo does not rely on cell clustering or cross-condition alignment. For each condition, DiCoLo constructs a gene graph using Optimal Transport distances that reflect gene co-localization patterns across the cell manifold. Then, it identifies differential gene programs by detecting changes in connectivity patterns between the gene graphs. We show that DiCoLo robustly identifies differential gene co-localization even under weak signals or complex batch effects, outperforming existing methods across multiple benchmark datasets. When applied to mouse hair follicle development data, DiCoLo reveals coordinated gene programs and emerging cell populations driven by perturbations in morphogen signaling that underlie dermal condensate differentiation. Overall, these results establish DiCoLo as a powerful framework for uncovering localized differential transcriptional coordinated patterns in single-cell data.
bioinformatics2026-02-14v2DVPNet: A New XAI-Based Interpretable Genetic Profiling Framework Using Nucleotide Transformer and Probabilistic Circuits
Kusumoto, T.AI Summary
- The study introduces DVPNet, an XAI-based framework for genetic profiling that uses a Nucleotide Transformer and probabilistic circuits to classify cancer vs. normal cells.
- Using the GSE131907 dataset, 900 genes per sample were selected, transformed into embeddings, and used to train the model, which then provided probabilistic contributions for classification.
- Key findings include identification of 1,524 genes with unexpected contribution scores, highlighting genes like ITGA5 and TP73, offering new insights beyond traditional statistical methods.
Abstract
In this study, we present an XAI-based genetic profiling framework that quantifies gene importance for distinguishing cancer cells from normal cells based on an interpretable AI decision process. We propose a new explainable AI (XAI) classification model that combines probabilistic circuits with the Nucleotide Transformer. By leveraging the strong feature-extraction capability of the Nucleotide Transformer, we design a tractable classification framework based on probabilistic circuits while preserving probabilistic interpretability. To demonstrate the capability of this framework, we used the GSE131907 single-cell lung cancer atlas and constructed a dataset consisting of cancer-cell and normal-cell classes. From each sample, 900 gene types were randomly selected and converted into embedding vectors using the Nucleotide Transformer, after which the classification model was trained. We then extracted class-specific probabilistic contributions from the tractable model and defined a contribution score for the cancer-cell class. Genetic profiling was performed based on these scores, providing insights into which genes and biological pathways are most important for the classification task. Notably, 1,524 of the 9,540 observed genes showed contribution scores that contradicted what would be expected from their class-wise occurrence frequencies, suggesting that the profiling goes beyond simple statistics by leveraging biological feature representations encoded by the Nucleotide Transformer. The top-ranked genes among these contradictory cases include several well-studied genes in cancer research (e.g., ITGA5, SIGLEC9, NOTUM, and TP73). Overall, these analyses go beyond traditional statistical or gene-expression-level approaches and provide new academic insights for genetic research.
bioinformatics2026-02-14v2IntelliFold-2: Surpassing AlphaFold 3 via Architectural Refinement and Structural Consistency
Qiao, L.; Yan, H.; Liu, G.; Guo, G.; Sun, S.AI Summary
- IntelliFold-2 enhances biomolecular structure prediction through architectural refinements like latent space scaling and atom-attention, improving over AlphaFold 3.
- Key improvements include better performance in therapeutic contexts, especially for antibody-antigen interactions and protein-ligand co-folding.
- Three variants (Flash, v2, Pro) are released to cater to different needs from efficient fine-tuning to high-precision inference.
Abstract
IntelliFold-2 is an open-source biomolecular structure prediction model that improves accuracy and robustness through architectural refinement and multiscale structural consistency. We introduce latent space scaling in Pairformer blocks, a principled atom-attention formulation with stochastic atomization, policy-guided optimization for diffusion sampling and difficulty-aware loss reweighting. On Foldbench, IntelliFold-2 improves performance in therapeutically relevant settings, with particularly strong gains for antibody-antigen interactions and protein-ligand co-folding relative to AlphaFold 3. We release three variants (Flash, v2, and Pro) to cover efficient fine-tuning through high-precision server-side inference.
bioinformatics2026-02-14v2TOXsiRNA: A web server to predict the toxicity of chemically modified siRNAs
Dar, S.; Kumar, M.AI Summary
- The study developed TOXsiRNA, a web server to predict the toxicity of chemically modified siRNAs, addressing the challenge of experimental testing.
- Machine learning models, including SVM, LR, KNN, and ANN, were used, with the SVM model based on mononucleotide composition showing the best performance (PCC of 0.91 and 0.92).
- The server, available at http://bioinfo.imtech.res.in/manojk/toxsirna, also integrates models for predicting siRNA knockdown efficacy and off-target effects.
Abstract
Small interfering RNAs (siRNAs) are largely modified with chemical molecules to enhance their properties for use in molecular biology research and therapeutic applications. Toxicity effects may arise due to these chemical moieties as well as sequence based off-targets at cellular level. Enormous resources are required to experimentally design and test the toxicity of these chemical modifications and their combinations on siRNAs. To address this problem, we developed TOXsiRNA web server to computationally predict the toxicity of chemically modified siRNAs and their off-targets. We selected 2749 siRNAs with different permutations and combinations of 21 different chemical modifications engineered on them. Next, we used Support Vector Machine (SVM), Linear Regression (LR), K-Nearest Neighbor (KNN) and Artificial Neural Network (ANN) machine learning applications to develop models. Best performance was displayed by mononucleotide composition based model developed with SVM, offering Pearson Correlation Coefficient (PCC) of 0.91 and 0.92 on training testing and independent validations respectively. Other sequence features like dinucleotide composition binary pattern and their combinations were also tested. Finally, three models of chemically modified siRNAs were implemented on the web server. Other algorithms that include predicting normal as well as chemically modified siRNA knockdown efficacy, off target etc. are also integrated. The resource is hosted online for scientific use freely at url: http://bioinfo.imtech.res.in/manojk/toxsirna.
bioinformatics2026-02-14v1Analysis of Age-Specific Dysregulation of miRNAs in Lung Cancer Via Machine learning: Biomarker Identification and Therapeutic Implications in Patients Aged 60 and Above.
Hasan, A.; Muzaffar, A.AI Summary
- This study analyzed miRNA dysregulation in lung cancer patients aged 60 and above using RNA sequencing data from TCGA.
- Differential expression analysis identified 25 significant miRNAs, with hsa-mir-1911 upregulated and others like hsa-mir-196a downregulated.
- Machine learning highlighted key miRNAs involved in lung cancer biology, suggesting their potential as biomarkers for early diagnosis and personalized therapy targets.
Abstract
Lung cancer is the leading cause of cancer-related mortality worldwide, predominantly affects older individuals, with non-small cell lung cancer (NSCLC) comprising 85% of cases. Despite advancements in diagnosis and treatment, prognosis for elderly patients remains poor. This study investigates the role of microRNAs (miRNAs) involved in lung cancer, focusing on individuals aged 60 and above. RNA sequencing data from The Cancer Genome Atlas (TCGA) was used to conduct differential expression analysis of miRNA profiles from elderly and senile patient groups. Results showed that out of 1,881 miRNA profiles, 801 were found to be differentially expressed. Filtering for significance identified that 25 miRNAs, with hsa-mir-1911 upregulated and 24, including hsa-mir-196a and hsa-mir-323b found to be downregulated. Studies showed that these miRNAs play roles in apoptosis, senescence, and inflammation. Another Experimental approach in this study, used Machine learning analysis which highlighted key miRNAs, including hsa-mir-181b, hsa-mir-542, hsa-mir-450b, hsa-mir-584, and hsa-mir-21 as crucial in lung cancer biology. Moreover, Functional enrichment analysis revealed their involvement in gene silencing, translational repression, and RNA-induced silencing complex (RISC) regulation. This research identifies the association of miRNAs and aging in lung cancer and finds potential biomarkers that can be helpful in early diagnosis and targets for personalized therapies.
bioinformatics2026-02-14v1Feature-based in-silico model to predict the Mycobacterium tuberculosis bedaquiline phenotype associated with Rv0678 variants
Quispe Rojas, W.; de Diego Fuertes, M.; Rennie, V.; Riviere, E.; Safarpour, M.; Van Rie, A.AI Summary
- The study developed an in-silico model to predict bedaquiline resistance in Mycobacterium tuberculosis based on 13 features of Rv0678 variants.
- Key features included evolutionary conservation and proximity to functional sites, with the model achieving high accuracy (ROC-AUC 0.826, sensitivity 87.1%, specificity 88.2%).
- External validation showed reduced performance, likely due to varied phenotypic measurement methods.
Abstract
Bedaquiline resistance is emerging globally and threatens the effectiveness of the novel short all-oral regimens for rifampicin-resistant tuberculosis. Following a systematic literature review, we quantified 13 sequence, biochemical, and structural features of 62 Rv0678 missense variants reported in 136 Mycobacterium tuberculosis isolates. Using rigorous machine learning methods, we show that the strongest contributing features were the evolutionary conservation score and the shortest atomic distance to key functional sites. The final 5-feature model had good performance (ROC-AUC 0.826) and classified the bedaquiline phenotype with high accuracy [sensitivity 87.1% (95% CI, 78.3-92.6) and specificity 88.2% (95% CI, 76.6-94.5)]. Performance was lower in external validation, likely due to the measurement error introduced when using diverse phenotypic methods. missense variants on the mmpR5 protein structure and function. Integrating the five-feature in-silico in variant interpretation software could improve the prediction of the effect of Rv0678 variants and guide clinical management of rifampicin-resistant tuberculosis.
bioinformatics2026-02-14v1CodonRL: Multi-Objective Codon Sequence Optimization Using Demonstration-Guided Reinforcement Learning
Du, S.; Kaynar, G.; Li, J.; You, Z.; Tang, S.; Kingsford, C.AI Summary
- CodonRL uses reinforcement learning to optimize codon sequences for translation efficiency, RNA stability, and compositional properties, addressing challenges like large action spaces and delayed rewards.
- It employs LinearFold for training and ViennaRNA for evaluation, with expert sequences to guide learning and milestone rewards to manage long-range optimization.
- On a benchmark of 55 human proteins, CodonRL outperformed GEMORNA, showing improvements in CAI by 9.5%, MFE by 25.4 kcal/mol, and reducing uridine content by 3.4%, enhancing translation efficiency, stability, and reducing immunogenicity.
Abstract
Optimizing synonymous codon sequences to improve translation efficiency, RNA stability, and compositional properties is challenging because the search space grows exponentially with protein length and objectives interact through long range RNA structure. Dynamic programming-based methods can provide strong solutions for fixed objective combinations but are difficult to extend to additional constraints. Deep generative models require large-scale, high-quality mRNA sequence datasets for training, limiting applicability when such data are scarce. Reinforcement learning naturally handles sequential decision-making but faces challenges in codon optimization due to delayed rewards, large action spaces, and expensive structural evaluation. We present CodonRL, a reinforcement learning framework that learns a structural prior for mRNA design from efficient folding feedback and demonstration-guided replay, and then enables user-controlled multi-objective trade-offs during inference. CodonRL uses LinearFold for fast intermediate reward computation during training and ViennaRNA for final evaluation, warms up learning with expert sequences to accelerate convergence for global structure objectives, and introduces milestone-based intermediate rewards to address delayed feedback in long range optimization. On a benchmark of 55 human proteins, CodonRL outperforms GEMORNA, a state-of-the-art codon optimization method, across multiple metrics, achieving 9.5% higher codon adaptation index (CAI), 25.4 kcal/mol more favorable minimum free energy (MFE), and 3.4% lower uridine content on average, while improving codon stabilization coefficient (CSC) in over 90% of benchmark proteins under matched constraints. These gains translate into designs that are predicted to be more efficiently translated, more structurally stable, and less immunogenic, while supporting continuous objective reweighting at inference time.
bioinformatics2026-02-14v1Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment
Jimenez-Blanco, A.; Lopez-Villellas, L.; Moure, J. C.; Moreto, M.; Marco-Sola, S.AI Summary
- Theseus is a novel algorithm for optimal affine-gap sequence-to-graph alignment, designed to reduce memory and computational demands while maintaining optimality.
- It uses the diagonal transition property and a sparse-data strategy to accelerate alignment, applicable to arbitrary directed graphs including those with cycles.
- Theseus outperforms existing methods in speed for multiple sequence alignment (2.0x to 232.2x faster) and pangenome read mapping (1.9x to 16.9x faster).
Abstract
Motivation: Sequence-to-graph alignment is a central problem in bioinformatics, with applications in multiple sequence alignment (MSA) and pangenome analysis, among others. However, current algorithms for optimal affine-gap alignment impose high memory and computational requirements, limiting their scalability to aligning long sequences to complex graphs. Practical solutions partially address this problem using heuristic strategies that ultimately trade off optimality for speed. Results: This work presents Theseus, a novel, fast, and optimal affine-gap sequence-to-graph alignment algorithm. Theseus leverages similarities between genomic sequences to accelerate the alignment computation and reduces the overall memory requirements without compromising optimality. To that end, Theseus exploits the diagonal transition property to process only a subset of the dynamic programming cells, combined with a sparse-data strategy that enables efficient sequence-to-graph alignment. Moreover, our algorithm supports optimal affine-gap alignment on arbitrary directed graphs, including those with cycles. We evaluate Theseus on two key problems: multiple sequence alignment (MSA) and pangenome read mapping. For MSA, we compare it against the state-of-the-art methods SPOA, abPOA, and POASTA. Theseus is 2.0x to 232.2x faster than the other two optimal aligners, SPOA and POASTA. Compared with abPOA, a heuristic aligner, Theseus is 3.3x faster on average, while ensuring optimality. For pangenome read mapping, we benchmark Theseus against the alignment stage of the popular mapping tool vg map, along with the alignment kernels of SPOA, abPOA, and POASTA. Theseus outperforms the other methods, showing a 1.9x to 16.9x speed improvement on short reads.
bioinformatics2026-02-14v1Machine learning-guided design of artificial microRNAs for targeted gene silencing
Belter, A.; Synak, J.; Mackowiak, M.; Kotowska-Zimmer, A.; Figlerowicz, M.; Szachniuk, M.; Olejniczak, M.AI Summary
- The study developed miRarchitect, a machine learning-based platform for designing artificial microRNAs (amiRNAs) to enhance targeted gene silencing.
- miRarchitect integrates neural network-guided selection, siRNA design, and scaffold choice, using data from human pri-miRNAs and next-generation sequencing.
- Validation experiments targeting TMPRSS2 and ACE-2 showed miRarchitect-designed amiRNAs had precise processing, robust knockdown, and high specificity, outperforming other tools in benchmarking.
Abstract
Artificial microRNAs (amiRNAs) offer a powerful strategy for targeted gene silencing, but their rational design is limited by complex sequence-structure-processing relationships and the lack of tools capable of optimizing efficacy and specificity. To address this need, we developed miRarchitect, a web-based platform that uses machine learning to support the customizable design of amiRNAs. miRarchitect integrates neural network-guided target-site selection, siRNA insert design, and scaffold choice, utilizing large-scale data from human primary microRNAs (pri-miRNAs) and next-generation sequencing. The platform generates molecules that closely resemble endogenous pri-miRNAs and includes comprehensive off-target analysis to enhance specificity. Experimental validation targeting TMPRSS2 and ACE-2 confirmed precise processing, robust knockdown, and high specificity of miRarchitect-designed amiRNAs. In comparative benchmarking, miRarchitect consistently produced functional amiRNAs, whereas only half of the top candidates generated by other tools showed measurable activity. miRarchitect is freely available at https://rnadrug.ichb.pl/mirarchitect and provides an intuitive interface with an automated workflow for generating, ranking, and selecting candidate amiRNAs for research and therapeutic applications.
bioinformatics2026-02-14v1MassID provides near complete annotation of metabolomics data with identification probabilities
Stancliffe, E.; Gandhi, M.; Guzior, D. V.; Mehta, A.; Acharya, S.; Richardson, A. D.; Cho, K.; Cohen, T.; Patti, G. J.AI Summary
- MassID is a cloud-based untargeted metabolomics pipeline designed to process LC/MS data from raw spectra to identified metabolite profiles, addressing challenges like noise and non-quantitative identification.
- It uses deep learning for peak detection, comprehensive noise filtering, and introduces DecoID2 for probabilistic metabolite identification with FDR control.
- Applied to human plasma, MassID annotated nearly all signals, identifying over 4,000 metabolites, with over 1,200 at FDR <5%, enhancing specificity and discovery beyond traditional MSI levels.
Abstract
Liquid chromatography coupled to mass spectrometry (LC/MS) is a powerful tool in metabolomics research, generating tens-of-thousands of signals from a single biological sample. However, current software solutions for unbiased assessment of metabolomics data analysis are limited by complex sources of noise and non-quantitative metabolite identifications that make results difficult to interpret. Here, we present MassID, a cloud-based untargeted metabolomics pipeline that aims to overcome the innate challenges of unbiased metabolite analysis and perform end-to-end data processing, transforming raw spectra to normalized and identified metabolite profiles. MassID incorporates a suite of software functionalities, including deep learning-based peak detection and comprehensive noise filtering. In addition, with MassID we introduce a novel software module: DecoID2 that enables probabilistic metabolite identification for false discovery rate (FDR)-controlled metabolomics. When applied to a human plasma dataset, MassID results in near-complete signal annotation, identification of >4,000 metabolites (including >1,200 compounds at an FDR <5%) across four complementary LC/MS runs, and enables integrated downstream analyses to understand biochemical dysregulation at both the molecular and pathway level. When compared to the Metabolomics Standards Initiative (MSI) confidence levels, identification probability generally correlated with MSI levels. However, only 356/418 of MSI Level 1 compounds were identified with <5% FDR and the remaining 884 FDR < 5% compounds were identified from MSI L2-L3 compounds, highlighting the enhanced specificity and discovery potential achieved by MassID.
bioinformatics2026-02-14v1Cell phenotypes in the biomedical literature: a systematic analysis and text mining corpus
Rotenberg, N. H.; Leaman, R.; Islamaj, R.; Kuivaniemi, H.; Tromp, G.; Fluharty, B.; Richardson, S.; Eastwood, C.; Diller, M.; Xu, B.; Pankajam, A. V.; Osumi-Sutherland, D.; Lu, Z.; Scheuermann, R. H.AI Summary
- The study introduces CellLink, a corpus of over 22,000 annotated mentions of human and mouse cell populations from recent literature, linked to Cell Ontology terms.
- Analysis showed lineage-specific patterns in cell naming based on various attributes.
- Fine-tuning transformer models on CellLink improved named entity recognition, and embedding approaches enhanced zero-shot entity linking, with applications in refining the chondrocyte branch of Cell Ontology.
Abstract
The variety of cell phenotypes identified by single-cell technologies is rapidly expanding, yet this knowledge is dispersed across the scientific literature and incompletely represented in structured resources. We present the CellLink corpus, a manually annotated collection of over 22,000 mentions of human and mouse cell populations in recent journal articles, distinguishing specific cell phenotypes, heterogeneous cell populations, and vague cell populations, and linking to Cell Ontology (CL) terms as either exact or related matches, covering nearly half of the terms in the current CL. A systematic analysis reveals lineage-specific patterns in how authors utilize anatomical context, molecular signatures, functional roles, developmental stage, and other attributes in cell naming. We show that fine-tuning transformer-based models on CellLink yields strong performance for named entity recognition, while embedding-based approaches support zero-shot entity linking and distinguishing exact from related matches. We further demonstrate the utility of CellLink to expand and refine the chondrocyte branch of CL.
bioinformatics2026-02-14v1evoCancerGPT: Generating Zero-Shot Single-Cell and Single-Sample Cancer Progression Through Transfer Learning
Wang, X.; Tan, R.; Cristea, S.AI Summary
- The study introduces evoCancerGPT, a transformer model designed to predict future gene expression in cancer evolution using single-cell RNA sequencing data.
- It uses transfer learning from 2.76 million cell tokens across 7 cancer types, ordered by pseudotime, to forecast cancer progression.
- evoCancerGPT showed high accuracy in predicting cancer trajectories, outperforming linear models and scGPT in low-context scenarios.
Abstract
Cancer evolution is driven by complex changes in gene expression as cells transition and change states during tumorigenesis. Single-cell RNA sequencing has provided snapshot insights into how the transcriptomics of tumors evolve, but whether the existing knowledge can be used to reliably learn and generate the patterns behind the evolution of cancers remains unknown. Here, we introduce evoCancerGPT, a generative pre-trained transformer decoder-only single-cell foundation model designed to forecast future gene expression profiles in cancer evolution by leveraging previous cell states at the level of single patients. This model integrates the continuous gene expression data of each cell to create a comprehensive representation of a cell token. Training sentences are constructed for each cancer type, each patient and each cell type separately, ordered via inferred pseudotime algorithms, using 2.76 million cell tokens, each with 12,639 genes, spanning 7 cancer types. By learning from long-range dependencies between cells arranged in pseudotime from a large corpus of data, evoCancerGPT captures key transitions in cancer evolution, achieving high concordance to ground truth trajectories and outperforming linear and scGPT baselines in held-out test samples in low-context scenarios. Our work suggests evoCancerGPT's potential utility in characterizing tumor progression at a single-cell and single-patient level and ultimately contributing to more personalized cancer care.
bioinformatics2026-02-14v1Discovery of TDP-43 aggregation inhibitors via a hybrid machine learning framework
Kapsiani, S.; Vora, S.; Fernandez-Villegas, A.; Kaminski, C. F.; Läubli, N. F.; Kaminski Schierle, G. S.AI Summary
- Researchers developed a hybrid machine learning approach using GNN embeddings, chemical descriptors, and biological annotations to identify TDP-43 aggregation inhibitors.
- The model screened 3,853 compounds, identifying berberrubine and PE859 as effective inhibitors, with molecular docking showing favorable interactions with TDP-43's RRM domain.
- Experimental validation confirmed that both compounds reduced TDP-43 aggregation in HEK cells, with PE859 significantly improving locomotor defects in C. elegans.
Abstract
TAR DNA-binding protein 43 (TDP-43) aggregation is a hallmark of several neurodegenerative diseases, including amyotrophic lateral sclerosis and frontotemporal dementia. Recent therapeutic efforts have highlighted the potential of small molecules capable of inhibiting TDP-43 aggregation; however, no effective treatments currently exist. Here, we developed a hybrid machine learning approach combining graph neural network (GNN) embeddings with traditional chemical descriptors and biological target annotations. Using XGBoost as the final classifier enabled model interpretability through SHAP analysis, allowing the identification of key chemical features and target annotations associated with TDP-43 anti-aggregation activity. Complementary Monte Carlo Tree Search analysis highlighted specific chemical substructures linked to predicted activity. By screening an external library of 3,853 small molecules, the model identified two compounds not previously evaluated against TDP-43 aggregation, namely berberrubine and PE859. Molecular docking analysis revealed that both compounds interact favourably with the TDP-43 RNA recognition motif (RRM) domain through distinct binding modes. Experimental validation showed that both compounds significantly reduced TDP-43 aggregation in HEK cells. Further testing in Caenorhabditis elegans expressing human TDP-43 demonstrated that PE859 significantly rescued locomotor defects, while berberrubine showed partial improvement. This work establishes a hybrid machine learning approach for accelerating small molecule drug discovery, yielding two promising therapeutic candidates for TDP-43 proteinopathies.
bioinformatics2026-02-14v1CPLfold: Chimeric and Pseudoknot-capable almost Linear-time RNA Secondary Structure Prediction
Wang, K.; Kudla, G.; Cohen, S. B.AI Summary
- CPLfold is a new RNA folding method that integrates thermodynamic modeling with chimeric evidence from RNA cross-linking and ligation to predict RNA secondary structures, including pseudoknots.
- It scales effectively for long sequences and outperforms existing methods in predicting global structures and long-range interactions, as shown in benchmarks like COMRADES and IRIS.
- The method offers flexibility through two parameters to balance the incorporation of chimeric evidence and pseudoknot prediction.
Abstract
Motivation: RNA structure plays a central role in how transcripts function, but inferring it reliably remains difficult, especially when pseudoknots need to be part of the prediction. Chemical probing experiments provide additional signals, yet these signals do not directly identify base pairing partners. RNA proximity ligation provides direct evidence of base pairing, but balancing this evidence with pseudoknot prediction accuracy and scalability of structure prediction for long sequences remains challenging. Results: We present CPLfold, a fast and flexible RNA folding method that combines thermodynamic modeling with chimeric evidence from RNA cross-linking and ligation experiments, while naturally supporting pseudoknots. CPLfold scales to long sequences and recovers more accurate global structures and long-range interactions than existing approaches across multiple benchmarks such as COMRADES and IRIS. By tuning two simple trade-off parameters (, {beta}) the method allows flexibility in the level of incorporating chimeric evidence and asserting pseudoknots. Availability and Implementation: Source code and scripts are available at https://github.com/Vicky-0256/CPLfold.
bioinformatics2026-02-14v1Inferring a novel insecticide resistance metric and exposurevariability in mosquito bioassays across Africa
Denz, A.; Kont, M. D.; Sanou, A.; Churcher, T. S.; Lambert, B.AI Summary
- This study introduces a new predictive model to assess insecticide resistance in mosquitoes by incorporating data from intensity-dose susceptibility bioassays, addressing variability due to genetic factors.
- The model was fitted to data from across Africa, focusing on Burkina Faso, to estimate location-specific resistance heterogeneity and exposure differences in bioassays versus experimental huts.
- The approach aims to enhance malaria transmission models by providing a mechanistic understanding of insecticide resistance's public health impact.
Abstract
Malaria claims approximately 500,000 lives each year, and insecticide-treated nets (ITNs), which kill mosquitoes that transmit the disease, remain the most effective intervention. However, resistance to pyrethroids, the primary insecticide class used in ITNs, has risen dramatically in Africa, making it difficult to assess the current public health impact of pyrethroid-ITNs. Past work has modelled the relation between pyrethroid susceptibility measured in discriminating-dose susceptibility bioassays and ITN effectiveness in experimental hut trials. Here, we introduce a new predictive approach that accounts for heterogeneity in insecticide resistance within wild mosquito populations, for example, due to genetic variability, by incorporating data from newly recommended intensity-dose susceptibility bioassays. We fit our mathematical model to a comprehensive data set that combines discriminating dose bioassays from all over Africa, intensity dose bioassays from Burkina Faso, and concurrent experimental hut trials. Our analysis estimates location- and insecticide-specific variation in resistance heterogeneity in Burkina Faso and quantifies differences in insecticide exposure in bioassays and experimental huts. By providing a mechanistic understanding of these experimental data, our approach could be integrated into malaria transmission models to account for the public health impact of insecticide resistance detected by surveillance programmes.
bioinformatics2026-02-12v4Predicting interaction-specific protein-protein interaction perturbations by missense variants with MutPred-PPI
Stewart, R.; Laval, F.; Coppin, G.; Spirohn-Fitzgerald, K.; Tixhon, M.; Hao, T.; Calderwood, M. A.; Mort, M.; Cooper, D. N.; Vidal, M.; Radivojac, P.AI Summary
- MutPred-PPI, a graph attention network, was developed to predict the interaction-specific effects of missense variants on protein-protein interactions using AlphaFold 3-based contact graphs and protein language model embeddings.
- The model outperformed existing methods with AUCs of 0.85 for seen proteins and 0.72 for unseen proteins, showing strong generalizability.
- Application to various datasets revealed distinct PPI perturbation patterns, with disease-associated variants showing enrichment for edgetic effects, particularly in cancer and neurodevelopmental disorders.
Abstract
Disruption of protein-protein interactions (PPIs) is a major mechanism of a variant's deleterious effect. Computational tools are needed to assess such variants at scale, yet existing predictors rarely consider loss of specific interactions, particularly when variants perturb binding interfaces without significantly affecting protein stability. To address this problem, we present MutPred-PPI, a graph attention network that predicts interaction-specific (edgetic) effects of missense variants by operating on AlphaFold 3-based protein complex contact graphs with protein language model embeddings imposed upon nodes. We systematically evaluated our model with stringent group cross-validation as well as benchmark data recently collected within the IGVF Consortium. MutPred-PPI outperformed all baseline methods across all evaluation criteria, achieving an AUC of 0.85 on seen proteins and 0.72 on previously unseen proteins in cross-validation, demonstrating strong generalizability despite scarce training data. To demonstrate biomedical relevance, we applied MutPred-PPI to variants from ClinVar, HGMD, COSMIC, gnomAD, and two de novo neurodevelopmental disorder-linked datasets. Disease-associated variants from ClinVar and HGMD showed strong enrichment for both quasi-null and edgetic effects, whereas population variants from gnomAD increasingly preserved interactions with higher allele frequencies. Notably, we observed a strong edgetic disruption signature in highly recurrent cancer variants from both the full COSMIC dataset and a subset of variants from oncogenes. Recurrent tumor suppressor gene variants and autism spectrum disorder-associated variants exhibited moderate quasi-null enrichment, whilst neurodevelopmental disorder-linked variants showed a weak edgetic disruption signature. These results indicate distinct PPI perturbation mechanisms across disease types and show that MutPred-PPI captures functionally relevant molecular effects of pathogenic variants.
bioinformatics2026-02-12v2Reading TEA leaves for de novo protein design
Pantolini, L.; Durairaj, J.AI Summary
- The study explores de novo protein design using a 20-letter structure-inspired alphabet from protein language model embeddings to enhance Monte Carlo sampling efficiency.
- This approach allows for rapid template-guided and unconditional design of protein sequences that meet in silico designability criteria, without relying on known homologues.
- The method significantly reduces the time required for protein design, opening new avenues for therapeutic and industrial applications.
Abstract
De novo protein design expands the functional protein universe beyond natural evolution, offering vast therapeutic and industrial potential. Monte Carlo sampling in protein design is under-explored due to the typically long simulation times required or prohibitive time requirements of current structure prediction oracles. Here we make use of a 20-letter structure-inspired alphabet derived from protein language model embeddings to score random mutagenesis-based Metropolis sampling of amino acid sequences. This facilitates fast template-guided and unconditional design, generating sequences that satisfy in silico designability criteria without known homologues. Ultimately, this unlocks a new path to fast and de novo protein design.
bioinformatics2026-02-12v2GeneReL: A Large Language Model-Powered Platform for Gene Regulatory Relationship Extraction with Community Curation
Park, J.-S.; Ha, S.; Lee, Y.; Kang, Y. J.AI Summary
- GeneReL is a platform developed to extract and curate gene regulatory relationships in Arabidopsis thaliana using large language models (LLMs) and community validation.
- It uses a tiered pipeline with different LLMs for screening, extraction, and verification, and includes a five-step gene normalization process.
- The platform has curated 13,710 interactions, with 86.8% unique compared to IntAct, and features interactive visualization and community voting for validation.
Abstract
Motivation: Gene regulatory networks provide fundamental insights into plant biology, yet extracting structured interaction data from scientific literature remains a significant bottleneck. Traditional manual curation cannot scale to meet the demands of modern research, while automated text mining approaches struggle with the complexity of gene nomenclature and relationship classification. Large language models offer promising capabilities for information extraction, but integrated platforms combining LLM extraction with community validation for plant regulatory databases remain scarce. Results: We developed GeneReL, an integrated platform combining LLM-based extraction with community-driven curation for gene regulatory networks in Arabidopsis thaliana. The system employs a tiered pipeline using Claude Haiku 4.5 for screening, Claude Sonnet 4 for extraction, and Claude Opus 4 for verification, along with a novel five-step gene normalization pipeline incorporating paper-text search and LLM-based disambiguation with UniProt annotations. The database contains 13,710 curated interactions across 51 relationship types, with 90.2% classified as high confidence based on linguistic certainty markers in source text. Comparison with IntAct reveals 86.8% of interactions are unique to our literature-derived database, demonstrating complementary coverage to existing resources. The web platform provides card-based browsing with voting capabilities, interactive network visualization using Cytoscape.js with locus-ID-based node consolidation, and administrative interfaces for curator review of ambiguous gene mappings.
bioinformatics2026-02-12v2LineageSim: A Single-Cell Lineage Simulator with Fate-Aware Gene Expression
Lai, H.; Sadria, M.AI Summary
- LineageSim is introduced as a simulator that generates single-cell lineage data with fate-aware gene expression, addressing the limitation of existing simulators which lack long-range temporal dependencies.
- The simulator includes latent signals in progenitor states that predict future cell fates, providing a benchmark for cell fate prediction algorithms.
- Validation through logistic regression showed a 68.3% balanced accuracy, confirming the presence of predictive fate information in the simulated data.
Abstract
Single-cell lineage data paired with gene expression are critical for developing computational methods in developmental biology. Since experimental lineage tracing is often technically limited, robust simulations are necessary to provide the ground truth for rigorous validation. However, existing simulators generate largely Markovian gene expression, failing to encode the fate bias observed in real biological systems, where progenitor states exhibit early signatures of future commitment. Consequently, they cannot support the training and evaluation of computational methods that model long-range temporal dependencies. We present LineageSim, a generative framework that introduces fate-aware gene expression, where progenitor states carry latent signals of their descendants' terminal fates. This framework establishes a new class of benchmarks for cell fate prediction algorithms. We validate the presence of these temporal signals by training a logistic regression baseline, which achieves 68.3% balanced accuracy. This confirms that the generated data contain subtle but recoverable fate information, in contrast to existing simulators, where such predictive signals are systematically absent.
bioinformatics2026-02-12v2A hyperparameter benchmark of VAE-based methods for scRNA-seq batch integration
Kassab, M.; Maniero, L.; Beltrame, E.AI Summary
- The study benchmarks hyperparameters of VAE-based methods (scVI, MrVI, LDVAE) for scRNA-seq batch integration, using 960 trainings across four datasets and two feature regimes.
- Evaluations with scib metrics showed scVI excels in batch correction, LDVAE preserves biological structure better in some datasets, and MrVI is effective in multi-protocol settings but resource-intensive.
- Results indicated that training with highly variable genes (HVGs) generally outperformed full-gene training, and higher latent dimensionality (>30) often balanced batch mixing with biological conservation.
Abstract
We present the first systematic benchmark of model architecture hyperparameters for variational autoencoder (VAE) methods for single-cell RNA-seq batch integration within scvi-tools, comparing scVI, MrVI, and LDVAE across four heterogeneous datasets under two feature regimes (all genes vs highly variable genes (HVGs)). We investigated 960 trainings (120 configurations) varying latent size and network depth/width, and evaluated with a standardized scib metric suite covering batch removal and biological conservation (Batch ASW, PCR batch, iLISI, graph connectivity, NMI, ARI, label ASW, isolated-label F1/ASW, cLISI, trajectory conservation), plus qualitative UMAP/t-SNE and PCA, random projection, and unintegrated baselines. Results show dataset-dependent trade-offs: scVI performs best overall via stronger batch correction; LDVAE can better preserve biological structure in some datasets; MrVI is stable and excels at batch correction in multi-protocol settings, but is more resource-intensive. HVG-only training generally outperforms full-gene training for all models. Hyperparameter analysis suggests moderate-to-high latent dimensionality (>30) often gives the best balance; sensitivity to latent size tracks dataset heterogeneity (tissues, labs, chemistries, gene coverage), with larger latents improving batch mixing but sometimes reducing biological conservation. We provide model- and dataset-specific guidelines for practical defaults and tuning of VAE-based integration in single-cell studies.
bioinformatics2026-02-12v1Structure-guided analysis and prediction of human E2-E3 ligase pairing specificity
Jarboe, B.; Dunbrack, R.AI Summary
- This study addresses the specificity of E2-E3 ligase interactions in ubiquitination by analyzing experimental structures from the PDB and using AlphaFold to predict thousands of ubiquitin-E2-E3 ternary complexes.
- A machine learning model was developed to predict functional E2-E3 pairings, enhancing the understanding of ubiquitination networks.
- The model predicted E2 partners for 88 E3 ligases, including a novel pairing between UBE2C and RNF214, potentially linking them in hepatocellular carcinoma pathways.
Abstract
Protein ubiquitination, directed by specific E3 ligases, constitutes the primary cellular pathway for selective protein degradation. In addition to targeting proteins for degradation, ubiquitination can mediate new protein-protein interactions, and otherwise modulate protein function, thereby regulating key cellular processes such as DNA repair and immune responses. Recently, Proteolysis-Targeting Chimeras (PROTACs), and related proximity-inducing agents, have revealed the significant therapeutic potential of co-opting ubiquitin ligase activity to induce the selective degradation of disease-relevant proteins. Despite the biological and clinical significance of this pathway, fundamental gaps remain in our understanding of ubiquitination networks, particularly regarding the specificity of E2-E3 interactions and their substrate preferences. In this study, we leverage analysis of experimental structures in the Protein Data Bank (PDB) and use AlphaFold to generate structures of thousands of ubiquitin-E2-E3 ternary complexes. Using these predicted structures and complementary analyses, we develop a machine learning model to predict functional E2-E3 pairings, advancing our ability to map ubiquitination networks and providing structural insights into functional ubiquitin-E2-E3 complexes. We demonstrate the utility of our model by predicting E2 partners for 88 putative E3 ligases lacking any previously known E2 interactors. Notably, we identify a predicted pairing between UBE2C and RNF214, two proteins recently implicated in hepatocellular carcinoma separately but through interrelated pathways, suggesting a potential functional link mediated by RNF214-dependent ubiquitination in partnership with UBE2C. Additionally, we present our web resource, UbiqCore, making the E2-E3 pairing predictions and ternary complex structures available to the scientific community (https://dunbrack.fccc.edu/ubiqcore).
bioinformatics2026-02-12v1Investigating Enzyme Function by Geometric Matching of Catalytic Motifs
Hackett, R. E.; Riziotis, I. G.; Larralde, M.; Ribeiro, A. J. M.; Zeller, G.; Thornton, J.AI Summary
- Developed a method using geometric matching to detect catalytic features in protein structures, utilizing a library of 6780 3D coordinate sets from 762 enzyme mechanisms.
- The approach was validated on 3751 high-quality experimental enzyme structures and predicted human proteome structures, showing higher sensitivity in identifying enzyme homology than sequence or 3D-structure-based methods.
- This method identifies structural similarities in catalytic sites of divergent enzymes, offering insights into enzyme function evolution, and is available as the Python module Enzyme Motif Miner.
Abstract
The rapidly growing universe of predicted protein structures offers opportunities for data driven exploration but requires computationally scalable and interpretable tools. We developed a method to detect catalytic features in protein structures, providing insights into enzyme function and mechanism. A library of 6780 3D coordinate sets describing enzyme catalytic sites, referred to as templates, has been collected from manually curated examples of 762 enzyme catalytic mechanisms described in the Mechanism and Catalytic Site Atlas. For template searching we optimised the geometric-matching algorithm Jess. We implemented RMSD and residue orientation filters to differentiate catalytically informative matches from spurious ones. We validated this approach on a non-redundant set of high quality experimental (n=3751, <40% amino acid identity) enzyme structures with well annotated catalytic sites as well as predicted structures of the human proteome. We show matching catalytic templates solely on structure is more sensitive than sequence- and 3D-structure-based approaches in identifying homology between distantly related enzymes. Since geometric matching does not depend on conserved sequence motifs or even common evolutionary history, we are able to identify examples of structural active site similarity in highly divergent and possibly convergent enzymes. Such examples make interesting case studies into the evolution of enzyme function. Though not intended for characterizing substrate-specific binding pockets, the speed and knowledge-driven interpretability of our method make it well suited for expanding enzyme active-site annotation across large predicted proteomes. We provide the method and template library as a Python module, Enzyme Motif Miner, at https://github.com/rayhackett/enzymm.
bioinformatics2026-02-12v1tensorOmics: Data integration for longitudinal omics data using tensor factorisation
Kodikara, S.; Lu, B.; Wang, S.; Le Cao, K.-A.AI Summary
- The study introduces tensorOmics, a framework using tensor factorization for integrating longitudinal multi-omics data, addressing the limitations of traditional matrix-based methods.
- tensorOmics includes both supervised and unsupervised methods for single and multi-omic analyses, preserving temporal structures and integrating phenotypic responses.
- Validation through case studies showed tensorOmics effectively differentiates treatment groups, captures time-dependent molecular signatures, and reveals coordinated responses across omics layers.
Abstract
Multi-omics studies capture comprehensive molecular profiles across biological layers to understand complex biological processes. A central challenge is integrating information across heterogeneous data types to identify coordinated molecular responses, particularly when measurements are collected longitudinally. Traditional integration methods can be broadly classified as unsupervised (exploring patterns without phenotypic information) or supervised (discriminating between groups or predicting outcomes). These approaches rely predominantly on matrix-based techniques that concatenate or project data into lower-dimensional spaces. However, matrix methods struggle with longitudinal data, as flattening multi-dimensional structures obscures temporal trajectories and violates independence assumptions. Tensor-based methods preserve the natural multi-way structure of longitudinal data but existing approaches are predominantly unsupervised, cannot incorporate phenotypic responses for discriminant analysis, and lack frameworks for integrating multiple omics layers. We present tensorOmics, a comprehensive framework for longitudinal omics analysis using tensor factorisation. The framework encompasses supervised and unsupervised methods for both single-omic (tensor PCA, tensor PLS discriminant analysis) and multi-omic settings (tensor PLS, block tensor PLS, block tensor PLS discriminant analysis). This unified approach captures coordinated responses across biological layers while preserving temporal structure. We validated tensorOmics through three case studies: antibiotic perturbation experiments, anaerobic digestion systems, and fecal microbiota transplantation. These applications demonstrate tensorOmics differentiates treatment groups, captures time-dependent molecular signatures, and reveals multi-layer coordinated responses that cross-sectional methods miss.
bioinformatics2026-02-12v1Deep learning-based non-invasive profiling of tumor transcriptomes from cell-free DNA for precision oncology
Patton, R. D.; Netzley, A.; Persse, T. W.; Nair, A.; Galipeau, P. C.; Coleman, I. M.; Itagi, P.; Chandra, P.; Adil, M.; Vashisth, M.; Sayar, E.; Hiatt, J. B.; Dumpit, R.; Kollath, L.; Demirci, R. A.; Ghodsi, A.; Lam, H.-M.; Morrissey, C.; Iravani, A.; Chen, D. L.; Hsieh, A. C.; MacPherson, D.; Haffner, M. C.; Nelson, P. S.; Ha, G.AI Summary
- The study introduces Triton for fragmentomic and nucleosome profiling of cfDNA and Proteus, a deep learning framework for predicting gene expression from standard depth whole genome sequencing of cfDNA.
- Proteus accurately reproduced gene expression profiles from ctDNA in patient-derived xenografts, similar to RNA-Seq replicates.
- When applied to patient cohorts, Proteus predicted expression of prognostic markers, phenotype markers, and therapeutic targets, demonstrating its utility in precision oncology.
Abstract
Circulating tumor DNA (ctDNA) profiling from liquid biopsies is increasingly adopted as a minimally invasive solution for clinical cancer diagnostic applications. Current methods for inferring gene expression from ctDNA require specialized assays or ultra-deep, targeted sequencing, which preclude transcriptome-wide profiling at single-gene resolution. Herein we jointly introduce Triton, a tool for comprehensive fragmentomic and nucleosome profiling of cell-free DNA (cfDNA), and Proteus, a multi-modal deep learning framework for predicting single gene expression, using standard depth (~30-120x) whole genome sequencing of cfDNA. By synthesizing fragmentation and inferred nucleosome positioning patterns in the promoter and gene body from Triton, Proteus reproduced expression profiles using pure ctDNA from patient-derived xenografts (PDX) with an accuracy similar to RNA-Seq technical replicates. Applying Proteus to cfDNA from four patient cohorts with matched tumor RNA-Seq, we show that the model accurately predicted the expression of specific prognostic and phenotype markers and therapeutic targets. As an analog to RNA-Seq, we further confirmed the immediate applicability of Proteus to existing tools through accurate prediction of gene pathway enrichment scores. Our results demonstrate the potential clinical utility of Triton and Proteus as non-invasive tools for precision oncology applications such as cancer monitoring and therapeutic guidance.
bioinformatics2026-02-12v1Splicer: Phylogenetic Placement in Sub-Linear Time
Markin, A.; Anderson, T. K.AI Summary
- Splicer is developed to perform phylogenetic placement in sub-linear time, specifically O(√n), addressing the scalability issues of existing methods like pplacer and EPA-ng.
- It decomposes the reference tree into "blobs" and constructs a scaffold tree, then places query sequences first on the scaffold and then within blobs for precision.
- Splicer demonstrated high accuracy on an influenza A dataset and was applied to over 12 million SARS-CoV-2 genomes, scaling maximum-likelihood placement to large datasets.
Abstract
Motivation: Phylogenetic placement is an established approach for rapidly classifying new genetic sequences and updating a phylogeny without fully recomputing it. Popular maximum- likelihood placement methods, such as pplacer and EPA-ng, tend to struggle computationally when the size of the reference tree increases to tens or hundreds of thousands of sequences. As a more scalable alternative, distance-based and parsimony-based placement methods were introduced such as UShER. These methods, in principle, scale linearly as the size of the reference tree grows. However, as the scale of genetic and genomic sequences continues to grow nearly exponentially, developing algorithms that can perform placement in sub-linear time while maintaining accuracy becomes more crucial. Results: Here, we develop Splicer, the first such algorithm that can perform placement in guaranteed O({surd}n) time. To achieve this performance, Splicer first decomposes the original reference tree into "blobs" and constructs a phylogenetic scaffold tree linking representatives from different blobs. Every blob in such decomposition has at most c{surd}n taxa, and the scaffold tree has at most 4/c*{surd}n leaves, where c is any constant. Then, given the query sequences for placement, they are first placed onto a scaffold tree using pplacer or EPA-ng, and then placed more precisely within the respective blobs. We demonstrate the high accuracy of Splicer on an empirical influenza A virus dataset that has sparse coverage due to limited genomic surveillance. We also show that Splicer can, for the first time, apply maximum-likelihood placement to COVID-19 pandemic-scale data using a dataset with over 12 million SARS-CoV-2 reference genomes. Splicer scales the highly accurate maximum-likelihood approaches implemented in pplacer and EPA-ng to trees with millions of taxa and eliminates the necessity to curate and subsample genomic datasets for real-time classifications. Availability and implementation: Splicer tool and source code are freely available at https://github.com/flu-crew/splicer.
bioinformatics2026-02-12v1Spatiotemporal cell type deconvolution leveraging tissue structure
Lobo, M. M.; Zhang, Z.; Zhang, X.AI Summary
- SpaDecoder is introduced as a method for cell type deconvolution in spatial transcriptomics, utilizing 3D tissue structure through an adaptive Gaussian kernel.
- It accounts for variability in single-cell reference profiles and batch effects, enhancing the accuracy of cell type distribution estimation.
- Comparisons and ablation tests demonstrate SpaDecoder's superior performance in leveraging 3D tissue structure for improved deconvolution across various datasets.
Abstract
Spot-based spatial transcriptomics (ST) captures aggregated transcriptomic profiles at spatial locations (spots) in tissue slices. Cell type deconvolution methods decode each spot and estimate the proportion of every cell type in the spot, necessary for uncovering spatial cell type distributions for further downstream analyses. Existing methods utilize cell type markers or reference transcriptomic (scRNA-seq) atlases at single cell (sc) resolution, or by aggregating profiles of identified cell types. However, current methods fail to effectively utilize the 3D tissue layout and single cell resolution reference. Some leverage 2D spatial organization assuming proximal spots are similar, which may be violated around boundaries or isolated cell types. We present SpaDecoder, a parallelized matrix factorization-based per-spot deconvolution method for multiple 3D spatial or temporal ST tissue slices effectively leveraging tissue structure with an adaptively inferred 3D neighborhood Gaussian kernel. We additionally account for variability in sc-reference profiles, along with batch effects. The mathematical framework of SpaDecoder allows it to be used for a range of downstream analyses. It can decode anteroposterior variability, impute gene expression, uncover putatively key tissue regions, identify colocalized cell types and predict spatio-temporal scRNA-seq cell locations. Ablation tests along with comparisons against other methods on various metrics, datasets, and scenarios, collectively show that SpaDecoder effectively harnesses 3D tissue structure and sc-reference profiles to improve cell type deconvolution. SpaDecoder is available at https://github.com/ZhangLabGT/spadecoder.
bioinformatics2026-02-12v1MOSAIC: A Spectral Framework for Integrative Phenotypic Characterization Using Population-Level Single-Cell Multi-Omics
Lu, C.; Kluger, Y.; Ma, R.AI Summary
-
MOSAIC is a spectral framework designed to analyze population-scale single-cell multi-omics data by learning a joint feature x sample embedding, addressing limitations of existing cell-centric or feature-centric methods.
-
It enables Differential Connectivity (DC) analysis, revealing regulatory network changes like the rewiring of proliferation programs in activated T cells post-vaccination, despite unchanged gene expression.
-
Applied to an HIV+ cohort, MOSAIC identified a novel stress-driven neuronal subtype with increased protein synthesis, highlighting its utility in discovering biologically significant sample subgroups.
Abstract
Population-scale single-cell multi-omics offers unprecedented opportunities to link molecular variation to human health and disease. However, existing methods for single-cell multi-omics analysis are either cell-centric, prioritizing batch-corrected cell embeddings that neglect feature relationships, or feature-centric, imposing global feature representations that overlook inter-sample heterogeneity. To address these limitations, we present MOSAIC, a spectral framework that learns a high-resolution feature x sample joint embedding from population-scale single-cell multi-omics data. For each individual, MOSAIC constructs a sample-specific coupling matrix capturing complete intra- and cross-modality feature interactions, then projects these into a shared latent space via spectral decomposition. The joint feature x sample embedding defines each feature's connectivity profile per sample, enabling two key downstream applications. First, MOSAIC introduces Differential Connectivity (DC) analysis, which identifies features exhibiting regulatory network rewiring across conditions even when their expression or abundance remains unchanged. Applied to a CITE-seq vaccination cohort, MOSAIC revealed rewiring of proliferation programs in activated T cells, highlighting a functional shift in STAT5B despite stable expression. Second, MOSAIC enables identification of biologically meaningful sample subgroups by isolating coherent multimodal feature modules. Applied to an HIV+ prefrontal cortex cohort, MOSAIC uncovered a novel stress-driven neuronal subtype within HIV+ samples characterized by elevated protein synthesis without chromatin accessibility changes. MOSAIC provides a general-purpose framework for systems-level phenotypic characterization, offering novel biological insights from population-scale multi-omic studies.
bioinformatics2026-02-12v1-
SLECA: a single-cell atlas of systemic lupus erythematosus enabling rare cell discovery using graph transformer
Duan, M.; Shi, Y.; Tian, H.; Wu, Q.; Wang, X.; Liu, B.AI Summary
- The study introduces SLECA, a large-scale single-cell atlas for systemic lupus erythematosus (SLE), using a graph-transformer framework to identify rare immune cell populations.
- SLECA integrates 366 samples, identifying 54 cell types, including disease-relevant rare populations like double-negative T cells (DNTs), which correlate with clinical severity.
- In silico perturbation showed that transcription factors JUN and EGR1 can reprogram DNTs, suggesting potential therapeutic targets in SLE.
Abstract
Systemic lupus erythematosus (SLE) is a highly heterogeneous autoimmune disease with complex immune and molecular dysregulation. While rare immune cell populations are increasingly recognized as critical drivers of disease pathogenesis and progression, the lack of sufficiently powered, comprehensive single-cell atlases has limited their systematic identification and characterization. To address this gap, we present SLECA, the first large-scale single-cell atlas of SLE, enabled by a novel graph-transformer framework for the interpretable discovery and analysis of disease-relevant rare cell populations. SLECA integrates 366 single-cell samples with standardized clinical and biological metadata, providing a comprehensive and analytically unified atlas of systemic lupus erythematosus. By enabling scalable integration and interpretable analysis, SLECA resolves 54 distinct cell types, including rare populations with critical disease relevance. Notably, we identify double-negative T cells (DNTs) as a disease-expanded population whose abundance correlates with clinical severity. Through in silico perturbation, we demonstrate that key transcription factors, specifically JUN and EGR1, can reprogram DNT cells toward conventional T-cell phenotypes, highlighting actionable regulatory vulnerabilities in SLE.
bioinformatics2026-02-12v1Taxonomy-aware, disorder-matched benchmarking of phase-separating protein predictors
Hou, S.; Shen, H.; Zhang, Y.AI Summary
- The study addresses biases in existing benchmarks for phase-separating protein (PSP) predictors due to taxonomic and intrinsic-disorder imbalances.
- A new taxonomy-aware, disorder-matched benchmark was developed, revealing that PSP features vary by taxa but LLPS-associated shifts are conserved.
- Benchmarking 20 PSP predictors showed taxon-dependent performance variations, with PSPs lacking IDRs being particularly challenging, suggesting the need for disorder-stratified evaluations.
Abstract
Background: Biomolecular condensates formed via liquid-liquid phase separation (LLPS) play vital roles in cellular organization and function. Computational prediction of phase-separating proteins (PSPs) is increasingly used to prioritize candidates at proteome scale, making robust, well-designed benchmarks essential for fair evaluation and iterative improvement of PSP predictors. Results: We first show that a recently released PSP benchmark is substantially confounded by the imbalances in taxonomic origin and intrinsic-disorder compositions between positive and negative sets, allowing predictors to achieve high apparent performance by exploiting non-LLPS shortcuts and obscuring their true ability to distinguish PSPs. To minimize these effects, we construct a taxonomy-aware, disorder-matched PSP benchmark. Using this benchmark, we find that absolute sequence and biophysical feature values of PSPs differ markedly across taxa, whereas LLPS-associated feature shifts relative to taxon-specific proteome backgrounds are comparatively conserved. Benchmarking twenty PSP predictors under this framework reveals pronounced taxon-dependent variation in performance. Moreover, PSPs lacking IDRs consistently constitute a more challenging regime across methods, motivating routine disorder-stratified evaluation. Conclusions: Our taxonomy-aware, disorder-matched benchmarking framework reduces shortcut-driven biases, enables more interpretable evaluation of PSP predictors, and provides guidance for developing models that capture transferable LLPS-associated signals rather than dataset- or taxon-specific shortcuts.
bioinformatics2026-02-12v1Decoding the Molecular Language of Proteins with Evolla
Zhou, X.; Han, C.; Zhang, Y.; Du, H.; Tian, J.; Su, J.; Liu, R.; Zhuang, K.; Jiang, S.; Gitter, A.; Liu, L.; Li, H.; Wu, M.; You, S.; Yuan, Z.; Ju, F.; Zhang, H.; Zheng, W.; Dai, F.; Zhou, Y.; Tao, Y.; Wu, D.; Shao, Z.; Liu, Y.; Lu, H.; Yuan, F.AI Summary
- Evolla is an interactive protein-language model trained on 546 million protein-text pairs, designed to interpret protein function through natural language queries.
- It outperforms general large language models in functional inference and matches state-of-the-art supervised models in zero-shot performance.
- Applications include identifying eukaryotic signature proteins in Asgard archaea and discovering a novel PET hydrolase, PsPETase, validated for plastic degradation.
Abstract
Proteins, nature's intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - understanding how sequences and structures encode biological functions - remains a cornerstone challenge. Here, we introduce Evolla, an interactive protein-language model designed to transcend static classification by interpreting protein function through natural language queries. Trained on 546 million protein-text pairs and refined via Direct Preference Optimization, Evolla couples high-dimensional molecular representations with generative semantic decoding. Benchmarking establishes Evolla's superiority over general large language models in functional inference, demonstrates zero-shot performance parity with the state-of-the-art supervised model, and exposes remote functional relationships invisible to conventional alignment. We validate Evolla through two distinct applications: identifying candidate eukaryotic signature proteins in Asgard archaea, with functional Vps4 homologs validated via yeast complementation; and interactively discovering a novel deep-sea polyethylene terephthalate (PET) hydrolase, PsPETase, confirmed to degrade plastic films. These results position Evolla not merely as a predictor, but as a generative engine capable of complex hypothesis formulation, shifting the paradigm from static annotation to interactive, actionable discovery. The Evolla online service is available at <a href="http://www.chat-protein.com/">http://www.chat-protein.com/</a>.
bioinformatics2026-02-11v4