Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Inferring a novel insecticide resistance metric and exposurevariability in mosquito bioassays across Africa
Denz, A.; Kont, M. D.; Sanou, A.; Churcher, T. S.; Lambert, B.Abstract
Malaria claims approximately 500,000 lives each year, and insecticide-treated nets (ITNs), which kill mosquitoes that transmit the disease, remain the most effective intervention. However, resistance to pyrethroids, the primary insecticide class used in ITNs, has risen dramatically in Africa, making it difficult to assess the current public health impact of pyrethroid-ITNs. Past work has modelled the relation between pyrethroid susceptibility measured in discriminating-dose susceptibility bioassays and ITN effectiveness in experimental hut trials. Here, we introduce a new predictive approach that accounts for heterogeneity in insecticide resistance within wild mosquito populations, for example, due to genetic variability, by incorporating data from newly recommended intensity-dose susceptibility bioassays. We fit our mathematical model to a comprehensive data set that combines discriminating dose bioassays from all over Africa, intensity dose bioassays from Burkina Faso, and concurrent experimental hut trials. Our analysis estimates location- and insecticide-specific variation in resistance heterogeneity in Burkina Faso and quantifies differences in insecticide exposure in bioassays and experimental huts. By providing a mechanistic understanding of these experimental data, our approach could be integrated into malaria transmission models to account for the public health impact of insecticide resistance detected by surveillance programmes.
bioinformatics2026-04-01v5Finding stable clusterings of single-cell RNA-seq data
Klebanoff, V. F.Abstract
Run a UMI count matrix through a clustering pipeline to obtain n cell clusters. Suppose that counts from the same experiment for an equal number of additional cells become available. Would including them change the results? Form the matrix containing both sets of counts, process it to obtain n clusters, restrict this (second) clustering to the initial cells and compare it with the initial clustering. If the clusterings are not consistent, conclude that the initial clustering is unstable. Although this scenario is unrealistic, it is practical to reverse the perspective: given a clustering, process samples of half of the cells. If their clusters are consistent with those of the full set of cells restricted to the samples, conclude that the clustering is stable. We use divisive hierarchical spectral clustering and describe a possibly novel mapping of the tree it produces to a set of nested clusterings. Positive affinities are defined for points (representing cells in Euclidean space) that are k-nearest neighbors (k is an input parameter). The affinity equals the inverse of the distance between the points. Ng, Jordan, and Weiss' algorithm divides a set of points into two clusters. The normalized cut measures the clusters' separation. Recursion generates a hierarchy of clusters. Viewing clusters as nodes of a tree, set the length of the branch between a node and each of its daughters to the normalized cut. Nodes' distances from the root define the mapping of the tree to nested clusterings. For four large data sets, this gave clusterings compatible with published results. Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample's clustering and clusters are compared with those of the full data set (restricted to the sample). This provides measures of the stability of the clustering and its clusters. For two of the large data sets, the clusterings compatible with published results were judged to be stable.
bioinformatics2026-04-01v3Adaptive Cluster-Count Autoencoders with Dirichlet Process Priors for Geometry-Aware Single-Cell Representation Learning
Fu, Z.Abstract
Standard autoencoders for single-cell transcriptomics learn latent spaces whose cluster structure emerges only post hoc through $K$-means or community detection, leaving cluster count and boundary quality uncontrolled during training. Here we ask whether imposing an adaptive nonparametric prior can shift this balance. We equip a feedforward autoencoder with an online Dirichlet Process Mixture Model (DPMM) prior that refits cluster assignments throughout training and directly regularizes latent compactness and separation. Across 56~scRNA-seq datasets the DPMM prior produces a pronounced \emph{geometry--concordance trade-off}: cluster compactness (ASW) improves by 127\% and Davies--Bouldin overlap drops by 47\%, but label-recovery metrics decline (NMI~$-$17\%, ARI~$-$21\%) and downstream $k$NN accuracy falls from 0.784 to 0.725. Wilcoxon signed-rank tests confirm that the geometry gains are significant with large Cliff's~$\delta$ effects while concordance losses remain bounded and non-significant. A second-stage conditional-flow refinement (DPMM-FM) further improves projection fidelity (DRE~0.751, LSE~0.695, DREX~0.873) at additional concordance cost, revealing a three-tier operating regime: prior-free for label recovery, DPMM for manifold geometry, and DPMM-FM for visualization fidelity. Against 18 external baselines DPMM-Base wins 70.5\% of core-metric comparisons ($p{<}0.05$). Gene Ontology enrichment confirms that geometry-improved latent components recover coherent biological programs. Rather than claiming universal superiority, this study characterizes the operating envelope of nonparametric mixture priors and identifies the task contexts---trajectory analysis, manifold visualization, and program-level annotation---where adaptive geometric structure outweighs label-counting accuracy.
bioinformatics2026-04-01v2Protein Language Model Decoys for Target Decoy Competition in Proteomics: Quality Assessment and Benchmarks
Reznikov, G.; Kusters, F.; Mohammadi, M.; van den Toorn, H. W. P.; Sinitcyn, P.Abstract
Large-scale proteomics relies heavily on target--decoy competition for false discovery rate estimation in peptide identification, and the performance of this strategy depends strongly on the design of the decoy database. Classical generators such as reversal and shuffling remain widely used. Here, we introduce protein language model-based (PLM) decoy generation for peptide identification and benchmark it against classical strategies. We evaluate these approaches using three complementary quality-control layers: sequence-based separability, search-engine-agnostic spectral-space diagnostics, and end-to-end mass spectrometry benchmarks, including pipelines with rescoring. Across these analyses, PLM-based decoys are harder for sequence-only neural networks to distinguish than most classical generators, suggesting fewer obvious sequence-level artifacts. However, this signal is only weakly informative for search performance. Spectral diagnostics further show that short peptides occupy a particularly crowded target--decoy space and are therefore especially prone to local collisions across all generators. In full search pipelines, reverse decoys remain a strong baseline, and current PLM-based generators do not yet provide a clear overall advantage. We therefore view PLM-based decoys not as universal replacements for reverse decoys, but as tunable tools for benchmarking, diagnostics, stress testing, and future adaptive decoy optimization, with increasing value as search models become more expressive.
bioinformatics2026-04-01v2On the Comparison of LGT networks and Tree-based Networks
Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.Abstract
Abstract. Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely di[ff]icult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments.
bioinformatics2026-04-01v2Searching the Druggable Genome using Large Language Models
Schimmelpfennig, L. E.; Cannon, M.; Cody, Q.; McMichael, J.; Coffman, A.; Kiwala, S.; Krysiak, K. J.; Wagner, A. H.; Griffith, M.; Griffith, O. L.Abstract
The druggable genome encompasses the genes that are known or predicted to interact with drugs. The Drug-Gene Interaction Database (DGIdb) provides an integrated resource for discovering and contextualizing these interactions, supporting a broad range of research and clinical applications. DGIdb is currently accessed through structured web interfaces and API calls, requiring users to translate natural-language questions into database-specific query patterns. To allow for the use of DGIdb through natural language, we developed the DGIdb Model Context Protocol (MCP) server, which allows large language models (LLMs) access to up-to-date information through the DGIdb API. We demonstrate that the MCP server greatly enhances an LLM's ability to answer questions requiring accurate, up-to-date biomedical knowledge drawn from structured external resources. Availability and implementation: The DGIdb MCP server is detailed at https://github.com/griffithlab/dgidb-mcp-server and includes instructions for accessing the server through the Claude desktop app.
bioinformatics2026-04-01v2Simplex-Constrained Neural Topic VAEs with Flow Refinement for Interpretable Single-Cell Gene-Program Discovery
Fu, Z.Abstract
Variational autoencoders for single-cell transcriptomics typically learn Gaussian latent spaces that lack part-based interpretability: individual latent dimensions carry no inherent biological meaning and the decoder provides no explicit gene-program readout. We introduce Topic-FM, a family of neural topic VAEs in which a logistic-normal Dirichlet prior constrains the latent vector to the probability simplex, turning each coordinate into a topic proportion and the decoder weight matrix into a directly readable topic--gene signature. A conditional optimal-transport flow field, trained entirely in pre-softmax $\mathbb{R}^K$, sharpens posterior geometry without modifying the decoder or breaking simplex validity. Unlike nonparametric mixture priors that improve geometry at the expense of label concordance, Topic-FM improves \emph{all} core metrics simultaneously: across 56~scRNA-seq datasets, Topic-FM-Transformer raises NMI by 8.2\%, ARI by 20.4\%, and ASW by 21.7\% relative to prior-free Pure-VAE (composite 0.502 vs.\ 0.434, +15.6\%). Wilcoxon signed-rank tests confirm significance with medium-to-large Cliff's~$\delta$ effects on all three metrics---no concordance--geometry trade-off is observed. Downstream $k$NN classification improves by 13.5\% in accuracy and 27.7\% in macro-F1. Among four architectural variants, Topic-FM-Contrastive achieves the highest external core win rate (86.4\% against 23 baselines), while Topic-FM-Transformer leads on composite score and supervised discrimination. Dual-pathway biological validation---perturbation importance and direct decoder-$\beta$ readout---yields convergent GO enrichment, demonstrating that the learned topics correspond to coherent, annotatable gene programs rather than opaque embedding dimensions.
bioinformatics2026-04-01v2A Convolutional Deep Learning Approach to identify DNA Sequences for Gene Prediction
Motta, J. A.; Gomez, P. D.Abstract
In this work, we present a highly efficient machine learning method for identifying DNA sequences that code for genes. The learning process is based on Human Genome Build 38 (GRCh38) sequences extracted from various specialized databases. The sequences were then translated into amino acid sequences and used to build matrices that facilitate the extraction of features with the TF*IDF vectorization method for the creation of the training space. The prediction functions are learned using a convolutional neural network (CNN) deep learning model. The training spaces were created using the 24 chromosomes of the human genome and approximately 36,000 genes and pseudogenes whose names were fetched from the HUGO Gene Nomenclature Committee (HGNC). Performance analysis was performed on 24 genes associated with genetic disorders, as well as the surrounding DNA regions. The metrics used were precision, recall, F_score measure, accuracy and ROC curves for the genes of interest. The results achieved exceed all our expectations and place the work at the level of the state of the art for gene prediction.
bioinformatics2026-04-01v2Protein Language Models Outperform BLAST for Evolutionarily Distant Enzymes: A Systematic Benchmark of EC Number Prediction
Sathyamoorthy, R.; Puri, M.Abstract
Accurate prediction of Enzyme Commission (EC) numbers is foundational to genome annotation, metabolic reconstruction, and enzyme engineering. Protein language models (PLMs) have transformed protein function prediction, yet their systematic evaluation for EC number prediction across architectures, EC hierarchy levels, and sequence identity thresholds is lacking. Here we present a comprehensive benchmark of three PLMs (ESM2-650M, ESM2-3B, ProtT5-XL) combined with nine downstream neural architectures, evaluated across four EC hierarchy levels and four sequence identity thresholds with 1,296 trained models in total. Our results establish that simple MLP classifiers achieve 98.0% accuracy at EC1, 96.9% at EC2, 96.6% at EC3, and 97.0% at EC4, matching or marginally exceeding a train-set-matched BLASTp baseline (+/-0.7 pp) for in-distribution proteins. Crucially, PLM-based methods dramatically outperform BLAST for evolutionarily distant eukaryotes: gains reach +31.8 pp over a fair 90K-sequence BLAST baseline (Giardia lamblia) and +26.4 pp over a full 520K SwissProt database (Trichomonas vaginalis). For held-out prokaryotic proteomes, PLMs outperform BLAST by a mean of +16.9 pp at EC4. Our benchmark reveals that (i) MLP architectures are sufficient and consistently superior to CNN/ResNet/Transformer variants, (ii) ESM2-650M is statistically distinguishable from but practically equivalent to the 5x larger ESM2-3B, and (iii) Transformer re-encoding of PLM embeddings fails at a shared learning rate due to convergence instability. All code, models, and benchmark results are available at https://github.com/r-mbio/plm_benchmark.git
bioinformatics2026-04-01v1Accurate detection of mosaic mutations at short tandem repeats from bulk sequencing data
Wang, W.; Li, W.; Wang, C.; Fan, W.; Xia, Y.; Yang, X.; Chu, C.; Dou, Y.Abstract
Short tandem repeats (STRs) are among the most mutable regions of the human genome, yet their somatic mosaicism remains poorly characterized due to the technical challenges of distinguishing genuine mutations from high intrinsic polymorphism and sequencing noise. Here, we introduce BulkMonSTR, a computational framework that combines STR-specific error modelling with machine-learning classification to enable accurate detection of mosaic STR mutations from bulk next-generation sequencing data. BulkMonSTR identifies nucleotide-resolution mutations--including insertions, deletions, and single-nucleotide variants (SNVs)--and supports both control-independent and case-control study designs. Leveraging a comprehensive training dataset derived from pedigree-based validation and in silico spike-in simulations, our random forest classifier effectively discriminates true mosaic events from germline variants and technical artifacts. Benchmarking on simulated and real datasets demonstrates that BulkMonSTR achieves substantially improved precision and F1 scores across diverse coverages and variant allele frequencies. In normal samples, cancer samples and controlled in silico mixing experiments, BulkMonSTR consistently outperforms existing methods, capturing a broader spectrum of STR mutations--including those arising on non-reference alleles--while achieving high validation rates. By enabling systematic, genome-wide interrogation of STR mosaicism, BulkMonSTR provides a scalable foundation for investigating the contributions of somatic STR mutations to aging and disease.
bioinformatics2026-04-01v1The human pangenome reference reduces ancestry-related biases in somatic mutation detection
Pham, C. V. K.; Abdelmalek, F. S. A.; Hua, T.; Apel, E.; Bizjak, A.; Schmidt, E. J.; Houlahan, K. E.Abstract
Commonly used human reference genomes collapse extensive genetic variability into a single linear genome of which 70% is derived from one donor. These linear genomes fail to capture the full spectrum of genetic variation, which can lead to misalignment of sequencing reads particularly for individuals underrepresented by the linear reference genomes. To address this shortcoming, the Human Pangenome Reference Consortium released the first draft of the human pangenome reference, a graph-based reference that integrates diverse haplotypes. While the human pangenome reference has shown increased accuracy in detecting inherited DNA variants, it remains to be seen if the observed improvements extend to somatic mutation detection. Here, we systematically benchmarked somatic single nucleotide variant (SNV) detection leveraging the human pangenome in 30 whole exome sequenced bladder tumours with matched blood tissue of diverse ancestries. We found somatic SNV detection leveraging the human pangenome reference outperformed the linear reference, most notably in individuals of East Asian ancestry where we observed on average a 20% improvement in detection accuracy. Improvements to detection accuracy in individuals of European ancestry were marginal. The increase in accuracy was attributed to reduced germline contamination and reduced reference bias. Further, we demonstrate the pangenome increases SNV detection precision, mitigating the need for time and computationally expensive ensemble approaches that take the consensus across multiple tools. Finally, we demonstrate that the increased precision when aligned to the pangenome generalized to an additional 29 lung adenocarcinoma tumours, particularly for individuals of East Asian ancestry. These findings support adoption of the pangenome to improve somatic variant detection and reduce ancestry-related disparities.
bioinformatics2026-04-01v1Subcellular Localization Constrains Protein Detectability and Reveals Systematic RNA-Protein Discordance Across Cancers
Joshi, K.; Kate, S.Abstract
Transcript abundance is widely used as a proxy for protein expression in cancer studies; however, mRNA levels often fail to predict protein detectability due to post-transcriptional and compartment-specific regulatory processes. Here, we present a machine learning framework that integrates RNA expression, gene-level attributes, and subcellular localization to model protein detectability across human cancers. Leveraging transcriptomic data from TCGA, TARGET, and GTEx, and protein annotations from the Human Protein Atlas, we constructed a dataset comprising over 100,000 gene-cancer pairs across seven tumor types. Models based on RNA features alone achieved moderate predictive performance (ROC-AUC ~0.71), whereas incorporating subcellular localization significantly improved accuracy (ROC-AUC ~0.82). Paired bootstrap analysis confirmed that these gains were statistically robust. We further identify a substantial set of genes with high transcript abundance yet absent protein detection, revealing widespread RNA-protein decoupling. These discordant genes are enriched in mitochondrial, metabolic, and translational regulatory pathways, suggesting that discordance reflects structured biological processes rather than stochastic variation. Together, our results demonstrate that cellular context, particularly subcellular localization, is a key determinant of protein detectability and underscore the limitations of transcript-centric interpretations in cancer genomics.
bioinformatics2026-04-01v1IMMREP25: Unseen Peptides
Richardson, E.; Aarts, Y. J. M.; Altin, J. A.; Baakman, C. A. B.; Bradley, P.; Chen, B.; Clifford, J.; Dhar, M.; Diepenbroek, D.; Fast, E.; Gowthaman, R.; He, J.; Karnaukhov, V.; Marzella, D. F.; Meysman, P.; Nielsen, M.; Nilsson, J. B.; Deleuran, S. N.; Parizi, F. M.; Pelissier, A.; Pierce, B. G.; Rodriguez Martinez, M.; Roran A R, D.; Saravanakumar, S.; Shao, Y.; Smit, N.; Van Houcke, M.; Visani, G. M.; Wan, Y.-T. R.; Wang, X.; Woods, L.; Wuyts, S.; Xiao, C.; Xue, L. C.; IMMREP25 Participant Consortium, ; Barton, J.; Noakes, M.; May, D. H.; Peters, B.Abstract
T cell receptors (TCRs) can bind to peptides presented by MHC molecules (pMHC) as a first step to trigger a T cell response. Reliable approaches to predict TCR:pMHC binding would have broad applications in clinical diagnostics, therapeutics, and the fundamental understanding of molecular interactions. IMMREP is a community organized series of prediction contests that asks participants to predict TCR:pMHC binding on unpublished datasets. Previous iterations in 2022 and 2023 showed multiple approaches can predict TCR-pMHC binding with significant accuracy (median AUC_0.1 greater than or equal to 0.7) for peptides where experimental data is available ("seen" peptides). In contrast, models did not outperform random guessing for peptides that have no such data available ("unseen" peptides). Here we report on the results of IMMREP25, which focused solely on unseen peptides in order to evaluate the cutting edge of the field. We received 126 named submissions predicting the specificity of 1,000 TCRs against twenty unseen peptides restricted by one of two MHC molecules (HLA-A*02:01 and HLA-B*40:01). The best performing methods showed a macro-AUC_0.1 of 0.60, significantly better than random, demonstrating significant advances in the field. The top performing methods incorporated structural modeling into their approach, indicating that especially for "unseen" peptides, a structural understanding aids in the prediction of TCR:pMHC interactions. The results from this benchmark highlight the significant challenges remaining for TCR:pMHC predictions and will inform future method development.
bioinformatics2026-04-01v1Serum metabolic signatures of cognitive resilience in a longitudinal aging cohort
Scheurink, T. A. W.; Seo, J. I.; David, L. C.; Wang, C. X.; Solis, D.; Zemlin, J.; Bergstrom, J.; Dorrestein, P. C.; Mohanty, I.; Molina, A. J. A.Abstract
Aging is typically accompanied by a progressive decline in cognitive function, yet some individuals maintain exceptional cognitive performance, even across the transition from middle to older age, defining exceptional cognitive resilience. While existing measures of resilience primarily rely on clinical assessments, its molecular determinants and early predictive markers remain poorly understood. Here, we performed untargeted LC-MS/MS profiling of longitudinal serum samples to identify metabolic signatures associated with cognitive resilience, which was established based on cognitive tests conducted over 28 years in a cohort of 237 participants. We observed associations across multiple chemical classes, including carnitines, glutamine conjugates, phosphocholines, as well as diet- and drug-derived metabolites. Chemical class-specific analyses revealed distinct metabolic profiles, including predominantly negative associations of medium-chain acylcarnitines with cognitive resilience, increased accumulation of glucuronide conjugates in individuals with low cognitive resilience, altered metabolism of the antihypertensive drug, metoprolol, and elevated levels of dietary compounds such as piperine and lutein in individuals with high cognitive resilience. By leveraging public metabolomics data, we further contextualized the metabolic signatures with respect to their organ specificity, microbial origin, and disease associations. Collectively, these metabolic features, including several previously underexplored compounds, represent promising candidates for functional characterization in mechanisms of aging biology and provide mechanistic insights into the molecular basis of cognitive resilience.
bioinformatics2026-04-01v1Benchmark of biomarker identification and prognostic modeling methods on diverse censored data
Fletcher, W. L.; Sinha, S.Abstract
The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods' performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.
bioinformatics2026-04-01v1Assessing the potential of bee-collected pollen sequence data to train machine learning models for geolocation of sample origin
Hayes, R. A.; Kern, A. D.; Ponisio, L. C.Abstract
Pollen is a robust and widespread substance that captures a historical snapshot of a specific time and place, and it can be used to track movements through space by examining the pollen deposited on various objects. Palynology, the study of pollen, is used across fields such as conservation, natural history, and forensics, where it is particularly useful for tracing the origin and movement of objects. However, pollen has remained underutilized due to the difficulty of distinguishing many pollen taxa beyond the family level and limited pollen reference material to support location predictions. With recent developments in pollen DNA metabarcoding these issues have been rectified, but much of the available pollen data are primarily from wind-pollinated species, which are widespread and less informative of specific sample locations. Bee-collected pollen presents an untapped resource in training predictive models to geolocate sample origin. Here we compiled bee-collected pollen DNA sequence relative abundance data from three projects in the western U.S. and assessed the accuracy of supervised machine learning models to predict the location of sample origin based solely on pollen assemblage, without the need of incorporating additional data. Random Forest and k-Nearest Neighbors models yielded high accuracy across all projects. We also found that models trained on taxonomically clustered pollen assigned sequence variants (ASVs) performed slightly better than those trained on raw sequence data, but the difference was minor, indicating that models trained on raw sequence data can reliably predict location and avoid the time-consuming taxonomic assignment process. Our results demonstrate the utility of repurposing bee-collected pollen for geolocation and provide a framework for employing supervised machine learning in future geolocation efforts.
bioinformatics2026-04-01v1VicMAG, an open-source tool for visualizing circular metagenome-assembled genomes highlighting bacterial virulence and antimicrobial resistance
Tsuda, Y.; Tanizawa, Y.; Vu, T. M. H.; Nishimura, Y.; Shintani, M.; Abe, H.; Hasebe, F.; Kasuga, I.; Nagao, M.; Suzuki, M.Abstract
Bacterial pathogens spread in clinical and environmental settings, and mobile genetic elements (MGEs), such as plasmids and phages, mediate the transfer of virulence factor genes (VFGs) and antimicrobial resistance genes (ARGs) among bacterial communities. Metagenomic analysis of environmental and wastewater samples using highly accurate long-read sequencing technologies, such as PacBio HiFi sequencing, provides valuable insights into monitoring the regional spread of VFGs and ARGs, including dissemination mediated by MGEs. No visualization tool is currently available for the comprehensive display of numerous resulting circular metagenome-assembled genomes (cMAGs) with functional gene annotations. Here, we developed VicMAG, a visualization tool for highly complex cMAGs derived from long-read metagenome assemblies annotated using updated databases of VFGs, ARGs, and MGEs. Using 353 cMAGs from PacBio HiFi sequencing of a wastewater sample, we demonstrated the utility of VicMAG for metagenome visualization. VicMAG provides comprehensive, size-aware visualization of cMAGs representing bacterial chromosomes and plasmids, annotated with VFGs, ARGs, and phages. By simultaneously visualizing all cMAGs in a framework, VicMAG facilitates a holistic understanding of the distribution and genomic context of VFGs and ARGs across complex microbial communities. This tool supports integrated surveillance of bacteria associated with virulence and antimicrobial resistance across clinical, environmental, and One Health contexts.
bioinformatics2026-04-01v1Millisecond Prediction of Protein Contact Maps from Amino Acid Sequences
Lin, R.; Ahnert, S. E.Abstract
Protein structure prediction typically outputs static coordinates, often obscuring the underlying physical principles and conformational flexibility. In this work, we present a coarse-grained generative framework to recover the Circuit Topology (CT) of proteins using Generative Flow Matching. We represent protein architecture using highly compressed Secondary Structure Elements (SSEs), reducing the sequence length to roughly 1/13 of the original amino acid sequence. We show that this minimal representation captures the essential "topological fingerprint" required to determine the global fold. By employing a joint-prediction head, our model simultaneously generates contact probabilities and asymmetric topological features, achieving a mean F1 score of 0.822 at the SSE level. Notably, our results demonstrate a counter-intuitive robustness in capturing long-range interactions, suggesting that global topology acts as a stable constraint compared to local residue packing. Furthermore, we show that these coarse-grained predictions can be mapped back to residue-level contact maps with sub-helical precision, yielding a mean alignment error of 2.69 residues. The probabilistic nature of the flow model effectively separates the stable structural signal of the folding core from flexible regions, providing a physically interpretable view of the protein's conformational ensemble. This pipeline is extremely fast, capable of completing a contact map prediction from amino acid sequence in an average of 110 milliseconds on a single GPU. These ultra-fast and accurate predictions provide a valuable tool for identifying conserved protein folding cores, facilitating the exploration of the protein structural genotype-phenotype (GP) map through large-scale sampling of mutants with highly similar folding cores.
bioinformatics2026-03-31v4Protein sequence domain annotation using a language model
Sarkar, A.; Krishnan, K.; Eddy, S. R.Abstract
Protein domain annotation underlies large-scale functional inference and is commonly performed by scanning sequences against libraries of profile hidden Markov models (profile HMMs). We describe PSALM, a protein domain annotation method that combines (i) a pretrained protein language model (ESM-2) with (ii) a per-residue domain-state classifier and (iii) a structured probabilistic decoder that produces a single, non-overlapping set of domain calls with explicit boundaries and scores. On a benchmark of 89M protein sequences with 107M annotated domains, PSALM attains a domain-detection sensitivity-specificity tradeoff comparable to HMMER. We characterize sequence and residue-level coverage on UniProtKB, observing higher coverage for HMMER at stringent expected false positive counts (E-values) and higher coverage for PSALM at relaxed E-values. We release code for data processing, training, and inference, along with the model weights and datasets used for training, validation, and benchmarking.
bioinformatics2026-03-31v3Phylogenetic detection of protein sites associated with continuous traits
Duchemin, L.; Muntane, G.; Boussau, B.; Veber, P.Abstract
Comparative genomic data can be used to look for substitutions in coding sequences that are associated with the variation of a particular phenotypic trait. A few statistical methods have been proposed to do so for phenotypes represented by discrete values. For continuous traits, no such statistical approach has been proposed, and researchers have resorted to sensible but uncharacterized criteria. Here, we investigate a phylogenetic model for coding sequences where amino acid preferences at a site are given by a continuous function of a quantitative trait. This function is inferred from the amino acids and the trait values in extant species and requires inferred point estimates of ancestral values of the trait at internal nodes. For detecting sites whose evolution is associated with this trait, we use a significance test against the hypothesis that amino acid preference does not depend on the trait. This procedure is compared to simpler strategies on simulated alignments. It displays an increased recall for low false positive rates, which is of special importance for performing whole-genome scans. This comes however at a much higher computational cost, and we suggest using a simple test to filter promising candidate sites. We then revisit a dataset of alignments for 62 species of mammals, using longevity as a phenotypic trait. We apply our method to three protein families that have previously been proposed to display sites associated with variation in lifespan in mammals. Using a graphical representation extracted from the detailed phylogenetic analysis of candidate sites, we suggest that the evidence for this in the sequence data alone is weak. The proposed method has been added to our Pelican software. It is available at https://gitlab.in2p3.fr/phoogle/pelican and can now be used with both discrete and continuous phenotypes to search for sites associated with phenotypic variation, on data sets with thousands of alignments.
bioinformatics2026-03-31v3Flipper: An advanced framework for identifyingdifferential RNA binding behavior with eCLIP data
Flanagan, K.; Xu, S.; Yeo, G. W.Abstract
Motivation: Crosslinking and immunoprecipitation (CLIP) methods remain the gold standard for characterizing RNA binding protein (RBP) behavior. As a result, many researchers rely on CLIP to assess how treatments targeting RBPs alter binding patterns and regulatory activity. However, current tools for differential RBP binding analysis lack core features required for rigorous statistical inference, including proper normalization and appropriate handling of replicate experiments. Furthermore, existing approaches cannot adequately separate expression driven effects from true changes in RBP binding, complicating interpretation of differential analyses. Addressing these limitations is essential for producing reproducible and informative analyses of differential RBP binding. Results: Here we present Flipper, an application purpose built for the analysis of differential RBP binding. Flipper introduces several innovations that adapt the DESeq2 framework for robust differential analysis of eCLIP count data. These include integration of input controls to account for expression driven binding shifts, hierarchical normalization strategies that adjust for technical variation without confounding signal to noise ratios, and improved post-differential analysis tools. We demonstrate that Flipper exhibits high specificity when applied to real differential eCLIP data while also providing deeper biological insights. In addition, analyses of both real and simulated data indicate that Flipper achieves superior sensitivity and precision compared with existing approaches. Together, these results highlight Flipper as a robust and generalizable framework for differential RBP binding analysis.
bioinformatics2026-03-31v2CCIDeconv: Hierarchical model for deconvolution of subcellular cell-cell interactions in single-cell data
Jayakumar, R.; Panwar, P.; Yang, J. Y. H.; Ghazanfar, S.Abstract
Motivation: Cell-cell interaction (CCI) underlies several fundamental mechanisms including development, homeostasis and disease progression. CCI are known to be localised to specific subcellular regions, for example, within the cytoplasms of cells. With the emergence of subcellular spatial transcriptomics technologies (sST), there is an opportunity to attribute CCI to subcellular regions. We aimed to deconvolute CCI to subcellular CCI (sCCI) in non-spatial single cell transcriptomics data (i.e. scRNA-seq) datasets using a modified CCI score from CellChat. Results: By calculating the sCCI score specific to cytoplasm and nucleus in nine publicly available sST datasets, we identified unique nucleus- nucleus and cytoplasm-cytoplasm sCCI. Then, we deconvolved the communication score to subcellular regions by using a hierarchical classification and regression model which we name as CCIDeconv. We performed leave-one-dataset-out cross-validation across nine datasets over a range of different tissue types from human samples. We observed that training across many different tissue types resulted in robust deconvolution performance in an unseen dataset. As the number of training datasets increased, models trained without spatial features achieved similar performance as models including spatial features. This implied the potential for accurate prediction of sCCI events from even scRNA-seq with large numbers of training datasets. Overall, we offer a method towards attributing CCI events to subcellular regions. This method can allow researchers in dissecting sCCI patterns to gain insights in underlying biology in a range of tissues covering health and disease.
bioinformatics2026-03-31v2A Bioinformatic Investigation into the Role of ITGB1 in Cancer Prognosis and Therapeutic Resistance
Mo, X.Abstract
Integrin {beta}1 is a crucial transmembrane protein that regulates cellular adhesion, migration, and signal transduction, processes essential for cancer progression. This study investigates the role of ITGB1, the gene that encodes Integrin {beta}1, in various cancers using bioinformatics tools. By analyzing gene expression data across different cancer types and normal tissues, the study identifies significant upregulation of ITGB1 in several cancers. We find elevated expression of ITGB1 is associated with poor prognosis in multiple tumors, suggesting its potential as a biomarker for cancer progression and therapeutic resistance. Further analysis reveals ITGB1's correlation with chemoresistance and immunoresistance genes, highlighting its involvement in cancer treatment evasion. The study also explores the expression and role of genes that are highly related to ITGB1 in tumor and patient prognosis, offering insights into potential molecular pathways and therapeutic targets. These findings underscore the clinical relevance of ITGB1 in cancer prognosis and therapy.
bioinformatics2026-03-31v2IDBSpred: An intrinsically disordered binding site predictor using machine learning and protein language model
Jones, D.; Wu, Y.Abstract
Intrinsically disordered proteins (IDPs) mediate many cellular functions through interactions with structured protein partners, but predicting the corresponding binding sites on the structured partner remains challenging. Here, we present IDBSpred, a sequence-based method for residue-level prediction of IDP-binding sites on structured proteins. Training and test data were collected from the DIBS database, which contains more than 700 non-redundant IDP-protein complexes. Residue-level embeddings of structured partner sequences were generated using the ESM-2 protein language model and used as input to a multilayer perceptron classifier for binary prediction of binding versus non-binding residues. Analysis of amino acid composition showed that IDP-binding sites are enriched in aromatic residues, especially Trp, Tyr, and Phe, as well as several charged and polar residues, whereas Ala and several small or conformationally restrictive residues are depleted. The classifier achieved an ROC AUC of 0.87 and an average precision of 0.61. Structural case studies further showed that the predicted sites largely recapitulate the major experimentally defined binding interfaces. These results demonstrate that protein language model embeddings plus machine learning algorithms can effectively capture sequence features associated with IDP recognition on structured proteins. IDBSpred provides a practical framework for studying IDP-mediated interfaces and identifying potential therapeutic hotspots.
bioinformatics2026-03-31v2Amino acid substitutomics: profiling amino acid substitutions at proteomic scale unveils biological implication and escape mechanism in cancer
Zhao, P.; DAI, S.; Lai, S.; Zhou, C.; Li, N.; Yu, W.Abstract
Amino acid (AA) substitutions play a critical role in regulating cellular activities, including complex signaling and cell cycle processes. Recent research on AA substitutions has primarily relied on genomic and transcriptomic data. The identification at the proteomic scale remains underexplored, despite evidence suggesting that DNA and RNA biosynthesis are not the sole sources of these substitutions. This gap persists due to challenges in analyzing large-scale proteomic data. In this study, we address this limitation by analyzing multiple independent datasets across five cancer types using PIPI-C, a novel mass spectrometry data analysis tool. And we propose AA substitutomics, a pipeline for characterizing AA substitutions arising after protein translation and dissecting the regulatory functions of key proteins with AA substitutions. Among our identified AA substitutions, 87% are novel findings and not recorded in genomic/transcriptomic databases, which indicates that the post-translational AA substitutions are prevalent. Our findings reveal biologically significant AA substitutions linked to cancer, such as F43S and E91D in hemoglobin subunit beta, P584T in filamin A, and A175N in fructose-bisphosphate aldolase B. Furthermore, our pipeline enables direct investigation of drug resistance and immune escape. By capturing functional protein-level alterations beyond genomic and transcriptomic profiling, it establishes a robust framework to advance cancer research.
bioinformatics2026-03-31v2Deep representation learning for temporal inference in cancer omics: a systematic review
Prol-Castelo, G.; Cirillo, D.; Valencia, A.Abstract
Deep learning methods, including deep representation learning (DRL) approaches such as variational autoencoders (VAEs), have been widely applied to cancer omics data to address the high dimensionality of these datasets. Despite remarkable advances, cancer remains a complex and dynamic disease, i.e. challenging to study, and the temporal resolution of cancer progression captured by omics-based studies remains limited. In this systematic literature review, we explore the use of DRL, particularly the VAE, in cancer omics studies for modeling time-related processes, such as tumor progression and evolutionary dynamics. Our work reveals that these methods most commonly support subtyping, diagnosis, and prognosis in this context, but rarely emphasize temporal information. We observed that the scarcity of longitudinal omics data currently limits deeper temporal analyses that could enhance these applications. We propose that applying the VAE as a generative model to study cancer in time, e.g. focusing on cancer staging, could lead to meaningful advancements in our understanding of the disease.
bioinformatics2026-03-31v2Co-designing sequence and structure of functional de novo enzymes with EnzyGen2
Song, Z.; Liu, H.; Zhao, Y.; Yang, Y.; Li, L.Abstract
Proteins underpin essential biological functions across all kingdoms of life. The capacity to design novel proteins with tailored activities holds transformative potential for biotechnology, medicine, and sustainability. However, since protein functions, particularly enzymatic activities, depend on precise interactions with small-molecule ligands, accurately modeling these interactions remains a formidable challenge in de novo protein design. Here, we introduce EnzyGen2, a protein foundation model designed for the simultaneous co-design of sequence and structure under ligand-guided functional targeting. Comprising 730 million parameters, EnzyGen2 is trained on 720,993 protein-ligand pairs using multi-task learning objectives that encompasses the joint prediction of sequence, structure, and protein-ligand interactions. In rigorous in silico benchmarks, EnzyGen2 consistently outperforms state-of-the-art baselines, including Inpainting, RFdiffusion/ProteinMPNN, RFdiffusion2/LigandMPNN, and RFdiffusion3/LigandMPNN, as measured by enzyme-substrate prediction scores, AlphaFold2 confidence metrics, and structural fidelity, while it generates samples 400x faster than prior methods. We further experimentally validated EnzyGen2 across multiple enzyme families, including chloramphenicol acetyltransferase, aminoglycoside adenylyltransferase, and thiopurine S-methyltransferase. De novo enzymes generated by our family-specific EnzyGen2 exhibited catalytic activities comparable to or exceeding those of natural enzymes, while retaining substantial novelty with sequence identities as low as 51.6%. These results establish EnzyGen2 as a robust Artificial Intelligence-based tool for functional enzyme design, demonstrating the power of large protein foundation models to create high-performance, novel biocatalysts.
bioinformatics2026-03-31v2Reliable prediction of short linear motifs in the human proteome
Pancsa, R.; Ficho, E.; Kalman, Z. E.; Gerdan, C.; Remenyi, I.; Zeke, A.; Tusnady, G. E.; Dobson, L.Abstract
Short linear motifs (SLiMs) are small interaction modules within intrinsically disordered regions of proteins that interact with specific domains, and thereby regulate numerous biological processes. Their limited sequence information leads to frequent false positive hits in computational and experimental SLiM identification methods. We present SLiMMine, a deep learning-based method to identify SLiMs in the human proteome. By refining the annotations of known motif classes, we created a high-quality training dataset. Using protein embeddings and neural networks, SLiMMine reliably predicts novel SLiM candidates in known classes, eliminates ~80% of the pattern matching-based hits as false-positives, furthermore, it also functions as a discovery tool to find uncharacterized SLiMs based on optimal sequence environment. Finally, narrowing the broad interactor-domain definitions of known SLiM classes to specific human proteins enables more precise linking of predicted SLiMs to known protein-protein interactions. SLiMMine is available as a user-friendly, multi-purpose web server at https://slimmine.pbrg.hu/.
bioinformatics2026-03-31v2HORI-EN: Atomic-level energetic profiling and higher-order network identification in protein structures
Joshi, S.; Sowdhamini, R.Abstract
Motivation: Characterizing atomic-level stability and cooperative interaction networks is essential for understanding protein function and evolution. However, existing tools often lack the precision to integrate detailed physicochemical energies with higher-order graph-theoretic analyses. Results: We present HORI-EN, an updated implementation to the HORI framework, featuring hybrid energetic scoring (Physicochemical + Knowledge-Based) and a Normalized Interaction Score (NIS) based on cumulative distribution functions. HORI-EN identifies higher-order cliques of interacting residues, revealing cooperative stabilization networks. Validation on the SKEMPI v2 dataset demonstrates that HORI-EN shows discriminative performance in identifying mutational hotspots, achieving an ROC-AUC of 0.780 on the full dataset and 0.844 on a clean benchmark. Enrichment analysis indicates a 3.1-fold increase in precision for the top 1% of predictions. Furthermore, analysis of the residue interaction network recovers 77.4% of non-contacting hotspots by identifying one-hop bridging interactions to the partner chain. Beyond hotspot prediction, HORI-EN distinguishes native structures from decoys and captures conserved energetic signatures in evolutionary case studies of serine proteases and lipases. Availability and Implementation: The web server is freely available at https://caps.ncbs.res.in/HORI-EN and source code is available at https://github.com/thesixeyedknight/HoriPy.
bioinformatics2026-03-31v1Analysis of biological networks using Krylov subspace trajectories
Frost, H. R.Abstract
We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.
bioinformatics2026-03-31v1Pan-Metabolomics Repository Mapping of the Carnitine Landscape
Mannochio-Russo, H.; Ferreira, P. C.; Kvitne, K. E.; Patan, A.; Deleray, V.; Agongo, J.; Gouda, H.; Goncalves Nunes, W. D.; Xing, S.; Zemlin, J.; van Faassen, M.; Reilly, E. R.; Koo, I.; Patterson, A. D.; Tsunoda, S. M.; Wang, M.; Siegel, D.; Burnett, L. A.; Dorrestein, P. C.Abstract
Carnitines are a structurally diverse class of metabolites formed by conjugation of L-carnitine with fatty acids, amino acids, xenobiotics, and microbial metabolites. They play roles in transport, mitochondrial and peroxisomal metabolism, detoxification, and systemic signaling, yet their chemical diversity remains incompletely defined. We applied a pan-repository data mining strategy of LC-MS/MS data across GNPS/MassIVE, MetaboLights, and Metabolomics Workbench using MassQL diagnostic fragment ion filtering to systematically extract acylcarnitine spectra. This yielded a library of 34,222 unique MS/MS spectra representing 2,857 atomic compositions, corresponding to 3,872,050 detections. These datasets provide an MS/MS library for annotation, discovery, and contextualization of acylcarnitines, enabling identification of previously unknown carnitines, such as dihydroferulic acid conjugated carnitines and supporting future exploration of this metabolite class across host metabolism, diet, microbial activity, pharmacological exposures, and metabolic dysregulation.
bioinformatics2026-03-31v1Structured Pooling Improves Detection of Rare Regulatory Mutations in Population-Scale Reporter Assays
Dura, K.; Siklenka, K.; Strouse, K. P.; Morrow, S.; Zhang, C.; Barrera, A.; Allen, A. S.; Reddy, T. E.; Majoros, W. H.Abstract
Identifying genetic variants in noncoding DNA that impact gene expression and thereby contribute to disease risk remains a difficult but important challenge in genomic medicine. Modern reporter assays such as STARR-seq and MPRA provide an efficient and effective means of testing, in very high throughput, millions of variants captured directly from patient genomes. While these assays have previously been scaled to whole genomes and, separately, to populations, we report findings from the first whole-genome population-scale STARR-seq experiment performed on 100 individuals. In order to achieve that scale we devised a novel experimental design that partitions samples into pools so as to increase allele frequencies within pools and thereby reduce expected dropout and increase signal-to-noise ratio in experimental readouts. We show that this design produces more accurate estimates of variant effect sizes, and we provide a Bayesian model for robust estimation of those effect sizes that also reports full posterior distributions for assessment of confidence in estimates. Together, these methodological innovations facilitate the detection of functional regulatory variants, particularly rare variants, with much higher accuracy and at greater scale than previously possible. We demonstrate the utility of this approach on the task of functional annotation of quantitative trait loci such as eQTLs and caQTLs, and show concordance with patterns of constraint in transcription factor binding profiles.
bioinformatics2026-03-31v1Modeling gene regulatory perturbations via deep learning from high-throughput reporter assays
Venukuttan, R.; Doty, R.; Thomson, A.; Chen, Y.; Li, B.; Duan, Y.; Barrera, A.; Dura, K.; Ko, K.-Y.; Lapp, H.; Reddy, T. E.; Allen, A. S.; Majoros, W. H.Abstract
Assessing likely variant effects on phenotypes is of critical importance in diagnostic settings, and while much progress has been made in interpreting genic mutations based on our understanding of coding sequence, noncoding variants can be much more challenging to reliably interpret based on DNA sequence alone. High-throughput reporter assays such as STARR-seq and MPRA have shown utility in experimentally measuring regulatory effects of noncoding variants present in samples, but provide no readout for variants not present in the assay inputs. However, whole-genome reporter assays provide copious data that can be used to train predictive models for prioritizing variants not directly observed in the experiment. We describe a retrainable predictive modeling framework, BlueSTARR, for this task, and present results of training several models with this framework on whole-genome STARR-seq data from two cell lines and one drug treatment. Using these models, we uncover a global signature across the human genome consistent with purifying selection against both loss-of-function and gain-of-function regulatory variants, with the latter showing a significant bias consistent with selection against gains of cis regulatory function in closed chromatin proximal to genes. By testing the model on synthetic enhancers with binding motifs for transcription factors GR and AP-1, we find that when trained on drug perturbation data, the model is able to learn distance-dependent and treatment-dependent binding patterns and their resulting reporter gene activation. These results demonstrate that lightweight, easily retrainable models such as ours have utility in probing latent signals present in novel experimental data. Finally, we find only modest differences in performance between different deep-learning architectures when trained on this single data modality, and while somewhat greater predictive accuracy can be achieved with much larger models trained at great expense on many terabytes of data, there is still copious room for improvement even for industrial strength, state-of-the-art models.
bioinformatics2026-03-31v1MoCoO: Momentum Contrast ODE-Regularized VAE for Single-Cell Trajectory Inference and Representation Learning
Fu, Z.Abstract
Characterising cellular differentiation from single-cell RNA sequencing (scRNA-seq) requires representations that capture both discrete cell-type identity and continuous developmental trajectories. We present MoCoO, a modular framework integrating a Variational Autoencoder (VAE), Neural Ordinary Differential Equations (Neural ODE), and Momentum Contrast (MoCo), complemented by a systematic Phase-2 Flow Matching (FM) refinement step applicable to all model variants. Through a systematic six-configuration ablation across 20 scRNA-seq datasets evaluated with a proposed five-metric suite covering clustering geometry (ASW, DAV, CAL) and embedding quality (DRE, DREX), we demonstrate two central findings. First, the ODE+MoCo combination is the core architectural synergy: VAE+ODE+MoCo achieves four of five top-two finishes among base configurations, including the best ASW (0.225) and DAV (1.478), plus second-best DRE (0.640) and CAL. Second, FM refinement systematically improves both embedding quality and clustering geometry across all six base configurations---DREX in 92% and DRE in 88% of 120 dataset--configuration pairs ({Delta}DREX=+0.030, {Delta}DRE=+0.023), CAL in 88%, ASW in 86% ({Delta}ASW=+0.018), and DAV in 80% ({Delta}DAV=-0.072; Fig. 2). Combined, the full MoCoO pipeline (VAE+ODE+MoCo+Proto+FM) achieves the best DRE (0.678), DREX (0.660), and CAL, while VAE+ODE+MoCo+FM achieves the best ASW (0.257) and DAV (1.359). ODE smooths the latent manifold along developmental trajectories; MoCo sharpens cluster geometry; FM recovers and amplifies both embedding quality and cluster separation post-hoc. Downstream validation confirms that MoCoO latent spaces support annotation transfer, uncertainty quantification, differential expression, and branching detection. Pseudotime predictions correlate significantly with canonical marker genes across all five core developmental systems. We publicly release the MoCoO Python package (pip install mocoo) and full benchmark suite.
bioinformatics2026-03-31v1WayFindR: Investigating Feedback in Biological Pathways
Bombina, P.; McGee, R. L.; Reed, J.; Abrams, Z.; Abruzzo, L. V.; Coombes, K. R.Abstract
Understanding biological pathways requires more than static diagrams. We present WayFindR, an R package that converts pathway data from WikiPathways and KEGG into graph structures using igraph, enabling computational analysis of regulatory features such as negative feedback loops. Rooted in control theory, negative feedback is essential for system stability, yet it is often underrepresented in curated pathway data. In this study, we systematically analyzed pathway information from both databases across multiple species and found that feedback loops -- particularly negative ones -- are rarely captured. This gap likely reflects both biological and technical challenges. Biologically, feedback mechanisms are inherently complex and often remain uncharted due to limited experimental focus. Technically, pathway databases frequently lack standardized annotations and complete representations of regulatory interactions, especially inhibitory edges that are crucial for identifying feedback. These observations underscore the need for improved data curation and consistent annotation practices to enhance our understanding of regulatory dynamics. By bridging the gap between static pathway diagrams and dynamic systems-level insights, WayFindR enables reproducible and scalable investigation of feedback regulation in cellular networks. The WayFindR R package can be downloaded from the Comprehensive R Archive Network (CRAN) (https://cran.rproject.org/web/packages/WayFindR/index.html). The processed data along with code for download can be accessed via the GitLab repository (https://gitlab.com/krcoombes/wayfindr).
bioinformatics2026-03-31v1Protein Language Model Decoys for Target Decoy Competition in Proteomics: Quality Assessment and Benchmarks
Reznikov, G.; Kusters, F.; Mohammadi, M.; van den Toorn, H. W. P.; Sinitcyn, P.Abstract
Large-scale proteomics relies heavily on target--decoy competition for false discovery rate estimation in peptide identification, and the performance of this strategy depends strongly on the design of the decoy database. Classical generators such as reversal and shuffling remain widely used. Here, we introduce protein language model-based (PLM) decoy generation for peptide identification and benchmark it against classical strategies. We evaluate these approaches using three complementary quality-control layers: sequence-based separability, search-engine-agnostic spectral-space diagnostics, and end-to-end mass spectrometry benchmarks, including pipelines with rescoring. Across these analyses, PLM-based decoys are harder for sequence-only neural networks to distinguish than most classical generators, suggesting fewer obvious sequence-level artifacts. However, this signal is only weakly informative for search performance. Spectral diagnostics further show that short peptides occupy a particularly crowded target--decoy space and are therefore especially prone to local collisions across all generators. In full search pipelines, reverse decoys remain a strong baseline, and current PLM-based generators do not yet provide a clear overall advantage. We therefore view PLM-based decoys not as universal replacements for reverse decoys, but as tunable tools for benchmarking, diagnostics, stress testing, and future adaptive decoy optimization, with increasing value as search models become more expressive.
bioinformatics2026-03-31v1Cell type composition drives patient stratification in single-cell RNA-seq cohorts
Halter, C.; Andreatta, M.; Carmona, S.Abstract
Early transcriptomic studies demonstrated that unsupervised analysis of bulk gene expression can reveal clinically meaningful patient subgroups. Single-cell RNA sequencing (scRNA-seq) provides high-resolution characterization of cellular heterogeneity and therefore enables more refined patient stratification. Several computational approaches have been proposed to summarize single-cell data into sample-level representations for cohort-level exploratory analyses. However, these methods generally do not explicitly account for the compositional nature of cell-type proportions. Based on eleven scRNA-seq cohorts across different biological conditions, we evaluated several state-of-the-art sample representation methods for their ability to recover known biological groupings in an unsupervised setting. Surprisingly, we found that baseline approaches based on cell-type composition and pseudobulk gene expression consistently matched or outperformed more complex methods while requiring orders of magnitude fewer computational resources. In particular, centered log-ratio-transformed cell-type proportions achieved the highest stratification performance and demonstrated robustness to batch effects. The stratification signal was frequently concentrated in a small subset of highly variable cell types, and performance was robust across diverse cell type annotation strategies. Altogether, these results suggest that clinically relevant inter-sample variation in scRNA-seq cohorts is largely driven by differences in cell-type composition. Importantly, compositional representations directly link cohort-level structure to specific cell populations, enabling mechanistic interpretation and facilitating clinical translation. We provide scECODA, an open-source R package for scalable and interpretable cohort-level Exploratory COmpositional Data Analysis of scRNA-seq data, and establish cell-type compositional representations as a powerful and interpretable approach for unsupervised patient stratification.
bioinformatics2026-03-31v1Carafe2 enables high quality in silico spectral library generation for timsTOF data-independent acquisition proteomics
Wen, B.; Paez, J. S.; Hsu, C.; Canzani, D.; Chang, A. T.; Shulman, N.; MacLean, B. X.; Berg, M. D.; Villen, J.; Fondrie, W.; Pino, L.; MacCoss, M. J.; Noble, W. S.Abstract
Data-independent acquisition (DIA) proteomics enables reproducible and systematic peptide detection and quantification, and trapped ion mobility spectrometry (TIMS) on the timsTOF platform further improves DIA by synchronizing ion mobility separation with quadrupole precursor sampling. Analyzing the highly multiplexed spectra generated by DIA typically relies on spectral libraries, and fully leveraging the additional ion mobility dimension requires these libraries to include accurate retention time, fragment ion intensity, and ion mobility annotations. Existing in silico spectral library generation tools either lack ion mobility support entirely or rely on models trained on data-dependent acquisition (DDA) data, that can introduce a mismatch that may not capture unique experiment-specific biases when applied to each respective timsTOF dataset. Carafe is a software tool that uses deep learning models to generate high-quality, experiment-specific in silico libraries by training directly on DIA data. In this study, we extend Carafe to generate libraries for timsTOF DIA data, which involves fine-tuning retention time (RT), fragment ion intensity, and ion mobility prediction models using timsTOF DIA data. Carafe2 operates directly on native timsTOF raw data (Bruker .d directories) without the need for data conversion. We demonstrate the performance of Carafe2 across a wide range of DIA applications, including global proteome, phosphoproteome, and plasma proteome datasets. Comparing Carafe2 fine-tuned RT, fragment ion intensity, and ion mobility prediction models with pretrained DDA models, we find that Carafe2 models outperform pretrained models on a variety of DIA datasets. We then demonstrate the utility of in silico libraries generated by Carafe2 for peptide detection on several different types of timsTOF DIA datasets by comparing with the libraries generated with DDA-trained AlphaPeptDeep models, DIA-NN built-in models, and empirical spectral libraries generated from DDA experiments.
bioinformatics2026-03-31v1Condition-matched in silico prediction of drug transcriptional responses enables mechanism-guided screening and combination discovery
Xiao, M.; He, Y.; Hu, J.; Zou, F.; Zou, B.Abstract
Perturbational transcriptomics links therapeutic compounds to cellular mechanisms and provides a powerful framework for drug discovery, but experimentally profiling transcriptional responses across diverse cell states, doses and durations is costly and often infeasible. Here we present DEPICT (Drug rEsponse Prediction in transCriptomics with Transformers), a deep learning framework that predicts condition-matched drug-induced transcriptional responses from baseline gene expression, perturbation settings and complementary drug representations. Using the LINCS L1000 dataset, DEPICT generalized to unseen drugs and cell types and outperformed five baseline strategies and two recent deep learning models. In the most challenging unseen-cell evaluation, DEPICT was the only model to surpass all baselines, improving differential-expression prediction accuracy and reducing perturbed-expression prediction error by 30.3\% and 36.8\%, respectively, relative to the next-best deep model. In a non-small cell lung cancer (NSCLC) case study, DEPICT-enabled virtual screening prioritized compounds predicted to reverse disease-associated transcriptional signatures. Notably, 13 of the top 20 prioritized compounds had either previously entered NSCLC-related clinical trials or been validated in NSCLC studies, supporting the translational relevance of the predicted perturbational profiles. DEPICT further enabled condition-matched drug synergy prediction and mechanistic exploration when experimentally matched profiles were unavailable. Together, these results show that accurate, condition-matched in silico perturbation profiling can scale transcriptomics-driven hypothesis generation for drug repurposing and combination discovery.
bioinformatics2026-03-31v1LAML-Pro: Maximum Likelihood Inference of Cell Genotypes and Cell Lineage Trees
Chu, G.; Schmidt, H.; Raphael, B.Abstract
Motivation: Recent dynamic lineage tracing technologies use genome editing to induce heritable mutations, or edits, that accumulate across successive cell divisions. These edits are measured using single-cell sequencing or imaging, providing data to reconstruct cell lineages at single-cell resolution. Current computational approaches to infer cell lineage trees, or phylogenies, from these data perform two separate steps: (1) Identify each cell's edits (genotype) from the raw sequencing or imaging data; (2) Infer a cell lineage tree from the cell genotypes. However, genotyping cells is an inexact process and genotype errors can yield an inaccurate lineage tree. For example, using fluorescence based-imaging to measure edits results in a high fraction ({approx}25-50%) of uncertain or erroneous genotypes. Results: We introduce Lineage Analysis via Maximum Likelihood with PRobabilistic Observations (LAML-Pro), an algorithm that jointly infers cell genotypes and a cell lineage tree. LAML-Pro is based on the Probabilistic Mixed-type Missing Observation (PMMO) model, which we derive to describe both the genome editing and genotype observation processes. LAML-Pro constructs lineage trees from thousands of cells in under an hour by leveraging the sparsity of transitions under the PMMO model. On simulated data, we demonstrate that LAML-Pro corrects genotype errors and infers substantially more accurate trees than existing methods which are vulnerable to genotype errors. Applied to data from two recent imaging-based lineage tracing systems, LAML-Pro reduces genotype errors by 5-fold and produces more spatially coherent lineage trees compared to existing methods. Availability and Implementation: LAML-Pro is freely available at: github.com/raphael-group/LAML-Pro.
bioinformatics2026-03-31v1BCAR: A fast and general barcode-sequence mapper for correcting sequencing errors
Andrews, B.; Ranganathan, R.Abstract
Motivation: DNA barcodes are commonly used as a tool to distinguish genuine mutations from sequencing errors in sequencing-based assays. In the presence of indel errors, utilizing barcodes requires accurate alignment of the raw reads to distinguish genuine indels from indel errors. Existing strategies to do this generally rely on aligners built for homology comparison and do not fully utilize quality scores. We reasoned that developing an aligner purpose-built for error correction could yield higher quality barcode-sequence maps. Results: Here, we present BCAR, a fast barcode-sequence mapper for correcting sequencing errors. BCAR considers all of the evidence for each base call at each position both during alignment and during final consensus generation. BCAR creates high-accuracy barcode-sequence maps from simulated reads across a broad range of error rates and read lengths, outperforming existing methods. We apply BCAR to two experimental datasets, where it generates high-quality barcode-sequence maps. Availability and implementation: BCAR source code, documentation and test data are available from: https://github.com/dry-brews/BCAR
bioinformatics2026-03-31v1Scalable Microbiome Network Inference: Mitigating Sparsity and Computational Bottlenecks in Random Effects Models
Roy, D.; Ghosh, T. S.Abstract
The application of Large Language Models (LLMs) and Transformers to biological and healthcare datasets requires the extraction of highly accurate, noise-filtered ecological networks. The Random Effects Model (REM) is a powerful statistical method for inferring microbial interaction networks and identifying keystone species across heterogeneous studies. However, existing implementations in R that rely on single-threaded "Iteratively Reweighted Least Squares" (IRLS) are computationally prohibitive for high-dimensional metagenomic data, creating a significant bottleneck for downstream machine learning pipelines. In this paper, we present Parallel-REM, a highly scalable, Python-based parallel pipeline accelerating large-scale network inference. By integrating robust variance filtering, sparsity checks, and a batched Master-Worker parallelisation strategy using joblib and statsmodels, we resolve native convergence failures associated with sparse biological matrices. Benchmarking on a massive clinical dataset comprising 70,185 samples and 466 optimal species demonstrates a 26.1x speedup over sequential baselines on a 64-core architecture, reducing computation time from days to minutes. Furthermore, statistical validation shows > 99.9% directional concordance with the original R implementation. Parallel-REM democratises large-scale network extraction, providing the high-throughput infrastructure necessary to feed clean, topological and biological features into modern deep learning and Transformer-based diagnostic architectures.
bioinformatics2026-03-31v1eSIG-Net: Accurate prediction of single-mutation induced perturbations on protein interactions using a language model
Pan, X.; Shrawat, A.; Raghavan, S.; Dong, C.; Yang, Y.; Li, Z.; Zheng, W. J.; Eckhardt, S. G.; Wu, E.; Fuxman Bass, J. I.; Jarosz, D. F.; Chen, S.; McGrail, D. J.; Sheynkman, G. M.; Huang, J. H.; Sahni, N.; Yi, S. S.Abstract
Most proteins exert their functions in complex with other interactors. Single mutations can exhibit a profound impact on perturbing protein interactions, leading to human disease. However, predicting the effect of single mutations on protein interactions remains a major computational challenge. Deep learning, particularly protein language models or transformers, has become an effective tool in bioinformatics for protein structure prediction. However, the functional divergence of mutations makes it difficult to predict their interaction perturbation profiles. To address this fundamental challenge, we present eSIG-Net (edgetic mutation Sequence-based Interaction Grammar Network), a novel sequence-based 'Interaction Language Model' for predicting protein interaction alterations caused by single mutations. eSIG-Net combines various protein sequence embeddings, introduces a mutation-encoding module with syntax and evolutionary insights, and employs contrastive learning to evaluate mutation-induced interaction changes. eSIG-Net significantly outperforms current state-of-the-art sequence-based and structure-based prediction methods at predicting mutational impact on protein interactions. We highlight examples where eSIG-Net nominates causal variants with high confidence and elucidates their functional role under relevant biological contexts. Together, eSIG-Net is a first-in-kind 'interaction language model' that can accurately predict interaction-specific rewiring by single mutations with only sequence information, and exhibits generalizability across biological contexts.
bioinformatics2026-03-31v1Transcriptional Hysteresis and Irreversibility in Periodontitis Revealed by Single-Cell Latent Manifold Modeling
Yadalam, P. K.Abstract
Chronic periodontitis is a prevalent inflammatory condition characterized by progressive tissue destruction; however, the molecular determinants that distinguish reversible inflammation from irreversible structural damage remain poorly defined. In this study, we analyzed single-cell RNA sequencing data from 12,104 cells (GSE152042) spanning healthy gingiva, mild periodontitis, and severe periodontitis to investigate transcriptional state transitions associated with disease progression. A variational autoencoder-derived 20-dimensional latent manifold was constructed to capture disease-associated transcriptional dynamics, followed by formal hysteresis analysis to assess irreversibility. Chi-square testing across 9,163 cells in transitional pseudotime bins revealed strong state-dependence (chi-square = 11,971, p < 1e-300, df = 4; Cramers V = 0.81), suggesting non-reversible transcriptional trajectories. Non-negative matrix factorization (k = 15) identified coherent gene programs, whose interactions were modeled as a hypergraph constraint network. Severe disease was associated with marked disruption of these constraints, including an 84% reduction in fibroblast-epithelial coupling. To explore system-level dynamics, we implemented a multi-agent simulation framework that recapitulated observed shifts in cellular composition and suggested the presence of a transition point beyond which tissue states diverge. We further propose a composite single-cell metric, the Regenerative Permission Index (RPI), which quantifies regenerative potential across disease states. RPI values were significantly reduced in severe periodontitis (mean = 0.323), indicating diminished regenerative permissibility relative to earlier stages. Classification models achieved high performance (accuracy = 88%, AUC = 0.992), and permutation testing supported the specificity of inferred network patterns (p < 0.01). Together, these findings provide a quantitative framework for investigating transcriptional irreversibility in periodontitis and introduce RPI as a potential tool for stratifying regenerative capacity at the single-cell level.
bioinformatics2026-03-31v1MetaGEAR Explorer: Rapid interactive searches and cross-cohort analyses of microbiome gene associations in disease
Rios, E.; Jin, S.; Zhang, C.; Neuhaus, F.; He, X.; Weissenberger, S.; Schirmer, M.Abstract
The human gut microbiome has been linked to inflammatory bowel disease (IBD) and colorectal cancer (CRC), yet identifying disease-associated microbial genes across diverse human cohort studies remains challenging due to inconsistent data processing and the high dimensionality of gene-level abundance profiles. Here we present MetaGEAR Explorer, a web platform comprising a user interface and web services for interactive and programmatic gene-centric exploration of >33 million microbial gene families across 9,053 metagenomic samples from 24 IBD, CRC, and healthy cohorts. MetaGEAR Explorer facilitates gene searches against a catalog of non-redundant gene families via nucleotide or amino acid sequence queries (BLAST) and Pfam domain-based searches. For matched gene families, the platform computes disease-stratified prevalence, cross-cohort disease associations, species-level taxonomic stratification, and functional domain annotations. Importantly, users can also explore the genomic context of individual gene families via contig-based co-localization networks derived from metagenomic species pangenome (MSP) assignments and pivot from sequence to domain searches to identify functional homologs. Additionally, the platform features a dedicated catalog to interactively browse 13,795 MSPs and export results programmatically via API endpoints. We demonstrate MetaGEAR Explorer's utility using the narG-encoding nitrate reductase gene and a case study of colibactin self-protection genes (clbS and DUF1706 homologs), where the platform revealed a consistent shift from commensals to Gammaproteobacteria carriers in disease. In summary, MetaGEAR Explorer enables rapid cross-cohort functional meta-analyses and is freely available at https://metagear-explorer.schirmerlab.de.
bioinformatics2026-03-31v1scTGCL: A Transformer-Based Graph Contrastive Learning Approach for Efficiently Clustering Single-Cell RNA-seq Data
Khan, M. S. A.; Kabir, M. H.; Faisal, M. M.Abstract
Single-cell RNA sequencing (scRNA-seq) enables characterization of cellular heterogeneity but clustering remains challenging due to high dimensionality, dropout induced sparsity, and technical noise. Existing graph-based and contrastive methods often rely on predefined similarity measures or suffer from high computational costs on large datasets. We propose single-cell Transformer-based Graph Contrastive Learning (scTGCL), a framework integrating multi-head self-attention with graph contrastive learning to learn robust cell representations. The model projects raw expression data into an embedding space and employs multi-head attention to adaptively learn weighted cell-cell graphs capturing diverse biological relationships. For contrastive augmentation, we apply random gene masking at the feature level and random edge dropping on attention matrices, simulating dropout and structural uncertainty. A symmetric contrastive loss maximizes agreement between original and augmented representations, while joint optimization with reconstruction and imputation losses preserves biological interpretability. Experiments on ten real scRNA-seq datasets demonstrate that scTGCL consistently outperforms nine state-of-the-art methods across clustering accuracy, normalized mutual information, and adjusted Rand index. Ablation studies validate each architectural component, and robustness analysis on simulated data confirms stable performance under varying dropout rates and differential expression levels. Furthermore, scTGCL exhibits superior computational efficiency, achieving substantially lower runtime on large scale datasets compared with existing approaches. The framework provides an accurate, efficient, and scalable solution for single-cell clustering. Source code and datasets are available at https://github.com/ShoaibAbdullahKhan/scTGCL.
bioinformatics2026-03-31v1Scalable computation of ultrabubbles in pangenomes by orienting bidirected graphs
Harviainen, J.; Sena, F.; Moumard, C.; Politov, A.; Schmidt, S.; Tomescu, A. I.Abstract
Motivation: Pangenome graphs are increasingly used in bioinformatics, ranging from environmental surveillance and crop improvement to the construction of population-scale human pangenomes. As these graphs grow in size, methods that scale efficiently become essential. A central task in pangenome analysis is the discovery of variation structures. In directed graphs, the most widely studied such structures, superbubbles, can be identified in linear time. Their canonical generalization to bidirected graphs, ultrabubbles, more accurately models DNA reverse complementarity. However, existing ultrabubble algorithms are quadratic in the worst case. Results: We show that all ultrabubbles in a bidirected graph containing at least one tip or one cutvertex--a common property of pangenome graphs--can be computed in linear time. Our key contribution is a new linear-time orientation algorithm that transforms such a bidirected graph into a directed graph of the same size, in practice. Orientation conflicts are resolved by introducing auxiliary source or sink vertices. We prove that ultrabubbles in the original bidirected graph correspond to weak superbubbles in the resulting directed graph, enabling the use of existing linear-time algorithms. Our approach achieves speedups of up to 25x over the ultrabubble implementation in vg, and of more than 200x over BubbleGun, enabling scalable pangenome analyses. For example, on the v2.0 pangenome graph constructed by the Human Pangenome Reference Consortium from 232 individuals, after reading the input, our method completes in under 3 minutes, while vg requires more than one hour, and four times more RAM. Availability: Our method is implemented in the BubbleFinder tool https://www.github.com/algbio/BubbleFinder, via the new ultrabubbles subcommand.
bioinformatics2026-03-31v1Identifying Inheritance Patterns of Allelic Imbalance, using Integrative Modeling and Bayesian Inference
Hoyt, S. H.; Reddy, T. E.; Gordan, R.; Allen, A. S.; Majoros, W. H.Abstract
Interpreting the effects of novel mutations on phenotypic traits remains challenging, particularly for cis-regulatory variants. For rare variants, individuals typically possess at most one affected copy of the causal allele, leading to allelic imbalance, and thus the ability to infer inheritance of allelic imbalance can inform genetic studies of phenotypic traits. While many methods for detection of allele-specific expression (ASE) exist, they largely focus on ASE in one individual. We show that performing joint inference across multiple individuals in a trio allows for simultaneously improving estimates of ASE and identifying its likely mode of inheritance. Our Bayesian approach has the benefit of being able to (1) aggregate information across individuals so as to improve statistical power, (2) estimate uncertainty in estimates, and (3) rank modes of inheritance by posterior probability. We demonstrate that this model is also applicable to other forms of imbalance such as allele-specific chromatin accessibility. Applying the model to ATAC-seq and RNA-seq from several trios, we uncover examples in which ASE can be linked to imbalance in chromatin state of cis-regulatory elements and to potential causal variants. As the cost of sequencing continues to decrease, we expect that powerful methodologies such as the one presented here will promote more routine collection of samples from related individuals and improve our understanding of genetic effects on gene regulation and their contribution to phenotypic traits.
bioinformatics2026-03-31v1LATTE for locus-specific quantification of transposable element expression across species
He, J.; Peng, C.; Zhang, Y.; Wang, Z.; Zhang, H.; Fang, L.; Zhao, P.Abstract
Transposable elements (TEs) are pivotal drivers of eukaryotic genome evolution and phenotypic diversity. However, their functional contributions to complex traits remain largely obscured by expression quantification challenges arising from high sequence homology and multi-mapping ambiguities. Here, we present LATTE, an efficient computational framework for defining and quantifying TE expression at locus-specific resolution by leveraging an innovative multi-indicator Expectation-Maximization (EM) algorithm. Extensive benchmarking against simulated datasets demonstrated that LATTE significantly outperformed existing state-of-the-art tools, achieving an accuracy of 0.998 at the subfamily level and 0.839 at the locus-specific level. Applying LATTE to 813 RNA-seq datasets across humans, cattle, and chickens, we quantified expression profiles of 2,703 TEs, followed by TE-expression quantitative trait loci (TE-eQTL) mapping. The colocalization rates between TE-eQTL and host gene-eQTL was low, revealing a distinct regulatory landscape of TE expression. This decoupled correlation between TEs and host genes are likely mediated by the differential expression of alternative transcripts. Through integrated TE-eQTL and genome-wide association studies on 3,746 complex traits across three species, we demonstrated that TEs constitute 204 (8.7%) additional associations with complex traits beyond gene-eQTL. More specifically, the Sjogren's syndrome-associated variant rs10954213 acts as a TE-eQTL that shifts the splicing landscape of IRF5, upregulating TE-containing transcripts while simultaneously suppressing canonical ones. Collectively, LATTE provides an efficient framework for studying TE expression across species, and our findings highlight the key role of TEs in understanding the genetic architecture of complex phenotypes.
bioinformatics2026-03-31v1Decoupling Topology from Geometry: Detecting Large-Scale Conformational Changes via Conformational Scanning
Lin, R.; Ahnert, S. E.Abstract
Protein function is fundamentally driven by structural dynamics, yet the majority of structural bioinformatics treats proteins as static rigid bodies. While Molecular Dynamics (MD) simulations attempt to capture these motions, they are computationally prohibitive for exploring large-scale conformational changes, such as domain movements or allostery, which occur on timescales often inaccessible to standard simulation. However, the Protein Data Bank (PDB) contains a latent wealth of dynamic information in the form of redundant entries proteins solved in multiple distinct conformational states. Detecting these "shape-shifting" pairs remains challenging because standard structural alignment algorithms (e.g., TM-align) rely on rigid-body superposition, which fails when substantial geometric rearrangement occurs. In this study, we introduce a high-throughput method to systematically mine the PDB for proteins that share identical topology but exhibit divergent tertiary conformations. By utilizing a coarse-grained Secondary Structure Element (SSE) representation, we decouple topological connectivity from geometric rigidity, allowing for the detection of conformational homologs that share low global structural similarity despite high predicted structural similarity. We applied this "conformational scanning" across the entire RCSB database, identifying a curated dataset of proteins undergoing significant structural rearrangements. This work bridges the gap between static structural data and dynamic function, providing a critical "ground truth" dataset for benchmarking data-driven protein design and checking the plausibility of generative structure models.
bioinformatics2026-03-31v1