Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data
Hamilton, T.; Sparta, B.; Cooley, S. M.; Aragones, S. D.; Ray, J. C. J.; Deeds, E. J.Abstract
High-dimensional data are becoming increasingly common in nearly all areas of science. Developing approaches to analyze these data and understand their meaning is a pressing issue. This is particularly true for single-cell RNA-seq (scRNA-seq), a technique that simultaneously measures the expression of tens of thousands of genes in thousands to millions of single cells. Popular analysis pipelines significantly reduce the dimensionality of the dataset before performing downstream analysis. One problem with this approach is that dimensionality reduction can introduce substantial distortion into the data, particularly by disrupting the local neighborhoods of certain points. Since many scRNA-seq analyses like cell type clustering or trajectory inference rely on these near-neighbor relationships, distortion in this aspect of the data could significantly influence the outcomes of these analyses. Here, we introduce a straightforward approach to quantifying this distortion by comparing the local neighborhoods of points before and after dimensionality reduction. We found that popular techniques like t-SNE and UMAP introduce substantial distortion even for simple simulated data sets. For scRNA-seq data, we found the distortion in local neighborhoods was often greater than 95%, and that there was no consistent set of neighborhoods across the various steps in the consensus scRNA-seq analysis pipeline. We also found that this distortion had profound impacts on the outcomes of cell type clustering and other downstream analyses. Our findings suggest that caution must be applied when interpreting results in terms of 2-D visualizations produced by tools like UMAP, and that there is a critical need for new dimensionality reduction tools that more effectively preserve the local topological structure of the data.
bioinformatics2026-05-05v7cellNexus: Quality control, annotation, aggregation and analytical layers for the Human Cell Atlas data
Shen, M.; Gao, Y.; Liu, N.; Bhuva, D.; Milton, M.; Henao, J.; Andrews, J.; Yang, E.; Zhan, C.; Liu, N.; Si, S.; Hutchison, W. J.; Shakeel, M. H.; Morgan, M.; Papenfuss, A. T.; Iskander, J.; Polo, J. M.; Mangiola, S.Abstract
Large-scale single-cell atlases such as the Human Cell Atlas have transformed our understanding of human biology. Yet, the lack of a robust framework that standardises quality control, expands cellular annotation, and adds normalisation and analytical layers, limits multi-study analyses and the usefulness of this resource. Here we present cellNexus, a comprehensive tool and resource that converts the Human Cell Atlas collection into analysis-ready data by linking quality control layers, metadata enrichment, expression normalisation, analysis and data aggregation. These enhancements enable robust statistical modelling across studies, exemplified by a multi-tissue map of immune cell communication during ageing, which reveals macrophage-muscle axes as among the most depleted regenerative interactions with age. All harmonised layers, including pseudobulk and cell-cell communication summaries, are accessible via a public web interface and with R and Python APIs. By providing continuous integration with CELLxGENE releases, cellNexus transforms large cell atlas corpora into an accessible, reproducible, interoperable foundation for large-scale biological discovery and the next generation of single-cell foundation models.
bioinformatics2026-05-05v3Topology Matters: The Trade-off Between Wasserstein Critics and Discriminators in Single-Cell Data Integration
Reid, K.; Stein-O'Brien, G.; Guven, E.Abstract
Motivation: Integrating single-cell RNA sequencing experiments (scRNA-seq) across technologies is hindered by severe technical batch effects that confound analysis and mask biological variation. Adversarial autoencoders are a popular solution to correct for these confounding effects, often relying on discriminator networks that approximate the Jensen-Shannon divergence. Previous research has established that the Jensen-Shannon divergence suffers from vanishing gradients when distributions do not overlap, a common phenomenon when datasets come from different sequencing technologies, leading to failed training. In contrast, the Wasserstein distance remains a valid metric with informative gradients even for disjoint distributions. While both approaches appear in the literature, no study has rigorously isolated the adversarial objective to systematically evaluate its impact on batch alignment, biological conservation, and scalability across varying dataset complexities. Results: We introduce a multi-class reference-based Wasserstein critic to systematically benchmark adversarial objectives. We find that the Wasserstein critic yields superior mixing; however, extensive reference sensitivity analysis reveals that the Wasserstein critic is prone to over-correction resulting in collapsed cellular representations; that its integrative performance is dependent on a topologically dense reference batch; and that it scales poorly with the number of batches. In contrast, we find that the "weak" integration characteristic of discriminators acts as a protective measure against over-correction. By highlighting the trade-offs between these methods, we aim to empower researchers to choose the correct method for their specific needs. Availability and Implementation: Source code is available at https://github.com/kreid415/wasserstein-critic-deconfounding. Data are available at https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-_ integration_task_datasets_Immune_and_pancreas_/12420968/1. Contact: kreid20@jh.edu.
bioinformatics2026-05-05v3Sequence-dependent transferability of the LRLLR membrane translocation motif: A computational study of smacN and NR2B9c peptides.
Munoz-Gacitua, D.; Blamey, J.Abstract
The LRLLR cell-penetrating motif can be transferred to confer membrane translocation activity, but only to compatible recipient peptides. Using umbrella sampling molecular dynamics simulations, we demonstrate that C-terminal LRLLR addition to the pro-apoptotic smacN peptide eliminates its translocation barrier entirely, transforming a +65 kJ/mol barrier into a -50 kJ/mol energy well. In contrast, N-terminal LRLLR addition to the neuroprotective NR2B9c peptide increases the translocation barrier from +85 to +100 kJ/mol, demonstrating that motif transfer can prove counterproductive for incompatible sequences. Cell-penetrating peptides offer promising strategies for intracellular delivery of therapeutic cargo, yet the sequence determinants governing their activity remain incompletely understood. The LRLLR motif, identified through systematic screening as essential for spontaneous membrane translocation, represents a minimal penetrating element whose transferability has not been previously evaluated. We appended this motif to two clinically relevant peptides: smacN, a tetrapeptide targeting inhibitor of apoptosis proteins in chemotherapy-resistant cancers, and NR2B9c, a nonapeptide that disrupts excitotoxic signaling in ischemic stroke. Potential of mean force profiles calculated across a POPC/POPG bilayer, combined with analysis of hydrogen bonding patterns, secondary structure propensity, and conformational dynamics, reveal the structural basis for these divergent outcomes. Successful transfer to smacN results from favorable complementarity: the hydrophobic, neutral smacN provides an ideal platform for the charged, amphipathic LRLLR motif, yielding a chimera capable of simultaneous interaction with both membrane leaflets. Transfer failure with NR2B9c stems from conformational rigidity induced by intramolecular hydrogen bonding, which prevents optimal membrane insertion, combined with unfavorable positioning of internal polar residues at the bilayer center. These findings establish that cell-penetrating motif transfer requires compatibility in charge distribution, hydrophobicity, and conformational flexibility between the motif and recipient sequence. The smacN-LRLLR chimera emerges as a promising candidate for experimental validation as a membrane-permeable therapeutic for survivin-positive tumors. More broadly, this work demonstrates the value of computational screening to identify compatible motif-cargo pairings prior to experimental investment.
bioinformatics2026-05-05v2MolGene-E: Inverse Molecular Design to Modulate Single Cell Transcriptomics
Ohlan, R.; Murugan, R.; Xie, L.; Nallabolu, V.; Mottaqi, M.; Zhang, S.; Xie, L.Abstract
Designing drugs that can restore a diseased cell to its healthy state is an emerging approach in systems pharmacology to address medical needs that conventional target-based drug discovery paradigms have failed to meet. Single-cell transcriptomics can comprehensively map the differences between diseased and healthy cellular states, making it a valuable technique for systems pharmacology. However, single-cell omics data is noisy, heterogeneous, scarce, and high-dimensional. As a result, no machine learning methods currently exist to use single-cell omics data to design new drug molecules. We have developed a new deep generative framework named MolGene-E to tackle this challenge. MolGene-E combines two novel models: 1) a cross-modal model that can harmonize and denoise chemical-perturbed bulk and single-cell transcriptomics data, and 2) a contrastive learning-based generative model that can generate new molecules based on the transcriptomics data. MolGene-E consistently outperforms baseline methods in generating high-quality, hit-like molecules on gene expression profiles from two evaluation settings: CRISPR knock-out perturbation profiles from L1000toRNAseq dataset, and single-cell gene expression profiles from Sciplex-3 dataset, both in zero-shot molecule generation setting. This superior performance is demonstrated across diverse de novo molecule generation metrics. Extensive evaluations demonstrate that MolGene-E achieves state-of-the-art performance for zero-shot molecular generations. This makes MolGene-E a potentially powerful new tool for drug discovery.
bioinformatics2026-05-05v2Exploring per-base quality scores as a surrogate marker of cell-free DNA fragmentome
Volkov, H. H. V.; Raitses-Gurevich, M.; Grad, M.; Shlayem, R.; Leibowitz, D.; Rubinek, T.; Golan, T.; Shomron, N.Abstract
Per-base quality scores are widely treated as technical metadata in next-generation sequencing. Here, we show that in rigorously controlled whole-genome sequencing of cell-free DNA, quality profiles may encode fragmentomic signals that enable classification of cancer samples against matched controls. Analyzing four independent batches (23 cancer samples: pancreatic and breast; 22 matched controls) sequenced in a within-lane regime and further normalized per flow-cell tile to reduce technical confounders, we demonstrate through unsupervised analysis that boundary-enriched dynamics captured in these quality scores consistently separate cancer from control samples. A leave-one-batch-out classifier trained on quality-derived scores achieved a pooled area under the curve of 0.81. Furthermore, we show that the quality-derived metric correlates with short-fragment enrichment and tumor-associated 5-end motifs, performing comparably to established, motif-based orthogonal methods. These results provide initial evidence that quality scores could serve as a low-cost, alignment-free biomarker for cfDNA-based cancer detection.
bioinformatics2026-05-05v2Preferential CDR masking in paired antibody language models improves binding affinity prediction
Talaei, M.; Walker, K. C.; Hao, B.; Jolley, E.; Jin, Y.; Kozakov, D.; Misasi, J.; Vajda, S.; Paschalidis, I. C.; Joseph-McCarthy, D.Abstract
Background: Therapeutic antibodies are a leading class of biologics, yet their unique architecture poses challenges for computational modeling. Each antibody comprises paired heavy and light variable domains with conserved framework regions that maintain structure and hypervariable complementarity-determining regions (CDRs) that directly contact antigens. This functional asymmetry, where CDRs determine binding specificity while frameworks provide scaffolding, suggests that region-aware training strategies could yield superior representations. Existing protein language models treat all regions uniformly, potentially missing critical features present in CDRs. Methods: We developed a region-aware pretraining strategy for paired variable domain sequences using two protein language models: a 3 billion parameter model (ESM2) and a compact 600 million parameter model (ESM C). We compared three masking approaches: uniform whole-chain masking, CDR-focused masking, and a hybrid strategy. Final models were trained on over 1.6 million paired antibody sequences and evaluated on binding affinity datasets with over 90,000 antibody variants across six antigens, including single-mutant panels and combinatorial libraries. Results: Here we show that CDR-focused training produces embeddings with superior predictive performance for antibody-antigen binding. Our approach achieves up to 27% improvements in binding affinity prediction compared to benchmarked antibody models. Remarkably, training exclusively on paired sequences proves sufficient; pretraining on billions of unpaired sequences provides no measurable benefit. Our compact model matches or exceeds larger antibody-specific baselines. Conclusions: These findings establish that prioritizing paired sequences with CDR-aware supervision over scale and complex training schemes achieves both computational efficiency and predictive accuracy, providing a practical framework for next generation antibody language models.
bioinformatics2026-05-05v2Systematic contextual biases in SegmentNT potentially relevant to other nucleotide transformer models
Ebbert, M. T. W.; Ho, A.; Page, M. L.; Dutch, B.; Byer, B. K.; Hankins, K. L.; Sabra, H.; Aguzzoli Heberle, B.; Wadsworth, M. E.; Fox, G. A.; Karki, B.; Hickey, C.; Fardo, D. W.; Bumgardner, C.; Jakubek, Y. A.; Steely, C. J.; Miller, J. B.Abstract
Recent advances in large language models (LLMs) have extended to genomic applications, yet model robustness relative to context is unclear. Here, we demonstrate two intrinsic biases (input sequence length and nucleotide position) affecting SegmentNT results, a model included with the Nucleotide Transformer that provides nucleotide-level predictions of biological features. We demonstrate that nucleotide position within the input sequence (beginning, middle, or end) alters the nature of SegmentNT's raw prediction probabilities, which can be standardized to improve prediction consistency. While longer input sequence length improves model performance, diminishing returns suggest a surprisingly small input length of ~3,072 nucleotides might be sufficient for many applications. We further identify a 24-nucleotide periodic oscillation in SegmentNT's prediction probabilities, revealing an intrinsic bias potentially linked to the model's training tokenization (6-mers) and architecture. We identify potential approaches to account for these biases and provide generalizable insights for utilizing nucleotide-resolution functional prediction models.
bioinformatics2026-05-05v2multiVIB: A unified probabilistic contrastive learning framework for atlas-scale integration of single-cell multi-omics data
Xu, Y.; Fleming, S. J.; Wang, B.; Schoenbeck, E. G.; Babadi, M.; Huo, B.-X.Abstract
Comprehensive brain cell atlases are essential for understanding neural functions and enabling translational insights. As single-cell technologies proliferate across experimental platforms, species, and modalities, these atlases must scale accordingly, calling for data integration framework that aligns heterogeneous datasets without erasing biologically meaningful variations. Existing tools typically address narrow integration settings, forcing researchers to assemble \textit{ad hoc} workflows that may generate artifacts. Here, we introduce multiVIB, a unified probabilistic contrastive learning framework that handles diverse integration scenarios. We show that multiVIB achieves state-of-the-art performance while mitigating spurious alignments. Applied to atlas-scale datasets from the BRAIN Initiative, multiVIB demonstrates robust and scalable integration, including integration of diverse data modalities and reliable preservation of species-specific variations in cross-species integration. These capabilities position multiVIB as a scalable, biologically faithful foundation for constructing next-generation brain cell atlases with the growing landscape of single-cell data.
bioinformatics2026-05-05v2IMAS enables target-aware integration of tumour multiomics to resolve communication-guided regulatory mechanisms
Deyang, W.; Yamashiro, T.; Inubushi, T.Abstract
Tumour multiomic datasets are often sparse, heterogeneous and limited in size, hindering robust and interpretable discovery of regulatory mechanisms. Here we present IMAS, a target-aware integrative framework for multiomic data augmentation and mechanism prioritization that leverages a pan-cancer single-cell multiomic resource to contextualize new tumour datasets and identify reliable sample-specific mechanistic hypotheses. IMAS combines shared latent-space modelling with target-domain adaptation to improve correspondence between predicted and observed RNA and TF profiles while concentrating explanatory predictive supports within the target dataset. Building on this adapted representation, IMAS reconstructs structured RNA-TF coupling networks, refines intercellular signaling through ligand-informed communication modelling, and organizes regulatory programs along communication-associated ordering. In independent colon cancer data, IMAS improved cluster-resolved correspondence and revealed communication-guided regulatory cascades across malignant epithelial states. A LAMB1-centred analysis further demonstrates how the framework supports progressive reinforcement of local regulatory structure and enables perturbation-based probing of context-specific dependencies. Rather than exhaustively predicting all possible outcomes, IMAS provides a target-aware and interpretable strategy to construct consistent and interpretable mechanism-discovery scaffolds and prioritize regulatory dependencies in data-limited tumour systems.
bioinformatics2026-05-05v2Transfer Learning Enables Drug-Target Interaction Prediction in Data-Scarce One-Carbon Metabolism
Dalkiran, A.; Cho, T.; Atalay, M. V.; Shin, K. W. D.; Meliton, A. Y.; Woods, P. S.; Shamaa, O. R.; Hamanaka, R. B.; Mutlu, G. M.; Cetin-Atalay, R.Abstract
Predicting drug-target interactions (DTIs) with deep learning offers opportunities to accelerate drug discovery, yet performance is constrained by the scarcity of target-specific training data. This is a particular challenge for mitochondrial one-carbon (1C) pathway enzymes, which are attractive therapeutic targets but remain pharmacologically understudied. Mitochondrial 1C metabolism supplies glycine, reducing equivalents, and 1C units critical for nucleotide synthesis, and has emerged as a key pathway in cancer and fibrosis. SHMT2 and MTHFD2, two key 1C enzymes, support collagen production in fibroblasts, blocking either prevents TGF-{beta}-induced glycine and collagen accumulation. Here, we developed transfer learning-based deep learning models to predict interactions between approved drugs and SHMT2 or MTHFD2 despite minimal target-specific training data, pre-training on large datasets from related enzymes before fine-tuning to these targets. Virtual screening of the DrugBank library identified six candidates, three of which, Carbimazole, Crizotinib, and GSK2018682 reduced TGF-{beta}-induced collagen production and glycine accumulation in human lung fibroblasts, demonstrating transfer learning as a strategy for repurposable drug identification in data-scarce metabolic targets.
bioinformatics2026-05-05v1immuneKG: An Immune-Cell-Aware Knowledge Graph Framework for Target Discovery in Immune-Mediated Diseases
Ye, Y.; PB-IDD Department, Pharmablock Sciences Inc.,Abstract
Biomedical knowledge graphs have emerged as foundational infrastructure for AI-driven drug discovery, yet their translational impact on novel target identification in immune-mediated diseases remains limited. Here we present immuneKG, a multimodal knowledge graph centred on autoimmune diseases, constructed through biologically meaningful feature reprogramming of disease nodes to enable deep mechanistic modelling of immune-related disorders. immuneKG introduces a new entity class immune_cell, and four original directed relation types, together adding 9,105 novel triples absent from all existing biomedical KG schemas. Disease nodes are endowed with three novel modal feature sets quantifying immune homeostatic imbalance: autoantibody profiles, cytokine signatures, and HLA genotypes, complemented by systemic involvement scores and genetic features. The graph encompasses over 407,000 training triples across 7,287 entities and 32 relation types. Applied to inflammatory bowel disease (IBD), immuneKG combined with a HeteroPNA-Attn graph neural network achieves a Hits@100 of 0.99 against a Clarivate Phase II+ clinical pipeline, while a novelty-penalised scoring function surfaces high-potential dark targets. The framework shifts from conventional candidate-space screening to a development-oriented decision-support paradigm, providing actionable and interpretable guidance for downstream drug discovery.
bioinformatics2026-05-05v1DOMINO: Learning Domain Co-occurrence for Multidomain Protein Design
Dai, F.; Su, J.; Tan, Q.; Yang, H.; Zhou, X.; Yuan, F.Abstract
Multidomain proteins arise through the reuse and recombination of structural domains, yet natural architectures represent a sparse, structured sample of the possible domain-combination space. Here, we introduce DOMINO, a two-stage framework that learns domain co-occurrence from TED-annotated multidomain proteins and uses the learned patterns to generate new multidomain sequences. DOMIN, a contrastive retrieval model, embeds domains into a latent compatibility space and retrieves candidate partners for a query domain from a TED-derived domain pool, including pairings not observed in the TED-derived co-occurrence set. DOMO, a conditional autoregressive sequence model, converts each retrieved domain pair into a full-length protein sequence by jointly generating the specified domain regions and the non-domain sequence context between and around them. DOMIN recovers hierarchical patterns of natural domain co-occurrence and expands the observed CATH homologous-superfamily co-occurrence network with candidate novel pairings. DOMO realizes both held-out natural pairs and DOMIN-retrieved pairs as proteins with high domain recovery and high AlphaFold-predicted structural confidence. Applied at scale, DOMINO generated 5 million retrieval-derived multidomain proteins, with sampled designs showing recovery of the specified domains, diverse CATH annotations, and sequence novelty relative to UniRef100. Together, these results support domain co-occurrence as a predictive design prior and demonstrate a scalable strategy for exploring multidomain protein architectures through new combinations of existing structural modules.
bioinformatics2026-05-05v1Building computational benchmarks: an Omnibenchmark reimplementation of a single-cell preprocessing pipeline evaluation
Choudhury, A.; Kitak, T.; Carrillo, B.; Busch, P.; Emons, M.; Gunz, S.; Koderman, M.; Luo, S.; Mallona, I.; Meara, A.; Wissel, D.; Robinson, M. D.Abstract
In the past few years, we have seen a veritable surge in single-cell (e.g., RNA sequencing) techniques and datasets, enabling increasingly detailed characterization of cellular heterogeneity across tissues and conditions. This surge in single-cell techniques has been complemented by a large number of analysis frameworks and pipelines, and a large parameter space and researcher degrees of freedom to use them. Many neutral benchmarks have been presented for various computational tasks, but most make design decisions that render them incompatible with each other, e.g., different datasets and metrics, or parameter sets used. In this work, we showcase a recently developed framework, Omnibenchmark, to build reproducible, extensible and standardized method comparisons. This not only facilitates the broad investigation of pipelines used in single-cell data analysis, but also highlights how the process of building benchmarks can be streamlined and unified. We do this as an initial proof-of-principle for an arms-length benchmark that evaluates five single-cell RNA sequencing pipelines (filtering to normalization to dimensionality reduction to clustering) on three datasets. This standardization enables benchmarks to be easily extended in several directions, including broader parameter sweeps, comparisons across software versions and architectures, isolation of pipeline steps, and integration of additional pipelines, datasets, and metrics.
bioinformatics2026-05-05v1ANYI: The ANnotated Yeast Interactome
Nissley, D. A.; Goel, M.; Castellanos-Girouard, X.; Kuntz, C. P.; Wang, Y.; Mukhtar, S.; Serohijos, A.; Schlebach, J. P.Abstract
Although several existing protein-protein interaction (PPI) databases provide yeast PPI data, none unify large-scale network topology information with detailed biophysical, proteostasis, and regulatory annotations in a single protein-centric framework. To address this gap, we developed the ANnotated Yeast Interactome (ANYI), an open, integrated resource that combines experimental yeast PPIs with sixteen feature annotation types, including protein abundance, half-life, disorder content, post-translational modifications, conformational stability, chaperone interactions, sequence, and structure. ANYI integrates 3,927 proteins with 155 annotation features, forming a unified matrix that enables systematic cross-layer analyses. Available via GitHub and Docker Hub with an interactive network browser for broad accessibility, ANYI provides both experienced and beginner computational scientists with tools to investigate the yeast interactome. For example, users can directly test whether highly connected hub proteins exhibit distinct stability, disorder, or proteostasis signatures relative to peripheral nodes.
bioinformatics2026-05-05v1Network-based analysis of glioblastoma identifies patient communities and cluster-specific biomarkers
Siminea, N.; Florea, D.; Paun, M.; Paun, A.; Petre, I.Abstract
Glioblastoma is an aggressive and highly heterogeneous brain tumor with poor prognosis despite multimodal treatment strategies. Understanding the molecular diversity of the disease is essential for improving tumor stratification and identifying potential therapeutic targets. In this study, we investigate whether network-based analysis can reveal biologically meaningful subgroups of glioblastoma tumors. Using RNA sequencing and mutation data from the TCGA-GBM cohort, we constructed patient-specific protein-protein interaction networks based on genes that are differentially expressed or harbor somatic mutations. These networks capture the molecular alterations associated with individual tumors within the context of the human interactome. We then derived similarities between tumors using a binary representation of network nodes and the Jaccard similarity metric, enabling the construction of a patient similarity graph. Community detection algorithms (Louvain and Leiden) were applied to this graph to identify clusters of tumors with similar molecular network profiles. Our analysis revealed six tumor communities characterized by distinct gene compositions and enriched biological processes. For each community, we identified candidate biomarkers and network hubs that may represent potential therapeutic targets. Several of the identified genes correspond to known drug targets, while others represent potential candidates for further investigation. These results illustrate how integrating molecular alterations with network-based modeling can help stratify glioblastoma tumors and uncover molecular mechanisms that may guide the development of more personalized therapeutic strategies.
bioinformatics2026-05-05v1Clonal embeddings allow exploratory analysis of lineage-resolved single-cell data
Isaev, S.; Erickson, A. G.; Adameyko, I.; Kharchenko, P. V.Abstract
Assays coupling high-throughput lineage tracing with single-cell transcriptomics are transforming studies of development and disease biology, revealing not only major differentiation routes but also continuous fate biases and their putative regulators. Yet, analysis of such data at scale presents challenges due to the sparse nature of clonal data and annotation dependencies. Towards that aim we developed a machine learning approach - clone2vec - which learns informative clone embeddings directly from the cellular expression manifold, bypassing discrete cell-type labels and remaining stable when clones are represented by few cells. This representation summarizes clonal variation as an interpretable geometry that supports exploration, statistics for clone-gene associations, and cross-dataset alignment. In prospective barcoding datasets spanning embryogenesis, tumorigenesis, and hematopoiesis, clone2vec recapitulates established clonal patterns and uncovers new axes of continuous variation that implicate regulatory programs and developmental pathways. In tumor microenvironments profiled with TCR sequencing, clone2vec robustly recovers distinct Treg lineages as well as conserved CD8+ T cell sublineages across cancer types, including several bystander-like clonal subsets. Overall, clone2vec provides a robust, general solution for the exploratory analysis of lineage-coupled scRNA-seq data.
bioinformatics2026-05-05v1Cross-assay RNA modeling reveals cancer biomarkers
Townsend, H. A.; Jordan, K. R.; Wolsky, R. J.; Van Kleunen, L. B.; Davidson, N. R.; Behbakht, K.; Sikora, M. J.; Dowell, R. D.; Clauset, A.; Bitler, B. G.Abstract
The clinical heterogeneity of cancer poses a major challenge for precision medicine. Limited cohort sizes across evolving assay platforms impede reliable biomarker discovery. Here, we systematically evaluate how to integrate data from four transcriptomics platforms: bulk and single-cell (sc) RNA sequencing (RNA-seq), NanoString, and microarray for predictive modeling in cancer. We use high-grade serous carcinoma (HGSC) of tube-ovarian origin as a model system, as it is highly heterogeneous in both biology and assay data. We find that using fold-change of gene expression in patients with matched pre- and post-neoadjuvant chemotherapy samples reduces inter-patient and inter-assay variability but is insufficient to overcome platform-specific biases. Microarray and scRNA-seq data exhibit systematic biases, while RNA-seq and NanoString show the most promise for combination into a single training cohort. To mitigate inter-assay limitations, we generate a new data set of HGSC tumor samples profiled with both RNA-seq and NanoString, and use it to identify the limits of detection and optimal harmonization strategies. Our approaches enable integration of cohorts for separate and combined RNA-seq and NanoString predictive models of disease recurrence (test-set AUROCs > 0.8), validated in external microarray cohorts. We leverage single-cell and bulk RNA-seq network-based analyses to provide mechanistic context for genes in the predictive models. Our models indicate that GBP4 expression is a key predictor of recurrence and marks immune remodeling towards cytotoxicity. We provide an interactive web portal to facilitate exploration of data and results. These findings guide cross-assay harmonization of transcriptomic data and enable improved predictive modeling in heterogeneous cancers.
bioinformatics2026-05-05v1Massively parallel reporter assay-informed modeling improves prediction of context-specific enhancer-gene regulatory interactions
DeGroat, W.; Kreimer, A.Abstract
Enhancers are cis-regulatory elements that drive context-specific gene expression, yet their target genes and modes of action remain largely unresolved. Because most disease-associated variants lie in non-coding regulatory DNA, accurate, cell type-specific enhancer-gene (E-G) mapping is essential for understanding genetic risk. However, current E-G prediction frameworks lack the resolution to capture such context-specific interactions. Massively parallel reporter assays (MPRAs) provide measurement of cis-regulatory activity, but their integration into genome-scale E-G models has been limited. Here, we introduce MPRabc, an MPRA-informed model that improves E-G interaction prediction. MPRabc integrates predicted MPRA activity, sequence-derived regulatory features, epigenomic signals, and three-dimensional chromatin contact maps with CRISPR-based perturbation training data. Benchmarking against validated regulatory interactions shows that MPRabc outperforms state-of-the-art models. We generated high-resolution E-G networks for K562, HepG2, and hiPSC cell lines and applied a graph-based framework to identify regulatory architecture, map trait-associated variants and expression quantitative trait loci, and resolve transcription factor drivers of enhancer activity. Across contexts, we accurately recovered lineage-defining regulatory programs, including GATA1::TAL1 in K562, HNF1A/B in HepG2, and POU factor circuits in hiPSCs. Together, these results establish MPRA-informed modeling as a scalable strategy for decoding enhancer function and linking non-coding variants to gene regulatory mechanisms across cellular contexts.
bioinformatics2026-05-05v1Cell Type Weighted Dimensionality Reduction
Putta, S.; Jensen, W.; Devakonda, S.; Pennell, L.; Croteau, J.Abstract
High-dimensional single-cell technologies, such as flow cytometry and CITE-Seq, typically rely on established lineage markers to define cell identities. Additional markers are commonly analyzed within the context of these predefined cell types. Nonlinear projection methods such as t-SNE and UMAP provide a visual framework for this analysis by enabling the overlay of cell types and marker expression. However, these methods frequently produce projections where distinct cell types substantially overlap, hindering interpretation of marker expression patterns relative to known cell types. In this study, we investigate the underlying causes of this phenomenon and demonstrate that such overlaps often stem from the inherent high-dimensional structure of the data rather than limitations in the dimensionality reduction algorithms themselves. To address this, we introduce Cell Type Weighted Dimensionality Reduction (CWDR), a novel approach that incorporates lineage-based information through a supervised weighting mechanism. By integrating both cell identity and marker expression, CWDR preserves the visual separation between predefined cell types while maintaining the local variance necessary for downstream analysis. We validate our method across multiple high-dimensional flow cytometry and proteogenomic datasets. Our results show that CWDR significantly reduces inter-cluster overlap compared to traditional methods, providing a clearer framework for visualizing marker expression within the context of specific cell lineages.
bioinformatics2026-05-05v1Interpreting Omics Data Analysis with Large Language Models for Disease Target and Drug Discovery
XU, Z.; Chen, W.; Ren, W.; Xu, T.; Amaechin, S.; Khan, R.; Chen, Y.; Province, M.; Payne, P.; Li, F.Abstract
In biomedical scientific discovery, synthesizing prior knowledge from the literature is an essential component of interpreting numerical omics data analyses for disease target identification and drug discovery. Large language models (LLMs) alone can rapidly retrieve disease mechanisms from biomedical text, but text-only outputs are general and unreliable for target and drug prioritization without cohort-specific quantitative evidence. Herein, we propose a provenance-aware Text-to-Target framework that couples schema-constrained multi-model LLM retrieval with numeric omics data analysis. The key design is a modality-aware fusion step: candidates are partitioned into overlap-supported anchors, retrieval-only hidden hubs, and network-emergent novelty nodes, then propagated into staged hypothesis and strategy generation under topology constraints. We evaluate the model in Alzheimer's disease (AD) and pancreatic ductal adenocarcinoma (PDAC). In PDAC, the workflow produced a balanced 75-gene candidate universe and a 23-strategy portfolio, with significant DepMap support at both target level and strategy level. In AD, stricter candidate controls yielded a compact 34-gene universe and 14 strategies; under an expanded CRISPRbrain registry, both target-level axes were significant , with strong strategy-level enrichment. Across both diseases, final strategies preserved full provenance closure to the candidate pool, enabling end-to-end auditability from retrieval artifacts to validation outputs. These results support a transferable discovery architecture in which omics evidence constrains biological activity, LLM retrieval expands mechanistic search space, and network-aware fusion preserves interpretability. The framework provides a reproducible basis for dual-disease target prioritization and motivates continuous literature-mechanism concordance with agentic evidence-refresh loops.
bioinformatics2026-05-05v1A universal taxonomic and functional human gut microbiome model for disease classification and phenotype discovery
Karwowska, Z.; Mozejko, M.; Nowak, W.; Romanchenko, A.; Szczurek, E.; Kosciolek, T.Abstract
The human gut microbiome is a powerful indicator of host health, yet its compositional nature, high sparsity, and inter-individual variability complicate downstream analysis. Here, we introduce two complementary approaches to characterize gut microbiome structure at population scale. First, we define eight functional signatures of the human gut microbiome using Non-negative Matrix Factorization, revealing coordinated metabolic patterns that partially decouple from taxonomic composition. Second, we present GUT-FORMer, a transformer-based autoencoder that jointly models taxonomic and functional metagenomic profiles from close to 21,000 publicly available samples. The learned latent representations capture biologically meaningful structure, reflect geographic and disease-associated variation, and enable accurate classification of 25 diseases in both binary and multiclass settings, as well as regression of host age and BMI. GUT-FORMer outperforms existing microbiome indices and deep learning methods across all tasks, establishing a generalizable framework for microbiome-based precision medicine.
bioinformatics2026-05-05v1Structure-derived synthetic sequences guide a protein language model toward metalloproteins
Peteani, G.; Sgueglia, G.; Lemmin, T.; Chino, M.Abstract
Motivation Protein language models (pLMs) capture evolutionary sequence constraints but are limited in modeling underrepresented functional classes due to training data imbalance. Metalloproteins constitute a fundamental but sparsely represented class in sequence databases. We therefore assess whether structure-conditioned synthetic sequences can be used to specialize pLMs toward metal-binding functionality. Results We fine-tuned the generalist model ProtGPT2 on synthetic sequences generated by the inverse-folding model ProteinMPNN, constructing training sets with controlled variation in size and diversity. Fine-tuning increased recovery of canonical metal-binding motifs from 43% in the baseline model to 91% in the fine-tuned models. Generated sequences retained high predicted structural confidence and structural similarity to known folds, despite low sequence identity. Analysis of latent representations from ProtGPT2 indicated that fine-tuned models occupy distinct regions of embedding space relative to both the baseline model and structure-conditioned sequences, consistent with partial incorporation of structural constraints while preserving sequence diversity. A multi-step filtering pipeline applied to sequences lacking canonical motifs identified candidate metal-binding sites in four-helical bundle topologies not detected in a non-redundant subset of Protein Data Bank structures or in AlphaFold-predicted proteomes. Availability and implementation Code, trained models, and datasets are available at: https://doi.org/10.5281/zenodo.18672158 and https://huggingface.co/gsgueglia.
bioinformatics2026-05-05v1Revealing the Hidden Landscape of Public Metabolomics Data Reuse in MetaboLights
Karaman, I.; Payne, T.; Vizcaino, J. A.Abstract
Public data reuse is a key driver of progress in omics sciences, including increasingly metabolomics data. In this study, we present a validated analysis of confirmed reuse of datasets from the MetaboLights data repository, one of the leading resources in the field. Candidate publications were collected via dataset identifiers (MTBLS#) using a Python-based retrieval pipeline across major publisher databases. They were next manually validated to distinguish active reuse from citation-only mentions. Overall, 272 unique publications were confirmed to have reused at least one MetaboLights dataset. Reuse is dominated by Method/Tool Development, with smaller contributions from Secondary Biological Analysis and Data Integration/Meta-analysis. LC-MS datasets account for the majority of reuse, whereas NMR and GC-MS also contribute but at a lower level. Data reuse has increased over time, with a noticeable acceleration in the most recent years. At the dataset level, reuse follows a long-tail distribution, where a small subset of datasets accounts for repeated reuse, mainly as community benchmarks. These results provide a conservative estimate of public metabolomics data reuse and show that public datasets are predominantly used for methodological and computational applications. They also indicate that reuse is under-detected when dataset identifiers are not consistently reported, highlighting the need for standardised dataset citation to improve traceability and recognition of reuse.
bioinformatics2026-05-05v1Machine learning approaches for the identification and analysis of enterotoxin genes in Staphylococcus aureus genomes
Uttin, A.; Leggett, R.; Moulton, V.; Dicks, J.Abstract
Staphylococcus aureus produces a broad range of enterotoxins that act as superantigens, disrupting host immune responses and resulting in a myriad of clinical symptoms. However, large-scale analyses determining enterotoxin gene diversity, lineage structure and isolate metadata remain scarce. We analysed 15,887 S. aureus RefSeq genomes using a machine learning pipeline combining profile Hidden Markov Model-based enterotoxin gene identification, lineage typing, gene profile-based strain clustering and association rule mining using a broad range of gene and metadata features. This approach identified 35 distinct enterotoxin genes and five variant forms, including two putative novel enterotoxin genes, sel34 and sel35. HDBSCAN clustering distinguished 45 enterotoxin gene profile groups, revealing strong associations between the two major egc enterotoxin gene cluster variants (OMIWNG and OMIUNG) and Clonal Complex membership: CC5, CC22 and CC45 with OMIWNG; CC30 and CC121 with OMIUNG. Integration of isolate metadata exposed distinct geographic and temporal trends, including a recent rise in non-egc lineages derived from Asia and animal sources. These findings show that S. aureus enterotoxin diversity is structured by lineage, mobile genetic element composition and Clonal Complex association. The discovery of sel34 and sel35, together with the comprehensive overview of lineage-specific enterotoxin profiles, expands current understanding of S. aureus virulence evolution and provides a scalable analytical framework for monitoring toxin gene dynamics in clinical and environmental populations.
bioinformatics2026-05-05v1Celldetective: an AI-enhanced image analysis tool for unraveling dynamic cell interactions
Torro, R.; Diaz Bello, B.; El Arawi, D.; Dervanova, K.; Ammer, L.; Dupuy, F.; Chames, P.; Sengupta, K.; Limozin, L.Abstract
Analysis of multimodal and multidimensional data capturing dynamic interactions between diverse cell populations is a current challenge in bioimaging, especially in the context of immunology and immunotherapy research. Here, we introduce Celldetective, an open-source Python-based software tool designed for high-performance, end-to-end analysis of image-based in vitro immune and immunotherapy assays. Celldetective is purpose-built for multicondition, 2D multi-channel time-lapse microscopy of mixed cell populations. Although it is optimised for the needs of immunology assays, it is nevertheless broadly applicable to any biological system involving interacting cell populations. The software seamlessly integrates AI-based segmentation, tracking, and automated single-cell event detection, all within an intuitive graphical interface that supports interactive visualisation, annotation, and training options. We showcase its capabilities with original datasets of single immune effector cell interactions with an activating surface mediated by bispecific antibodies, and pairwise interactions in antibody-dependent cell cytotoxicity events.
bioinformatics2026-05-04v4SenNet Portal: Build, Optimization and Usage
Borner, K.; Blood, P. D.; Silverstein, J. C.; Ruffalo, M.; Satija, R.; Gehlenborg, N.; Honick, B.; Bueckle, A.; Jain, Y.; Qaurooni, D.; Shirey, B.; Sibilla, M.; Metis, K.; Bisciotti, J.; Morgan, R. S.; Betancur, D.; Sablosky, G. R.; Turner, M. L.; Kim, S.-J.; Lee, P. J.; Bartz, J.; Domanskyi, S.; Peters, S. T.; Enninful, A.; Farzad, N.; Fan, R.; SenNet Team, ; Herr, B. W.Abstract
Cellular senescence is a hallmark of aging and a driver of functional decline across tissues, yet its heterogeneity and context dependence have limited systematic study. The Common Fund's Cellular Senescence Network (SenNet) Program addresses this challenge by generating multimodal, multi-tissue datasets that profile senescent cells across the human lifespan and complementary mouse models. The SenNet Data Portal (https://data.sennetconsortium.org) serves as the public gateway to these resources, providing open access to harmonized single-cell, spatial, imaging, transcriptomic, and proteomic data; senescence biomarker catalogs; and standardized protocols that can be used to comprehensively identify and characterize senescent cells in mouse and human tissue. As of April 2026, the portal hosts 2,041 publicly available human and mouse datasets across 15 organs using 6 general assay types. Experts from 13 Tissue Mapping Centers (TMCs) and 12 Technology Development and Application (TDAs) components contribute tissue data, analyze data, identify senescent biomarkers, and agree on panels for cross-tissue antibody harmonization. They also register human tissue data into the Human Reference Atlas (HRA) and develop user interfaces for the multiscale and multimodal exploration of this data. Built on a scalable hybrid cloud microservices architecture by the Consortium Organization and Data Coordinating Center (CODCC), the Portal enables data submission, management, integrated analysis, spatial context mapping, and harmonized access to cross-species data critical for aging research. This paper presents user needs, the Portal's architecture, data processing workflows, and senescence-focused analytical tools; usage scenarios illustrating applications in biomarker discovery, quality benchmarking, hypothesis generation, spatial analysis, cost-efficient profiling, and cell distance distribution analysis; and utility and usage by the larger researcher community. Current limitations and planned extensions, including expanded spatial-omics releases and improved tools for senotype characterization, are discussed. SenNet protocols, code, and user interfaces are freely available on https://docs.sennetconsortium.org/apis.
bioinformatics2026-05-04v2Species-specific transformer models of bacterial gene order and content for genomic surveillance tasks
Horsfield, S. T.; Wiatrak, M.; McInerney, J. O.; Bentley, S. D.; Colijn, C.; Lees, J. A.Abstract
Transformer models enable functionally meaningful representation of complex biological data, such as nucleotide or protein sequences. Existing foundation transformer models are trained on large multi-domain corpuses of unlabelled DNA or protein data, showing unmatched task generalisation. However, these foundation models are often outperformed on domain-specific tasks by models trained on taxonomically-constrained data, such as gene classification in prokaryotes. By extension, species-specific transformer models hold promise for targeted analyses, given sufficient training data are available. Epidemiological analysis of bacterial pathogens exemplifies the use-case of species-specific transformers, due to the wealth of genome data available, coupled with pathogen-specific analyses carried out during routine and outbreak surveillance. Here, we trained a transformer model, PanBART, on the gene content and gene order of two important and biologically distinct bacterial pathogens, Escherichia coli and Streptococcus pneumoniae, benchmarking against state-of-the-art non-transformer approaches for genomic epidemiology. We show PanBART learns representations of population structure in an unsupervised manner, and can be used to accurately assign genomes to biologically-meaningful sequence clusters. PanBART is also able to identify emergent lineages, differentiating them from pre-existing lineages, and can accurately predict genomes likely to uptake genes involved in antibiotic resistance before a transfer event has occurred. Finally, PanBART can be used to conduct co-selection analysis to identify pairs of genes likely to be found together. Our work demonstrates that species-specific transformer models can be employed in many critical public health scenarios. We lay the groundwork for wider application of such models in epidemiological analysis, and provide scenarios where such models excel.
bioinformatics2026-05-04v2Semi supervised GAN for smart microscopy, fast and data efficient cell cycle classification
Manick, R.; El Habouz, Y.; Guillout, M.; Martin, C.; Bonnet, J.; Ruel, L.; Pastezeur, S.; Chanteux, O.; Bouchareb, O.; Tramier, M.; Pecreaux, J.Abstract
Modern optical microscopes are fully motorised; however, transforming them into truly smart systems requires real-time adjustment of acquisition settings in response to detected objects and dynamic biological events. At the core are classification algorithms that commonly depend on customised softwares and are generally designed for narrowly-defined biological applications. In addition, they often require substantial annotated datasets for effective training. We introduce a semi-supervised generative adversarial network (SGAN) for robust cell-cycle stage classification under low-resource conditions, adaptable to diverse cellular structures. The framework combines unlabelled microscopy images with synthetically generated samples to mitigate limited annotation, while preserving stable performance even when the unlabelled subset is class-imbalanced. Tested on the Mitocheck dataset, which features five mitosis classes, the model achieved 93{+/-}2 % accuracy using only 80 labelled per class and 600 unlabelled images. The proposed algorithm is generic and can be readily adapted to new labelling schemes, classification targets, cell lines, or microscopy modalities through transfer learning. SGAN is well suited for integration into automated microscopes, enabling efficient and adaptable image analysis across diverse biological and microscopy applications.
bioinformatics2026-05-04v2MetaUmbra: Statistically Controlled Genome-Level Presence Inference from Metaproteomic Peptides
Wu, Q.; Ning, Z.; Zhang, A.; Cheng, K.; Figeys, D.Abstract
Taxonomic interpretation of metaproteomic peptides remains difficult because many peptide sequences are present in proteins from different organisms, reducing taxonomic specificity. Current peptide-centric workflows can report taxonomic summaries or taxon level confidence scores, but they do not provide formal statistical evidence that a taxon is present in the microbiome. Here we present MetaUmbra, a tool that derives genome-level statistical significance values from identified peptides. MetaUmbra builds theoretical peptide lists by in silico digestion of the taxon specific proteins and matches observed peptides against these references. It then combines a conservative significance estimate from unique peptides with a Monte Carlo based p-value for shared peptide evidence estimated under an empirical null model. In the defined community benchmark SIHUMIx, MetaUmbra identified the expected genomes without introducing false-positive genomes after embedding the SIHUMIx genomes in a large gut reference background. In the single strain benchmark Mix24X, all expected genomes were identified with the best statistical significances even after near neighbor and full background expansion. In a hamster gut genome panel, MetaUmbra further preserved an interpretable ranking of candidate genomes in a dense real-data setting. Together, these results show that MetaUmbra can statistically identify the presence of specific microbes in a complex microbiome while maintaining low false-positive calls. MetaUmbra therefore provides a practical framework for converting peptide evidence into genome-level statistical inference in metaproteomics.
bioinformatics2026-05-04v1Integrated transcriptomic and proteomic analyses identify novel biomarkers of bladder outlet obstruction
Bigger-Allen, A. A.; Das, B.; Tang, Y.; Costa, K.; Ocampo, G.-L.; Hashemi Gheinani, A.; DiMartino, S.; Kaull, J.; Froehlich, J.; Lee, R. S.; Adam, R.Abstract
Bladder outlet obstruction leads to pathological remodeling and emergence of lower urinary tract symptoms. Although relief of obstruction is associated with symptomatic improvement, it is not universally successful, reflecting persistent alterations in the bladder. Reliable surrogate biomarkers of obstruction are lacking, particularly early in the disease course before irreversible damage to the bladder may have occurred. In this study, re-analysis of publicly available transcriptomic datasets from diverse rodent models of obstruction identified tissue transcripts including Cthrc1, Grem1, Ltbp2 and Msn that were induced in response to injury. Candidate markers were validated experimentally in an independent model of neurogenic obstruction demonstrating time-dependent changes. Candidate markers were also attenuated with either surgical removal of obstruction or treatment with anticholinergic medication or inosine. Integrated analysis of tissue transcriptomics data and tissue and urine proteomics data from a model of neurogenic obstruction revealed significant concordance between markers observed in tissue and urine. Urinary proteomics analysis identified a statistically significant increase in MSN in patients with neurogenic bladder compared to unaffected controls. These findings identify tissue and urine biomarkers of both non-neurogenic and neurogenic obstruction that may reflect early changes in obstructive uropathy that could be monitored in a non-invasive manner.
bioinformatics2026-05-04v1spatiAlytica: Viewer-Grounded Multimodal Agentic System for Interactive Spatial Omics Analysis
Das, A.; Zhang, K.; Song, J.; Han, M.; Chen, A.; Meng, W.; Galloway, H.; Chen, P.-Y.; Jo, S.; Liu, Z.; Hasib, M. M.; Officer, A.; Sinha, H.; Chiu, Y.-C.; Gao, S.-J.; Li, L.; Huang, Y.Abstract
Spatial transcriptomics and proteomics map tissue architecture and cellular interactions, but analysis remains limited by programming demands and text-centered AI agents that lack viewer grounding and cross-turn context. We present spatiAlytica, a viewer-centric multimodal interactive agentic system embedded in the Napari viewer that enables non-programmer biologists to perform iterative, hypothesis-driven spatial omics analysis via natural language. spatiAlytica couples viewer-state serialization, agentic memory, biological concept-to-data-field mapping, code generation and debugging, Spatial VQA, and grounded interpretation to support an exploratory analysis and interpretive reasoning workflow. We introduce spatiAlyticaBench, a comprehensive benchmark spanning 222 single-turn spatial analytical coding questions, 178 multi-turn sequential workflow questions, and 7,350 image-grounded reasoning questions. spatiAlytica outperformed strong agentic baselines, while using less time and tokens. Case studies across Kaposi's sarcoma, colorectal cancer, and ovarian cancer recapitulated known spatial patterns and uncovered progressive CD8 T-cell dysfunction during KS progression.
bioinformatics2026-05-04v1FastDedup - A fast and memory-efficient tool for read deduplication
Ribes, R.; Mandier, C.; Baniel, A.Abstract
PCR duplicate removal is a critical first step in high-throughput sequencing pipelines, yet existing tools struggle with speed, memory, or correctness at modern dataset scales. We present FastDedup, a Rust-based FASTX deduplicator that transforms each read or read pair to a compact xxh3 hash fingerprint, drastically reducing memory usage and binding most of the execution time to disk I/O. Benchmarked against six competing tools on synthetic human WGS datasets up to 300 million reads, FastDedup consistently leads on paired-end data, running more than 10 times faster than fastp. It also outperforms all tools on uncompressed single-end data, deduplicating a million reads in a second. We additionally report correctness failures in prinseq++ and clumpify. FastDedup is available under the MIT License via GitHub, Bioconda, and Cargo
bioinformatics2026-05-04v1Radiant DIA: A Fast, Sensitive, and Accurate Search Engine for Quantitative Proteomics
Just, S.; Cantrell, L. S.; Nichols, A.; Wang, J.; Kis, J.; Mohtashemi, I.; Platt, T.; Farokhzad, O.; Batzoglou, S.Abstract
In mass spectrometry-based proteomics, robust and efficient search engines are essential for accurate peptide and protein identification and quantification. Advances in sample preparation and instrumentation have increased the demand for highly scalable processing tools, with datasets comprising hundreds or thousands of samples in single-cell and population studies. Here we present Radiant DIA, a novel Data-Independent Acquisition search engine which achieves 4x faster processing and 10x lower cloud compute costs for large experiments while ensuring rigorous control of false discovery rate (FDR) and maintaining similar sensitivity, precision, and quantitative accuracy. The Radiant DIA search engine is paired with a modular pipeline deployable on cloud and desktop environments comprising individual modules for distributed re-scoring, FDR estimation, protein inference and quantification. Unlike traditional monolithic applications, this architecture enables high-performance, cloud-scale analysis without sacrificing local usability. Together, the Radiant DIA and Fulcrum Pipeline tools enhance computational efficiency to facilitate biological discovery in large-scale proteomics, as demonstrated by analyses of real-world experiments up to thousands of MS acquisitions.
bioinformatics2026-05-04v1MeiCOfi: Meiotic CrossOver Finder in haploid, diploid, polyploid and hyper-recombinant genomes
Fuentes, R. R.; Fernandes, J. B.; Susanto, T.; Wang, Y.; Underwood, C. J.Abstract
During the meiotic cell division, homologous chromosomes pair and recombine, leading to large reciprocal exchanges of genetic information. In most species, meiotic crossovers (COs) are crucial for normal chromosome segregation and they generate genetic diversity, which can be acted upon by natural selection in wild populations or by breeders to combine desirable traits in a genome. Identifying the position and frequency of COs is therefore essential in both classical genetics studies and breeding programmes. However, a computational tool capable of accurately detecting COs across diverse contexts, including varying marker densities, genome size and structure, recombination rate, and ploidy, remains lacking. We developed MeiCOfi (Meiotic CrossOver Finder) to detect meiotic crossover events at high-resolution from low-coverage genome sequencing data. We evaluated it using data from Arabidopsis thaliana, rice, barley and both intra- and inter-specific tomato hybrids, encompassing a wide range of genome complexities and marker densities. It reliably detects crossovers in hyper-recombinant A. thaliana with up to 62 CO per backcross offspring and in haploid gametes from barley with sequencing coverage as low as 0.1x. It can identify crossovers in polyploid genomes, including simulated recombinant tetraploids and also real data from tetraploid tomato hybrid offspring. Our results demonstrate that MeiCOfi can robustly identify crossovers in diverse genomic contexts.
bioinformatics2026-05-04v1Robust identification of cell-cell communication heterogeneity in single cells
Bocci, F.; Jia, Y.; Atwood, S.; Nie, Q.Abstract
Communication between cells modulates cell fate decisions by relaying information across tissues and inducing intracellular responses mediated by gene regulatory networks. Inference of cell-cell communication from high throughput data such as single cell transcriptomics is gaining popularity due to the high data availability and ease to automate modeling over hundreds of signaling pathways. Studying how cell-cell communication operates across biological scales and influences cell fate decisions, however, remain a major open question. Here, we present scRICH, a framework and package that integrates mechanism-based, multiscale mathematical modeling with learning strategies to capture the complexity of cell-cell communication from single-cell and spatial transcriptomics data. scRICH unravels the heterogeneity of communication behavior within cell types, links cell-cell communication to cell fate decisions by incorporating dynamical information of RNA splicing, and connects the scales of cell-cell interactions and intracellular response by constructing multilayer regulatory networks. We validate scRICH with new experiments on EGF ligand/receptor co-expression in keratinocytes from skin-equivalent organoid, and compare these computational predictions against existing CCC inference methods. Applying scRICH to multiple biological scenarios demonstrate its ability to capture emerging relations between distinct cell-cell communication pathways, interactions at the onset of cell fate decision, and emerging trends in cell-cell communications along cell lineages and in space.
bioinformatics2026-05-04v1DPLM: Dynamics-aware Protein Language Model via contrastive learning between sequence and molecular dynamics simulation trajectory
Jiang, Y.; Wang, D.; Imam, I. A.; Xu, D.; Shao, Q.Abstract
Protein dynamics play a critical role in protein function, yet such important information is missing in many protein language models (PLM). We introduce DPLM, a dynamics-aware protein language model that aligns sequence embeddings with molecular dynamics (MD) trajectory embeddings via contrastive learning. Using MD features encoded by a pretrained video model, DPLM learns sequence representations that correlate with residue-level flexibility and improve protein-level functional clustering compared to static sequence- and structure-based PLMs. Without task-specific training, DPLM outperforms ESM-based representations in zero-shot mutation-effect prediction on multiple deep mutational scanning datasets. When adapted with lightweight task-specific heads, DPLM further achieves top-tier performance on protein stability prediction and intrinsic disorder region identification, demonstrating that contrastive alignment with MD trajectories enables PLMs to capture biologically meaningful dynamic properties.
bioinformatics2026-05-04v1AI-guided discovery of atypical protein assemblies
Toghani, A.; Seager, B. A.; Sugihara, Y.; Roijen, L.-M.; Azcue, J. M.; Garro, M.; Sargolzaei, M.; Morianou, I.; Harant, A.; Gallop, S.; Kourelis, J.; MacLean, D.; Contreras, M. P.; Kamoun, S.; Lüdke, D.Abstract
Artificial intelligence (AI) systems such as AlphaFold have transformed structural biology by enabling accurate prediction of protein structures. However, their capacity to uncover new classes of macromolecular assemblies remains largely untapped. We developed the Structural Novelty Index (SNI), a quantitative framework for identifying protein complexes that diverge from canonical architectures. As one implementation of SNI, we developed SNINRC-Hexa, to identify unconventional resistosomes formed by nucleotide-binding, leucine-rich repeat immune receptors (NLRs). We used it to analyze AlphaFold 3 models of 637 non-redundant NRC proteins from 346 genomes representing 85 plant species. This analysis identified candidates with predicted architectures distinct from the canonical hexameric resistosomes of NRC proteins. Biochemical purification and negative-stain transmission electron microscopy of NRC7 orthologs from multiple species supported the SNI prediction and revealed an unexpected undecameric (11-mer) assembly. Our results establish SNI as a scalable approach for discovering atypical protein complexes.
bioinformatics2026-05-04v1Automatic Bevacizumab Response Prediction in Ovarian Cancer from Digital Pathology Images via Novel AI-based Computational Pipeline
Alsaiari, A.; Turki, T.; Taguchi, Y.-h.Abstract
Ovarian cancer is one of the gynecological cancer types, which, if metastasized and not detected early, can cause deaths among women. Therefore, there is a need to accurately predict drug responses to ovarian cancer. A gynecological pathologist inspects abnormality in tissues, followed by providing a report about patients; however, such a diagnostic process is (1) hard; (2) requires experience; and (3) time consuming. Moreover, existing tools are far from perfect. Hence, we present a computational pipeline to improve predicting drug response pertaining to ovarian cancer, derived as follows. First, we download digital pathology images pertaining to ovarian bevacizumab response from the cancer imaging archive repository. We employed histogram of oriented gradients to images, constructing feature vectors, provided to Fisher linear discriminant analysis to change the representation through dimensionality reduction. Then, we provide reduced-dimensionality data for regression analysis through support vector regression coupled with various kernels and calculating the area under the ROC curve (AUC). Experimental results against transformer-based models (ViT and Swin) and other deep learning (DL) models (VGG16, ResNet50, InceptionV3, MobileNetV2, and EfficientNetB6) demonstrate that our approach with radial kernel (named SVRD+R) yielded an AUC performance improvements of 17% against the best-performing transformer-based model (ViT) while obtaining an AUC performance improvements of 14.9% when compared against the best DL-based model (MobileNetV2). These results demonstrate the superiority and feasibility of our AIbased pipeline when tackling prediction problems pertaining to gynecologic cancer studies.
bioinformatics2026-05-04v1Automated Multimodal Correlative Registration for Organelle-Specific Molecular Imaging
Lu, C.; ZHAO, K.; Cui, D.; Chen, G.; Yang, Q.; Yang, H.; Zhao, M.; Song, K.; Nikan, M.; Li, Z.; Zhao, S.; Cen, J.; Qiu, X.; Young, S.; Bennett, C. F.; Seth, P.; Chen, K.; Qi, X.; Jiang, H.Abstract
Mapping subcellular drug distribution is essential for understanding trafficking and off-target effects. NanoSIMS enables chemical imaging of labeled therapeutics, but signal interpretation requires ultrastructural correlation with electron microscopy, a manual and laborious process. We present an automated AI-driven pipeline for correlating chemical and ultrastructural images, enabling multiscale, organelle-precise imaging of molecules in cells and tissues. The method integrates bidirectional optical flow, confidence-guided affine transformation, and automated template matching for cross-scale EM alignment. Morphology-rich ion channels (e.g., 32S) estimate transformations that propagate to sparse therapeutic signals (e.g., 79Br, 15N), overcoming low signal-to-noise challenges. We validate this framework across diverse cell and tissue types, tracking oligonucleotide and antibody therapeutics in vitro and in vivo to reveal cell-type- and organelle-specific distribution patterns. This work establishes a generalizable platform for automated multimodal registration and organelle-resolved subcellular pharmacology.
bioinformatics2026-05-04v1PDBe-SIFTS: an open-source tool for Structure Integration with Function, Taxonomy, and Sequences, featuring improved alignment, scoring scheme, and accelerated search
Bellaiche, A.; Choudhary, P.; Nair, S.; Harrus, D.; Yu, C. W.-H.; Tanweer, S. A.; Evans, G. L.; Lo, S. W.; Martin, M.; Fleming, J. R.; Velankar, S.Abstract
Structure Integration with Function, Taxonomy and Sequences (SIFTS) provides residue-level mappings between UniProt Knowledgebase sequences and Protein Data Bank structures and has historically been generated through internal Protein Data Bank in Europe (PDBe) pipelines. Here, PDBe-SIFTS is presented as a fully open-source, locally deployable implementation of this mapping framework. The pipeline combines fast, scalable sequence search using MMseqs2, an improved bounded scoring scheme for ranking candidate mappings, and residue-level mapping refinement based on backbone connectivity. PDBe-SIFTS is distributed as a Python package with command-line tools for 1) building a sequence search database, 2) identifying the best sequence-structure match, 3) one-to-one mapping at the residue level, and 4) generating SIFTS annotations in PDBx/mmCIF format. Benchmarking on the complete Protein Data Bank archive showed that MMseqs2 reduced archive-scale UniProtKB searches from hours with BLASTP to minutes, approximately 22-36 times faster, while curated mappings were recovered at top rank in 93.1% of cases. The remaining discrepancies mainly involved biologically ambiguous cases such as highly conserved proteins, chimeric constructs, or closely related orthologs. These results show that PDBe-SIFTS enables fast mapping, improving structural coherence in residue-level alignments while delivering the most up-to-date and accurate mappings, comparable to expert curation. Tool: https://github.com/PDBeurope/SIFTS Quick start notebook with example: https://github.com/PDBeurope/SIFTS/tree/master/notebooks
bioinformatics2026-05-04v1AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction
Muneeb, M.; Ascher, D. B.Abstract
Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein-language-model-derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome-wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM-derived features, conservation metrics, population-frequency variables, established pathogenicity predictors and engineered amino acid/codon-context features. Using 132,714 ClinVar-labelled missense variants, we benchmarked machine-learning and deep-learning models under controlled feature configurations. The full 303-feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC-AUC = 0.9950 across stratified five-fold cross-validation. Restricted naive and location-oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity-controlled ablations showed that removing prior-predictor, population-frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM-derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1-score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.
bioinformatics2026-05-04v1Clustering Strategies Improve Structure-Preserving Visualization of Single-Cell RNA-seq Data with CBMAP
Alchaar, M.; Dogan, B.Abstract
Dimensionality reduction for visualization is a fundamental step in single-cell RNA sequencing (scRNA-seq) analysis due to the extremely high dimensionality of gene expression profiles. However, widely used nonlinear embedding techniques such as UMAP and t-SNE can introduce substantial distortions when projecting data into two-dimensional space, potentially altering global organization, local neighborhoods, and distance relationships in ways that may mislead downstream biological interpretation. In this study, we investigate the applicability of Clustering-Based Manifold Approximation and Projection (CBMAP) for the visualization of scRNA-seq data and systematically examine how clustering strategies influence the quality of the resulting embeddings. CBMAP was integrated with several clustering algorithms commonly used in single-cell analysis, including k-means, Leiden, HDBSCAN, Secuer, HGC, and FlowSOM. The resulting embeddings were evaluated using quantitative metrics that measure global, local, and distance-level structure preservation and were compared with widely used dimensionality reduction methods such as UMAP, t-SNE, and PaCMAP across multiple benchmark datasets. Our results demonstrate that the clustering stage plays a critical role in determining the structural fidelity of CBMAP embeddings. Clustering algorithms specifically designed for single-cell transcriptomic data, particularly Secuer, produced more consistent preservation of global relationships between cell populations. Across multiple datasets, CBMAP more faithfully preserved global structural organization and inter-population distance relationships than the compared methods, although local neighborhood preservation was generally weaker than in techniques optimized for local structure. Importantly, CBMAP embeddings retained biologically meaningful relationships in trajectory benchmark datasets. When combined with RNA velocity analysis, CBMAP successfully preserved cyclic progenitor states and branching differentiation trajectories, demonstrating compatibility with trajectory-aware visualization. These findings indicate that CBMAP provides a structure-faithful visualization framework for scRNA-seq data and that clustering selection plays a central role in determining embedding quality.
bioinformatics2026-05-04v1DoFormer: Causal Transformer for Gene Perturbation
Karbalayghareh, A.; Paull, E.; Califano, A.Abstract
Learning causal gene regulatory mechanisms from single-cell data, and thereby predicting the effects of unseen perturbations, remains challenging. Observational RNA-seq data alone is insufficient for causal modeling, whereas perturbational data is essential. Classical causal inference methods often rely on unrealistic directed acyclic graph (DAG) assumptions and are not well suited to integrating multimodal data. Current transcriptomic foundation models also typically treat observational and perturbational data identically, limiting their ability to model perturbations. We present DoFormer, a causal multimodal Transformer that makes no DAG assumptions and leverages rich perturbational data to accurately predict previously unseen perturbations. DoFormer enables principled in silico perturbations by adapting the causal do-operator within the attention mechanism: the perturbed gene is set to the intervention value and prevented from attending to other genes, allowing the model to fully distinguish observational from interventional regimes. We train DoFormer using biologically informed loss functions and evaluate it with comprehensive perturbation prediction metrics. DoFormer substantially improves perturbation prediction relative to baseline and prior foundation models, underscoring the importance of intervention-aware architectures and biologically grounded objectives for causal modeling in single-cell genomics.
bioinformatics2026-05-04v1Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Guo, J.Abstract
The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task--molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.
bioinformatics2026-05-04v1Reference-Based Library Construction Improves Performance in low-input diaPASEF Workflows
Charkow, J.; Ghaznavi, M.; Seale, B.; Peng, J.; Gingras, A.-C.; Rost, H.Abstract
In low input mass spectrometry-based proteomics, Data Independent Acquisition (DIA), including diaPASEF, is quickly becoming the method of choice for label free quantification. Whether using empirical or in silico spectral libraries, performance is dependent on the library; however, the optimal library construction strategy for low input proteomics remains an open question. To address this, we examine and develop library construction approaches that are compatible with both spectrum-centric and peptide-centric analysis workflows. These approaches leverage a closely related, high-quality sample to improve library quality. First, we validated our approach in bulk sample amounts where we observed that the effects of gas-phase fractionation based library construction is dependent on the software framework, with improvements more pronounced in OpenSWATH compared to DIA-NN. In OpenSWATH, our peptide-centric library reconstruction workflow consistently outperforms a transfer learning strategy, an emerging alternative approach. In DIA-NN, trends are dependent on library source highlighting OpenSWATH's stronger dependence on the search space. In low-input applications, such as single-cell-equivalent injection amounts (100 pg) of HeLa cell digest on a timsTOF SCP, our library construction approach provided more pronounced improvements across both software tools compared to bulk samples. Using a peptide-centric reconstruction approach with the OpenSWATH analysis framework, we detected over 15,000 peptide precursors (2480 protein groups), a 90% improvement over the original library. Furthermore, using a spectrum-centric construction approach, peptide precursor identification rates improved over 6-fold ( ~1000 to ~6000). Our strategy provides a practical solution for generating high-quality libraries in low-input applications.
bioinformatics2026-05-04v1An unsupervised framework for comparing SARS-CoV-2 protein sequences using LLMs
Littlefield, S. B.; Campbell, R. H.Abstract
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic led to 700 million infections and 7 million deaths worldwide. While studying these viruses, scientists developed a large amount of sequencing data that was made available to researchers. Large language models (LLMs) are pre-trained on large databases of proteins and prior work has shown its use in studying the structure and function of proteins. This paper proposes an unsupervised framework for characterizing SARS-CoV-2 sequences using large language models. First, we perform a comparison of several language models previously proposed by other authors. This step is used to determine how clustering and classification approaches perform on SARS-CoV-2 sequence embeddings. In this paper, we focus on surface glycoprotein sequences, also known as spike proteins in SARS-CoV-2 because scientists have previously studied their involvement in being recognized by the human immune system. Our contrastive learning framework is trained in an unsupervised manner, leveraging the Levenshtein distance from pairwise alignment of sequences when the contrastive loss is computed by the Siamese Neural Network. The final part of this paper focuses on a comparison with a previous approach on a test dataset containing data from the latter part of the pandemic. In the prediction of emerging variants, the proposed LLM-based approach shows an improvement of 0.2 in terms of the adjusted rand index clustering compared to a previously proposed approach. This shows the potential of applying large language models to this field.
bioinformatics2026-05-03v3A 37-million-particle dataset from over 250 experiments to accelerate data-driven cryo-EM analysis
Zamanos, A.; Kyrilis, F. L.; Koromilas, P.; Kastritis, P. L.; Panagakis, Y.Abstract
Cryogenic Electron Microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-resolution structure determination of biological macromolecules. Central to cryo-EM analysis are particles, namely 2D projections of biomolecules extracted from micrographs, which serve as the primary input for 3D reconstruction. While data-driven methods have transformed other scientific domains, their impact on cryo-EM remains limited because existing particle datasets are too small, too narrow in protein diversity, and lack rich per-particle annotations. We introduce cryoPANDA (cryo-EM Particles ANnotated DAtaset), comprising over 37 million annotated particles from 252 experiments spanning a wide range of protein types, more than 10-fold larger than prior collections. Each particle is accompanied by detailed annotations covering acquisition, classification, and reconstruction metadata, alongside the corresponding 3D electrostatic potential map, the published EMDB map, and, where available, the PDB model. We validate cryoPANDA in two ways: first, by reconstructing hundreds of distinct high-resolution cryo-EM maps; and second, by training a DINOv2 foundation model and evaluating its learned representations on micrograph segmentation, particle picking, and particle clustering.
bioinformatics2026-05-03v1MorphOTU: A universal image-based framework for delineating biodiversity discovery
Zhan, Z.; Chen, W.; Liu, X.; Yue, L.; Zhang, F.Abstract
The absence of a scalable system for organizing the vast majority of unidentified species becomes the central obstacle in biodiversity science. Existing molecular and computer-vision methods rely on DNA material or closed-set labels, which hamper biodiversity quantification under the open, incomplete conditions that characterize real ecosystems. Here, we introduce morphOTUs, a general image-based framework that constructs operational units of biodiversity directly from phenotype. Using morphOTU, we derive image based OTUs across five plant and beetle datasets spanning heterogeneous imaging conditions. These units recover species-level boundaries, retain coherent structure when most species are "unseen" during training, and accurately approximate richness and Shannon diversity indices even under sparse labeling or limited sampling. Visual explanations reveal that morphOTU consistently focuses on biologically meaningful traits and captures continuous phenotypic variation. By providing a scalable and open set framework for quantifying phenotypic diversity, morphOTUs enable biodiversity assessment that includes unnamed species and unlock the ecological value of rapidly expanding digital image repositories.
bioinformatics2026-05-01v1Confronting global eradication of TB head on: Uncovering the root of drug resistance and bacterial survival strategies through a comprehensive computational study of first-line TB drug resistant mutations
Pawar, P.; Samarasinghe, S.Abstract
Tuberculosis (TB) is fast becoming incurable affecting millions globally. Mycobacterium tuberculosis (Mtb), causative agent of TB, has evolved elusive survival strategies through point mutations in the drug targets leading to a daunting scenario of resistance towards first-line TB drugs, exacerbated by global differences in mutation patterns. Drug resistance studies have focussed only on few mutations; however, hundreds of mutations have been reported in the last three decades. WHO goal of global eradication of TB therefore now requires a deep understanding of mechanisms of drug resistance, involving many mutations, addressed in a global context. This study addresses bacterial survival strategies by following bacteria-drug interaction to probe into how bacteria evolve drug resistance mechanisms through mutations. We hypothesise that bacteria favour mutations that protect them from a drug while making the drug ineffective. To test the hypothesis, we quantify the impact of mutations on both bacterial function and drug binding affinity to get to the root of drug resistance revealing how bacteria may evolve an arsenal of mutations towards an optimal survival strategy. This first comprehensive and systematic in-depth study investigates global patterns of mutation and drug resistance mechanisms from mutation data for Mtb reported over the last 30 years. These were collected for 31,073 drug-resistant Mtb isolates from 149 published studies for the four first line drugs isoniazid (INH), pyrazinamide (PZA), rifampicin (RIF), and ethambutol (EMB). We found 821 single frequency non-synonymous mutations for INH (n= 202), RIF (n=120), EMB (n=226) and PZA (n=273). We then investigated the prevalence and diversity of these mutations in the drug targets across the globe. We found S315T in the target katG (60%) to be the most prevalent mutation in INH resistance followed by S450L in rpoB (56%) and M306V in embB (29%) associated with RIF and EMB resistance, respectively; these were also the highly occurring mutations across the six WHO regions, except for the most common mutation Q10P in pncA (1.4%) (PZA resistance; with shorter exposure to drug) showing a variable pattern of occurrence globally. We found the highest mutational burden in the Western Pacific and South-East Asia regions for INH and RIF resistance. Frequent mutations had also undergone frequent amino acid substitutions. Accordingly, we developed a comprehensive atlas of mutation spread across the globe and their evolution over the last 30 years. We then probed into the impact of mutations on TB bacteria and drug binding with a comprehensive bioinformatics analysis for understanding crucial changes caused by mutation at the molecular level affecting function and structural stability of bacteria and the drug binding affinity. We found that the most prevalent mutations occur in non-conserved areas in the drug binding region indicating a choice of a less dramatic level of change in target protein function and stability. All mutations reduced drug binding affinity. For characterising drug resistance mechanisms, we introduced a new concept of ranking drug-resistant TB mutations into lethal, moderate, mild and neutral considering the combined effect on Mtb viability and drug binding. We identified 340 mutations as lethal, 284 as moderate, 185 as mild and 12 as neutral. We observed that frequently occurring mutations occur in non-conserved regions causing a mild effect on target proteins (such as S315T of katG, S450L of rpoB and M306V in embB), while reducing drug binding affinity. With these we uncovered a universal strategy of drug resistance and bacterial survival: Mtb favours less harmful mutations in the drug binding region without compromising conservancy while destabilising the drugs, thus striking a balance between fitness and drug resistance. This ingenious strategy seems successful and reasonable persisting globally over three decades and provides a holistic understanding of drug resistance and a strong foundation for designing efficacious drugs and therapies towards global eradication of TB.
bioinformatics2026-05-01v1