Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Celldetective: an AI-enhanced image analysis tool for unraveling dynamic cell interactions
Torro, R.; Diaz Bello, B.; El Arawi, D.; Dervanova, K.; Ammer, L.; Dupuy, F.; Chames, P.; Sengupta, K.; Limozin, L.Abstract
Analysis of multimodal and multidimensional data capturing dynamic interactions between diverse cell populations is a current challenge in bioimaging, especially in the context of immunology and immunotherapy research. Here, we introduce Celldetective, an open-source Python-based software tool designed for high-performance, end-to-end analysis of image-based in vitro immune and immunotherapy assays. Celldetective is purpose-built for multicondition, 2D multi-channel time-lapse microscopy of mixed cell populations. Although it is optimised for the needs of immunology assays, it is nevertheless broadly applicable to any biological system involving interacting cell populations. The software seamlessly integrates AI-based segmentation, tracking, and automated single-cell event detection, all within an intuitive graphical interface that supports interactive visualisation, annotation, and training options. We showcase its capabilities with original datasets of single immune effector cell interactions with an activating surface mediated by bispecific antibodies, and pairwise interactions in antibody-dependent cell cytotoxicity events.
bioinformatics2026-05-04v4SenNet Portal: Build, Optimization and Usage
Borner, K.; Blood, P. D.; Silverstein, J. C.; Ruffalo, M.; Satija, R.; Gehlenborg, N.; Honick, B.; Bueckle, A.; Jain, Y.; Qaurooni, D.; Shirey, B.; Sibilla, M.; Metis, K.; Bisciotti, J.; Morgan, R. S.; Betancur, D.; Sablosky, G. R.; Turner, M. L.; Kim, S.-J.; Lee, P. J.; Bartz, J.; Domanskyi, S.; Peters, S. T.; Enninful, A.; Farzad, N.; Fan, R.; SenNet Team, ; Herr, B. W.Abstract
Cellular senescence is a hallmark of aging and a driver of functional decline across tissues, yet its heterogeneity and context dependence have limited systematic study. The Common Fund's Cellular Senescence Network (SenNet) Program addresses this challenge by generating multimodal, multi-tissue datasets that profile senescent cells across the human lifespan and complementary mouse models. The SenNet Data Portal (https://data.sennetconsortium.org) serves as the public gateway to these resources, providing open access to harmonized single-cell, spatial, imaging, transcriptomic, and proteomic data; senescence biomarker catalogs; and standardized protocols that can be used to comprehensively identify and characterize senescent cells in mouse and human tissue. As of April 2026, the portal hosts 2,041 publicly available human and mouse datasets across 15 organs using 6 general assay types. Experts from 13 Tissue Mapping Centers (TMCs) and 12 Technology Development and Application (TDAs) components contribute tissue data, analyze data, identify senescent biomarkers, and agree on panels for cross-tissue antibody harmonization. They also register human tissue data into the Human Reference Atlas (HRA) and develop user interfaces for the multiscale and multimodal exploration of this data. Built on a scalable hybrid cloud microservices architecture by the Consortium Organization and Data Coordinating Center (CODCC), the Portal enables data submission, management, integrated analysis, spatial context mapping, and harmonized access to cross-species data critical for aging research. This paper presents user needs, the Portal's architecture, data processing workflows, and senescence-focused analytical tools; usage scenarios illustrating applications in biomarker discovery, quality benchmarking, hypothesis generation, spatial analysis, cost-efficient profiling, and cell distance distribution analysis; and utility and usage by the larger researcher community. Current limitations and planned extensions, including expanded spatial-omics releases and improved tools for senotype characterization, are discussed. SenNet protocols, code, and user interfaces are freely available on https://docs.sennetconsortium.org/apis.
bioinformatics2026-05-04v2Species-specific transformer models of bacterial gene order and content for genomic surveillance tasks
Horsfield, S. T.; Wiatrak, M.; McInerney, J. O.; Bentley, S. D.; Colijn, C.; Lees, J. A.Abstract
Transformer models enable functionally meaningful representation of complex biological data, such as nucleotide or protein sequences. Existing foundation transformer models are trained on large multi-domain corpuses of unlabelled DNA or protein data, showing unmatched task generalisation. However, these foundation models are often outperformed on domain-specific tasks by models trained on taxonomically-constrained data, such as gene classification in prokaryotes. By extension, species-specific transformer models hold promise for targeted analyses, given sufficient training data are available. Epidemiological analysis of bacterial pathogens exemplifies the use-case of species-specific transformers, due to the wealth of genome data available, coupled with pathogen-specific analyses carried out during routine and outbreak surveillance. Here, we trained a transformer model, PanBART, on the gene content and gene order of two important and biologically distinct bacterial pathogens, Escherichia coli and Streptococcus pneumoniae, benchmarking against state-of-the-art non-transformer approaches for genomic epidemiology. We show PanBART learns representations of population structure in an unsupervised manner, and can be used to accurately assign genomes to biologically-meaningful sequence clusters. PanBART is also able to identify emergent lineages, differentiating them from pre-existing lineages, and can accurately predict genomes likely to uptake genes involved in antibiotic resistance before a transfer event has occurred. Finally, PanBART can be used to conduct co-selection analysis to identify pairs of genes likely to be found together. Our work demonstrates that species-specific transformer models can be employed in many critical public health scenarios. We lay the groundwork for wider application of such models in epidemiological analysis, and provide scenarios where such models excel.
bioinformatics2026-05-04v2Semi supervised GAN for smart microscopy, fast and data efficient cell cycle classification
Manick, R.; El Habouz, Y.; Guillout, M.; Martin, C.; Bonnet, J.; Ruel, L.; Pastezeur, S.; Chanteux, O.; Bouchareb, O.; Tramier, M.; Pecreaux, J.Abstract
Modern optical microscopes are fully motorised; however, transforming them into truly smart systems requires real-time adjustment of acquisition settings in response to detected objects and dynamic biological events. At the core are classification algorithms that commonly depend on customised softwares and are generally designed for narrowly-defined biological applications. In addition, they often require substantial annotated datasets for effective training. We introduce a semi-supervised generative adversarial network (SGAN) for robust cell-cycle stage classification under low-resource conditions, adaptable to diverse cellular structures. The framework combines unlabelled microscopy images with synthetically generated samples to mitigate limited annotation, while preserving stable performance even when the unlabelled subset is class-imbalanced. Tested on the Mitocheck dataset, which features five mitosis classes, the model achieved 93{+/-}2 % accuracy using only 80 labelled per class and 600 unlabelled images. The proposed algorithm is generic and can be readily adapted to new labelling schemes, classification targets, cell lines, or microscopy modalities through transfer learning. SGAN is well suited for integration into automated microscopes, enabling efficient and adaptable image analysis across diverse biological and microscopy applications.
bioinformatics2026-05-04v2MetaUmbra: Statistically Controlled Genome-Level Presence Inference from Metaproteomic Peptides
Wu, Q.; Ning, Z.; Zhang, A.; Cheng, K.; Figeys, D.Abstract
Taxonomic interpretation of metaproteomic peptides remains difficult because many peptide sequences are present in proteins from different organisms, reducing taxonomic specificity. Current peptide-centric workflows can report taxonomic summaries or taxon level confidence scores, but they do not provide formal statistical evidence that a taxon is present in the microbiome. Here we present MetaUmbra, a tool that derives genome-level statistical significance values from identified peptides. MetaUmbra builds theoretical peptide lists by in silico digestion of the taxon specific proteins and matches observed peptides against these references. It then combines a conservative significance estimate from unique peptides with a Monte Carlo based p-value for shared peptide evidence estimated under an empirical null model. In the defined community benchmark SIHUMIx, MetaUmbra identified the expected genomes without introducing false-positive genomes after embedding the SIHUMIx genomes in a large gut reference background. In the single strain benchmark Mix24X, all expected genomes were identified with the best statistical significances even after near neighbor and full background expansion. In a hamster gut genome panel, MetaUmbra further preserved an interpretable ranking of candidate genomes in a dense real-data setting. Together, these results show that MetaUmbra can statistically identify the presence of specific microbes in a complex microbiome while maintaining low false-positive calls. MetaUmbra therefore provides a practical framework for converting peptide evidence into genome-level statistical inference in metaproteomics.
bioinformatics2026-05-04v1Integrated transcriptomic and proteomic analyses identify novel biomarkers of bladder outlet obstruction
Bigger-Allen, A. A.; Das, B.; Tang, Y.; Costa, K.; Ocampo, G.-L.; Hashemi Gheinani, A.; DiMartino, S.; Kaull, J.; Froehlich, J.; Lee, R. S.; Adam, R.Abstract
Bladder outlet obstruction leads to pathological remodeling and emergence of lower urinary tract symptoms. Although relief of obstruction is associated with symptomatic improvement, it is not universally successful, reflecting persistent alterations in the bladder. Reliable surrogate biomarkers of obstruction are lacking, particularly early in the disease course before irreversible damage to the bladder may have occurred. In this study, re-analysis of publicly available transcriptomic datasets from diverse rodent models of obstruction identified tissue transcripts including Cthrc1, Grem1, Ltbp2 and Msn that were induced in response to injury. Candidate markers were validated experimentally in an independent model of neurogenic obstruction demonstrating time-dependent changes. Candidate markers were also attenuated with either surgical removal of obstruction or treatment with anticholinergic medication or inosine. Integrated analysis of tissue transcriptomics data and tissue and urine proteomics data from a model of neurogenic obstruction revealed significant concordance between markers observed in tissue and urine. Urinary proteomics analysis identified a statistically significant increase in MSN in patients with neurogenic bladder compared to unaffected controls. These findings identify tissue and urine biomarkers of both non-neurogenic and neurogenic obstruction that may reflect early changes in obstructive uropathy that could be monitored in a non-invasive manner.
bioinformatics2026-05-04v1spatiAlytica: Viewer-Grounded Multimodal Agentic System for Interactive Spatial Omics Analysis
Das, A.; Zhang, K.; Song, J.; Han, M.; Chen, A.; Meng, W.; Galloway, H.; Chen, P.-Y.; Jo, S.; Liu, Z.; Hasib, M. M.; Officer, A.; Sinha, H.; Chiu, Y.-C.; Gao, S.-J.; Li, L.; Huang, Y.Abstract
Spatial transcriptomics and proteomics map tissue architecture and cellular interactions, but analysis remains limited by programming demands and text-centered AI agents that lack viewer grounding and cross-turn context. We present spatiAlytica, a viewer-centric multimodal interactive agentic system embedded in the Napari viewer that enables non-programmer biologists to perform iterative, hypothesis-driven spatial omics analysis via natural language. spatiAlytica couples viewer-state serialization, agentic memory, biological concept-to-data-field mapping, code generation and debugging, Spatial VQA, and grounded interpretation to support an exploratory analysis and interpretive reasoning workflow. We introduce spatiAlyticaBench, a comprehensive benchmark spanning 222 single-turn spatial analytical coding questions, 178 multi-turn sequential workflow questions, and 7,350 image-grounded reasoning questions. spatiAlytica outperformed strong agentic baselines, while using less time and tokens. Case studies across Kaposi's sarcoma, colorectal cancer, and ovarian cancer recapitulated known spatial patterns and uncovered progressive CD8 T-cell dysfunction during KS progression.
bioinformatics2026-05-04v1FastDedup - A fast and memory-efficient tool for read deduplication
Ribes, R.; Mandier, C.; Baniel, A.Abstract
PCR duplicate removal is a critical first step in high-throughput sequencing pipelines, yet existing tools struggle with speed, memory, or correctness at modern dataset scales. We present FastDedup, a Rust-based FASTX deduplicator that transforms each read or read pair to a compact xxh3 hash fingerprint, drastically reducing memory usage and binding most of the execution time to disk I/O. Benchmarked against six competing tools on synthetic human WGS datasets up to 300 million reads, FastDedup consistently leads on paired-end data, running more than 10 times faster than fastp. It also outperforms all tools on uncompressed single-end data, deduplicating a million reads in a second. We additionally report correctness failures in prinseq++ and clumpify. FastDedup is available under the MIT License via GitHub, Bioconda, and Cargo
bioinformatics2026-05-04v1Radiant DIA: A Fast, Sensitive, and Accurate Search Engine for Quantitative Proteomics
Just, S.; Cantrell, L. S.; Nichols, A.; Wang, J.; Kis, J.; Mohtashemi, I.; Platt, T.; Farokhzad, O.; Batzoglou, S.Abstract
In mass spectrometry-based proteomics, robust and efficient search engines are essential for accurate peptide and protein identification and quantification. Advances in sample preparation and instrumentation have increased the demand for highly scalable processing tools, with datasets comprising hundreds or thousands of samples in single-cell and population studies. Here we present Radiant DIA, a novel Data-Independent Acquisition search engine which achieves 4x faster processing and 10x lower cloud compute costs for large experiments while ensuring rigorous control of false discovery rate (FDR) and maintaining similar sensitivity, precision, and quantitative accuracy. The Radiant DIA search engine is paired with a modular pipeline deployable on cloud and desktop environments comprising individual modules for distributed re-scoring, FDR estimation, protein inference and quantification. Unlike traditional monolithic applications, this architecture enables high-performance, cloud-scale analysis without sacrificing local usability. Together, the Radiant DIA and Fulcrum Pipeline tools enhance computational efficiency to facilitate biological discovery in large-scale proteomics, as demonstrated by analyses of real-world experiments up to thousands of MS acquisitions.
bioinformatics2026-05-04v1MeiCOfi: Meiotic CrossOver Finder in haploid, diploid, polyploid and hyper-recombinant genomes
Fuentes, R. R.; Fernandes, J. B.; Susanto, T.; Wang, Y.; Underwood, C. J.Abstract
During the meiotic cell division, homologous chromosomes pair and recombine, leading to large reciprocal exchanges of genetic information. In most species, meiotic crossovers (COs) are crucial for normal chromosome segregation and they generate genetic diversity, which can be acted upon by natural selection in wild populations or by breeders to combine desirable traits in a genome. Identifying the position and frequency of COs is therefore essential in both classical genetics studies and breeding programmes. However, a computational tool capable of accurately detecting COs across diverse contexts, including varying marker densities, genome size and structure, recombination rate, and ploidy, remains lacking. We developed MeiCOfi (Meiotic CrossOver Finder) to detect meiotic crossover events at high-resolution from low-coverage genome sequencing data. We evaluated it using data from Arabidopsis thaliana, rice, barley and both intra- and inter-specific tomato hybrids, encompassing a wide range of genome complexities and marker densities. It reliably detects crossovers in hyper-recombinant A. thaliana with up to 62 CO per backcross offspring and in haploid gametes from barley with sequencing coverage as low as 0.1x. It can identify crossovers in polyploid genomes, including simulated recombinant tetraploids and also real data from tetraploid tomato hybrid offspring. Our results demonstrate that MeiCOfi can robustly identify crossovers in diverse genomic contexts.
bioinformatics2026-05-04v1Robust identification of cell-cell communication heterogeneity in single cells
Bocci, F.; Jia, Y.; Atwood, S.; Nie, Q.Abstract
Communication between cells modulates cell fate decisions by relaying information across tissues and inducing intracellular responses mediated by gene regulatory networks. Inference of cell-cell communication from high throughput data such as single cell transcriptomics is gaining popularity due to the high data availability and ease to automate modeling over hundreds of signaling pathways. Studying how cell-cell communication operates across biological scales and influences cell fate decisions, however, remain a major open question. Here, we present scRICH, a framework and package that integrates mechanism-based, multiscale mathematical modeling with learning strategies to capture the complexity of cell-cell communication from single-cell and spatial transcriptomics data. scRICH unravels the heterogeneity of communication behavior within cell types, links cell-cell communication to cell fate decisions by incorporating dynamical information of RNA splicing, and connects the scales of cell-cell interactions and intracellular response by constructing multilayer regulatory networks. We validate scRICH with new experiments on EGF ligand/receptor co-expression in keratinocytes from skin-equivalent organoid, and compare these computational predictions against existing CCC inference methods. Applying scRICH to multiple biological scenarios demonstrate its ability to capture emerging relations between distinct cell-cell communication pathways, interactions at the onset of cell fate decision, and emerging trends in cell-cell communications along cell lineages and in space.
bioinformatics2026-05-04v1DPLM: Dynamics-aware Protein Language Model via contrastive learning between sequence and molecular dynamics simulation trajectory
Jiang, Y.; Wang, D.; Imam, I. A.; Xu, D.; Shao, Q.Abstract
Protein dynamics play a critical role in protein function, yet such important information is missing in many protein language models (PLM). We introduce DPLM, a dynamics-aware protein language model that aligns sequence embeddings with molecular dynamics (MD) trajectory embeddings via contrastive learning. Using MD features encoded by a pretrained video model, DPLM learns sequence representations that correlate with residue-level flexibility and improve protein-level functional clustering compared to static sequence- and structure-based PLMs. Without task-specific training, DPLM outperforms ESM-based representations in zero-shot mutation-effect prediction on multiple deep mutational scanning datasets. When adapted with lightweight task-specific heads, DPLM further achieves top-tier performance on protein stability prediction and intrinsic disorder region identification, demonstrating that contrastive alignment with MD trajectories enables PLMs to capture biologically meaningful dynamic properties.
bioinformatics2026-05-04v1AI-guided discovery of atypical protein assemblies
Toghani, A.; Seager, B. A.; Sugihara, Y.; Roijen, L.-M.; Azcue, J. M.; Garro, M.; Sargolzaei, M.; Morianou, I.; Harant, A.; Gallop, S.; Kourelis, J.; MacLean, D.; Contreras, M. P.; Kamoun, S.; Lüdke, D.Abstract
Artificial intelligence (AI) systems such as AlphaFold have transformed structural biology by enabling accurate prediction of protein structures. However, their capacity to uncover new classes of macromolecular assemblies remains largely untapped. We developed the Structural Novelty Index (SNI), a quantitative framework for identifying protein complexes that diverge from canonical architectures. As one implementation of SNI, we developed SNINRC-Hexa, to identify unconventional resistosomes formed by nucleotide-binding, leucine-rich repeat immune receptors (NLRs). We used it to analyze AlphaFold 3 models of 637 non-redundant NRC proteins from 346 genomes representing 85 plant species. This analysis identified candidates with predicted architectures distinct from the canonical hexameric resistosomes of NRC proteins. Biochemical purification and negative-stain transmission electron microscopy of NRC7 orthologs from multiple species supported the SNI prediction and revealed an unexpected undecameric (11-mer) assembly. Our results establish SNI as a scalable approach for discovering atypical protein complexes.
bioinformatics2026-05-04v1Automatic Bevacizumab Response Prediction in Ovarian Cancer from Digital Pathology Images via Novel AI-based Computational Pipeline
Alsaiari, A.; Turki, T.; Taguchi, Y.-h.Abstract
Ovarian cancer is one of the gynecological cancer types, which, if metastasized and not detected early, can cause deaths among women. Therefore, there is a need to accurately predict drug responses to ovarian cancer. A gynecological pathologist inspects abnormality in tissues, followed by providing a report about patients; however, such a diagnostic process is (1) hard; (2) requires experience; and (3) time consuming. Moreover, existing tools are far from perfect. Hence, we present a computational pipeline to improve predicting drug response pertaining to ovarian cancer, derived as follows. First, we download digital pathology images pertaining to ovarian bevacizumab response from the cancer imaging archive repository. We employed histogram of oriented gradients to images, constructing feature vectors, provided to Fisher linear discriminant analysis to change the representation through dimensionality reduction. Then, we provide reduced-dimensionality data for regression analysis through support vector regression coupled with various kernels and calculating the area under the ROC curve (AUC). Experimental results against transformer-based models (ViT and Swin) and other deep learning (DL) models (VGG16, ResNet50, InceptionV3, MobileNetV2, and EfficientNetB6) demonstrate that our approach with radial kernel (named SVRD+R) yielded an AUC performance improvements of 17% against the best-performing transformer-based model (ViT) while obtaining an AUC performance improvements of 14.9% when compared against the best DL-based model (MobileNetV2). These results demonstrate the superiority and feasibility of our AIbased pipeline when tackling prediction problems pertaining to gynecologic cancer studies.
bioinformatics2026-05-04v1Automated Multimodal Correlative Registration for Organelle-Specific Molecular Imaging
Lu, C.; ZHAO, K.; Cui, D.; Chen, G.; Yang, Q.; Yang, H.; Zhao, M.; Song, K.; Nikan, M.; Li, Z.; Zhao, S.; Cen, J.; Qiu, X.; Young, S.; Bennett, C. F.; Seth, P.; Chen, K.; Qi, X.; Jiang, H.Abstract
Mapping subcellular drug distribution is essential for understanding trafficking and off-target effects. NanoSIMS enables chemical imaging of labeled therapeutics, but signal interpretation requires ultrastructural correlation with electron microscopy, a manual and laborious process. We present an automated AI-driven pipeline for correlating chemical and ultrastructural images, enabling multiscale, organelle-precise imaging of molecules in cells and tissues. The method integrates bidirectional optical flow, confidence-guided affine transformation, and automated template matching for cross-scale EM alignment. Morphology-rich ion channels (e.g., 32S) estimate transformations that propagate to sparse therapeutic signals (e.g., 79Br, 15N), overcoming low signal-to-noise challenges. We validate this framework across diverse cell and tissue types, tracking oligonucleotide and antibody therapeutics in vitro and in vivo to reveal cell-type- and organelle-specific distribution patterns. This work establishes a generalizable platform for automated multimodal registration and organelle-resolved subcellular pharmacology.
bioinformatics2026-05-04v1PDBe-SIFTS: an open-source tool for Structure Integration with Function, Taxonomy, and Sequences, featuring improved alignment, scoring scheme, and accelerated search
Bellaiche, A.; Choudhary, P.; Nair, S.; Harrus, D.; Yu, C. W.-H.; Tanweer, S. A.; Evans, G. L.; Lo, S. W.; Martin, M.; Fleming, J. R.; Velankar, S.Abstract
Structure Integration with Function, Taxonomy and Sequences (SIFTS) provides residue-level mappings between UniProt Knowledgebase sequences and Protein Data Bank structures and has historically been generated through internal Protein Data Bank in Europe (PDBe) pipelines. Here, PDBe-SIFTS is presented as a fully open-source, locally deployable implementation of this mapping framework. The pipeline combines fast, scalable sequence search using MMseqs2, an improved bounded scoring scheme for ranking candidate mappings, and residue-level mapping refinement based on backbone connectivity. PDBe-SIFTS is distributed as a Python package with command-line tools for 1) building a sequence search database, 2) identifying the best sequence-structure match, 3) one-to-one mapping at the residue level, and 4) generating SIFTS annotations in PDBx/mmCIF format. Benchmarking on the complete Protein Data Bank archive showed that MMseqs2 reduced archive-scale UniProtKB searches from hours with BLASTP to minutes, approximately 22-36 times faster, while curated mappings were recovered at top rank in 93.1% of cases. The remaining discrepancies mainly involved biologically ambiguous cases such as highly conserved proteins, chimeric constructs, or closely related orthologs. These results show that PDBe-SIFTS enables fast mapping, improving structural coherence in residue-level alignments while delivering the most up-to-date and accurate mappings, comparable to expert curation. Tool: https://github.com/PDBeurope/SIFTS Quick start notebook with example: https://github.com/PDBeurope/SIFTS/tree/master/notebooks
bioinformatics2026-05-04v1Clustering Strategies Improve Structure-Preserving Visualization of Single-Cell RNA-seq Data with CBMAP
Alchaar, M.; Dogan, B.Abstract
Dimensionality reduction for visualization is a fundamental step in single-cell RNA sequencing (scRNA-seq) analysis due to the extremely high dimensionality of gene expression profiles. However, widely used nonlinear embedding techniques such as UMAP and t-SNE can introduce substantial distortions when projecting data into two-dimensional space, potentially altering global organization, local neighborhoods, and distance relationships in ways that may mislead downstream biological interpretation. In this study, we investigate the applicability of Clustering-Based Manifold Approximation and Projection (CBMAP) for the visualization of scRNA-seq data and systematically examine how clustering strategies influence the quality of the resulting embeddings. CBMAP was integrated with several clustering algorithms commonly used in single-cell analysis, including k-means, Leiden, HDBSCAN, Secuer, HGC, and FlowSOM. The resulting embeddings were evaluated using quantitative metrics that measure global, local, and distance-level structure preservation and were compared with widely used dimensionality reduction methods such as UMAP, t-SNE, and PaCMAP across multiple benchmark datasets. Our results demonstrate that the clustering stage plays a critical role in determining the structural fidelity of CBMAP embeddings. Clustering algorithms specifically designed for single-cell transcriptomic data, particularly Secuer, produced more consistent preservation of global relationships between cell populations. Across multiple datasets, CBMAP more faithfully preserved global structural organization and inter-population distance relationships than the compared methods, although local neighborhood preservation was generally weaker than in techniques optimized for local structure. Importantly, CBMAP embeddings retained biologically meaningful relationships in trajectory benchmark datasets. When combined with RNA velocity analysis, CBMAP successfully preserved cyclic progenitor states and branching differentiation trajectories, demonstrating compatibility with trajectory-aware visualization. These findings indicate that CBMAP provides a structure-faithful visualization framework for scRNA-seq data and that clustering selection plays a central role in determining embedding quality.
bioinformatics2026-05-04v1AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction
Muneeb, M.; Ascher, D. B.Abstract
Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein-language-model-derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome-wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM-derived features, conservation metrics, population-frequency variables, established pathogenicity predictors and engineered amino acid/codon-context features. Using 132,714 ClinVar-labelled missense variants, we benchmarked machine-learning and deep-learning models under controlled feature configurations. The full 303-feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC-AUC = 0.9950 across stratified five-fold cross-validation. Restricted naive and location-oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity-controlled ablations showed that removing prior-predictor, population-frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM-derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1-score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.
bioinformatics2026-05-04v1DoFormer: Causal Transformer for Gene Perturbation
Karbalayghareh, A.; Paull, E.; Califano, A.Abstract
Learning causal gene regulatory mechanisms from single-cell data, and thereby predicting the effects of unseen perturbations, remains challenging. Observational RNA-seq data alone is insufficient for causal modeling, whereas perturbational data is essential. Classical causal inference methods often rely on unrealistic directed acyclic graph (DAG) assumptions and are not well suited to integrating multimodal data. Current transcriptomic foundation models also typically treat observational and perturbational data identically, limiting their ability to model perturbations. We present DoFormer, a causal multimodal Transformer that makes no DAG assumptions and leverages rich perturbational data to accurately predict previously unseen perturbations. DoFormer enables principled in silico perturbations by adapting the causal do-operator within the attention mechanism: the perturbed gene is set to the intervention value and prevented from attending to other genes, allowing the model to fully distinguish observational from interventional regimes. We train DoFormer using biologically informed loss functions and evaluate it with comprehensive perturbation prediction metrics. DoFormer substantially improves perturbation prediction relative to baseline and prior foundation models, underscoring the importance of intervention-aware architectures and biologically grounded objectives for causal modeling in single-cell genomics.
bioinformatics2026-05-04v1Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Guo, J.Abstract
The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task--molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.
bioinformatics2026-05-04v1Reference-Based Library Construction Improves Performance in low-input diaPASEF Workflows
Charkow, J.; Ghaznavi, M.; Seale, B.; Peng, J.; Gingras, A.-C.; Rost, H.Abstract
In low input mass spectrometry-based proteomics, Data Independent Acquisition (DIA), including diaPASEF, is quickly becoming the method of choice for label free quantification. Whether using empirical or in silico spectral libraries, performance is dependent on the library; however, the optimal library construction strategy for low input proteomics remains an open question. To address this, we examine and develop library construction approaches that are compatible with both spectrum-centric and peptide-centric analysis workflows. These approaches leverage a closely related, high-quality sample to improve library quality. First, we validated our approach in bulk sample amounts where we observed that the effects of gas-phase fractionation based library construction is dependent on the software framework, with improvements more pronounced in OpenSWATH compared to DIA-NN. In OpenSWATH, our peptide-centric library reconstruction workflow consistently outperforms a transfer learning strategy, an emerging alternative approach. In DIA-NN, trends are dependent on library source highlighting OpenSWATH's stronger dependence on the search space. In low-input applications, such as single-cell-equivalent injection amounts (100 pg) of HeLa cell digest on a timsTOF SCP, our library construction approach provided more pronounced improvements across both software tools compared to bulk samples. Using a peptide-centric reconstruction approach with the OpenSWATH analysis framework, we detected over 15,000 peptide precursors (2480 protein groups), a 90% improvement over the original library. Furthermore, using a spectrum-centric construction approach, peptide precursor identification rates improved over 6-fold ( ~1000 to ~6000). Our strategy provides a practical solution for generating high-quality libraries in low-input applications.
bioinformatics2026-05-04v1An unsupervised framework for comparing SARS-CoV-2 protein sequences using LLMs
Littlefield, S. B.; Campbell, R. H.Abstract
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic led to 700 million infections and 7 million deaths worldwide. While studying these viruses, scientists developed a large amount of sequencing data that was made available to researchers. Large language models (LLMs) are pre-trained on large databases of proteins and prior work has shown its use in studying the structure and function of proteins. This paper proposes an unsupervised framework for characterizing SARS-CoV-2 sequences using large language models. First, we perform a comparison of several language models previously proposed by other authors. This step is used to determine how clustering and classification approaches perform on SARS-CoV-2 sequence embeddings. In this paper, we focus on surface glycoprotein sequences, also known as spike proteins in SARS-CoV-2 because scientists have previously studied their involvement in being recognized by the human immune system. Our contrastive learning framework is trained in an unsupervised manner, leveraging the Levenshtein distance from pairwise alignment of sequences when the contrastive loss is computed by the Siamese Neural Network. The final part of this paper focuses on a comparison with a previous approach on a test dataset containing data from the latter part of the pandemic. In the prediction of emerging variants, the proposed LLM-based approach shows an improvement of 0.2 in terms of the adjusted rand index clustering compared to a previously proposed approach. This shows the potential of applying large language models to this field.
bioinformatics2026-05-03v3A 37-million-particle dataset from over 250 experiments to accelerate data-driven cryo-EM analysis
Zamanos, A.; Kyrilis, F. L.; Koromilas, P.; Kastritis, P. L.; Panagakis, Y.Abstract
Cryogenic Electron Microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-resolution structure determination of biological macromolecules. Central to cryo-EM analysis are particles, namely 2D projections of biomolecules extracted from micrographs, which serve as the primary input for 3D reconstruction. While data-driven methods have transformed other scientific domains, their impact on cryo-EM remains limited because existing particle datasets are too small, too narrow in protein diversity, and lack rich per-particle annotations. We introduce cryoPANDA (cryo-EM Particles ANnotated DAtaset), comprising over 37 million annotated particles from 252 experiments spanning a wide range of protein types, more than 10-fold larger than prior collections. Each particle is accompanied by detailed annotations covering acquisition, classification, and reconstruction metadata, alongside the corresponding 3D electrostatic potential map, the published EMDB map, and, where available, the PDB model. We validate cryoPANDA in two ways: first, by reconstructing hundreds of distinct high-resolution cryo-EM maps; and second, by training a DINOv2 foundation model and evaluating its learned representations on micrograph segmentation, particle picking, and particle clustering.
bioinformatics2026-05-03v1MorphOTU: A universal image-based framework for delineating biodiversity discovery
Zhan, Z.; Chen, W.; Liu, X.; Yue, L.; Zhang, F.Abstract
The absence of a scalable system for organizing the vast majority of unidentified species becomes the central obstacle in biodiversity science. Existing molecular and computer-vision methods rely on DNA material or closed-set labels, which hamper biodiversity quantification under the open, incomplete conditions that characterize real ecosystems. Here, we introduce morphOTUs, a general image-based framework that constructs operational units of biodiversity directly from phenotype. Using morphOTU, we derive image based OTUs across five plant and beetle datasets spanning heterogeneous imaging conditions. These units recover species-level boundaries, retain coherent structure when most species are "unseen" during training, and accurately approximate richness and Shannon diversity indices even under sparse labeling or limited sampling. Visual explanations reveal that morphOTU consistently focuses on biologically meaningful traits and captures continuous phenotypic variation. By providing a scalable and open set framework for quantifying phenotypic diversity, morphOTUs enable biodiversity assessment that includes unnamed species and unlock the ecological value of rapidly expanding digital image repositories.
bioinformatics2026-05-01v1Single-cell foundation models reveal context-sensitive cancer programmes under subtype shift
Wallace, J.; Youssef, G.; Han, N.Abstract
Single-cell foundation models (scFMs) have shown promise as transferable representations of cellular state, but recent zero-shot evaluations suggest that they do not consistently outperform simpler baselines. We asked whether this apparent limitation reflects an intrinsic weakness of scFMs or instead the difficulty of using them without task-specific adaptation. To test this, we fine-tuned two widely used scFMs, Geneformer and scGPT, on common tumour subtypes from renal, lung, and breast cancer, and compared them with a LightGBM baseline on within-domain validation cohorts and on out-of-domain rarer, unseen cancer subtypes. Across all three organs, the models achieved near-perfect within-domain discrimination (AUROC 0.98-1.00), but differences emerged under subtype shift. On chromophobe RCC, scGPT and Geneformer achieved AUROC 0.88 and 0.92 respectively versus 0.64 for LightGBM; on SCLC, Geneformer reached 1.00 versus 0.82 for LightGBM; and on TNBC, scGPT achieved 0.80 versus 0.49 for LightGBM. To determine whether this generalisation reflected meaningful adaptation rather than arbitrary feature drift, we applied Integrated Gradients, an interpretability technique, to the fine-tuned scFMs and SHAP to LightGBM. LightGBM showed highly stable gene-importance rankings across datasets, whereas the foundation models were substantially more context-sensitive. However, this flexibility was not random: all models converged on a shared within-domain core, while scFMs acquired larger rare-subtype-specific gene sets and pathway programmes during transfer. Pathway enrichment further supported the biological relevance of these attributed genes. Together, these results suggest that fine-tuned scFMs can bridge clinically relevant domain shifts in cancer single-cell analysis and that interpretability provides a practical route to distinguishing biologically grounded adaptation from rigid reuse of training-era rules.
bioinformatics2026-05-01v1JUMPlion improves quantitative DIA proteomics through ion-level recovery of missing values
Fu, Y.; Yuan, Z.-F.; Byrum, S. D.; Wu, L.; Peng, J.; Wang, X.; High, A. A.Abstract
Incomplete quantification remains a persistent challenge in data-independent acquisition (DIA) mass spectrometry (MS), particularly in low-input and single-cell analyses. In identification-driven workflows, missing protein quantities often arise not from true absence of the corresponding peptides, but from failure to retain low-abundance signals from precursor or product ions for quantification. Here we present JUMPlion (local inference of ion-level missingness), a DIA quantification framework that re-examines MS raw files to recover missing values at the ion level before protein quantification. JUMPlion re-extracts precursor- and product-ion signals directly from raw data, infers ion-level measurements within precursor-specific local quantitative neighborhoods, and combines complementary precursor- and product-ion signals into downstream quantification. Using benchmark datasets acquired on multiple DIA platforms, JUMPlion increased protein-level completeness, improved fold-change accuracy, and enhanced detection of differentially abundant proteins while maintaining low differential-abundance false discovery rates. These gains were most evident in low-input and single-cell DIA datasets. Together, these results show that addressing missingness at the ion level before protein-level summarization can improve DIA quantification in diverse acquisition settings.
bioinformatics2026-05-01v1Rapid-PFP: Accelerating Prefix-Free Parsing with GPU Parallelism
Ferro, E. A.; Pencinger, T.; Green, O.; Lotfollahi, M.; Boucher, C.Abstract
Prefix-Free Parsing (PFP) is widely used in genomic data processing to construct compressed indexes on massive, highly repetitive datasets. However, existing CPU implementations are constrained by sequential bottlenecks, limiting their ability to scale to large-scale modern pangenomic collections. We introduce RAPID-PFP, a redesigned implementation of the PFP algorithm that takes advantage of the massive parallelism and high memory bandwidth of modern GPUs. RAPID-PFP parallelizes trigger-string detection, phrase parsing, dictionary construction, and parse generation through custom CUDA kernels and GPU-resident data structures built using cuDF, CuPy, and Numba-CUDA. The algorithm operates entirely within GPU memory, minimizes host interaction, and dynamically adapts to available VRAM, enabling efficient processing in a range of hardware configurations. Across E. coli and Human Pangenome (HPRC) datasets, RAPID-PFP produces identical output to established CPU pipelines while delivering an order-of-magnitude acceleration. On 3,682 E. coli assemblies, RAPID-PFP reduces runtime from 552 seconds to 17 seconds compared to PFP-FL (32.1 times) and from 1,078 seconds to 17 seconds compared to PFP-ITL (62.6 times). On the complete 46-sample HPRC dataset, RAPID-PFP achieves a 33.4 time speedup and successfully processes scales that PFP-ITL cannot handle. Performance improves with dataset size, reflecting that PFP maps naturally onto thousands of CUDA cores, yielding sublinear scaling relative to CPU implementations. RAPID-PFP demonstrates that foundational compressed-indexing algorithms can be re-engineered for accelerators, enabling scalable and practical preprocessing for large-scale genomic indexing workflows.
bioinformatics2026-05-01v1A high-quality, chromosome-scale genome assembly of the shade-tolerant wild rice, Oryza granulata
Zhang, F.; Yang, Y.-h.; Li, W.; Shi, C.; Zhu, X.-g.; Gao, L.-z.Abstract
Oryza granulata Nees et Arn. ex Watt, a diploid wild rice (GG genome), possesses exceptional shade tolerance and is a key genetic resource for rice improvement. However, previous genome assemblies lacked continuity and completeness. Here we present a chromosome-scale reference genome of O. granulata using PacBio SMRT (113*), Hi-C (95*), and Illumina sequencing. The final assembly is ~764.24 Mb, with a scaffold N50 of ~59.32 Mb, and ~96.47% of the sequence anchored to 12 chromosomes. BUSCO completeness is ~98.6%. We annotated ~42,064 protein-coding genes, of which ~95.39% were functionally annotated, along with ~73.46% repetitive elements. The genome assembly and raw sequencing data are available at NGDC (PRJCA061980), NGDC GSA (CRA068332), and NGDC GWH (GWHISVE00000000.1). This high-quality genome will serve as a fundamental resource for evolutionary genomics, conservation biology, and breeding of shade-tolerant rice cultivars.
bioinformatics2026-05-01v1Aiki-GeNano: Multi-Stage Preference Optimization for Generative Design of Developable Nanobodies
Meda, R. S.; Doshi, J.; Iyer, E.; Shastry, S.; Mysore, V.Abstract
Therapeutic nanobodies must combine target binding with biophysical and chemical properties that determine manufacturability, stability, and clinical viability, collectively termed developability, yet most computational design pipelines still treat developability as a post-hoc filter rather than an integrated training objective. We present Aiki-GeNano, a three-stage language-model alignment pipeline for epitope-conditioned nanobody generation that integrates multiple developability signals directly into training, using only sequence information and previously published predictors. Across 65 target epitopes and relative to the supervised baseline, the combined pipeline raised predicted mean melting temperature by 6.6 C, halved isomerization-motif severity, reduced deamidation, N-glycosylation sequons and CDR methionine-oxidation motifs, and preserved predicted humanness and solubility. On a shared 10-target GPCR benchmark, Aiki-GeNano achieved the highest predicted melting temperature and the lowest isomerization severity among five contemporary VHH generators. Starting from ProtGPT2 and a 1.35-million-pair binder dataset generated on an mRNA-display platform, the pipeline applies supervised fine-tuning, Direct Preference Optimization on 522{,}800 pairs ranked by a composite of selectivity, predicted thermal stability, solubility, and humanness, and Group Reward-Decoupled Policy Optimization against six sequence-based rewards (FR2 hydrophobicity, hydrophobic-patch coverage, chemical-liability motifs, Wilkinson--Harrison expression probability, VHH hallmark residues, scaffold integrity). Generated sequences differ from the nearest training sequence by a mean of 8.1--9.0 amino acids out of 126, and two alternative training trajectories converge to distinct amino-acid-composition strategies with similar liability outcomes but different thermal-stability gains, indicating initialization-dependent convergence of the reward-optimized policy. Predicted humanness was preserved at the level of the camelid VHH scaffold of the training library -- a data-side limitation rather than a methodological one, since the framework was effectively constant across all preference pairs. Applicability to the drug discovery and development pipeline, limitations of predicted-property evaluation, and future work are discussed.
bioinformatics2026-05-01v1Skill-Augmented Frontier Agents Nearly Saturate BixBench-Verified-50
Zhang, X.Abstract
Large language model (LLM) agents are increasingly used for biological data analysis, but prior benchmark results have given a mixed picture of whether they are ready for routine bioinformatics work. The original BixBench study reported only ~17-21% accuracy for frontier agents on open-answer bioinformatics questions. Subsequent curation of BixBench-Verified-50 removed or revised ambiguous items, revealing much higher performance for modern agents. Here we evaluate three frontier-model configurations on the 50 verified questions using the same local benchmark, prompt structure, answer format, and grading pipeline: GPT-5.4 with Claude Scientific Skills and no web access, Claude Opus 4.7 with Claude Scientific Skills and no web access, and GPT-5.5 with Claude Scientific Skills, bioSkills, and web access. The three configurations achieve 88.0% (44/50), 84.0% (42/50), and 98.0% (49/50) accuracy, respectively. The remaining GPT-5.5 error is not a clear analytical failure: the agent correctly computed Spearman correlations on the distributed CRISPRGeneEffect.csv values and selected CCND1, whereas the reference answer is recovered only after interpreting stronger essentiality as the opposite sign of the raw gene-effect score. Offline errors mainly occurred when agents lacked pathway, organism-annotation, BUSCO, or PhyKIT-related resources. These results show that frontier agents equipped with high-quality scientific skills can nearly saturate a curated bioinformatics benchmark, while also emphasizing that question wording, score sign conventions, and access to current external resources remain decisive for reliable evaluation.
bioinformatics2026-05-01v1Differential peripheral immune dynamics underlie therapeutic response to chemotherapy and chemo-immunotherapy in triple-negative breast cancer
mesrizadeh, z.; Mukund, K.; Subramaniam, S.Abstract
Triple-negative breast cancer (TNBC) remains the most aggressive breast cancer subtype, with limited treatment options and variable response to immune checkpoint inhibitors. While tumor-infiltrating lymphocytes have been extensively studied, the integration of system-level peripheral immune dynamics with mechanistic immune regulation underlying therapeutic response and resistance remain poorly defined. Here, we integrate systems-level immune state modeling with pathway-level mechanistic inference to analyze single-cell RNA sequencing of peripheral blood mononuclear cells from advanced TNBC patients treated with paclitaxel alone (chemotherapy) or in combination with anti-PD-L1 antibody atezolizumab (combination). This framework leverages treatment arm, longitudinal sampling, and clinical response to resolve coordinated immune programs across lymphoid and myeloid compartments. Using this approach, we identified distinct treatment- and response-specific immune states in pre- and post-treatment. Chemotherapy responders displayed pre-treatment adaptive immune priming, whereas combination therapy responders exhibited pre-existing effector T cell activity coupled with tumor tissue PD-L1 expression. In contrast, chemotherapy non-responders developed persistent post-treatment immune dysregulation in regulatory and terminal effector programs, while combination therapy non-responders demonstrated maladaptive remodeling of adaptive and innate lymphoid compartments, including dysfunctional NK and metabolically reprogrammed myeloid populations. Across both regimens, pathways involving protein translation, metabolic adaptation, and stress signaling emerged as critical modulators of response. These findings suggest that coordinated adaptive-innate immune dynamics underlie therapeutic efficacy, whereas systemic immune exhaustion and myeloid immunoregulation lead to resistance. Projection of these peripheral immune programs onto independent I-SPY2 showed concordant associations with tumor immune phenotypes and pathological complete response, supporting generalizability of the identified systemic immune states. Our study demonstrates the utility of an integrative systems-level approach for linking peripheral immune state organization with mechanistic insights, informing immune response and resistance in TNBC.
bioinformatics2026-05-01v1Reconstructing True 3D Spatial Omics at Single-Cell Resolution
Yang, Y.; Luo, Y.; Zhang, K.; Bu, Y.; Xia, Z.; Peng, H.; Yan, R.; Liu, Q.; Chen, Y.; Shen, L.; Chen, E.Abstract
Capturing the three-dimensional (3D) organization of cells is essential for deciphering complex biological processes, yet comprehensive 3D spatial omics is severely hindered by the destructive nature of physical sectioning and the depth limitations of intact tissue imaging. Current computational methods rely on 2.5D stacking of discrete slices, which inherently disrupts tissue topology and fails to resolve continuous depth-dependent molecular gradients. To bridge this gap, we introduce DeepSpatial, an Optimal Transport flow matching framework that models tissue evolution as a continuous dynamic vector field. By solving the underlying probability flow ODEs, DeepSpatial enables the direct extraction of uninterrupted, infinitely resolvable tissue states at arbitrary spatial depths. Using Deep STAR/RIBOmap 3D technologies, we demonstrate that DeepSpatial achieves improved 3D reconstruction fidelity relative to 2.5D approaches, yielding structures that more closely recapitulate native tissue microenvironments in real-world datasets. Across diverse spatial omics modalities, including spatial proteomics using imaging mass cytometry in human breast cancer and spatial transcriptomics using openST in head and neck squamous cell carcinoma metastatic lymph nodes, DeepSpatial produces biologically interpretable and high-fidelity reconstructions across datasets. We evaluated the scalability and robustness of DeepSpatial on a large-scale mouse brain dataset, reconstructing a continuous 3D cellular atlas comprising 39 million cells within 41.6 hours. Systematic downstream characterization validated its ability to recapitulate consistent spatial architectures, cell-type distributions, transcriptomic patterns, and microenvironmental structures across brain regions. Collectively, these results demonstrate DeepSpatial as a generalizable and efficient solution for true 3D spatial reconstruction across scales and modalities.
bioinformatics2026-05-01v1Interpretable sequence-based machine learning consolidates candidate H3N2 hemagglutinin antigenic sites
Meyer, A. G.; Santillana, M.Abstract
Vaccine strain selection for seasonal influenza A(H3N2) depends on knowing which hemagglutinin (HA) substitutions are most likely to erode neutralizing antibody recognition, yet published antigenic site sets disagree substantially on which positions matter most. We applied interpretable gradient-boosted tree models with SHAP-based site attribution to two complementary hemagglutination inhibition (HI) datasets to produce a more consolidated ranking of candidate antigenic positions. Models trained on a Neher/Bedford benchmark dataset recover the canonical cluster-transition sites established by prior analyses. Moreover, after filtering the WIC dataset for confounding factors, our models recover the majority of positions from four major prior reference sets (Koel, Neher/Bedford, Harvey, and Shah) and improve concordance between rankings derived from the Neher/Bedford and WIC datasets. Rankings from our models also agree more strongly with models trained to predict sampling time or passage identity than with standard evolutionary metrics used to detect diversifying selection. Our results show that interpretable sequence-based models can provide a more integrative ranking of candidate antigenic positions across different data sources and modeling approaches. This work should aid efforts to prioritize H3N2 substitutions for epidemic surveillance.
bioinformatics2026-05-01v1Single nucleus transcriptome analysis of Arabidopsis thaliana roots infected with Phytophthora. capsici
Alajoleen, R. M.; Chau, T. N.; Shuman, J.; Bargmann, B.; Li, S.Abstract
Understanding how plant roots coordinate immune responses at the cellular level is key to unraveling host-pathogen interactions. Using single-nucleus RNA sequencing (snRNA-seq) of Arabidopsis thaliana roots 24 hours after Phytophthora. capsici (P. capsici) inoculation, we captured the transcriptional landscape of early infection at single-cell resolution. Four libraries (two infected and two mock-treated) were generated with approximately 26,000 high-quality nuclei with consistent sequencing depth and viability. A reference-based pipeline distinguished host and pathogen transcripts, enabling species-resolved mapping and host-focused single-nucleus transcriptomic analysis. Integration and clustering identified 12 transcriptionally distinct root cell types, encompassing major tissues such as the meristem, cortex, endodermis, and vasculature. Cluster-specific marker analysis confirmed cell-type identities, while differential expression and Gene Ontology enrichment revealed a global transcriptional shift from metabolic and translational processes in mock samples to defense, stress, and pathogen-response pathways upon infection. Hormone-related enrichment indicated broad salicylic acid activation across root tissues, spatially confined ethylene signaling in vascular-associated clusters, and localized jasmonic acid responses in cortex and phloem. Together, these results provide a high-resolution view of Arabidopsis root immunity, highlighting a coordinated yet tissue-specific defense architecture in which salicylic acid underpins systemic protection, ethylene modulates vascular defense, and jasmonic acid contributes targeted reinforcement during early P. capsici infection. Keywords: single-nucleus RNA sequencing, root immunity, cell type specific defense. hormone signaling, salicylic acid, jasmonic acid, ethylene signaling
bioinformatics2026-05-01v1OmniAge: a compendium of aging omic biomarkers links mitotic clocks to clonal hematopoiesis and causality
Du, Z.; Ling, Y.; Tong, H.; Guo, X.; Teschendorff, A. E.Abstract
Interest in aging 'omic' biomarkers has grown due to their ability to quantify biological age. Most of these biomarkers have been derived in blood and fall into many diverse categories, yet relatively little is known about their correlative patterns, especially between biomarkers from different categories. Here we present the OmniAge R and Python package, a collection of 413 aging omic biomarkers representing 12 different categories, including traditional epigenetic clocks, epigenetic mitotic clocks, DNA methylation-based proxies for clonal hematopoiesis and inflammaging, causal clocks, cell-type specific epigenetic clocks and single-cell transcriptomic clocks. By studying their inter-class correlations across large blood datasets, we reveal associations of mitotic age with clonal hematopoiesis subtypes and causal clocks, which are predictive of cancer risk. Using proxies of serum protein levels, we further dissect associations with mitotic clocks, clonal hematopoiesis and causal clocks into distinct biological processes mapping to key aging pathways. Applying OmniAge to multi-modal data of sorted immune cell-types reveals that age-acceleration derived from transcriptomic and epigenetic clocks correlate, but that this is driven by underlying cell-type heterogeneity. In summary, the OmniAge package is an exploratory tool for evaluating large numbers of aging omic biomarkers, and to aid discovery and generate new hypotheses.
bioinformatics2026-05-01v1Modeling healthy proteomic profiles for anomaly detection using subspace learning based one-class classification
Sohrab, F.; Kumar, A.; Ahola, V.; Magis, A.; Hautamaki, V.; Heinaniemi, M.; Huang, S.Abstract
High-throughput plasma proteomics provides sensitive and scalable measurements of thousands of systemic protein profiles from minimally invasive blood samples, creating new opportunities for disease detection and population-scale health monitoring. However, robust statistical modeling remains challenging due to high dimensionality, limited availability and high diversity of diseased samples, resulting in class imbalance in clinical cohorts. Here, we present a subspace One-Class Classification (OCC) framework for proteomics-driven anomaly detection that models healthy proteomic profiles as a reference distribution. To address the limitations of conventional hyperparameter tuning in severely imbalanced data settings, we introduce a fully data-driven parameter estimation strategy that infers all model parameters directly from intrinsic properties of the healthy training data, without using any disease labels. Using plasma proteomics data generated with Olink, we evaluate a family of subspace and graph-embedded subspace extensions of Support Vector Data Description, in which all models operate on learned low-dimensional representations rather than the original feature space. Models are trained exclusively on a healthy reference cohort and evaluated on heterogeneous disease conditions, including multiple cancer types and an independent COVID-19 cohort, with all disease samples withheld from training to enable unbiased assessment of cross-disease generalization. Across disease contexts, the evaluated one-class models yield stable and balanced detection performance, demonstrating that learning structured low-dimensional representations of healthy proteomic variation captures intrinsic biological organization that generalizes across disease-specific perturbations. These results establish healthy-profile-based, subspace one-class learning as a robust and disease-agnostic framework for screening in high-dimensional plasma proteomics.
bioinformatics2026-05-01v1From Generalist to Specialist: Evolution of PS2 α-integrins and Implications for Drug Targeting
Liu, S.; Chen, Y.; Xu, R.-G.; Zhang, H.; Mostafa, F.; Liu, L.Abstract
Integrins are heterodimeric transmembrane receptors that mediate cell-cell and cell-extracellular matrix interactions and play essential roles in development and disease. Within the PS2 alpha-integrin subfamily, four paralogs (alphaIIb, alpha5, alpha8, and alphaV) share a conserved RGD-binding motif yet exhibit diverse functional specializations. Integrins have been widely targeted therapeutically for various clinical conditions, though achieving subtype specificity remains a major challenge. Here, we performed an integrative evolutionary analysis of 114 PS2 alpha-integrin sequences across 28 vertebrate species, combining phylogenetic reconstruction, time calibration, ancestral sequence inference, and structural mapping. Our time-calibrated phylogeny indicates that the PS2 lineage originated ~862 Mya, with diversification of the four paralogs occurring prior to vertebrate radiation. Ancestral state reconstruction reveals that fibronectin and vitronectin binding are ancestral traits, whereas fibrinogen binding and beta3 pairing arose independently in the alphaIIb and alphaV lineage. Evolutionary rate analysis shows domain-specific divergence, with the beta-propeller acting as a hotspot of evolutionary change, likely driven by combined pressures from ligand binding and beta-subunit interaction. These pressures vary across paralogs: alphaIIb exhibits accelerated evolution in ligand-binding regions, while alphaV displays elevated rates in beta-subunit interaction domains. Mapping sequence variation onto structural interfaces identifies lineage-specific substitutions underlying functional divergence, including distinct molecular solutions for fibrinogen binding in alphaIIb and alphaV. These findings collectively demonstrate that PS2 alpha-integrins evolved from a generalist ancestor through neofunctionalization and lineage-specific specialization. This work provides an evolutionary framework for identifying subtype-specific functional sites and highlights the potential of evolution-informed strategies to guide the development of more selective integrin-targeting therapeutics.
bioinformatics2026-05-01v1Confronting global eradication of TB head on: Uncovering the root of drug resistance and bacterial survival strategies through a comprehensive computational study of first-line TB drug resistant mutations
Pawar, P.; Samarasinghe, S.Abstract
Tuberculosis (TB) is fast becoming incurable affecting millions globally. Mycobacterium tuberculosis (Mtb), causative agent of TB, has evolved elusive survival strategies through point mutations in the drug targets leading to a daunting scenario of resistance towards first-line TB drugs, exacerbated by global differences in mutation patterns. Drug resistance studies have focussed only on few mutations; however, hundreds of mutations have been reported in the last three decades. WHO goal of global eradication of TB therefore now requires a deep understanding of mechanisms of drug resistance, involving many mutations, addressed in a global context. This study addresses bacterial survival strategies by following bacteria-drug interaction to probe into how bacteria evolve drug resistance mechanisms through mutations. We hypothesise that bacteria favour mutations that protect them from a drug while making the drug ineffective. To test the hypothesis, we quantify the impact of mutations on both bacterial function and drug binding affinity to get to the root of drug resistance revealing how bacteria may evolve an arsenal of mutations towards an optimal survival strategy. This first comprehensive and systematic in-depth study investigates global patterns of mutation and drug resistance mechanisms from mutation data for Mtb reported over the last 30 years. These were collected for 31,073 drug-resistant Mtb isolates from 149 published studies for the four first line drugs isoniazid (INH), pyrazinamide (PZA), rifampicin (RIF), and ethambutol (EMB). We found 821 single frequency non-synonymous mutations for INH (n= 202), RIF (n=120), EMB (n=226) and PZA (n=273). We then investigated the prevalence and diversity of these mutations in the drug targets across the globe. We found S315T in the target katG (60%) to be the most prevalent mutation in INH resistance followed by S450L in rpoB (56%) and M306V in embB (29%) associated with RIF and EMB resistance, respectively; these were also the highly occurring mutations across the six WHO regions, except for the most common mutation Q10P in pncA (1.4%) (PZA resistance; with shorter exposure to drug) showing a variable pattern of occurrence globally. We found the highest mutational burden in the Western Pacific and South-East Asia regions for INH and RIF resistance. Frequent mutations had also undergone frequent amino acid substitutions. Accordingly, we developed a comprehensive atlas of mutation spread across the globe and their evolution over the last 30 years. We then probed into the impact of mutations on TB bacteria and drug binding with a comprehensive bioinformatics analysis for understanding crucial changes caused by mutation at the molecular level affecting function and structural stability of bacteria and the drug binding affinity. We found that the most prevalent mutations occur in non-conserved areas in the drug binding region indicating a choice of a less dramatic level of change in target protein function and stability. All mutations reduced drug binding affinity. For characterising drug resistance mechanisms, we introduced a new concept of ranking drug-resistant TB mutations into lethal, moderate, mild and neutral considering the combined effect on Mtb viability and drug binding. We identified 340 mutations as lethal, 284 as moderate, 185 as mild and 12 as neutral. We observed that frequently occurring mutations occur in non-conserved regions causing a mild effect on target proteins (such as S315T of katG, S450L of rpoB and M306V in embB), while reducing drug binding affinity. With these we uncovered a universal strategy of drug resistance and bacterial survival: Mtb favours less harmful mutations in the drug binding region without compromising conservancy while destabilising the drugs, thus striking a balance between fitness and drug resistance. This ingenious strategy seems successful and reasonable persisting globally over three decades and provides a holistic understanding of drug resistance and a strong foundation for designing efficacious drugs and therapies towards global eradication of TB.
bioinformatics2026-05-01v1Calibrated analysis framework for nanopore direct RNA sequencing uncovers cell-specific m⁶A stoichiometry at conserved sites
Ohnezeit, D.; Loliashvili, E.; Putzel, G.; Verstraten, R.; Liu, J.; Nicholson, L. S.; Pironti, A.; Jaffrey, S. R.; Depledge, D. P.; Wilson, A. C.Abstract
Nanopore direct RNA sequencing (DRS) coupled with Dorado modification-aware basecalling enables mapping of epitranscriptomic modifications including N6-methyladenosine (m6A) at the level of individual RNAs. However, a lack of systematic benchmarking continues to raise questions regarding the sensitivity, specificity, and reproducibility of this method. To address this and to establish a best-practice workflow, we evaluated multiple Dorado versions using in vitro transcribed RNA and an m6A methyltransferase inhibitor as specificity controls. We established that stringent filtering is necessary to reduce false-positive calls and found strong concordance at high-stoichiometry sites when compared to an orthogonal m6A mapping method (GLORI). Further, by applying DRS to primary human fibroblasts and HD10.6 neurons, we uncovered cell type-specific differences in m6A stoichiometry, indicating a finely tuned epitranscriptomic regulation. Our study thus presents the first systematic comparison of Dorado and GLORI from the same input RNA and expands characterization of the m6A epitranscriptome to fibroblasts and neurons.
bioinformatics2026-04-30v4Praxis-BGM: Clustering of Omics Data Using Semi-Supervised Transfer Learning for Gaussian Mixture Models via Natural-Gradient Variational Inference
Jia, Q.; Goodrich, J. A.; Conti, D. V.Abstract
High-dimensional omics data are typically measured on limited sample sizes, which challenges model-based clustering methods such as Gaussian mixture models, often leading to instability and poor generalization under complex mixture structures. To address these limitations, we developed Praxis-BGM, a natural-gradient variational inference framework for Gaussian mixture models that enables semi-supervised transfer learning by incorporating an informative prior Gaussian mixture model derived from large-scale reference data with robust cluster structures. This prior can encode cluster-specific means, covariance structures, and structural connectivity patterns, and is updated using the target data with variational inference to improve clustering in small-sample settings. We derived natural-gradient updates for standard parameters and assess feature-level contributions to posterior clustering via Bayes Factors. Implemented in Python library JAX for accelerator-oriented computation, Praxis-BGM is computationally efficient and scalable. Across extensive simulations and two real-world applications-breast cancer bulk transcriptomics for subtype recovery and single-cell transcriptomics for cross-platform label transfer-Praxis-BGM improves posterior clustering performance, stability, and biological interpretability, even when priors are partially mismatched.
bioinformatics2026-04-30v3SpatialQuery: scalable discovery and molecular characterization of multicellular motifs from spatial omics data
An, S.; Keller, M.; Gehlenborg, N.; Hemberg, M.Abstract
Spatially resolved single-cell technologies enable profiling of cells in situ, yet computational approaches that jointly discover multicellular spatial patterns and characterize their molecular programs remain limited. Here we introduce SpatialQuery, a framework that can both identify cellular motifs, i.e. recurrent multicellular co-localization patterns, and perform molecular analyses focused on the motifs. It uncovers genes modulated by spatial contexts through differential expression analysis, and detects coordinated expression changes through covariation analysis. SpatialQuery can identify functional tissue units, and goes beyond pairwise analyses to characterize multicellular interactions. Applications to both spatial transcriptomics and proteomics data uncover cross-germ-layer signaling in gut tube patterning, disease-specific fibrotic and immunosuppressive niches in kidney and colon, and regional determinants of motif-associated transcriptional programs in a mouse brain atlas. SpatialQuery is available as a Python package, and we demonstrate how its light computational footprint enables integration into web-based cell atlas portals for interactive visualization and exploration.
bioinformatics2026-04-30v3Systematic evaluation and benchmarking of text summarization methods for biomedical literature: From word-frequency methods to language models
Baumgärtel, F.; Bono, E.; Fillinger, L.; Galou, L.; Keska-Izworska, K.; Walter, S.; Andorfer, P.; Kratochwill, K.; Perco, P.; Ley, M.Abstract
The rapid expansion of biomedical literature demands automated summarization tools that can reliably condense research articles into concise, accurate overviews. We benchmarked 62 text summarization methods - ranging from frequency-based and TextRank extractors to modern encoder-decoder models (EDMs) and large language models (LLMs) - on a set of 1,000 biomedical abstracts for which author-generated highlights sections were available as reference summaries. Models were evaluated using a composite suite of metrics covering lexical overlap (ROUGE-1/2/L, BLEU, METEOR), embedding-based semantic similarity (RoBERTa, DeBERTa, all-mpnet-base-v2), and factual consistency (AlignScore). Our results indicate that general-purpose language models (LMs) achieve the highest overall scores across both lexical and semantic metrics, outperforming both reasoning-oriented and domain-specific models. Within the general-purpose group, medium-sized models, typically runnable on a single node, often outperform frontier-scale counterparts, suggesting an optimal balance between model capacity and computational efficiency. Statistical extractive methods lag behind all neural approaches. These findings provide a systematic reference for selecting summarization tools in biomedical research and highlight that broad pretraining remains more effective than narrow domain adaptation for generating high-quality scientific summaries.
bioinformatics2026-04-30v3On the use of variational autoencoders for biomedical data integration
Pielies Avelli, M.; Hernandez Medina, R.; Webel, H. E.; Rasmussen, S.Abstract
Variational Autoencoders (VAEs) are a widely used framework to integrate diverse biomedical data modalities, create representations that capture the underlying structure of the datasets, and obtain insights about the relations between variables. Here we describe how this is achieved from an empirical point of view in our previously developed VAE-based framework MOVE, providing an intuitive perspective on the inner workings of multimodal VAEs in biomedical contexts. We explore how the models' emerging dynamics shape their performance and how in silico perturbations can be leveraged to identify potential associations between variables. To do that, we extend our framework to handle perturbations of continuous variables, introduce a new approach to better capture associations between them, and create synthetic datasets to benchmark the proposed methods against well-defined ground truth associations. We finally showcase our findings in real biomedical scenarios, namely a multimodal dataset of inflammatory bowel disease and a dataset containing genetic knockdowns in K562 and RPE1 cells.
bioinformatics2026-04-30v2A gene program dictionary of human cells
Xu, Y.; Wang, Y.; Geng, Z.; Qin, Y.; Ma, S.Abstract
Defining all human cell types and their roles in health and disease is a central goal of biology. Single-cell RNA sequencing has enabled the construction of organ-specific cell atlases, but building a comprehensive organism-wide atlas spanning multiple organs remains challenging due to batch effects, study biases, and inter-organ complexity. Here, we present Gene Program Dictionary (GPD), a framework that leverages robust gene co-expression programs-rather than direct cell integration-to overcome these barriers. Using SpacGPA, a partial correlation-based network method, we analyzed 466 scRNA-seq datasets, generating 1,975 independent networks and 90,701 gene co-expression modules, which were consolidated into 1,534 consensus gene programs representing a wide range of human tissues and cell types. Each program serves as a composite marker, capturing both cell-type-specific and shared biological processes. We demonstrate their utility by mapping endothelial cell subtypes across tissues to reveal their heterogeneity-including tumor-specific programs-annotating colorectal cancer spatial transcriptomes, and linking programs and their corresponding cell types to disease loci, revealing hotspots such as neuronal programs in psychiatric disorders and a proximal tubule program in kidney diseases. GPD provides an organism-wide reference for studying cellular diversity and disease mechanisms.
bioinformatics2026-04-30v2Spartan: activation-aware framework for spatial domain and variable gene discovery
Faiz, M. F. I.; Jokl, E.; Jennings, R.; Piper Hanley, K.; Sharrocks, A.; Iqbal, M.; Baker, S. M.Abstract
Spatial transcriptomics is rapidly advancing toward single-cell-level resolution, revealing complex tissue architectures organized across continuous anatomical gradients. However, accurate identification of spatial domains remains a central computational challenge, as many existing clustering approaches blur anatomical boundaries, merge transitional zones, or fail to resolve localized microstructures. Here we introduce Spartan, an activation-aware multiplex graph framework for high-resolution domain discovery. Spartan integrates spatial topology and Local Spatial Activation (LSA), a neighborhood deviation signal that captures localized transcriptional heterogeneity often attenuated by similarity-based clustering. By jointly modeling cohesion within domains and localized activation structure, Spartan recovers anatomically aligned partitions across spatially resolved transcriptomics technologies including Visium HD, MERFISH, Stereo-seq, and STARmap. We further demonstrate its utility in a high-resolution Visium HD section of developing human esophagus and stomach, where activation-aware graph integration enables precise delineation of complex transitional regions such as the gastroesophageal junction and supports stable multi-scale domain recovery without fragile hyperparameter tuning. Beyond domain identification, Spartan leverages activation-aware structure to detect spatially variable genes associated with localized tissue remodeling. Spartan scales near-linearly with dataset size, providing a robust and interpretable framework for spatial systems-level analysis.
bioinformatics2026-04-30v2PanVariants: Best Practice for Pangenome-based Variant Calling Pipeline and Framework
Yi, H.; Wang, L.; Chen, X.; Ding, Y.; Carroll, A.; Chang, P.-C.; Shafin, K.; Xu, L.; Zeng, X.; Zhao, X.; Gong, M.; Wei, X.; Hou, Y.; Ni, M.Abstract
Background: Although pangenome references offer richer population diversity compared to linear references, current mainstream pangenome-based variant callers are limited to detecting only known variants stored in the graph. To address this limitation, we developed PanVariants, a novel pipeline designed to improve the detection of both known and novel variants accurately. We systematically evaluated its performance against the traditional linear alignment solution (BWA+GATK/Manta) and the existing pangenome-aware solution (DRAGEN/PanGenie) in three contexts: small variants (SNVs/indels) and structural variants (SVs) accuracy in Genome in a Bottle samples, clinical detection on positive samples, and application in cohort-based joint calling. Results: By integrating k-mer-based and mapping-based methods, PanVariants significantly reduced variant errors (FPs + FNs), achieving a 73% reduction compared to BWA+GATK and a 45% reduction compared to DRAGEN for SNVs. Retraining the DeepVariant model with high-quality DNBSEQ data further decreased errors by 15%. For SVs detection, PanVariants attained an F1-score of 89.39%, markedly outperforming DRAGEN (68.18%) and BWA+Manta (58.33%), approaching long-read sequencing performance (95.22%). In validation using clinical positive samples, PanVariants successfully detected all expected pathogenic variants while PanGenie failed. In the cohort joint-calling analysis, PanVariants detected more variants, made fewer Mendelian inheritance errors, and gave better per-sample accuracy than GATK. Conclusions: PanVariants establishes a robust framework and best-practice pipeline for pangenome-based variant detection, achieving both sensitive novel variant discovery and high accuracy for SNVs, indels and SVs. Our systematic evaluation of optional processing steps and input variables offers practical guidance for users. Validated across diagnostic and population-based applications, our findings strongly support the transition from linear to pangenome references in future genomics.
bioinformatics2026-04-30v2Harnessing AI to Build Virtual Cells
Cheng, X.; Li, P.; Guo, H.; Liang, Y.; Gong, J.; de Vazelhes, W.; Gou, C.; Xie, P.; Song, L.; Xing, E. P.Abstract
A virtual cell is a world model of a cell: a computational system that predicts, simulates and programs cellular processes across modalities and scales. An important path toward this goal is to model how genetic and chemical perturbations give rise to transcriptional responses, a core capability for disease understanding and drug discovery. However, current approaches remain expert-intensive, relying on iterative manual model design, training and debugging over months. Here we present VCHarness, an autonomous AI system that constructs perturbation-response models by combining an AI coding agent with multimodal biological foundation models. The system explores large spaces of architectures and training pipelines with minimal human intervention, iteratively generating, evaluating and refining candidate models. Across multiple perturbation-response benchmarks, VCHarness identifies architectures that outperform expert-designed approaches while reducing development time from months to days. It further uncovers non-obvious architectural patterns associated with improved performance, indicating that automated search can extend beyond conventional design strategies. These results suggest a shift from manually engineered models toward autonomous systems for constructing components of virtual cell world models, enabling scalable and data-driven exploration of cellular systems.
bioinformatics2026-04-30v2Hierarchical Breakdown of RNA Structure Prediction in CASP16: From Reliable Local Features to Speculative Multimer Assembly
Nithin, C.; Pilla, S. P.; Kmiecik, S.Abstract
CASP16 provided a community-wide benchmark for assessing RNA structure prediction, including the first large-scale blind assessment of RNA-RNA multimer prediction. The results showed that achieving high atomic precision remains a major challenge across the field. In this work, we use the performance of our group (LCBio) as a diagnostic case study to examine the current limits of RNA structure prediction. Our workflow ranked first in the RNA multimer category and remained competitive for monomers. We combine hierarchical analysis with representative case studies to identify a pattern of predictive breakdown, in which modeling fidelity degrades from reliable local features to increasingly speculative global architectures. Multi-helix junctions appear to mark a major transition boundary where 2D topological success often fails to translate into 3D geometric realism, leading to cascading errors in global architecture. This hierarchical breakdown is especially pronounced in RNA multimers, where limitations in the recovery of junction geometry and tertiary interactions propagate directly into errors in higher-order assembly, making multimer prediction increasingly speculative. By placing benchmark performance in a direct structural context, this case study helps define the current limits of RNA structure prediction and highlights priorities for improving predictive accuracy.
bioinformatics2026-04-30v2Survey of the human proteostasis network: the ubiquitin-proteasome system
Elsasser, S.; Powers, E.; Stoeger, T.; Sui, X.; Kurtzbard, R. D.; Martinez-Botia, P.; Wangaline, M. A.; Gama, A. R.; Huttlin, E. L.; Elia, L. P.; Kelly, J. W.; Gestwicki, J. E.; Frydman, J. E.; Finkbeiner, S.; Clerico, E. M.; Morimoto, R.; Prado, M. A.; Vertegaal, A. C. O.; Hofmann, K.; Finley, D.Abstract
Modification by ubiquitination governs the half-lives of thousands of proteins that are fated for elimination by either the proteasome or autophagy pathways, depending on the intricate architectures of ubiquitin modification. This system mediates quality control for individual proteins, protein complexes, and organelles, as well as myriad purely regulatory functions. Here we provide a comprehensive survey of the ubiquitin-proteasome system (UPS), the scope of which is at present poorly defined. The UPS, with the inclusion of pathways involving ubiquitin-like modifiers, comprises in our estimate over 1400 distinct proteins in humans, a vast set of activities whose collective impact on the biology of the cell is pervasive. The UPS is an integral component of the proteostasis network (PN), the remainder of which we have also surveyed in recent studies. With the addition of molecular chaperones, proteins from autophagy-lysosome pathway, and related activities, the PN includes in total over 3100 components by our estimates. Comprehensive and systematic definition of these pathways should support a range of ongoing investigations in the areas of genomics, proteomics, biochemistry, cell biology, and disease research.
bioinformatics2026-04-30v2Overcoming systematic data biases enables accurate prediction of enzyme kcat fold-changes for computational protein design
Rousset, Y.; Kroll, A.; Lercher, M.Abstract
Machine learning is increasingly used to guide protein engineering by predicting how mutations affect desired properties. Recent models for the turnover number (kcat) of enzymes report high accuracy, suggesting that mutation effects can be inferred directly from protein sequence. However, these approaches are typically evaluated on heterogeneous datasets of enzyme variants, where closely related sequences and systematic reporting patterns may confound model performance. A central challenge is therefore to determine whether current models truly capture mutation-specific effects or instead exploit statistical regularities in the data. Here we show that much of the reported accuracy in mutant kcat prediction arises from two pervasive biases: variants of the same enzyme occupy a narrow activity range, and mutations within a group often share a common direction of change. Simple baselines that exploit these biases match or exceed the performance of existing models, indicating that high apparent accuracy does not imply mechanistic understanding. To address this limitation, we introduce a bias-aware framework that reformulates prediction as a pairwise fold-change task and evaluates performance on unseen mutant-mutant pairs, thereby isolating mutation-specific signal. A proof-of-principle implementation explains approximately one-third of the variance under these conditions and outperforms existing models on leakage-controlled benchmarks. More broadly, this work establishes a general strategy for evaluating and modeling mutation effects in biochemical datasets, with implications for protein engineering and related fields.
bioinformatics2026-04-30v2