Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
BacTaxID: A universal framework for standardized bacterial classification
Fernandez-de-Bobadilla, M. D.; Lanza, V. F.AI Summary
- BacTaxID is a universal framework for bacterial classification using whole-genome k-mer-based sketches, which organizes strains into hierarchical clusters based on user-defined similarity thresholds.
- It provides a direct quantitative link to Average Nucleotide Identity (ANI), showing universal concordance with species and sub-species classifications across 2.3 million genomes from 67 genera.
- BacTaxID outperforms traditional methods in capturing strain-level diversity and replicates SNP and cgMLST-based definitions in surveillance and outbreak scenarios.
Abstract
Bacterial strain typing is key to surveillance, outbreak investigation and microbial ecology, yet current systems remain species-specific, reference-dependent and lack a universal, interpretable metric of genomic relatedness. Here, we introduce BacTaxID, a fully configurable, whole-genome k-mer-based framework that encodes each genome as a numeric sketch and organizes strains into hierarchical clusters with user-defined similarity thresholds. BacTaxID distances are strictly proportional to Average Nucleotide Identity (ANI), providing a direct quantitative link between vectorial typing and genome-wide divergence. Applied to 2.3 million genomes from All the Bacteria across 67 genera, BacTaxID demonstrates universal concordance species and sub-species classification systems, while capturing finer strain-level diversity than traditional reference-based approaches. In simulated surveillance and real outbreak datasets, BacTaxID reproduces SNP and cgMLST-based definitions while enabling rapid, scalable screening. Precomputed genus-level schemes and an open implementation provide a practical, genus-agnostic alternative to classical typing systems for standardized bacterial classification.
bioinformatics2026-02-22v3Protenix-v1: Toward High-Accuracy Open-Source Biomolecular Structure Prediction
Zhang, Y.; Gong, C.; Zhang, H.; Ma, W.; Liu, Z.; Chen, X.; Guan, J.; Wang, L.; Yang, Y.; Xia, Y.; Xiao, W.AI Summary
- Protenix-v1 (PX-v1) is an open-source biomolecular structure prediction model that outperforms AlphaFold3 with the same constraints, showing improved accuracy with increased sampling budget.
- It includes features like protein template integration and RNA MSA support, with a variant, Protenix-v1-20250630, trained on a larger dataset for enhanced accuracy.
- The study also addresses benchmarking limitations by providing updated evaluation tools and year-stratified benchmarks for more reliable assessments.
Abstract
We introduce Protenix-v1 (PX-v1), the first open-source structure prediction model to attain superior performance to AlphaFold3 while strictly adhering to the same training data cutoff, model size, and inference budget. Beyond standard evaluations, we highlight the effectiveness of inference-time scaling behavior, demonstrating that increasing the sampling budget yields consistent improvements in prediction quality--a behavior previously seen in AlphaFold3 but not in other open-source models. In addition to improved accuracy, Protenix-v1 incorporates key capabilities including protein template integration and RNA MSA support. Furthermore, to better support real-world applications such as drug discovery, we additionally release Protenix-v1-20250630, a variant trained on a larger dataset (cutoff: June 30, 2025), delivering further improved prediction accuracy. Finally, we identify the limitations of current benchmarking tools and we provide updated evaluation tools and year-stratified benchmarks to facilitate more reliable and transparent assessment within the community. Collectively, these contributions provide a robust foundation for the Protenix series and the broader field.
bioinformatics2026-02-22v3Bias in genome-wide association test statistics due to omitted interactions
Yelmen, B.; Güler, M. N.; Estonian Biobank Research Team, ; Kollo, T.; Möls, M.; Charpiat, G.; Jay, F.AI Summary
- This study investigates how omitting interaction terms in linear models used for GWAS can bias test statistics, specifically by altering the mean and variance of the null statistic.
- Through mathematical derivation and simulation using Estonian Biobank data, the study shows that ignoring epistasis can lead to an anti-conservative regime, inflating significance.
- The findings suggest that GWAS results based on linear models might include spurious associations, urging caution in their interpretation.
Abstract
Over the past two decades, genome-wide association studies (GWAS) enabled the discovery of thousands of variants associated with many complex human traits. However, conventional GWAS are still widely performed with linear models with the assumption that the genetic effects are predominantly additive. In this work, we investigate the test statistic behavior when linear models are used to obtain significant genotype-phenotype associations without accounting for epistasis. We first algebraically derive mean and variance shift in the null statistic due to the omitted interaction term, and define the boundary between conservative (i.e., deflated statistic tail) and anti-conservative (i.e., inflated statistic tail) regimes for the common GWAS significance threshold. We then perform phenotype simulation analyses using the Estonian Biobank genotypes and validate the mathematical model. We demonstrate that the anti-conservative regime is plausible under realistic parameter settings and models omitting interaction terms can produce spurious significance. Our findings suggest caution when interpreting statistically significant signals reported in the literature based on linear models, especially for large-scale GWAS.
bioinformatics2026-02-22v2Cellects, a software to quantify cell expansion and motion
Boussard, A.; Petit, M.; Arrufat, P.; Dussutour, A.; Perez-Escudero, A.AI Summary
- Cellects is an open-source software designed for automated quantification of cell expansion, motion, and morphology from 2D and time-lapse images across various biological systems.
- It features a graphical interface for interactive parameter tuning, visualization, and batch processing, with customization options via a Python API.
- The software outputs quantitative data like area and trajectory in .csv format, facilitating statistical analysis and integration into existing research workflows.
Abstract
Cellects is a user-friendly and open-source software for automated quantification of biological growth, motion, and morphology from 2D image data and time-lapse sequences (2D + t), acquired under a wide range of experimental conditions and biological systems (from fungal colonies to unicellular branching networks). The software is available as a stand-alone version, featuring a graphical interface that supports interactive parameter tuning, visualization, validation, and batch processing. The analysis pipeline can be extended and customized using a dedicated Python API. The typical inputs and outputs are as follows. Cellects is designed to process grayscale or color images originating from standard microscopy, macroscopic imaging setups, or camera-based platforms. The software supports single or multiple organisms growing or moving in one or several arenas and can analyze multiple folders sequentially. All quantitative results (area, circularity, orientation axes, centroid trajectories, oscillations, network topology) are exported as standardized .csv files suitable for downstream statistical analysis, ensuring reproducibility and integration into existing workflows.
bioinformatics2026-02-22v2STELAR-X: Scaling Coalescent-Based Species Tree Inference to 100,000 Species and Beyond
Saha, A.; Bayzid, M. S.AI Summary
- STELAR-X is introduced as a scalable, triplet-based phylogenetic inference algorithm for species tree reconstruction under the multispecies coalescent model, optimized for ultra-large datasets.
- It achieves O(nk) memory complexity for n species and k gene trees, significantly reducing both running time and memory usage compared to existing methods like ASTRAL-MP.
- Experiments showed STELAR-X analyzed datasets with 100,000 taxa in 8.5 hours and 100,000 genes in 4 minutes, demonstrating unprecedented scalability.
Abstract
Summary methods reconstruct species trees from collections of gene trees by accounting for gene tree discordance, and offer a statistically consistent framework for phylogenomic inference under the multispecies coalescent model. While existing triplet- and quartet-based approaches such as ASTRAL and STELAR have provable statistical consistency, their running time and memory usage restrict their applicability to ultra-large datasets. We introduce STELAR-X, a statistically consistent and highly scalable triplet-based phylogenetic inference algorithm that achieves an asymptotically optimal memory complexity of O(nk) for n species and k gene trees--essentially matching the input size and allowing analyses to remain feasible as long as the input trees fit in memory--while also substantially reducing running time. STELAR-X achieves this by a comprehensive re-engineering of the underlying data structures and algorithms. We introduce a novel, compact integer tuple-based encoding of tree bipartitions and efficient procedures for rapid pre-computation of bipartition weights. We further leverage GPU parallelism for fast pre-computation of necessary weights. This improved and redesigned computational framework underpins a dynamic programming algorithm with substantially reduced computational overhead. Extensive experiments demonstrate that STELAR-X achieves unprecedented scalability. On simulated datasets with 10,000 taxa and 1,000 gene trees, STELAR-X runs 712x faster than ASTRAL-MP (the most scalable variant of ASTRAL) while using 7.5x less CPU memory. Most significantly, STELAR-X analyzed a dataset of 100,000 taxa and 1,000 genes in 8.5 hours using 86 GB RAM, and a 100,000-gene dataset with 1000 taxa in just 4 minutes using 106 GB RAM -- scales that were previously intractable for statistically consistent summary methods. STELAR-X is publicly available at <a href="https://github.com/aaniksahaa/STELAR-X">https://github.com/aaniksahaa/STELAR-X</a>.
bioinformatics2026-02-22v2Protenix-v1: Toward High-Accuracy Open-Source Biomolecular Structure Prediction
Zhang, Y.; Gong, C.; Zhang, H.; Ma, W.; Liu, Z.; Chen, X.; Guan, J.; Wang, L.; Yang, Y.; Xia, Y.; Xiao, W.AI Summary
- Protenix-v1 (PX-v1) is introduced as an open-source biomolecular structure prediction model that outperforms AlphaFold3 with the same constraints.
- It shows improved prediction quality with increased sampling budget and includes features like protein template integration and RNA MSA support.
- A variant, Protenix-v1-20250630, trained on a larger dataset, offers enhanced accuracy, and new evaluation tools are provided for better benchmarking.
Abstract
We introduce Protenix-v1 (PX-v1), the first open-source structure prediction model to attain superior performance to AlphaFold3 while strictly adhering to the same training data cutoff, model size, and inference budget. Beyond standard evaluations, we highlight the effectiveness of inference-time scaling behavior, demonstrating that increasing the sampling budget yields consistent improvements in prediction quality--a behavior previously seen in AlphaFold3 but not in other open-source models. In addition to improved accuracy, Protenix-v1 incorporates key capabilities including protein template integration and RNA MSA support. Furthermore, to better support real-world applications such as drug discovery, we additionally release Protenix-v1-20250630, a variant trained on a larger dataset (cutoff: June 30, 2025), delivering further improved prediction accuracy. Finally, we identify the limitations of current benchmarking tools and we provide updated evaluation tools and year-stratified benchmarks to facilitate more reliable and transparent assessment within the community. Collectively, these contributions provide a robust foundation for the Protenix series and the broader field.
bioinformatics2026-02-22v2Bacterial protein function prediction via multimodal deep learning
Muzio, G.; Adamer, M.; Fernandez, L.; Miklautz, L.; Borgwardt, K.; Avican, K.AI Summary
- Developed DeepEST, a multimodal deep learning framework to predict bacterial protein functions by assigning GO terms, using gene expression, location, and protein structure.
- DeepEST integrates a multi-layer perceptron for expression/location data and a structure-based predictor, enhanced by a novel masked loss function for bacterial species.
- DeepEST outperformed existing methods on a 25-species benchmark and predicted functions for unclassified proteins in 25 human bacterial pathogens.
Abstract
Bacterial proteins are specialized with extensive functional diversity for survival in diverse and stressful environments. A significant portion of these proteins remains functionally uncharacterized, limiting our understanding of bacterial survival mechanisms. Hence, we developed Deep Expression STructure (DeepEST), a multimodal deep learning framework designed to accurately predict protein function in bacteria by assigning Gene Ontology (GO) terms. DeepEST comprises two modules: a multi-layer perceptron that takes gene expression and gene location as input features, and a protein structure-based predictor. Within DeepEST, we integrated these modules through a learnable weighted linear combination and introduced a novel masked loss function to fine-tune the structure-based predictor for bacterial species. These modeling choices are particularly well suited for bacteria due to the spatial organization of their circular genomes. Functionally related genes frequently co-localize and are co-transcribed within operons, allowing transcription dynamics to serve as crucial, condition-dependent regulatory signals. We show that DeepEST outperforms existing protein function prediction methods on a 25-species benchmark, relying solely on amino acid sequence or protein structure. Moreover, DeepEST predicts GO terms for unclassified hypothetical proteins across 25 human bacterial pathogens, facilitating the design of experimental setups for characterization studies. By combining expression, localization, and structure information in a unified deep learning framework, DeepEST bridges organism-specific data integration and structure-based transfer learning, providing a method tailored for bacterial protein function prediction in settings with structural and multi-condition expression data.
bioinformatics2026-02-22v2Deciphering Features of Metalloprotease Cleavage Targets Using Protein Structure Prediction
Chung, D. S.; Park, J.; Choi, W.; Hong, D.AI Summary
- The study developed a computational model using protein structure prediction to identify and classify substrates of ADAM10, a metalloprotease, and predict its cleavage sites.
- The model analyzed predicted protein complexes, focusing on protein-protein interactions, structural information of cleavage sites, and spatial relationships with metal ions.
- Results showed the model's effectiveness in substrate classification and cleavage site prediction, with potential applications to other ADAM family members and metalloproteases.
Abstract
Metalloproteases are a class of enzymes that utilize metal ions within their active sites to catalyze the hydrolysis of peptide bonds in proteins. Among these, ADAM10 (A Disintegrin and Metalloproteinase 10), a member of the ADAM family, plays a crucial role in mediating intracellular signaling by cleaving specific substrates, thereby influencing a variety of physiological and pathological processes. The mechanisms underlying the activity of ADAM10 present significant opportunities for the development of novel therapeutic strategies aimed at disease intervention. However, the information available to identify the substrate and cleavage sites of ADAM10 is still insufficient. Therefore, it is essential to identify and classify the features of substrates and to elucidate cleavage sites through experimental approaches. However, these studies across numerous proteins present significant challenges. To address the promise of these investigations, we developed a model that leverages protein structure prediction to decipher substrate features, classify substrates, and predict cleavage sites. Through the analysis of predicted protein complexes between ADAM10 and its substrates using PDB files, we evaluated protein-protein interaction (PPI) data, the structural information of cleavage sites, and the spatial relationships between the cleavage sites and metal ions. Finally, we present a computational model that effectively classifies substrates and accurately predicts cleavage sites in this study. Our study demonstrates the potential for application not only to ADAM10 but also to other members of the ADAM family and, more broadly, to additional metalloproteases. By leveraging computed protein structural information, our approach offers a novel perspective for substrate classification.
bioinformatics2026-02-22v2Leveraging Large Language Models to Extract Prognostic Pathology Features in Ewing Sarcoma
Huang, J.; Batool, A.; Gu, Z.; Zhao, Z.; Yao, B.; Black, J.; Davis, J.; al-Ibraheemi, A.; DuBois, S.; Barkauskas, D.; Ramakrishnan, S.; Hall, D.; Grohar, P.; Xie, Y.; Xiao, G.; Leavey, P. J.AI Summary
- The study aimed to use Large Language Models (LLMs) to extract prognostic histologic features from pathology reports of Ewing sarcoma patients from six COG trials.
- The LLM achieved high accuracy (94-98.1%) in extracting 17 IHC markers, surpassing human annotators in some cases.
- Survival analysis revealed NSE as a negative prognostic marker (HR 2.15) and S100 as a positive one (HR 0.58), suggesting potential for refining risk stratification in Ewing sarcoma.
Abstract
Background: Current risk stratification for Ewing sarcoma relies heavily on clinical factors such as metastatic status, failing to capture histologic heterogeneity as a potential prognostic indicator. Although pathology reports contain rich biological data, this information remains locked in unstructured narrative text, limiting large-scale retrospective analyses. We aimed to validate the utility of Large Language Models (LLMs) for scalable data abstraction and to identify prognostic histologic features from a large multi-institutional cohort. Methods: We conducted a retrospective cohort study using data from six Children's Oncology Group (COG) clinical trials. We utilized an LLM-based pipeline (OpenAI o3) to extract structured variables, including immunohistochemical (IHC) markers and CD99 staining patterns - from digitized, Optical Character Recognition (OCR)-processed pathology reports. Extraction accuracy was validated against a human-annotated ground truth (n=200) and cross-validated against senior experts (n=48). We assessed the association between extracted features and Overall Survival (OS) using Kaplan-Meier analysis and multivariable Cox proportional hazards regression, adjusting for metastatic status. Findings: We analyzed 931 diagnostic pathology reports spanning over 19-years. The LLM achieved a weighted average accuracy of 94% across 17 IHC markers; in a cross-validation subset, the LLM outperformed human annotators (weighted average accuracy over 15 IHC markers: LLM o3: 98.1%, a resident specialist 91.4%, and a senior expert 95.9%). Survival analysis identified Neuron-Specific Enolase (NSE) and S100 as significant prognostic biomarkers. After adjusting for metastatic status, NSE positivity was associated with significantly inferior survival (HR 2.15, 95% CI 1.15 - 4.02, p=0.016); this risk was most pronounced in patients with non-metastatic disease (HR 5.64, p=0.0055). Conversely, S100 positivity was associated with improved survival (HR 0.58, 95% CI 0.34-1.00, p=0.046). Interpretation: LLM-assisted extraction of pathology variables is highly accurate and scalable, capable of unlocking "dark data" from historical clinical trials. We identified NSE as a potent risk factor and S100 as a protective marker in Ewing sarcoma, particularly in localized disease. These findings suggest that AI-derived histologic data can refine risk stratification and, if validated, warrant inclusion in future prospective trials.
bioinformatics2026-02-22v1Hybrid MD-generative modeling expands RNA ensembles to include cryptic ligand-binding conformations: application to HIV-1 TAR
Kurisaki, I.; Hamada, M.AI Summary
- The study uses Molearn, a hybrid MD-generative deep learning model, to explore cryptic ligand-binding conformations in apo RNA structures, focusing on HIV-1 TAR.
- Molearn was trained on apo TAR conformations to generate a diverse ensemble, from which potential MV2003-binding conformations were identified.
- Docking simulations showed that these generated conformations could bind MV2003 with interaction scores similar to those from NMR structures, demonstrating the model's ability to predict novel ligand-binding RNA states.
Abstract
Cryptic ligand-binding sites--absent in apo structures but formed upon conformational rearrangement--offer high specificity for RNA-ligand recognition, yet remain rare among experimentally resolved RNA-ligand complex structures and difficult to predict in silico. We apply Molearn, a hybrid molecular dynamics-generative deep learning model, to expand apo RNA conformational ensembles to include cryptic states. Focusing on the paradigmatic HIV-1 TAR-MV2003 system, Molearn was trained exclusively on apo TAR conformations and used to generate a diverse ensemble of TAR structures. Candidate cryptic MV2003-binding conformations were subsequently identified using post-generation geometric analyses. Docking simulations of these conformations with MV2003 yielded binding poses with RNA-ligand interaction scores comparable to those of NMR-derived complexes. Notably, this work provides the first demonstration that a generative modeling framework can access cryptic RNA conformations that are ligand-binding competent and have not been observed in prior molecular-dynamics and deep-learning studies.
bioinformatics2026-02-21v7Defining the Active Conformation of Typical Protein Kinase Domains from Substrate-Bound PDB Structures Enables Active-State AlphaFold2 Models for All 437 Human Catalytic Protein Kinases
Gizzio, J.; Faezov, B.; Xu, Q.; Dunbrack, R. L.AI Summary
- The study aimed to define the active conformation of human protein kinase domains by analyzing 248 substrate-bound kinase structures, identifying criteria for the active state.
- Using these criteria, only 30% of the 437 human catalytic kinases were found in active form in the PDB.
- AlphaFold2 was employed to model all 437 kinases in their active form, with 71% of models achieving a backbone RMSD < 1.0 Å to benchmark structures, enhancing understanding of kinase activity in diseases like cancer.
Abstract
Humans have 437 catalytically competent protein kinase domains with the typical kinase fold, similar to the structure of Protein Kinase A (PKA). The active form of a kinase must satisfy requirements for binding ATP, magnesium, and substrate. From structural bioinformatics analysis of 248 crystal structures of 54 unique substrate-bound kinases, we derived structural criteria for the active form of typical protein kinases. We include well-known requirements on the DFG motif of the activation loop and the N-terminal domain salt bridge, but also on the positions of the N-terminal and C-erminal segments of the activation loop that must be placed appropriately to bind substrate. With these criteria, only 130 of the 437 human catalytic protein kinases (30%) are in the Protein Data Bank in their active form. Because the active forms of catalytic kinases are needed for understanding substrate specificity and the effects of mutations on catalytic activity in cancer and other diseases, we used AlphaFold2 to produce models of all 437 human protein kinases in the active form. This was accomplished with templates from the PDB that resemble substrate-bound structures, shallow multiple sequence alignments of orthologs and close paralogs of the query protein, and application of the active-kinase criteria to the output models. We selected models for each kinase based on intramolecular ipSAE scores of the activation loop residues of these models, demonstrating that the highest scoring models have the lowest or close to the lowest RMSD to 29 non-redundant substrate-bound structures in the PDB. A larger benchmark of 117 active kinase structures with solved activation loops in the PDB shows that 71% of the highest scoring AlphaFold2 models had backbone RMSD < 1.0 [A] to the benchmark structures and 92% were within 2.0 [A]. Models for all 437 catalytic kinases are available at https://dunbrack.fccc.edu/kincore/activemodels. We believe they may be useful for interpreting mutations leading to constitutive catalytic activity in cancer as well as for templates for modeling substrate and inhibitor binding for molecules which bind to the active state.
bioinformatics2026-02-21v2DynaBiomeX: An Interpretable Dual-Strategy Deep Learning Framework for Architectural Noise Filtration in Sparse Longitudinal Microbiome Data
Qureshi, A.; Wahid, A.; Qazi, S.; Shahzad, M. K. K.AI Summary
- DynaBiomeX is a dual-strategy deep learning framework designed to filter noise in sparse longitudinal microbiome data, integrating Stacking Ensembles (Bi-LSTM, GRU) and an adapted Temporal Fusion Transformer (TFT).
- It was validated on 1,871 hematopoietic cell transplantation patients to detect gut dysbiosis, with the ensembles acting as high-sensitivity screeners (ROC-AUC = 0.912) and the TFT as a precision Sentinel (Precision = 1.0, MCC = 0.646).
- The TFT showed superior calibration (ECE = 0.0085), and ablation studies confirmed robustness without clinical covariates (ROC-AUC > 0.81).
Abstract
Longitudinal microbiome datasets present unique challenges due to extreme sparsity, zero-inflation, and non-stationary behavior. Conventional Recurrent Neural Networks (RNNs) struggle to distinguish structural from sampling zeros in these contexts, limiting their utility for Clinical Decision Support (CDS). We introduce DynaBiomeX, an interpretable framework specifically developed for sparse biomedical time-series. It integrates Stacking Ensembles (Bi-LSTM, GRU) with an adapted Temporal Fusion Transformer (TFT) in a unified Screener-Sentinel workflow. The Ensembles optimize collective decision boundaries to maximize sensitivity and minimize missed cases. Concurrently, the TFT functions as a Physiological Gatekeeper, utilizing Gated Residual Networks (GRN) to actively filter stochastic noise from real biological signals. We validated this approach on a multi-modal dataset of 1,871 hematopoietic cell transplantation (HCT) patients to detect gut dysbiosis. Stacking ensembles maximized discriminative performance (ROC-AUC = 0.912), effectively serving as high-sensitivity screeners. In contrast, the Adapted TFT functioned as a precision Sentinel, achieving zero false positives (Precision = 1.0) and high stability (MCC = 0.646). Crucially, the TFT demonstrated superior probabilistic reliability with a low Expected Calibration Error (ECE = 0.0085), addressing the "black-box" overconfidence typical of deep learning models. Ablation studies confirmed predictive robustness even without clinical covariates (ROC-AUC > 0.81). DynaBiomeX couples sensitive screening with precise, calibrated validation to robustly analyze sparse longitudinal data. Validated on microbiome dysbiosis, this framework offers a scalable template for zero-inflated domains like single-cell sequencing and EHR monitoring.
bioinformatics2026-02-21v2SuperCell2.0 enables semi-supervised construction of multimodal metacell atlases
Herault, L.; Gabriel, A. A.; Duc, B.; Dolfi, B.; Shah, A.; Joyce, J. A.; Gfeller, D.AI Summary
- SuperCell2.0 is introduced as a workflow for constructing semi-supervised multimodal metacells from large single-cell datasets.
- It was found that multimodal metacells outperform single-modality metacells, enhancing inter-modality consistency and integration of multiomic data.
- The workflow identified interferon-primed monocytes and macrophages in blood and tumor samples, with markers used to characterize this population in healthy donors.
Abstract
Multimodal single-cell atlases comprising hundreds of thousands of cells provide unique resources for exploring complex biological tissues and generating testable hypotheses. To streamline the analysis of such large datasets, we introduce SuperCell2.0, a robust workflow to build (semi-)supervised multimodal metacells. We demonstrate that multimodal metacells outperform metacells built with a single modality, improve inter-modality consistency, and facilitate integration of multiomic single-cell datasets. SuperCell2.0 can further leverage full or partial cell type annotations to improve metacell quality. This workflow enables us to construct multimodal metacell atlases from blood and tumor samples and identifies interferon-primed monocytes and macrophages in the circulation and in the tumor microenvironment. Markers derived from the metacell analysis enable us to sort and phenotypically characterize this population in healthy donors. Overall, our work demonstrates how SuperCell2.0 facilitates the analysis of large multimodal single-cell atlases.
bioinformatics2026-02-20v1ProteoMapper: Alignment-Aware Identification and Quantitative Analysis of Contextual Motif-Domain Patterns in Protein Families
Sefa, S. M.; Sarkar, J.; Robin, A. H. K.; Uddin, M.AI Summary
- ProteoMapper integrates domain annotation with motif detection to analyze spatial relationships in protein families, introducing metrics like positional conservation scoring and Motif-Domain Coverage Score (MDCS).
- The tool processes alignments in Excel format, providing rapid analysis and color-coded reports, validated across three protein families with high accuracy.
- In Arabidopsis ERD6-like sugar transporters, MDCS analysis showed PROSITE signatures PS00216 and PS00217 are fully domain-embedded but differ in evolutionary conservation, suggesting subfunctionalization.
Abstract
Protein function depends on interactions between structural domains and regulatory motifs. Yet current tools analyze these elements separately, hindering investigation of disease mutations affecting evolutionarily conserved, structurally constrained motifs. We present ProteoMapper, a computational framework integrating HMMER-based domain annotation with user-defined motif detection to quantify motif-domain spatial relationships in protein families. ProteoMapper introduces two discovery metrics: (1) positional conservation scoring, identifying motifs at identical alignment coordinates in [≥] N% of sequences (default 60%), indicating purifying selection; (2) Motif-Domain Coverage Score (MDCS), quantifying motif embedding within Pfam domains (MDCS=1: fully embedded; MDCS=0: extra-domain). The platform processes Excel-formatted alignments without programming requirements, delivering color-coded reports with conserved motif positions, domain boundaries, and MDCS values. Parallel execution of sequence batches enables rapid analysis (8 motifs were searched in 150 sequences with complete Pfam scanning in <6 seconds on standard hardware). Validation across three protein families confirmed technical accuracy and biological insight. In PLATZ transcription factors (24 proteins), domain predictions achieved 0.94 mean intersection-over-union versus published annotations, exactly reproducing 22 of 23 reported spans. In Arabidopsis ERD6-like sugar transporters (17 proteins), MDCS analysis revealed canonical PROSITE signatures PS00216 and PS00217 are equally domain-embedded (MDCS=1.0) but evolutionarily divergent. PS00217 shows positional conservation (58.8% of sequences) while PS00216 exhibits dispersal, suggesting subfunctionalization. In tomato actin-depolymerizing factors (11 proteins), domain detection achieved 100% sensitivity with >93% positional concordance. ProteoMapper enables hypothesis-driven investigation of evolutionary constraints, regulatory mechanisms, and variant effect prediction in biomedical and functional proteomics. Source code, documentation, and test results with datasets at https://github.com/sifullah0/ProteoMapper.
bioinformatics2026-02-20v1Geometric-aware and interpretable deep learning for single-cell batch correction via explicit disentanglement and optimal transport
Jiang, C.; Zheng, R.; Ji, Y.; Cao, S.; Fang, Y.; Wang, Z.; Wang, R.; Liang, S.; Tao, S.AI Summary
- The study introduces iDLC, a deep learning framework for single-cell RNA sequencing batch correction, using explicit feature disentanglement and optimal transport for dual-level correction.
- iDLC separates biological from technical components in a structured latent space and uses mutual nearest neighbors for geometric alignment.
- Evaluations on various datasets show iDLC effectively removes batch effects, preserves cell subtypes, and outperforms existing methods in both correction and biological fidelity.
Abstract
Single-cell RNA sequencing enables high-resolution characterization of cellular heterogeneity, yet integrating datasets from diverse sources remains challenging due to batch effects. Current methods rely on implicit feature disentanglement and and lack geometric constraintsoften result in under-correction, over-correction, or compromised biological fidelity. Here, we present iDLC, an interpretable deep learning framework that performs dual-level correction through explicit feature disentanglement and optimal transport - regularized adversarial alignment. iDLC separates biological and technical components within a structured latent space, then leverages high-confidence mutual nearest neighbor pairs to guide geometrically constrained distribution alignment. Systematic evaluation across pancreatic cancer datasets with varying batch effect intensities, multi-source human immune cells, and large-scale cross-species atlases demonstrates that iDLC robustly eliminates complex batch effects while preserving fine-grained cell subtypes, continuous developmental trajectories, and rare populations. The framework scales efficiently to datasets exceeding one million cells and consistently outperforms existing methods in both batch correction and biological conservation metrics. iDLC provides a principled and reliable tool for constructing unified single-cell reference atlases across diverse experimental conditions and biological systems.
bioinformatics2026-02-20v1OT-knn: a neighborhood-aware optimal transport framework for aligning spatial transcriptomics data
Song, J.; Li, Q.AI Summary
- OT-knn is introduced as a method for aligning spatial transcriptomics (ST) data by integrating local neighborhood information into an optimal transport framework.
- It reconstructs each spot using its k-nearest neighbors to capture microenvironment context, enhancing robustness against noise and variability.
- Evaluations on simulated and real datasets, including human and mouse brain data, show OT-knn achieves accurate alignment despite spatial deformation, donor heterogeneity, and developmental variation.
Abstract
Spatial transcriptomics (ST) measures gene expression while preserving spatial context within tissues, enabling detailed characterization of tissue organization. As ST technologies advance, aligning datasets across tissue sections, individuals, platforms, and developmental stages has become increasingly important but remains challenging due to sparse expression, biological heterogeneity, and geometric distortions between slices. We introduce OT-knn, a method for ST alignment that integrates local neighborhood information within an optimal transport framework. Rather than relying solely on single-spot expression, OT-knn reconstructs each spot using its spatial k-nearest neighbors, capturing microenvironment context that is more robust to noise and variability. These representations are then used to derive probabilistic correspondences between slices. We evaluate OT-knn using simulated data with known ground-truth alignment and real datasets from multiple ST platforms, including human dorsolateral prefrontal cortex data (10x Genomics Visium), mouse brain aging data with both within-donor and cross-donor comparisons (MERFISH), and a multi-stage axolotl brain dataset (Stereo-seq). Across these settings, OT-knn achieves accurate and robust alignment, particularly in the presence of spatial deformation, donor heterogeneity, and developmental variation.
bioinformatics2026-02-20v1wavess 1.2: Presenting an HLA-aware within-host virus sequence simulation framework
Lapp, Z.; Leitner, T.AI Summary
- The study extends the wavess framework to simulate within-host virus sequence evolution by incorporating an HLA-aware CD8+ CTL response and variable recombination rates.
- This allows for more accurate modeling of virus sequences, especially in regions influenced by CTLs, and supports investigations into how these mechanisms affect within-host evolution.
Abstract
Motivation: Understanding how virus sequences are shaped by selection can inform vaccine design and transmission inference. Modeling within-host evolution to interrogate these questions requires a detailed mechanistic framework that accurately captures sequence diversification. The CD8+ cytotoxic T-lymphocyte (CTL) response plays an important role in immune-mediated selection and can leave strong signatures in virus sequences; however, existing sequence-based within-host virus modeling frameworks do not explicitly include an HLA-aware CTL response. Results: We extended our previously published within-host sequence evolution simulator, wavess, to include an explicit CTL response, and share a method for identifying HLA-specific CTL epitopes given a founder virus sequence. We also updated the model to permit a variable recombination rate, which allows for modeling recombination hotspots, non-adjacent genes, and segmented genomes. These extensions to wavess allow for more accurate simulation of viruses and virus genes, particularly in regions of the genome where the immune response is dominated by CTLs (rather than antibodies). It also provides the foundation for investigations of how these newly-added biological mechanisms influence within-host evolution. Availability and implementation: The core of wavess is written in Python 3, with helper functions written in R. It is available at https://github.com/MolEvolEpid/wavess.
bioinformatics2026-02-20v1Prediction of ligand-dependent conformational sampling of ABC transporters by AlphaFold3 and correlation to experimental structures and energetics
Tang, Q.; Mchaourab, H.; Wu, T.; Soubasis, B.AI Summary
- This study uses AlphaFold3 to predict nucleotide-dependent conformational changes in ABC transporters, comparing these predictions to experimental structures.
- AlphaFold3 accurately samples known conformations and correlates with experimental dynamics, also predicting previously unobserved conformations.
- The study suggests that AlphaFold3's predictions might extrapolate from known structures, as sequence determinants influence the predicted conformational changes.
Abstract
AlphaFold3 architecture represented an important leap relative to Alphafold2 by enabling the inclusion of protein ligands in the prediction network. Ligand-dependent structural rearrangements are inherently difficult to predict computationally as they imply transitions between states separated by large energy differences. Here we apply AlphaFold3 to predict nucleotide-dependent changes in the conformational cycle of representative ABC transporters that have been extensively investigated by experimental structural biology techniques. We show that under similar conditions, AlphaFold3 predictions sample experimentally observed conformations. Moreover, the heterogeneity of these predictions correlates with experimental measures of dynamics obtained from multiple techniques. For couple of the tested transporters, the implied relative energetics of the conformations mirror their experimental counterpart. Remarkably, AlphaFold3 predicts previously unobserved conformations that have been implied to be sampled by ABC transporters. Finally, we report preliminary results showing that postulated sequence determinants of conformational changes modify the predictions of AlphaFold3. Although hundreds of ABC transporter structures have been determined and were included in the training data of AF3, we propose that aspects of its predictions reflect extrapolation of principles learned from these structures.
bioinformatics2026-02-20v1A New Sparse Bayesian Quantile Neural Network-based Approach and Its Application to Discover Physiological Sweet Spots in the Canadian Longitudinal Study on Aging
Min, J.; Vishnyakova, O.; Brooks-Wilson, A.; Elliott, L. T.AI Summary
- The study introduces Q-FSNet and Q-DirichNet, neural network frameworks integrating quantile regression for identifying physiological sweet spots in high-dimensional data.
- Using data from the Canadian Longitudinal Study on Aging, these methods identified 25 metabolites with optimal ranges that minimize biological age acceleration.
- The findings suggest dietary and gut microbiome-derived metabolites as potential biomarkers for healthy aging, supported by existing literature.
Abstract
Identifying physiological sweet spots (optimal ranges for homeostasis) is essential for precision medicine. However, traditional statistical methods often rely on globally linear or locally jagged models that struggle to capture the smooth, non-linear nature of biological regulation in high-dimensional data. We present the Quantile Feature Selection Network (Q-FSNet), a neural network-based framework that integrates quantile regression, feature selection, and uncertainty estimation to identify biomarkers with sweet spots. Unlike traditional methods, Q-FSNet learns continuous response curves without requiring pre-specified number of change points. We further introduce Quantile Dirichlet Network (Q-DirichNet), a fully Bayesian extension that utilizes Dirichlet priors to automate feature shrinkage. Using data from the Canadian Longitudinal Study on Aging, we identified 25 metabolites with distinct homeostatic ranges for which biological age acceleration is minimized. The metabolites with sweet spots for biological aging include some derived from diet or produced by the gut microbiome; this highlights their potential for knowledge translation and public health impact. Our results, corroborated by existing literature, demonstrate that these sparse neural network-based methods offer a scalable and interpretable tool for discovering metabolic signatures of healthy aging vs. dysregulation in large-scale omics research.
bioinformatics2026-02-20v1Chemical Probes in Scientific Literature: Expanding and Validating Target-Disease Evidence
Adasme, M. F.; Ochoa, D.; Lopez, I.; Do, H.-M.-A.; McDonagh, E. M.; O'Boyle, N. M.; Leach, A. R.; Zdrazil, B.AI Summary
- This study systematically analyzed over 18 million articles to quantify the impact of 561 chemical probes, identifying 5,558 unique target-disease associations.
- Findings showed chemical probe evidence precedes structured data by 1-7 years, revealed 353 new T-D pairs, and 135 high-confidence associations for therapeutic repurposing in rare diseases.
- Chemical probes were crucial for validating target-disease associations, enhancing evidence beyond correlative data like RNA expression.
Abstract
Chemical probes are indispensable tools for validating therapeutic hypotheses, yet their broader impact on early-stage drug discovery remains unquantified. To our knowledge, this study represents the first systematic, large-scale investigation of the chemical probe literature. By screening over 18 million articles using a high-quality dictionary of 561 chemical probes, we identified 20,000 articles mentioning a chemical probe which resulted in 5,558 unique target-disease (T-D) associations. Our analysis yields four principal findings that redefine the utility of these chemicals: First, we show that chemical probe evidence typically precedes the appearance of structured data in major knowledge bases by 1-7 years, providing a crucial lead time for target prioritisation. Second, we identified 353 T-D pairs (6.4%) with no prior evidence in the Open Targets Platform, highlighting the approach's discovery potential. Third, the application of strict novelty filters uncovered 135 new high-confidence associations between targets and diseases, revealing distinct opportunities for therapeutic repurposing in non-oncological, rare autoimmune diseases, and diseases without effective therapies due to complex biology or high treatment resistance. Finally, we demonstrate that chemical probes are essential for strengthening evidence, providing functional validation for associations previously supported only by weaker, correlative data such as RNA expression or animal models. Collectively, these findings illustrate that chemical probes catalyse early therapeutic discovery, emphasising the importance of cataloguing existing probes and identifying new ones.
bioinformatics2026-02-20v1Differential analysis of image-based chromatin tracing data with Dory
Ma, Z.; Liu, M.; Wang, S.; Wang, S.; Zang, C.AI Summary
- Dory is a statistical method designed for differential analysis of chromatin tracing data to identify spatial pattern differences between two groups.
- It quantifies pairwise spatial distances and uses multi-level statistical tests to detect significant structural changes, producing a differential score matrix.
- Application of Dory revealed associations between chromatin structural changes and alterations in A/B compartments, promoter-enhancer interactions, and gene expression.
Abstract
Spatial organization of the genome plays a vital role in defining cell identity and regulating gene expression. The three-dimensional (3D) genome structure can be measured by sequencing-based techniques such as Hi-C usually on the cell population level or by imaging-based techniques such as chromatin tracing at the single-cell level. Chromatin tracing is a multiplexed DNA fluorescence in situ hybridization (FISH)-based method that can directly map the 3D positions of genomic loci along individual chromosomes at single-molecule resolution. However, few computational tools are available for statistical differential analysis of chromatin tracing data, which are inherently high-dimensional, highly variable and contain many missing values. Here, we present Dory, a statistical method for identifying differential spatial patterns between two groups of chromatin traces. Dory quantifies pairwise spatial distances among genomic regions in a chromatin trace and applies multi-level statistical tests to detect significant structural differences between the two groups of traces. It produces a differential score matrix highlighting region pairs with significant distance difference. Applying Dory to multiple chromatin tracing datasets, we found that the detected chromatin structural changes were associated with alterations in A/B compartments and promoter-enhancer interactions correlated with differential gene expression. Dory is a robust and user-friendly computational tool for quantitative analysis of imaging-based 3D genome data that enables systematic exploration of chromatin architecture and its roles in gene regulation.
bioinformatics2026-02-20v1Learning heritable multimodal brain representation via contrastive learning
Xia, T.; Zhao, X.; Islam, S. S. M.; Mohammed, K. K.; Xie, Z.; Zhi, D.AI Summary
- This study introduces a multimodal contrastive learning framework using paired T1- and T2-weighted MRIs to derive heritable brain representations.
- The approach improves prediction of traditional imaging-derived phenotypes, age, and brain disorders compared to single-modality models.
- GWAS on these representations showed increased genetic loci overlap, revealing shared biological targets and enhancing genetic discovery.
Abstract
Magnetic resonance imaging (MRI)-derived phenotypes (IDP) has enabled the discovery of numerous genomic loci associated with brain structure and function. However, most existing IDPs and learned representations are derived from a single imaging modality, missing complementary information across modalities and potentially limiting the scope of genetic discovery. Here, we introduce a multimodal contrastive learning framework to derive heritable representations from paired T1- and T2-weighted MRIs. Unlike single-modality reconstruction-based models, we designed a momentum-based contrastive learning framework. As a result, our approach offers improved prediction of traditional IDPs, age, and brain disorders. Notably, genome-wide association studies (GWAS) of the learned representations reveal a substantially higher overlap of genetic loci across modalities, indicating improved alignment of their underlying genetic architecture. Analysis of the GWAS loci identified shared protein and drug targets, yielding meaningful biological insights. Overall, our framework learns shared representations across brain imaging modalities that exhibit anatomical and genetic coherence.
bioinformatics2026-02-20v1SpecLig: Energy-Guided Hierarchical Model for Target-Specific 3D Ligand Design
Zhang, P.; Han, R.; Kong, X.; Chen, T.; Ma, J.AI Summary
- SpecLig is introduced as a framework for generating small molecules and peptides with enhanced target affinity and specificity, addressing the issue of promiscuous binding in structure-based models.
- It uses a hierarchical SE(3)-equivariant variational autoencoder and an energy-guided geometric latent-diffusion model, incorporating chemical priors to favor pocket-complementary fragment combinations.
- Evaluations show that SpecLig's ligands bind with high specificity and affinity, with real applications demonstrating reduced off-target risks.
Abstract
Structure-based generative models often optimize single-target affinity with ignorance of specificity, resulting in the generation of high-affinity candidates that exhibit promiscuous binding across unrelated targets. This decoupling of affinity and specificity not only compromises therapeutic efficacy but also elevates off-target risks that constrain translational potential. Therefore, we introduce SpecLig, a unified structure-based framework that jointly generates small molecules and peptides with improved target affinity and specificity. SpecLig represents a complex as a block-based graph, combining a hierarchical SE(3)-equivariant variational autoencoder with an energy-guided geometric latent-diffusion model. Chemical priors derived from block-block contact statistics are explicitly incorporated, biasing generation towards pocket-complementary fragment combinations. We benchmark SpecLig on peptide and small-molecule tasks using standard public datasets and propose precision/breadth testing paradigms to quantify specificity. Across multiple evaluations, ligand candidates generated by SpecLig usually bind to the target pocket with high specificity and affinity while maintaining competitive advantages in other attributes. Ablations indicate that both hierarchical representation and energy guidance contribute to success. Finally, we present multiple real applications that demonstrate how SpecLig improves ligands in natural complexes to mitigate potential off-target risks. SpecLig, therefore, provides a practical route to prioritize higher-specificity designs for downstream experimental validation. The codes are available at: https://github.com/CQ-zhang-2016/SpecLig.
bioinformatics2026-02-19v3A statistical framework for defining synergistic anticancer drug interactions
Dias, D.; Zobolas, J.; Ianevski, A.; Aittokallio, T.AI Summary
- The study developed a statistical framework to identify significant synergistic anticancer drug interactions by establishing reference null distributions from a large dataset of over 2,000 drug combinations across 125 cancer cell lines.
- This approach allowed for the calculation of empirical p-values, confirming known synergistic combinations and revealing novel ones that were previously overlooked.
- The framework was also applied to a smaller dataset, demonstrating its general applicability in detecting significant drug combination effects.
Abstract
Synergistic drug combinations have the potential to delay drug resistance and improve clinical outcomes. However, current cell-based screens lack robust statistical assessment to identify significant synergistic interactions for downstream experimental or clinical validation. Leveraging a large-scale dataset that systematically evaluated more than 2,000 drug combinations across 125 pan-cancer cell lines, we established reference null distributions separately for various synergy metrics and cancer types. These data-driven reference distributions enable estimation of empirical p-values to assess the significance of observed drug combination effects, thereby standardizing synergy detection in future studies. The statistical evaluation confirmed key synergistic combinations and uncovered novel combination effects that met stringent statistical criteria, yet were overlooked in the original analyses. We revealed cell context-specific drug combination effects across the tissue types and differences in statistical behavior of the synergy metrics. To demonstrate the general applicability of our approach to smaller-scale studies, we applied the reference distributions to evaluate the significance of combination effects in an independent dataset. We provide a fast and statistically rigorous approach to detecting synergistic drug interactions in combinatorial screens.
bioinformatics2026-02-19v3The practical impact of numerical variability on structural MRI measures of Parkinson's disease
Chatelain, Y. M. B.; Sokołowski, A.; Sharp, M.; Poline, J.-B.; Glatard, T.AI Summary
- The study investigated how numerical variability in MRI analyses affects structural measures in Parkinson's disease using FreeSurfer to simulate computational differences.
- Numerical variability was found to be significant, reaching up to one-third of population variability, impacting statistical conclusions.
- A tool was developed to estimate the Numerical-Population Variability Ratio (NPVR), revealing a high probability of false positives and negatives in existing Parkinson's disease MRI studies due to numerical variability.
Abstract
Numerical variability is rarely quantified in neuroimaging despite many biomarkers relying on subtle morphometric differences across individuals. We instrumented FreeSurfer, a widely used neuroimaging pipeline, to simulate numerical differences across computational environments, and used it to measure numerical variability in MRI analyses of Parkinson's disease patients and controls. In multiple cortical and subcortical regions, numerical variation reached nearly one-third of the population variability, altering statistical conclusions about group differences and clinical associations. To assess the impact of numerical noise in existing studies, we developed a practical tool that estimates the Numerical-Population Variability Ratio (NPVR) in a study, and propagates the resulting numerical uncertainty to common statistics and associated p-values. By applying this framework to thirteen previously published studies reporting MRI measures of Parkinson's disease, we quantified the probability of numerically induced false positives and false negatives in the literature, highlighting a substantial impact of numerical variability on MRI measures of Parkinson's disease. These results underscore the importance of systematically evaluating numerical stability in neuroimaging and provides a practical framework to do so.
bioinformatics2026-02-19v2Pioneer and Altimeter: Fast Analysis of DIA Proteomics Data Optimized for Narrow Isolation Windows
Wamsley, N. T.; Wilkerson, E. M.; Major, M. B.; Goldfarb, D.AI Summary
- The study introduces Pioneer and Altimeter, tools designed for fast analysis of DIA proteomics data, addressing challenges posed by narrow isolation windows in mass spectrometry.
- Altimeter models fragment intensity as a function of collision energy, allowing spectral library reuse, while Pioneer re-isotopes spectra and uses advanced techniques for efficient analysis.
- These tools enable high-confidence protein identification and quantification, performing analyses 2-6 times faster while controlling false-discovery rates across various experimental setups.
Abstract
Advances in mass spectrometry have enabled increasingly fast data-independent acquisition (DIA) experiments, producing datasets whose scale and complexity challenge existing analysis tools. Those same advances have also led to the use of narrow isolation windows, which alter MS2 spectra via fragment isotope effects and give rise to systematic deviations from spectral libraries. Here we introduce Pioneer and Altimeter, open-source tools for fast DIA analysis with explicit modeling of isolation-window effects. Altimeter predicts deisotoped fragment intensity as a continuous function of collision energy, allowing a single spectral library to be reused across datasets. Pioneer re-isotopes predicted spectra per scan and combines an intensity-aware fragment index, spectral deconvolution, and dual-window quantification for fast, spectrum-centric DIA analysis. Across instruments, experimental designs, and sample inputs, Pioneer enables high-confidence identification and precise quantification at scale, completing analyses 2-6x faster and maintaining conservative false-discovery rate control.
bioinformatics2026-02-19v2Harnessing DNA Foundation Models for Cross-Species Transcription Factor Binding Site Prediction in Plant Genomes
Haghani, M.; Dhulipalla, K. V.; Li, S.AI Summary
- This study evaluates the performance of DNA foundation models (DNABERT-2, AgroNT, HyenaDNA) in predicting transcription factor binding sites (TFBSs) in plant genomes using Arabidopsis thaliana and Sisymbrium irio data.
- The models were benchmarked against specialized methods like DeepBind and BERT-TFBS.
- HyenaDNA showed superior predictive accuracy and computational efficiency, suggesting potential for scalable genome-wide TFBS prediction in plants.
Abstract
Accurate prediction of transcription factor binding sites (TFBSs) is crucial for understanding gene regulation. While experimental methods like ChIP-seq and DAP-seq are informative, they are labor-intensive and species-specific. Recent advancements in large-scale pretrained DNA foundation models have shown promise in overcoming these limitations. This study evaluates the performance of three such models, DNABERT-2, AgroNT, and HyenaDNA, in predicting TFBSs in plants. Using Arabidopsis thaliana and Sisymbrium irio DAP-seq data, we benchmark their accuracy against specialized methods like DeepBind and BERT-TFBS. Our results demonstrate that foundation models, particularly HyenaDNA, offer superior predictive accuracy and computational efficiency, highlighting their potential for scalable, genome-wide TFBS prediction in plants.
bioinformatics2026-02-19v2Fine-tuning protein language models on human spatial constraint improves variant effect prediction by reducing wild-type sequence bias
Bajracharya, G.; Capra, J. A.AI Summary
- The study introduces Human Spatial Constraint (HuSC), which quantifies intraspecies constraint on missense variants by integrating human genetic variation with 3D protein structures.
- Fine-tuning protein language models (PLMs) on HuSC scores enhances prediction of variant effects by reducing bias towards wild-type sequences.
- HuSC outperforms traditional conservation metrics in predicting pathogenic variants and improves variant fitness predictions across different taxa and assays.
Abstract
Protein language models (PLMs) achieve state-of-the-art performance in predicting effects of missense variants, yet they do not explicitly consider variation within the human population. Here, we introduce Human Spatial Constraint (HuSC), a framework for quantifying intraspecies constraint on missense variants that integrates population-scale human genetic variation with 3D protein structures. We then fine-tune PLMs on HuSC scores. HuSC models the expected frequency of missense variation under neutral evolution and compares it to observed variation, accounting for both variation in mutational processes and 3D structural context. HuSC outperforms traditional inter- and intraspecies conservation metrics in predicting pathogenic variants. By focusing on intraspecies variation, HuSC reveals protein sites under human-specific constraint that cannot be captured by interspecies models. Integrating this intraspecies perspective into PLMs by fine-tuning on HuSC scores improves the prediction of variant fitness from deep mutational scans across diverse taxa and functional assay types. The improvement after fine-tuning comes largely from reducing bias toward wild-type sequences in regions that tolerate variation. Together, these results demonstrate that combining intraspecies constraint with cross-species PLMs improves their performance in variant-effect interpretation.
bioinformatics2026-02-19v2Convergence of Angiotensin Signaling on Lung Pericyte and Stromal Behaviors
Benjamin, K. J. M.; Gonye, E.; Sauler, M.; Gidner, S.; Malinina, A.; Neptune, E. R.AI Summary
- The study investigated the expression of angiotensin receptors AGTR1 and AGTR2 in human lung tissue using bulk and single-nucleus transcriptomics, finding AGTR1 in lung pericytes and AGTR2 in alveolar epithelial type 2 cells.
- AGTR1 expression in pericytes was linked to pericyte behaviors; its inhibition restored pericyte numbers in an emphysema model, suggesting a role in airspace repair.
- In COPD, AGTR1 showed dysregulated expression in stromal cells, and angiotensin II with cigarette smoke exposure impaired pericyte migration and proliferation.
Abstract
The renin-angiotensin system is a well-characterized regulator of tissue homeostasis whose clinical relevance has expanded to include lung disorders such as chronic obstructive pulmonary disease (COPD)-associated emphysema, idiopathic pulmonary fibrosis, and COVID-19. Despite this interest, the cell-specific localization of angiotensin receptors in the human lung has remained poorly defined, in part due to limitations of available antibody reagents. Here, we define the expression patterns of the two predominant angiotensin receptors, AGTR1 and AGTR2, using complementary bulk and single-nucleus transcriptomic datasets from human lung tissue. We demonstrate that these receptors exhibit mutually exclusive, compartment-specific localization, with AGTR1 expressed in lung pericytes and AGTR2 expressed in alveolar epithelial type 2 cells. AGTR1 is detectable in isolated lung pericytes, and spatial colocalization with pericyte markers confirmed within the airspace microvasculature compartment by RNAscope. Airspace pericyte abundance was reduced in an experimental emphysema model but restored by pharmacologic attenuation of AGTR1 signaling commensurate with airspace repair. In COPD lungs, AGTR1 expression showed heterogeneous, disease-associated dysregulation across stromal populations, including upregulation in alveolar fibroblasts. Bulk transcriptomics also revealed aging-associated redistribution of AGTR1 expression into stromal compartments. Angiotensin II and cigarette smoke impaired pericyte migration toward endothelial cells, while combined exposure suppressed pericyte proliferation. Together, these findings identify AGTR1 as a new highly selective marker of lung pericytes and a regulator of pericyte behaviors within the airspace microvasculature. These findings provide a cell-resolved framework for angiotensin signaling with direct relevance to airspace resilience and therapeutic targeting.
bioinformatics2026-02-19v2Investigating the topological motifs of inversions in pangenome graphs
Romain, S.; Dubois, S.; Legeai, F.; Lemaitre, C.AI Summary
- This study investigated how inversions are represented in pangenome graphs, focusing on identifying topological motifs for inversion bubbles.
- Two motifs were identified: path-explicit and alignment-rescued, and a tool was developed to annotate these from bubble-caller outputs.
- Analysis across four pipelines showed significant differences in inversion representation, with low recovery rates in real human datasets, indicating challenges in pangenomic inversion analysis.
Abstract
Background: Pangenome graphs are increasingly used in genetic diversity analyses because they reduce reference bias in read mapping and enhance variant discovery and genotyping from SNPs to Structural Variants. In pangenome graphs, variants appear as bubbles, which can be detected by dedicated bubble calling tools. Although these tools report essential information on the variant bubbles, such as their position and allele walks in the graph, they do not annotate the type of the detected variants. While simple SNPs, insertions, and deletions are easily distinguishable by allele size, large balanced variants like inversions are harder to differentiate among the large number of unannotated bubbles and remain underexplored in pangenome graph benchmarks and analyses. Results: In this work we focused on inversions, which have been drawing renewed attention in evolutionary genomics studies in the past years, and aimed to assess how this type of variant is handled by state of the art pangenome graph pipelines. We identified two distinct topological motifs for inversion bubbles: one path-explicit and one alignment-rescued, and developed a tool to annotate them from bubble-caller outputs. We constructed pangenome graphs with both simulated data and real data using four state of the art pipelines, and assessed the impact of inversion size, genome divergence and variant density on inversion representation and accuracy. Conclusions: Our results reveal substantial differences between pipelines in simulated graphs, with some inversions either misrepresented or lost. In addition, recovery rates are strikingly low in real human datasets, highlighting major challenges in analyzing inversions through pangenomic approaches.
bioinformatics2026-02-19v2jazzPanda: A hybrid approach to find spatial markergenes in imaging-based spatial transcriptomics data
Jin, X.; Putri, G. H.; Cheng, J.; Asselin-Labat, M.-L.; Smyth, G. K.; Phipson, B.AI Summary
- The study introduces jazzPanda, a hybrid method for identifying spatial marker genes in imaging-based spatial transcriptomics, which integrates spatial coordinates of gene detections and cells.
- jazzPanda uses a binning approach to pseudobulk gene detections and cells within clusters, enhancing marker gene analysis through linear models.
- Testing on datasets from Xenium, CosMx, and MERSCOPE showed that jazzPanda's marker genes have strong spatial correlation and increased specificity compared to existing methods.
Abstract
Spatial transcriptomics enables the understanding of the spatial architecture of tissues, providing deeper insight into tissue structure and cellular neighbourhoods. A crucial step in the analysis of spatial data is cell type identification. In single cell RNA-sequencing (scRNA-seq) analysis, cells are clustered according to their transcriptional similarity, and marker genes for each cluster identified. Marker analysis identifies genes highly expressed in each cluster compared to the remaining clusters, and these marker genes are used to annotate clusters with cell types. For spatial data, there are limited software tools for appropriate marker gene detection methods that account for the spatial distribution of gene expression. Tools developed for scRNA-seq ignore spatial information for the cells and genes. We have developed a hybrid approach to prioritize marker genes that uses the spatial coordinates of gene detections and cells making up clusters. We propose a binning approach that effectively "pseudobulks" gene detections and cells within clusters that can then be used as input into linear models for marker analysis. Our approach can account for multiple samples and background noise. We have tested our methods on several public datasets from di!erent platforms including Xenium, CosMx and MERSCOPE. The marker genes detected by our method show strong spatial correlation with the corresponding clusters and have increased specificity compared to other methods. The method is implemented in the jazzPanda R Bioconductor package and is publicly available (https://bioconductor.org/packages/jazzPanda).
bioinformatics2026-02-19v2Differential analysis of genomics count data with edge*
Pachter, L.AI Summary
- The study addresses the integration of edgeR, a tool for differential expression analysis, into the Python ecosystem, which is prevalent in single-cell genomics.
- They developed edgePython, a Python version of edgeR 4.8.2, incorporating a negative binomial gamma mixed model for multi-subject single-cell analysis and empirical Bayes shrinkage for cell-level dispersion.
- Key findings include the successful adaptation of edgeR to Python, enhancing its utility in single-cell genomic studies.
Abstract
The edgeR Bioconductor package is one of the most widely used tools for differential expression analysis of count-based genomics data. Despite its popularity, the R-only implementation limits its integration with the Python centric ecosystem that has become dominant in single-cell genomics. We present edgePython, a Python port of edgeR 4.8.2 that extends the framework with a negative binomial gamma mixed model for multi-subject single-cell analysis and empirical Bayes shrinkage of cell-level dispersion.
bioinformatics2026-02-19v2NanoHIVSeq: A Long-Read Bioinformatics Pipeline for High-Throughput Processing of HIV Env Sequences
Sheng, Z.; Xiao, Q.; Qiao, Y.; Lu, H.; McWhirter, J.; Sagar, M.; Wu, X.AI Summary
- NanoHIVSeq is a UMI-free bioinformatics pipeline designed for high-throughput sequencing of HIV-1 Env gene using Oxford Nanopore Technology (ONT).
- It processes ONT data through clustering, consensus polishing, indel correction, denoising, and genotyping to recover functional Env variants.
- Testing on plasmid env and bulk HIV datasets showed NanoHIVSeq's high robustness, reproducibility, and accuracy (>99.9% or >Q30), comparable to UMI methods.
Abstract
High-throughput sequencing of the HIV-1 envelope (Env) gene from viral quasispecies is essential for epidemiology, virus-antibody coevolution studies, and evaluating therapeutics, but the conventional single-genome amplification (SGA) coupled with Sanger sequencing is labor-intensive and low-throughput. Oxford Nanopore Technology (ONT) offers long-read sequencing advantages, but high error rates (1-7%) poses a challenge in identifying biological variants from sequencing artifacts. Without unique molecular identifiers (UMIs), which lose DNA template and add complexity in library preparation, here we introduce NanoHIVSeq, a UMI-free and reference-free bioinformatics pipeline that processes ONT data from bulk Env PCR amplicons through multistep clustering, consensus polishing, indel correction, denoising, and genotyping to recover functional full-length Env variants. By leveraging advanced ONT duplex sequencing technology, NanoHIVSeq was assessed using plasmid env and bulk HIV reservoir datasets, demonstrating high robustness, recovery rate, reproducibility, and accuracy (>99.9% or >Q30) comparable to UMI approaches. Our findings indicated that NanoHIVSeq allows flexible and simplified ONT library preparation for reproducible and efficient Env sequencing especially for large cohorts.
bioinformatics2026-02-19v1A Machine Learning and Benchmarking Approach for Molecular Formula Assignment of Ultra High-Resolution Mass Spectrometry Data from Complex Mixtures
Shabbir, B.; Oliveira, P. B.; Fernandez-Lima, F.; Saeed, F.AI Summary
- This study applies machine learning, specifically KNN, DTR, and RFR algorithms, to improve molecular formula assignment in ultra-high resolution mass spectrometry (UHRMS) data from complex mixtures like dissolved organic matter (DOM).
- The approach was benchmarked against traditional methods, showing a 43% increase in formula annotations (5796 vs 4047) and up to 2x more formulas assigned with Model-Synthetic (8268 vs 4047).
- DTR and RFR achieved formula-level accuracies of 86.5% and 60.4%, respectively, enhancing the reliability of characterizing complex systems in environmental science, metabolomics, and petroleomics.
Abstract
A machine learning approach to molecular formula assignment is crucial for unlocking the full potential of ultra-high resolution mass spectrometry (UHRMS) when analyzing complex mixtures. By combining data-driven models with rigorous benchmarking, the accuracy, consistency, and speed in identifying plausible molecular formulas from vast spectral datasets can be improved. Compared with traditional de novo methods that rely heavily on rule-based heuristics, and manual parameter tuning, machine learning approaches can capture complex patterns in data and adapt more readily to diverse sample types. In this paper, we describe the application of a machine learning methods using the k-nearest neighbors (KNN) algorithm trained on curated chemical formula datasets of UHRMS analysis of dissolved organic matter (DOM) covering the saline river continuum and tropical wet/dry season variability. The influence of the mass accuracy (training set with 0.15-1ppm) was evaluated on a blind test set of DOMs of different geographical origins. A Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) based on mass accuracy (<1ppm) was used. Results from our ML models exhibit 43% more formulas annotated than traditional methods (5796 vs 4047), Model-Synthetic achieved 99.9% assignment rate and annotated/assigned 2x more formulas (8,268 vs 4047). DTR and RFR achieved formula-level accuracies (FA) of 86.5% and 60.4%, respectively. Overall, results show an increase in formula assignment when compared with traditional methods. This ultimately enables more reliable characterization of complex natural and engineered systems, supporting advances in fields such as environmental science, metabolomics, and petroleomics. Furthermore, the novel data set produced for this study is made publicly available, establishing an initial benchmark for molecular formula assignment in UHRMS using machine learning. The dataset and code are publicly available at: https://github.com/pcdslab/dom-formula-assignment-using-ml
bioinformatics2026-02-19v1Foundation Models Improve Perturbation Response Prediction
Cole, E.; Huizing, G.-J.; Addagudi, S.; Ho, N.; Hasanaj, E.; Kuijs, M.; Johnstone, T.; Carilli, M.; Davi, A.; Ellington, C.; Feinauer, C.; Li, P.; Menegaux, R.; Mohammadi, S.; Shao, Y.; Zhang, J.; Lundberg, E.; Song, L.; Bar-Joseph, Z.; Xing, E. P.AI Summary
- The study analyzed over 600 models to assess the effectiveness of foundation models in predicting cellular responses to genetic or chemical perturbations.
- Findings showed that while some foundation models did not outperform simple baselines, others significantly enhanced prediction accuracy.
- Integrating multiple foundation models was shown to approach fundamental performance limits, confirming their utility in improving cellular response simulations.
Abstract
Predicting cellular responses to genetic or chemical perturbations has been a long-standing goal in biology. Recent applications of foundation models to this task have yielded contradictory results regarding their superiority over simple baselines. We conducted an extensive analysis of over 600 different models across various prediction tasks and evaluation metrics, demonstrating that while some foundation models fail to outperform simple baselines, others significantly improve predictions for both genetic and chemical perturbations. Furthermore, we developed and evaluated methods for integrating multiple foundation models for perturbation prediction. Our results show that with sufficient data, these models approach fundamental performance limits, confirming that foundation models can improve cellular response simulations.
bioinformatics2026-02-19v1Spartan: Spatial Activation Aware Transcriptomic Analysis Network
Faiz, M. F. I.; Jokl, E.; Jennings, R.; Piper Hanley, K.; Sharrocks, A.; Iqbal, M.; Baker, S. M.AI Summary
- Spartan is a new framework designed to improve the identification of spatial domains in spatial transcriptomics by modeling spatial transitions and using Local Spatial Activation (LSA) to enhance resolution.
- It integrates spatial topology and activation signals to accurately partition tissues across various technologies like Visium HD, MERFISH, Stereo-seq, and STARmap.
- Applied to a high-resolution Visium HD section of developing human esophagus and stomach, Spartan effectively delineates transitional regions and detects genes linked to tissue remodeling.
Abstract
Spatial transcriptomics is rapidly advancing toward single cell level resolution, revealing complex tissue architectures organized across continuous anatomical gradients. However, accurate identification of spatial domains remains a central computational challenge, as many existing clustering approaches blur anatomical boundaries, merge transitional zones, or fail to resolve localized microstructures. Here we introduce Spartan, an activation-aware multiplex graph framework that explicitly models spatial transitions for high-resolution domain discovery. Spartan integrates spatial topology, Local Spatial Activation (LSA), a neighborhood deviation signal that amplifies localized transcriptomic shifts often attenuated by similarity-based clustering. By jointly modeling cohesion within domains and activation at interfaces, Spartan recovers anatomically aligned partitions across spatially resolved transcriptomics technologies including Visium HD, MERFISH, Stereo-seq, and STARmap. We demonstrate its utility in a high-resolution Visium HD section of developing human esophagus and stomach, where activation-aware graph integration enables precise delineation of transitional regions such as the gastroesophageal junction and supports stable multi-scale domain recovery without fragile hyperparameter tuning. Beyond domain identification, Spartan leverages activation-aware structure to detect spatially variable genes associated with localized tissue remodeling. Spartan scales near-linearly with dataset size, providing a robust and interpretable framework for spatial systems-level analysis.
bioinformatics2026-02-19v1In silico degradomics reveals disease- and endotype-specific alterations in the joint tissue landscape
Hoyle, A.; Midwood, K. S.AI Summary
- The study developed DegrAID, an in-silico pipeline to analyze semi-tryptic peptides from proteomic data, mapping neo-epitopes in matrix proteins without the need for labeling or enrichment.
- Applied to osteoarthritis and rheumatoid arthritis (RA) patient samples, DegrAID identified distinct degradomes in different tissues and highlighted disease-specific degradation patterns.
- In RA, different endotypes (myeloid and lymphoid) showed varied degradation patterns, with proteoglycans more degraded in myeloid-RA and collagens in lymphoid-RA, revealing endotype-specific biomarkers.
Abstract
Tissues dynamically remodel extracellular matrix to maintain homeostasis, alterations in which are an early pathogenic hallmark of disease. Protein degradation, essential for tissue remodelling, is often dismissed as indiscriminate damage, despite evidence of its specificity. A major determinant of protein tissue levels and activity, matrix proteolysis also creates circulating degradation products that are emerging biomarkers, with specific collagen fragments capable of tracking disease severity. Understanding intentional matrix destruction therefore is key to understanding tissue biology. Unbiased, holistic analysis, extending our knowledge beyond ubiquitously expressed collagens, will uncover tissue- and disease-specific remodelling. However, degradomics' technical demands, requiring labelling and enrichment for neo-epitopes generated by cleavage events, restricts its inclusion in omics research. Here, we develop an in-silico pipeline (DegrAID) that identifies semi-tryptic peptides in unlabelled/unenriched proteomic datasets, mapping neo-epitopes within matrix domain organization and 3D structure, correlating these with known/predicted protease sites, and applies this to rare patient cohorts. Validation with matched degradomic data showed good conservation across degraded proteins and cleavage sites. Interrogation of multiple, independent cohorts including cartilage, synovial tissue and synovial fluid from osteoarthritis (OA) or rheumatoid arthritis (RA) patients identified distinct degradomes between disease and tissue compartments. Further investigating RA heterogeneity revealed myeloid and lymphoid endotypes that display different treatment responses, have substantially different degradation patterns. Proteoglycans were more degraded in myeloid-RA, while collagens more so in lymphoid-RA, with notable exceptions, and endotype-specific fingerprints were conserved between synovial tissue and fluid. Thus, this tool provides new insights into tissue remodelling by unlocking degradomes from any proteomic dataset.
bioinformatics2026-02-19v1Benchmarking Large Language Models for Predicting Therapeutic Antisense Oligonucleotide Efficacy
Wei, Z.; Griesmer, S.; Sundar, A.AI Summary
- This study benchmarks large language models (LLMs) and molecular embedding models for predicting the efficacy of therapeutic antisense oligonucleotides (ASOs) using datasets PFRED, openASO, and ASOptimizer.
- DNA sequence-based representations with gene context were found to be superior to SMILES-based representations for predicting ASO efficacy.
- GPT-3.5-Turbo with few-shot prompting showed the best performance, achieving R² values up to 0.6381, significantly outperforming baseline regression models.
Abstract
Antisense oligonucleotides (ASOs) are a promising class of therapeutic agents capable of selectively modulating gene expression and treating a wide range of genetic and neurological disorders. Accurate prediction of ASO efficacy is essential for accelerating drug discovery and reducing experimental costs, yet remains a challenging computational task due to complex sequence-function relationships. In this study, we benchmark large language models (LLMs) and molecular embedding-based regression models for predicting therapeutic ASO efficacy across three publicly available biological datasets: PFRED, openASO, and ASOptimizer. We evaluate multiple transformer-based molecular embedding models, including ChemBERTa and MolFormer, alongside prompt-engineered LLM configurations such as GPT-3.5-Turbo, LLaMA-2-7B, and Galactica-6.7B. Our results demonstrate that DNA sequence-based representations combined with gene context outperform SMILES-based molecular representations for efficacy prediction. Among the evaluated approaches, GPT-3.5-Turbo using few-shot prompting achieves strong predictive performance, reaching coefficient of determination (R squared) values up to 0.6381 and substantially outperforming baseline regression models. These findings highlight the potential of general-purpose large language models as effective tools for biomolecular prediction and computational drug discovery. This work provides a systematic benchmarking framework and establishes a foundation for integrating large language models into therapeutic antisense oligonucleotide design pipelines.
bioinformatics2026-02-19v1Experimental Time Points Guided Transcriptomic Velocity Inference
Zang, X.; Shu, X.; Zhang, N.; Wu, Y.; Deng, M.; Zhou, X.; Yang, J.; Zhang, C.-Y.; Wang, X.; Zhou, Z.; Wang, J.AI Summary
- The study introduces CellDyc, a semi-supervised learning framework that uses experimental time points to enhance the reconstruction of cellular trajectories via transcriptomic velocities.
- CellDyc outperforms existing methods in various contexts, providing insights like temporal heterogeneity in erythroid maturation and delayed monocyte differentiation in glioblastoma.
- It integrates well with tools like CellRank and remains effective with inferred temporal data.
Abstract
Time-series single-cell RNA sequencing enables longitudinal tracking of biological processes, yet cellular trajectory reconstruction informed by experimental time remains challenging. Existing trajectory inference methods either perform de novo reconstruction without leveraging experimental time points, or prioritize transitions between time points while paying less attention to intra-time-point dynamics. To reconcile experimental time points with local precision, we present CellDyc, a semi-supervised learning framework that leverages experimental time-point supervision to reconstruct transcriptomic velocities and recover an intrinsic gene-embedded time. CellDyc consistently outperforms existing approaches in reconstructing cellular trajectories across development, disease, and reprogramming contexts. Biologically, CellDyc provides novel insights, such as resolving temporal heterogeneity in erythroid maturation and quantitatively demonstrating that the immunosuppressive environment delays monocyte differentiation in glioblastoma. CellDyc integrates seamlessly with downstream tools like CellRank and remains robust even when only inferred temporal information is available. Collectively, CellDyc offers a rigorous, data-driven solution for deciphering time-resolved cellular dynamics.
bioinformatics2026-02-19v1NaVis: a virtual microscopy framework for interactive, high-resolution navigation of spatial transcriptomics data
Oshinjo, A.; Wu, J.; Petrov, P.; Izzi, V.AI Summary
- NaVis is a web-based virtual microscopy framework designed to enhance the exploration of spatial transcriptomics data by providing interactive, high-resolution navigation.
- It allows for near real-time super-resolution inference from low-resolution platforms, transforming resolution into a user-controlled parameter.
- NaVis offers a point-and-click interface, making it accessible to non-coders and facilitating direct interrogation of spatial molecular architecture, thus broadening its use in biological research.
Abstract
Despite the wide adoption of spatial transcriptomics (ST) into the biomedical community, its practical use remains constrained by a fundamental resolution/coverage trade off and by reliance on computationally intensive and static workflows. As a result, transcriptome-wide spatial data are typically interpreted as ad-hoc processed outputs rather than explored dynamically as one would do with stained or fluorescence tissue images, limiting ST accessibility and slowing biological insight. Here we introduce NaVis, a web-based virtual microscopy framework that redefines how spatial transcriptomics is experienced. NaVis enables near real-time, on-demand super-resolution inference from low-resolution whole-transcriptome platforms (10x Genomics Visium V1/V2, Cytassist and VisiumHD), generating high-resolution reconstructions that approach microscopy-level detail while preserving transcriptome-wide coverage. Unlike conventional interpolation approaches that produce fixed images, NaVis computes and refines spatial reconstructions interactively as users navigate tissue sections, transforming resolution from a platform-imposed constraint into a dynamic, user-controlled parameter. Also, NaVis is delivered through a fully point-and-click browser interface requiring no coding expertise, thus removing computational mediation and allowing clinicians, pathologists and experimental researchers to directly interrogate spatial molecular architecture. By coupling high-resolution inference with immediate visual interaction, NaVis shifts spatial transcriptomics from a static computational analysis to an exploratory, microscopy-like modality, broadening both its accessibility, conceptual reach, and potential for biological discoveries.
bioinformatics2026-02-19v1Identification of an ERCC2 mutation associated mutational signature of nucleotide excision repair deficiency in targeted panel sequencing data
Stojkova, O.; Borcsok, J.; Sztupinszki, Z.; Diossy, M.; Prosz, A.; Neil, A.; Mouw, K. W.; Sorensen, C. S.; Szallasi, Z.AI Summary
- This study developed a method to identify a mutational signature of nucleotide excision repair (NER) deficiency from targeted panel sequencing data in bladder cancer with ERCC2 mutations.
- ERCC2 wild type bladder cancers with high levels of this signature showed better response to neoadjuvant platinum therapy and improved survival.
- The signature was also observed in other solid tumors with ERCC2 mutations, suggesting potential therapeutic targeting beyond bladder cancer.
Abstract
Next generation sequencing based mutational signatures are frequently used to identify tumors with specific DNA repair deficiencies for targeted therapeutic strategies. Although mutational signatures are most commonly derived from whole exome (WES) or whole genome sequencing (WGS) data, more patients currently undergo tumor sequencing using more limited targeted panels that typically encompass several hundred cancer-associated genes. Identifying clinically relevant mutational signatures from targeted panel data requires new approaches capable of deriving signatures from the more limited sequencing data. Here, we derive and validate a panel sequencing-based composite mutational signature associated with nucleotide excision repair (NER) deficiency induced by inactivating ERCC2 mutations in bladder cancer. Using publicly available panel sequencing data, we find that ERCC2 wild type (WT) bladder cancer cases that have high levels of this mutational signature respond better to neoadjuvant platinum therapy and have improved overall survival compared to ERCC2 WT cases with low levels of the signature. We also find that other solid tumor types with ERCC2 mutations also show the characteristic mutational signature seen in NER-deficient ERCC2-mutant bladder cancers, suggesting a novel approach to therapeutically target these ERCC2-mutant solid tumors beyond bladder cancer.
bioinformatics2026-02-19v1Hi-Cformer enables multi-scale chromatin contact map modeling for single-cell Hi-C data analysis
Wu, X.; Chen, X.; Jiang, R.AI Summary
- Hi-Cformer is a transformer-based method designed to model multi-scale chromatin contact maps from single-cell Hi-C data, addressing challenges like sparsity and uneven contact distribution.
- It uses a specialized attention mechanism to capture dependencies across genomic regions and scales, providing robust low-dimensional cell representations and clearer cell type separation.
- Hi-Cformer accurately imputes chromatin interactions, identifies 3D genome features, and extends to cell type annotation with high accuracy across different datasets.
Abstract
Single-cell Hi-C captures the three-dimensional organization of chromatin in individual cells and provides insights into fundamental genomic processes such as gene regulation and transcription. While analyses of bulk Hi-C data have revealed multi-scale chromatin structures like A/B compartments and topologically associating domains, single-cell Hi-C data remain challenging to analyze due to sparsity and uneven distribution of chromatin contacts across genomic distances. These characteristics lead to strong signals near the diagonal and complex multi-scale local patterns in single-cell contact maps. Here, we propose Hi-Cformer, a transformer-based method that simultaneously models multi-scale blocks of chromatin contact maps and incorporates a specially designed attention mechanism to capture the dependencies between chromatin interactions across genomic regions and scales, enabling the integration of both global and fine-grained chromatin interaction features. Building on this architecture, Hi-Cformer robustly derives low-dimensional representations of cells from single-cell Hi-C data, achieving clearer separation of cell types compared to existing methods. Hi-Cformer can also accurately impute chromatin interaction signals associated with cellular heterogeneity, including 3D genome features such as topologically associating domain-like boundaries and A/B compartments. Furthermore, by leveraging its learned embeddings, Hi-Cformer can be extended to cell type annotation, achieving high accuracy and robustness across both intra- and inter-dataset scenarios.
bioinformatics2026-02-18v2BioGraphX: Bridging the Sequence-Structure Gap via PhysicochemicalGraph Encoding for Interpretable Subcellular Localization Prediction
Saeed, A.; Abbas, W.AI Summary
- BioGraphX introduces a novel encoding framework that constructs protein interaction graphs from sequences using biochemical rules, bypassing the need for 3D structure determination.
- The framework integrates 158 interpretable biophysical features with ESM-2 embeddings, enhancing prediction accuracy on the DeepLoc benchmarks.
- SHAP analysis reveals that BioGraphX-Net uses sequence profiles for exclusion and specific biophysical features for precise localization, with Frustration features aiding in resolving targeting ambiguities.
Abstract
Computational approaches for protein subcellular localization prediction are important for understanding cellular mechanisms and developing treatments for complex diseases. However, a critical limitation of current methods is their lack of interpretability: while they can predict where a protein localizes, they fail to explain why the protein is assigned to a specific location. Moreover, traditional approaches rely on Anfinsen' s principle, which assumes that protein behavior is determined by its native three-dimensional structure, requiring costly and time-consuming process. Here, we propose BioGraphX, a novel encoding framework that constructs protein interaction graphs directly from protein sequences using biochemical rules. This approach eliminates the need for three dimensional structure determination by encoding 158 interpretable features grounded in biophysical principles. Building upon this representation, BioGraphX Net demonstrates superior performance on the DeepLoc benchmarks by integrating ESM-2 embeddings with the proposed features via a gating mechanism. Gating analysis shows that although ESM-2 embeddings provide strong contributions, BioGraphX features function as high-precision filters. SHAP analysis shows that BioGraphX-Net encodes a sophisticated biophysical logic: sequence profiles act as universal exclusion filters, while organelle-specific combinations of biophysical features enable precise compartment discrimination. Notably, Frustration features help resolve targeting ambiguities in complex compartments, reflecting evolutionary constraints while preventing mislocalization from sequence mimicry. It has the additional advantage of promoting Green AI in bioinformatics, achieving performance comparable to the state-of-the-art while maintaining a minimal parameter count of 13.46 million. In summary, BioGraphX not only provides accurate predictions but also offers new insights into the language of life.
bioinformatics2026-02-18v2Short linear motifs - Underexplored players driving Toxoplasma gondii infection
Alvarado Valverde, J.; Lapouge, K.; Boergel, A.; Remans, K.; Luck, K.; Gibson, T.AI Summary
- The study explores the role of short linear motifs in Toxoplasma gondii's infection process, focusing on how these motifs facilitate interactions with host proteins.
- A computational pipeline was developed to identify motifs in Toxoplasma secreted proteins, revealing 24,291 motif matches in 295 proteins.
- Experimental validation confirmed the presence of TRAF6-binding motifs in Toxoplasma proteins RON10 and GRA15, highlighting the utility of motif predictions in understanding infection mechanisms.
Abstract
Pathogens infect hosts by interacting with host proteins and exploiting their functions to their advantage. Short linear motifs, small functional regions within intrinsically disordered protein regions, are common mediators of host-pathogen protein interactions. While motifs have been more extensively studied in viruses and bacteria, the extent to which eukaryotic unicellular parasites use motifs during infection remains unexplored. Toxoplasma gondii is a widespread intracellular Apicomplexan parasite capable of infecting all warm-blooded animals and invading any of their nucleated cells. Toxoplasma's secreted proteins are key in interacting with host proteins during infection, making them potential sources for motifs. To highlight the role of motifs in Toxoplasma gondii infection, we curated 21 known motif instances in Toxoplasma proteins from the scientific literature. To identify more motifs in Toxoplasma secreted proteins, we developed a computational pipeline that annotates putative motif matches with structural and functional features. Through this approach, we identified a set of 24,291 motif matches in 295 secreted proteins. We highlight strategies for further prioritisation of likely functional motif matches by focusing on integrin motifs, degrons and TRAF6-binding motifs. We subjected four predicted TRAF6-binding motifs to experimental validation, supporting the predicted motifs in the Toxoplasma proteins RON10 and GRA15. Our motif predictions provide a valuable resource for generating hypotheses and designing experiments to study infection mechanisms. The characterisation of motifs in Toxoplasma will be key to understanding the molecular principles underlying its broad host range and more comprehensive Apicomplexan infection strategies.
bioinformatics2026-02-18v2BOND-PEP: topology-conditioned bipartite alignment for evidence-grounded peptide binder generation
Ding, W.AI Summary
- BOND-PEP is a novel framework for generating peptide binders that uses empirical binding evidence to condition peptide generation explicitly at the residue level.
- It achieves state-of-the-art performance in terms of low perplexity, high hit rates, and sequence novelty, outperforming existing methods.
- The approach provides a practical method for de novo peptide binder design, effective even under conditions of noisy labels and distribution shift.
Abstract
Peptide binders can modulate proteins that remain challenging for small molecules, but discovering high-affinity, selective peptides is still slow and sample-intensive. Sequence-first generators could scale design when structures are unavailable or conformationally heterogeneous, yet they often trade diversity for control: unconstrained sampling is inefficient while conditioning remains largely implicit. This limitation is exacerbated by the uneven transfer of protein language model priors to short peptides. Here we present BOND-PEP, a retrieval-augmented, bipartite-aligned, topology-conditioned framework that converts empirical binding evidence into an explicit, residue-resolved conditioning state for peptide generation. BOND-PEP shows low perplexity together with satisfactory free-generation hit rates and sequence novelty under a fair evaluation protocol and decoding budget. Compared with existing peptide generation methods, BOND-PEP achieves state-of-the-art results that match or improve upon validated peptide-protein sequence pairs. In total, BOND-PEP provides a practical, sequence-only route to controllable de novo peptide binder generation under noisy labels and distribution shift.
bioinformatics2026-02-18v1KG-Orchestra: An Open-Source Multi-Agent Framework for Evidence-Based Biomedical Knowledge Graphs Enrichment.
Mohamed, A. H.; Shalaby, K. S.; Kaladharan, A.; Atas Guvenilir, H.; Tom Kodamullil, A.AI Summary
- KG-Orchestra is a multi-agent framework designed to enrich biomedical knowledge graphs (BKGs) by focusing on specific topics, using Retrieval-Augmented Generation (RAG) for evidence acquisition, validation, and integration.
- Evaluations on specialized contexts like Nelivaptan-Alzheimer's link and gut-brain axis interactions showed that Qwen 3 variants and hybrid retrieval strategies improved reasoning and evidence relevance.
- The framework ensures high triplet integrity and biological validity, is computationally flexible, and supports applications like drug repurposing and pathway completion.
Abstract
Biomedical Knowledge Graphs (BKGs) offer integrative representations of complex biology, yet their utility is compromised by the limitations of current construction methods: manual curation offers high fidelity but is unscalable, whereas purely automated Large Language Model (LLM) approaches often yield broad networks lacking mechanistic granularity. We present KG-Orchestra, an open-source multi-agent framework designed to build specialized, directional, cause-and-effect BKGs by enriching seed graphs. The framework focuses on increasing granularity within specific topics by leveraging Retrieval-Augmented Generation (RAG) to autonomously acquire, validate, and integrate evidence. The system orchestrates specialized agents for retrieval, schema alignment, and triplet validation with explicit, traceable provenance, transforming sparse seeds into dense, high-resolution resources. We evaluated KG-Orchestra on two specialized contexts -- the mechanistic link between Nelivaptan and Alzheimer's Disease (NADKG) and the complex probiotic interactions within the gut-brain axis (ProPreSyn-GBA) -- across varying computational budgets. Our benchmarking results demonstrate that Qwen 3 variants deliver superior reasoning performance and that hybrid retrieval strategies significantly enhance evidence relevance. Furthermore, the multi-agent architecture ensures high triplet integrity and biological validity through iterative cross-checking and self-correction. The framework remains computationally flexible, deploying from single laptop GPUs to high-performance clusters. By bridging knowledge gaps and adding context-aware entities, KG-Orchestra increases reliability while validating seed assertions against up-to-date sources. This versatility supports critical downstream applications, including completing missing mechanistic pathways, integrating novel entities for drug repurposing, constructing targeted subgraphs from entity lists, and retroactively validating graph evidence for transparent auditing.
bioinformatics2026-02-18v1Wayfarer: A multiscale framework for spatial analysis of tumor progression
Moses, L.; Herault, A.; Cabon, L.; Dumitrascu, B.AI Summary
- Wayfarer is a multiscale framework designed to analyze how spatial association metrics in tumor progression change across different spatial scales using spatial -omics data.
- Applied to Xenium data from lung adenocarcinoma, Wayfarer revealed that tumor progression involves shifts in spatial patterns, with increased fine-scale coherence in ERBB2-high regions and coarse-scale clustering of immune markers.
- This framework transforms spatial aggregation from a confounder into a diagnostic tool, available as an R package via Bioconductor.
Abstract
Spatial biology spans multiple length scales, from intracellular organization to tissue-level architecture. Spatial transcriptomics captures this structure, yet most analyses operate at a single spatial resolution, implicitly assuming that biological organization is scale-consistent. In practice, spatial autocorrelation and co-localization are functions of scale, and conclusions can depend on arbitrary aggregation choices. Here we present Wayfarer, a multiscale framework for spatial -omics that tracks how spatial association metrics evolve across nested spatial aggregations, enabling statistical comparison of multiscale structure across biological conditions. Using Xenium data from lung adenocarcinoma (LUAD), we show that spatial patterns often co-exist at fine and coarse scales and that progression is accompanied by reproducible shifts in scale-response profiles. These include increased fine-scale coherence of ERBB2-high tumor regions and coarse-scale clustering of immune-associated markers that are not apparent at a single resolution. Wayfarer converts spatial aggregation from a confounder into a diagnostic signal and is implemented as an R package to be released through Bioconductor.
bioinformatics2026-02-18v1Private Information Leakage from Polygenic Risk Scores
Nikitin, K.; Gursoy, G.AI Summary
- This study investigates the privacy risks of sharing Polygenic Risk Scores (PRSs), demonstrating that PRSs can be used to reconstruct parts of an individual's genome.
- Using dynamic programming and population-based likelihood estimation, the research shows how a single PRS value can reveal genotypes, with increased accuracy when combining multiple PRSs.
- The authors propose an analytical framework to evaluate privacy risks and suggest methods for sharing PRS models while maintaining utility.
Abstract
Polygenic Risk Scores (PRSs) estimate the likelihood of individuals to develop complex diseases based on their genetic variations. While their use in clinical practice and direct-to-consumer genetic testing is growing, the privacy implications of publicly sharing PRS values are often underestimated. In this work, we demonstrate that PRSs can be exploited to recover genotypes and to de-anonymize individuals. We describe how to reconstruct a portion of an individual's genome from a single PRS value by using dynamic programming and population-based likelihood estimation, which we experimentally demonstrate on PRS panels of up 50 variants. We highlight the risks of combining multiple, even larger-panel PRSs to improve genotype-recovery accuracy, which can lead to the re-identification of individuals or their relatives in genomic databases or to the prediction of additional health risks, not originally associated with the disclosed PRSs. We then develop an analytical framework to assess the privacy risk of releasing individual PRS values and provide a potential solution for sharing PRS models without decreasing their utility. Our tool and instructions to reproduce our calculations can be found at https://github.com/G2Lab/prs-privacy.
bioinformatics2026-02-18v1Structural Characterization of the Type IV Secretion System in Brucella melitensis for Virtual Screening-Based Therapeutic Targeting
Kapoor, J.; Panda, A.; Rajagopal, R.; Kumar, S.; Bandyopadhyay, A.AI Summary
- The study focused on characterizing the Type IV Secretion System (T4SS) in Brucella melitensis to explore its potential as a therapeutic target for brucellosis.
- Computational modeling and structural analysis of T4SS components were performed, revealing conserved architecture with E. coli T4SS despite low sequence identity.
- Virtual screening identified three promising drug candidates (Ezetimibe, Chlordiazepoxide, Alloin) targeting the VirB11 ATPase dimeric interface, with favorable binding energies confirmed by molecular dynamics simulations.
Abstract
Brucellosis is a globally important zoonotic disease caused by Brucella melitensis, the most virulent and clinically significant species affecting both humans and livestock. Unlike many Gram-negative pathogens, B. melitensis, a facultative intracellular pathogen, lacks conventional virulence factors and instead relies on specialized systems such as the Type IV Secretion System (T4SS) for secretion of effector proteins. In this study, an integrated computational pipeline was implemented to identify, model, and assemble the T4SS components, encoded by virB operon, from the complete B. melitensis proteome. Template-based modeling strategies were employed to generate structures of T4SS subcomplexes, referencing crystallographic data from E. coli T4SS. Structural superposition with E. coli homologs revealed highly conserved architecture despite only 30 to 50% sequence identity. Stereochemical validation confirmed high model quality and favorable interactions among most VirB protein pairs. Membrane insertion analysis of the membrane-embedded assemblies further corroborated the spatial orientation of the modeled T4SS. Potential of T4SS as a drug target was explored by targeting dimeric interface of VirB11 ATPase to disrupt protein-protein interactions that could disarm the pathogen. Virtual screening of compounds from DrugBank database revealed compounds with docking score less than -7.0 kcal/mol that were screened based on ADMET properties, yielding three promising candidates: Ezetimibe (Drug Id: DB00973), Chlordiazepoxide (Drug Id: DB00475), and Alloin (Drug Id: DB15477). MM-GBSA analysis estimated favorable binding free energies for these compounds and molecular dynamics simulation for 200 ns further confirmed the protein-ligand interaction stability. Collectively, these findings provide new insights into the architecture of B. melitensis T4SS and identify three potential drug molecules targeting T4SS. This supports FDA approved drug repurposing as an effective strategy for anti-virulence therapy against Brucellosis.
bioinformatics2026-02-18v1Privacy-Preserving Pangenome Graphs
Blindenbach, J.; Soni, S.; Gursoy, G.AI Summary
- The study introduces PanMixer, a framework for privacy-preserving pangenome graph releases, addressing privacy concerns by selectively obfuscating individual haplotypes.
- PanMixer formulates the privacy-utility trade-off as a knapsack problem, using information theory for privacy risk and graph properties for utility.
- Testing on a draft human pangenome of 47 individuals showed PanMixer reduces re-identification risk while maintaining the accuracy of downstream genomic applications.
Abstract
The human pangenome reference, often represented as a graph, promises to capture genetic diversity across populations, but open release of individual haplotypes raises significant privacy concerns, including risks of re-identification and inference of sensitive traits. To address these challenges, we introduce PanMixer, a framework for privacy-preserving pangenome graph releases that selectively obfuscates an individual's haplotypes while retaining the utility of the reference graph. PanMixer formulates the privacy-utility trade-off as a knapsack problem, where privacy risk is quantified using information theory and utility is measured using graph properties. Using the recently released draft human pangenome containing 47 individuals, we show that PanMixer robustly reduces re-identification risk under linkage attacks and genome reconstruction attempts. We also show that PanMixer preserves the accuracy of key downstream applications, including allele frequency estimation, linkage disequilibrium analysis, and read mapping. By addressing privacy concerns, PanMixer enables the inclusion of individuals, particularly those from underrepresented populations, who might otherwise be reluctant to contribute but seek representation in future genomic studies. Our results provide both a practical tool and a generalizable framework for balancing privacy and utility in future large-scale pangenome references.
bioinformatics2026-02-18v1