Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Fitness translocation: improving variant effect prediction with biologically-grounded data augmentation
Mialland, A.; Fukunaga, S.; Katsuki, R.; Dong, Y.; Yamaguchi, H.; Saito, Y.Abstract
Predicting the functional effects of protein variants (variant effect prediction) is essential in protein engineering but remains challenging due to the scarcity of fitness data for training prediction models. To address this limitation, we introduce a data augmentation strategy called fitness translocation, which leverages variant fitness data from homologous proteins to enhance prediction models for a target protein. Using embeddings from protein language models, our method computes the differences between the homolog's wild type and its variants, which are applied to the target wild type to generate its synthetic variants in the embedding space. We evaluate this approach on three protein families: IGPS, GFP, and SARS-CoV-2 spike proteins, under various prediction models and training data sizes. Fitness translocation consistently improves prediction accuracy, especially under limited training data. Moreover, accuracy improvement is observed even between remote homologs with sequence identity as low as 35%. These results highlight the potential of data-efficient protein engineering by reusing fitness data previously accumulated in homologs. The code is available at https://github.com/adrienmialland/ProtFitTrans
bioinformatics2026-03-25v4Signature Distance: Generalizing Energy Statistics
Lazzaro, N.; Marchesi, R.; Leonardi, G.; Tessadori, J.; Chierici, M.; Sales, G.; Moroni, M.; Tebaldi, T.; Jurman, G.Abstract
Comparing empirical distributions is central to generative model evaluation, hypothesis testing and data augmentation in high-dimensional biological data. Established methods such as energy distance summarize each point's relationship to the opposing distribution through a single expected distance, providing sensitivity to location shifts but not to local density or topological structure. We introduce Signature Distance (SD), a metric that compares empirical distributions through the mean absolute difference of their sorted pointwise distance profiles. SD is a structural generalization of energy distance and matches its $\mathcal{O}(n^2)$ computational complexity. On TCGA pan-cancer transcriptomic data, we show that (1) SD detects density changes that energy distance is insensitive to; (2) the per-point SD loss landscape reveals the geometric mechanisms behind known limitations of energy distance as a generative objective; (3) linearly interpolated biological samples that are not detected by energy distance are correctly penalized by SD; (4) SD provides a direct differentiable potential energy for model-free Langevin data expansion, with a bootstrap resampling protocol that stabilises the stopping epoch; and (5) SD is directly usable as a differentiable generative training loss.
bioinformatics2026-03-25v4DVPNet: A New XAI-Based Interpretable Genetic Profiling Framework Using Nucleotide Transformer and Probabilistic Circuits
Kusumoto, T.Abstract
In this study, we present an XAI-based genetic profiling framework that quantifies gene importance for distinguishing cancer cells from normal cells based on an interpretable AI decision process. We propose a new explainable AI (XAI) classification model that combines probabilistic circuits with the Nucleotide Transformer. By leveraging the strong feature-extraction capability of the Nucleotide Transformer, we design a tractable classification framework based on probabilistic circuits while preserving probabilistic interpretability. To demonstrate the capability of this framework, we used the GSE131907 single-cell lung cancer atlas and constructed a dataset consisting of cancer-cell and normal-cell classes. From each sample, 900 gene types were randomly selected and converted into embedding vectors using the Nucleotide Transformer, after which the classification model was trained. We then extracted class-specific probabilistic contributions from the tractable model and defined a contribution score for the cancer-cell class. Genetic profiling was performed based on these scores, providing insights into which genes and biological pathways are most important for the classification task. Notably, 1,524 of the 9,540 observed genes showed contribution scores that contradicted what would be expected from their class-wise occurrence frequencies, suggesting that the profiling goes beyond simple statistics by leveraging biological feature representations encoded by the Nucleotide Transformer. The top-ranked genes among these contradictory cases include several well-studied genes in cancer research (e.g., ITGA5, SIGLEC9, NOTUM, and TP73). Overall, these analyses go beyond traditional statistical or gene-expression-level approaches and provide new academic insights for genetic research.
bioinformatics2026-03-25v3Mosaic integration of spatial multi-omics with SpaMosaic
Yan, X.; Fang, Z.; Ang, K. S.; Olst, L. v.; Edwards, A.; Watson, T.; Zheng, R.; Fan, R.; Li, M.; Gate, D.; Chen, J.Abstract
With the advent of spatial multi-omics, mosaic integration of diverse datasets with partially overlapping modalities enables construction of comprehensive multi-modal spatial atlases from heterogeneous sources. Here, we present SpaMosaic, a tool that employs contrastive learning and graph neural networks to build a modality-agnostic and batch-corrected latent space for spatial domain identification and missing modality imputation. We systematically benchmarked SpaMosaic against existing integration methods using simulated data and experimentally acquired datasets spanning RNA and protein abundance, chromatin accessibility, and histone modifications from brain, embryo, tonsil, and lymph node tissues. SpaMosaic consistently outperformed other methods in identifying coherent spatial domains by reducing noise and mitigating batch effects across diverse technologies and developmental stages. Computationally, SpaMosaic is highly scalable, capable of integrating over 100 sections and processing a single section with more than 800,000 spots. Beyond robust integration, the unified latent space generated by SpaMosaic enables accurate imputation of missing modalities. In a mosaic mouse brain dataset, the imputed histone modifications not only recapitulated expected transcriptome-epigenome correlations but also uncovered more region-specific regulatory links compared to the measured chromatin accessibility data, demonstrating the ability to infer relationships between modalities without co-profiling. In summary, SpaMosaic provides a versatile framework for unifying the rapidly accumulating heterogeneous spatial omics data into comprehensive biological atlases.
bioinformatics2026-03-25v3Cenote-Taker 3 for Fast and Accurate Virus Discovery and Annotation of the Virome
Tisza, M. J.; Varsani, A.; Petrosino, J. F.; Cregeen, S. J. J.Abstract
Viruses are abundant across all Earth's environments and infect all classes of cellular life. Despite this, viruses are something of a black box for genomics scientists. Their genetic diversity is greater than all other lifeforms combined, their genomes are often overlooked in sequencing datasets, they encode polyproteins, and no function can be inferred for a large majority of their encoded proteins. For these reasons, scientists need robust, performant, well-documented, extensible tools that can be deployed to conduct sensitive and specific analyses of sequencing data to discover virus genomes - even those with high divergence from known references - and annotate their genes. Here, we present Cenote-Taker 3. This command line interface tool processes genome assemblies and/or metagenomic assemblies with modules for virus discovery, prophage extraction, and annotation of genes and other genetic features. Benchmarks show that Cenote-Taker 3 outperforms most tools for virus gene annotation in both speed (wall time) and accuracy. For virus discovery benchmarks, Cenote-Taker 3 performs well compared to geNomad, and these tools produce complementary results. Cenote-Taker 3 is freely available on Bioconda, and its open-source code is maintained on GitHub (https://github.com/mtisza1/Cenote-Taker3).
bioinformatics2026-03-25v3HEDeST: An Integrative Approach to Enhance Spatial Transcriptomic Deconvolution with Histology
Gortana, L.; Chadoutaud, L.; Bourgade, R.; Barillot, E.; Walter, T.Abstract
Spatial organization of cells is essential for tissue function, yet sequencing-based spatial transcriptomics often lacks single-cell resolution. We present HEDeST, a weakly supervised framework that integrates histology-derived morphological features with deconvolution-derived spot-level proportions to assign cell types at single-cell resolution. HEDeST is robust to technical variability, adaptable to user-defined cell types, and compatible with any deconvolution method. Across simulated and semi-simulated datasets, HEDeST outperforms existing morphology-based approaches and reveals biologically meaningful microenvironments when applied to real cancer datasets, providing a scalable tool for high-resolution spatial tissue analysis.
bioinformatics2026-03-25v2OmicClaw: executable and reproducible natural-language multi-omics analysis over the unified OmicVerse ecosystem.
Zeng, Z.; Wang, X.; Luo, Z.; Zheng, Y.; Hu, L.; Xing, C.; Du, H.Abstract
Advances in bulk, single-cell and spatial omics have transformed biological discovery, yet analysis remains fragmented across packages with incompatible interfaces, heterogeneous dependencies and limited workflow reproducibility. Here, we present OmicClaw, an executable natural-language framework for multi-omics analysis built on the unified OmicVerse ecosystem and the J.A.R.V.I.S. runtime. OmicVerse organizes upstream processing, preprocessing, single-cell, spatial, bulk-transcriptomic and foundation-model workflows into a shared AnnData-centered interface spanning more than 100 methods. J.A.R.V.I.S. converts this ecosystem into a bounded analytical action space by exposing more than 200 registered functions and classes through a registry-grounded, state-aware and recoverable execution layer that validates prerequisites, preserves provenance and supports iterative repair. Rather than relying on unconstrained code generation, OmicClaw translates user requests into traceable workflows over live omics objects. Across a benchmark of 15 tasks spanning scRNA-seq, spatial transcriptomics, RNA velocity, scATAC-seq, CITE-seq and multiome analysis, ov.Agent improved rubric-based performance over bare one-shot large language model baselines, particularly for long-horizon multi-step workflows. OmicClaw further supports external agent access through an MCP-compatible server and a beginner-friendly web platform for interactive analysis, code execution and million-scale visualization. Together, OmicClaw provides a practical foundation for reproducible human AI collaboration in modern multi-omics research. OmicClaw is ready to use at https://github.com/Starlitnightly/omicverse
bioinformatics2026-03-25v2Genome-wide maps of transcription factor footprints identify noncoding variants rewiring gene regulatory networks
Lin, J.; Dong, W.; Zhang, J.; Xie, C.; Jing, X.; Zhao, J.; Ma, K.; Kang, H.; Jiang, Y.; Xie, X. S.; Zhao, Y.Abstract
Genome-wide association studies have identified millions of noncoding loci linked to human traits, yet how these variants alter gene regulation remains a major challenge, particularly for rare variants where whole-genome sequencing cohorts and high-resolution functional annotations remain limited. Here we show that single-molecule deaminase footprinting (FOODIE) in K562 cells captures up to 103-fold heritability enrichment for erythroid traits despite covering 0.12% of the genome. We introduce varTFBridge, integrating FOODIE footprinting with AlphaGenome variant effect prediction to identify causal noncoding variants altering transcription factor (TF)-mediated regulation. Applied to 490,640 UK Biobank genomes across 13 erythrocyte traits, varTFBridge prioritises 113 high-confidence regulatory variants (104 common, 9 rare), encompassing 2,173 linkages along the variant-TF binding-gene-trait cascade across 64 TFs and 108 genes. varTFBridge recapitulates rs112233623 and resolves its mechanism: GATA1/TAL1 co-binding disruption at a CCND3 enhancer altering red blood cell count and volume.
bioinformatics2026-03-25v2Chromatix: a differentiable, GPU-accelerated wave-optics library
Deb, D.; Both, G.-J.; Bezzam, E.; Kohli, A.; Yang, S.; Chaware, A.; Allier, C.; Cai, C.; Anderberg, G.; Eybposh, M. H.; Schneider, M. C.; Heintzmann, R.; Rivera-Sanchez, F. A.; Simmerer, C.; Meng, G.; Tormes-Vaquerano, J.; Han, S.; Shanmugavel, S. C.; Maruvada, T.; Yang, X.; Kim, Y.; Diederich, B.; Joo, C.; Waller, L.; Durr, N. J.; Pegard, N. C.; La Riviere, P. J.; Horstmeyer, R.; Chowdhury, S.; Turaga, S. C.Abstract
Modern microscopy methods incorporate computational modeling as an integral part of the imaging process, either to solve inverse problems or optimize the optical system design itself. These methods often depend on differentiable optics simulations, yet no standardized framework exists--forcing computational optics researchers to repeatedly and independently implement simulations with limited reusability and performance. These common problems limit the potential impact of computational optics as a field. Here we present Chromatix: an open-source, GPU-accelerated, differentiable wave optics simulation library. Chromatix builds on JAX to democratize fast, parallelized simulation of diverse optical systems and expand the design space in computational optics. Chromatix standardizes a growing collection of optical elements and propagation methods allowing a broad range of applications, which we demonstrate here for snapshot microscopy, holography, and phase retrieval. We demonstrate speed improvements of 2-6x on a single GPU and up to 22x on 8 GPUs.
bioinformatics2026-03-25v2cellSight: Characterizing dynamics of cells using single-cell RNA-sequencing
Chatterjee, R.; Gohel, C.; Shook, B. A.; Taheriyoun, A. R.; Rahnavard, A.Abstract
Single-cell analysis has transformed our understanding of cellular diversity, offering insights into complex biological systems. Yet, manual data processing in single-cell studies poses challenges, including inefficiency, human error, and limited scalability. To address these issues, we propose the automated workflow cellSight, which integrates high-throughput sequencing in a user-friendly platform. By automating tasks like cell type clustering, feature extraction, and data normalization, cellSight reduces researcher workload, promoting focus on data interpretation and hypothesis generation. Its standardized analysis pipelines and quality control metrics enhance reproducibility, enabling collaboration across studies. Moreover, cellSight's adaptability supports integration with emerging technologies, keeping pace with advancements in single-cell genomics. cellSight accelerates discoveries in single-cell biology, driving impactful insights and clinical translation. It is available with documentation and tutorials at https://github.com/omicsEye/cellSight.
bioinformatics2026-03-25v2Statistical detection of protein sites associated with continuous traits
Duchemin, L.; Muntane, G.; Boussau, B.; Veber, P.Abstract
Comparative genomic data can be used to look for substitutions in coding sequences that are associated with the variation of a particular phenotypic trait. A few statistical methods have been proposed to do so for phenotypes represented by discrete values. For continuous traits, no such statistical approach has been proposed, and researchers have resorted to sensible but uncharacterized criteria. Here, we investigate a phylogenetic model for coding sequences where amino acid preferences at a site are given by a continuous function of a quantitative trait. This function is inferred from the amino acids and the trait values in extant species and requires inferred point estimates of ancestral values of the trait at internal nodes. For detecting sites whose evolution is associated with this trait, we use a significance test against the hypothesis that amino acid preference does not depend on the trait. This procedure is compared to simpler strategies on simulated alignments. It displays an increased recall for low false positive rates, which is of special importance for performing whole-genome scans. This comes however at a much higher computational cost, and we suggest using a simple test to filter promising candidate sites. We then revisit a dataset of alignments for 62 species of mammals, using longevity as a phenotypic trait. We apply our method to three protein families that have previously been proposed to display sites associated with variation in lifespan in mammals. Using a graphical representation extracted from the detailed phylogenetic analysis of candidate sites, we suggest that the evidence for this in the sequence data alone is weak. The proposed method has been added to our Pelican software. It is available at https://gitlab.in2p3.fr/phoogle/pelican and can now be used with both discrete and continuous phenotypes to search for sites associated with phenotypic variation, on data sets with thousands of alignments.
bioinformatics2026-03-25v2PATTY corrects open chromatin bias for improved bulk and single-cell CUT&Tag profiling
Hu, S. S.; Su, Z.; Liu, L.; Chen, Q.; Grieco, M. C.; Tian, M.; Dutta, A.; Zang, C.Abstract
Precise profiling of epigenomes is essential for better understanding chromatin biology and gene regulation. Cleavage Under Targets & Tagmentation (CUT&Tag) is an efficient epigenomic profiling technique that can be performed on a low number of cells and at the single-cell level. With its growing adoption, CUT&Tag datasets spanning diverse biological systems are rapidly accumulating in the field. CUT&Tag assays use the hyperactive transposase Tn5 for DNA tagmentation. Tn5's preference toward accessible chromatin alters CUT&Tag sequence read distributions in the genome and introduces open chromatin bias that can confound downstream analysis, an issue more substantial in sparse single-cell data. We show that open chromatin bias extensively exists in published CUT&Tag datasets, including those generated with recently optimized high-salt protocols. To address this challenge, we present PATTY (Propensity Analyzer for Tn5 Transposase Yielded bias), a comprehensive computational method that corrects open chromatin bias in CUT&Tag data by leveraging accompanying ATAC-seq. By integrating transcriptomic and epigenomic data using machine learning and integrative modeling, we demonstrate that PATTY enables accurate and robust detection of occupancy sites for both active and repressive histone modifications, including H3K27ac, H3K27me3, and H3K9me3, with experimental validation. We further develop a single-cell CUT&Tag analysis framework built on PATTY and show improved cell clustering when using bias-corrected single-cell CUT&Tag data compared to using uncorrected data. Beyond CUT&Tag, PATTY sets a foundation for further development of bias correction methods for improving data analysis for all Tn5-based high-throughput assays.
bioinformatics2026-03-25v2EvoRMD: Integrating Biological Context and Evolutionary RNA Language Models for Interpretable Prediction of RNA Modifications
Wang, B.; Zhang, H.; Cui, T.; Wang, X.; Song, J.; Xu, H.Abstract
RNA modifications are essential regulators of post-transcriptional gene expression, influencing RNA stability, localization, translation, and degradation. Determining the specific modification at a given nucleotide is therefore critical for understanding its regulatory role. Most computational approaches treat each modification type as an independent binary task. This strategy provides a macro-level statistical perspective, but it does not reflect that, under a defined biochemical or cellular condition, only one modification type can occur at a specific site. Current mapping assays also report a single observed modification per site, leaving all other types unlabeled rather than truly negative. These properties motivate a framework that can reason over competing modification types. We introduce EvoRMD, a unified model for biologically contextualized and interpretable prediction of RNA modification types. EvoRMD integrates contextual sequence embeddings from a large-scale RNA language model with structured biological metadata-including species, organ, cell type, and subcellular localization. A lightweight attention mechanism highlights informative sequence positions. A shared multi-class classifier then generates a context-conditioned plausibility distribution over eleven modification types (Am, Cm, Um, Gm, D, pseudouridine, m1A, m5C, m5U, m6A, m7G), consistent with the single-positive, multiple-unlabeled nature of existing datasets. Although trained in a multi-class setting, EvoRMD also produces calibrated multi-label predictions through sigmoid-transformed logits, enabling direct comparison with existing single-modification and multi-label methods. EvoRMD achieves strong predictive performance and offers interpretable insights through attention profiles and motif analyses. Together, these components establish a biologically grounded framework for identifying and prioritizing RNA modification types from sequence and context.
bioinformatics2026-03-25v1STRmie-HD enables interruption-aware HTT repeat genotyping and somatic mosaicism profiling across sequencing platforms
Napoli, A.; Liorni, N.; Biagini, T.; Giovannetti, A.; Squitieri, A.; Miele, L.; Urbani, A.; Caputo, V.; Gasbarrini, A.; Squitieri, F.; Mazza, T.Abstract
Short tandem repeat expansions in exon 1 of the HTT gene drive Huntington's disease (HD) pathogenesis, with disease onset and progression heavily influenced by somatic mosaicism and sequence interruptions. While sequencing technologies enable repeat sizing, many computational tools lack the resolution to capture subtle interruption motifs and allele-specific somatic variation. We present STRmie-HD, an alignment-free, de novo framework for interruption-aware genotyping and quantitative profiling of somatic mosaicism at single-read resolution. The tool parses individual reads to quantify uninterrupted CAG tract length, CCG repeat content, and critical interruption variants, including Loss of Interruption (LOI) and Duplication of Interruption (DOI). Validated across Illumina, PacBio SMRT, and Oxford Nanopore platforms, STRmie-HD demonstrates high concordance with reference genotypes and superior sensitivity in identifying rare interruption patterns that conventional tools often overlook. Furthermore, it implements somatic mosaicism metrics to characterize repeat dynamics and successfully distinguishes a higher somatic expansion burden in brain tissues than in peripheral blood. STRmie-HD offers a comprehensive and extensible solution for high-resolution molecular characterization of HTT variation, providing a robust framework for patient stratification and genetic research in HD.
bioinformatics2026-03-25v1Computational Design and Atomistic Validation of a High-Affinity VHH Nanobody Targeting the PI/RuvC Interface of Streptococcus pyogenes Cas9: A Bivalent Hub Strategy for CRISPR-Cas9 Enhancement
Kumar, N.; Dalal, D.; Sharma, V.Abstract
The CRISPR-Cas9 system has revolutionized genome engineering, yet its full therapeutic potential remains constrained by challenges in precisely modulating its activity and specificity. Here we report a fully computational end-to-end pipeline for the de novo design of a single-domain VHH nanobody (NbSpCas9-v1) targeting a structurally conserved, non-catalytic epitope at the PAM-interacting (PI) and RuvC-III interface of Streptococcus pyogenes Cas9 (SpCas9; PDB: 4UN3). Nanobody sequences were generated using BoltzGen, a generative diffusion binder design framework, and co-folded with SpCas9 using Boltz-2 to evaluate structural confidence and binding affinity. The top-ranked model (SpCas9_4UN3_Bivalent_Hub_v1) achieved a complex pLDDT of 0.8406, an aggregate score of 0.8016, and an ipTM of >0.8, indicating high confidence in the nanobody antigen interface. The designed 1,616-residue quaternary complex was subjected to 10 ns of all-atom molecular dynamics simulation using the AMBER14SB force field within the GROMACS/OpenMM framework. The complex stabilized at RMSD approximately 6 Angstrom with a radius of gyration of 39 to 44 Angstrom, confirming thermodynamic stability under physiological conditions (310 K, 0.15 M NaCl). A conserved 96.3 Angstrom inter-molecular distance between the nanobody centroid and the HNH catalytic residue H840 establishes NbSpCas9-v1 as a distal, non-inhibitory binder ideally suited for a Bivalent Hub architecture recruiting secondary effectors to the Cas9 ribonucleoprotein. These results provide a rigorous structural and dynamic foundation for experimental validation of VHH-based CRISPR-Cas9 enhancers and modulators.
bioinformatics2026-03-25v1AI-guided design of candidate BMPR1A-binding peptides for cartilage regeneration: a multi-tool computational benchmarking study
Ahmadov, A.; Ahmadov, O.Abstract
Bone morphogenetic protein receptor type IA (BMPR1A) is a key mediator of chondrogenesis and a validated therapeutic target for cartilage repair, yet existing BMP mimetic peptides suffer from low potency and the full-length protein (rhBMP-2) carries significant safety risks. Generative AI tools for protein design can now produce de novo peptide binders, but none have been applied to cartilage regeneration targets. Here, we benchmarked four architecturally distinct AI tools--RFdiffusion, BindCraft, PepMLM, and RFpeptides--to design candidate BMPR1A-binding peptides. We generated 192 candidates alongside 98 negative controls (290 total) and evaluated all complexes using AlphaFold 3 structure prediction, dual physics-based energy scoring (PyRosetta and FoldX), and contact recapitulation against the crystallographic BMP-2:BMPR1A interface (PDB: 1REW). A four-metric composite ranking identified a 15-residue PepMLM design (pepmlm_L15_0026) as the top candidate, combining favorable binding energy (PyRosetta dG_separated = -45.9 REU; FoldX DeltaG = -19.4 kcal/mol) with the highest contact recapitulation among top-ranked peptides (11/30 gold-standard interface residues). Designed candidates significantly outperformed controls on ipTM (p = 0.002) and FoldX DeltaG (p < 0.001). BindCraft candidates achieved the highest structural confidence (ipTM up to 0.81) but exhibited moderate contact recapitulation (mean 0.224), consistent with the computational hypothesis that they may engage alternative BMPR1A binding surfaces rather than the native BMP-2 interface. Physicochemical filtering yielded a shortlist of 54 candidates across all four tools. These results establish a reproducible computational framework for AI-guided peptide design targeting cartilage regeneration and identify specific candidates for future experimental validation via binding assays and chondrocyte differentiation studies.
bioinformatics2026-03-25v1CBIcall: a configuration-driven framework for variant calling in large sequencing cohorts
Rueda, M.; Fernandez Orth, D.; Gut, I. G.Abstract
Motivation: Variant calling for next-generation sequencing (NGS) data relies on a diverse ecosystem of tools and workflows. Large-scale collaborative studies increasingly adopt federated analysis, where each institution processes sensitive data locally using standardized pipelines. Deploying identical pipelines across multiple centers remains challenging because heterogeneous software environments and computing policies can cause workflow divergence and inconsistent results. Results: We developed CBIcall, a workflow-agnostic, configuration-driven framework that runs standardized variant-calling pipelines from raw FASTQ files to analysis-ready VCFs using a single YAML file. An execution driver validates user parameters, enforces compatibility across pipelines, analysis modes, workflow backends, genome builds, and tool versions, and records structured provenance for each run, ensuring consistent and reproducible pipeline execution across computing environments. CBIcall dispatches validated workflows through Bash or Snakemake backends and provides production-ready pipelines for germline WES, WGS (single-sample or cohort joint geno-typing following GATK Best Practices), and mitochondrial DNA analysis. We validated CBIcall on public datasets and deployed it in the EU HEREDITARY project, processing 1,111 samples with both WES and mtDNA pipelines on an institutional HPC system, demonstrating its suitability for reproducible cohort-scale genomic analyses. Availability and implementation: CBIcall is open source (GPLv3) and distributed with ready-to-run pipelines; full dependency and installation documentation is available at https://github.com/CNAG-Biomedical-Informatics/cbicall.
bioinformatics2026-03-25v1Visualize, Explore, and Select: A protein Language Model-based Approach Enabling Navigation of Protein Sequence Space for Enzyme Discovery and Mining
Moorhoff, F.; Medina-Ortiz, D.; Kotnis, A.; Hassanin, A.; D. Davari, M.Abstract
The rapid expansion of protein sequence databases continues to outpace functional characterization, limiting efficient enzyme discovery in large, heterogeneous, and sparsely annotated sequence spaces. Here, we present an embedding-guided framework for structured navigation of enzyme sequence space, implemented in SelectZyme. The workflow integrates protein language model embeddings with dimensionality reduction, hierarchical clustering, connectivity reconstruction, and quantitative dendrogram analysis to enable exploration without reliance on fixed sequence identity thresholds or predefined functional annotations. Across distinct case studies, we demonstrate that embedding-defined neighborhoods preserve fold-level coherence even when sequence identity resides in the twilight zone and that biologically meaningful functional organization can emerge under fully unsupervised conditions. In a large multi-family PETase landscape exceeding 100,000 sequences, the framework supports scalable, anchor-guided prioritization under sparse labeling and process-motivated constraints. By unifying latent representations with connectivity-aware and hierarchical interpretation, this approach reframes enzyme mining from threshold-driven similarity filtering into structured exploration, providing a scalable methodological foundation for hypothesisdriven biocatalyst discovery and reasonable starting points for downstream protein engineering.
bioinformatics2026-03-25v1A universal model for drug-receptor interactions
Menezes, F.; Wahida, A.; Froehlich, T.; Grass, P.; Zaucha, J.; Napolitano, V.; Siebenmorgen, T.; Pustelny, K.; Barzowska-Gogola, A.; Rioton, S.; Didi, K.; Bronstein, M.; Czarna, A.; Hochhaus, A.; Plettenburg, O.; Sattler, M.; Nissen-Meyer, J.; Conrad, M.; Kurzrock, R.; Popowicz, G. M.Abstract
The genomic landscape of disease holds, in principle, the information required for rational therapeutic design. Genes encode proteins whose functions are tightly coupled to their three-dimensional structures via non-bonded interactions. Since the late 1970s, the advent of macromolecular crystallography inspired the notion that structural knowledge alone could enable a lock-and-key approach to drug design. However, this framework has failed to catalyze a step-change in the generation of new chemical matter. Drug discovery continues to depend on costly and largely serendipitous screening campaigns. Our understanding of, and reasoning from, non-bonded interaction chemistry is still too limited. Compounding this is the scarcity of novel chemistry and infinitesimal coverage of the chemical combinatorial space by current experimental data. To alleviate these problems, we show that a machine learning model can successfully learn and infer the principles of non-bonded interactions in the drug-receptor space. A reductionist approach to training data led to a model generalizing drug-target interactions to truly novel chemical matter without suffering from memorization bias. This work addresses that gap in drug discovery through a theoretical framework for predictive molecular recognition.
bioinformatics2026-03-24v10FlashDeconv enables atlas-scale, multi-resolution spatial deconvolution via structure-preserving sketching
Yang, C.; Chen, J.; Zhang, X.Abstract
Coarsening Visium HD resolution from 8 to 64 m can flip cell-type co-localization from negative to positive (r = -0.12 [->] +0.80), yet investigators are routinely forced to coarsen because current deconvolution methods cannot scale to million-bin datasets. Here we introduce FlashDeconv, which combines leverage-score importance sampling with sparse spatial regularization to match top-tier Bayesian accuracy while processing 1.6 million bins in 153 seconds on a standard laptop. Systematic multi-resolution analysis of Visium HD mouse intestine reveals a tissue-specific resolution horizon (8-16 m)--the scale at which this sign inversion occurs--validated by Xenium ground truth. Below this horizon, FlashDeconv provides, to our knowledge, the first sequencing-based quantification of Tuft cell chemosensory niches (15.3-fold stem cell enrichment). In a 1.6-million-bin human colorectal cancer cohort, FlashDeconv uncovers neutrophil inflammatory microdomains co-localized with immunoregulatory dendritic cells (mRegDC) at the tumor-stroma interface--spatial niches invisible to classification-based methods, which discard 97.7% of the relevant bins.
bioinformatics2026-03-24v3Deconvolution of omics data in Python with Deconomix -- cellular compositions, cell-type specific gene regulation, and background contributions
Mensching-Buhr, M.; Sterr, T.; Voelkl, D.; Seifert, N.; Tauschke, J.; Engel, L.; Rayford, A.; Straume, O.; Grellscheid, S. N.; Beissbarth, T.; Zacharias, H. U.; Goertler, F.; Altenbuchinger, M. C.Abstract
Background: Gene expression profiles derived from heterogeneous bulk samples contain signals from various cell populations. Cell-type deconvolution approaches are computational tools to reverse engineer the composition of bulks in term of cell populations. Accurate estimates of cell compositions are crucial for identifying cell populations relevant for disease. Moreover, analyses, such as the identification of differentially expressed genes, can be confounded by cellular composition, as differences in gene expression may arise from both variations in cellular composition and gene regulation. Results: We present Deconvolution of omics data (Deconomix) - a comprehensive toolbox for the cell-type deconvolution of bulk transcriptomics data, available as a Python package and standalone graphical user interface. Deconomix stands apart from competing solutions with rich functionality and highly efficient implementations. It facilitates (A) the inference of cellular compositions from bulk transcriptomics data, (B) the machine learning-based optimization of gene weights to resolve small cell populations and to disentangle phenotypically related cells, (C) the inference of background contributions which otherwise would deteriorate cell-type deconvolution, and (D) population estimates of cell-type specific gene regulation. To showcase the application of Deconomix, we present a case study on breast cancer data from TCGA, highlighting subtype-specific cellular compositions and cell-type-specific gene-regulatory programs. Conclusion: We present Deconomix, a comprehensive Python package including a graphical user interface for the inference of cellular compositions, cell-type-specific gene regulation, and background contributions from bulk transcriptomics data. Key words: Cell-type deconvolution, Bulk transcriptomics, Gene regulation, Gene expression analysis, Machine learning, Cellular composition inference
bioinformatics2026-03-24v2A comprehensive reference database to support untargeted metabolomics in Pseuudomonas putida
Ross, D. H.; Chang, C.; Vasquez, J.; Overstreet, R.; Schultz, K.; Metz, T.; Bade, J.Abstract
Pseudomonas putida strain KT2440 is a crucial model organism for synthetic biology and bioengineering applications, yet there currently exists no comprehensive metabolomics database comparable to those available for other model organisms. This gap hinders the use of untargeted metabolomics for exploratory analyses in this system. We developed the P. putida metabolome reference database (PPMDB v1) to address this limitation by consolidating metabolite information from multiple sources and expanding coverage through computational predictions. The database was constructed by curating metabolites from BioCyc, BiGG, and other literature sources, then computationally expanding this collection using BioTransformer environmental transformation predictions to generate additional predicted metabolites. We enhanced the database's utility for molecular annotation in metabolomics studies by incorporating analytical properties including collision cross-sections, tandem mass spectra, and gas-phase infrared spectra. These analytical properties were gathered from existing measurement data or predicted using computational tools. We further augmented the database through inclusion of reaction information and pathway annotations, facilitating biological interpretation of metabolomics data. This publicly available resource fills a critical gap in P. putida research infrastructure, supporting metabolite annotation and biological interpretation in untargeted metabolomics studies and enabling in-depth exploratory analyses of this important synthetic biology platform at the molecular level.
bioinformatics2026-03-24v1Col-Ovo: Smartphone-based artificial intelligence for rapid counting of Aedes mosquito eggs under field conditions
Almanza, J.; Montenegro, D.Abstract
Background: OviCol has recently been proposed as a disruptive strategy for the surveillance and control of synanthropic Aedes mosquitoes, vectors of dengue, Zika, and chikungunya viruses. The approach integrates monitoring and control through ultra-low-cost ovitraps (~0.2 USD), bioattractants, and egg inactivation using hot water. However, large-scale ovitrap surveillance generates thousands of egg substrates that require time-consuming manual counting, creating a major operational bottleneck. To address this limitation, we developed Col-Ovo, an artificial intelligence-based tool for automated counting of Aedes aegypti eggs from real field samples, together with OviLab, a digital platform for annotation, curation, and management of entomological image datasets. Methodology/Principal Findings: The detection model was trained using YOLOv11m on a dataset of 275 oviposition substrates (20.5 cm strips) collected under routine operational conditions. Images were captured in situ without preprocessing and included substrates heavily stained by bioattractants such as blackstrap molasses and dry yeast (Saccharomyces cerevisiae), as well as sand and particulate debris, reflecting realistic field conditions. The system was designed to operate with standard smartphone images and tolerate compression artifacts produced by messaging platforms such as WhatsApp. Performance was evaluated by comparing automated egg counts with expert manual counts and with virtual-human counts conducted in OviLab using >200% image magnification. Col-Ovo achieved >95% agreement with expert counts and 88% agreement with OviLab while reducing processing time from approximately 15 minutes to <3 seconds per sample. Conclusions/Significance: Col-Ovo enables rapid, scalable quantification of Ae. aegypti eggs from smartphone images, addressing a critical operational barrier in ovitrap-based surveillance. The system requires no image preprocessing or specialized hardware and is accessible through a lightweight web interface supported by an AI architecture that allows retraining for new ecological contexts or additional Aedes species. Integrated with OviLab, this platform provides a flexible digital infrastructure that can strengthen routine vector surveillance and community-level control programs across regions where Aedes mosquitoes continue to expand.
bioinformatics2026-03-24v1Micro16S: Universal Phylogenetic 16S rRNA Gene Representations for Deep Learning of the Microbiome
Bishop, H. V.; Ogilvie, O. J.; Dobson, R. C. J.; Herbold, C. W.Abstract
Existing self-supervised microbiome models represent taxa as discrete, independent units restricted to fixed vocabularies, disregarding their evolutionary context. Here we present Micro16S, a deep learning approach that embeds 16S ribosomal RNA gene sequences into a continuous vector space according to phylogenetic relationships derived from the Genome Taxonomy Database. Using a combination of triplet and pair loss objectives, the model learns representations where spatial proximity reflects phylogenetic relatedness, while remaining largely invariant to the specific 16S rRNA region. Evaluations demonstrate taxonomically coherent clustering across most ranks and substantially improved region invariance compared to k-mer frequency baselines. A transformer pretrained on 50,418 unlabelled gut microbiome samples using these embeddings captured biologically meaningful community structure, though classical machine learning baselines outperformed Micro16S across six benchmark classification tasks, highlighting the limitations of the current system. These results establish the feasibility of phylogenetic embeddings for microbiome deep learning and identify mining algorithm design and class imbalance as primary targets for future improvement.
bioinformatics2026-03-24v1MiCBuS: Marker Gene Mining for Unknown Cell Types Using Bulk and Single Cell RNA-Seq Data
Zhang, S.; Lu, Y.; Luo, Q.; An, L.Abstract
Motivation: Identifying cell type-specific expressed genes (marker genes) is essential for understanding the roles and interactions of cell populations within tissues. To achieve this, the traditional differential analysis approaches are often applied to individual cell-type bulk RNA-seq and single-cell RNA-seq data. However, real-world datasets often pose challenges, such as heterogeneous bulk RNA-seq and incomplete scRNA-seq. Heterogeneous bulk RNA-seq amalgamates gene expression profiles from multiple cell types and results in low resolution, while incomplete scRNA-seq does not capture some cell types from the tissue, leading to unknown cell types. Traditional methods fail to identify marker genes for such unknown cell types. Results: MiCBuS addresses this limitation by generating Dirichlet-pseudo-bulk RNA-seq based on bulk and incomplete single-cell RNA-seq data. By performing differential analysis of gene expressions on bulk and Dirichlet-pseudo-bulk RNA-seq samples, MiCBuS can identify the marker genes of unknown cell types, enabling the identification and characterization of these elusive cellular components. Simulation studies and real data analyses demonstrate that MiCBuS reliably and robustly identifies marker genes specific to unknown cell types, a capability that traditional differential analysis methods cannot achieve.
bioinformatics2026-03-24v1TCRseek: Scalable Approximate Nearest Neighbor Search for T-Cell Receptor Repertoires via Windowed k-mer Embeddings
Yang, Y.Abstract
The rapid growth of T-cell receptor (TCR) sequencing data has created an urgent need for computational methods that can efficiently search CDR3 sequences at scale. Existing approaches either rely on exact pairwise distance computation, which scales quadratically with repertoire size, or employ heuristic grouping that sacrifices sensitivity. Here we present TCRseek, a two-stage retrieval framework that combines biologically informed sequence embeddings with approximate nearest neighbor (ANN) indexing for scalable search over TCR repertoires. TCRseek first encodes CDR3 amino acid sequences into fixed-length numerical vectors through a multi-scale windowed k-mer embedding scheme derived from BLOSUM62 eigendecomposition, then indexes these vectors using FAISS-based structures (IVF-Flat, IVF-PQ, or HNSW-Flat) that support sublinear-time search. A second-stage reranking module refines the shortlisted candidates using exact sequence alignment scores (Needleman--Wunsch with BLOSUM62), Levenshtein distance, or Hamming distance. We benchmarked TCRseek against tcrdist3, TCRMatch, and GIANA on a 100,000-sequence corpus with precomputed exact ground truth under three distance metrics. Under cross-metric evaluation---where the reranking and ground truth metrics differ, providing the most informative test of generalization---TCRseek achieved NDCG@10 = 0.890 (Levenshtein ground truth) and 0.880 (Hamming ground truth), ranking highest among the retained baselines under Hamming and remaining competitive with tcrdist3 (0.894) under Levenshtein. When the reranking metric matches the ground truth definition (BLOSUM62 alignment), NDCG@10 reached 0.993, confirming that the ANN shortlist captures >99% of true neighbors---the expected ceiling of the two-stage design. On the 100,000-sequence corpus, TCRseek achieved 3.6--39.6x speedup over exact brute-force search depending on index type and distance metric, with the largest gains for alignment-based retrieval. These results demonstrate that embedding-based ANN search provides a practical and scalable alternative for TCR repertoire analysis.
bioinformatics2026-03-24v1ERFMTDA: Predicting tsRNA-disease associations using an enhanced rotative factorization machine
Lan, W.; Wang, D.; Chen, W.; Yan, X.; Chen, Q.; Pan, S.; Pan, Y.Abstract
Motivation: tRNA-derived small RNAs (tsRNAs) have emerged as a novel class of regulatory molecules implicated in the pathogenesis of many human diseases, making them as promising biomarkers and therapeutic targets. However, existing computational methods for tsRNA-disease association prediction often overlook explicit biological attributes and complex feature interactions, limiting their predictive performance. Results: We propose ERFMTDA, an enhanced rotative factorization machine framework for predicting potential tsRNA-disease associations. ERFMTDA explicitly models complex interactions among heterogeneous biological features while integrating latent structural representations derived from the global association matrix. In addition, a biologically informed negative sampling strategy based on motif-level sequence similarity is introduced to improve the reliability of negative samples. Extensive experiments demonstrate that ERFMTDA consistently outperforms eleven state-of-the-art methods. Case studies on diabetic retinopathy and hepatocellular carcinoma further confirm its ability to prioritize biologically meaningful tsRNA-disease associations.
bioinformatics2026-03-24v1PACMON: Pathway-guided Multi-Omics data integration for interpreting large-scale perturbation screens
Qoku, A.; Stickel, T.; Amerifar, S.; Wolf, S.; Oellerich, T.; Buettner, F.Abstract
High-throughput perturbation screens coupled with single-cell molecular profiling enable systematic interrogation of gene function, yet interpreting the resulting data in terms of biological pathways remains challenging. Existing approaches either identify latent gene modules without linking them to perturbations, or model perturbation effects without incorporating prior biological knowledge, limiting interpretability and scalability. Here, we introduce PACMON (Pathway-guided Multi-Omics data integration for interpreting large-scale perturbation screens), a Bayesian latent factor model that jointly infers pathway-level programs and their modulation by experimental perturbations. PACMON decomposes multimodal molecular measurements into shared latent factors aligned with known biological pathways through structured sparsity priors, while simultaneously estimating how each perturbation activates or represses these pathway programs. The framework naturally accommodates multiple data modalities and employs stochastic variational inference for scalable application to large datasets. We evaluate PACMON in three settings of increasing complexity. On synthetic data with known ground truth, PACMON achieves near-perfect recovery of pathway structure and perturbation effects, outperforming existing methods in both accuracy and computational scalability. Applied to a multimodal Perturb-CITE-seq screen of melanoma cells, PACMON recovers coherent interferon-signaling and cell-cycle programs spanning RNA and surface-protein modalities and identifies interpretable perturbation-pathway associations consistent with known immune-evasion mechanisms. Finally, we apply PACMON to the Tahoe-100M perturbation atlas - approximately 100 million cells and over 1,000 drug-dose combinations - producing the first pathway-level latent factor analysis at this scale and revealing biologically meaningful drug-response landscapes across Hallmark pathway programs. PACMON provides a unified, scalable and interpretable framework for mapping perturbation effects onto biological pathways in modern large-scale perturbation experiments.
bioinformatics2026-03-24v1RiboPipe: efficient per-transcript codon-resolution ribo-seq coverage imputation for low-coverage transcripts
Zhang, Y.-z.; Hashimoto, S.; Li, S.; Inada, T.; Imoto, S.Abstract
Motivation: Ribosome profiling (Ribo-seq) provides codon-resolution measurements of translation; however, many transcripts exhibit sparse or low read coverage, which limits downstream quantitative analyses. Reliable prediction and imputation of codon-resolution coverage for low-coverage transcripts remain computationally challenging. Results: We present RiboPipe, an efficient framework for per-transcript codon-resolution Ribo-seq coverage imputation for low-coverage transcripts. RiboPipe is designed around three key principles. First, it jointly optimizes transcript-level mean ribosome load (MRL) prediction and codon-level coverage modeling within a unified objective, enabling consistent learning across both local and transcript-level scales. Second, it introduces a peak-weighted loss that emphasizes high-signal codon positions associated with translational pausing, improving the recovery of functionally relevant coverage peaks. Third, the framework is lightweight and data-efficient, achieving stable performance even when trained on only a small fraction of high-coverage transcripts. Using two publicly available Ribo-seq datasets (GSE233886 and GSE133393), we demonstrate stable convergence and consistent prediction accuracy across multiple train-test split ratios. Comparative evaluation of embedding strategies shows that simple one-hot representations achieve competitive or even superior performance compared with pre-trained language model embeddings under identical training conditions. Overall, RiboPipe provides a computationally efficient and scalable framework for Ribo-seq coverage imputation in low-coverage transcripts. Availability and Implementation: The source code and associated data can be accessed at https://github.com/yaozhong/riboPipe
bioinformatics2026-03-24v1From SNPs to Pathways: A genome-wide benchmark of annotation discrepancies and their impact on protein- and pathway-level inference
Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.Abstract
Background Accurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. Results We conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. Conclusion SNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.
bioinformatics2026-03-24v1dreampy: Pseudobulk mixed-model differential expression for single-cell RNA-seq in Python
Wells, S. B.; Shahnawaz, H.; Jones, J. L.Abstract
dreampy is a Python implementation of the R dreamlet framework for pseudobulk differential expression analysis of single-cell RNA-seq data. dreamlet combines voom precision-weighted linear mixed models with empirical Bayes moderation to handle batch effects, repeated measures, and other hierarchical structure in multi-donor studies, but exists entirely within the R/Bioconductor ecosystem. dreampy reproduces this pipeline natively in Python, integrating with AnnData and the scverse ecosystem.
bioinformatics2026-03-24v1AI-readiness for Biomedical Data
Clark, T.; Caufield, H.; Parker, J. A.; Al Manir, S.; Amorim, E.; Eddy, J.; Gim, N.; Gow, B.; Goar, W.; Hansen, J. N.; Harris, N.; Hermjakob, H.; Joachimiak, M.; Jordan, G.; Lee, I.-H.; McWeeney, S. K.; Nebeker, C.; Nikolov, M.; Reese, J.; Shaffer, J.; Sheffield, N.; Sheynkman, G.; Stevenson, J.; Chen, J. Y.; Mungall, C.; Wagner, A.; Kong, S. W.; Ghosh, S. S.; Patel, B.; Williams, A.; Munoz-Torres, M. C.Abstract
Biomedical research is rapidly adopting artificial intelligence (AI). Yet the inherent complexity of biomedical data preparation requires implementing actionable, robust criteria for ethical and explainable AI (XAI) at the "pre-model" stage, encompassing data acquisition, detailed transformations, and ethical governance. Simple conformance to FAIR (Findable, Accessible, Interoperable, Reusable) Principles is insufficient. Here, we define criteria and practices for reliable AI-readiness of biomedical data, developed by the NIH Bridge to Artificial Intelligence (Bridge2AI) Standards Working Group across seven core dimensions of dataset AI-readiness: FAIRness, Provenance, Characterization, Ethics, Pre-model Explainability, Sustainability, and Computability. Conformance to these criteria provides a basis for pre-model scientific rigor and ethical integrity, mitigating downstream risks of bias and error before AI modeling. We apply and evaluate these standards across all four Bridge2AI flagship datasets, spanning functional genomics to clinical medicine, and encode them in machine-actionable metadata bound to the datasets. This framework sets a benchmark for preparing ethical, reusable datasets in biomedical AI and provides standardized methods for reliable pre-model data evaluation.
bioinformatics2026-03-23v5Variable performance of widely used bisulfite sequencing methods and read mapping software for DNA methylation
Kerns, E. V.; Weber, J. N.Abstract
DNA methylation (DNAm) is the most commonly studied marker in ecological epigenetics, yet the performance of library preparation strategies and bioinformatic tools are seldom assessed in genetically variable natural populations. We profiled DNAm in threespine stickleback (Gasterosteus aculeatus) liver tissue, using reduced representation bisulfite sequencing (RRBS) and whole genome bisulfite sequencing (WGBS) across technical and biological replicates. We additionally collated publicly available RRBS and WGBS data from taxonomically diverse organisms, and then compared how the most commonly used methylation software (Bismark) performed relative to alternative pipelines (BWA meth, BiSulfite Bolt, and Biscuit). Even after choosing parameters to maximize Bismarks mapping efficiency, it was still outperformed by all other methods. Surprisingly, newer tools overrepresented DNAm compared to older methods, highlighting the importance of testing methods on nonmodel organisms. There were also distinct differences in DNAm profiles produced across library preparation methods, with large impacts of population and read depth filters. Methylated sites unique to WGBS predominantly mapped to introns and intergenic regions, while sites unique to RRBS primarily overlapped with promoters and exons. Moreover, the prevalence of nucleotides with intermediate methylation (within individuals) was greatly reduced in RRBS. Together, this suggests that RRBS may be more useful for detecting functionally-relevant methylation differences. Based on these results, we provide methodological recommendations for improving the reliability and utility of DNAm profiles, particularly concerning the detection of functionally relevant DNAm differences in genetically diverse natural populations.
bioinformatics2026-03-23v4AI-Enhanced Adaptive Virtual Screening Platform Enabling Exploration of 69 Billion Molecules Discovers Structurally Validated FSP1 Inhibitors
Cecchini, D.; Nigam, A.; Tang, M.; Reis, J.; Koop, M.; Gottinger, A.; Nicoll, C. R.; Wang, Y.; Jayaraj, A.; Cinaroglu, S. S.; Törner, R.; Malets, Y.; Gehev, M.; Padmanabha Das, K. M.; Churion, K.; Kim, J.; Thomas, N.; Li, Y.; Seo, H.-S.; Dhe-Paganon, S.; Secker, C.; Haddadnia, M.; Hasson, A.; Li, M.; Kumar, A.; Levin-Konigsberg, R.; Choi, E.-B.; Shapiro, G. I.; Cox, H.; Sebastian, L.; Braithwaite, C.; Bashyal, P.; Radchenko, D. S.; Kumar, A.; Yang, L.; Aquilanti, P.-Y.; Gabb, H.; Alhossary, A.; Wagner, G.; Aspuru-Guzik, A.; Moroz, Y. S.; Kalodimos, C. G.; Fackeldey, K.; Schuetz, J. D.; MattevAbstract
Identifying potent lead molecules for specific targets remains a major bottleneck in drug discovery. As structural information about proteins becomes increasingly available, ultra-large virtual screenings (ULVSs) which computationally evaluate billions of molecules offer a powerful way to accelerate early-stage drug discovery. Here, we introduce AdaptiveFlow, an open-source platform designed to make ULVSs more accessible, scalable, and efficient. AdaptiveFlow provides free access to a screening-ready version of the Enamine REAL Space, the largest library of ready-to-dock, drug-like molecules, containing 69 billion compounds that we prepared using the ligand preparation module of the platform. A key innovation of the platform is its use of a multi-dimensional grid of molecular properties, which helps researchers explore and prioritize chemical space more effectively and reduce the computational costs by a factor of approximately 1000. This grid forms the basis of a new method for identifying promising regions of chemical space, enabling systematic exploration and prioritization of compound libraries. An optional active learning component can further accelerate this process by adaptively steering the search toward molecules most likely to bind a given target. To support a broad range of applications, AdaptiveFlow is compatible with over 1,500 docking methods. The platform achieves near-linear scaling on up to 5.6 million CPUs in the AWS Cloud, setting a new benchmark for large-scale cloud computing in drug discovery. Using this approach, we identified nanomolar inhibitors of two disease-relevant targets: ferroptosis suppressor protein 1 (FSP1) and poly(ADP-ribose) polymerase 1 (PARP-1). By leveraging newly solved crystal structures of FSP1 in complex with NAD+, FAD, and coenzyme Q1, we validated these hits experimentally and determined the first co-crystal structures of FSP1 bound to small-molecule inhibitors, enabling insights into inhibitor binding mechanisms previously unknown. With its high scalability, flexibility, and open accessibility, AdaptiveFlow offers a powerful new resource for discovering and optimizing drug candidates at an unprecedented scale and speed.
bioinformatics2026-03-23v3Identification of Distinct Topological Structures From High-Dimensional Data
Xu, B.; Braun, R.Abstract
Single-cell RNA sequencing allows the direct measurement of the expression of tens of thousands of genes, providing an unprecedented view of the transcriptomic state of a cell. Within each cell, different biological processes such as differentiation or cell cycle take place simultaneously, each providing a different characterization of cell state. To identify gene sets that govern these processes for the purpose of disentangling convolved biological processes, we present "Identification of Distinct topological structures" (ID). ID works by constructing an alternative low-dimensional parametrization of the high-dimensional system, applying a finite perturbation to this alternative parametrization, and looking for genes that respond similarly. With this approach, we demonstrate that ID is capable of identifying structures within the data that will otherwise be missed. We further demonstrate the utility of ID in scRNA-seq datasets collected under various backgrounds, delineating cellular differentiation, characterizing cellular response to external perturbation, and dissecting the effect of genetic knock-outs.
bioinformatics2026-03-23v3VINE: Variational inference for scalable Bayesian reconstruction of species and cell-lineage phylogenies
Siepel, A.; Hassett, R.; Staklinski, S. J.Abstract
Bayesian methods are now widely used in reconstructing both species and cell-lineage phylogenies, but they remain heavily reliant on computationally intensive Markov chain Monte Carlo sampling. Phylogenetic variational inference (VI) circumvents this dependency but so far has been limited in speed and scalability. Here we introduce Variational Inference with Node Embeddings (VINE), a computational method that combines an embedding of taxa in a high-dimensional space and a distance-based "decoder" with several algorithmic innovations to dramatically improve phylogenetic VI. VINE supports both standard DNA substitution models and CRISPR barcode-mutation models for inference of cell-lineage trees and tissue-migration histories. In extensive simulation experiments, we show that VINE is comparable in accuracy to the best available Bayesian methods with speeds orders of magnitude faster. We then apply VINE to ~1,000 complete SARS-CoV-2 genomes and ~900 lung-cancer cell barcodes, showing reductions in compute time from days to hours or minutes.
bioinformatics2026-03-23v2Single-cell spatial multi-omics molecular pathology enabled by SuperFocus
Lu, Y.; Tian, X.; Vicari, M.; Enninful, A.; Bao, S.; Bai, Z.; Liu, C.; Zhang, X.; Andren, P.; Lundeberg, J.; Xu, M. L.; Fan, R.; Xiao, Y.; Ma, Z.Abstract
Histopathology and molecular pathology are currently distinct diagnostic modalities for the most part, one revealing tissue morphology at cellular resolution and the other providing molecular measurements with limited or no spatial context. Projecting genome-scale molecular information onto histopathology images at single-cell resolution across whole tissue sections represents a long-sought goal for next-generation pathology. Here we present SuperFocus, a modality-agnostic computational platform that generates histopathology-integrated single-cell spatial multi-omics from spot-based spatial measurements acquired on the same or an adjacent section without requiring external reference data. SuperFocus combines constrained cascading imputation with feature-level and cell-level quality-control scores to reduce spurious predictions and quantify confidence. On a ground-truth spatial transcriptomics benchmark dataset, SuperFocus improves key accuracy metrics by 28-73% over existing methods. Across Patho-DBiT, spatial ATAC-RNA, spatial CITE-seq and Visium-MALDI-MSI (SMA) datasets, SuperFocus enables cell-resolved analyses of MALT lymphoma microenvironments, gene regulatory programs in human hippocampus, lipotoxic hepatocyte states in human MASH, and transcriptomic-metabolomic states linked to neurotransmission and neuroinflammation in Parkinsonian mouse brain. Overall, SuperFocus enables scalable whole-slide single-cell spatial multi-omics integrated with histopathology, bridging histology and genome-scale molecular profiling for next-generation molecular pathology.
bioinformatics2026-03-23v2Breaking the Extraction Bottleneck: A Single AI Agent Achieves Statistical Equivalence with Human-Extracted Meta-Analysis Data Across Five Agricultural Datasets
Halpern, M.Abstract
Background: Data extraction is the primary bottleneck in meta-analysis, consuming weeks of researcher time with single-extractor error rates of 17.7%. Existing LLM-based systems achieve only 26-36% accuracy on continuous outcomes, and no study has validated AI-extracted continuous data against multiple independent datasets using formal equivalence testing. Methods: A single AI agent (Claude Opus 4.6) extracted treatment means, control means, sample sizes, and variance measures from source PDFs across five published agricultural meta-analyses spanning zinc biofortification, biostimulant efficacy, biochar amendments, predator biocontrol, and elevated CO2 effects on plant mineral nutrition. Observations were matched to reference standards using an LLM-driven alignment method. Validation employed proportional TOST equivalence testing, ICC(3,1), Bland-Altman analysis, and source-type stratification. Results: Across five datasets, the agent produced 1,149 matched observations from 136 papers. Pearson correlations ranged from 0.984 to 0.999. Proportional TOST confirmed statistical equivalence for all five datasets (all p < 0.05). Table-sourced observations achieved 5.5x lower median error than figure-sourced observations. Aggregate effects were reproduced within 0.01-1.61 pp of published values. Independent duplicate runs confirmed extraction stability (within 0.09-0.23 pp). Conclusions: A single AI agent achieves statistical equivalence with human-extracted meta-analysis data across five independent agricultural datasets. The approach reduces extraction cost by approximately one to two orders of magnitude while maintaining accuracy sufficient for aggregate meta-analytic pooling.
bioinformatics2026-03-23v2REMAG: recovery of eukaryotic genomes from metagenomic data using contrastive learning
Gomez-Perez, D.; Raguideau, S.; Warring, S.; James, R.; Hildebrand, F.; Quince, C.Abstract
Metagenome-assembled genomes (MAGs) are central to exploring microbial communities. Yet, despite the relevance of protists and fungi to diverse ecosystems, eukaryotic MAG recovery lags behind that of prokaryotes. A major bottleneck is that most state-of-the-art binning pipelines exclusively rely on prokaryotic single-copy core gene reference databases and are optimized for smaller genomes. To address this gap, we present REMAG (Recovery of Eukaryotic MAGs), a tool designed to recover high-quality eukaryotic genomes suited for long-read metagenomic data. REMAG leverages fine-tuned HyenaDNA genomic foundation models to efficiently filter eukaryotic contigs. It then employs a dual-encoder Siamese network trained with Barlow Twins contrastive loss to learn a shared embedding space by integrating contig composition and differential coverage. Finally, high-quality bins are extracted using greedy iterative Leiden clustering optimized with eukaryotic single-copy core gene constraints. In benchmarks based on simulated mixed prokaryotic/eukaryotic communities and real datasets of varying sizes and origin, we demonstrate REMAG's ability to recover more near-complete eukaryotic genomes than existing state-of-the-art tools, which often produce highly fragmented eukaryotic bins. REMAG provides an automated eukaryotic binning method that scales effectively with the increasing size and sequencing depth of metagenomic datasets.
bioinformatics2026-03-23v2Translating Histopathology Foundation Model Embeddings into Cellular and Molecular Features for Clinical Studies
Cui, S.; Sui, Z.; Li, Z.; Matkowskyj, K. A.; Yu, M.; Grady, W. M.; Sun, W.Abstract
AI-powered pathology foundation models provide general-purpose representations of histopathological images by encoding image tiles into numerical embeddings. However, these embeddings are not directly interpretable in biological or clinical terms and must be translated into biologically meaningful features, such as cell-type composition or gene expression, to enable downstream clinical applications. To bridge this gap, we developed STpath, a framework that integrates histopathology image embeddings derived from existing pathology foundation models with matched, spatially resolved transcriptomics data. STpath consists of cancer-specific XGBoost models trained to infer cell-type compositions and gene expression from histopathology image tiles. We evaluated STpath in colorectal and breast cancer datasets and showed that it provides accurate estimates of the composition of major cell types and the expression of a subset of genes, with further performance gains achieved by combining embeddings from multiple foundation models. Finally, we demonstrated that STpath inferred features that can be used in downstream studies to evaluate their associations with clinical outcomes.
bioinformatics2026-03-23v2ChEA-KG: Human Transcription Factor Regulatory Network with a Knowledge Graph Interactive User Interface
Byrd, A. I.; Evangelista, J. E.; Lachmann, A.; Chung, H.-Y.; Jenkins, S. L.; Ma'ayan, A.Abstract
Gene expression is controlled by transcription factors (TFs) that selectively bind and unbind to DNA to regulate mRNA expression of all human genes. TFs control the expression of other TFs, forming a complex gene regulatory network (GRN) with switches, feedback loops, and other regulatory motifs. Many experimental and computational methods have been developed to reconstruct the human intracellular GRN. Here we present a different approach. By submitting thousands of up and down gene sets from the RummaGEO resource for TF enrichment analysis with ChEA3, we distill signed and directed edges that connect human TFs to construct a high quality human GRN. The GRN has 131,581 signed and directed edges connecting 701 source TF nodes to 1,559 target TF nodes. The GRN is accessible via the ChEA-KG web server application, which provides interactive network visualization and analysis tools. Users may query the GRN for single or pairs of TFs or submit gene sets to perform TF enrichment analysis with ChEA3, placing the enriched TFs within the GRN. To demonstrate the utility of ChEA-KG, several TF-centric atlases are also made available via the ChEA-KG website. These atlases host TF subnetworks that regulate 131 major normal human cell-types (Cell Type Atlas); 69 tumour subtypes from 10 cancers (Cancer Atlas); 30 consensus perturbation response signatures for common mechanisms of action (MoA Atlas); and 24 aging signatures from tissues profiled by GTEx. Overall, ChEA-KG is an interactive web-server application that presents to users a new method of exploring the human gene regulatory network through both network visualization and transcription factor enrichment analysis. The ChEA-KG application is available from: https://chea-kg.maayanlab.cloud/.
bioinformatics2026-03-23v2Learning gene interactions from tabular gene expression data using Graph Neural Networks
Boulougouri, M.; Nallapareddy, M. V.; Vandergheynst, P.Abstract
Gene interactions form complex networks underlying disease susceptibility and therapeutic response. While bulk transcriptomic datasets offer rich resources for studying these interactions, applying Graph Neural Networks (GNNs) to such data remains limited by a lack of methodological guidance, especially for constructing gene interaction graphs. We present REGEN (REconstruction of GEne Networks), a GNN-based framework that simultaneously learns latent gene interaction networks from bulk transcriptomic profiles and predicts patient vital status. Evaluated across seven cancer types in the TCGA cohort, REGEN outperforms baseline models in five datasets and provides robust network inference. By systematically comparing strategies for initializing gene - gene adjacency matrices, we derive practical guidelines for GNN application to bulk transcriptomics. Analysis of the learned kidney cancer gene-network reveals cancer related pathways and biomarkers, validating the model's biological relevance. Together, we establish a principled approach for applying GNNs to bulk transcriptomics, enabling improved phenotype prediction and meaningful gene network discovery.
bioinformatics2026-03-23v1Solving the Diagnostic Odyssey with Synthetic Phenotype Data
Colangelo, G.; Marti, M.Abstract
The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.
bioinformatics2026-03-23v1FuzzyClusTeR: a web server for analysis of tandem and diffuse DNA repeat clusters with application to telomeric-like repeats
Aksenova, A. Y.; Zhuk, A. S.; Lada, A. G.; Sergeev, A. V.; Volkov, K. V.; Batagov, A.Abstract
DNA repeats constitute a large fraction of eukaryotic genomes and play important roles in genome stability and evolution. While tandem repeats such as microsatellites have been extensively studied, the genomic organization and potential functions of dispersed or loosely organized repeat patterns remain poorly understood. Here we present FuzzyClusTeR, a web server for the identification, visualization and enrichment analysis of DNA repeat clusters in genomic sequences. Using parameterized metrics, FuzzyClusTeR detects both classical tandem repeats and regions where related motifs occur in proximity without forming perfect tandem arrays, which we term diffuse (or fuzzy) repeat clusters. The server supports analysis of user-defined sequences as well as genome-scale datasets, including the T2T-CHM13 and GRCh38 human genome assemblies, and provides interactive visualization and statistical tools for assessing the genomic distribution of repetitive motifs and corresponding clusters. As a demonstration, we analyzed telomeric-like repeats in the T2T-CHM13v2.0 genome and identified families of diffuse clusters enriched in these motifs. Comparison with simulated sequences suggests that these clusters represent non-random genomic patterns with potential evolutionary and functional significance. FuzzyClusTeR enables systematic exploration of repeat clustering across genomic regions or entire genomes. It is available at https://utils.researchpark.ru/bio/fuzzycluster
bioinformatics2026-03-23v1EvoMut: A Computational Framework for Engineering Oxidative Stability in Proteins
Arab, S. S.; Lewis, N. E.Abstract
Amino acid oxidation is a major cause of protein instability and loss of function in therapeutic and industrial settings. Although methionine, cysteine, tyrosine, and tryptophan residues are widely recognized as oxidation-prone, only a subset of such residues are dominant functional hotspots, and not all are suitable targets for mutation. Identifying these vulnerable yet engineerable sites remains a major challenge. Here, we present EvoMut, a residue-level analytical framework for evaluating both oxidative vulnerability and mutation feasibility. EvoMut estimates oxidation risk by integrating structural features, local functional context, intrinsic chemical susceptibility, and evolutionary conservation. A central feature of the framework is the explicit separation of oxidation risk from mutation feasibility: candidate substitutions are evaluated only after high-risk residues are identified and ranked by evolutionary substitution patterns. Application of EvoMut to multiple proteins, and evaluation with experimental data, showed that oxidation-prone residues differ markedly in their engineering potential. EvoMut distinguishes residues that are both oxidation-sensitive and evolutionarily permissive from those that are chemically vulnerable but functionally constrained. By providing residue-level mechanistic insight, EvoMut offers a practical framework for the rational design of oxidation-resistant proteins. EvoMut is freely available as a web server at https://evomut.org.
bioinformatics2026-03-23v1Time-Resolved Phosphoproteomics-Guided BFS Beam Search Reveals Cell-Type-Specific EGFR Signaling Architectures and SHP2 Inhibitor-Induced Pathway Rewiring
Lee, H.; Lee, G.Abstract
Background: The epidermal growth factor receptor (EGFR) orchestrates highly context-dependent intracellular signaling networks whose architecture varies across cell types and is frequently rewired by targeted therapeutics. Systems-level reconstruction of these networks from phosphoproteomic data remains challenging because phosphorylation measurements identify signaling nodes but do not reveal the interaction paths that propagate signals between proteins. Results: We developed a computational framework integrating time-resolved phosphoproteomics with graph traversal algorithms to reconstruct EGFR-initiated signaling pathways across three contexts/conditions. A sign-assignment preprocessing procedure converts quantitative phosphorylation measurements into binary activation states across time points, defining a condition-specific active node set that filters the protein-protein interaction network. Breadth-First Search combined with interaction-weighted Beam Search is then applied to the STRING interaction database (v11.5) to enumerate candidate signaling paths. Applying this framework to phosphoproteomic datasets from EGF-stimulated HeLa cells, EGF-stimulated MDA-MB-468 triple-negative breast cancer (TNBC) cells, and EGF-stimulated MDA-MB-468 cells pretreated with the SHP2 inhibitor SHP099 yielded 260 paths in HeLa cells (117 unique topologies), 293 paths in MDA-MB-468 cells (155 unique), and 292 paths under SHP2 inhibition (85 unique). HeLa cells displayed a SRC-centered architecture dominated by ERBB2 and SHC1 first-hop effectors, converging on focal adhesion, HSP90 chaperone, CRKL adaptor, and integrin signaling arms. In contrast, MDA-MB-468 cells showed a PIK3CA/PTPN11 dual-axis architecture integrating direct PI3K engagement with SHP2-mediated GRB2-IRS1-ABL1 signaling. SHP2 inhibition abolished PTPN11-mediated pathways and induced PIK3CA dominance (69.2% first-hop), accompanied by compensatory ERBB3 engagement and a computationally predicted SYK/VAV1/LCP2 node set whose biological role warrants experimental validation. Conclusions: Time-resolved phosphoproteomics-guided BFS Beam Search over STRING interaction networks captures cell-type-specific EGFR signaling architectures and drug-induced pathway rewiring. This framework provides a systematic approach for transforming phosphoproteomic measurements into mechanistically interpretable signaling hypotheses specific to the cell-type-specific contexts, directly applicable to drug resistance modeling.
bioinformatics2026-03-23v1The Risk of Gulf Birds Functional Diversity Loss with Climate Change Uncovered Using Deep Learning Population Models
Li, L.; Bai, J.; Sun, S.; Zuzuarregui, M.; Wang, Z.Abstract
Climate change and sea-level rise (SLR) pose increasing threats to coastal ecosystems and biodiversity in the Gulf of America. Most efforts to anticipate these threats focus on species counts or range shifts, while changes in species functional diversity remain uncovered. We estimated climate change and sea level rise impacts on hundreds of bird species populations and corresponding functional diversity shifts. We used the generative deep learning method, Variational Gaussian Mixture Autoencoder (GMVAE), and Trait Probability Density analysis to study such impacts. We found that a generative GMVAE model uncovered species' unobserved ranges, and that climate change reduced coastal ecosystem resilience and caused biodiversity loss across multiple dimensions, including functional richness, redundancy, evenness, and divergence. Surprisingly, the most impacted areas are not the exposed shoreline but the landward coastal transition zones. Specifically, shoreline functional diversity turned out to increase with climate change and sea level rise, whereas uplands showed declining functional diversity and increasing redundancy, indicating contraction of functional trait space. Furthermore, avian biodiversity expanded in coastal protected areas, serving as refugia embedded in a surrounding landscape where unique combinations of species traits are lost.
bioinformatics2026-03-23v1Rastair: an integrated variant and methylation caller
Etzioni, Z.; Zhao, L.; Hertleif, P.; Schuster-Boeckler, B.Abstract
Cytosine methylation is a crucial epigenetic mark that impact tissue-specific chromatin conformation and gene expression. For many years, bisulfite sequencing (BS-seq), which converts all non-methylated cytosine (C) to thymine (T), remained the only approach to measure cytosine methylation at base resolution. Recently, however, several new methods that convert only methylated cytosines to thymine (mC[->]T) have become widely available. Here we present rastair, an integrated software toolkit for simultaneous SNP detection and methylation calling from mC[->]T sequencing data such as those created with Watchmaker's TAPS+ and Illumina's 5-Base chemistries. Rastair combines machine-learning-based variant detection with genotype-aware methylation estimation. Using NA12878 benchmark datasets, we show that rastair outperforms existing methylation-aware SNP callers and achieves F1 scores exceeding 0.99 for datasets above 30x depth, matching the accuracy of state-of-the-art tools run on whole-genome sequencing data. At the same time, rastair is significantly faster than other genetic variant callers, processing a 30x depth file takes less than 30 minutes given 32 CPU cores on an Intel Xeon, and half as long when a GPU is available. By integrating genotyping with methylation calling, rastair reports an additional 500,000 positions in NA12878 where a SNP turns a non-CpG reference position into a "de-novo" CpG. Vice-versa, rastair also identifies positions where a variant disrupts a CpG and corrects their reported methylation levels. Rastair produces standard-compliant outputs in vcf, bam and bed formats, facilitating integration into downstream analyses pipelines. Rastair is open-source and available via conda, Dockerhub, and as pre-compiled binaries from https://www.rastair.com.
bioinformatics2026-03-23v1TogoMCP: Natural Language Querying of Life-Science Knowledge Graphs via Schema-Guided LLMs and the Model Context Protocol
Kinjo, A. R.; Yamamoto, Y.; Bustamante-Larriet, S.; Labra-Gayo, J. E.; Fujisawa, T.Abstract
Querying the RDF Portal knowledge graph maintained by DBCLS, which aggregates more than 70 life-science databases, requires proficiency in both SPARQL and database-specific RDF schemas, placing this resource beyond the reach of most researchers. Large Language Models (LLMs) can, in principle, translate natural-language questions into executable SPARQL, but without schema-level context, they frequently fabricate non-existent predicates or fail to resolve entity names to database-specific identifiers. We present TogoMCP, a system that recasts the LLM as a protocol-driven inference engine orchestrating specialized tools via the Model Context Protocol (MCP). Two mechanisms are essential to its design: (i) the MIE (Metadata-Interoperability-Exchange) file, a concise YAML document that dynamically supplies the LLM with each target database's structural and semantic context at query time; and (ii) a two-stage workflow separating entity resolution via external REST APIs from schema-guided SPARQL generation.On a benchmark of 50 biologically grounded questions spanning five types and 23 databases, TogoMCP achieved a large improvement over an unaided baseline (Cohen's d = 0.92, Wilcoxon p < 10-6), with win rates exceeding 80% for question types with precise, verifiable answers. An ablation study identified MIE files as the single indispensable component: removing them reduced the effect to a non-significant level (d = 0.08), while a one-line instruction to load the relevant MIE file recovered the full benefit of an elaborate behavioral protocol. These results suggest a general design principle: concise, dynamically delivered schema context is more valuable than complex orchestration logic.
bioinformatics2026-03-23v1Single-cell Landscape of T Cell Heterogeneity in Kawasaki Disease: STAT3/JAK Axis Regulates the Lineage Differentiation Bias of Th17 Cells
Song, S.; Zong, Y.; Xu, Y.; Chen, L.; Zhou, Y.; Chen, L.; Li, G.; Xiao, T.; Huang, M.Abstract
Abstract Background: Kawasaki disease (KD) is a pediatric systemic vasculitis in which T-cell-mediated immune responses play a pivotal role. However, the precise dynamic evolution of T-cell subsets during disease progression remains poorly understood. Methods: Single-cell RNA sequencing (scRNA-seq) was employed to perform high-resolution annotation of peripheral blood mononuclear cells (PBMCs) from healthy controls and KD patients, both pre- and post- IVIG treatment. T-cell developmental trajectories were reconstructed via Monocle3-based pseudotime analysis. Furthermore, the functional significance of the significant pathway was validated in a CAWS-induced KD murine model. Results: A high-resolution single-cell landscape identified 13 distinct T-cell subtypes. Pseudotime analysis revealed a significant lineage commitment of CD4+ T cells toward a Th17 phenotype during the acute phase of KD, synchronized with the transcriptional upregulation of the STAT3/JAK signaling axis. Animal experiments further demonstrated that pharmacological inhibition of this pathway substantially attenuated inflammatory infiltration in the cardiac vasculature of KD mice. Conclusion: This study identifies the STAT3/JAK-mediated Th17 differentiation bias as a potential regulatory program associated with acute inflammation in Kawasaki disease, thereby highlighting the STAT3/JAK axis as a potential therapeutic target.
bioinformatics2026-03-23v1