Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
MiGenPro: A linked data workflow for phenotype-genotype prediction of microbial traits using machine learning.
Loomans, M.; Suarez-Diez, M.; Schaap, P. J.; Saccenti, E.; Koehorst, J. J.Abstract
The availability of microbial genomic data and the development of machine learning methods have created a unique opportunity to establish associations between genetic information and phenotypes. Here, we introduce a computational workflow for Microbial Genome Prospecting (MiGenPro) that combines phenotypic and genomic information. MiGenPro serves as a workflow for the training of machine learning models that predict microbial traits from genomes that have been annotated. Microbial genomes have been consistently annotated and features were stored in a semantic framework that is easy to query using SPARQL. The data was used to train machine learning models and successfully predicted microbial traits such as motility, Gram stain, optimal temperature range, and sporulation capabilities. To ensure robustness, a hyper parameter halving grid search was used to determine optimal parameter settings followed by a five-fold cross-validation which demonstrated consistent model performance across iterations and without overfitting. Effectiveness was further validated through comparison with existing models, showing comparable accuracy, with modest variations attributed to differences in datasets rather than methodology. Classification can be further explored using feature importance characterisation to identify biologically relevant genomic features. MiGenPro provides an easy to use interoperable workflow to build and validate models to predict phenotypes from microbes based on their annotated genome.
bioinformatics2026-03-03v2Large-Scale Statistical Dissection of Sequence-Derived Biochemical Features Distinguishing Soluble and Insoluble Proteins
Vu, N. H. H.; Nguyen Bao, L.Abstract
Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a statistically rigorous large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliffs {delta}, and discriminative performance was assessed by ROC-AUC. Although 34 features remained significant after correction, most exhibited small effect sizes and substantial class overlap, consistent with a weak-signal regime. The strongest effects were associated with size-related features (sequence length and molecular weight; {delta} {approx} -0.21), whereas charge-related descriptors, particularly the proportion of negatively charged residues ({delta} = 0.150; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ({rho} up to 0.998). Applying a redundancy threshold (|{rho}| [≥] 0.85), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings demonstrate that sequence-level solubility information is intrinsically low-dimensional and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.
bioinformatics2026-03-03v1h5adify: neuro-symbolic metadata harmonizationenables scalable AnnData integration with locallarge language models
Rincon de la Rosa, L.; Mouazer, A.; Navidi, M.; Degroodt, E.; Künzle, T.; Geny, S.; Idbaih, A.; Verrault, M.; Labreche, K.; Hernandez-Verdin, I.; Alentorn, A.Abstract
Background: The rapid growth of public single-cell and spatial transcriptomics repositories has shifted the main bottleneck for atlas-scale integration from data generation to metadata heterogeneity. Even when datasets are released in the AnnData H5AD format, inconsistent column naming, partial annotations, and mixed gene identifier conventions frequently prevent reproducible merging, downstream benchmarking, and reuse in foundation model training. Automated approaches that resolve semantic inconsistency while preserving biological validity are therefore essential for scalable data reuse. Results: We present h5adify, a neuro-symbolic toolkit that combines deterministic biological inference with locally deployed large language models to transform heterogeneous AnnData objects into schema-normalized, integration-ready representations. The framework performs metadata field discovery, gene identifier harmonization, optional paper-aware extraction, and consensus resolution with explicit uncertainty logging. Benchmarking four open-weight model families deployed through Ollama (Gemma, Llama, Mistral, and Qwen) demonstrates that small local models achieve high semantic accuracy in metadata resolution with low hallucination rates and modest computational requirements. In controlled simulations introducing annotation noise into single-cell and Visium-like datasets, harmonization improves integration benchmarking and reduces spurious batch effects. Application to sex-annotated glioblastoma datasets recovers biologically coherent microenvironmental patterns and cell type-specific genomic differences not explained by differential expression alone. Conclusions: Together, h5adify provides a reproducible framework for evaluating LLM-assisted biocuration and enables scalable, privacy-preserving metadata harmonization for modern single-cell atlases and foundation model pipelines. These results demonstrate that modular neuro-symbolic integration of deterministic biological inference and small local language models can effectively resolve semantic heterogeneity while remaining computationally accessible.
bioinformatics2026-03-03v1snputils: A High-Performance Python Library for Genetic Variation and Population Structure
Bonet, D.; Comajoan Cara, M.; Barrabes, M.; Smeriglio, R.; Agrawal, D.; Aounallah, K.; Geleta, M.; Dominguez Mantes, A.; Thomassin, C.; Shanks, C.; Huang, E. C.; Franquesa Mones, M.; Luis, A.; Saurina, J.; Perera, M.; Lopez, C.; Sabat, B. O.; Abante, J.; Moreno-Grau, S.; Mas Montserrat, D.; Ioannidis, A. G.Abstract
The increasing size and resolution of genomic and population genetic datasets offer unprecedented opportunities to study population structure and uncover the genetic basis of complex traits and diseases. The collection of existing analytical tools, however, is characterized by format incompatibilities, limited functionality, and computational inefficiencies, forcing researchers to construct fragile pipelines that chain together fragmented command-line utilities and ad hoc scripts. These are difficult to maintain, scale, and reproduce. To address such limitations, we present snputils, a Python library that unifies high-performance I/O, transformation, and analysis of genotype, ancestry, and phenotypic information within a single framework suitable for biobank-scale research. The library provides efficient tools for essential operations, including querying, cleaning, merging, and statistical analysis. In addition, it offers classical population genetic statistics with optional ancestry-specific masking. An identity-by-descent module supports reading of multiple formats, filtering and ancestry-restricted segment trimming for relatedness and demographic inference. snputils also incorporates ancestry-masking and multi-array functionalities for dimensionality reduction methods, as well as efficient implementations of admixture simulation, admixture mapping, and advanced visualization capabilities. With support for the most commonly used file formats, snputils integrates smoothly with existing tools and clinical databases. At the same time, its modular and optimized design reduces technical overhead, facilitating reproducible workflows that accelerate discoveries in population genetics, genomic research, and precision medicine. Benchmarking demonstrates a significant reduction in genotype data loading speed compared to existing Python libraries. The open-source library is available at https://github.com/AI-sandbox/snputils, with full documentation and tutorials at https://snputils.org.
bioinformatics2026-03-03v1Minimum Unique Substrings as a Context-Aware k-mer Alternative for Genomic Sequence Analysis
Adu, A. F.; Menkah, E. S.; Amoako-Yirenkyi, P.; Pandam Salifu, S.Abstract
Fixed-length k-mers have long been the standard in sequence analysis. However, they impose a uniform resolution across heterogeneous genomes, often resulting in significant redundancy and a loss of contextual sensitivity. To address these limitations, we introduce Minimum Unique Substrings (MUSs), which are variable-length sequence units that adapt to the local complexity of the genome. MUSs function as context-aware markers that naturally define repeat boundaries by extending only until uniqueness is achieved. We build upon the theoretical relationship between MUSs and maximal repeats, extending this framework to sequencing reads by establishing a read-consistent definition of uniqueness. We present a linear-time (O(n)) algorithm based on a generalized suffix tree and introduce the concept of outposts. These outposts act as anchors for uniqueness, enabling precise localization of MUS boundaries within the sequencing data. Empirical studies of E. coli K-12 and human HiFi reads reveal distinct distributions in MUS lengths that reflect their respective genomic architectures. The compact bacterial genome produces a highly dense set of MUSs with a narrow length distribution (averaging 30.44 bp). In contrast, the repeat-rich human genome requires longer substrings to resolve uniqueness, resulting in an increased mean length (36.08 bp) and a broader distribution that delineates complex repetitive elements. The MUS framework achieves 100% unique coverage with an average length of 36.08 bp, surpassing the 69% coverage of k = 61. By reducing the total number of tokens by over 99%, it provides higher resolution and superior data compression compared to fixed-length k-mer sampling. These results demonstrate that MUSs provide a biologically meaningful, context-sensitive alternative to k-mers, with direct applications in genome assembly, repeat characterization, and comparative genomics.
bioinformatics2026-03-03v1Improved prediction of virus-human protein-protein interactions by incorporating network topology and viral molecular mimicry
Zhang, Z.; Feng, Y.; Meng, X.; Peng, Y.Abstract
The protein-protein interactions (PPIs) between viruses and human play crucial roles in viral infections. Although numerous computational approaches have been proposed for predicting virus-human PPIs, their performances remain suboptimal and may be overestimated due to the lack of benchmark dataset. To address these limitations, we first constructed a carefully curated benchmark dataset, ensuring non-overlapped PPIs and minimum sequences similarity of both human and viral proteins in the training and test sets. Based on this dataset, we developed vhPPIpred, a machine learning-based prediction method that not only incorporated sequence embedding and evolutionary information but also leveraged network topology and viral molecular mimicry of human PPIs. Comparative experiments demonstrated that vhPPIpred outperformed five state-of-the-art methods on both our benchmark dataset and three independent datasets. vhPPIpred also achieved high computational efficiency, requiring relatively low runtime and memory. Finally, vhPPIpred was demonstrated to have great potential in identifying human virus receptors, and in inferring virus phenotypes as the virus-human PPIs predicted by vhPPIpred can be used to effectively infer virus virulence. In summary, this study provides a valuable benchmark dataset and an effective tool for virus-human PPI prediction, with potential applications in antiviral drug discovery, host-pathogen interaction research and early warnings of emerging viruses.
bioinformatics2026-03-03v1LLPSight: enhancing prediction of LLPS-driving proteins using machine learning and protein Language Models
GONAY, V.; VITALE, R.; STEGMAYER, G.; Dunne, M. P.; KAJAVA, A. V.Abstract
In eukaryotic cells, essential functions are often confined within organelles enclosed by lipid membranes. Increasing evidence, however, highlights the role of membrane-less organelles (MLOs), formed through liquid-liquid phase separation (LLPS). MLO assemblies are typically initiated by >>driver>> proteins, which form a scaffold to recruit additional >>client>> molecules. By leveraging expanding MLO datasets and modern machine learning approaches, we developed LLPSight, an ML-based predictor of LLPS-driving proteins. The model was trained using rigorously curated datasets: a positive set of proteins experimentally confirmed to drive LLPS in vivo and a negative set of soluble, unstructured proteins not associated with LLPS. For the features, we employed a cutting-edge approach using embeddings from protein Language Models. LLPSight achieves the highest F1 score (0.885) among existing tools, enabling more efficient discovery of new LLPS drivers eagerly awaited by researchers for experimental validation. An additional key feature of LLPSight is its ability to perform proteome-wide analyses; application to the human proteome yielded promising targets. LLPSight can be obtained from authors upon request.
bioinformatics2026-03-03v1In Silico Screening of Indian Medicinal Herb Compounds for Intestinal α-Glucosidase Inhibition with ADMET and Toxicity Assessment for Postprandial Glucose Management in Type-2 Diabetes
Roy, D. A. C.; GHOSH, D. I.Abstract
Postprandial hyperglycemia is a major concern in type 2 diabetes, and inhibition of intestinal alpha-glucosidases is an established method for controlling post-meal glucose excursions. In this study, we conducted an in-silico screening of phytochemicals from different well-known medicinal plants (Withania somnifera, Rauwolfia serpentina, Curcuma longa, and Camellia sinensis) against MGAM, using the clinically approved inhibitor miglitol as reference for docking protocol validation. Molecular docking revealed that miglitol binds to MGAM with a binding energy of -6.86 kcal/mol and an RMSD of 1.04 (with co-crystal structure; PBD ID:3L4W); however, several phytochemicals exhibited binding affinities equal to or stronger than miglitol. Among these, Withanolide B (-9.25 kcal/mol) and Withanone (-7.57 kcal/mol) from Withania somnifera showed the highest predicted affinities, indicating robust engagement of the MGAM catalytic pocket. Rauwolfia serpentina alkaloids such as yohimbine (-8.50 kcal/mol) and raubasine (-8.46 kcal/mol) also displayed strong binding energies, whereas curcuminoids (curcumin -6.36 kcal/mol; deoxycurcumin -6.35 kcal/mol) and tea catechins (e.g., epicatechin gallate -6.85 kcal/mol) demonstrated moderate affinity. Interaction analysis showed that top-ranking compounds formed extensive hydrogen-bonding and hydrophobic interactions with key catalytic residues of MGAM, suggesting stable occupancy of the active site. In-silico ADME profiling predicted favorable gastrointestinal absorption for lead phytochemicals, supporting their potential for oral intestinal action. Collectively, these results identify plant-derived ligands with binding energies comparable to or exceeding that of miglitol, highlighting Withania somnifera withanolides as priority candidates for experimental validation in enzyme inhibition assays and glucose tolerance models, and providing a focused set of natural MGAM inhibitors for further translational investigation in postprandial glucose control.
bioinformatics2026-03-03v1scUnify: A Unified Framework for Zero-shot Inference of Single-Cell Foundation Models
KIM, D.; Jeong, K.; KIM, K.Abstract
Foundation models (FMs) pre-trained on large-scale single-cell RNA sequencing (scRNA-seq) data provide powerful cell embeddings, but their practical usability and systematic comparison are limited by model-specific environments, preprocessing pipelines, and execution procedures. To address these challenges, we introduce scUnify, a unified zero-shot inference framework for single-cell foundation models. scUnify accepts a standard AnnData object and automatically manages environment isolation, preprocessing, and tokenization through a registry-based modular design. It employs a hierarchical distributed inference strategy that combines Ray-based task scheduling with multi-GPU data-parallel execution via HuggingFace Accelerate, enabling scalable inference on datasets containing up to one million cells. In addition, built-in integration of scIB and scGraph metrics enables standardized cross-model embedding evaluation within a single workflow. Benchmarking results demonstrate substantial reductions in inference time compared with the original model implementations, while preserving embedding quality and achieving near-linear multi-GPU scaling. scUnify is implemented in Python and is publicly available at https://github.com/DHKim327/scUnify.
bioinformatics2026-03-03v1Phenotypic Bioactivity Prediction as Open-set Biological Assay Querying
Sun, Y.; Zhang, X.; Zheng, Q.; Li, H.; Zhang, J.; Hong, L.; Wang, Y.; Zhang, Y.; Xie, W.Abstract
The traditional drug discovery pipeline is severely bottlenecked by the need to design and execute bespoke biological assays for every new target and compound, which is both time-consuming and prohibitively expensive. While machine learning has accelerated virtual screening, current models remain confined to ``closed-set'' paradigms, unable to generalize to entirely novel biological assays without target-specific experimental data. Here, we present OpenPheno, a groundbreaking multimodal foundation model that fundamentally redefines bioactivity prediction as an open-set, visual-language question-answering (QA) task. By integrating chemical structures (SMILES), universal phenotypic profiles (Cell Painting images), and natural language descriptions of biological assays, OpenPheno unlocks the highly coveted "profile once, predict many'' paradigm. Instead of conducting countless target-specific wet-lab experiments, researchers only need to capture a single, low-cost Cell Painting image of a novel compound. OpenPheno then evaluates this universal phenotypic ``fingerprint'' against the text-based description of any unseen assay, predicting bioactivity in a zero-shot manner. On 54 entirely unseen assays, it achieves strong zero-shot performance (mean AUROC 0.75), exceeding supervised baselines trained with full labeled data, and few-shot adaptation further improves predictions. In the most stringent setting where both compounds and assays are novel, OpenPheno maintains robust generalization (mean AUROC 0.66), opening up a new paradigm for a highly scalable, cost-effective, and universal engine for next-generation drug discovery.
bioinformatics2026-03-03v1Enabling Megascale Microbiome Analysis with DartUniFrac
Zhao, J.; McDonald, D.; Sfiligoi, I.; Lladser, M. E.; Patel, L.; Weng, Y.; Khatib, L.; Degregori, S.; Gonzalez, A.; Lozupone, C.; Knight, R.Abstract
We introduce a new algorithm, DartUniFrac, and a near-optimal implementation with GPU acceleration, up to three orders of magnitude faster than the state of the art and scaling to millions of samples (pairwise) and billions of taxa. DartUniFrac connects UniFrac with weighted Jaccard similarity and exploits sketching algorithms for fast computation. We benchmark DartUniFrac against exact UniFrac implementations, demonstrating that DartUniFrac is statistically indistinguishable from them on real-world microbiome and metagenomic datasets.
bioinformatics2026-03-03v1Evaluating Few-Shot Meta-Learning using STUNT for Microbiome-Based Disease Classification
Peng, C.; Abeel, T.Abstract
The human gut microbiome is increasingly explored as a diagnostic indicator for disease, yet machine learning models trained on metagenomic data are often constrained by limited sample sizes and poor cross-cohort generalizability. Meta-learning, a machine learning paradigm that optimizes models for rapid adaptation to new tasks with limited examples, offers a promising strategy to address this by leveraging the potential shared microbial structure across publicly available metagenomic datasets. Here, we evaluated STUNT, a framework combining self-supervised pretraining with metric-based meta-learning (Prototypical Networks), for few-shot microbiome-based disease classification. Using over 5,000 species-level gut metagenomic profiles from 57 cohorts in GMrepo v2, we meta-trained STUNT on 52 cohorts and evaluated the pretrained embedding on five held-out disease cohorts covering rheumatoid arthritis (RA), gestational diabetes mellitus during pregnancy (GDM), non-alcoholic fatty liver disease (NAFLD), diabetes mellitus, type 1 (T1D), and inflammatory bowel disease (IBD). We compared Prototypical Networks, Logistic Regression, and Random Forest with and without STUNT-derived embeddings across shot sizes of 1 to 10 samples per class. We found that STUNT-derived embeddings provided a modest benefit only under extreme data scarcity (one labeled sample per class) and this advantage rapidly diminished and reversed with additional samples, indicating that the meta-learned representations impose an information bottleneck limiting access to task-specific signals. Classification performance varied substantially across cohorts, consistent with PERMANOVA-estimated microbiome-disease separability. These results highlight the need for representation learning approaches that preserve disease- and cohort-specific variation and suggest that intrinsic biological signal strength is the primary determinant of classification success.
bioinformatics2026-03-03v1Towards Cross-Sample Alignment for Multi-Modal Representation Learning in Spatial Transcriptomics
Dai, J.; Nonchev, K.; Koelzer, V. H.; Raetsch, G.Abstract
The growing number of spatial transcriptomics (ST) datasets enables comprehensive multi-modal characterization of cell types across diverse biological and clinical contexts. However, integration across patient cohorts remains challenging, as local microenvironment, patient-specific variability, and technical batch effects can dominate signals. Here, we hypothesize that combining specialized transcriptomics correction methods with deep representation learning can jointly align morphology, transcriptomics, and spatial information across multiple tissue samples. This approach benefits from recent transcriptomics and pathology foundation models, projecting cells into a shared embedding space where they cluster by cell type rather than dataset-specific conditions. Applying this framework to 18 skin melanoma, 12 human brain, and 4 lung cancer datasets, we demonstrate that it outperforms conventional batch-correction approaches by 58%, 38%, and 2-fold, respectively. Together, this framework enables efficient integration of multi-modal ST data across modalities and samples, facilitating the systematic discovery of conserved cellular programs and spatial niches while remaining robust to cohort-specific batch effects. Code availability: https://github.com/ratschlab/aestetik
bioinformatics2026-03-03v1Pinc: a simple probabilistic AlphaFold interaction score
Toth-Petroczy, A.; Badonyi, M.Abstract
Abstract Motivation Screening of interacting proteins with AlphaFold has become widespread in biological research owing to its utility in generating and testing hypotheses. While several model quality and interaction confidence metrics have been developed, their interpretation is not always straightforward. Results Here, building on a previously published method, we address this limitation by converting predicted aligned errors of an AlphaFold model into conditional contact probabilities. We show that, without additional parametrisation, the contact probabilities are readily calibrated to the fraction of native contacts observed across experimentally determined protein dimers. We find that the average contact probability for interacting chains, termed Pinc (probability of interface native contacts), is more sensitive to interactions involving smaller interfaces than many commonly used scores. We provide an R script to calculate Pinc for AlphaFold models, and propose its use as an alternative scoring metric for interaction screens and for prioritising interface residues for experimental validation. Availability and implementation An R script and a Colab notebook are available at https://git.mpi-cbg.de/tothpetroczylab/Pinc
bioinformatics2026-03-03v1Characterizing and Mitigating Protocol-Dependent Gene Expression Bias in 3' and 5' Single-Cell RNA Sequencing
Shydlouskaya, V.; Haeryfar, S. M. M.; Andrews, T. S.Abstract
Single-cell RNA sequencing (scRNA-seq) has enabled large-scale characterization of cellular heterogeneity; yet, integrating datasets generated through different library preparation protocols remains challenging. For instance, comparisons between 10X Genomics 3' and 5' chemistries are complicated by protocol-dependent technical biases imposed by differences in transcript end capture and amplification. While normalization, and often batch correction, is an integral step in preprocessing scRNA-seq datasets, it remains unclear which correction is most appropriate, or even necessary, for reliable cross-protocol comparisons. Here, we systematically characterize protocol-related expression differences using 35 matched donors across six tissues profiled with both 3' and 5' scRNA-seq approaches. We find that gene expression discrepancies are not pervasive across the whole transcriptome, but driven instead by a relatively small, reproducible subset of protocol-biased genes. Excluding these genes improves cross-protocol concordance, indicating that most genes are directly comparable without aggressive correction. We then benchmark commonly employed normalization approaches and show that while several methods, such as fastMNN, improve statistical alignment when cell populations are well matched, they can distort gene-level signals and inflate differential expression in biologically realistic settings with incomplete cell-type overlap. Taken together, our results demonstrate that protocol bias between 3' and 5' scRNA-seq is limited in scope and that targeted handling of a small set of biased genes presents an alternative approach to normalization or batch correction strategies. This work provides a practical guideline for integrating 3' and 5' scRNA-seq data and highlights the importance of matching normalization strategies to the structure of technical variation and the intended downstream analyses.
bioinformatics2026-03-03v1selscape: A Snakemake Workflow for Investigating Genomic Landscapes of Natural Selection
Chen, S.; Huang, X.Abstract
Analyzing natural selection is a central task in evolutionary genomics, yet applying multiple tools across populations in a reproducible and scalable manner is often complicated by heterogeneous input formats, parameter settings, and tool dependencies. Here, we present selscape, a Snakemake workflow that automates end-to-end genome-wide selection analysis--from input preparation and statistic calculation to functional annotation, downstream visualization, and summary reporting. We demonstrate selscape on high-coverage genomes from the 1000 Genomes Project, illustrating how the workflow enables efficient, large-scale analyses and streamlined comparisons across populations. By unifying diverse tools with Snakemake, selscape lowers the barrier to robust genome-wide analyses and provides a flexible framework for future extensions and integration with complementary population genetic analyses.
bioinformatics2026-03-03v1The limits of Bayesian estimates of divergence times in measurably evolving populations
Ivanov, S.; Fosse, S.; dos reis, M.; Duchene, S.Abstract
Bayesian inference of divergence times for extant species using molecular data is an unconventional statistical problem: Divergence times and molecular rates are confounded, and only their product, the molecular branch length, is statistically identifiable. This means we must use priors on times and rates to break the identifiability problem. As a consequence, there is a lower bound in the uncertainty that can be attained under infinite data for estimates of evolutionary timescales using the molecular clock. With infinite data (i.e., an infinite number of sites and loci in the alignment) uncertainty in ages of nodes in phylogenies increases proportionally with their mean age, such that older nodes have higher uncertainty than younger nodes. On the other hand, if extinct taxa are present in the phylogeny, and if their sampling times are known (i.e., `heterochronous' data), then times and rates are identifiable and uncertainties of inferred times and rates go to zero with infinite data. However, in real heterochronous datasets (such as viruses and bacteria), alignments tend to be small and how much uncertainty is present and how it can be reduced as a function of data size are questions that have not been explored. This is clearly important for our understanding of the tempo and mode of microbial evolution using the molecular clock. Here we conducted extensive simulation experiments and analyses of empirical data to develop the infinite-sites theory for heterochronous data. Contrary to expectations, we find that uncertainty in ages of internal nodes scales positively with the distance to their closest tip with known age (i.e., calibration age), not their absolute age. Our results also demonstrate that estimation uncertainty decreases with calibration age more slowly in data sets with more, rather than fewer site patterns, although overall uncertainty is lower in the former. Our statistical framework establishes the minimum uncertainty that can be attained with perfect calibrations and sequence data that are effectively infinitely informative. Finally, we discuss the implications for viral sequence data sets. In a vast majority of cases viral data from outbreaks is not sufficiently informative to display infinite-sites behaviour and thus all estimates of evolutionary timescales will be associated with a degree of uncertainty that will depend on the size of the data set, its information content, and the complexity of the model. We anticipate that our framework is useful to determine such theoretical limits in empirical analyses of microbial outbreaks.
bioinformatics2026-03-03v1A comprehensive assessment of tandem repeat genotyping methods for Nanopore long-read genomes
Aliyev, E.; Avvaru, A.; De Coster, W.; Arner, G. M.; Nyaga, D. M.; Gibson, S. B.; Weisburd, B.; Gu, B.; Gonzaga-Jauregui, C.; 1000 Genomes Long-Read Sequencing Consortium, ; Chaisson, M. J. P.; Miller, D. E.; Ostrowski, E.; Dashnow, H.Abstract
Background Tandem repeats (TRs) play critical roles in human disease and phenotypic diversity but are among the most challenging classes of genomic variation to measure accurately. While it is possible to identify TR expansions using short-read sequencing, these methods are limited because they often cannot accurately determine repeat length or sequence composition. Long-read sequencing (LRS) has the potential to accurately characterize long TRs, including the identification of non-canonical motifs and complex structures. However, while there are an increasing number of genotyping methods available, no systematic effort has been undertaken to evaluate their length and sequence-level accuracy, performance across motifs from STRs to VNTRs and across allele lengths, and, critically, how usable these tools are in practice. Results We reviewed 25 available bioinformatic tools, and selected seven that are actively maintained for benchmarking using publicly available Oxford Nanopore genome sequencing data from more than 100 individuals. Our benchmarking catalog included ~43k TR loci genome-wide, selected to represent a range of simple and challenging TR loci. As no "truth" exists for this purpose, we used four complementary strategies to assess accuracy: concordance with high-quality haplotype-resolved Human Pangenome Reference Consortium (HPRC) assemblies, Mendelian consistency in Genome in a Bottle trios, cross-tool consistency, and sensitivity in individuals with pathogenic TR expansions confirmed by molecular methods. For all comparisons, we assess both total allele length and full sequence similarity using the Levenshtein distance. We also evaluated installation, documentation, computational requirements, and output characteristics to reflect real-world use. We provide a complete analysis workflow for all tools to support community reuse. Tool performance varied substantially across both accuracy and usability. Most methods achieved high concordance with HPRC assemblies, with higher accuracy when using the R10 ONT pore chemistry. Accuracy generally declined with increasing allele length, and most tools performed worse on homopolymers, likely reflecting underlying sequencing accuracy. Tools generally performed worse at heterozygous loci and at alleles that differed from the reference genome. Interestingly, concordance with assembly in population samples did not predict sensitivity to pathogenic expansions, with different genotypers performing best in each category. Similarly, Mendelian consistency was highest in the tool that performed worst in assembly concordance. Conclusions No single genotyper emerged as consistently best across all assessments, but strong contenders emerged in each. Our results demonstrate that length accuracy (a typical benchmarking approach) alone overestimates TR genotyping performance. Sequence-level benchmarking is essential for selecting tools best-suited for population studies and clinical diagnostics. This work provides practical guidance for tool selection and highlights key priorities for future long-read TR genotyping method development.
bioinformatics2026-03-03v1iGS: A Zero-Code Dual-Engine Graphical Software for Polygenic Trait Prediction
Zhang, J.; Chen, F.Abstract
Genomic selection (GS) has become the core driving force in modern plant and animal breeding. However , state-of-the-art comprehensive GS tools often rely on complex underlying environment configurations and command-line operations, posing significant technical barriers for breeders lacking programming expertise . To address this critical pain point, this study developed a fully "zero-code" graphical user interface (GUI) decision support system for genomic selection. The platform innovatively employs a "portable dual-engine architecture" (R-Portable and Python-Portable) to achieve completely dependency-free, "out-of-the-box" deployment , and integrates a standardized six-step end-to-end workflow from data quality control to result export . Furthermore, the platform comprehensively integrates 33 cutting-edge prediction models across four major paradigms, linear, Bayesian, machine learning, and deep learning , and features an original intelligent parameter configuration system that dynamically renders algorithm parameters to provide a minimalist UI interaction experience . Benchmark testing on the Wheat2000 dataset across six complex agronomic and quality traits, including thousand-kernel weight (TKW) and grain protein content (PROT), demonstrated that classic linear models remain highly robust for polygenic additive traits, while tree-based machine learning and hybrid deep learning architectures exhibit superior predictive potential and noise resilience when resolving complex epistatic effects and low-heritability traits. The successful deployment of this platform fundamentally liberates biologists from the constraints of computational science, providing robust digital infrastructure to accelerate the popularization and practical application of GS technologies in agricultural production.
bioinformatics2026-03-03v1RankMap: Rank-based reference mapping for fast and robust cell type annotation in spatial and single-cell transcriptomics
Cheng, J.; Li, S.; Kim, S.; Ang, C. H.; Chew, S. C.; Chow, P. K.-H.; Liu, N.Abstract
Accurate cell type annotation is essential for the analysis of single-cell and spatial transcriptomics data. While reference-based annotation methods have been widely adopted, many existing approaches rely on full-transcriptome profiles and incur substantial computational cost, limiting their applicability to large-scale spatial datasets and platforms with partial gene panels. Here, we present RankMap (https://github.com/jinming-cheng/RankMap), an efficient and flexible R package for reference-based cell type annotation across both single-cell and spatial transcriptomics. RankMap transforms gene expression profiles into rank-based representations using the top expressed genes per cell, improving robustness to platform-specific biases and expression scale differences. A multinomial regression model trained with elastic net regularization is then used to predict cell types and associated confidence scores. We benchmarked RankMap on five spatial transcriptomics datasets, including Xenium, MERFISH, and Stereo-seq, as well as two single-cell datasets, and compared it with established methods such as SingleR, Azimuth, and RCTD. RankMap achieved competitive or superior annotation accuracy while consistently reducing runtime compared to existing methods, particularly for large spatial datasets. These results demonstrate that RankMap provides a scalable and robust solution for reference-based cell type annotation in modern single-cell and spatial transcriptomics studies.
bioinformatics2026-03-03v1Navigating the peptide sequence space in search for peptide binders with BoPep
Hartman, E.; Samsudin, F.; Siljehag Alencar, M.; Tang, D.; Bond, P. J.; Schmidtchen, A.; Malmstrom, J.AI Summary
- The study developed BoPep, a framework using Bayesian optimization to efficiently explore peptide sequence space for protein binders, reducing the need for extensive docking evaluations.
- BoPep was applied to peptides from clinical wound fluids, the human proteome, and de novo designs, identifying novel peptide classes that bind CD14 and neutralize pneumolysin's hemolytic activity.
Abstract
Peptides are short amino-acid chains that mediate essential biological processes, including antimicrobial defence, immune modulation and cell signalling. Their high degree of modularity, biocompatibility and capacity to bind proteins with high specificity make them attractive therapeutic candidates. However, identifying peptides that bind and modulate the function of specific proteins remains challenging due to the immense size of the peptide sequence space. To adress this challenge, we developed BoPep (Bayesian Optimization for Peptides), an end-to-end modular framework that effectively navigates the landscape of peptide-protein interactions by directing the search toward informative regions of sequence space and prioritizes candidates with high binding potential. By focusing computational effort where it is most informative and using calibrated uncertainty to balance exploration and exploitation, BoPep reduces the number of expensive docking evaluations by orders of magnitudes. We demonstrate the utility of BoPep by applying it to three sources of peptides: endogenous proteolytic fragments from clinical wound fluids, the complete human proteome, and a de novo design peptide landscape generated by diffusion-based backbone sampling. Using these sources, we uncover novel encrypted peptide classes that bind CD14 and identify peptides that neutralize the hemolytic activity of pneumolysin, a major bacterial virulence factor. Together, these findings show that BoPep accelerates the identification of testable therapeutic leads from large and diverse peptide collections. BoPep is available at GitHub.
bioinformatics2026-03-02v2t2pmhc: A Structure-Informed Graph Neural Network to predict TCR-pMHC Binding
Polster, M.; Stadelmaier, J.; Ball, E.; Scheid, J.; Bauer, J.; Nelde, A.; Claassen, M.; Dubbelaar, M. L.; Walz, J. S.; Nahnsen, S.AI Summary
- The study introduces t2pmhc, a structure-based graph neural network framework to predict TCR-pMHC binding, utilizing predicted structures of the entire TCR-pMHC complex.
- t2pmhc, incorporating Graph Convolutional Network (GCN) and Graph Attention Network, showed enhanced generalization to unseen peptides over sequence-based methods.
- Analysis revealed that t2pmhc-GCN assigns high attention to biologically relevant regions, like the peptide and CDR3, with specific weighting within the peptide sequence.
Abstract
Mapping of T cell receptors (TCRs) to their cognate MHC-presented peptides (pMHC) is central for the development of precision immunotherapies and vaccine design. However, accurate prediction of TCR affinity to peptide antigens remains an open challenge. Most approaches rely solely on sequence information, although increasing evidence suggests that TCR-pMHC binding is primarily determined by three-dimensional structural interactions within the entire TCR-pMHC complex. Consequently, sequence-based methods often fail to generalize to peptides not included in the training data (unseen peptides). Here we introduce t2pmhc, a structure-based graph neural network framework for predicting TCR-pMHC binding using predicted structures of the entire TCR-pMHC complex. We evaluated a Graph Convolutional Network (GCN) and a Graph Attention Network, both demonstrating improved generalization to unseen peptides compared to state-of-the-art models across a variety of public datasets. Evaluation with crystallographic structures yields high-confidence predictions, indicating that current limitations of structure-based models are largely driven by the accuracy of structure prediction. Analysis of node attention patterns in t2pmhc-GCN reveals biologically consistent patterns, assigning high attention to the peptide and the CDR3 regions. Within the peptide sequence, canonical MHC anchor residues are consistently downweighted, whereas potential TCR-binding residues are upweighted. These findings establish t2pmhc as a structure-informed framework for robust TCR-pMHC binding prediction, enabling improved generalization to unseen antigens and providing a foundation for integrating TCR repertoire sequencing into vaccine design and immunotherapy.
bioinformatics2026-03-02v1Multiscale Symbolic Morpho-Barcoding Reveals Region-Specific and Scale-Dependent Neuronal Organization
Zhao, S.; Li, Y.; Liu, Y.; Peng, H.AI Summary
- The study introduces Multiscale Morpho-Barcoding (MMB), a framework for encoding whole-brain neuronal morphology into symbolic representations.
- By applying MMB to 1,876 reconstructed mouse neurons, the research identified region-specific and scale-dependent neuronal organization patterns.
- MMB effectively distinguishes major brain divisions and specific thalamic circuit classes, enhancing understanding beyond traditional projection strength analysis.
Abstract
Neuronal morphology is a central determinant of circuit organization, yet its multiscale complexity has hindered systematic, brain-wide analysis and integration with anatomical context. Here we introduce Multiscale Morpho-Barcoding (MMB), a framework that encodes whole-brain neuronal morphology into symbolic representations spanning cellular geometry, axonal tract routing, arbor organization, and predicted synaptic distributions. Applying MMB to 1,876 fully reconstructed mouse neurons, comprising 3,776 arbors and 2.63 million predicted presynaptic sites, we identify distinct multiscale morpho-patterns that reveal region-specific and scale-dependent principles of neuronal organization across the brain. MMB robustly discriminates major anatomical divisions and resolves canonical thalamic circuit classes beyond what can be achieved using projection strength alone. By transforming complex neuronal geometry into interpretable multiscale representations, MMB provides a general framework for systematic comparison of neuronal structure and for integrating morphology with connectivity and function at whole-brain scale.
bioinformatics2026-03-02v1Explainable AI for end-to-end pathogen target discovery and molecular design
Polonio, A.; Perez-Garcia, A.; Fernandez-Ortuno, D.; Jimenez-Castro, L.AI Summary
- The study introduces APEX, an explainable AI framework for identifying pathogen targets and designing molecules across species.
- APEX uses ESM-2 embeddings, graph attention networks, and a multilayer perceptron to predict essentiality, virulence, and druggability, recovering known fungal targets and proposing new ones like GmrSD and YadV.
- It also guides the design of inhibitors by highlighting key residues and pockets, demonstrating its utility in both known and novel target sites.
Abstract
Drug discovery is often constrained by target identification, a bottleneck especially acute in antimicrobial development and the fight against emerging fungicide resistance. We present APEX (Attention-based Protein EXplainer), an explainable AI framework for cross-species, proteome-scale target discovery and pocket-guided molecular design. APEX combines ESM-2 evolutionary embeddings, graph attention networks, and a multilayer perceptron to train pathogen-specific essentiality and virulence predictors (APEX-Tar) alonsgside a universal druggability model (APEX-Drug). Attention maps and GNNExplainer-derived subgraphs highlight residues and pockets driving predictions, enabling direct conditioning of structure-based diffusion models for inhibitor generation. APEX-Tar recovers known fungal targets (endopolygalacturonase 1, Hog1 MAPK) and proposes new candidates, including fungal GmrSD and bacterial YadV. APEX-Drug recapitulates established fungicide sites ({beta}-tubulin, cytochrome b), guides putative inhibitor design for GmrSD, and identifies in YadV a previously undescribed pocket distinct from known pilicide sites. Together, APEX offers a kingdom-agnostic pipeline for explainable target prioritization and guided molecular design.
bioinformatics2026-03-02v1Exploring the mechanism of Panax Notoginseng in the treatment of skin wound based on network pharmacology and experimental verification
Li, Y.-b.; Li, Q.-l.; Liu, J.; Li, J.-c.; Geng, H.-m.; Li, G.-k.; Jin, C.; Luo, J.; Zhang, Z.AI Summary
- This study used network pharmacology to identify 8 active components, 156 targets, and 115 pathways of Panax notoginseng (PN) in treating skin wounds, focusing on core targets like TNF, IL-6, and IL-10.
- Experimental validation in rats showed that PN treatment significantly reduced wound size, inflammation, and cytokine expression (TNF, IL-6, IL-10) compared to controls at various post-injury time points.
- PN promotes skin healing by modulating multiple signaling pathways, enhancing fibroblast proliferation, and optimizing the healing process from inflammation to tissue remodeling.
Abstract
Background How to shorten the healing cycle and reduce the incidence of infection is a difficult problem faced by clinicians. Panax notoginseng(PN), a traditional Chinese medicine, can promote the absorption of inflammatory exudates, granulation tissue formation and epidermal proliferation, effectively inhibit the inflammatory reaction of wounds and promote the healing of skin wounds, but its molecular mechanism has not been fully clarified so far. Based on network pharmacology and animal experiments, this study explored the target and molecular mechanism of PN in the treatment of skin wound. Methods Through network pharmacology, we screened the active components of PN and the common targets related to skin wounds, constructed a target protein-protein interaction (PPI) network, and performed GO and KEGG enrichment analysis. Using the MCODE and CytoHubba plugins, we explored core functional modules and key targets, ultimately constructing a visual network of PN components-targets-pathways. In the experimental section, Forty-eight male Sprague-Dawley (SD) rats were randomly divided into a control group and a PN group, with 24 rats in each group, and underwent full-thickness skin excision. Postoperatively, the PN group received intraperitoneal injections of drugs, while the control group received an equal amount of saline. Data were collected on postoperative days 1, 4, and 7, and hematoxylin and eosin (HE) staining, immunohistochemical staining, quantitative real-time polymerase chain reaction (qRT-PCR), and enzyme-linked immunosorbent assay (ELISA) were used to evaluate skin healing and detect changes in the expression of TNF-, IL-6, and IL-10 in the tissues. Results This study identified 8 major active components, 156 targets, and 115 signaling pathways involved in the treatment of skin wounds in rats using PN. The top 10 core target genes included TNF, IL-6, and IL-10, primarily enriched in signaling pathways such as NF-{kappa}B, MAPK, and JAK-STAT. Animal experiments revealed that at 4 and 7 days post-injury, the wound area in the PN group was significantly smaller than that in the control group (P<0.05). HE staining showed reduced infiltration of neutrophils and inflammatory cells in the injury area at 7 days in the PN group, accompanied by more pronounced fibroblast proliferation and collagen secretion. Molecular detection indicated that TNF-, IL-6, and IL-10 positive reactants were mainly distributed in the cytoplasm and matrix of epidermal cells, inflammatory cells, and fibroblasts in the skin. qRT-PCR and ELISA results showed that TNF- expression in the PN group was significantly lower than that in the control group at 4 and 7 days (P<0.01). IL-6 expression was lower than that in the control group at all time points, peaking at 4 days and then decreasing (P<0.01). IL-10 expression was significantly lower than that in the control group at 1 and 7 days (P<0.01). Conclusion PN treatment for skin wounds exhibits characteristics such as multi-component, multi-target, multi-pathway synergistic effects, and various regulatory pathways. It can reshape the dynamic balance of the cytokine network, optimize the temporal progression of "inflammation initiation - repair transition - tissue remodeling", and improve skin wound healing.
bioinformatics2026-03-02v1ExoFILT: Transfer learning for robust and accelerated analysis of exocytosis single-particle tracking data
Kramer, E.; Betancur, L. I.; Meek, S.; Tosi, S.; Manzo, C.; Oliva, B.; Gallego, O.AI Summary
- ExoFILT uses transfer learning to classify exocytic events in single-particle tracking data, reducing manual annotation time by ten-fold and enhancing consistency.
- Applied to dual-color time-lapse movies, ExoFILT quantified temporal relationships between exocytic proteins.
- The tool revealed distinct subpopulations of exocytic events with different molecular compositions, providing insights into exocytosis mechanisms.
Abstract
Motivation: Understanding constitutive exocytosis at the molecular level requires quantitative characterization of protein dynamics during the process. Single-particle tracking allows the measurement of protein dynamics in living cells. However, identifying bona fide exocytic events requires extensive manual annotation, limiting throughput and introducing personal biases that affect reproducibility. Results: We present ExoFILT, a deep learning-based classifier designed to identify exocytic events in single-particle tracking data, using the exocyst complex as a reference. Trained via transfer learning on simulated and experimental data, ExoFILT reduces the time required for manual annotation by ten-fold while improving measurement consistency across researchers. When applied to simultaneous dual-color time-lapse movies, ExoFILT enabled the systematic quantification of temporal relationships between exocytic proteins. The increased throughput uncovered distinct subpopulations of exocytic events with differential molecular composition (e.g., events with and without detectable levels of Sec1), underscoring the potential of ExoFILT to reveal mechanistic insights into exocytosis.
bioinformatics2026-03-02v1ToxiVerse: A Public Platform for Chemical Toxicity Data Sharing and Customizable Predictive Modeling
Durai, P.; Russo, D. P.; Shen, Y.; Wang, T.; Chung, E.; Li, L.; Zhu, H.AI Summary
- ToxiVerse is a public platform developed to provide user-friendly machine learning tools for computational toxicology, addressing the need for efficient chemical toxicity assessment.
- It features three modules: Bioprofiler for chemical descriptor generation, Database with 50,000 curated chemicals, and Cheminformatics for dataset management and QSAR model generation.
- The platform allows researchers to perform bioprofiling, access toxicity data, and predict chemical toxicity without programming expertise, available at www.toxiverse.com.
Abstract
Chemical toxicity assessment is critical for drug development and environmental safety. Computational models have emerged as a promising alternative to animal testing and now play a significant role in efficiently evaluating new chemicals. To address the urgent need for providing user-friendly machine learning tools in computational toxicology, we developed ToxiVerse, a public web-based platform. It provides curated toxicity datasets, automatic chemical bioprofiling, and a predictive modeling interface designed for researchers who lack programming expertise. The platform comprises three integrated modules: (i) the Bioprofiler module, which provides chemical descriptors by combining chemical-bioactivity data from PubChem assay with a machine learning-based data gap-filling procedure; (ii) the Database module, which hosts around 50,000 curated unique chemicals covering diverse toxicity endpoints; and (iii) the Cheminformatics module, which allows users to upload their own datasets, use datasets from ToxiVerse, or retrieve existing data from PubChem; perform chemical curation; and automatically generate Quantitative Structure-Activity Relationship (QSAR) models to predict chemicals of interest. ToxiVerse enables researchers to carry out bioprofiling, access curated toxicity datasets, and evaluate chemical toxicity through machine learning-based modeling and prediction. The platform is supported by sample files and a detailed tutorial, and it is freely accessible at www.toxiverse.com.
bioinformatics2026-03-02v1A Query-to-Dashboard Framework for Reproducible PubMed-Scale Bibliometrics and Trend Intelligence
Kidder, B. L.AI Summary
- The study introduces PubMed Atlas, a platform for conducting topic-specific bibliometric analyses using PubMed E-utilities, which retrieves and organizes metadata into a SQLite database for analysis.
- An interactive Streamlit dashboard allows for the exploration of publication trends, journal distributions, MeSH term frequencies, and author geography.
- The framework was applied to cancer stem cell biology and stem cell transcriptional regulatory networks, demonstrating its utility in identifying research trends and gaps.
Abstract
The rapid expansion of biomedical literature necessitates computational approaches for systematic analysis of publication patterns, identification of emerging scientific themes, and characterization of field evolution. We present PubMed Atlas, an integrated command-line and web-based platform for conducting topic-specific bibliometric analyses through programmatic access to PubMed E-utilities. This workflow retrieves PubMed identifiers matching user-defined queries, downloads comprehensive metadata in batch mode, extracts structured information including titles, abstracts, author affiliations, Medical Subject Headings, publication classifications, funding acknowledgments, and digital object identifiers, then organizes these data within a local SQLite relational database optimized for rapid queries and visualization. An accompanying Streamlit-based interactive dashboard enables exploration of temporal publication patterns, journal distribution profiles, MeSH term frequencies, geographic author distributions, and direct linking to recent publications. We demonstrate the application of PubMed Atlas to cancer stem cell biology and stem cell transcriptional regulatory network research, providing a framework for reproducible bibliometric investigation and systematic identification of research gaps within dynamically evolving scientific domains.
bioinformatics2026-03-02v1Density-guided AlphaFold3 uncovers unmodelled conformations in β2-microglobulin
Maddipatla, S. A.; Vedula, S.; Bronstein, A. M.; Marx, A.AI Summary
- The study uses density-guided AlphaFold3 to model alternative backbone conformations of β2-microglobulin from crystallographic maps, which are typically obscured in standard X-ray crystallography models.
- Findings show that the approach can reveal conformational heterogeneity influenced by electron density quality, crystallization conditions, and lattice packing.
- This method enhances the ability to capture the full structural landscape of proteins, improving macromolecular crystallography interpretation.
Abstract
Although X-ray crystallography captures the ensemble of conformations present within the crystal lattice, models typically depict only the most dominant conformation, obscuring the existence of alternative states. Applying the electron density-guided AlphaFold3 approach to {beta}2-Microglobulin highlights how ensembles of alternate backbone conformations can be systematically modeled directly from crystallographic maps. This study also highlights how the detection of conformational ensembles is affected by the local quality of electron density and subtle variations in crystallization conditions and lattice packing. These results demonstrate that density-guided AlphaFold3 can uncover conformational heterogeneity missed by conventional refinement, offering a robust, systematic framework to capture the full structural landscape of proteins in crystals and enhancing the interpretive power of macromolecular crystallography.
bioinformatics2026-03-02v1Synora: vector-based boundary detection for spatial omics
Li, J.-T.; Liang, Z.; Fu, Z.; Chen, H.; Liang, Y.-L.; Liu, N.; Wu, Q.-N.; Liu, Z.; Zheng, Y.; Huo, J.; Li, X.; Zuo, Z.; Zhao, Q.; Liu, Z.-X.AI Summary
- Synora is a computational framework for detecting tumor-stroma boundaries in spatial omics data, using only cell coordinates and binary annotations.
- It introduces 'orientedness' to differentiate true boundary cells from infiltrated regions, integrating this with diversity measures into a BoundaryScore.
- Synora effectively identifies boundaries in synthetic and real datasets, revealing gene signatures and spatial patterns, and performs well under data perturbations.
Abstract
Tumor-stroma boundaries are critical microenvironmental niches where malignant and non-malignant cells exchange signals that shape invasion, immune modulation and therapeutic response. Spatial omics platforms now resolve these interfaces at single-cell scale, but computational boundary detection remains challenging because heterogeneous neighborhoods can arise either from true compartment interfaces or from unstructured immune infiltration. Here we present Synora, a modality-agnostic computational framework that identifies tumor boundaries using only cell coordinates and binary tumor/non-tumor annotations, making it readily applicable across a broad range of spatial omics modalities. Synora introduces 'orientedness', a novel metric that quantifies directional neighborhood asymmetry and distinguishes true boundary cells, where neighbors are spatially segregated by type, from infiltrated regions where cell types intermingle randomly. By integrating orientedness with traditional diversity measures into a unified BoundaryScore, Synora achieves robust boundary identification across synthetic datasets with ground-truth boundaries, maintaining performance under realistic perturbations including 50% missing cells and 25% infiltration. Application to 15 Visium HD spatial transcriptomic datasets across multiple cancer types reveals consistent boundary-enriched gene signatures and cell-type spatial gradients. Validation on a CODEX multiplexed protein dataset demonstrates that Synora's precise boundary identification enables discovery of clinically relevant cellular neighborhoods and disease-associated spatial patterns missed by frequency-based approaches. Synora enables boundary-aware spatial analyses by making tissue interfaces quantifiable from minimal inputs, helping to standardize interface detection and comparison across spatial omics platforms and biological contexts.
bioinformatics2026-03-02v1STCS: A Platform-Agnostic Framework for Cell-Level Reconstruction in Sequencing-Based Spatial Transcriptomics
Chen Wu, L.; Hu, X.; Zhan, F.; Sun, C.; Gonzales, J.; Ofer, R.; Tran, T.; Verzi, M. P.; Liu, L.; Yang, J.AI Summary
- The study introduces STCS, a platform-agnostic framework for reconstructing single-cell expression profiles from sequencing-based spatial transcriptomics data by integrating transcriptomic and spatial data from H&E images.
- STCS uses two interpretable parameters for optimization, selected via internal metrics, and outperforms existing methods in reconstructing cell-level data from Visium HD and Stereo-seq datasets.
Abstract
Sequencing-based spatial transcriptomics platforms such as Visium HD and Stereo-seq achieve transcriptome-wide coverage at subcellular resolution, yet their measurements are defined over spatially barcoded units rather than biologically segmented cells. Reconstructing coherent cell-level expression profiles from these data remains a central computational challenge. Here, we introduce Spatial Transcriptomics Cell Segmentation (STCS), a platform-agnostic framework that reconstructs single-cell expression profiles by assigning spatial units to nuclei segmented from paired H&E images, using a combined transcriptomic and spatial distance. STCS is governed by two interpretable parameters that can be selected using reference-free internal metrics. On both Visium HD human lung cancer data with matched Xenium references and Stereo-seq mouse brain data, STCS achieves consistent improvements over existing methods across multiple evaluation dimensions. STCS is fully open-source and designed for broad applicability across sequencing-based spatial transcriptomics technologies.
bioinformatics2026-03-02v1STEQ: A statistically consistent quartet distance based species tree estimation method
Saha, P.; Saha, A.; Roddur, M. S.; Sikdar, S.; Anik, N. H.; Reaz, R.; Bayzid, M. S.AI Summary
- The study introduces STEQ, a new method for estimating species trees from multi-locus data using a quartet-based distance metric, which is statistically consistent under the multi-species coalescent model.
- STEQ offers faster computation with a time complexity of for taxa and genes, outperforming methods like ASTRAL in speed.
- Evaluations on simulated and empirical datasets show STEQ maintains competitive accuracy with leading methods like ASTRAL and wQFM-TREE while significantly reducing inference time.
Abstract
Accurate estimation of large-scale species trees from multi-locus data in the presence of gene tree discordance remains a major challenge in phylogenomics. Although maximum likelihood, Bayesian, and statistically consistent summary methods can infer species trees with high accuracy, most of these methods are slow and not scalable to large number of taxa and genes. One of the promising ways for enabling large-scale phylogeny estimation is distance based estimation methods. Here, we present STEQ, a new statistically consistent, fast, and accurate distance based method to estimate species trees from a collection of gene trees. We used a quartet based distance metric which is statistically consistent under the multi-species coalescent (MSC) model. The running time of STEQ scales as $\mathcal{O}(kn^2 \log n)$, for $n$ taxa and $k$ genes, which is asymptotically faster than the leading summary based methods such as ASTRAL. We evaluated the performance of STEQ in comparison with ASTRAL and wQFM-TREE -- two of the most popular and accurate coalescent-based methods. Experimental findings on a collection of simulated and empirical datasets suggest that STEQ enables significantly faster inference of species trees while maintaining competitive accuracy with the best current methods. STEQ is publicly available at \url{https://github.com/prottoysaha99/STEQ}.
bioinformatics2026-03-02v1Genomic language models improve cross-species gene expression prediction and accurately capture regulatory variant effects in Brachypodium mutant lines
Vahedi Torghabeh, B.; Moslemi, C.; Dybdal Jensen, J.; Hentrup, S.; Li, T.; Yu, X.; Wang, H.; Asp, T.; Ramstein, G. P.AI Summary
- This study developed deep learning sequence-to-expression (S2E) models using context-aware sequence embeddings from PlantCaduceus to predict gene expression across 17 plant species, incorporating chromatin accessibility data.
- The models showed superior performance over PhytoExpr in predicting gene expression across species (Pearson R=0.82 vs. R=0.74) and in Brachypodium mutant lines for between-gene expression differences (β=0.78 vs. β=0.57).
- Notably, the models accurately predicted single-nucleotide mutation effects on within-gene expression, outperforming existing models (β=0.38 vs. β=0.08).
Abstract
Predicting gene expression from cis-regulatory DNA sequences at the promoter and terminator regions is a central challenge in plant genomics. This capability is also a prerequisite for assessing the effects of regulatory mutations on gene expression. Here, we developed deep learning sequence-to-expression (S2E) models that leverage context-aware sequence embeddings from the PlantCaduceus genomic language model instead of one-hot encoding of sequences, to predict gene expression across 17 plant species. To further improve predictions, we integrated chromatin accessibility data as auxiliary regulatory features. First, we evaluated our models to predict gene expression on unseen gene families via cross-validation, demonstrating our model's prediction accuracy across all species outperforms PhytoExpr, the current state-of-the-art (SOTA) S2E model in plants (Pearson R=0.82 vs. R=0.74). We then validated variant effect predictions using an experimental dataset across 796 Brachypodium mutant lines, specifically designed to test predictions at single-base resolution. Our models outperformed SOTA S2E models in predicting between-gene expression differences (regression coefficient {beta}=0.78 vs. {beta}=0.57). Remarkably, they also accurately predicted the effects of single-nucleotide mutations on within-gene expression, while SOTA S2E models showed only weak associations (regression coefficient {beta}=0.38 vs. {beta}=0.08). Our results demonstrated the value of context-aware DNA sequence embeddings for predicting regulatory variant effects in plants. They also reveal a persistent accuracy gap in S2E models when moving from between-gene to allelic variation, a challenge that needs to be addressed in future S2E studies.
bioinformatics2026-03-02v1DNA fragment length analysis using machine learning assisted vibrational spectroscopy
Fatayer, R.; Ahmed, W.; Szeto, I.; Sammut, S.-J.; Senthil Murugan, G.AI Summary
- This study introduces a rapid, label-free method using ATR-FTIR and Raman spectroscopy combined with machine learning to quantify DNA fragment lengths from 50-300 bp.
- Machine learning models achieved high accuracy in predicting DNA length (R2=0.92-0.96), with multimodal fusion enhancing performance.
- The approach requires minimal sample (4 µL), short processing time (15 minutes), and allows full sample recovery, making it a scalable alternative for DNA length analysis.
Abstract
DNA length analysis is essential for genomic workflows including next-generation sequencing and fragmentomics based diagnostics. Conventional approaches typically require large, expensive instrumentation and sample-destructive protocols with long processing times. Here we present a rapid, label-free approach integrating vibrational spectroscopy with deep learning to quantify DNA fragment length distributions. We demonstrate that ATR-FTIR and Raman spectroscopy capture length-dependent spectral features arising from phosphate backbone, nucleobase, and structural vibrations. Machine learning models trained on spectra acquired from purified monodisperse DNA (50-300 bp) predicted DNA length with high accuracy (R2=0.92-0.94), with multimodal fusion improving performance to R2=0.96. A convolutional neural network trained on 35 DNA mixtures comprising molecules of different lengths also successfully deconvoluted their fragment length profile. Transfer learning enabled adaptation to biological samples, achieving low prediction error (RMSE=0.3-7.2%, {Delta}=12 bp). Importantly, the method requires only 4 L sample and 15 minutes passive drying, with no consumables beyond cleaning materials, and allows full sample recovery. This establishes vibrational spectroscopy as a scalable alternative for DNA length quantification.
bioinformatics2026-03-02v1Evaluation of deep learning tools for chromatin contact prediction
Nguyen, T. H. T.; Vermeirssen, V.AI Summary
- This study evaluates five deep learning models (C.Origami, Epiphany, ChromaFold, HiCDiffusion, GRACHIP) for predicting Hi-C contact maps from genomic and epigenomic data.
- Epiphany was found to have the best performance in terms of accuracy, generalization across cell types, and biological relevance.
- Key findings include the importance of CTCF binding and chromatin co-accessibility in prediction accuracy, with only a subset of omics inputs significantly contributing to model performance.
Abstract
Three-dimensional chromatin organization is essential for gene regulation and is commonly measured using Hi-C contact maps. Recent deep learning models have been developed to predict Hi-C maps from genomic and epigenomic features. However, their relative performance and biological interpretability remain poorly understood due to the lack of systematic evaluation. Here, we present a comprehensive benchmarking framework that evaluates five Hi-C prediction models: C.Origami, Epiphany, ChromaFold, HiCDiffusion, and GRACHIP, across predictive accuracy, visual fidelity, and downstream biological analyses. Among them, Epiphany consistently achieved the best overall performance, combining high accuracy, cross-cell-type generalization, realistic map quality, and reliable loop recovery. The framework further shows that epigenomic features, particularly CTCF binding and chromatin co-accessibility, are the primary drivers of accurate Hi-C pattern prediction. Notably, although many models incorporate multiple omics inputs, only a limited subset substantially contributes to performance. This manuscript clarifies model behaviour and provides guidance for developing and interpreting Hi-C prediction methods.
bioinformatics2026-03-02v1miREA: a network-based tool for microRNA-oriented enrichment analysis
Zhang, Z.; Lai, X.AI Summary
- miREA is a network-based tool designed for miRNA-oriented enrichment analysis, focusing on miRNA-gene interactions (MGIs) to interpret miRNA function at the pathway level.
- It employs five edge-based enrichment methods, integrating expression and interactome data with pathway networks, outperforming traditional node-based methods in sensitivity and biological interpretability.
- Benchmarking in various cancer types, including bladder cancer, demonstrated miREA's effectiveness in identifying relevant pathways and generating mechanistic hypotheses for experimental validation.
Abstract
MicroRNAs (miRNAs) regulate gene expression at the post-transcriptional level. To interpret the function of miRNAs at the pathway level, it is necessary to use enrichment analysis tools that employ gene regulatory networks. However, existing network node-centric methods focus predominantly on gene expression profiles, neglecting the role of regulatory information encoded in miRNA-gene interactions (MGIs) that constitute network edges. This omission introduces analytical bias and limits the methods' biological interpretability. Here, we present miREA, a network-based tool for miRNA enrichment analysis that leverages MGIs to characterize miRNA function at the pathway level. miREA implements five edge-based enrichment methods spanning over-representation, scoring-based, topology-aware, and network propagation approaches by integrating expression and interactome profiles with pathway networks. Benchmarking across multiple cancer types shows that the edge-based methods outperform node-based methods in improving sensitivity to identify relevant pathways and biological interpretability while maintaining controlled false positive rates. We further demonstrate the utility of miREA in elucidating miRNA-gene-pathway regulatory mechanisms in bladder cancer. miREA is a versatile enrichment analysis tool that provides pathway-level interpretation of human miRNA function and facilitates mechanistic hypothesis generation for experimental validation.
bioinformatics2026-03-02v1Evaluating genome assemblies with HMM-Flagger
Asri, M.; Eizenga, J. M.; Hebbar, P.; Real, T. D.; Lucas, J.; Loucks, H.; Calicchio, A.; Diekhans, M.; Eichler, E. E.; Salama, S.; Miga, K. H.; Paten, B.AI Summary
- HMM-Flagger uses a hidden Markov model with a Gaussian autoregressive process to detect structural errors in genome assemblies by analyzing read coverage.
- It achieved F1 scores of 78.4% and 60.4% for synthetic errors with Pacific Biosciences HiFi and Oxford Nanopore Technologies R10 data, respectively.
- Applied to real assemblies, it identified large misassemblies in HG002 and showed significant error rate reduction from 0.94% to 0.38% between HPRC releases, validating NOTCH2NL assemblies.
Abstract
HMM-Flagger is a reference-free tool for detecting structural errors in haplotype-resolved genome assemblies based upon the coverage of mapped reads. It models read coverage with a hidden Markov model augmented by a Gaussian autoregressive process, which enables classifying coverage anomalies as erroneous blocks, false duplications, or collapsed blocks. Trained and tested on synthetic misassemblies, it detected synthetic errors using Pacific Biosciences HiFi and Oxford Nanopore Technologies R10 data with F1 scores of 78.4\% and 60.4\% respectively. When applied to six HG002 assemblies it revealed multiple large misassemblies including false duplications and collapse events in human satellites. Applied to assemblies from the Human Pangenome Reference Consortium (HPRC), HMM-Flagger demonstrated substantial improvements from release 1 (0.94\% error rate) to release 2 (0.38\%), reflecting technological advances. HMM-Flagger also validated NOTCH2NL assemblies in HPRC release 2 and confirmed the correctness of three novel structural configurations.
bioinformatics2026-03-02v1Benchmarking niche identification via domain segmentation for spatial transcriptomics data
Wang, Y.; Chen, Y.; Yang, L.; Wang, C.; Cai, J.; Xin, H.AI Summary
- This study benchmarks 16 domain segmentation algorithms on high-resolution CosMx ST data from a human lymph node to identify tissue niches, revealing that most algorithms fail to accurately define niche boundaries in their default settings.
- The primary challenge identified is the reduction in spatial signal-to-noise ratio due to stochastic infiltration of peripheral cell types, which obscures key functional lineage distributions.
- Strategic weighting of core functional lineages improved niche resolution, highlighting the need for specialized computational methods for functional microenvironment analysis.
Abstract
Tissue niches are spatially organized microenvironments in which coordinated multicellular interactions shape cellular states and biological functions. Currently, niche identification is routinely performed using domain segmentation frameworks. While interrelated, spatial domains and niches are not fundamentally equivalent. The former emphasizes intra-domain compositional consistency and transcriptomic homogeneity, whereas the latter is defined by the emergent properties of localized signaling gradients and the functional reciprocity between key cell lineages. Here, we present a high-resolution reference by thoroughly annotating single-cell resolution CosMx ST data of a human follicular lymphoid hyperplasia lymph node, a dynamic, non-compartmentalized tissue containing several critical immune niches defined by specific lineage architectures. We systematically benchmarked 16 contemporary domain segmentation algorithms, demonstrating that most methods in their default configurations fail to recapitulate biologically defined niche boundaries. Our analysis reveals that the definitive, disjoint spatial distributions of key functional lineages are frequently obscured by the stochastic infiltration of peripheral cell types. Such reduction in the spatial signal-to-noise ratio represents a primary bottleneck for existing algorithms, which prioritize local transcriptomic variance over global architectural logic. Following this observation, we demonstrate that strategic weighting of core functional lineages can restore the resolution of spatial niches in select domain segmentation frameworks. Cross-comparison against compartmentalized tissues further underscores the unique challenges of niche identification in non-mechanically separated environments and clarifies the fundamental divergence between structural domain segmentation and functional niche discovery. Our work delineates the limitations of current paradigms and advocates for the development of specialized computational approaches tailored specifically to the complexity of functional microenvironments.
bioinformatics2026-03-02v1GTA-5: A Unified Graph Transformer Framework for Ligands and Protein Binding Sites - Part I: Constructing the PDB Pocket and Ligand Space
Ciambur, B. C.; Pageau, R.; Sperandio, O.AI Summary
- GTA-5 is a graph transformer auto-encoder framework that integrates ligands and protein binding sites into a unified latent space by representing them as 3D point clouds with Tripos atom type labels.
- Trained on 64,124 liganded pockets and 23,133 unique ligands, GTA-5 clusters functional protein families coherently while capturing physicochemical properties like volume and hydrophobicity.
- The framework supports applications like scaffold hopping, QSAR/QSPR modeling, and drug repurposing by enabling structural reasoning based on spatial context rather than bond connectivity.
Abstract
Structural recognition between a protein target and a ligand underpins therapeutic innovation, yet computational representations of protein binding sites and small molecules remain largely disjoint. Here we introduce GTA-5, a unified graph transformer auto-encoder framework designed to capture the geometric structure and chemical composition of ligands and protein binding pockets, embedding them into multidimensional latent spaces where proximity reflects functional compatibility. Ligands and pockets are represented as three-dimensional point clouds annotated with Tripos atom type labels, omitting explicit bond connectivity to enable structural reasoning based on spatial context rather than predefined connectivity graphs. By not enforcing bond topology, GTA-5 maintains representational flexibility across molecular modalities while preserving chemically meaningful local environments. The model was trained on a curated dataset from the Protein Data Bank comprising 64,124 liganded pockets and 23,133 unique ligands spanning 2,257 protein families. We find that functional protein families cluster coherently in both pocket and ligand latent spaces while retaining biologically meaningful heterogeneity. The model captures physicochemical pocket properties such as volume, exposure, and hydrophobicity directly from raw structural data, while ligands with distinct scaffolds co-localise when occupying similar binding environments. This provides a basis for several downstream applications including scaffold hopping in ligand-based virtual screening, QSAR/QSPR modelling using embedding-derived descriptors, and drug repurposing via pocket similarity. More broadly, the GTA-5 framework establishes a foundation for structural reasoning across molecular modalities in drug discovery.
bioinformatics2026-03-02v1ProPrep: An Interactive and Instructional Interface for Proper Protein Preparation with AMBER
Walker, a.; Guberman-Pfeffer, M. J.AI Summary
- ProPrep is an interactive interface designed to guide users through the process of preparing proteins for molecular dynamics (MD) simulations using AMBER, addressing the need for accessible yet expert-quality preparation.
- It integrates multiple functions including structure downloading, homology searches, alignment, structural repair, mutation application, and simulation setup, all within a single workspace.
- The tool was demonstrated on a 64-heme cytochrome 'nanowire' bundle, completing the preparation from a PDF to energy minimization in 18 minutes, showcasing its efficiency and transparency through an interactive session log.
Abstract
Millions of experimental and AI-predicted protein structures are now available, and the biosynthetic promise of bespoke proteins is increasingly within reach. The functional characterization challenge thus posed cannot be addressed by experimental techniques alone. Molecular dynamics (MD) simulations offer functional screening with atomic resolution, yet accessibility remains limited. Existing computational chemistry software presents stark trade-offs whereby powerful tools require extensive expertise and manual effort, or user-friendly programs function as black boxes that obscure critical preparation decisions. Herein, we present ProPrep, an interactive workflow manager that guides users through expert-quality MD preparation by showing the 'what, why, and how' of each step while automating tedious manual operations. Within a single workspace, ProPrep integrates (1) downloading structures from multiple sources (PDB, AlphaFold, AlphaFill), (2) performing homology searches, (3) aligning structures, (4) curating and repairing structural issues, (5) applying mutations, (6) parameterizing specialized residues, (7) converting redox-active sites to forcefield-compatible forms, (8) generating topology and coordinate files, and (9) configuring, executing, and analyzing simulations with active monitoring of key quantities via ASCII visualizations. A key innovation is ProPrep's extensible transformer framework for detecting, defining, and transforming redox-active sites--including mono- and polynuclear metal centers, organic cofactors, and redox-active amino acids--for forcefield compatibility. We demonstrate the full workflow on a 64-heme cytochrome 'nanowire' bundle (PDB: 9YUQ), proceeding from a PDF file to energy minimization of the solvated system (467,635 atoms) for constant pH molecular dynamics--a process demanding 4,819 PDB record modifications and 610 bond definitions'in 18 minutes of user interaction. The entire process is recorded in an interactive session log that can be shared and replayed for reproducibility, making simulation setup a fully transparent process that relies on what was done instead of what was remembered and reported.
bioinformatics2026-03-02v1Assessment of Generative De Novo Peptide Design Methods for G Protein-Coupled Receptors
Junker, H.; Schoeder, C. T.AI Summary
- The study assessed the effectiveness of deep learning methods (AlphaFold2 Initial Guess, Boltz-2, RosettaFold3) in designing de novo peptides for G protein-coupled receptors (GPCRs) by validating 124 known GPCR-peptide complexes.
- Generative methods (BindCraft, BoltzGen, RFdiffusion3) were evaluated for their peptide sampling capabilities, revealing issues with confidence overestimation and memorization in both prediction and generation.
- While backbone sampling was adequate, sequence generation was less effective, though improved by ProteinMPNN.
Abstract
G protein-coupled receptors (GPCRs) play an ubiquitous role in the transduction of extracellular stimuli into intracellular responses and therefore represent a major target for the development of novel peptide-based therapeutics. In fact, approximately 30% of all non-sensory GPCRs are peptide-targeted, representing a blueprint for the design of de novo peptides, both as pharmacological tools and therapeutics. The recent advances of deep learning-based protein structure generation and structure prediction offer a multitude of peptide design strategies for GPCRs, yet confidence metrics rarely correlate with experimental success. In the context of peptides, this problem is exacerbated due to the lack of elaborate tertiary structures in peptides, raising the question of whether this is due to inadequate sampling or insufficient scoring. In this two-part benchmark, we addressed this question by first simulating the validation process of 124 unique known GPCR-peptide complexes using AlphaFold2 Initial Guess, Boltz-2 and RosettaFold3. We then assessed the peptide sampling capabilities of the respective generative methods BindCraft, BoltzGen and RFdiffusion3. Our results indicate that current design pipelines primarily suffer from significant confidence overestimation for misplaced peptides in the validation phase across all three prediction methods. We further highlight occurrences of significant memorization in both prediction as well as generation of peptides. While all generative methods sample backbone space sufficiently, their simultaneous sequence generation remains subpar and can be partially recovered through the use of ProteinMPNN. Taken together, our benchmark offers guidance for the design of peptides specifically using deep learning-based pipelines.
bioinformatics2026-03-02v1SPATIALLY PATTERNED PODOCYTE STATE TRANSITIONS COORDINATE AGING OF THE GLOMERULUS
Chaney, C.; Pippin, J. W.; Tran, U.; Eng, D.; Wang, J.; Carroll, T. J.; Shankland, S. J.; Wessely, O.AI Summary
- The study investigated how aging affects the glomerulus by analyzing single nuclei transcriptomics from kidneys of mice at different ages, focusing on regional and cell type-specific responses.
- Results showed that aging in podocytes is characterized by a transition from expressing canonical podocyte genes to showing inflammatory and senescent signatures, predominantly in the juxtamedullary region.
- Unlike podocytes, other glomerular cell types showed minimal age-related changes, indicating that podocyte aging is selective and coordinated rather than a universal degeneration.
Abstract
Background: With the US population living longer, the risk, incidence, prevalence and severity for chronic kidney diseases become more abundant. Glomerular diseases are the leading cause for chronic and end-stage kidney disease. Yet, the cellular responses and the underlying mechanisms of progressive glomerular disease, which ultimately leads to glomerulosclerosis and loss of kidney function with advancing age, are poorly understood. Methods: Kidneys of young (4 months-old), middle-aged (20 months-old) and aged (24 months-old) mice were separated into outer cortex and juxta-medullary region and processed for single nuclei transcriptomics. Focusing on the aging glomerulus data were analyzed using a state-of-the-art analysis pipeline dissecting out the cellular age- and kidney region-specific responses. Results: Global analysis of the transcriptome reveals regional-specific differences that are detectable across multiple cell types exemplified by the expression of Napsa as a bona-fide juxta-medullary marker. In contrast aging led to rather cell type-specific responses. In the glomerulus, healthy podocytes were characterized by expression of canonical podocyte genes; conversely the senescent, aged podocytes were characterized by the down-regulation of canonical podocyte genes and the emergence of inflammatory and senescent signatures. Interestingly, these senescent podocytes were primarily located in the juxtamedullary region suggesting that juxtamedullary podocytes are more sensitive. Yet, instead of aging being defined by distinct cell states, the profiles, as well as ligand-receptor and pseudotime analyses suggest that podocytes aging is selective and coordinated, not universal degeneration. This was different to the other glomerular cell types, parietal epithelial cells, glomerular endothelial cells and mesangial cells. While they also as existed in different subpopulations, they exhibited little regional-, or age-depended changes. Finally proximal tubular aging manifested itself as discrete cellular states. Conclusions: The single nuclei transcriptomics of the aging kidney provides a mechanistic explanation for regional susceptibility of nephrons and suggests that the future therapeutic strategies need to consider the cellular and spatial complexity of the glomerulus.
bioinformatics2026-03-02v1Detecting Extrachromosomal DNA from Routine Histopathology
Khalid, M. A.; Gratius, M.; Brown, C.; Younis, R.; Ahmadi, Z.; Chavez, L.AI Summary
- This study developed a deep learning framework to detect extrachromosomal DNA (ecDNA) from standard histopathology images across twelve cancer types.
- The approach successfully distinguished ecDNA-amplified tumors from chromosomally amplified or non-amplified ones, with notable results in glioblastoma.
- The method identified histomorphologic changes associated with ecDNA, correlating with poor survival outcomes, suggesting potential for routine diagnostic integration.
Abstract
Extrachromosomal DNA (ecDNA) is a major driver of oncogene amplification, tumour heterogeneity and poor clinical outcomes [1-3], yet its detection relies on specialised genomic assays that are not integrated into routine diagnostics. Here, we show that ecDNA status can be inferred directly from standard haematoxylin and eosin-stained whole-slide pathology images. We develop an end-to-end, weakly supervised deep learning framework that aggregates thousands of high-magnification patches per slide with slide-level augmentation and interpretable attention. Across twelve cancer types from The Cancer Genome Atlas, the approach identifies tumours with genomic amplifications and, critically, distinguishes ecDNA-amplified from chromosomally amplified or non-amplified tumours, with the strongest signal in glioblastoma. Attention maps localise regions enriched for nuclei with altered chromatin intensity and texture, and predicted ecDNA status recapitulates its adverse association with survival. These results indicate that ecDNA amplifications leave reproducible histomorphologic footprints detectable by routine pathology, enabling scalable screening to prioritise tumours for confirmatory molecular testing.
bioinformatics2026-03-02v1Scalable mass-spectrometry-based molecular phylogeny with TreeMS2
Dierckx, M.; Adams, C.; Gauglitz, J. M.; Bittremieux, W.AI Summary
- TreeMS2 extends molecular phylogeny to proteomic and metabolomic data by comparing MS/MS spectra, bypassing annotation for rapid analysis.
- The tool constructs phenotype-derived trees that can be compared with genetic trees, revealing where molecular phenotypes align or diverge from evolutionary history.
- Across various datasets, TreeMS2 effectively reconstructs biological relationships, distinguishing cell types in single-cell proteomics and resolving biochemical structures in metabolomics.
Abstract
Molecular phylogeny is a well-established method for inferring evolutionary relationships from DNA and RNA sequences. Here, we extend this concept beyond genetic information by applying phylogeny-like analysis to proteomic and metabolomic mass spectrometry data, capturing relationships based on the realized molecular phenotype. The resulting phenotype-derived trees can be directly compared with conventional genetic-based trees to identify where molecular phenotypes reflect evolutionary history and where they diverge due to functional adaptation, regulation, or environmental influence. To enable this analysis, we introduce TreeMS2, a computational tool that constructs similarity matrices by directly comparing tandem mass spectrometry (MS/MS) spectra between samples. By bypassing spectrum annotation, TreeMS2 enables rapid, unbiased comparisons. Across diverse datasets, TreeMS2 reconstructs biologically meaningful relationships. In proteomics, phenotype-derived trees recapitulate established taxonomy, with deviations pinpointing sample handling errors. In single-cell proteomics our method distinguishes cell types despite sparse and noisy measurements and in metabolomics it resolves major biochemical divisions and fine-scale compositional structure. Together, these results establish TreeMS2 as a scalable, annotation-independent framework for deriving molecular relationships from raw MS/MS data.
bioinformatics2026-03-02v1Prediction and analysis of new HisKA-like domains
Silly, L.; Perriere, G.; Ortet, P.AI Summary
- This study analyzed 869,964 sequences of incomplete histidine kinases (iHKs) with HATPase but lacking HisKA domains to identify new HisKA-like domains.
- 18 HisKA-like profiles were identified, with their 3D structures matching known HisKA domains and genomic contexts indicating involvement in signal transduction.
- The findings were cross-validated with curated annotations and a negative dataset, suggesting potential improvements in annotating prokaryotic regulation pathways.
Abstract
Histidine kinases (HKs) are part of many signaling pathways, by being implicated in two components systems (TCS). Using autophosphorylation and phosphotransfer to a response regulators (RR), they enable organisms to adapt to their environment. Most HKs are transmembrane proteins with a sensing domain outside of the cell and two catalytic domains called HisKA and HATPase. HATPase is required for interaction with the ATP and HisKA contains the phosphorylated histidine residue. HKs are involved in various environmental adaptation mechanisms, like light sensing or biochemical changes. Studying their diversity is therefore important to better understand how cells interacts with their environment. There exist incomplete HKs (iHKs) lacking either the HisKA or HATPase domain. Some iHKs with an HATPase domain possess a section of their sequence where an HisKA domain could be expected. These iHKs may contain "true" HKs, with unknown HisKA domain, that could fill gaps in various signaling pathways. In this study we analyzed 869 964 sequences of iHKs having an HATPase domain but lacking an HisKA domain. We identified 18 HisKA-like profiles and did multiple meta-studies to assessed their HisKA-like characteristics. We found that their 3D structures matched the structure of known HisKA domains. We saw that the genomic context of the genes associated to these profiles contained genes implicated in signal transduction pathways. We cross-validated some of our profiles with curated annotations, as well as with a "negative dataset" made of non-HK proteins. We believe that our work could help improve the annotation of regulation pathways in prokaryotes.
bioinformatics2026-03-02v1Atlas-scale spatially aware clustering with support for 3D and multimodal data using SpatialLeiden
Müller-Bötticher, N.; Malt, A.; Kiessling, P.; Eils, R.; Kuppe, C.; Ishaque, N.AI Summary
- SpatialLeiden was extended to handle atlas-scale, multi-sample, 3D, and multimodal spatial omics data through neighbor-graph multiplexing on batch-corrected latent spaces.
- The algorithm demonstrated superior performance in creating coherent domains aligned with brain atlases, reconstructing 3D cancer tissue structures, and integrating multimodal features, surpassing specialized tools in modularity and scalability.
Abstract
Here we extend SpatialLeiden, our spatial clustering algorithm, to enable generalised atlas-scale multi-sample, 3D serial-section, and multimodal spatial omics via flexible neighbour-graph multiplexing on batch-corrected latent spaces. It delivers coherent domains aligning with brain atlases across >100 samples, stable 3D reconstruction of cancer tissue structures, and integrated multimodal features, outperforming specialized tools in modularity and scalability on standard hardware. SpatialLeiden is compatible with scverse for broad and intuitive adoption.
bioinformatics2026-03-02v1scProfiterole: Clustering of Single-Cell Proteomic DataUsing Graph Contrastive Learning via Spectral Filters
Coskun, M.; Lopes, F. B.; Kubilay Tolunay, P.; Chance, M. R.; Koyuturk, M.AI Summary
- The study addresses the challenge of clustering single-cell proteomic data by introducing scProfiterole, which uses graph contrastive learning (GCL) with spectral filters to improve cell type identification.
- scProfiterole employs three types of homophilic filters (random walks, heat kernels, beta kernels) and uses Arnoldi orthonormalization for efficient polynomial interpolation of these filters.
- Key findings show that GCL with learnable polynomial coefficients, along with heat and beta kernels, enhances clustering performance, with polynomial interpolation outperforming traditional methods.
Abstract
Novel technologies for the acquisition of protein expression data at the single cell level are emerging rapidly. Although there exists a substantial body of computational algorithms and tools for the analysis of single cell gene expression (scRNAseq) data, tools for even basic tasks such as clustering or cell type identification for single cell proteomic (scProteomics) data are relatively scarce. Adoption of algorithms that have been developed for scRNAseq into scProteomics is challenged by the larger number of drop-outs, missing data, and noise in single cell proteomic data. Graph contrastive learning (GCL) on cell-to-cell similarity graphs derived from single cell protein expression profiles show promise in cell type identification. However, missing edges and noise in the cell-to-cell similarity graph requires careful design of convolution matrices to overcome the imperfections in these graphs. Here, we introduce scProfiterole (Single Cell Proteomics Clustering via Spectral Filters), a computational framework to facilitate effective use of spectral graph filters in GCL-based clustering of single cell proteomic data. Since clustering assumes a homophilic network topology, we consider three types of homophilic filters: (i) random walks, (ii) heat kernels, (iii) beta kernels. Since direct implementation of these filters is computationally prohibitive, the filters are either truncated or approximated in practice. To overcome this limitation, scProfiterole uses Arnoldi orthonormalization to implement polynomial interpolations of any given spectral graph filter. Our results on comprehensive single cell proteomic data show that (i) graph contrastive learning with learnable polynomial coefficients that are carefully initialized improves the effectiveness and robustness of cell type identification, (ii) heat kernels and beta kernels improve clustering performance over adjacency matrices or random walks, and (iii) polynomial interpolation of complex filters outperforms approximation or truncation. The source code for scProfiterole is available at https://github.com/mustafaCoskunAgu/scProfiterole
bioinformatics2026-02-28v1LRSomatic: a highly scalable and robust pipeline for somatic variant calling in long-read sequencing data
Forsyth, R. A.; Harbers, L.; Verhasselt, A.; Iraizos, A.-L. R.; Yang, S.; Vande Velde, J.; Davies, C.; Pillay, N.; Lambrechts, L.; Demeulemeester, J.AI Summary
- LRSomatic is a Nextflow-based pipeline for somatic variant calling from long-read sequencing data, supporting SNV, indel, structural variant, and copy number analysis in both PacBio HiFi and ONT platforms.
- It accommodates paired tumor-normal and tumor-only designs, with the option for epigenetic integration via Fiber-seq.
- Benchmarking on COLO829 and HG008 showed high performance, and application to a clear cell sarcoma case identified all driver alterations, including the EWSR1::ATF1 fusion.
Abstract
Motivation Long-read sequencing is increasingly used in cancer research and clinical genomics due to its ability to resolve complex genomic variation and previously inaccessible regions of the genome. However, dedidated workflows for comprehensive somatic variant analysis from long-read whole-genome data remain scarce, limiting uptake in cancer genomics. Results We present LRSomatic, a Nextflow-based, nf-core-compliant pipeline supporting somatic SNV, indel, structural variant, and copy number calling from PacBio HiFi and ONT data. LRSomatic supports paired tumor-normal and tumor-only designs, as well as integration of epigenetic integration via Fiber-seq. Benchmarked on COLO829 and HG008 reference cell lines, LRSomatic achieves state-of-the-art performance across both platforms and variant types. Applied to a case of clear cell sarcoma, it recovers all identified driver alterations, including the pathognomonic EWSR1::ATF1 fusion, and resolves haplotype-specific chromatin accessibility via Fiber-seq. Availability and Implementation Freely available at https://github.com/intgenomicslab/lrsomatic, implemented in Nextflow DSL2, supported via Docker and Singularity.
bioinformatics2026-02-28v1Arborist: Prioritizing Bulk DNA Inferred Tumor Phylogenies via Low-pass Single-cell DNA Sequencing Data
Weber, L. L.; Ching, C. Y.; Ly, C.; Pan, Y.; Cheng, Y.; Gao, C.; Van Loo, P.AI Summary
- The study introduces ARBORIST, a method that integrates bulk DNA sequencing with low-pass single-cell DNA sequencing to improve tumor phylogeny reconstruction.
- ARBORIST uses variational inference to prioritize tumor phylogenies by approximating the marginal likelihood of candidate trees.
- Testing on simulated and biological data showed ARBORIST outperforms existing methods, resolving evolutionary relationships in a malignant peripheral nerve sheath tumor.
Abstract
Cancer arises from an evolutionary process that can be reconstructed from DNA sequencing and modeled by tumor phylogenies. High coverage bulk DNA sequencing (bulk DNA-seq) is widely available, but tumor phylogeny inference requires deconvolution, often resulting in non-uniqueness in the solution space. Single-cell DNA sequencing (scDNA-seq) holds potential to yield higher resolution tumor phylogenies, but the sparsity of emerging low-pass sequencing technologies poses challenges for the study of single-nucleotide variants. Increasing availability of data sequenced with both modalities provides an opportunity to capitalize on the advantages of these technologies. While inference methods exist for bulk DNA-seq and for low-pass scDNA-seq, no joint inference methods currently exist. As a first step, we propose a method named ARBORIST that prioritizes tumor phylogenies inferred via bulk DNA-seq using low-pass scDNA-seq data. ARBORIST takes as input a candidate set of trees with corresponding SNV clustering, along with variant and total read count data from scDNA-seq and uses variational inference to approximate a lower bound on the marginal likelihood of each tree in the candidate set. On simulated data, matching characteristics of current scDNA-seq data, ARBORIST outperforms both bulk and low-pass single-cell reconstruction methods. On a biological dataset, ARBORIST conclusively resolves the evolutionary relationship between different SNV clusters on a malignant peripheral nerve sheath tumor, which is supported by orthogonal validation via a proxy for copy number. ARBORIST provides a principled framework for integrating bulk DNA-seq and low-pass scDNA-seq data, improving confidence in tumor phylogeny reconstruction. Availability: https://github.com/VanLoo-lab/Arborist
bioinformatics2026-02-28v1Random Matrix Theory-guided sparse PCA for single-cell RNA-seq data
Chardes, V.AI Summary
- The study introduces a Random Matrix Theory (RMT)-guided sparse PCA method for single-cell RNA-seq data to address noise and variability issues, using a novel biwhitening algorithm to estimate noise per gene.
- This approach automatically selects sparsity levels, making sparse PCA nearly parameter-free, and retains PCA's interpretability.
- Across various technologies and algorithms, the method improved principal subspace reconstruction and outperformed traditional PCA, autoencoders, and diffusion methods in cell-type classification.
Abstract
Single-cell RNA-seq provides detailed molecular snapshots of individual cells but is notoriously noisy. Variability stems from biological differences and technical factors, such as amplification bias and limited RNA capture efficiency, making it challenging to adapt computational pipelines to heterogeneous datasets or evolving technologies. As a result, most studies still rely on principal component analysis (PCA) for dimensionality reduction, valued for its interpretability and robustness, in spite of its known bias in high dimensions. Here, we improve upon PCA with a Random Matrix Theory (RMT)-based approach that guides the inference of sparse principal components using existing sparse PCA algorithms. We first introduce a novel biwhitening algorithm which self-consistently estimates the magnitude of transcriptomic noise affecting each gene in individual cells, without assuming a specific noise distribution. This enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free. Our mathematically grounded approach retains the interpretability of PCA while enabling robust, hands-off inference of sparse principal components. Across seven single-cell RNA-seq technologies and four sparse PCA algorithms, we show that this method systematically improves the reconstruction of the principal subspace and consistently outperforms PCA-, autoencoder-, and diffusion-based methods in cell-type classification tasks.
bioinformatics2026-02-28v1