Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Metadata Collector: An Open-Source Platform for Standardized Metadata Management in Multi Centre Sequencing Projects
Liguori, R.; Huttner, M.; Ferrazzi, F.Abstract
Background: Next-generation sequencing (NGS) projects generate increasingly complex metadata that are critical for reproducibility, interoperability, and compliance with FAIR principles. Nevertheless, metadata curation in multi-institutional settings often still relies on spreadsheets, manual data entry and curation, as well as non-standardized terminology. These practices frequently result in incomplete or inconsistent annotations, hinder metadata sharing, and delay submission to public repositories. Results: We developed Metadata Collector as a React/API/PostgreSQL web platform and deployed it on a Kubernetes cluster within a large German research consortium. The platform implements a flexible, machine-readable metadata model for experimental data and integrates customizable templates, controlled vocabularies designed to support future ontology integration, and a complete event-based versioning model. Since deployment, Metadata Collector has been used across 32 projects involving RNA-seq, scRNA-seq, ATAC-seq and multiomics datasets, representing over 700 annotated samples contributed by multiple consortium partners. The platform is designed for use by non-computational researchers as well as centralized facilities and can be integrated into existing research data management infrastructures. Conclusions: Metadata Collector embeds standardization early in the metadata lifecycle, ensuring consistent, FAIR-aligned, and reproducible metadata across distributed research groups. Its modular, open-source architecture supports both local and consortium-scale deployments and provides a foundation for future extensions, including multi-omics support and integration with laboratory information management systems and automated submission pipelines.
bioinformatics2026-06-08v3Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Guo, J.; Ding, S.Abstract
The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.
bioinformatics2026-06-08v3Protein large language model assisted one-to-one gene homology mapping in cross-species single-cell transcriptome integration
Kuang, Z.-Y.; Sun, Y.-C.; Wei, N.-N.; Wang, Y.-J.; Wu, H.-J.Abstract
Cross-species integration of single-cell transcriptomes requires establishing gene correspondences to enable comparative analysis of expression profiles across organisms. Current approaches predominantly rely on Ensembl homology tables, whose default many-to-many mappings often amplify gene-family effects and introduce artifactual micro-clusters that lack clear cell-type identity, thereby complicating biological interpretation. While restricting mappings to a one-to-one scheme suppresses such artifacts, it reduces the number of homology gene pairs by approximately 8% ([~]900 pairs). To address this limitation, we developed a protein large language model (pLLM)-based gene homology mapping strategy that boosts the number of homology gene pairs. By integrating pLLM-derived representations with sequence similarity, we constructed a fused mapping approach, which achieved top performance in a comprehensive benchmark based on a curated cross-species atlas -- spanning nine datasets, 11 species, and over 3.2 million cells. Our method further identifies previously unannotated cell-type marker pairs, facilitating novel cross-species marker discovery. These results establish a robust framework for gene homology mapping in cross-species transcriptome integration, improving both accuracy and biological interpretability.
bioinformatics2026-06-08v3Melody: Decoding the Sequence Determinants of Locus-Specific DNA Methylation Across Human Tissues
Jin, J.; Wang, D.; Qiao, J.; Gao, W.; Liu, Y.; Chen, S.; Zou, Q.; Wu, S.; Su, R.; Wei, L.Abstract
DNA methylation is a fundamental epigenetic modification that plays crucial roles in transcriptional regulation, cellular differentiation, and genome stability. However, how locus-specific DNA methylation is determined by intrinsic DNA sequence remains poorly understood. Here, we introduce Melody, a deep learning framework that predicts DNA methylation from 10-kb genomic sequences, enabling the integration of both local and long-range sequence signals. Across 39 human tissues, Melody accurately predicts methylation profiles and consistently outperforms existing state-of-the-art methods in whole-chromosome, hypomethylated-region, and cell-type-specific benchmarks. Melody also generalizes to methylation quantitative trait locus (meQTL) effect prediction and identifies regulatory sequence motifs associated with methylation variability. To extend prediction beyond profiled tissues, we further develop Melody-G, which incorporates single-cell RNA-seq foundation model embeddings to infer methylation states in previously unseen cell types directly from transcriptomic data. Together, Melody provides a unified framework for linking genomic sequence and cellular state to DNA methylation and offers new insights into the regulatory logic governing the human methylome.
bioinformatics2026-06-08v2Multi-feature Classification to Improve Colorimetric Loop-Mediated Isothermal Amplification Fidelity
Melton, G.; Negron, D. A.; Hauser, K.; Jagannathan, S.; Tolli, N.; Jennings, K.; Necciai, B.; Sozhamannan, S.; Abramson, B.Abstract
Loop-mediated isothermal amplification (LAMP) is a cost-effective and portable assay technique for performing nucleic acid-based diagnostics in the field whose adoption is hindered by design and reproducibility issues. This is due to a complex primer design process that fine-tunes parameters across 6-8 binding regions. The likelihood of assay success depends on satisfying thermodynamic and secondary structure constraints while maintaining target specificity and avoiding overlaps between multiple primers. Software such as the NEB(R) LAMP Primer Design Tool, PREMIER Biosoft LAMP Designer, Primer3, PCR Signature Erosion Tool (PSET), and PrimerExplorer enable automation of this task for researchers. However, in our experience, these programs can sometimes yield inconsistent results in laboratory testing. Here, we approached the issue by comparing and training multiple machine learning (ML) models on primer sets targeting various organisms from working assays and failing ones to determine significant features and improve predictions prior to ordering primer sets. A literature review produced an initial list of primer sets (n=116), which were then filtered down based on reference template availability to discern their FIP/BIP components (F2/F1c and B1c/B2). The final training set (n=109) included sequence and thermodynamic features derived from primers collected from the review (n=74) and those designed in-house with PSET (n=35). Failing assays were difficult to obtain from the publications, so we provided our own (n=23). Using WEKA Experimenter, models were created based on decision tree and Bayesian learning algorithms using an experimental scheme that performed a parameter grid search, seeded replicates, feature selection, and cross-validation while avoiding data-leakage and outputting logs for model comparison, feature analysis, and overfit assessment. Notably, thermodynamic features associated with the F1c and B1c primers consistently appeared in the top ranks according to consensus between information gain, class-correlation, and model-based feature ranking. For classification, the NaiveBayes algorithm had a TP and TN rate of 0.90 (+/- 0.02) and 0.73 (+/- 0.05) while achieving Cohen's kappa coefficient and F-score values of 0.61 (+/- 0.06) and 0.91 (+/- 0.01). This work highlights how a practical model was built from a small, imbalanced training set incorporating negative research results, of which more are needed to improve generalization and refine parameters critical to assay success.
bioinformatics2026-06-08v1From topography to connectome: Towards an integrated understanding of the resting brain
Naranjo Rincon, S.; Ahmad, F.; Easley, T.; Shoushtari, S.; Glatard, T.; Kiar, G.; Modi, H.; Dahan, S.; Robinson, E.; Kamilov, U.; Bijsterbosch, J.Abstract
As the field expands from early research into the human connectome, there has been a fast expansion in the number of analytical approaches to study resting state functional MRI (rsfMRI) data. With increasing focus on individual differences, topographical brain maps of spatial organization have emerged in addition to traditional functional connectomes. Here, we developed a deep-learning model to embed maps of network topography and faithfully translate to individualized connectomes. Results confirmed the validity of the surface vision transformer based on reconstruction accuracy (0.73{+/-}0.09) and accurate topography-to-connectome translation (0.43{+/-}0.08). Importantly, translated connectomes retained identifiability and brain-cognition associations. These findings establish a direct mapping from spatial topography to connectomes that can be used to integrate scientific insights across rsfMRI sub-fields. This is an important step towards broadening our conceptualization of the connectome and supporting broader integration of findings to inform a complete understanding of the human connectome.
bioinformatics2026-06-08v1scFAIR Consortium: a decentralized hub for single-cell RNA-Seq data standardization and unification
Gardeux, V.; Carsanaro, S.; Chen, W. J.; David, F. P. A.; Goutte-Gattat, D.; Hilton, J. A.; Lubiana, T.; Patel, N.; Raymor, B.; Zucchi, I.; Deplancke, B.; Ernst, C.; Osumi-Sutherland, D.; Robinson-Rechavi, M.; Sternberg, P. W.; Bastian, F. B.Abstract
The rapid accumulation of single-cell RNA-Seq (scRNA-seq) data across multiple repositories presents major challenges for data accessibility, integration, and reproducibility. While primary repositories provide raw data, they rarely include structured cell-type annotations or descriptions of analytical workflows, limiting the ability to reuse and integrate datasets in a FAIR (Findable, Accessible, Interoperable, Reusable) manner. Here we present scFAIR, a consortium of single-cell data resources that has developed a unified metadata schema and common curation framework to improve the FAIRness of scRNA-seq data. Building on and extending the CZ CELLxGENE Discover metadata schema, the scFAIR consortium has been instrumental in driving key schema improvements, including the expansion of supported organisms, richer biological context, and structured reporting of computational workflows. To provide unified access to decentralized datasets, the consortium developed the sc-fair.org portal, which currently aggregates 2,346 datasets across partner resources through ontology-aware semantic search. We demonstrate the practical value of FAIR-compliant datasets through a cross-species validation between human and mouse Allen Brain Atlases, showing that standardized ontology annotations enable reliable annotation transfer across species, with 90% of neuronal clusters receiving an exact or equivalent label. Together, the scFAIR schema, validator, and portal constitute a community-driven framework that advances single-cell data standardization and lays the foundation for reproducible, large-scale integration of single-cell datasets.
bioinformatics2026-06-08v1Intra-slide calibration technology improves immunohistochemical harmonization within and between anatomic pathology laboratories
Fernandes, G. M. d. M.; Wang, W.; Parwani, A.; Ahmadian, S. S.; Alves, M. J.; Philips, J. J.; Otero, J. J.Abstract
The reproducibility of immunohistochemistry in tumor tissue analysis across reference labs remains a persistent challenge. We tested the extent to which an intra-slide calibration technology mitigated discprepencies in inter-laboratory assays of p53 immunohistochemical (IHC) reactions in brain biopsies of glioblastoma (GB), IDH-wildtype. Intra-slide calibration technologies apply a 0-100% concentration scale incorporating primary surrogate and secondary antibodies to generate a standardized curve for DAB precipitation. IHC from GB samples was performed independently by pathology departments from two different hospital laboratories and were digitalized at 40x magnification using Aperio Image Scope software. Feature extraction, including intensity and texture parameters was performed using the EBImage package in R, followed by UMAP dimensionality reduction and DBSCAN clustering analysis. Our results show significant differences in intensity and texture clustering patterns between laboratory tissue samples and intra-slide calibration technology ruler caused by the different laboratories. Intra-slide calibration technology coupled with polynomial regression analysis improved ~90% the data harmonization. Our findings demonstrate a key role for computational pathology using intra-slide calibration technology to enable intra-laboratory consistency and inter-laboratory reproducibility. These advances strengthen the reproducibility of diagnostic assessments and support more objective, data-driven decision-making in neuro-oncology.
bioinformatics2026-06-08v1HydraMPP: A lightweight library for distributed massive parallel processing in Python - threading at scale.
Figueroa, J. L.; White, R. A.Abstract
We now exist in the era of massive datasets from genomics, large language models, and all the known knowledge of humanity right at our fingertips. Much of this data is becoming more accessible; however, processing such data remains an ongoing issue across systems including high performance computing (HPC) infrastructures. Massively parallel computing (MPP) has solved this using a divide and conquer approach by splitting workloads across independent nodes (i.e., central processing units (CPU) allowing for higher scaling of data). The main engine for this in python is Ray; however, it has many issues including a large code space, security issues, debugging opacity, and memory management issues. Here, we present HydraMPP, a lightweight, ease of use and utilization, with high auditability, and with SLURM ergonomics.
bioinformatics2026-06-08v1DDI_single: Single-Sequence-Based Protein Domain Assembly
Shengyi, Z.Abstract
Domains are the basic units of protein structure and function. Appropriate inter-domain organization is critical to enable cooperative execution of multiple related functions. It is thus a crucial step to determine the full-length structure of multi-domain proteins for the purpose of elucidating their functions and designing new drugs to regulate these functions. Existing structure prediction algorithms are generally better at solving the internal conformation of domains, rather than modeling the relative positions between domains. To address the challenge of accurately determining multi-domain protein conformations, we develop a single-sequence-based domain assembly algorithm called DDI_single. DDI_single directly extracts features from the amino acid sequence using the protein language model ESM-1b, and accurately predicts the interactions between residue pairs of structural domains through a novel gated cross-attention module, thus achieving the correct assembly of structural domains. With the knowledge of domain definition, DDI_single achieves more than 20% higher accuracy in the task of predicting the relative distances of residue pairs between domains than that of the single-sequence-based structure prediction algorithm trRosettaX_single. When assembling domains with known spatial conformations, DDI_single correctly assembles 74.4% of the samples in the test set (TM-score>0.5). When assembling domains with unknown spatial conformations, in cases where the internal spatial conformations of domains are correctly modeled, DDI_single correctly assembles 73.9% of the samples.
bioinformatics2026-06-08v1HNSW-MS: Hierarchical Graph Indexing Enables Accurate Real-Time Mass Spectral Similarity Search at Repository Scale
Semenov, A.; Gupta, S.; Roberts, A. M. P.; Boginski, V.; Aksenov, A. A.Abstract
Spectral similarity search is the basis of mass spectrometry-based metabolomics, underpinning library matching, molecular networks construction, and repository searches such as MASST. Until recently, dataset sizes were limited, making exhaustive pairwise comparison tractable. This is no longer true. Public repositories such as GNPS now exceed one billion of spectra, and the emerging paradigm of reverse metabolomics (placing experimental spectra into the context of all existing public data to drive annotation and discovery) demands search at a scale where linear sequential comparison is no longer viable. We introduce HNSW-MS, which implements Hierarchical Navigable Small World graph indexing natively for mass spectral similarity, operating directly on raw GC-MS and LC-MS/MS spectra without preprocessing or embedding, thus ensuring maximum reproducibility. Validated on the 8.4 million MS/MS spectra, HNSW-MS achieves up to 560-fold acceleration over linear scan while maintaining top-1 recall above 90%, with perfect recall achievable at moderate parameter settings. This acceleration removes the search bottleneck at repository scale, enabling near real-time spectral querying against the entirety of public metabolomics data.
bioinformatics2026-06-08v1DipSkmer: Reference-free population genomics with diploid genome skims
Charvel, E.; Alves Monteiro, H. J.; Mirarab, S.; Bafna, V.Abstract
Ecologists and conservation biologists rely on genetic diversity as a key essential biodiversity variable (EBV) used to track population health and dynamics, and utilize the population parameter {theta} (estimated by the average pairwise genomic distance) as a key metric of diversity. While whole-genome-sequencing (wgs) is increasingly affordable, it will be considerable time before the full diversity of life is represented by high-quality assembled genomes; even then, constant monitoring will still require repeated sampling of populations. In contrast, genome skimming (low-coverage, short-read wgs) is highly cost-effective but challenging to analyze because the coverage is too low for assembly and reliable error correction. Mature methods, such as Mash, exist for estimating pairwise genomic distances based on the Jaccard similarity of k-mer sets computed using sketching techniques. Some, such as Skmer, additionally model the impacts of low coverage. These methods have been successfully applied to assembly-free species identification and phylogenetics; however, their use in population genetics has been limited. This is because these methods implicitly treat genomes as haploid and heterozygosity confounds true estimates of genomic distance for diploid organisms. In this paper, we address this problem through a number of technical advances. First, we use coalescent theory to mathematically derive how the Jaccard index between two diploid samples changes with the scaled population size parameter ({theta}). Next, we derive an estimator that computes {theta} from the Jaccard index, in addition to several auxiliary variables, which we also estimate from the genome skims. The resulting method, DipSkmer, enables more accurate estimates of coverage, sequencing error, and pairwise nucleotide distance for diploid samples. Analyses of both simulated and empirical datasets show that for diploids and low distances (e.g., <2%), DipSkmer produces the most accurate pairwise distance estimates, outperforming existing alignment-free methods such as Mash and Skmer, and closely approximates ANGSD, a reference and alignment-based tool.
bioinformatics2026-06-08v1A Web-based Software Resource for Interactive Analysis of Multiplex Tissue Imaging Datasets
Creason, A. L.; Watson, C.; Gu, Q.; Persson, D.; Sargent, L. L.; Chen, Y.-A.; Lin, J.-R.; Sivagnanam, S.; Wünnemann, F.; Nirmal, A. J.; Chin, K.; Feiler, H. S.; Holly, H.; Coussens, L. M.; Schapiro, D.; Grüning, B. A.; Sorger, P. K.; Sokolov, A.; Goecks, J.Abstract
Highly multiplexed tissue imaging (MTI) are powerful spatial proteomics technologies that enable in situ single-cell characterization of tissues. However, analysis and visualization of MTI datasets remains challenging, and we developed the Galaxy-ME software hub to address this challenge. Galaxy-ME is a web-based, interactive software hub that enables end-to-end analysis and visualization of MTI datasets and is accessible to everyone. To demonstrate its utility, Galaxy-ME was used to analyze datasets obtained from multiple MTI assays in both normal and cancerous tissues. Galaxy-ME is a publicly available web resource.
bioinformatics2026-06-07v3Metadata Collector: An Open-Source Platform for Standardized Metadata Management in Multi Centre Sequencing Projects
Liguori, R.; Ferrazzi, F.Abstract
Background: Next-generation sequencing (NGS) projects generate increasingly complex metadata that are critical for reproducibility, interoperability, and compliance with FAIR principles. Nevertheless, metadata curation in multi-institutional settings often still relies on spreadsheets, manual data entry and curation, as well as non-standardized terminology. These practices frequently result in incomplete or inconsistent annotations, hinder metadata sharing, and delay submission to public repositories. Results: We developed Metadata Collector as a React/API/PostgreSQL web platform and deployed it on a Kubernetes cluster within a large German research consortium. The platform implements a flexible, machine-readable metadata model for experimental data and integrates customizable templates, controlled vocabularies designed to support future ontology integration, and a complete event-based versioning model. Since deployment, Metadata Collector has been used across 32 projects involving RNA-seq, scRNA-seq, ATAC-seq and multiomics datasets, representing over 700 annotated samples contributed by multiple consortium partners. The platform is designed for use by non-computational researchers as well as centralized facilities and can be integrated into existing research data management infrastructures. Conclusions: Metadata Collector embeds standardization early in the metadata lifecycle, ensuring consistent, FAIR-aligned, and reproducible metadata across distributed research groups. Its modular, open-source architecture supports both local and consortium-scale deployments and provides a foundation for future extensions, including multi-omics support and integration with laboratory information management systems and automated submission pipelines.
bioinformatics2026-06-07v2An Agentic Platform for Drug Repurposing Unified across Molecular, Phenotypic, and Clinical Scales
Wang, C.; El Moussaoui, M.; Zhang, D.; Prabhakaraalva, P.; Merzliakov, S.; Lu, R. J.-H.; Zaman, N.; Chakraborty, G.; Huang, K.-l.Abstract
Drug repurposing offers an effective path to new therapies, yet existing computational approaches rely on a single line of evidence and are rarely validated across biological scales. We present LinkD, an integrated framework that unifies diffusion-based affinity prediction, proteome-wide selectivity scoring, phenotypic validation, and population-scale clinical evidence. LinkD-Bind predicts binding across 14,981 drugs and 20,385 human targets, ranking first in 8 of 9 BindingDB, Davis, and KIBA evaluations, with the largest gains under cold-start conditions. LinkD-Select recovers 95.3% of known drug-target pairs by combining selectivity scoring and molecular docking. LinkD-Pheno integrates drug-sensitivity and CRISPR dependency data across 960 cancer cell lines, identifying 34 novel drug-gene pairs and recovering ~85% of known targets among the top 50 candidates. Across 11.5 million individuals from Mount Sinai and UK Biobank, LinkD-prioritized {beta}-blockers propranolol (HR 0.82) and carvedilol (HR 0.92) reduced 5-year prostate cancer incidence relative to metoprolol, corroborated by ADRB2 docking and LNCaP growth inhibition. LinkD-Agent, which can effectively orchestrate all evidence layers, is served on a publicly available web platform (https://linkd-agent.onrender.com/), enabling a wide range of users to derive new drug repurposing opportunities through natural language queries.
bioinformatics2026-06-07v2Germline regulation of tumor evolutionary dynamics shapes multiple myeloma progression
Chen, H.; Shu, J.; Mudappathi, R.; Wang, P.; Bergsagel, L.; Yang, P.; Sun, Z.; Shi, C.; Liu, L.Abstract
Germline variation shapes cancer risk, yet its influence on the evolutionary dynamics of established tumors remains poorly understood. In multiple myeloma, subclonal diversification drives disease progression and treatment failure, but the heritable factors that modulate this process are unknown. Here, we show that germline variation is associated with tumor evolutionary features, implicating inherited regulation in subclonal expansion. Integrating germline variation with tumor evolutionary parameters identifies variants associated with evolutionary features, with signals enriched in regulatory regions, consistent with a transcriptional basis. We further identify TBKBP1 as a key locus linking germline variation to tumor evolution and clinical outcome. Germline variation at this locus is associated with TBKBP1 expression and subclonal expansion, and TBKBP1 expression correlates with adverse prognosis, consistent across independent cohorts. Functional analyses demonstrate that TBKBP1 promotes proliferation and activates MYC, mTORC1 and non-canonical NF-{kappa}B signaling pathway. Together, these findings establish germline regulatory variation as a determinant of tumor evolutionary dynamics and identify TBKBP1 as a mediator linking inherited variation to subclonal expansion and disease progression in multiple myeloma.
bioinformatics2026-06-07v1Multi-level, multi-body atomic interaction graphs for machine learning-based prediction of protein-ligand binding energies
Le, T. T. H.; Nguyen, B. T.; Vo, H.; Nguyen, N. H.; Nguyen, D. D.Abstract
Accurate prediction of binding affinity is crucial for rational drug design and discovery. Traditional computational methods often rely on complex scoring functions that incorporate a multitude of physical and chemical descriptors, leading to high computational demands and sometimes limited generalizability. In this work, we propose a novel scoring function that models multi-level, multi-body atomic interactions using graph-based representations. Our method constructs comprehensive interaction graphs that incorporate both pairwise and triplet-wise atomic features that help capture cooperative spatial patterns essential for binding affinity prediction. By employing a feature fusion strategy, GMI-Score maintains model simplicity while enhancing accuracy. Extensive evaluation across multiple datasets, such as PDBbind v2013, PDBbind v2016, PDBbind v2020, CSAR-NRC-HiQ, and PDBbind-Redocked, demonstrates that our model consistently outperforms state-of-the-art scoring functions, achieving Pearson correlation coefficients up to 0.877. Furthermore, it retains strong predictive power under strict data leakage controls and realistic docking conditions to highlight its robustness and generalizability.
bioinformatics2026-06-07v1GLOF: A large-scale expert-curated benchmark dataset of gain-of-function and loss-of-function missense variants
Maricato, V.; Schlesinger, D.; de Souza Moura, P. N.Abstract
Distinguishing loss-of-function (LOF) from gain-of-function (GOF) effects of missense variants is fundamental to understanding disease mechanisms and guiding therapeutic strategy, yet no large-scale, expert-curated benchmark has been publicly available for this task. Here we present GLOF (Gain and Loss Of Function), a dataset of 112,399 missense variants across 2,809 human genes, each classified as LOF, GOF, or neutral by board-certified clinical geneticists following ACMG guidelines. Pathogenic variants were sourced from ClinVar and annotated with their functional mechanism based on published functional studies, phenotype correlations, and established gene-disease relationships. Neutral variants were drawn from gnomAD v3.1 and validated against v4.1 using stringent population frequency filters. The dataset spans diverse protein families, includes 97 genes with bidirectional mechanisms (containing both LOF and GOF variants), and has been validated against well-characterized variants in the literature. GLOF is publicly available on Kaggle (https://www.kaggle.com/datasets/maricatovictor/loss-and-gain-of-function-variants) and Hugging Face (https://huggingface.co/datasets/victormaricato/glof), and provides a standardized resource for developing and benchmarking computational methods that predict variant functional mechanisms.
bioinformatics2026-06-07v1Quantifying Evidence for Competing Biomedical Hypotheses using Large Language Models and Bayesian Analysis
Moore, B. M.; Freeman, J.; Millikin, R. J.; Mohanty, C.; George, K. S.; Bal, A.; Lock, C.; Sauer, J.-D.; Spurgeon, M. E.; Moore, D. L.; Travers, B. G.; Stewart, R.Abstract
Science fundamentally depends on the generation and testing of hypotheses, many of them controversial. An explosion in scientific literature has made evaluating hypotheses even within a domain a problem of scale, and risks slowing an already extensive consensus-building process. While this challenge has prompted interest in automated hypothesis evaluation tools, existing methods have not yet proven effective for comparing hypotheses. Here, we introduce KM-GPT-DCH, an algorithm that combines co-occurrence methods with large language models (LLMs) to develop a transparent and reproducible literature-based algorithm to compare controversial hypotheses using a structured scoring approach with Bayesian methods to estimate confidence. When testing the algorithm on historical controversial hypotheses previously decided, KM-GPT-DCH chooses the correct hypothesis with high confidence several years before the scientific community or public do so. We further apply the algorithm to compare twenty unresolved controversial hypothesis pairs providing guidance for future research. The method can help researchers and the public to evaluate biomedical hypotheses such as "Is it more likely that monoamine deficiency or inflammation causes depression?" It can also be used to assess and visualize historical trends in the scientific literature. A web-based implementation of the algorithm is freely available at https://skim.morgridge.org.
bioinformatics2026-06-07v1KDM: embedding DNA/RNA motifs and sequences in a shared k-mer space for unified discovery, analysis and binding prediction
Fumagalli, L.; Becchi, T.; Cereda, M.; Pozzoli, U.Abstract
Motif discovery and binding-site prediction in DNA and RNA sequences are central tasks in regulatory genomics, yet the methodological landscape is split between interpretable but rigid position weight matrices (PWMs) and high-performing but opaque machine-learning models. We present KDM, a unifying framework in which both motifs and sequences are represented as probability distributions over a shared k-mer dictionary, embedded via the Hellinger transformation. This common geometry enables motif-sequence scoring, motif-motif comparison, de novo discovery, and binding prediction with a single primitive, the Bhattacharyya coefficient. We instantiate four tools on this representation: KDMMap for positional enrichment analysis, KDMMatch for information-content-aware motif matching, KDMFind for unsupervised motif discovery via projective non-negative matrix factorization, and KDM-LRLM for binding prediction with Lasso-regularized logistic regression. Across 1,324 transcription-factor ChIP-seq and 161 RBP eCLIP experiments, KDMMap matches CentriMo's motif rankings in 84% of TF and 79% of RBP experiments, and KDMMatch agrees with Tomtom on motif annotation in 74.5% of TFs. On binding prediction across four datasets covering 2,475 experiments, KDM-LRLM matches or exceeds eight deep-learning and three k-mer-based competitors. Notably, AI methods overtake k-mer methods only in the top quartile of training-set size, indicating that data scale, not architecture, drives the recent dominance of deep models. KDM provides a single interpretable representation across the full motif-analysis workflow.
bioinformatics2026-06-07v1Polynomial Trajectory Compression for Protein Language Model Embeddings
Sahni, H.; Chen, X.; Estrada, T.Abstract
Protein language models (PLMs) generate rich, layer-wise embeddings that capture diverse biological information but are expensive in terms of storage and computation at scale. In this work, we propose a compact surrogate representation for PLM embeddings across transformer layers using low-dimensional PCA projections and cubic polynomial trajectories. This approach enables efficient storage and on-demand reconstruction of these protein-level embeddings at any layer without rerunning the PLM. We evaluate our method on two downstream tasks: protein protein interaction and subcellular localization using ESM-35M and ESM-3B PLM. We show that the surrogate embeddings achieve high reconstruction fidelity while reducing storage and computational requirements significantly. The new approach also retains downstream task prediction performance compared to original embeddings. Our approach provides a scalable and practical solution for large-scale protein embedding storage and reuse.
bioinformatics2026-06-07v1Structure-guided compound prioritization strategy for virtual screening identifies putative binders for the nuclear receptor LRH-1
Chang-Gonzalez, A. C.; Campbell, A. N.; Bell, E. W.; Blind, R.; Meiler, J.Abstract
Compound ranking in structure-based virtual screening notoriously yields highly ranked false positive binders due to variable poses or biases in scoring terms. We developed a compound prioritization strategy that utilizes sampled docked poses from contrasting docking approaches (targeted physics-based docking and blind docking with a generative model) against multiple models of the target protein to train a multi-layer perceptron (MLP). The model predicts binders at the orthosteric ligand-binding pocket of the nuclear receptor LRH-1 (NR5A2). Our approach circumvents the reliance on a single docked pose for scoring compounds or individual scoring metrics for compound ranking. In a separate benchmarking set, we observed that the MLP identifies known binders that are chemically dissimilar from the compounds in the training set and is sensitive to single scaffold modifications, making it a potential tool for lead optimization. We applied our strategy to a prospective virtual screening campaign, which resulted in the discovery of four putative LRH-1 binders. We found that a combination of scoring and prediction metrics enriches for the hit compounds across library sizes. In all, this implementation presents a method to leverage structural and experimental data to aid virtual screening for a challenging protein target.
bioinformatics2026-06-07v1Learning quality scores for chromatin accessibility bigWig tracks using Machine Learning
Sanders, E.; Riva, S. G.; Hughes, J. R.Abstract
High-throughput chromatin accessibility assays such as bulk and single-cell ATAC-seq have generated large collections of processed signal tracks in bigWig format, which are widely used for visualisation, data integration, and Machine Learning (ML)-based analyses. Despite their central role, systematic quality control (QC) frameworks operating directly at the level of bigWig signal tracks remain underdeveloped. This gap limits the ability to assess data reliability and hampers robust downstream analyses. Here, we present a biologically grounded QC framework for chromatin accessibility bigWig files that integrates peak-level information, background noise estimation, and recovery of stable genomic reference features. Using an ML-based peak caller (LanceOtron), we derive complementary quality metrics capturing signal structure and signal-to-noise properties. We further define constant promoter and CTCF regions as internal biological controls and show that their recovery provides a sensitive measure of data quality across diverse cellular contexts. We apply this framework to a collection of 502 human chromatin accessibility bigWig tracks spanning a wide range of tissues and cell types. The proposed metrics capture related but non-redundant aspects of signal quality and motivate the use of constant promoter and CTCF recovery as biologically meaningful targets. An XGBoost model trained on LanceOtron-derived features accurately predicts recovery of these stable genomic elements on held-out data (R2 = 0.97), yielding a continuous and interpretable quality score. Feature importance analysis using SHAP values highlights that model decisions are driven by biologically relevant signal properties rather than arbitrary heuristics. Quantile-based stratification of the quality score is further supported by clear qualitative differences in genome browser visualisations. Together, this work provides a principled and extensible framework for assessing the quality of chromatin accessibility bigWig tracks, enabling more reliable data integration and supporting downstream ML applications in regulatory genomics.
bioinformatics2026-06-07v1CLASPP: A unified model for predicting post-translational modifications
Gravel, N.; Zhou, Z.; Fang, R.; Soleymani, S.; Kannan, N.Abstract
Post-Translational Modifications (PTMs) are a fundamental mechanism for regulating cellular pathways and increasing the functional diversity of the proteome. Accurately predicting the PTM types that are likely to occur at a given site in the primary sequence is a key challenge in functional proteomics. Existing PTM prediction models predominantly focus on either single PTM types or employ ensemble methods that combine multiple models to predict different PTM types. This fragmentation is largely driven by the vast imbalance in data availability across PTM types, making it difficult to predict multiple PTM types with a single model. To address this limitation, we present the Contrastively Learned Attention-based Stratified PTM Predictor (CLASPP), a unified PTM prediction model. CLASPP addresses imbalance challenges by leveraging unsupervised clustering-based undersampling and a novel contrastive learning framework tailored to PTM data. Additionally, our hierarchical data organization and curation are shown to improve CLASPP's performance by balancing the representation of individual PTM types and provides a standardized dataset to train and validate future model designs. Drawing inspiration from advancements in image and natural language processing, the CLASPP model employs a multi-stage training strategy and a high-quality, curated training dataset to improve PTM prediction performance. To uncover what is learned during the contrastive learning stage, the CLASPP model is shown to distinguish known protein kinase substrate specificity profiles as a form of explainability. Finally, we evaluate the application of CLASPP in predicting PTMs in different model organisms and experimentally validated ubiquitination sites in the understudied DCLK3 kinase. Overall, CLASPP represents a unified model for PTM prediction that addresses key bottlenecks in data imbalance and offers new strategies for biological data curation, thereby improving PTM-type prediction performance across diverse organisms.
bioinformatics2026-06-07v1BacteReason: A Reasoning Model for Antimicrobial Resistance Prediction
Oikawa, Y.; Kawashima, S.; Kinjo, A. R.; Demizu, Y.; Tamura, R.; Tsuda, K.Abstract
The rapid global spread of antimicrobial resistance (AMR) has placed unprecedented pressure on clinical decision-making. Machine learning predictors of antibiotic susceptibility exist, but their lack of mechanistic grounding limits credibility. We present BacteReason, a reasoning large language model (LLM) that predicts bacterial susceptibility to a target antibiotic, together with a mechanistic rationale. BacteReason is obtained by fine-tuning an open-weight LLM on clinical susceptibility data augmented with rationales that explain the molecular mechanisms. These rationales are produced by a proprietary teacher LLM prompted to explain known susceptibility outcomes. The teacher is interfaced via TogoMCP with a collection of biomedical knowledge-graph databases, grounding each reasoning step in retrieved evidence. On an extrapolation benchmark, BacteReason achieves a relative improvement of 43% over the untuned baseline and 38% over the same base LLM fine-tuned without rationales, demonstrating that reasoning supervision improves prediction accuracy.
bioinformatics2026-06-07v1CREP: Cis-Regulatory Element Predictor Based on Fine-Tuned Enformer
Stranieri, N.; Riva, S. G.; Hughes, J. R.Abstract
A substantial fraction of disease-associated genetic variants reside in non-coding regions of the genome, where they act by perturbing cis-regulatory elements (CREs) such as enhancers, promoters, and insulators. While recent sequence-based deep learning models, such as Enformer, accurately predict continuous epigenomic signals from DNA sequence, they do not directly provide discrete and interpretable CRE annotations. Here, we present CREP (Cis-Regulatory Element Predictor), a fine-tuned version of Enformer trained to predict regulatory element identity from sequence using REgulamentary-derived annotations across multiple human cell-types. Through a controlled experimental framework, we show that incorporating diverse cell-types improves model performance. CREP leverages cell-type-specific training data to learn regulatory representations while producing a unified prediction of CRE identity from sequence. This is demonstrated by the Vanuatu SNP, a non-coding variant that creates a de novo erythroid regulatory element, which is correctly detected only when erythroid data are included during training. Error analysis further reveals that apparent misclassifications between enhancers and promoters reflect their shared regulatory architecture, supporting the view of CREs as a functional continuum rather than strictly discrete classes. Together, these results demonstrate that CREP enables interpretable prediction of regulatory element identity from sequence and provides a framework for the functional interpretation of non-coding genetic variation.
bioinformatics2026-06-07v1Fasting Status and Epigenetic Clock Stability: Implications for Aging Research
Seale, K. B.; Dwaraka, V. B.; Giosan, I.; Mendez, T.; Smith, R.Abstract
Background: Epigenetic clocks are DNA methylation-based biomarkers increasingly used in aging research and clinical trials. A recent assessment of 18 clocks across multiple short-term perturbations concluded that most demonstrate only moderate biological reliability, raising concerns about their translational utility. However, understanding biological variability requires understanding the construction of each clock: different clocks capture distinct biological properties that respond differently to specific perturbations, and pooling reliability metrics across heterogeneous populations and array platforms may obscure the mechanisms driving variability in each case. Methods: We evaluated 24 epigenetic clocks spanning five construction categories - first and second generation classical clocks (eg. Horvath, Hannum, PhenoAge), the PC versions of the classical clocks, SystemsAge organ-system clocks, mortality-trained clocks (GrimAge, PCGrimAge, OMICmAge), pace of aging clocks (DunedinPACE) and the IntrinClock, across three datasets: a within-person paired fasting design (n = 15 pairs), a cross-sectional cohort of fasted vs non-fasted (n = 2,895), and EPICv2custom technical replicates (n = 96 samples from 4 individuals). For each clock, we quantified the acute fasting effect with and without immune cell adjustment, decomposed between-person and within-person variance at successive adjustment levels (Raw, EAA, IAA), and benchmarked biological variability against the technical measurement floor. Results: Fasting followed by acute refeeding was associated with group-level shifts of 0.5-3 years in immune-sensitive clocks, while within-person reliability remained high (Raw clock ICC median ~0.96). These observations are compatible because fasting effects are small relative to the age-driven between-person variance that dominates the ICC denominator. The magnitude of the observed shift varied by clock. PC transformations showed larger effects than their classical counterparts in the paired cohort (PC Hannum -2.03 vs. Hannum -1.37 years; PC PhenoAge > PhenoAge; PC Horvath > Horvath), SystemsAge showed the largest effects (1.15-2.9 years younger when fasted), and mortality-trained clocks (GrimAge V1/V2, OMICmAge) and DunedinPACE showed no detectable acute effect (all FDR p > 0.10). Immune cell adjustment attenuated or eliminated the fasting effects in sensitive clocks (PC Hannum 88% attenuation; SystemsAge Blood 99.7%); no clock retained a significant fasting effect after FDR-corrected immune adjustment in either cohort. Within the cross-sectional cohort, a clock's immune content, which is the fraction of its age-independent variance explained by immune cell composition, was correlated with the degree to which immune adjustment attenuated its fasting effect (r = 0.68, p = 0.003). IntrinClock, designed to exclude immune-variable CpGs, showed no fasting effect in either cohort (immune R-squared = 3.2%), serving as a negative control. Technical replicates confirmed near-perfect measurement reproducibility (median Raw ICC > 0.97), establishing that variance in fasting pairs reflects biology, not noise. Immune-adjusted ICCs behaved differently across clocks in ways consistent with their composition: for clocks where fasting generated within-person variance, immune adjustment removed it and ICC increased (SystemsAge EAA 0.768 to IAA 0.913); for clocks unaffected by fasting, immune adjustment removed between-person structure and ICC fell substantially (OMICmAge 0.922 to 0.160), reflecting the estimation cost of fitting many immune cell predictors to stable residuals. Cross-sectional replication (n = 2,895) confirmed immune cell redistribution at scale. Mortality clocks reached significance cross-sectionally despite resistance to acute fasting. Conclusions: Acute refeeding after an overnight fast elicits small shifts in some epigenetic clocks, which varied systematically by training category in our data. PC-based clocks, which concentrate correlated CpG variance including that associated with immune cell composition, showed the largest shifts; mortality-trained clocks showed no detectable acute effect. A reliability-only framework that reports ICC without also testing for systematic group-level effects can miss the kind of structured biological variation observed here under fasting. ICC is not a fixed property of a clock, it is shaped by the study design, the population heterogeneity, the perturbation, and the adjustment applied. We recommend that clock reliability be assessed on a perturbation-specific, clock-by-clock basis, with variance decomposition at each adjustment level and explicit benchmarking against technical replicates.
bioinformatics2026-06-07v1A Web-based software toolkit for accessible and best-practice machine learning analyses in biomedical research
Morais Lyra Junior, P. C.; Qiu, J.; Van Dang, K.; Pybus, A.; Narvaez-Bandera, I.; Singh, M. A.; Gu, Q.; Sargent, L.; Creason, A. L.; Goecks, J.Abstract
Machine learning is increasingly central to biomedical research, but using machine learning well often requires substantial computational expertise and methodological care to produce high-quality results. To make machinelearning tools more accessible to biomedical researchers while supporting best-practice approaches, we developed the Galaxy Learning and Modeling (GLEAM) software toolkit. GLEAM enables researchers to performsupervised machine learning analyses through a set of web-based, code-free software tools for tabular, image, and multimodal biomedical datasets. GLEAM standardizes data partitioning, model selection, training, evaluation,and reporting, helping researchers apply machine learning with greater rigor and consistency. GLEAM runs on the Galaxy computational workbench and uses Galaxy's core features to make all analyses accessible,reproducible, and scalable. We validated GLEAM on three biomedical tasks: predicting patient response to immunotherapy, skin lesion classification, and cancer recurrence prediction. Across these tasks, GLEAM producedhighly accurate predictive models and improved transparency, reproducibility, and rigor.
bioinformatics2026-06-07v1Multimodal physical evidence uncovers interpretable gene regulatory networks for perturbation prediction
Yang, Z.; Huang, S.; Bai, G.; Dong, J.; Wang, J.; Li, S. Z.Abstract
Gene regulatory networks govern cell fate transitions through dynamic causal mechanisms. Since exhaustively mapping this vast perturbation space experimentally is prohibitive, scalable computational models are essential. Yet, current frameworks fall short because they infer statistical co-expression rather than physical mechanisms, remain blind to non-canonical regulators lacking classical DNA-binding motifs, and fail to generalize across unseen perturbation factors or cell lines. Here we show that a multimodal biophysical framework, VitaGRN, overcomes these barriers by constructing a biophysical regulatory scaffold from multimodal evidence and propagating interactions to capture non-canonical regulators. By leveraging structurally aligned protein embeddings, VitaGRN predicts zero-shot perturbation responses and uncovers non-canonical translational control programs. Notably, VitaGRN demonstrates robust generalization across unseen factors, cell lines, and developmental transitions. Ultimately, VitaGRN generates a con[fi]dence-calibrated virtual perturbation atlas spanning over a thousand factors. This resource reframes gene regulatory networks from static correlation graphs into dynamically generalizable and mechanistically transparent models, streamlining wet-lab candidate prioritization.
bioinformatics2026-06-07v1Anthocyanin-associated cellular programs underlying terroir variation in Cabernet Sauvignon grape berry revealed by SEED-based deconvolution
Hu, X.; Tang, Y.; Deng, F.; Chen, Z.; Tang, G.; Yan, X.; Xia, Z.; Tong, H. H. Y.; Zhan, J.; Zou, X.; Hao, J.Abstract
Plant tissues consist of diverse cell populations that collectively contribute to development, metabolism, environmental responses, and phenotype formation. Although single-cell and single-nucleus RNA sequencing have greatly advanced the study of plant cellular heterogeneity, their application to large sample cohorts remains limited by cost, technical complexity, tissue dissociation constraints, and throughput. In contrast, bulk RNA-seq datasets have accumulated extensively across plant species, tissues, developmental stages, and environmental conditions, yet the celltype-level information embedded in these datasets remains difficult to resolve because plant-oriented deconvolution frameworks are still lacking. Existing deconvolution methods have largely been developed in mammalian systems and have not been systematically optimized for plant transcriptomic features, leaving their applicability under plant-specific constraints unclear. Here, we present SEED, an adaptive deconvolution framework optimized for plant transcriptomic data. SEED integrates candidate reference-template construction with seven deconvolution strategies and automatically identifies an optimal combination for a given dataset. In grapevine simulated benchmarking, SEED showed its clearest advantage under low-replication conditions and remained broadly competitive, rather than uniformly dominant, when larger pseudo-bulk sample sizes were evaluated. SEED further performed robustly in public Arabidopsis thaliana and Nicotiana tabacum datasets. Finally, we applied SEED to bulk RNA-seq data generated in this study from Vitis vinifera cv. Cabernet Sauvignon berries collected from Yinchuan and Yantai, identifying terroir-associated cell subtypes and coordinated celltype interaction patterns. Together, these results establish SEED as a practical framework for plant transcriptome deconvolution and provide a new tool for dissecting cellular heterogeneity associated with environmental adaptation and phenotype formation in plants.
bioinformatics2026-06-07v1VelocityFM: Short-Horizon Protein Trajectory Prediction via Flow Matching in Velocity Space
Jayathilake, L.; Wijesinghe, C. R.; Weerasinghe, R.Abstract
Protein dynamics is fundamentally a trajectory prediction problem, but molecular dynamics (MD) simulation remains expensive and static structure predictors do not model time-ordered motion. We present VelocityFM, a short-horizon protein trajectory predictor that applies rectified flow matching in velocity space over residue frames and torsions. The model combines six Invariant Point Attention (IPA) blocks with a two-layer per-residue temporal self-attention encoder, and is trained on 710 ATLAS proteins comprising 2090 filtered replicate trajectories. At the primary 128-frame rollout horizon, VelocityFM achieves a median TM-score of 0.929 on 72 held out proteins, with 100% of proteins remaining above TM> 0.7 and 100% clash-free generation. Backbone geometry also remains strong, with a median Ramachandran favoured rate of 91.09%, while dynamics calibration is conservative with median RMSF ratio 0.697. These results show that velocity-space geometric learning can generalise short-horizon trajectory prediction to unseen proteins while preserving fold structure and geometric validity within its intended operating regime.
bioinformatics2026-06-07v1Metadata Collector: An Open-Source Platform for Standardized Metadata Management in Multi Centre Sequencing Projects
Liguori, R.; Ferrazzi, F.Abstract
Background: Next-generation sequencing (NGS) projects generate increasingly complex metadata that are critical for reproducibility, interoperability, and compliance with FAIR principles. Nevertheless, metadata curation in multi-institutional settings often still relies on spreadsheets, manual data entry and curation, as well as non-standardized terminology. These practices frequently result in incomplete or inconsistent annotations, hinder metadata sharing, and delay submission to public repositories. Results: We developed Metadata Collector as a React/API/PostgreSQL web platform and deployed it on a Kubernetes cluster within a large German research consortium. The platform implements a flexible, machine-readable metadata model for experimental data and integrates customizable templates, controlled vocabularies designed to support future ontology integration, and a complete event-based versioning model. Since deployment, Metadata Collector has been used across 32 projects involving RNA-seq, scRNA-seq, ATAC-seq and multiomics datasets, representing over 700 annotated samples contributed by multiple consortium partners. The platform is designed for use by non-computational researchers as well as centralized facilities and can be integrated into existing research data management infrastructures. Conclusions: Metadata Collector embeds standardization early in the metadata lifecycle, ensuring consistent, FAIR-aligned, and reproducible metadata across distributed research groups. Its modular, open-source architecture supports both local and consortium-scale deployments and provides a foundation for future extensions, including multi-omics support and integration with laboratory information management systems and automated submission pipelines.
bioinformatics2026-06-07v1CytoGem-XAI:A Hypergraph Neural Network Framework for Genome-Scale Metabolic Modeling and Interpretable Analysis
Chen, S.; Chen, T.; Xu, Z.; Zhang, L.; Gao, B.; Mao, J.Abstract
Genome-scale metabolic models are essential for understanding cellular metabolism, yet existing deep learning approaches remain black boxes, and traditional flux balance analysis (FBA) cannot provide sample-specific predictions. To our knowledge, CytoGem-XAI is the first framework to combine hypergraph neural network representation with interpretable, FBA-parallel analysis and sample-specific metabolic characterization. Built upon hypergraph representations where reactions are encoded as hyperedges connecting their participating metabolites, CytoGem-XAI introduces three analysis modules: perturbation-based carbon source importance ranking, hard intervention reaction bottleneck identification, and pathway-level topological attribution. Beyond prediction, CytoGem-XAI uniquely enables condition-dependent carbon source essentiality and reaction bottlenecks that vary with genetic background - capabilities absent from both traditional FBA and existing deep learning methods. Trained on 17,400 E.coli growth conditions using 10-fold cross-validation, our framework achieves 2 =0 .862,substantially outperforming AMN (R^2=0 .81,+6 .4%), FBA ( R^2=0 .62,+39%),and gradient boosting baselines (R^2 =0.71,+21%). Biological validation confirms that CytoGem-XAI identifies known essential carbon sources (e.g., alanine, malate) and rate-limiting enzymes (e.g., TCA cycle), while also revealing N-acetylmuramate - a peptidoglycan precursor - as a previously underappreciated essential nutrient.
bioinformatics2026-06-07v1Single-cell gene regulatory network reconstruction and key regulator identification using a dual-channel fusion graph convolutional network
Tang, R.; Liu, J.; Zhang, P.; Liang, X.Abstract
Background and objective: Gene regulatory networks are formed by complex regulatory relationships between transcription factors and their target genes. A systematic understanding of these regulatory relationships is crucial for deciphering the molecular mechanisms that underlie cell state transitions under physiological and pathological conditions. Single-cell expression data can reveal cell-type-specific transcriptional regulation, and computational methods have recently been developed to infer gene regulatory networks from single-cell transcriptomics and prior regulatory knowledge. However, existing methods could not explore the common and specific information in expression correlations and prior regulatory knowledge, which can adversely affect prediction performance. Methods: We propose a novel method for inferring gene regulatory networks from single-cell RNA sequencing data. The proposed method consists of dual-channel graph neural networks and a weight-shared common graph neural network, enabling effective fusion of prior regulatory knowledge with gene co-expression patterns. Furthermore, we formulate a new computational framework built upon the proposed algorithm, which integrates differential gene expression profiles and regulatory changes to identify key regulators that distinguish different cell states. Results: Experimental results demonstrate that our method significantly improves the accuracy of regulatory inference across multiple datasets, outperforming other state-of-the-art approaches. Our method also exhibits robustness to noise and missing data. Analysis of two single-cell expression datasets suggests that the proposed framework could help identify key regulators involved in tumor metastasis and drug resistance. Conclusion: These results indicate that the proposed method could advance the understanding of the biological mechanisms underlying diseases by reconstructing single-cell gene regulatory networks and identifying key regulators across different cell states.
bioinformatics2026-06-07v1CiliAI: Automated segmentation and compartment specific fluorescence quantification of primary cilia in confocal microscopy images
Karapetian, E.; Gerhardt, C.; Nazif, E.; Pfirrmann, T.Abstract
Primary cilia regulate essential signalling pathways controlling cell proliferation, differentiation, and tissue homeostasis. Quantitative analysis of ciliary morphology and compartment-specific protein localization by confocal microscopy is labor-intensive, user-dependent, and difficult to scale, particularly for multiplexed 3D image datasets. Here, we present CiliAI, a web-based deep-learning workflow for automated detection, substructure segmentation, and quantitative analysis of primary cilia in confocal microscopy images. CiliAI identifies ciliary substructures including the basal body, transition zone, and axoneme from multiplexed 3D image stacks and performs automated measurements of cilium length and compartment-specific fluorescence intensity. In NIH-3T3 cells, automated cilium length measurements showed close agreement with manual quantification and no statistically significant difference between methods (mean difference -0.214 {gamma}m, p = 0.213). Automated fluorescence analysis reproduced previously reported reductions in transition zone-associated Cep290 signal intensity in Rpgrip1l-deficient cells and identified the absence of significant Rpgrip1l accumulation changes in Rmnd5a-deficient cells. Automated processing reduced analysis time from days of manual quantification to minutes. Together, these findings establish CiliAI as an automated framework for quantitative analysis of ciliary morphology and compartment-specific protein abundance in confocal microscopy datasets.
bioinformatics2026-06-07v1Cross Dataset Transcriptomic Analysis Identifies Oxidative Stress Inflammation Gene Networks Modulated by Nutrigenomic Interventions in Parkinson Disease
Rafiee, M.; Abaj, F.; Ghiasvand, R.Abstract
Inflammation and oxidative stress (OS) are key to Parkinson's disease (PD). We performed a cross dataset integrative transcriptomic analysis to identify OS and inflammation-related hub genes consistently dysregulated in PD and to explore gene compound relationships using nutrigenomic studies using publicly available datasets. Four GEO datasets (GSE7621, GSE20141, GSE20146, GSE49036) were analysed to identify differentially expressed genes (DEGs), which were intersected with GeneCards OS inflammation gene sets. Functional enrichment analyses, including gene ontology (GO), pathway over-representation analysis (ORA), and protein-protein interaction (PPI) analysis, were used to identify key pathways and hub genes. Gene food bioactive compound (FBC) association was explored by integrating PD signatures with nutrigenomic profiles from NutriGenomeDB. We identified 183 DEGs in PD, enriched in synaptic, dopaminergic, OS, and inflammatory pathways. Intersection analysis yielded 26 OS inflammation related genes and 10 central regulators, including TH, DDC, SNCA, LRRK2, HSPB1, and HSPA1B. Integration with nutrigenomic datasets revealed opposing-direction transcriptional patterns, with several FBC associated signatures showing lower expression of stress related genes and higher expression of dopaminergic markers such as TH, GCH1, and DDC. Overall, this integrative analysis highlights OS inflammation gene networks in PD and identifies candidate diet gene associations that warrant further experimental and clinical validation.
bioinformatics2026-06-06v2Quantifying the contribution of DNA conformational flexibility to transcription factor binding on nucleosomal DNA uncovers indirect readout across diverse TF families
Dey, U.; Martinez, G. S.; Kumar, R.; Yella, V. R.; Kumar, A.Abstract
Background: Eukaryotic gene regulation depends on transcription factors (TFs) recognizing short DNA motifs within chromatin. Many of these motifs lie within nucleosomes, where DNA is sharply bent, rotationally phased, and constrained by histone-DNA contacts. Yet only a subset is occupied in any cellular context. Motif identity alone, therefore, cannot fully explain selective TF engagement with nucleosomal DNA. We asked whether sequence-derived DNA conformational flexibility provides an interpretable representation of sequence context relevant to TF recognition on nucleosomes. Results: We compiled five DNA flexibility descriptors in the Python package DNAflexpy, representing bendability, torsional deformation, backbone conformational variability, and stiffness. We built quantitative models of TF binding affinity across 226 datasets from a high-throughput in vitro TF-nucleosome binding assay. Flexibility-augmented models improved prediction over mononucleotide baselines in most datasets, with smaller but reproducible gains over trinucleotide baselines. The gains were not uniform: they varied across TF families and were concordant with DNA shape-fluctuation features, suggesting that DNAflexpy descriptors capture a sequence-encoded structural signal. In PIONEAR-seq data, model performance generalized across nucleosomal templates in a TF- and sequence-dependent manner. Beyond prediction, position-resolved flexibility footprints revealed deformation signatures at cognate motifs and flanking regions across diverse TF families. For SOX11, model-derived footprints aligned with DNA shape fluctuations from nanosecond-to-microsecond molecular dynamics trajectories of SOX11-bound nucleosomes, consistent with independently observed DNA conformational dynamics and bound-state stabilization. The in vivo data showed a similar but more context-dependent pattern. OCT4 occupancy tended to correlate with local flexibility, whereas GATA3-pioneered regions showed flexibility coupled with altered rotational positioning of cognate motifs. Flexibility-augmented classifiers further improved discrimination of occupied nucleosomal motifs across ENCODE datasets. Torsional flexibility features, particularly twist dispersion and trx, were most informative for classification. Conclusions: Sequence-derived DNA conformational flexibility provides a quantitative and interpretable representation of sequence context in TF recognition on nucleosomes. By augmenting the sequence with structural information, these models help quantify and interpret an indirect-readout contribution in which DNA deformation tendencies may complement motif sequence and DNA shape. This framework may help explain why only selected motif instances are engaged in chromatin, without treating flexibility as independent of primary sequence.
bioinformatics2026-06-06v2HOPE: Interpretable Histology Analysis with Spatial Omics-Derived Signatures for Precision Oncology
Wang, T.; Bieniosek, M.; Krpicak, T. J.; Luan, M.; Ruf, B.; Schürch, C. M.; Mayer, A. T.; Luo, R.; Trevino, A. E.; Wu, Z.Abstract
Hematoxylin and eosin (H&E) stained images are fundamental clinical tools for disease assessment. However, even with advanced computational models, their prognostic capabilities remain limited. Spatial omics characterizes tumor microenvironments (TME) in detail yet remains clinically inaccessible due to cost and complexity. In this study, we present HOPE, a lightweight framework that learns TME signatures from paired H&E and spatial omics data during training, then applies these to H&E alone at inference. Leveraging H&E foundation models, HOPE consistently outperforms identical architectures trained without spatial omics guidance across cancer types and cohorts. It further generates interpretable annotations of TME signature on H&E regions, stratifying patients into biologically coherent groups with different prognostic outcomes. HOPE establishes a practical route to translate high-content spatial omics discoveries into scalable, clinically deployable tools.
bioinformatics2026-06-06v1Compositional and interpretable representation of histology using AI foundation models and sparse autoencoders
Zhao, Z.; Maliga, Z.; Ogbonna, E. C.; Talemi, S. R.; Coy, S.; Gagne, A.; Lumamba, K.; Solomon, I. H.; Santagata, S.; Steyn, A. J. C.; Naidoo, T.; Sorger, P. K.Abstract
Light microscopy of tissue sections stained with hematoxylin and eosin (H&E) has been the foundation of histopathology for over 150 years and remains essential for diagnosis and research. The development of high-plex spatial profiling approaches able to measure protein and RNA expression at single-cell resolution augments but does not replace H&E imaging, even in research. Computational pathology (CPath) models based on deep learning promise to further increase the value of H&E imaging but interpreting these models in biological terms remains challenging. As a result, they are not widely used in spatial profiling studies. Here we describe a human-in-the-loop computational framework that leverages CPath foundation models (FMs) and sparse autoencoders (SAEs) to decompose FM embeddings and automatically identify diverse, human-interpretable histopathology features in H&E images. When FM-SAE modeling was applied to pulmonary diseases such as tuberculosis and lung cancer, human-machine interaction augmented and accelerated expert interpretation. Moreover, the resulting annotations provide a morphology-aware approach to integrating 2D and 3D mesoscale tissue architectures with molecular spatial profiling.
bioinformatics2026-06-06v1TetraFuse: A Synergistic Four-Dimensional Dynamic Fusion Framework for Efficient and Robust Medical Image Classification
Gao, Y.; Li, J.; Xu, J.; Li, Q.; Li, Z.; Shi, Y.; ZHao, G.; Wu, X.; Zhang, Y.Abstract
Accurate and robust classification of medical pathology images is pivotal for computer-aided diagnosis. However, the deployment of deep learning models in high-throughput clinical screening faces a fundamental challenge: the trade-off between diagnostic accuracy and computational efficiency. Current lightweight architectures, while reducing parameter complexity through grouped convolutions, often lead to cross-channel information isolation and diminished representational capacity. In this paper, we propose TetraFuse, a novel framework that systematically integrates features from four complementary domains: space, channel, statistics, and frequency. TetraFuse introduces a novel Cross-Channel Dynamic Aggregation (CCDA) paradigm that reconstructs global channel topology with negligible computational overhead, resolving the inter-group isolation issue. To balance perceptual fidelity and efficiency, we design a stage-aware local enhancement mechanism: Local Variance-Guided Enhancer (LVGE) is employed to filter out shallow-stage background noise, while High-Frequency Boundary Injection (HFBI) reinforces deep-stage pathological contours, preventing spatial over-smoothing. Experimental results on the COVID-19, ISIC 2018, and Kvasir datasets confirm that TetraFuse outperforms state-of-the-art (SOTA) methods. Notably, TetraFuse-Tiny achieves a transformative 91.53% reduction in FLOPs compared to ResNet50; on the Kvasir dataset, it achieved an accuracy of 0.926 and an AUC of 0.994 with only 0.345G FLOPs. By combining high representational power with minimal computational demand, TetraFuse offers a scalable solution for large-scale medical image analysis, especially in resource-constrained clinical environments.
bioinformatics2026-06-06v1Revised Adaptive Immune Receptor Data in the Immune Epitope Database
Scheffer, L.; Richardson, E. M.; Vita, R.; Zarebski, L.; Blazeska, N.; Wheeler, D. K.; Cantrell, J. R.; Deleuran, S. N.; Lees, W. D.; Christley, S.; Corrie, B.; Cowell, L. G.; Sette, A.; Peters, B.Abstract
The Immune Epitope Database (IEDB, iedb.org) is a freely available resource that catalogs experimentally defined immune epitopes and - if available - the immune receptors that recognize them. Currently, the IEDB records ~185,000 T cell receptors and ~5,000 B cell receptors/antibodies with experimentally verified epitope specificity. Because these receptor data were manually curated from ~3,300 references spanning decades, nomenclature inconsistencies present challenges for computational analyses and user queries. To support integrated analysis of the entire dataset, we revised the IEDB receptor data standardization and validation pipeline to flag and correct inaccuracies. Anomalous receptors from over 800 studies were flagged for re-curation. The updated receptor dataset shows greater conformity through consistent gene nomenclature formatting and harmonized CDR sequence delimitation. Taking advantage of the increased receptor data consistency, the IEDB web interface was expanded to include receptor search features directly on the homepage, support V/J gene and species options in the refined receptor search, and allow direct data export in the Adaptive Immune Receptor Repertoire (AIRR) format. We anticipate that the improved receptor data quality will simplify bioinformatics analyses, and facilitate integration of IEDB data into cross-repository data resources, such as the AIRR Knowledge Commons.
bioinformatics2026-06-06v1Comparative Proteomics Across Tissues and Crop Agroecosystems Reveals Agricultural Stressor Responses in the Western Honey Bee
Zhong, H.; ZHONG, P.; Park, J.; Kozlova-Ryabova, A.; Moravcova, R.; Rogalski, J. C.; Jamieson, A.; Lansing, L.; Fang, W. W. T.; Moon, K.-M.; Yuan, X.; Ovinge, L. P.; Kearns, J. D.; Gregoris, A. S.; Higo, H.; Common, J.; Conflitti, I. M.; Pepinelli, M.; Tran, L.; Cunningham, M.; Jabbari, H.; Bukhari, S. A.; French, S. K.; Ho, J.; Deckers, T. B.; Zorz, J.; Polo, R. O.; Hoover, S. E.; Pernal, S. F.; Giovenazzo, P.; Currie, R. W.; Guarna, M. M.; Zayed, A.; Foster, L. J.Abstract
Maintaining honey bee health in crop production systems is increasingly difficult because worker bees encounter multiple chemical and biological pressures from pesticides and pathogens. How these field-realistic pressures affect molecular physiology across functionally distinct tissues remains poorly understood. Here, we tested whether tissue-resolved proteomics could separate stable tissue-specific patterns from crop-associated molecular changes. To do this, we profiled abdomen, gut, and head proteomes from honey bees collected across four Canadian crop ecosystems over two consecutive years, and integrated these data with pesticide-residue and pathogen-load measurements. Proteomic variation was structured by both tissue identity and crop environment. Tissue-specific proteomic profiles were characterized across samples, whereas crop-associated effects were detected in both years and were stronger in 2021, the second year of the study. Tissue-specific enrichment and network analyses linked the abdomen to lipid catabolism and ubiquitin-proteasome proteostasis, the gut to central carbon metabolism, membrane transport, vesicle trafficking, and cytoskeletal organization, and the head to neurosensory and mitochondrial functions, together with amino-sugar metabolism and vesicle-associated quality-control modules. Among the measured pesticide residues, boscalid was the most reproducible chemical correlate of proteomic variation, with the strongest signal in the gut. Cross-year validation associated boscalid exposure with reduced abundance of gut proteins involved in mitochondrial metabolism, protein quality control, vesicle trafficking, nutrient transport, and biosynthetic pathways. Additionally, integrated proteome-transcriptome-microbiome factor analysis further identified gut-centered components associated with measured stressor variables and linked protein-level variation to coordinated transcriptomic and microbial shifts. Independent-year validation showed that compact crop-associated protein signatures detected in 2020 were also present in 2021. Together, these results show that honey bee tissues maintain stable proteomic identities while showing tissue- and year-specific responses to pesticide and pathogen pressures encountered in crop ecosystems. The gut proteome may specifically provide a sensitive molecular indicator of pesticide-associated perturbation under field conditions.
bioinformatics2026-06-06v1samsampleX: Distribution-aware downsampling for benchmarking next-generation sequencing data
Demiriz, S.; Taliun, D.Abstract
High-throughput next-generation sequencing (NGS) is essential for genetic variant discovery across diverse applications. As NGS evolve, there is a growing need for benchmarking tools that support realistic data simulation and downsampling. Existing downsampling tools apply uniform sampling of sequencing reads, which inadequately models realistic coverage distributions, particularly in difficult-to-sequence regions and hybrid sequencing designs. Here we present samsampleX, a Python-based tool implementing a novel distribution-aware downsampling algorithm that dynamically adjusts read retention probabilities to emulate coverage profiles derived from real sequencing data. Using ultra-high-coverage reference datasets, samsampleX accurately reproduces coverage patterns observed in typical sequencing experiments, outperforming uniform downsampling methods at preserving depth variability across genomic regions such as the HLA locus and hybrid whole-exome/genome sequencing configurations. samsampleX extends current downsampling strategies by offering enhanced flexibility for specialized NGS benchmarking scenarios, facilitating improved assessment of sequencing data analysis methods.
bioinformatics2026-06-06v1An inflammatory gene set driven epigenetic clock tracks down disease progression and rejuvenation
Sandor, P.; Kerepesi, C.; Castro, J. P.Abstract
Chronic, low-level inflammation, characterized by elevated pro-inflammatory programs, including epigenetic changes, in the absence of infection, is a major driver of aging and age-related diseases. On the other side of the spectrum, aging interventions work, at least in part, by decreasing inflammation. However, the molecular connection between epigenetic aging and inflammatory profiles in chronic diseases and rejuvenation has not been established yet. This study aimed to investigate the role of a newly described inflammatory signature gene set (ISig) in aging, previously associated with accelerated aging, in the progression of chronic diseases and rejuvenation. To achieve this, we developed inflammation-derived epigenetic aging clocks using ElasticNet regression models trained on CpG sites from ISig promoter regions. The newly developed inflammation aging clocks were validated on healthy samples and tested for their capacity to detect accelerated aging in diseased samples and rejuvenation during cellular reprogramming. The data demonstrate that the ISig inflammatory clocks accurately predict age, detect rejuvenation, and identify accelerated aging in disease contexts. Furthermore, we have demonstrated that it is possible to use a curated inflammatory gene-set with biological relevance to estimate biological age acceleration. We also developed a web application, the GeneClock Studio (available at https://ilab.sztaki.hu/geneclockstudio/), that allows researchers to apply the inflammatory aging clocks to their own DNA methylation datasets without requiring any programming expertise. Furthermore, the GeneClockStudio supports the training of new aging clocks based on an arbitrarily selected gene set in a similar way as in the case of the ISig inflammatory clocks.
bioinformatics2026-06-06v1Multivariate integration of histological images and gene expression data: a comparative review
Ma, C.; Mao, J.; Le Cao, K.-A.Abstract
Integrating histological images with gene expression data offers a promising approach for linking tissue morphologies to molecular signatures and improving disease subtyping. However, such integration remains challenging due to the high dimensionality of these datasets, cross-modal heterogeneity, and limited interpretability. Multivariate methods such as Sparse Canonical Correlation Analysis (Sparse CCA), Joint Nonnegative Matrix Factorisation (Joint NMF), and Angle-based Joint and Individual Variation Explained (AJIVE), have been used to address these challenges by reducing dimensionality while identifying features associated with latent factors, thereby enhancing biological interpretability. Despite increasing application in imaging-omics research, systematic comparisons of their methodological properties remain limited. Consequently, users often lack guidance on how to appropriately select these methods in practice, and these approaches are frequently treated as interchangeable despite differing modelling assumptions. Here, we use paired H\&E images and gene expression data from breast cancer as a representative case study to examine the methodological characteristics, interpretability, and complementary properties of these integration approaches. Our results show that each method captures distinct yet complementary aspects of the underlying information. Although the biological findings are derived from the TCGA-BRCA datasets, the methodological insights identified here extend more broadly to imaging-omics integration studies. Overall, this comparative review highlights the strengths and limitations of each approach and outlines considerations for future methodological development.
bioinformatics2026-06-06v1Mapping Chemical Diversity: Descriptor-Guided Clustering of Natural Products in the COCONUT Database
Shreyasree, G.; Dileep, A.; Namani, A.; Karunakar, P.Abstract
Natural products represent a major source of bioactive compounds for drug discovery, yet their exploration remains challenging due to extensive structural complexity and scaffold diversity. Using the COCONUT database, we developed a cluster-oriented framework to systematically map and characterize the natural product chemical space through feature engineering, molecular clustering, and representative-based analysis. Descriptor selection identified a greedy maximum coverage strategy with a 0.35-0.85 correlation threshold range and 20 descriptors as the optimal feature set, enriched in physicochemical and graph-topological properties. Comparative evaluation of clustering approaches identified UMAP-HDBSCAN as the best-performing pipeline, generating 1,683 clusters with silhouette scores of 0.42 before and 0.24 after noise reassignment. Cluster profiling revealed a highly heterogeneous scaffold landscape, with 67.56% of clusters exhibiting low scaffold dominance and only 15.21% representing highly scaffold-dominated regions, supporting a chemical space composed largely of interconnected transitional clusters. Descriptor analyses showed that natural product clusters were generally enriched in saturated, low-aromaticity chemotypes with moderate lipophilicity and constrained molecular flexibility. Representative-based analyses demonstrated that central representatives (medoid and centroid-closest molecules) closely captured cluster-average properties, whereas diverse representatives better reflected structural breadth, findings further supported through descriptor-based and docking-based validation. Collectively, the results reinforce the natural product chemical space as a continuous yet structured manifold and provide a representative-guided framework for its efficient exploration in drug discovery applications. The complete data can be accessed at: https://github.com/shrek-28/DescriptorClusteringNPSpace
bioinformatics2026-06-06v1Ignet 2.0 and Vignet: An Ontology-Driven Web Platform for Biomedical Gene Interaction Discovery and Visualization
Asaduzzaman, S.; Bansal, B.; Combs, P.; Zhang, J.; Rehana, H.; McGregor, B.; He, Y.; Hur, J.Abstract
Background: The expansion of biomedical literature demands systematic ontology-guided discovery of gene interactions, vaccine mechanisms, drug associations, and adverse events. Existing platforms such as STRING, DisGeNET, and PubTator fall short of providing a unified, freely accessible system that integrates ontology-based semantic interaction classification, vaccine-focused heterogeneous network construction, and Artificial Intelligence-assisted evidence retrieval. Results: Ignet 2.0 and Vignet are freely accessible dual-platform systems that combine PubMed literature mining, BioBERT-based interaction scoring for millions of gene-gene co-occurrence pairs and integrate three biomedical ontologies and one curated drug resource, Interaction Network Ontology (INO), Vaccine Ontology (VO), Human Disease Ontology (HDO), and DrugBank. Ignet 2.0 supports gene interaction discovery, gene set enrichment retrieval of BioBERT-scored GenePair evidence, and AI-assisted summarization through BioSummarAI. Vignet extends these features with VO-guided Vaccine Exploration, VacPair interaction scoring, and the creation of vaccine, gene, drug, and disease networks in VacNet. A public Representational State Transfer Application Programming Interface (REST API) and Model Context Protocol (MCP) endpoint enable real-time integration, fostering trust in biomedical knowledge discovery. Conclusion: Ignet 2.0 and Vignet are scalable, ontology-guided biomedical knowledge platforms that facilitate evidence-based gene interaction analysis, vaccine-focused semantic exploration, and AI-assisted knowledge discovery. Their real-time PubMed data integration ensures up-to-date insights; however, users should consider validation processes and potential lags in incorporating the latest experimental data, which may affect the reliability of immediate data. Availability: Ignet 2.0: https://ignet.org/ignet; Vignet: https://ignet.org/vignet/
bioinformatics2026-06-06v1Correcting for Global Synonymous Selection Improves the Accuracy of Episodic Positive Selection Inference
Verdonk, H. E.; Pivirotto, A.; Hey, J.; Kosakovsky Pond, S. L.Abstract
The ratio of nonsynonymous to synonymous substitution rates ({omega}) constitutes a fundamental parameter for inferring adaptive protein evolution, predicated upon the assumption that synonymous substitutions are selectively inert. This premise, however, is increasingly untenable given evidence of selection acting on synonymous substitutions, driven by various biological processes such as translational efficiency and mRNA stability. In this study, we demonstrate that unmodelled synonymous selection introduces substantial bias into {omega} estimation, resulting in elevated false positive rates in tests for positive selection. To rectify this, we present BUSTED+S+MSS, a statistical framework incorporating Multiclass Synonymous Substitution (MSS) models into BUSTED, a method for detecting episodic selection. By partitioning synonymous codons into empirically derived rate classes, this approach accounts for global synonymous constraints. Application to five diverse clades - Drosophila, Caenorhabditis, Enterobacteria, Saccharomyces, and Primates - reveals that the inclusion of MSS components consistently improves model fit and reduces the proportion of genes inferred to be under positive selection. In Enterobacteria, genes retaining significance under the corrected model exhibit weaker constraint on synonymous substitutions (dSs), consistent with the hypothesis that unmodelled purifying selection drives spurious signals of adaptation. Furthermore, an information-theoretic analysis indicates that whilst site-specific variation (SRV) provides the primary correction, global synonymous rate variation (MSS) contributes a distinct second-order correction. In highly divergent alignments, these signals act in concert to improve model fit. The BUSTED+S+MSS framework, especially when coupled with an "error-sink" to absorb alignment artifacts, thus offers a computationally feasible means to disentangle adaptive nonsynonymous substitution from the confounding effects of synonymous constraint.
bioinformatics2026-06-06v1PAG-Agent: a biologist-oriented research assistant for context-aware pathway-level analysis and interpretation
Nguyen, Q.-H.; Zhang, Z.; Le, D.-H.; Chen, J. Y.; Ku, W.-S.; Chen, H.; Yue, Z.Abstract
Pathway analysis is a critical step for translating gene-level omics results into biological mechanisms, yet existing workflows often leave researchers with long lists of statistically significant pathways that are difficult to interpret, validate, and connect to experimental context. We developed PAG-Agent, a biologist-oriented virtual research assistant that integrates pathway-level statistical analysis, context-aware biological interpretation, literature-supported reasoning, and scientific writing support within a unified workflow. PAG-Agent supports bulk and single-cell transcriptomic data and enables users to perform data preprocessing, differential expression analysis, pathway analysis, pathway-level consensus analysis, and pathway-level meta-analysis through click-based and chat-based interactions. Unlike conventional pathway-analysis tools that analyze gene sets largely in isolation, PAG-Agent incorporates experimental conditions and research objectives to prioritize biologically relevant pathways and generate interpretable hypotheses. The system also provides gene and pathway annotation, citation retrieval, visualization, and writing refinement functions. In Alzheimer's disease case studies using three transcriptomic datasets, PAG-Agent consistently identified neurodegeneration-related pathways across multiple analysis methods and datasets. In citation-retrieval benchmarking, PAG-Agent outperformed six competing LLMs across five common literature-support scenarios, demonstrating improved ability to provide contextually relevant and valid references. Overall, PAG-Agent lowers technical barriers for pathway-level analysis and helps researchers move from transcriptomic data to biologically grounded interpretation, hypothesis generation, and scientific communication.
bioinformatics2026-06-06v1STITCH: Spatial Transcriptomics Imputation via Flow Matching with Internal Learning
Wang, S.; Wang, X.; Peng, Q.; Li, T.Abstract
Spatial transcriptomics datasets frequently suffer from spatial gaps and missing regions due to sectioning artifacts, tissue damage, and the high cost of sequencing that limits tissue coverage. We present STITCH, a scalable and robust generative framework for multidimensional virtual spatial transcriptomics reconstruction. STITCH models intrinsic spatial-transcriptomic patterns directly from individual tissue samples, enabling reconstruction without requiring external reference atlases or matched histological image priors. The framework adopts a decoupled architecture that separates spatial morphology restoration from transcriptomic generation. STITCH first compresses high-dimensional transcriptomic profiles into a low-dimensional latent representation through a spatial-aware graph autoencoder. For 3D cross-slice gaps, STITCH employs optimal transport-conditioned flow matching for spatial reconstruction, whereas 2D in-slice damage is repaired through an internal learning strategy. To generate the corresponding transcriptomic profiles, STITCH further establishes a point-wise conditional flow matching model in the latent space. This module achieves linear computational complexity, enabling continuous 3D atlas reconstruction of over 11 million cells within 5 hours on a single commodity GPU. Extensive evaluations across diverse spatial transcriptomics platforms, spanning both single-cell and spot-level technologies, demonstrate that STITCH consistently preserves transcriptomic identities, spatial topologies, and anatomical continuity. Overall, STITCH provides a scalable and platform-compatible computational framework for reconstructing high-resolution continuous spatial transcriptomic atlases.
bioinformatics2026-06-06v1