Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
A Web-based Software Resource for Interactive Analysis of Multiplex Tissue Imaging Datasets
Creason, A. L.; Watson, C.; Gu, Q.; Persson, D.; Sargent, L. L.; Chen, Y.-A.; Lin, J.-R.; Sivagnanam, S.; Wünnemann, F.; Nirmal, A. J.; Chin, K.; Feiler, H. S.; Holly, H.; Coussens, L. M.; Schapiro, D.; Grüning, B. A.; Sorger, P. K.; Sokolov, A.; Goecks, J.Abstract
Highly multiplexed tissue imaging (MTI) are powerful spatial proteomics technologies that enable in situ single-cell characterization of tissues. However, analysis and visualization of MTI datasets remains challenging, and we developed the Galaxy-ME software hub to address this challenge. Galaxy-ME is a web-based, interactive software hub that enables end-to-end analysis and visualization of MTI datasets and is accessible to everyone. To demonstrate its utility, Galaxy-ME was used to analyze datasets obtained from multiple MTI assays in both normal and cancerous tissues. Galaxy-ME is a publicly available web resource.
bioinformatics2026-06-07v3Metadata Collector: An Open-Source Platform for Standardized Metadata Management in Multi Centre Sequencing Projects
Liguori, R.; Ferrazzi, F.Abstract
Background: Next-generation sequencing (NGS) projects generate increasingly complex metadata that are critical for reproducibility, interoperability, and compliance with FAIR principles. Nevertheless, metadata curation in multi-institutional settings often still relies on spreadsheets, manual data entry and curation, as well as non-standardized terminology. These practices frequently result in incomplete or inconsistent annotations, hinder metadata sharing, and delay submission to public repositories. Results: We developed Metadata Collector as a React/API/PostgreSQL web platform and deployed it on a Kubernetes cluster within a large German research consortium. The platform implements a flexible, machine-readable metadata model for experimental data and integrates customizable templates, controlled vocabularies designed to support future ontology integration, and a complete event-based versioning model. Since deployment, Metadata Collector has been used across 32 projects involving RNA-seq, scRNA-seq, ATAC-seq and multiomics datasets, representing over 700 annotated samples contributed by multiple consortium partners. The platform is designed for use by non-computational researchers as well as centralized facilities and can be integrated into existing research data management infrastructures. Conclusions: Metadata Collector embeds standardization early in the metadata lifecycle, ensuring consistent, FAIR-aligned, and reproducible metadata across distributed research groups. Its modular, open-source architecture supports both local and consortium-scale deployments and provides a foundation for future extensions, including multi-omics support and integration with laboratory information management systems and automated submission pipelines.
bioinformatics2026-06-07v2An Agentic Platform for Drug Repurposing Unified across Molecular, Phenotypic, and Clinical Scales
Wang, C.; El Moussaoui, M.; Zhang, D.; Prabhakaraalva, P.; Merzliakov, S.; Lu, R. J.-H.; Zaman, N.; Chakraborty, G.; Huang, K.-l.Abstract
Drug repurposing offers an effective path to new therapies, yet existing computational approaches rely on a single line of evidence and are rarely validated across biological scales. We present LinkD, an integrated framework that unifies diffusion-based affinity prediction, proteome-wide selectivity scoring, phenotypic validation, and population-scale clinical evidence. LinkD-Bind predicts binding across 14,981 drugs and 20,385 human targets, ranking first in 8 of 9 BindingDB, Davis, and KIBA evaluations, with the largest gains under cold-start conditions. LinkD-Select recovers 95.3% of known drug-target pairs by combining selectivity scoring and molecular docking. LinkD-Pheno integrates drug-sensitivity and CRISPR dependency data across 960 cancer cell lines, identifying 34 novel drug-gene pairs and recovering ~85% of known targets among the top 50 candidates. Across 11.5 million individuals from Mount Sinai and UK Biobank, LinkD-prioritized {beta}-blockers propranolol (HR 0.82) and carvedilol (HR 0.92) reduced 5-year prostate cancer incidence relative to metoprolol, corroborated by ADRB2 docking and LNCaP growth inhibition. LinkD-Agent, which can effectively orchestrate all evidence layers, is served on a publicly available web platform (https://linkd-agent.onrender.com/), enabling a wide range of users to derive new drug repurposing opportunities through natural language queries.
bioinformatics2026-06-07v2BacteReason: A Reasoning Model for Antimicrobial Resistance Prediction
Oikawa, Y.; Kawashima, S.; Kinjo, A. R.; Demizu, Y.; Tamura, R.; Tsuda, K.Abstract
The rapid global spread of antimicrobial resistance (AMR) has placed unprecedented pressure on clinical decision-making. Machine learning predictors of antibiotic susceptibility exist, but their lack of mechanistic grounding limits credibility. We present BacteReason, a reasoning large language model (LLM) that predicts bacterial susceptibility to a target antibiotic, together with a mechanistic rationale. BacteReason is obtained by fine-tuning an open-weight LLM on clinical susceptibility data augmented with rationales that explain the molecular mechanisms. These rationales are produced by a proprietary teacher LLM prompted to explain known susceptibility outcomes. The teacher is interfaced via TogoMCP with a collection of biomedical knowledge-graph databases, grounding each reasoning step in retrieved evidence. On an extrapolation benchmark, BacteReason achieves a relative improvement of 43% over the untuned baseline and 38% over the same base LLM fine-tuned without rationales, demonstrating that reasoning supervision improves prediction accuracy.
bioinformatics2026-06-07v1CREP: Cis-Regulatory Element Predictor Based on Fine-Tuned Enformer
Stranieri, N.; Riva, S. G.; Hughes, J. R.Abstract
A substantial fraction of disease-associated genetic variants reside in non-coding regions of the genome, where they act by perturbing cis-regulatory elements (CREs) such as enhancers, promoters, and insulators. While recent sequence-based deep learning models, such as Enformer, accurately predict continuous epigenomic signals from DNA sequence, they do not directly provide discrete and interpretable CRE annotations. Here, we present CREP (Cis-Regulatory Element Predictor), a fine-tuned version of Enformer trained to predict regulatory element identity from sequence using REgulamentary-derived annotations across multiple human cell-types. Through a controlled experimental framework, we show that incorporating diverse cell-types improves model performance. CREP leverages cell-type-specific training data to learn regulatory representations while producing a unified prediction of CRE identity from sequence. This is demonstrated by the Vanuatu SNP, a non-coding variant that creates a de novo erythroid regulatory element, which is correctly detected only when erythroid data are included during training. Error analysis further reveals that apparent misclassifications between enhancers and promoters reflect their shared regulatory architecture, supporting the view of CREs as a functional continuum rather than strictly discrete classes. Together, these results demonstrate that CREP enables interpretable prediction of regulatory element identity from sequence and provides a framework for the functional interpretation of non-coding genetic variation.
bioinformatics2026-06-07v1Fasting Status and Epigenetic Clock Stability: Implications for Aging Research
Seale, K. B.; Dwaraka, V. B.; Giosan, I.; Mendez, T.; Smith, R.Abstract
Background: Epigenetic clocks are DNA methylation-based biomarkers increasingly used in aging research and clinical trials. A recent assessment of 18 clocks across multiple short-term perturbations concluded that most demonstrate only moderate biological reliability, raising concerns about their translational utility. However, understanding biological variability requires understanding the construction of each clock: different clocks capture distinct biological properties that respond differently to specific perturbations, and pooling reliability metrics across heterogeneous populations and array platforms may obscure the mechanisms driving variability in each case. Methods: We evaluated 24 epigenetic clocks spanning five construction categories - first and second generation classical clocks (eg. Horvath, Hannum, PhenoAge), the PC versions of the classical clocks, SystemsAge organ-system clocks, mortality-trained clocks (GrimAge, PCGrimAge, OMICmAge), pace of aging clocks (DunedinPACE) and the IntrinClock, across three datasets: a within-person paired fasting design (n = 15 pairs), a cross-sectional cohort of fasted vs non-fasted (n = 2,895), and EPICv2custom technical replicates (n = 96 samples from 4 individuals). For each clock, we quantified the acute fasting effect with and without immune cell adjustment, decomposed between-person and within-person variance at successive adjustment levels (Raw, EAA, IAA), and benchmarked biological variability against the technical measurement floor. Results: Fasting followed by acute refeeding was associated with group-level shifts of 0.5-3 years in immune-sensitive clocks, while within-person reliability remained high (Raw clock ICC median ~0.96). These observations are compatible because fasting effects are small relative to the age-driven between-person variance that dominates the ICC denominator. The magnitude of the observed shift varied by clock. PC transformations showed larger effects than their classical counterparts in the paired cohort (PC Hannum -2.03 vs. Hannum -1.37 years; PC PhenoAge > PhenoAge; PC Horvath > Horvath), SystemsAge showed the largest effects (1.15-2.9 years younger when fasted), and mortality-trained clocks (GrimAge V1/V2, OMICmAge) and DunedinPACE showed no detectable acute effect (all FDR p > 0.10). Immune cell adjustment attenuated or eliminated the fasting effects in sensitive clocks (PC Hannum 88% attenuation; SystemsAge Blood 99.7%); no clock retained a significant fasting effect after FDR-corrected immune adjustment in either cohort. Within the cross-sectional cohort, a clock's immune content, which is the fraction of its age-independent variance explained by immune cell composition, was correlated with the degree to which immune adjustment attenuated its fasting effect (r = 0.68, p = 0.003). IntrinClock, designed to exclude immune-variable CpGs, showed no fasting effect in either cohort (immune R-squared = 3.2%), serving as a negative control. Technical replicates confirmed near-perfect measurement reproducibility (median Raw ICC > 0.97), establishing that variance in fasting pairs reflects biology, not noise. Immune-adjusted ICCs behaved differently across clocks in ways consistent with their composition: for clocks where fasting generated within-person variance, immune adjustment removed it and ICC increased (SystemsAge EAA 0.768 to IAA 0.913); for clocks unaffected by fasting, immune adjustment removed between-person structure and ICC fell substantially (OMICmAge 0.922 to 0.160), reflecting the estimation cost of fitting many immune cell predictors to stable residuals. Cross-sectional replication (n = 2,895) confirmed immune cell redistribution at scale. Mortality clocks reached significance cross-sectionally despite resistance to acute fasting. Conclusions: Acute refeeding after an overnight fast elicits small shifts in some epigenetic clocks, which varied systematically by training category in our data. PC-based clocks, which concentrate correlated CpG variance including that associated with immune cell composition, showed the largest shifts; mortality-trained clocks showed no detectable acute effect. A reliability-only framework that reports ICC without also testing for systematic group-level effects can miss the kind of structured biological variation observed here under fasting. ICC is not a fixed property of a clock, it is shaped by the study design, the population heterogeneity, the perturbation, and the adjustment applied. We recommend that clock reliability be assessed on a perturbation-specific, clock-by-clock basis, with variance decomposition at each adjustment level and explicit benchmarking against technical replicates.
bioinformatics2026-06-07v1A Web-based software toolkit for accessible and best-practice machine learning analyses in biomedical research
Morais Lyra Junior, P. C.; Qiu, J.; Van Dang, K.; Pybus, A.; Narvaez-Bandera, I.; Singh, M. A.; Gu, Q.; Sargent, L.; Creason, A. L.; Goecks, J.Abstract
Machine learning is increasingly central to biomedical research, but using machine learning well often requires substantial computational expertise and methodological care to produce high-quality results. To make machinelearning tools more accessible to biomedical researchers while supporting best-practice approaches, we developed the Galaxy Learning and Modeling (GLEAM) software toolkit. GLEAM enables researchers to performsupervised machine learning analyses through a set of web-based, code-free software tools for tabular, image, and multimodal biomedical datasets. GLEAM standardizes data partitioning, model selection, training, evaluation,and reporting, helping researchers apply machine learning with greater rigor and consistency. GLEAM runs on the Galaxy computational workbench and uses Galaxy's core features to make all analyses accessible,reproducible, and scalable. We validated GLEAM on three biomedical tasks: predicting patient response to immunotherapy, skin lesion classification, and cancer recurrence prediction. Across these tasks, GLEAM producedhighly accurate predictive models and improved transparency, reproducibility, and rigor.
bioinformatics2026-06-07v1Multimodal physical evidence uncovers interpretable gene regulatory networks for perturbation prediction
Yang, Z.; Huang, S.; Bai, G.; Dong, J.; Wang, J.; Li, S. Z.Abstract
Gene regulatory networks govern cell fate transitions through dynamic causal mechanisms. Since exhaustively mapping this vast perturbation space experimentally is prohibitive, scalable computational models are essential. Yet, current frameworks fall short because they infer statistical co-expression rather than physical mechanisms, remain blind to non-canonical regulators lacking classical DNA-binding motifs, and fail to generalize across unseen perturbation factors or cell lines. Here we show that a multimodal biophysical framework, VitaGRN, overcomes these barriers by constructing a biophysical regulatory scaffold from multimodal evidence and propagating interactions to capture non-canonical regulators. By leveraging structurally aligned protein embeddings, VitaGRN predicts zero-shot perturbation responses and uncovers non-canonical translational control programs. Notably, VitaGRN demonstrates robust generalization across unseen factors, cell lines, and developmental transitions. Ultimately, VitaGRN generates a con[fi]dence-calibrated virtual perturbation atlas spanning over a thousand factors. This resource reframes gene regulatory networks from static correlation graphs into dynamically generalizable and mechanistically transparent models, streamlining wet-lab candidate prioritization.
bioinformatics2026-06-07v1Anthocyanin-associated cellular programs underlying terroir variation in Cabernet Sauvignon grape berry revealed by SEED-based deconvolution
Hu, X.; Tang, Y.; Deng, F.; Chen, Z.; Tang, G.; Yan, X.; Xia, Z.; Tong, H. H. Y.; Zhan, J.; Zou, X.; Hao, J.Abstract
Plant tissues consist of diverse cell populations that collectively contribute to development, metabolism, environmental responses, and phenotype formation. Although single-cell and single-nucleus RNA sequencing have greatly advanced the study of plant cellular heterogeneity, their application to large sample cohorts remains limited by cost, technical complexity, tissue dissociation constraints, and throughput. In contrast, bulk RNA-seq datasets have accumulated extensively across plant species, tissues, developmental stages, and environmental conditions, yet the celltype-level information embedded in these datasets remains difficult to resolve because plant-oriented deconvolution frameworks are still lacking. Existing deconvolution methods have largely been developed in mammalian systems and have not been systematically optimized for plant transcriptomic features, leaving their applicability under plant-specific constraints unclear. Here, we present SEED, an adaptive deconvolution framework optimized for plant transcriptomic data. SEED integrates candidate reference-template construction with seven deconvolution strategies and automatically identifies an optimal combination for a given dataset. In grapevine simulated benchmarking, SEED showed its clearest advantage under low-replication conditions and remained broadly competitive, rather than uniformly dominant, when larger pseudo-bulk sample sizes were evaluated. SEED further performed robustly in public Arabidopsis thaliana and Nicotiana tabacum datasets. Finally, we applied SEED to bulk RNA-seq data generated in this study from Vitis vinifera cv. Cabernet Sauvignon berries collected from Yinchuan and Yantai, identifying terroir-associated cell subtypes and coordinated celltype interaction patterns. Together, these results establish SEED as a practical framework for plant transcriptome deconvolution and provide a new tool for dissecting cellular heterogeneity associated with environmental adaptation and phenotype formation in plants.
bioinformatics2026-06-07v1VelocityFM: Short-Horizon Protein Trajectory Prediction via Flow Matching in Velocity Space
Jayathilake, L.; Wijesinghe, C. R.; Weerasinghe, R.Abstract
Protein dynamics is fundamentally a trajectory prediction problem, but molecular dynamics (MD) simulation remains expensive and static structure predictors do not model time-ordered motion. We present VelocityFM, a short-horizon protein trajectory predictor that applies rectified flow matching in velocity space over residue frames and torsions. The model combines six Invariant Point Attention (IPA) blocks with a two-layer per-residue temporal self-attention encoder, and is trained on 710 ATLAS proteins comprising 2090 filtered replicate trajectories. At the primary 128-frame rollout horizon, VelocityFM achieves a median TM-score of 0.929 on 72 held out proteins, with 100% of proteins remaining above TM> 0.7 and 100% clash-free generation. Backbone geometry also remains strong, with a median Ramachandran favoured rate of 91.09%, while dynamics calibration is conservative with median RMSF ratio 0.697. These results show that velocity-space geometric learning can generalise short-horizon trajectory prediction to unseen proteins while preserving fold structure and geometric validity within its intended operating regime.
bioinformatics2026-06-07v1Structure-guided compound prioritization strategy for virtual screening identifies putative binders for the nuclear receptor LRH-1
Chang-Gonzalez, A. C.; Campbell, A. N.; Bell, E. W.; Blind, R.; Meiler, J.Abstract
Compound ranking in structure-based virtual screening notoriously yields highly ranked false positive binders due to variable poses or biases in scoring terms. We developed a compound prioritization strategy that utilizes sampled docked poses from contrasting docking approaches (targeted physics-based docking and blind docking with a generative model) against multiple models of the target protein to train a multi-layer perceptron (MLP). The model predicts binders at the orthosteric ligand-binding pocket of the nuclear receptor LRH-1 (NR5A2). Our approach circumvents the reliance on a single docked pose for scoring compounds or individual scoring metrics for compound ranking. In a separate benchmarking set, we observed that the MLP identifies known binders that are chemically dissimilar from the compounds in the training set and is sensitive to single scaffold modifications, making it a potential tool for lead optimization. We applied our strategy to a prospective virtual screening campaign, which resulted in the discovery of four putative LRH-1 binders. We found that a combination of scoring and prediction metrics enriches for the hit compounds across library sizes. In all, this implementation presents a method to leverage structural and experimental data to aid virtual screening for a challenging protein target.
bioinformatics2026-06-07v1CytoGem-XAI:A Hypergraph Neural Network Framework for Genome-Scale Metabolic Modeling and Interpretable Analysis
Chen, S.; Chen, T.; Xu, Z.; Zhang, L.; Gao, B.; Mao, J.Abstract
Genome-scale metabolic models are essential for understanding cellular metabolism, yet existing deep learning approaches remain black boxes, and traditional flux balance analysis (FBA) cannot provide sample-specific predictions. To our knowledge, CytoGem-XAI is the first framework to combine hypergraph neural network representation with interpretable, FBA-parallel analysis and sample-specific metabolic characterization. Built upon hypergraph representations where reactions are encoded as hyperedges connecting their participating metabolites, CytoGem-XAI introduces three analysis modules: perturbation-based carbon source importance ranking, hard intervention reaction bottleneck identification, and pathway-level topological attribution. Beyond prediction, CytoGem-XAI uniquely enables condition-dependent carbon source essentiality and reaction bottlenecks that vary with genetic background - capabilities absent from both traditional FBA and existing deep learning methods. Trained on 17,400 E.coli growth conditions using 10-fold cross-validation, our framework achieves 2 =0 .862,substantially outperforming AMN (R^2=0 .81,+6 .4%), FBA ( R^2=0 .62,+39%),and gradient boosting baselines (R^2 =0.71,+21%). Biological validation confirms that CytoGem-XAI identifies known essential carbon sources (e.g., alanine, malate) and rate-limiting enzymes (e.g., TCA cycle), while also revealing N-acetylmuramate - a peptidoglycan precursor - as a previously underappreciated essential nutrient.
bioinformatics2026-06-07v1Single-cell gene regulatory network reconstruction and key regulator identification using a dual-channel fusion graph convolutional network
Tang, R.; Liu, J.; Zhang, P.; Liang, X.Abstract
Background and objective: Gene regulatory networks are formed by complex regulatory relationships between transcription factors and their target genes. A systematic understanding of these regulatory relationships is crucial for deciphering the molecular mechanisms that underlie cell state transitions under physiological and pathological conditions. Single-cell expression data can reveal cell-type-specific transcriptional regulation, and computational methods have recently been developed to infer gene regulatory networks from single-cell transcriptomics and prior regulatory knowledge. However, existing methods could not explore the common and specific information in expression correlations and prior regulatory knowledge, which can adversely affect prediction performance. Methods: We propose a novel method for inferring gene regulatory networks from single-cell RNA sequencing data. The proposed method consists of dual-channel graph neural networks and a weight-shared common graph neural network, enabling effective fusion of prior regulatory knowledge with gene co-expression patterns. Furthermore, we formulate a new computational framework built upon the proposed algorithm, which integrates differential gene expression profiles and regulatory changes to identify key regulators that distinguish different cell states. Results: Experimental results demonstrate that our method significantly improves the accuracy of regulatory inference across multiple datasets, outperforming other state-of-the-art approaches. Our method also exhibits robustness to noise and missing data. Analysis of two single-cell expression datasets suggests that the proposed framework could help identify key regulators involved in tumor metastasis and drug resistance. Conclusion: These results indicate that the proposed method could advance the understanding of the biological mechanisms underlying diseases by reconstructing single-cell gene regulatory networks and identifying key regulators across different cell states.
bioinformatics2026-06-07v1CiliAI: Automated segmentation and compartment specific fluorescence quantification of primary cilia in confocal microscopy images
Karapetian, E.; Gerhardt, C.; Nazif, E.; Pfirrmann, T.Abstract
Primary cilia regulate essential signalling pathways controlling cell proliferation, differentiation, and tissue homeostasis. Quantitative analysis of ciliary morphology and compartment-specific protein localization by confocal microscopy is labor-intensive, user-dependent, and difficult to scale, particularly for multiplexed 3D image datasets. Here, we present CiliAI, a web-based deep-learning workflow for automated detection, substructure segmentation, and quantitative analysis of primary cilia in confocal microscopy images. CiliAI identifies ciliary substructures including the basal body, transition zone, and axoneme from multiplexed 3D image stacks and performs automated measurements of cilium length and compartment-specific fluorescence intensity. In NIH-3T3 cells, automated cilium length measurements showed close agreement with manual quantification and no statistically significant difference between methods (mean difference -0.214 {gamma}m, p = 0.213). Automated fluorescence analysis reproduced previously reported reductions in transition zone-associated Cep290 signal intensity in Rpgrip1l-deficient cells and identified the absence of significant Rpgrip1l accumulation changes in Rmnd5a-deficient cells. Automated processing reduced analysis time from days of manual quantification to minutes. Together, these findings establish CiliAI as an automated framework for quantitative analysis of ciliary morphology and compartment-specific protein abundance in confocal microscopy datasets.
bioinformatics2026-06-07v1Germline regulation of tumor evolutionary dynamics shapes multiple myeloma progression
Chen, H.; Shu, J.; Mudappathi, R.; Wang, P.; Bergsagel, L.; Yang, P.; Sun, Z.; Shi, C.; Liu, L.Abstract
Germline variation shapes cancer risk, yet its influence on the evolutionary dynamics of established tumors remains poorly understood. In multiple myeloma, subclonal diversification drives disease progression and treatment failure, but the heritable factors that modulate this process are unknown. Here, we show that germline variation is associated with tumor evolutionary features, implicating inherited regulation in subclonal expansion. Integrating germline variation with tumor evolutionary parameters identifies variants associated with evolutionary features, with signals enriched in regulatory regions, consistent with a transcriptional basis. We further identify TBKBP1 as a key locus linking germline variation to tumor evolution and clinical outcome. Germline variation at this locus is associated with TBKBP1 expression and subclonal expansion, and TBKBP1 expression correlates with adverse prognosis, consistent across independent cohorts. Functional analyses demonstrate that TBKBP1 promotes proliferation and activates MYC, mTORC1 and non-canonical NF-{kappa}B signaling pathway. Together, these findings establish germline regulatory variation as a determinant of tumor evolutionary dynamics and identify TBKBP1 as a mediator linking inherited variation to subclonal expansion and disease progression in multiple myeloma.
bioinformatics2026-06-07v1Multi-level, multi-body atomic interaction graphs for machine learning-based prediction of protein-ligand binding energies
Le, T. T. H.; Nguyen, B. T.; Vo, H.; Nguyen, N. H.; Nguyen, D. D.Abstract
Accurate prediction of binding affinity is crucial for rational drug design and discovery. Traditional computational methods often rely on complex scoring functions that incorporate a multitude of physical and chemical descriptors, leading to high computational demands and sometimes limited generalizability. In this work, we propose a novel scoring function that models multi-level, multi-body atomic interactions using graph-based representations. Our method constructs comprehensive interaction graphs that incorporate both pairwise and triplet-wise atomic features that help capture cooperative spatial patterns essential for binding affinity prediction. By employing a feature fusion strategy, GMI-Score maintains model simplicity while enhancing accuracy. Extensive evaluation across multiple datasets, such as PDBbind v2013, PDBbind v2016, PDBbind v2020, CSAR-NRC-HiQ, and PDBbind-Redocked, demonstrates that our model consistently outperforms state-of-the-art scoring functions, achieving Pearson correlation coefficients up to 0.877. Furthermore, it retains strong predictive power under strict data leakage controls and realistic docking conditions to highlight its robustness and generalizability.
bioinformatics2026-06-07v1GLOF: A large-scale expert-curated benchmark dataset of gain-of-function and loss-of-function missense variants
Maricato, V.; Schlesinger, D.; de Souza Moura, P. N.Abstract
Distinguishing loss-of-function (LOF) from gain-of-function (GOF) effects of missense variants is fundamental to understanding disease mechanisms and guiding therapeutic strategy, yet no large-scale, expert-curated benchmark has been publicly available for this task. Here we present GLOF (Gain and Loss Of Function), a dataset of 112,399 missense variants across 2,809 human genes, each classified as LOF, GOF, or neutral by board-certified clinical geneticists following ACMG guidelines. Pathogenic variants were sourced from ClinVar and annotated with their functional mechanism based on published functional studies, phenotype correlations, and established gene-disease relationships. Neutral variants were drawn from gnomAD v3.1 and validated against v4.1 using stringent population frequency filters. The dataset spans diverse protein families, includes 97 genes with bidirectional mechanisms (containing both LOF and GOF variants), and has been validated against well-characterized variants in the literature. GLOF is publicly available on Kaggle (https://www.kaggle.com/datasets/maricatovictor/loss-and-gain-of-function-variants) and Hugging Face (https://huggingface.co/datasets/victormaricato/glof), and provides a standardized resource for developing and benchmarking computational methods that predict variant functional mechanisms.
bioinformatics2026-06-07v1Quantifying Evidence for Competing Biomedical Hypotheses using Large Language Models and Bayesian Analysis
Moore, B. M.; Freeman, J.; Millikin, R. J.; Mohanty, C.; George, K. S.; Bal, A.; Lock, C.; Sauer, J.-D.; Spurgeon, M. E.; Moore, D. L.; Travers, B. G.; Stewart, R.Abstract
Science fundamentally depends on the generation and testing of hypotheses, many of them controversial. An explosion in scientific literature has made evaluating hypotheses even within a domain a problem of scale, and risks slowing an already extensive consensus-building process. While this challenge has prompted interest in automated hypothesis evaluation tools, existing methods have not yet proven effective for comparing hypotheses. Here, we introduce KM-GPT-DCH, an algorithm that combines co-occurrence methods with large language models (LLMs) to develop a transparent and reproducible literature-based algorithm to compare controversial hypotheses using a structured scoring approach with Bayesian methods to estimate confidence. When testing the algorithm on historical controversial hypotheses previously decided, KM-GPT-DCH chooses the correct hypothesis with high confidence several years before the scientific community or public do so. We further apply the algorithm to compare twenty unresolved controversial hypothesis pairs providing guidance for future research. The method can help researchers and the public to evaluate biomedical hypotheses such as "Is it more likely that monoamine deficiency or inflammation causes depression?" It can also be used to assess and visualize historical trends in the scientific literature. A web-based implementation of the algorithm is freely available at https://skim.morgridge.org.
bioinformatics2026-06-07v1KDM: embedding DNA/RNA motifs and sequences in a shared k-mer space for unified discovery, analysis and binding prediction
Fumagalli, L.; Becchi, T.; Cereda, M.; Pozzoli, U.Abstract
Motif discovery and binding-site prediction in DNA and RNA sequences are central tasks in regulatory genomics, yet the methodological landscape is split between interpretable but rigid position weight matrices (PWMs) and high-performing but opaque machine-learning models. We present KDM, a unifying framework in which both motifs and sequences are represented as probability distributions over a shared k-mer dictionary, embedded via the Hellinger transformation. This common geometry enables motif-sequence scoring, motif-motif comparison, de novo discovery, and binding prediction with a single primitive, the Bhattacharyya coefficient. We instantiate four tools on this representation: KDMMap for positional enrichment analysis, KDMMatch for information-content-aware motif matching, KDMFind for unsupervised motif discovery via projective non-negative matrix factorization, and KDM-LRLM for binding prediction with Lasso-regularized logistic regression. Across 1,324 transcription-factor ChIP-seq and 161 RBP eCLIP experiments, KDMMap matches CentriMo's motif rankings in 84% of TF and 79% of RBP experiments, and KDMMatch agrees with Tomtom on motif annotation in 74.5% of TFs. On binding prediction across four datasets covering 2,475 experiments, KDM-LRLM matches or exceeds eight deep-learning and three k-mer-based competitors. Notably, AI methods overtake k-mer methods only in the top quartile of training-set size, indicating that data scale, not architecture, drives the recent dominance of deep models. KDM provides a single interpretable representation across the full motif-analysis workflow.
bioinformatics2026-06-07v1Polynomial Trajectory Compression for Protein Language Model Embeddings
Sahni, H.; Chen, X.; Estrada, T.Abstract
Protein language models (PLMs) generate rich, layer-wise embeddings that capture diverse biological information but are expensive in terms of storage and computation at scale. In this work, we propose a compact surrogate representation for PLM embeddings across transformer layers using low-dimensional PCA projections and cubic polynomial trajectories. This approach enables efficient storage and on-demand reconstruction of these protein-level embeddings at any layer without rerunning the PLM. We evaluate our method on two downstream tasks: protein protein interaction and subcellular localization using ESM-35M and ESM-3B PLM. We show that the surrogate embeddings achieve high reconstruction fidelity while reducing storage and computational requirements significantly. The new approach also retains downstream task prediction performance compared to original embeddings. Our approach provides a scalable and practical solution for large-scale protein embedding storage and reuse.
bioinformatics2026-06-07v1Metadata Collector: An Open-Source Platform for Standardized Metadata Management in Multi Centre Sequencing Projects
Liguori, R.; Ferrazzi, F.Abstract
Background: Next-generation sequencing (NGS) projects generate increasingly complex metadata that are critical for reproducibility, interoperability, and compliance with FAIR principles. Nevertheless, metadata curation in multi-institutional settings often still relies on spreadsheets, manual data entry and curation, as well as non-standardized terminology. These practices frequently result in incomplete or inconsistent annotations, hinder metadata sharing, and delay submission to public repositories. Results: We developed Metadata Collector as a React/API/PostgreSQL web platform and deployed it on a Kubernetes cluster within a large German research consortium. The platform implements a flexible, machine-readable metadata model for experimental data and integrates customizable templates, controlled vocabularies designed to support future ontology integration, and a complete event-based versioning model. Since deployment, Metadata Collector has been used across 32 projects involving RNA-seq, scRNA-seq, ATAC-seq and multiomics datasets, representing over 700 annotated samples contributed by multiple consortium partners. The platform is designed for use by non-computational researchers as well as centralized facilities and can be integrated into existing research data management infrastructures. Conclusions: Metadata Collector embeds standardization early in the metadata lifecycle, ensuring consistent, FAIR-aligned, and reproducible metadata across distributed research groups. Its modular, open-source architecture supports both local and consortium-scale deployments and provides a foundation for future extensions, including multi-omics support and integration with laboratory information management systems and automated submission pipelines.
bioinformatics2026-06-07v1Learning quality scores for chromatin accessibility bigWig tracks using Machine Learning
Sanders, E.; Riva, S. G.; Hughes, J. R.Abstract
High-throughput chromatin accessibility assays such as bulk and single-cell ATAC-seq have generated large collections of processed signal tracks in bigWig format, which are widely used for visualisation, data integration, and Machine Learning (ML)-based analyses. Despite their central role, systematic quality control (QC) frameworks operating directly at the level of bigWig signal tracks remain underdeveloped. This gap limits the ability to assess data reliability and hampers robust downstream analyses. Here, we present a biologically grounded QC framework for chromatin accessibility bigWig files that integrates peak-level information, background noise estimation, and recovery of stable genomic reference features. Using an ML-based peak caller (LanceOtron), we derive complementary quality metrics capturing signal structure and signal-to-noise properties. We further define constant promoter and CTCF regions as internal biological controls and show that their recovery provides a sensitive measure of data quality across diverse cellular contexts. We apply this framework to a collection of 502 human chromatin accessibility bigWig tracks spanning a wide range of tissues and cell types. The proposed metrics capture related but non-redundant aspects of signal quality and motivate the use of constant promoter and CTCF recovery as biologically meaningful targets. An XGBoost model trained on LanceOtron-derived features accurately predicts recovery of these stable genomic elements on held-out data (R2 = 0.97), yielding a continuous and interpretable quality score. Feature importance analysis using SHAP values highlights that model decisions are driven by biologically relevant signal properties rather than arbitrary heuristics. Quantile-based stratification of the quality score is further supported by clear qualitative differences in genome browser visualisations. Together, this work provides a principled and extensible framework for assessing the quality of chromatin accessibility bigWig tracks, enabling more reliable data integration and supporting downstream ML applications in regulatory genomics.
bioinformatics2026-06-07v1CLASPP: A unified model for predicting post-translational modifications
Gravel, N.; Zhou, Z.; Fang, R.; Soleymani, S.; Kannan, N.Abstract
Post-Translational Modifications (PTMs) are a fundamental mechanism for regulating cellular pathways and increasing the functional diversity of the proteome. Accurately predicting the PTM types that are likely to occur at a given site in the primary sequence is a key challenge in functional proteomics. Existing PTM prediction models predominantly focus on either single PTM types or employ ensemble methods that combine multiple models to predict different PTM types. This fragmentation is largely driven by the vast imbalance in data availability across PTM types, making it difficult to predict multiple PTM types with a single model. To address this limitation, we present the Contrastively Learned Attention-based Stratified PTM Predictor (CLASPP), a unified PTM prediction model. CLASPP addresses imbalance challenges by leveraging unsupervised clustering-based undersampling and a novel contrastive learning framework tailored to PTM data. Additionally, our hierarchical data organization and curation are shown to improve CLASPP's performance by balancing the representation of individual PTM types and provides a standardized dataset to train and validate future model designs. Drawing inspiration from advancements in image and natural language processing, the CLASPP model employs a multi-stage training strategy and a high-quality, curated training dataset to improve PTM prediction performance. To uncover what is learned during the contrastive learning stage, the CLASPP model is shown to distinguish known protein kinase substrate specificity profiles as a form of explainability. Finally, we evaluate the application of CLASPP in predicting PTMs in different model organisms and experimentally validated ubiquitination sites in the understudied DCLK3 kinase. Overall, CLASPP represents a unified model for PTM prediction that addresses key bottlenecks in data imbalance and offers new strategies for biological data curation, thereby improving PTM-type prediction performance across diverse organisms.
bioinformatics2026-06-07v1Cross Dataset Transcriptomic Analysis Identifies Oxidative Stress Inflammation Gene Networks Modulated by Nutrigenomic Interventions in Parkinson Disease
Rafiee, M.; Abaj, F.; Ghiasvand, R.Abstract
Inflammation and oxidative stress (OS) are key to Parkinson's disease (PD). We performed a cross dataset integrative transcriptomic analysis to identify OS and inflammation-related hub genes consistently dysregulated in PD and to explore gene compound relationships using nutrigenomic studies using publicly available datasets. Four GEO datasets (GSE7621, GSE20141, GSE20146, GSE49036) were analysed to identify differentially expressed genes (DEGs), which were intersected with GeneCards OS inflammation gene sets. Functional enrichment analyses, including gene ontology (GO), pathway over-representation analysis (ORA), and protein-protein interaction (PPI) analysis, were used to identify key pathways and hub genes. Gene food bioactive compound (FBC) association was explored by integrating PD signatures with nutrigenomic profiles from NutriGenomeDB. We identified 183 DEGs in PD, enriched in synaptic, dopaminergic, OS, and inflammatory pathways. Intersection analysis yielded 26 OS inflammation related genes and 10 central regulators, including TH, DDC, SNCA, LRRK2, HSPB1, and HSPA1B. Integration with nutrigenomic datasets revealed opposing-direction transcriptional patterns, with several FBC associated signatures showing lower expression of stress related genes and higher expression of dopaminergic markers such as TH, GCH1, and DDC. Overall, this integrative analysis highlights OS inflammation gene networks in PD and identifies candidate diet gene associations that warrant further experimental and clinical validation.
bioinformatics2026-06-06v2Quantifying the contribution of DNA conformational flexibility to transcription factor binding on nucleosomal DNA uncovers indirect readout across diverse TF families
Dey, U.; Martinez, G. S.; Kumar, R.; Yella, V. R.; Kumar, A.Abstract
Background: Eukaryotic gene regulation depends on transcription factors (TFs) recognizing short DNA motifs within chromatin. Many of these motifs lie within nucleosomes, where DNA is sharply bent, rotationally phased, and constrained by histone-DNA contacts. Yet only a subset is occupied in any cellular context. Motif identity alone, therefore, cannot fully explain selective TF engagement with nucleosomal DNA. We asked whether sequence-derived DNA conformational flexibility provides an interpretable representation of sequence context relevant to TF recognition on nucleosomes. Results: We compiled five DNA flexibility descriptors in the Python package DNAflexpy, representing bendability, torsional deformation, backbone conformational variability, and stiffness. We built quantitative models of TF binding affinity across 226 datasets from a high-throughput in vitro TF-nucleosome binding assay. Flexibility-augmented models improved prediction over mononucleotide baselines in most datasets, with smaller but reproducible gains over trinucleotide baselines. The gains were not uniform: they varied across TF families and were concordant with DNA shape-fluctuation features, suggesting that DNAflexpy descriptors capture a sequence-encoded structural signal. In PIONEAR-seq data, model performance generalized across nucleosomal templates in a TF- and sequence-dependent manner. Beyond prediction, position-resolved flexibility footprints revealed deformation signatures at cognate motifs and flanking regions across diverse TF families. For SOX11, model-derived footprints aligned with DNA shape fluctuations from nanosecond-to-microsecond molecular dynamics trajectories of SOX11-bound nucleosomes, consistent with independently observed DNA conformational dynamics and bound-state stabilization. The in vivo data showed a similar but more context-dependent pattern. OCT4 occupancy tended to correlate with local flexibility, whereas GATA3-pioneered regions showed flexibility coupled with altered rotational positioning of cognate motifs. Flexibility-augmented classifiers further improved discrimination of occupied nucleosomal motifs across ENCODE datasets. Torsional flexibility features, particularly twist dispersion and trx, were most informative for classification. Conclusions: Sequence-derived DNA conformational flexibility provides a quantitative and interpretable representation of sequence context in TF recognition on nucleosomes. By augmenting the sequence with structural information, these models help quantify and interpret an indirect-readout contribution in which DNA deformation tendencies may complement motif sequence and DNA shape. This framework may help explain why only selected motif instances are engaged in chromatin, without treating flexibility as independent of primary sequence.
bioinformatics2026-06-06v2HOPE: Interpretable Histology Analysis with Spatial Omics-Derived Signatures for Precision Oncology
Wang, T.; Bieniosek, M.; Krpicak, T. J.; Luan, M.; Ruf, B.; Schürch, C. M.; Mayer, A. T.; Luo, R.; Trevino, A. E.; Wu, Z.Abstract
Hematoxylin and eosin (H&E) stained images are fundamental clinical tools for disease assessment. However, even with advanced computational models, their prognostic capabilities remain limited. Spatial omics characterizes tumor microenvironments (TME) in detail yet remains clinically inaccessible due to cost and complexity. In this study, we present HOPE, a lightweight framework that learns TME signatures from paired H&E and spatial omics data during training, then applies these to H&E alone at inference. Leveraging H&E foundation models, HOPE consistently outperforms identical architectures trained without spatial omics guidance across cancer types and cohorts. It further generates interpretable annotations of TME signature on H&E regions, stratifying patients into biologically coherent groups with different prognostic outcomes. HOPE establishes a practical route to translate high-content spatial omics discoveries into scalable, clinically deployable tools.
bioinformatics2026-06-06v1Compositional and interpretable representation of histology using AI foundation models and sparse autoencoders
Zhao, Z.; Maliga, Z.; Ogbonna, E. C.; Talemi, S. R.; Coy, S.; Gagne, A.; Lumamba, K.; Solomon, I. H.; Santagata, S.; Steyn, A. J. C.; Naidoo, T.; Sorger, P. K.Abstract
Light microscopy of tissue sections stained with hematoxylin and eosin (H&E) has been the foundation of histopathology for over 150 years and remains essential for diagnosis and research. The development of high-plex spatial profiling approaches able to measure protein and RNA expression at single-cell resolution augments but does not replace H&E imaging, even in research. Computational pathology (CPath) models based on deep learning promise to further increase the value of H&E imaging but interpreting these models in biological terms remains challenging. As a result, they are not widely used in spatial profiling studies. Here we describe a human-in-the-loop computational framework that leverages CPath foundation models (FMs) and sparse autoencoders (SAEs) to decompose FM embeddings and automatically identify diverse, human-interpretable histopathology features in H&E images. When FM-SAE modeling was applied to pulmonary diseases such as tuberculosis and lung cancer, human-machine interaction augmented and accelerated expert interpretation. Moreover, the resulting annotations provide a morphology-aware approach to integrating 2D and 3D mesoscale tissue architectures with molecular spatial profiling.
bioinformatics2026-06-06v1TetraFuse: A Synergistic Four-Dimensional Dynamic Fusion Framework for Efficient and Robust Medical Image Classification
Gao, Y.; Li, J.; Xu, J.; Li, Q.; Li, Z.; Shi, Y.; ZHao, G.; Wu, X.; Zhang, Y.Abstract
Accurate and robust classification of medical pathology images is pivotal for computer-aided diagnosis. However, the deployment of deep learning models in high-throughput clinical screening faces a fundamental challenge: the trade-off between diagnostic accuracy and computational efficiency. Current lightweight architectures, while reducing parameter complexity through grouped convolutions, often lead to cross-channel information isolation and diminished representational capacity. In this paper, we propose TetraFuse, a novel framework that systematically integrates features from four complementary domains: space, channel, statistics, and frequency. TetraFuse introduces a novel Cross-Channel Dynamic Aggregation (CCDA) paradigm that reconstructs global channel topology with negligible computational overhead, resolving the inter-group isolation issue. To balance perceptual fidelity and efficiency, we design a stage-aware local enhancement mechanism: Local Variance-Guided Enhancer (LVGE) is employed to filter out shallow-stage background noise, while High-Frequency Boundary Injection (HFBI) reinforces deep-stage pathological contours, preventing spatial over-smoothing. Experimental results on the COVID-19, ISIC 2018, and Kvasir datasets confirm that TetraFuse outperforms state-of-the-art (SOTA) methods. Notably, TetraFuse-Tiny achieves a transformative 91.53% reduction in FLOPs compared to ResNet50; on the Kvasir dataset, it achieved an accuracy of 0.926 and an AUC of 0.994 with only 0.345G FLOPs. By combining high representational power with minimal computational demand, TetraFuse offers a scalable solution for large-scale medical image analysis, especially in resource-constrained clinical environments.
bioinformatics2026-06-06v1Revised Adaptive Immune Receptor Data in the Immune Epitope Database
Scheffer, L.; Richardson, E. M.; Vita, R.; Zarebski, L.; Blazeska, N.; Wheeler, D. K.; Cantrell, J. R.; Deleuran, S. N.; Lees, W. D.; Christley, S.; Corrie, B.; Cowell, L. G.; Sette, A.; Peters, B.Abstract
The Immune Epitope Database (IEDB, iedb.org) is a freely available resource that catalogs experimentally defined immune epitopes and - if available - the immune receptors that recognize them. Currently, the IEDB records ~185,000 T cell receptors and ~5,000 B cell receptors/antibodies with experimentally verified epitope specificity. Because these receptor data were manually curated from ~3,300 references spanning decades, nomenclature inconsistencies present challenges for computational analyses and user queries. To support integrated analysis of the entire dataset, we revised the IEDB receptor data standardization and validation pipeline to flag and correct inaccuracies. Anomalous receptors from over 800 studies were flagged for re-curation. The updated receptor dataset shows greater conformity through consistent gene nomenclature formatting and harmonized CDR sequence delimitation. Taking advantage of the increased receptor data consistency, the IEDB web interface was expanded to include receptor search features directly on the homepage, support V/J gene and species options in the refined receptor search, and allow direct data export in the Adaptive Immune Receptor Repertoire (AIRR) format. We anticipate that the improved receptor data quality will simplify bioinformatics analyses, and facilitate integration of IEDB data into cross-repository data resources, such as the AIRR Knowledge Commons.
bioinformatics2026-06-06v1Comparative Proteomics Across Tissues and Crop Agroecosystems Reveals Agricultural Stressor Responses in the Western Honey Bee
Zhong, H.; ZHONG, P.; Park, J.; Kozlova-Ryabova, A.; Moravcova, R.; Rogalski, J. C.; Jamieson, A.; Lansing, L.; Fang, W. W. T.; Moon, K.-M.; Yuan, X.; Ovinge, L. P.; Kearns, J. D.; Gregoris, A. S.; Higo, H.; Common, J.; Conflitti, I. M.; Pepinelli, M.; Tran, L.; Cunningham, M.; Jabbari, H.; Bukhari, S. A.; French, S. K.; Ho, J.; Deckers, T. B.; Zorz, J.; Polo, R. O.; Hoover, S. E.; Pernal, S. F.; Giovenazzo, P.; Currie, R. W.; Guarna, M. M.; Zayed, A.; Foster, L. J.Abstract
Maintaining honey bee health in crop production systems is increasingly difficult because worker bees encounter multiple chemical and biological pressures from pesticides and pathogens. How these field-realistic pressures affect molecular physiology across functionally distinct tissues remains poorly understood. Here, we tested whether tissue-resolved proteomics could separate stable tissue-specific patterns from crop-associated molecular changes. To do this, we profiled abdomen, gut, and head proteomes from honey bees collected across four Canadian crop ecosystems over two consecutive years, and integrated these data with pesticide-residue and pathogen-load measurements. Proteomic variation was structured by both tissue identity and crop environment. Tissue-specific proteomic profiles were characterized across samples, whereas crop-associated effects were detected in both years and were stronger in 2021, the second year of the study. Tissue-specific enrichment and network analyses linked the abdomen to lipid catabolism and ubiquitin-proteasome proteostasis, the gut to central carbon metabolism, membrane transport, vesicle trafficking, and cytoskeletal organization, and the head to neurosensory and mitochondrial functions, together with amino-sugar metabolism and vesicle-associated quality-control modules. Among the measured pesticide residues, boscalid was the most reproducible chemical correlate of proteomic variation, with the strongest signal in the gut. Cross-year validation associated boscalid exposure with reduced abundance of gut proteins involved in mitochondrial metabolism, protein quality control, vesicle trafficking, nutrient transport, and biosynthetic pathways. Additionally, integrated proteome-transcriptome-microbiome factor analysis further identified gut-centered components associated with measured stressor variables and linked protein-level variation to coordinated transcriptomic and microbial shifts. Independent-year validation showed that compact crop-associated protein signatures detected in 2020 were also present in 2021. Together, these results show that honey bee tissues maintain stable proteomic identities while showing tissue- and year-specific responses to pesticide and pathogen pressures encountered in crop ecosystems. The gut proteome may specifically provide a sensitive molecular indicator of pesticide-associated perturbation under field conditions.
bioinformatics2026-06-06v1samsampleX: Distribution-aware downsampling for benchmarking next-generation sequencing data
Demiriz, S.; Taliun, D.Abstract
High-throughput next-generation sequencing (NGS) is essential for genetic variant discovery across diverse applications. As NGS evolve, there is a growing need for benchmarking tools that support realistic data simulation and downsampling. Existing downsampling tools apply uniform sampling of sequencing reads, which inadequately models realistic coverage distributions, particularly in difficult-to-sequence regions and hybrid sequencing designs. Here we present samsampleX, a Python-based tool implementing a novel distribution-aware downsampling algorithm that dynamically adjusts read retention probabilities to emulate coverage profiles derived from real sequencing data. Using ultra-high-coverage reference datasets, samsampleX accurately reproduces coverage patterns observed in typical sequencing experiments, outperforming uniform downsampling methods at preserving depth variability across genomic regions such as the HLA locus and hybrid whole-exome/genome sequencing configurations. samsampleX extends current downsampling strategies by offering enhanced flexibility for specialized NGS benchmarking scenarios, facilitating improved assessment of sequencing data analysis methods.
bioinformatics2026-06-06v1An inflammatory gene set driven epigenetic clock tracks down disease progression and rejuvenation
Sandor, P.; Kerepesi, C.; Castro, J. P.Abstract
Chronic, low-level inflammation, characterized by elevated pro-inflammatory programs, including epigenetic changes, in the absence of infection, is a major driver of aging and age-related diseases. On the other side of the spectrum, aging interventions work, at least in part, by decreasing inflammation. However, the molecular connection between epigenetic aging and inflammatory profiles in chronic diseases and rejuvenation has not been established yet. This study aimed to investigate the role of a newly described inflammatory signature gene set (ISig) in aging, previously associated with accelerated aging, in the progression of chronic diseases and rejuvenation. To achieve this, we developed inflammation-derived epigenetic aging clocks using ElasticNet regression models trained on CpG sites from ISig promoter regions. The newly developed inflammation aging clocks were validated on healthy samples and tested for their capacity to detect accelerated aging in diseased samples and rejuvenation during cellular reprogramming. The data demonstrate that the ISig inflammatory clocks accurately predict age, detect rejuvenation, and identify accelerated aging in disease contexts. Furthermore, we have demonstrated that it is possible to use a curated inflammatory gene-set with biological relevance to estimate biological age acceleration. We also developed a web application, the GeneClock Studio (available at https://ilab.sztaki.hu/geneclockstudio/), that allows researchers to apply the inflammatory aging clocks to their own DNA methylation datasets without requiring any programming expertise. Furthermore, the GeneClockStudio supports the training of new aging clocks based on an arbitrarily selected gene set in a similar way as in the case of the ISig inflammatory clocks.
bioinformatics2026-06-06v1Multivariate integration of histological images and gene expression data: a comparative review
Ma, C.; Mao, J.; Le Cao, K.-A.Abstract
Integrating histological images with gene expression data offers a promising approach for linking tissue morphologies to molecular signatures and improving disease subtyping. However, such integration remains challenging due to the high dimensionality of these datasets, cross-modal heterogeneity, and limited interpretability. Multivariate methods such as Sparse Canonical Correlation Analysis (Sparse CCA), Joint Nonnegative Matrix Factorisation (Joint NMF), and Angle-based Joint and Individual Variation Explained (AJIVE), have been used to address these challenges by reducing dimensionality while identifying features associated with latent factors, thereby enhancing biological interpretability. Despite increasing application in imaging-omics research, systematic comparisons of their methodological properties remain limited. Consequently, users often lack guidance on how to appropriately select these methods in practice, and these approaches are frequently treated as interchangeable despite differing modelling assumptions. Here, we use paired H\&E images and gene expression data from breast cancer as a representative case study to examine the methodological characteristics, interpretability, and complementary properties of these integration approaches. Our results show that each method captures distinct yet complementary aspects of the underlying information. Although the biological findings are derived from the TCGA-BRCA datasets, the methodological insights identified here extend more broadly to imaging-omics integration studies. Overall, this comparative review highlights the strengths and limitations of each approach and outlines considerations for future methodological development.
bioinformatics2026-06-06v1Mapping Chemical Diversity: Descriptor-Guided Clustering of Natural Products in the COCONUT Database
Shreyasree, G.; Dileep, A.; Namani, A.; Karunakar, P.Abstract
Natural products represent a major source of bioactive compounds for drug discovery, yet their exploration remains challenging due to extensive structural complexity and scaffold diversity. Using the COCONUT database, we developed a cluster-oriented framework to systematically map and characterize the natural product chemical space through feature engineering, molecular clustering, and representative-based analysis. Descriptor selection identified a greedy maximum coverage strategy with a 0.35-0.85 correlation threshold range and 20 descriptors as the optimal feature set, enriched in physicochemical and graph-topological properties. Comparative evaluation of clustering approaches identified UMAP-HDBSCAN as the best-performing pipeline, generating 1,683 clusters with silhouette scores of 0.42 before and 0.24 after noise reassignment. Cluster profiling revealed a highly heterogeneous scaffold landscape, with 67.56% of clusters exhibiting low scaffold dominance and only 15.21% representing highly scaffold-dominated regions, supporting a chemical space composed largely of interconnected transitional clusters. Descriptor analyses showed that natural product clusters were generally enriched in saturated, low-aromaticity chemotypes with moderate lipophilicity and constrained molecular flexibility. Representative-based analyses demonstrated that central representatives (medoid and centroid-closest molecules) closely captured cluster-average properties, whereas diverse representatives better reflected structural breadth, findings further supported through descriptor-based and docking-based validation. Collectively, the results reinforce the natural product chemical space as a continuous yet structured manifold and provide a representative-guided framework for its efficient exploration in drug discovery applications. The complete data can be accessed at: https://github.com/shrek-28/DescriptorClusteringNPSpace
bioinformatics2026-06-06v1Ignet 2.0 and Vignet: An Ontology-Driven Web Platform for Biomedical Gene Interaction Discovery and Visualization
Asaduzzaman, S.; Bansal, B.; Combs, P.; Zhang, J.; Rehana, H.; McGregor, B.; He, Y.; Hur, J.Abstract
Background: The expansion of biomedical literature demands systematic ontology-guided discovery of gene interactions, vaccine mechanisms, drug associations, and adverse events. Existing platforms such as STRING, DisGeNET, and PubTator fall short of providing a unified, freely accessible system that integrates ontology-based semantic interaction classification, vaccine-focused heterogeneous network construction, and Artificial Intelligence-assisted evidence retrieval. Results: Ignet 2.0 and Vignet are freely accessible dual-platform systems that combine PubMed literature mining, BioBERT-based interaction scoring for millions of gene-gene co-occurrence pairs and integrate three biomedical ontologies and one curated drug resource, Interaction Network Ontology (INO), Vaccine Ontology (VO), Human Disease Ontology (HDO), and DrugBank. Ignet 2.0 supports gene interaction discovery, gene set enrichment retrieval of BioBERT-scored GenePair evidence, and AI-assisted summarization through BioSummarAI. Vignet extends these features with VO-guided Vaccine Exploration, VacPair interaction scoring, and the creation of vaccine, gene, drug, and disease networks in VacNet. A public Representational State Transfer Application Programming Interface (REST API) and Model Context Protocol (MCP) endpoint enable real-time integration, fostering trust in biomedical knowledge discovery. Conclusion: Ignet 2.0 and Vignet are scalable, ontology-guided biomedical knowledge platforms that facilitate evidence-based gene interaction analysis, vaccine-focused semantic exploration, and AI-assisted knowledge discovery. Their real-time PubMed data integration ensures up-to-date insights; however, users should consider validation processes and potential lags in incorporating the latest experimental data, which may affect the reliability of immediate data. Availability: Ignet 2.0: https://ignet.org/ignet; Vignet: https://ignet.org/vignet/
bioinformatics2026-06-06v1Correcting for Global Synonymous Selection Improves the Accuracy of Episodic Positive Selection Inference
Verdonk, H. E.; Pivirotto, A.; Hey, J.; Kosakovsky Pond, S. L.Abstract
The ratio of nonsynonymous to synonymous substitution rates ({omega}) constitutes a fundamental parameter for inferring adaptive protein evolution, predicated upon the assumption that synonymous substitutions are selectively inert. This premise, however, is increasingly untenable given evidence of selection acting on synonymous substitutions, driven by various biological processes such as translational efficiency and mRNA stability. In this study, we demonstrate that unmodelled synonymous selection introduces substantial bias into {omega} estimation, resulting in elevated false positive rates in tests for positive selection. To rectify this, we present BUSTED+S+MSS, a statistical framework incorporating Multiclass Synonymous Substitution (MSS) models into BUSTED, a method for detecting episodic selection. By partitioning synonymous codons into empirically derived rate classes, this approach accounts for global synonymous constraints. Application to five diverse clades - Drosophila, Caenorhabditis, Enterobacteria, Saccharomyces, and Primates - reveals that the inclusion of MSS components consistently improves model fit and reduces the proportion of genes inferred to be under positive selection. In Enterobacteria, genes retaining significance under the corrected model exhibit weaker constraint on synonymous substitutions (dSs), consistent with the hypothesis that unmodelled purifying selection drives spurious signals of adaptation. Furthermore, an information-theoretic analysis indicates that whilst site-specific variation (SRV) provides the primary correction, global synonymous rate variation (MSS) contributes a distinct second-order correction. In highly divergent alignments, these signals act in concert to improve model fit. The BUSTED+S+MSS framework, especially when coupled with an "error-sink" to absorb alignment artifacts, thus offers a computationally feasible means to disentangle adaptive nonsynonymous substitution from the confounding effects of synonymous constraint.
bioinformatics2026-06-06v1PAG-Agent: a biologist-oriented research assistant for context-aware pathway-level analysis and interpretation
Nguyen, Q.-H.; Zhang, Z.; Le, D.-H.; Chen, J. Y.; Ku, W.-S.; Chen, H.; Yue, Z.Abstract
Pathway analysis is a critical step for translating gene-level omics results into biological mechanisms, yet existing workflows often leave researchers with long lists of statistically significant pathways that are difficult to interpret, validate, and connect to experimental context. We developed PAG-Agent, a biologist-oriented virtual research assistant that integrates pathway-level statistical analysis, context-aware biological interpretation, literature-supported reasoning, and scientific writing support within a unified workflow. PAG-Agent supports bulk and single-cell transcriptomic data and enables users to perform data preprocessing, differential expression analysis, pathway analysis, pathway-level consensus analysis, and pathway-level meta-analysis through click-based and chat-based interactions. Unlike conventional pathway-analysis tools that analyze gene sets largely in isolation, PAG-Agent incorporates experimental conditions and research objectives to prioritize biologically relevant pathways and generate interpretable hypotheses. The system also provides gene and pathway annotation, citation retrieval, visualization, and writing refinement functions. In Alzheimer's disease case studies using three transcriptomic datasets, PAG-Agent consistently identified neurodegeneration-related pathways across multiple analysis methods and datasets. In citation-retrieval benchmarking, PAG-Agent outperformed six competing LLMs across five common literature-support scenarios, demonstrating improved ability to provide contextually relevant and valid references. Overall, PAG-Agent lowers technical barriers for pathway-level analysis and helps researchers move from transcriptomic data to biologically grounded interpretation, hypothesis generation, and scientific communication.
bioinformatics2026-06-06v1STITCH: Spatial Transcriptomics Imputation via Flow Matching with Internal Learning
Wang, S.; Wang, X.; Peng, Q.; Li, T.Abstract
Spatial transcriptomics datasets frequently suffer from spatial gaps and missing regions due to sectioning artifacts, tissue damage, and the high cost of sequencing that limits tissue coverage. We present STITCH, a scalable and robust generative framework for multidimensional virtual spatial transcriptomics reconstruction. STITCH models intrinsic spatial-transcriptomic patterns directly from individual tissue samples, enabling reconstruction without requiring external reference atlases or matched histological image priors. The framework adopts a decoupled architecture that separates spatial morphology restoration from transcriptomic generation. STITCH first compresses high-dimensional transcriptomic profiles into a low-dimensional latent representation through a spatial-aware graph autoencoder. For 3D cross-slice gaps, STITCH employs optimal transport-conditioned flow matching for spatial reconstruction, whereas 2D in-slice damage is repaired through an internal learning strategy. To generate the corresponding transcriptomic profiles, STITCH further establishes a point-wise conditional flow matching model in the latent space. This module achieves linear computational complexity, enabling continuous 3D atlas reconstruction of over 11 million cells within 5 hours on a single commodity GPU. Extensive evaluations across diverse spatial transcriptomics platforms, spanning both single-cell and spot-level technologies, demonstrate that STITCH consistently preserves transcriptomic identities, spatial topologies, and anatomical continuity. Overall, STITCH provides a scalable and platform-compatible computational framework for reconstructing high-resolution continuous spatial transcriptomic atlases.
bioinformatics2026-06-06v1EnzOracle: Mechanism-aware prediction of enzyme environmental adaptation via a classification-guided mixture-of-experts framework
Wei, D.-Q.; Gao, Q.; Fang, Z.; Yuan, Y.; Jin, M.; Sun, H.; Peng, Z.; Yang, L.; Li, J.Abstract
Industrial biocatalysis increasingly requires enzymes capable of operating under extreme physicochemical conditions, yet most natural sequence data reflect adaptation to mild environments, leading conventional predictive models to suffer from regression-to-the-mean effects in extremophilic regimes. Here we present EnzOracle, a classification-guided mixture-of-experts framework that enables distribution-aware prediction of enzyme melting temperature (Tm), optimal catalytic temperature (Topt), and optimal pH (pHopt) directly from sequence. EnzOracle demonstrated robust predictive accuracy across diverse benchmarks, achieving RMSE of 5.245 for Tm, 11.458 for Topt, and 0.781 for pHopt. Beyond predictive accuracy, we introduce a trait-resolved molecular simulation strategy to evaluate whether EnzOracle-derived attribution patterns correspond to independent physical mechanisms. Across representative systems, attention hotspots mapped onto rigidity-conferring interaction networks for Tm, dynamically preorganized active-site ensembles for Topt, and pH-dependent electrostatic and hydration networks for pHopt. These orthogonal validations indicate that EnzOracle captures transferable biophysical principles of enzyme environmental adaptation rather than merely exploiting dataset-specific correlations, positioning sequence-based learning as a mechanism-aware framework for discovering stability and activity determinants across diverse catalytic landscapes.
bioinformatics2026-06-06v1Single-Cell Multi-Omics Dissection of Malignant Evolutionary Mechanisms and Construction of a Prognostic Model for Clear Cell Renal Cell Carcinoma
Liu, R.; Shi, Y.; Xiao, Y.; Ren, B.; Li, L.; Qi, B.; Li, T.; Zhang, Y.; Gao, J.Abstract
Clear cell renal cell carcinoma (ccRCC) exhibits pronounced heterogeneity across WHO histological grades, yet systematic single-cell multi-omics studies characterizing these transitions remain limited. We integrated scRNA-seq and scATAC-seq data across ccRCC WHO grades to establish a multi-omics framework encompassing tumor cells and immune populations. Using pseudotime trajectory analysis and machine learning ensembles, we developed a prognostic signature (CBG) from core nodes of transcriptional regulatory networks. We found that in tumor cells, epigenetic alterations consistently precede metabolic reprogramming and invasive adaptation. CD8+ T cell exhaustion followed a trajectory shifting from IRF7- to ZNF683-regulated states, while monocytes differentiated toward M1 and M2 macrophages orchestrated by NFIC/IL1B and CEBPD/GLI2. Intercellular communication networks showed a temporal progression from inflammation, through vascular remodeling, to immunosuppression dominance. The CBG signature demonstrated robust performance in independent cohorts, identifying SLC11A1 and SH3YL1 as antagonistic survival determinants. This study elucidates the dynamic molecular and immunological mechanisms underlying ccRCC grade progression, providing a robust framework for subtype-specific prognostication and precision therapeutic targeting
bioinformatics2026-06-06v1Temporal Biodynamics: An AI Platform for Identification of Stage-Relevant Targets and Biomarkers
Natekar, P.; Yao, B.; Mohammad-Taheri, S.; Rusnak, A.; Gort-Freitas, N. A.; Fillatre, J.; Raymond, J. J.; Saksena, S. D.; Lipnick, S.; Sokolov, A.Abstract
Temporal modeling of disease progression is poised to revolutionize the process of target identification, leading to better characterization of and intervention at the critical early stages of chronic conditions. Temporal Biodynamics is an artificial intelligence-driven platform that leverages within-tissue heterogeneity in cross-sectional cohorts to assemble a single, continuous trajectory of transcriptomic changes between health and disease. We demonstrate that the platform enriches for known disease-associated genes and proteins by more than 50% over the conventional case-control comparisons. When compared to other published pseudotime methods, our models were better at extracting disease-relevant signals in the presence of confounders and co-morbidities. The Temporal Biodynamics platform enables rich profiling of a disease continuum, providing temporal insights that are otherwise hidden by the traditional discrete staging of chronic diseases. This includes detecting cascades of molecular events, providing clues regarding causality, and increasing confidence in blood-based protein biomarkers using tissue-based context.
bioinformatics2026-06-06v1Chromap Suite: an open-source single-binary platform for agentic multiomic RNA + ATAC profiling
Hung, L.-H.; Yeung, K. Y.Abstract
Background. Single-cell multiomic profiling of RNA expression and chromatin accessibility is now a standard tool for resolving regulatory state in single cells, but existing analysis toolchains have lagged. Cell Ranger ARC, the proprietary multiomic pipeline, uses a custom broad peak caller rather than the MACS3 narrow peaks that the ATAC field has consolidated on, and its restrictive end-user licence forbids redistribution of analysis pipelines that include it. A fully open-source, permissively-licensed alternative anchored on community-standard methods (Chromap for ATAC alignment and MACS3 for narrow peak calling) has been impractical to assemble because the two codebases are written in different languages with incompatible runtimes, leaving practitioners to chain them together with ad-hoc scripts. Results. We present Chromap Suite, the chromatin-accessibility side of an open-source multiomic stack built in support of the NIH Molecular Phenotypes of Null Alleles in Cells (MorPhiC) consortium's multiomic production pipeline. We extended Chromap with native BAM output and coordinate sorting, in-process narrow peak calling, optional Y-chromosome filtering, and native input from the compressed binary CBQ sequencing format alongside FASTQ, and hardened the result with a regression-test matrix that auto-validates the four upstream Chromap presets (bulk ATAC, scATAC, ChIP-seq, Hi-C). We reimplemented MACS3's narrow peak caller in portable C++ as libMACS3, byte-identical to MACS3 v3.0.3 and free of any Python interpreter dependency. Finally, we extracted Chromap's alignment and fragment-generation paths into a callable C++ library (libchromap) and embedded both libchromap and libMACS3 into STAR Suite, so that one STAR invocation runs alignment, peak calling, and cell calling for both RNA and ATAC modalities concurrently. To our knowledge this is the first true single-binary RNA + ATAC multiomic implementation. On the public 3K PBMC Multiome at 32 threads, the platform completes in 18 minutes 55 seconds wall time and 44.6 GB peak resident memory, against 40 minutes 4 seconds and 79.1 GB resident memory for Cell Ranger ARC v2.2.0 (a 2.12x wall speedup with 1.8x less peak memory), and produces 50,274 peaks that are byte-identical to MACS3 v3.0.3. To support deployment by both research scientists and the AI agents increasingly used in bioinformatics analysis, Chromap Suite ships a Model Context Protocol (MCP) server and a browser-based Launchpad driven by a shared set of composable YAML recipes that humans and agents drive the same way. Conclusions. Chromap Suite delivers a unified, freely redistributable multiomic pipeline that produces the MACS3 narrow peaks downstream ATAC analyses already rely on, with substantially lower wall time and memory than the proprietary alternative. The MIT- and BSD-3-licensed code carries no redistribution restrictions, the constituent libraries are independently embeddable in other open-source tools, and the MCP server plus Launchpad recipes make the platform straightforward to drive both by humans and by AI agents.
bioinformatics2026-06-06v1Predicting Clinical Phenotypes by Growth Curve Modeling of Transcriptomic Signatures during Disease Progression
Akhlaghi, M.; Ghasemi, E.; Ray, M. S.; Pyne, S.Abstract
High-throughput gene expression data analysis has benefited from many statistical tests of differential expression across two or more groups such as t tests, ANOVA, etc. Yet, in complex transcriptomic datasets such as longitudinal or repeated measures, few studied have addressed such key issues as group effects and temporal dependency in expression profiles with a single model that is both practically effective and theoretically grounded. In this study, we used Growth Curve Model (GCM), as a generalization of MANOVA, to identify differentially expressed longitudinal profiles of genes, and thus predicted the associated clinical phenotypes, of pediatric lupus during the progressions of the disease across two different racial groups. In particular, we detected a module of histone genes which was shown to be linked with lupus. Key words: Growth Curve Model; Trace test; Longitudinal gene expression; Pediatric lupus; Overrepresentation analysis; Clinical phenotypes
bioinformatics2026-06-05v2Cellpin enables reference-based imputation and denoising of spatial transcriptomes
Putze, P.; Lucarelli, D.; Wellappili, D.; Bahrami, M.; Luecken, M. D.; Theis, F. J.; Saur, D.Abstract
Spatially resolved transcriptomics enables gene expression profiling within tissue architecture, but targeted panels leave much of the transcriptome unmeasured and spatial artifacts such as RNA diffusion and segmentation errors introduce technical noise. These limitations necessitate computational imputation and denoising, yet existing methods typically incorporate spatial measurements during training, limiting scalability and risking the embedding of technology-specific artifacts into learned representations. To address this, we present cellpin, a variational autoencoder trained exclusively on single-cell RNA sequencing data, using teacher-student latent distillation and noise-simulating augmentations to jointly impute unmeasured genes and denoise spatial profiles without requiring cross-modality alignment. Benchmarked against six methods across multiple paired datasets, cellpin achieves superior held-out gene prediction while scaling efficiently to atlas-size references and multi-sample cohorts. In full-transcriptome Atera data, cellpin reduces residual spatial noise and improves cell-state resolution, providing a scalable and principled foundation for biological discovery from spatial transcriptomics data.
bioinformatics2026-06-05v1Mycol: A user-friendly app for automating analysis of microscopy images
Bradley, S. A.; Schiesaro, G.; Webel, H.; Skumantz, M.; Novillo-Sanjuan, O.; Panagou, A.; Lucena-Marin, R.; Jensen, E. D.; Di Pietro, A.; Acevedo-Rocha, C. G.Abstract
Microscopy image analysis is central to modern biology, yet many available platforms remain inaccessible to non-specialist users because they require advanced technical expertise, code-based workflows, extensive setup, or paid access. This creates a barrier for researchers who need reliable and fast image quantification but lack dedicated computational support. Here, we introduce Mycol, an open-source, machine-learning-assisted image analysis platform designed to be accessible and run on standard laptops with minimal setup. Mycol supports end-to-end workflows in which users annotate microscopy images, perform human-in-the-loop fine-tuning of machine learning models for automated segmentation and classification, deploy machine learning models, quality control predictions and quantitatively compare morphological and class frequency descriptors through a single intuitive interface. By combining machine-learning analysis with efficient quality control by humans, Mycol makes rapid and high-quality image quantification available to biologists without requiring specialist training. We demonstrate the utility of Mycol in diverse workflows using two economically important organisms, the crop pathogen (Fusarium oxysporum) and the blue mussel (Mytilus edulis). Through Mycol, curated training sets were generated and high quality segmentation and classification models were obtained in each case. Deploying these models through Mycol decreased the time requirements and increased traceability of established cell counting workflows and facilitated a quantitative comparison of morphological parameters that reveals new patterns in early M. edulis larval development.
bioinformatics2026-06-05v1A Reproducible and Extensible Benchmark of Supervised Cell Type Annotation Tools for Cytometry Data
Kirk, F.; Sonnenholzner, A.; Herranz del Cerro, J.; Scheel Wegener, H.; Modvig, S.; Olsen, L. R.Abstract
High-dimensional cytometry technologies such as flow cytometry (FCM) and mass cytometry (CyTOF) are central to immunophenotyping in research and clinical practice. While manual gating remains the standard for cell population annotation, it is time-consuming, difficult to scale, and subject to inter-operator variability. Supervised annotation methods have emerged as a way of scaling manual annotation work, yet independent benchmarks for comparing these tools remain limited and quickly become outdated. This study presents a reproducible and extensible benchmark of supervised cytometry annotation tools implemented within the OmniBenchmark framework. Five supervised annotation methods were evaluated, spanning linear models, nearest-neighbor approaches, tree-based classifiers, mixture-rule systems, and deep learning, across eight publicly available datasets carefully selected to cover technologies, tissues, panel designs, and healthy and disease contexts. Using a sample-centric cross-validation design that reflects common reference-mapping scenarios, overall and per-population F1 scores, performance on rare populations, runtime, and robustness to reduced training set sizes was tested. Performance varied substantially across datasets and was not fully explained by dataset size or dimensionality, highlighting both operator dependence in annotation and the importance of biological context, cohort heterogeneity, and population imbalance. Less prevalent populations (<1%) remained a key challenge for most methods. Downsampling analyses showed that moderate reference sizes were often sufficient to achieve near-maximum performance. Rather than ranking methods, this benchmark provides a standardized and transparent framework for evaluating annotation tools under realistic deployment conditions. As a living resource, the OmniBenchmark implementation supports continuous integration of new datasets, tools, and metrics for both tool developers and end users annotating datasets. This enables ongoing, reproducible method comparison and informed tool selection for diverse cytometry applications.
bioinformatics2026-06-05v1Development of the Mitochondrial Base Editor Analysis Package (MitoBEAP).
Mutti, C. D.; Nash, P.; Silva-Pinheiro, P.; Minczuk, M.; Van Haute, L.Abstract
For many years, the genetic manipulation of mitochondrial DNA was largely hampered by inefficient delivery of nucleic acids to mitochondria. However, the development of mitoCBEs, such as mitochondrial cytosine base editors (DdCBEs), which catalyse C-to-T and G-to-A conversions, and more recently, mitoABEs, such as transcription-activator-like effector (TALE)-linked deaminases (TALEDs) enabling A-to-G and T-to-C conversion, has transformed this field. Generally, mitochondrial base editors exhibit high on-target efficiency and are straightforward to design and use. Nonetheless, unintended off-target effects cannot be overlooked and should be assessed consistently with each experiment, which can be challenging without specialised bioinformatic expertise. Here, we introduce Mitochondrial Base Editor Analysis Package (MitoBEAP), which, to our knowledge, is the first R package specifically designed to analyse next-generation sequencing data from base-edited mtDNA samples. The package facilitates the analysis of potential off-target effects, offers multiple visualisation options, and allows customisation of graphics and thresholds for calculations. As a proof of concept, this study demonstrates how MitoBEAP can be utilised to measure the efficiency of DdCBE treatment targeting human 12S rRNA, as well as to identify potentially harmful off-target conversions across the mtDNA.
bioinformatics2026-06-05v1Towards Generalizable Protein-ligand Co-folding with ACER
Vithayapalert, N.; Grisoni, F.Abstract
Predicting protein-ligand complex structures is a central challenge in drug discovery. While recent co-folding models such as AlphaFold-3 achieve accurate structure prediction, they fail to generalize to underexplored binding interfaces - systematically misplacing ligands, particularly for allosteric or structurally novel targets. To address this gap, we present ACER (A daptive Co-folding via pocket E xploration and pose R anking), a training-free framework that (a) enables co-folding models to systematically explore alternative binding pockets, and (b) leverages the discovered pockets to increase pose accuracy. Our method enables the efficient discovery of non-prevalent pockets without prior expert knowledge. ACER improves pocket discovery and pose accuracy on allosteric targets and structurally novel complexes, successfully modeling binding interfaces that are under-represented or absent from the training set. Our results demonstrate how improved sampling dynamics enhance the generalisability of co-folding models without retraining.
bioinformatics2026-06-05v1inGSEA: An Improved Method for Gene Set Enrichment Analysis Using a Weighted Integral Statistic
Zhang, Q.; Li, Q.Abstract
Gene Set Enrichment Analysis (GSEA) is one of the most popular methods for transcriptomic analysis, yet its statistical power is limited when the biological pathways exhibit heterogeneous or non-concordant expression patterns. We propose an improved GSEA method, \textbf{in}tegral-based GSEA (inGSEA). inGSEA introduces a novel enrichment score based on the Anderson-Darling weighted integral statistic. The new enrichment score enhances detection power for complex signals, particularly sparse and bidirectional ones, while the Cauchy combination of integral and classic maximum statistics provides robustness across diverse expression patterns. Extensive numerical studies demonstrate that inGSEA achieves superior power and well-calibrated false discoveries. Application to real-world datasets reveals biologically relevant pathways missed by the standard GSEA. inGSEA reduces the computational burden of permutation testing by employing a generalized gamma distribution to approximate the null distribution. inGSEA is accessible as a user-friendly web-based software tool (https://amss-stat.github.io/inGSEA).
bioinformatics2026-06-05v1A mitochondrial-immune axis drives the transcriptomic transition from brain aging to Alzheimer's disease
Pal, A.; Arif, S.; Karthikeyan, I.; Waisberg, E.; Guarnieri, J. W.Abstract
Aging is the primary risk factor for Alzheimer's disease (AD), yet the molecular transitions linking normal brain aging to neurodegeneration remain poorly defined. Here, we performed integrative bulk transcriptomic analyses across a multi-region mouse aging atlas, a human aging-to-AD cohort, and an independent human AD validation dataset. Aging is associated with a progressive, region-specific increase in transcriptional perturbation, with the entorhinal cortex and choroid plexus showing the most pronounced age-associated remodeling. Females develop more extensive late-stage remodeling than males, characterized by stronger immune activation and greater suppression of mitochondrial metabolic pathways. Across cohorts, aging drives a coordinated shift toward immune activation and suppression of oxidative phosphorylation and respiratory-chain programs that is amplified in AD. Aged brains occupy an intermediate molecular state between young and AD conditions, supporting a continuum model. Together, our findings define a sex-modulated mitochondrial-immune axis linking normal aging to AD and highlight early immune-metabolic changes as potential intervention targets.
bioinformatics2026-06-05v1