Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
misosoup: A metabolic modeling tool for identifying minimal microbial communities provides valuable insights into microbial ecology and biotechnological applications
Ochsner, N.; San Roman, M.; Jimenez-Fernandez, A.; Bonhoeffer, S.; Pascual-Garcia, A.Abstract
Microbial survival and function often depend on metabolic interactions within communities. Therefore, a central question in disentangling microbial organization is determining which minimal groups of species are able to thrive in a given medium--referred to as 'minimal communities'. Answering this question is essential for understanding microbial distribution, enhancing laboratory cultivation, and designing synthetic communities (SynComs). Here, we introduce misosoup, a Python package for identifying minimal communities (MInimal Supplying cOmmunity Search). Through genome-scale constraint-based metabolic modeling, misosoup enables the systematic identification of communities that support microbial growth in environments where individual species fail to survive alone. We validate misosoup against experimentally verified minimal communities, demonstrating its ability to predict known cooperative interactions, cocultures, and consortia with biotechnological potential. We further illustrate the use of misosoup to investigate broad microbial ecology questions by applying it to a set of 60 marine microbes, revealing pervasive cross-feeding-driven niche expansion and showing how the detailed outputs provided by misosoup facilitate research on topics such as the identification of functional groups. In summary, misosoup provides a powerful tool for microbial ecology and community design, with potential applications in both research and biotechnological innovation.
bioinformatics2026-05-25v2Read-Consistent Minimum Unique Substrings: A Parameter-Free, Linear-Time Framework for Genomic Sequence Representation
Adu, A. F.; Menkah, E. S.; Amoako-Yirenkyi, P.; Pandam Salifu, S.Abstract
Fixed-length k-mers have been the standard unit of genomic sequence representation for over two decades. However, they impose a uniform resolution on genomes whose complexity varies across loci. We introduce Minimum Unique Substrings (MUSs), variable-length sequence units defined by the local uniqueness structure of the genome rather than predefined parameters. We first extend MUS theory from single contiguous strings to fragmented sequencing reads by formalizing a definition of uniqueness that is consistent with these reads. Next, we present a linear-time extraction algorithm that runs in O(n) time using the generalized suffix tree. In this context, we introduce outpost nodes, topological anchors within the suffix tree that accurately localize MUS boundaries in fragmented sequencing reads. Finally, we empirically characterize the distributions of MUS lengths in E. coli K-12 and human chromosome 11. Our results demonstrate that MUS lengths naturally mirror genomic architectural complexity without the need for user-defined parameters. Notably, the MUS framework achieves 100% unique positional coverage with a mean length of only 36.08 bp. In contrast, fixed-length k=61 coverage reaches only 69.4%, despite being 1.69 times the MUS average. We show that increasing k from 21 to 61 triples the unique k-mer count from 2.35M to 6.86M. This k-paradox occurs because repetitive sequences are fragmented into spuriously unique tokens without improving true genomic resolution. MUSs escape this artifact entirely by adapting dynamically to local sequence complexity. These results establish MUSs as a biologically grounded, computationally tractable foundation for parameter-free genome assembly, repeat characterization, and alignment-free genomics.
bioinformatics2026-05-25v2MSLipidMapper: a pathway-centered lipidome analysis environment linking lipid class, acyl-chain subsets, and multi-omics data
Oka, T.; Nishida, K.; Harayama, T.; Tsugawa, H.Abstract
Lipids exhibit extensive structural diversity arising from variation in lipid classes, subclasses, and acyl-chain compositions, making systematic interpretation of lipidomics data challenging. Although untargeted lipidomics enables the quantification of hundreds to thousands of lipid molecular species, downstream analyses often treat pathway-level summaries, molecular-species visualization, structural subsetting, and multi-omics interpretation as separate steps. Here, we present MSLipidMapper, an R/Shiny-based lipidomics data exploration environment for pathway-centered and structure-aware analysis of annotated lipidomics datasets. MSLipidMapper reconstructs annotated lipid peak tables as Bioconductor SummarizedExperiment objects, thereby organizing quantitative lipid abundance values, sample metadata, lipid subclass annotations, and parsed acyl-chain features within a unified data structure. Lipid molecular species are summarized on static, curated lipid metabolic pathway maps at the subclass level while retaining direct links to the underlying molecular species and acyl-chain annotations. This design enables users to inspect molecular-species patterns underlying each pathway node, define lipid subsets based on structural features such as specific acyl chains, and re-project these subsets onto the same pathway context. Gene or protein expression data can also be overlaid on pathway-associated reactions to support multi-layer interpretation of lipid metabolism. The program is showcased using publicly available aging lipidome datasets of mice, illustrating how subclass-level pathway summaries can be connected to molecular-species heatmaps, acyl-chain-defined subsets, and transcriptome or proteome information.
bioinformatics2026-05-25v1Cell-type-specific transposable element transcription tracks symbiosis and calcification programs in the reef-building coral Acropora hemprichii
Zhong, H.; Konciute, M. K.; Hu, J.; Menzies, J.; Cui, G.; Aranda, M.Abstract
Transposable elements (TEs) are pervasive components of eukaryotic genomes and major drivers of genome evolution, yet their contribution to cell-type-specific regulatory landscapes remains poorly understood, particularly in non-model marine invertebrates. Here, we integrated single-cell RNA sequencing with pseudo-aligned TE expression profiling to examine how TE transcription relates to cell type identity in the reef-building coral Acropora hemprichii. We constructed a cell atlas comprising 4,716 cells across eight major cell types. Notably, TE expression alone was sufficient to accurately resolve all major cell types, indicating that cell-type-specific transcriptional states are robustly reflected in TE activity patterns. We identified 9,759 expressed TEs, of which 333 exhibited strong cell-type-specific activity. These differentially expressed TE features were associated with nearby expressed genes and transcription factor loci, suggesting a relationship between cell-type-specific TE activity and local gene regulatory programs. Genes associated with cell-type-specific TEs were enriched for core coral physiological processes, including calcification, metabolite transport, and symbiosis-related functions. Together, these findings indicate that TE transcription is structured along coral cell-type identity and physiological specialization. Our study provides a single-cell-resolved framework for investigating TE-gene relationships in early-diverging metazoans and a community resource for future functional interrogation in reef-building corals.
bioinformatics2026-05-25v1SpatialClaw: A Memory-Augmented Autonomous Ecosystem for Spatial Omics Analysis
Du, G.; Lan, O.; Wei, X.; Wu, Y.; Meng, G.; Wu, J.; Li, Z.; Li, X.; Shang, X.Abstract
While the expansion of spatial omics has revolutionized our ability to dissect tissue architecture, the accumulation of incompatible computational methods has heavily fragmented end-to-end analysis, rendering complex workflows irreproducible. Generic conversational agents lack the domain-specific precision necessary to navigate the intricate biological pipelines. To overcome this, we present SpatialClaw, a memory-augmented autonomous ecosystem to unify spatial omics analysis under a single natural-language interaction. SpatialClaw integrates 30 specialized skills, spanning raw data preprocessing, spatial domain identification, deconvolution, spatially variable gene detection, cell-cell communication analysis, multi-sample and cross-modality integration. Distinct from existing agents, SpatialClaw introduces a graph-based persistent memory architecture that stores dataset metadata, analysis lineage, biological insights, and user preferences as versioned nodes and edges across three hierarchical layers (Session, Episodic, and Semantic), governed by a deterministic promotion policy. A Memory-Augmented Reasoning (MAR) Operator bridges the memory store and the main agent, synthesizing retrieved experiences into task-specific guidance for each query. In rigorous benchmarking spanning three memory-sensitive scenarios across 10 spatial-omics skills, SpatialClaw outperforms both a standard large language model and the memory-only configuration. Furthermore, we demonstrate its robust biological utility by dissecting the complex tumor microenvironment of a 15-section human triple-negative breast cancer cohort. In merely three conversational turns and with zero direct scripting, SpatialClaw executes a comprehensive end-to-end workflow, yielding standardized output bundles. Ultimately, by synergizing comprehensive analytical tools with structured persistent memory, SpatialClaw elevates spatial omics from disjointed computational stitching to a fully traceable, reproducible, and self-improving discovery ecosystem.
bioinformatics2026-05-25v1E-InfertilityTest: An Explainable AI Framework for Male Infertility Assessment
Das, G.; Ghosh, B.; Ghosh, Z.Abstract
Male infertility has emerged as a significant concern in modern society, with genetic defects as one of the major underlying cause behind it. This impairment negatively impacts sperm motility and morphology, leading to conditions such as Asthenozoospermia (reduced sperm motility), Teratozoospermia (abnormal sperm morphology) and sometimes Asthenoteratozoospermia (both motility and morphology defects). Assisted reproductive technologies (ART), such as in-vitro fertilization (IVF), offer a potential solution for such cases but with a low success rate. Classical semen analysis provides only a phenotypic snapshot without revealing the fertilizing potential of the sperms. Hence, in order to screen the functional sperm population as well as to get a deeper insight into the reasons underlying the aberrant sperm population, it is important to study their genetic profile. In this work, we have performed a meta analysis of the transcriptomic data of infertile sperms from Asthenozoospermia and Teratozoospermia patients with that from fertile sperms of normal individuals. Thereafter we have screened a signature gene set which has been used to develop a prediction model named Explainable Infertility Test (E-InfertilityTest) to classify between fertile versus infertile sperm at the preliminary level. For each prediction, it will also provide the set of genes which are playing a dominant role towards such prediction. Thus, it will provide patient specific dominant gene expression profile responsible for the aberration. This work warrants validation experiments in future to substantiate the model performance in a clinical setting. User can access the tool named E-InfertilityTest as a standalone version on GitHub. Github Link: https://github.com/zglabDIB/einfertility.git
bioinformatics2026-05-25v1OAC-PCA: orthogonal adjustment of confounding effects in principal component analysis for metabolomics data mining
Kurata, M.; Yamamoto, H.; Tsugawa, H.Abstract
Principal component analysis (PCA) is widely used in mass spectrometry-based metabolomics for exploratory data mining. Statistical testing of loading values can extract metabolite features associated with score patterns, but this approach requires principal components (PCs) to remain orthogonal while loadings are defined as correlation coefficients between PC scores and variables. Adjustment for Confounding PCA (AC-PCA) was previously developed to explore biologically meaningful components from data matrices affected by biological and technical confounders. However, AC-PCA does not simultaneously ensure PC orthogonality and a correlation-coefficient definition of loadings, limiting the statistical interpretation of its loadings. Here, we reformulated AC-PCA as Orthogonal Adjustment for Confounding effects in PCA (OAC-PCA). In OAC-PCA, PCs remain orthogonal, and loadings retain this correlation-coefficient interpretation. These properties enable statistical testing of metabolite associations while accounting for confounding effects.
bioinformatics2026-05-25v1CardioSeg: An interactive platform for integrated spatial transcriptomics data and nuclear morphological analysis of mouse heart tissue
Kancherla, S. K.; Melleby, A. O.; Aronsen, J. M.Abstract
Motivation: Spatial transcriptomics enables gene expression profiling within its spatial context in intact tissue sections. Existing workflows for segmentation, spatial annotation, and morphological analysis are often code-heavy and poorly integrated. This limits the joint analysis of spatial gene expression at a single-nucleus resolution, and corresponding nuclear morphology. Results: We present CardioSeg, a Python-based graphical interface for nuclei segmentation, spatial annotation, and interactive analysis of myocardial histology. CardioSeg integrates multi-threshold Cellpose-based segmentation with nuclei-level transcriptomic mapping and interactive visualisation. CardioSeg achieved robust segmentation performance across heterogeneous imaging conditions, with union-based inference outperforming the individual parameter configurations. For cell-type annotation, CardioSeg achieved 0.88 in accuracy and 0.85 in balanced accuracy against reference labels, while also resolving spatial heterogeneity not captured by spot-based approaches. Application to pressure-overloaded cardiac tissue revealed uncharacterized intra-ventricular variations in nuclear morphology, indicating the potential of CardioSeg to couple disease-specific nuclear morphology with the associated transcriptomics. Availability and Implementation: Source code is available at GitHub under the CC BY 4.0 license (https://github.com/SrijanKancherla/CardioSeg). A versioned release was archived in Zenodo (DOI: 10.5281/zenodo.20177171). Keywords: Spatial transcriptomics, nuclei segmentation, cardiac histology, single-cell annotation, bioimage analysis, interactive visualization
bioinformatics2026-05-25v1Highly Constrained Kinetic Models for Single-Cell Gene Expression Analysis
Cho, H. J.; Bohrer, C. H.; Trzaskoma, P.; Kim, J. M.; Pekowska, A.; Casellas, R. C.; Patro, R.; Chow, C. C.; Larson, D. R.Abstract
Advances in single-cell RNA sequencing (scRNA-seq) and high-resolution imaging techniques, such as single-molecule tracking (SMT) of RNA and transcription factors, allow researchers to quantitatively explore dynamics and variation but have never been integrated into a single coherent model. In this study, we propose a kinetic model that intakes multiple data types, including steady-state and time-resolved datasets, to simulate and fit stochastic models of gene transcription to experimental data. We find that 3-state models provide an essential improvement over the widely used 2-state model for most genes and have the property of kinetic proofreading, which we argue is advantageous in the cellular context. We further identify two dimensionless quantities derived from the rate equations which are broadly conserved across genes. Finally, we extend this model to scRNA-seq datasets to infer kinetic rates under defined perturbations and reveal biochemical insight into the mechanism of action of transcription factors.
bioinformatics2026-05-25v1HiCPotts: An R/Bioconductor package to identify significant interactions in chromosome conformation capture data and model sources of biases.
Osuntoki, I. G.; Harrison, A. P.; Dai, H.; Bao, Y.; Zabet, N. R.Abstract
Motivation: Chromosome Conformation Capture methods, including Hi-C, micro-C or Capture-C, are used to map chromatin interactions genome-wide. Most of the existing computational methods do not account for sources of biases (such as DNA accessibility, GC content or TE content) in the data. Results: We previously developed ZipHiC, a Bayesian method based on a the hidden Markov random field (HMRF) model and the Approximate Bayesian Computation (ABC), that uses zero-inflated Poisson distribution to model the noise, signal and false signal of the data and showed that this approach was able to detect biases from DNA accessibility, GC content and TE content in both Hi-C and micro-C data. Here, we present HiCPotts, another Bayesian method based on the HMRF model and the ABC that uses a zero-inflated Negative Binomial distribution instead to model the noise and signal of the data. We systematically show that HiCPotts reduces false positives and increases recovery of true interactions compared to ZipHiC, but also compared to other methods such as FastHiC, Juicer and HiCExplorer. Most importantly, we provide an R/Bioconductor package that allows modelling the noise, signal and false signal using various distributions such as the zero-inflated Negative Binomial (ZINB) and the zero-inflated Poisson distribution (ZIP). Availability: https://bioconductor.org/packages/HiCPotts/
bioinformatics2026-05-25v1Decoding Condition-Specific Cellular Crosstalk in Spatial Omics via Bilinear Edge Classification
Karin, J.; Friedman, R.; Nitzan, M.Abstract
Tissues are multicellular structured communities whose function emerges from a combination of individual cellular characteristics along with their corresponding spatial configuration, affecting their interactions and response patterns. During processes such as disease progression or aging, tissues can undergo structural reorganization, including changes in co-localization of different cell types, assembly or destruction of functional niches, and disruption of intercellular communication axes. Such changes can manifest primarily in the spatial reorganization of cells rather than in the transcriptional states of individual cells. While computational tools for spatial transcriptomics have made significant progress in characterizing tissue architecture, most approaches for characterizing changes in tissue states across biological conditions operate at the level of individual cells or rely on discrete cell type labels, thus limiting the ability to detect coordinated transcriptional changes between neighboring cells that distinguish one condition from another. We present Casei, a bilinear classification framework operating on cellular proximity graphs, which directly models condition-specific cell-cell interactions in spatial omics data by focusing on interactions (edges), rather than cells (nodes), as the fundamental unit of biological inference. To capture such condition-specific signals, we leverage a model whose inductive bias aligns with cellular interactions through coordinated gene-gene relationships of neighboring cells. Casei enables the discovery of condition-associated multicellular interactions and spatial expression programs, and characterizes the loss of multicellular function and structure. Applied to mammalian liver fibrosis, atherosclerosis, and brain aging, Casei reveals biologically meaningful spatial reorganization, including the shift from endothelial- to macrophage-dominated networks in atherosclerotic plaques, disruption of hepatocyte zonation in fibrosis, and oligodendrocyte-microglia crosstalk in aging white matter.
bioinformatics2026-05-24v2BioGraphX-RNA: A Universal Physicochemical Graph Encoding for Interpretable RNA Subcellular Localization Prediction
Saeed, A.; Abbas, W.Abstract
RNA subcellular localization is a critical determinant of cellular function. However, current computational approaches often operate as "black boxes," overlooking the complex interplay among sequence, structure, and physicochemical interactions that govern RNA localization. Building upon BioGraphX, originally developed for proteins, we introduce BioGraphX-RNA, a universal physicochemical graph-encoding framework that provides structure-informed encoding by translating primary nucleotide sequences into multi-scale interaction graphs using explicit biophysical rules. When combined with frozen RiNALMo embeddings through an interpretable gated fusion layer, BioGraphX-RNA outperforms DeepLocRNA and, uniquely, quantifies the relative contributions of sequence and structure for each RNA type, achieving macro-AUROC improvements of 0.0172 for mRNA, 0.0545 for miRNA, and 0.0422 for lncRNA on human datasets. In a blind cross-species prediction task on mouse data, the model demonstrates promising zero-shot transfer performance, suggesting that biophysical localization cues are evolutionarily conserved. Notably, the BioGraphX graph-only model outperforms RNAfold-derived secondary-structure graphs for miRNA (macro-AUROC 0.9482 vs. 0.8787), validating the structural proxy hypothesis under the most stringent possible conditions. Explainability analyses further reveal RNA-type-specific structural dependencies. In particular, miRNA exhibits a near-equilibrium balance between sequence and structure. SHAP-based interpretations provide mechanistic insights, identifying patterned GC content as a potential nuclear retention signal and an anti-structure profile as indicative of exosome-mediated targeting. These advances are achieved with only 2.05 million trainable parameters, aligning with Green AI principles. BioGraphX-RNA therefore demonstrates that explicitly integrating biophysical constraints into graph-based encodings enables accurate, generalizable, and interpretable predictions, advancing structure-aware RNA biology and laying a foundation for precision medicine.
bioinformatics2026-05-24v2Interpreting Omics Data Analysis with Large Language Models for Disease Target and Drug Discovery
XU, Z.; Chen, W.; Ren, W.; Xu, T.; Amaechin, S.; Khan, R.; Chen, Y.; Province, M.; Payne, P.; Li, F.Abstract
In biomedical scientific discovery, synthesizing prior knowledge from the literature is an essential component of interpreting numerical omics data analyses for disease target identification and drug discovery. Large language models (LLMs) alone can rapidly retrieve disease mechanisms from biomedical text, but text-only outputs are general and unreliable for target and drug prioritization without cohort-specific quantitative evidence. Herein, we propose a provenance-aware Text-to-Target framework that couples schema-constrained multi-model LLM retrieval with numeric omics data analysis. The key design is a modality-aware fusion step: candidates are partitioned into overlap-supported anchors, retrieval-only hidden hubs, and network-emergent novelty nodes, then propagated into staged hypothesis and strategy generation under topology constraints. We evaluate the model in Alzheimer's disease (AD) and pancreatic ductal adenocarcinoma (PDAC). In PDAC, the workflow produced a balanced 75-gene candidate universe and a 23-strategy portfolio, with significant DepMap support at both target level and strategy level. In AD, stricter candidate controls yielded a compact 34-gene universe and 14 strategies; under an expanded CRISPRbrain registry, both target-level axes were significant , with strong strategy-level enrichment. Across both diseases, final strategies preserved full provenance closure to the candidate pool, enabling end-to-end auditability from retrieval artifacts to validation outputs. These results support a transferable discovery architecture in which omics evidence constrains biological activity, LLM retrieval expands mechanistic search space, and network-aware fusion preserves interpretability. The framework provides a reproducible basis for dual-disease target prioritization and motivates continuous literature-mechanism concordance with agentic evidence-refresh loops.
bioinformatics2026-05-23v2Time-Resolved Phosphoproteomics-Guided BFS Beam Search Reveals Cell-Type-Specific EGFR Signaling Architectures and SHP2 Inhibitor-Induced Pathway Rewiring
Lee, H.; Lee, G.Abstract
Adaptive resistance to kinase- and phosphatase-targeted therapies is frequently driven by pharmacological rewiring of intracellular signaling networks, yet systematic computational methods for quantifying cross-condition pathway changes from phosphoproteomic data remain limited. We present an algorithmic framework for reconstructing cell-type-specific signaling pathways from time-resolved phosphoproteomic data using Breadth-First Search (BFS) combined with interaction-weight-guided Beam Search over the STRING protein-protein interaction database. The framework integrates the data-adaptive Median Absolute Deviation (MAD)-based binary-state assignment, BFS Beam Search traversal anchored to experimentally supported active nodes at zone boundaries and terminals (with STRING-inferred bridge proteins permitted as intermediate connectors), and a post-enumeration path cleaning pipeline that produces biologically interpretable, acyclic signaling routes (with edge-level validation against Human Protein Atlas-based cell-line expression data), with real-time access to the STRING REST API (v12.5), enabling network construction without local database installation. Benchmarked across five published phosphoproteomic datasets spanning three cell types (HeLa, MDA-MB-468, EGFR Flp-In HEK293T), the framework captures cell-type-specific EGFR signaling architectures and quantifies drug-induced pathway rewiring. Applied to MDA-MB-468 cells under three pharmacological conditions, SHP2 inhibition abolished PTPN11-mediated pathways and shifted first-hop effector distribution toward ERBB3 (21.5% to 25.2% of paths) and PIK3CA engagement (9.2% to 14.3% of paths), while SHP2 inhibitor washout revealed partial PTPN11 recovery with ERBB2 re-emerging as the dominant first-hop effector (30.3% of paths). This framework provides a systematic, reproducible approach for transforming time-resolved phosphoproteomic measurements into mechanistically interpretable signaling hypotheses, with direct applicability to drug resistance modeling and combination therapy design.
bioinformatics2026-05-23v2Reproducible transcriptional modules define glioblastoma ecosystems across independent cohorts.
Seo, H.Abstract
Glioblastoma (GBM) comprises a complex ecosystem of malignant, immune, vascular and neural transcriptional states. However, it remains difficult to determine which gene expression programmes are reproducibly recovered across independent cohorts and profiling platforms, because programme-level analyses are sensitive to cohort composition, technical context and factorization rank. Here, we analyzed three public GBM datasets--GLASS and IVYGAP bulk RNA-seq cohorts and the HEILAND Visium spatial transcriptomics cohort--to examine whether deconvolution-derived programmes could be organized into shared cross-cohort modules. Integrating 279 programmes inferred by consensus non-negative matrix factorization identified eight transcriptional communities, including myeloid immune microenvironment, neuronal and synaptic, oligodendrocyte and myelin, developmental, tumour-associated mesenchymal or hypoxic, proliferative, and ciliated or ependymal-like modules, as well as one cohort-restricted community. Community activity showed coherent associations with independent annotations: the myeloid community correlated with ESTIMATE immune score and inversely with tumour purity; the oligodendrocyte and myelin community was reduced in recurrent tumours; and ciliated or ependymal-like and neuronal communities showed modest exploratory associations with overall survival. Spatial projection onto Visium data provided qualitative support for the histological coherence of several modules, while also highlighting the limits of spot-level interpretation. Together, these results provide a proof-of-concept that cross-cohort integration can recover recurrent transcriptional structure across heterogeneous GBM datasets and offer an interpretable framework for comparing gene expression programmes while preserving cohort-specific signal and uncertainty in biological assignment.
bioinformatics2026-05-23v1Atlas-Level Single-Cell and Spatial Transcriptomics Data Integration via PRIME
Wu, X.; Wang, X.; Wang, J.; Wan, S.Abstract
Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have enabled atlas-scale cellular cartography, with consortium efforts now assembling millions of cells across diverse tissues, donors, and technologies to build comprehensive references for cell identify and disease mechanism, yet the scientific value of these atlases hinges on robust computational integration across heterogeneous data sources. Unlike pairwise batch correction, atlas-level integration must jointly reconcile heterogeneous and often hierarchically nested batch effects across many datasets whose cell-type compositions are highly imbalanced, all while preserving subtle biological variation and remaining computationally tractable at the scale of millions of cells. Existing approaches often prioritize either batch mixing or preservation of local biological structure, and most cannot natively accommodate spatial coordinates. Here we introduce PRIME (Projection-based Robust Integration via Manifold Embedding), an ensemble integration framework that combines random-projection-based consensus anchoring, graph-Laplacian correction, and optional spatial-neighborhood regularization. Across multiple random projections of the expression manifold, PRIME uses consensus voting to keep only cell pairs that repeatedly matched, reducing false anchors caused by projection-specific distortions. For ST, PRIME couples this expression-based anchor graph with a coordinate-derived spatial neighborhood graph in a unified graph-Laplacian objective with closed-form solution, enabling simultaneous cross-batch alignment and local spatial coherence. Based on extensive benchmarking spanning diverse datasets, we show that PRIME consistently outperforms state-of-the-art methods in both batch correction and biological conservation across scRNA-seq and ST integration scenarios and downstream tasks including trajectory inference, spatial-domain preservation, and perturbation-response analysis. Particularly, when integrating a human hematopoiesis benchmark spanning eight donors and approximately 33,000 cells, PRIME preserves biologically coherent developmental trajectories in human hematopoiesis. It also maintains cortical laminar architecture across dorsolateral prefrontal cortex sections in a ST dataset and recovers known drug-target relationships in a perturbation atlas of more than 1 million cells while suppressing batch-associated confounders. Together, these results establish PRIME as a versatile and scalable framework for atlas-level integration of scRNA-seq and ST across diverse biological applications.
bioinformatics2026-05-23v1Widespread use of invalid statistical tests in biomedical machine learning
Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.Abstract
Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance -- not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
bioinformatics2026-05-22v2IDEAL-Age: an interpretable deep learning framework for single-cell resolution profiling of immunological aging
Xu, Y.; Luo, Z.; He, K.; Zhang, F.; Zhang, Y.; Wang, J.; Wen, H.; Li, Y.; Han, D.Abstract
Immunosenescence increases susceptibility to infection and reduces vaccine responsiveness, yet bulk transcriptomic clocks obscure the cellular heterogeneity underlying this process. Here, we present IDEAL-Age, an interpretable deep learning framework that operates directly on single-cell PBMC transcriptomes. Benchmarking against 31 methods across independent cohorts demonstrates superior predictive performance. The framework' s interpretability uncovers linear and non-linear transcriptomic dynamics that reveal phase-specific physiological transitions, and identifies pro-youthful or pro-aging cellular contributions. Application to systemic lupus erythematosus (SLE) reveals accelerated immunological aging driven by interferon-associated monocyte shifts. IDEAL-Age establishes a high-resolution computational framework for deciphering systemic immune aging.
bioinformatics2026-05-22v2SpatialCCCbench: Standardized Metrics for the Systematic Evaluation of Spatial Cell-Cell Communication Methods
Dai, W.Abstract
Spatial transcriptomics (ST) enables transcriptome profiling with preserved spatial context, providing spatial dimensions that are essential for understanding complex intercellular signals in tissue architecture. ST-based CCC tools integrate spatial and molecular information to decipher intercellular interactions from a spatially informed perspective. Despite the rapid evolution of many CCC computational tools, a systematic assessment of their performance in handling ST-specific heterogeneity, utilizing spatial information efficiently, and robustness against technical or biological noise is still lacking. To address this gap, SpatialCCCbench incorporates classification accuracy, spatial signal features, robustness, and user-friendliness, aiming to guide the selection of optimal CCC inference tools across diverse spatial biology contexts. SpatialCCCbench systematically evaluates the scenario-specific applicability of ST-based CCC tools. It helps users select tools according to their analytical objectives and provides a practical benchmark for future method development.
bioinformatics2026-05-22v1Min-frame transformation enables more sensitive viral genome alignment
Doughty, R. D.; Banerjee, A.; Kille, B.; Warnow, T.; Treangen, T. J.Abstract
Motivation: Maximal unique matches (MUMs) are a fundamental primitive in genome comparison, where they serve as high-confidence anchors for downstream multiple genome alignment. However, because MUMs rely on exact string matching, their effectiveness degrades with increased genome divergence and larger sets of genomes, inhibiting their ability to recover long homologous regions and reducing the number of base pairs covered by the multiple genome alignment. Additionally, existing approaches that improve robustness to mutation, such as spaced seeds or translated alignment methods, introduce trade-offs in specificity, scalability, or computational complexity. Methods: To address this gap, we introduce the Min-Frame Transformation (MFT), a deterministic encoding of nucleotide sequences to sequences over a transformed alphabet that preserves the coordinate structure of the original sequence. At each position, the MFT selects a \kmer from a local window according to a fixed global ordering and assigns it a character in the transformed alphabet via a predefined mapping. This process captures local sequence context and can mask the impact of mutations, increasing the likelihood that homologous regions remain detectable as exact matches. The resulting transformed sequences can be indexed using standard string data structures, such as suffix arrays and suffix trees, enabling efficient extraction of MUMs without modifying existing algorithms. Impact: The MFT is a novel computational approach for improving the robustness of MUM-based seeding for genome alignment by producing longer and more contiguous matches that span a greater fraction of the genome, leading to improved alignment coverage and SNP recall. Altogether, these improvements have the potential to result in improvements for downstream viral genome analysis applications such as phylogenetic inference and transmission analysis.
bioinformatics2026-05-22v1A community machine learning challenge to predict the effects of gene perturbations on T cell differentiation for cancer immunotherapy
Zhang, J.; Schwartz, M. A.; Mutaher, M.; Olajide, O.; Pritykin, Y.; Ashenberg, O.; Hacohen, N.; Uhler, C.Abstract
Perturbations of genes with functional importance in T cells could be used to change the distribution of CD8 T cell states to enhance anti-tumor functions for cancer immunotherapies. We launched a world-wide computational challenge to predict the effects of gene perturbations and to devise objective functions for prioritizing gene perturbations that lead to desired T-cell state distributions. We supported the challenge by generating a single-cell Perturb-seq dataset profiling the effect of knocking out 73 individual expert-defined genes in T cells transferred into a mouse melanoma model. We compared the top algorithms developed by participants, and found that performance was primarily determined by the prior data used for gene feature representation, with perturbational data derived features, proving most effective. Experimental validation of the top 61 genes nominated by the algorithms revealed that perturbation of Ndufv2 and Dimt1 reached the defined objective and biased T cell differentiation toward desired states.
bioinformatics2026-05-22v1AbSolution: interactive exploration of sequence-derived features in AIRR-seq repertoires
Garcia-Valiente, R.; Triantafyllou, C.; van Schaik, B.; Jongejan, A.; Pollastro, S.; Anang, D. C.; Guikema, J. E.; de Vries, N.; Hoefsloot, H. C.; van Kampen, A. H. C.Abstract
High-throughput sequencing of B-cell and T-cell immune receptor repertoires provides unprecedented insight into adaptive immune responses. The data produced are structured by clonal relationships and somatic mutation signatures, and yield extremely rich information in sequence-derived features, including physicochemical properties and compositional patterns. However, integrated analysis across datasets, conditions, and time points remains challenging. Current analytical tools typically focus only on certain features within individual repertoires, without enabling integrated, multivariable comparisons across datasets, conditions, and time points to address their diversity and variability. Here we present AbSolution, a user-friendly and flexible interactive application for comprehensive exploration of immune repertoires and their sequence-based properties. AbSolution enables multiscale analysis of thousands of sequence-derived features across receptor regions, while accounting for V(D)J usage, clonal composition and experimental groupings. We demonstrate its utility by identifying distinct sequence-based profiles associated with dominant (highly abundant) and non-dominant B-cell clones in peripheral blood BCR repertoires from patients with idiopathic inflammatory myopathies, and with antigen-responsive T-cell populations over time in a longitudinal in vitro antigen-stimulation dataset. Through interactive, interlinked visualizations, statistical feature selection and multi-sample comparisons, AbSolution facilitates integrated feature profiling that supports the interpretation of immune selection processes and enables systematic analysis of complex repertoire datasets.
bioinformatics2026-05-22v1Large-Scale Assessment of Animal-to-Human Drug Translation Using Natural Language Processing
Doneva, S. E.; Ellendorff, T. R.; Schneider, G.; Held, L.; von Wyl, V.; Simpson, I.; Sick, B.; Ineichen, B. V.Abstract
Background: Large-scale estimates of animal-to-human drug translation and the study characteristics associated with successful translation remain limited. The expanding preclinical literature also challenges manual evidence synthesis. We developed a natural language processing (NLP) pipeline to structure and link preclinical and clinical evidence at scale. Methods: In this retrospective meta-research study, we analysed more than 500,000 neuroscience-related animal drug studies from PubMed and linked them to clinical trial and regulatory approval data. NLP methods extracted drug, disease, and experimental design characteristics from abstracts and full texts. Translation was defined as progression to completed phase III/IV trials or regulatory approval. Logistic regression assessed associations between preclinical study characteristics and successful translation. Findings: Among 291,624 drug entities identified in animal studies, 6.7% entered clinical development and 3.1% reached phase III/IV trials or regulatory approval. At the drug-disease level, 4.4% entered clinical development and 1.9% achieved translation. Restricting analyses to successfully linked ontology entities increased estimates to 11.3% and 4.1%, respectively. Male-only animal studies predominated, whereas reporting of randomisation, blinding, and sample size calculations remained limited. Testing across multiple species and reporting blinding were associated with higher odds of successful translation. Interpretation: Only a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and blinding were associated with improved translational success. NLP-based evidence synthesis may support scalable evaluation of translational research and identification of potentially modifiable research practices.
bioinformatics2026-05-22v1Metabarcode and transcriptome datasets of Pinus sylvestris to assess fungal phyllosphere and disease dynamics.
Moore, B.; Perry, A.; Kaur, S.; Crampton, B.; Gurung, A.; Beaton, J.; Cottrell, J. E.; Stockan, J. K.; Smith, V. A.; Morris, J.; Hedley, P. E.; Nemeth, K.; Barber, H.; Cavers, S.; Jones, S.Abstract
Understanding how host-microbiome interactions influence tree disease is critical for understanding forest resilience. Here, we present foliar microbiome ITS2 metabarcoding transcriptomic datasets from Pinus sylvestris to investigate susceptibility to Dothistroma needle blight (DNB), a globally important foliar disease caused by Dothistroma septosporum. We hypothesised that host genotype shapes foliar microbial communities and their interactions, thereby influencing disease outcomes. Samples were collected from a progeny-provenance field trial in the south of Scotland representing a broad spectrum of disease susceptibilities. The dataset comprises ITS2 metabarcoding samples from 200 genotypes across three timepoints and RNAseq samples from 48 genotypes across two timepoints. Sampling captured key stages of pathogen exposure and disease progression. Both standardised and bespoke protocols were used for nucleotide extraction, sequencing, and quality control, including multiple negative and positive controls. These datasets, available in the European Nucleotide Archive (project accession PRJEB88228), enable analysis of temporal dynamics in foliar fungal communities, host-microbiome transcriptional responses, and genotype-dependent variation in disease susceptibility.
bioinformatics2026-05-21v3Counterfactual Explanations for Graph Neural Networks in Patient Outcome Prediction
Chaidos, N.; Dimitriou, A.; Calzi, H.; Casiraghi, E.; Stamou, G.; Valentini, G.Abstract
Counterfactual Explanation (CE) algorithms have been successfully applied to uncover the main factors driving computational diagnostic and prognostic predictions on tabular medical data.Recently, a new Network Medicine paradigm has been introduced for patient diagnosis and prognosis using Patient Similarity Networks (PSNs), i.e. graphs where patients are represented as nodes and their clinical and biomolecular similarities as edges. In this context, graph-based algorithms, including Graph Neural Networks (GNNs), can provide predictions using not only individual patient features but also their relations within a network of clinically and biomolecularly similar individuals. In this work, we propose the first CE algorithm tailored to explain diagnostic and prognostic predictions within PSNs. Alongside a contrastive GNN backbone, we introduce a versatile, model-agnostic counterfactual search method compatible with any underlying classifier. Preliminary results on synthetic data and on a cohort of patients affected by the Alzheimer's disease show that our algorithm is competitive both with seminal tabular based CE algorithms and GNNExplainer, a well-established method for explaining graph-based classification tasks.
bioinformatics2026-05-21v2sxLaep: a Lightweight and Accurate Enzyme Predictorfor High-throughput Mining of Metagenomic Sequences
Duan, H.; Han, X.; Mo, Y.; Ren, B.; Xia, L. C.Abstract
Motivation: Metagenomic sequencing generates petabyte-scale sequence datasets that strain both deep learning and alignment based enzyme annotation tools. A lightweight rapid and accurate filter tool is needed to filter and identify enzymatic sequences prior to resource-intensive functional prediction. Results: We present sxLaep (Lightweight and Accurate Enzyme Predictor), a resource-efficient framework using lightweight physicochemical features for enzyme pre-screening. On the external validation set, sxLaep completed prediction in only 0.002 s/sequence, which is 22.9-fold faster than Diamond (0.0457 s/sequence). It used 372.16 MB peak memory, corresponding to a 54.4% memory reduction relative to Diamond (815.64 MB). sxLaep achieved an accuracy of 99.34% and the highest recall in remote homology detection, including enzyme candidates missed by alignment-based methods. We further successfully applied sxLaep to a marine metagenomic enzyme-mining workflow, demonstrating its utility for high-throughput discovery from large-scale metagenomic sequences. Availability and Implementation: sxLaep is available as a Python package at https://pypi.org/project/sxlaep and is maintained as an open-source software repository at https://github.com/labxscut/sxLaep. Detailed installation, usage, and Docker deployment instructions are provided in the GitHub repository to support reproducible enzyme prediction and model execution.
bioinformatics2026-05-21v2Unique molecular identifiers don't need to be unique: a collision-aware estimator for RNA-seq quantification
Agyemang, D.; Irizarry, R. A.; Baharav, T. Z.Abstract
RNA-sequencing (RNA-seq) relies on Unique Molecular Identifiers (UMIs) to accurately quantify gene expression after PCR amplification. Longer UMIs minimize collisions---where two distinct transcripts are assigned the same UMI---at the expense of increased sequencing and synthesis costs. However, it is not clear how long UMIs need to be in practice, especially given the nonuniformity of the empirical UMI distribution. In this work, we develop a method-of-moments estimator that accounts for UMI collisions, accurately quantifying gene expression and preserving downstream biological insights. We show that UMIs need not be unique: shorter UMIs can be used with a more sophisticated estimator.
bioinformatics2026-05-21v2GlyComboCLI enables command line-based FAIR workflows for glycan composition assignment in mass spectrometry data
Kelly, M. I.; Thang, W. C. M.; Pang, C. N. I.; Gustafsson, O. J. R.; Ashwood, C.Abstract
Glycans are integral biomolecules whose presence cannot be predicted from genomic data alone, necessitating experimental characterisation through approaches including mass spectrometry. Assignment of glycan compositions to observed mass to charge ratios is computationally challenging due to the potential monosaccharide diversity and existing tools lack the required flexibility for integration into automated bioinformatic workflows. Here, we present GlyComboCLI, an open-source command-line application for the assignment of glycan compositions to mass spectrometry data which expands upon our previous GUI application, GlyCombo. GlyComboCLI accepts mass lists and vendor-neutral mzML files, supports an extensive range of monosaccharides, derivatisation states, reducing-end modifications, and adducts to ensure compatibility with a breadth of glycomics approaches. Outputs are compatible with downstream tools including Skyline and GlycoWorkBench. This software is deployable as a standalone executable, a Docker container, and a Galaxy tool, adhering to FAIR principles. When applied to 52 raw files from a published mouse glycomics dataset, a local instance completed composition assignment and downstream quality control in under three hours, recovering biologically consistent findings. Furthermore, an integrated Galaxy workflow demonstrated reproducible detection of sialidase treatment effects. GlyComboCLI substantially reduces the pool of spectra requiring manual structural interpretation, offering a flexible and scalable solution for glycomics bioinformatic workflows.
bioinformatics2026-05-21v2Nipoppy: A framework for standardizing neuroimaging studies to facilitate international derived-data sharing
Bhagwat, N.; Wang, M.; Dugre, M.; Pfarr, J.-K.; Dai, A.; Urchs, S.; McPherson, B.; Gau, R.; van Heese, E. M.; d'Angremont, E.; Laansma, M. A.; Prasad, S.; Sanz-Robinson, J.; Torabi, M.; Jahanpour, A.; Danyluik, M.; Joubert, A.; Macdonald, A.; Waller, L.; Stewart, A.; Joulot, M.; Dickie, E.; Devenyi, G. A.; Bouix, S.; Bollmann, S.; Jahanshad, N.; Thompson, P. M.; Burgos, N.; Chakravarty, M. M.; Halchenko, Y. O.; van der Werf, Y. D.; Poline, J.-B.Abstract
Neuroimaging data management and processing are tedious and error-prone, prompting reproducibility concerns. Globally, studies with heterogeneous infrastructure and governance policies lead to eclectic data processing and sharing, necessitating standardization of data workflows to ensure reusability and comparability of multi-centric datasets. The Nipoppy neuroinformatics framework facilitates such standardization by combining specification, protocol, and software to manage study-level data workflows. With its adoption, researchers can share standardized, derived datasets enabling efficient, reproducible, and inclusive research.
bioinformatics2026-05-21v1geneML: Gene annotation across diverse fungal species using deep learning
Vader, L.; Harvey, C. J.; Weber, T.; Hon, L. S.Abstract
Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML.
bioinformatics2026-05-21v1A phylogeny-guided framework for decoding mechanisms of human endogenous retrovirus regulation in health and disease
Patterson, A.; Duong, B.; Yoon, L.; Foster, M.; MacMullen, L.; Wickramasinghe, J.; Lucas, A.; Srivastava, A.; Jacobson, S.; Murphy, M. E.; Soldan, S.; Lieberman, P. M.; Auslander, N.Abstract
Human endogenous retroviruses (HERVs) are remmants of ancient infections which make up to ~8% of the human genome. Their activity influences development, immunity, and cancer, but studying them has been limited by a key technical challenge: short-read sequencing cannot uniquely assign reads to these highly repetitive elements. Here, we present ERVmancer, a phylogeny-informed method that resolves the read-mapping ambiguity and quantifies HERV expression across scales, from individual loci to entire retroviral clades, depending on mapping confidence. Benchmarking with sample-matched long- and short-read data generated in this study demonsrates that ERVmancer outperforms existing approaches in both sensitivity and specificity. Application of ERVmancer recapitulates known HERV expression patterns in multiple sclerosis and uncovers new biology in breast cancer, including suppression of HERVH-LTR7 by p53. By enabling accurate and scalable quantification of integrated retroviral elements, ERVmancer provides a broadly applicable resource for investigating retroviral mechanisms in health and disease.
bioinformatics2026-05-21v1BioRAG-DRAG: A Multimodal Biological Retrieval Layer for Local-First Biomedical Agents
Wang, L.Abstract
Biomedical agents need reliable access to heterogeneous evidence: literature text, gene and pathway records, protein sequences, DNA/cDNA sequences, and structured biological relations. Classical sequence tools such as BLAST remain the right choice for alignment-grounded verification, but they are not a unified context interface for large language model agents. We present BioRAG-DRAG, a local-first multimodal retrieval layer that combines pluggable neural sequence-text retrieval, BLAST verification, and graph-based evidence packaging. Specialized encoders such as ESM-2 can serve protein partitions, while OmniGene CPT provides a unified biological-language backbone for mixed sequence/text and agent-facing use; BLAST reranks or verifies sequence candidates; and DRAG graphs expose typed, traceable paths for downstream agents. We introduce BioRAG-Standard v0, a partitioned corpus/library with 257,886 retrievable records and an initial annotation layer for engineering evaluation built from Open-Rosalind Standard biomedical records and sequence-window extensions. On an in-index sequence-window stress test, BLAST nearly saturates biological matching, while vector retrieval recovers substantial but lower biological match rates. On held-out parent-fragment controls, public protein encoders outperform the current OmniGene protein-window embedding, while DNA/cDNA dense retrieval remains weak even with off-the-shelf Nucleotide Transformer pooling; this supports a model-agnostic BioRAG design rather than a claim that one unified generator backbone is the best sequence-search encoder. Indexed Chroma lookup over Standard text and 100k sequence-window collections adds only small lookup overhead after query embedding; this does not measure end-to-end instant latency. Finally, exploratory sequence DRAG traces show inspectable biological neighborhoods, including immunoglobulin-family and gene-symbol modules, with initial graph controls indicating non-random but partly sequence-similarity-driven structure. These results support a bounded architecture: vector retrieval supplies unified candidate context, while BLAST and DRAG provide biological verification and evidence attribution.
bioinformatics2026-05-21v1Heterogeneity-driven adaptive scale graph learning for subcellular spatial transcriptomics
Shi, W.; Shen, C.; Liu, Y.; Xiao, Q.; Luo, J.Abstract
Spatial transcriptomics enables gene expression profiling within intact tissue sections, providing an important basis for analyzing tissue organization, cellular heterogeneity, and microenvironmental interactions. However, existing spatial structure identification methods often integrate spatial information using fixed neighborhoods or predefined smoothing scales, which limits their ability to adapt to region-specific structural heterogeneity. In homogeneous regions, broader spatial smoothing can help preserve continuous tissue structures, whereas in regions with complex boundaries or mixed cell populations, excessive smoothing may obscure local expression differences and fine-scale structural changes. Therefore, it is necessary to develop an adaptive graph learning framework that can adjust the range of spatial information integration according to tissue structural heterogeneity. In this study, we propose HAST, a heterogeneity-driven adaptive-scale graph learning framework for spatial transcriptomics. HAST adaptively determines graph filtering scales according to spatial structural heterogeneity, enabling flexible information aggregation across different tissue regions. It further decomposes gene expression signals into low-frequency structural components and high-frequency residual components, thereby jointly modeling global spatial continuity and local expression variations. Experiments on high-resolution spatial transcriptomics datasets show that HAST improves spatial structure identification and cross-section generalization. Tumor-enriched cluster identification and neighborhood enrichment analysis further demonstrate its ability to characterize tumor-associated spatial regions and microenvironmental organization.
bioinformatics2026-05-21v1Spectral Prompting: Unsupervised Recovery of Human Hair Follicle Cell-Type and Multiscale Systems Architecture from Bulk and Single-Cell RNA-Seq Datasets via Single-Gene Seeded Spectral Unfolding
Purba, T.Abstract
Bulk RNA sequencing datasets are assumed to carry minimal resolvable programmatic and cell type biological information; as such, in the absence of single-cell resolution, researchers prioritise data analysis approaches based on differential expression, or rely on deconvolution and co-expression methods that require external reference panels, large multi-sample cohorts, or prior single-cell data to resolve cell-type structure. Here I describe the recovery of specialised cell-type and systems gene expression architecture resolved from a static gene expression dataset of untreated cultured human hair follicles (pooled from N=12 patients) isolated from scalp skin. To achieve this, I used graph theoretic methods to mathematically transform gene expression data into a latent space of relational structure, which was spectrally organised into coarse- and fine-grained modes and partitioned using a purpose-built computational algorithm. This permitted the synthesis of a computational Spectral Prompting system, whereby a single gene can be seeded to unfold to reveal associated partners across manifold projections in gene expression space. Individual projections across the manifold can reveal rich individual gene expression programmes, which can then be aggregated to identify core-associated genes for a given spectral gene prompt, both within the manifold analysed and across >1 manifold constructions. With this, I recover hitherto unresolved gene expression programmes from bulk data, including, but not limited to, epithelial hair follicle stem cell (eHFSC), hair shaft, dermal papilla and endothelial gene expression signatures. Focusing on querying KRT15, a human anagen bulge eHFSC and progenitor marker, raw output from individual spectral prompts during testing recovered known eHFSC-associated genes including LGR5, LHX2 and CXCL14, and discovered new candidate human eHFSC and progenitor cell-associated markers, such as RGMA and MUCL1 which were validated in situ. Finally, I show a brief demonstration that the technique can be similarly applied to single-cell data (GSE129611), whereby a KRT15 gene prompt from a combined expression matrix was mapped to a KRT15+/CXCL14+/LHX2+/DIO2+/SFRP1+ cell population (31/6000 cells) independent of standard clustering tools. Moving forward, from this foundation, the method will be developed to study how latent gene expression space shifts following perturbation or pathology.
bioinformatics2026-05-21v1A framework for peptide identification on commercial nanopore sequencing platforms
Beslic, D.; Kucklick, M.; Graap, E.; Sedaghatjoo, S.; Renard, B. Y.; Fuchs, S.; Engelmann, S.; Koerber, N.Abstract
Direct single-molecule peptide analysis could in principle enable rapid and sensitive identification of pathogen-derived or disease-associated biomarkers without reliance on mass spectrometry. However, existing nanopore peptide sensing methods are typically constrained by limited throughput and lack of accessibility beyond specialized setups. Here, we present an integrated experimental-computational framework for DNA-linked peptide translocation on a commercially available, high-throughput nanopore sequencing platform, the MinION. Synthetic peptides were covalently bound to oligonucleotides at both termini. The resulting peptide-DNA constructs were then translocated through the CsgG-CsgF pores using a DNA motor protein. Current traces were segmented using the known DNA sequences to extract peptide-associated signal regions. From these segments, we extracted signal features and trained feature-based and deep-learning classifiers to distinguish peptides, balancing interpretability and classification performance. We establish a framework for peptide identification using standard nanopore sequencing hardware. Across a diverse panel of synthetic peptides, our approach resolves single-amino-acid substitutions, maintains performance across independent sequencing runs, and correctly identifies peptides in blind mixtures. Interpretable model analyses connect classifier decisions and common errors to specific signal motifs. By combining commercially available instrumentation with a reproducible experimental and computational workflow, this framework lowers the barrier to nanopore-based proteomics and enables broader adoption across laboratories. It provides a foundation for future developments in amino acid modification detection and sequence analysis.
bioinformatics2026-05-21v1S-IGTD: supervised tabular-to-image topology learning via between-group correlation for multiclass classification of biological data
WU, H.-M.Abstract
Motivation: Tabular-to-image methods allow convolutional neural network (CNN)-based classifiers to analyse high-dimensional biological tables by mapping features onto a two-dimensional grid. Existing layouts are usually driven by unsupervised global correlation, which can place class-discriminative features far apart when nuisance or housekeeping covariation dominates the total covariance structure. Results: We present the Supervised Image Generator for Tabular Data (S-IGTD), a supervised extension of IGTD that optimizes tabular-to-image topology by replacing total-correlation distance with one minus the absolute between-group correlation, computed from class-wise feature means, under the Within-And-Between-Analysis (WABA) decomposition. We prove entrywise consistency of the supervised distance matrix under standard moment conditions and identify balanced-class settings in which S-IGTD improves a Signal Dispersion Score (SDS)-related topology objective. In controlled simulations targeting between-group signal, S-IGTD outperformed Euclidean- and correlation-distance IGTD variants in SDS, accuracy and macro-F1 score. Across five biological benchmarks ranging from 4- to 91-class classification, S-IGTD produced compact class-supervised layouts, with 24/35 Holm-adjusted significant SDS wins against seven non-reference layout controls. As a secondary downstream diagnostic, a CNN with batch normalization showed higher mean accuracy than random layouts and correlation-distance IGTD on all real datasets, and higher mean accuracy than Euclidean-distance IGTD on four of five datasets, with the clearest gains on large multiclass cancer and methylation benchmarks. Availability and implementation: Source code, datasets, configuration files and reproducibility scripts are freely available at https://github.com/hanmingwu1103/S-IGTD.
bioinformatics2026-05-21v1Multi-layer transcriptomic characterization of age-related immune dynamics
Zhao, Z.; Zhao, S.; Jin, J.; Ni, T.Abstract
Despite the pivotal role of mRNA isoform diversity in governing immune cell function, current investigations into peripheral immune aging predominantly focused on gene-level expression, obscuring deeper regulatory layers of transcriptome complexity. Here, we leveraged a 5' scRNA-seq atlas comprising approximately 2.5 million PBMCs from 378 healthy donors. We demonstrate that immune aging is characterized by profound, non-linear transcriptional reprogramming that extends beyond gene-level shifts to include fine-tuned regulation of alternative transcription initiation and splice site selection. By quantifying the transcriptional activity of cis-regulatory elements, we resolved their contributions to age-related expression dynamics. Notably, we identified a subset of endogenous retroviruses that are reactivated in older individuals, some of which served as alternative promoters driving the production of chimeric transcripts. Furthermore, our analysis revealed EDA as a top-ranked gene consistently upregulated with age across multiple independent cohorts. Increasing EDA expression in in vitro-stimulated naive CD4+ T cells from young individuals recapitulated aged phenotypes. This comprehensive resource elucidates the multi-layered transcriptomic landscape of the aging immune system and facilitates the identification of novel drivers of immune aging.
bioinformatics2026-05-21v1Structural Pockets and Interacting RNA-Associated Ligands (SPIRAL): A DSSR-enabled Meta-Analysis of RNA-Small Molecule Recognition
Lu, X.-J.; Wang, Y.Abstract
Small molecules that target structured RNA hold therapeutic promise across a wide range of diseases, yet the structural principles governing RNA-ligand recognition remain poorly defined. Here we present SPIRAL (Structural Pockets and Interacting RNA-Associated Ligands), a curated database of 1,098 RNA-small molecule structures from the Protein Data Bank covering 1,137 ligand-binding events across six functional RNA categories: riboswitches, ribozymes, synthetic aptamers, G-quadruplexes, ribosomal RNA, and regulatory RNA motifs. A customized pipeline built on DSSR (Dissecting the Spatial Structure of RNA) extracts structural interaction parameters from each complex, capturing stacking geometry, hydrogen-bond topology by RNA moiety, backbone contacts, groove engagement, and tertiary motif context. Unsupervised clustering of these fingerprints resolves six mechanistically distinct binding modes whose distribution is strongly governed by RNA functional class, demonstrating that different RNA categories engage small molecules through fundamentally different chemical strategies. To enable category-independent comparison of interaction quality across these mechanistically diverse modes, we introduce the Composite Binding Quality Score (CBQS), a seven-metric framework that ranks riboswitches highest and regulatory RNA motifs lowest among the six categories, while ribozymes, synthetic aptamers, and G-quadruplexes achieve statistically equivalent intermediate scores through three distinct recognition strategies. Analysis of 275 non-redundant affinity-characterized entries identifies C2'-endo sugar pucker count and total buried contact surface area as the dominant independent predictors of binding affinity. Both predictors are enriched at junction loops, pseudoknots, and base multiplet networks, the same tertiary structural sites most under engaged by current regulatory RNA motif binders, suggesting that ligands designed to contact these sites would improve both potency and selectivity simultaneously.
bioinformatics2026-05-21v1Differential Gene Expression in the Tropical House Cricket and Its Iridovirus in Healthy versus Diseased Specimens
Hinton, J. A.; Walt, H. K.; Duffield, K. R.; Ramirez, J. L.; Meyer, F.; Hoffmann, F. G.Abstract
The tropical house cricket, Gryllodes sigillatus, is a mass-produced insect that is used as a protein source for pets and livestock. However, intensive mass-rearing conditions, coupled with high genetic relatedness, create an ideal environment for the spread of pathogenic microbes that severely impact production. Cricket iridovirus (CrIV) is a pathogen that impedes cricket growth and causes significant losses for cricket farmers. Interestingly, recent studies have shown that CrIV is often present asymptomatically, yet the molecular basis of the emergence of disease symptoms remains unknown. To address this, we sampled healthy and diseased crickets and examined differences in cricket and CrIV gene expression via RNAseq. Using differential gene expression analysis and functional enrichment analysis, we found significant differences in host and viral gene expression between healthy and diseased crickets, including genes involved in immunity. Interestingly, while we observed high CrIV gene expression across the entire CrIV genome in sick populations, healthy asymptomatic populations showed elevated expression at a single viral locus. Our results shed light not only on the cricket immune response to CrIV infection but also identify a viral gene that is highly expressed during covert infections, suggesting its potential role in suppressing the host's immune response. These findings enhance our understanding of how CrIV interacts with our cricket host, providing essential insights for developing targeted strategies to manage CrIV outbreaks in cricket mass-rearing facilities.
bioinformatics2026-05-21v1A Bayesian modelling framework for inference of latent infection risk patterns from virus neutralisation assay titration data
Alrefae, T. A.; Pons-Salort, M.; Donnelly, C. A.; Lambert, B.; Kamau, E.Abstract
Serological assays remain the standard experimental approach for estimating the cumulative incidence of a pathogen and monitoring population immunity. The predominant approach for analysing serum titration data from virus neutralisation assays uses a nearly century-old interpolation-based method which neglects inherent imperfections in the assay and produces estimates with no measure of uncertainty. We introduce a two-part Bayesian modelling framework to estimate the underlying antibody concentrations in the raw serum samples taken from serosurveyed individuals, to improve the interpretation of serological data over age. First, we develop a mechanistic Bayesian model for serum antibody titration data that estimates latent antibody concentrations while accounting for assay variability and quantifying uncertainty. Second, we propagate this uncertainty into an age-structured serocatalytic model by integrating over posterior draws of individual antibody concentrations, allowing joint inference on latent serostate membership, force of infection, and serological waning rate. We use this framework to explore the dynamics of infection and immunity for three enterovirus serotypes: enteroviruses A71 (EV-A71) and D68 (EV-D68) and coxsackievirus A6 (CVA6). These serotypes are leading causes of outbreaks of severe respiratory illness and hand, foot, and mouth disease. Applying these approaches to three cross-sectional serosurveys, we estimated consistently higher and more persistent antibody concentrations throughout life for EV-D68 compared to EV-A71 and CVA6. Our analysis suggests that the proportion of recently infected individuals (i.e.\ individuals with high estimated antibody concentration levels given their age) peaks around $25\%$ by age $7$ years for both EV-A71 and CVA6 before gradually declining with age. In contrast, for EV-D68 the inferred proportion of the population in the infected state exceeds $50\%$ by age $9$ years and continues to grow with age. We also estimate that EV-D68 antibody concentration levels are higher than those of the other two serotypes, with the force of infection estimated to be highest in early childhood and declining more gradually with age than for EV-A71 and CVA6. These estimates are different to previous estimates found in the literature. Our inferential framework uncovers the wide-ranging variation in antibody levels that are often obscured by conventional endpoint titre estimation methods. We demonstrate that our framework can infer infection rates without relying on predetermined seropositivity cut-offs and without making explicit assumptions of virus-specific infection mechanisms.
bioinformatics2026-05-21v1KmerSignificance Score: A discriminative and biologically-informed framework for viral k-mer prioritization
Lebatteux, D.; Corso, F.; Soudeyns, H.; Boucoiran, I.; Gantt, S.; Banire Diallo, A.Abstract
Distinguishing closely related viral strains requires identifying genomic regions where subtle sequence differences carry biological significance. While k-mer-based approaches offer computational efficiency for genome analysis, existing methods lack standardized frameworks for evaluating which k-mers are most informative. Current selection strategies focus primarily on statistical discriminative power without integrating biological relevance. We introduce KmerSignificance Score (KSS), a k-mer prioritization framework combining three components: an information-theoretic method measuring strain-distinguishing capacity, an optimized amino acid substitution matrix (MIYATA_EVO) for mutation impact assessment, and protein-level functional importance scoring derived from UniProt annotations. KSS produces standardized scores in the [0,1] interval, enabling direct cross-dataset comparison. The discriminative component achieved classification performance comparable or superior to all tested alternatives (mean F1 = 0.880 vs. 0.718-0.877 for six established methods) while additionally providing bounded scores with consistent empirical distributions for cross-dataset comparability. MIYATA_EVO, optimized via genetic algorithm, improved biophysical property correlations by 28.4% over the original MIYATA matrix. Protein scoring on 17,470 viral proteins showed robust agreement with UniProt annotation scores (Kendall {tau} = 0.777) while revealing finer functional distinctions. Literature validation on SARS-CoV-2 (278,738 sequences, 19 variants), HIV-1 (12,223 sequences, 15 subtypes), and human cytomegalovirus (HCMV; 399-646 sequences, 4-8 genotypes) confirmed that high-scoring k-mers consistently map to established variant-defining mutations, subtype-specific polymorphisms, and genotype markers. KSS provides a standardized framework for viral k-mer prioritization with applications in variant surveillance, molecular epidemiology, and functional annotation. The tool is available at https://github.com/bioinfoUQAM/KmerSignificanceScore.
bioinformatics2026-05-21v1Recovering biological structure in sparse single-cell proteomics with GIRAFI
Zhong, H.; Chi, S.; Wong, R.; Rogalski, J.; Wang, Z.; Chan, S.; Bailey, M. L.; Ebrahimi, A.; Jayme, G.; Yin, J.; Gong, A.; Snutch, T. P.; Maier, C. S.; Marra, M. A.; Foster, L. J.; Tang, X.Abstract
Single-cell proteomics (SCP) based on liquid-chromatography mass-spectrometry resolves protein-level cellular heterogeneity, but interpretation remains limited by detection-linked sparsity. SCP profiles continuous, peptide-derived intensities and has lower throughput than single-cell RNA sequencing, making denoising methods for large-scale, count-based transcriptomics difficult to apply. Here we present GIRAFI, a graph-informed statistical learning framework that imputes missing values and reveals reproducible cell states by constraining inference to dataset-aware, prior-knowledge-informed protein neighborhoods. We evaluated GIRAFI across SCP datasets spanning diverse biological/technical contexts. In masking-based recovery experiments and cell-type-specific protein-protein interaction inference, GIRAFI outperformed existing methods, and matched bulk proteomics comparisons corroborated recovery accuracy and ablations supported the graph-informed design. Beyond reduced replicate- and source-associated technical structure, GIRAFI recovered ground-truth cell-type annotations, improved cell state-resolved pathway analysis, and enabled trajectory inference consistent with known time courses. These results establish graph-constrained imputation as an effective strategy for improving SCP robustness, biological structure, interpretation, and cross-dataset comparability.
bioinformatics2026-05-21v1Multimodal single-cell analysis uncovers transcription factor networks underlying T-cell aging
Shaigan, M.; Puri, D.; Fornero, G.; Kruger, R.; Steiger, M.; Klump, H. J.; Meissner, A.; Kretzmer, H.; Wagner, W.; Gesteira Costa Filho, I.Abstract
Aging of the immune system is associated with chronic inflammation and impaired immune function, yet the regulatory mechanisms underlying these changes remain incompletely understood. Here, we generated paired single-cell transcriptomic and chromatin accessibility profiles from peripheral blood mononuclear cells of young and old healthy donors to characterize immune aging at single-cell resolution. Using an integrative computational framework for multi-omic single-cell analysis, we detected pronounced age-associated changes in T cells, including loss of naive CD8+ T cells and expansion of differentiated memory and effector populations. Aging was accompanied by increased inflammatory signaling and reduced oxidative phosphorylation programs. Enhancer-based gene regulatory network analyses identified a reduced role of TCF7 and increased activity of inflammatory regulators, including FOSL2, in aged T cells. Integration with genetic association and eQTL datasets further supported the functional relevance of age-associated regulatory regions and their target genes.
bioinformatics2026-05-21v1Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large cis-regulatory domains, spanning tens to hundreds of kilobases, are pivotal in orchestrating cell-state-specific transcriptional programs that define cellular identity. However, existing single-cell analytical frameworks lack the capacity to identify these higher-order structures, thereby obscuring the coordinated, domain-level epigenetic regulation essential for complex biological processes. To address this, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) solely from single-cell ATAC-sequencing data. Our approach faithfully recapitulates established hematopoietic hierarchies and resolves lineage-specific regulatory programs by recovering canonical master transcription factors, frequent chromatin interactions, and enrichment of fine-mapped autoimmune disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE captures tumor-associated H3K27ac landscapes and prioritizes USP7 as a potential therapeutic candidate, supported by in silico perturbation. Collectively, our framework provides a powerful and scalable platform for deciphering the complex epigenetic architectures underlying human development and disease.
bioinformatics2026-05-20v8Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes
Insana, G.; Martin, M. J.; Pearson, W. R.Abstract
MMseqs2 clustering was used to examine the uniformity and heterogeneity of proteomes from 20 bacterial species. Using clustering parameters that required 50% sequence overlap, clusters with proteins from 50% of proteomes typically contain proteins from 95% of the proteomes and capture more than 80% of the proteins in an organism. Protein clusters are highly uniform in length; across the 20 bacteria, the median cluster has more than 99% of the proteins at the mode length. While protein lengths in clusters are highly uniform, some clusters contain dozens to hundreds of proteins that are considerably shorter (75%) than the mode-length, and a few clusters include proteins that are 133% the mode length. Most "outlier" proteins are found in fewer than 10% of clusters, and "high-outlier" clusters are over-represented in a small fraction of proteomes. Short-outlier proteins are artifacts; at least 80% of short-outlier genomes contain mode-length copies of the protein in the cluster; 40% of short protein artifacts are produced by sequencing errors (frameshifts and termination codons) while another 40% by initiation codon choice. High "outlier" clusters are concentrated in a small fraction of proteomes, which often have poor Proteome BUSCO fragment scores. As with "short-outlier" proteins, the 5% of proteomes that are excluded from the core (50% participation) cluster set encode the missing protein more than 98% of the time; these proteins were missed because of frameshifts in the genome sequence. MMseqs2 clustering with 50% participation provides robust sets of core bacterial proteins.
bioinformatics2026-05-20v3Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes
Insana, G.; Martin, M. J.; Pearson, W. R.Abstract
MMseqs2 clustering was used to examine the uniformity and heterogeneity of proteomes from 20 bacterial species. Using clustering parameters that required 50% sequence overlap, clusters with proteins from 50% of proteomes typically contain proteins from 95% of the proteomes and capture more than 80% of the proteins in an organism. Protein clusters are highly uniform in length; across the 20 bacteria, the median cluster has more than 99% of the proteins at the mode length. While protein lengths in clusters are highly uniform, some clusters contain dozens to hundreds of proteins that are considerably shorter (75%) than the mode-length, and a few clusters include proteins that are 133% the mode length. Most "outlier" proteins are found in fewer than 10% of clusters, and "high-outlier" clusters are over-represented in a small fraction of proteomes. Short-outlier proteins are artifacts; at least 80% of short-outlier genomes contain mode-length copies of the protein in the cluster; 40% of short protein artifacts are produced by sequencing errors (frameshifts and termination codons) while another 40% by initiation codon choice. High "outlier" clusters are concentrated in a small fraction of proteomes, which often have poor Proteome BUSCO fragment scores. As with "short-outlier" proteins, the 5% of proteomes that are excluded from the core (50% participation) cluster set encode the missing protein more than 98% of the time; these proteins were missed because of frameshifts in the genome sequence. MMseqs2 clustering with 50% participation provides robust sets of core bacterial proteins.
bioinformatics2026-05-20v2Pan1c : a pipeline to easily build chromosome-level pangenome graphs
Mergez, A.; Racoupeau, M.; Bardou, P.; Linard, B.; Legeai, F.; Choulet, F.; Gaspin, C.; Klopp, C.Abstract
The advances of sequencing technologies and the availability of high-quality genome assemblies for many genotypes per species, give the opportunity to improve sequence alignment rate and quality, and the variant calling accuracy by including all genomic variations in a graph reference, called a pangenome graph. Because the process of building and analysing a pangenome graph is still complex, with related software packages under development, there is an important need for releasing user-friendly pipelines for this emerging research area. Pan1C is a pipeline based on a chromosome-by-chromosome graph construction strategy. It integrates two complementary strategies for building pangenomes and produces informative metric plots and graphics using a large set of tools. By benchmarking Pan1C on human, fungal, and wheat assemblies, which span a wide range of genome sizes and complexities, we showed the interest of Pan1C for assembly and graph validation as well as for performing primary analyses.
bioinformatics2026-05-20v2Novel 4D tensor decomposition-based approach integrating tri-omics profiling data can identify functionally relevant gene clusters
Taguchi, Y.-h.; Turki, T.Abstract
Understanding gene expression requires integrating multiple regulatory layers, because transcript abundance does not necessarily correspond to translational activity or protein abundance. Ribosome profiling and proteomics help distinguish increased translation from ribosome stacking or translational buffering, but no de facto standard framework exists for unsupervised integration of transcriptome, translatome, and proteome profiles. Here, we propose a four-dimensional tensor decomposition-based unsupervised feature extraction approach for tri-omics integration. We applied higher-order singular value decomposition to transcriptome, Ribo-seq, and proteome profiles measured under branched-chain amino acid starvation. The resulting singular value vectors captured relationships among the three omics layers, including a component consistent with ribosome stacking, where transcriptome and translatome signals increased while proteome signals decreased, and another consistent with translational buffering, where proteome variation was suppressed despite transcriptome and translatome changes. Gene selection identified 1,781 genes associated with ribosome stacking and 227 genes associated with translational buffering. Enrichment analyses linked the former to translation, post-translational protein modification, RNA polymerase II transcription, cell cycle regulation, endoplasmic reticulum protein processing, ubiquitin-mediated proteolysis, and stress-related pathways, and the latter to ribosome, translation elongation and termination, spliceosome, immune- and stress-related pathways, and ribosomopathy-associated diseases. Robustness analyses indicated that the results were not substantially affected by the duplicated proteome replicate or missing-value handling. Comparison with MOFA+ and mixOmics suggested that our approach more effectively extracted components interpretable as ribosome stacking and translational buffering. These results demonstrate that tensor decomposition-based unsupervised feature extraction is useful for identifying functionally relevant gene clusters from tri-omics data.
bioinformatics2026-05-20v2Metabarcode and transcriptome datasets of Pinus sylvestris to assess fungal phyllosphere and disease dynamics.
Moore, B.; Perry, A.; Kaur, S.; Crampton, B.; Gurung, A.; Beaton, J.; Smith, V. A.; Morris, J.; Hedley, P. E.; Nemeth, K.; Barber, H.; Cavers, S.; Jones, S.Abstract
Understanding how host microbiome interactions influence tree disease is critical for understanding forest resilience. Here, we present foliar microbiome ITS2 metabarcoding transcriptomic datasets from Pinus sylvestris to investigate susceptibility to Dothistroma needle blight (DNB), a globally important foliar disease caused by Dothistroma septosporum. We hypothesised that host genotype shapes foliar microbial communities and their interactions, thereby influencing disease outcomes. Samples were collected from a progeny provenance field trial in the south of Scotland representing a broad spectrum of disease susceptibilities. The dataset comprises ITS2 metabarcoding samples from 200 genotypes across three timepoints and RNAseq samples from 48 genotypes across two timepoints. Sampling captured key stages of pathogen exposure and disease progression. Both standardised and bespoke protocols were used for nucleotide extraction, sequencing, and quality control, including multiple negative and positive controls. These datasets, available in the European Nucleotide Archive (project accession PRJEB88228), enable analysis of temporal dynamics in foliar fungal communities, host microbiome transcriptional responses, and genotype dependent variation in disease susceptibility.
bioinformatics2026-05-20v2Shiny AMMOA: an interactive platform for integrative multi-omics analysis of murine aging
Ninomiya Kanda, M.Abstract
Aging is accompanied by complex, tissue-specific molecular changes across multiple biological layers, yet integrative analysis of multi-omics datasets remains challenging for many experimental researchers due to technical and computational barriers. Here, I present Shiny Aging Murine Multi-Omic Analyzer (Shiny AMMOA), a graphical user interface (GUI)-based, user-friendly analytical platform that enables interactive exploration of murine aging-associated bulk transcriptomic, proteomic, and metabolomic datasets. Shiny AMMOA integrates publicly available multi-omics resources within a unified R Shiny framework and supports end-to-end analyses, including differential expression testing, pathway enrichment analysis, and pathway-level visualization across individual and multiple omics layers. Using representative use cases, I demonstrate that Shiny AMMOA recapitulates key findings from original source studies and facilitates intuitive discovery of tissue-, pathway-, and modality-specific aging signatures, including age-associated alterations in unfolded protein response, extracellular matrix organization, and metabolic pathways across specific tissues and omics layers. The platform further enables integrated visualization of molecular changes across omics layers on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway diagrams, supporting hypothesis generation at the systems level. By democratizing access to integrative multi-omics analysis while preserving analytical rigor, Shiny AMMOA provides an extensible resource for experimental biologists and aging researchers to interrogate large-scale public datasets, prioritize biological pathways, and accelerate translation of multi-omics insights into testable experimental hypotheses. Shiny AMMOA is available at https://github.com/M-Ninomiya-Kanda/Shiny_AMMOA_local, and a lightweight web-based demonstration version with limited functionality is available at https://m-ninomiya-kanda.shinyapps.io/shiny_ammoa_web/.
bioinformatics2026-05-20v1