Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
IDEAL-Age: an interpretable deep learning framework for single-cell resolution profiling of immunological aging
Xu, Y.; Luo, Z.; He, K.; Zhang, F.; Zhang, Y.; Wang, J.; Wen, H.; Li, Y.; Han, D.Abstract
Immunosenescence increases susceptibility to infection and reduces vaccine responsiveness, yet bulk transcriptomic clocks obscure the cellular heterogeneity underlying this process. Here, we present IDEAL-Age, an interpretable deep learning framework that operates directly on single-cell PBMC transcriptomes. Benchmarking against 31 methods across independent cohorts demonstrates superior predictive performance. The framework' s interpretability uncovers linear and non-linear transcriptomic dynamics that reveal phase-specific physiological transitions, and identifies pro-youthful or pro-aging cellular contributions. Application to systemic lupus erythematosus (SLE) reveals accelerated immunological aging driven by interferon-associated monocyte shifts. IDEAL-Age establishes a high-resolution computational framework for deciphering systemic immune aging.
bioinformatics2026-05-22v2Widespread use of invalid statistical tests in biomedical machine learning
Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.Abstract
Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance -- not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
bioinformatics2026-05-22v2Large-Scale Assessment of Animal-to-Human Drug Translation Using Natural Language Processing
Doneva, S. E.; Ellendorff, T. R.; Schneider, G.; Held, L.; von Wyl, V.; Simpson, I.; Sick, B.; Ineichen, B. V.Abstract
Background: Large-scale estimates of animal-to-human drug translation and the study characteristics associated with successful translation remain limited. The expanding preclinical literature also challenges manual evidence synthesis. We developed a natural language processing (NLP) pipeline to structure and link preclinical and clinical evidence at scale. Methods: In this retrospective meta-research study, we analysed more than 500,000 neuroscience-related animal drug studies from PubMed and linked them to clinical trial and regulatory approval data. NLP methods extracted drug, disease, and experimental design characteristics from abstracts and full texts. Translation was defined as progression to completed phase III/IV trials or regulatory approval. Logistic regression assessed associations between preclinical study characteristics and successful translation. Findings: Among 291,624 drug entities identified in animal studies, 6.7% entered clinical development and 3.1% reached phase III/IV trials or regulatory approval. At the drug-disease level, 4.4% entered clinical development and 1.9% achieved translation. Restricting analyses to successfully linked ontology entities increased estimates to 11.3% and 4.1%, respectively. Male-only animal studies predominated, whereas reporting of randomisation, blinding, and sample size calculations remained limited. Testing across multiple species and reporting blinding were associated with higher odds of successful translation. Interpretation: Only a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and blinding were associated with improved translational success. NLP-based evidence synthesis may support scalable evaluation of translational research and identification of potentially modifiable research practices.
bioinformatics2026-05-22v1A community machine learning challenge to predict the effects of gene perturbations on T cell differentiation for cancer immunotherapy
Zhang, J.; Schwartz, M. A.; Mutaher, M.; Olajide, O.; Pritykin, Y.; Ashenberg, O.; Hacohen, N.; Uhler, C.Abstract
Perturbations of genes with functional importance in T cells could be used to change the distribution of CD8 T cell states to enhance anti-tumor functions for cancer immunotherapies. We launched a world-wide computational challenge to predict the effects of gene perturbations and to devise objective functions for prioritizing gene perturbations that lead to desired T-cell state distributions. We supported the challenge by generating a single-cell Perturb-seq dataset profiling the effect of knocking out 73 individual expert-defined genes in T cells transferred into a mouse melanoma model. We compared the top algorithms developed by participants, and found that performance was primarily determined by the prior data used for gene feature representation, with perturbational data derived features, proving most effective. Experimental validation of the top 61 genes nominated by the algorithms revealed that perturbation of Ndufv2 and Dimt1 reached the defined objective and biased T cell differentiation toward desired states.
bioinformatics2026-05-22v1SpatialCCCbench: Standardized Metrics for the Systematic Evaluation of Spatial Cell-Cell Communication Methods
Dai, W.Abstract
Spatial transcriptomics (ST) enables transcriptome profiling with preserved spatial context, providing spatial dimensions that are essential for understanding complex intercellular signals in tissue architecture. ST-based CCC tools integrate spatial and molecular information to decipher intercellular interactions from a spatially informed perspective. Despite the rapid evolution of many CCC computational tools, a systematic assessment of their performance in handling ST-specific heterogeneity, utilizing spatial information efficiently, and robustness against technical or biological noise is still lacking. To address this gap, SpatialCCCbench incorporates classification accuracy, spatial signal features, robustness, and user-friendliness, aiming to guide the selection of optimal CCC inference tools across diverse spatial biology contexts. SpatialCCCbench systematically evaluates the scenario-specific applicability of ST-based CCC tools. It helps users select tools according to their analytical objectives and provides a practical benchmark for future method development.
bioinformatics2026-05-22v1Min-frame transformation enables more sensitive viral genome alignment
Doughty, R. D.; Banerjee, A.; Kille, B.; Warnow, T.; Treangen, T. J.Abstract
Motivation: Maximal unique matches (MUMs) are a fundamental primitive in genome comparison, where they serve as high-confidence anchors for downstream multiple genome alignment. However, because MUMs rely on exact string matching, their effectiveness degrades with increased genome divergence and larger sets of genomes, inhibiting their ability to recover long homologous regions and reducing the number of base pairs covered by the multiple genome alignment. Additionally, existing approaches that improve robustness to mutation, such as spaced seeds or translated alignment methods, introduce trade-offs in specificity, scalability, or computational complexity. Methods: To address this gap, we introduce the Min-Frame Transformation (MFT), a deterministic encoding of nucleotide sequences to sequences over a transformed alphabet that preserves the coordinate structure of the original sequence. At each position, the MFT selects a \kmer from a local window according to a fixed global ordering and assigns it a character in the transformed alphabet via a predefined mapping. This process captures local sequence context and can mask the impact of mutations, increasing the likelihood that homologous regions remain detectable as exact matches. The resulting transformed sequences can be indexed using standard string data structures, such as suffix arrays and suffix trees, enabling efficient extraction of MUMs without modifying existing algorithms. Impact: The MFT is a novel computational approach for improving the robustness of MUM-based seeding for genome alignment by producing longer and more contiguous matches that span a greater fraction of the genome, leading to improved alignment coverage and SNP recall. Altogether, these improvements have the potential to result in improvements for downstream viral genome analysis applications such as phylogenetic inference and transmission analysis.
bioinformatics2026-05-22v1AbSolution: interactive exploration of sequence-derived features in AIRR-seq repertoires
Garcia-Valiente, R.; Triantafyllou, C.; van Schaik, B.; Jongejan, A.; Pollastro, S.; Anang, D. C.; Guikema, J. E.; de Vries, N.; Hoefsloot, H. C.; van Kampen, A. H. C.Abstract
High-throughput sequencing of B-cell and T-cell immune receptor repertoires provides unprecedented insight into adaptive immune responses. The data produced are structured by clonal relationships and somatic mutation signatures, and yield extremely rich information in sequence-derived features, including physicochemical properties and compositional patterns. However, integrated analysis across datasets, conditions, and time points remains challenging. Current analytical tools typically focus only on certain features within individual repertoires, without enabling integrated, multivariable comparisons across datasets, conditions, and time points to address their diversity and variability. Here we present AbSolution, a user-friendly and flexible interactive application for comprehensive exploration of immune repertoires and their sequence-based properties. AbSolution enables multiscale analysis of thousands of sequence-derived features across receptor regions, while accounting for V(D)J usage, clonal composition and experimental groupings. We demonstrate its utility by identifying distinct sequence-based profiles associated with dominant (highly abundant) and non-dominant B-cell clones in peripheral blood BCR repertoires from patients with idiopathic inflammatory myopathies, and with antigen-responsive T-cell populations over time in a longitudinal in vitro antigen-stimulation dataset. Through interactive, interlinked visualizations, statistical feature selection and multi-sample comparisons, AbSolution facilitates integrated feature profiling that supports the interpretation of immune selection processes and enables systematic analysis of complex repertoire datasets.
bioinformatics2026-05-22v1Metabarcode and transcriptome datasets of Pinus sylvestris to assess fungal phyllosphere and disease dynamics.
Moore, B.; Perry, A.; Kaur, S.; Crampton, B.; Gurung, A.; Beaton, J.; Cottrell, J. E.; Stockan, J. K.; Smith, V. A.; Morris, J.; Hedley, P. E.; Nemeth, K.; Barber, H.; Cavers, S.; Jones, S.Abstract
Understanding how host-microbiome interactions influence tree disease is critical for understanding forest resilience. Here, we present foliar microbiome ITS2 metabarcoding transcriptomic datasets from Pinus sylvestris to investigate susceptibility to Dothistroma needle blight (DNB), a globally important foliar disease caused by Dothistroma septosporum. We hypothesised that host genotype shapes foliar microbial communities and their interactions, thereby influencing disease outcomes. Samples were collected from a progeny-provenance field trial in the south of Scotland representing a broad spectrum of disease susceptibilities. The dataset comprises ITS2 metabarcoding samples from 200 genotypes across three timepoints and RNAseq samples from 48 genotypes across two timepoints. Sampling captured key stages of pathogen exposure and disease progression. Both standardised and bespoke protocols were used for nucleotide extraction, sequencing, and quality control, including multiple negative and positive controls. These datasets, available in the European Nucleotide Archive (project accession PRJEB88228), enable analysis of temporal dynamics in foliar fungal communities, host-microbiome transcriptional responses, and genotype-dependent variation in disease susceptibility.
bioinformatics2026-05-21v3Counterfactual Explanations for Graph Neural Networks in Patient Outcome Prediction
Chaidos, N.; Dimitriou, A.; Calzi, H.; Casiraghi, E.; Stamou, G.; Valentini, G.Abstract
Counterfactual Explanation (CE) algorithms have been successfully applied to uncover the main factors driving computational diagnostic and prognostic predictions on tabular medical data.Recently, a new Network Medicine paradigm has been introduced for patient diagnosis and prognosis using Patient Similarity Networks (PSNs), i.e. graphs where patients are represented as nodes and their clinical and biomolecular similarities as edges. In this context, graph-based algorithms, including Graph Neural Networks (GNNs), can provide predictions using not only individual patient features but also their relations within a network of clinically and biomolecularly similar individuals. In this work, we propose the first CE algorithm tailored to explain diagnostic and prognostic predictions within PSNs. Alongside a contrastive GNN backbone, we introduce a versatile, model-agnostic counterfactual search method compatible with any underlying classifier. Preliminary results on synthetic data and on a cohort of patients affected by the Alzheimer's disease show that our algorithm is competitive both with seminal tabular based CE algorithms and GNNExplainer, a well-established method for explaining graph-based classification tasks.
bioinformatics2026-05-21v2sxLaep: a Lightweight and Accurate Enzyme Predictorfor High-throughput Mining of Metagenomic Sequences
Duan, H.; Han, X.; Mo, Y.; Ren, B.; Xia, L. C.Abstract
Motivation: Metagenomic sequencing generates petabyte-scale sequence datasets that strain both deep learning and alignment based enzyme annotation tools. A lightweight rapid and accurate filter tool is needed to filter and identify enzymatic sequences prior to resource-intensive functional prediction. Results: We present sxLaep (Lightweight and Accurate Enzyme Predictor), a resource-efficient framework using lightweight physicochemical features for enzyme pre-screening. On the external validation set, sxLaep completed prediction in only 0.002 s/sequence, which is 22.9-fold faster than Diamond (0.0457 s/sequence). It used 372.16 MB peak memory, corresponding to a 54.4% memory reduction relative to Diamond (815.64 MB). sxLaep achieved an accuracy of 99.34% and the highest recall in remote homology detection, including enzyme candidates missed by alignment-based methods. We further successfully applied sxLaep to a marine metagenomic enzyme-mining workflow, demonstrating its utility for high-throughput discovery from large-scale metagenomic sequences. Availability and Implementation: sxLaep is available as a Python package at https://pypi.org/project/sxlaep and is maintained as an open-source software repository at https://github.com/labxscut/sxLaep. Detailed installation, usage, and Docker deployment instructions are provided in the GitHub repository to support reproducible enzyme prediction and model execution.
bioinformatics2026-05-21v2Unique molecular identifiers don't need to be unique: a collision-aware estimator for RNA-seq quantification
Agyemang, D.; Irizarry, R. A.; Baharav, T. Z.Abstract
RNA-sequencing (RNA-seq) relies on Unique Molecular Identifiers (UMIs) to accurately quantify gene expression after PCR amplification. Longer UMIs minimize collisions---where two distinct transcripts are assigned the same UMI---at the expense of increased sequencing and synthesis costs. However, it is not clear how long UMIs need to be in practice, especially given the nonuniformity of the empirical UMI distribution. In this work, we develop a method-of-moments estimator that accounts for UMI collisions, accurately quantifying gene expression and preserving downstream biological insights. We show that UMIs need not be unique: shorter UMIs can be used with a more sophisticated estimator.
bioinformatics2026-05-21v2GlyComboCLI enables command line-based FAIR workflows for glycan composition assignment in mass spectrometry data
Kelly, M. I.; Thang, W. C. M.; Pang, C. N. I.; Gustafsson, O. J. R.; Ashwood, C.Abstract
Glycans are integral biomolecules whose presence cannot be predicted from genomic data alone, necessitating experimental characterisation through approaches including mass spectrometry. Assignment of glycan compositions to observed mass to charge ratios is computationally challenging due to the potential monosaccharide diversity and existing tools lack the required flexibility for integration into automated bioinformatic workflows. Here, we present GlyComboCLI, an open-source command-line application for the assignment of glycan compositions to mass spectrometry data which expands upon our previous GUI application, GlyCombo. GlyComboCLI accepts mass lists and vendor-neutral mzML files, supports an extensive range of monosaccharides, derivatisation states, reducing-end modifications, and adducts to ensure compatibility with a breadth of glycomics approaches. Outputs are compatible with downstream tools including Skyline and GlycoWorkBench. This software is deployable as a standalone executable, a Docker container, and a Galaxy tool, adhering to FAIR principles. When applied to 52 raw files from a published mouse glycomics dataset, a local instance completed composition assignment and downstream quality control in under three hours, recovering biologically consistent findings. Furthermore, an integrated Galaxy workflow demonstrated reproducible detection of sialidase treatment effects. GlyComboCLI substantially reduces the pool of spectra requiring manual structural interpretation, offering a flexible and scalable solution for glycomics bioinformatic workflows.
bioinformatics2026-05-21v2Multi-layer transcriptomic characterization of age-related immune dynamics
Zhao, Z.; Zhao, S.; Jin, J.; Ni, T.Abstract
Despite the pivotal role of mRNA isoform diversity in governing immune cell function, current investigations into peripheral immune aging predominantly focused on gene-level expression, obscuring deeper regulatory layers of transcriptome complexity. Here, we leveraged a 5' scRNA-seq atlas comprising approximately 2.5 million PBMCs from 378 healthy donors. We demonstrate that immune aging is characterized by profound, non-linear transcriptional reprogramming that extends beyond gene-level shifts to include fine-tuned regulation of alternative transcription initiation and splice site selection. By quantifying the transcriptional activity of cis-regulatory elements, we resolved their contributions to age-related expression dynamics. Notably, we identified a subset of endogenous retroviruses that are reactivated in older individuals, some of which served as alternative promoters driving the production of chimeric transcripts. Furthermore, our analysis revealed EDA as a top-ranked gene consistently upregulated with age across multiple independent cohorts. Increasing EDA expression in in vitro-stimulated naive CD4+ T cells from young individuals recapitulated aged phenotypes. This comprehensive resource elucidates the multi-layered transcriptomic landscape of the aging immune system and facilitates the identification of novel drivers of immune aging.
bioinformatics2026-05-21v1Structural Pockets and Interacting RNA-Associated Ligands (SPIRAL): A DSSR-enabled Meta-Analysis of RNA-Small Molecule Recognition
Lu, X.-J.; Wang, Y.Abstract
Small molecules that target structured RNA hold therapeutic promise across a wide range of diseases, yet the structural principles governing RNA-ligand recognition remain poorly defined. Here we present SPIRAL (Structural Pockets and Interacting RNA-Associated Ligands), a curated database of 1,098 RNA-small molecule structures from the Protein Data Bank covering 1,137 ligand-binding events across six functional RNA categories: riboswitches, ribozymes, synthetic aptamers, G-quadruplexes, ribosomal RNA, and regulatory RNA motifs. A customized pipeline built on DSSR (Dissecting the Spatial Structure of RNA) extracts structural interaction parameters from each complex, capturing stacking geometry, hydrogen-bond topology by RNA moiety, backbone contacts, groove engagement, and tertiary motif context. Unsupervised clustering of these fingerprints resolves six mechanistically distinct binding modes whose distribution is strongly governed by RNA functional class, demonstrating that different RNA categories engage small molecules through fundamentally different chemical strategies. To enable category-independent comparison of interaction quality across these mechanistically diverse modes, we introduce the Composite Binding Quality Score (CBQS), a seven-metric framework that ranks riboswitches highest and regulatory RNA motifs lowest among the six categories, while ribozymes, synthetic aptamers, and G-quadruplexes achieve statistically equivalent intermediate scores through three distinct recognition strategies. Analysis of 275 non-redundant affinity-characterized entries identifies C2'-endo sugar pucker count and total buried contact surface area as the dominant independent predictors of binding affinity. Both predictors are enriched at junction loops, pseudoknots, and base multiplet networks, the same tertiary structural sites most under engaged by current regulatory RNA motif binders, suggesting that ligands designed to contact these sites would improve both potency and selectivity simultaneously.
bioinformatics2026-05-21v1Differential Gene Expression in the Tropical House Cricket and Its Iridovirus in Healthy versus Diseased Specimens
Hinton, J. A.; Walt, H. K.; Duffield, K. R.; Ramirez, J. L.; Meyer, F.; Hoffmann, F. G.Abstract
The tropical house cricket, Gryllodes sigillatus, is a mass-produced insect that is used as a protein source for pets and livestock. However, intensive mass-rearing conditions, coupled with high genetic relatedness, create an ideal environment for the spread of pathogenic microbes that severely impact production. Cricket iridovirus (CrIV) is a pathogen that impedes cricket growth and causes significant losses for cricket farmers. Interestingly, recent studies have shown that CrIV is often present asymptomatically, yet the molecular basis of the emergence of disease symptoms remains unknown. To address this, we sampled healthy and diseased crickets and examined differences in cricket and CrIV gene expression via RNAseq. Using differential gene expression analysis and functional enrichment analysis, we found significant differences in host and viral gene expression between healthy and diseased crickets, including genes involved in immunity. Interestingly, while we observed high CrIV gene expression across the entire CrIV genome in sick populations, healthy asymptomatic populations showed elevated expression at a single viral locus. Our results shed light not only on the cricket immune response to CrIV infection but also identify a viral gene that is highly expressed during covert infections, suggesting its potential role in suppressing the host's immune response. These findings enhance our understanding of how CrIV interacts with our cricket host, providing essential insights for developing targeted strategies to manage CrIV outbreaks in cricket mass-rearing facilities.
bioinformatics2026-05-21v1BioRAG-DRAG: A Multimodal Biological Retrieval Layer for Local-First Biomedical Agents
Wang, L.Abstract
Biomedical agents need reliable access to heterogeneous evidence: literature text, gene and pathway records, protein sequences, DNA/cDNA sequences, and structured biological relations. Classical sequence tools such as BLAST remain the right choice for alignment-grounded verification, but they are not a unified context interface for large language model agents. We present BioRAG-DRAG, a local-first multimodal retrieval layer that combines pluggable neural sequence-text retrieval, BLAST verification, and graph-based evidence packaging. Specialized encoders such as ESM-2 can serve protein partitions, while OmniGene CPT provides a unified biological-language backbone for mixed sequence/text and agent-facing use; BLAST reranks or verifies sequence candidates; and DRAG graphs expose typed, traceable paths for downstream agents. We introduce BioRAG-Standard v0, a partitioned corpus/library with 257,886 retrievable records and an initial annotation layer for engineering evaluation built from Open-Rosalind Standard biomedical records and sequence-window extensions. On an in-index sequence-window stress test, BLAST nearly saturates biological matching, while vector retrieval recovers substantial but lower biological match rates. On held-out parent-fragment controls, public protein encoders outperform the current OmniGene protein-window embedding, while DNA/cDNA dense retrieval remains weak even with off-the-shelf Nucleotide Transformer pooling; this supports a model-agnostic BioRAG design rather than a claim that one unified generator backbone is the best sequence-search encoder. Indexed Chroma lookup over Standard text and 100k sequence-window collections adds only small lookup overhead after query embedding; this does not measure end-to-end instant latency. Finally, exploratory sequence DRAG traces show inspectable biological neighborhoods, including immunoglobulin-family and gene-symbol modules, with initial graph controls indicating non-random but partly sequence-similarity-driven structure. These results support a bounded architecture: vector retrieval supplies unified candidate context, while BLAST and DRAG provide biological verification and evidence attribution.
bioinformatics2026-05-21v1A Bayesian modelling framework for inference of latent infection risk patterns from virus neutralisation assay titration data
Alrefae, T. A.; Pons-Salort, M.; Donnelly, C. A.; Lambert, B.; Kamau, E.Abstract
Serological assays remain the standard experimental approach for estimating the cumulative incidence of a pathogen and monitoring population immunity. The predominant approach for analysing serum titration data from virus neutralisation assays uses a nearly century-old interpolation-based method which neglects inherent imperfections in the assay and produces estimates with no measure of uncertainty. We introduce a two-part Bayesian modelling framework to estimate the underlying antibody concentrations in the raw serum samples taken from serosurveyed individuals, to improve the interpretation of serological data over age. First, we develop a mechanistic Bayesian model for serum antibody titration data that estimates latent antibody concentrations while accounting for assay variability and quantifying uncertainty. Second, we propagate this uncertainty into an age-structured serocatalytic model by integrating over posterior draws of individual antibody concentrations, allowing joint inference on latent serostate membership, force of infection, and serological waning rate. We use this framework to explore the dynamics of infection and immunity for three enterovirus serotypes: enteroviruses A71 (EV-A71) and D68 (EV-D68) and coxsackievirus A6 (CVA6). These serotypes are leading causes of outbreaks of severe respiratory illness and hand, foot, and mouth disease. Applying these approaches to three cross-sectional serosurveys, we estimated consistently higher and more persistent antibody concentrations throughout life for EV-D68 compared to EV-A71 and CVA6. Our analysis suggests that the proportion of recently infected individuals (i.e.\ individuals with high estimated antibody concentration levels given their age) peaks around $25\%$ by age $7$ years for both EV-A71 and CVA6 before gradually declining with age. In contrast, for EV-D68 the inferred proportion of the population in the infected state exceeds $50\%$ by age $9$ years and continues to grow with age. We also estimate that EV-D68 antibody concentration levels are higher than those of the other two serotypes, with the force of infection estimated to be highest in early childhood and declining more gradually with age than for EV-A71 and CVA6. These estimates are different to previous estimates found in the literature. Our inferential framework uncovers the wide-ranging variation in antibody levels that are often obscured by conventional endpoint titre estimation methods. We demonstrate that our framework can infer infection rates without relying on predetermined seropositivity cut-offs and without making explicit assumptions of virus-specific infection mechanisms.
bioinformatics2026-05-21v1KmerSignificance Score: A discriminative and biologically-informed framework for viral k-mer prioritization
Lebatteux, D.; Corso, F.; Soudeyns, H.; Boucoiran, I.; Gantt, S.; Banire Diallo, A.Abstract
Distinguishing closely related viral strains requires identifying genomic regions where subtle sequence differences carry biological significance. While k-mer-based approaches offer computational efficiency for genome analysis, existing methods lack standardized frameworks for evaluating which k-mers are most informative. Current selection strategies focus primarily on statistical discriminative power without integrating biological relevance. We introduce KmerSignificance Score (KSS), a k-mer prioritization framework combining three components: an information-theoretic method measuring strain-distinguishing capacity, an optimized amino acid substitution matrix (MIYATA_EVO) for mutation impact assessment, and protein-level functional importance scoring derived from UniProt annotations. KSS produces standardized scores in the [0,1] interval, enabling direct cross-dataset comparison. The discriminative component achieved classification performance comparable or superior to all tested alternatives (mean F1 = 0.880 vs. 0.718-0.877 for six established methods) while additionally providing bounded scores with consistent empirical distributions for cross-dataset comparability. MIYATA_EVO, optimized via genetic algorithm, improved biophysical property correlations by 28.4% over the original MIYATA matrix. Protein scoring on 17,470 viral proteins showed robust agreement with UniProt annotation scores (Kendall {tau} = 0.777) while revealing finer functional distinctions. Literature validation on SARS-CoV-2 (278,738 sequences, 19 variants), HIV-1 (12,223 sequences, 15 subtypes), and human cytomegalovirus (HCMV; 399-646 sequences, 4-8 genotypes) confirmed that high-scoring k-mers consistently map to established variant-defining mutations, subtype-specific polymorphisms, and genotype markers. KSS provides a standardized framework for viral k-mer prioritization with applications in variant surveillance, molecular epidemiology, and functional annotation. The tool is available at https://github.com/bioinfoUQAM/KmerSignificanceScore.
bioinformatics2026-05-21v1Recovering biological structure in sparse single-cell proteomics with GIRAFI
Zhong, H.; Chi, S.; Wong, R.; Rogalski, J.; Wang, Z.; Chan, S.; Bailey, M. L.; Ebrahimi, A.; Jayme, G.; Yin, J.; Gong, A.; Snutch, T. P.; Maier, C. S.; Marra, M. A.; Foster, L. J.; Tang, X.Abstract
Single-cell proteomics (SCP) based on liquid-chromatography mass-spectrometry resolves protein-level cellular heterogeneity, but interpretation remains limited by detection-linked sparsity. SCP profiles continuous, peptide-derived intensities and has lower throughput than single-cell RNA sequencing, making denoising methods for large-scale, count-based transcriptomics difficult to apply. Here we present GIRAFI, a graph-informed statistical learning framework that imputes missing values and reveals reproducible cell states by constraining inference to dataset-aware, prior-knowledge-informed protein neighborhoods. We evaluated GIRAFI across SCP datasets spanning diverse biological/technical contexts. In masking-based recovery experiments and cell-type-specific protein-protein interaction inference, GIRAFI outperformed existing methods, and matched bulk proteomics comparisons corroborated recovery accuracy and ablations supported the graph-informed design. Beyond reduced replicate- and source-associated technical structure, GIRAFI recovered ground-truth cell-type annotations, improved cell state-resolved pathway analysis, and enabled trajectory inference consistent with known time courses. These results establish graph-constrained imputation as an effective strategy for improving SCP robustness, biological structure, interpretation, and cross-dataset comparability.
bioinformatics2026-05-21v1Multimodal single-cell analysis uncovers transcription factor networks underlying T-cell aging
Shaigan, M.; Puri, D.; Fornero, G.; Kruger, R.; Steiger, M.; Klump, H. J.; Meissner, A.; Kretzmer, H.; Wagner, W.; Gesteira Costa Filho, I.Abstract
Aging of the immune system is associated with chronic inflammation and impaired immune function, yet the regulatory mechanisms underlying these changes remain incompletely understood. Here, we generated paired single-cell transcriptomic and chromatin accessibility profiles from peripheral blood mononuclear cells of young and old healthy donors to characterize immune aging at single-cell resolution. Using an integrative computational framework for multi-omic single-cell analysis, we detected pronounced age-associated changes in T cells, including loss of naive CD8+ T cells and expansion of differentiated memory and effector populations. Aging was accompanied by increased inflammatory signaling and reduced oxidative phosphorylation programs. Enhancer-based gene regulatory network analyses identified a reduced role of TCF7 and increased activity of inflammatory regulators, including FOSL2, in aged T cells. Integration with genetic association and eQTL datasets further supported the functional relevance of age-associated regulatory regions and their target genes.
bioinformatics2026-05-21v1Nipoppy: A framework for standardizing neuroimaging studies to facilitate international derived-data sharing
Bhagwat, N.; Wang, M.; Dugre, M.; Pfarr, J.-K.; Dai, A.; Urchs, S.; McPherson, B.; Gau, R.; van Heese, E. M.; d'Angremont, E.; Laansma, M. A.; Prasad, S.; Sanz-Robinson, J.; Torabi, M.; Jahanpour, A.; Danyluik, M.; Joubert, A.; Macdonald, A.; Waller, L.; Stewart, A.; Joulot, M.; Dickie, E.; Devenyi, G. A.; Bouix, S.; Bollmann, S.; Jahanshad, N.; Thompson, P. M.; Burgos, N.; Chakravarty, M. M.; Halchenko, Y. O.; van der Werf, Y. D.; Poline, J.-B.Abstract
Neuroimaging data management and processing are tedious and error-prone, prompting reproducibility concerns. Globally, studies with heterogeneous infrastructure and governance policies lead to eclectic data processing and sharing, necessitating standardization of data workflows to ensure reusability and comparability of multi-centric datasets. The Nipoppy neuroinformatics framework facilitates such standardization by combining specification, protocol, and software to manage study-level data workflows. With its adoption, researchers can share standardized, derived datasets enabling efficient, reproducible, and inclusive research.
bioinformatics2026-05-21v1geneML: Gene annotation across diverse fungal species using deep learning
Vader, L.; Harvey, C. J.; Weber, T.; Hon, L. S.Abstract
Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML.
bioinformatics2026-05-21v1A phylogeny-guided framework for decoding mechanisms of human endogenous retrovirus regulation in health and disease
Patterson, A.; Duong, B.; Yoon, L.; Foster, M.; MacMullen, L.; Wickramasinghe, J.; Lucas, A.; Srivastava, A.; Jacobson, S.; Murphy, M. E.; Soldan, S.; Lieberman, P. M.; Auslander, N.Abstract
Human endogenous retroviruses (HERVs) are remmants of ancient infections which make up to ~8% of the human genome. Their activity influences development, immunity, and cancer, but studying them has been limited by a key technical challenge: short-read sequencing cannot uniquely assign reads to these highly repetitive elements. Here, we present ERVmancer, a phylogeny-informed method that resolves the read-mapping ambiguity and quantifies HERV expression across scales, from individual loci to entire retroviral clades, depending on mapping confidence. Benchmarking with sample-matched long- and short-read data generated in this study demonsrates that ERVmancer outperforms existing approaches in both sensitivity and specificity. Application of ERVmancer recapitulates known HERV expression patterns in multiple sclerosis and uncovers new biology in breast cancer, including suppression of HERVH-LTR7 by p53. By enabling accurate and scalable quantification of integrated retroviral elements, ERVmancer provides a broadly applicable resource for investigating retroviral mechanisms in health and disease.
bioinformatics2026-05-21v1Heterogeneity-driven adaptive scale graph learning for subcellular spatial transcriptomics
Shi, W.; Shen, C.; Liu, Y.; Xiao, Q.; Luo, J.Abstract
Spatial transcriptomics enables gene expression profiling within intact tissue sections, providing an important basis for analyzing tissue organization, cellular heterogeneity, and microenvironmental interactions. However, existing spatial structure identification methods often integrate spatial information using fixed neighborhoods or predefined smoothing scales, which limits their ability to adapt to region-specific structural heterogeneity. In homogeneous regions, broader spatial smoothing can help preserve continuous tissue structures, whereas in regions with complex boundaries or mixed cell populations, excessive smoothing may obscure local expression differences and fine-scale structural changes. Therefore, it is necessary to develop an adaptive graph learning framework that can adjust the range of spatial information integration according to tissue structural heterogeneity. In this study, we propose HAST, a heterogeneity-driven adaptive-scale graph learning framework for spatial transcriptomics. HAST adaptively determines graph filtering scales according to spatial structural heterogeneity, enabling flexible information aggregation across different tissue regions. It further decomposes gene expression signals into low-frequency structural components and high-frequency residual components, thereby jointly modeling global spatial continuity and local expression variations. Experiments on high-resolution spatial transcriptomics datasets show that HAST improves spatial structure identification and cross-section generalization. Tumor-enriched cluster identification and neighborhood enrichment analysis further demonstrate its ability to characterize tumor-associated spatial regions and microenvironmental organization.
bioinformatics2026-05-21v1Spectral Prompting: Unsupervised Recovery of Human Hair Follicle Cell-Type and Multiscale Systems Architecture from Bulk and Single-Cell RNA-Seq Datasets via Single-Gene Seeded Spectral Unfolding
Purba, T.Abstract
Bulk RNA sequencing datasets are assumed to carry minimal resolvable programmatic and cell type biological information; as such, in the absence of single-cell resolution, researchers prioritise data analysis approaches based on differential expression, or rely on deconvolution and co-expression methods that require external reference panels, large multi-sample cohorts, or prior single-cell data to resolve cell-type structure. Here I describe the recovery of specialised cell-type and systems gene expression architecture resolved from a static gene expression dataset of untreated cultured human hair follicles (pooled from N=12 patients) isolated from scalp skin. To achieve this, I used graph theoretic methods to mathematically transform gene expression data into a latent space of relational structure, which was spectrally organised into coarse- and fine-grained modes and partitioned using a purpose-built computational algorithm. This permitted the synthesis of a computational Spectral Prompting system, whereby a single gene can be seeded to unfold to reveal associated partners across manifold projections in gene expression space. Individual projections across the manifold can reveal rich individual gene expression programmes, which can then be aggregated to identify core-associated genes for a given spectral gene prompt, both within the manifold analysed and across >1 manifold constructions. With this, I recover hitherto unresolved gene expression programmes from bulk data, including, but not limited to, epithelial hair follicle stem cell (eHFSC), hair shaft, dermal papilla and endothelial gene expression signatures. Focusing on querying KRT15, a human anagen bulge eHFSC and progenitor marker, raw output from individual spectral prompts during testing recovered known eHFSC-associated genes including LGR5, LHX2 and CXCL14, and discovered new candidate human eHFSC and progenitor cell-associated markers, such as RGMA and MUCL1 which were validated in situ. Finally, I show a brief demonstration that the technique can be similarly applied to single-cell data (GSE129611), whereby a KRT15 gene prompt from a combined expression matrix was mapped to a KRT15+/CXCL14+/LHX2+/DIO2+/SFRP1+ cell population (31/6000 cells) independent of standard clustering tools. Moving forward, from this foundation, the method will be developed to study how latent gene expression space shifts following perturbation or pathology.
bioinformatics2026-05-21v1A framework for peptide identification on commercial nanopore sequencing platforms
Beslic, D.; Kucklick, M.; Graap, E.; Sedaghatjoo, S.; Renard, B. Y.; Fuchs, S.; Engelmann, S.; Koerber, N.Abstract
Direct single-molecule peptide analysis could in principle enable rapid and sensitive identification of pathogen-derived or disease-associated biomarkers without reliance on mass spectrometry. However, existing nanopore peptide sensing methods are typically constrained by limited throughput and lack of accessibility beyond specialized setups. Here, we present an integrated experimental-computational framework for DNA-linked peptide translocation on a commercially available, high-throughput nanopore sequencing platform, the MinION. Synthetic peptides were covalently bound to oligonucleotides at both termini. The resulting peptide-DNA constructs were then translocated through the CsgG-CsgF pores using a DNA motor protein. Current traces were segmented using the known DNA sequences to extract peptide-associated signal regions. From these segments, we extracted signal features and trained feature-based and deep-learning classifiers to distinguish peptides, balancing interpretability and classification performance. We establish a framework for peptide identification using standard nanopore sequencing hardware. Across a diverse panel of synthetic peptides, our approach resolves single-amino-acid substitutions, maintains performance across independent sequencing runs, and correctly identifies peptides in blind mixtures. Interpretable model analyses connect classifier decisions and common errors to specific signal motifs. By combining commercially available instrumentation with a reproducible experimental and computational workflow, this framework lowers the barrier to nanopore-based proteomics and enables broader adoption across laboratories. It provides a foundation for future developments in amino acid modification detection and sequence analysis.
bioinformatics2026-05-21v1S-IGTD: supervised tabular-to-image topology learning via between-group correlation for multiclass classification of biological data
WU, H.-M.Abstract
Motivation: Tabular-to-image methods allow convolutional neural network (CNN)-based classifiers to analyse high-dimensional biological tables by mapping features onto a two-dimensional grid. Existing layouts are usually driven by unsupervised global correlation, which can place class-discriminative features far apart when nuisance or housekeeping covariation dominates the total covariance structure. Results: We present the Supervised Image Generator for Tabular Data (S-IGTD), a supervised extension of IGTD that optimizes tabular-to-image topology by replacing total-correlation distance with one minus the absolute between-group correlation, computed from class-wise feature means, under the Within-And-Between-Analysis (WABA) decomposition. We prove entrywise consistency of the supervised distance matrix under standard moment conditions and identify balanced-class settings in which S-IGTD improves a Signal Dispersion Score (SDS)-related topology objective. In controlled simulations targeting between-group signal, S-IGTD outperformed Euclidean- and correlation-distance IGTD variants in SDS, accuracy and macro-F1 score. Across five biological benchmarks ranging from 4- to 91-class classification, S-IGTD produced compact class-supervised layouts, with 24/35 Holm-adjusted significant SDS wins against seven non-reference layout controls. As a secondary downstream diagnostic, a CNN with batch normalization showed higher mean accuracy than random layouts and correlation-distance IGTD on all real datasets, and higher mean accuracy than Euclidean-distance IGTD on four of five datasets, with the clearest gains on large multiclass cancer and methylation benchmarks. Availability and implementation: Source code, datasets, configuration files and reproducibility scripts are freely available at https://github.com/hanmingwu1103/S-IGTD.
bioinformatics2026-05-21v1Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large cis-regulatory domains, spanning tens to hundreds of kilobases, are pivotal in orchestrating cell-state-specific transcriptional programs that define cellular identity. However, existing single-cell analytical frameworks lack the capacity to identify these higher-order structures, thereby obscuring the coordinated, domain-level epigenetic regulation essential for complex biological processes. To address this, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) solely from single-cell ATAC-sequencing data. Our approach faithfully recapitulates established hematopoietic hierarchies and resolves lineage-specific regulatory programs by recovering canonical master transcription factors, frequent chromatin interactions, and enrichment of fine-mapped autoimmune disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE captures tumor-associated H3K27ac landscapes and prioritizes USP7 as a potential therapeutic candidate, supported by in silico perturbation. Collectively, our framework provides a powerful and scalable platform for deciphering the complex epigenetic architectures underlying human development and disease.
bioinformatics2026-05-20v8Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes
Insana, G.; Martin, M. J.; Pearson, W. R.Abstract
MMseqs2 clustering was used to examine the uniformity and heterogeneity of proteomes from 20 bacterial species. Using clustering parameters that required 50% sequence overlap, clusters with proteins from 50% of proteomes typically contain proteins from 95% of the proteomes and capture more than 80% of the proteins in an organism. Protein clusters are highly uniform in length; across the 20 bacteria, the median cluster has more than 99% of the proteins at the mode length. While protein lengths in clusters are highly uniform, some clusters contain dozens to hundreds of proteins that are considerably shorter (75%) than the mode-length, and a few clusters include proteins that are 133% the mode length. Most "outlier" proteins are found in fewer than 10% of clusters, and "high-outlier" clusters are over-represented in a small fraction of proteomes. Short-outlier proteins are artifacts; at least 80% of short-outlier genomes contain mode-length copies of the protein in the cluster; 40% of short protein artifacts are produced by sequencing errors (frameshifts and termination codons) while another 40% by initiation codon choice. High "outlier" clusters are concentrated in a small fraction of proteomes, which often have poor Proteome BUSCO fragment scores. As with "short-outlier" proteins, the 5% of proteomes that are excluded from the core (50% participation) cluster set encode the missing protein more than 98% of the time; these proteins were missed because of frameshifts in the genome sequence. MMseqs2 clustering with 50% participation provides robust sets of core bacterial proteins.
bioinformatics2026-05-20v3Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes
Insana, G.; Martin, M. J.; Pearson, W. R.Abstract
MMseqs2 clustering was used to examine the uniformity and heterogeneity of proteomes from 20 bacterial species. Using clustering parameters that required 50% sequence overlap, clusters with proteins from 50% of proteomes typically contain proteins from 95% of the proteomes and capture more than 80% of the proteins in an organism. Protein clusters are highly uniform in length; across the 20 bacteria, the median cluster has more than 99% of the proteins at the mode length. While protein lengths in clusters are highly uniform, some clusters contain dozens to hundreds of proteins that are considerably shorter (75%) than the mode-length, and a few clusters include proteins that are 133% the mode length. Most "outlier" proteins are found in fewer than 10% of clusters, and "high-outlier" clusters are over-represented in a small fraction of proteomes. Short-outlier proteins are artifacts; at least 80% of short-outlier genomes contain mode-length copies of the protein in the cluster; 40% of short protein artifacts are produced by sequencing errors (frameshifts and termination codons) while another 40% by initiation codon choice. High "outlier" clusters are concentrated in a small fraction of proteomes, which often have poor Proteome BUSCO fragment scores. As with "short-outlier" proteins, the 5% of proteomes that are excluded from the core (50% participation) cluster set encode the missing protein more than 98% of the time; these proteins were missed because of frameshifts in the genome sequence. MMseqs2 clustering with 50% participation provides robust sets of core bacterial proteins.
bioinformatics2026-05-20v2Novel 4D tensor decomposition-based approach integrating tri-omics profiling data can identify functionally relevant gene clusters
Taguchi, Y.-h.; Turki, T.Abstract
Understanding gene expression requires integrating multiple regulatory layers, because transcript abundance does not necessarily correspond to translational activity or protein abundance. Ribosome profiling and proteomics help distinguish increased translation from ribosome stacking or translational buffering, but no de facto standard framework exists for unsupervised integration of transcriptome, translatome, and proteome profiles. Here, we propose a four-dimensional tensor decomposition-based unsupervised feature extraction approach for tri-omics integration. We applied higher-order singular value decomposition to transcriptome, Ribo-seq, and proteome profiles measured under branched-chain amino acid starvation. The resulting singular value vectors captured relationships among the three omics layers, including a component consistent with ribosome stacking, where transcriptome and translatome signals increased while proteome signals decreased, and another consistent with translational buffering, where proteome variation was suppressed despite transcriptome and translatome changes. Gene selection identified 1,781 genes associated with ribosome stacking and 227 genes associated with translational buffering. Enrichment analyses linked the former to translation, post-translational protein modification, RNA polymerase II transcription, cell cycle regulation, endoplasmic reticulum protein processing, ubiquitin-mediated proteolysis, and stress-related pathways, and the latter to ribosome, translation elongation and termination, spliceosome, immune- and stress-related pathways, and ribosomopathy-associated diseases. Robustness analyses indicated that the results were not substantially affected by the duplicated proteome replicate or missing-value handling. Comparison with MOFA+ and mixOmics suggested that our approach more effectively extracted components interpretable as ribosome stacking and translational buffering. These results demonstrate that tensor decomposition-based unsupervised feature extraction is useful for identifying functionally relevant gene clusters from tri-omics data.
bioinformatics2026-05-20v2Pan1c : a pipeline to easily build chromosome-level pangenome graphs
Mergez, A.; Racoupeau, M.; Bardou, P.; Linard, B.; Legeai, F.; Choulet, F.; Gaspin, C.; Klopp, C.Abstract
The advances of sequencing technologies and the availability of high-quality genome assemblies for many genotypes per species, give the opportunity to improve sequence alignment rate and quality, and the variant calling accuracy by including all genomic variations in a graph reference, called a pangenome graph. Because the process of building and analysing a pangenome graph is still complex, with related software packages under development, there is an important need for releasing user-friendly pipelines for this emerging research area. Pan1C is a pipeline based on a chromosome-by-chromosome graph construction strategy. It integrates two complementary strategies for building pangenomes and produces informative metric plots and graphics using a large set of tools. By benchmarking Pan1C on human, fungal, and wheat assemblies, which span a wide range of genome sizes and complexities, we showed the interest of Pan1C for assembly and graph validation as well as for performing primary analyses.
bioinformatics2026-05-20v2Metabarcode and transcriptome datasets of Pinus sylvestris to assess fungal phyllosphere and disease dynamics.
Moore, B.; Perry, A.; Kaur, S.; Crampton, B.; Gurung, A.; Beaton, J.; Smith, V. A.; Morris, J.; Hedley, P. E.; Nemeth, K.; Barber, H.; Cavers, S.; Jones, S.Abstract
Understanding how host microbiome interactions influence tree disease is critical for understanding forest resilience. Here, we present foliar microbiome ITS2 metabarcoding transcriptomic datasets from Pinus sylvestris to investigate susceptibility to Dothistroma needle blight (DNB), a globally important foliar disease caused by Dothistroma septosporum. We hypothesised that host genotype shapes foliar microbial communities and their interactions, thereby influencing disease outcomes. Samples were collected from a progeny provenance field trial in the south of Scotland representing a broad spectrum of disease susceptibilities. The dataset comprises ITS2 metabarcoding samples from 200 genotypes across three timepoints and RNAseq samples from 48 genotypes across two timepoints. Sampling captured key stages of pathogen exposure and disease progression. Both standardised and bespoke protocols were used for nucleotide extraction, sequencing, and quality control, including multiple negative and positive controls. These datasets, available in the European Nucleotide Archive (project accession PRJEB88228), enable analysis of temporal dynamics in foliar fungal communities, host microbiome transcriptional responses, and genotype dependent variation in disease susceptibility.
bioinformatics2026-05-20v2Synthetic-data augmented calibration for expert-informed rare disease models
Yang, H.; Rachel, T.; Litwin, T.; Karakioulaki, M.; Reimer-Taschenbrecker, A.; Timmer, J.; Has, C.; Binder, H.; Hess, M.Abstract
Clinical data for rare diseases are sparse, noisy, and heterogeneous, complicating calibration of ordinary differential equation (ODE) models. Thus, we introduce a noise-robust calibration in latent space that combines expert-derived ODEs with learned latent representations. Our approach leverages synthetic ODE trajectories, augmenting our scarce observations to train a model-specific autoencoder representation and imputer. During calibration, observed and ODE-generated trajectories are compared in latent space, and ODE parameters are updated by minimizing their latent distance. In a controlled ABCDE simulation model, the imputer outperformed a carry-forward baseline for moderate parameter shifts, parameter recovery remained stable under random missingness, calibration remained robust to additional noise variables despite reduced downstream identifiability, and distinct dynamics formed visually separable latent trajectories. On a custom developed ODE model for real Epidermolysis Bullosa patients, the calibrated phenomenological model reproduced patient-level trajectories from sparse observations. Thus, we conclude that our latent-space calibration approach supports rare-disease modeling.
bioinformatics2026-05-20v1Widespread use of invalid statistical tests in biomedical machine learning
Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.Abstract
Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance -- not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
bioinformatics2026-05-20v1Mapping Tumor-Microenvironment dependencies with TMEformer: A spatial foundation framework enabling in silico perturbation
Chen, S.; Zhu, G.; Yang, L.; Li, S.; Liu, P.; Chen, Q.; Tang, Y.; Luo, J.; Huang, L.; Chen, B.; Ou, S.; Jiang, J.Abstract
Despite the fundamental role of spatial context in driving tumor progression, most current computational models for virtual perturbation have largely overlooked its importance. Here, we introduce TMEformer, a tumor microenvironment-aware deep learning framework that leverages high-resolution spatial transcriptomics to jointly model intrinsic tumor cell programs and local microenvironmental signals by explicitly incorporating spatial architecture. Validated across diverse tumor spatial transcriptomic cohorts, TMEformer enables virtual perturbations that capture functional dependencies within local cellular ecosystems. Despite being trained on cancer-specific spatial datasets, TMEformer outperforms baseline models pretrained on large-scale corpora in capturing key tumor transitions, including lineage plasticity and the emergence of therapy resistance. Systematic perturbation analyses prioritize tumor-intrinsic transcription factors and TME-derived ligands that drive disease progression, recovering established regulators and revealing novel candidates. Furthermore, TME-derived embeddings improve the spatial stratification of tumor cells and align more closely with pathological architecture. Together, TMEformer establishes a general framework for modeling tumors as spatially coupled, perturbable ecosystems.
bioinformatics2026-05-20v1Static2Dynamic: Reconstructing videos of unobservable cellular, developmental, and disease processes
Boyer, T.; Del Nery, E.; Spassky, N.; Genovesio, A.Abstract
A fundamental limitation in biology is that many of its most important processes unfold as visual dynamics that cannot be directly observed. Development, tissue remodeling, and disease progression often occur deep in living organisms, over extended timescales, and at cellular resolution beyond the reach of current live imaging technologies. As a result, much of biology remains accessible only through static snapshots, while the underlying phenotypic trajectories and visual transformations remain hidden. Here, we introduce Static2Dynamic, a general framework to reconstruct unseen biological dynamics from sets of cross-sectional image data. Starting from time-unpaired static samples, Static2Dynamic first recovers a continuous pseudotime for individual images in a time-discriminative deep representation space, then learns a generative model of images conditionally to the underlying process, and finally reconstructs temporally coherent videos initialized from real samples. This makes it possible to infer past and future visual states of a static image and to simulate complete trajectories of cellular, developmental, and disease processes that were never directly recorded. We quantitatively validate Static2Dynamic on two large-scale experimental microscopy video datasets generated specifically for benchmarking, enabling direct comparison of inferred pseudotime trajectories and reconstructed videos against ground-truth biological dynamics. We further show that the framework generalizes across biological scales, organisms, and imaging modalities, including processes inaccessible to continuous live observation. More broadly, Static2Dynamic establishes the foundations of pseudotime microscopy, a new paradigm for reconstructing the visual and temporal dynamics of biological processes directly from static imaging data, thereby expanding the observable space of living systems beyond current experimental limits.
bioinformatics2026-05-20v1Using Mapping-Profiles to Refine Strain-Level Metagenomic Classification
Lipovac, J.; Angevin, L.; Krizanovic, K.Abstract
Metagenomic classification at the strain level remains challenging due to high sequence similarity among closely related genomes, which leads to ambiguous read mappings and frequent false-positive strain detections. Reducing such errors improves the reliability of strain-level analyses, which is critical for applications such as pathogen detection. We introduce StrainRefine, a post-mapping refinement method that analyzes read-reference mapping profiles to resolve ambiguous assignments among highly similar genomes. The method represents candidate reference genomes using binary profiles that capture read-support patterns and measures similarity between references based on profile overlap. The method clusters references based on similar mapping profiles, filters weakly supported genomes, and reassigns reads to representative references, reducing redundant reporting of near-identical strains. StrainRefine substantially reduces false-positive strain detections while preserving recall and improving agreement between predicted and true abundance profiles. On large-scale metagenomic datasets, it achieves a substantially improved precision-recall balance compared to existing mapping-based approaches, with the standalone method obtaining the highest read-level classification accuracy on the most complex evaluated dataset. Unlike many strain-level tools designed for individual species, StrainRefine operates without prior assumptions about sample composition or curated species-specific reference collections, while still achieving comparable performance in single-species settings on species-specific reference databases. These results highlight mapping-profile similarity as an effective signal for improving strain-level metagenomic classification.
bioinformatics2026-05-20v1HiCP2GAN: A Plug and Play Foundation Model-based GAN for Hi-C Enhancement
Olowofila, S.; Oluwadare, O.Abstract
The three-dimensional organization of chromatin shapes gene regulation and cellular function. Hi-C has emerged as the primary technique for mapping chromatin interactions genomewide, yet high-resolution data remain costly and scarce, leaving many studies with sparse contact maps that limit downstream analysis. Deep learning methods, especially generative adversarial networks (GANs), have shown promise for enhancing low-resolution Hi-C data. Most existing GAN-based approaches, however, rely on custom discriminators trained from scratch, which can yield unstable training and limited generalization. Hi-C foundation models pretrained on large-scale data capture rich, transferable representations of chromatin structure; their use as discriminators within adversarial enhancement frameworks has not been explored. In this work, we introduce HiCP2GAN, a plug-and-play GAN that employs a pretrained Vision Transformer-based Hi-C foundation model as its discriminator. The discriminator was pretrained on 118 million Hi-C patches across diverse species and cell types, providing biologically meaningful gradients for adversarial supervision. The HiCP2GAN framework is generator-agnostic: any compatible Hi-C resolution enhancement architecture can serve as the generator, enabling fair comparison across methods. The encoder phase of the foundation model was adapted as a discriminator backbone and experimented with finetuning different numbers of layers from the input while freezing the deeper transformer layers. Finetuning the first few layers while freezing the rest preserved pretrained knowledge while allowing task-specific adaptation. Experiments on human cell lines show that HiCP2GAN consistently improves resolution over standalone generators and conventional GAN-based models, while serving as a plug-and-play framework for most non-GAN generator models. HiCP2GAN is publicly available at https://github.com/OluwadareLab/HiCP2GAN.
bioinformatics2026-05-20v1Counterfactual Explanations for Graph Neural Networks in Patient Outcome Prediction
Chaidos, N.; Dimitriou, A.; Calzi, H.; Casiraghi, E.; Stamou, G.; Valentini, G.Abstract
Counterfactual Explanation (CE) algorithms have been successfully applied to uncover the main factors driving computational diagnostic and prognostic predictions on tabular medical data.Recently, a new Network Medicine paradigm has been introduced for patient diagnosis and prognosis using Patient Similarity Networks (PSNs), i.e. graphs where patients are represented as nodes and their clinical and biomolecular similarities as edges. In this context, graph-based algorithms, including Graph Neural Networks (GNNs), can provide predictions using not only individual patient features but also their relations within a network of clinically and biomolecularly similar individuals. In this work, we propose the first CE algorithm tailored to explain diagnostic and prognostic predictions within PSNs. Alongside a contrastive GNN backbone, we introduce a versatile, model-agnostic counterfactual search method compatible with any underlying classifier. Preliminary results on synthetic data and on a cohort of patients affected by the Alzheimer's disease show that our algorithm is competitive both with seminal tabular based CE algorithms and GNNExplainer, a well-established method for explaining graph-based classification tasks.
bioinformatics2026-05-20v1Deciphering interaction syntax via decoupling intrinsic lineages and niche pressure
Guo, Q.; Zhong, W.; Zeng, Z.; Nie, Q.; Zhou, P.; Zhang, L.Abstract
Spatial transcriptomics enables the mapping of gene expression within intact tissues, yet a fundamental gap remains between knowing where cells are and understanding how they interact. A cell's measured transcriptome reflects both its intrinsic lineage identity and niche pressure. Here we introduce TRINUS, a self-supervised model that deciphers interaction syntax by generative decoupling of a cell's intrinsic lineage identity from the extrinsic niche pressure. TRINUS maintains a library of context-free cell prototypes to isolate lineage identity while modeling cooperative interaction dependencies among neighbors. We validated TRINUS on synthetic datasets with known interaction logic and benchmarked it against existing methods with superior performance in cell clustering and spatial domain detection. Applied across diverse platforms and biological systems, TRINUS resolves multi-level interaction syntax and maps tissue-wide interaction patterns in colorectal cancer, and identifies stage-specific signaling dependencies and time-dependent receptor windows during mouse organogenesis. We also show TRINUS's bidirectional in silico engineering capability in the ovarian tumor microenvironment, where forward perturbation revealed subtype-specific macrophage immunosuppressive programs via virtual transplantation and inverse design identified molecular modifications in macrophages predicted to rescue adjacent T-cell function. Collectively, TRINUS provides a practical tool for interaction syntax discovery and predictive tissue engineering on spatial transcriptomics data.
bioinformatics2026-05-20v1The ATLAS Penalty: Auxiliary-Transformed Location-Aware Smoothing with Applications to Spatial Transcriptomics
Tang, Q.; Chi, E. C.; Wang, W.Abstract
We address the problem of fitting a collection of location-specific models under a spatial smoothness assumption. Existing approaches penalize roughness in the model parameters directly, an assumption that breaks down when smoothness is a function of parameters and auxiliary covariates rather than the parameters themselves. Our framework, the Auxiliary-Transformed Location-Aware Smoothing (ATLAS) penalty, generalizes spatial smoothness by penalizing roughness in transformations of model parameters using auxiliary information. As a concrete case study, we develop a spatially smooth deconvolution model for spatial transcriptomics that estimates tumor mixing coefficients from thousands of spots distributed on a single tissue slide. To handle the computational challenges posed by the nonlinear likelihood, nonsmooth nonconvex penalty, and spatially coupled estimation, we propose an alternating direction method of multipliers (ADMM) algorithm. Through simulation studies, we demonstrate that our framework provides substantially better spatial domain detection than approaches that smooth model parameters directly, with particularly strong gains when auxiliary covariates carry calibrated spatial structure.
bioinformatics2026-05-20v1Predicting and Elucidating Peptide Retention Mechanisms with Graph Attention Networks
Kensert, A.; Hruzova, K.; Devreese, R.; Nameni, A.; Declercq, A.; Gabriels, R.; Martens, L.; Bouwmeester, R.; Urban, J.Abstract
Liquid chromatography (LC) is a key technology in bottom-up proteomics, separating proteolytic peptides to decrease sample complexity, enhance coverage, and increase the robustness of protein identification and quantification. Although high-resolution mass spectrometry has advanced significantly, comparable progress in LC has lagged, primarily due to a limited understanding of peptide-column interactions. To bridge this knowledge gap, we introduce a novel deep learning model (PeptideGNN) based on a Graph Neural Network (GNN) architecture to model and elucidate peptide behaviors across various separation conditions. Trained to accurately predict peptide retention times on ten diverse proteomic datasets, the model subsequently employed a saliency mapping technique to interpret the underlying retention mechanisms. Our model consistently outperformed existing retention-time predictors across multiple datasets, while the saliency mapping, importantly, revealed insights into peptide-stationary phase interactions, highlighting the effects of neighboring amino acids, post-translational modifications (PTMs), chromatographic columns, and mobile phase additives on peptide retention.
bioinformatics2026-05-20v1Decoding heterogeneous aging clocks and disease risk stratification using a metabolomic foundation model
Xu, Y.; Zou, B.; Xie, G.; Jia, W.; Zhang, L.Abstract
Metabolomic aging clocks estimate biological age by modeling metabolite concentrations, thereby capturing aging signals from healthspan and adverse outcomes. However, existing clocks generally assume homogeneous aging trajectories and yield only a single age acceleration metric, limiting their capacity to capture inter-individual metabolic heterogeneity and characterize nuanced individual-level representations. To address these limitations, we proposed MetFoundation, a metabolomic foundation model pre-trained on nuclear magnetic resonance (NMR) metabolomic profiles from over 430,000 participants in UK Biobank via self-supervised learning. This large-scale pre-training enables MetFoundation to learn a metabolomic representation space that captures the complex, nonlinear structure of systemic metabolism as reflected in NMR data. Building on MetFoundation, we developed a mortality-informed metabolomic aging clock by fine-tuning an attached survival module, deriving age acceleration that demonstrates significant associations with multiple age-related diseases and factors. More importantly, we utilized embeddings generated by MetFoundation to identify metabolic subtypes, resulting in 13 distinct subtypes with differential susceptibility profiles for major age-related diseases, particularly dementia and diabetes. This finding empirically demonstrated profound metabolic heterogeneity across populations, persisting even at comparable levels of age acceleration. To enhance clinical applicability, we further employed contrastive learning to distill a lightweight model that approximates the learned metabolomic representation space using only routine blood test measurements as inputs. Both hold-out testing within UK Biobank and the external validation in China Health and Retirement Longitudinal Study replicated similar disease onset patterns across the identified subtypes, underscoring the robust generalizability of MetFoundation and the translational potential of the discovered metabolic subtypes.
bioinformatics2026-05-20v1Therapeutic Relevance of NLPA Lipoprotein to Combat Biofilm-Associated infection in Acinetobacter baumannii
Brahma, V. U.; Munagalasetty, S.; Bhandari, V.Abstract
Acinetobacter baumannii is a leading multidrug-resistant critical priority pathogen in healthcare settings, where biofilm formation confers survival and antibiotic tolerance. Targeting virulence associated proteins offers an alternative to conventional bactericidal strategies. Here, the inner membrane anchored lipoprotein NLPA, implicated in biofilm associated adaptation, was studied as a putative anti-virulence target using an integrated in silico pipeline and complementing the computational findings. The Alpha fold-derived structure of NLPA served as the basis for virtual screening of approximately 1.6 million compounds, with subsequent prioritization guided by MM/GBSA calculated binding free energies to highlight the top promising candidates. Molecular dynamics simulations demonstrated stable NLPA ligand complexes, as indicated by equilibrated RMSD, low residue fluctuations in the binding region, and persistent interaction networks over time. Pharmacokinetic evaluation indicated that the compounds satisfied Lipinski Rule of Five and had overall acceptable ADMET characteristics. Two compounds, NLPA-6 and NLPA-3, showed the most favourable predicted binding free energies, suggesting strong and stable interactions within the NLPA binding site. NLPA-3 was evaluated in vitro against A. baumannii to validate the computational outcomes. The compound displayed moderate antibacterial activity with a MIC of 125 mcg/mL and demonstrated 55.75% inhibition of biofilm formation at 4x MIC. In addition, in macrophage infection studies, NLPA-3 decreased intracellular bacterial survival to 19.25% at 50 mcg /mL, suggesting that it may disrupt virulence pathways linked to persistence. In whole, these findings identify promising NLPA targeting compounds and support the feasibility of NLPA as an anti virulence target.
bioinformatics2026-05-20v1Shiny AMMOA: an interactive platform for integrative multi-omics analysis of murine aging
Ninomiya Kanda, M.Abstract
Aging is accompanied by complex, tissue-specific molecular changes across multiple biological layers, yet integrative analysis of multi-omics datasets remains challenging for many experimental researchers due to technical and computational barriers. Here, I present Shiny Aging Murine Multi-Omic Analyzer (Shiny AMMOA), a graphical user interface (GUI)-based, user-friendly analytical platform that enables interactive exploration of murine aging-associated bulk transcriptomic, proteomic, and metabolomic datasets. Shiny AMMOA integrates publicly available multi-omics resources within a unified R Shiny framework and supports end-to-end analyses, including differential expression testing, pathway enrichment analysis, and pathway-level visualization across individual and multiple omics layers. Using representative use cases, I demonstrate that Shiny AMMOA recapitulates key findings from original source studies and facilitates intuitive discovery of tissue-, pathway-, and modality-specific aging signatures, including age-associated alterations in unfolded protein response, extracellular matrix organization, and metabolic pathways across specific tissues and omics layers. The platform further enables integrated visualization of molecular changes across omics layers on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway diagrams, supporting hypothesis generation at the systems level. By democratizing access to integrative multi-omics analysis while preserving analytical rigor, Shiny AMMOA provides an extensible resource for experimental biologists and aging researchers to interrogate large-scale public datasets, prioritize biological pathways, and accelerate translation of multi-omics insights into testable experimental hypotheses. Shiny AMMOA is available at https://github.com/M-Ninomiya-Kanda/Shiny_AMMOA_local, and a lightweight web-based demonstration version with limited functionality is available at https://m-ninomiya-kanda.shinyapps.io/shiny_ammoa_web/.
bioinformatics2026-05-20v1An 8 Gene Bevacizumab Resistance Signature Predicts Prognosis and Reveals Immunosuppressive Microenvironment in Colorectal Cancer
Niu, Z.; Qiu, D.; Xu, P.Abstract
Background Bevacizumab resistance severely limits long term efficacy in metastatic colorectal cancer (CRC). This study aimed to develop and validate a bevacizumab resistance associated gene signature for prognosis prediction and immune microenvironment characterization in CRC. Methods Two GEO datasets (GSE19862, GSE86582) with bevacizumab response data and TCGA COAD/READ RNA seq data were analyzed. Overlapping differentially expressed genes (DEGs) linked to both CRC progression and bevacizumab resistance were identified. An 8 gene signature (AXIN2, PSORS1C1, KRT74, SLC2A3, STIL, IL33, GALNT6, HSD11B2) was constructed via univariate Cox and LASSO Cox regression. Results In the TCGA cohort, high risk patients had shorter overall survival (OS; log rank P < 0.0001). Time dependent ROC yielded 1 year AUC = 0.638, 3 year AUC = 0.657, and 5 year AUC = 0.757. Multivariate Cox regression confirmed the risk score as an independent prognostic factor. External validation in GSE39582 (optimal cutoff = -1.49) replicated these findings: high risk patients had inferior OS (P = 0.0016) with acceptable 1/3/5 year AUCs and retained independent prognostic value (HR = 1.634, P = 0.00415). CIBERSORT and ESTIMATE analyses showed that the high risk group was characterized by increased M2 macrophages and neutrophils, higher immune and stromal scores, and reduced activated memory CD4 T cells, monocytes, and activated dendritic cells (all P < 0.05). GSEA highlighted enrichment of TNF /NF {kappa}B, IL 6/JAK/STAT3, and immune checkpoint pathways in the high risk group. AXIN2 (HR = 0.829, P = 0.032) was an independent protective factor, while PSORS1C1 (HR = 1.356, P = 0.048) was an independent risk factor. Conclusion The 8 gene bevacizumab resistance signature robustly predicts prognosis and reflects an immunosuppressive microenvironment closely linked to bevacizumab failure in CRC. These findings provide novel insights into immune mediated resistance and support clinical risk stratification.
bioinformatics2026-05-20v1NANOTAXI: A Shiny-Based GUI for Real-Time Classification and Analysis of 16S rRNA Nanopore Reads
Mahar, N. S.; Chouhan, K.; Gupta, I.Abstract
Real time taxonomic classification of nanopore amplicon sequencing data enables rapid insights into microbial communities, with applications in clinical diagnostics, environmental monitoring, and outbreak surveillance. However, bridging the gap between long-read data and interpretable results often requires specialised bioinformatics expertise. There remains a need for integrated, user-friendly software that combines live data acquisition with downstream microbiome analysis. Here we present NANOTAXI, a fully automated Shiny-based GUI for the classification of barcoded 16S rRNA gene sequences generated by Oxford Nanopore sequencing. The platform supports four taxonomic classifiers, integrated with five reference databases, enabling flexible selection of classification strategies based on user requirements and available computational resources. In addition to real-time monitoring, NANOTAXI performs cohort-level analyses, including alpha and beta diversity, ordination, differential abundance testing, and functional inference using PICRUSt2. Validation using barcoded synthetic communities comprising pooled genomic DNA from clinically relevant bacterial species and the ZymoBIOMICS mock community demonstrated that NANOTAXI generated biologically coherent taxonomic and functional profiles. Benchmarking revealed clear trade-offs between computational performance and taxonomic specificity. Emu provided the lowest observed species-level false-positive rate, whereas Kraken2 offered the fastest classification and enabled continuous near-real-time monitoring across all tested databases.
bioinformatics2026-05-20v1Dual-Stream Compression of High Bit-Depth Medical Images with Application to DNA Storage
Su, H.; Fan, W.; Peng, J.; Zhang, Y.Abstract
High bit-depth medical images preserve subtle intensity variations that are important for quantitative analysis and clinical interpretation, but their large dynamic range poses challenges for efficient compression. We propose a bit-plane-aware dual-stream compression framework for 16-bit medical images by separately modeling the most significant bit (MSB) and least significant bit (LSB) components. The MSB structural stream is encoded using JPEG coding with a Duplicate Segment Skipping (DSS) strategy to exploit spatial and segment-level redundancy, while the LSB detail stream is compressed using learned image compression to represent residual variations and fine-grained details. Experiments on four MRI and CT datasets show that the proposed method consistently outperforms representative traditional and learning-based codecs, achieving the lowest bit rate across all datasets. Meanwhile, it preserves high reconstruction fidelity and maintains local intensity profiles and downstream segmentation consistency. As a downstream application, we further demonstrate that the compressed bitstreams can be effectively integrated with DNA encoding and converted into sequences with favorable biochemical properties.
bioinformatics2026-05-20v1VX: an AI-enabled desktop genome viewer and transcriptome browser with a programmable analysis framework
Shirokikh, N. E.; Cleynen, A.Abstract
Background. Genome and transcriptome browsers are central to the interpretation of high-throughput sequencing data, but today's tools assume a human operator at a graphical interface and offer only limited programmability. As large-language-model assistants become routine in bioinformatics \citep{MCP2024}, this creates a bottleneck: agents cannot observe the visual state of the browser or drive it through the same interface as the human user, and analyses remain fragmented across a separate ecosystem of external tools. Transcript-coordinate data, produced by ribosome profiling \citep{Ingolia2012} and direct RNA sequencing \citep{Garalde2018}, is also awkwardly supported in chromosome-oriented viewers. Results. We present VX, a desktop genome and transcriptome viewer written in D, using GTK~3 and OpenGL, that handles genome-scale and transcriptome-scale data in a unified interface. VX exposes its full functionality through an embedded HTTP API on the loopback interface and a Model Context Protocol server of currently thirty-nine tools, so that scripts and LLM agents can load data, navigate, manage tracks, run analyses, and capture figures through the same contract used by the GUI. An integrated analysis framework provides more than fifty analyses and includes signal processing and peak calling, quantification, variant analysis, alignment statistics, interaction and cross-track comparisons, all with an explicit four-level scope hierarchy running from viewport to whole dataset; results are written to disk and, where appropriate, added as new tracks. Additional features include a magnifier popup for base-resolution inspection (Alt+hover), chromosome-alias resolution across UCSC, Ensembl, and NCBI conventions, viewport video recording via an \texttt{ffmpeg} pipe, and INI-based configuration. Conclusions. VX complements existing desktop and web browsers by providing a native agent-control layer, an integrated analysis framework, and first-class transcript-space handling. The binary is freely available for non-commercial use; the HTTP API and MCP protocol are fully specified in this article, so third-party clients can be written independently of the core implementation.
bioinformatics2026-05-20v1