Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Beyond single markers: bacterial synergies identified by Multidimensional Feature Selection reveal conserved microbiome disease signatures
Zielinska, K.; Rudnicki, W.; Labaj, P. P.Abstract
The gut microbiome encodes disease-relevant information not only in the abundance of individual taxa and functions, but in the way they co-occur and interact. Yet metagenomic analyses have largely relied on univariate approaches that evaluate features in isolation, systematically overlooking the combinatorial signals that arise from microbial co-occurrence. Here, we introduce a framework based on the Multidimensional Feature Selection (MDFS) algorithm to identify synergistic feature pairs - combinations of taxa and functions whose joint predictive relevance substantially exceeds that of either constituent alone, including features that carry no individual signal and would be discarded by any conventional analysis. We first validated the approach on a meta-analysis of colorectal cancer (CRC) cohorts - one of the most competitive microbiome classification benchmarks available - using a leave-one-cohort-out cross-validation framework. Our framework matched state-of-the-art classification performance (AUC = 0.85) while simultaneously revealing microbial interactions that are structurally inaccessible to univariate methods. A subset of high-stability synergistic pairs showed consistently elevated model selection frequencies and robust discriminatory power across independent cohorts, confirmed under stringent per-cohort effect size testing. Extending the framework to 20 disease cohorts spanning inflammatory bowel disease, type 2 diabetes, liver cirrhosis, and atherosclerotic cardiovascular disease, we identified thousands of high-impact synergistic interactions and 21 conserved cross-cohort markers. Across all contexts examined, synergistic pairs substantially outperformed their individual constituents, establishing microbial co-occurrence as a reproducible and biologically informative axis of disease-associated variation that univariate approaches are structurally unable to detect. The framework is freely available at https://github.com/Kizielins/MDFS_synergies. Importance: Most microbiome studies search for individual gut bacterial species associated with disease. However, bacteria do not act in isolation, and their combined presence or relative balance may be far more informative than any single microbe considered alone. This study presents a computational framework that identifies pairs of gut microorganisms whose co-occurrence or relative abundance carries substantially greater predictive signal than either constituent feature independently. Applied to stool metagenomic data from patients with colorectal cancer and more than a dozen additional conditions, we demonstrate that these synergistic interactions are widespread, reproducible across independent patient cohorts, and reveal disease-relevant microbial relationships that standard analyses miss entirely. Our framework offers a more complete view of how the gut microbiome is altered in disease and provides a principled basis for identifying robust, interaction-based biomarkers.
bioinformatics2026-06-10v3SLiMNet: a deep learning model to detect short linear motifs using protein large language model representations and paired inputs
McFee, M. C.; Kim, P. M.Abstract
Short linear motifs (SLiMs) are short (3-15 amino acids in length) segments within intrinsically disordered regions (IDRs) that mediate transient protein-protein interactions as well as other functions such as stability and subcellular localization. Only a few thousand out of likely hundreds of thousands have been experimentally validated. SLiMs can be detected as conserved regions inside of IDRs using local alignments, though current approaches have limited sensitivity and specificity and are unable to functionally annotate their hits. Assigning function is hence a major outstanding issue in SLiM biology. Here we present SLiMNet, a deep learning model inspired by siamese networks and contrastive learning that predicts functional similarity in pairs of SLiMs. SLiMNet uses uses protein large language model embeddings and is trained on annotated sets of SLiMS. We show that it detects shared function in unseen, non-redundant motif pairs, and its scores correlate with experimental binding strengths from deep mutational scanning of cyclin-binding motifs. Using SLiMNet we provide repositories of putative SLiM pairs derived from annotated IDR regions for to help with hypothesis generation for the functional annotation of SLiMs. This includes an atlas generated from all-by-all scoring 16-mers from tiled IDRs from the DisProt database. We show that it captures a new nuclear localization motif recently added to MoMaP and a PRMT1 methylation motif in the literature. We also provided a repository of all IDRs scored with SLiMNet against against all MoMaP instances, and an atlas of potential functional pairs for 256 known orphan motifs (motifs with only a single known instance with essential function). Collectively, these atlases are useful resources for the SLiM biology community
bioinformatics2026-06-10v3Depth normalization for single-cell genomics count data
Booeshaghi, A. S.; Hallgrimsdottir, I. B.; Galvez-Merchan, A.; Pachter, L.Abstract
Single-cell genomics analysis requires normalization of feature counts that stabilizes variance, accounts for variable cell sequencing depth, and preserves monotonicity of within-cell feature abundances. We show that normalization via an (additive in the raw counts) proportional fitting step followed by the logarithm and then another (multiplicative in the raw counts) proportional fitting step (PFlogPF) is the only feature-relabeling-equivariant method satisfying the three desiderata. We demonstrate superior performance of this method, which is equivalent to a shifted centered-log ratio transform, in comparison to other normalizations on numerous benchmarks across hundreds of single-cell RNA-seq datasets.
bioinformatics2026-06-10v3RingNet: An Interactive Platform for Multi-Modal Data Visualization in Networks
Zhang, L.; Lai, X.Abstract
The exponential growth of data in biomedicine has created an urgent need for intuitive visualization tools. These tools must be able to effectively represent complex biological networks and remain accessible to domain experts without extensive computational training. Current network visualization approaches often require specialized programming skills and/or cannot handle the scale and complexity of modern biomedical datasets, which creates significant barriers to biological discovery. We develop RingNet, a web-based interactive visualization tool that integrates computational efficiency with flexible, user-driven exploration. This tool addresses the community's need to visualize multi-modal datasets within a single, compact network representation, as well as identify patterns of interest in complex data. RingNet uses an R backend for network computation and coordinate optimization. This generates JSON data structures that feed into a JavaScript and HTML frontend, which provides real-time, interactive visualization functions. It offers dynamic layout adjustments, node and edge filtering, and customizable color schemes for representing data. It can export reproducible, publication-ready figures in SVG and PNG formats. In our case studies, we use RingNet to visualize breast cancer patients' omics profiles in a gene regulatory network and a cell-to-cell communication network in atopic dermatitis. This demonstrates RingNet's ability to reveal biological relationships across multiple data modalities. RingNet lowers the barrier to exploring, analyzing, and communicating data-driven findings, thereby accelerating research.
bioinformatics2026-06-10v3jsPCA: fast, scalable, and interpretable identification of spatial domains and variable genes across multi-slice and multi-sample spatial transcriptomics data
Assali, I.; Escande, P.; Picard, F.; Villoutreix, P.Abstract
Spatially structured cell heterogeneity within tissues is essential for healthy organ function. This heterogeneity is reflected by differential gene expression activity at various spatial location. Spatial transcriptomics technologies record genome-wide measurements of gene expression at the scale of entire tissues with high spatial resolution. While they have revolutionized our quantitative understanding of tissue architecture, these technologies generate large and high dimensional datasets encompassing tens of thousands of genes recorded at tens of thousands of spatial locations, requiring efficient automated methods for their analysis. In this study we introduce joint spatial PCA (jsPCA), a novel, fast, scalable and interpretable method for the automatic identification of spatial domains and variable genes in multi-slice and multi-sample spatial transcriptomics data. jsPCA relies on a simple mathematical formulation of a spatial covariance defined as the product of the gene expression covariance with the spatial autocorrelation. The principal components of this spatial covariance yield a biologically meaningful low-dimensional representation. From this representation, we can derive spatial domains by simple clustering. In addition, spatially variable genes can be identified directly from the principal components coefficients. Moreover, this approach enables the joint representation of multiple slices and samples, a frequent experimental setting. This joint representation is obtained without spatial alignment by computing common principal components via joint diagonalization of the set of spatial covariance matrices obtained for each slice. By leveraging sparsity and non-convex optimization on manifold, jsPCA leads to computing time in the order of seconds to minutes, substantially outperforming state-of-the-art approaches. We benchmarked jsPCA on the Visium 10x dataset of human dorsolateral prefrontal cortex and the Stereo-seq MOSTA dataset of mouse embryonic development against 10 state-of-the-art methods. Our approach demonstrated excellent performances, comparable or better than state-of-the-art methods, such as SpatialPCA, BASS, GraphPCA or Stagate, while being much faster, interpretable, and scalable to very large datasets.
bioinformatics2026-06-10v2Candidate Molecular Subtypes of Cognitive Resilience in Alzheimers Disease: A Multi-Cohort Machine Learning and Neuroimaging Study
Kitani, A.; Matsui, Y.Abstract
Background: Cognitive resilience (CR) in Alzheimers disease (AD) refers to preserved cognitive function despite substantial AD pathology. Diverse biological processes have been implicated in CR, including synaptic maintenance, neuroimmune regulation, and metabolic homeostasis. However, how these mechanisms are organized into molecularly distinct CR subtypes and relate to clinical and neuroanatomical heterogeneity remains unclear. Here, we applied a machine learning framework to multi-cohort transcriptomic, proteomic, and neuroimaging data to investigate molecular subtypes of CR in AD. Methods: RNAseq data from the Religious Orders Study and Memory and Aging Project (ROSMAP) cohort were used to train machine learning models classifying individuals with AD pathology as CR or non-CR based on residual-based resilience scores. Model development and performance estimation used nested cross-validation to minimize information leakage. Final ROSMAP-trained models were evaluated in the independent Mount Sinai Brain Bank (MSBB) cohort. Model-derived genes were used for biological interpretation and hierarchical clustering of CR individuals. The subtype structure was further evaluated in the Alzheimers Disease Neuroimaging Initiative (ADNI) cohort using cerebrospinal fluid proteomics, MRI-derived brain measures, and longitudinal MMSE data. Results: Machine learning models showed modest but consistent predictive performance in ROSMAP, with out of fold AUROC values of 0.644 to 0.688. In the independent MSBB full cohort, AUROC values were 0.586 to 0.659, with improved discrimination in a top/bottom quartile analysis. Hierarchical clustering identified two major molecular subgroups among CR individuals in ROSMAP/MSBB RNA-seq data. A reduced 22 gene/protein signature showed a partial, cluster-like resemblance to this structure in ADNI cerebrospinal fluid proteomics. In ADNI, both projected CR subtypes showed preserved brain tissue-volume profiles and slower longitudinal MMSE decline compared with non-CR participants, whereas clear differences between CR subtypes were not observed. Differential CSF proteomic analysis suggested partially distinct molecular characteristics. Conclusions: These findings suggest that CR in AD may encompass molecularly heterogeneous, subtype-like profiles that converge on broadly preserved brain structure and slower cognitive decline. Our results provide a candidate framework for stratifying resilience-associated molecular phenotypes in AD and warrant prospective and experimental validation. We also developed the Resilience Gene Analyzer, a web-based platform for visualizing gene-level contributions to CR prediction (https://igcore.cloud/GerOmics/REsilienceGeneAnalyzer/).
bioinformatics2026-06-10v2Multi-level, multi-body atomic interaction graphs for machine learning-based prediction of protein-ligand binding energies
Le, T. T. H.; Nguyen, B. T.; Vo, H.; Nguyen, N. H.; Nguyen, D. D.Abstract
Accurate prediction of binding affinity is crucial for rational drug design and discovery. Traditional computational methods often rely on complex scoring functions that incorporate a multitude of physical and chemical descriptors, leading to high computational demands and sometimes limited generalizability. In this work, we propose a novel scoring function that models multi-level, multi-body atomic interactions using graph-based representations. Our method constructs comprehensive interaction graphs that incorporate both pairwise and triplet-wise atomic features that help capture cooperative spatial patterns essential for binding affinity prediction. By employing a feature fusion strategy, GMI-Score maintains model simplicity while enhancing accuracy. Extensive evaluation across multiple datasets, such as PDBbind v2013, PDBbind v2016, PDBbind v2020, CSAR-NRC-HiQ, and PDBbind-Redocked, demonstrates that our model consistently outperforms state-of-the-art scoring functions, achieving Pearson correlation coefficients up to 0.877. Furthermore, it retains strong predictive power under strict data leakage controls and realistic docking conditions to highlight its robustness and generalizability.
bioinformatics2026-06-10v2Advances in protein function prediction from the fifth CAFA challenge
De Paolis Kaluza, M. C.; Ramola, R.; Joshi, P.; Piovesan, D.; Reade, W.; Orchard, S.; Martin, M. J.; Ignatchenko, A.; Rost, B.; Orengo, C. A.; Robinson-Rechavi, M.; Durand, D.; Brenner, S. E.; Greene, C. S.; Mooney, S. D.; Friedberg, I.; Radivojac, P.Abstract
The Critical Assessment of Function Annotation (CAFA) is a long-standing community effort to independently assess computational methods for protein function prediction, to highlight well-performing methodologies, to identify bottlenecks in the field, and to provide a forum for the dissemination of results and exchange of ideas. In its fifth round (CAFA 5) of triennial challenges, a partnership with Kaggle Inc. facilitated participation from a large community of data scientists and computational biologists through a competitive prospective challenge on the crowdsourcing platform. In this work, we present an in-depth analysis of the submitted predictions and report improvements in accuracy over all methods from the previous CAFA challenges. We further introduce a new evaluation setting for proteins with pre-existing (incomplete) annotations and identify the need for methods that better leverage existing annotations to predict those that will be discovered later. Finally, we characterize the prospective evaluation framework by examining performance on a strict set of unpublished annotations and across intermediate database releases. Our results indicate that recent developments in the field, such as the availability of protein language models and accurately predicted 3D structures, as well as the growth of experimental annotations through biocuration, have all contributed to performance improvements.
bioinformatics2026-06-10v2Transcriptomic profiling reveals multiple mechanisms of insecticide resistance in Aedes aegypti from Angola
Youd, H. A.; Ooi, J. M. F.; Muhammad, A.; Paine, M. J. I.; Lucas, E. R.; Grau-Bove, X.; Grigoraki, L. R.; Troco, A. D.; Parreira, R.; Sousa, C. A.; Pinto, J.; Weetman, D.Abstract
Control of arboviruses remains heavily reliant on insecticide-based vector control targeting adult Aedes aegypti, especially during outbreaks, but the effectiveness of these tools can be compromised by insecticide resistance. While the mechanisms underlying resistance have been widely studied in Latin American and South East Asian Ae. aegypti, knowledge from African populations is limited, particularly regarding metabolic resistance. To address this knowledge gap, we sequenced the transcriptomes of Ae. aegypti collected in Angola, from both unexposed individuals and survivors of exposure to the organophosphate fenitrothion, alongside two insecticide-susceptible laboratory reference strains. Many overexpressed genes belonged to the major detoxification enzyme families, including 96 cytochrome P450 monooxygenases (CYP450s), 18 glutathione S-transferases (GSTs), and 35 carboxylesterases, with multiple genes previously detected as upregulated in Latin American and Asian populations. These included frequently reported, functionally-validated, metabolic resistance genes such as CYP9J24, CYP9J26, and CYP6BB2. However, expression of auxiliary resistance families including hexamerins, heat shock proteins, and odorant binding proteins were linked to the insecticide resistance phenotype, whilst numerous cuticular genes differentiated the Angolan population from both susceptible laboratory strains. A novel candidate, CYP6AG7, that was overexpressed after fenitrothion exposure was experimentally validated, and surprisingly metabolised fenitrothion into its toxic oxon form, which it did not subsequently break down. The antioxidant response element (ARE) motif, to which the transcription factor Maf-S binds, was detected in all CYP450 overexpressed in the fenitrothion treatment suggesting their potential coordinated induction. Analysis of genetic differentiation revealed several resistance-linked genes under potential selection, and SNP screening identified both known and novel non-synonymous mutations in the voltage-gated sodium channel (VGSC) gene, the target for pyrethroid insecticides. This is the first RNAseq dataset for Ae. aegypti from Africa in the context of insecticide resistance, providing insight into the complexity of resistance mechanisms, including some shared, and others potentially novel, compared to better studied populations from other geographical regions.
bioinformatics2026-06-10v2ECMME: an atlas of selection pressures on the mammalian extracellular matrix reveals contrasting evolutionary dynamics
Petrov, P. B.; Oshinjo, A.; Roning, J.; Izzi, V.Abstract
The extracellular matrix (ECM) is a fundamental metazoan innovation that provides structural support and regulatory cues essential for multicellular life. While core matrisome components are subject to strong functional constraints, their evolutionary dynamics at the molecular level remain incompletely characterized. Here, we present a comprehensive per-residue analysis of selection pressures across 272 human core matrisome proteins using high-quality orthologous sequences from up to 228 placental mammal species. We developed an automated pipeline integrating ortholog identification, codon-aware alignments, and site-specific selection analyses with the MEME and FUBAR methods from the HyPhy suite. Results reveal pervasive strong purifying selection across the matrisome, consistent with its structural and functional indispensability. This is accompanied by episodic positive selection and rarer pervasive positive selection, with collagens exhibiting significantly elevated episodic positive selection compared to glycoproteins and proteoglycans. To facilitate community access, we developed ECMME (ECM Molecular Evolution) browser, an intuitive open-access web resource that visualizes selection metrics plotted directly onto protein topologies. ECMME allows researchers to seamlessly browse and investigate the data, providing a powerful framework for interpreting functional sites. It is available online and requires no local installation or set-up (https://izzilab-ecmme.share.connect.posit.cloud/).
bioinformatics2026-06-10v1Bias-mitigated microbiome inference refines coronary artery disease signature
Honeybrook, L.Abstract
Roughly half the cells in the human body are microbial, and changes in these communities are increasingly implicated in cardiovascular, metabolic, and oncological diseases. Yet identifying which taxa truly differ in abundance, differential abundance (DA), is distorted by four major sources of bias: loss of total microbial load, taxa measurement efficiencies, arbitrary pseudocounts required to handle pervasive zeros, and contamination which has recently driven retractions. No existing DA method accounts for all four. Here we introduce BootDA, a non-parametric bootstrap-based method that explicitly models each bias source without data transformations, pseudocounts, parametric assumptions, or assuming that most taxa are non-DA. In semi-parametric simulations preserving the sparsity (>70% zeros) and correlation structure of real 16S amplicon data, BootDA achieved the highest sensitivity among tested methods, including ANCOM-BC2, LinDA, MaAsLin 3, and Wilcoxon tests, while controlling the false discovery rate. Performance was retained in low biomass settings when contamination contributed ~50% of counts, and without negative controls, indicating de novo decontamination capability. Applied to a coronary artery disease cohort, BootDA refined the original signature to two co-enriched genera, Klebsiella and Gemmiger, and excluded likely contaminants. BootDA is available as an R package and could generalise to other sparse, high dimensional biological data.
bioinformatics2026-06-10v1SPARQ-MI leverages end-to-end spatial single-cell analysis of the tumor microenvironment
Kiwitz, L.; Turiello, R.; Effern, M.; Toma, M.; Landsberg, J.; Hoelzel, M.; Thurley, K.Abstract
Detailed spatial analysis of the tumor micro-environment (TME) through multiplexed fluorescence imaging requires quantitative image-processing and data-analysis methods. While data-preprocessing down to segmentation of individual cells is captured by available methods, statistical analysis of single-cell features is compromised by the uneven noise distribution especially in complex tissues such as the TME, as well as by labor-intensive manual cell-type annotation and region segmentation. Here, we present SPARQ-MI (Spatial Phenotyping, Architecture Reconstruction and Quantification from Multiplexed Imaging) for streamlined spatial single-cell analysis, along with a tissue microarray PhenoCycler data-set with 37 fluorescent channels from melanoma patients under immunotherapy. We demonstrate that SPARQ-MI enables robust reconstruction of the cellular and spatial composition in this and other tissue types. Our analysis reveals associations of the cell-state and spatial location of CD8 T cells with response to immunotherapy. Overall, SPARQ-MI allows for quantitative analysis of complex fluorescence histology samples under minimal user input, and accounting for spatially uneven coverage of antibody signals, setting the stage for quantitative analysis of clinical samples.
bioinformatics2026-06-10v1Is level-1 blob reconstruction under the network multispecies coalescent easy?
Dai, J.; Molloy, E.Abstract
Hybridization is an important evolutionary process, commonly modeled by the network multispecies coalescent. Reconstructing evolutionary histories under this model is notoriously costly, even for level-1 networks where hybridization events are isolated from each other. The widely used methods that combine speed with statistical guarantees rely on quartet concordance factors computed for all subsets of four species, resulting in an o(n^4k) bottleneck that severely limits scalability to large numbers of species (n) and genes (k). Among quartet-based methods, NANUQ+ is notable because it decomposes the problem into two steps: first reconstructing a tree of blobs, which compresses each non-treelike part of the network, called a blob, into a single vertex, and second reconstructing the internal structure of each level-1 blob, specifically its circular order and hybrid vertex. Here, we investigate whether level-1 blob reconstruction is difficult once the tree of blobs is known. We present a fast and statistically consistent algorithm, called NetCS, based on two simple primitives: majority voting and merge sort, circumventing the bottleneck of computing all quartet concordance factors. In simulations, NetCS achieved comparable accuracy to NANUQ+ and was dramatically faster, enabling analyses of 200 taxa and 1000 genes in only a few minutes. Both methods attained near-perfect accuracy when given the true tree of blobs; however, their performance degraded in end-to-end pipelines due to errors in tree of blobs reconstruction. Strikingly, even methods that reconstruct level-1 networks directly struggled to accurately predict hybrid ancestry. Our results suggest that reconstructing level-1 blobs is unexpectedly easy once the tree of blobs is known, and that a major challenge for phylogenetic network inference lies in accurate tree of blobs reconstruction.
bioinformatics2026-06-10v1APOSM: Pairwise preference learning improves generative small-molecule design
Dreisler, M. W.; Michael, R.; Hatzakis, N. S.; Boomsma, W.Abstract
Small-molecule lead refinement is constrained by the cost of synthesizing and assaying candidates, making the surrogate models that prioritize compounds for experimental testing central to the design process. The reliability of such surrogates is limited by the noise and sparsity of screening measurements. We show that training the surrogate on pairwise comparisons between candidate molecules, rather than on absolute predicted scores, yields a substantially more reliable signal for active candidate selection in this regime. We develop APOSM, an active-learning algorithm that combines a fragment-based generator, a pairwise message-passing graph neural network surrogate, and probabilistic ranking inside a batched acquisition loop. On the Practical Molecular Optimization benchmark and a GPCR ligand rediscovery task, APOSM improves target attainment and sampling efficiency over unguided fragment-based optimization, the Graph-GA genetic algorithm, and a pointwise-regression ablation, with the largest gains on tasks where absolute scores are hardest to calibrate.
bioinformatics2026-06-10v1A Unified Spatial AI Framework for Cross-Domain Tissue-State Analysis in Trauma, Oral, and Cardiovascular Pathology
Pham, T. D.Abstract
Objective: To develop a cross-domain spatial AI framework for identifying conserved tissue-state organisation across trauma, oral disease, and cardiovascular tissue using spatial transcriptomic data. Methods: Four public spatial transcriptomic datasets spanning wound healing, periodontitis, oral squamous cell carcinoma, and cardiac tissue were integrated using recurrence modelling, graph-based spatial learning, fuzzy tissue-state analysis, and tensor decomposition. Cross-domain coupling, spatial fragmentation, recurrence structure, and permutation-based topological validation were evaluated. Results: Six conserved fuzzy tissue states were identified, dominated by extracellular matrix remodelling, fibroblast/stromal activation, endothelial signalling, and inflammatory pathways. Latent embedding analysis demonstrated strong overlap between trauma and oral domains, while cardiovascular tissue exhibited more compact spatial organisation. Oral inflammatory tissue showed the highest fragmentation, whereas cardiovascular tissue demonstrated greater recurrence coherence. Tensor decomposition identified conserved stromal-remodelling programmes across domains. Permutation testing confirmed significantly elevated graph modularity and reduced spatial entropy relative to null distributions. Conclusion: The proposed framework identified conserved spatial tissue-state architecture linking wound healing, oral pathology, and cardiovascular tissue despite differences in tissue origin, pathology, and acquisition technology. Significance: These findings demonstrate the potential of spatial AI for investigating conserved stromal and inflammatory microenvironmental organisation across clinically related disease systems and may support spatial biology research in trauma--oral--systemic health.
bioinformatics2026-06-10v1When batch correction corrupts gene expression: uncovering distortions in correlation structures
Nourisa, J.; Passemiers, A.; Moreau, Y.; Raimondi, D.Abstract
Batch correction is essential for integrating datasets and enabling population-level insights into health and disease. Embedding-based approaches are among the most widely used solutions, but here we highlight a critical, overlooked limitation: these methods can distort feature-to-feature (e.g., gene gene) relationships, potentially undermining downstream analyses. We investigate this issue and introduce a novel metric to quantify it.
bioinformatics2026-06-10v1Promera: a unified model for biomolecular structure prediction, filtering, and design
Jing, B.; Bafna, M.; Diaz, D. J.; Klivans, A. R.; Berger, B.Abstract
Generative models have become staple tools for modeling and designing biomolecular structures. However, although these tools have improved in structural prediction accuracy, their ability to filter designed binders---an essential use case---remains insufficient; whereas design methods have focused more on unconstrained binder generation rather than capabilities enabled by controllable design. We introduce Promera, a unified generative model that combines all-atom structure prediction with improved filtering and controllable design. We find that Promera's confidence metrics are more accurate for filtering binders from non-binders for both miniproteins and nanobodies, while its co-folding performance surpasses popular open-source models (OpenFold3-p2, Boltz-2) on therapeutically relevant categories. As a design model, Promera generates binders by predicting masked protein sequences with optional epitope, paratope, and template constraints. Remarkably, our nanobody designs match the in silico success rates from backprop-based techniques (mBER) when evaluated under co-folding confidence filters. We further provide two in silico demonstrations of the the versatile capabilities of our design method: epitope targeting of the Andes hantavirus glycoprotein with VHHs and active state stabilization of the beta-2 andrenergic GPCR. We conclude by proposing a scaling law for co-folding models, suggesting a path for further performance improvement.
bioinformatics2026-06-10v1GEOAgent: An AI-driven Autonomous Framework for Intelligent GEO Data Retrieval and Standardized Preprocessing
Zhao, Y.; Cai, Q.; Chen, D.; Chen, J.Abstract
Datasets in the Gene Expression Omnibus (GEO) remain difficult to reuse at scale because sample annotations are heterogeneous and raw sequencing data require assay-specific preprocessing. We present GEOAgent, an AI-driven autonomous framework designed for intelligent dataset retrieval and standardized preprocessing by coupling autonomous semantic governance with an automated Nextflow pipeline named bioStream. Metadata from 181,760 sequencing series and 84,756 associated PubMed records were organized in a relational database and semantic index to support natural-language dataset retrieval. The framework automatically determines assay modalities, resolves experimental design pairings, and standardizes sample naming to minimize manual curation overhead. Based on these parsed attributes, the framework generates deployment-ready manifests to automatically execute containerized workflows across bulk and single-cell omics modalities. In expert-curated benchmarks, the workflow achieved 96% retrieval precision alongside 100% accuracy in assay classification and sample relationship resolution. The web platform is publicly accessible, while the source code and associated databases are openly available via GitHub and Zenodo.
bioinformatics2026-06-10v1NaVis: a virtual microscopy framework for interactive histological interrogation of spatial transcriptomics data
Oshinjo, A.; Wu, J.; Petrov, P.; Hashmi, A.; Englund, J. I.; Izzi, V.Abstract
Despite the widespread adoption of spatial transcriptomics (ST), revealing the alignment between transcriptional layers and tissue morphology remains technically demanding, typically requiring proficiency across multiple computational frameworks and thereby limiting accessibility for a substantial fraction of the biomedical community. Here, we introduce NaVis (https://github.com/Izzilab/NaVis), a point-and-click virtual microscopy framework that redefines ST analysis as an interactive, image-centric experience. NaVis enables rapid high-resolution inference from low-resolution whole-transcriptome platforms, producing microscopy-like visualizations while preserving transcriptome-wide coverage. It further decomposes histological images into quantitative tissue architecture priors (nuclei-rich regions, fibrillar extracellular matrix, and soft tissue) allowing direct integration of gene expression with local morphology. This unified representation supports analyses of compartment enrichment, boundary concordance, spatial cross-correlation, morphological patterning, histology/expression decoupling, and transcriptome-wide spatial similarity. By coupling transcriptomic and image-derived information within an interactive framework, NaVis shifts ST from static computational workflows to an exploratory modality, broadening its accessibility, conceptual reach and potential for biological discoveries.
bioinformatics2026-06-09v2LongAllele: a joint inference framework for allele-specific analysis on long-read bulk and single-cell RNA sequencing
Xu, Z.; Wang, K.Abstract
Allele-specific analysis from RNA-seq is a powerful approach to characterize cis-regulatory effects. However, existing methods remain limited in both haplotype inference and allelic testing. Their haplotype-inference workflows separate variant calling, haplotype phasing, and read-haplotype assignment into sequential steps, failing to fully exploit within-read single-nucleotide variant (SNV) linkage information and propagating errors into downstream allelic analysis. At the testing stage, they ignore non-phasable reads lacking heterozygous SNVs, biasing calls and inflating false positives, and remain incomplete across gene-, isoform-, and local-event-level variant effects. Here, we present LongAllele, a statistical framework that employs an expectation-maximization algorithm to jointly infer heterozygous variants, haplotype structure, and read-haplotype assignments from long-read bulk and single-cell RNA sequencing. LongAllele further introduces phasability-aware testing that explicitly accounts for non-phasable reads, avoiding inflated false-positive calls when haplotype information is incomplete. It also enables comprehensive allelic testing across gene-level allele-specific expression (ASE), isoform-level allele-specific transcript usage (ASTU), and local-event-level haplotype-associated exon and junction usage (HAEU and HAJU), providing a multi-scale view of cis-regulation across biological contexts. We applied LongAllele to long-read RNA-seq datasets spanning GTEx (multi-tissue bulk), peripheral blood mononuclear cells (single-cell), human hippocampus (single-nucleus), and human cortex from two Alzheimer's disease (AD) case-control cohorts (bulk, Oxford Nanopore and PacBio). LongAllele consistently revealed greater context dependence in expression-level than isoform-level allelic regulation across tissues, cell types, and disease states, and pinpointed high-impact regulatory variants including rare splice-site mutations missed by standalone variant callers. It further showed that purifying selection constrains allelic imbalance at both gene and isoform levels and resolved AD-associated variant effects in individual transcriptomes across long-read platforms.
bioinformatics2026-06-09v2FLAG-X: Hybrid machine learning workflows for automated gating of clinical flow cytometry data
Martini, P.; Mohammadi, M.; Thrun, M. C.; Blumenthal, D. B.; Krause, S. W.Abstract
Flow cytometry analysis is widespread practice in cell biology, immunology and hematology. Cell populations of interest are typically identified by consecutively examining the expression levels of antigen marker pairs. Since this manual gating process lacks standardization and is time-consuming, several machine learning (ML) methods for automated gating of flow cytometry data have been proposed in recent years. However, their translation into routine workflows has been limited. To address this, we developed the Python package FLAG-X (''flow cytometry automated gating toolbox''), which supports two novel workflows that integrate manual with ML-based gating, using labeled and unlabeled training data. We selected state-of-the-art ML methods developed for automated gating for inclusion in FLAG-X, based on their gating performance in comparison to manual expert annotations. FLAG-X provides a unified interface for top-performing methods and enables seamless integration with standard software for manual gating by exporting results as FCS files. To demonstrate its practical utility, we applied FLAG-X to representative cases from clinical practice. FLAG-X is available at https://anaconda.org/channels/bioconda/packages/flagx/overview
bioinformatics2026-06-09v2Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Guo, J.; Ding, S.Abstract
The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.
bioinformatics2026-06-08v3Metadata Collector: An Open-Source Platform for Standardized Metadata Management in Multi Centre Sequencing Projects
Liguori, R.; Huttner, M.; Ferrazzi, F.Abstract
Background: Next-generation sequencing (NGS) projects generate increasingly complex metadata that are critical for reproducibility, interoperability, and compliance with FAIR principles. Nevertheless, metadata curation in multi-institutional settings often still relies on spreadsheets, manual data entry and curation, as well as non-standardized terminology. These practices frequently result in incomplete or inconsistent annotations, hinder metadata sharing, and delay submission to public repositories. Results: We developed Metadata Collector as a React/API/PostgreSQL web platform and deployed it on a Kubernetes cluster within a large German research consortium. The platform implements a flexible, machine-readable metadata model for experimental data and integrates customizable templates, controlled vocabularies designed to support future ontology integration, and a complete event-based versioning model. Since deployment, Metadata Collector has been used across 32 projects involving RNA-seq, scRNA-seq, ATAC-seq and multiomics datasets, representing over 700 annotated samples contributed by multiple consortium partners. The platform is designed for use by non-computational researchers as well as centralized facilities and can be integrated into existing research data management infrastructures. Conclusions: Metadata Collector embeds standardization early in the metadata lifecycle, ensuring consistent, FAIR-aligned, and reproducible metadata across distributed research groups. Its modular, open-source architecture supports both local and consortium-scale deployments and provides a foundation for future extensions, including multi-omics support and integration with laboratory information management systems and automated submission pipelines.
bioinformatics2026-06-08v3Protein large language model assisted one-to-one gene homology mapping in cross-species single-cell transcriptome integration
Kuang, Z.-Y.; Sun, Y.-C.; Wei, N.-N.; Wang, Y.-J.; Wu, H.-J.Abstract
Cross-species integration of single-cell transcriptomes requires establishing gene correspondences to enable comparative analysis of expression profiles across organisms. Current approaches predominantly rely on Ensembl homology tables, whose default many-to-many mappings often amplify gene-family effects and introduce artifactual micro-clusters that lack clear cell-type identity, thereby complicating biological interpretation. While restricting mappings to a one-to-one scheme suppresses such artifacts, it reduces the number of homology gene pairs by approximately 8% ([~]900 pairs). To address this limitation, we developed a protein large language model (pLLM)-based gene homology mapping strategy that boosts the number of homology gene pairs. By integrating pLLM-derived representations with sequence similarity, we constructed a fused mapping approach, which achieved top performance in a comprehensive benchmark based on a curated cross-species atlas -- spanning nine datasets, 11 species, and over 3.2 million cells. Our method further identifies previously unannotated cell-type marker pairs, facilitating novel cross-species marker discovery. These results establish a robust framework for gene homology mapping in cross-species transcriptome integration, improving both accuracy and biological interpretability.
bioinformatics2026-06-08v3Melody: Decoding the Sequence Determinants of Locus-Specific DNA Methylation Across Human Tissues
Jin, J.; Wang, D.; Qiao, J.; Gao, W.; Liu, Y.; Chen, S.; Zou, Q.; Wu, S.; Su, R.; Wei, L.Abstract
DNA methylation is a fundamental epigenetic modification that plays crucial roles in transcriptional regulation, cellular differentiation, and genome stability. However, how locus-specific DNA methylation is determined by intrinsic DNA sequence remains poorly understood. Here, we introduce Melody, a deep learning framework that predicts DNA methylation from 10-kb genomic sequences, enabling the integration of both local and long-range sequence signals. Across 39 human tissues, Melody accurately predicts methylation profiles and consistently outperforms existing state-of-the-art methods in whole-chromosome, hypomethylated-region, and cell-type-specific benchmarks. Melody also generalizes to methylation quantitative trait locus (meQTL) effect prediction and identifies regulatory sequence motifs associated with methylation variability. To extend prediction beyond profiled tissues, we further develop Melody-G, which incorporates single-cell RNA-seq foundation model embeddings to infer methylation states in previously unseen cell types directly from transcriptomic data. Together, Melody provides a unified framework for linking genomic sequence and cellular state to DNA methylation and offers new insights into the regulatory logic governing the human methylome.
bioinformatics2026-06-08v2HydraMPP: A lightweight library for distributed massive parallel processing in Python - threading at scale.
Figueroa, J. L.; White, R. A.Abstract
We now exist in the era of massive datasets from genomics, large language models, and all the known knowledge of humanity right at our fingertips. Much of this data is becoming more accessible; however, processing such data remains an ongoing issue across systems including high performance computing (HPC) infrastructures. Massively parallel computing (MPP) has solved this using a divide and conquer approach by splitting workloads across independent nodes (i.e., central processing units (CPU) allowing for higher scaling of data). The main engine for this in python is Ray; however, it has many issues including a large code space, security issues, debugging opacity, and memory management issues. Here, we present HydraMPP, a lightweight, ease of use and utilization, with high auditability, and with SLURM ergonomics.
bioinformatics2026-06-08v1DDI_single: Single-Sequence-Based Protein Domain Assembly
Shengyi, Z.Abstract
Domains are the basic units of protein structure and function. Appropriate inter-domain organization is critical to enable cooperative execution of multiple related functions. It is thus a crucial step to determine the full-length structure of multi-domain proteins for the purpose of elucidating their functions and designing new drugs to regulate these functions. Existing structure prediction algorithms are generally better at solving the internal conformation of domains, rather than modeling the relative positions between domains. To address the challenge of accurately determining multi-domain protein conformations, we develop a single-sequence-based domain assembly algorithm called DDI_single. DDI_single directly extracts features from the amino acid sequence using the protein language model ESM-1b, and accurately predicts the interactions between residue pairs of structural domains through a novel gated cross-attention module, thus achieving the correct assembly of structural domains. With the knowledge of domain definition, DDI_single achieves more than 20% higher accuracy in the task of predicting the relative distances of residue pairs between domains than that of the single-sequence-based structure prediction algorithm trRosettaX_single. When assembling domains with known spatial conformations, DDI_single correctly assembles 74.4% of the samples in the test set (TM-score>0.5). When assembling domains with unknown spatial conformations, in cases where the internal spatial conformations of domains are correctly modeled, DDI_single correctly assembles 73.9% of the samples.
bioinformatics2026-06-08v1Intra-slide calibration technology improves immunohistochemical harmonization within and between anatomic pathology laboratories
Fernandes, G. M. d. M.; Wang, W.; Parwani, A.; Ahmadian, S. S.; Alves, M. J.; Philips, J. J.; Otero, J. J.Abstract
The reproducibility of immunohistochemistry in tumor tissue analysis across reference labs remains a persistent challenge. We tested the extent to which an intra-slide calibration technology mitigated discprepencies in inter-laboratory assays of p53 immunohistochemical (IHC) reactions in brain biopsies of glioblastoma (GB), IDH-wildtype. Intra-slide calibration technologies apply a 0-100% concentration scale incorporating primary surrogate and secondary antibodies to generate a standardized curve for DAB precipitation. IHC from GB samples was performed independently by pathology departments from two different hospital laboratories and were digitalized at 40x magnification using Aperio Image Scope software. Feature extraction, including intensity and texture parameters was performed using the EBImage package in R, followed by UMAP dimensionality reduction and DBSCAN clustering analysis. Our results show significant differences in intensity and texture clustering patterns between laboratory tissue samples and intra-slide calibration technology ruler caused by the different laboratories. Intra-slide calibration technology coupled with polynomial regression analysis improved ~90% the data harmonization. Our findings demonstrate a key role for computational pathology using intra-slide calibration technology to enable intra-laboratory consistency and inter-laboratory reproducibility. These advances strengthen the reproducibility of diagnostic assessments and support more objective, data-driven decision-making in neuro-oncology.
bioinformatics2026-06-08v1HNSW-MS: Hierarchical Graph Indexing Enables Accurate Real-Time Mass Spectral Similarity Search at Repository Scale
Semenov, A.; Gupta, S.; Roberts, A. M. P.; Boginski, V.; Aksenov, A. A.Abstract
Spectral similarity search is the basis of mass spectrometry-based metabolomics, underpinning library matching, molecular networks construction, and repository searches such as MASST. Until recently, dataset sizes were limited, making exhaustive pairwise comparison tractable. This is no longer true. Public repositories such as GNPS now exceed one billion of spectra, and the emerging paradigm of reverse metabolomics (placing experimental spectra into the context of all existing public data to drive annotation and discovery) demands search at a scale where linear sequential comparison is no longer viable. We introduce HNSW-MS, which implements Hierarchical Navigable Small World graph indexing natively for mass spectral similarity, operating directly on raw GC-MS and LC-MS/MS spectra without preprocessing or embedding, thus ensuring maximum reproducibility. Validated on the 8.4 million MS/MS spectra, HNSW-MS achieves up to 560-fold acceleration over linear scan while maintaining top-1 recall above 90%, with perfect recall achievable at moderate parameter settings. This acceleration removes the search bottleneck at repository scale, enabling near real-time spectral querying against the entirety of public metabolomics data.
bioinformatics2026-06-08v1Multi-feature Classification to Improve Colorimetric Loop-Mediated Isothermal Amplification Fidelity
Melton, G.; Negron, D. A.; Hauser, K.; Jagannathan, S.; Tolli, N.; Jennings, K.; Necciai, B.; Sozhamannan, S.; Abramson, B.Abstract
Loop-mediated isothermal amplification (LAMP) is a cost-effective and portable assay technique for performing nucleic acid-based diagnostics in the field whose adoption is hindered by design and reproducibility issues. This is due to a complex primer design process that fine-tunes parameters across 6-8 binding regions. The likelihood of assay success depends on satisfying thermodynamic and secondary structure constraints while maintaining target specificity and avoiding overlaps between multiple primers. Software such as the NEB(R) LAMP Primer Design Tool, PREMIER Biosoft LAMP Designer, Primer3, PCR Signature Erosion Tool (PSET), and PrimerExplorer enable automation of this task for researchers. However, in our experience, these programs can sometimes yield inconsistent results in laboratory testing. Here, we approached the issue by comparing and training multiple machine learning (ML) models on primer sets targeting various organisms from working assays and failing ones to determine significant features and improve predictions prior to ordering primer sets. A literature review produced an initial list of primer sets (n=116), which were then filtered down based on reference template availability to discern their FIP/BIP components (F2/F1c and B1c/B2). The final training set (n=109) included sequence and thermodynamic features derived from primers collected from the review (n=74) and those designed in-house with PSET (n=35). Failing assays were difficult to obtain from the publications, so we provided our own (n=23). Using WEKA Experimenter, models were created based on decision tree and Bayesian learning algorithms using an experimental scheme that performed a parameter grid search, seeded replicates, feature selection, and cross-validation while avoiding data-leakage and outputting logs for model comparison, feature analysis, and overfit assessment. Notably, thermodynamic features associated with the F1c and B1c primers consistently appeared in the top ranks according to consensus between information gain, class-correlation, and model-based feature ranking. For classification, the NaiveBayes algorithm had a TP and TN rate of 0.90 (+/- 0.02) and 0.73 (+/- 0.05) while achieving Cohen's kappa coefficient and F-score values of 0.61 (+/- 0.06) and 0.91 (+/- 0.01). This work highlights how a practical model was built from a small, imbalanced training set incorporating negative research results, of which more are needed to improve generalization and refine parameters critical to assay success.
bioinformatics2026-06-08v1From topography to connectome: Towards an integrated understanding of the resting brain
Naranjo Rincon, S.; Ahmad, F.; Easley, T.; Shoushtari, S.; Glatard, T.; Kiar, G.; Modi, H.; Dahan, S.; Robinson, E.; Kamilov, U.; Bijsterbosch, J.Abstract
As the field expands from early research into the human connectome, there has been a fast expansion in the number of analytical approaches to study resting state functional MRI (rsfMRI) data. With increasing focus on individual differences, topographical brain maps of spatial organization have emerged in addition to traditional functional connectomes. Here, we developed a deep-learning model to embed maps of network topography and faithfully translate to individualized connectomes. Results confirmed the validity of the surface vision transformer based on reconstruction accuracy (0.73{+/-}0.09) and accurate topography-to-connectome translation (0.43{+/-}0.08). Importantly, translated connectomes retained identifiability and brain-cognition associations. These findings establish a direct mapping from spatial topography to connectomes that can be used to integrate scientific insights across rsfMRI sub-fields. This is an important step towards broadening our conceptualization of the connectome and supporting broader integration of findings to inform a complete understanding of the human connectome.
bioinformatics2026-06-08v1scFAIR Consortium: a decentralized hub for single-cell RNA-Seq data standardization and unification
Gardeux, V.; Carsanaro, S.; Chen, W. J.; David, F. P. A.; Goutte-Gattat, D.; Hilton, J. A.; Lubiana, T.; Patel, N.; Raymor, B.; Zucchi, I.; Deplancke, B.; Ernst, C.; Osumi-Sutherland, D.; Robinson-Rechavi, M.; Sternberg, P. W.; Bastian, F. B.Abstract
The rapid accumulation of single-cell RNA-Seq (scRNA-seq) data across multiple repositories presents major challenges for data accessibility, integration, and reproducibility. While primary repositories provide raw data, they rarely include structured cell-type annotations or descriptions of analytical workflows, limiting the ability to reuse and integrate datasets in a FAIR (Findable, Accessible, Interoperable, Reusable) manner. Here we present scFAIR, a consortium of single-cell data resources that has developed a unified metadata schema and common curation framework to improve the FAIRness of scRNA-seq data. Building on and extending the CZ CELLxGENE Discover metadata schema, the scFAIR consortium has been instrumental in driving key schema improvements, including the expansion of supported organisms, richer biological context, and structured reporting of computational workflows. To provide unified access to decentralized datasets, the consortium developed the sc-fair.org portal, which currently aggregates 2,346 datasets across partner resources through ontology-aware semantic search. We demonstrate the practical value of FAIR-compliant datasets through a cross-species validation between human and mouse Allen Brain Atlases, showing that standardized ontology annotations enable reliable annotation transfer across species, with 90% of neuronal clusters receiving an exact or equivalent label. Together, the scFAIR schema, validator, and portal constitute a community-driven framework that advances single-cell data standardization and lays the foundation for reproducible, large-scale integration of single-cell datasets.
bioinformatics2026-06-08v1DipSkmer: Reference-free population genomics with diploid genome skims
Charvel, E.; Alves Monteiro, H. J.; Mirarab, S.; Bafna, V.Abstract
Ecologists and conservation biologists rely on genetic diversity as a key essential biodiversity variable (EBV) used to track population health and dynamics, and utilize the population parameter {theta} (estimated by the average pairwise genomic distance) as a key metric of diversity. While whole-genome-sequencing (wgs) is increasingly affordable, it will be considerable time before the full diversity of life is represented by high-quality assembled genomes; even then, constant monitoring will still require repeated sampling of populations. In contrast, genome skimming (low-coverage, short-read wgs) is highly cost-effective but challenging to analyze because the coverage is too low for assembly and reliable error correction. Mature methods, such as Mash, exist for estimating pairwise genomic distances based on the Jaccard similarity of k-mer sets computed using sketching techniques. Some, such as Skmer, additionally model the impacts of low coverage. These methods have been successfully applied to assembly-free species identification and phylogenetics; however, their use in population genetics has been limited. This is because these methods implicitly treat genomes as haploid and heterozygosity confounds true estimates of genomic distance for diploid organisms. In this paper, we address this problem through a number of technical advances. First, we use coalescent theory to mathematically derive how the Jaccard index between two diploid samples changes with the scaled population size parameter ({theta}). Next, we derive an estimator that computes {theta} from the Jaccard index, in addition to several auxiliary variables, which we also estimate from the genome skims. The resulting method, DipSkmer, enables more accurate estimates of coverage, sequencing error, and pairwise nucleotide distance for diploid samples. Analyses of both simulated and empirical datasets show that for diploids and low distances (e.g., <2%), DipSkmer produces the most accurate pairwise distance estimates, outperforming existing alignment-free methods such as Mash and Skmer, and closely approximates ANGSD, a reference and alignment-based tool.
bioinformatics2026-06-08v1A Web-based Software Resource for Interactive Analysis of Multiplex Tissue Imaging Datasets
Creason, A. L.; Watson, C.; Gu, Q.; Persson, D.; Sargent, L. L.; Chen, Y.-A.; Lin, J.-R.; Sivagnanam, S.; Wünnemann, F.; Nirmal, A. J.; Chin, K.; Feiler, H. S.; Holly, H.; Coussens, L. M.; Schapiro, D.; Grüning, B. A.; Sorger, P. K.; Sokolov, A.; Goecks, J.Abstract
Highly multiplexed tissue imaging (MTI) are powerful spatial proteomics technologies that enable in situ single-cell characterization of tissues. However, analysis and visualization of MTI datasets remains challenging, and we developed the Galaxy-ME software hub to address this challenge. Galaxy-ME is a web-based, interactive software hub that enables end-to-end analysis and visualization of MTI datasets and is accessible to everyone. To demonstrate its utility, Galaxy-ME was used to analyze datasets obtained from multiple MTI assays in both normal and cancerous tissues. Galaxy-ME is a publicly available web resource.
bioinformatics2026-06-07v3Metadata Collector: An Open-Source Platform for Standardized Metadata Management in Multi Centre Sequencing Projects
Liguori, R.; Ferrazzi, F.Abstract
Background: Next-generation sequencing (NGS) projects generate increasingly complex metadata that are critical for reproducibility, interoperability, and compliance with FAIR principles. Nevertheless, metadata curation in multi-institutional settings often still relies on spreadsheets, manual data entry and curation, as well as non-standardized terminology. These practices frequently result in incomplete or inconsistent annotations, hinder metadata sharing, and delay submission to public repositories. Results: We developed Metadata Collector as a React/API/PostgreSQL web platform and deployed it on a Kubernetes cluster within a large German research consortium. The platform implements a flexible, machine-readable metadata model for experimental data and integrates customizable templates, controlled vocabularies designed to support future ontology integration, and a complete event-based versioning model. Since deployment, Metadata Collector has been used across 32 projects involving RNA-seq, scRNA-seq, ATAC-seq and multiomics datasets, representing over 700 annotated samples contributed by multiple consortium partners. The platform is designed for use by non-computational researchers as well as centralized facilities and can be integrated into existing research data management infrastructures. Conclusions: Metadata Collector embeds standardization early in the metadata lifecycle, ensuring consistent, FAIR-aligned, and reproducible metadata across distributed research groups. Its modular, open-source architecture supports both local and consortium-scale deployments and provides a foundation for future extensions, including multi-omics support and integration with laboratory information management systems and automated submission pipelines.
bioinformatics2026-06-07v2An Agentic Platform for Drug Repurposing Unified across Molecular, Phenotypic, and Clinical Scales
Wang, C.; El Moussaoui, M.; Zhang, D.; Prabhakaraalva, P.; Merzliakov, S.; Lu, R. J.-H.; Zaman, N.; Chakraborty, G.; Huang, K.-l.Abstract
Drug repurposing offers an effective path to new therapies, yet existing computational approaches rely on a single line of evidence and are rarely validated across biological scales. We present LinkD, an integrated framework that unifies diffusion-based affinity prediction, proteome-wide selectivity scoring, phenotypic validation, and population-scale clinical evidence. LinkD-Bind predicts binding across 14,981 drugs and 20,385 human targets, ranking first in 8 of 9 BindingDB, Davis, and KIBA evaluations, with the largest gains under cold-start conditions. LinkD-Select recovers 95.3% of known drug-target pairs by combining selectivity scoring and molecular docking. LinkD-Pheno integrates drug-sensitivity and CRISPR dependency data across 960 cancer cell lines, identifying 34 novel drug-gene pairs and recovering ~85% of known targets among the top 50 candidates. Across 11.5 million individuals from Mount Sinai and UK Biobank, LinkD-prioritized {beta}-blockers propranolol (HR 0.82) and carvedilol (HR 0.92) reduced 5-year prostate cancer incidence relative to metoprolol, corroborated by ADRB2 docking and LNCaP growth inhibition. LinkD-Agent, which can effectively orchestrate all evidence layers, is served on a publicly available web platform (https://linkd-agent.onrender.com/), enabling a wide range of users to derive new drug repurposing opportunities through natural language queries.
bioinformatics2026-06-07v2BacteReason: A Reasoning Model for Antimicrobial Resistance Prediction
Oikawa, Y.; Kawashima, S.; Kinjo, A. R.; Demizu, Y.; Tamura, R.; Tsuda, K.Abstract
The rapid global spread of antimicrobial resistance (AMR) has placed unprecedented pressure on clinical decision-making. Machine learning predictors of antibiotic susceptibility exist, but their lack of mechanistic grounding limits credibility. We present BacteReason, a reasoning large language model (LLM) that predicts bacterial susceptibility to a target antibiotic, together with a mechanistic rationale. BacteReason is obtained by fine-tuning an open-weight LLM on clinical susceptibility data augmented with rationales that explain the molecular mechanisms. These rationales are produced by a proprietary teacher LLM prompted to explain known susceptibility outcomes. The teacher is interfaced via TogoMCP with a collection of biomedical knowledge-graph databases, grounding each reasoning step in retrieved evidence. On an extrapolation benchmark, BacteReason achieves a relative improvement of 43% over the untuned baseline and 38% over the same base LLM fine-tuned without rationales, demonstrating that reasoning supervision improves prediction accuracy.
bioinformatics2026-06-07v1CREP: Cis-Regulatory Element Predictor Based on Fine-Tuned Enformer
Stranieri, N.; Riva, S. G.; Hughes, J. R.Abstract
A substantial fraction of disease-associated genetic variants reside in non-coding regions of the genome, where they act by perturbing cis-regulatory elements (CREs) such as enhancers, promoters, and insulators. While recent sequence-based deep learning models, such as Enformer, accurately predict continuous epigenomic signals from DNA sequence, they do not directly provide discrete and interpretable CRE annotations. Here, we present CREP (Cis-Regulatory Element Predictor), a fine-tuned version of Enformer trained to predict regulatory element identity from sequence using REgulamentary-derived annotations across multiple human cell-types. Through a controlled experimental framework, we show that incorporating diverse cell-types improves model performance. CREP leverages cell-type-specific training data to learn regulatory representations while producing a unified prediction of CRE identity from sequence. This is demonstrated by the Vanuatu SNP, a non-coding variant that creates a de novo erythroid regulatory element, which is correctly detected only when erythroid data are included during training. Error analysis further reveals that apparent misclassifications between enhancers and promoters reflect their shared regulatory architecture, supporting the view of CREs as a functional continuum rather than strictly discrete classes. Together, these results demonstrate that CREP enables interpretable prediction of regulatory element identity from sequence and provides a framework for the functional interpretation of non-coding genetic variation.
bioinformatics2026-06-07v1Fasting Status and Epigenetic Clock Stability: Implications for Aging Research
Seale, K. B.; Dwaraka, V. B.; Giosan, I.; Mendez, T.; Smith, R.Abstract
Background: Epigenetic clocks are DNA methylation-based biomarkers increasingly used in aging research and clinical trials. A recent assessment of 18 clocks across multiple short-term perturbations concluded that most demonstrate only moderate biological reliability, raising concerns about their translational utility. However, understanding biological variability requires understanding the construction of each clock: different clocks capture distinct biological properties that respond differently to specific perturbations, and pooling reliability metrics across heterogeneous populations and array platforms may obscure the mechanisms driving variability in each case. Methods: We evaluated 24 epigenetic clocks spanning five construction categories - first and second generation classical clocks (eg. Horvath, Hannum, PhenoAge), the PC versions of the classical clocks, SystemsAge organ-system clocks, mortality-trained clocks (GrimAge, PCGrimAge, OMICmAge), pace of aging clocks (DunedinPACE) and the IntrinClock, across three datasets: a within-person paired fasting design (n = 15 pairs), a cross-sectional cohort of fasted vs non-fasted (n = 2,895), and EPICv2custom technical replicates (n = 96 samples from 4 individuals). For each clock, we quantified the acute fasting effect with and without immune cell adjustment, decomposed between-person and within-person variance at successive adjustment levels (Raw, EAA, IAA), and benchmarked biological variability against the technical measurement floor. Results: Fasting followed by acute refeeding was associated with group-level shifts of 0.5-3 years in immune-sensitive clocks, while within-person reliability remained high (Raw clock ICC median ~0.96). These observations are compatible because fasting effects are small relative to the age-driven between-person variance that dominates the ICC denominator. The magnitude of the observed shift varied by clock. PC transformations showed larger effects than their classical counterparts in the paired cohort (PC Hannum -2.03 vs. Hannum -1.37 years; PC PhenoAge > PhenoAge; PC Horvath > Horvath), SystemsAge showed the largest effects (1.15-2.9 years younger when fasted), and mortality-trained clocks (GrimAge V1/V2, OMICmAge) and DunedinPACE showed no detectable acute effect (all FDR p > 0.10). Immune cell adjustment attenuated or eliminated the fasting effects in sensitive clocks (PC Hannum 88% attenuation; SystemsAge Blood 99.7%); no clock retained a significant fasting effect after FDR-corrected immune adjustment in either cohort. Within the cross-sectional cohort, a clock's immune content, which is the fraction of its age-independent variance explained by immune cell composition, was correlated with the degree to which immune adjustment attenuated its fasting effect (r = 0.68, p = 0.003). IntrinClock, designed to exclude immune-variable CpGs, showed no fasting effect in either cohort (immune R-squared = 3.2%), serving as a negative control. Technical replicates confirmed near-perfect measurement reproducibility (median Raw ICC > 0.97), establishing that variance in fasting pairs reflects biology, not noise. Immune-adjusted ICCs behaved differently across clocks in ways consistent with their composition: for clocks where fasting generated within-person variance, immune adjustment removed it and ICC increased (SystemsAge EAA 0.768 to IAA 0.913); for clocks unaffected by fasting, immune adjustment removed between-person structure and ICC fell substantially (OMICmAge 0.922 to 0.160), reflecting the estimation cost of fitting many immune cell predictors to stable residuals. Cross-sectional replication (n = 2,895) confirmed immune cell redistribution at scale. Mortality clocks reached significance cross-sectionally despite resistance to acute fasting. Conclusions: Acute refeeding after an overnight fast elicits small shifts in some epigenetic clocks, which varied systematically by training category in our data. PC-based clocks, which concentrate correlated CpG variance including that associated with immune cell composition, showed the largest shifts; mortality-trained clocks showed no detectable acute effect. A reliability-only framework that reports ICC without also testing for systematic group-level effects can miss the kind of structured biological variation observed here under fasting. ICC is not a fixed property of a clock, it is shaped by the study design, the population heterogeneity, the perturbation, and the adjustment applied. We recommend that clock reliability be assessed on a perturbation-specific, clock-by-clock basis, with variance decomposition at each adjustment level and explicit benchmarking against technical replicates.
bioinformatics2026-06-07v1A Web-based software toolkit for accessible and best-practice machine learning analyses in biomedical research
Morais Lyra Junior, P. C.; Qiu, J.; Van Dang, K.; Pybus, A.; Narvaez-Bandera, I.; Singh, M. A.; Gu, Q.; Sargent, L.; Creason, A. L.; Goecks, J.Abstract
Machine learning is increasingly central to biomedical research, but using machine learning well often requires substantial computational expertise and methodological care to produce high-quality results. To make machinelearning tools more accessible to biomedical researchers while supporting best-practice approaches, we developed the Galaxy Learning and Modeling (GLEAM) software toolkit. GLEAM enables researchers to performsupervised machine learning analyses through a set of web-based, code-free software tools for tabular, image, and multimodal biomedical datasets. GLEAM standardizes data partitioning, model selection, training, evaluation,and reporting, helping researchers apply machine learning with greater rigor and consistency. GLEAM runs on the Galaxy computational workbench and uses Galaxy's core features to make all analyses accessible,reproducible, and scalable. We validated GLEAM on three biomedical tasks: predicting patient response to immunotherapy, skin lesion classification, and cancer recurrence prediction. Across these tasks, GLEAM producedhighly accurate predictive models and improved transparency, reproducibility, and rigor.
bioinformatics2026-06-07v1Multimodal physical evidence uncovers interpretable gene regulatory networks for perturbation prediction
Yang, Z.; Huang, S.; Bai, G.; Dong, J.; Wang, J.; Li, S. Z.Abstract
Gene regulatory networks govern cell fate transitions through dynamic causal mechanisms. Since exhaustively mapping this vast perturbation space experimentally is prohibitive, scalable computational models are essential. Yet, current frameworks fall short because they infer statistical co-expression rather than physical mechanisms, remain blind to non-canonical regulators lacking classical DNA-binding motifs, and fail to generalize across unseen perturbation factors or cell lines. Here we show that a multimodal biophysical framework, VitaGRN, overcomes these barriers by constructing a biophysical regulatory scaffold from multimodal evidence and propagating interactions to capture non-canonical regulators. By leveraging structurally aligned protein embeddings, VitaGRN predicts zero-shot perturbation responses and uncovers non-canonical translational control programs. Notably, VitaGRN demonstrates robust generalization across unseen factors, cell lines, and developmental transitions. Ultimately, VitaGRN generates a con[fi]dence-calibrated virtual perturbation atlas spanning over a thousand factors. This resource reframes gene regulatory networks from static correlation graphs into dynamically generalizable and mechanistically transparent models, streamlining wet-lab candidate prioritization.
bioinformatics2026-06-07v1Anthocyanin-associated cellular programs underlying terroir variation in Cabernet Sauvignon grape berry revealed by SEED-based deconvolution
Hu, X.; Tang, Y.; Deng, F.; Chen, Z.; Tang, G.; Yan, X.; Xia, Z.; Tong, H. H. Y.; Zhan, J.; Zou, X.; Hao, J.Abstract
Plant tissues consist of diverse cell populations that collectively contribute to development, metabolism, environmental responses, and phenotype formation. Although single-cell and single-nucleus RNA sequencing have greatly advanced the study of plant cellular heterogeneity, their application to large sample cohorts remains limited by cost, technical complexity, tissue dissociation constraints, and throughput. In contrast, bulk RNA-seq datasets have accumulated extensively across plant species, tissues, developmental stages, and environmental conditions, yet the celltype-level information embedded in these datasets remains difficult to resolve because plant-oriented deconvolution frameworks are still lacking. Existing deconvolution methods have largely been developed in mammalian systems and have not been systematically optimized for plant transcriptomic features, leaving their applicability under plant-specific constraints unclear. Here, we present SEED, an adaptive deconvolution framework optimized for plant transcriptomic data. SEED integrates candidate reference-template construction with seven deconvolution strategies and automatically identifies an optimal combination for a given dataset. In grapevine simulated benchmarking, SEED showed its clearest advantage under low-replication conditions and remained broadly competitive, rather than uniformly dominant, when larger pseudo-bulk sample sizes were evaluated. SEED further performed robustly in public Arabidopsis thaliana and Nicotiana tabacum datasets. Finally, we applied SEED to bulk RNA-seq data generated in this study from Vitis vinifera cv. Cabernet Sauvignon berries collected from Yinchuan and Yantai, identifying terroir-associated cell subtypes and coordinated celltype interaction patterns. Together, these results establish SEED as a practical framework for plant transcriptome deconvolution and provide a new tool for dissecting cellular heterogeneity associated with environmental adaptation and phenotype formation in plants.
bioinformatics2026-06-07v1VelocityFM: Short-Horizon Protein Trajectory Prediction via Flow Matching in Velocity Space
Jayathilake, L.; Wijesinghe, C. R.; Weerasinghe, R.Abstract
Protein dynamics is fundamentally a trajectory prediction problem, but molecular dynamics (MD) simulation remains expensive and static structure predictors do not model time-ordered motion. We present VelocityFM, a short-horizon protein trajectory predictor that applies rectified flow matching in velocity space over residue frames and torsions. The model combines six Invariant Point Attention (IPA) blocks with a two-layer per-residue temporal self-attention encoder, and is trained on 710 ATLAS proteins comprising 2090 filtered replicate trajectories. At the primary 128-frame rollout horizon, VelocityFM achieves a median TM-score of 0.929 on 72 held out proteins, with 100% of proteins remaining above TM> 0.7 and 100% clash-free generation. Backbone geometry also remains strong, with a median Ramachandran favoured rate of 91.09%, while dynamics calibration is conservative with median RMSF ratio 0.697. These results show that velocity-space geometric learning can generalise short-horizon trajectory prediction to unseen proteins while preserving fold structure and geometric validity within its intended operating regime.
bioinformatics2026-06-07v1Metadata Collector: An Open-Source Platform for Standardized Metadata Management in Multi Centre Sequencing Projects
Liguori, R.; Ferrazzi, F.Abstract
Background: Next-generation sequencing (NGS) projects generate increasingly complex metadata that are critical for reproducibility, interoperability, and compliance with FAIR principles. Nevertheless, metadata curation in multi-institutional settings often still relies on spreadsheets, manual data entry and curation, as well as non-standardized terminology. These practices frequently result in incomplete or inconsistent annotations, hinder metadata sharing, and delay submission to public repositories. Results: We developed Metadata Collector as a React/API/PostgreSQL web platform and deployed it on a Kubernetes cluster within a large German research consortium. The platform implements a flexible, machine-readable metadata model for experimental data and integrates customizable templates, controlled vocabularies designed to support future ontology integration, and a complete event-based versioning model. Since deployment, Metadata Collector has been used across 32 projects involving RNA-seq, scRNA-seq, ATAC-seq and multiomics datasets, representing over 700 annotated samples contributed by multiple consortium partners. The platform is designed for use by non-computational researchers as well as centralized facilities and can be integrated into existing research data management infrastructures. Conclusions: Metadata Collector embeds standardization early in the metadata lifecycle, ensuring consistent, FAIR-aligned, and reproducible metadata across distributed research groups. Its modular, open-source architecture supports both local and consortium-scale deployments and provides a foundation for future extensions, including multi-omics support and integration with laboratory information management systems and automated submission pipelines.
bioinformatics2026-06-07v1CytoGem-XAI:A Hypergraph Neural Network Framework for Genome-Scale Metabolic Modeling and Interpretable Analysis
Chen, S.; Chen, T.; Xu, Z.; Zhang, L.; Gao, B.; Mao, J.Abstract
Genome-scale metabolic models are essential for understanding cellular metabolism, yet existing deep learning approaches remain black boxes, and traditional flux balance analysis (FBA) cannot provide sample-specific predictions. To our knowledge, CytoGem-XAI is the first framework to combine hypergraph neural network representation with interpretable, FBA-parallel analysis and sample-specific metabolic characterization. Built upon hypergraph representations where reactions are encoded as hyperedges connecting their participating metabolites, CytoGem-XAI introduces three analysis modules: perturbation-based carbon source importance ranking, hard intervention reaction bottleneck identification, and pathway-level topological attribution. Beyond prediction, CytoGem-XAI uniquely enables condition-dependent carbon source essentiality and reaction bottlenecks that vary with genetic background - capabilities absent from both traditional FBA and existing deep learning methods. Trained on 17,400 E.coli growth conditions using 10-fold cross-validation, our framework achieves 2 =0 .862,substantially outperforming AMN (R^2=0 .81,+6 .4%), FBA ( R^2=0 .62,+39%),and gradient boosting baselines (R^2 =0.71,+21%). Biological validation confirms that CytoGem-XAI identifies known essential carbon sources (e.g., alanine, malate) and rate-limiting enzymes (e.g., TCA cycle), while also revealing N-acetylmuramate - a peptidoglycan precursor - as a previously underappreciated essential nutrient.
bioinformatics2026-06-07v1Single-cell gene regulatory network reconstruction and key regulator identification using a dual-channel fusion graph convolutional network
Tang, R.; Liu, J.; Zhang, P.; Liang, X.Abstract
Background and objective: Gene regulatory networks are formed by complex regulatory relationships between transcription factors and their target genes. A systematic understanding of these regulatory relationships is crucial for deciphering the molecular mechanisms that underlie cell state transitions under physiological and pathological conditions. Single-cell expression data can reveal cell-type-specific transcriptional regulation, and computational methods have recently been developed to infer gene regulatory networks from single-cell transcriptomics and prior regulatory knowledge. However, existing methods could not explore the common and specific information in expression correlations and prior regulatory knowledge, which can adversely affect prediction performance. Methods: We propose a novel method for inferring gene regulatory networks from single-cell RNA sequencing data. The proposed method consists of dual-channel graph neural networks and a weight-shared common graph neural network, enabling effective fusion of prior regulatory knowledge with gene co-expression patterns. Furthermore, we formulate a new computational framework built upon the proposed algorithm, which integrates differential gene expression profiles and regulatory changes to identify key regulators that distinguish different cell states. Results: Experimental results demonstrate that our method significantly improves the accuracy of regulatory inference across multiple datasets, outperforming other state-of-the-art approaches. Our method also exhibits robustness to noise and missing data. Analysis of two single-cell expression datasets suggests that the proposed framework could help identify key regulators involved in tumor metastasis and drug resistance. Conclusion: These results indicate that the proposed method could advance the understanding of the biological mechanisms underlying diseases by reconstructing single-cell gene regulatory networks and identifying key regulators across different cell states.
bioinformatics2026-06-07v1CiliAI: Automated segmentation and compartment specific fluorescence quantification of primary cilia in confocal microscopy images
Karapetian, E.; Gerhardt, C.; Nazif, E.; Pfirrmann, T.Abstract
Primary cilia regulate essential signalling pathways controlling cell proliferation, differentiation, and tissue homeostasis. Quantitative analysis of ciliary morphology and compartment-specific protein localization by confocal microscopy is labor-intensive, user-dependent, and difficult to scale, particularly for multiplexed 3D image datasets. Here, we present CiliAI, a web-based deep-learning workflow for automated detection, substructure segmentation, and quantitative analysis of primary cilia in confocal microscopy images. CiliAI identifies ciliary substructures including the basal body, transition zone, and axoneme from multiplexed 3D image stacks and performs automated measurements of cilium length and compartment-specific fluorescence intensity. In NIH-3T3 cells, automated cilium length measurements showed close agreement with manual quantification and no statistically significant difference between methods (mean difference -0.214 {gamma}m, p = 0.213). Automated fluorescence analysis reproduced previously reported reductions in transition zone-associated Cep290 signal intensity in Rpgrip1l-deficient cells and identified the absence of significant Rpgrip1l accumulation changes in Rmnd5a-deficient cells. Automated processing reduced analysis time from days of manual quantification to minutes. Together, these findings establish CiliAI as an automated framework for quantitative analysis of ciliary morphology and compartment-specific protein abundance in confocal microscopy datasets.
bioinformatics2026-06-07v1Germline regulation of tumor evolutionary dynamics shapes multiple myeloma progression
Chen, H.; Shu, J.; Mudappathi, R.; Wang, P.; Bergsagel, L.; Yang, P.; Sun, Z.; Shi, C.; Liu, L.Abstract
Germline variation shapes cancer risk, yet its influence on the evolutionary dynamics of established tumors remains poorly understood. In multiple myeloma, subclonal diversification drives disease progression and treatment failure, but the heritable factors that modulate this process are unknown. Here, we show that germline variation is associated with tumor evolutionary features, implicating inherited regulation in subclonal expansion. Integrating germline variation with tumor evolutionary parameters identifies variants associated with evolutionary features, with signals enriched in regulatory regions, consistent with a transcriptional basis. We further identify TBKBP1 as a key locus linking germline variation to tumor evolution and clinical outcome. Germline variation at this locus is associated with TBKBP1 expression and subclonal expansion, and TBKBP1 expression correlates with adverse prognosis, consistent across independent cohorts. Functional analyses demonstrate that TBKBP1 promotes proliferation and activates MYC, mTORC1 and non-canonical NF-{kappa}B signaling pathway. Together, these findings establish germline regulatory variation as a determinant of tumor evolutionary dynamics and identify TBKBP1 as a mediator linking inherited variation to subclonal expansion and disease progression in multiple myeloma.
bioinformatics2026-06-07v1Multi-level, multi-body atomic interaction graphs for machine learning-based prediction of protein-ligand binding energies
Le, T. T. H.; Nguyen, B. T.; Vo, H.; Nguyen, N. H.; Nguyen, D. D.Abstract
Accurate prediction of binding affinity is crucial for rational drug design and discovery. Traditional computational methods often rely on complex scoring functions that incorporate a multitude of physical and chemical descriptors, leading to high computational demands and sometimes limited generalizability. In this work, we propose a novel scoring function that models multi-level, multi-body atomic interactions using graph-based representations. Our method constructs comprehensive interaction graphs that incorporate both pairwise and triplet-wise atomic features that help capture cooperative spatial patterns essential for binding affinity prediction. By employing a feature fusion strategy, GMI-Score maintains model simplicity while enhancing accuracy. Extensive evaluation across multiple datasets, such as PDBbind v2013, PDBbind v2016, PDBbind v2020, CSAR-NRC-HiQ, and PDBbind-Redocked, demonstrates that our model consistently outperforms state-of-the-art scoring functions, achieving Pearson correlation coefficients up to 0.877. Furthermore, it retains strong predictive power under strict data leakage controls and realistic docking conditions to highlight its robustness and generalizability.
bioinformatics2026-06-07v1GLOF: A large-scale expert-curated benchmark dataset of gain-of-function and loss-of-function missense variants
Maricato, V.; Schlesinger, D.; de Souza Moura, P. N.Abstract
Distinguishing loss-of-function (LOF) from gain-of-function (GOF) effects of missense variants is fundamental to understanding disease mechanisms and guiding therapeutic strategy, yet no large-scale, expert-curated benchmark has been publicly available for this task. Here we present GLOF (Gain and Loss Of Function), a dataset of 112,399 missense variants across 2,809 human genes, each classified as LOF, GOF, or neutral by board-certified clinical geneticists following ACMG guidelines. Pathogenic variants were sourced from ClinVar and annotated with their functional mechanism based on published functional studies, phenotype correlations, and established gene-disease relationships. Neutral variants were drawn from gnomAD v3.1 and validated against v4.1 using stringent population frequency filters. The dataset spans diverse protein families, includes 97 genes with bidirectional mechanisms (containing both LOF and GOF variants), and has been validated against well-characterized variants in the literature. GLOF is publicly available on Kaggle (https://www.kaggle.com/datasets/maricatovictor/loss-and-gain-of-function-variants) and Hugging Face (https://huggingface.co/datasets/victormaricato/glof), and provides a standardized resource for developing and benchmarking computational methods that predict variant functional mechanisms.
bioinformatics2026-06-07v1