Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
DOTSeq enables genome-wide detection of differential ORF usage
Lim, C. S.; Chieng, G. S. W.Abstract
Protein synthesis is regulated by multiple cis-regulatory elements, including small ORFs, yet current differential translation methods assume uniform changes at the gene level. We present DOTSeq, a Differential ORF Translation statistical framework that resolves ORF-level regulation in bulk ribosome profiling (Ribo-seq) experiments and provides ORF-level read summarisation for single-cell Ribo-seq. DOTSeq's core module, Differential ORF Usage (DOU), quantifies changes in an ORF's relative contribution to a gene's translation output, using a beta-binomial GLM with flexible dispersion modelling. DOTSeq also implements ORF-level Differential Translation Efficiency (DTE) using a standard approach to complement DOU. Benchmarks show that DOU achieves superior sensitivity with near-nominal FDR across effect sizes, while DTE and some existing methods excel when technical noise is low. DOTSeq introduces an ORF-aware, quantitative framework for ribosome profiling, delivering end-to-end workflows for ORF annotation, read summarisation, contrast estimation, and visualisation to uncover translational control events at scale.
bioinformatics2026-03-18v3plsMD: A plasmid reconstruction tool from short-read assemblies
Lotfi, M.; Jalal, D.; Sayed, A. A.Abstract
While whole genome sequencing (WGS) has become a cornerstone of antimicrobial resistance (AMR) surveillance, the reconstruction of plasmid sequences from short-read WGS data remains a challenge due to repetitive sequences and assembly fragmentation. Current computational tools for plasmid identification and binning, such as PlasmidFinder, cBAR, PlasmidSPAdes, and Mob-recon, have limitations in reconstructing full plasmid sequences, hindering downstream analyses like phylogenetic studies and AMR gene tracking. To address this gap, we present plsMD, a tool designed for full plasmid reconstruction from short-read assemblies. plsMD integrates Unicycler assemblies with replicon and full plasmid sequence databases (PlasmidFinder, MOB-typer and PLSDB) to guide plasmid reconstruction through a series of contig manipulations. Using two datasets, one established benchmark dataset used in previous benchmarking studies and another novel dataset consisting of newly sequenced bacterial isolates, plsMD outperformed existing tools in both. In the benchmark dataset, it achieved excellent recall, precision, and F1 scores of 91.3%, 95.5%, and 92.0%, respectively. In the novel dataset, it achieved good recall, precision, and F1 scores of 77.6%, 88.9%, and 74.5%, respectively. plsMD supports two usage modalities: single-sample analysis for plasmid reconstruction and gene annotation, and batch-sample analysis for phylogenetic investigations of plasmid transmission. This computational tool represents a significant advancement in plasmid analysis, offering a robust solution for utilizing existing short-read WGS data to study plasmid-mediated AMR spread and evolution.
bioinformatics2026-03-18v2PREMISE: A Quality-Aware Probabilistic Framework for Pathogen Resolution and Source Assignment in Viral mNGS
Vijendran, S.; Dorman, K.; Anderson, T. K.; Eulenstein, O.Abstract
The circulation of Influenza A viruses (IAVs) in wildlife and livestock presents a significant public health threat due to their zoonotic potential and rapid genomic diversification. Accurate classification of viral subtypes and characterization of within-host diversity are crucial for risk assessment and vaccine development. Although metagenomic sequencing facilitates early detection, prevalent memory-efficient k-mer-based pipelines often discard critical linkage information. This loss of information can result in missed or imprecise pathogen identification, potentially delaying clinical and public health responses. We introduce premise (Pathogen Resolution via Expectation Maximization In Sequencing Experiments), a probabilistic, alignment-based framework implemented in RUST for high-resolution viral genome identification. By integrating advanced string data structures for efficient alignment with a quality-score-aware Expectation-Maximization algorithm, premise accurately identifies source strains, estimates relative abundances, and performs precise read assignments. This framework provides superior source estimation with statistical confidence, enabling the identification of mixed infections, recombination, and IAV-reassortment directly from raw data. Validated against simulated and empirical datasets, premise outperforms state-of-the-art k-mer methods. Ultimately, this framework represents a significant advancement in viral identification, providing a foundation for novel approaches that can automatically flag reassorted viruses or recombination events in the future, thereby improving the detection of emerging pathogens with zoonotic potential. Availability: https://github.com/sriram98v/premise} under a MIT license. Contact: sriramv@iastate.edu
bioinformatics2026-03-18v1Interpolating and Extrapolating Node Counts in Colored Compacted de Bruijn Graphs for Pangenome Diversity
Parmigiani, L.; Peterlongo, P.Abstract
A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.
bioinformatics2026-03-18v1Hierarchical genomic feature annotation with variable-length queries
Alanko, J. N.; Ranallo-Benavidez, T. R.; Barthel, F. P.; Puglisi, S. J.; Marchet, C.Abstract
K-mer-based methods are widely used for sequence classification in metagenomics, pangenomics, and RNA-seq analysis, but existing tools face important limitations: they typically require a fixed k-mer length chosen at index construction time, handle multi-matching k-mers (whose origin in the indexed data is ambiguous) in ad-hoc ways, and some resort to lossy approximations, complicating interpretation. We present HKS, a data structure for exact hierarchical variable-length k-mer annotation. Building on the Spectral Burrows-Wheeler Transform (SBWT), a single HKS index is constructed for a specified maximum query length s, and supports queries at any length k [≤] s. HKS associates each k-mer with exactly one label from a user-defined category hierarchy, where multi-matching k-mers are resolved to their most specific common node in the hierarchy. We formalize a feature assignment framework that partitions indexed k-mers into disjoint sets according to a user-defined category hierarchy. To recover specificity lost to multi-matching and novel k-mers, we introduce a hierarchy-aware smoothing algorithm that makes use of flanking sequence context. We validate the approach by assigning each query k-mer to a specific chromosome across human genome assemblies, including the T2T-CHM13v2.0 reference as a positive control and two diploid genomes of different ancestries (HG002, NA19185). Smoothing increases overall concordance from [~]81% to [~]97%, with residual errors attributable to known biological phenomena including acrocentric short-arm recombination and subtelomeric duplications. In performance benchmarks against Kraken2, HKS provides comparable query throughput while providing exact, lossless annotation across all k-mer lengths simultaneously from a single index. A prototype implementation is available at https://github.com/jnalanko/HKS.
bioinformatics2026-03-18v1HARVEST: Unlocking the Dark Bioactivity Data of Pharmaceutical Patents via Agentic AI
Shepard, V.; Musin, A.; Chebykina, K.; Zeninskaya, N. A.; Mistryukova, L.; Avchaciov, K.; Fedichev, P. O.Abstract
Pharmaceutical patents contain vast Structure-Activity Relationship tables documenting protein-ligand binding data that are technically public yet computationally inaccessible, rendering this wealth of data effectively dark - trapped in unstructured archives no existing database has systematically captured. We present HARVEST, a multi-agent large language model pipeline that autonomously extracts structured bioactivity records from USPTO patent archives at $0.11 per document. Applied to 164,877 patents, HARVEST produced 3.36 million activity records, recovering 365,713 unique scaffolds and 1,108 protein targets absent from BindingDB - completing in under a week a task requiring over 55 years of continuous expert labor. Automated extraction achieves 91% agreement with human curators while exhibiting lower unit-conversion error rates. We further introduce H-Bench, a structurally guaranteed held-out benchmark built from this recovered data. Evaluation of the leading open-source model Boltz-2 on H-Bench reveals a two-dimensional generalization gap: performance degrades both on novel chemical scaffolds and on uncharacterized protein targets, exposing fundamental limitations of models trained on existing public repositories.
bioinformatics2026-03-18v1Sex Checking by Zygosity Distributions
Molina-Sedano, O.; Mas Montserrat, D.; Ioannidis, A. G.Abstract
Motivation: In genomic and clinical studies, verifying concordance between self-reported and genotype-inferred sex is a crucial quality control step, since mismatches arising from mislabeling or aneuploidies can bias downstream analyses and affect diagnostic accuracy. Existing approaches typically require substantial auxiliary data, and often require manual threshold tuning. There remains a need for a streamlined, reference-free method that generalizes across different data modalities, including whole-genome, single-sample and array, without requiring additional files or parameter tuning. Results: We present Zigo, a novel ML-based sex-checking method that operates solely on a standard VCF file, designed using X-chromosome genotype class distributions across sexes. Our model was trained on synthetic data incorporating standard demographic models and empirical recombination maps to ensure realistic genetic architecture and population structure. We simulate WGS, array, and single-sample files for broad applicability. Unlike traditional methods, we eliminate manual thresholding by distilling learned discriminative patterns into a single polynomial equation that determines genetic sex directly from normalized genotype counts. We validated Zigo on independent datasets, including 1000 Genomes, UK Biobank, and HGDP. Additional experiments assessed robustness under reduced variant availability through random SNP subsampling and allele-frequency filtering. Across all evaluations, the model achieved state-of-the-art accuracy, high time efficiency, and strong generalization, even with severely limited variant sets.
bioinformatics2026-03-18v1usiGrabber: Automating the curation of proteomics spectra data at scale, making large datasets ready for use in machine learning systems
Auge, G.; Clausen, M.; Ketterer, K.; Schaefer, J.; Schmitt, N.; Altenburg, T.; Hartmaring, Y.; Raetz, H.; Schlaffner, C. N.; Renard, B. Y.Abstract
Motivation: An unprecedented amount of mass spectrometry-based proteomics data is publicly available through repositories such as the PRoteomics IDEntifications Database (PRIDE), and the field is increasingly leveraging machine learning approaches. However, the available data is not ready to be reused in a scalable way beyond the original acquisition purpose. Existing machine learning models commonly rely on a few manually curated datasets that require deep domain expertise and tedious technical work to construct. Importantly, these datasets have not been updated in recent years, so that newly published data remains inaccessible. We present usiGrabber, a scalable framework for assembling large proteomic datasets. usiGrabber is designed around portability and extensibility. It extracts spectra identification data from mzIdentML files, stores additional project-level metadata retrieved through the PRIDE API, indexes raw spectra using Universal Spectrum Identifiers (USIs), and offers download utilities to retrieve spectra data at scale. Results: Within 49 hours, we parsed over 800 million peptide spectrum matches and corresponding USIs from over 1,200 projects. As a proof of concept, we used usiGrabber to construct a phosphorylation-specific training dataset of nearly 11 million spectra in under two days and used it to retrain a binary phosphorylation classifier based on the AHLF model architecture. With a balanced accuracy of 0.78, our model achieves comparable performance to the original model on an independent test set, showing that automated data extraction is an alternative to manual curation of static datasets. Availability: All code is available at https://github.com/usiGrabber/usiGrabber; the data is available at https://zenodo.org/records/18853258.
bioinformatics2026-03-18v1Ryder: Epigenome normalization using a two-tier model and internal reference regions
Cao, Y.; Ge, G.; Zhao, K.Abstract
Motivation: Sequencing-based epigenomic profiling methods are powerful but suffer from technical variability that complicates cross-sample comparisons and can obscure true biological signals. While existing normalization methods using spike-in controls or computational approaches have been proposed, they often rely on assumptions that may not hold across diverse experimental conditions or require additional data types. Results: We present Ryder, a flexible and robust Python package for the normalization and differential analysis of epigenomic data. Ryder introduces a normalization strategy that leverages stable internal reference regions, such as invariant CTCF binding sites, to correct for technical artifacts genome-wide. Our results show that it effectively models and adjusts both background noise and signal intensity, ensuring accurate signal alignment across samples. We demonstrate that Ryder performs robust, genome-wide normalization -- correcting signals in both peak and background regions -- across a range of assays including DNase-seq, CUT&RUN, ATAC-seq, MNase-seq, and ChIP-seq, with or without spike-in controls. By reducing technical noise, we show that Ryder improves the detection of genuine biological changes, such as quantitative reduction of chromatin accessibility at key enhancer elements by depletion of BRG1, a key subunit of the chromatin remodeling BAF complexes. Availability and Implementation: The Ryder source code and documentation are freely available at: https://github.com/YaqiangCao/ryder .
bioinformatics2026-03-18v1Millisecond Prediction of Protein Contact Maps from Amino AcidSequences
Lin, R.; Ahnert, S. E.Abstract
Protein structure prediction typically outputs static coordinates, often obscuring the underlying physical principles and conformational flexibility. In this work, we present a coarse-grained generative framework to recover the Circuit Topology (CT) of proteins using Generative Flow Matching. We represent protein architecture using highly compressed Secondary Structure Elements (SSEs), reducing the sequence length to roughly 1/13 of the original amino acid sequence. We show that this minimal representation captures the essential "topological fingerprint" required to determine the global fold. By employing a joint-prediction head, our model simultaneously generates contact probabilities and asymmetric topological features, achieving a mean F1 score of 0.822 at the SSE level. Notably, our results demonstrate a counter-intuitive robustness in capturing long-range interactions, suggesting that global topology acts as a stable constraint compared to local residue packing. Furthermore, we show that these coarse-grained predictions can be mapped back to residue-level contact maps with sub-helical precision, yielding a mean alignment error of 2.69 residues. The probabilistic nature of the flow model effectively separates the stable structural signal of the folding core from flexible regions, providing a physically interpretable view of the protein's conformational ensemble. This pipeline is extremely fast, capable of completing a contact map prediction from amino acid sequence in an average of 110 milliseconds on a single GPU. These ultra-fast and accurate predictions provide a valuable tool for identifying conserved protein folding cores, facilitating the exploration of the protein structural genotype-phenotype (GP) map through large-scale sampling of mutants with highly similar folding cores.
bioinformatics2026-03-18v1A Permutation-Based Framework for Evaluating Bias in Microbiome Differential Abundance Analysis
Zeng, K.; Fodor, A. A.Abstract
Background: In microbiome research, differential abundance analysis aids in identifying significant differences in microbial taxa across two or more conditions. Statistical approaches used for this purpose include classical tests such as the t-test and Wilcoxon test, as well as methods designed to account for the compositional nature of microbiome data, including ALDEx2, ANCOM-BC2, and metagenomeSeq. In addition, methods originally developed for RNA sequencing data, such as DESeq2 and edgeR, have been frequently applied to microbiome studies. However, the use of these methods has been controversial. One area of concern is whether different modeling frameworks produce accurate p-values when the null hypothesis is true. Results: We evaluated eight methods across six publicly available datasets. Four permutation strategies were applied to generate data under the null hypothesis: shuffling sample names, shuffling counts within samples, shuffling counts within taxa, and fully randomizing the counts table. Methods based on the negative binomial distribution (DESeq2 and edgeR) produced p-values that were consistently smaller than expected under the null hypothesis. In contrast, methods that attempt to correct for compositionality (ALDEx2, ANCOM-BC2, and metagenomeSeq) tended to produce larger-than-expected p-values, even when only sample labels were shuffled, a permutation strategy that does not alter compositional structure. These deviations were dependent on dataset characteristics and permutation strategy, suggesting complex interactions between underlying data structure and algorithm performance. Generating data to follow the expected negative binomial distribution did not eliminate the tendency of DESeq2 and edgeR to exaggerate statistical significance. Although similar patterns were observed in RNA sequencing (RNAseq) datasets, the deviations were less pronounced than in microbiome data. In contrast, the classical t-test and Wilcoxon test yielded p-value distributions consistent with theoretical expectations across datasets and permutation strategies. Conclusions: These results indicate that the performance of several widely used differential abundance methods can be problematic under null conditions and may affect biological interpretation. Our findings emphasize the importance of careful method selection and highlight the robustness of simpler statistical approaches for reliable inference.
bioinformatics2026-03-18v1Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large-scale cis-regulatory domains, such as super-enhancers, are pivotal in orchestrating robust and cell-state-specific transcriptional programs that define cellular identity. However, current single-cell methods do not effectively identify these higher-order structures, obscuring the coordinated, domain-level regulation essential for complex biological processes. Identifying such domain-scale representation at the single-cell level is critical for understanding the regulatory logic underlying development and disease. Here, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) from single-cell ATAC-sequencing. Our approach faithfully recapitulated established hematopoietic hierarchies and resolved lineage-specific regulatory programs by recovering canonical lineage-defining regulators, frequent chromatin interactions, and enrichment of fine-mapped autoimmune disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE successfully captured tumor-associated H3K27ac programs and prioritized cancer-relevant regulators, pointing to USP7 as a potential therapeutic candidate supported by in silico perturbation. Our framework provides a fruitful approach for deciphering context-dependent epigenetic reprogramming.
bioinformatics2026-03-18v1GOTFlow: Learning Directed Population Transitions from Cross-Sectional Biomedical Data with Optimal Transport
Wright, G.; Alzaid, E.; Muter, J.; Brosens, J.; Minhas, F.Abstract
Motivation: Many biological and clinical processes are dynamic, yet most datasets are cross-sectional, capturing populations at discrete states rather than tracking individuals over time. This makes it difficult to quantify how populations change across developmental, physiological, or disease-associated conditions. Existing trajectory and transport-based methods often rely on fixed feature spaces, assumptions tailored to transcriptomic time-course data, or approximately linear progression, limiting their ability to model heterogeneous and unbalanced transitions across diverse biomedical modalities. Flexible methods are needed that can infer directed population-level change from cross-sectional data while retaining biological interpretability. Results: We present GOTFlow, a framework for learning directed population transitions from cross-sectional biomedical data using graph-constrained optimal transport in a learned latent space. GOTFlow integrates representation learning with unbalanced optimal transport to jointly estimate embeddings and transport couplings between biological states. This enables hypothesis-driven modelling of progression structures while accommodating non-linear geometry, branching relationships, and changes in population mass. From the inferred transport plans, GOTFlow derives interpretable summaries of dynamics, including drift vectors quantifying transitions, and feature-level transported changes that highlight molecular drivers of progression. In synthetic data, GOTFlow recovered known transitions with strong agreement between inferred and ground-truth drifts. Across three biological applications, endometrial remodelling, breast cancer risk progression, and prion disease, GOTFlow identified state-to-state transitions and biologically meaningful feature shifts reflecting impaired decidualisation, increasing cancer risk, and neurodegenerative progression. These results establish GOTFlow as a general and interpretable framework for analysing directed population dynamics from cross-sectional data. Availability: Code available at: https://github.com/wgrgwrght/GOTFlow Supplementary information: Available online.
bioinformatics2026-03-18v1New Space-Time Tradeoffs for Subset Rank and k-mer Lookup
Diseth, A. C.; Puglisi, S. J.Abstract
Given a sequence S of subsets of symbols drawn from a fixed alphabet, a subset rank query srank(i, c) asks for the number of subsets before the ith subset that contain the symbol c. It was recently shown (Alanko et al., Proc. SIAM ACDA, 2023) that subset rank queries on the spectral Burrows-Wheeler lead to efficient k-mer lookup queries, an essential and widespread task in genomic sequence analysis. In this paper we design faster subset rank data structures that use small space---less than 3 bits per k-mer. Our experiments show that this translates to new Pareto optimal SBWT-based k-mer lookup structures at the low-memory end of the space-time spectrum.
bioinformatics2026-03-18v1Homology-based perspective on pangenome graphs
Lisiecka, A.; Kowalewska, A.; Dojer, N.Abstract
Pangenome graphs conveniently represent genetic variation within a population. Several types of such graphs have been proposed, with varying properties and potential applications. Among them, variation graphs (VGs) seem best suited to replace reference genomes in sequencing data processing, while whole genome alignments (WGAs) are particularly practical for comparative genomics applications. For both models, no widely accepted optimization criteria for a graph representing a given set of genomes have been proposed. In the current paper we introduce the concept of homology relation induced by a pangenome graph on the characters of represented genomic sequences and define such relations for both VG and WGA model. Then, we use this concept to propose homology-based metrics for comparing different graphs representing the same genome collection, and to formulate the desired properties of transformations between VG and WGA models. Moreover, we propose several such transformations and examine their properties on pangenome graph data. Finally, we provide implementations of these transformations in a package WGAtools, available at https://github.com/anialisiecka/WGAtools.
bioinformatics2026-03-18v1scTimeBench: A streamlined benchmarking platform for single-cell time-series analysis
Osakwe, A.; Huang, E. H.; Li, Y.Abstract
Temporal modelling of single-cell gene expression is essential for capturing dynamic cellular processes, yet a systematic framework for evaluating time-aware trajectory inference methods has not yet been established. Here, we present a modular and scalable benchmark designed to assess methods across three critical tasks: forecast accuracy (temporal cell alignment) for projecting cells to unseen time points, embedding coherence between original and projected data, and cell-type lineage fidelity. We evaluated nine state-of-the-art methods, which are broadly categorized into 7 forecasting-based and 2 optimal transport (OT)-based methods across eight diverse datasets spanning four species. Our results show that while several methods achieve high forecast accuracy, they often fail to preserve biological signals, both in their latent spaces and in cell lineage reconstruction. Notably, most methods confer low lineage fidelity and often underperform compared to a correlation baseline. We further demonstrate that integrating pseudotime can effectively denoise trajectories by aligning the data snapshots with the intrinsic biological clock in each cell. Finally, to streamline benchmarking for temporal single-cell analysis, we built one of the first self-contained Python packages for the research community: https://github.com/li-lab-mcgill/scTimeBench.
bioinformatics2026-03-18v1drFrankenstein: An Automated Pipeline for the Parameterisation of Non-Canonical Amino Acids
Shrimpton-Phoenix, E.; Notari, E.; Wood, C. W.Abstract
The incorporation of non-canonical amino acids (ncAAs) is a powerful strategy for introducing novel chemical functions into proteins. Molecular dynamics (MD) simulations are essential for understanding the structural and dynamic effects of these modifications, yet the creation of accurate force field parameters for ncAAs remains a significant bottleneck. Current parameterisation methods are often inaccurate or computationally expensive. To address this, we present drFrankenstein, an automated pipeline for generating AMBER force field parameters for ncAAs. drFrankenstein is a robust and accessible tool that streamlines the parameterisation workflow, enabling the routine use of MD simulations to study the behaviour of ncAA-containing proteins.
bioinformatics2026-03-18v1SpeciefAI: Multi-species mRNA-level Antibody Framework Generation using Transformers
Grabarczyk, D.; Kocikowski, M.; Parys, M.; Cohen, S. B.; Alfaro, J. A.Abstract
Motivation: Encoding antibodies (Abs) and nanobodies (Nbs) as mRNA enables in vivo production of therapeutic proteins. However, this approach requires meeting two species-dependent requirements: the mRNA encoding must support efficient expression in the host species, and the encoded protein sequence must resemble the natural Ab repertoire of the recipient species to minimize immunogenicity. These requirements motivate species-conditioned generative models for joint mRNA and protein design. Results: We propose SpeciefAI a transformer-based model for multi-species Ab and Nb species sequence-harmonisation by generation of novel Framework Regions (FRs) tailored to input Complementarity-Determining Regions (CDRs). Our model works directly in the mRNA space and learns the correspondence between FRs and CDRs in six species. The model is capable of generating sequences with a highly similar distribution to natural sequences and a mean absolute difference in codon adaptation index (CAI) of 0.013 and 0.033 for humans and dogs respectively. We show that the generated human sequences are highly human (0.95 T20 score) and canine sequences highly canine (0.95 cT20 score). We furthermore demonstrate that we can generate diverse candidate sequences using our method. Availability and Implementation: Source code is available on https://github.com/Dominko/SpeciefAI. OAS and COGNANO data are publicly available on https://opig.stats.ox.ac.uk/webapps/oas/ and https://cognanous.com/datasets/vhh-corpus (preprocessed versions available upon request). Canine data is available on https://zenodo.org/records/18301526
bioinformatics2026-03-18v1Mutation-centric Network Construction using Long-Range Interactions
Huseynov, R.; Otlu, B.Abstract
Somatic mutations can alter normal cells and lead to cancer development. Yet distinguishing functional driver mutations from neutral passenger mutations remains a significant challenge. Traditional genomic tools often prioritize linear overlap searches, failing to capture the complex, three-dimensional regulatory environment of the genome. We present a graph-based framework, MutationNetwork, for constructing mutation-centric networks by integrating long-range intrachromosomal interactions with local genomic overlaps. Our method utilizes a unique positive and negative indexing scheme to represent interacting genomic intervals as nodes. By encoding both interactions and overlaps as edges, we enable constant-time retrieval of complex relationship data. By iteratively expanding the graph from a seed mutation, we can quantify a mutation's influence on the genomic landscape and assess its proximity to genes. We applied this framework to a dataset of 560 breast cancer whole-genome sequences, focusing on Triple-Negative Breast Cancer (TNBC) and Luminal A subtypes. Our results demonstrate that the generated mutation embeddings successfully cluster samples according to their biological subtypes, with the highest classification performance achieved at specific ranges. This approach provides a comprehensive view of mutation impact, offering a scalable solution for cancer patient stratification and the prioritization of potential non-coding driver mutations by assessing their network-level impact.
bioinformatics2026-03-18v1PalmaClust: A graph-fusion framework leveraging the Palma ratio for robust ultra-rare cell type detection in scRNA-seq data
Niu, X.; Wang, J.; Wan, S.Abstract
Motivation: Single-cell RNA sequencing (scRNA-seq) is routinely used to build atlases of tissues, resolve developmental trajectories, and characterize disease microenvironments. Yet many biologically and clinically meaningful populations--including transient progenitors, therapy-resistant tumor subclones, and antigen-specific lymphocytes--occur at very low frequencies (<1%) and are easily missed by standard clustering pipelines. Existing approaches often require extensive manual curation, rely on known marker genes, or trade sensitivity for unacceptable false positive rates due to the insensitivity of metrics like the Gini index to heavy-tailed distributions. A scalable, statistically grounded method is needed to sensitively detect rare populations while providing calibrated confidence and interpretable molecular signatures. Results: We present PalmaClust, a graph-fusion clustering framework that repurposes Palma ratio--a tail-sensitive inequality metric in sociology--to identify marker genes driven by extreme sparsity. PalmaClust constructs and fuses multiple K-Nearest Neighbor (KNN) graphs derived from complementary gene-selection statistics including the Palma ratio, Gini index, and Fano factor. It employs a local refinement strategy that re-prioritizes Palma-ranked genes within parent clusters. Benchmarking across diverse public scRNA-seq datasets confirms that PalmaClust consistently outperforms state-of-the-art baselines, improving rare-class F1 scores by at least 20% (absolute) while maintaining high global clustering stability. Further studies demonstrate that the Palma ratio-derived graph layer is essential for capturing ultra-rare signatures that other views miss. Availability: https://github.com/wan-mlab/PalmaClust.
bioinformatics2026-03-18v110-minimizers: a promising class of constant-space minimizers
Shur, A.; Tziony, I.; Orenstein, Y.Abstract
Minimizers are sampling schemes which are ubiquitous in almost any high-throughputsequencing analysis. Assuming a fixed alphabet of size {sigma}, a minimizer is defined by two positive integers k, w and a linear order {rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k - 1 its minimal k-mer with respect to {rho}. A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite {sigma}-ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and nearoptimal densities, but they require to explicitly store k-mer ranks in {Omega}(2k) space. While constantspace minimizers exist, and some of them are proven to be asymptotically optimal, no constantspace minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w [≥] k - 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer
bioinformatics2026-03-18v1Outperforming the Majority-Rule Consensus Tree Using Fine-Grained Dissimilarity Measures
Takazawa, Y.; Takeda, A.; Hayamizu, M.; Gascuel, O.Abstract
Phylogenetic analyses often require the summarization of multiple trees, e.g., in Bayesian analyses to obtain the centroid of the posterior distribution of trees, or to determine the consensus of a set of bootstrap trees. The majority-rule consensus tree is the most commonly used. It is easy to compute and minimizes the sum of Robinson-Foulds (RF) distances to the input trees. In mathematical terms, the majority-rule consensus tree is the median of the input trees with respect to the RF distance. However, due to the coarse nature of RF distance, which only considers whether two branches induce exactly the same bipartition of the taxa or not, highly unresolved trees can be produced when the phylogenetic signal is low. To overcome this limitation, we propose using median trees with respect to finer-grained dissimilarity measures between trees. These measures include a quartet distance between tree topologies, and transfer distances, which quantify the similarity between bipartitions, in contrast to the 0/1 view of RF. We describe fast heuristic consensus algorithms for transfer-based tree dissimilarities, capable of efficiently processing trees with thousands of taxa. Through evaluations on simulated datasets in both Bayesian and bootstrapping maximum-likelihood frameworks, our results show that our methods improve consensus tree resolution in scenarios with low to moderate phylogenetic signal, while providing better or comparable dissimilarities to the true phylogeny. Applying our methods to Mammal phylogeny and a large HIV dataset of over nine thousand taxa confirms the improvement with real data. These results demonstrate the usefulness of our new consensus tree methods for analyzing the large datasets that are available today. Our software, PhyloCRISP, is available from https://github.com/yukiregista/PhyloCRISP.
bioinformatics2026-03-18v1SpatialFusion: A lightweight multimodal foundation model for pathway-informed spatial niche mapping
Yates, J.; Shavakhi, M.; Choueiri, T. K.; Van Allen, E.; Uhler, C.Abstract
Foundation models enable knowledge transfer across data modalities and tasks, yet foundation models for spatial biology remain in their early stages, largely centered on encoding single-cell representations in spatial context without fully integrating transcriptomic and morphological information to delineate functional niches. Here we introduce SpatialFusion, a lightweight multimodal foundation model that identifies biologically coherent microenvironments defined by distinct pathway activation patterns rather than spatial proximity alone. SpatialFusion integrates paired histopathology, gene expression, and inferred pathway activity into a unified representation. Compared with two specialist niche-detection methods and four spatial foundation models, SpatialFusion performs competitively and consistently resolves fine-grained spatial niches with unique pathway-level signatures. Applying the model to two Visium HD cohorts uncovered a pre-malignant niche in morphologically normal mucosa adjacent to colorectal tumors and revealed distinct malignant microenvironments in non-small cell lung cancer that were predictive of tumor stage. Overall, SpatialFusion offers a versatile framework for multimodal spatial analysis, enabling the discovery of new morpho-molecular niches with significant biological and clinical relevance.
bioinformatics2026-03-18v1scRGCL: Neighbor-Aware Graph Contrastive Learning for Robust Single-Cell Clustering
Fan, J.; Liu, F.; Lai, X.Abstract
Accurate cell type identification is a fundamental step in single-cell RNA sequencing (scRNA-seq) data analysis, providing critical insights into cellular heterogeneity at high resolution. However, the high dimensionality, zero-inflated, and long-tailed distribution of scRNA-seq data pose significant computational challenges for conventional clustering approaches. Although recent deep learning-based methods utilize contrastive learning to joint-learn representations and clustering assignments, they often overlook cluster-level information, leading to suboptimal feature extraction for downstream tasks. To address these limitations, we propose scRGCL, a single-cell clustering method that learns a regularized representation guided by contrastive learning. Specifically, scRGCL captures the cell-type-associated expression structure by clustering similar cells together while ensuring consistency. For each sample, the model performs negative sampling by selecting cells from distinct clusters, thereby ensuring semantic dissimilarity between the target cell and its negative pairs. Moreover, scRGCL introduces a neighbor-aware re-weighting strategy that increases the contribution of samples from clusters closely related to the target. This mechanism prevents cells from the same category from being mistakenly pushed apart, effectively preserving intra-cluster compactness. Extensive experiments on fourteen public datasets demonstrate that scRGCL consistently outperforms state-of-the-art methods, as evidenced by significant improvements in normalized mutual information and adjusted rand index. Moreover, ablation studies confirm that the integration of cluster-aware negative sampling and the neighbor-aware re-weighting module is essential for achieving high-fidelity clustering. By harmonizing cell-level contrast with cluster-level guidance, scRGCL provides a robust and scalable framework that advances the precision of automated cell-type discovery in increasingly complex single-cell landscapes.
bioinformatics2026-03-18v1InSTaPath: Integrating Spatial Transcriptomics and histoPathology Images via Multimodal Topic Learning
Xiao, W.; Chen, H.; Osakwe, A.; Zhang, Q.; Li, Y.Abstract
Spatial transcriptomic (ST) technologies enable the measurement of gene expression directly within tissue sections while preserving spatial context. Many ST platforms additionally generate paired histological images alongside spatially resolved transcriptomic profiles. However, most existing computational approaches only incorporate histology images as auxiliary features in representation learning models and typically produce latent embeddings that are difficult to interpret. We present InSTaPath (Integrating Spatial Transcriptomics and histoPathology images), a multimodal topic modeling framework that links transcriptional programs with tissue morphology. InSTaPath converts tokenlevel embeddings extracted from pretrained histology foundation models into discrete image words through vector quantization, enabling histological morphology to be represented in a count-based form analogous to gene expression. InSTaPath then jointly analyzes image-word and gene expression counts to infer shared latent topics that are interpretable through both topic-gene and topic-image-word associations. Across multiple ST datasets, InSTaPath improves spatial domain identification and uncovers biologically meaningful relationships between gene programs and tissue morphology through pathway enrichment and in silico perturbation analyses.
bioinformatics2026-03-18v1Developing a Standard Definition for Sequences of Concern
Alexanian, T.; Beal, J.; Bartling, C.; Berlips, J.; Carr, P. A.; Clore, A.; Cozzarini, H.; Diggans, J.; El Moubayed, Y.; Esvelt, K.; Flyangolts, K.; Foner, L.; Fullerton, P. A.; Gemler, B. T.; Jagla, C. A.; Lababidi, R.; Mitchell, T.; Murphy, S. T.; Parker, M. T.; Roehner, N.; Rusch, A.; Talley, K.; Timmerman, T.; Wheeler, N. E.Abstract
Readily available nucleic acid synthesis is both critical for the bioeconomy and an increasingly pressing security concern due to the potential for accidental or deliberate misuse. While biosecurity experts broadly agree that nucleic acid providers should screen orders for potential "sequences of concern," there has previously been no agreed standard for how to define and recognize such sequences. To address this gap, we first organized a test set of 1.1 million sequences from pathogens and toxins on the Australia Group Common Control Lists and their non-controlled relatives, along with model organisms and synthetic constructs. An initial categorization of sequences as to whether or not they were sequence of concern was produced by comparing the results of four biosecurity screening systems for each of these sequences, finding that these systems already agreed on the categorization of more than 80% of sequences. We then refined these results through a science-based stakeholder review process to define a rubric for determining whether a sequence should be flagged as a potential sequence of concern, then applied this rubric to improve the categorization of test sets. The result is a rubric that identifies sequences of concern with respect to human pandemic-potential viruses, key classes of low-risk genes, and controlled toxins. Applying this rubric to the test set collection has reduced the number of test sequences with disputed categorization by 44.3% for controlled viruses and 10.7% across the test set as a whole. Together, these results provide a concrete "sequence of concern" definition that can be used as a foundation for development of biosecurity screening standards and policy.
bioinformatics2026-03-18v1DeSCENT: Deconvolutional Single-Cell RNA-seq Enhances Transcriptome-based Cancer Survival Analysis
Zhao, Y.; You, Z.; Shen, Y.; Chu, J.; Gong, X.; Li, T.; Wang, Z.; Xu, C.; Luo, Z.; He, Y.Abstract
Motivation: Accurate cancer survival prediction requires modeling tumor heterogeneity across both population and cell levels. Most cancer survival analyses use tumor transcriptomes only, since cohorts are usually measured with bulk RNA-seq but are rarely recorded with single-cell RNA-seq. This prevents the direct use of cell-level transcriptomes in cancer survival analysis. Results: To bridge this gap, we propose using bulk RNA-seq deconvolution algorithms to reconstruct each subject's scRNA-seq profile from their bulk data. Then, by combining both scRNA-seq and bulk RNA-seq together with their survival labels (paired to bulk), we perform multimodal transcriptome-based survival analysis. We built this framework as DeSCENT and evaluated it with common survival models on eight TCGA cancer cohorts. Results showed notable and consistent improvements in C-index over bulk-only models or models using cellular information alone. Availability: Our code is available at GitHub: https://github.com/YonghaoZhao722/DeSCENT.
bioinformatics2026-03-18v1Accelerating k-mer-based sequence filtering
Martayan, I.; Vandamme, L.; Constantinides, B.; Cazaux, B.; Paperman, C.; Limasset, A.Abstract
The exponential growth of global sequencing data repositories presents both analytical challenges and opportunities. While k-mer-based indexing has improved scalability over traditional alignment for identifying relevant documents, pinpointing the exact sequences matching numerous queries remains a hurdle. In particular, searching for numerous k-mers with a single large query or multiple distinct queries strains existing exact matching tools, whose performance scales poorly with an increasing number of patterns. At the same time, indexing entire vast datasets for infrequent or ad-hoc searches is often resource-prohibitive. Designing fast methods for matching a large number of k-mers without exhaustive pre-indexing is therefore critical. We propose an efficient solution to the problem of k-mer-based sequence filtering: given a set of k-mers of interests and a threshold, quickly evaluate whether an arbitrary sequence has a number of k-mer matches above or below the threshold. Our approach demonstrates how minimizer-based based sketching, alongside SIMD acceleration, can enhance the performance of streaming searches, and is implemented as a Rust tool named K2Rmini. On a consumer laptop, K2Rmini is able to filter long reads at 2 Gbp/s. Availability: https://github.com/Malfoy/K2Rmini
bioinformatics2026-03-17v2ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach
Lee, S.; Sakatsume, J.; Oba, G. M.; Nagaoka, Y.; Lin, C.; Chen, C.-Y.; Nakato, R.Abstract
Chromatin states, which are defined by specific combinations of histone post-translational modifications, are fundamental to gene regulation and cellular identity. Despite their importance, comprehensive patterns within chromatin state sequences, which could provide insights into key biological functions, remain largely unexplored. In this study, we introduce ChromBERT, a BERT-based model specifically designed to detect distinct chromatin state patterns as 'motifs.' We pre-trained ChromBERT on 15-state chromatin annotations from 127 human cell and tissue types from the ROADMAP consortium. This pre-trained model can be fine-tuned for various downstream tasks, and obtained high-attention chromatin state patterns are extracted as motifs. To account for the variable-length nature of chromatin state motifs, ChromBERT uses Dynamic Time Warping to cluster similar motifs and identify meaningful representative patterns. In this study, we evaluated the performance of the model on several tasks, including binary and quantitative gene expression prediction, cell type classification, and three-dimensional genome feature classification. Our analyses yielded biologically grounded results and revealed the associated chromatin state motifs. This workflow facilitates the discovery of specific chromatin state patterns across different biological contexts and offers a new framework for exploring the dynamics of epigenomic states.
bioinformatics2026-03-17v2From Circles to Signals: Representation Learning on Ultra-Long Extrachromosomal Circular DNA
Li, J.; Liu, Z.; Zhang, Z.; Zhang, J.; Singh, R.Abstract
Extrachromosomal circular DNA (eccDNA) is a covalently closed circular DNA molecule that plays an important role in cancer biology. Genomic foundation models have recently emerged as a powerful direction for DNA sequence modeling, enabling the direct prediction of biologically relevant properties from DNA sequences. Although recent genomic foundation models have shown strong performance on general DNA sequence modeling, their application to eccDNA remains limited: existing approaches either rely on computationally expensive attention mechanisms or truncate ultra-long sequences into kilobase fragments, thereby disrupting long-range continuity and ignoring the molecule's circular topology. To overcome these problems, we introduce eccDNAMamba, a bidirectional state space model (SSM) built upon the Mamba-2 framework, which scales linearly with input sequence length and enables scalable modeling of ultra-long eccDNA sequences. eccDNAMamba further incorporates a circular augmentation strategy to preserve the intrinsic circular topology of eccDNA. Comprehensive evaluations against state-of-the-art genomic foundation models demonstrate that eccDNAMamba achieves superior performance on ultra-long sequences across multiple task settings, such as cancer versus healthy eccDNA discrimination and eccDNA copy-number level prediction. Moreover, the Integrated Gradient (IG) based model explanation indicates that eccDNAMamba focuses on biologically meaningful regulatory elements and can uncover key sequence patterns in cancer-derived eccDNAs. Overall, these results demonstrate that eccDNAMamba effectively models ultra-long eccDNA sequences by leveraging their unique circular topology and regulatory architecture, bridging a critical gap in sequence analysis. Our codes and datasets are available at https://github.com/zzq1zh/eccDNAMamba.
bioinformatics2026-03-17v2The History of Enzyme Evolution Embedded in Metabolism
Corlett, T.; Smith, H. B.; Smith, E.; Goldford, J. E.; Longo, L. M.Abstract
Whereas phylogenetic reconstructions are a primary record of protein evolution, it is unknown whether the deep history of enzymes are encoded at higher levels of biological organization. Here, we demonstrate that the emergence and reuse history of enzymatic folds is embedded within the web of metabolite-cofactor-enzyme interdependencies that comprise biosphere-scale metabolic reaction networks. Using a simple network analysis approach, we reconstruct the relative ordering of enzymatic fold emergence and, where possible, the first reaction(s) that each enzymatic fold catalyzed. We find that a large majority of enzymatic folds were sufficient as independent additions to open new avenues for metabolic growth. The resulting network-based histories are broadly concordant with enzyme phyletic distribution in prokaryotes, a proxy for enzyme age. Our results suggest that the earliest enzyme-mediated metabolisms were enriched for /{beta} proteins, likely due to their strong association with cofactor utilization, and that -proteins preferentially emerge at later stages. The cradle-loop barrel, a member of the small {beta}-barrel metafold, is predicted to be the founding {beta}-fold, in agreement with analyses of ribosome structure. An examination of how the protein universe responded to the biological production of molecular oxygen reveals that the adaptation of existing enzymatic folds, not novel fold emergence, was the primary driver of metabolic evolution. This work presents a self-consistent model of metabolic and enzyme evolution, key progress towards integrating diverse perspectives into a unified history of protein evolution.
bioinformatics2026-03-17v2BioOS: A Gene-Driven Digital Twin Runtime for Emergent Plant Development
AUGER, E.; Gandecki, M.; Delarche, C.; Heng, F. X.Abstract
Predicting plant phenotypes from genomic data requires models that bridge molecular regulation and organ-scale morphogenesis. We introduce BioOS, a computational runtime in which plant behavior - cell division, differentiation, and elongation - emerges from the execution of a gene regulatory network rather than from hardcoded rules. The system is built on the Formal Cell abstraction: a minimal signal-processing unit analogous to the McCulloch-Pitts formal neuron, whose transfer function is gene expression. Each Formal Cell evaluates promoters, transcribes mRNA, translates proteins, and derives its entire behavioral repertoire from the resulting protein concentrations - without a single hardcoded rule in the simulator code. A multi-scale architecture with level-of-detail switching enables real-time simulation of Arabidopsis thaliana primary root development. On the current official five-case primary-root auxin benchmark, BioOS achieves 75.4% mean score, 5/5 qualitative matches, 5/5 cases passing all current gates, and Spearman severity correlation {rho} = 0.70. The current root-auxin runtime is driven by a curated 35-gene registry with explicit promoter logic, kinetic parameters, and epigenetic state; for readability, this manuscript details a core 18-gene subnetwork that carries the main auxin benchmark logic. We describe the architecture, the gene expression runtime, the epigenetic memory model, the completed transition to post-hoc (non-causal) zone classification, and candidate benchmark extensions for persistent plasmodesmata and intracellular auxin compartmentalization within a broader six-suite, 63-case benchmark framework. Beyond the root-auxin slice, the current codebase also closes the official flowering (5/5), photosynthesis (7/7), and cytokinin (5/5) gates, while root-patterning remains a passing candidate panel.
bioinformatics2026-03-17v1Cross-Propagative Graph Learning Reveals Spatial Tissue Domains in Multi-Modal Spatial Transcriptomics
Guo, Y.; Liu, S.; Zhang, Z.; Zhang, S.; Li, L.Abstract
Spatial transcriptomics enables in situ characterization of tissue organization by jointly profiling gene expression profiles and spatial coordinates, with histological images as complementary contextual information. However, effectively integrating these heterogeneous modalities remains challenging due to differences in statistical properties and structural patterns. We propose st-Xprop, a cross-propagative graph network with dual-graph embedding coupling for spatial domain identification. st-Xprop constructs modality-specific graphs for gene expression and histological features, and performs alternating cross-modal propagation to explicitly model inter-modal heterogeneity while enabling complementary information exchange. Through dual-graph embedding coupling, the framework progressively learns a unified low-dimensional representation that integrates multimodal signals and preserves spatial coherence. Evaluations on multiple real spatial transcriptomics datasets demonstrate that st-Xprop consistently improves clustering accuracy and robustness, particularly in weak-signal or structurally complex regions, yielding spatial domains that are more stable and biologically meaningful.
bioinformatics2026-03-17v1Integrated Artificial Intelligence and Quantum Chemistry Approach for the Rational Design of Novel Antibacterial Agents against Ralstonia solanacearum.
Gulumbe, D. A.; Tiwari, G.; Lohar, T.; Nikam, R.; Kumar, A.; Giri, S.Abstract
Antimicrobial resistance (AMR) in plant pathogenic bacteria poses a serious threat to global agriculture, necessitating the development of novel antibacterial agents targeting virulence mechanisms. This study presents an integrated bioinformatics-driven framework for the rational design and computational validation of Solres, a newly designed small molecule targeting key virulence proteins in phytopathogenic bacteria. Approximately 10,000 active compounds from PubChem BioAssay (AID: 588726) were analyzed using structural clustering and scaffold mining to identify conserved molecular motifs associated with antibacterial activity. Guided by high-frequency substructures, Solres was designed de novo and screened for structural novelty against PubChem, ChEMBL, and WIPO databases. Drug-likeness evaluation using Lipinski Rule of Five confirmed favorable physicochemical properties. Molecular docking was performed against essential virulence regulators, including PhcA, PhcR, HrpB, PehA, and Egl from Ralstonia solanacearum and Xanthomonas spp., with active sites predicted using CaspFold. Docking analyses revealed strong binding affinities and stable interactions with key catalytic and regulatory residues. Complex stability and conformational integrity were further validated through molecular dynamics simulations. Quantum chemical descriptors, including HOMO LUMO energy gap and dipole moment, supported the electronic suitability and reactivity profile of Solres. Collectively, this study demonstrates the effective integration of cheminformatics, structural bioinformatics, molecular simulations, and quantum chemical analyses for plant-focused antibacterial discovery. The compound Solres represents a promising lead candidate for mitigating bacterial wilt disease and provides a computational framework for future experimental validation and sustainable crop protection strategies against AMR-driven phytopathogens.
bioinformatics2026-03-17v1VarDCL: A Multimodal PLM-Enhanced Framework for Missense Variant Effect Prediction via Self-distilled Contrastive Learning
Zhang, H.; Zheng, G.; Xu, Z.; Zhao, H.; Cai, S.; Huang, Y.; Zhou, Z.; Wei, Y.Abstract
Abstract. Missense variants are a common type of genetic mutation that can alter the structure and function of proteins, thereby affecting the normal physiological processes of organisms. Accurately distinguishing damaging missense variants from benign ones is of great significance for clinical genetic diagnosis, treatment strategy development, and protein engineering. Here, we propose the VarDCL method, which ingeniously integrates multimodal protein language model embeddings and self-distilled contrastive learning to identify subtle sequence and structural differences before and after protein mutations, thereby accurately predicting pathogenic missense variants. First, leveraging sequence and structural information before and after mutations, VarDCL generates sequence-structural multimodal features via different language models. It incorporates both global and local perspectives of feature embeddings to provide the model with dynamic, multimodal, and multi-view input data. Additionally, a Self-distilled Contrastive Learning (SDCL) module was proposed to enable more effective information integration and feature learning, enhancing the model's ability to detect sequence and structural changes induced by mutations. Within this module, the multi-level contrastive learning framework excels at capturing information differences before and after mutations within the same modality; meanwhile, the feature self-distillation mechanism effectively utilizes high-level fused features to guide the learning of low-level differential features, facilitating information interaction across different modalities. The VarDCL framework not only ensures the model's capacity to learn dynamic changes pre- and post-mutation but also significantly improves cross-modal information interaction between sequence and structure, thereby remarkably boosting the model's performance in distinguishing pathogenic mutations from benign ones. To validate the effectiveness of VarDCL, extensive experiments were conducted. The ablation study demonstrates that all key components of VarDCL contribute significantly. On an independent test set containing 18,731 clinical variants, VarDCL achieved an AUC of 0.917, an AUPR of 0.876, an MCC of 0.690, and an F1-score of 0.789, outperforming 21 state-of-the-art existing methods. Benchmark analysis shows that VarDCL can be utilized as an accurate and potent tool for predicting missense variant effects. The data and code for VarDCL are available at https://github.com/mjcoo/VarDCL for academic use.
bioinformatics2026-03-17v1OmicClaw: executable and reproducible natural-language multi-omics analysis over the unified OmicVerse ecosystem.
Zeng, Z.; Wang, X.; Luo, Z.; Zheng, Y.; Hu, L.; Xing, C.; Du, H.Abstract
Advances in bulk, single-cell and spatial omics have transformed biological discovery, yet analysis remains fragmented across packages with incompatible interfaces, heterogeneous dependencies and limited workflow reproducibility. Here, we present OmicClaw, an executable natural-language framework for multi-omics analysis built on the unified OmicVerse ecosystem and the J.A.R.V.I.S. runtime. OmicVerse organizes upstream processing, preprocessing, single-cell, spatial, bulk-transcriptomic and foundation-model workflows into a shared AnnData-centered interface spanning more than 100 methods. J.A.R.V.I.S. converts this ecosystem into a bounded analytical action space by exposing more than 200 registered functions and classes through a registry-grounded, state-aware and recoverable execution layer that validates prerequisites, preserves provenance and supports iterative repair. Rather than relying on unconstrained code generation, OmicClaw translates user requests into traceable workflows over live omics objects. Across a benchmark of 15 tasks spanning scRNA-seq, spatial transcriptomics, RNA velocity, scATAC-seq, CITE-seq and multiome analysis, ov.Agent improved rubric-based performance over bare one-shot large language model baselines, particularly for long-horizon multi-step workflows. OmicClaw further supports external agent access through an MCP-compatible server and a beginner-friendly web platform for interactive analysis, code execution and million-scale visualization. Together, OmicClaw provides a practical foundation for reproducible human AI collaboration in modern multi-omics research. OmicClaw is ready to use at https://github.com/Starlitnightly/omicverse
bioinformatics2026-03-17v1CROCHET: a versatile pipeline for automated analysis and visual atlas creation from single-cell spatialomic data
Bozorgui, B.; Thibault, G.; Yuan, C.; Dereli, Z.; Wang, H.; Overman, M. J.; Weinstein, J. N.; Korkut, A.Abstract
Spatial biology technologies offer a unique opportunity to link tissue composition with function. However, analytical methods for quantifying and interpreting highly complex spatial data remain limited. We present CROCHET (ChaRacterization Of Cellular HEterogeneity in Tissues), an end-to-end analysis pipeline for construction of spatially resolved cell atlases from raw data covering millions of cells across large sample cohorts. Its modular architecture supports the integration of diverse data modalities and novel analytical methods for image processing and segmentation, spatialomics quantification and downstream analyses. With comprehensive, open-source, user-friendly, interactive, and visual analysis modules, CROCHET aims to democratize spatial omics for a broad community of users.
bioinformatics2026-03-17v1RIBEX: Predicting and Explaining RNA Binding Across Structured and Intrinsically Disordered Regions (IDR)-rich Proteins
Firmani, S.; Steinbauer, F.; Kasneci, G.; Horlacher, M.; Marsico, A.Abstract
Motivation: RNA-binding proteins (RBPs) regulate post-transcriptional processes, yet many remain undiscovered because RNA-binding activity often occurs outside canonical RNA-binding domains (RBDs), including within intrinsically disordered regions (IDRs) or through protein complexes. Computational methods can help identify novel RBPs, but approaches relying solely on sequence-derived features or ignoring the cellular interaction context are limited in capturing the complexity of RNA-binding behavior. To date, no framework rigorously integrates both sequence information and protein interaction context for RBP prediction. Results: We introduce RIBEX, a multimodal framework that combines protein language model (pLM) embeddings with protein interactome topology to improve RBP prediction and interpretation. Specifically, we integrate sequence representations with graph-derived positional encodings (PE) from the human STRING protein protein interaction (PPI) network. PE are computed using Personalized PageRank, reduced with principal component analysis, and fused with pooled sequence embeddings through FiLM conditioning, while Low-Rank Adaptation (LoRA) enables parameter-efficient task adaptation. Across both an annotation-based benchmark and experimental RNA Interactome Capture (RIC) dataset, PE consistently improves predictive performance, indicating that interactome topology provides complementary information beyond sequence features. LoRA adaptation of ESM2-650M further yields larger gains than simply scaling frozen backbone size. RIBEX outperforms state of the art methods such as RBPTSTL and HydRA, particularly on challenging subsets including proteins lacking canonical RBDs and those enriched in IDRs. For interpretability, we combine sequence-level computational alanine scanning with network-level positional-encoding ablation and inverse-PCA mapping, recovering known RNA-binding domains, IDR-associated contributions, and functional interactome communities linked to RBP predictions.
bioinformatics2026-03-17v13D-Manhattan: An interactive visualization tool for multiple GWAS results
Hashimoto, S.Abstract
Genome-wide association studies (GWAS) are widely used to identify genetic loci underlying various agronomic traits. Conventional Manhattan plots provide an effective two-dimensional (2D) summary of an individual GWAS result. However, recent advances in high-throughput phenotyping have led to study designs that generate multiple GWAS outputs across time points, traits, or experimental conditions. In such settings, biological insight increasingly depends on comparative interpretation of multiple association maps, yet panel-based arrangements of 2D plots fragment related information and impede recognition of shared or dynamic genetic signals. Here, I present 3D-Manhattan, an interactive visualization framework that integrates multiple GWAS results within a unified three-dimensional (3D) coordinate system. By extending the conventional Manhattan plot with an additional axis representing time, trait, or condition, 3D-Manhattan enables simultaneous, axis-aligned comparison of association landscapes while preserving genomic coordinates and statistical values. The tool is implemented as a stand-alone, browser-based application using WebGL-based rendering and supports smooth interaction without server-side computation. The framework provides flexible visualization controls, region highlighting, and variant-level correspondence across datasets, facilitating exploratory analysis of stable and context-dependent genetic associations. Collectively, 3D-Manhattan provides an alternative approach for visualizing multi-dimensional GWAS results and offers a powerful platform for visualizing general a series of genome-wide datasets.
bioinformatics2026-03-17v1Calcium transient detection and segmentation with the astronomically motivated algorithm for background estimation and transient segmentation (Astro-BEATS)
Fan, B.; Bilodeau, A.; Beaupre, F.; Wiesner, T.; Gagne, C.; Lavoie-Cardinal, F.; Hlozek, R.Abstract
Fluorescence-based calcium-imaging is a powerful tool for studying localized neuronal activity, including miniature Synaptic Calcium Transients, providing real-time insights into synaptic activity. These transients induce only subtle changes in the fluorescence signal, often barely above baseline, which poses a significant challenge for automated synaptic transient detection and segmentation. Detecting astronomical transients similarly requires efficient algorithms that will remain robust over a large field of view with varying noise properties. We leverage techniques used in astronomical transient detection for miniature Synaptic Calcium Transient detection in fluorescence microscopy. We present Astro-BEATS, an automatic miniature Synaptic Calcium Transient segmentation algorithm that incorporates image estimation and source-finding techniques used in astronomy and designed for calcium-imaging videos. Astro-BEATS outperforms current threshold-based approaches for synaptic calcium transient detection and segmentation. The produced segmentation masks can be used to train a supervised deep learning algorithm for improved synaptic calcium transient detection in calcium-imaging data. The speed of Astro-BEATS and its applicability to previously unseen datasets without re-optimization makes it particularly useful for generating training datasets for deep learning-based approaches.
bioinformatics2026-03-17v1NYX: Format-aware, learned compression across omics file types
Patsakis, M.; Chronopoulos, T.; Mouratidis, I.; Georgakopoulos-Soares, I.Abstract
Genomic data repositories continue to grow as sequencing technologies improve, with the NCBI SRA alone exceeding 47 PB. General-purpose compressors treat bioinformatics files as unstructured byte streams and fail to exploit the structured nature of omics data. We present NYX, a format-aware compression system for FASTA, FASTQ, VCF, WIG, H5AD, and BED files. NYX combines lightweight, reversible preprocessing and is build upon the OpenZL framework to take advantage of inherent data structure, delivering high compression ratios while preserving fast and lossless compression. Across representative datasets in the target formats, NYX achieves substantially higher speed than format-specific compressors while maintaining or improving compression ratio.
bioinformatics2026-03-17v1Glydentify: An explainable deep learning platform for glycosyltransferase donor substrate prediction
Fang, R.; Na, L.; Corulli, C. J.; Prabhakar, P. K.; Berardinelli, S. J.; Venkat, A.; Prasad, A.; Mahmud, R.; Moremen, K. W.; Urbanowicz, B. R.; Dou, F.; Kannan, N.Abstract
Glycosyltransferases (GTs) are a large family of enzymes that catalyze the formation of glycosidic linkages between chemically diverse donor and acceptor molecules to regulate diverse cellular processes across all domains of life. Despite their importance, the activated sugar donors (donor substrates) used by most GTs remain unidentified, limiting our understanding of GT functions. To address this challenge, we developed Glydentify, a deep learning framework that predicts donor usage across GT-A and GT-B fold glycosyltransferases. Trained on large-scale UniProt annotations, Glydentify integrates protein sequence embeddings learned from protein language models with chemical features derived from molecular encoders trained on extensive chemical datasets. The resulting models achieve high predictive performance, with precision-recall AUCs (PR-AUC) of 0.86 for GT-A and 0.91 for GT-B, surpassing general enzyme-substrate predictors while requiring minimal manual curation. We employed Glydentify to predict the donor specificity of uncharacterized plant GTs and experimentally tested the predictions using in vitro biochemical assays. Furthermore, we demonstrate that the model utilizes a combination of evolutionary, structural, and biochemical features to predict donor specificity through residue attention score analysis. Together, these results establish Glydentify as a robust, explainable framework for decoding donor-glycosyltransferase relationships and highlight its potential as a broadly applicable framework for modeling enzyme classes that act on chemically diverse substrates.
bioinformatics2026-03-17v1Eco-Evolutionary Dynamics of Proliferation Heterogeneity: A Phenotype-Structured Model for Tumor Growth and Treatment Response
Schmalenstroer, L.; Rockne, R. C.; Farahpour, F.Abstract
Intra-tumor heterogeneity in proliferation rates fundamentally influences cancer progression and treatment resistance. To investigate how continuous phenotypic variation shapes eco-evolutionary dynamics, we develop a phenotype-structured partial differential equation framework that explicitly models proliferation heterogeneity as a dynamic trait distribution. Our model integrates three key biological principles: (1) phenotypic diffusion capturing heritable variation in proliferation rates, (2) global resource competition enforcing density-dependent growth constraints, and (3) an experimentally grounded life-history trade-off linking elevated proliferation to increased mortality. Using adaptive dynamics, we derive the optimum proliferation rate in a growing tumor, showing that the optimal phenotype dynamically shifts toward slower proliferation as tumors approach carrying capacity. We perform \textit{in silico} treatment simulations for four different treatment regimes (pan-proliferation, low-, mid-, and high-proliferation targeting) to show how therapeutic selective pressures reshape fitness landscapes. While all treatments slow down tumor growth, they induce divergent evolutionary trajectories: low- and mid-proliferation targeting enrich fast-proliferating clones, whereas high-proliferation targeting selects for slower phenotypes. We connect these dynamics with changes in mean proliferation rates during and after treatment. We use adaptive dynamics to explain the shifts in mean proliferation rate during treatment, showing how each regimen alters the maximum fitness proliferation rate. Our work establishes a predictive, evolutionarily grounded framework for understanding how therapy reshapes tumor proliferation landscapes, offering a mechanistic basis for designing strategies that anticipate and counteract adaptive resistance.
bioinformatics2026-03-17v1Flipper: An advanced framework for identifyingdifferential RNA binding behavior with eCLIP data
Flanagan, K.; Xu, S.; Yeo, G. W.Abstract
Motivation: Crosslinking and immunoprecipitation (CLIP) methods remain the gold standard for characterizing RNA binding protein (RBP) behavior. As a result, many researchers rely on CLIP to assess how treatments targeting RBPs alter binding patterns and regulatory activity. However, current tools for differential RBP binding analysis lack core features required for rigorous statistical inference, including proper normalization and appropriate handling of replicate experiments. Furthermore, existing approaches cannot adequately separate expression driven effects from true changes in RBP binding, complicating interpretation of differential analyses. Addressing these limitations is essential for producing reproducible and informative analyses of differential RBP binding. Results: Here we present Flipper, an application purpose built for the analysis of differential RBP binding. Flipper introduces several innovations that adapt the DESeq2 framework for robust differential analysis of eCLIP count data. These include integration of input controls to account for expression driven binding shifts, hierarchical normalization strategies that adjust for technical variation without confounding signal to noise ratios, and improved post-differential analysis tools. We demonstrate that Flipper exhibits high specificity when applied to real differential eCLIP data while also providing deeper biological insights. In addition, analyses of both real and simulated data indicate that Flipper achieves superior sensitivity and precision compared with existing approaches. Together, these results highlight Flipper as a robust and generalizable framework for differential RBP binding analysis.
bioinformatics2026-03-15v1Bayesian AMMI-Based Simulation of Genotype x Environment Interactions
Lee, H.; Segae, V. S.; Garcia-Abadillo, J.; de Oliveira Bussiman, F.; Trujano Chavez, M. Z.; Hidalgo, J.; Jarquin, D.Abstract
Genotype-by-environment interaction (GEI) has been studied to identify environment-stable/favorable genotypes. The GEI simulation could help refine the inference by incorporating tangible factors such as genomic and environmental information. The Bayesian additive main effect and multiplicative interaction (Bayesian AMMI) model captures the genotype-specific responses across environments, reflecting directional relationships between genotypes and environments. Thus, we propose a Bayesian AMMI-based GEI simulation framework that utilizes high-throughput environmental covariance matrices to generate GEI effects with interpretable directional structure. To demonstrate the proposed approach, two simulated phenotypes were assessed under four levels of GEI variance. In the first simulation (Sim1), GEI effects were sampled from a multivariate normal distribution defined by the GEI matrix. In the second simulation (Sim2), GEI effects were generated by extending Sim1 with the Bayesian AMMI model. In both simulations, increasing GEI variance resulted in lower correlations of phenotypes across environments and stronger genotype-specific sensitivity to environmental variation. Across five cross-validation designs, models accounting for GEI consistently outperformed one that did not, with prediction accuracy generally decreasing as GEI variance increased. Clear distinctions between the two simulated phenotypes were evident from biplot analyses: Sim2 successfully captured environmental relatedness and genotype-specific responses, whereas such structure was absent in Sim1. These results demonstrate that the proposed Bayesian AMMI-based GEI simulation framework enables interpretable visualization of GEI and supports genomic selection strategies under complex environmental conditions.
bioinformatics2026-03-15v1Asymmetric Contrastive Objectives for Efficient Phenotypic Screening
Nightingale, L.; Tuersley, J.; Warchal, S.; Cairoli, A.; Howes, J.; Shand, C.; Powell, A.; Green, D.; Strange, A.; Howell, M.Abstract
Phenotypic screening experiments produce many microscope images of cells under diverse perturbations, with biologically significant responses often subtle or difficult to identify visually. A central challenge is to extract image representations that distinguish activity from controls and group phenotypically similar perturbations. In this work we propose new adaptations of contrastive loss functions that incorporate experimental metadata as learned class vectors, and a geometrically inspired variant, called SPC, where class vectors are confined to the unit sphere and updated only by attractive terms (allowing more overlap of phenotypically similar classes). The approach is tested on two popular benchmarking datasets, BBBC021 and RxRx3-core; and we also evaluate performance on uncurated screens of HaCaT cells to gauge effectiveness in a realistic use-case scenario. We find we outperform prior methods across the three datasets and on a wide array of metrics measuring phenotype grouping, biological recall, drug-target interaction and mechanism-of-action inference. We also show we maintain this improved performance compared to models over 10x larger in parameter count, and that SPC can be used as an effective fine-tuning technique. The method is easy to implement and is well suited to settings with limited data or compute resources.
bioinformatics2026-03-14v3stMCP: Spatial Transcriptomics with a Model Context Protocol Server
Smith, J. J.; Wang, X.; McPheeters, M.; Widjaja-Adhi, M. A.; Littleton, S.; Saban, D.; Golczak, M.; Jenkins, M. W.Abstract
Spatial transcriptomics enables high-resolution mapping of gene expression in intact tissues but remains challenging due to complex computational workflows that limit accessibility and reproducibility. Here, we present a Model Context Protocol (MCP) framework enabling natural language-driven spatial transcriptomics analysis. By executing analytical tools locally, this architecture eliminates the need to upload massive datasets to large language models, bypassing high token costs and mitigating data privacy and training risks. The MCP orchestrator interprets intent, dynamically routes requests, maintains session state, and verifies input integrity to ensure reproducible execution. Benchmarking across biological discovery, orchestration accuracy, token usage, and execution time demonstrates robust performance. This architecture establishes a scalable template for AI-native research by standardizing the interface between models and local analytical engines. Rather than replacing bioinformaticians, this framework empowers biologists to independently and comprehensively explore their data, accelerating hypothesis testing, and unlocking broader biological discoveries.
bioinformatics2026-03-14v1Image Analysis Tools for Electron Microscopy
Shtengel, D.; Shtengel, G.; Xu, C. S.; Hess, H. F.Abstract
Electron Microscopy (EM) is widely used in many scientific fields, particularly in life sciences, offering high-resolution information on the ultrastructure of biological organisms. Accurate characterization of EM image quality is important for assessing the EM tool performance, in addition to sample preparation protocol, imaging conditions, etc. This paper provides an overview of tools we developed as plugins for the popular image processing package Fiji (ImageJ) (1). These tools include signal-to-noise ratio analysis, contrast evaluation, and resolution analysis, as well as the capability to import images acquired on custom FIB-SEM instruments (2). We have also made these tools available in Python, with both versions available on GitHub.
bioinformatics2026-03-14v1DisGeneFormer: Precise Disease Gene Prioritization by Integrating Local and Global Graph Attention
Koeksal, R.; Fritz, A.; Kumar, A.; Schmidts, M.; Tran, V. D.; Backofen, R.Abstract
Identifying genes associated with human diseases is essential for effective diagnosis and treatment. Experimentally identifying disease-causing genes is time-consuming and expensive. Computational prioritization methods aim to streamline this process by ranking genes based on their likelihood of association with a given disease. However, existing methods often report long ranked lists consisting of thousands of potential disease genes, often containing a high number of false positives. This fails to meet the practical needs of clinicians who require shorter, more precise candidate lists. To address this problem, we introduce DisGeneFormer (DGF), an end-to-end disease-gene prioritization pipeline. Our approach is based on two distinct graph representations, modeling gene and disease relationships, respectively. Each graph is first processed separately by graph attention and then jointly by a transformer module to combine within-graph and cross-graph knowledge through local and global attention. We propose an evaluation pipeline based on the precision of a top K ranked gene list, with K set to clinically feasible values between 5 and 50, relying solely on experimentally verified associations as ground truth. Our evaluation demonstrates that DGF substantially outperforms existing methods. We additionally assessed the influence of the negative data sampling strategy as well as analyses of the effect of graph topology and features on the performance of our model.
bioinformatics2026-03-14v1A Multi-Omics Processing Pipeline (MOPP) for Extracting Taxonomic and Functional Insights from Metaribosome Profiling (metaRibo-Seq) data
Weng, Y.; Moyne, O.; Walker, C.; Haddad, E.; Lieng, C.; Chin, L.; Rahman, G.; McDonald, D.; Knight, R.; Zengler, K.Abstract
Metaribosome profiling (metaRibo-Seq) enables genome-wide measurement of translation across complex microbial communities by sequencing ribosome-protected mRNA fragments, but the short length of these footprints creates substantial nonspecific mapping against large reference genome collections, leading to spurious taxonomic and functional assignments. Here we present MOPP (Multi-Omics Processing Pipeline), a modular reference-based workflow that denoises metaRibo-Seq data by leveraging matched metagenomic coverage breadth to identify genomes likely to be truly present in a sample before aligning metatranslatomic and optional metatranscriptomic reads. MOPP generates taxon-by-gene count tables across genomic, transcriptional and translational layers, enabling integrated downstream analyses of microbial function. We evaluated MOPP using a defined 79-member synthetic human gut community profiled by metagenomics and metaRibo-Seq. Coverage breadth filtering markedly improved detection accuracy relative to a standard baseline workflow, with performance remaining robust across a broad intermediate threshold range and peaking at 92-95% coverage breadth. At a 92% threshold, MOPP reduced the number of distinct detected operational genomic units by 99.4% while retaining 87.8% of aligned metaRibo-Seq reads on average, and increased the F1 score from 0.02 to 0.61. Residual false positives were predominantly attributable to genomes with extremely high nucleotide similarity to true community members, whereas false negatives were enriched among low-abundance taxa, indicating that remaining errors are driven primarily by biological similarity and detection limits rather than widespread nonspecific mapping. Together, these results establish MOPP as a high-throughput workflow for robust processing of metaRibo-Seq in the context of matched metagenomics and position it as a scalable framework for integrated taxonomic and functional analysis of microbial communities across genomic, transcriptional and translational layers.
bioinformatics2026-03-14v1