Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
The History of Enzyme Evolution Embedded in Metabolism
Corlett, T.; Smith, H. B.; Smith, E.; Goldford, J. E.; Longo, L. M.Abstract
Whereas phylogenetic reconstructions are a primary record of protein evolution, it is unknown whether the deep history of enzymes are encoded at higher levels of biological organization. Here, we demonstrate that the emergence and reuse history of enzymatic folds is embedded within the web of metabolite-cofactor-enzyme interdependencies that comprise biosphere-scale metabolic reaction networks. Using a simple network analysis approach, we reconstruct the relative ordering of enzymatic fold emergence and, where possible, the first reaction(s) that each enzymatic fold catalyzed. We find that a large majority of enzymatic folds were sufficient as independent additions to open new avenues for metabolic growth. The resulting network-based histories are broadly concordant with enzyme phyletic distribution in prokaryotes, a proxy for enzyme age. Our results suggest that the earliest enzyme-mediated metabolisms were enriched for /{beta} proteins, likely due to their strong association with cofactor utilization, and that -proteins preferentially emerge at later stages. The cradle-loop barrel, a member of the small {beta}-barrel metafold, is predicted to be the founding {beta}-fold, in agreement with analyses of ribosome structure. An examination of how the protein universe responded to the biological production of molecular oxygen reveals that the adaptation of existing enzymatic folds, not novel fold emergence, was the primary driver of metabolic evolution. This work presents a self-consistent model of metabolic and enzyme evolution, key progress towards integrating diverse perspectives into a unified history of protein evolution.
bioinformatics2026-03-17v2From Circles to Signals: Representation Learning on Ultra-Long Extrachromosomal Circular DNA
Li, J.; Liu, Z.; Zhang, Z.; Zhang, J.; Singh, R.Abstract
Extrachromosomal circular DNA (eccDNA) is a covalently closed circular DNA molecule that plays an important role in cancer biology. Genomic foundation models have recently emerged as a powerful direction for DNA sequence modeling, enabling the direct prediction of biologically relevant properties from DNA sequences. Although recent genomic foundation models have shown strong performance on general DNA sequence modeling, their application to eccDNA remains limited: existing approaches either rely on computationally expensive attention mechanisms or truncate ultra-long sequences into kilobase fragments, thereby disrupting long-range continuity and ignoring the molecule's circular topology. To overcome these problems, we introduce eccDNAMamba, a bidirectional state space model (SSM) built upon the Mamba-2 framework, which scales linearly with input sequence length and enables scalable modeling of ultra-long eccDNA sequences. eccDNAMamba further incorporates a circular augmentation strategy to preserve the intrinsic circular topology of eccDNA. Comprehensive evaluations against state-of-the-art genomic foundation models demonstrate that eccDNAMamba achieves superior performance on ultra-long sequences across multiple task settings, such as cancer versus healthy eccDNA discrimination and eccDNA copy-number level prediction. Moreover, the Integrated Gradient (IG) based model explanation indicates that eccDNAMamba focuses on biologically meaningful regulatory elements and can uncover key sequence patterns in cancer-derived eccDNAs. Overall, these results demonstrate that eccDNAMamba effectively models ultra-long eccDNA sequences by leveraging their unique circular topology and regulatory architecture, bridging a critical gap in sequence analysis. Our codes and datasets are available at https://github.com/zzq1zh/eccDNAMamba.
bioinformatics2026-03-17v2ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach
Lee, S.; Sakatsume, J.; Oba, G. M.; Nagaoka, Y.; Lin, C.; Chen, C.-Y.; Nakato, R.Abstract
Chromatin states, which are defined by specific combinations of histone post-translational modifications, are fundamental to gene regulation and cellular identity. Despite their importance, comprehensive patterns within chromatin state sequences, which could provide insights into key biological functions, remain largely unexplored. In this study, we introduce ChromBERT, a BERT-based model specifically designed to detect distinct chromatin state patterns as 'motifs.' We pre-trained ChromBERT on 15-state chromatin annotations from 127 human cell and tissue types from the ROADMAP consortium. This pre-trained model can be fine-tuned for various downstream tasks, and obtained high-attention chromatin state patterns are extracted as motifs. To account for the variable-length nature of chromatin state motifs, ChromBERT uses Dynamic Time Warping to cluster similar motifs and identify meaningful representative patterns. In this study, we evaluated the performance of the model on several tasks, including binary and quantitative gene expression prediction, cell type classification, and three-dimensional genome feature classification. Our analyses yielded biologically grounded results and revealed the associated chromatin state motifs. This workflow facilitates the discovery of specific chromatin state patterns across different biological contexts and offers a new framework for exploring the dynamics of epigenomic states.
bioinformatics2026-03-17v2Accelerating k-mer-based sequence filtering
Martayan, I.; Vandamme, L.; Constantinides, B.; Cazaux, B.; Paperman, C.; Limasset, A.Abstract
The exponential growth of global sequencing data repositories presents both analytical challenges and opportunities. While k-mer-based indexing has improved scalability over traditional alignment for identifying relevant documents, pinpointing the exact sequences matching numerous queries remains a hurdle. In particular, searching for numerous k-mers with a single large query or multiple distinct queries strains existing exact matching tools, whose performance scales poorly with an increasing number of patterns. At the same time, indexing entire vast datasets for infrequent or ad-hoc searches is often resource-prohibitive. Designing fast methods for matching a large number of k-mers without exhaustive pre-indexing is therefore critical. We propose an efficient solution to the problem of k-mer-based sequence filtering: given a set of k-mers of interests and a threshold, quickly evaluate whether an arbitrary sequence has a number of k-mer matches above or below the threshold. Our approach demonstrates how minimizer-based based sketching, alongside SIMD acceleration, can enhance the performance of streaming searches, and is implemented as a Rust tool named K2Rmini. On a consumer laptop, K2Rmini is able to filter long reads at 2 Gbp/s. Availability: https://github.com/Malfoy/K2Rmini
bioinformatics2026-03-17v2VarDCL: A Multimodal PLM-Enhanced Framework for Missense Variant Effect Prediction via Self-distilled Contrastive Learning
Zhang, H.; Zheng, G.; Xu, Z.; Zhao, H.; Cai, S.; Huang, Y.; Zhou, Z.; Wei, Y.Abstract
Abstract. Missense variants are a common type of genetic mutation that can alter the structure and function of proteins, thereby affecting the normal physiological processes of organisms. Accurately distinguishing damaging missense variants from benign ones is of great significance for clinical genetic diagnosis, treatment strategy development, and protein engineering. Here, we propose the VarDCL method, which ingeniously integrates multimodal protein language model embeddings and self-distilled contrastive learning to identify subtle sequence and structural differences before and after protein mutations, thereby accurately predicting pathogenic missense variants. First, leveraging sequence and structural information before and after mutations, VarDCL generates sequence-structural multimodal features via different language models. It incorporates both global and local perspectives of feature embeddings to provide the model with dynamic, multimodal, and multi-view input data. Additionally, a Self-distilled Contrastive Learning (SDCL) module was proposed to enable more effective information integration and feature learning, enhancing the model's ability to detect sequence and structural changes induced by mutations. Within this module, the multi-level contrastive learning framework excels at capturing information differences before and after mutations within the same modality; meanwhile, the feature self-distillation mechanism effectively utilizes high-level fused features to guide the learning of low-level differential features, facilitating information interaction across different modalities. The VarDCL framework not only ensures the model's capacity to learn dynamic changes pre- and post-mutation but also significantly improves cross-modal information interaction between sequence and structure, thereby remarkably boosting the model's performance in distinguishing pathogenic mutations from benign ones. To validate the effectiveness of VarDCL, extensive experiments were conducted. The ablation study demonstrates that all key components of VarDCL contribute significantly. On an independent test set containing 18,731 clinical variants, VarDCL achieved an AUC of 0.917, an AUPR of 0.876, an MCC of 0.690, and an F1-score of 0.789, outperforming 21 state-of-the-art existing methods. Benchmark analysis shows that VarDCL can be utilized as an accurate and potent tool for predicting missense variant effects. The data and code for VarDCL are available at https://github.com/mjcoo/VarDCL for academic use.
bioinformatics2026-03-17v1Integrated Artificial Intelligence and Quantum Chemistry Approach for the Rational Design of Novel Antibacterial Agents against Ralstonia solanacearum.
Gulumbe, D. A.; Tiwari, G.; Lohar, T.; Nikam, R.; Kumar, A.; Giri, S.Abstract
Antimicrobial resistance (AMR) in plant pathogenic bacteria poses a serious threat to global agriculture, necessitating the development of novel antibacterial agents targeting virulence mechanisms. This study presents an integrated bioinformatics-driven framework for the rational design and computational validation of Solres, a newly designed small molecule targeting key virulence proteins in phytopathogenic bacteria. Approximately 10,000 active compounds from PubChem BioAssay (AID: 588726) were analyzed using structural clustering and scaffold mining to identify conserved molecular motifs associated with antibacterial activity. Guided by high-frequency substructures, Solres was designed de novo and screened for structural novelty against PubChem, ChEMBL, and WIPO databases. Drug-likeness evaluation using Lipinski Rule of Five confirmed favorable physicochemical properties. Molecular docking was performed against essential virulence regulators, including PhcA, PhcR, HrpB, PehA, and Egl from Ralstonia solanacearum and Xanthomonas spp., with active sites predicted using CaspFold. Docking analyses revealed strong binding affinities and stable interactions with key catalytic and regulatory residues. Complex stability and conformational integrity were further validated through molecular dynamics simulations. Quantum chemical descriptors, including HOMO LUMO energy gap and dipole moment, supported the electronic suitability and reactivity profile of Solres. Collectively, this study demonstrates the effective integration of cheminformatics, structural bioinformatics, molecular simulations, and quantum chemical analyses for plant-focused antibacterial discovery. The compound Solres represents a promising lead candidate for mitigating bacterial wilt disease and provides a computational framework for future experimental validation and sustainable crop protection strategies against AMR-driven phytopathogens.
bioinformatics2026-03-17v1Cross-Propagative Graph Learning Reveals Spatial Tissue Domains in Multi-Modal Spatial Transcriptomics
Guo, Y.; Liu, S.; Zhang, Z.; Zhang, S.; Li, L.Abstract
Spatial transcriptomics enables in situ characterization of tissue organization by jointly profiling gene expression profiles and spatial coordinates, with histological images as complementary contextual information. However, effectively integrating these heterogeneous modalities remains challenging due to differences in statistical properties and structural patterns. We propose st-Xprop, a cross-propagative graph network with dual-graph embedding coupling for spatial domain identification. st-Xprop constructs modality-specific graphs for gene expression and histological features, and performs alternating cross-modal propagation to explicitly model inter-modal heterogeneity while enabling complementary information exchange. Through dual-graph embedding coupling, the framework progressively learns a unified low-dimensional representation that integrates multimodal signals and preserves spatial coherence. Evaluations on multiple real spatial transcriptomics datasets demonstrate that st-Xprop consistently improves clustering accuracy and robustness, particularly in weak-signal or structurally complex regions, yielding spatial domains that are more stable and biologically meaningful.
bioinformatics2026-03-17v1BioOS: A Gene-Driven Digital Twin Runtime for Emergent Plant Development
AUGER, E.; Gandecki, M.; Delarche, C.; Heng, F. X.Abstract
Predicting plant phenotypes from genomic data requires models that bridge molecular regulation and organ-scale morphogenesis. We introduce BioOS, a computational runtime in which plant behavior - cell division, differentiation, and elongation - emerges from the execution of a gene regulatory network rather than from hardcoded rules. The system is built on the Formal Cell abstraction: a minimal signal-processing unit analogous to the McCulloch-Pitts formal neuron, whose transfer function is gene expression. Each Formal Cell evaluates promoters, transcribes mRNA, translates proteins, and derives its entire behavioral repertoire from the resulting protein concentrations - without a single hardcoded rule in the simulator code. A multi-scale architecture with level-of-detail switching enables real-time simulation of Arabidopsis thaliana primary root development. On the current official five-case primary-root auxin benchmark, BioOS achieves 75.4% mean score, 5/5 qualitative matches, 5/5 cases passing all current gates, and Spearman severity correlation {rho} = 0.70. The current root-auxin runtime is driven by a curated 35-gene registry with explicit promoter logic, kinetic parameters, and epigenetic state; for readability, this manuscript details a core 18-gene subnetwork that carries the main auxin benchmark logic. We describe the architecture, the gene expression runtime, the epigenetic memory model, the completed transition to post-hoc (non-causal) zone classification, and candidate benchmark extensions for persistent plasmodesmata and intracellular auxin compartmentalization within a broader six-suite, 63-case benchmark framework. Beyond the root-auxin slice, the current codebase also closes the official flowering (5/5), photosynthesis (7/7), and cytokinin (5/5) gates, while root-patterning remains a passing candidate panel.
bioinformatics2026-03-17v1Eco-Evolutionary Dynamics of Proliferation Heterogeneity: A Phenotype-Structured Model for Tumor Growth and Treatment Response
Schmalenstroer, L.; Rockne, R. C.; Farahpour, F.Abstract
Intra-tumor heterogeneity in proliferation rates fundamentally influences cancer progression and treatment resistance. To investigate how continuous phenotypic variation shapes eco-evolutionary dynamics, we develop a phenotype-structured partial differential equation framework that explicitly models proliferation heterogeneity as a dynamic trait distribution. Our model integrates three key biological principles: (1) phenotypic diffusion capturing heritable variation in proliferation rates, (2) global resource competition enforcing density-dependent growth constraints, and (3) an experimentally grounded life-history trade-off linking elevated proliferation to increased mortality. Using adaptive dynamics, we derive the optimum proliferation rate in a growing tumor, showing that the optimal phenotype dynamically shifts toward slower proliferation as tumors approach carrying capacity. We perform \textit{in silico} treatment simulations for four different treatment regimes (pan-proliferation, low-, mid-, and high-proliferation targeting) to show how therapeutic selective pressures reshape fitness landscapes. While all treatments slow down tumor growth, they induce divergent evolutionary trajectories: low- and mid-proliferation targeting enrich fast-proliferating clones, whereas high-proliferation targeting selects for slower phenotypes. We connect these dynamics with changes in mean proliferation rates during and after treatment. We use adaptive dynamics to explain the shifts in mean proliferation rate during treatment, showing how each regimen alters the maximum fitness proliferation rate. Our work establishes a predictive, evolutionarily grounded framework for understanding how therapy reshapes tumor proliferation landscapes, offering a mechanistic basis for designing strategies that anticipate and counteract adaptive resistance.
bioinformatics2026-03-17v1Glydentify: An explainable deep learning platform for glycosyltransferase donor substrate prediction
Fang, R.; Na, L.; Corulli, C. J.; Prabhakar, P. K.; Berardinelli, S. J.; Venkat, A.; Prasad, A.; Mahmud, R.; Moremen, K. W.; Urbanowicz, B. R.; Dou, F.; Kannan, N.Abstract
Glycosyltransferases (GTs) are a large family of enzymes that catalyze the formation of glycosidic linkages between chemically diverse donor and acceptor molecules to regulate diverse cellular processes across all domains of life. Despite their importance, the activated sugar donors (donor substrates) used by most GTs remain unidentified, limiting our understanding of GT functions. To address this challenge, we developed Glydentify, a deep learning framework that predicts donor usage across GT-A and GT-B fold glycosyltransferases. Trained on large-scale UniProt annotations, Glydentify integrates protein sequence embeddings learned from protein language models with chemical features derived from molecular encoders trained on extensive chemical datasets. The resulting models achieve high predictive performance, with precision-recall AUCs (PR-AUC) of 0.86 for GT-A and 0.91 for GT-B, surpassing general enzyme-substrate predictors while requiring minimal manual curation. We employed Glydentify to predict the donor specificity of uncharacterized plant GTs and experimentally tested the predictions using in vitro biochemical assays. Furthermore, we demonstrate that the model utilizes a combination of evolutionary, structural, and biochemical features to predict donor specificity through residue attention score analysis. Together, these results establish Glydentify as a robust, explainable framework for decoding donor-glycosyltransferase relationships and highlight its potential as a broadly applicable framework for modeling enzyme classes that act on chemically diverse substrates.
bioinformatics2026-03-17v1NYX: Format-aware, learned compression across omics file types
Patsakis, M.; Chronopoulos, T.; Mouratidis, I.; Georgakopoulos-Soares, I.Abstract
Genomic data repositories continue to grow as sequencing technologies improve, with the NCBI SRA alone exceeding 47 PB. General-purpose compressors treat bioinformatics files as unstructured byte streams and fail to exploit the structured nature of omics data. We present NYX, a format-aware compression system for FASTA, FASTQ, VCF, WIG, H5AD, and BED files. NYX combines lightweight, reversible preprocessing and is build upon the OpenZL framework to take advantage of inherent data structure, delivering high compression ratios while preserving fast and lossless compression. Across representative datasets in the target formats, NYX achieves substantially higher speed than format-specific compressors while maintaining or improving compression ratio.
bioinformatics2026-03-17v13D-Manhattan: An interactive visualization tool for multiple GWAS results
Hashimoto, S.Abstract
Genome-wide association studies (GWAS) are widely used to identify genetic loci underlying various agronomic traits. Conventional Manhattan plots provide an effective two-dimensional (2D) summary of an individual GWAS result. However, recent advances in high-throughput phenotyping have led to study designs that generate multiple GWAS outputs across time points, traits, or experimental conditions. In such settings, biological insight increasingly depends on comparative interpretation of multiple association maps, yet panel-based arrangements of 2D plots fragment related information and impede recognition of shared or dynamic genetic signals. Here, I present 3D-Manhattan, an interactive visualization framework that integrates multiple GWAS results within a unified three-dimensional (3D) coordinate system. By extending the conventional Manhattan plot with an additional axis representing time, trait, or condition, 3D-Manhattan enables simultaneous, axis-aligned comparison of association landscapes while preserving genomic coordinates and statistical values. The tool is implemented as a stand-alone, browser-based application using WebGL-based rendering and supports smooth interaction without server-side computation. The framework provides flexible visualization controls, region highlighting, and variant-level correspondence across datasets, facilitating exploratory analysis of stable and context-dependent genetic associations. Collectively, 3D-Manhattan provides an alternative approach for visualizing multi-dimensional GWAS results and offers a powerful platform for visualizing general a series of genome-wide datasets.
bioinformatics2026-03-17v1Calcium transient detection and segmentation with the astronomically motivated algorithm for background estimation and transient segmentation (Astro-BEATS)
Fan, B.; Bilodeau, A.; Beaupre, F.; Wiesner, T.; Gagne, C.; Lavoie-Cardinal, F.; Hlozek, R.Abstract
Fluorescence-based calcium-imaging is a powerful tool for studying localized neuronal activity, including miniature Synaptic Calcium Transients, providing real-time insights into synaptic activity. These transients induce only subtle changes in the fluorescence signal, often barely above baseline, which poses a significant challenge for automated synaptic transient detection and segmentation. Detecting astronomical transients similarly requires efficient algorithms that will remain robust over a large field of view with varying noise properties. We leverage techniques used in astronomical transient detection for miniature Synaptic Calcium Transient detection in fluorescence microscopy. We present Astro-BEATS, an automatic miniature Synaptic Calcium Transient segmentation algorithm that incorporates image estimation and source-finding techniques used in astronomy and designed for calcium-imaging videos. Astro-BEATS outperforms current threshold-based approaches for synaptic calcium transient detection and segmentation. The produced segmentation masks can be used to train a supervised deep learning algorithm for improved synaptic calcium transient detection in calcium-imaging data. The speed of Astro-BEATS and its applicability to previously unseen datasets without re-optimization makes it particularly useful for generating training datasets for deep learning-based approaches.
bioinformatics2026-03-17v1RIBEX: Predicting and Explaining RNA Binding Across Structured and Intrinsically Disordered Regions (IDR)-rich Proteins
Firmani, S.; Steinbauer, F.; Kasneci, G.; Horlacher, M.; Marsico, A.Abstract
Motivation: RNA-binding proteins (RBPs) regulate post-transcriptional processes, yet many remain undiscovered because RNA-binding activity often occurs outside canonical RNA-binding domains (RBDs), including within intrinsically disordered regions (IDRs) or through protein complexes. Computational methods can help identify novel RBPs, but approaches relying solely on sequence-derived features or ignoring the cellular interaction context are limited in capturing the complexity of RNA-binding behavior. To date, no framework rigorously integrates both sequence information and protein interaction context for RBP prediction. Results: We introduce RIBEX, a multimodal framework that combines protein language model (pLM) embeddings with protein interactome topology to improve RBP prediction and interpretation. Specifically, we integrate sequence representations with graph-derived positional encodings (PE) from the human STRING protein protein interaction (PPI) network. PE are computed using Personalized PageRank, reduced with principal component analysis, and fused with pooled sequence embeddings through FiLM conditioning, while Low-Rank Adaptation (LoRA) enables parameter-efficient task adaptation. Across both an annotation-based benchmark and experimental RNA Interactome Capture (RIC) dataset, PE consistently improves predictive performance, indicating that interactome topology provides complementary information beyond sequence features. LoRA adaptation of ESM2-650M further yields larger gains than simply scaling frozen backbone size. RIBEX outperforms state of the art methods such as RBPTSTL and HydRA, particularly on challenging subsets including proteins lacking canonical RBDs and those enriched in IDRs. For interpretability, we combine sequence-level computational alanine scanning with network-level positional-encoding ablation and inverse-PCA mapping, recovering known RNA-binding domains, IDR-associated contributions, and functional interactome communities linked to RBP predictions.
bioinformatics2026-03-17v1CROCHET: a versatile pipeline for automated analysis and visual atlas creation from single-cell spatialomic data
Bozorgui, B.; Thibault, G.; Yuan, C.; Dereli, Z.; Wang, H.; Overman, M. J.; Weinstein, J. N.; Korkut, A.Abstract
Spatial biology technologies offer a unique opportunity to link tissue composition with function. However, analytical methods for quantifying and interpreting highly complex spatial data remain limited. We present CROCHET (ChaRacterization Of Cellular HEterogeneity in Tissues), an end-to-end analysis pipeline for construction of spatially resolved cell atlases from raw data covering millions of cells across large sample cohorts. Its modular architecture supports the integration of diverse data modalities and novel analytical methods for image processing and segmentation, spatialomics quantification and downstream analyses. With comprehensive, open-source, user-friendly, interactive, and visual analysis modules, CROCHET aims to democratize spatial omics for a broad community of users.
bioinformatics2026-03-17v1OmicClaw: executable and reproducible natural-language multi-omics analysis over the unified OmicVerse ecosystem.
Zeng, Z.; Wang, X.; Luo, Z.; Zheng, Y.; Hu, L.; Xing, C.; Du, H.Abstract
Advances in bulk, single-cell and spatial omics have transformed biological discovery, yet analysis remains fragmented across packages with incompatible interfaces, heterogeneous dependencies and limited workflow reproducibility. Here, we present OmicClaw, an executable natural-language framework for multi-omics analysis built on the unified OmicVerse ecosystem and the J.A.R.V.I.S. runtime. OmicVerse organizes upstream processing, preprocessing, single-cell, spatial, bulk-transcriptomic and foundation-model workflows into a shared AnnData-centered interface spanning more than 100 methods. J.A.R.V.I.S. converts this ecosystem into a bounded analytical action space by exposing more than 200 registered functions and classes through a registry-grounded, state-aware and recoverable execution layer that validates prerequisites, preserves provenance and supports iterative repair. Rather than relying on unconstrained code generation, OmicClaw translates user requests into traceable workflows over live omics objects. Across a benchmark of 15 tasks spanning scRNA-seq, spatial transcriptomics, RNA velocity, scATAC-seq, CITE-seq and multiome analysis, ov.Agent improved rubric-based performance over bare one-shot large language model baselines, particularly for long-horizon multi-step workflows. OmicClaw further supports external agent access through an MCP-compatible server and a beginner-friendly web platform for interactive analysis, code execution and million-scale visualization. Together, OmicClaw provides a practical foundation for reproducible human AI collaboration in modern multi-omics research. OmicClaw is ready to use at https://github.com/Starlitnightly/omicverse
bioinformatics2026-03-17v1Flipper: An advanced framework for identifyingdifferential RNA binding behavior with eCLIP data
Flanagan, K.; Xu, S.; Yeo, G. W.Abstract
Motivation: Crosslinking and immunoprecipitation (CLIP) methods remain the gold standard for characterizing RNA binding protein (RBP) behavior. As a result, many researchers rely on CLIP to assess how treatments targeting RBPs alter binding patterns and regulatory activity. However, current tools for differential RBP binding analysis lack core features required for rigorous statistical inference, including proper normalization and appropriate handling of replicate experiments. Furthermore, existing approaches cannot adequately separate expression driven effects from true changes in RBP binding, complicating interpretation of differential analyses. Addressing these limitations is essential for producing reproducible and informative analyses of differential RBP binding. Results: Here we present Flipper, an application purpose built for the analysis of differential RBP binding. Flipper introduces several innovations that adapt the DESeq2 framework for robust differential analysis of eCLIP count data. These include integration of input controls to account for expression driven binding shifts, hierarchical normalization strategies that adjust for technical variation without confounding signal to noise ratios, and improved post-differential analysis tools. We demonstrate that Flipper exhibits high specificity when applied to real differential eCLIP data while also providing deeper biological insights. In addition, analyses of both real and simulated data indicate that Flipper achieves superior sensitivity and precision compared with existing approaches. Together, these results highlight Flipper as a robust and generalizable framework for differential RBP binding analysis.
bioinformatics2026-03-15v1Bayesian AMMI-Based Simulation of Genotype x Environment Interactions
Lee, H.; Segae, V. S.; Garcia-Abadillo, J.; de Oliveira Bussiman, F.; Trujano Chavez, M. Z.; Hidalgo, J.; Jarquin, D.Abstract
Genotype-by-environment interaction (GEI) has been studied to identify environment-stable/favorable genotypes. The GEI simulation could help refine the inference by incorporating tangible factors such as genomic and environmental information. The Bayesian additive main effect and multiplicative interaction (Bayesian AMMI) model captures the genotype-specific responses across environments, reflecting directional relationships between genotypes and environments. Thus, we propose a Bayesian AMMI-based GEI simulation framework that utilizes high-throughput environmental covariance matrices to generate GEI effects with interpretable directional structure. To demonstrate the proposed approach, two simulated phenotypes were assessed under four levels of GEI variance. In the first simulation (Sim1), GEI effects were sampled from a multivariate normal distribution defined by the GEI matrix. In the second simulation (Sim2), GEI effects were generated by extending Sim1 with the Bayesian AMMI model. In both simulations, increasing GEI variance resulted in lower correlations of phenotypes across environments and stronger genotype-specific sensitivity to environmental variation. Across five cross-validation designs, models accounting for GEI consistently outperformed one that did not, with prediction accuracy generally decreasing as GEI variance increased. Clear distinctions between the two simulated phenotypes were evident from biplot analyses: Sim2 successfully captured environmental relatedness and genotype-specific responses, whereas such structure was absent in Sim1. These results demonstrate that the proposed Bayesian AMMI-based GEI simulation framework enables interpretable visualization of GEI and supports genomic selection strategies under complex environmental conditions.
bioinformatics2026-03-15v1Asymmetric Contrastive Objectives for Efficient Phenotypic Screening
Nightingale, L.; Tuersley, J.; Warchal, S.; Cairoli, A.; Howes, J.; Shand, C.; Powell, A.; Green, D.; Strange, A.; Howell, M.Abstract
Phenotypic screening experiments produce many microscope images of cells under diverse perturbations, with biologically significant responses often subtle or difficult to identify visually. A central challenge is to extract image representations that distinguish activity from controls and group phenotypically similar perturbations. In this work we propose new adaptations of contrastive loss functions that incorporate experimental metadata as learned class vectors, and a geometrically inspired variant, called SPC, where class vectors are confined to the unit sphere and updated only by attractive terms (allowing more overlap of phenotypically similar classes). The approach is tested on two popular benchmarking datasets, BBBC021 and RxRx3-core; and we also evaluate performance on uncurated screens of HaCaT cells to gauge effectiveness in a realistic use-case scenario. We find we outperform prior methods across the three datasets and on a wide array of metrics measuring phenotype grouping, biological recall, drug-target interaction and mechanism-of-action inference. We also show we maintain this improved performance compared to models over 10x larger in parameter count, and that SPC can be used as an effective fine-tuning technique. The method is easy to implement and is well suited to settings with limited data or compute resources.
bioinformatics2026-03-14v3A Multi-Omics Processing Pipeline (MOPP) for Extracting Taxonomic and Functional Insights from Metaribosome Profiling (metaRibo-Seq) data
Weng, Y.; Moyne, O.; Walker, C.; Haddad, E.; Lieng, C.; Chin, L.; Rahman, G.; McDonald, D.; Knight, R.; Zengler, K.Abstract
Metaribosome profiling (metaRibo-Seq) enables genome-wide measurement of translation across complex microbial communities by sequencing ribosome-protected mRNA fragments, but the short length of these footprints creates substantial nonspecific mapping against large reference genome collections, leading to spurious taxonomic and functional assignments. Here we present MOPP (Multi-Omics Processing Pipeline), a modular reference-based workflow that denoises metaRibo-Seq data by leveraging matched metagenomic coverage breadth to identify genomes likely to be truly present in a sample before aligning metatranslatomic and optional metatranscriptomic reads. MOPP generates taxon-by-gene count tables across genomic, transcriptional and translational layers, enabling integrated downstream analyses of microbial function. We evaluated MOPP using a defined 79-member synthetic human gut community profiled by metagenomics and metaRibo-Seq. Coverage breadth filtering markedly improved detection accuracy relative to a standard baseline workflow, with performance remaining robust across a broad intermediate threshold range and peaking at 92-95% coverage breadth. At a 92% threshold, MOPP reduced the number of distinct detected operational genomic units by 99.4% while retaining 87.8% of aligned metaRibo-Seq reads on average, and increased the F1 score from 0.02 to 0.61. Residual false positives were predominantly attributable to genomes with extremely high nucleotide similarity to true community members, whereas false negatives were enriched among low-abundance taxa, indicating that remaining errors are driven primarily by biological similarity and detection limits rather than widespread nonspecific mapping. Together, these results establish MOPP as a high-throughput workflow for robust processing of metaRibo-Seq in the context of matched metagenomics and position it as a scalable framework for integrated taxonomic and functional analysis of microbial communities across genomic, transcriptional and translational layers.
bioinformatics2026-03-14v1stMCP: Spatial Transcriptomics with a Model Context Protocol Server
Smith, J. J.; Wang, X.; McPheeters, M.; Widjaja-Adhi, M. A.; Littleton, S.; Saban, D.; Golczak, M.; Jenkins, M. W.Abstract
Spatial transcriptomics enables high-resolution mapping of gene expression in intact tissues but remains challenging due to complex computational workflows that limit accessibility and reproducibility. Here, we present a Model Context Protocol (MCP) framework enabling natural language-driven spatial transcriptomics analysis. By executing analytical tools locally, this architecture eliminates the need to upload massive datasets to large language models, bypassing high token costs and mitigating data privacy and training risks. The MCP orchestrator interprets intent, dynamically routes requests, maintains session state, and verifies input integrity to ensure reproducible execution. Benchmarking across biological discovery, orchestration accuracy, token usage, and execution time demonstrates robust performance. This architecture establishes a scalable template for AI-native research by standardizing the interface between models and local analytical engines. Rather than replacing bioinformaticians, this framework empowers biologists to independently and comprehensively explore their data, accelerating hypothesis testing, and unlocking broader biological discoveries.
bioinformatics2026-03-14v1Image Analysis Tools for Electron Microscopy
Shtengel, D.; Shtengel, G.; Xu, C. S.; Hess, H. F.Abstract
Electron Microscopy (EM) is widely used in many scientific fields, particularly in life sciences, offering high-resolution information on the ultrastructure of biological organisms. Accurate characterization of EM image quality is important for assessing the EM tool performance, in addition to sample preparation protocol, imaging conditions, etc. This paper provides an overview of tools we developed as plugins for the popular image processing package Fiji (ImageJ) (1). These tools include signal-to-noise ratio analysis, contrast evaluation, and resolution analysis, as well as the capability to import images acquired on custom FIB-SEM instruments (2). We have also made these tools available in Python, with both versions available on GitHub.
bioinformatics2026-03-14v1DisGeneFormer: Precise Disease Gene Prioritization by Integrating Local and Global Graph Attention
Koeksal, R.; Fritz, A.; Kumar, A.; Schmidts, M.; Tran, V. D.; Backofen, R.Abstract
Identifying genes associated with human diseases is essential for effective diagnosis and treatment. Experimentally identifying disease-causing genes is time-consuming and expensive. Computational prioritization methods aim to streamline this process by ranking genes based on their likelihood of association with a given disease. However, existing methods often report long ranked lists consisting of thousands of potential disease genes, often containing a high number of false positives. This fails to meet the practical needs of clinicians who require shorter, more precise candidate lists. To address this problem, we introduce DisGeneFormer (DGF), an end-to-end disease-gene prioritization pipeline. Our approach is based on two distinct graph representations, modeling gene and disease relationships, respectively. Each graph is first processed separately by graph attention and then jointly by a transformer module to combine within-graph and cross-graph knowledge through local and global attention. We propose an evaluation pipeline based on the precision of a top K ranked gene list, with K set to clinically feasible values between 5 and 50, relying solely on experimentally verified associations as ground truth. Our evaluation demonstrates that DGF substantially outperforms existing methods. We additionally assessed the influence of the negative data sampling strategy as well as analyses of the effect of graph topology and features on the performance of our model.
bioinformatics2026-03-14v1Comprehensive long-read transcriptome analysis uncovers alternative RNA processing feature and isoform diversity in ovarian cancer progression
Liu, T.; Lv, J.; Wang, S.; Liu, Y.; Chen, Y.; Li, J.; Wang, L.; Shi, Y.; Li, N.; Ding, W.; Piao, Y.Abstract
Post-transcriptional processing has a crucial yet largely unresolved dynamic change and role during the malignant progression of ovarian cancer, especially due to the limited read length of short-read RNA sequencing being insufficient to capture transcript diversity. Here, we performed Iso and RNA sequencing on paired normal, primary tumor, and metastatic samples, generating a comprehensive isoform atlas of over 41,000 full-length transcripts including many unannotated isoforms. Integrative analyses revealed extensive isoform-level remodeling across disease states that often occurred without concordant alterations at the gene level, emphasizing the importance of qualitative transcript regulation. Notably, we identified isoform-level alterations with distinct biological and clinical relevance, including differential expression of the short KRAS isoform, a tumor-specific isoform switch of TMEM201, and an alternative first-exon event in FNDC3B associated with poor survival. Together, these findings provide a high-resolution map of the ovarian cancer transcriptome and illustrate how long-read sequencing exposes multiple layers of post-transcriptional and clinical insight that remain hidden in conventional expression profiling.
bioinformatics2026-03-14v1ProteinMCP: An Agentic AI Framework for Autonomous Protein Engineering
Xu, X.; Feng, C.; Zha, C.; He, W.; He, M.; Xiao, B.; Gao, X.Abstract
Computational protein design is often constrained by slow, complex, inaccessible, and highly sophiscated and expert-dependent workflows that hinder its transferrability and generalization power for broader applications. We present ProteinMCP, an agentic AI framework designed to accelerate and democratize protein engineering. ProteinMCP automates end-to-end scientific tasks, delivering dramatic gains in efficiency; for instance, a comprehensive protein fitness modeling workflow was completed in just 11 minutes. This performance is achieved by an AI agent that intelligently orchestrates a unified ecosystem of 38 specialized tools, made accessible through a Model-Context-Protocol (MCP). A cornerstone of the framework is an automated pipeline that converts existing software into MCP-compliant servers, ensuring the platform is both powerful and perpetually extensible. We further demonstrate its capabilities through the successful autonomous design and selection of high-affinity de novo binders and therapeutic nanobodies. By removing technical barriers, ProteinMCP has the potential to shorten the design-build-test cycle and make advanced computational protein design accessible to the broader scientific community.
bioinformatics2026-03-14v1Fast, accurate construction of multiple sequence alignments from protein language embeddings
Hoang, M.; Armour-Garb, I.; Singh, M.Abstract
Multiple sequence alignment (MSA) is a foundational task in computational biology, underpinning protein structure prediction, evolutionary analysis, and domain annotation. Traditional MSA algorithms rely on pairwise amino acid substitution matrices derived from conserved protein families. While effective for aligning closely related sequences, these scoring schemes struggle in the low-identity "twilight zone." Here, we present a new approach for constructing MSAs leveraging amino acid embeddings generated by protein language models (PLMs), which capture rich evolutionary and contextual information from massive and diverse sequence datasets. We introduce a windowed reciprocal-weighted embedding similarity metric that is surprisingly effective in identifying corresponding amino acids across sequences. Building on this metric, we develop ARIES (Alignment via RecIprocal Embedding Similarity), an algorithm that constructs a PLM-generated template embedding and aligns each sequence to this template via dynamic time warping in order to build a global MSA. Across diverse benchmark datasets, ARIES achieves higher accuracies than existing state-of-the-art approaches, especially in low-identity regimes where traditional methods degrade, while scaling almost linearly with the number of sequences to be aligned. Together, these results provide the first large-scale demonstration of the power of PLMs for accurate and scalable MSA construction across protein families of varying sizes and levels of similarity, highlighting the potential of PLMs to transform comparative sequence analysis. Code availabilityhttps://github.com/Singh-Lab/ARIES
bioinformatics2026-03-13v3Per-residue optimisation of protein structures: Rapid alternative to optimisation with constrained alpha carbons
Schindler, O.; Bucekova, G.; Svoboda, T.; Svobodova, R.Abstract
In recent years, the number of known protein structures has increased significantly. Predictive algorithms and experimental methods provide the positions of protein residues relative to each other with high accuracy. However, the local quality of the protein structure, including bond lengths, angles, and positions of individual atoms, often lacks the same level of precision. For this reason, protein structures are usually optimised by a force field prior to their application in further research sensitive to structural quality. Protein structure optimisation, however, is computationally challenging. In this paper, we introduce a general method Per-residue optimisation of protein structures: Rapid alternative to optimisation with constrained alpha carbons (PROPTIMUS RAPHAN). Rather than optimising the entire protein structure at once, PROPTIMUS RAPHAN divides the structure into overlapping residual substructures and optimises each substructure individually. This approach results in computational time that scales linearly with the size of the structure. Additionally, we present PROPTIMUS RAPHANGFN-FF, a reference implementation of our method employing a generic, almost QM-accurate force field, GFN-FF. We tested PROPTIMUS RAPHANGFN-FF on 461 AlphaFold DB structures and demonstrated that our approach achieves results comparable to the optimisation of the structure with constrained alpha carbons in significantly less time.
bioinformatics2026-03-13v2User-driven development and evaluation of an agentic framework for analysis of large pathway diagrams
Corradi, M.; Djidrovski, I.; Ladeira, L.; Staumont, B.; Verhoeven, A.; Sanz Serrano, J.; Rougny, A.; Vaez, A.; Hemedan, A.; Mazein, A.; Niarakis, A.; de Carvalho e Silva, A.; Auffray, C.; Wilighagen, E.; Kuchovska, E.; Schreiber, F.; Balaur, I.; Calzone, L.; Matthews, L.; Veschini, L.; Gillespie, M. E.; Kutmon, M.; Koenig, M.; van Welzen, M.; Hiroi, N.; Lopata, O.; Klemmer, P.; Overall, R.; Hofer, T.; Satagopam, V.; Schneider, R.; Teunis, M.; Geris, L.; Ostaszewski, M.Abstract
As biomedical knowledge keeps growing, resources storing available information multiply and grow in size and complexity. Such resources can be in the format of molecular interaction maps, which represent cellular and molecular processes under normal or pathological conditions. However, these maps can be complex and hard to navigate, especially to novice users. Large Language Models (LLMs), particularly in the form of agentic frameworks, have emerged as a promising technology to support this exploration. In this article, we describe a user-driven process of prototyping, development, and user testing of Llemy, an LLM-based system for exploring these molecular interaction maps. By involving domain experts from the very first prototyping in the form of a hackathon and collecting both fine-grained and general feedback on more refined versions, we were able to evaluate the perceived utility and quality of the developed system, in particular for summarising maps and pathways, as well as prioritise the development of future features. We recommend continued user-driven development and benchmarking to keep the community engaged. This will also facilitate the transition towards open-weight LLMs to support the needs of the open research environment in an ever-changing technology landscape.
bioinformatics2026-03-13v2Improving Local Ancestry Inference through Neural Networks
Medina Tretmanis, J.; Avila-Arcos, M. C.; Jay, F.; Huerta-Sanchez, E.Abstract
Motivation: Local Ancestry Inference (LAI) allows us to study evolutionary processes in admixed populations, uncover ancestry-specific disease risk factors, and to better understand the demographic history of these populations. Many methods for LAI exist, however, these methods usually focus on cases of intercontinental admixture. In this work, we evaluate both existing and novel methods in challenging scenarios, such as downsampled reference panels, intracontinental admixture, and distant admixture events. Results: We present four novel LAI implementations based on neural network architectures, including Bidirectional Long Short-Term Memory and Transformer networks which have not previously been used for LAI. We compare these novel implementations to existing methods for LAI across a variety of scenarios using the 1 Thousand Genomes dataset and other synthetic datasets. We find that while all networks achieve high performance for intercontinental admixture scenarios, inference power is comparatively low for scenarios of intracontinental or distant admixture. We further show how our implementations achieve the best performance of all methods through specialized preprocessing and inference smoothing steps.
bioinformatics2026-03-13v1Descriptron-GBIF Annotator: A browser-based platform for crowdsourced morphological annotation of biodiversity images to help accelerate morphology based biodiversity data
Van Dam, A. R.; Hita Garcia, F.Abstract
The accelerating biodiversity crisis demands new approaches to taxonomic description that can scale beyond the capacity of professional taxonomists alone. We present the Descriptron-GBIF Annotator, a zero-installation, browser-based tool for morphological annotation of biodiversity specimen images retrieved directly from the Global Biodiversity Information Facility (GBIF). The application runs entirely client-side as a single HTML file, integrating SAM2 (Segment Anything Model 2) for AI-assisted segmentation, ontology-linked anatomical region templates covering 25 major taxonomic groups across 124 standardized views, 335 ontology Compact URI Expressions (CURIEs), with 745 possible ontology mentions, and structured trait attribute recording. The annotator supports multiple export formats including Darwin Core JSON, COCO JSON, traits CSV, and a novel JSON-LD knowledge graph linking specimens to anatomical regions and morphological traits via UBERON and domain-specific ontologies. A built-in Zenodo publishing pipeline enables users to deposit annotations as citable datasets with DOIs directly from the browser. Additionally users can also annotate images from Zenodo BioSysLit enabling annotation of taxonomic treatments directly. We position this public-facing tool as the first tier of a two-tier architecture complementing the Descriptron Portal, a GPU-accelerated professional workbench for taxonomists providing tools for fine-tuning AI models, geometric morphometrics, and automated species descriptions. Together, these tiers create a feedback loop where public annotations generate training data for expert AI models, while expert-validated outputs improve the public tool. This approach draws on the citizen science model pioneered by Notes from Nature and iNaturalist to engage diverse audiences in structured morphological data collection, addressing a critical gap in biodiversity informatics where specimen images exist in abundance but structured morphological annotations remain scarce. To learn more go here: https://descriptrongbifannotator.org
bioinformatics2026-03-13v1Cross-etiology transcriptomic conservation in hepatocellular carcinoma reveals opposing proliferation and hepatocyte-loss programs validated across cohorts
Romero, R.; Toledo, C.Abstract
Background: Hepatocellular carcinoma (HCC) arises from diverse etiologies, but the balance between conserved and specific transcriptomic programs remains unclear. Methods: HBV and HCV cohorts were analyzed using GSVA to quantify Hallmark shifts. Biology was distilled into proliferation (ProlifHub) and hepatocyte-loss (HepLoss) modules, forming a composite HCCStateScore. An HBV injury axis was adjusted for proliferative state (E2F/G2M). Validation was performed using GSE14520 and GEPIA3. Results: Hallmark analysis revealed conserved proliferative activation and hepatocyte function suppression across etiologies. In HBV-HCC, the injury axis remained significantly elevated after adjusting for proliferation (p{approx}0.0147), indicating an injury component independent of the cell cycle. HCCStateScore robustly separated tumor from non-tumor tissue (AUC{approx}0.986, p=0). GEPIA3 confirmed concordant expression and survival associations for module genes. Conclusions: HCC features conserved opposing proliferation and hepatocyte-loss programs. HBV-associated tumors retain a distinct injury-linked component not fully explained by cell division. This validated score provides a framework for cross-cohort analysis and mechanistic prioritization in liver cancer research.
bioinformatics2026-03-13v1An explainable boosting machine model for identifying artifacts caused by formalin-fixed paraffin embedding
Grether, V.; Goldstein, Z. R.; Shelton, J. M.; Chu, T. R.; Hooper, W. F.; Geiger, H.; Corvelo, A.; Martini, R.; Davis, M. B.; Robine, N.; Liao, W.Abstract
Background: Formalin-fixed paraffin-embedding (FFPE) is a widely used, cost-effective method for long-term storage of clinical samples. However, fixation is known to introduce damage to nucleic acids that can present as artifactual bases in sequencing otherwise absent from higher fidelity storage methods such as fresh freezing (FF). Various machine learning methods exist for filtering these variant artifacts, but benchmarking performance can be difficult without reliable truth sets. In this study, we employ a collection of 90 paired fresh-frozen and formalin-fixed paraffin embedded samples from the same tumor to robustly define real and FFPE-derived, artifactual variation and enable objective evaluation of filtering methods. To address existing shortcomings, we propose a novel explainable boosting machine (EBM) model that improves performance, can be easily updated with new data, requires modest computational resources, and is analysis pipeline agnostic, making it broadly accessible. Results: We evaluated several methods for limiting FFPE-derived variant artifacts using cohorts of B-cell lymphoma samples. We found capturing local context around variants to be a highly informative, under-utilized feature set not commonly incorporated into many existing machine learning methods. Consequently, we developed a novel algorithm, FIFA, for filtering FFPE artifacts, which uses an EBM model, an interpretable decision-tree-based learning algorithm, to address some of the existing shortcomings. We used four independent cohorts composed of paired lymphoma and cervical cancer samples and a breast cancer cell line with both FF and FFPE samples to define clearly annotated training and test sets and demonstrated improved performance over existing methods. Additionally, FIFA filtering increased relevant biological signals in FFPE breast cancer datasets distinct from the training and testing sets. The EBM framework employed by FIFA is computationally efficient and easily amenable to incorporation of additional datasets due to its generalized additive modeling of features making it straightforward to incorporate new data into existing models dynamically over time. Conclusions: Our novel FFPE variant artifact filtering tool, FIFA, is a marked improvement over existing methods. It can be easily implemented, post hoc, to supplement existing somatic calling pipelines, training and inference can be run quickly across most compute environments, and it can be easily updated online as new training data becomes available. Accordingly, FIFA represents an important advance in retrospective cancer genomics research by further enhancing access to the vast stores of FFPE-archived tumor samples currently in existence
bioinformatics2026-03-13v1Fast and accurate resolution of ecDNA sequence using Cycle-Extractor
Faizrahnemoon, M.; Luebeck, J.; Hung, K. L.; Rao, S.; Prasad, G.; Tsz-Lo Wong, I.; G. Jones, M.; S. Mischel, P.; Y. Chang, H.; Zhu, K.; Bafna, V.Abstract
Extrachromosomal DNA (ecDNA) plays a key role in cancer pathology. EcDNAs mediate high oncogene amplification and expression and worse patient outcomes. Accurately determining the structure of these circular molecules is essential for understanding their function, yet reconstructing ecDNA cycles from sequencing data remains challenging. We introduce Cycle-Extractor (CE) for reconstruction. CE accepts a breakpoint graph derived from either short or long read sequencing data as input and extracts a cycle with the maximum length-weighted-copy-number. CE utilizes a mixed-integer linear program (MILP) and a separate traversal procedure, enabling fast optimization and compatibility with free solvers. We evaluated CE against CoRAL (long-read-based quadratic optimization), Decoil (long-reads), and AmpliconArchitect (AA for short reads) on both simulated data and real cancer cell lines. On simulated ecDNA, CE achieves performance comparable to CoRAL across three accuracy metrics and consistently outperforms AA and Decoil. On cancer cell lines, CE produces longer and heavier cycles than AA, and achieves performance similar to CoRAL. Moreover, CE is, on average, 40 x faster than CoRAL. These results demonstrate that CE accurately reconstructs ecDNA from both short- and long-read sequencing data, while long-read inputs allow CE to recover more complete and higher-confidence ecDNA structures. CE improved the prediction of many ecDNA structures. On a PC3 ecDNA containing MYC, CE uses ONT data to reconstruct a substantially larger and higher-copy sequence (4.2 Mbp) compared to the short-read-derived reconstruction (690 Kbp). CRISPR-CATCH experiments confirm the presence of a large ecDNA molecule, validating the long-read-based CE reconstruction.
bioinformatics2026-03-13v1Multiscale conformational sampling of multidomain fusion proteins by a physics informed diffusion model
Su, Z.; Wang, B.; Wu, Y.Abstract
Multidomain fusion proteins, such as bispecific antibodies, rely on highly flexible linker regions for their therapeutic efficacy. Characterizing these vast conformational ensembles is crucial for rational drug design; however, while all-atom molecular dynamics (MD) is the traditional gold standard, its immense computational cost makes simulating large-scale domain motions prohibitive. Recently, deep generative diffusion models have emerged as a rapid alternative for sampling protein dynamics. Yet, being trained primarily on massive databases of structured, static domains, these generic models often lack the biophysical constraints required to thoroughly sample the large-scale dynamics of highly flexible multidomain architectures. To overcome this, we leverage microsecond MD trajectories of a multidomain protein construct with various linkers to train a multiscale diffusion framework utilizing an Equivariant Graph Neural Network (EGNN). To efficiently model the dynamics of the large molecular complexes, we employ a coarse-grained spatial graph that condenses rigid domains into center-of-mass anchors while preserving explicit backbone resolution for the flexible linker. By further integrating foundational rules in biophysics directly into both the training objective and the inference process, our model generates high-fidelity conformational ensembles that reproduce the thermodynamic distributions of long-timescale MD. This physics-informed approach provides a mathematically stable, highly scalable platform for the rapid multiscale characterization of flexible biologics, significantly accelerating the rational design of fusion protein therapeutics.
bioinformatics2026-03-13v1EnsAgent: a tool-ensemble multiple Agent system for robust annotation in spatial transcriptomics
Zhang, D.; Zhang, M.; Li, N.; Zheng, C.; Liang, L.; Ke, X.; Dong, Q.Abstract
Motivation: Automated domain annotation in spatially resolved transcriptomics (SRT) remains challenging since it depends on gene expression, morphology, and clinical conventions, which vary across cohorts and platforms. While Large Language Model (LLM)-driven agents show promise, current approaches typically condition semantic reasoning on static, single-method partitions. This reliance makes annotation pipelines fragile to upstream partition errors and prone to hallucinations when molecular evidence is ambiguous. A robust framework integrating ensemble intelligence with iterative, evidence-based reasoning is required to ensure reproducibility and accuracy. Results: We introduce EnsAgent, a tool-ensemble multi-agent system designed for robust SRT annotation. Uniquely, EnsAgent decouples structural partitioning from semantic labeling via a Consultation-Review workflow. A Tool-Runner Agent orchestrates a diverse portfolio of clustering algorithms via the Model Context Protocol (MCP), generating a consensus partition optimized by a multimodal Scoring Agent. Subsequently, a Proposer-Critic feedback loop coordinates four specialized experts (Marker, Pathway, Spatiality, and Visual) to formulate annotations with explicit evidence trails and uncertainty estimates. Benchmarking on three SRT datasets demonstrates that EnsAgent effectively neutralizes batch effects and resolves subtle tumor microenvironment niches missed by single-paradigm baselines, delivering state-of-the-art accuracy and interpretability. Availability and Implementation: EnsAgent is available at github.com/keviccz/ensAgent.
bioinformatics2026-03-13v1SAMWOOD: An automated method to measure wood cells along growth orientation
Verlingue, K.; Brunel, G.; Decombeix, A.-L.; Ramel, M.; Tresson, P.Abstract
Quantitative wood anatomy requires precise measurement of wood cells. This step is often laborious and limiting for further analysis. We introduce Samwood, a tool based on the zero-shot Segment-Anything Model to easily segment cells on microscopic images without the need for a training dataset. The reconstruction of cell files then allows for the analysis of wood along growth orientation and precise measurement of anatomical properties of the wood, such as lumen areas. We tested our pipeline on an example dataset of fossil woods featuring deformation, heterogeneous preservation, and frequent artefacts, to assess the robustness of our approach. The model achieves a precision of 0.78 and a recall of 0.80, often producing segmentation of better quality and more consistent than a human operator. This approach substantially reduces analysis time, minimizes operator bias, and provides a robust and extensible framework for large-scale anatomical studies. The complete code pipeline is available at https://github.com/umr-amap/samwood.
bioinformatics2026-03-13v1MetaResNet: Enhancing Microbiome-Based Disease Classification through Colormap Optimization and Imbalance Handling
Qureshi, A.; Wahid, A.; Qazi, S.; Khattak, H. A.; Hussain, S. F.Abstract
Image-based representations of metagenomic data enable convolutional neural network (CNN) applications for microbiome disease classification. However, the impact of colormap selection on model performance remains unexplored. Current approaches arbitrarily select visualization parameters despite evidence that colormap choices can suppress minority-class features in imbalanced microbiome datasets. This study systematically evaluates colormap effects on metagenomic disease classification to establish evidence-based visualization guidelines. We developed MetaResNet, a custom CNN architecture incorporating residual blocks and attention mechanisms, to assess five colormap schemes (Jet, YlGnBu, Reds, Paired and nipy\_spectral) across four benchmark datasets: inflammatory bowel disease (n=110), colon cancer (n=121), women type 2 diabetes (n=96), and obesity (n=253). Class imbalance was addressed using Synthetic Minority Over-sampling Technique (SMOTE) versus class weighting strategies. Custom data augmentation preserved taxonomic abundance relationships while enhancing generalization. Performance evaluation employed F1-score, Receiver Operating Characteristic and Area Under the Curve (AUC-ROC), Matthews correlation coefficient (MCC), precision, and recall to address accuracy limitations in imbalanced scenarios. Results identified the Jet colormap coupled with SMOTE as the optimal global configuration, maximizing signal retention and achieving peak performance (AUC 1.00 in Colon). SMOTE significantly improved minority-class recall over class weighting (0.81{+/-}0.09 vs. 0.69{+/-}0.11, p=0.003). MetaResNet achieved performance comparable to current state-of-the-art frameworks, while statistically outperforming established deep learning baselines (e.g., DeepMicro, PopPhy-CNN; p=0.025) in discriminatory power (AUC), with peak values exceeding 0.96. These findings demonstrate that visualization efficacy is strategy-dependent, establishing MetaResNet as a robust framework for microbiome-based diagnostics that supports evidence-based visualization strategies for precision medicine.
bioinformatics2026-03-13v1Nanoscale Material Size Shapes Distinct Immune Transcriptional States Under Physiological Flow
Kovacevic, V.; Milivojevic Dimitrijevic, N.; Mihailovich, M.; Zivanovic, M.; Ivanovic, M.; Zivic, A.; Jankovic, M. G.; Kovacevic, A.; Zmrzljak, U. P.; Puac, F.; Filipovic, N.; Ljujic, B.Abstract
Nanoscale materials interact with circulating immune cells, yet how material size and exposure complexity shape transcriptional state organization under physiological flow conditions remains poorly understood. Controlled microfluidic exposure is combined with single-cell RNA sequencing to examine how size-defined polystyrene nanoplastics (PSNPs; 40 nm, 200 nm) and their combination modulate transcriptional programs in primary human peripheral blood mononuclear cells (PBMCs) under dynamic flow conditions. Across immune populations, PSNP exposure induces a conserved translational and RNA-regulatory program, indicating a shared intracellular adaptation framework. Upon this backbone, innate and adaptive immune compartments exhibit distinct organizational principles. Monocytes undergo size-dependent, pathway-coherent state remodeling, whereas B cells and CD4+; T cells display distributed, lineage-preserving transcriptional tuning without discrete state transitions. Combined exposure to different particle sizes does not produce additive responses but instead generates integrated transcriptional states in monocytes, revealing non-linear immune adaptation to heterogeneous material cues. These findings demonstrate that nanoscale material size and exposure complexity shape immune transcriptional state architecture under physiological flow and establish a framework for understanding dynamic material-immune interfaces at single-cell resolution.
bioinformatics2026-03-13v1Learning the All-Atom Equilibrium Distribution of Biomolecular Interactions at Scale
Wang, Y.; Xu, Y.; Li, W.; Yu, H.; Tan, W.; Li, S.; Huang, Q.; Chen, N.; Wu, X.; Wu, Q.; Liu, K.Abstract
Biomolecular functions are governed by dynamic conformational ensembles rather than static structures. While models like AlphaFold have revolutionized static structure prediction, accurately capturing the equilibrium distribution of all-atom biomolecular interactions remains a significant challenge due to the high computational cost of molecular dynamics (MD). We present AnewSampling, a transferable generative foundation framework designed for the high-fidelity sampling of all-atom equilibrium distributions, which is the first model to faithfully reproduce MD at the all-atom level. It uses a novel quotient-space generative framework to ensure mathematical consistency and leverages the largest self-curated database of protein-ligand trajectories to date, with over 15 million conformations. Statistically, AnewSampling consistently outperforms all prior generative methods on the ATLAS monomer benchmark, and the all-atom capabilities of AnewSampling enable close statistical alignment with ground-truth MD for evaluating atomic biomolecular interactions in protein-ligand dynamics. Furthermore, AnewSampling successfully recovers coupled ligand and side-chain motions in CDK2 systems, overcoming a major sampling hurdle inherent to conventional MD. AnewSampling enables rapid exploration of conformational landscapes prior to intensive simulations, elucidating fundamental biophysical mechanisms and accelerating the broader design of functional biomolecules.
bioinformatics2026-03-13v1SuperSurv: A Unified Framework for Machine Learning Ensembles in Survival Analysis
Lyu, Y.; Huang, X.; Lin, S. H.; Li, Z.Abstract
This paper introduces SuperSurv, a user-friendly R package for building, evaluating, and interpreting ensemble models for right-censored survival data. Although many survival modeling methods are available, existing tools are often model-specific and lack a unified platform for systematically integrating, comparing, and ensembling heterogeneous learners. SuperSurv addresses this gap by providing a unified interface for diverse survival learners, including models that return full survival curves as well as methods that produce only risk scores. All learner outputs are mapped to calibrated survival probability curves on a common evaluation time grid, enabling direct comparison and ensemble construction across heterogeneous model classes. SuperSurv implements stacking of survival models using inverse-probability-of-censoring weighted (IPCW) Brier risk to estimate ensemble weights in the presence of right censoring. The framework integrates hyperparameter tuning, time-dependent benchmarking metrics, and visualization tools for survival model evaluation. In addition, the package provides post-hoc interpretability utilities based on SHAP values and supports covariate-adjusted restricted mean survival time (RMST) contrasts through g-computation. By bridging the gap between theoretical rigor and clinical application, SuperSurv offers researchers a comprehensive ecosystem for modern survival analysis. The SuperSurv package is open-source and freely available at https://github.com/yuelyu21/SuperSurv. An empirical example using the METABRIC breast cancer dataset demonstrates a complete workflow from model training and benchmarking to explainability and clinically interpretable survival contrasts.
bioinformatics2026-03-13v1DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA
Khabbaz, R.; Mateos, J.; Antonini, M.; Kas Hanna, S.Abstract
The biochemical processes underlying DNA data storage, including synthesis, amplification, and sequencing, are inherently noisy. Consequently, base-level insertion, deletion, and substitution (IDS) errors, as well as sequence-level dropouts, occur and pose major challenges for reliable data retrieval. Here we introduce DNA-MGC+, a DNA storage codec designed to enable reliable and resource-efficient data retrieval under diverse operating conditions. We evaluate DNA-MGC+ across a wide range of in silico and in vitro settings, including experiments with both Illumina and Nanopore sequencing, and show that it consistently outperforms existing codecs. In particular, DNA-MGC+ achieves simultaneous gains in sequencing depth requirements, read cost, decoding time, storage density, and error-correction capability under explicit reliability constraints. Notable results include reliable decoding under IDS error rates of up to 24% in synthetic scenarios, and reliable retrieval at sequencing depths below 3x with read costs below 3.5 bits/nt under electrochemical synthesis for both Illumina and Nanopore sequencing.
bioinformatics2026-03-13v1StrainVis: interactive visual strain-level analysis of microbiome data
Paz, I.; Ley, R. E.; Enav, H.Abstract
Background: Microbiomes contain multiple conspecific strains whose genomic differences arise from both single nucleotide variants (SNVs) and structural variation (insertions, deletions, recombination). Recently, computational tools to assess strain-level differences became available, based either on average nucleotide identity (ANI) or on the average pairwise synteny (APSS) of strains, which are sensitive, respectively, to either SNVs or to structural variation. However, strain-level analyses remain technically challenging and fragmented across approaches and combining these complementary signals typically requires substantial bioinformatic expertise. Results: Here we present StrainVis, a web-based analysis and visualization platform that integrates outputs from both ANI- and APSS-based strain tracking tools to enable unified, interactive exploration of within-species diversity. StrainVis allows users to perform per-species and multi-species comparisons, incorporate metadata and gene annotations, and generate statistical summaries and publication-ready figures without programming. Conclusions: By lowering technical barriers and enabling joint interpretation of sequence and structural variation, StrainVis makes advanced strain-level microbiome analysis accessible to a broader community and facilitates discovery of evolutionary patterns that would be missed by single-method approaches alone.
bioinformatics2026-03-13v1Orally Delivered dsRNA-Derived siRNAs Reach the Central Nervous System in Leptinotarsa decemlineata
Amineni, V. P. S.; Cedden, D.Abstract
RNA interference (RNAi) has emerged as an eco-friendly approach to pest management and relies on the processing of exogenous double-stranded RNA (dsRNA). RNAi-based pest management is highly effective in the Colorado potato beetle (Leptinotarsa decemlineata); however, the tissue-specific distribution and processing of exogenous dsRNA following oral uptake remain incompletely understood. In this study, we investigated whether ingested dsRNA reaches the central nervous system (CNS) and is processed into active small interfering RNAs (siRNAs). Adult beetles were fed dsmGFP-coated leaf disks, and RISC-bound small RNAs were isolated from midgut, CNS, and remaining body tissues using a RISC-enrichment approach. Small RNA sequencing revealed abundant 21-nucleotide antisense guide-strand siRNAs in all analysed tissues, with relative proportions following the order midgut > CNS > remaining tissues. Notably, antisense siRNAs of consistent size were detected in CNS samples, indicating that exogenous dsRNA or its processed products can access neural tissue and enter the RNAi silencing machinery. These findings provide strong biochemical evidence that orally taken-up dsRNA is processed into AGO-loaded siRNAs in the L. decemlineata CNS. Together, our results offer a tissue-resolved view of functional RNAi activity in this species and contribute to a mechanistic understanding of systemic dsRNA transport in coleopteran pests.
bioinformatics2026-03-13v1Context-dependent genetic regulation of gene expression in pigs
Wang, F.; Wang, C.; Teng, J.; Fang, L.; Ionita-Laza, I.Abstract
Production livestock provide a natural system for studying gene regulation under physiologically demanding conditions shaped by rapid growth, environmental exposure, and immune challenges. Using farm pigs from the PigGTEx resource as a model, we applied quantile regression to uncover latent, context-dependent genetic effects on gene expression across tissues. This approach identifies quantile-specific expression quantitative trait loci (eQTLs) that are not detected by standard linear regression and are enriched in distal regulatory elements and three-dimensional genome architectural features rather than promoter-proximal regions. Genes with quantile-dependent eQTLs are more intolerant to loss-of-function variants and exhibit stronger enrichment in GO functional categories, indicating their likely functional significance. Cross-species comparisons reveal substantial overlap between pig and human eGenes across tissues, indicating conservation of regulatory architecture. Notably, many quantile-specific eQTLs influence tail expression states and involve genes relevant to human disease. For example, we identify a cis-eQTL affecting the conserved transcriptional regulator BCL6B in pig blood that modulates enhancer activity and reduces expression at lower quantiles. In contrast, BCL6B is minimally expressed in resting human blood and lacks detectable cis-regulatory variation under baseline conditions, consistent with its reported induction during immune activation. These findings demonstrate that pig eQTL maps can reveal context-dependent regulatory variation at loci that remain silent or weakly variable in human cohorts.
bioinformatics2026-03-13v1scprocess: a pipeline for processing, integrating and visualising atlas-scale single cell data
Koderman, M.; Pilarski, J.; Bianco, E.; Gonzalez, D.; Robinson, M. D.; Macnair, W.Abstract
The transition toward "atlas-scale" single cell research has resulted in datasets comprising millions of cells across hundreds of samples, creating significant challenges for data management, computational efficiency, and reproducibility. While numerous methods are available for individual steps in single cell data processing, the highly complex nature of the analysis makes it challenging to maintain a clear record of every tool and parameter used. This makes final results difficult to reproduce, highlighting the need for a unified workflow that integrates multiple steps into a cohesive framework. We introduce scprocess, a Snakemake pipeline designed to streamline and automate the complex steps involved in processing single cell RNA sequencing data. Specifically optimized for data generated using the 10x Genomics technology, it provides a comprehensive solution that transforms raw sequencing files into standardized outputs suitable for a variety of downstream tasks. The pipeline is built to support the analysis of datasets comprising multiple (e.g. 100+) samples via a simple CLI, allowing researchers to efficiently explore their datasets while ensuring reproducibility and scalability in their workflows. scprocess can be installed via GitHub (https://github.com/marusakod/scprocess) under the MIT license. Documentation, including setup instructions and tutorials on example datasets is available at https://marusakod.github.io/scprocess/.
bioinformatics2026-03-13v1Expression-based annotation identifies and enables quantification of small vault RNAs (svtRNAs) in human cells
Sheppard, J. D.; Smircich, P.; Duhagon, M. A.; Fort, R. S.Abstract
Background Small non-coding RNAs (sncRNAs) play central roles in post-transcriptional gene regulation. In addition to canonical microRNAs (miRNAs), fragments derived from vault RNAs (vtRNAs), called small vault RNAs (svtRNAs), have been reported in human cells. However, the absence of a standardized annotation framework has hindered their systematic detection, quantification, and comparison across small RNA sequencing (small RNA-seq) studies. Methods We developed an expression-based annotation strategy to identify svtRNAs from human small RNA-seq datasets. Using FlaiMapper followed by structure and expression-based filtering, we generated two annotation sets: a stringent miRNA-like set enriched in Argonaute-associated datasets, and (ii) a broader Total set derived from total small RNA-seq libraries under relaxed structural constraints. We explored the expression of the annotated svtRNAs across the different datasets analyzed: multiple normal and tumor-derived human cell lines, including Argonaute immunoprecipitation datasets. Results We identified a repertoire of svtRNAs that are detected across independent datasets and, in several cases, reach abundance levels comparable to canonical miRNAs. Several highly abundant svtRNAs correspond to molecules with experimental validation from prior studies, supporting the robustness of our annotation strategy. Importantly, the same dominant (in terms of gene expression) svtRNAs emerged independently from Argonaute-associated and total small RNA datasets, supporting the idea of enzymatically consistent, reproducible svtRNA processing. We further identified svtRNAs derived from distinct vtRNA precursors that could share identical seed sequences, suggesting the possibility of svtRNA families with potential miRNA-like regulatory properties. We provide a standardized annotation that enables reproducible svtRNA quantification. Conclusions Our study establishes a comprehensive expression-based annotation resource for human svtRNAs. By enabling their systematic detection and reproducible quantification, we show that svtRNAs appear to represent an abundant component of the human small RNA landscape.
bioinformatics2026-03-13v1STEVE: Single-cell Transcriptomics Expression Visualization and Evaluation
Torbenson, E. J.; Ma, X.; Lin, J.-R.; Garry, D.; Jameson, S. C.; Zhang, Z.; Niedernhofer, L. J.; Zhang, L.; Li, M.; Dong, X.Abstract
Single-cell RNA sequencing (scRNA-seq) has become a key technology for characterizing cell-type heterogeneity in complex tissues. However, its utility depends on accurate and reproducible cell-type annotation, which remains a major analytical challenge. Although hundreds of computational tools have been developed for automated annotation, there is currently no systematic framework to evaluate annotation robustness in a dataset-specific manner or within the context of complete analytical pipelines. Here, we present STEVE (Single-cell Transcriptomics Expression Visualization and Evaluation), a quantitative framework designed to assess the accuracy, robustness, and reproducibility of cell-type annotation in scRNA-seq studies. STEVE implements three complementary in silico evaluation modules: (i) Subsampling Evaluation to quantify annotation stability under varying reference sizes and data partitions; (ii) Novel Cell Evaluation to assess the ability to detect previously unseen cell types; and (iii) Annotation Benchmarking to compare alternative annotation tools against ground-truth labels. In addition, STEVE includes a Reference Transfer Annotation module that enables cross-dataset cell-type mapping using external reference datasets. All modules are built upon a unified probabilistic framework that provides consistent confidence estimation across evaluation scenarios. We evaluated STEVE across four independent scRNA-seq datasets with experimentally defined or expert-curated cell-type labels. Our results show that annotation robustness is strongly influenced by the annotation method, biological separability, dataset complexity, and batch effects. STEVE provides a practical framework for quantifying annotation uncertainty and improving reproducibility in single-cell transcriptomic analyses. STEVE is freely available at GitHub (https://github.com/XiaoDongLab/STEVE).
bioinformatics2026-03-13v1EoRNA2: Autonomous Data Discovery and Processing for Databasing of Gene Expression Data
Milne, L.; Simpson, C. G.; Guo, W.; Mayer, C.-D.; Milne, I.; Bayer, M.Abstract
We describe a major new release of the EoRNA database, a gene expression database for barley based on public data, first published in 20211. EoRNA v.2 (https://ics.hutton.ac.uk/eorna2/index.html) features an order of magnitude more samples and is based on a new automated workflow of sample discovery and processing which has enabled a dramatic scale-up the original database. EoRNA v.2 also features a major rebuild of the web user interface with rich new functionality. All infrastructure-related code and database schemas and web components are now species agnostic and publicly available for reuse with other taxa. A dedicated new reference transcript dataset has been created for EoRNA v.2 which is largely based on the recently published barley pan-transcriptome and represents the most comprehensive dataset of its kind to date.
bioinformatics2026-03-13v1Immune Transcriptional Signatures Across Human Cardiomyopathy Subtypes: A Multi-Cohort Integrative Computational Analysis
Adegboyega, B. B.; Okorie, B.; Courage, P.Abstract
Abstract Background: Heart failure, arrhythmia, and sudden cardiac death are common outcomes of cardiomyopathies, which are molecularly diverse heart muscle disorders marked by structural and functional myocardial dysfunction. The lack of sensitive molecular biomarkers that precede overt physiological deterioration makes early diagnosis difficult despite advancements in imaging and clinical classification. The immune transcriptional landscape across cardiomyopathy subtypes is still poorly understood, despite growing evidence linking both innate and adaptive immune dysregulation, such as macrophage activation and T-cell and inflammatory cytokine networks, as active contributors to myocardial remodelling and disease progression. Methods: We performed a multi-cohort integrative transcriptomic analysis of 1,068 cardiac tissue samples from five publicly available GEO datasets (GSE57338, GSE5406, GSE36961, GSE141910, GSE47495) spanning dilated, ischemic, hypertrophic, and peripartum cardiomyopathy. Using a fully scripted R and Python pipeline, we conducted differential expression analysis (limma), immune cell deconvolution (xCell), pathway enrichment (clusterProfiler), weighted gene co-expression network analysis (WGCNA), and regularised machine learning classification (LASSO, Random Forest). Cross-dataset validation was performed in two independent cohorts on different microarray platforms. Results: Differential expression analysis identified 43 primary DEGs (FDR < 0.05, |log2FC| > 1.0), revealing a coherent immune-fibrotic program characterized by loss of anti-inflammatory macrophage markers (CD163, VSIG4), complement dysregulation (FCN3), innate interferon activation (IFI44L, IFIT2), and ECM remodelling (ASPN, SFRP4, LUM). xCell deconvolution identified coordinated depletion of adaptive immune populations in failing myocardium. WGCNA defined a fibrosis hub module (brown; CTSK, SULF1, SFRP4) and an immune collapse module (turquoise; MYD88, TNFRSF1A, LAPTM5). A nine-gene LASSO classifier achieved a cross-validated AUC of 0.986, with HMOX2 as the top-discriminating feature, implicating ferroptosis in cardiomyocyte death. Cross-platform validation in an independent HCM cohort (GSE36961) demonstrated a directional concordance of 34.9%. Conclusions: This study defines a reproducible immune-fibrotic transcriptional signature of human cardiomyopathy, nominates HMOX2 and ferroptosis as central pathomechanisms, and provides a validated nine-gene biomarker panel for future translational investigation.
bioinformatics2026-03-13v1ATAClone: Cancer Clone Identification and Copy Number Estimation from Single-cell ATAC-seq
Cain, L. D.; Trigos, A. S.Abstract
Single-cell analyses of cancer typically begin by identifying distinct populations of cancer cells by unsupervised clustering. However, in many cases this clustering is explained simply by differences in DNA copy number, which affects the interpretation of differential expression results and tumour heterogeneity studies. To detect and estimate these differences in copy number, we have developed ATAClone. Applicable to both standalone and multiome scATAC-seq assays, ATAClone first identifies cancer cells with shared DNA copy number profiles (i.e. clones), then estimates their copy number jointly. Importantly, ATAClone can determine an optimal clustering resolution automatically using simulations. By utilising only stably accessible regions, ATAClone maximises copy number signal while minimising unrelated biological and technical noise. Additionally, by leveraging differences in total DNA between cells, ATAClone can infer absolute copy number, even in the presence of polyploidy. Using cancer cell mixture experiments, we verify the ability of ATAClone to accurately separate clones based on copy number differences. Moreover, using matched scATAC-seq and bulk whole genome sequencing, we show that copy number estimates from ATAClone are more accurate than those derived with existing methods, achieving Pearson correlations between 0.75-0.95 with their bulk-derived estimates. ATAClone represents an important tool for disentangling the genetic and non-genetic contributions to gene expression in cancer, providing deeper insight into the evolutionary history and adaptive forces driving a tumour.
bioinformatics2026-03-13v1