Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Analysis of multicellular anatomical structures from spatial omics data using sosta
Gunz, S.; Crowell, H. L.; Robinson, M. D.Abstract
Spatial omics technologies enable high-resolution, large-scale quantification of molecular features while preserving the spatial context within tissues. Existing analysis methods largely focus on spatial arrangements of single cells, whereas biological function often emerges from multicellular arrangements. Here, we introduce structure-based analysis of spatial omics data, which focuses on the direct analysis of multicellular, anatomical structures. We illustrate this type of analysis using two publicly available datasets and provide sosta, an open-source Bioconductor package for broad community use.
bioinformatics2026-04-07v3Representation Methods of Transcriptomics with Applications in Neuroimmune Biology
Abbasi, M.; Ochoa Zermeno, S.; Spendlove, M. D.; Tashi, Z.; Plaisier, C. L.; Bartelle, B. B.Abstract
Interpretable representations of gene expression are used to define cellular identities and the molecular programs active within cells, two related, but distinct phenomena. In the case of microglia, a cell type with high transcriptomic, functional, and morphological heterogeneity, the predominant representation of transcriptomic data presumes the adoption of distinct molecular identities, despite a lack of easily separable transcriptional states. Here, we explore alternative transcriptomic representations by comparing two single-cell analysis methods: differential expression analysis for identities and co-expression network analysis for molecular programs. For microglia, co-expression network analysis identifies highly significant functional ontologies not resolved by differential expression analysis. The identified co-expression modules are preserved across transcriptomic datasets and suggest reducible functional programs that activate and modulate depending on context. We conclude that co-expression analysis constitutes a best practice for single cell analysis of an individual cell type and describing microglia function as concurrent molecular programs offers a more parsimonious model of microglia function.
bioinformatics2026-04-07v1Locat: Joint enrichment and depletion testing identifies localized marker genes in single-cell transcriptomics
Lewis, W. R.; Aizenbud, Y.; Strino, F.; Kluger, Y.; Parisi, F.Abstract
Several methods have been developed to identify marker genes that delineate cell populations in single-cell transcriptomic data, yet most emphasize enrichment within candidate populations without testing whether expression is significantly reduced outside those populations. We present Locat, a framework for identifying highly specific localized genes by testing whether expression is concentrated within compact regions of the cellular embedding and depleted elsewhere. For each gene, Locat fits weighted Gaussian mixture models to gene-specific and background densities, computes test statistics for concentration within compact regions and depletion outside those regions, and integrates the results into a unified localization score. Across synthetic benchmarks with controlled ground truth, Locat detects localized genes spanning uni-modal, multi-modal, and sparse expression patterns, and appropriately loses significance when simulated expression becomes indistinguishable from background structure. In biological datasets spanning developmental, perturbation, and differentiation contexts, Locat identifies compact marker sets that capture lineage organization, condition-specific programs, and temporal regulatory dynamics. Localized gene sets are often smaller than conventional feature selections such as highly variable genes, and embeddings constructed from localized gene sets tend to preserve separation of major cell populations and developmental programs. In murine dermis, embeddings computed using localized genes preserve differentiation and cell-cycle trajectories observed in the full dataset. In interferon-{beta}-treated PBMCs, independent localization analysis of control and stimulated samples reveals stimulus-responsive programs and markers of shared immune populations without requiring batch correction or data integration. In retinoic acid-induced embryonic stem cell differentiation, localized genes exhibit reproducible stage-specific patterns across time points. Together, these results demonstrate that jointly assessing concentration and depletion yields specific, interpretable marker genes that enable direct cross-condition and multi-sample comparisons of marker genes across diverse biological settings.
bioinformatics2026-04-07v1STDrug enables spatially informed personalized drug repurposing from spatial transcriptomics
Yang, Y.; Unjitwattana, T.; Zhou, S.; Kadomoto, S.; Yang, X.; Chen, T.; Karaaslanli, A.; Du, Y.; Zhang, W.; Liang, H.; Guo, X.; Keller, E. T.; Garmire, L. X.Abstract
Drug repurposing offers a scalable route to accelerate therapeutic discovery, yet existing approaches based on single-cell RNA sequencing (scRNA-seq) often overlook spatial tissue context, limiting their ability to capture microenvironment-dependent drug responses. Here we present STDrug, a spatially informed computational framework that integrates spatial transcriptomics, graph-based modeling, and multimodal learning to enable patient-specific therapeutic prioritization. STDrug identifies and aligns disease and control spatial domains using graph convolutional networks and coherent point drift, and prioritizes candidate drugs through an integrative scoring scheme combining tumor-reversible gene signatures, perturbation-based reversal scores, and knowledge-guided gene weighting within a machine learning framework. By modeling spatial domain interactions alongside predicted drug efficacy and toxicity, STDrug generates robust patient-level drug scores. Across hepatocellular carcinoma and prostate cancer datasets, STDrug outperforms existing single-cell and spatial transcriptomics-based drug repurposing methods, achieving signficantly improved predictive accuracy (AUCs=0.81-0.82) across patients. Validation using large-scale electronic health records and in vitro assays further supports the translational relevance of top-ranked candidates. Taking together, STDrug establishes a generalizable framework for incorporating spatial omics into therapeutic discovery, advancing spatially informed and personalized drug repurposing.
bioinformatics2026-04-07v1A Context-Aware Single-Cell Proteomics Analysis pipeline.
Salomo Coll, C.; Makar, A. N.; Brenes, A. J.; Inns, J.; Trost, M.; Rajan, N.; Wilkinson, S.; von Kriegsheim, A.Abstract
Single-cell proteomics (SCP) by mass spectrometry can now quantify hundreds to thousands of proteins per cell, but the field still lacks standardised analytical pipelines that accommodate the diversity of instruments, sample preparation workflows and biological contexts encountered in practice. Existing workflows, largely adapted from single-cell transcriptomics, do not account for the informative missingness, pervasive ambient protein contamination and limited feature space that distinguish proteomic from transcriptomic data. In addition, cell type annotation remains a manual bottleneck that is subjective, difficult to reproduce and hard to scale. Here we present an end-to-end pipeline that integrates adaptive quality control, entropy-guided iterative batch correction, multi-modal marker discovery that exploits detection patterns unique to proteomics, and context-aware annotation by large language models (LLMs) coupled to structured contradiction reasoning and orthogonal data-driven validation. Benchmarking on published single-cell proteomic datasets from developing human brain and glioblastoma-associated neutrophils revealed systematic LLM failure modes, including context-insensitive marker vocabulary and misinterpretation of phagocytic or lytic cell states. We addressed these errors using a three-round prompt architecture that combines general biological principles with auto-generated dataset-specific constraints. In held-out validation on a skin tumour dataset acquired, the pipeline showed high concordance with FACS-sorted ground truth. In the caerulein-injured pancreas, orthogonal immunohistochemistry further supported annotations of macrophage, stellate and immune populations. The pipeline is fully automated under fixed settings, and available as Context-Aware Single-Cell Proteomics Analysis (CASPA), providing SCP laboratories and facilities with a reproducible workflow that delivers interpretable, confidence-quantified annotations suitable for downstream expert review.
bioinformatics2026-04-07v1DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
Liu, T.; Jiang, S.; Zhang, F.; Sun, K.; Head-Gordon, T.; Zhao, H.Abstract
Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.
bioinformatics2026-04-07v1Correlation Between Information Entropy and Functions of Gene Sequences in the Evolutionary Context: A New Way to Construct Gene Regulatory Networks from Sequence
Pan, L.; Chen, M.; Tanik, M.Abstract
The information encoded in DNA sequences can be rigorously quantified using Shannon entropy and related measures. When placed in an evolutionary context, this quantification offers a principled yet underexplored route to constructing gene regulatory networks (GRNs) directly from sequence data. While most GRN inference methods rely exclusively on gene expression profiles, the regulatory code is ultimately written in the DNA sequence itself. Here we review the mathematical foundations of information theory as applied to gene sequences, survey existing computational methods for GRN inference with emphasis on information-theoretic and sequence-based approaches, and examine how evolutionary conservation constrains sequence entropy to preserve biological function. We then propose a four-layer integrative framework that combines per-position Shannon entropy profiles, evolutionary conservation scoring via Jensen-Shannon divergence, expression-based mutual information and transfer entropy, and DNA foundation model embeddings to construct GRNs from sequence. Through worked examples on the Escherichia coli SOS regulatory sub-network, we demonstrate how conservation-weighted mutual information improves edge discrimination and how transfer entropy resolves regulatory directionality. The framework generates testable predictions: edges supported by low-entropy regulatory regions should show higher experimental validation rates, and cross-species entropy profile conservation should predict GRN topology conservation. This work bridges three scales of biological information-nucleotide-level entropy, evolutionary constraint patterns, and network-level regulatory logic-establishing information entropy as the natural mathematical language for sequence-to-network regulatory inference.
bioinformatics2026-04-07v1Accurate estimation of canine inbreeding using ultra low-coverage whole genomesequencing
Pellegrini, M.; Kim, R.; Rubbi, L.; Kislik, G.; Smith, D.Abstract
The measurement of inbreeding has gained significance across diverse fields, including population and conservation genetics, agricultural genetics, breeding programs for animals and plants, and wildlife management. This is due to the fact that inbreeding leads to increased homozygosity and results in lower genetic diversity, rendering populations more vulnerable to environmental changes, diseases, and other stressors. High or mid-coverage whole genome sequencing (WGS) has been widely used for inbreeding estimation, but it is resource-intensive. We aimed to investigate the use of ultra low-coverage whole genome sequencing (ulcWGS) as a cost-effective alternative for inbreeding analysis. Domestic dogs were used for our study as their extensive breeding histories lead to populations with a wide range of inbreeding levels. We constructed a multi-breed reference panel from high-coverage WGS samples. Inbreeding in independent ulcWGS samples was then estimated using runs of homozygosity (RoH) and inbreeding coefficients (F). We modeled the relationship between these measures and sequencing depth using nonlinear regression, to generate inbreeding estimates relative to sequencing depth. Resulting relative RoH and F measurements were significantly correlated, with purebred dogs exhibiting more runs of homozygosity and higher inbreeding coefficients2 compared to mixed-breed dogs. Our findings demonstrate that ulcWGS can provide reliable and economical estimations of inbreeding, expanding accessibility to genetic monitoring.
bioinformatics2026-04-07v1MitoChontrol: Adaptive mitochondrial filtering for robust single-cell RNA sequencing quality control
Strassburg, C.; Pitlor, D.; Singhi, A. D.; Gottschalk, R.; Uttam, S.Abstract
Mitochondrial transcript abundance is a standard quality control metric in single-cell RNA sequencing, but fixed percentage thresholds fail to account for the substantial variation in mitochondrial content across cell types and tissues, risking both retention of compromised cells and exclusion of transcriptionally active viable cell populations. We present MitoChontrol, a cell-type-aware probabilistic framework for mitochondrial quality control that models the mitochondrial transcript fraction within transcriptionally coherent clusters as a Gaussian mixture distribution. Compromised-cell components are identified from the upper tail of each cluster-specific distribution, and filtering thresholds are defined as the point at which theposterior probability of cellular compromise exceeds a user-definded confidence value. Applied to controlled perturbation experiments and a pancreatic ductal adenocarcinoma single-cell dataset, MitoChontrol selectively removes transcriptionally compromised cells while preserving biologically elevated but viable populations, outperforming fixed-threshold and outlier-based approaches.
bioinformatics2026-04-07v1Estimation of metabolite levels in cheese from microbial gene expression
Mansouri, A.; Mekuli, R.; Swennen, D.; Durazzi, F.; Remondini, D.Abstract
Characterizing aroma and flavours generated during cheese production is of high relevance for the food industry. A deeper comprehension of flavour generation can be achieved by understanding the role of microbial population governing milk processing, and in particular their metabolic activity governed by gene expression. In this work we considered two independent experiments in which gene expression of the microbial population involved in cheese processing is sampled, together with final volatile products quantification. We estimated the final volatile compound profile from the measured metatranscriptomic expression by using machine learning with two different strategies for model training and validation, and we were able to associate specific biochemical pathways to the identified gene signatures.
bioinformatics2026-04-07v1FunctionaL Assigning Sequence Homing (FLASH) maps phenotype to sequence with deep and machine learning
Cotter, D. J.; Harrison, M.-C.; Rustagi, A.; Wang, P. L.; Kokot, M.; Carey, A. F.; Deorowicz, S.; Salzman, J.Abstract
Genome-wide association studies (GWAS) map genetic variation to a reference genome and correlate variants to phenotypes. Yet, GWAS and similar procedures have limitations, including an inability to predict phenotype on variants never seen during the discovery phase and difficulty integrating structural variants. Deep and machine learning alternatives have not been successful at consistent prediction of resistance phenotypes (Hu et al. 2024). Here, we introduce FLASH: a new interpretable, statistically-based deep learning framework that operates directly on raw sequencing reads. In over 35,000 isolates of bacteria, fungi and viruses, FLASH achieves uniformly high accuracy on independent test data, including on variation never seen in training, meeting or exceeding bespoke state of the art methods. FLASH identifies canonical drug targets ab initio and new pan-species predictors of virulence, including those lacking annotation and those only partially aligned to NCBI reference databases. Further, FLASH can predict phenotypes beyond the possibility of GWAS, such as bacterial host range of phage, a task that to our knowledge is impossible today. FLASH is simple to run, highly efficient and constitutes a new approach for predicting gene function and phenotype across the tree of life. It is especially valuable when bioethical concerns and the vast genetic complexity of pathogenic microbes limit the feasibility of experimental validation.
bioinformatics2026-04-07v1Flow molecular dynamics simulations reveal mechanosensitive regulation of von Willebrand factor through glycan-modulated autoinhibitory modules
Richard Louis, N. E. L.; Zhao, Y. C.; Ju, L. A.Abstract
Force-induced protein conformational changes govern many essential biological processes, yet their molecular mechanisms remain difficult to resolve. Von Willebrand factor (VWF), a central regulator of haemostasis, is activated by hydrodynamic forces in blood flow, but how mechanical signals propagate across its multidomain architecture is poorly understood. Here, we use flow molecular dynamics (FMD), a simulation framework that applies fluid forces via controlled solvent flow to interrogate mechanosensitive proteins. Using VWF as a model system, we reconstructed the complete mechanomodule (DD3A1A2A3; 1,109 residues) with native glycosylation by integrating crystallographic data and AlphaFold predictions. FMD simulations capture a force-driven transition from a compact, autoinhibited bird-nest ensemble to an extended, activated state, revealing asymmetric autoinhibitory strengths within the NAIM and CAIM modules of the A1 domain. By directly linking static structures to dynamic, force-regulated behaviour, this work establishes a generalizable platform for dissecting protein mechanosensitivity and enabling the rational design of force-responsive therapeutics.
bioinformatics2026-04-07v1Integrative AlphaFold Modeling, Fragment Mapping, and Microsecond Molecular Dynamics Reveal Ligand-Specific Structural Plasticity at the Human Urotensin II Receptor
Torbey, A. G.Abstract
Peptide ligands Urotensin II (hUII, human), hUII-related peptide (URP) and its cognate human receptor (hUT) are known for their implications in cardiovascular pathophysiology, yet the lack of experimentally resolved hUT structures has limited a deep mechanistic understanding of ligand binding and receptor activation. Here, we leverage recent breakthroughs in multistate AlphaFold predictions, long-timescale molecular dynamics (MD) simulations, and site identification by ligand competitive saturation (SILCS) based pocket mapping and solving ligand bound conformation to illuminate the dynamic interaction of hUII and URP with hUTR. By analyzing hUT dynamics in its intracellular transducer binding pocket, and residue-level interaction probabilities in each simulation, we capture subtle distinctions in the way hUII and URP anchor key pocket residues, modulate transmembrane (TM) domain tilts. Results indicate that hUII imposes stronger conformational constraints on TM5 and TM6 relative to URP, both potentially stabilizing different active-like receptor configurations. At the same time, interaction maps highlight unique aromatic and polar networks that each ligand exploits. These findings reinforce the concept that relatively small differences in GPCR peptide ligand structure may lead to large effects on receptor-state selection, signal specificity, ultimately reflecting different clinical outcomes. By integrating computational modeling with per-residue dynamics, this work not only reconciles prior mutagenesis and docking data but also provides validated 3D models and MD simulations of the endogenous ligands bound to hUT, offering new opportunities to selectively harness ligand-dependent signaling in the urotensinergic system.
bioinformatics2026-04-07v1REBEL, Reproducible Environment Builder for Explicit Library resolution
Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.Abstract
Background: Achieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. Findings: We present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. Conclusions: REBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core Keywords: reproducibility, bioinformatics, dependency resolution, Docker, FAIR, software environments, package management
bioinformatics2026-04-07v1Multistage Machine Learning Reveals Circadian Gene Programs and Supports a Retina-Choroid Axis in Myopia Development
Watcharapalakorn, A.; Poyomtip, T.; Tawonkasiwattanakun, P.; Dewi, P. K. K.; Thomrongsuwannakij, T.; Mahawan, T.Abstract
Purpose To determine whether circadian timing defines critical molecular windows in myopia development and to assess the transferability of circadian gene programs across ocular tissues, disease stages, and species. Methods Publicly available retinal and choroidal RNA-seq datasets from chick models of form-deprivation myopia were analyzed using unsupervised transcriptomic profiling and multistage machine-learning classification. Circadian windows were defined based on Zeitgeber time, and samples were grouped accordingly for downstream analyses. Classification model robustness was evaluated through cross-tissue and cross-stage validation and further assessed using external validation in an independent dataset. Functional translation to humans was examined using ortholog-based Gene Ontology enrichment analysis to identify conserved biological processes and higher-order regulatory pathways. Results A circadian critical window at ZT8-ZT12 exhibited the strongest transcriptional divergence during both myopia onset and progression. Gene signatures derived from this window generalized across retina and choroid and remained predictive across disease stages, supporting coordinated molecular regulation between ocular tissues. External validation confirmed the reproducibility of these signatures despite differences in experimental design and gene coverage. Functional mapping revealed that conserved molecular components in chicks are reorganized into more complex neuroendocrine and regulatory networks in humans, indicating cross-species conservation with increased functional complexity. Conclusions Circadian timing strongly shapes myopia-related gene expression and underlies coordinated retina-choroid signaling. These findings highlight circadian biology as a key factor of refractive development and suggest that time-dependent mechanisms may influence myopia susceptibility, progression, and response to treatment.
bioinformatics2026-04-06v1MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization
Zhang, L.; Wang, L.; Sun, X.; Tang, W.; Su, H.; Qian, Y.; Yang, Q.; Li, Q.; Tang, Z.; Sun, H.; Han, Y.; Jiang, Y.; Lou, W.; Zhou, B.; Wang, X.; Bai, L.; Xie, Z.Abstract
Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
bioinformatics2026-04-06v1From Parametric Guessing to Graph-Grounded Answers: Building Reliable ChatGPT-like tools for Plant Science
Itharajula, M.; Lim, S. C.; Mutwil, M.Abstract
Large language models (LLMs) are increasingly used by plant biologists to summarize literature, generate hypotheses, and interpret experimental results. However, LLMs are unreliable sources of exhaustive, source-attributed facts, a critical limitation for the list-style queries that pervade plant biology (e.g., "list all transcription factors regulating secondary cell wall (SCW) biosynthesis in Arabidopsis"). Here, we query ChatGPT, Claude, and Gemini with such queries and demonstrate that none return complete gene lists with reliable citations. We trace these failures to how LLMs store knowledge: as statistical patterns distributed across billions of internal parameters, with no mechanism to guarantee completeness, provenance, or reproducibility. We also review fine-tuning mitigation strategies, including multi-task instruction tuning, parameter-efficient methods, and context engineering, that alleviate but do not resolve these limitations. We then discuss retrieval-augmented generation (RAG), which feeds relevant documents to the LLM at query time, and argue that while it improves source attribution, it remains impractical when answers require synthesizing information scattered across hundreds of papers. As an alternative, we advocate graph retrieval-augmented generation (GraphRAG), in which the LLM serves as a reasoning and language interface over a structured, provenance-linked knowledge graph (KG) that returns complete result sets reproducibly. We outline a practical GraphRAG architecture and survey existing plant KG resources. Finally, we discuss open challenges, including entity disambiguation, relation normalization and evidence grading, and propose a roadmap for building open, continuously updated plant KGs that can turn "read 1,000 papers" into a single reproducible query.
bioinformatics2026-04-06v1Statistical signals indicate a dependence between amino acid backbone conformation and the translated synonymous codon
Rosenberg, A.; Marx, A.; Bronstein, A. M.Abstract
Synonymous codons encode the same amino acid but can differ in their usage and translational properties. In previous work we reported statistical differences in backbone dihedral angle distributions associated with synonymous codons in the Escherichia coli proteome. This finding has been questioned due to concerns regarding the statistical methodology used. Here we revisit the dataset using corrected statistical procedures and alternative statistical tests. Across multiple frameworks, the real dataset consistently shows an excess of small p-values relative to randomized controls, indicating detectable codon-associated differences in backbone conformation.
bioinformatics2026-04-06v1EV-Net: A computational framework to model extracellular vesicles-mediated communication
Torrejon, E.; Sleegers, J.; Matthiesen, R.; Macedo, M. P.; Baudot, A.; Machado de Oliveira, R.Abstract
Summary Extracellular vesicles (EVs) are bilayer vesicles that carry a diverse cargo of molecules, such as nucleic acids, proteins and metabolites. These EVs can be transported throughout the organism to specific recipient tissues. For this reason, EVs have been recognized as pivotal mediators of cell-to-cell communication (CCC). Importantly, alterations in EV-mediated communication have been linked to pathological processes, further highlighting their biological relevance. However, the in silico exploration of the functional effects of EV cargo in recipient tissues remains limited due to the lack of dedicated tools that can be applied to EV omics datasets. Most current bioinformatics tools for assessing CCC rely on ligand-mediated communication and therefore cannot be used to explore EV-mediated communication. To address this gap, we developed EV-Net, a bioinformatics tool designed to explore the effects of EV cargo on recipient tissues. EV-Net was built by adapting NicheNet, a CCC bioinformatics tool that relies on ligand-receptor mediated communication, for the analysis of EVs proteomics and RNA-seq data. The EV-Net framework enables the identification and prioritization of EV cargo molecules with high regulatory potential in a recipient tissue of interest. This prioritization facilitates the systematic translation of EV cargo profiles into testable biological hypotheses. Availability and documentation The source code of EV-Net is stored in GitHub https://github.com/torrejoNia/EV-Net alongside instructions on how to install it. Comprehensive tutorials and additional documentation are available at https://torrejonia.github.io/EV-Net/. The datasets used in the use cases were deposited in Zenodo. The corresponding Zenodo links are provided in the tutorials for each use case. This software is distributed under a GLP3 licence.
bioinformatics2026-04-06v1NovoTax: prokaryotic strain identification from mass spectrometry-based proteomics data
Svedberg, D.; Mateus, A.Abstract
Traditional mass spectrometry-based proteomics typically requires prior knowledge of sample composition to match spectra to peptides. Yet, novel de novo peptide sequencing approaches can provide peptide sequences to identify the organism. Here, we introduce an end-to-end pipeline (NovoTax) to identify the closest prokaryotic genome directly from raw bottom-up proteomics data. The approach combines peptide sequencing tools with an optimized implementation of peptide searching through an extensive genome database. On a benchmark dataset of species isolates, we identified the reported species and strain in the majority of the cases, and showed that in discordant cases NovoTax was likely correct. Interestingly, NovoTax was also able to identify contaminating species in some samples. The algorithm also identified the most abundant species in bacterial communities. In summary, NovoTax provides strain level identification of microbial samples enabling the downstream use of traditional proteomics search engines for a deeper proteome analysis.
bioinformatics2026-04-06v1Multimodal Fusion of Circular Functional Data on High-resolution Neuroretinal Phenotypes
Pyne, S.; Wainwright, B.; Ali, M. H.; Lee, H.; Ray, M. S.; Senthil, S.; Jammalamadaka, S. R.Abstract
Progressive optic neuropathies, particularly glaucoma, represent a significant global health challenge, and the need for precise understanding of the heterogeneous neurodegenerative phenotypes cannot be overstated. Here, we brought together two complementary sources of unstructured yet clinically-relevant information about neurotinal rim (NRR) thinning, a common clinical marker of such decay. These are based on a new dataset of Fundus digital images and a corresponding one of optical coherence tomography, both collected from a large clinical cohort of healthy eyes. First, we represented them using a common data structure that imposed a high-resolution scale of 180 equally-spaced and registered measurements on a 360{whitebullet} circular axis. We modeled such NRR data-points of each eye as circular curves, and aligned these multimodal curves to obtain a fused NRR curve for each eye. Unsupervised clustering of these fused curves identified 4 clusters of eyes with structural heterogeneity, which were also found to have distinctive clinical covariates. The computation of functional derivatives revealed the troughs in the curves of each cluster. Using circular statistics, we estimated the directional distributions of such troughs as potentially clinically-relevant regions of NRR decay. We also demonstrated that multimodal fusion leads to improvement in the robustness of baseline NRR data obtained from fundus imaging.
bioinformatics2026-04-06v1Domain classification of archaeal proteomes reveals conserved fold repertoire
Schaeffer, R. D.; Pei, J.; Guo, R.; Zhang, J.; Medvedev, K.; Cong, Q.; Grishin, N.Abstract
Archaea represent one of the three domains of cellular life and yet account for fewer than 1% of experimentally determined protein structures, leaving the extent of their structural novelty unknown. Here we present a systematic domain-level classification of 124,075 proteins from 65 archaeal classes spanning 21 phyla and all major lineages, using both AFDB and newly predicted AlphaFold3 structures classified against the Evolutionary Classification of protein Domains (ECOD). We assigned 204,758 domains, of which 76.8% received high-confidence classifications, spanning 987 ECOD X-groups; 40% of known structural diversity within a single domain of life. Clustering by Foldseek recovered structural relationships for 63% of domains that are singletons by sequence comparison. To characterize the 21% of proteins lacking high-confidence classification, we applied successive filters for structure prediction confidence, protein length, and structural cluster context, reducing 8,452 domain-free proteins to a small number of well-folded structural orphans (less than 0.1% of the dataset). The unclassified fraction is dominated by sub-threshold matches to known folds (14% of all proteins) and low-confidence structure predictions (5%), not by novel structures. These results demonstrate that the protein fold repertoire at the single-domain level is broadly conserved across the deepest phylogenetic distances in cellular life, and that the gap between archaeal and well-characterized proteomes reflects classification sensitivity for divergent sequences rather than unexplored structural diversity.
bioinformatics2026-04-06v1BABAPPASnake: a workflow for episodic selection analysis with robustness-aware summaries
Singha, S.; Panda, P.; Panda, A.; Das, S. K.; Das, A.; Ghosh, N.; Sinha, K.Abstract
Episodic selection analyses are often assembled from fragmented toolchains in which ortholog discovery, codon alignment, phylogeny, exploratory scans, branch-site testing, and reporting are handled separately, making reproducibility and sensitivity tracking difficult. We introduce BABAPPASnake as an integrated workflow contribution for orthogroup-centered episodic selection analysis. The workflow combines orthogroup construction logic, CDS quality-aware mapping, multi-engine alignment pathways, phylogenetic inference, exploratory nomination, and branch-site follow-up testing in one reproducible execution framework. It also supports optional HyPhy GARD recombination screening as a conservative preprocessing report layer without forcing fragment-level rerouting by default. It generates pathway-level and cross-pathway robustness outputs, including matrix, consensus, narrative, and provenance summaries to support sensitivity-aware interpretation. A four-gene mosquito melanization-associated module is analyzed as a real-data empirical demonstration of end-to-end workflow behavior. In this demonstration, branch/site signals show both recurrent and method-sensitive components across six method-trim pathways, with a directional core-tier tendency in several summaries. These case-study patterns are interpreted as workflow-based empirical evidence and hypothesis-generating asymmetry, not decisive pathway level confirmation. Overall, BABAPPASnake provides a practical and reproducible framework for episodic selection studies where analytical sensitivity must be explicitly reported.
bioinformatics2026-04-05v7RAMBO: Resolving Amplicons in Mixed Samples for Accurate DNA Barcoding with Oxford Nanopore
Kolter, A.; Hebert, P. D. N.Abstract
DNA barcoding, the use of short genetic markers to identify and differentiate species, is a foundational tool for ecological and taxonomic research. The method has been scaled rapidly with next-generation sequencing technologies enabling the processing of thousands of specimens in parallel. Nanopore sequencing not only offers a flexible, low cost alternative to other platforms but produces full-length reads in real time and can be used in remote settings. However, its comparatively high error rate complicates downstream processing, particularly when PCR amplifies multiple templates from a single specimen, reflecting pseudogenes, paralogs, or contaminants. We present a novel pipeline for DNA barcoding that resolves mixed sequence signals from Nanopore reads using unsupervised clustering and staged consensus generation, without relying on curated reference databases, taxonomic priors, or error models. While existing methods to curate Nanopore sequence data assume a single dominant amplicon per sample or require deep sequence divergence among amplicons, our pipeline can distinguish variants differing by as little as 0.15 percent. It combines column-weighted encodings, UMAP projection, and HDBSCAN clustering, followed by conservative consensus refinement. The pipeline was benchmarked and validated using datasets with known composition, including high-fidelity PacBio sequences. The results show that Nanopore barcoding, when paired with appropriate analysis, can recover biologically meaningful variation even in technically complex samples. The pipeline is particularly suited for specimens where divergent templates are co-amplified, including mitochondrial pseudogenes or multicopy nuclear regions like ITS. As such, it provides a generalizable framework for high-resolution Nanopore analysis of complex amplicon mixtures.
bioinformatics2026-04-05v2Unravelling genome-wide mosaic microsatellite mutations at single-cell resolution
Wang, C.; Fan, W.; Wang, W.; Xia, Y.; Lu, J.; Ma, X.; Yu, J.; Zheng, Y.; Luo, Y.; Li, W.; Yang, Q.; Lin, M.; Liu, H.; Lan, Y.; Li, C.; Liu, X.; HE, D.; Cai, S.; Yu, X.; Zhou, D.; Kellis, M.; Xiong, X.; Xie, Q.; Dou, Y.Abstract
Short tandem repeats (STRs), or microsatellites, are highly mutable genomic elements that modulate gene regulations and are implicated in a range of human diseases. However, detecting mosaic STR mutations at single-cell resolution remains challenging due to both technical and biological complexities. To address this, we developed BayesMonSTR, a robust algorithm that enables accurate detection of mosaic STR mutations. Using this tool in single-cell analysis of human tissues, we reveal an accumulation of longer mosaic STR insertions and deletions (indels) in aging mitotic and post-mitotic cells. Strikingly, prefrontal cortex (PFC) neurons accumulate a higher burden of STR mutations than B cells or lung epithelium, with aged neurons exhibiting a particularly pronounced increase in longer STR deletions. These mutations are enriched at transcription start sites (TSSs) and active enhancers of highly expressed genes. Our work establishes a foundation for genome-wide, hypothesis-free discovery of disease-associated mosaic STR mutations and reveals a previously unexplored landscape of mosaic STR variation in development and aging.
bioinformatics2026-04-05v2Benchmarking long-read RNA-seq across modalities, methods, and sequencing depth in iNeurons
Schubert, R.Abstract
Long-read RNA sequencing (lrRNA-seq) provides advantages for transcript discovery and quantification through the sequencing of full-length transcripts. Although recent benchmarks have evaluated long-read technologies and quantification tools, to the best of our knowledge, no study to date has jointly compared sequencing technology, quantification choice, and depth across both bulk and single-cell platforms. Here, we generate a matched dataset using NGN2-induced neurons derived from Fragile X syndrome and isogenic rescue lines, profiled with bulk and single-cell Illumina, Oxford Nanopore Technologies (ONT), and Pacific Biosciences (PB) Kinnex technologies. All platforms and technologies capture the expected FMR1 reactivation signal. We find that PB bulk under-detects and under-quantifies short transcripts (less than 1.25 kb), ONT bulk under-detects and under-quantifies long transcripts (greater than 5 kb), and single-cell long read technologies a large number of single-cell specific transcripts associated with truncations. Across six bulk and four single-cell long-read quantification tools, Isosceles, Miniquant, and Oarfish provide the best compromise between computational efficiency and performance in terms of quantification accuracy as measured by spike-ins, comparisons to Illumina, and on consensus based down stream tasks such as differential transcript expression (DTE). Depth-equivalency analyses reveal that PB single-cell sequencing requires approximately three- to four-fold greater depth than bulk to reach comparable power for transcript discovery and differential transcript expression. Our work aims to offer practical guidance for study design, including the choice of technology, sequencing depth, and quantification method. In addition, we hope our data may serve a reference dataset to evaluate emerging long-read transcriptomic protocols and methods as well as more closely investigate FMR1 biology.
bioinformatics2026-04-04v1Correlate: A Web Application for Analyzing Gene Sets and Exploring Gene Dependencies Using CRISPR Screen Data
Deolankar, S.; Wermeling, F.Abstract
CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design.
bioinformatics2026-04-04v1muat: portable transformer-based method for tumour classification and representation learning from somatic variants
Sanjaya, P.; Pitkänen, E.Abstract
Motivation: Deep neural networks have proven effective in classifying tumour types using next-generation sequencing data. However, developing transferable models that work across heterogeneous operating environments remains challenging due to differences in cohort compositions and data generation protocols, privacy concerns, and limited computational capabilities. Results: We introduce muat, a transformer-based software for tumour classification using somatic variant data from whole-genome (WGS) and whole-exome sequencing (WES). Building on previously developed MuAt and MuAt2 models, we distribute the software via Docker containers and Bioconda for deployment in high-performance computing (HPC) systems and Secure Processing Environments (SPEs). Using a downloadable MuAt checkpoint, we reproduce the performance reported in the original study on whole genome (PCAWG; 89% accuracy in histological tumour typing) and exome sequencing data (TCGA; 64% accuracy). Cross-cohort evaluation in Genomics England SPE achieved 81% accuracy without retraining and 89% following fine-tuning. As a demonstration of the software's adaptability, we also deployed muat within the iCAN Digital Precision Cancer Medicine Flagship's SPE and integrated it into a Nextflow-managed workflow. Availability and implementation: muat is available through conda (www.anaconda.org/bioconda/muat) and GitHub (https://github.com/primasanjaya/muat), under the Apache 2.0 License. Contact: prima.sanjaya@helsinki.fi, esa.pitkanen@helsinki.fi; website: mlbiomed.net
bioinformatics2026-04-03v1Conserved water molecules as structural ligands modulating pathogenic variation in human protein binding sites
Konc, J.; Recer, K.; Kunej, T.; Janezic, D.Abstract
Conserved water molecules (CWMs) are tightly bound solvent molecules that occupy well-defined and recurrent positions in protein structures. Although they are known to influence protein stability, function, and ligand binding, their contribution to human genetic disease has remained largely unexplored. Here, we demonstrate that CWMs substantially contribute to the pathogenicity of single nucleotide polymorphisms (SNPs). By systematically mapping SNPs onto ligand-binding and conserved water sites across human protein structures in the Protein Data Bank, we find that pathogenic variants are strongly enriched at CWM positions. Enrichment is particularly pronounced at CWM sites within ligand-binding regions, exceeding that observed for ligand-binding sites as a whole. To establish a mechanistic link, we performed molecular dynamics simulations on human lysosomal acid glucosylceramidase (GCase), encoded by GBA1 and associated with Gaucher disease and Parkinson's disease risk. Removal of a single conserved water molecule in the wild-type protein recapitulates key structural features of the pathogenic L444P variant, whereas stabilization of this water in the mutant restores native-like behavior. These findings demonstrate that disruption of a conserved water molecule can induce long-range structural changes consistent with disease-associated mutations. Together, our results identify conserved water molecules as functional structural elements whose disruption represents a recurrent mechanism of protein dysfunction and provide direct mechanistic evidence for their pathogenic role in Gaucher disease.
bioinformatics2026-04-03v1LigandForge: A Web Server for Structure-Guided De Novo Drug Design
Nada, H.; Sipos-Szabo, L.; Bajusz, D.; Keseru, G.; Gabr, M.Abstract
Despite advances in computational drug discovery, de novo drug design remains hindered by high licensing costs and the need for specialized programming expertise. We present LigandForge, a webserver for structure-guided de novo ligand generation. LigandForge integrates structural validation and binding-site characterization; voxel-based property grid construction for spatial mapping of electrostatics and hydrophobicity; chemistry-aware fragment assembly; multi-objective lead optimization; and retrosynthetic feasibility analysis. The platform utilizes a structure-guided framework to assemble molecules from curated fragment libraries while enforcing physicochemical constraints, including molecular weight, LogP, and hybridization states. Generated molecules are refined via reinforcement learning and genetic algorithms which are subsequently evaluated using composite metrics such as the quantitative estimate of drug-likeness. By leveraging RDKit for cheminformatics and NGL viewer for real-time 3D visualization, LigandForge provides a synthesis-aware environment that bridges the gap between macromolecular structural data and experimentally feasible lead compounds without requiring local software installation.
bioinformatics2026-04-03v1Anonymized Somatic Tumor Twins (STTs) enable open genome data sharing and use in research and clinical oncology
Gaitan, N.; Martin, R.; Tello, D.; Benetti, E.; Riba, M.; Licata, L.; Arbones, M.; Royo, R.; Olmos, D.; Morelli, M. J.; Tonon, G.; Castro, E.; Torrents, D.Abstract
The study of somatic variants from tumor genomes is fundamental to cancer research and clinical decision-making. However, existing data protection frameworks impose restrictions on the use and sharing of these variants in conjunction with sensitive germline information. To overcome these challenges, we developed GenomeAnonymizer, the first method to anonymize short-read DNA sequences from tumor-normal pairs. This generates Somatic Tumor Twins (STTs), an anonymized version of the original data that preserves the donor's privacy while retaining somatic tumor information and sequencing noise. This method successfully removed all detectable germline variants from the 47 PCAWG-Pilot samples. We further demonstrate that Whole-Genome Sequencing (WGS) STTs preserve more than 98% of the original somatic variants, enabling reliable downstream analysis that replicates somatic-related findings from the original samples, including cancer driver genes, mutational signatures, and intratumor heterogeneity. Importantly, we also show that STTs can reproduce the identification of actionable genes and downstream clinical interpretations and decision-making. We generated a cancer cohort of STTs matched with synthetic clinical data that could be openly shared and used across projects and centers worldwide. This paradigm-shifting approach will accelerate discovery and clinical translation in oncology and enable the robust benchmarking of genome analysis and large-scale data infrastructures.
bioinformatics2026-04-03v1Importance of taking Single Amino Acid Variant and accessory proteome variability into account in Data Independent Acquisition Proteomics: illustrated with Legionella pneumophila analysis
Dupas, A.; Ibranosyan, M.; Ginevra, C.; Jarraud, S.; Lemoine, J.Abstract
Understanding allelic variability is crucial for elucidating intrinsic bacterial mechanisms and distinguishing phenotypic profiles. However, such variability poses a major challenge for the reliable identification of proteins in data-independent acquisition (DIA) proteomics. To address this, we developed an analytical workflow that integrates protein sequence variability to enhance proteome coverage. Fifteen Legionella pneumophila isolates were analyzed using DIA-NN, with spectral libraries generated either from a reference proteome or incorporating allelic variability. Our workflow includes protein clustering and subsequent protein inference from these clusters, allowing the accurate assignment of shared and variant-specific peptides. Integration of variability enabled the identification of a comparable number of proteins as the reference proteome while capturing between 28 and 77 % of variant-specific sequences in each isolate, all while maintaining a low false positive rate. These findings demonstrate that accounting for allelic variability substantially improves proteomic coverage and identification confidence, providing a more comprehensive view of the proteome. This approach facilitates a deeper understanding of biological mechanisms and enables precise bacterial proteotyping of Legionella pneumophila isolates.
bioinformatics2026-04-03v1Proteome analyses reveal Endoplasmic Reticulum stress-induced changes in protein abundance associated with Ube2j2 deficiency in human cell culture
Dahlberg, C. L.; Zinkgraf, M.; Laugesen, S. H.; Soltoft, C. L.; Ginebra, Q.; Bennett, E. P.; Hartmann-Petersen, R.; Ellgaard, L.Abstract
The unfolded protein response (UPR) helps reinstate cellular proteostasis upon an accumulation of misfolded proteins in the endoplasmic reticulum (ER), in part through ER-associated degradation (ERAD). Ube2j2 is an ER-localized E2 ubiquitin-conjugating enzyme that participates in ERAD. We used mass spectrometry analysis of cultured U2OS cells to investigate how the loss of Ube2j2 affects the cellular proteome in response to tunicamycin-induced ER stress. We constructed a network of twelve statistically distinct modules of protein abundance profiles across conditions. We describe the Gene Ontology annotations for each module along with the hub gene proteins whose abundance levels most closely adhere to each modules protein abundance profile. Our analysis identifies known Ube2j2-associated pathways (e.g., the UPR and ERAD) and cellular functions that were previously unassociated with Ube2j2 (e.g., RNA metabolism, ER-Golgi transport, and cell-cycle progression). These data are available via ProteomeXchange with identifier PXD076153 and provide avenues for further investigation into the cellular functions of Ube2j2 under basal and ER-stressed conditions.
bioinformatics2026-04-03v1PANDA: Read-Level Phased Analysis of DNA Amplicons for Methylation Studies
Kubota, A.; Kobayashi, H.; Tajima, A.Abstract
DNA methylation analysis using bisulfite sequencing is widely used to investigate epigenetic regulation at single-base resolution; however, conventional analysis workflows primarily rely on site-wise averaging, which obscures contiguous methylation patterns encoded within individual DNA molecules and limits interpretation of epiallelic heterogeneity in targeted amplicon studies. Here, we present PANDA (Phased ANalysis of DNA Amplicons), an end-to-end graphical pipeline that restores contiguous single-molecule methylation patterns by linking unmerged paired-end reads to reconstruct epiallelic patterns across unsequenced regions. PANDA supports both Sanger and next-generation sequencing inputs, providing a unified workflow for alignment, read-level methylation calling, phased visualization, and quantification of within-sample methylation heterogeneity. Using synthetic benchmarking datasets, we demonstrated that in silico motif filtering isolates specific target reads, enabling the accurate detection of allele-specific methylation and loss of imprinting. Furthermore, the re-analysis of primate placentae datasets confirmed that long-range phasing across unsequenced regions successfully restored the original epiallelic architectures. PANDA establishes a robust, practical approach to single-molecule epigenomic profiling using targeted bisulfite amplicon sequencing.
bioinformatics2026-04-03v1DeepTrio: Variant Calling in Families Using Deep Learning
Brambrink, L.; Kolesnikov, A.; Goel, S.; Nattestad, M.; Yun, T.; Baid, G.; Yang, H.; McLean, C.; Shafin, K.; Chang, P.-C.; Carroll, A.Abstract
Every human inherits one copy of the genome from their mother and another from their father. Parental inheritance helps us understand the transmission of traits and genetic diseases, which often involve de novo variants and rare recessive alleles. Here we present DeepTrio, which learns to analyze child-mother-father trios from the joint sequence information, without explicit encoding of inheritance priors. DeepTrio learns how to weigh sequencing error, mapping error, and de novo rates and genome context directly from the sequence data. DeepTrio has higher accuracy on both Illumina and PacBio HiFi data when compared to DeepVariant. Improvements are especially pronounced at lower coverages (with 20x DeepTrio roughly equivalent to 30x DeepVariant). As DeepTrio learns directly from data, we also demonstrate extensions to exome calling solely by changing the training data. DeepTrio includes pre-trained models for Illumina WGS, Illumina exome, and PacBio HiFi.
bioinformatics2026-04-02v2SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale
Wang, L.; Zhang, X.; Wang, Y.; Xue, Z.Abstract
The advent of highly accurate structure prediction techniques such as AlphaFold3 is driving an unprecedented expansion of protein structure databases. This rapid growth creates an urgent demand for novel search tools, as even the current fastest available methods like Foldseek face significant limitations in sensitivity and scalability when confronted with these massive repositories. To meet this challenge, we have developed SSAlign, a protein structure retrieval tool that leverages protein language models to jointly encode sequence and structural information and adopts a two-stage alignment strategy. On large-scale datasets such as AFDB50, SSAlign achieves a two-orders-of-magnitude speedup over Foldseek in search, substantially improving scalability for high-throughput structural analysis. Compared to Foldseek, SSAlign retrieves substantially more high-quality matches on Swiss-Prot and achieves marked performance improvements on SCOPe40, with relative AUC increases of +20.2% at the family level and +33.3% at the superfamily level, demonstrating significantly enhanced sensitivity and recall. In sum, SSAlign achieves TM-align-comparable accuracy with Foldseek-surpassing speed and coverage, offering an efficient, sensitive, and scalable solution for large-scale structural biology and structure-based drug discovery.
bioinformatics2026-04-02v2Ankh-score produces better sequence alignments than AlphaFold3
Malec, J.; Rusen, K.; Golding, G. B.; Ilie, L.Abstract
Protein sequence alignment is one of the most fundamental procedures in bioinformatics. Due to its many downstream applications, improvements to this procedure are of great importance. We consider two revolutionary concepts that emerged recently as candidates for improving the state-of-the-art alignment methods: AlphaFold and protein language models such as Ankh, ProtT5 or ESM-C. Alignment improvements can come from the structural alignment of AlphaFold-predicted structures or the scoring based on the similarity of protein embeddings produced by the protein language models. Thorough comparison on many domains from BAliBASE and CDD demonstrates that the Ankh-score method produces much better sequence alignments than the structural alignments using US-align of AlphaFold3-predicted structures. Both are better than the traditional method using BLOSUM matrices. This suggests that Ankh embeddings may possess certain information that is not available in the AlphaFold3-predicted structures. The alignment software is freely available as a web server at e-score.csd.uwo.ca and as source code at github.com/lucian-ilie/E-score.
bioinformatics2026-04-02v2Optimisation of Weighted Ensembles of Genomic Prediction Models in Maize
Tomura, S.; Powell, O. M.; Wilkinson, M. J.; Lefevre, J.; Cooper, M.Abstract
Ensembles of multiple genomic prediction models have demonstrated improved prediction performance over the individual models contributing to the ensemble. The outperformance of ensemble models is expected from the Diversity Prediction Theorem, which states that for ensembles constructed with diverse prediction models, the ensemble prediction error becomes lower than the mean prediction error of the individual models. While a naive ensemble-average model provides baseline performance improvement by aggregating all individual prediction models with equal weights, optimising weights for each individual model could further enhance ensemble prediction performance. The weights can be optimised based on their level of informativeness regarding prediction error and diversity. Here, we evaluated weighted ensemble-average models with three possible weight optimisation approaches (linear transformation, Nelder-Mead and Bayesian) using flowering time and tillering traits from two maize nested associated mapping (NAM) datasets; TeoNAM and MaizeNAM. The three proposed weighted ensemble-average approaches improved prediction performance in several of the prediction scenarios investigated. In particular, the weighted ensemble models enhanced prediction performance when the adjusted weights differed substantially from the equal weights used by the naive ensemble models. For performance comparisons among the weighted ensembles, there was no clear superiority among the proposed approaches in both prediction accuracy and error across the prediction scenarios. Weight optimisation for ensembles warrants further investigation to explore the opportunities to improve their prediction performance; for example, integration of a weighted ensemble with a simultaneous hyperparameter tuning process may offer a promising direction for further research.
bioinformatics2026-04-02v2CLEAR: Concise List Enrichment Analysis Reducing Redundancy
Jia, X.; Phan, A.; Dorman, K.; Kadelka, C.Abstract
High-throughput experiments generate genome-wide measurements for thousands of genes, which are often tested marginally. Biological processes are driven by coordinated groups of genes rather than individual genes, making gene set enrichment analysis an essential post hoc interpretation tool. Traditional approaches such as Over-Representation Analysis and Gene Set Enrichment Analysis test gene sets independently, which ignores the hierarchical and overlapping structure of gene set collections such as the Gene Ontology, and often leads to redundant enrichment results. Set-based approaches such as MGSA address this issue by modeling multiple gene sets simultaneously, but they rely on binary gene activation states derived from arbitrary thresholds on gene-level statistics. We introduce Concise List Enrichment Analysis Reducing Redundancy (CLEAR), a Bayesian gene set enrichment framework that jointly models gene sets while incorporating continuous gene-level statistics such as test statistics or p-values. CLEAR extends model-based gene set analysis by replacing threshold-based gene activation with a probabilistic model for continuous gene-level statistics. This approach preserves the redundancy-reduction advantages of set-based enrichment methods while avoiding the information loss introduced by binarization. Using both simulated datasets and human gene expression data, we show that CLEAR improves sensitivity compared with existing enrichment approaches while producing a more concise and interpretable set of enriched gene sets.
bioinformatics2026-04-02v2Resolution of recursive data corruption to transform T-cell epitope discovery
Preibisch, G.; Tyrolski, M.; Kucharski, P.; Gizinski, S.; Grzegorczyk, P.; Moon, S.; Kim, S.; Zaro, B.; Gambin, A.Abstract
Accurate prediction of MHC class I-presented peptides is essential for any vaccine or T-cell therapy design, yet reported gains on in silico benchmarks have not translated into clinical successes. Here we show that this discrepancy may come from a common methodological error: immunopeptidomics datasets are fundamentally contaminated by existing prediction models through prediction-based deconvolution and filtering, resulting in an iterative confirmation bias. An audit of the IEDB, the biggest database in the field, reveals that as of January 2025, 55.8% of assessable data are labeled by computational models rather than verified experimentally. This inflates in silico benchmarks while degrading real-world applicability on new data, effectively making it impossible to objectively test model performance, which can lead to choosing suboptimal solutions and decreasing the chance of any therapy's clinical success. In silico simulation shows that iterative data corruption maintains high AUROC while top-of-list retrieval collapses. We reframe epitope discovery as a protein-centric learning-to-rank task and introduce deepMHCflare, a model evaluated exclusively on clean data. deepMHCflare achieves 0.80 Precision@4 on mono-allelic benchmarks versus 0.55-0.65 for gold-standard prediction models. A preclinical cancer vaccine study validated that 2 of the 4 deepMHCflare-nominated peptides were immunogenic, with a third independently confirmed in the literature.
bioinformatics2026-04-02v2HalluCodon enables species-specific codon optimization using multimodal language models
Lou, Y.; Mao, S.; Wu, T.; Xia, F.; Zhang, Z.; Tian, Y.; Li, Y.; Cheng, Q.; Yan, J.; Wang, X.Abstract
Codon optimization is widely used in transgenic crop development, plant synthetic biology, and molecular farming to improve heterologous protein expression in plant cells. Increasing availability of plant omics data now enables optimization strategies that account for species-specific sequence features. We developed HalluCodon, a customizable framework that uses multimodal language models to design coding sequences tailored to individual plant species. The framework allows users to fine tune pre-trained protein and RNA language models with their own datasets to build species-specific codon optimization models. The current implementation includes base models trained on coding sequences and proteomes from fifteen plant species. HalluCodon generates coding sequences through a hallucination-based design strategy guided by two predictive modules that evaluate coding sequence naturalness (CodonNAT) and expression potential (CodonEXP). Benchmark tests using representative proteins show that the generated sequences reproduce host-specific codon usage patterns and support high expression levels in plant systems.
bioinformatics2026-04-02v1Evaluating FoldX5.1 for MAVISp Stability Data Collection
Vliora, A.; Tiberti, M.; Papaleo, E.Abstract
MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found a high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.
bioinformatics2026-04-02v1RastQC: High-Performance Sequencing Quality Control Written in Rust
Huang, K.-l.Abstract
Quality control (QC) of high-throughput sequencing data is a critical first step in genomics analysis pipelines. FastQC has served as the de facto standard for sequencing QC for over a decade, but its Java runtime dependency introduces startup overhead, elevated memory consumption, and deployment complexity. Here we present RastQC, a complete reimplementation of FastQC in Rust that provides all 12 standard QC modules with matching algorithms, plus 3 additional long-read QC modules, MultiQC-compatible output formats, native MultiQC JSON export, a built-in multi-file summary dashboard, and a web-based report viewer. RastQC also supports SOLiD colorspace reads, Oxford Nanopore Fast5/POD5 formats, standard input streaming, intra-file parallelism, and QC-aware exit codes for workflow integration. We benchmarked RastQC against FastQC v0.12.1 on both synthetic datasets (100K-1M reads) and real whole-genome sequencing data spanning five model organisms: Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus, and Homo sapiens. Despite running 15 modules (vs. 11 in FastQC), RastQC achieves comparable speed while using 4-9x less memory (59-125 MB vs. 551-638 MB). On real genome data, RastQC matches FastQC speed on most organisms while achieving 100% module-level concordance (55/55 module calls identical across all organisms for the 11 shared modules). RastQC compiles to a single 2.1 MB static binary with no external dependencies, representing a 102x reduction in deployment footprint. RastQC is freely available at https://github.com/Huang-lab/RastQC under the MIT license.
bioinformatics2026-04-02v1When Multimodal Fusion Fails: Contrastive Alignment as a Necessary Stabilizer for TCR--Peptide Binding Prediction
Qi, C.; Wang, W.; Fang, H.; Wei, Z.Abstract
Multimodal learning is commonly assumed to improve predictive performance, yet in biological applications auxiliary modalities are often imperfect and can degrade learning if fused naively. We investigate this problem in TCR--peptide binding prediction, where sequence embeddings from pretrained protein language models are strong and transferable, but structure-derived residue graphs are built from predicted folds and heuristic discretization. In this setting, structural views can be noisy, inconsistent, and difficult to optimize jointly with sequence features. We introduce TRACE, a lightweight multimodal framework that encodes each entity (TCR and peptide) with parallel sequence and graph towers, then applies CLIP-style intra-entity contrastive alignment before interaction modeling. The alignment objective regularizes representation geometry by encouraging modality consistency for the same biological entity, thereby preventing unstable graph signals from dominating fusion. Across protocol-aware TCHard RN evaluations, naive sequence+graph fusion frequently underperforms a sequence-only baseline and can collapse toward near-random behavior. In contrast, TRACE consistently restores and improves performance. Controlled noise and supervision sweeps show that these gains persist under increasing graph corruption and positive-label scarcity, indicating that alignment is especially important when training conditions are hard. Our results challenge the assumption that adding modalities is inherently beneficial. Instead, they highlight a central principle for robust multimodal bioinformatics: performance depends not only on what modalities are used, but on how their interaction is constrained during optimization. TRACE provides a simple and general recipe for leveraging imperfect structural information without sacrificing stability.
bioinformatics2026-04-02v1A structure-informed deep learning framework for modeling TCR-peptide-HLA interactions
Cao, K.; Li, R.; Strazar, M.; Brown, E. M.; Nguyen, P. N. U.; Pust, M.-M.; Park, J.; Graham, D. B.; Ashenberg, O.; Uhler, C.; Xavier, R.Abstract
The interaction between T cell receptors (TCRs), peptides, and human leukocyte antigens (HLAs) underlies antigen-specific T cell immunity. Despite substantial advances in peptide-HLA presentation prediction, accurate modeling of coupled TCR-peptide-HLA recognition remains underdeveloped, limiting applications such as TCR and neoepitope prioritization in cancer and antigen identification in autoimmunity. Here we present StriMap, a unified framework for predicting TCR-peptide-HLA interactions by integrating physicochemical, sequence-context, and structural features at recognition interfaces. StriMap achieves state-of-the-art performance with improved generalizability and enables applications in both cancer and autoimmunity. As a case study in ankylosing spondylitis (AS), we screened 13 million peptides derived from 43,241 bacterial proteins and identified candidate molecular mimics that were experimentally validated to activate T cells expressing an AS-associated TCR. Notably, a top validated peptide was enriched in patients with inflammatory bowel disease (IBD), suggesting potential shared microbial triggers between AS and IBD. Overall, StriMap provides a generalizable framework for rational immunotherapy design and for dissecting antigenic drivers of autoimmunity.
bioinformatics2026-04-02v1DESPOT: Direction-Enhanced Scoring POTentials
Poelmans, R.; Bruncsics, B.; Arany, A.; Van Eynde, W.; Shemy, A.; Moreau, Y.; Voet, A. R.Abstract
Knowledge-based potentials (KBPs) have long been used to score protein-ligand interactions, yet existing formulations remain isotropic, capturing only distance dependencies and neglecting the directional preferences that govern molecular recognition. Here, we introduce Direction-Enhanced Scoring POTentials (DESPOT), an anisotropic knowledge-based framework that unifies pose scoring and binding-site characterization within a single probabilistic model. Where classical knowledge-based methods model the probability of observing a distance given an interacting atom pair, DESPOT instead models the conditional probability of observing specific ligand atom types at discretized spatial positions around protein atoms. This inverted probabilistic formulation naturally supports both directional modelling through atom type-specific local reference frames and symmetry-aware geometric discretization, and steric exclusion, encoded as a dedicated void state that explicitly captures the probability that a spatial bin remains unoccupied. Evaluation on the CASF-2016 benchmark shows that DESPOT substantially outperforms isotropic KBPs in all pose-discrimination and virtual screening tasks (p < 0.0001 for all enrichment factors), with the largest gains arising from its ability to penalize geometrically implausible poses. Constrained energy minimization of training structures proves strongly beneficial for the derivation of KBPs, while our train-test leakage analysis reveals that overfitting is an underestimated and understudied issue for KBPs. The resulting anisotropic interaction profiles reveal systematic directional preferences (illustrated here for hydrogen bonds, aromatic interactions, and halogen bonds) that extend beyond idealized geometric models. DESPOT provides a data-driven framework for direction-aware modelling of protein-ligand interactions, with applications in pose scoring, binding-site characterization, and structure-based design.
bioinformatics2026-04-02v1Benchmarking Agentic Bioinformatics Systems for Complex Protein-Set Retrieval: A Coccolithophore Calcification Case Study
Zhang, X.Abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.
bioinformatics2026-04-02v1The U-method: Leveraging expression probability for robust biological marker detection
Stein, Y.; Lavon, H.; Hindi Malowany, M.; Arpinati, L.; Scherz-Shouval, R.Abstract
Reliable identification of cluster-defining markers is fundamental to single-cell transcriptomic analysis, yet current approaches often rely on average expression differences, which can dilute biologically informative signals in sparse and heterogeneous data. Here we introduce the U-method, a fast probability-based framework for identifying uniquely expressed genes (UEGs) by contrasting the expression probability of a gene within a cluster with its highest expression probability in any other cluster. This highest-probability comparison prioritizes detection consistency over expression magnitude, resulting in markers that consistently identify cell populations across independent datasets analyzed at comparable clustering resolutions. Applied to colorectal, breast, pancreatic, and lung cancer single-cell RNA-sequencing datasets, the U-method identifies canonical lineage markers together with additional genes showing clear cluster specificity. When projected onto Visium HD spatial transcriptomics data using only raw average expression of top UEGs, these signatures reveal coherent and biologically interpretable tissue organization without the need for smoothing, deconvolution, or model-based spatial inference. These results position the U-method as a practical implementation of detection consistency, enabling robust marker discovery and spatial interpretation in single-cell analysis.
bioinformatics2026-04-02v1Generating and navigating single cell dynamics via a geodesic bridge between nonlinear transcriptional and linear latent manifolds
Zhu, J.; Zhang, Z.; Sun, Y.; Dai, H.; Wen, H.; Zhou, P.; Chen, L.Abstract
Time-series single-cell RNA sequencing (scRNA-seq) captures cellular processes as sparse and unpaired snapshots, limiting our ability not only to reconstruct continuous cell state transitions, but also to navigate between states in a controlled and interpretable manner. Here we present GeoBridge, a framework modeling cellular dynamics as geodesic trajectories on the transcriptional manifold based on our isometric geodesic theory, which theoretically and computationally transforms time-varying nonlinear transcriptional geodesics (original nonlinear manifold) into constant-velocity straight-line geodesics (latent linear manifold) by a learned geodesic bridge. In such learned geodesic space, continuous interpolation becomes biologically meaningful, enabling reconstruction of unobserved intermediate states and efficient navigation between distinct cellular phenotypes at a single-cell resolution. By mapping interpolated trajectories back to the original gene expression space, GeoBridge recovers smooth transcriptional programs that are robust to noise and snapshot sparsity. Leveraging the derived geodesic potentials, GeoBridge further infers pseudo-temporal trajectories from single-snapshot scRNA-seq data without temporal annotation, and directly identifies genes that drive progression along geodesic paths. Across diverse biological systems, GeoBridge accurately resolves developmental dynamics, generates unmeasured intermediate states, identifies dynamic driver genes, and more significantly, enables navigable transitions across multiple differentiation endpoints. Together, GeoBridge establishes a principled method that transforms sparse single-cell measurements into a continuous, controllable landscape for the reconstruction, navigation and manipulation of cellular state transitions.
bioinformatics2026-04-02v1CardamomOT: a mechanistic optimal transport-based framework for gene regulatory network inference, trajectory reconstruction and generative modeling
Mauge, Y.; Ventre, E.Abstract
A key challenge in inferring gene regulatory networks (GRNs) governing cellular processes such as differentiation and reprogramming from experimental data lies in the impossibility of directly measuring protein dynamics at the single-cell level, which prevents establishing causal relationships between regulator activity and target responses. In earlier work, we introduced CARDAMOM, an algorithm that uses temporal snapshots of scRNA-seq data to calibrate a GRN-driven mechanistic model of gene expression. However, this method had several limitations: it could only rely on the relative ordering of time points rather than their exact labels, imposed restrictive quasi-stationary assumptions on protein dynamics, and depended on multiple hyperparameters. Here, we present CardamomOT, a new method based on the same mechanistic model that jointly reconstructs the GRN and unobserved protein trajectories from the data within a mechanistic optimal transport framework. By incorporating exact time labels and priors on protein kinetic rates from the literature, and substantially reducing the number of required hyperparameters, our approach addresses these limitations and substantially improves the accuracy and robustness of GRN calibration. We validate our framework on both in silico and experimental datasets, demonstrating computational scalability and consistently improved performance over state-of-the-art methods in both GRN and trajectory reconstruction. In particular, CardamomOT accurately recovers velocity fields driving cellular trajectories and unobserved protein levels, alongside reliable GRN structures. We also show that these improvements make the calibrated mechanistic model suitable to be used as a generative model to predict cellular responses to unseen perturbations. To our knowledge, this is among the first methods to explicitly integrate mechanistic GRN inference, trajectory reconstruction, and simulation of realistic datasets into a unified framework for scRNA-seq time series analysis.
bioinformatics2026-04-02v1