Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
muat: portable transformer-based method for tumour classification and representation learning from somatic variants
Sanjaya, P.; Pitkänen, E.Abstract
Motivation: Deep neural networks have proven effective in classifying tumour types using next-generation sequencing data. However, developing transferable models that work across heterogeneous operating environments remains challenging due to differences in cohort compositions and data generation protocols, privacy concerns, and limited computational capabilities. Results: We introduce muat, a transformer-based software for tumour classification using somatic variant data from whole-genome (WGS) and whole-exome sequencing (WES). Building on previously developed MuAt and MuAt2 models, we distribute the software via Docker containers and Bioconda for deployment in high-performance computing (HPC) systems and Secure Processing Environments (SPEs). Using a downloadable MuAt checkpoint, we reproduce the performance reported in the original study on whole genome (PCAWG; 89% accuracy in histological tumour typing) and exome sequencing data (TCGA; 64% accuracy). Cross-cohort evaluation in Genomics England SPE achieved 81% accuracy without retraining and 89% following fine-tuning. As a demonstration of the software's adaptability, we also deployed muat within the iCAN Digital Precision Cancer Medicine Flagship's SPE and integrated it into a Nextflow-managed workflow. Availability and implementation: muat is available through conda (www.anaconda.org/bioconda/muat) and GitHub (https://github.com/primasanjaya/muat), under the Apache 2.0 License. Contact: prima.sanjaya@helsinki.fi, esa.pitkanen@helsinki.fi; website: mlbiomed.net
bioinformatics2026-04-03v1PANDA: Read-Level Phased Analysis of DNA Amplicons for Methylation Studies
Kubota, A.; Kobayashi, H.; Tajima, A.Abstract
DNA methylation analysis using bisulfite sequencing is widely used to investigate epigenetic regulation at single-base resolution; however, conventional analysis workflows primarily rely on site-wise averaging, which obscures contiguous methylation patterns encoded within individual DNA molecules and limits interpretation of epiallelic heterogeneity in targeted amplicon studies. Here, we present PANDA (Phased ANalysis of DNA Amplicons), an end-to-end graphical pipeline that restores contiguous single-molecule methylation patterns by linking unmerged paired-end reads to reconstruct epiallelic patterns across unsequenced regions. PANDA supports both Sanger and next-generation sequencing inputs, providing a unified workflow for alignment, read-level methylation calling, phased visualization, and quantification of within-sample methylation heterogeneity. Using synthetic benchmarking datasets, we demonstrated that in silico motif filtering isolates specific target reads, enabling the accurate detection of allele-specific methylation and loss of imprinting. Furthermore, the re-analysis of primate placentae datasets confirmed that long-range phasing across unsequenced regions successfully restored the original epiallelic architectures. PANDA establishes a robust, practical approach to single-molecule epigenomic profiling using targeted bisulfite amplicon sequencing.
bioinformatics2026-04-03v1Conserved water molecules as structural ligands modulating pathogenic variation in human protein binding sites
Konc, J.; Recer, K.; Kunej, T.; Janezic, D.Abstract
Conserved water molecules (CWMs) are tightly bound solvent molecules that occupy well-defined and recurrent positions in protein structures. Although they are known to influence protein stability, function, and ligand binding, their contribution to human genetic disease has remained largely unexplored. Here, we demonstrate that CWMs substantially contribute to the pathogenicity of single nucleotide polymorphisms (SNPs). By systematically mapping SNPs onto ligand-binding and conserved water sites across human protein structures in the Protein Data Bank, we find that pathogenic variants are strongly enriched at CWM positions. Enrichment is particularly pronounced at CWM sites within ligand-binding regions, exceeding that observed for ligand-binding sites as a whole. To establish a mechanistic link, we performed molecular dynamics simulations on human lysosomal acid glucosylceramidase (GCase), encoded by GBA1 and associated with Gaucher disease and Parkinson's disease risk. Removal of a single conserved water molecule in the wild-type protein recapitulates key structural features of the pathogenic L444P variant, whereas stabilization of this water in the mutant restores native-like behavior. These findings demonstrate that disruption of a conserved water molecule can induce long-range structural changes consistent with disease-associated mutations. Together, our results identify conserved water molecules as functional structural elements whose disruption represents a recurrent mechanism of protein dysfunction and provide direct mechanistic evidence for their pathogenic role in Gaucher disease.
bioinformatics2026-04-03v1LigandForge: A Web Server for Structure-Guided De Novo Drug Design
Nada, H.; Sipos-Szabo, L.; Bajusz, D.; Keseru, G.; Gabr, M.Abstract
Despite advances in computational drug discovery, de novo drug design remains hindered by high licensing costs and the need for specialized programming expertise. We present LigandForge, a webserver for structure-guided de novo ligand generation. LigandForge integrates structural validation and binding-site characterization; voxel-based property grid construction for spatial mapping of electrostatics and hydrophobicity; chemistry-aware fragment assembly; multi-objective lead optimization; and retrosynthetic feasibility analysis. The platform utilizes a structure-guided framework to assemble molecules from curated fragment libraries while enforcing physicochemical constraints, including molecular weight, LogP, and hybridization states. Generated molecules are refined via reinforcement learning and genetic algorithms which are subsequently evaluated using composite metrics such as the quantitative estimate of drug-likeness. By leveraging RDKit for cheminformatics and NGL viewer for real-time 3D visualization, LigandForge provides a synthesis-aware environment that bridges the gap between macromolecular structural data and experimentally feasible lead compounds without requiring local software installation.
bioinformatics2026-04-03v1Importance of taking Single Amino Acid Variant and accessory proteome variability into account in Data Independent Acquisition Proteomics: illustrated with Legionella pneumophila analysis
Dupas, A.; Ibranosyan, M.; Ginevra, C.; Jarraud, S.; Lemoine, J.Abstract
Understanding allelic variability is crucial for elucidating intrinsic bacterial mechanisms and distinguishing phenotypic profiles. However, such variability poses a major challenge for the reliable identification of proteins in data-independent acquisition (DIA) proteomics. To address this, we developed an analytical workflow that integrates protein sequence variability to enhance proteome coverage. Fifteen Legionella pneumophila isolates were analyzed using DIA-NN, with spectral libraries generated either from a reference proteome or incorporating allelic variability. Our workflow includes protein clustering and subsequent protein inference from these clusters, allowing the accurate assignment of shared and variant-specific peptides. Integration of variability enabled the identification of a comparable number of proteins as the reference proteome while capturing between 28 and 77 % of variant-specific sequences in each isolate, all while maintaining a low false positive rate. These findings demonstrate that accounting for allelic variability substantially improves proteomic coverage and identification confidence, providing a more comprehensive view of the proteome. This approach facilitates a deeper understanding of biological mechanisms and enables precise bacterial proteotyping of Legionella pneumophila isolates.
bioinformatics2026-04-03v1Anonymized Somatic Tumor Twins (STTs) enable open genome data sharing and use in research and clinical oncology
Gaitan, N.; Martin, R.; Tello, D.; Benetti, E.; Riba, M.; Licata, L.; Arbones, M.; Royo, R.; Olmos, D.; Morelli, M. J.; Tonon, G.; Castro, E.; Torrents, D.Abstract
The study of somatic variants from tumor genomes is fundamental to cancer research and clinical decision-making. However, existing data protection frameworks impose restrictions on the use and sharing of these variants in conjunction with sensitive germline information. To overcome these challenges, we developed GenomeAnonymizer, the first method to anonymize short-read DNA sequences from tumor-normal pairs. This generates Somatic Tumor Twins (STTs), an anonymized version of the original data that preserves the donor's privacy while retaining somatic tumor information and sequencing noise. This method successfully removed all detectable germline variants from the 47 PCAWG-Pilot samples. We further demonstrate that Whole-Genome Sequencing (WGS) STTs preserve more than 98% of the original somatic variants, enabling reliable downstream analysis that replicates somatic-related findings from the original samples, including cancer driver genes, mutational signatures, and intratumor heterogeneity. Importantly, we also show that STTs can reproduce the identification of actionable genes and downstream clinical interpretations and decision-making. We generated a cancer cohort of STTs matched with synthetic clinical data that could be openly shared and used across projects and centers worldwide. This paradigm-shifting approach will accelerate discovery and clinical translation in oncology and enable the robust benchmarking of genome analysis and large-scale data infrastructures.
bioinformatics2026-04-03v1Proteome analyses reveal Endoplasmic Reticulum stress-induced changes in protein abundance associated with Ube2j2 deficiency in human cell culture
Dahlberg, C. L.; Zinkgraf, M.; Laugesen, S. H.; Soltoft, C. L.; Ginebra, Q.; Bennett, E. P.; Hartmann-Petersen, R.; Ellgaard, L.Abstract
The unfolded protein response (UPR) helps reinstate cellular proteostasis upon an accumulation of misfolded proteins in the endoplasmic reticulum (ER), in part through ER-associated degradation (ERAD). Ube2j2 is an ER-localized E2 ubiquitin-conjugating enzyme that participates in ERAD. We used mass spectrometry analysis of cultured U2OS cells to investigate how the loss of Ube2j2 affects the cellular proteome in response to tunicamycin-induced ER stress. We constructed a network of twelve statistically distinct modules of protein abundance profiles across conditions. We describe the Gene Ontology annotations for each module along with the hub gene proteins whose abundance levels most closely adhere to each modules protein abundance profile. Our analysis identifies known Ube2j2-associated pathways (e.g., the UPR and ERAD) and cellular functions that were previously unassociated with Ube2j2 (e.g., RNA metabolism, ER-Golgi transport, and cell-cycle progression). These data are available via ProteomeXchange with identifier PXD076153 and provide avenues for further investigation into the cellular functions of Ube2j2 under basal and ER-stressed conditions.
bioinformatics2026-04-03v1Resolution of recursive data corruption to transform T-cell epitope discovery
Preibisch, G.; Tyrolski, M.; Kucharski, P.; Gizinski, S.; Grzegorczyk, P.; Moon, S.; Kim, S.; Zaro, B.; Gambin, A.Abstract
Accurate prediction of MHC class I-presented peptides is essential for any vaccine or T-cell therapy design, yet reported gains on in silico benchmarks have not translated into clinical successes. Here we show that this discrepancy may come from a common methodological error: immunopeptidomics datasets are fundamentally contaminated by existing prediction models through prediction-based deconvolution and filtering, resulting in an iterative confirmation bias. An audit of the IEDB, the biggest database in the field, reveals that as of January 2025, 55.8% of assessable data are labeled by computational models rather than verified experimentally. This inflates in silico benchmarks while degrading real-world applicability on new data, effectively making it impossible to objectively test model performance, which can lead to choosing suboptimal solutions and decreasing the chance of any therapy's clinical success. In silico simulation shows that iterative data corruption maintains high AUROC while top-of-list retrieval collapses. We reframe epitope discovery as a protein-centric learning-to-rank task and introduce deepMHCflare, a model evaluated exclusively on clean data. deepMHCflare achieves 0.80 Precision@4 on mono-allelic benchmarks versus 0.55-0.65 for gold-standard prediction models. A preclinical cancer vaccine study validated that 2 of the 4 deepMHCflare-nominated peptides were immunogenic, with a third independently confirmed in the literature.
bioinformatics2026-04-02v2DeepTrio: Variant Calling in Families Using Deep Learning
Brambrink, L.; Kolesnikov, A.; Goel, S.; Nattestad, M.; Yun, T.; Baid, G.; Yang, H.; McLean, C.; Shafin, K.; Chang, P.-C.; Carroll, A.Abstract
Every human inherits one copy of the genome from their mother and another from their father. Parental inheritance helps us understand the transmission of traits and genetic diseases, which often involve de novo variants and rare recessive alleles. Here we present DeepTrio, which learns to analyze child-mother-father trios from the joint sequence information, without explicit encoding of inheritance priors. DeepTrio learns how to weigh sequencing error, mapping error, and de novo rates and genome context directly from the sequence data. DeepTrio has higher accuracy on both Illumina and PacBio HiFi data when compared to DeepVariant. Improvements are especially pronounced at lower coverages (with 20x DeepTrio roughly equivalent to 30x DeepVariant). As DeepTrio learns directly from data, we also demonstrate extensions to exome calling solely by changing the training data. DeepTrio includes pre-trained models for Illumina WGS, Illumina exome, and PacBio HiFi.
bioinformatics2026-04-02v2SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale
Wang, L.; Zhang, X.; Wang, Y.; Xue, Z.Abstract
The advent of highly accurate structure prediction techniques such as AlphaFold3 is driving an unprecedented expansion of protein structure databases. This rapid growth creates an urgent demand for novel search tools, as even the current fastest available methods like Foldseek face significant limitations in sensitivity and scalability when confronted with these massive repositories. To meet this challenge, we have developed SSAlign, a protein structure retrieval tool that leverages protein language models to jointly encode sequence and structural information and adopts a two-stage alignment strategy. On large-scale datasets such as AFDB50, SSAlign achieves a two-orders-of-magnitude speedup over Foldseek in search, substantially improving scalability for high-throughput structural analysis. Compared to Foldseek, SSAlign retrieves substantially more high-quality matches on Swiss-Prot and achieves marked performance improvements on SCOPe40, with relative AUC increases of +20.2% at the family level and +33.3% at the superfamily level, demonstrating significantly enhanced sensitivity and recall. In sum, SSAlign achieves TM-align-comparable accuracy with Foldseek-surpassing speed and coverage, offering an efficient, sensitive, and scalable solution for large-scale structural biology and structure-based drug discovery.
bioinformatics2026-04-02v2Ankh-score produces better sequence alignments than AlphaFold3
Malec, J.; Rusen, K.; Golding, G. B.; Ilie, L.Abstract
Protein sequence alignment is one of the most fundamental procedures in bioinformatics. Due to its many downstream applications, improvements to this procedure are of great importance. We consider two revolutionary concepts that emerged recently as candidates for improving the state-of-the-art alignment methods: AlphaFold and protein language models such as Ankh, ProtT5 or ESM-C. Alignment improvements can come from the structural alignment of AlphaFold-predicted structures or the scoring based on the similarity of protein embeddings produced by the protein language models. Thorough comparison on many domains from BAliBASE and CDD demonstrates that the Ankh-score method produces much better sequence alignments than the structural alignments using US-align of AlphaFold3-predicted structures. Both are better than the traditional method using BLOSUM matrices. This suggests that Ankh embeddings may possess certain information that is not available in the AlphaFold3-predicted structures. The alignment software is freely available as a web server at e-score.csd.uwo.ca and as source code at github.com/lucian-ilie/E-score.
bioinformatics2026-04-02v2Optimisation of Weighted Ensembles of Genomic Prediction Models in Maize
Tomura, S.; Powell, O. M.; Wilkinson, M. J.; Lefevre, J.; Cooper, M.Abstract
Ensembles of multiple genomic prediction models have demonstrated improved prediction performance over the individual models contributing to the ensemble. The outperformance of ensemble models is expected from the Diversity Prediction Theorem, which states that for ensembles constructed with diverse prediction models, the ensemble prediction error becomes lower than the mean prediction error of the individual models. While a naive ensemble-average model provides baseline performance improvement by aggregating all individual prediction models with equal weights, optimising weights for each individual model could further enhance ensemble prediction performance. The weights can be optimised based on their level of informativeness regarding prediction error and diversity. Here, we evaluated weighted ensemble-average models with three possible weight optimisation approaches (linear transformation, Nelder-Mead and Bayesian) using flowering time and tillering traits from two maize nested associated mapping (NAM) datasets; TeoNAM and MaizeNAM. The three proposed weighted ensemble-average approaches improved prediction performance in several of the prediction scenarios investigated. In particular, the weighted ensemble models enhanced prediction performance when the adjusted weights differed substantially from the equal weights used by the naive ensemble models. For performance comparisons among the weighted ensembles, there was no clear superiority among the proposed approaches in both prediction accuracy and error across the prediction scenarios. Weight optimisation for ensembles warrants further investigation to explore the opportunities to improve their prediction performance; for example, integration of a weighted ensemble with a simultaneous hyperparameter tuning process may offer a promising direction for further research.
bioinformatics2026-04-02v2CLEAR: Concise List Enrichment Analysis Reducing Redundancy
Jia, X.; Phan, A.; Dorman, K.; Kadelka, C.Abstract
High-throughput experiments generate genome-wide measurements for thousands of genes, which are often tested marginally. Biological processes are driven by coordinated groups of genes rather than individual genes, making gene set enrichment analysis an essential post hoc interpretation tool. Traditional approaches such as Over-Representation Analysis and Gene Set Enrichment Analysis test gene sets independently, which ignores the hierarchical and overlapping structure of gene set collections such as the Gene Ontology, and often leads to redundant enrichment results. Set-based approaches such as MGSA address this issue by modeling multiple gene sets simultaneously, but they rely on binary gene activation states derived from arbitrary thresholds on gene-level statistics. We introduce Concise List Enrichment Analysis Reducing Redundancy (CLEAR), a Bayesian gene set enrichment framework that jointly models gene sets while incorporating continuous gene-level statistics such as test statistics or p-values. CLEAR extends model-based gene set analysis by replacing threshold-based gene activation with a probabilistic model for continuous gene-level statistics. This approach preserves the redundancy-reduction advantages of set-based enrichment methods while avoiding the information loss introduced by binarization. Using both simulated datasets and human gene expression data, we show that CLEAR improves sensitivity compared with existing enrichment approaches while producing a more concise and interpretable set of enriched gene sets.
bioinformatics2026-04-02v2When Multimodal Fusion Fails: Contrastive Alignment as a Necessary Stabilizer for TCR--Peptide Binding Prediction
Qi, C.; Wang, W.; Fang, H.; Wei, Z.Abstract
Multimodal learning is commonly assumed to improve predictive performance, yet in biological applications auxiliary modalities are often imperfect and can degrade learning if fused naively. We investigate this problem in TCR--peptide binding prediction, where sequence embeddings from pretrained protein language models are strong and transferable, but structure-derived residue graphs are built from predicted folds and heuristic discretization. In this setting, structural views can be noisy, inconsistent, and difficult to optimize jointly with sequence features. We introduce TRACE, a lightweight multimodal framework that encodes each entity (TCR and peptide) with parallel sequence and graph towers, then applies CLIP-style intra-entity contrastive alignment before interaction modeling. The alignment objective regularizes representation geometry by encouraging modality consistency for the same biological entity, thereby preventing unstable graph signals from dominating fusion. Across protocol-aware TCHard RN evaluations, naive sequence+graph fusion frequently underperforms a sequence-only baseline and can collapse toward near-random behavior. In contrast, TRACE consistently restores and improves performance. Controlled noise and supervision sweeps show that these gains persist under increasing graph corruption and positive-label scarcity, indicating that alignment is especially important when training conditions are hard. Our results challenge the assumption that adding modalities is inherently beneficial. Instead, they highlight a central principle for robust multimodal bioinformatics: performance depends not only on what modalities are used, but on how their interaction is constrained during optimization. TRACE provides a simple and general recipe for leveraging imperfect structural information without sacrificing stability.
bioinformatics2026-04-02v1Genetic demultiplexing and transcript start site identification from nanopore sequencing of 10x Genomics multiome libraries
Mears, J.; Orchard, P.; Varshney, A.; Bose, M. L.; Robertson, C. C.; Piper, M.; Pashos, E.; Dolgachev, V.; Manickam, N.; Jean, P.; Kitzman, D. W.; Fauman, E.; Damilano, F.; Roth Flach, R. J.; Nicklas, B.; Parker, S. C.Abstract
Short-read Illumina sequencing of 10x Genomics single-nucleus multiome libraries captures only the 3' end of RNA transcripts, losing transcription start site (TSS) information. Here we demonstrate nanopore sequencing of 10x multiome libraries, which enables the profiling of full length transcripts. We show concordance with common short-read sequencing based workflows including successful genetic demultiplexing of nanopore data despite its higher error rate. We compare TSS identified using nanopore sequencing of multiome cDNA to those identified using a short-read 5' assay, and provide an optimized approach for the preprocessing of nanopore reads prior to TSS identification. We find that nanopore sequencing of multiome cDNA captures a median of 63% of the TSS detected by the 5' assay.
bioinformatics2026-04-02v1Decoding antibiotic modes of action from multimodal cellular responses
Hesse, J.; Schum, D.; Leidel, L.; Gareis, L. R.; Herrmann, J.; Müller, R.; Sieber, S. A.Abstract
Antibiotic resistance continues to rise, yet most new drug candidates act through long-established targets. Faster mode of action (MoA) assessment would enable more effective prioritization of screening hits and help identify compounds with novel mechanisms. In this study, we aimed to develop a scalable framework for MoA inference from antibiotic-induced cellular response profiles in Escherichia coli. We generated a multimodal dataset spanning more than 50 antibiotics, including proteome profiles, chemical structure descriptors, inhibitory concentrations and growth dynamics, and used it to build MAPPER (Mode of Action Prediction via Proteomics-Enhanced Representation), a framework comprising a fixed multimodal predictor and an uncertainty module. MAPPER accurately classified antibiotics across nine mechanistic classes, flagged compounds with likely novel mechanisms and retained predictive power in proteomics-only transfer experiments across mass spectrometry platforms and external data. Together, these results establish MAPPER as an innovative tool for MoA prediction and novelty detection, enabling prioritization of antibacterial candidates with distinct mechanisms.
bioinformatics2026-04-02v1HalluCodon enables species-specific codon optimization using multimodal language models
Lou, Y.; Mao, S.; Wu, T.; Xia, F.; Zhang, Z.; Tian, Y.; Li, Y.; Cheng, Q.; Yan, J.; Wang, X.Abstract
Codon optimization is widely used in transgenic crop development, plant synthetic biology, and molecular farming to improve heterologous protein expression in plant cells. Increasing availability of plant omics data now enables optimization strategies that account for species-specific sequence features. We developed HalluCodon, a customizable framework that uses multimodal language models to design coding sequences tailored to individual plant species. The framework allows users to fine tune pre-trained protein and RNA language models with their own datasets to build species-specific codon optimization models. The current implementation includes base models trained on coding sequences and proteomes from fifteen plant species. HalluCodon generates coding sequences through a hallucination-based design strategy guided by two predictive modules that evaluate coding sequence naturalness (CodonNAT) and expression potential (CodonEXP). Benchmark tests using representative proteins show that the generated sequences reproduce host-specific codon usage patterns and support high expression levels in plant systems.
bioinformatics2026-04-02v1Evaluating FoldX5.1 for MAVISp Stability Data Collection
Vliora, A.; Tiberti, M.; Papaleo, E.Abstract
MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found a high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.
bioinformatics2026-04-02v1RastQC: High-Performance Sequencing Quality Control Written in Rust
Huang, K.-l.Abstract
Quality control (QC) of high-throughput sequencing data is a critical first step in genomics analysis pipelines. FastQC has served as the de facto standard for sequencing QC for over a decade, but its Java runtime dependency introduces startup overhead, elevated memory consumption, and deployment complexity. Here we present RastQC, a complete reimplementation of FastQC in Rust that provides all 12 standard QC modules with matching algorithms, plus 3 additional long-read QC modules, MultiQC-compatible output formats, native MultiQC JSON export, a built-in multi-file summary dashboard, and a web-based report viewer. RastQC also supports SOLiD colorspace reads, Oxford Nanopore Fast5/POD5 formats, standard input streaming, intra-file parallelism, and QC-aware exit codes for workflow integration. We benchmarked RastQC against FastQC v0.12.1 on both synthetic datasets (100K-1M reads) and real whole-genome sequencing data spanning five model organisms: Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus, and Homo sapiens. Despite running 15 modules (vs. 11 in FastQC), RastQC achieves comparable speed while using 4-9x less memory (59-125 MB vs. 551-638 MB). On real genome data, RastQC matches FastQC speed on most organisms while achieving 100% module-level concordance (55/55 module calls identical across all organisms for the 11 shared modules). RastQC compiles to a single 2.1 MB static binary with no external dependencies, representing a 102x reduction in deployment footprint. RastQC is freely available at https://github.com/Huang-lab/RastQC under the MIT license.
bioinformatics2026-04-02v1A structure-informed deep learning framework for modeling TCR-peptide-HLA interactions
Cao, K.; Li, R.; Strazar, M.; Brown, E. M.; Nguyen, P. N. U.; Pust, M.-M.; Park, J.; Graham, D. B.; Ashenberg, O.; Uhler, C.; Xavier, R.Abstract
The interaction between T cell receptors (TCRs), peptides, and human leukocyte antigens (HLAs) underlies antigen-specific T cell immunity. Despite substantial advances in peptide-HLA presentation prediction, accurate modeling of coupled TCR-peptide-HLA recognition remains underdeveloped, limiting applications such as TCR and neoepitope prioritization in cancer and antigen identification in autoimmunity. Here we present StriMap, a unified framework for predicting TCR-peptide-HLA interactions by integrating physicochemical, sequence-context, and structural features at recognition interfaces. StriMap achieves state-of-the-art performance with improved generalizability and enables applications in both cancer and autoimmunity. As a case study in ankylosing spondylitis (AS), we screened 13 million peptides derived from 43,241 bacterial proteins and identified candidate molecular mimics that were experimentally validated to activate T cells expressing an AS-associated TCR. Notably, a top validated peptide was enriched in patients with inflammatory bowel disease (IBD), suggesting potential shared microbial triggers between AS and IBD. Overall, StriMap provides a generalizable framework for rational immunotherapy design and for dissecting antigenic drivers of autoimmunity.
bioinformatics2026-04-02v1DESPOT: Direction-Enhanced Scoring POTentials
Poelmans, R.; Bruncsics, B.; Arany, A.; Van Eynde, W.; Shemy, A.; Moreau, Y.; Voet, A. R.Abstract
Knowledge-based potentials (KBPs) have long been used to score protein-ligand interactions, yet existing formulations remain isotropic, capturing only distance dependencies and neglecting the directional preferences that govern molecular recognition. Here, we introduce Direction-Enhanced Scoring POTentials (DESPOT), an anisotropic knowledge-based framework that unifies pose scoring and binding-site characterization within a single probabilistic model. Where classical knowledge-based methods model the probability of observing a distance given an interacting atom pair, DESPOT instead models the conditional probability of observing specific ligand atom types at discretized spatial positions around protein atoms. This inverted probabilistic formulation naturally supports both directional modelling through atom type-specific local reference frames and symmetry-aware geometric discretization, and steric exclusion, encoded as a dedicated void state that explicitly captures the probability that a spatial bin remains unoccupied. Evaluation on the CASF-2016 benchmark shows that DESPOT substantially outperforms isotropic KBPs in all pose-discrimination and virtual screening tasks (p < 0.0001 for all enrichment factors), with the largest gains arising from its ability to penalize geometrically implausible poses. Constrained energy minimization of training structures proves strongly beneficial for the derivation of KBPs, while our train-test leakage analysis reveals that overfitting is an underestimated and understudied issue for KBPs. The resulting anisotropic interaction profiles reveal systematic directional preferences (illustrated here for hydrogen bonds, aromatic interactions, and halogen bonds) that extend beyond idealized geometric models. DESPOT provides a data-driven framework for direction-aware modelling of protein-ligand interactions, with applications in pose scoring, binding-site characterization, and structure-based design.
bioinformatics2026-04-02v1Benchmarking Agentic Bioinformatics Systems for Complex Protein-Set Retrieval: A Coccolithophore Calcification Case Study
Zhang, X.Abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.
bioinformatics2026-04-02v1The U-method: Leveraging expression probability for robust biological marker detection
Stein, Y.; Lavon, H.; Hindi Malowany, M.; Arpinati, L.; Scherz-Shouval, R.Abstract
Reliable identification of cluster-defining markers is fundamental to single-cell transcriptomic analysis, yet current approaches often rely on average expression differences, which can dilute biologically informative signals in sparse and heterogeneous data. Here we introduce the U-method, a fast probability-based framework for identifying uniquely expressed genes (UEGs) by contrasting the expression probability of a gene within a cluster with its highest expression probability in any other cluster. This highest-probability comparison prioritizes detection consistency over expression magnitude, resulting in markers that consistently identify cell populations across independent datasets analyzed at comparable clustering resolutions. Applied to colorectal, breast, pancreatic, and lung cancer single-cell RNA-sequencing datasets, the U-method identifies canonical lineage markers together with additional genes showing clear cluster specificity. When projected onto Visium HD spatial transcriptomics data using only raw average expression of top UEGs, these signatures reveal coherent and biologically interpretable tissue organization without the need for smoothing, deconvolution, or model-based spatial inference. These results position the U-method as a practical implementation of detection consistency, enabling robust marker discovery and spatial interpretation in single-cell analysis.
bioinformatics2026-04-02v1Generating and navigating single cell dynamics via a geodesic bridge between nonlinear transcriptional and linear latent manifolds
Zhu, J.; Zhang, Z.; Sun, Y.; Dai, H.; Wen, H.; Zhou, P.; Chen, L.Abstract
Time-series single-cell RNA sequencing (scRNA-seq) captures cellular processes as sparse and unpaired snapshots, limiting our ability not only to reconstruct continuous cell state transitions, but also to navigate between states in a controlled and interpretable manner. Here we present GeoBridge, a framework modeling cellular dynamics as geodesic trajectories on the transcriptional manifold based on our isometric geodesic theory, which theoretically and computationally transforms time-varying nonlinear transcriptional geodesics (original nonlinear manifold) into constant-velocity straight-line geodesics (latent linear manifold) by a learned geodesic bridge. In such learned geodesic space, continuous interpolation becomes biologically meaningful, enabling reconstruction of unobserved intermediate states and efficient navigation between distinct cellular phenotypes at a single-cell resolution. By mapping interpolated trajectories back to the original gene expression space, GeoBridge recovers smooth transcriptional programs that are robust to noise and snapshot sparsity. Leveraging the derived geodesic potentials, GeoBridge further infers pseudo-temporal trajectories from single-snapshot scRNA-seq data without temporal annotation, and directly identifies genes that drive progression along geodesic paths. Across diverse biological systems, GeoBridge accurately resolves developmental dynamics, generates unmeasured intermediate states, identifies dynamic driver genes, and more significantly, enables navigable transitions across multiple differentiation endpoints. Together, GeoBridge establishes a principled method that transforms sparse single-cell measurements into a continuous, controllable landscape for the reconstruction, navigation and manipulation of cellular state transitions.
bioinformatics2026-04-02v1CardamomOT: a mechanistic optimal transport-based framework for gene regulatory network inference, trajectory reconstruction and generative modeling
Mauge, Y.; Ventre, E.Abstract
A key challenge in inferring gene regulatory networks (GRNs) governing cellular processes such as differentiation and reprogramming from experimental data lies in the impossibility of directly measuring protein dynamics at the single-cell level, which prevents establishing causal relationships between regulator activity and target responses. In earlier work, we introduced CARDAMOM, an algorithm that uses temporal snapshots of scRNA-seq data to calibrate a GRN-driven mechanistic model of gene expression. However, this method had several limitations: it could only rely on the relative ordering of time points rather than their exact labels, imposed restrictive quasi-stationary assumptions on protein dynamics, and depended on multiple hyperparameters. Here, we present CardamomOT, a new method based on the same mechanistic model that jointly reconstructs the GRN and unobserved protein trajectories from the data within a mechanistic optimal transport framework. By incorporating exact time labels and priors on protein kinetic rates from the literature, and substantially reducing the number of required hyperparameters, our approach addresses these limitations and substantially improves the accuracy and robustness of GRN calibration. We validate our framework on both in silico and experimental datasets, demonstrating computational scalability and consistently improved performance over state-of-the-art methods in both GRN and trajectory reconstruction. In particular, CardamomOT accurately recovers velocity fields driving cellular trajectories and unobserved protein levels, alongside reliable GRN structures. We also show that these improvements make the calibrated mechanistic model suitable to be used as a generative model to predict cellular responses to unseen perturbations. To our knowledge, this is among the first methods to explicitly integrate mechanistic GRN inference, trajectory reconstruction, and simulation of realistic datasets into a unified framework for scRNA-seq time series analysis.
bioinformatics2026-04-02v1mRNA-GPT: A Generative Model for Full-Length mRNA Design and Optimization
Li, S.; Chauvin, P.; Gross, O.; Bailey, M.; Jager, S.Abstract
We introduce mRNA-GPT, a generative model for end-to-end full-length mRNA sequence design and optimization. Unlike existing approaches that optimize isolated regions, mRNA-GPT jointly optimizes across all three regions (5' UTR, CDS, and 3' UTR) to capture long-range sequence dependencies and cross-region regulatory interactions critical for therapeutic efficacy. The model is pre-trained on 30 million full-length natural mRNA sequences across diverse species and organisms, establishing a robust foundation for sequence generation. We employ Reinforcement Learning (RL), specifically Proximal Policy Optimization (PPO) with oracle-based reward signals, to directly and iteratively optimize target properties, such as half-life and translation efficiency. mRNA-GPT supports flexible generation modes: single regions (UTR or CDS alone), full-length sequences, or generation of any region conditioned on any other region. Through multi-objective optimization, mRNA-GPT achieves Pareto-optimal designs that balance competing properties without sacrificing performance on either objective. mRNA-GPT demonstrates superior design capabilities compared to state-of-the-art methods, achieving enhanced performance in 3' UTR stability optimization, CDS translation rate enhancement, and comprehensive full-length sequence design.
bioinformatics2026-04-02v1Inferring a novel insecticide resistance metric and exposurevariability in mosquito bioassays across Africa
Denz, A.; Kont, M. D.; Sanou, A.; Churcher, T. S.; Lambert, B.Abstract
Malaria claims approximately 500,000 lives each year, and insecticide-treated nets (ITNs), which kill mosquitoes that transmit the disease, remain the most effective intervention. However, resistance to pyrethroids, the primary insecticide class used in ITNs, has risen dramatically in Africa, making it difficult to assess the current public health impact of pyrethroid-ITNs. Past work has modelled the relation between pyrethroid susceptibility measured in discriminating-dose susceptibility bioassays and ITN effectiveness in experimental hut trials. Here, we introduce a new predictive approach that accounts for heterogeneity in insecticide resistance within wild mosquito populations, for example, due to genetic variability, by incorporating data from newly recommended intensity-dose susceptibility bioassays. We fit our mathematical model to a comprehensive data set that combines discriminating dose bioassays from all over Africa, intensity dose bioassays from Burkina Faso, and concurrent experimental hut trials. Our analysis estimates location- and insecticide-specific variation in resistance heterogeneity in Burkina Faso and quantifies differences in insecticide exposure in bioassays and experimental huts. By providing a mechanistic understanding of these experimental data, our approach could be integrated into malaria transmission models to account for the public health impact of insecticide resistance detected by surveillance programmes.
bioinformatics2026-04-01v5Finding stable clusterings of single-cell RNA-seq data
Klebanoff, V. F.Abstract
Run a UMI count matrix through a clustering pipeline to obtain n cell clusters. Suppose that counts from the same experiment for an equal number of additional cells become available. Would including them change the results? Form the matrix containing both sets of counts, process it to obtain n clusters, restrict this (second) clustering to the initial cells and compare it with the initial clustering. If the clusterings are not consistent, conclude that the initial clustering is unstable. Although this scenario is unrealistic, it is practical to reverse the perspective: given a clustering, process samples of half of the cells. If their clusters are consistent with those of the full set of cells restricted to the samples, conclude that the clustering is stable. We use divisive hierarchical spectral clustering and describe a possibly novel mapping of the tree it produces to a set of nested clusterings. Positive affinities are defined for points (representing cells in Euclidean space) that are k-nearest neighbors (k is an input parameter). The affinity equals the inverse of the distance between the points. Ng, Jordan, and Weiss' algorithm divides a set of points into two clusters. The normalized cut measures the clusters' separation. Recursion generates a hierarchy of clusters. Viewing clusters as nodes of a tree, set the length of the branch between a node and each of its daughters to the normalized cut. Nodes' distances from the root define the mapping of the tree to nested clusterings. For four large data sets, this gave clusterings compatible with published results. Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample's clustering and clusters are compared with those of the full data set (restricted to the sample). This provides measures of the stability of the clustering and its clusters. For two of the large data sets, the clusterings compatible with published results were judged to be stable.
bioinformatics2026-04-01v3Adaptive Cluster-Count Autoencoders with Dirichlet Process Priors for Geometry-Aware Single-Cell Representation Learning
Fu, Z.Abstract
Standard autoencoders for single-cell transcriptomics learn latent spaces whose cluster structure emerges only post hoc through $K$-means or community detection, leaving cluster count and boundary quality uncontrolled during training. Here we ask whether imposing an adaptive nonparametric prior can shift this balance. We equip a feedforward autoencoder with an online Dirichlet Process Mixture Model (DPMM) prior that refits cluster assignments throughout training and directly regularizes latent compactness and separation. Across 56~scRNA-seq datasets the DPMM prior produces a pronounced \emph{geometry--concordance trade-off}: cluster compactness (ASW) improves by 127\% and Davies--Bouldin overlap drops by 47\%, but label-recovery metrics decline (NMI~$-$17\%, ARI~$-$21\%) and downstream $k$NN accuracy falls from 0.784 to 0.725. Wilcoxon signed-rank tests confirm that the geometry gains are significant with large Cliff's~$\delta$ effects while concordance losses remain bounded and non-significant. A second-stage conditional-flow refinement (DPMM-FM) further improves projection fidelity (DRE~0.751, LSE~0.695, DREX~0.873) at additional concordance cost, revealing a three-tier operating regime: prior-free for label recovery, DPMM for manifold geometry, and DPMM-FM for visualization fidelity. Against 18 external baselines DPMM-Base wins 70.5\% of core-metric comparisons ($p{<}0.05$). Gene Ontology enrichment confirms that geometry-improved latent components recover coherent biological programs. Rather than claiming universal superiority, this study characterizes the operating envelope of nonparametric mixture priors and identifies the task contexts---trajectory analysis, manifold visualization, and program-level annotation---where adaptive geometric structure outweighs label-counting accuracy.
bioinformatics2026-04-01v2Searching the Druggable Genome using Large Language Models
Schimmelpfennig, L. E.; Cannon, M.; Cody, Q.; McMichael, J.; Coffman, A.; Kiwala, S.; Krysiak, K. J.; Wagner, A. H.; Griffith, M.; Griffith, O. L.Abstract
The druggable genome encompasses the genes that are known or predicted to interact with drugs. The Drug-Gene Interaction Database (DGIdb) provides an integrated resource for discovering and contextualizing these interactions, supporting a broad range of research and clinical applications. DGIdb is currently accessed through structured web interfaces and API calls, requiring users to translate natural-language questions into database-specific query patterns. To allow for the use of DGIdb through natural language, we developed the DGIdb Model Context Protocol (MCP) server, which allows large language models (LLMs) access to up-to-date information through the DGIdb API. We demonstrate that the MCP server greatly enhances an LLM's ability to answer questions requiring accurate, up-to-date biomedical knowledge drawn from structured external resources. Availability and implementation: The DGIdb MCP server is detailed at https://github.com/griffithlab/dgidb-mcp-server and includes instructions for accessing the server through the Claude desktop app.
bioinformatics2026-04-01v2Simplex-Constrained Neural Topic VAEs with Flow Refinement for Interpretable Single-Cell Gene-Program Discovery
Fu, Z.Abstract
Variational autoencoders for single-cell transcriptomics typically learn Gaussian latent spaces that lack part-based interpretability: individual latent dimensions carry no inherent biological meaning and the decoder provides no explicit gene-program readout. We introduce Topic-FM, a family of neural topic VAEs in which a logistic-normal Dirichlet prior constrains the latent vector to the probability simplex, turning each coordinate into a topic proportion and the decoder weight matrix into a directly readable topic--gene signature. A conditional optimal-transport flow field, trained entirely in pre-softmax $\mathbb{R}^K$, sharpens posterior geometry without modifying the decoder or breaking simplex validity. Unlike nonparametric mixture priors that improve geometry at the expense of label concordance, Topic-FM improves \emph{all} core metrics simultaneously: across 56~scRNA-seq datasets, Topic-FM-Transformer raises NMI by 8.2\%, ARI by 20.4\%, and ASW by 21.7\% relative to prior-free Pure-VAE (composite 0.502 vs.\ 0.434, +15.6\%). Wilcoxon signed-rank tests confirm significance with medium-to-large Cliff's~$\delta$ effects on all three metrics---no concordance--geometry trade-off is observed. Downstream $k$NN classification improves by 13.5\% in accuracy and 27.7\% in macro-F1. Among four architectural variants, Topic-FM-Contrastive achieves the highest external core win rate (86.4\% against 23 baselines), while Topic-FM-Transformer leads on composite score and supervised discrimination. Dual-pathway biological validation---perturbation importance and direct decoder-$\beta$ readout---yields convergent GO enrichment, demonstrating that the learned topics correspond to coherent, annotatable gene programs rather than opaque embedding dimensions.
bioinformatics2026-04-01v2A Convolutional Deep Learning Approach to identify DNA Sequences for Gene Prediction
Motta, J. A.; Gomez, P. D.Abstract
In this work, we present a highly efficient machine learning method for identifying DNA sequences that code for genes. The learning process is based on Human Genome Build 38 (GRCh38) sequences extracted from various specialized databases. The sequences were then translated into amino acid sequences and used to build matrices that facilitate the extraction of features with the TF*IDF vectorization method for the creation of the training space. The prediction functions are learned using a convolutional neural network (CNN) deep learning model. The training spaces were created using the 24 chromosomes of the human genome and approximately 36,000 genes and pseudogenes whose names were fetched from the HUGO Gene Nomenclature Committee (HGNC). Performance analysis was performed on 24 genes associated with genetic disorders, as well as the surrounding DNA regions. The metrics used were precision, recall, F_score measure, accuracy and ROC curves for the genes of interest. The results achieved exceed all our expectations and place the work at the level of the state of the art for gene prediction.
bioinformatics2026-04-01v2On the Comparison of LGT networks and Tree-based Networks
Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.Abstract
Abstract. Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely di[ff]icult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments.
bioinformatics2026-04-01v2Protein Language Model Decoys for Target Decoy Competition in Proteomics: Quality Assessment and Benchmarks
Reznikov, G.; Kusters, F.; Mohammadi, M.; van den Toorn, H. W. P.; Sinitcyn, P.Abstract
Large-scale proteomics relies heavily on target--decoy competition for false discovery rate estimation in peptide identification, and the performance of this strategy depends strongly on the design of the decoy database. Classical generators such as reversal and shuffling remain widely used. Here, we introduce protein language model-based (PLM) decoy generation for peptide identification and benchmark it against classical strategies. We evaluate these approaches using three complementary quality-control layers: sequence-based separability, search-engine-agnostic spectral-space diagnostics, and end-to-end mass spectrometry benchmarks, including pipelines with rescoring. Across these analyses, PLM-based decoys are harder for sequence-only neural networks to distinguish than most classical generators, suggesting fewer obvious sequence-level artifacts. However, this signal is only weakly informative for search performance. Spectral diagnostics further show that short peptides occupy a particularly crowded target--decoy space and are therefore especially prone to local collisions across all generators. In full search pipelines, reverse decoys remain a strong baseline, and current PLM-based generators do not yet provide a clear overall advantage. We therefore view PLM-based decoys not as universal replacements for reverse decoys, but as tunable tools for benchmarking, diagnostics, stress testing, and future adaptive decoy optimization, with increasing value as search models become more expressive.
bioinformatics2026-04-01v2Benchmark of biomarker identification and prognostic modeling methods on diverse censored data
Fletcher, W. L.; Sinha, S.Abstract
The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods' performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.
bioinformatics2026-04-01v1Serum metabolic signatures of cognitive resilience in a longitudinal aging cohort
Scheurink, T. A. W.; Seo, J. I.; David, L. C.; Wang, C. X.; Solis, D.; Zemlin, J.; Bergstrom, J.; Dorrestein, P. C.; Mohanty, I.; Molina, A. J. A.Abstract
Aging is typically accompanied by a progressive decline in cognitive function, yet some individuals maintain exceptional cognitive performance, even across the transition from middle to older age, defining exceptional cognitive resilience. While existing measures of resilience primarily rely on clinical assessments, its molecular determinants and early predictive markers remain poorly understood. Here, we performed untargeted LC-MS/MS profiling of longitudinal serum samples to identify metabolic signatures associated with cognitive resilience, which was established based on cognitive tests conducted over 28 years in a cohort of 237 participants. We observed associations across multiple chemical classes, including carnitines, glutamine conjugates, phosphocholines, as well as diet- and drug-derived metabolites. Chemical class-specific analyses revealed distinct metabolic profiles, including predominantly negative associations of medium-chain acylcarnitines with cognitive resilience, increased accumulation of glucuronide conjugates in individuals with low cognitive resilience, altered metabolism of the antihypertensive drug, metoprolol, and elevated levels of dietary compounds such as piperine and lutein in individuals with high cognitive resilience. By leveraging public metabolomics data, we further contextualized the metabolic signatures with respect to their organ specificity, microbial origin, and disease associations. Collectively, these metabolic features, including several previously underexplored compounds, represent promising candidates for functional characterization in mechanisms of aging biology and provide mechanistic insights into the molecular basis of cognitive resilience.
bioinformatics2026-04-01v1Subcellular Localization Constrains Protein Detectability and Reveals Systematic RNA-Protein Discordance Across Cancers
Joshi, K.; Kate, S.Abstract
Transcript abundance is widely used as a proxy for protein expression in cancer studies; however, mRNA levels often fail to predict protein detectability due to post-transcriptional and compartment-specific regulatory processes. Here, we present a machine learning framework that integrates RNA expression, gene-level attributes, and subcellular localization to model protein detectability across human cancers. Leveraging transcriptomic data from TCGA, TARGET, and GTEx, and protein annotations from the Human Protein Atlas, we constructed a dataset comprising over 100,000 gene-cancer pairs across seven tumor types. Models based on RNA features alone achieved moderate predictive performance (ROC-AUC ~0.71), whereas incorporating subcellular localization significantly improved accuracy (ROC-AUC ~0.82). Paired bootstrap analysis confirmed that these gains were statistically robust. We further identify a substantial set of genes with high transcript abundance yet absent protein detection, revealing widespread RNA-protein decoupling. These discordant genes are enriched in mitochondrial, metabolic, and translational regulatory pathways, suggesting that discordance reflects structured biological processes rather than stochastic variation. Together, our results demonstrate that cellular context, particularly subcellular localization, is a key determinant of protein detectability and underscore the limitations of transcript-centric interpretations in cancer genomics.
bioinformatics2026-04-01v1The human pangenome reference reduces ancestry-related biases in somatic mutation detection
Pham, C. V. K.; Abdelmalek, F. S. A.; Hua, T.; Apel, E.; Bizjak, A.; Schmidt, E. J.; Houlahan, K. E.Abstract
Commonly used human reference genomes collapse extensive genetic variability into a single linear genome of which 70% is derived from one donor. These linear genomes fail to capture the full spectrum of genetic variation, which can lead to misalignment of sequencing reads particularly for individuals underrepresented by the linear reference genomes. To address this shortcoming, the Human Pangenome Reference Consortium released the first draft of the human pangenome reference, a graph-based reference that integrates diverse haplotypes. While the human pangenome reference has shown increased accuracy in detecting inherited DNA variants, it remains to be seen if the observed improvements extend to somatic mutation detection. Here, we systematically benchmarked somatic single nucleotide variant (SNV) detection leveraging the human pangenome in 30 whole exome sequenced bladder tumours with matched blood tissue of diverse ancestries. We found somatic SNV detection leveraging the human pangenome reference outperformed the linear reference, most notably in individuals of East Asian ancestry where we observed on average a 20% improvement in detection accuracy. Improvements to detection accuracy in individuals of European ancestry were marginal. The increase in accuracy was attributed to reduced germline contamination and reduced reference bias. Further, we demonstrate the pangenome increases SNV detection precision, mitigating the need for time and computationally expensive ensemble approaches that take the consensus across multiple tools. Finally, we demonstrate that the increased precision when aligned to the pangenome generalized to an additional 29 lung adenocarcinoma tumours, particularly for individuals of East Asian ancestry. These findings support adoption of the pangenome to improve somatic variant detection and reduce ancestry-related disparities.
bioinformatics2026-04-01v1Accurate detection of mosaic mutations at short tandem repeats from bulk sequencing data
Wang, W.; Li, W.; Wang, C.; Fan, W.; Xia, Y.; Yang, X.; Chu, C.; Dou, Y.Abstract
Short tandem repeats (STRs) are among the most mutable regions of the human genome, yet their somatic mosaicism remains poorly characterized due to the technical challenges of distinguishing genuine mutations from high intrinsic polymorphism and sequencing noise. Here, we introduce BulkMonSTR, a computational framework that combines STR-specific error modelling with machine-learning classification to enable accurate detection of mosaic STR mutations from bulk next-generation sequencing data. BulkMonSTR identifies nucleotide-resolution mutations--including insertions, deletions, and single-nucleotide variants (SNVs)--and supports both control-independent and case-control study designs. Leveraging a comprehensive training dataset derived from pedigree-based validation and in silico spike-in simulations, our random forest classifier effectively discriminates true mosaic events from germline variants and technical artifacts. Benchmarking on simulated and real datasets demonstrates that BulkMonSTR achieves substantially improved precision and F1 scores across diverse coverages and variant allele frequencies. In normal samples, cancer samples and controlled in silico mixing experiments, BulkMonSTR consistently outperforms existing methods, capturing a broader spectrum of STR mutations--including those arising on non-reference alleles--while achieving high validation rates. By enabling systematic, genome-wide interrogation of STR mosaicism, BulkMonSTR provides a scalable foundation for investigating the contributions of somatic STR mutations to aging and disease.
bioinformatics2026-04-01v1Protein Language Models Outperform BLAST for Evolutionarily Distant Enzymes: A Systematic Benchmark of EC Number Prediction
Sathyamoorthy, R.; Puri, M.Abstract
Accurate prediction of Enzyme Commission (EC) numbers is foundational to genome annotation, metabolic reconstruction, and enzyme engineering. Protein language models (PLMs) have transformed protein function prediction, yet their systematic evaluation for EC number prediction across architectures, EC hierarchy levels, and sequence identity thresholds is lacking. Here we present a comprehensive benchmark of three PLMs (ESM2-650M, ESM2-3B, ProtT5-XL) combined with nine downstream neural architectures, evaluated across four EC hierarchy levels and four sequence identity thresholds with 1,296 trained models in total. Our results establish that simple MLP classifiers achieve 98.0% accuracy at EC1, 96.9% at EC2, 96.6% at EC3, and 97.0% at EC4, matching or marginally exceeding a train-set-matched BLASTp baseline (+/-0.7 pp) for in-distribution proteins. Crucially, PLM-based methods dramatically outperform BLAST for evolutionarily distant eukaryotes: gains reach +31.8 pp over a fair 90K-sequence BLAST baseline (Giardia lamblia) and +26.4 pp over a full 520K SwissProt database (Trichomonas vaginalis). For held-out prokaryotic proteomes, PLMs outperform BLAST by a mean of +16.9 pp at EC4. Our benchmark reveals that (i) MLP architectures are sufficient and consistently superior to CNN/ResNet/Transformer variants, (ii) ESM2-650M is statistically distinguishable from but practically equivalent to the 5x larger ESM2-3B, and (iii) Transformer re-encoding of PLM embeddings fails at a shared learning rate due to convergence instability. All code, models, and benchmark results are available at https://github.com/r-mbio/plm_benchmark.git
bioinformatics2026-04-01v1VicMAG, an open-source tool for visualizing circular metagenome-assembled genomes highlighting bacterial virulence and antimicrobial resistance
Tsuda, Y.; Tanizawa, Y.; Vu, T. M. H.; Nishimura, Y.; Shintani, M.; Abe, H.; Hasebe, F.; Kasuga, I.; Nagao, M.; Suzuki, M.Abstract
Bacterial pathogens spread in clinical and environmental settings, and mobile genetic elements (MGEs), such as plasmids and phages, mediate the transfer of virulence factor genes (VFGs) and antimicrobial resistance genes (ARGs) among bacterial communities. Metagenomic analysis of environmental and wastewater samples using highly accurate long-read sequencing technologies, such as PacBio HiFi sequencing, provides valuable insights into monitoring the regional spread of VFGs and ARGs, including dissemination mediated by MGEs. No visualization tool is currently available for the comprehensive display of numerous resulting circular metagenome-assembled genomes (cMAGs) with functional gene annotations. Here, we developed VicMAG, a visualization tool for highly complex cMAGs derived from long-read metagenome assemblies annotated using updated databases of VFGs, ARGs, and MGEs. Using 353 cMAGs from PacBio HiFi sequencing of a wastewater sample, we demonstrated the utility of VicMAG for metagenome visualization. VicMAG provides comprehensive, size-aware visualization of cMAGs representing bacterial chromosomes and plasmids, annotated with VFGs, ARGs, and phages. By simultaneously visualizing all cMAGs in a framework, VicMAG facilitates a holistic understanding of the distribution and genomic context of VFGs and ARGs across complex microbial communities. This tool supports integrated surveillance of bacteria associated with virulence and antimicrobial resistance across clinical, environmental, and One Health contexts.
bioinformatics2026-04-01v1Assessing the potential of bee-collected pollen sequence data to train machine learning models for geolocation of sample origin
Hayes, R. A.; Kern, A. D.; Ponisio, L. C.Abstract
Pollen is a robust and widespread substance that captures a historical snapshot of a specific time and place, and it can be used to track movements through space by examining the pollen deposited on various objects. Palynology, the study of pollen, is used across fields such as conservation, natural history, and forensics, where it is particularly useful for tracing the origin and movement of objects. However, pollen has remained underutilized due to the difficulty of distinguishing many pollen taxa beyond the family level and limited pollen reference material to support location predictions. With recent developments in pollen DNA metabarcoding these issues have been rectified, but much of the available pollen data are primarily from wind-pollinated species, which are widespread and less informative of specific sample locations. Bee-collected pollen presents an untapped resource in training predictive models to geolocate sample origin. Here we compiled bee-collected pollen DNA sequence relative abundance data from three projects in the western U.S. and assessed the accuracy of supervised machine learning models to predict the location of sample origin based solely on pollen assemblage, without the need of incorporating additional data. Random Forest and k-Nearest Neighbors models yielded high accuracy across all projects. We also found that models trained on taxonomically clustered pollen assigned sequence variants (ASVs) performed slightly better than those trained on raw sequence data, but the difference was minor, indicating that models trained on raw sequence data can reliably predict location and avoid the time-consuming taxonomic assignment process. Our results demonstrate the utility of repurposing bee-collected pollen for geolocation and provide a framework for employing supervised machine learning in future geolocation efforts.
bioinformatics2026-04-01v1IMMREP25: Unseen Peptides
Richardson, E.; Aarts, Y. J. M.; Altin, J. A.; Baakman, C. A. B.; Bradley, P.; Chen, B.; Clifford, J.; Dhar, M.; Diepenbroek, D.; Fast, E.; Gowthaman, R.; He, J.; Karnaukhov, V.; Marzella, D. F.; Meysman, P.; Nielsen, M.; Nilsson, J. B.; Deleuran, S. N.; Parizi, F. M.; Pelissier, A.; Pierce, B. G.; Rodriguez Martinez, M.; Roran A R, D.; Saravanakumar, S.; Shao, Y.; Smit, N.; Van Houcke, M.; Visani, G. M.; Wan, Y.-T. R.; Wang, X.; Woods, L.; Wuyts, S.; Xiao, C.; Xue, L. C.; IMMREP25 Participant Consortium, ; Barton, J.; Noakes, M.; May, D. H.; Peters, B.Abstract
T cell receptors (TCRs) can bind to peptides presented by MHC molecules (pMHC) as a first step to trigger a T cell response. Reliable approaches to predict TCR:pMHC binding would have broad applications in clinical diagnostics, therapeutics, and the fundamental understanding of molecular interactions. IMMREP is a community organized series of prediction contests that asks participants to predict TCR:pMHC binding on unpublished datasets. Previous iterations in 2022 and 2023 showed multiple approaches can predict TCR-pMHC binding with significant accuracy (median AUC_0.1 greater than or equal to 0.7) for peptides where experimental data is available ("seen" peptides). In contrast, models did not outperform random guessing for peptides that have no such data available ("unseen" peptides). Here we report on the results of IMMREP25, which focused solely on unseen peptides in order to evaluate the cutting edge of the field. We received 126 named submissions predicting the specificity of 1,000 TCRs against twenty unseen peptides restricted by one of two MHC molecules (HLA-A*02:01 and HLA-B*40:01). The best performing methods showed a macro-AUC_0.1 of 0.60, significantly better than random, demonstrating significant advances in the field. The top performing methods incorporated structural modeling into their approach, indicating that especially for "unseen" peptides, a structural understanding aids in the prediction of TCR:pMHC interactions. The results from this benchmark highlight the significant challenges remaining for TCR:pMHC predictions and will inform future method development.
bioinformatics2026-04-01v1Millisecond Prediction of Protein Contact Maps from Amino Acid Sequences
Lin, R.; Ahnert, S. E.Abstract
Protein structure prediction typically outputs static coordinates, often obscuring the underlying physical principles and conformational flexibility. In this work, we present a coarse-grained generative framework to recover the Circuit Topology (CT) of proteins using Generative Flow Matching. We represent protein architecture using highly compressed Secondary Structure Elements (SSEs), reducing the sequence length to roughly 1/13 of the original amino acid sequence. We show that this minimal representation captures the essential "topological fingerprint" required to determine the global fold. By employing a joint-prediction head, our model simultaneously generates contact probabilities and asymmetric topological features, achieving a mean F1 score of 0.822 at the SSE level. Notably, our results demonstrate a counter-intuitive robustness in capturing long-range interactions, suggesting that global topology acts as a stable constraint compared to local residue packing. Furthermore, we show that these coarse-grained predictions can be mapped back to residue-level contact maps with sub-helical precision, yielding a mean alignment error of 2.69 residues. The probabilistic nature of the flow model effectively separates the stable structural signal of the folding core from flexible regions, providing a physically interpretable view of the protein's conformational ensemble. This pipeline is extremely fast, capable of completing a contact map prediction from amino acid sequence in an average of 110 milliseconds on a single GPU. These ultra-fast and accurate predictions provide a valuable tool for identifying conserved protein folding cores, facilitating the exploration of the protein structural genotype-phenotype (GP) map through large-scale sampling of mutants with highly similar folding cores.
bioinformatics2026-03-31v4Protein sequence domain annotation using a language model
Sarkar, A.; Krishnan, K.; Eddy, S. R.Abstract
Protein domain annotation underlies large-scale functional inference and is commonly performed by scanning sequences against libraries of profile hidden Markov models (profile HMMs). We describe PSALM, a protein domain annotation method that combines (i) a pretrained protein language model (ESM-2) with (ii) a per-residue domain-state classifier and (iii) a structured probabilistic decoder that produces a single, non-overlapping set of domain calls with explicit boundaries and scores. On a benchmark of 89M protein sequences with 107M annotated domains, PSALM attains a domain-detection sensitivity-specificity tradeoff comparable to HMMER. We characterize sequence and residue-level coverage on UniProtKB, observing higher coverage for HMMER at stringent expected false positive counts (E-values) and higher coverage for PSALM at relaxed E-values. We release code for data processing, training, and inference, along with the model weights and datasets used for training, validation, and benchmarking.
bioinformatics2026-03-31v3Phylogenetic detection of protein sites associated with continuous traits
Duchemin, L.; Muntane, G.; Boussau, B.; Veber, P.Abstract
Comparative genomic data can be used to look for substitutions in coding sequences that are associated with the variation of a particular phenotypic trait. A few statistical methods have been proposed to do so for phenotypes represented by discrete values. For continuous traits, no such statistical approach has been proposed, and researchers have resorted to sensible but uncharacterized criteria. Here, we investigate a phylogenetic model for coding sequences where amino acid preferences at a site are given by a continuous function of a quantitative trait. This function is inferred from the amino acids and the trait values in extant species and requires inferred point estimates of ancestral values of the trait at internal nodes. For detecting sites whose evolution is associated with this trait, we use a significance test against the hypothesis that amino acid preference does not depend on the trait. This procedure is compared to simpler strategies on simulated alignments. It displays an increased recall for low false positive rates, which is of special importance for performing whole-genome scans. This comes however at a much higher computational cost, and we suggest using a simple test to filter promising candidate sites. We then revisit a dataset of alignments for 62 species of mammals, using longevity as a phenotypic trait. We apply our method to three protein families that have previously been proposed to display sites associated with variation in lifespan in mammals. Using a graphical representation extracted from the detailed phylogenetic analysis of candidate sites, we suggest that the evidence for this in the sequence data alone is weak. The proposed method has been added to our Pelican software. It is available at https://gitlab.in2p3.fr/phoogle/pelican and can now be used with both discrete and continuous phenotypes to search for sites associated with phenotypic variation, on data sets with thousands of alignments.
bioinformatics2026-03-31v3Reliable prediction of short linear motifs in the human proteome
Pancsa, R.; Ficho, E.; Kalman, Z. E.; Gerdan, C.; Remenyi, I.; Zeke, A.; Tusnady, G. E.; Dobson, L.Abstract
Short linear motifs (SLiMs) are small interaction modules within intrinsically disordered regions of proteins that interact with specific domains, and thereby regulate numerous biological processes. Their limited sequence information leads to frequent false positive hits in computational and experimental SLiM identification methods. We present SLiMMine, a deep learning-based method to identify SLiMs in the human proteome. By refining the annotations of known motif classes, we created a high-quality training dataset. Using protein embeddings and neural networks, SLiMMine reliably predicts novel SLiM candidates in known classes, eliminates ~80% of the pattern matching-based hits as false-positives, furthermore, it also functions as a discovery tool to find uncharacterized SLiMs based on optimal sequence environment. Finally, narrowing the broad interactor-domain definitions of known SLiM classes to specific human proteins enables more precise linking of predicted SLiMs to known protein-protein interactions. SLiMMine is available as a user-friendly, multi-purpose web server at https://slimmine.pbrg.hu/.
bioinformatics2026-03-31v2IDBSpred: An intrinsically disordered binding site predictor using machine learning and protein language model
Jones, D.; Wu, Y.Abstract
Intrinsically disordered proteins (IDPs) mediate many cellular functions through interactions with structured protein partners, but predicting the corresponding binding sites on the structured partner remains challenging. Here, we present IDBSpred, a sequence-based method for residue-level prediction of IDP-binding sites on structured proteins. Training and test data were collected from the DIBS database, which contains more than 700 non-redundant IDP-protein complexes. Residue-level embeddings of structured partner sequences were generated using the ESM-2 protein language model and used as input to a multilayer perceptron classifier for binary prediction of binding versus non-binding residues. Analysis of amino acid composition showed that IDP-binding sites are enriched in aromatic residues, especially Trp, Tyr, and Phe, as well as several charged and polar residues, whereas Ala and several small or conformationally restrictive residues are depleted. The classifier achieved an ROC AUC of 0.87 and an average precision of 0.61. Structural case studies further showed that the predicted sites largely recapitulate the major experimentally defined binding interfaces. These results demonstrate that protein language model embeddings plus machine learning algorithms can effectively capture sequence features associated with IDP recognition on structured proteins. IDBSpred provides a practical framework for studying IDP-mediated interfaces and identifying potential therapeutic hotspots.
bioinformatics2026-03-31v2Flipper: An advanced framework for identifyingdifferential RNA binding behavior with eCLIP data
Flanagan, K.; Xu, S.; Yeo, G. W.Abstract
Motivation: Crosslinking and immunoprecipitation (CLIP) methods remain the gold standard for characterizing RNA binding protein (RBP) behavior. As a result, many researchers rely on CLIP to assess how treatments targeting RBPs alter binding patterns and regulatory activity. However, current tools for differential RBP binding analysis lack core features required for rigorous statistical inference, including proper normalization and appropriate handling of replicate experiments. Furthermore, existing approaches cannot adequately separate expression driven effects from true changes in RBP binding, complicating interpretation of differential analyses. Addressing these limitations is essential for producing reproducible and informative analyses of differential RBP binding. Results: Here we present Flipper, an application purpose built for the analysis of differential RBP binding. Flipper introduces several innovations that adapt the DESeq2 framework for robust differential analysis of eCLIP count data. These include integration of input controls to account for expression driven binding shifts, hierarchical normalization strategies that adjust for technical variation without confounding signal to noise ratios, and improved post-differential analysis tools. We demonstrate that Flipper exhibits high specificity when applied to real differential eCLIP data while also providing deeper biological insights. In addition, analyses of both real and simulated data indicate that Flipper achieves superior sensitivity and precision compared with existing approaches. Together, these results highlight Flipper as a robust and generalizable framework for differential RBP binding analysis.
bioinformatics2026-03-31v2A Bioinformatic Investigation into the Role of ITGB1 in Cancer Prognosis and Therapeutic Resistance
Mo, X.Abstract
Integrin {beta}1 is a crucial transmembrane protein that regulates cellular adhesion, migration, and signal transduction, processes essential for cancer progression. This study investigates the role of ITGB1, the gene that encodes Integrin {beta}1, in various cancers using bioinformatics tools. By analyzing gene expression data across different cancer types and normal tissues, the study identifies significant upregulation of ITGB1 in several cancers. We find elevated expression of ITGB1 is associated with poor prognosis in multiple tumors, suggesting its potential as a biomarker for cancer progression and therapeutic resistance. Further analysis reveals ITGB1's correlation with chemoresistance and immunoresistance genes, highlighting its involvement in cancer treatment evasion. The study also explores the expression and role of genes that are highly related to ITGB1 in tumor and patient prognosis, offering insights into potential molecular pathways and therapeutic targets. These findings underscore the clinical relevance of ITGB1 in cancer prognosis and therapy.
bioinformatics2026-03-31v2