Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
DeepTrio: Variant Calling in Families Using Deep Learning
Brambrink, L.; Kolesnikov, A.; Goel, S.; Nattestad, M.; Yun, T.; Baid, G.; Yang, H.; McLean, C.; Shafin, K.; Chang, P.-C.; Carroll, A.Abstract
Every human inherits one copy of the genome from their mother and another from their father. Parental inheritance helps us understand the transmission of traits and genetic diseases, which often involve de novo variants and rare recessive alleles. Here we present DeepTrio, which learns to analyze child-mother-father trios from the joint sequence information, without explicit encoding of inheritance priors. DeepTrio learns how to weigh sequencing error, mapping error, and de novo rates and genome context directly from the sequence data. DeepTrio has higher accuracy on both Illumina and PacBio HiFi data when compared to DeepVariant. Improvements are especially pronounced at lower coverages (with 20x DeepTrio roughly equivalent to 30x DeepVariant). As DeepTrio learns directly from data, we also demonstrate extensions to exome calling solely by changing the training data. DeepTrio includes pre-trained models for Illumina WGS, Illumina exome, and PacBio HiFi.
bioinformatics2026-04-02v2SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale
Wang, L.; Zhang, X.; Wang, Y.; Xue, Z.Abstract
The advent of highly accurate structure prediction techniques such as AlphaFold3 is driving an unprecedented expansion of protein structure databases. This rapid growth creates an urgent demand for novel search tools, as even the current fastest available methods like Foldseek face significant limitations in sensitivity and scalability when confronted with these massive repositories. To meet this challenge, we have developed SSAlign, a protein structure retrieval tool that leverages protein language models to jointly encode sequence and structural information and adopts a two-stage alignment strategy. On large-scale datasets such as AFDB50, SSAlign achieves a two-orders-of-magnitude speedup over Foldseek in search, substantially improving scalability for high-throughput structural analysis. Compared to Foldseek, SSAlign retrieves substantially more high-quality matches on Swiss-Prot and achieves marked performance improvements on SCOPe40, with relative AUC increases of +20.2% at the family level and +33.3% at the superfamily level, demonstrating significantly enhanced sensitivity and recall. In sum, SSAlign achieves TM-align-comparable accuracy with Foldseek-surpassing speed and coverage, offering an efficient, sensitive, and scalable solution for large-scale structural biology and structure-based drug discovery.
bioinformatics2026-04-02v2Ankh-score produces better sequence alignments than AlphaFold3
Malec, J.; Rusen, K.; Golding, G. B.; Ilie, L.Abstract
Protein sequence alignment is one of the most fundamental procedures in bioinformatics. Due to its many downstream applications, improvements to this procedure are of great importance. We consider two revolutionary concepts that emerged recently as candidates for improving the state-of-the-art alignment methods: AlphaFold and protein language models such as Ankh, ProtT5 or ESM-C. Alignment improvements can come from the structural alignment of AlphaFold-predicted structures or the scoring based on the similarity of protein embeddings produced by the protein language models. Thorough comparison on many domains from BAliBASE and CDD demonstrates that the Ankh-score method produces much better sequence alignments than the structural alignments using US-align of AlphaFold3-predicted structures. Both are better than the traditional method using BLOSUM matrices. This suggests that Ankh embeddings may possess certain information that is not available in the AlphaFold3-predicted structures. The alignment software is freely available as a web server at e-score.csd.uwo.ca and as source code at github.com/lucian-ilie/E-score.
bioinformatics2026-04-02v2Optimisation of Weighted Ensembles of Genomic Prediction Models in Maize
Tomura, S.; Powell, O. M.; Wilkinson, M. J.; Lefevre, J.; Cooper, M.Abstract
Ensembles of multiple genomic prediction models have demonstrated improved prediction performance over the individual models contributing to the ensemble. The outperformance of ensemble models is expected from the Diversity Prediction Theorem, which states that for ensembles constructed with diverse prediction models, the ensemble prediction error becomes lower than the mean prediction error of the individual models. While a naive ensemble-average model provides baseline performance improvement by aggregating all individual prediction models with equal weights, optimising weights for each individual model could further enhance ensemble prediction performance. The weights can be optimised based on their level of informativeness regarding prediction error and diversity. Here, we evaluated weighted ensemble-average models with three possible weight optimisation approaches (linear transformation, Nelder-Mead and Bayesian) using flowering time and tillering traits from two maize nested associated mapping (NAM) datasets; TeoNAM and MaizeNAM. The three proposed weighted ensemble-average approaches improved prediction performance in several of the prediction scenarios investigated. In particular, the weighted ensemble models enhanced prediction performance when the adjusted weights differed substantially from the equal weights used by the naive ensemble models. For performance comparisons among the weighted ensembles, there was no clear superiority among the proposed approaches in both prediction accuracy and error across the prediction scenarios. Weight optimisation for ensembles warrants further investigation to explore the opportunities to improve their prediction performance; for example, integration of a weighted ensemble with a simultaneous hyperparameter tuning process may offer a promising direction for further research.
bioinformatics2026-04-02v2CLEAR: Concise List Enrichment Analysis Reducing Redundancy
Jia, X.; Phan, A.; Dorman, K.; Kadelka, C.Abstract
High-throughput experiments generate genome-wide measurements for thousands of genes, which are often tested marginally. Biological processes are driven by coordinated groups of genes rather than individual genes, making gene set enrichment analysis an essential post hoc interpretation tool. Traditional approaches such as Over-Representation Analysis and Gene Set Enrichment Analysis test gene sets independently, which ignores the hierarchical and overlapping structure of gene set collections such as the Gene Ontology, and often leads to redundant enrichment results. Set-based approaches such as MGSA address this issue by modeling multiple gene sets simultaneously, but they rely on binary gene activation states derived from arbitrary thresholds on gene-level statistics. We introduce Concise List Enrichment Analysis Reducing Redundancy (CLEAR), a Bayesian gene set enrichment framework that jointly models gene sets while incorporating continuous gene-level statistics such as test statistics or p-values. CLEAR extends model-based gene set analysis by replacing threshold-based gene activation with a probabilistic model for continuous gene-level statistics. This approach preserves the redundancy-reduction advantages of set-based enrichment methods while avoiding the information loss introduced by binarization. Using both simulated datasets and human gene expression data, we show that CLEAR improves sensitivity compared with existing enrichment approaches while producing a more concise and interpretable set of enriched gene sets.
bioinformatics2026-04-02v2Resolution of recursive data corruption to transform T-cell epitope discovery
Preibisch, G.; Tyrolski, M.; Kucharski, P.; Gizinski, S.; Grzegorczyk, P.; Moon, S.; Kim, S.; Zaro, B.; Gambin, A.Abstract
Accurate prediction of MHC class I-presented peptides is essential for any vaccine or T-cell therapy design, yet reported gains on in silico benchmarks have not translated into clinical successes. Here we show that this discrepancy may come from a common methodological error: immunopeptidomics datasets are fundamentally contaminated by existing prediction models through prediction-based deconvolution and filtering, resulting in an iterative confirmation bias. An audit of the IEDB, the biggest database in the field, reveals that as of January 2025, 55.8% of assessable data are labeled by computational models rather than verified experimentally. This inflates in silico benchmarks while degrading real-world applicability on new data, effectively making it impossible to objectively test model performance, which can lead to choosing suboptimal solutions and decreasing the chance of any therapy's clinical success. In silico simulation shows that iterative data corruption maintains high AUROC while top-of-list retrieval collapses. We reframe epitope discovery as a protein-centric learning-to-rank task and introduce deepMHCflare, a model evaluated exclusively on clean data. deepMHCflare achieves 0.80 Precision@4 on mono-allelic benchmarks versus 0.55-0.65 for gold-standard prediction models. A preclinical cancer vaccine study validated that 2 of the 4 deepMHCflare-nominated peptides were immunogenic, with a third independently confirmed in the literature.
bioinformatics2026-04-02v2HalluCodon enables species-specific codon optimization using multimodal language models
Lou, Y.; Mao, S.; Wu, T.; Xia, F.; Zhang, Z.; Tian, Y.; Li, Y.; Cheng, Q.; Yan, J.; Wang, X.Abstract
Codon optimization is widely used in transgenic crop development, plant synthetic biology, and molecular farming to improve heterologous protein expression in plant cells. Increasing availability of plant omics data now enables optimization strategies that account for species-specific sequence features. We developed HalluCodon, a customizable framework that uses multimodal language models to design coding sequences tailored to individual plant species. The framework allows users to fine tune pre-trained protein and RNA language models with their own datasets to build species-specific codon optimization models. The current implementation includes base models trained on coding sequences and proteomes from fifteen plant species. HalluCodon generates coding sequences through a hallucination-based design strategy guided by two predictive modules that evaluate coding sequence naturalness (CodonNAT) and expression potential (CodonEXP). Benchmark tests using representative proteins show that the generated sequences reproduce host-specific codon usage patterns and support high expression levels in plant systems.
bioinformatics2026-04-02v1Evaluating FoldX5.1 for MAVISp Stability Data Collection
Vliora, A.; Tiberti, M.; Papaleo, E.Abstract
MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found a high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.
bioinformatics2026-04-02v1RastQC: High-Performance Sequencing Quality Control Written in Rust
Huang, K.-l.Abstract
Quality control (QC) of high-throughput sequencing data is a critical first step in genomics analysis pipelines. FastQC has served as the de facto standard for sequencing QC for over a decade, but its Java runtime dependency introduces startup overhead, elevated memory consumption, and deployment complexity. Here we present RastQC, a complete reimplementation of FastQC in Rust that provides all 12 standard QC modules with matching algorithms, plus 3 additional long-read QC modules, MultiQC-compatible output formats, native MultiQC JSON export, a built-in multi-file summary dashboard, and a web-based report viewer. RastQC also supports SOLiD colorspace reads, Oxford Nanopore Fast5/POD5 formats, standard input streaming, intra-file parallelism, and QC-aware exit codes for workflow integration. We benchmarked RastQC against FastQC v0.12.1 on both synthetic datasets (100K-1M reads) and real whole-genome sequencing data spanning five model organisms: Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus, and Homo sapiens. Despite running 15 modules (vs. 11 in FastQC), RastQC achieves comparable speed while using 4-9x less memory (59-125 MB vs. 551-638 MB). On real genome data, RastQC matches FastQC speed on most organisms while achieving 100% module-level concordance (55/55 module calls identical across all organisms for the 11 shared modules). RastQC compiles to a single 2.1 MB static binary with no external dependencies, representing a 102x reduction in deployment footprint. RastQC is freely available at https://github.com/Huang-lab/RastQC under the MIT license.
bioinformatics2026-04-02v1When Multimodal Fusion Fails: Contrastive Alignment as a Necessary Stabilizer for TCR--Peptide Binding Prediction
Qi, C.; Wang, W.; Fang, H.; Wei, Z.Abstract
Multimodal learning is commonly assumed to improve predictive performance, yet in biological applications auxiliary modalities are often imperfect and can degrade learning if fused naively. We investigate this problem in TCR--peptide binding prediction, where sequence embeddings from pretrained protein language models are strong and transferable, but structure-derived residue graphs are built from predicted folds and heuristic discretization. In this setting, structural views can be noisy, inconsistent, and difficult to optimize jointly with sequence features. We introduce TRACE, a lightweight multimodal framework that encodes each entity (TCR and peptide) with parallel sequence and graph towers, then applies CLIP-style intra-entity contrastive alignment before interaction modeling. The alignment objective regularizes representation geometry by encouraging modality consistency for the same biological entity, thereby preventing unstable graph signals from dominating fusion. Across protocol-aware TCHard RN evaluations, naive sequence+graph fusion frequently underperforms a sequence-only baseline and can collapse toward near-random behavior. In contrast, TRACE consistently restores and improves performance. Controlled noise and supervision sweeps show that these gains persist under increasing graph corruption and positive-label scarcity, indicating that alignment is especially important when training conditions are hard. Our results challenge the assumption that adding modalities is inherently beneficial. Instead, they highlight a central principle for robust multimodal bioinformatics: performance depends not only on what modalities are used, but on how their interaction is constrained during optimization. TRACE provides a simple and general recipe for leveraging imperfect structural information without sacrificing stability.
bioinformatics2026-04-02v1A structure-informed deep learning framework for modeling TCR-peptide-HLA interactions
Cao, K.; Li, R.; Strazar, M.; Brown, E. M.; Nguyen, P. N. U.; Pust, M.-M.; Park, J.; Graham, D. B.; Ashenberg, O.; Uhler, C.; Xavier, R.Abstract
The interaction between T cell receptors (TCRs), peptides, and human leukocyte antigens (HLAs) underlies antigen-specific T cell immunity. Despite substantial advances in peptide-HLA presentation prediction, accurate modeling of coupled TCR-peptide-HLA recognition remains underdeveloped, limiting applications such as TCR and neoepitope prioritization in cancer and antigen identification in autoimmunity. Here we present StriMap, a unified framework for predicting TCR-peptide-HLA interactions by integrating physicochemical, sequence-context, and structural features at recognition interfaces. StriMap achieves state-of-the-art performance with improved generalizability and enables applications in both cancer and autoimmunity. As a case study in ankylosing spondylitis (AS), we screened 13 million peptides derived from 43,241 bacterial proteins and identified candidate molecular mimics that were experimentally validated to activate T cells expressing an AS-associated TCR. Notably, a top validated peptide was enriched in patients with inflammatory bowel disease (IBD), suggesting potential shared microbial triggers between AS and IBD. Overall, StriMap provides a generalizable framework for rational immunotherapy design and for dissecting antigenic drivers of autoimmunity.
bioinformatics2026-04-02v1DESPOT: Direction-Enhanced Scoring POTentials
Poelmans, R.; Bruncsics, B.; Arany, A.; Van Eynde, W.; Shemy, A.; Moreau, Y.; Voet, A. R.Abstract
Knowledge-based potentials (KBPs) have long been used to score protein-ligand interactions, yet existing formulations remain isotropic, capturing only distance dependencies and neglecting the directional preferences that govern molecular recognition. Here, we introduce Direction-Enhanced Scoring POTentials (DESPOT), an anisotropic knowledge-based framework that unifies pose scoring and binding-site characterization within a single probabilistic model. Where classical knowledge-based methods model the probability of observing a distance given an interacting atom pair, DESPOT instead models the conditional probability of observing specific ligand atom types at discretized spatial positions around protein atoms. This inverted probabilistic formulation naturally supports both directional modelling through atom type-specific local reference frames and symmetry-aware geometric discretization, and steric exclusion, encoded as a dedicated void state that explicitly captures the probability that a spatial bin remains unoccupied. Evaluation on the CASF-2016 benchmark shows that DESPOT substantially outperforms isotropic KBPs in all pose-discrimination and virtual screening tasks (p < 0.0001 for all enrichment factors), with the largest gains arising from its ability to penalize geometrically implausible poses. Constrained energy minimization of training structures proves strongly beneficial for the derivation of KBPs, while our train-test leakage analysis reveals that overfitting is an underestimated and understudied issue for KBPs. The resulting anisotropic interaction profiles reveal systematic directional preferences (illustrated here for hydrogen bonds, aromatic interactions, and halogen bonds) that extend beyond idealized geometric models. DESPOT provides a data-driven framework for direction-aware modelling of protein-ligand interactions, with applications in pose scoring, binding-site characterization, and structure-based design.
bioinformatics2026-04-02v1Benchmarking Agentic Bioinformatics Systems for Complex Protein-Set Retrieval: A Coccolithophore Calcification Case Study
Zhang, X.Abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.
bioinformatics2026-04-02v1The U-method: Leveraging expression probability for robust biological marker detection
Stein, Y.; Lavon, H.; Hindi Malowany, M.; Arpinati, L.; Scherz-Shouval, R.Abstract
Reliable identification of cluster-defining markers is fundamental to single-cell transcriptomic analysis, yet current approaches often rely on average expression differences, which can dilute biologically informative signals in sparse and heterogeneous data. Here we introduce the U-method, a fast probability-based framework for identifying uniquely expressed genes (UEGs) by contrasting the expression probability of a gene within a cluster with its highest expression probability in any other cluster. This highest-probability comparison prioritizes detection consistency over expression magnitude, resulting in markers that consistently identify cell populations across independent datasets analyzed at comparable clustering resolutions. Applied to colorectal, breast, pancreatic, and lung cancer single-cell RNA-sequencing datasets, the U-method identifies canonical lineage markers together with additional genes showing clear cluster specificity. When projected onto Visium HD spatial transcriptomics data using only raw average expression of top UEGs, these signatures reveal coherent and biologically interpretable tissue organization without the need for smoothing, deconvolution, or model-based spatial inference. These results position the U-method as a practical implementation of detection consistency, enabling robust marker discovery and spatial interpretation in single-cell analysis.
bioinformatics2026-04-02v1Generating and navigating single cell dynamics via a geodesic bridge between nonlinear transcriptional and linear latent manifolds
Zhu, J.; Zhang, Z.; Sun, Y.; Dai, H.; Wen, H.; Zhou, P.; Chen, L.Abstract
Time-series single-cell RNA sequencing (scRNA-seq) captures cellular processes as sparse and unpaired snapshots, limiting our ability not only to reconstruct continuous cell state transitions, but also to navigate between states in a controlled and interpretable manner. Here we present GeoBridge, a framework modeling cellular dynamics as geodesic trajectories on the transcriptional manifold based on our isometric geodesic theory, which theoretically and computationally transforms time-varying nonlinear transcriptional geodesics (original nonlinear manifold) into constant-velocity straight-line geodesics (latent linear manifold) by a learned geodesic bridge. In such learned geodesic space, continuous interpolation becomes biologically meaningful, enabling reconstruction of unobserved intermediate states and efficient navigation between distinct cellular phenotypes at a single-cell resolution. By mapping interpolated trajectories back to the original gene expression space, GeoBridge recovers smooth transcriptional programs that are robust to noise and snapshot sparsity. Leveraging the derived geodesic potentials, GeoBridge further infers pseudo-temporal trajectories from single-snapshot scRNA-seq data without temporal annotation, and directly identifies genes that drive progression along geodesic paths. Across diverse biological systems, GeoBridge accurately resolves developmental dynamics, generates unmeasured intermediate states, identifies dynamic driver genes, and more significantly, enables navigable transitions across multiple differentiation endpoints. Together, GeoBridge establishes a principled method that transforms sparse single-cell measurements into a continuous, controllable landscape for the reconstruction, navigation and manipulation of cellular state transitions.
bioinformatics2026-04-02v1CardamomOT: a mechanistic optimal transport-based framework for gene regulatory network inference, trajectory reconstruction and generative modeling
Mauge, Y.; Ventre, E.Abstract
A key challenge in inferring gene regulatory networks (GRNs) governing cellular processes such as differentiation and reprogramming from experimental data lies in the impossibility of directly measuring protein dynamics at the single-cell level, which prevents establishing causal relationships between regulator activity and target responses. In earlier work, we introduced CARDAMOM, an algorithm that uses temporal snapshots of scRNA-seq data to calibrate a GRN-driven mechanistic model of gene expression. However, this method had several limitations: it could only rely on the relative ordering of time points rather than their exact labels, imposed restrictive quasi-stationary assumptions on protein dynamics, and depended on multiple hyperparameters. Here, we present CardamomOT, a new method based on the same mechanistic model that jointly reconstructs the GRN and unobserved protein trajectories from the data within a mechanistic optimal transport framework. By incorporating exact time labels and priors on protein kinetic rates from the literature, and substantially reducing the number of required hyperparameters, our approach addresses these limitations and substantially improves the accuracy and robustness of GRN calibration. We validate our framework on both in silico and experimental datasets, demonstrating computational scalability and consistently improved performance over state-of-the-art methods in both GRN and trajectory reconstruction. In particular, CardamomOT accurately recovers velocity fields driving cellular trajectories and unobserved protein levels, alongside reliable GRN structures. We also show that these improvements make the calibrated mechanistic model suitable to be used as a generative model to predict cellular responses to unseen perturbations. To our knowledge, this is among the first methods to explicitly integrate mechanistic GRN inference, trajectory reconstruction, and simulation of realistic datasets into a unified framework for scRNA-seq time series analysis.
bioinformatics2026-04-02v1mRNA-GPT: A Generative Model for Full-Length mRNA Design and Optimization
Li, S.; Chauvin, P.; Gross, O.; Bailey, M.; Jager, S.Abstract
We introduce mRNA-GPT, a generative model for end-to-end full-length mRNA sequence design and optimization. Unlike existing approaches that optimize isolated regions, mRNA-GPT jointly optimizes across all three regions (5' UTR, CDS, and 3' UTR) to capture long-range sequence dependencies and cross-region regulatory interactions critical for therapeutic efficacy. The model is pre-trained on 30 million full-length natural mRNA sequences across diverse species and organisms, establishing a robust foundation for sequence generation. We employ Reinforcement Learning (RL), specifically Proximal Policy Optimization (PPO) with oracle-based reward signals, to directly and iteratively optimize target properties, such as half-life and translation efficiency. mRNA-GPT supports flexible generation modes: single regions (UTR or CDS alone), full-length sequences, or generation of any region conditioned on any other region. Through multi-objective optimization, mRNA-GPT achieves Pareto-optimal designs that balance competing properties without sacrificing performance on either objective. mRNA-GPT demonstrates superior design capabilities compared to state-of-the-art methods, achieving enhanced performance in 3' UTR stability optimization, CDS translation rate enhancement, and comprehensive full-length sequence design.
bioinformatics2026-04-02v1Genetic demultiplexing and transcript start site identification from nanopore sequencing of 10x Genomics multiome libraries
Mears, J.; Orchard, P.; Varshney, A.; Bose, M. L.; Robertson, C. C.; Piper, M.; Pashos, E.; Dolgachev, V.; Manickam, N.; Jean, P.; Kitzman, D. W.; Fauman, E.; Damilano, F.; Roth Flach, R. J.; Nicklas, B.; Parker, S. C.Abstract
Short-read Illumina sequencing of 10x Genomics single-nucleus multiome libraries captures only the 3' end of RNA transcripts, losing transcription start site (TSS) information. Here we demonstrate nanopore sequencing of 10x multiome libraries, which enables the profiling of full length transcripts. We show concordance with common short-read sequencing based workflows including successful genetic demultiplexing of nanopore data despite its higher error rate. We compare TSS identified using nanopore sequencing of multiome cDNA to those identified using a short-read 5' assay, and provide an optimized approach for the preprocessing of nanopore reads prior to TSS identification. We find that nanopore sequencing of multiome cDNA captures a median of 63% of the TSS detected by the 5' assay.
bioinformatics2026-04-02v1Decoding antibiotic modes of action from multimodal cellular responses
Hesse, J.; Schum, D.; Leidel, L.; Gareis, L. R.; Herrmann, J.; Müller, R.; Sieber, S. A.Abstract
Antibiotic resistance continues to rise, yet most new drug candidates act through long-established targets. Faster mode of action (MoA) assessment would enable more effective prioritization of screening hits and help identify compounds with novel mechanisms. In this study, we aimed to develop a scalable framework for MoA inference from antibiotic-induced cellular response profiles in Escherichia coli. We generated a multimodal dataset spanning more than 50 antibiotics, including proteome profiles, chemical structure descriptors, inhibitory concentrations and growth dynamics, and used it to build MAPPER (Mode of Action Prediction via Proteomics-Enhanced Representation), a framework comprising a fixed multimodal predictor and an uncertainty module. MAPPER accurately classified antibiotics across nine mechanistic classes, flagged compounds with likely novel mechanisms and retained predictive power in proteomics-only transfer experiments across mass spectrometry platforms and external data. Together, these results establish MAPPER as an innovative tool for MoA prediction and novelty detection, enabling prioritization of antibacterial candidates with distinct mechanisms.
bioinformatics2026-04-02v1Inferring a novel insecticide resistance metric and exposurevariability in mosquito bioassays across Africa
Denz, A.; Kont, M. D.; Sanou, A.; Churcher, T. S.; Lambert, B.Abstract
Malaria claims approximately 500,000 lives each year, and insecticide-treated nets (ITNs), which kill mosquitoes that transmit the disease, remain the most effective intervention. However, resistance to pyrethroids, the primary insecticide class used in ITNs, has risen dramatically in Africa, making it difficult to assess the current public health impact of pyrethroid-ITNs. Past work has modelled the relation between pyrethroid susceptibility measured in discriminating-dose susceptibility bioassays and ITN effectiveness in experimental hut trials. Here, we introduce a new predictive approach that accounts for heterogeneity in insecticide resistance within wild mosquito populations, for example, due to genetic variability, by incorporating data from newly recommended intensity-dose susceptibility bioassays. We fit our mathematical model to a comprehensive data set that combines discriminating dose bioassays from all over Africa, intensity dose bioassays from Burkina Faso, and concurrent experimental hut trials. Our analysis estimates location- and insecticide-specific variation in resistance heterogeneity in Burkina Faso and quantifies differences in insecticide exposure in bioassays and experimental huts. By providing a mechanistic understanding of these experimental data, our approach could be integrated into malaria transmission models to account for the public health impact of insecticide resistance detected by surveillance programmes.
bioinformatics2026-04-01v5Finding stable clusterings of single-cell RNA-seq data
Klebanoff, V. F.Abstract
Run a UMI count matrix through a clustering pipeline to obtain n cell clusters. Suppose that counts from the same experiment for an equal number of additional cells become available. Would including them change the results? Form the matrix containing both sets of counts, process it to obtain n clusters, restrict this (second) clustering to the initial cells and compare it with the initial clustering. If the clusterings are not consistent, conclude that the initial clustering is unstable. Although this scenario is unrealistic, it is practical to reverse the perspective: given a clustering, process samples of half of the cells. If their clusters are consistent with those of the full set of cells restricted to the samples, conclude that the clustering is stable. We use divisive hierarchical spectral clustering and describe a possibly novel mapping of the tree it produces to a set of nested clusterings. Positive affinities are defined for points (representing cells in Euclidean space) that are k-nearest neighbors (k is an input parameter). The affinity equals the inverse of the distance between the points. Ng, Jordan, and Weiss' algorithm divides a set of points into two clusters. The normalized cut measures the clusters' separation. Recursion generates a hierarchy of clusters. Viewing clusters as nodes of a tree, set the length of the branch between a node and each of its daughters to the normalized cut. Nodes' distances from the root define the mapping of the tree to nested clusterings. For four large data sets, this gave clusterings compatible with published results. Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample's clustering and clusters are compared with those of the full data set (restricted to the sample). This provides measures of the stability of the clustering and its clusters. For two of the large data sets, the clusterings compatible with published results were judged to be stable.
bioinformatics2026-04-01v3Adaptive Cluster-Count Autoencoders with Dirichlet Process Priors for Geometry-Aware Single-Cell Representation Learning
Fu, Z.Abstract
Standard autoencoders for single-cell transcriptomics learn latent spaces whose cluster structure emerges only post hoc through $K$-means or community detection, leaving cluster count and boundary quality uncontrolled during training. Here we ask whether imposing an adaptive nonparametric prior can shift this balance. We equip a feedforward autoencoder with an online Dirichlet Process Mixture Model (DPMM) prior that refits cluster assignments throughout training and directly regularizes latent compactness and separation. Across 56~scRNA-seq datasets the DPMM prior produces a pronounced \emph{geometry--concordance trade-off}: cluster compactness (ASW) improves by 127\% and Davies--Bouldin overlap drops by 47\%, but label-recovery metrics decline (NMI~$-$17\%, ARI~$-$21\%) and downstream $k$NN accuracy falls from 0.784 to 0.725. Wilcoxon signed-rank tests confirm that the geometry gains are significant with large Cliff's~$\delta$ effects while concordance losses remain bounded and non-significant. A second-stage conditional-flow refinement (DPMM-FM) further improves projection fidelity (DRE~0.751, LSE~0.695, DREX~0.873) at additional concordance cost, revealing a three-tier operating regime: prior-free for label recovery, DPMM for manifold geometry, and DPMM-FM for visualization fidelity. Against 18 external baselines DPMM-Base wins 70.5\% of core-metric comparisons ($p{<}0.05$). Gene Ontology enrichment confirms that geometry-improved latent components recover coherent biological programs. Rather than claiming universal superiority, this study characterizes the operating envelope of nonparametric mixture priors and identifies the task contexts---trajectory analysis, manifold visualization, and program-level annotation---where adaptive geometric structure outweighs label-counting accuracy.
bioinformatics2026-04-01v2Searching the Druggable Genome using Large Language Models
Schimmelpfennig, L. E.; Cannon, M.; Cody, Q.; McMichael, J.; Coffman, A.; Kiwala, S.; Krysiak, K. J.; Wagner, A. H.; Griffith, M.; Griffith, O. L.Abstract
The druggable genome encompasses the genes that are known or predicted to interact with drugs. The Drug-Gene Interaction Database (DGIdb) provides an integrated resource for discovering and contextualizing these interactions, supporting a broad range of research and clinical applications. DGIdb is currently accessed through structured web interfaces and API calls, requiring users to translate natural-language questions into database-specific query patterns. To allow for the use of DGIdb through natural language, we developed the DGIdb Model Context Protocol (MCP) server, which allows large language models (LLMs) access to up-to-date information through the DGIdb API. We demonstrate that the MCP server greatly enhances an LLM's ability to answer questions requiring accurate, up-to-date biomedical knowledge drawn from structured external resources. Availability and implementation: The DGIdb MCP server is detailed at https://github.com/griffithlab/dgidb-mcp-server and includes instructions for accessing the server through the Claude desktop app.
bioinformatics2026-04-01v2Simplex-Constrained Neural Topic VAEs with Flow Refinement for Interpretable Single-Cell Gene-Program Discovery
Fu, Z.Abstract
Variational autoencoders for single-cell transcriptomics typically learn Gaussian latent spaces that lack part-based interpretability: individual latent dimensions carry no inherent biological meaning and the decoder provides no explicit gene-program readout. We introduce Topic-FM, a family of neural topic VAEs in which a logistic-normal Dirichlet prior constrains the latent vector to the probability simplex, turning each coordinate into a topic proportion and the decoder weight matrix into a directly readable topic--gene signature. A conditional optimal-transport flow field, trained entirely in pre-softmax $\mathbb{R}^K$, sharpens posterior geometry without modifying the decoder or breaking simplex validity. Unlike nonparametric mixture priors that improve geometry at the expense of label concordance, Topic-FM improves \emph{all} core metrics simultaneously: across 56~scRNA-seq datasets, Topic-FM-Transformer raises NMI by 8.2\%, ARI by 20.4\%, and ASW by 21.7\% relative to prior-free Pure-VAE (composite 0.502 vs.\ 0.434, +15.6\%). Wilcoxon signed-rank tests confirm significance with medium-to-large Cliff's~$\delta$ effects on all three metrics---no concordance--geometry trade-off is observed. Downstream $k$NN classification improves by 13.5\% in accuracy and 27.7\% in macro-F1. Among four architectural variants, Topic-FM-Contrastive achieves the highest external core win rate (86.4\% against 23 baselines), while Topic-FM-Transformer leads on composite score and supervised discrimination. Dual-pathway biological validation---perturbation importance and direct decoder-$\beta$ readout---yields convergent GO enrichment, demonstrating that the learned topics correspond to coherent, annotatable gene programs rather than opaque embedding dimensions.
bioinformatics2026-04-01v2A Convolutional Deep Learning Approach to identify DNA Sequences for Gene Prediction
Motta, J. A.; Gomez, P. D.Abstract
In this work, we present a highly efficient machine learning method for identifying DNA sequences that code for genes. The learning process is based on Human Genome Build 38 (GRCh38) sequences extracted from various specialized databases. The sequences were then translated into amino acid sequences and used to build matrices that facilitate the extraction of features with the TF*IDF vectorization method for the creation of the training space. The prediction functions are learned using a convolutional neural network (CNN) deep learning model. The training spaces were created using the 24 chromosomes of the human genome and approximately 36,000 genes and pseudogenes whose names were fetched from the HUGO Gene Nomenclature Committee (HGNC). Performance analysis was performed on 24 genes associated with genetic disorders, as well as the surrounding DNA regions. The metrics used were precision, recall, F_score measure, accuracy and ROC curves for the genes of interest. The results achieved exceed all our expectations and place the work at the level of the state of the art for gene prediction.
bioinformatics2026-04-01v2On the Comparison of LGT networks and Tree-based Networks
Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.Abstract
Abstract. Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely di[ff]icult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments.
bioinformatics2026-04-01v2Protein Language Model Decoys for Target Decoy Competition in Proteomics: Quality Assessment and Benchmarks
Reznikov, G.; Kusters, F.; Mohammadi, M.; van den Toorn, H. W. P.; Sinitcyn, P.Abstract
Large-scale proteomics relies heavily on target--decoy competition for false discovery rate estimation in peptide identification, and the performance of this strategy depends strongly on the design of the decoy database. Classical generators such as reversal and shuffling remain widely used. Here, we introduce protein language model-based (PLM) decoy generation for peptide identification and benchmark it against classical strategies. We evaluate these approaches using three complementary quality-control layers: sequence-based separability, search-engine-agnostic spectral-space diagnostics, and end-to-end mass spectrometry benchmarks, including pipelines with rescoring. Across these analyses, PLM-based decoys are harder for sequence-only neural networks to distinguish than most classical generators, suggesting fewer obvious sequence-level artifacts. However, this signal is only weakly informative for search performance. Spectral diagnostics further show that short peptides occupy a particularly crowded target--decoy space and are therefore especially prone to local collisions across all generators. In full search pipelines, reverse decoys remain a strong baseline, and current PLM-based generators do not yet provide a clear overall advantage. We therefore view PLM-based decoys not as universal replacements for reverse decoys, but as tunable tools for benchmarking, diagnostics, stress testing, and future adaptive decoy optimization, with increasing value as search models become more expressive.
bioinformatics2026-04-01v2Benchmark of biomarker identification and prognostic modeling methods on diverse censored data
Fletcher, W. L.; Sinha, S.Abstract
The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods' performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.
bioinformatics2026-04-01v1Serum metabolic signatures of cognitive resilience in a longitudinal aging cohort
Scheurink, T. A. W.; Seo, J. I.; David, L. C.; Wang, C. X.; Solis, D.; Zemlin, J.; Bergstrom, J.; Dorrestein, P. C.; Mohanty, I.; Molina, A. J. A.Abstract
Aging is typically accompanied by a progressive decline in cognitive function, yet some individuals maintain exceptional cognitive performance, even across the transition from middle to older age, defining exceptional cognitive resilience. While existing measures of resilience primarily rely on clinical assessments, its molecular determinants and early predictive markers remain poorly understood. Here, we performed untargeted LC-MS/MS profiling of longitudinal serum samples to identify metabolic signatures associated with cognitive resilience, which was established based on cognitive tests conducted over 28 years in a cohort of 237 participants. We observed associations across multiple chemical classes, including carnitines, glutamine conjugates, phosphocholines, as well as diet- and drug-derived metabolites. Chemical class-specific analyses revealed distinct metabolic profiles, including predominantly negative associations of medium-chain acylcarnitines with cognitive resilience, increased accumulation of glucuronide conjugates in individuals with low cognitive resilience, altered metabolism of the antihypertensive drug, metoprolol, and elevated levels of dietary compounds such as piperine and lutein in individuals with high cognitive resilience. By leveraging public metabolomics data, we further contextualized the metabolic signatures with respect to their organ specificity, microbial origin, and disease associations. Collectively, these metabolic features, including several previously underexplored compounds, represent promising candidates for functional characterization in mechanisms of aging biology and provide mechanistic insights into the molecular basis of cognitive resilience.
bioinformatics2026-04-01v1Subcellular Localization Constrains Protein Detectability and Reveals Systematic RNA-Protein Discordance Across Cancers
Joshi, K.; Kate, S.Abstract
Transcript abundance is widely used as a proxy for protein expression in cancer studies; however, mRNA levels often fail to predict protein detectability due to post-transcriptional and compartment-specific regulatory processes. Here, we present a machine learning framework that integrates RNA expression, gene-level attributes, and subcellular localization to model protein detectability across human cancers. Leveraging transcriptomic data from TCGA, TARGET, and GTEx, and protein annotations from the Human Protein Atlas, we constructed a dataset comprising over 100,000 gene-cancer pairs across seven tumor types. Models based on RNA features alone achieved moderate predictive performance (ROC-AUC ~0.71), whereas incorporating subcellular localization significantly improved accuracy (ROC-AUC ~0.82). Paired bootstrap analysis confirmed that these gains were statistically robust. We further identify a substantial set of genes with high transcript abundance yet absent protein detection, revealing widespread RNA-protein decoupling. These discordant genes are enriched in mitochondrial, metabolic, and translational regulatory pathways, suggesting that discordance reflects structured biological processes rather than stochastic variation. Together, our results demonstrate that cellular context, particularly subcellular localization, is a key determinant of protein detectability and underscore the limitations of transcript-centric interpretations in cancer genomics.
bioinformatics2026-04-01v1The human pangenome reference reduces ancestry-related biases in somatic mutation detection
Pham, C. V. K.; Abdelmalek, F. S. A.; Hua, T.; Apel, E.; Bizjak, A.; Schmidt, E. J.; Houlahan, K. E.Abstract
Commonly used human reference genomes collapse extensive genetic variability into a single linear genome of which 70% is derived from one donor. These linear genomes fail to capture the full spectrum of genetic variation, which can lead to misalignment of sequencing reads particularly for individuals underrepresented by the linear reference genomes. To address this shortcoming, the Human Pangenome Reference Consortium released the first draft of the human pangenome reference, a graph-based reference that integrates diverse haplotypes. While the human pangenome reference has shown increased accuracy in detecting inherited DNA variants, it remains to be seen if the observed improvements extend to somatic mutation detection. Here, we systematically benchmarked somatic single nucleotide variant (SNV) detection leveraging the human pangenome in 30 whole exome sequenced bladder tumours with matched blood tissue of diverse ancestries. We found somatic SNV detection leveraging the human pangenome reference outperformed the linear reference, most notably in individuals of East Asian ancestry where we observed on average a 20% improvement in detection accuracy. Improvements to detection accuracy in individuals of European ancestry were marginal. The increase in accuracy was attributed to reduced germline contamination and reduced reference bias. Further, we demonstrate the pangenome increases SNV detection precision, mitigating the need for time and computationally expensive ensemble approaches that take the consensus across multiple tools. Finally, we demonstrate that the increased precision when aligned to the pangenome generalized to an additional 29 lung adenocarcinoma tumours, particularly for individuals of East Asian ancestry. These findings support adoption of the pangenome to improve somatic variant detection and reduce ancestry-related disparities.
bioinformatics2026-04-01v1Accurate detection of mosaic mutations at short tandem repeats from bulk sequencing data
Wang, W.; Li, W.; Wang, C.; Fan, W.; Xia, Y.; Yang, X.; Chu, C.; Dou, Y.Abstract
Short tandem repeats (STRs) are among the most mutable regions of the human genome, yet their somatic mosaicism remains poorly characterized due to the technical challenges of distinguishing genuine mutations from high intrinsic polymorphism and sequencing noise. Here, we introduce BulkMonSTR, a computational framework that combines STR-specific error modelling with machine-learning classification to enable accurate detection of mosaic STR mutations from bulk next-generation sequencing data. BulkMonSTR identifies nucleotide-resolution mutations--including insertions, deletions, and single-nucleotide variants (SNVs)--and supports both control-independent and case-control study designs. Leveraging a comprehensive training dataset derived from pedigree-based validation and in silico spike-in simulations, our random forest classifier effectively discriminates true mosaic events from germline variants and technical artifacts. Benchmarking on simulated and real datasets demonstrates that BulkMonSTR achieves substantially improved precision and F1 scores across diverse coverages and variant allele frequencies. In normal samples, cancer samples and controlled in silico mixing experiments, BulkMonSTR consistently outperforms existing methods, capturing a broader spectrum of STR mutations--including those arising on non-reference alleles--while achieving high validation rates. By enabling systematic, genome-wide interrogation of STR mosaicism, BulkMonSTR provides a scalable foundation for investigating the contributions of somatic STR mutations to aging and disease.
bioinformatics2026-04-01v1Protein Language Models Outperform BLAST for Evolutionarily Distant Enzymes: A Systematic Benchmark of EC Number Prediction
Sathyamoorthy, R.; Puri, M.Abstract
Accurate prediction of Enzyme Commission (EC) numbers is foundational to genome annotation, metabolic reconstruction, and enzyme engineering. Protein language models (PLMs) have transformed protein function prediction, yet their systematic evaluation for EC number prediction across architectures, EC hierarchy levels, and sequence identity thresholds is lacking. Here we present a comprehensive benchmark of three PLMs (ESM2-650M, ESM2-3B, ProtT5-XL) combined with nine downstream neural architectures, evaluated across four EC hierarchy levels and four sequence identity thresholds with 1,296 trained models in total. Our results establish that simple MLP classifiers achieve 98.0% accuracy at EC1, 96.9% at EC2, 96.6% at EC3, and 97.0% at EC4, matching or marginally exceeding a train-set-matched BLASTp baseline (+/-0.7 pp) for in-distribution proteins. Crucially, PLM-based methods dramatically outperform BLAST for evolutionarily distant eukaryotes: gains reach +31.8 pp over a fair 90K-sequence BLAST baseline (Giardia lamblia) and +26.4 pp over a full 520K SwissProt database (Trichomonas vaginalis). For held-out prokaryotic proteomes, PLMs outperform BLAST by a mean of +16.9 pp at EC4. Our benchmark reveals that (i) MLP architectures are sufficient and consistently superior to CNN/ResNet/Transformer variants, (ii) ESM2-650M is statistically distinguishable from but practically equivalent to the 5x larger ESM2-3B, and (iii) Transformer re-encoding of PLM embeddings fails at a shared learning rate due to convergence instability. All code, models, and benchmark results are available at https://github.com/r-mbio/plm_benchmark.git
bioinformatics2026-04-01v1VicMAG, an open-source tool for visualizing circular metagenome-assembled genomes highlighting bacterial virulence and antimicrobial resistance
Tsuda, Y.; Tanizawa, Y.; Vu, T. M. H.; Nishimura, Y.; Shintani, M.; Abe, H.; Hasebe, F.; Kasuga, I.; Nagao, M.; Suzuki, M.Abstract
Bacterial pathogens spread in clinical and environmental settings, and mobile genetic elements (MGEs), such as plasmids and phages, mediate the transfer of virulence factor genes (VFGs) and antimicrobial resistance genes (ARGs) among bacterial communities. Metagenomic analysis of environmental and wastewater samples using highly accurate long-read sequencing technologies, such as PacBio HiFi sequencing, provides valuable insights into monitoring the regional spread of VFGs and ARGs, including dissemination mediated by MGEs. No visualization tool is currently available for the comprehensive display of numerous resulting circular metagenome-assembled genomes (cMAGs) with functional gene annotations. Here, we developed VicMAG, a visualization tool for highly complex cMAGs derived from long-read metagenome assemblies annotated using updated databases of VFGs, ARGs, and MGEs. Using 353 cMAGs from PacBio HiFi sequencing of a wastewater sample, we demonstrated the utility of VicMAG for metagenome visualization. VicMAG provides comprehensive, size-aware visualization of cMAGs representing bacterial chromosomes and plasmids, annotated with VFGs, ARGs, and phages. By simultaneously visualizing all cMAGs in a framework, VicMAG facilitates a holistic understanding of the distribution and genomic context of VFGs and ARGs across complex microbial communities. This tool supports integrated surveillance of bacteria associated with virulence and antimicrobial resistance across clinical, environmental, and One Health contexts.
bioinformatics2026-04-01v1Assessing the potential of bee-collected pollen sequence data to train machine learning models for geolocation of sample origin
Hayes, R. A.; Kern, A. D.; Ponisio, L. C.Abstract
Pollen is a robust and widespread substance that captures a historical snapshot of a specific time and place, and it can be used to track movements through space by examining the pollen deposited on various objects. Palynology, the study of pollen, is used across fields such as conservation, natural history, and forensics, where it is particularly useful for tracing the origin and movement of objects. However, pollen has remained underutilized due to the difficulty of distinguishing many pollen taxa beyond the family level and limited pollen reference material to support location predictions. With recent developments in pollen DNA metabarcoding these issues have been rectified, but much of the available pollen data are primarily from wind-pollinated species, which are widespread and less informative of specific sample locations. Bee-collected pollen presents an untapped resource in training predictive models to geolocate sample origin. Here we compiled bee-collected pollen DNA sequence relative abundance data from three projects in the western U.S. and assessed the accuracy of supervised machine learning models to predict the location of sample origin based solely on pollen assemblage, without the need of incorporating additional data. Random Forest and k-Nearest Neighbors models yielded high accuracy across all projects. We also found that models trained on taxonomically clustered pollen assigned sequence variants (ASVs) performed slightly better than those trained on raw sequence data, but the difference was minor, indicating that models trained on raw sequence data can reliably predict location and avoid the time-consuming taxonomic assignment process. Our results demonstrate the utility of repurposing bee-collected pollen for geolocation and provide a framework for employing supervised machine learning in future geolocation efforts.
bioinformatics2026-04-01v1IMMREP25: Unseen Peptides
Richardson, E.; Aarts, Y. J. M.; Altin, J. A.; Baakman, C. A. B.; Bradley, P.; Chen, B.; Clifford, J.; Dhar, M.; Diepenbroek, D.; Fast, E.; Gowthaman, R.; He, J.; Karnaukhov, V.; Marzella, D. F.; Meysman, P.; Nielsen, M.; Nilsson, J. B.; Deleuran, S. N.; Parizi, F. M.; Pelissier, A.; Pierce, B. G.; Rodriguez Martinez, M.; Roran A R, D.; Saravanakumar, S.; Shao, Y.; Smit, N.; Van Houcke, M.; Visani, G. M.; Wan, Y.-T. R.; Wang, X.; Woods, L.; Wuyts, S.; Xiao, C.; Xue, L. C.; IMMREP25 Participant Consortium, ; Barton, J.; Noakes, M.; May, D. H.; Peters, B.Abstract
T cell receptors (TCRs) can bind to peptides presented by MHC molecules (pMHC) as a first step to trigger a T cell response. Reliable approaches to predict TCR:pMHC binding would have broad applications in clinical diagnostics, therapeutics, and the fundamental understanding of molecular interactions. IMMREP is a community organized series of prediction contests that asks participants to predict TCR:pMHC binding on unpublished datasets. Previous iterations in 2022 and 2023 showed multiple approaches can predict TCR-pMHC binding with significant accuracy (median AUC_0.1 greater than or equal to 0.7) for peptides where experimental data is available ("seen" peptides). In contrast, models did not outperform random guessing for peptides that have no such data available ("unseen" peptides). Here we report on the results of IMMREP25, which focused solely on unseen peptides in order to evaluate the cutting edge of the field. We received 126 named submissions predicting the specificity of 1,000 TCRs against twenty unseen peptides restricted by one of two MHC molecules (HLA-A*02:01 and HLA-B*40:01). The best performing methods showed a macro-AUC_0.1 of 0.60, significantly better than random, demonstrating significant advances in the field. The top performing methods incorporated structural modeling into their approach, indicating that especially for "unseen" peptides, a structural understanding aids in the prediction of TCR:pMHC interactions. The results from this benchmark highlight the significant challenges remaining for TCR:pMHC predictions and will inform future method development.
bioinformatics2026-04-01v1Millisecond Prediction of Protein Contact Maps from Amino Acid Sequences
Lin, R.; Ahnert, S. E.Abstract
Protein structure prediction typically outputs static coordinates, often obscuring the underlying physical principles and conformational flexibility. In this work, we present a coarse-grained generative framework to recover the Circuit Topology (CT) of proteins using Generative Flow Matching. We represent protein architecture using highly compressed Secondary Structure Elements (SSEs), reducing the sequence length to roughly 1/13 of the original amino acid sequence. We show that this minimal representation captures the essential "topological fingerprint" required to determine the global fold. By employing a joint-prediction head, our model simultaneously generates contact probabilities and asymmetric topological features, achieving a mean F1 score of 0.822 at the SSE level. Notably, our results demonstrate a counter-intuitive robustness in capturing long-range interactions, suggesting that global topology acts as a stable constraint compared to local residue packing. Furthermore, we show that these coarse-grained predictions can be mapped back to residue-level contact maps with sub-helical precision, yielding a mean alignment error of 2.69 residues. The probabilistic nature of the flow model effectively separates the stable structural signal of the folding core from flexible regions, providing a physically interpretable view of the protein's conformational ensemble. This pipeline is extremely fast, capable of completing a contact map prediction from amino acid sequence in an average of 110 milliseconds on a single GPU. These ultra-fast and accurate predictions provide a valuable tool for identifying conserved protein folding cores, facilitating the exploration of the protein structural genotype-phenotype (GP) map through large-scale sampling of mutants with highly similar folding cores.
bioinformatics2026-03-31v4Protein sequence domain annotation using a language model
Sarkar, A.; Krishnan, K.; Eddy, S. R.Abstract
Protein domain annotation underlies large-scale functional inference and is commonly performed by scanning sequences against libraries of profile hidden Markov models (profile HMMs). We describe PSALM, a protein domain annotation method that combines (i) a pretrained protein language model (ESM-2) with (ii) a per-residue domain-state classifier and (iii) a structured probabilistic decoder that produces a single, non-overlapping set of domain calls with explicit boundaries and scores. On a benchmark of 89M protein sequences with 107M annotated domains, PSALM attains a domain-detection sensitivity-specificity tradeoff comparable to HMMER. We characterize sequence and residue-level coverage on UniProtKB, observing higher coverage for HMMER at stringent expected false positive counts (E-values) and higher coverage for PSALM at relaxed E-values. We release code for data processing, training, and inference, along with the model weights and datasets used for training, validation, and benchmarking.
bioinformatics2026-03-31v3Phylogenetic detection of protein sites associated with continuous traits
Duchemin, L.; Muntane, G.; Boussau, B.; Veber, P.Abstract
Comparative genomic data can be used to look for substitutions in coding sequences that are associated with the variation of a particular phenotypic trait. A few statistical methods have been proposed to do so for phenotypes represented by discrete values. For continuous traits, no such statistical approach has been proposed, and researchers have resorted to sensible but uncharacterized criteria. Here, we investigate a phylogenetic model for coding sequences where amino acid preferences at a site are given by a continuous function of a quantitative trait. This function is inferred from the amino acids and the trait values in extant species and requires inferred point estimates of ancestral values of the trait at internal nodes. For detecting sites whose evolution is associated with this trait, we use a significance test against the hypothesis that amino acid preference does not depend on the trait. This procedure is compared to simpler strategies on simulated alignments. It displays an increased recall for low false positive rates, which is of special importance for performing whole-genome scans. This comes however at a much higher computational cost, and we suggest using a simple test to filter promising candidate sites. We then revisit a dataset of alignments for 62 species of mammals, using longevity as a phenotypic trait. We apply our method to three protein families that have previously been proposed to display sites associated with variation in lifespan in mammals. Using a graphical representation extracted from the detailed phylogenetic analysis of candidate sites, we suggest that the evidence for this in the sequence data alone is weak. The proposed method has been added to our Pelican software. It is available at https://gitlab.in2p3.fr/phoogle/pelican and can now be used with both discrete and continuous phenotypes to search for sites associated with phenotypic variation, on data sets with thousands of alignments.
bioinformatics2026-03-31v3Deep representation learning for temporal inference in cancer omics: a systematic review
Prol-Castelo, G.; Cirillo, D.; Valencia, A.Abstract
Deep learning methods, including deep representation learning (DRL) approaches such as variational autoencoders (VAEs), have been widely applied to cancer omics data to address the high dimensionality of these datasets. Despite remarkable advances, cancer remains a complex and dynamic disease, i.e. challenging to study, and the temporal resolution of cancer progression captured by omics-based studies remains limited. In this systematic literature review, we explore the use of DRL, particularly the VAE, in cancer omics studies for modeling time-related processes, such as tumor progression and evolutionary dynamics. Our work reveals that these methods most commonly support subtyping, diagnosis, and prognosis in this context, but rarely emphasize temporal information. We observed that the scarcity of longitudinal omics data currently limits deeper temporal analyses that could enhance these applications. We propose that applying the VAE as a generative model to study cancer in time, e.g. focusing on cancer staging, could lead to meaningful advancements in our understanding of the disease.
bioinformatics2026-03-31v2Amino acid substitutomics: profiling amino acid substitutions at proteomic scale unveils biological implication and escape mechanism in cancer
Zhao, P.; DAI, S.; Lai, S.; Zhou, C.; Li, N.; Yu, W.Abstract
Amino acid (AA) substitutions play a critical role in regulating cellular activities, including complex signaling and cell cycle processes. Recent research on AA substitutions has primarily relied on genomic and transcriptomic data. The identification at the proteomic scale remains underexplored, despite evidence suggesting that DNA and RNA biosynthesis are not the sole sources of these substitutions. This gap persists due to challenges in analyzing large-scale proteomic data. In this study, we address this limitation by analyzing multiple independent datasets across five cancer types using PIPI-C, a novel mass spectrometry data analysis tool. And we propose AA substitutomics, a pipeline for characterizing AA substitutions arising after protein translation and dissecting the regulatory functions of key proteins with AA substitutions. Among our identified AA substitutions, 87% are novel findings and not recorded in genomic/transcriptomic databases, which indicates that the post-translational AA substitutions are prevalent. Our findings reveal biologically significant AA substitutions linked to cancer, such as F43S and E91D in hemoglobin subunit beta, P584T in filamin A, and A175N in fructose-bisphosphate aldolase B. Furthermore, our pipeline enables direct investigation of drug resistance and immune escape. By capturing functional protein-level alterations beyond genomic and transcriptomic profiling, it establishes a robust framework to advance cancer research.
bioinformatics2026-03-31v2Reliable prediction of short linear motifs in the human proteome
Pancsa, R.; Ficho, E.; Kalman, Z. E.; Gerdan, C.; Remenyi, I.; Zeke, A.; Tusnady, G. E.; Dobson, L.Abstract
Short linear motifs (SLiMs) are small interaction modules within intrinsically disordered regions of proteins that interact with specific domains, and thereby regulate numerous biological processes. Their limited sequence information leads to frequent false positive hits in computational and experimental SLiM identification methods. We present SLiMMine, a deep learning-based method to identify SLiMs in the human proteome. By refining the annotations of known motif classes, we created a high-quality training dataset. Using protein embeddings and neural networks, SLiMMine reliably predicts novel SLiM candidates in known classes, eliminates ~80% of the pattern matching-based hits as false-positives, furthermore, it also functions as a discovery tool to find uncharacterized SLiMs based on optimal sequence environment. Finally, narrowing the broad interactor-domain definitions of known SLiM classes to specific human proteins enables more precise linking of predicted SLiMs to known protein-protein interactions. SLiMMine is available as a user-friendly, multi-purpose web server at https://slimmine.pbrg.hu/.
bioinformatics2026-03-31v2IDBSpred: An intrinsically disordered binding site predictor using machine learning and protein language model
Jones, D.; Wu, Y.Abstract
Intrinsically disordered proteins (IDPs) mediate many cellular functions through interactions with structured protein partners, but predicting the corresponding binding sites on the structured partner remains challenging. Here, we present IDBSpred, a sequence-based method for residue-level prediction of IDP-binding sites on structured proteins. Training and test data were collected from the DIBS database, which contains more than 700 non-redundant IDP-protein complexes. Residue-level embeddings of structured partner sequences were generated using the ESM-2 protein language model and used as input to a multilayer perceptron classifier for binary prediction of binding versus non-binding residues. Analysis of amino acid composition showed that IDP-binding sites are enriched in aromatic residues, especially Trp, Tyr, and Phe, as well as several charged and polar residues, whereas Ala and several small or conformationally restrictive residues are depleted. The classifier achieved an ROC AUC of 0.87 and an average precision of 0.61. Structural case studies further showed that the predicted sites largely recapitulate the major experimentally defined binding interfaces. These results demonstrate that protein language model embeddings plus machine learning algorithms can effectively capture sequence features associated with IDP recognition on structured proteins. IDBSpred provides a practical framework for studying IDP-mediated interfaces and identifying potential therapeutic hotspots.
bioinformatics2026-03-31v2A Bioinformatic Investigation into the Role of ITGB1 in Cancer Prognosis and Therapeutic Resistance
Mo, X.Abstract
Integrin {beta}1 is a crucial transmembrane protein that regulates cellular adhesion, migration, and signal transduction, processes essential for cancer progression. This study investigates the role of ITGB1, the gene that encodes Integrin {beta}1, in various cancers using bioinformatics tools. By analyzing gene expression data across different cancer types and normal tissues, the study identifies significant upregulation of ITGB1 in several cancers. We find elevated expression of ITGB1 is associated with poor prognosis in multiple tumors, suggesting its potential as a biomarker for cancer progression and therapeutic resistance. Further analysis reveals ITGB1's correlation with chemoresistance and immunoresistance genes, highlighting its involvement in cancer treatment evasion. The study also explores the expression and role of genes that are highly related to ITGB1 in tumor and patient prognosis, offering insights into potential molecular pathways and therapeutic targets. These findings underscore the clinical relevance of ITGB1 in cancer prognosis and therapy.
bioinformatics2026-03-31v2CCIDeconv: Hierarchical model for deconvolution of subcellular cell-cell interactions in single-cell data
Jayakumar, R.; Panwar, P.; Yang, J. Y. H.; Ghazanfar, S.Abstract
Motivation: Cell-cell interaction (CCI) underlies several fundamental mechanisms including development, homeostasis and disease progression. CCI are known to be localised to specific subcellular regions, for example, within the cytoplasms of cells. With the emergence of subcellular spatial transcriptomics technologies (sST), there is an opportunity to attribute CCI to subcellular regions. We aimed to deconvolute CCI to subcellular CCI (sCCI) in non-spatial single cell transcriptomics data (i.e. scRNA-seq) datasets using a modified CCI score from CellChat. Results: By calculating the sCCI score specific to cytoplasm and nucleus in nine publicly available sST datasets, we identified unique nucleus- nucleus and cytoplasm-cytoplasm sCCI. Then, we deconvolved the communication score to subcellular regions by using a hierarchical classification and regression model which we name as CCIDeconv. We performed leave-one-dataset-out cross-validation across nine datasets over a range of different tissue types from human samples. We observed that training across many different tissue types resulted in robust deconvolution performance in an unseen dataset. As the number of training datasets increased, models trained without spatial features achieved similar performance as models including spatial features. This implied the potential for accurate prediction of sCCI events from even scRNA-seq with large numbers of training datasets. Overall, we offer a method towards attributing CCI events to subcellular regions. This method can allow researchers in dissecting sCCI patterns to gain insights in underlying biology in a range of tissues covering health and disease.
bioinformatics2026-03-31v2Flipper: An advanced framework for identifyingdifferential RNA binding behavior with eCLIP data
Flanagan, K.; Xu, S.; Yeo, G. W.Abstract
Motivation: Crosslinking and immunoprecipitation (CLIP) methods remain the gold standard for characterizing RNA binding protein (RBP) behavior. As a result, many researchers rely on CLIP to assess how treatments targeting RBPs alter binding patterns and regulatory activity. However, current tools for differential RBP binding analysis lack core features required for rigorous statistical inference, including proper normalization and appropriate handling of replicate experiments. Furthermore, existing approaches cannot adequately separate expression driven effects from true changes in RBP binding, complicating interpretation of differential analyses. Addressing these limitations is essential for producing reproducible and informative analyses of differential RBP binding. Results: Here we present Flipper, an application purpose built for the analysis of differential RBP binding. Flipper introduces several innovations that adapt the DESeq2 framework for robust differential analysis of eCLIP count data. These include integration of input controls to account for expression driven binding shifts, hierarchical normalization strategies that adjust for technical variation without confounding signal to noise ratios, and improved post-differential analysis tools. We demonstrate that Flipper exhibits high specificity when applied to real differential eCLIP data while also providing deeper biological insights. In addition, analyses of both real and simulated data indicate that Flipper achieves superior sensitivity and precision compared with existing approaches. Together, these results highlight Flipper as a robust and generalizable framework for differential RBP binding analysis.
bioinformatics2026-03-31v2Co-designing sequence and structure of functional de novo enzymes with EnzyGen2
Song, Z.; Liu, H.; Zhao, Y.; Yang, Y.; Li, L.Abstract
Proteins underpin essential biological functions across all kingdoms of life. The capacity to design novel proteins with tailored activities holds transformative potential for biotechnology, medicine, and sustainability. However, since protein functions, particularly enzymatic activities, depend on precise interactions with small-molecule ligands, accurately modeling these interactions remains a formidable challenge in de novo protein design. Here, we introduce EnzyGen2, a protein foundation model designed for the simultaneous co-design of sequence and structure under ligand-guided functional targeting. Comprising 730 million parameters, EnzyGen2 is trained on 720,993 protein-ligand pairs using multi-task learning objectives that encompasses the joint prediction of sequence, structure, and protein-ligand interactions. In rigorous in silico benchmarks, EnzyGen2 consistently outperforms state-of-the-art baselines, including Inpainting, RFdiffusion/ProteinMPNN, RFdiffusion2/LigandMPNN, and RFdiffusion3/LigandMPNN, as measured by enzyme-substrate prediction scores, AlphaFold2 confidence metrics, and structural fidelity, while it generates samples 400x faster than prior methods. We further experimentally validated EnzyGen2 across multiple enzyme families, including chloramphenicol acetyltransferase, aminoglycoside adenylyltransferase, and thiopurine S-methyltransferase. De novo enzymes generated by our family-specific EnzyGen2 exhibited catalytic activities comparable to or exceeding those of natural enzymes, while retaining substantial novelty with sequence identities as low as 51.6%. These results establish EnzyGen2 as a robust Artificial Intelligence-based tool for functional enzyme design, demonstrating the power of large protein foundation models to create high-performance, novel biocatalysts.
bioinformatics2026-03-31v2Scalable computation of ultrabubbles in pangenomes by orienting bidirected graphs
Harviainen, J.; Sena, F.; Moumard, C.; Politov, A.; Schmidt, S.; Tomescu, A. I.Abstract
Motivation: Pangenome graphs are increasingly used in bioinformatics, ranging from environmental surveillance and crop improvement to the construction of population-scale human pangenomes. As these graphs grow in size, methods that scale efficiently become essential. A central task in pangenome analysis is the discovery of variation structures. In directed graphs, the most widely studied such structures, superbubbles, can be identified in linear time. Their canonical generalization to bidirected graphs, ultrabubbles, more accurately models DNA reverse complementarity. However, existing ultrabubble algorithms are quadratic in the worst case. Results: We show that all ultrabubbles in a bidirected graph containing at least one tip or one cutvertex--a common property of pangenome graphs--can be computed in linear time. Our key contribution is a new linear-time orientation algorithm that transforms such a bidirected graph into a directed graph of the same size, in practice. Orientation conflicts are resolved by introducing auxiliary source or sink vertices. We prove that ultrabubbles in the original bidirected graph correspond to weak superbubbles in the resulting directed graph, enabling the use of existing linear-time algorithms. Our approach achieves speedups of up to 25x over the ultrabubble implementation in vg, and of more than 200x over BubbleGun, enabling scalable pangenome analyses. For example, on the v2.0 pangenome graph constructed by the Human Pangenome Reference Consortium from 232 individuals, after reading the input, our method completes in under 3 minutes, while vg requires more than one hour, and four times more RAM. Availability: Our method is implemented in the BubbleFinder tool https://www.github.com/algbio/BubbleFinder, via the new ultrabubbles subcommand.
bioinformatics2026-03-31v1CoLa-VAE: Cell-Cell Communication-aware Variational Autoencoder with Dynamic Graph Laplacian Constraints
Chen, Y.; Qi, C.; Fang, H.; Luan, F.; Zhang, Z.; Arya, S.; Wei, Z.Abstract
Intercellular communication is a fundamental driver of cellular identity and tissue homeostasis, yet current single-cell representation learning frameworks predominantly model cell state as a function of intrinsic gene expression, neglecting the extrinsic signaling context. Conversely, dedicated communication inference tools are often constrained by the sparsity and noise of raw transcriptomic data. Here, we present CoLa-VAE, a deep generative framework that explicitly integrates cell-cell communication (CCC) constraints into latent variable learning. By employing a dynamic graph Laplacian regularization derived from pairwise ligand-receptor interactions, CoLa-VAE disentangles communication-driven topology from intrinsic transcriptional heterogeneity within a variational autoencoder architecture. We demonstrate that CoLa-VAE serves as a robust, method-agnostic framework compatible with diverse signaling definitions, consistently outperforming state-of-the-art baselines in structural clustering metrics and denoising fidelity across heterogeneous sequencing platforms.
bioinformatics2026-03-31v1Identifying Inheritance Patterns of Allelic Imbalance, using Integrative Modeling and Bayesian Inference
Hoyt, S. H.; Reddy, T. E.; Gordan, R.; Allen, A. S.; Majoros, W. H.Abstract
Interpreting the effects of novel mutations on phenotypic traits remains challenging, particularly for cis-regulatory variants. For rare variants, individuals typically possess at most one affected copy of the causal allele, leading to allelic imbalance, and thus the ability to infer inheritance of allelic imbalance can inform genetic studies of phenotypic traits. While many methods for detection of allele-specific expression (ASE) exist, they largely focus on ASE in one individual. We show that performing joint inference across multiple individuals in a trio allows for simultaneously improving estimates of ASE and identifying its likely mode of inheritance. Our Bayesian approach has the benefit of being able to (1) aggregate information across individuals so as to improve statistical power, (2) estimate uncertainty in estimates, and (3) rank modes of inheritance by posterior probability. We demonstrate that this model is also applicable to other forms of imbalance such as allele-specific chromatin accessibility. Applying the model to ATAC-seq and RNA-seq from several trios, we uncover examples in which ASE can be linked to imbalance in chromatin state of cis-regulatory elements and to potential causal variants. As the cost of sequencing continues to decrease, we expect that powerful methodologies such as the one presented here will promote more routine collection of samples from related individuals and improve our understanding of genetic effects on gene regulation and their contribution to phenotypic traits.
bioinformatics2026-03-31v1