Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Liu, Z.-Y.; Berthel, A.; Czech, E.; Marroquin, E.; Stitzer, M. C.; Hsu, S.-K.; Pennell, M.; Buckler, E. S.; Zhai, J.Abstract
Accurate genome annotation is fundamental to biological discovery, yet identifying gene structures directly from DNA sequence remains a major challenge in complex genomes. We introduce GeneCAD, a sequence-only framework that predicts biologically coherent gene models without requiring species-matched transcriptomic or proteomic evidence. GeneCAD integrates lineage-specific DNA representations from the PlantCAD2 foundation model with a transformer encoder and a chromosome-scale conditional random field (CRF) to enforce structural constraints, such as splice-phase and feature order. To ensure high-quality supervision, we implement a curation strategy using a sequence-based masked-motif score to filter reference transcripts. As a primary validation across diverse angiosperms, including a complex allotetraploid, GeneCAD improves transcript F1 by approximately 9% over current tools like Helixer and BRAKER3, while sharpening boundary precision and achieving a best-in-class recovery of 86% of classical coding sequences. Furthermore, we demonstrate the framework's modularity by adapting it to animal lineages through the substitution of the underlying DNA foundation model. While the long introns of vertebrates challenge full transcript reconstruction, the model remains highly effective at identifying individual exons. By connecting evolutionary signals with structured decoding, GeneCAD provides a versatile and scalable solution for high-fidelity genome annotation across the Tree of Life.
bioinformatics2026-05-12v4GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Liu, Z.-Y.; Berthel, A.; Czech, E.; Marroquin, E.; Stitzer, M. C.; Hsu, S.-K.; Pennell, M.; Buckler, E. S.; Zhai, J.Abstract
Accurate genome annotation is fundamental to biological discovery, yet identifying gene structures directly from DNA sequence remains a major challenge in complex genomes. We introduce GeneCAD, a sequence-only framework that predicts biologically coherent gene models without requiring species-matched transcriptomic or proteomic evidence. GeneCAD integrates lineage-specific DNA representations from the PlantCAD2 foundation model with a transformer encoder and a chromosome-scale conditional random field (CRF) to enforce structural constraints, such as splice-phase and feature order. To ensure high-quality supervision, we implement a curation strategy using a sequence-based masked-motif score to filter reference transcripts. As a primary validation across diverse angiosperms, including a complex allotetraploid, GeneCAD improves transcript F1 by approximately 9% over current tools like Helixer and BRAKER3, while sharpening boundary precision and achieving a best-in-class recovery of 86% of classical coding sequences. Furthermore, we demonstrate the framework's modularity by adapting it to animal lineages through the substitution of the underlying DNA foundation model. While the long introns of vertebrates challenge full transcript reconstruction, the model remains highly effective at identifying individual exons. By connecting evolutionary signals with structured decoding, GeneCAD provides a versatile and scalable solution for high-fidelity genome annotation across the Tree of Life.
bioinformatics2026-05-12v3ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses
Shahar, N.; Miller, D.; Shahoha, M.; Lurie, G.; Weiner, I. N.Abstract
Translating transcriptomic data into therapeutic hypotheses remains fragmented and labor-intensive. Here we present ConvergeCELL, a platform combining a patient representation model trained on over 20 million cells across 4,479 patients, an interpretability framework for gene discovery, and a large language model-driven workflow that classifies candidates along an evidence hierarchy and constructs mechanism-of-action hypotheses. Validated on held-out cohorts spanning lupus, multiple myeloma, and sepsis across single-cell and bulk modalities, ConvergeCELL recovers known disease-associated genes at or above differential expression, machine-learning, and patient-level foundation model (PaSCient) baselines. The advantage is most pronounced for clinically validated, disease-specific drug targets: ConvergeCELL ranks TNFSF13B (Belimumab; lupus), TNFRSF17/BCMA (Belantamab; myeloma), and CXCR4 (Plerixafor; myeloma) within the top 0.3% of its gene rankings - significantly outcompeting alternative approaches. ConvergeCELL delivers an end-to-end translational workflow with state-of-the-art performance on both disease-associated gene recovery and patient-level disease classification. The pretrained ConvergeCELL patient representation model and bulk distillation module are publicly available on Hugging Face (huggingface.co/ConvergeBio/virtual-cell-patient) under the Apache 2.0 license.
bioinformatics2026-05-12v1Quantifying Cross-Modal Association Confidence for Single-Cell RNA-ATAC Integration
Furutani, T.; Ji, H.Abstract
While multimodal sequencing technologies are rapidly advancing, most single-cell and spatial datasets still measure only a single modality. Integrative computational methods for separately profiled single-cell RNA-seq (scRNA-seq) and ATAC-seq (scATAC-seq) data typically rely on the assumption that gene expression correlates with the chromatin accessibility of nearby regulatory regions. However, the strength and reliability of these correlations vary substantially across genes, and incorporating low-confidence associations can compromise integration accuracy. Here, we introduce the CLIC (Cross-modality Link Confidence) score, a quantitative measure of the empirical concordance between gene expression and nearby chromatin accessibility, derived from diverse single-cell multiome datasets from the ENCODE project. CLIC scores provide prior confidence estimates for gene-peak associations across modalities. Building on this, we propose a hybrid feature selection strategy that intersects highly variable genes with high-CLIC genes, generating feature sets that better align with the assumptions of cross-modal integration methods. Across diverse publicly available single-cell and spatial datasets, and multiple state-of-the-art integration frameworks, our approach consistently improves the integration of gene expression and chromatin accessibility data, enhancing both robustness and biological interpretability.
bioinformatics2026-05-12v1The illusion of interpretability in biologically informed neural networks
Caranzano, I.; Sanavia, T.; Lio', P.; Baldi, P.; Fariselli, P.Abstract
Biologically informed neural networks (BINNs), also known as visible neural networks (VNNs), are widely adopted in omics because their architectures mirror known biological structures, such as gene-to-pathway relationships, and are therefore often assumed to be inherently interpretable. This assumption implies that learned gene-to-pathway weights and pathway node activations reflect meaningful biological mechanisms. Here, we show that this premise fails for a classical reason: nonidentifiability. Using a controlled teacher and student framework, we demonstrate that even under ideal conditions, including noiseless data, the correct model class, and identical sparse wiring, a BINN can perfectly recover the input-to-output mapping while failing to recover both gene-to-pathway weights and pathway activations. This failure persists across classification, regression, and survival tasks, and remains robust to variations in biological structure and network depth. Thus, the problem is not merely overparameterization or poor optimization: learning from outputs alone does not identify internal structure. Since biological mechanisms are not directly observed, recovering them from predictions alone is harder, not easier, than recovering neural network parameters, which are already known to be nonidentifiable. Critically, this failure reflects standard practice: widely used BINNs do not impose objective level constraints on gene-to-pathway weights or pathway activations, and therefore operate precisely in the regime modeled by our teacher-student framework. These results indicate that architectural transparency does not imply mechanistic interpretability. Without constraints that explicitly enforce identifiability, the apparent interpretability of BINNs reflects their design rather than what they actually learn.
bioinformatics2026-05-12v1StabCell: Stability selection for clustering and marker detection in single-cell RNA sequencing
Lück, N.; Rossi, A.; Staerk, C.Abstract
Motivation: Conventional pipelines for differential expression analysis in single-cell RNA sequencing (scRNA-seq) data first cluster individual cells and then test for differentially expressed genes between the resulting clusters. Using the same data for clustering and testing, however, poses a selective inference problem and can result in overconfidence in differences that may not reflect true biological variation. Results: We introduce StabCell, a stability selection framework which integrates clustering and detection of differentially expressed marker genes. By repeatedly performing clustering and differential expression analysis in complementary random subsamples, StabCell assesses clustering and marker stability, yielding a stable clustering with sets of stable marker genes. In simulations, we demonstrate that StabCell provides approximate empirical per-family error rate (PFER) control, selecting fewer false positive marker genes compared with conventional approaches, especially in cases with low signal-to-noise ratio and low sequencing depth. Applying the method to a cell differentiation dataset from induced pluripotent stem cells (IPSCs) to cardiomyocytes reveals that meaningful marker genes are consistently among the top-ranked genes. These results indicate that StabCell can improve the interpretability and robustness of scRNA-seq analyses. Availability and implementation: An implementation of StabCell in the statistical programming language R is available at https://github.com/LuckyLueck/StabCell. Code to reproduce the results is available at https://github.com/LuckyLueck/StabCell_paper.
bioinformatics2026-05-12v1MurineCyto-Det: A High-Resolution Murine BALF Cytology Dataset for Leukocyte Segmentation and Detection
Le, T. X.; Tran, L.-A. T.; Farabi, D. A.; Wang, S.; Phan, A. T. Q.; Cormier, S. A.; Taada, A.; McGrew, D.; Du, Y.; Vu, L. D.Abstract
Automated analysis of murine bronchoalveolar lavage fluid (BALF) cytology is important for preclinical respiratory research, yet progress has been limited by the lack of publicly available, well-annotated mouse BALF image datasets. We present MurineCyto-Det, a high-resolution murine BALF cytology dataset comprising 333 image tiles of size 1024 x 1024 pixels, annotated across five cytological categories with both pixel-level segmentation masks and one-to-one matched bounding boxes. The dataset contains 14,551 annotated cell instances and supports two complementary analysis tasks: morphology-oriented cell segmentation and object-level cell detection. To establish reproducible benchmark baselines, we evaluated representative segmentation and detection models. The results demonstrate the practical utility of MurineCyto-Det while highlighting realistic challenges arising from class imbalance, small object size, irregular cell morphology, and ambiguous debris-like structures. MurineCyto-Det provides a standardized resource for developing, evaluating, and comparing automated methods for murine BALF cytology analysis. The dataset is publicly available at https://doi.org/10.5281/zenodo.17608677.
bioinformatics2026-05-12v1Task-Specialized Protein Language Models Decode the Sequence Grammar of Post-Translational Modification Sites
Adhikari, S.; Mondal, J.Abstract
Post-translational modifications (PTMs) regulate protein signaling, localization, degradation, and cellular decision-making, yet the sequence determinants that distinguish modified from chemically eligible but unmodified residues remain difficult to decode at proteome scale. Here, we examine whether adapting a general protein language model to PTM-site prediction can reveal the biochemical logic underlying residue-level modification. We fine-tune ESM2, a protein language model trained on tens of millions of evolutionarily diverse protein sequences, for phosphorylation, acetylation, and ubiquitination-site prediction. To address the pronounced class imbalance inherent in proteome-wide PTM annotation, we combine parameter-efficient fine-tuning with focal-loss training. The resulting task-specialized models show that PTM recognition depends on model capacity, annotation depth, and modification chemistry: phosphorylation benefits from larger models, whereas acetylation and ubiquitination peak at intermediate scale. More importantly, the fine-tuned phosphorylation model exposes three layers of biological organization: it recovers canonical kinase-recognition motifs without kinase-label supervision, resolves pathway-level functional relationships among proteins from sequence-derived embeddings, and preserves evolutionary signatures of homologous phosphorylation sites across 200 eukaryotic species. These results establish task-specialized protein language models as interpretable instruments for probing PTM-site biochemistry, kinase specificity, functional organization, and evolutionary conservation.
bioinformatics2026-05-12v1Identifying Context-Specific Cell-Cell Interaction Genes Without Ligand-Receptor Databases from Spatial Transcriptomics
Kim, H.; Park, B.; Jung, J.; Lee, S.; Panahandeh, S.; Kwon, S.; Li, J. J.; Madan, E.; Kim, D.; Kim, J.; Gogna, R.; Won, K. J.Abstract
Current approaches to inferring cell-cell interactions (CCIs) are largely constrained by predefined ligand-receptor databases, particularly for low-resolution spatial transcriptomics (ST) platforms such as Visium. Due to the difficulties in accurately resolving interacting cells at coarse spatial resolution, other modes of interaction are often overlooked. Low-resolution ST data, however, can serve as an alternative to high-resolution ST, which suffers from low sensitivity, and to image-based ST, which is limited by restricted gene panels. Here, we present CellNeighborEX v2, a database-free framework that directly infers CCI-associated genes from ST data by detecting deviations between observed and expected gene expression at the spot-population level. These deviations are rigorously evaluated through a hybrid statistical framework involving permutation testing and are further refined by considering the abundance of interacting cell-type pairs. Compared with other conventional approaches relying on ligand-receptor databases, CellNeighborEX v2 can capture CCI genes from a broad spectrum of interactions, including both paracrine signaling and contact-dependent communication. Across datasets from hippocampus, liver cancer, colorectal cancer, ovarian cancer, and lymph node infection, CellNeighborEX v2 accurately recapitulated previously identified CCIs. Notably, it uniquely detected interactions absent from existing ligand-receptor databases, enabling detection of context-specific CCIs from Visium data. CellNeighborEX v2 is a tool that expands the analytical spectrum of Visium data and deepens our understanding of the molecular language of intercellular communication.
bioinformatics2026-05-12v1misoTar: A novel approach for predicting miRNA and isomiR targets
Ripan, R. C.; Li, x.; Hu, H.Abstract
Understanding the interactions between microRNAs/isomiRs and mRNAs has long been a major challenge in RNA biology. Although numerous computational approaches have been developed to predict these interactions, most fail to account for isomiR mediated targeting. To address this limitation, we developed misoTar, a deep learning framework trained on more than 6.662 million positive and negative interaction pairs derived from 67 publicly available human samples across six independent studies. In five-fold cross-validation, misoTar achieved an average precision of 0.930 and a recall of 0.898. Evaluation on independent test datasets demonstrated consistently superior or comparable performance relative to existing tools, including TargetScan, Mimosa, DMISO, and TEC-miTarget. In addition, single-nucleotide mutation analyses of true positive interactions revealed the critical functional contributions of non-seed regions in microRNA/isomiR targeting. Overall, misoTar provides a robust and accurate framework for predicting microRNA/isomiR interactions while offering new biological insights into microRNA targeting mechanisms. The misoTar tool is publicly available at https://figshare.com/projects/misoTar/262723.
bioinformatics2026-05-12v1Mechanisms Matter: Transportability of Cellular Perturbation Effects
Qi, S.-a.; Chapfuwa, P.Abstract
Predicting cellular responses to genetic or chemical perturbations across biological contexts is central to drug development and disease understanding.Despite increases in data and model scale, deep learning models have not consistently outperformed simple baselines. Leveraging causal transportability theory, we show that cross-context generalization is governed by shared causal mechanisms, not merely distributional similarity.To enable controlled evaluation, we develop a causal simulator that generates realistic semi-synthetic Perturb-seq datasets with tunable mechanistic divergence, providing benchmarks with known ground-truth causal structure. Further, we adapt the Vendi diversity score to the perturbation setting as a diagnostic for mode collapse, a failure mode invisible to standard per-perturbation metrics. Extensive experiments across four deep learning models and six simple baselines on semi-synthetic and real Perturb-seq datasets reveal a cross-context generalization gap: performance under cross-context splits drops substantially, often to simple baseline levels. Notably, even on synthetic data with fully specified causal structure, no model generalized across contexts with different causal mechanisms. These results underscore the need for cross-context evaluation, diversity-aware metrics, and mechanistically grounded inductive biases.
bioinformatics2026-05-12v1MucOneUp: A Simulation Framework for MUC1-VNTR Variant Benchmarking
Popp, B.; Saei, H.Abstract
Summary: Variable number tandem repeats (VNTRs) in the MUC1 gene cause autosomal dominant tubulointerstitial kidney disease when disrupted by frameshift variants, but the GC-rich 60-bp repeat structure (20-125 copies) challenges variant detection. While tools like VNtyper enable MUC1 variant calling, no gold-standard benchmarking datasets exist for systematic performance evaluation. We present MucOneUp, a specialized simulation framework for generating MUC1-VNTR reference sequences with targeted variants and platform-specific sequencing reads (Illumina, Oxford Nanopore, PacBio). MucOneUp employs Markov chain-based repeat generation, supports diploid simulation with customizable variant placement, and includes additional analysis modules for SNaPshot assay simulation and exploratory frameshift analysis. We validate MucOneUp through a multi-variant, cross-platform benchmark of six tool-platform combinations using 13 distinct frameshift variants and investigate VNTR length effects on detection.
bioinformatics2026-05-12v1Amino Acid Insertion Energetics in a POPC Bilayer from Unbiased Molecular Dynamics
Bories, S. C. A.; Lague, P.Abstract
Membrane association is governed by the thermodynamics of amino acid partitioning between water and the lipid bilayer. Here, we quantified amino acid side-chain insertion energetics in a 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine (POPC) bilayer using unbiased molecular dynamics simulations. Equilibrium depth distributions of 28 analogs, including multiple protonation states, were converted into potentials of mean force (PMFs) by Boltzmann inversion. The resulting PMFs reproduced the main features of bilayer partitioning. Hydrophobic analogs favored the bilayer core, aromatic analogs were stabilized in interfacial regions, and polar or charged analogs remained unfavorable in the hydrophobic interior. A diglycine analog representing the peptide backbone behaved similarly to uncharged polar residues. Depth-dependent pKa profiles and orientational analyses further showed how protonation equilibria and aromatic-ring alignment influence insertion energetics. Agreement with experimental hydrophobicity scales supports the robustness of the approach. These results provide an efficient and internally consistent framework for characterizing bilayer insertion energetics and establish a reference for future studies in more complex lipid environments.
bioinformatics2026-05-12v1A novel vaccine and drug targets for global eradication of bovine tuberculosis: Holistic frameworks for construction of a potent vaccine and identification of drug targets
Pawar, P.; samarasinghe, s.; Kulasiri, D.Abstract
Bovine tuberculosis (TB), caused by Mycobacterium bovis, has become a global concern over the last two decades. Bovine TB primarily affects cattle, but other domestic livestock are also affected and it is more common in less developed and developing countries. The significant loss of livestock leads to trade restrictions and economic crises. Zoonotic potential of bovine TB raises health concerns for the public. Currently, no effective treatment is available and animal slaughtering is usually undertaken to reduce the burden of it in the environment. Antibiotic therapy can be used on animals living in captivity, but it is not reliable for herd or free-grazing animals. The BCG vaccine is another option available for treating the disease, but it shows limited efficacy in cattle. The prevention of bovine TB is a long-term goal that can only be accomplished by developing a more effective vaccine than BCG and designing new drugs. In this research, we propose therapeutic drug targets and vaccine for treating bovine TB. The conceptual framework for vaccine developed in this study uses a number of bioinformatics approaches to identify potential vaccine candidates and construct an in-silico epitope-based vaccine. Our holistic framework identified potential therapeutic candidates by directly analysing the proteome of TB bacterial strains. Specifically, we performed a comparative proteomic analysis of 11 Mycobacterium bovis strains to cover the diversity and identify conserved proteins among those strains for developing the bovine TB vaccine. An extensive reverse vaccinology and immunoinformatics analysis provided 26 highly immunogenic, non-toxic and non-allergenic epitopes (CTL epitopes- 8, HTL epitopes- 2 and B-cell epitopes-16) for Mycobacterium bovis required for three-dimensional structure construction of TB vaccine. The constructed epitope-based vaccine showed a potent interaction inside the host, thus generating efficient cell-mediated and humoral immune responses. Next, a framework based on a novel subtractive proteomic approach was developed for identifying bovine TB drug targets. We performed this approach on the 11 Mycobacterium bovis strains and identified nine drug targets that are conserved, essential, antigenic and have unique metabolic pathways in Mycobacterium bovis. These drug targets could further help investigate therapeutic drugs for the treatment of bovine TB. Several bioinformatics prediction tools were used together to ensure checks and balances, aiming to reduce the chance of errors and provide accurate results. The vaccine and drug targets developed in this study can be tested experimentally with confidence for further validation as therapeutics with the potential to eradicate bovine TB globally. The strategies implemented in the study are generic and can be used for other zoonotic infectious diseases. This study would be a game changer in the field of bovine tuberculosis treatment.
bioinformatics2026-05-12v1BatchVaria: a variance-aware framework for evaluating batch correction in high-dimensional omics data
Moir, N.; Sherwood, K.; Simpson, I.Abstract
Batch effects and other unwanted technical sources of variation remain a persistent challenge in the integrative analysis of high-dimensional -omics data. Although established methods such as ComBat effectively mitigate batch-associated signal, their impact on biologically meaningful variation is frequently evaluated in an ad hoc and non-quantitative manner. This is particularly problematic in heterogeneous disease contexts, such as breast cancer transcriptomics, where technical and biological sources of variation may be partially confounded. We present BatchVaria, an R package that implements a variance-aware framework for batch correction and post-adjustment evaluation. BatchVaria integrates variance component modelling, batch adjustment, and systematic re-profiling within a unified analysis container, enabling iterative quantification and reassessment of technical and biological variance contributions while preserving analytical provenance. By supporting multiple variance profiling engines and structured storage of intermediate results, BatchVaria facilitates transparent and reproducible evaluation of batch correction strategies. We demonstrate the utility of BatchVaria using a publicly available breast cancer transcriptomic dataset with known covariate-driven structure, illustrating how iterative variance profiling can guide responsible batch correction without erosion of subtype-associated biological signal.
bioinformatics2026-05-12v1CausalKnowledgeTrace: A Novel Computational Framework for Automated Literature-Based Causal Graph Construction and Evidence-Based Variable Selection in Biomedical Research
Upadhayaya, R.; Pradhan, M. M.; Metzger, V. T.; Malec, S. A.Abstract
Background: Variable selection for causal inference from observational biomedical data is challenging, as overlooking confounders or conditioning on colliders leads to biased estimates. While vast causal knowledge exists in biomedical literature, manually extracting this information for principled variable selection is impractical at scale. Methods: We developed CausalKnowledgeTrace, a Python-based computational framework with Django web interface that systematically leverages structured causal knowledge from the Semantic MEDLINE Database (SemMedDB) to inform variable selection in causal studies. The system implements a six-stage analysis pipeline using NetworkX for graph operations, including graph parsing, basic analysis, comprehensive cycle detection, systematic generic node removal, post-removal analysis, and formal causal inference with bias detection. Results: Analysis of the hypertension and Alzheimer's relationship across three degree neighborhoods (1 to 3) demonstrated systematic scaling of causal complexity: 361 to 866 variables, 429 to 1,442 relationships, with graph densities of 0.0033 to 0.0019. The analysis revealed complex cyclic structures with 54 to 606 baseline cycles across degree levels. Processing times ranged from 0.3 to 1.0 seconds for all three degrees, demonstrating computational efficiency for complex biomedical networks. Key confounders identified across all degrees included inflammation, diabetes, insulin resistance, obesity, and ischemia. In the third degree of graph, the pipeline structurally identified 39 confounders, 11 mediators, and 3 colliders from the causal graph. Among the key identified confounders and mediators (including obesity, oxidative stress, ischemia, and vascular diseases), all were found to have strong supporting evidence in established epidemiological and pathophysiological literature. Conclusions: CausalKnowledgeTrace provides a scalable, evidence-based approach to causal graph construction that systematically identifies confounders and bias structures often missed by conventional approaches. The Python-Django architecture enables both standalone analysis and integration into larger computational workflows, representing a significant advance in computational support for causal inference in biomedical research.
bioinformatics2026-05-12v1WasteFams: A database of protein families from global wastewater microbiomes
Galaras, A.; Chasapi, I. N.; Aplakidou, E.; Chasapi, M. N.; Lamari, E.; Diplari, S.; Georgakopoulos-Soares, I.; Karatzas, E.; Baltoumas, F. A.; Kyrpides, N.; Pavlopoulos, G.Abstract
Wastewater surveillance has emerged as a critical tool for global epidemiology, yet the functional diversity of wastewater microbiomes remains poorly characterized at the protein level. Here, we present WasteFams, the first comprehensive database dedicated to the systematic exploration of protein families in wastewater metagenomic and metatranscriptomic studies worldwide. Integrating data from 580 metagenomes, 132 metatranscriptomes, and 1,709 reference genomes, WasteFams catalogs 3,887 non-redundant protein families (containing {succeq}100 members) derived from over 105 million predicted proteins. Each protein family is enriched with multi-layered annotations, including AlphaFold3 structural predictions, taxonomic classifications, and biome-specific metadata. To further expand their functional annotation, we integrated deep genomic context analysis to link protein families to Mobile Genetic Elements (MGEs), Biosynthetic Gene Clusters (BGCs), Antibiotic Resistance Genes (ARGs), and CRISPR elements. Accessible through the EnvoFams portal, WasteFams provides a user-friendly interface featuring advanced search capabilities, sequence and structural similarity tools, and interactive visualization modules. As global initiatives increasingly leverage wastewater for public health and environmental insights, WasteFams can serve as a critical resource for discovering novel microbial functions, monitoring resistance mechanisms, and exploring the biotechnological potential of secondary metabolites within wastewater-engineered ecosystems.
bioinformatics2026-05-12v1Carbohydrate active enzymes in Pectobacteriaceae: coevolving enzyme sets and host adaptation
Hobbs, E. E. M.; Gloster, T. M.; Pritchard, L.Abstract
Many phytopathogenic bacteria have evolved large, diverse arsenals of Carbohydrate Active enZymes (CAZymes) that liberate simple sugars, and thus nutrition and energy, from the complex lignocellulosic matrices of their plant hosts. The CAZyme arsenals of these phytopathogens are expected to be influenced by and adapted to the cell wall composition of their plant hosts. The solutions these organisms have reached for the problem of degrading plant material may help us understand their host ranges and present a rich source of novel CAZymes for exploitation in industrial bioprocessing. Here we catalogue and analyse CAZyme complements (CAZomes) of publicly-available Enterobacterial phytopathogen genomes, including those of the economically significant and widely-studied Pectobacterium and Dickeya genera. These comprise a broad diversity of CAZymes, providing insight into host adaptation and a resource for bioprospection of industrially-relevant enzymes. We find evidence supporting coevolution of sets of CAZymes specific to bacterial genus and species and, notably, CAZymes associated with pathogen preference for either woody or soft plant tissue, suggesting adaptation of CAZomes to host plant cell wall composition.
bioinformatics2026-05-12v1Dogcatcher2: Improved statistical detection of transcriptional readthrough and repetitive element analysis across sequencing platforms
melnick, m.; Link, C. D.Abstract
Downstream of Gene (DoG) transcription occurs when RNA polymerase II fails to terminate normally at the transcription end site, resulting in extended transcription downstream of the gene. This is a widespread phenomenon linked to cellular stress, cancer and neurodegeneration. Existing tools for DoG detection from short-read RNA-seq rely on absolute coverage thresholds and sliding window approaches that are sensitive to sequencing depth and expression level. Here we present Dogcatcher2, which applies improved statistical detection methods to gene body-normalized coverage profiles. Using long-read ground truth across multiple datasets, we show that Dogcatcher2 outperforms existing methods in both detection sensitivity and boundary accuracy while maintaining high precision even at low sequencing depths. Dogcatcher2 further improves detection on pseudobulk scRNA-seq and snRNA-seq data. Analysis of DoG regions in human reveals specific enrichment for Alu elements including inverted Alu pairs capable of forming double-stranded RNA, with transposable elements within DoG regions showing elevated expression, connecting readthrough transcription to dsRNA generation and innate immune signaling.
bioinformatics2026-05-12v1Engineering a pacemaker-driven human mini-heart guided by spatial multi-omics of sinoatrial node development
Zhu, J.; Zhang, Z.; Gregorio, R. D.; Chang, K.; Dong, X.; Banerjee, K.; Liu, K.; Rea-Moreno, M.; Kizilbash, M.; Alonso, A.; Liu, J.; Tsai, S.; Chen, Y.-W.; Evans, T.; Chen, S.Abstract
The human sinoatrial node (SAN) functions as the primary pacemaker of the heart and coordinates the hierarchical electrical activity that drives cardiac contraction. However, experimental systems capable of reconstructing pacemaker driven cardiac organization in human tissues remain limited. Here we integrate spatial multi-omics of the human fetal SAN with stem cell engineering to generate pacemaker organoids (Sinoids) and assemble them into a pacemaker driven human mini-heart composed of sinoatrial, atrial and ventricular cardiac modules. High-resolution spatial transcriptomics and single nucleus multiomic analyses of human fetal SAN tissues identify regulatory pathways guiding pacemaker lineage specification, which we leverage to engineer human pluripotent stem cell derived SAN organoids with robust pacemaker identity and electrophysiological activity. When integrated with atrial and ventricular cardioids, Sinoids initiate and coordinate electrical activation across assembled cardiac tissues, establishing directional propagation of electrophysiological signals within structured mini-heart organoids. Combining AI guided perturbation modeling with functional validation further identifies conserved regulatory pathways controlling pacemaker specification and regionalization, including YAP TEAD and NRG ERBB signaling. Together, these results establish a multiomic guided strategy for engineering pacemaker tissues and reconstructing cardiac conduction hierarchy in vitro. The pacemaker driven mini heart platform provides a modular human cardiac system for studying pacemaker biology, modeling arrhythmia mechanisms and enabling electrophysiological drug discovery.
bioinformatics2026-05-12v1Figra: A WebAssembly-based Excel Add-in for publication-quality scientific visualization with ggplot2
Sato, Y.Abstract
Data visualization is a critical step in scientific communication. Most researchers rely on subscription-based software for this purpose, which requires ongoing licensing costs. Free alternatives such as R and Python offer publication-quality output but demand programming expertise that many researchers do not possess. Artificial intelligence tools can assist with figure generation but remain frustrating when users wish to fine-tune specific visual parameters to their preference. Meanwhile, Microsoft Excel, the most widely used tool for scientific data storage and management, offers limited visualization capabilities, forcing researchers to transfer their data to external software as an extra step before creating figures. Here we present Figra, a free Excel Office Add-in that eliminates this extra step by enabling publication-quality ggplot2-based figure generation directly within Excel, with simple and direct control over every visual option. Figra leverages WebAssembly technology (webR) to execute R code entirely within the browser, requiring no R installation, no subscription, and no server connection. The add-in supports over 20 chart types spanning distribution plots, grouped comparisons, time-series, scatter plots, and specialized curve-fitting analyses. For applicable chart types, Figra performs automated or manual statistical analysis supporting both paired and unpaired designs across two or more groups. Additionally, Figra exports simplified, executable R code that reproduces the displayed figure, serving as an educational tool for researchers wishing to learn ggplot2. Figra is open-source and freely available at https://h20gg702.github.io/figra-pages/index.html while the source code is provided at https://github.com/h20gg702/Figra.
bioinformatics2026-05-12v1Receptor-Anchored Olfaction Representation through Perception-Consistent Metric Learning
Tian, C.; Wang, J.; Hou, J.; Liu, W.; Luo, Y.; Wang, Y.; Yang, L.; Lin, W.Abstract
Olfactory perception arises from distributed activation across hundreds of olfactory receptors (ORs), yet our understanding of this landscape remains constrained by the scarcity of OR affinity measurements. Here, we present Receptor-Anchored Metric Supervision (RAMS), a transfer learning framework using perceptual consistency as weak supervision to predict OR activation spectra. RAMS fine-tunes a pretrained drug-target affinity model by imposing constraints derived from olfactory perception, where similar odorants are encouraged to exhibit similar OR activations. It transfers protein-ligand interaction knowledge learned from large-scale pharmacological data into the olfactory domain and reshapes it toward OR activation prediction. Evaluations against experimental measurements show that RAMS improves the accuracy of receptor-spectrum prediction and yields biologically plausible activation patterns. The predicted spectra show concordance between receptor discriminative capacity and expression level, and highlight the understudied OR52 family as a potential contributor to primary odor recognition. Together, RAMS provides a scalable framework for reconstructing receptor-anchored olfactory representations.
bioinformatics2026-05-12v1Culsma: A Formal Language for Laboratory Protocols
Chen, Y.; Sun, M.; Tadepally, L.; Wang, J.; Barcenilla, H.; Gonzalez, L.; Brodin, P.Abstract
The application of artificial intelligence to biomedical research increasingly depends on iterative cycles in which AI systems analyze experimental data, propose follow-up conditions, and drive automated execution at scale, a paradigm central to Bio-AI and autonomous laboratory science. For such cycles to operate, laboratory protocols must be expressed in a form that is simultaneously human-readable and machine-executable. Natural-language descriptions, the current standard in laboratory practice, do not satisfy this dual requirement. We present Culsma, a formal language and execution framework that elevates laboratory protocols from informal prose to semantically explicit workflow programs that can be analyzed, validated, executed, and transferred across settings. The same protocol can be read and verified by a bench scientist, and parsed, validated, and executed by an automated pipeline without re-translation. We demonstrate an end-to-end implementation providing concrete evidence of practical viability.
bioinformatics2026-05-12v1Dual-view Guided Context-aware Network for Automated Bone Lesion Segmentation and Quantification in Whole-body SPECT
chen, w.; Yang, X.; Lu, J.; Miao, M.; Huang, Y.; Zheng, S.; Zhang, C.; Xie, L.; Zhang, Y.Abstract
Whole-body SPECT bone scintigraphy reflects skeletal metabolic activity throughout the body and plays an indispensable role in the screening, treatment evaluation, and prognostic assessment of bone metastases in tumors. However, the automatic detection and segmentation of hypermetabolic bone lesions remain challenging due to low contrast, limited spatial resolution, and complex lesion distributions. In this study, we proposed Bone-Segnet, a dual-view guided automatic segmentation network for hypermetabolic bone lesions that integrated multi-scale feature modeling, global context modeling, and view-conditioned modulation. Pixel-level annotated anterior and posterior whole-body bone scintigraphy images were used for model training and prediction. The proposed network enhanced the recognition of low-contrast and small-scale lesions through small-lesion enhancement and multi-scale contextual modeling. A Transformer module was further introduced to strengthen global feature representation, while cross-view collaborative modeling was achieved by incorporating the complementary characteristics of anterior and posterior imaging. Experimental results demonstrated that the proposed method outperformed existing approaches across multiple evaluation metrics, with the Dice score improving from 0.7440 to 0.8750, indicating a substantial improvement in segmentation performance. Further quantitative analysis based on the segmentation results revealed significant differences among disease types in lesion count, pixel burden, and spatial distribution patterns, reflecting the heterogeneity of disease-related skeletal metabolic activity. Overall, the proposed method improved automatic lesion segmentation performance and enabled quantitative analysis of lesion burden and spatial distribution patterns, providing objective data support for the assessment of related diseases.
bioinformatics2026-05-12v1SigBridgeR: An Integrative Framework and Toolkit for Comprehensive Screening and Benchmarking of Phenotype-Associated Cell Subpopulations in Single-Cell Transcriptomics
Yang, Y.; Yan, Z.; Qian, H.; Du, L.; Wang, C.; Peng, Y.; Bu, X.; Zhou, J.-G.; Wang, S.Abstract
Single-cell RNA sequencing has revolutionized our understanding of cellular heterogeneity, yet linking specific cell subpopulations to clinically relevant phenotypes remains a persistent challenge. Although multiple computational methods have been developed to bridge this gap, they are typically implemented as standalone packages with heterogeneous preprocessing pipelines, incompatible parameter conventions, and divergent output formats, thereby hindering rigorous cross-method benchmarking and reproducible multi-method workflows. Here, we present SigBridgeR, an extensible R framework and comprehensive toolkit that currently unifies eight state-of-the-art phenotype-associated cell screening algorithms within consistent workflows. We conducted a systematic benchmarking study across four cancer types HER2-positive breast cancer, triple-negative breast cancer, lung adenocarcinoma, and ovarian cancer using both binary phenotypes and patient survival endpoints. Our evaluation incorporated positive and negative control assessments based on differentially expressed genes and randomly selected marker panels, alongside quantitative accuracy comparisons using ground-truth cell labels. Building upon these insights, SigBridgeR provides standardized preprocessing for scRNA-seq and bulk transcriptomic data, unified algorithmic interfaces through a registry-based architecture, ensemble analysis via weighted voting, and comprehensive visualization utilities for multi-method comparison. By lowering technical barriers and promoting methodological standardization, SigBridgeR facilitates reliable discovery of phenotype-relevant cell subpopulations and enhances the translational potential of single-cell omics research.
bioinformatics2026-05-12v1Temporal-deviation-driven community detection uncovers early-warning signals for critical transitions in complex diseases
Wang, L.; Xu, M.; Yan, H.; Zheng, Y.; Feng, S.; Zhang, Y.; Li, C.; Qiu, D.; Hu, B.; Wan, X.; Zhang, F.Abstract
Early detection of critical transitions in complex diseases is crucial for timely clinical intervention. However, as patients often provide only a single snapshot, identifying sample-specific early-warning signals (EWS) from a dynamical evolution perspective remains challenging, coupled with high-dimensional noise amplification. Here, we present TD-COM, a framework for detecting personalized EWS of critical transitions via single-sample community detection. By constructing a temporal perturbation map STDN, TD-COM captures latent dynamical perturbations inferred from static individual profiles. Synergizing these temporal-deviation signals with static topological features, TD-COM implements a multi-level node filtering strategy during community detection, effectively suppressing single-sample noise. Validated on hour-scale, multi-year, and multi-decade transcriptomic data, TD-COM robustly detects critical states preceding clinical deterioration and uncovers their underlying molecular mechanisms. Comparative experiments demonstrate that TD-COM outperforms existing methods in accuracy and topological robustness. Thus, TD-COM provides a generalizable framework for personalized early warning of complex diseases, particularly when longitudinal sampling is infeasible.
bioinformatics2026-05-12v1BAT: an integrated pipeline for gene tree construction, annotation, and functional inference
Sheppard, B. D.; Behnken, B.; Steinbrenner, A.Abstract
Gene family functional exploration often requires analyzing motifs, domains, and associated datasets (e.g. gene expression) in the phylogenetic context of a gene tree. As genomic resources become more abundant, local pipelines are needed to analyze gene families of interest with project-specific resources. Here we present BLAST-Align-Tree (BAT), a bioinformatic pipeline for automated gene family phylogeny construction and annotation to enable gene tree exploration. BAT combines a BLAST search of local genome databases with a robust and flexible gene tree construction pipeline that enables multiple modes of annotation. Output visualizations display experimental datasets, custom regex specified amino acid motifs, and protein HMM domain annotations. For flexibility, BAT runs locally and is independent of pre-existing databases, allowing the easy incorporation of custom genomes and datasets. Three primary case studies described here demonstrate the utility of BAT for inferring the function of homologs and orthologs within characterized gene families. BAT is suitable for fine scale phylogenomic analysis of gene families across the tree of life, and default genomes available on installation span model eukaryotes.
bioinformatics2026-05-12v1Generative Chemistry Platform for Small Molecules Targeting RNA: A Case Study for Chemical Optimization
Allen, T. E. H.; Bonnet, M.; Khan, R. T.Abstract
We introduce the Serna Bio GenAI platform, a generative chemistry and multiparametric optimization platform for the design of RNA-targeting small molecules. Targeting RNA with small molecules has proven historically challenging but offers notable potential upsides, including access to unique mechanisms of action and the ability to target otherwise untargetable genes. We consider a major challenge here to be designing chemistry specific to RNA-targeting. Molecular design is a valuable application of AI in drug discovery, but many publicly available models use training data focused on protein-targeting - the modality best historically explored in drug discovery. We showcase the difference and value in building a specifically RNA-targeting platform, comparing its performance to state-of-the-art public chemical generators and experimentally validating its chemical designs in comparison to chemistry designed by a human expert.
bioinformatics2026-05-12v1Spurious correlation inflates performance in single-cell perturbation prediction
Nicol, P. B.; Shivakumar, S.; Irizarry, R.Abstract
The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.
bioinformatics2026-05-12v1CardioSafe: Multi-task prediction of cardiac ion channel activity with reverse-leak audited benchmarking
Jovanovic, M.; Weidener, L. S.; Brkic, M.; Ulgac, E.; Meduri, A.Abstract
Drug-induced inhibition of the hERG potassium channel is the leading cause of cardiac safety-related drug attrition, but the Comprehensive in Vitro Proarrhythmia Assay (CiPA) framework requires activity data on multiple cardiac ion channels to assess proarrhythmic risk. We present CardioSafe, a three-branch multi-task neural network with cross-attention fusion that integrates chemical fingerprints, ChemBERTa embeddings, and predicted L1000 transcriptomic features to predict blocker status and potency for hERG, Nav1.5, and Cav1.2, with an exploratory IKs head. CardioSafe was trained on the largest publicly reported multi-channel cardiac ion channel dataset, combining ChEMBL 36 with the hERGCentral database (331127 hERG, 3160 Nav1.5, 1138 Cav1.2, and 115 IKs compounds), curated under a pharmacology-aware policy that retains censored measurements and inhibition-percentage votes. Under Tanimoto-similarity-controlled splits, CardioSafe outperforms the leading published comparators (CToxPred2 and CardioGenAI) on the data-rich hERG head; on the smaller Nav1.5 and Cav1.2 heads the standard evaluation is statistically inconclusive. A reverse-leak audit revealed that 22% of Nav1.5 and 21% of Cav1.2 test compounds were present in published comparators' training data (92% as exact compound matches); after removing these contaminated compounds, CardioSafe's lead on Nav1.5 and Cav1.2 also reaches statistical significance, demonstrating that prior cross-publication benchmarks for these channels were inflated by training-data overlap.
bioinformatics2026-05-12v1Easydecon: Efficient Cell Type Mapping for High-Definition Spatial Transcriptomic Data
Umu, S. U.; Karlsen, V. T.; Baekkevold, E. S.; Jahnsen, F. L.; Domanska, D.Abstract
The emergence of high-resolution spatial transcriptomics platforms, such as VisiumHD, has enabled transcriptome-wide spatial profiling at near-single-cell resolution. However, existing analysis tools often lack scalability or compatibility with this new resolution, limiting their utility for multimodal cell type analysis. We present Easydecon, a lightweight and modular computational framework for spatial transcriptomics analysis using marker genes from single-cell RNA sequencing datasets. Easydecon uses a two-phase strategy, which firstly detects expression hotspots and then refines cell type assignments with similarity-based methods. We demonstrate its efficacy by resolving cell type subsets with high accuracy. Easydecon supports integration with segmentation tools and outperforms established methods in speed, usability and cell type recovery.
bioinformatics2026-05-11v4barbieQ: An R software package for analysing barcode count data from clonal tracking experiments
Fei, L.; Maksimovic, J.; Oshlack, A.Abstract
Motivation: A clone encompasses a progenitor cell and its progeny cells. Tracking clonal composition as cells differentiate or evolve is useful in many fields. Various single-cell lineage tracing (clonal tracking) technologies use unique DNA barcodes that are passed from progenitor cells to their offspring. The barcode count for each sample indicates cell number in clones. However, analysis of barcode count data is often bespoke and relies on visualisations and heuristics. A generalized workflow for preprocessing and robust statistical analysis of barcode count data across protocols is needed. Results: We introduce barbieQ, a Bioconductor R package for analysing barcode count data across groups of samples. It provides data-driven quality control and filtering, extensive visualisations, and two statistical tests: 1) Differential barcode proportion (differences in proportions between sample groups), and 2) Differential barcode occurrence (differences in presence/absence odds between groups). Both tests handle complex experimental designs using regression models and rigorously account for sample-to-sample variability. We validated both tests on semi-simulated, real data and a case study, demonstrating that they hold their size, are sufficiently powered to detect true differences, and outperform existing approaches.
bioinformatics2026-05-11v3A Permutation-Based Framework for Evaluating Bias in Microbiome Differential Abundance Analysis
Zeng, K.; Fodor, A. A.Abstract
ABSTRACT Background: In microbiome research, differential abundance analysis aids in identifying significant differences in microbial taxa across two or more conditions. Statistical approaches used for this purpose include classical tests such as the t-test and Wilcoxon test, as well as methods designed to account for the compositional nature of microbiome data, including ALDEx2, ANCOM-BC2, and metagenomeSeq. In addition, methods originally developed for RNA sequencing data, such as DESeq2 and edgeR, have been frequently applied to microbiome studies. However, the use of these methods has been controversial. One area of concern is whether different modeling frameworks produce accurate p-values when the null hypothesis is true. Results: We evaluated seven methods across six datasets. Four permutation strategies were applied to generate data under the null hypothesis: shuffling sample names, shuffling counts within samples, shuffling counts within taxa, and fully randomizing the counts table. Methods based on the negative binomial distribution (DESeq2 and edgeR) produced p-values that were consistently smaller than expected under the null hypothesis. In contrast, methods that attempt to correct for compositionality (ALDEx2, ANCOM-BC2, and metagenomeSeq) tended to produce larger-than-expected p-values, even when only sample labels were shuffled, a permutation strategy that does not alter compositional structure. These deviations were dependent on dataset characteristics and permutation strategy, suggesting complex interactions between underlying data structure and algorithm performance. Generating data to follow the expected negative binomial distribution did not eliminate the tendency of DESeq2 and edgeR to exaggerate statistical significance. Although similar patterns were observed in RNA sequencing (RNAseq) datasets, the deviations were less pronounced than in microbiome data. In contrast, the classical t-test and Wilcoxon test yielded p-value distributions consistent with theoretical expectations across datasets and permutation strategies. Conclusions: These results indicate that the performance of several widely used differential abundance methods can be problematic under null conditions and may affect biological interpretation. Our findings emphasize the importance of careful method selection and highlight the robustness of simpler statistical approaches for reliable inference.
bioinformatics2026-05-11v2InterScale reveals multi-scale cellular interaction programs in spatial transcriptomics
Drummer, F. K.; Jimenez, S.; Marco, F. D.; Schaar, A. C.; Pentimalli, T. M.; Beckmann, J.; Rajewsky, N.; Theis, F. J.Abstract
Tissue homeostasis and disease emerge from cell-cell interactions operating across spatial scales: from autocrine and juxtacrine signals within micrometers to paracrine gradients coordinating responses across tissues. While these can be read out from spatial transcriptomics, existing computational methods capture either local adjacency-based or long-range dependencies, but rarely both within a single framework. We introduce InterScale, a graph-transformer approach that jointly models local and global cellular interactions from spatial transcriptomics data. By integrating a Graph Convolutional Network as a local component with a global transformer encoder, InterScale learns multi-scale representations of cellular communication. A downstream workflow enables scale-resolved interpretation of interactions from gene to tissue level. Applied to Sonic Hedgehog morphogen patterning in neural organoids, InterScale resolves spatially restricted neuronal differentiation programs and broader progenitor regulatory states along the morphogen gradient. In a human pancreatic dataset contrasting healthy and type 1 diabetic tissue, it reveals disease-associated spatial reorganization and tissue remodeling. InterScale's modular architecture supports diverse spatial transcriptomics platforms and provides a scalable, unbiased, and biologically interpretable framework for studying cellular interactions across scales.
bioinformatics2026-05-11v1miR-128 Regulates Hypertensive Vascular Remodeling via PPAR-γ
Zhoufei, F.; Han, C.; Liu, R.; Yu, L.; Chen, C.; Chen, S.; Li, l.; Chen, Q.; Cai, H.; Su, J.; Peng, F.Abstract
This study investigated the role and mechanism of microRNA-128 (miR-128) in hypertensive vascular remodeling, focusing on peroxisome proliferator-activated receptor {gamma} (PPAR-{gamma}) and the Toll-like receptor 4/nuclear factor{kappa}-B (TLR4/NF-{kappa}B) pathway. Ten-week-old male spontaneously hypertensive rats (SHRs) were randomly divided into renal denervation (RDN), sacubitril/valsartan, and sham groups; age-matched Wistar-Kyoto rats served as normotensive controls. Eight weeks after intervention, mesenteric arteries were collected for histological, functional, and molecular analyses.Serum miR-128 levels were measured by quantitative real-time polymerase chain reaction (qRT-PCR). Protein expression was determined by immunofluorescence, immunohistochemistry, and Western blotting. Compared with the sham group, SHRs showed elevated blood pressure, severe vascular remodeling, and impaired vasodilation, accompanied by downregulated miR-128 and activated TLR4/NF-{kappa}B signaling (all p < 0.0001). RDN markedly restored miR-128 expression, suppressed the TLR4/NF-{kappa}B pathway and pro-inflammatory cytokines (IL-1{beta}, IL-6, TNF-), and improved vasodilatory function (all p < 0.0001). Mechanistically, miR-128 negatively regulated the TLR4/NF-B pathway by upregulating PPAR-{gamma} (p < 0.05). In conclusion, RDN attenuates hypertension and vascular remodeling. miR-128 alleviates vascular inflammation and remodeling via the PPAR-{gamma}/TLR4/NF-{kappa}B axis, representing a promising therapeutic target for hypertension.
bioinformatics2026-05-11v1eSkip2 prioritizes exon-skipping antisense oligonucleotide target regions across exon--intron contexts
Chiba, S.; Kunitake, K.; Shirakaki, S.; Haque, U. S.; Wilton-Clark, H.; Shah, M. N. A.; Leckie, J. N.; Matsui, K.; Uno-Ono, F.; Yokota, T.; Aoki, Y.; Okuno, Y.Abstract
Antisense oligonucleotides (ASOs) for exon skipping are increasingly used to correct pathogenic splicing; however, rational target-region selection remains difficult because regulatory information is distributed across exons, introns, and splice junctions. Here we present eSkip2, a framework for prioritizing exon-skipping ASO target regions from joint exon--intron sequence context. eSkip2 combines transfer learning from a genome-pretrained foundation model with joint training on ASO activity and SNV-derived splicing perturbation data and can be adapted to a target locus without experimental ASO labels. Across multi-gene benchmarks spanning canonical exons, pseudoexons, cell types, chemistries, and exonic, intronic, and exon--intron-spanning targets, eSkip2 robustly prioritized active regions; in exon-confined comparisons, it showed improved overall performance compared with applicable existing models. It also supported prospective design of dual-targeting ASOs for DMD exon 46, where top-ranked candidates were enriched for active ASOs and yielded dose-dependent dystrophin restoration. eSkip2 narrows the experimental search space across diverse target architectures.
bioinformatics2026-05-11v1Efficient and Tidy Manipulation of Annotated Matrix Data with plyxp
Landis, J. T.; Love, M. I.Abstract
Manipulating high-dimensional omics data, such as bulk or single cell gene expression counts matrices, typically requires a bioinformatics analyst to learn domain-specific functions and syntax. These matrix-centric functions and syntax can be less intuitive than working with tidy data analytic principles, as exemplified by tools such as dplyr applied to tabular data. We propose an expressive grammar for manipulating annotated matrix data, with syntax to access, modify, and append matrix data and tabular row and column metadata, including row-wise or column-wise grouped operations. This grammar defines multiple contexts, and providing pronouns for specific recall and assignment within and across these contexts. The plyxp package is an implementation of this grammar for the R/Bioconductor ecosystem, with efficient abstractions for the SummarizedExperiment class. We demonstrate plyxp's efficiency compared to alternative approaches on data manipulation tasks requiring computation across contexts.
bioinformatics2026-05-11v1BioMADE: Predicting Torsades de Pointes from molecular structures through biologically informed representations
Acitores Cortina, J. M.; Schut, M. C.; Tatonetti, N. P.Abstract
Drug-induced arrhythmias, particularly Torsades de Pointes (TdP), pose a significant risk to patient safety and can sometimes have life-threatening outcomes. They remain a major concern in drug development and regulation. Machine learning (ML) has become a powerful tool for analyzing complex biological and chemical datasets, enabling researchers to identify subtle patterns that differentiate safe compounds from those likely to cause dangerous cardiac effects. However, most existing in silico approaches do not sufficiently incorporate biological elements, relying heavily on chemical and structural properties or on computationally expensive simulations. Here, we introduce BioMADE, a novel ML framework that harnesses small-molecule-protein activity profiles from publicly available datasets to predict TdP risk without requiring exhaustive mechanistic annotation. Activity data from ChEMBL were used to train individual models for each gene, which predict activity values for any given compound. A curated set of arrhythmia-relevant genes was then used to construct a latent biological embedding (BioMADE embedding) for each molecule. We validated the performance of these features in distinguishing biological elements such as ATC3 class, showing superior classification performance compared with representations such as Molformer (lacks biological information) and MACCS (limited chemical properties) (0.85 AUROC vs 0.81 and 0.73, respectively). BioMADE representations served as input to a support vector machine classifier to discriminate TdP-inducing drugs from safe compounds. BioMADE achieved an AUROC of 0.91 in internal validation, indicating strong predictive performance. Against state-of-the-art models such as ADMEThyst, BioMADE achieved an AUROC of 0.74 on ADMEThyst's validation set (vs. 0.72 for ADMEThyst). When we combined both approaches, the AUROC reached 0.77. These results demonstrate that BioMADE provides a scalable, biology-informed, and generalizable approach for predicting drug-induced toxicities. By integrating protein activity profiles into toxicology modeling, our framework highlights the critical role of human biology in adverse drug reaction prediction, an aspect often overshadowed by purely chemical or structural descriptors.
bioinformatics2026-05-11v1EVd3x: a source-attributed multi-omic platform for mapping extracellular vesicle cargo evidence
Ait Ouares, K.; Weerakkody, J. S.Abstract
Extracellular vesicle (EV) studies increasingly generate mixed cargo lists that include genes, proteins, miRNAs, biofluids, cell contexts, disease labels, pathways, and interaction networks. The central interpretive challenge is determining which source supports each record and what level of biological claim that source can justify. We developed EVd3x, a source-attributed multi-omic platform that integrates 28 public resources into 17 canonical Apache Parquet analysis tables and converts molecule, disease, or natural-language queries into a reusable analysis state. The same state can be inspected across linked evidence layers for EV cargo, disease aggregation, pathway enrichment, cell context, ligand receptor evidence, miRNA target support, STRING protein protein interactions, and exportable source rows. We evaluated EVd3x using the disease-first query: early onset Alzheimers disease with behavioral disturbance. The query resolved a PSEN1-centered state with 5 seeds, 109 nodes, and 197 edges, and exported 647 EV evidence rows, 4,053 disease rows, 2,204 pathway rows, 3,555 cell-context/communication/ligand receptor rows, and 4,032 bridge rows. EVd3x recovered familial Alzheimer disease type 3, gamma-secretase and Notch context, nervous-system pathway terms, oligodendrocyte to astrocyte communication hypotheses, and PSEN1 bridges in which six queried miRNAs, including hsa-miR-107, target PSEN1 directly. These outputs are reported as separable evidence layers rather than as a composite proof score. A table-backed research assistant fine-tuned from Qwen2.5-1.5B-Instruct with QLoRA routes natural-language requests through deterministic retrieval before optional synthesis. EVd3x supports transparent EV hypothesis generation by preserving source attribution from query to export.
bioinformatics2026-05-11v1Autoresearch Discovery of Interpretable Filter Rules for Antibody Binder Classification
Landajuela, M.Abstract
Antibody design campaigns increasingly generate many candidates before only a small subset can be tested experimentally, making candidate filtering a central bottleneck. We study whether an autoresearch loop can discover better training-free filters for antibody binder classification by iteratively proposing rule variants, evaluating them under a fixed Leave-One-System-Out protocol, recording each experiment in version control, and using the results to guide the next iteration. Across 75 unique logged filter variants on seven antibody-antigen systems, the loop improves average ROC-AUC from 0.6371 for the initial baseline to 0.8060 for a compact final rule that we call the RMSD-Tuned Triad rule, an absolute gain of 0.1689 and a relative improvement of 26.5%. The discovered filter is competitive with supervised machine learning baselines and prompted LLM baselines evaluated on the same systems: it exceeds logistic regression (0.7144), feature-selected balanced logistic regression (0.7536), and GPT-4o tabular few-shot prompting (0.7640), and it comes within 0.0044 ROC-AUC of the strongest GPT-5 tabular few-shot result (0.8104). Unlike the LLM baseline, the final rule requires no prompted examples and no LLM inference once the numeric structure-derived features are available. These results show that systematic autoresearch can turn simple structural-confidence signals into compact, interpretable filters that are useful when target-specific training data are scarce.
bioinformatics2026-05-11v1ProteinFlux: accurate, rapid and scalable generative prediction of protein dynamics driven by post-translational modifications
Qian, Q.; Peng, J.; Ma, D.; Liu, K.; Cheng, Y.; Deng, Y.; Zhao, J.; Su, S.; Yao, Y.; Qu, Y.; Fu, R.; Liu, J.; Zhao, M.; Xiao, Y.; Wang, K.; Wu, Y.; Wang, Y.; Xu, Q.; Wang, J.; Hay, D. C.; Ke, Y.; Wang, Y.; Shipston, M. J.; Chi, Y.Abstract
The function of proteins, the building blocks of life, in health and disease depends not only on their 3D-conformational states but most importantly on the dynamic transition between states controlled by a wide array of post-translational modifications (PTMs). Recent major advances have been made in our ability to predict static 3D structures; however, understanding and predicting the impact of PTMs on protein conformational dynamics remains a major question and challenge in the field. Molecular dynamics (MD) simulation remains the major computational approach for studying protein dynamics. However, the high computational cost, lack of integration of PTMs as conditioning inputs and inefficient generation of continuous protein dynamics largely precludes PTM-regulated conformational dynamics and the study of slow conformational processes. To address this critical bottleneck, we developed ProteinFlux, a flow-matching generative framework that links PTM-conditioned conformational dynamics to evolutionary constraints encoded by PTM sites. Evolutionary information plays a critical role in capturing conformational dynamics beyond sequence identity, and PTM sites inherently encode evolutionary constraints critical to protein functional regulation. We therefore built FluxSite, a dual-modal PTM site predictor that integrates sequence evolutionary information and 3D structural features to generate a continuous conditional signal encoding conservation and functional importance for each predicted site. FluxSite achieves robust generalization across 18 PTM types and 30 disease-associated proteomes. ProteinFlux generates phosphorylation-conditioned, all-atom conformational trajectories across diverse protein fold classes, faithfully reproducing both thermodynamic properties such as free energy landscapes and kinetic features such as conformational transition pathways. It outperforms state-of-the-art predictors while achieving inference speeds several orders of magnitude faster than traditional MD. In addition, we introduce DynaMo-phos, a benchmark dataset of phosphorylated protein MD simulations. Together, ProteinFlux, FluxSite and DynaMo-phos provide a scalable, high-throughput platform for elucidating PTM-driven conformational mechanisms, with potential applications across allosteric drug design, functional annotation of disease-associated modifications and mechanism-guided therapeutic development.
bioinformatics2026-05-11v1Partner determination from protein sequences using class information with CLAPP
Gennai, L.; Caredda, F.; Rebeaud, M. E.; Pagnani, A.; De Los Rios, P.Abstract
Protein-protein interactions underpin nearly all cellular processes, making their accurate identification a central challenge in biology. With the rapid expansion of genomic data, sequence-based computational approaches have emerged as a powerful route to infer such interactions, complementing experimental methods that are often prohibitively time- and resource-intensive. This challenge becomes particularly acute in the presence of paralogs, which arise through gene duplication and typically diversify toward distinct, though sometimes overlapping, functions. Reconstructing their interaction networks is therefore essential for understanding a wide range of biological processes. Protein paralogs within a family can often be subdivided into classes based on a range of properties, including functional, structural and architectural features. When interactions between these classes are conserved across organisms, such that sequences from one class interact exclusively with sequences from another, this information can be used to solve the paralog matching problem. We introduce here CLAPP (CLAss Pooling for Paralog matching), a method for predicting interacting paralogs by pooling interaction scores from different subclasses across organisms. We apply it to scores extracted using coevolution-based methods. Pooling scores at the class level reduces noise in the interaction scores and replaces organism-specific assignments with a single shared assignment, improving performance and substantially reducing computational cost. We apply CLAPP to bacterial systems including histidine kinases and response regulators, as well as interacting families of chaperones and co-chaperones, and recover known interaction partners.
bioinformatics2026-05-11v1Benchmarking long-read simulators against Oxford Nanopore whole-genome sequencing data
Taouk, M. L.; Ingle, D. J.; Wick, R. R.Abstract
Background: Oxford Nanopore Technologies (ONT) sequencing is increasingly used for whole-genome sequencing (WGS) across a wide range of applications. However, the platform has evolved rapidly through updates to flow cell chemistry and basecalling algorithms, altering the characteristics of the resulting sequencing data. Read simulators provide synthetic datasets with known ground truth, enabling controlled development and evaluation of methods. However, many existing simulators were developed for earlier versions of ONT sequencing or use generic long-read assumptions, and their realism for contemporary ONT data is unclear. Results: We benchmarked six ONT-compatible read simulators (Badread, LongISLND, lrsim, NanoSim, PBSIM3 and SimLoRD) using a microbial genome reference and ONT R10.4.1 reads as the empirical standard. Each tool was configured to maximise realism, including training on empirical reads when supported. We compared simulated and real datasets with respect to read length, read accuracy, FASTQ quality scores and sequence error profiles. No simulator reproduced all metrics of the real data well. PBSIM3 most closely reproduced read length, read accuracy and FASTQ quality scores, making it a strong simulator for broad read-level realism. However, it did not capture important features of the real error profile, including context-dependent substitution rates and homopolymer-length errors. Badread and LongISLND better reproduced some aspects of the error profile, but showed other departures from the real data. Conclusion: PBSIM3 is a good general-purpose choice for many ONT WGS simulation tasks because it reproduced several key read-level properties well. However, Badread or LongISLND may be preferable for applications where error structure is more important. No evaluated tool was realistic across all tested metrics, highlighting a gap for improved long-read simulators.
bioinformatics2026-05-11v1Pathway-informed Universal Domain Adaptation for Single-cell RNA-seq Data
Wei, X.; Li, X.; Liu, H.; Du, G.; Wei, F.; Shang, X.Abstract
The rapid accumulation of single-cell atlases has yielded datasets of unprecedented scale, encompassing samples across diverse platforms, locations and laboratories. This multidimensional complexity drives an urgent need for universal domain adaptation methods capable of achieving precise cell-type annotation. However, existing methods lack computational scalability and fail to integrate biological priors. Here, we develop scPathOT, a pathway-informed universal domain adaptation framework that leverages pathway activation transformations to harmonize single-cell datasets across disparate conditions. We demonstrate the versatility of scPathOT across diverse technological platforms, tissues, disease contexts, cellular senescence and treatment conditions. Crucially, this pathway-informed alignment not only accurately resolves cellular identities but also uncovers functional mechanism. In pancreatic islets, scPathOT delineates a shared stress-repair axis traversed by beta-cells prior to their divergence into type 1 and type 2 diabetes-specific states. In aging bone marrow, scPathOT disentangles lineage-specific senescence modules that unexpectedly converge onto unified inflammatory and oxidative-stress programs. Furthermore, application to an in-house pancreatic ductal adenocarcinoma cohort uncovers the mechanistic basis underlying the neoadjuvant chemotherapy-induced reorganization of stromal-immune crosstalk. By coupling biological priors with universal domain adaptation, scPathOT provides a scalable, mechanistically interpretable framework to accelerate biological discovery from atlas-level single-cell data.
bioinformatics2026-05-11v1Investigation of Protein Melting Temperature Prediction with Cross-Method Validation on Biophysical Data
Pailozian, K.; Kohout, P.; Damborsky, J.; Mazurenko, S.Abstract
Motivation: Protein melting temperature (Tm) prediction accelerates the discovery of thermostable enzymes which are crucial for industrial biotechnology often requiring harsh reaction conditions. Experimental determination of Tm remains labour-intensive and varies across techniques, motivating the development of in silico predictors. Mass-spectrometry datasets such as Meltome Atlas now enable large-scale Tm prediction with models based on deep learning, but model generalisation across diverse experimental datasets has not been systematically tested. Results: We evaluated the generalisability of state-of-the-art deep learning approaches and explored ESM-based embeddings for Tm prediction. To this end, we assembled the ProMelt training dataset (45 441 proteins) and five independent biophysics-based validation datasets. Our analysis revealed substantial differences between proteomics- and biophysics-based Tm measurements, highlighting the challenge of cross-domain generalisation. Existing state-of-the-art predictors trained on large-scale proteomics datasets showed reduced performance on biophysics-based validation sets. Our fine-tuned embedding-based models, particularly LoRA-adapted ESM-2 (TmProt 1.0), outperformed state-of-the-art predictors in identifying thermostable proteins Tm [≥] 60{degrees}C) across heterogeneous datasets, achieving AUC scores of 0.75--0.77. We also demonstrated that the available models could be used efficiently in the sequence prioritization task. Availability: The TmProt web server is available at https://loschmidt.chemi.muni.cz/tmprot/.
bioinformatics2026-05-11v1Nanopore event detection in a simple and adaptive way
Wei, P.; Kansari, M.; Mierzejewski, M.; Ensslen, T.; Lin, C.-Y.; Kavetsky, K.; Jones, P. D.; Behrends, J. C.; Drndic, M.; Fyta, M.Abstract
Nanopore read-out, that is the current signals measured across nanometer-sized openings in dielectric membranes or through natural protein channels, enables the detection, identification and sequencing of individual molecules. The detection can take place by analyzing the events of single biomolecules interacting with the pore. The accuracy in the detection of these single events is key for identification of physicochemical properties of analyte molecules. To this end, we further develop a very simple, fast, almost parameter-free, and adaptable cluster-based event detection (CBED) algorithm that clusters the nanopore signals prior to detecting nanopore events. The algorithm is validated against two other event detection schemes with respect to simplicity and efficiency. For this, nanopore data from four different experiments stemming from different laboratories that vary in the nanopore type, size, and analyte are considered. The comparison is made on the basis of the number of events detected, their quality, and the most important features extracted from nanopore events. Our results underline the higher efficiency and less noise of the CBED detected events for biological nanopore data and the need for an on-the-fly adaptivity of the baseline current for a class of solid-state nanopore data.
bioinformatics2026-05-11v1Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV
Rouhollahi, A.; Nezami, F. R.Abstract
Objective: How structured clinical features and cluster-semantic embeddings interact under self-distillation in EHR prediction models is unknown. Existing approaches treat these sources separately (gradient-boosted trees exploit tabular features while sequence models process text), and their interaction under self-distillation regularisation remains uncharacterised. We introduce the Narrative Velocity (NV) framework and evaluate this interaction in a 7-model benchmark. Materials and Methods: Cadence is a ~5.86M-parameter residual multilayer perceptron (MLP) combining structured EHR features with frozen PubMedBERT embeddings of cluster-label strings under born-again self-distillation from a prior Cadence checkpoint (seed-42 teacher). Cadence is benchmarked against six comparators on MIMIC-IV v3.1 with dual-sex TRIPOD+AI reporting (5 student seeds for Cadence; 2--3 seeds for baselines). Results: At full-cohort scale, Cadence achieves 38.04 +/- 0.04% male and 35.66 +/- 0.04% female top-1 accuracy, exceeding the strongest non-neural baseline (XGBoost-2420, trained on the identical 2,420-dimensional input) by +1.35 pp male and +0.82 pp female (paired t-test on shared seeds 42--44: t(2)=69.06, p = 2.10 x 10^-4 male; t(2)=25.32, p = 1.56 x 10^-3 female). On time-to-next-event regression Cadence lowers MAE by 7.68 d male and 7.30 d female versus XGBoost-2420; FT-Transformer attains the lowest absolute MAE at full scale (27.58 d male, 36.63 d female), revealing a classification-regression trade-off across model families. A controlled 2x2 random-vector ablation isolates the self-distillation--embedding interaction at +0.49 pp top-1 (95% CI [0.35, 0.64] pp; bootstrap, n = 10,000 resamples; 3-teacher-seed mean +0.513 +/- 0.010 pp) under a matched-dimensionality null. A 3-teacher-seed validation (multi_teacher_02) confirms the interaction is robust to teacher-seed identity (per-seed values +0.525, +0.509, +0.507 pp; mean +0.513 +/- 0.010 pp). Cadence achieves the best Brier score among evaluated models (0.774 male / 0.798 female) but its raw probabilities are systematically miscalibrated (ECE 0.077 vs. XGBoost-884's 0.010); after a single scalar temperature scaling step (T* ~0.81), ECE drops to ~0.028 while Brier remains best. On a small (n = 1,120 patients, 39,120 events) external OCR-extracted BWH cohort, Cadence ranked 3rd of 7 models with three confounded sources of error (institutional shift, OCR noise, centroid mapping); we therefore report this as a generalisation probe rather than a definitive external validation. At the longer h30 evaluation horizon Cadence's MAE advantage reverses (47.35 d versus XGBoost 45.06 d), reflecting the absence of a matched-horizon self-distillation teacher. Discussion: The 2x2 random-vector ablation confirms that the self-distillation gain on PubMedBERT embeddings (+0.78 pp) exceeds that on matched-dimensionality random vectors (+0.29 pp) by +0.49 pp, isolating the interaction to semantic content rather than feature dimensionality. The factorial decomposition (+0.49--0.51 pp interaction) and the sequential pipeline-level decomposition (Supplementary Table S3) are complementary triangulations under different reference frames and are not directly additive. Conclusion: This 7-model benchmark establishes a dual-sex, dual-metric, cross-institutional reference for next clinical event prediction under the TRIPOD+AI reporting framework. These results characterise discrimination and calibration on a single retrospective cohort; prospective evaluation, decision-curve analysis, and harm-benefit assessment are required before clinical deployment. Keywords: clinical event prediction, electronic health records, MIMIC-IV, Narrative Velocity, residual MLP, PubMedBERT, knowledge distillation, TRIPOD+AI
bioinformatics2026-05-11v1G-SPRI: A Structure-Centric Graph Model for Comprehensive Prediction of Cancer Driver Events from Missense Mutations
Wang, B.; Ye, B.; Farhat, A.; Liang, J.; Yu, L.; Lu, Z.; Wang, X.; Xu, L.Abstract
In silico approaches for predicting the functional impact of missense mutations are critical for interpreting personal genomes and identifying disease-related biomarkers. Existing methods largely rely on sequence-based information or intuitive structural features, but often overlook the complex biophysical patterns encoded in protein 3D structures. Here, we present G-SPRI, a multilevel framework built on a novel alpha shape protein graph that accurately captures residue connectivity from atomic-resolution geometry and enables precise message passing around mutation sites. Using this graph representation, G-SPRI integrates wild-type structural properties and mutation-specific perturbation signals derived from the Protein Data Bank (PDB) universe to support graph-based learning for distinguishing pathogenic from benign missense variants. G-SPRI performs strongly across multiple key tasks. On the binary prediction benchmark, G-SPRI delivers improved pathogenicity prediction for individual mutations. By integrating mutation recurrence across the pan-cancer cohort, G-SPRI recovers more known cancer driver genes than state-of-the-art methods from more than 2.3 million mutations. Furthermore, by jointly quantifying site-specific pathogenicity and co-clustering influence within higher order structural organization units, G-SPRI provides comprehensive evidence for pinpointing likely driver mutations and structurally susceptible regions within disease genes.
bioinformatics2026-05-11v1sxRaep: A Rapid and Accurate Enzyme Predictor for high-throughput mining of enzymatic sequences
Duan, H.; Han, X.; Mo, Y.; Ren, B.; Xia, L. C.Abstract
Metagenomic sequencing generates petabyte-scale sequence datasets that strain both deep learning and alignment based enzyme annotation tools. A lightweight rapid and accurate filter tool is needed to identify enzymatic sequences prior to resource-intensive functional prediction. We present sxRaep (Rapid and Accurate Enzyme Predictor), a resource-efficient framework using lightweight physicochemical features for enzyme pre-screening. sxRaep achieves 6,604-fold speedup over Diamond (0.002 seconds per inference) with 62.1% memory reduction relative to Diamond (372 MB peak), while maintaining 99.4% accuracy and the highest recall in remote homology detection. This lightweight approach identifies enzymatic candidates missed by alignment-based methods without sacrificing accuracy.
bioinformatics2026-05-11v1A fine-tuned genomic language model adds complementary nucleotide-context information to missense variant interpretation
Su, Y.; Lin, Y.-J.Abstract
Missense variant interpretation remains a central challenge in clinical genomics. Missense pathogenicity predictors achieve strong performance, but many emphasize protein-level consequences or overlapping annotation priors. Whether genomic language models add non-redundant nucleotide-context signal to missense interpretation remains unclear. Here, we systematically adapted genomic language models to ClinVar missense pathogenicity prediction across backbone architectures, representation strategies, classifier heads, and adaptation regimes. In our analysis, variant-position embeddings consistently outperformed pooled sequence representations, multi-species pretraining provided the strongest backbone-level advantage, and low-rank adaptation generalized better than full fine-tuning. The resulting fine-tuned model, GLM-Missense, substantially outperformed zero-shot scoring from the same pretrained model. To test whether GLM-Missense contributes information beyond existing methods, we built MetaMissense, an XGBoost ensemble combining GLM-Missense with AlphaMissense, ESM1b, REVEL, CADD, SIFT, and PolyPhen-2. GLM-Missense showed the lowest concordance with other predictors, retained the strongest partial association with pathogenicity after controlling for the other predictors, and ranked as the most informative non-ensemble input to MetaMissense. MetaMissense achieved the best performance in both cross-validation and held-out testing. Analyses of variants correctly classified by GLM-Missense but misclassified by several established predictors suggested two patterns. First, part of the GLM-Missense signal may reflect splice-relevant exonic context. Second, GLM-Missense appears to add value in settings where other predictors may overweight allele frequency, gene-level constraint, or amino-acid-change severity. However, these features explained only about 10% of the distinction between the GLM-Missense-correct subset from the background. Together, our results demonstrate that fine-tuned genomic language models contribute complementary nucleotide-context information to missense variant interpretation.
bioinformatics2026-05-11v1