Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Can AI Conduct Autonomous Scientific Research? Case Studies on Two Real-World Tasks
Agrawal, S.; Anadkat, H.; Athimoolam, K.; Bhardwaj, H.; Chowdhury, T.; Gao, S.; Kamat, P.; Makwana, V.; Shariff, M.; Badkul, A.; Xie, L.; Sinitskiy, A.AI Summary
- This study evaluated eight AI frameworks on their ability to autonomously conduct scientific research by replicating tasks in uncertainty quantification and protein interaction discovery.
- None of the frameworks could complete a full research cycle, showing limitations in robust implementation and producing hallucinations, despite competence in planning and summarization.
- The findings indicate that while current AI systems cannot autonomously conduct scientific research, they can assist with specific research subtasks under human supervision.
Abstract
Recent advances in artificial intelligence (AI) have prompted claims about autonomous "AI scientists," yet systematic evaluations of these capabilities remain scarce. This exploratory study investigates whether current AI frameworks can execute scientific research tasks beyond isolated demonstrations. We tested eight open-source AI frameworks (Agent Laboratory, AutoGen, BabyAGI, GPT Researcher, MOOSE-Chem2, SciAgents, SciMON, and Virtual Lab) on two tasks that aimed to reproduce research on algorithm development from recent papers in uncertainty quantification and protein interaction discovery. In our evaluation, no framework completed a full research cycle from literature understanding through computational execution to validated results and scientific paper writing. While all systems showed competence in conceptual tasks such as planning and summarization, they consistently failed at robust implementation. Every framework produced sophisticated hallucinations. Deployment proved demanding, requiring substantial debugging and technical expertise, which undermines common claims about the democratization of science with AI. Despite these limitations, the frameworks showed promise as research assistants for methodological planning and ideation under careful human supervision. Our findings suggest that the explored AI systems cannot yet autonomously conduct scientific research, but may provide real value for specific subtasks within the research workflow. We offer preliminary observations to help researchers and developers better understand the gap between advertised and actual capabilities of AI in science.
bioinformatics2026-01-06v1Circulating miR-4532 is associated with loss of ambulation in dysferlinopathy
Grewal, T. S.; Hollander, Z.; Dai, D. L.; Chen, V.; Windish, H. P.; Albrecht, D. E.; Paoliello, E. L.; Rufibach, L.; Williams, B.; Mittal, P.; Assadian, S.; Wilson-McManus, J. E.; McManus, B.; Ng, R.; Tebbutt, S.; Bernatchez, P.; Singh, A.AI Summary
- This study investigated the association of circulating miRNAs with ambulation status in dysferlinopathy patients by comparing miRNA profiles of 49 patients and 25 controls.
- miR-4532 was found to be upregulated in ambulatory dysferlinopathy patients compared to controls, but downregulated in non-ambulatory patients compared to ambulatory ones.
- miR-4532 levels were positively correlated with circulating monocyte levels in ambulatory patients.
Abstract
Background: Limb-girdle muscular dystrophies (LGMDs) are inherited myopathies characterized mainly by progressive weakness of the proximal muscles of the shoulder and pelvic girdle areas, leading to functional decline and eventual loss of independent ambulation. Dysferlinopathy (LGMD2B) is an autosomal recessive LGMD subtype, is caused by mutations in the DYSF gene that lead to lack of dysferlin which results in muscle death and chronic muscle fiber degeneration. Preservation of ambulation is a key clinical milestone, as loss of independent gait markedly reduces quality of life and complicates care management. Although patients often perceive functional decline before their initial clinical presentation, current clinical assessments typically detect disease progression after substantial muscle damage has occurred. Methods: In this multigroup case-control study, we profiled plasma miRNAs from 49 genetically confirmed dysferlinopathy patients (24 ambulatory, 25 non-ambulatory) and 25 age- and sex-matched healthy controls. Total RNA was extracted from blood samples and hybridized to Affymetrix GeneChip miRNA 3.1 arrays. After quality control and filtering, differential expression analysis was performed using linear models for microarrays, adjusting for age and sex, with a false discovery rate cutoff of 10%. Results: 14 miRNAs were significantly altered between dysferlinopathy patients and controls. Notably, miR-4532 was upregulated in ambulatory patients relative to controls, whereas it was downregulated in non-ambulatory patients compared with ambulatory patients, although expression levels remained higher than in controls. Levels of miR-4532 were positively associated with circulating monocyte levels in ambulatory patients only.
bioinformatics2026-01-06v1ICFinder: ion channel identification and ion permeation residue prediction using protein language models
wang, J.; Zhang, X.; Fan, X.; Xiao, B.; Tian, B.AI Summary
- The study developed BLAPE and CLAPE frameworks using the ESM-2 protein language model to identify ion channels and predict ion permeation residues, crucial for understanding transport mechanisms and developing therapies.
- These models showed significant improvements (33%-171% in MCC) over existing methods and highlighted an enrichment of weakly polar residues at ion permeation sites.
- Case studies demonstrated CLAPE's superior performance and its utility for proteins without experimental structures, with results accessible via the ICFinder webserver.
Abstract
Ion channel dysfunction underlies many diseases (e.g., arrhythmias, epilepsy, cystic fibrosis), and uncharacterized channels may also contribute to pathology. Identifying such channels and their residues directly contacting the permeation pathway (i.e., ion permeation residues) is key to elucidating transport mechanisms and developing targeted therapies. Leveraging the protein language model ESM-2 and curated datasets, we developed BLAPE and CLAPE frameworks for high-throughput ion channel identification and permeation residue prediction. Our models outperformed existing methods, with 33%-171% improvements in Matthews correlation coefficients (MCC) across different datasets. Analysis of amino acid composition revealed enrichment for weakly polar residues among ion permeation sites. Case studies on four diverse ion channels highlighted that CLAPE consistently outperforms existing predictors and remains applicable to proteins lacking experimental structures, while also complementing structure-based pipelines such as AlphaFold3. We further applied our models to UniRef50 to predict potential ion channels, and made these results publicly available through the ICFinder webserver (https://tianlab-tsinghua.cn/icfinder/), providing a ready-to-use resource for the research community. All source code is available at https://github.com/JueWangTHU/ICFinder.
bioinformatics2026-01-05v2Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus
Pan, Y.; He, Y.; Liu, Y.-Q.; Shan, Y.-T.; Liu, S.-N.; Ma, J.-H.; Liu, X.; Pan, X.; Bai, Y.; Xu, Z.; Hou, T.; Wang, Z.; Ye, J.; Holmes, E. C.; Li, B.; Chen, Y.-Q.; Li, Z.-R.; Shi, M.AI Summary
- LucaVirus, a multi-modal foundation model, was developed to predict viral evolution and function by training on 25.4 billion nucleotide and amino acid tokens from nearly all known viruses.
- The model captures relationships between sequences, protein/gene homology, and evolutionary divergence, enabling downstream tasks like identifying hidden viruses, annotating protein functions, predicting viral evolvability, and identifying antibody candidates.
- LucaVirus achieves state-of-the-art performance in three tasks and matches leading models in a fourth, demonstrating its efficiency and versatility in AI-driven virology.
Abstract
Predicting viral evolution and function remains a central challenge in biology, hindered by high sequence divergence and limited knowledge compared to cellular organisms. Here, we introduce LucaVirus, a multi-modal foundation model for viruses, trained on 25.4 billion nucleotide and amino acid tokens covering nearly all known viruses. LucaVirus learns biologically meaningful representations capturing relationships between sequences, protein/gene homology, and evolutionary divergence. Using these embeddings, we developed downstream models that address key virology tasks: identifying hidden viruses in genomic "dark matter", annotating enzymatic activities of uncharacterized proteins, predicting viral evolvability, and identifying antibody candidates for emerging viruses. LucaVirus achieves state-of-the-art results in three tasks and matches leading models in the fourth with one-third the parameters. Together, these findings demonstrate the power of a unified foundation model to comprehensively decode the viral world and establish LucaVirus as an efficient and versatile platform for AI-driven virology, from virus discovery to functional and therapeutic predictions.
bioinformatics2026-01-05v2WITHDRAWN: GeneTerrain-GMM Unmasks a Coordinated Neuroinflammatory and Cell Death Network Perturbed by Dasatinib in a Human Neuronal Model of Alzheimer's Disease
Song, K. M.; Zhang, J.AI Summary
- The manuscript titled "GeneTerrain-GMM Unmasks a Coordinated Neuroinflammatory and Cell Death Network Perturbed by Dasatinib in a Human Neuronal Model of Alzheimer's Disease" has been withdrawn.
- The withdrawal is due to unresolved issues concerning authorship attribution and intellectual property rights.
Abstract
The authors have withdrawn this manuscript due to an open and unresolved matter regarding authorship attribution and the clarification of intellectual property rights associated with the work. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-01-05v2Atlas-Based Spatio-temporal MRI Phenotyping of 3D Fungal Spread in Grapevine Wood
Phukon, G.; Cardoso, M.; Goze-bac, C.; Le Cunff, L.; Verdeil, J.-L.; Moisy, C.; Fernandez, R.AI Summary
- The study developed a non-destructive 3D + t MRI pipeline to monitor early internal tissue degradation from fungal colonization in grapevines.
- The approach involved anatomical alignment, time-series registration, cylindrical coordinate transformation, supervised classification, and statistical analysis to create lesion atlases.
- Results showed consistent early degradation signals and cultivar-dependent lesion differences, enhancing understanding and management of grapevine trunk diseases.
Abstract
In perennial crops, inner wood degradation often escapes detection until irreversible damage has occurred. Grapevine trunk disease (GTD) is a well-known example in viticulture that alters plants from within, years before foliar symptoms arise, making early assessment difficult. To overcome this limitation, we present a novel non-destructive 3D + t pipeline for Magnetic Resonance Imaging (MRI) spatial quantification and monitoring of early internal tissue degradation resulting from fungal colonization. This pipeline integrates (i) anatomical alignment and rigid time-series registration of volumetric MRI scans, (ii) a generalized cylindrical coordinate transformation for cross-sectional trunk anatomy normalization, (iii) supervised classification to segment water-depleted (diseased/non-functional) regions, and (iv) population-level statistical analyses including construction of population mean images, probabilistic atlases of lesions, and 3D lesion descriptors. Applied to multiple Vitis vinifera cultivars inoculated with a fungal trunk pathogen, our approach enables time-lapse comparisons of cultivar and treatment in vivo. The results reveal consistent early degradation signals across individuals and cultivar-dependent lesion differences. By combining high-resolution MRI with advanced image processing and statistical atlas tools, this method provides a new paradigm for 3D plant phenotyping of internal disease progression. This methodological innovation allows non-invasive quantification of disease development and comparative assessment of host responses in woody plants, demonstrating its potential to advance understanding and management of GTDs.
bioinformatics2026-01-05v1Predicting TCR-pMHC Binding by Reinforcement Learning
Lang, J.; Yu, C.; Tran, N. H.; Peng, C.; Lei, Q.; Qin, H.; Yang, L.; Zhang, Y.; Bu, D.; Li, M.AI Summary
- This study introduces ProTCR, a reinforcement learning-based approach that integrates sequence, structural, and functional data to predict TCR-pMHC binding, enhancing prediction accuracy by considering their interdependencies.
- ProTCR achieved an AUROC of 0.75 on benchmark datasets, surpassing existing methods by 32.7%, and provided insights into binding determinants.
- Validation with cervical cancer patient data showed a correlation between T cell clonal expansion and ProTCR predictions, with additional applications in predicting SARS-CoV-2 complexes and designing immunotherapies.
Abstract
The binding between T cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is fundamental to the immune system ability to recognize and eliminate pathogens. Accurate prediction of TCR-pMHC interactions holds significant promise for advancing cancer immunotherapy, vaccine design, and autoimmune disease research. However, existing approaches often treat the sequences, structures, and functions of TCRs, peptides, and MHC molecules in isolation, neglecting their interdependencies and hence limiting the prediction accuracy. In this study, we present ProTCR, a novel approach that integrates sequence, structural, and functional information within a reinforcement learning framework, offering a new paradigm for predicting TCR-pMHC binding. The reinforcement learning optimization enables ProTCR to generate TCR-pMHC sequences with enhanced binding propensity, thereby improving prediction accuracy. On benchmark datasets such as IEDB and VDJdb, ProTCR achieves an AUROC of 0.75, outperforming state-of-the-art methods by 32.7%, while offering interpretable insights into the structural and sequence determinants of binding. We further validate ProTCR using TCRs and neoantigens derived from a cervical cancer patient via proteogenomic profiling. Our analysis reveals a strong correlation between T cell clonal expansion and ProTCR-predicted TCR-peptide binding scores, supporting the biological relevance of the model. Additionally, ProTCR demonstrates robust performance in predicting SARS-CoV-2 TCR-pMHC complexes and generating MHC-specific peptides with potential applications in peptide-based immunotherapies. Collectively, these findings establish ProTCR as a powerful and interpretable tool for TCR-pMHC binding prediction, with broad utility across immunology research and translational applications.
bioinformatics2026-01-05v1Learning a PRECISE language for small-molecule binding
Erden, M.; Zhang, X.; Devkota, K.; Singh, R.; Cowen, L.AI Summary
- PRECISE leverages CoNCISE's quantized small-molecule representations to predict binding sites by interpreting drug-target interactions as compatibility between drug embeddings and a target's 3D surface mesh, enriched with electrostatic and geometric features.
- It uses a geometric deep learning architecture to outperform existing methods in identifying binding sites, while maintaining scalability for billion-scale screening.
- PRECISE-MCTS integrates PRECISE with Vina docking via Monte Carlo Tree Search, enhancing efficiency in downstream docking workflows.
Abstract
Virtual screening of billion-scale compound libraries has become feasible through machine learning approaches. In particular, CoNCISE (RECOMB 2025) introduced drug quantization via codebooks, achieving highly scalable and accurate binary predictions. However, drug discovery requires understanding not just whether molecules bind, but where they bind and how to target specific sites. Here, we present PRECISE which leverages CoNCISE's quantized small-molecule representations while operating on the target's 3D structure as its input. The key innovation of PRECISE is reimagining drug-target interaction as compatibility between quantized drug embeddings and a latent representation of the target's surface mesh, enriched with electrostatic and geometric features. PRECISE designs a novel surface representation, interpreted through a geometric deep learning architecture, enabling it to identify binding sites more accurately than state-of-the-art methods (DiffDock-L, Chai, and Boltz-2) while the codebook ensures billion-scale screening capability. Our formulation unlocks zero-shot generalization to complex targets such as metalloproteins and multi-chain complexes. To enable efficient integration with downstream docking workflows, we introduce PRECISE-MCTS, which combines fast PRECISE-based screening with selective Vina docking through an iterative Monte Carlo Tree Search approach. By providing both mechanistic understanding and massive scalability, PRECISE delivers capabilities that were previously mutually exclusive in virtual screening.
bioinformatics2026-01-05v1Proteins as Statistical Languages: Information-Theoretic Signatures of Proteomes Across the Tree of Life
Alegre, E. O. T.AI Summary
- This study explores proteins as statistical languages by analyzing informational descriptors like composition entropy, mutual information, and separation-dependent information across 20 UniProt reference proteomes.
- Using bootstrap resampling and synthetic controls, real proteomes showed dependencies beyond local transition statistics, differing from composition-matched i.i.d. and Markov-1 models.
- Results suggest proteomes function as constrained statistical languages, offering a diagnostic approach for comparing sequence ensembles before detailed mechanistic modeling.
Abstract
Protein sequences are commonly interpreted through biochemical and evolutionary lenses, emphasizing structure-function relationships and selection in sequence space. Here we develop a complementary viewpoint: proteins as statistical languages, strings over a finite alphabet generated by constrained stochastic processes. We formalize intrinsic informational descriptors of protein ensembles, including composition entropy H1, adjacent mutual information I1, and separation-dependent information profiles Id. A null-model ladder (uniform, composition-matched i.i.d., and Markov-1) separates compositional effects from genuine positional dependence. We then evaluate these descriptors empirically across 20 UniProt reference proteomes spanning major clades, using protein-level bootstrap resampling and matched synthetic controls. Real proteomes consistently depart from composition-matched i.i.d. baselines and exhibit information profiles that remain elevated beyond the decay expected under first-order Markov surrogates, indicating dependencies beyond local transition statistics. Finally, a compressibility proxy (gzip) provides an orthogonal signature of redundancy relative to i.i.d. controls at matched composition. Together, these results support the view of proteomes as constrained statistical languages and provide model-agnostic fingerprints for comparing sequence ensembles. These signatures provide a lightweight diagnostic layer for comparing proteomes prior to mechanistic modeling.
bioinformatics2026-01-05v1BiomeGPT: A foundation model for the human gut microbiome
Medearis, N. A.; Zhu, S.; Zomorrodi, A. R.AI Summary
- BiomeGPT is a transformer-based foundation model pretrained on over 13,300 human gut metagenomes to learn species-level microbiome representations across 32 phenotypes.
- When fine-tuned, BiomeGPT accurately predicts health status, distinguishing between healthy and diseased states, and identifies specific microbial signatures related to various diseases.
- The model provides a scalable framework for microbiome analysis, aiding in biomarker discovery, disease stratification, and precision medicine.
Abstract
The human gut microbiome encodes rich information about host health, yet current analysis pipelines remain narrowly optimized for individual tasks. This limits our ability to gain a thorough view of how the microbiome impacts health and disease. Here we introduce BiomeGPT, a transformer-based foundation model pretrained on over 13,300 human gut metagenomes spanning 32 phenotypes, including healthy and 31 diverse diseases, to learn context-aware, species-level gut microbiome representations. The model captures quantitative compositional structure and intricate cross-species dependencies embedded within community profiles. When fine-tuned for predicting host health status, BiomeGPT accurately distinguishes healthy from diseased microbiomes and resolves individual disease states across a broad clinical spectrum. Furthermore, its attention patterns reveal biologically plausible microbial signatures, highlighting both shared and disease-specific microbial species linked to host phenotypes. By providing a unified, scalable framework for species-level gut microbiome representation learning and prediction, BiomeGPT enables new avenues for biomarker discovery, disease stratification, and microbiome-driven precision medicine.
bioinformatics2026-01-05v1Large-scale Visuomotor Reaction Time Self-Testing Reveals Subtle Motor Changes in Older Adults with Subjective Cognitive Impairment
Wang, X.; Bindoff, A.; George, R. S.; Roccati, E.; Li, R.; Lawler, K.; Connelly, W.; Tran, S.; King, A.; Vickers, J.; Bai, Q.; Alty, J.AI Summary
- This study investigated whether online visuomotor reaction time (RT) tests could detect subtle motor changes in older adults with subjective cognitive impairment (SCI), a risk factor for Alzheimer's disease.
- Results showed that SCI was linked to 8.4% longer RT, increased time-out failures, and greater RT variability, but not to differences in memory or executive function tests.
- These findings suggest that visuomotor RT tests might be more sensitive to early cognitive decline than traditional cognitive assessments.
Abstract
Introduction: Affordable tools for early Alzheimer's disease (AD) detection could support drug development and early intervention. Subtle motor changes may indicate preclinical AD, but hand response selection and initiation speeds are understudied. This study assessed whether unsupervised, online visuomotor reaction time (RT) tests relate to subjective cognitive impairment (SCI), a validated high risk state for future conversion to AD Methods: A total of 910 participants (age 66.3 +/- 7.5, 70.8% female) completed assessments of simple and choice visuomotor RT tests at home as part of the online TAS Test protocol; they also completed the Cambridge Neuropsychological Test Automated Battery (CANTAB) episodic memory and executive function tests. Among them, 142 participants reported SCI. Results: On the TAS Test visuomotor tests, SCI was associated with 8.4% [1.4%, 15.4%] longer RT ( (p = .008; adjusted for task complexity), greater odds of time-out failure (OR = 1.35 [1.01, 1.81]; p = .037), and greater variance in RT (log-variance (SCI - comparison) = .094 [.028, .159]; p < .001). There were no significant differences between the SCI and comparison groups on any of the CANTAB tests. After adjusting for SCI status, none of the CANTAB tests were significantly associated with RT. Discussion: SCI was associated with longer and more variable visuomotor RT, and greater odds of time-out failure, while not being associated with tests of memory and executive function. Cognitive test scores did not explain a significant amount of variance in visuomotor RT. Taken together, these results support a hypothesis that people with SCI may be experiencing earlier visuomotor deficits that are distinct from (or precede) decline in episodic memory and executive function. Visuomotor tasks that record RT may be more sensitive to preclinical manifestations of cognitive decline than more traditional tests of cognitive function.
bioinformatics2026-01-05v1Most disease gene variants show minimal population differentiation despite incomplete coverage
Gyimah, S. K.AI Summary
- The study compared naive zero-imputation and missingness-aware methods for assessing population differentiation in 72,915 variants across 17 African-relevant disease genes using 1000 Genomes Project data.
- Both methods showed high correlation (Spearman ρ = 0.9969), with only 0.71% disagreement, mainly in variants with incomplete coverage.
- Disease genes had a 4.75-fold lower rate of highly differentiated variants compared to the genome-wide background, suggesting functional constraints limit divergence except at sites under positive selection.
Abstract
Background Underrepresentation of non-European populations in genomic databases creates challenges for ancestry-matched variant interpretation, particularly when population frequency data are incomplete. Current approaches either assume missing populations have reference allele fixation (naive zero-imputation) or restrict analyses to observed populations (missingness-aware), but the clinical impact of these methodological choices remains unquantified. Methods We compared naive and missingness-aware differentiation metrics across 72,915 variants in 17 African-relevant disease genes with documented selection or clinical significance, using 10-population data from the 1000 Genomes Project Phase 3. Genome-wide validation employed 1,102,375 chromosome 22 variants with complete 26-population coverage. Population differentiation was quantified as maximum absolute reference allele frequency difference (max--{Delta}p--). Highdifferentiation variants (max--{Delta}p-- [≥] 0.5) were compared between disease genes and chromosomal background using Fishers exact tests with bootstrap confidence intervals. Results Methods showed high overall correlation (Spearman {rho} = 0.9969) with only 0.71% disagreement (518/72,915 variants), concentrated among variants with incomplete population coverage. However, 350 variants (0.48%) exceeded the high-differentiation threshold, including wellcharacterized ancestry-informative markers under documented selection. Disease genes showed 4.75fold depletion of highly differentiated variants relative to genome-wide background (Fishers exact test OR = 0.210, 95% CI [0.188, 0.233], p < 10-316), indicating that functional constraint limits frequency divergence except at sites under positive selection. Complete population coverage eliminated method disagreement (chromosome 22: Spearman {rho} = 1.0000, zero disagreements). Conclusions Ancestry-matched variant interpretation is not universally required but becomes critical for a small, clinically enriched subset (0.48%) showing substantial population differentiation. Functional constraint in disease genes concentrates extreme differentiation at specific adaptive sites rather than distributing it across functionally important regions. These findings provide empirical guidance for resource allocation in equitable variant interpretation frameworks.
bioinformatics2026-01-05v1PRADA-DTI: A Prototype-Retrieval Augmented Domain-Adaptation Framework for Drug-Target Interaction Prediction
Zhu, J.; Lv, T.; Pan, X.AI Summary
- PRADA-DTI addresses the challenge of dynamic biomedical data in drug-target interaction (DTI) prediction by using a retrieval-augmented, parameter-efficient framework for domain-incremental learning.
- It employs domain-specialized prompts and LowRank Adaptation (LoRA) modules, with protein embeddings querying a prototype memory to dynamically adapt to new protein domains.
- On BindingDB and BIOSNAP benchmarks, PRADA-DTI showed superior performance in predictive accuracy and forgetting mitigation, with interpretability analysis confirming its focus on relevant binding regions.
Abstract
Drug target interaction (DTl) prediction is a fundamental task in computational drug discovery. However, most existing DTI models assume a static learning environment, whereas real-world biomedical data are dynamic, characterized by the continuous emergence of new protein families and interaction patterns. This poses major challenges for model generalization and continual adaptation, especially under privacy and data-access constraints. To address these issues, we introduce PRADA-DTI, a retrieval-augmented, parameter-efficient framework for domain-incremental DTI learning. Built on a shared physicochemical backbone, PRADADTI learns domain-specialized prompts for representation-level modulation, coupled with LowRank Adaptation (LoRA) modules for parameter-space adaptation. At inference, protein embeddings query a compact prototype memory without storing raw molecular data, retrieving similar domains to dynamically compose the relevant prompts and LoRA parameters without requiring domain labels. This retrieval-guided composition enables continual learning from new protein domains while mitigating catastrophic forgetting on previous ones. On BindingDB and BIOSNAP benchmarks, PRADA-DTI substantially outperforms state-of-the-art continual learning and parameter-efficient baselines in both predictive accuracy and forgetting mitigation with minimal parameter updates. Interpretability analysis through residue-level attribution visualization demonstrates that the model correctly attends to binding pocket regions across different protein domains, confirming that the retrieval-guided adaptation mechanism captures biologically relevant structural patterns. These results demonstrate the effectiveness of retrieval augmented parameter adaptation for continual drug discovery.
bioinformatics2026-01-05v1MCBO: Mammalian Cell Bioprocessing Ontology, A Hub-and-Spoke, IOF-Anchored Application Ontology
Robasky, K.; Morrissey, J.; Riedl, M.; Dräger, A.; Borth, N.; Betenbaugh, M. J.; Lewis, N. E.AI Summary
- The study addresses the challenge of fragmented datasets in mammalian cell-based biopharmaceutical manufacturing by introducing the Mammalian Cell Bioprocessing Ontology (MCBO), which integrates bioreactor conditions, cell line characteristics, and product production.
- MCBO, built on BFO and anchored to IOF Core, uses a hub-and-spoke model to harmonize data, demonstrated through a datahub with 723 process instances and 325 samples.
- Validation via SPARQL queries confirmed MCBO's utility in cross-study analysis, enhancing AI and human workflows for biomanufacturing intelligence.
Abstract
Mammalian cell-based biopharmaceutical manufacturing generates vast, heterogeneous datasets that remain fragmented due to the lack of a standardized metadata framework. A key challenge in biologics manufacturing is linking bioreactor conditions, cell line characteristics and recombinant product production. A datahub for mammalian cell bioprocessing, integrated by semantic technologies, will serve as a tool to understand and query the connections between these complex datasets. While existing ontologies cover general biological and experimental concepts, they often lack the operational specificity required to harmonize bioreactor conditions, cell line engineering, and product quality metrics. To address this specific gap, we present the Mammalian Cell Bioprocessing Ontology (MCBO), a hub-and-spoke application ontology built on Basic Formal Ontology (BFO) foundations and anchored to the Industrial Ontology Foundry (IOF) Core. MCBO formalizes the process-participant-quality modeling pattern, enabling precise tracking of culture environmental conditions as qualities of the physical culture system. We demonstrate the utility of MCBO through a central datahub populated with 723 curated cell culture process instances and 325 unique bioprocess samples from published studies. The framework is validated against eight competency questions implemented via SPARQL, demonstrating efficient cross-study querying of culture optimization, cell line engineering, and multi-omics integration. By providing a stable, schema-independent substrate for data harmonization, MCBO enables AI agent-powered, human-in-the-loop workflows and facilitates LLM-assisted extraction of structured metadata from legacy records. MCBO is open-source and designed for deployment behind institutional firewalls to support interoperable biomanufacturing intelligence while maintaining intellectual property sensitivity. MCBO is supported by the International Biomanufacturing Network (IBioNe), which aims to accelerate discoveries and developments by providing a network of biomanufacturing training and workforce development to educate the next generation of biomanufacturing experts. Availability: https://github.com/lewiscelllabs/mcbo
bioinformatics2026-01-05v1TAFFISH: A lightweight, modular, and containerized workflow framework for reproducible bioinformatics analyses
Han, K.; Wang, T.; Yuan, S.-S.; Ma, C.-Y.; Su, W.; Deng, K.; Li, X.; Lv, H.; Lin, H.AI Summary
- TAFFISH is a lightweight, modular framework for reproducible bioinformatics analyses, using containerized components to ensure consistency across different systems.
- It employs a shell-native DSL for flexible workflow design and includes over 60 standardized modules in the TAFFISH-Hub repository.
- A demonstration showed identical results across four different environments when analyzing Arabidopsis thaliana P450 protein sequences with BLAST, confirming high portability and reproducibility.
Abstract
Bioinformatics workflows often involve multiple tools and steps, with complex dependencies and environment-specific differences, making results difficult to reproduce across different systems. TAFFISH introduces a moderately engineered framework positioned between raw shell scripts and heavyweight workflow systems. It encapsulates each analysis tool, together with its fixed runtime environment and software version, into a modular, "Lego-like" component. It employs a lightweight, shell-native domain-specific language (DSL) to balance standardized environments with flexible workflow design. Each module in TAFFISH consists of a container image hosted on GitHub Packages and a corresponding script, and the system automatically pulls and runs the appropriate container when invoked through a unified interface. This design ensures a consistent interface, environment, and analysis results across Linux, macOS, and Windows (via WSL), significantly reducing cross-platform configuration overhead. We have developed a library of standardized modules for more than 60 widely used bioinformatics tools in the TAFFISH-Hub repository. As a demonstration, we applied a BLAST-based analysis of Arabidopsis thaliana P450 protein sequences and executed the same workflow on four different hardware/OS environments, obtaining identical results in all cases. This confirms that TAFFISH achieves high portability and reproducibility while preserving the flexibility of shell scripting. TAFFISH establishes a novel paradigm for constructing reusable, modular bioinformatics workflows, serving as an essential bridge between ad-hoc shell scripts and complex workflow frameworks, and enabling researchers to rapidly build and share reproducible analysis pipelines.
bioinformatics2026-01-04v2EssTFNet: Integration of Adaptive Time - Frequency and DNA Language Models for Interpretable Human Essential Gene Prediction
Ye, D.; Yuan, S.; Su, W.; Zhang, H.; Li, R.; Qi, Y.; Dong, N.; Lin, H.; Jin, Y.AI Summary
- The study introduces EssTFNet, a deep learning framework that integrates adaptive time-frequency analysis with DNA language models to predict human essential genes.
- EssTFNet uses ATFNet to convert DNA and protein sequences into time-series signals, enhancing feature extraction for better prediction accuracy.
- The model achieved an AUC of 0.8336 and AUPR of 0.8212, outperforming existing methods, and used DeepLIFT to identify functional motifs related to gene essentiality.
Abstract
Essential genes are defined as those that are indispensable for an organism's survival. The loss of function of these genes results in cell death or an inability to complete the normal life cycle. Research on essential genes is pivotal in elucidating the origin and evolution of life, as well as in identifying potential therapeutic targets. Therefore, accurately predicting essential genes is of great scientific importance and has many applications in basic research and the biomedical field. In this study, we propose EssTFNet, a novel and interpretable deep learning framework that combines adaptive time - frequency analysis with a DNA language model to achieve accurate prediction of human essential genes while enabling mechanistic biological interpretation. Specifically, EssTFNet leverages the architecture of ATFNet, which innovatively maps DNA and protein sequences into equivalent time-series signals to extract periodic and non-stationary features, thereby enhancing the model's capacity to capture complex sequence patterns. Through effective feature selection and architectural optimization, EssTFNet strikes a favorable balance among prediction accuracy, model interpretability, and cross-tissues generalization capability, significantly outperforming current mainstream sequence-based deep learning methods with an AUC of 0.8336 and an AUPR of 0.8212. Additionally, the DeepLIFT attribution method was employed to identify functional motifs closely associated with gene essentiality, offering valuable insights for experimental validation. For the convenience of researchers, we have developed an easy-to-use web server and made it along with the source code in a GitHub repository: https://github.com/QIANJINYDX/EssTFNet. Overall, this study presents a potentially useful methodological framework for human essential genes prediction, which could provide valuable insights for future research and applications in this field.
bioinformatics2026-01-04v1NetMedGPT - A network medicine foundation model for extensive disease mechanism mining and drug repurposing
Firoozbakht, F.; Suwer, S.; Elkjaer, M. L.; Handy, D. E.; Maier, A.; Li, J.; Lancashire, L.; Loscalzo, J.; Baumbach, J.AI Summary
- NetMedGPT is a transformer-based foundation model trained on a large biomedical knowledge graph to enable zero-shot inference across various drug discovery tasks.
- It outperformed specialized baselines in predicting drug associations with indications, targets, adverse reactions, contraindications, and off-label uses, with gains of 2.2% to 26% in precision-recall curve area.
- External validation showed NetMedGPT's effectiveness in prioritizing clinically relevant drug-disease pairs and generating mechanistically plausible subnetworks for biological insights.
Abstract
Network medicine leverages large biomedical knowledge graphs (KGs) to model disease mechanisms and identify therapeutic opportunities. However, most deep learning approaches that use KGs in biomedicine remain task-specific, limiting their ability to generalize across diverse applications within a unified framework. Here, we introduce NetMedGPT, a transformer-based foundation model trained on a large-scale biomedical KG using masked token prediction. By learning contextualized representations of biomedical nodes, NetMedGPT enables unified, zero-shot inference across different drug discovery tasks. Specifically, in five tasks, i.e., predicting the association of drugs with indications, targets, adverse drug reactions, contraindications, and off-label uses, NetMedGPT consistently outperforms all specialized baselines, achieving area under the precision-recall curve gains of between 2.2% and 26%. When evaluated on independent external datasets, NetMedGPT outperformed baseline on an expert-curated clinical indications set and also preferentially prioritized clinically relevant drug-disease pairs in ClinicalTrials.gov. NetMedGPT's generative capability further supports the construction of mechanistically plausible subnetworks offering biological insights. NetMedGPT provides a unified foundation model for network medicine that supports scalable hypothesis generation and provides potential to accelerate drug repurposing. We further provided an interactive interface (https://prototypes.cosy.bio/chatnetmedgpt/) that allows users to obtain model inferences through natural-language queries.
bioinformatics2026-01-04v1The Influence of Ligands on AlphaFold3 Prediction of Cryptic Pockets
Lazou, M.; Tuchscherer, F.; Vajda, S.; Joseph-McCarthy, D.AI Summary
- The study investigates how AlphaFold 3 (AF3) predicts cryptic pockets in proteins, focusing on the influence of ligands.
- AF3 can generate conformational ensembles that include cryptic pockets, especially when a ligand is provided, leading to predictions where the ligand binds correctly in the cryptic site.
- The choice of ligand affects predictions, and generating multiple co-folded models is crucial for accurately predicting binding modes.
Abstract
Cryptic pockets are binding sites that are formed or exposed upon a conformational change. They represent an important class of potentially druggable binding sites. Reliably predicting cryptic pockets capable of binding ligands, however, remains a challenge. Herein we examine the use of AlphaFold 3 (AF3) for generating realistic conformational ensembles that include known cryptic pockets. We find that AF3 is generally able to reproduce the scale of conformational change required for cryptic site formation. When given a cryptic-site ligand for the protein, AF3 predominantly predicts conformations competent to bind the ligand in the cryptic site; without the ligand, conformations lacking the cryptic pocket generally dominate. While the results may reflect a bias toward memorized structural priors, the level of detrimental memorization appears to be limited. We also show that the choice of the ligand can significantly impact the predictions, and that AF3 is able to produce models with the ligand correctly positioned. Variability in ligand position, however, suggests that generating ensembles of co-folded predictions is critical to enhancing the likelihood of obtaining a correct binding mode. Overall, AF3-generated protein-ligand structural ensembles have potential utility in cryptic-site drug discovery, and they can reveal ligands likely to bind to those sites.
bioinformatics2026-01-04v1Decoupled Representation Learning Improves Generalization in CRISPR Off-Target Prediction
Bhargava, N.; Goswami, A.AI Summary
- The study addresses the generalization issue in CRISPR-Cas9 off-target prediction by proposing a two-stage deep learning framework that decouples sequence representation learning from off-target classification.
- In the first stage, guide RNA sequences are encoded using pretrained transformer-based embeddings; in the second, these embeddings are integrated with additional features for prediction.
- The approach significantly improves generalization performance on the TrueOT benchmark, showing consistent enhancements in ROC-AUC and PR-AUC, demonstrating the effectiveness of representation transfer in handling distribution shifts.
Abstract
Computational prediction of CRISPR-Cas9 off-target activity is a critical step toward the safe and scalable design of guide RNAs. Most existing deep learning models are trained on large proxy datasets derived from high-throughput assays; however, their performance often degrades when evaluated on experimentally validated off-target sites. This limitation raises concerns about the generalization ability of proxy-trained models and highlights the need for architectures that better transfer to biologically relevant benchmarks such as TrueOT. We propose a two-stage deep learning framework that decouples sequence representation learning from off-target classification. In the first stage, guide RNA sequences are encoded using pretrained transformer-based embeddings learned from large genomic corpora. In the second stage, a hybrid neural network integrates these embeddings with mismatch-level and sequence-pair features to predict off-target activity. Models are trained exclusively on a proxy dataset and evaluated under strict external validation on the TrueOT benchmark. Our results show that incorporating pretrained sequence embeddings substantially improves generalization performance. The full two-stage model achieves consistent improvements in ROC-AUC and PR-AUC on TrueOT, relative to proxy-only baselines, across repeated runs. We demonstrate a consistent generalization effect under distribution shift, robust to training noise, seeds, and evaluation variance. These findings provide empirical evidence that representation transfer plays a critical role in bridging the gap between proxy assay data and experimentally validated off-target behavior.
bioinformatics2026-01-04v1Explaining how mutations affect AlphaFold predictions
Clore, M. F.; Thole, J. F.; Dontha, S.; Sharma, P.; Jensen, D.; Volkman, B.; Coudron, M.; Porter, L.AI Summary
- Researchers developed CAAT to identify how mutations affect AlphaFold's protein structure predictions by analyzing amino acid positions.
- CAAT revealed that AlphaFold uses simple, sparse amino acid patterns to select protein conformations, with significant prediction changes when these positions are modified.
- Experimental validation confirmed that mutations at positions identified by CAAT had a greater impact on protein structure than those not identified.
Abstract
Transformer models, neural networks that learn context by identifying relationships in sequential data, underpin many recent advances in artificial intelligence. Nevertheless, their inner workings are difficult to explain. Here, we find that a transformer model within the AlphaFold architecture uses simple, sparse patterns of amino acids to select protein conformations. To identify these patterns, we developed a straightforward algorithm called Conformational Attention Analysis Tool (CAAT). CAAT identifies amino acid positions that affect AlphaFold's predictions substantially when modified. These effects are corroborated by experiments in several cases. By contrast, modifying amino acids ignored by CAAT affects AlphaFold predictions less, regardless of experimental ground truth. Our results demonstrate that CAAT successfully identifies the positions of some amino acids important for protein structure, narrowing the search space required to make effective mutations and suggesting a framework that can be applied to other transformer-based neural networks.
bioinformatics2026-01-04v1Benchmarking algorithms for RNA velocity inference
Huang, K.; Zhou, Y.; Wang, T.; Li, X.; Zhao, X.; Liu, X.; Huang, L.; Zhou, X.; Liu, J.AI Summary
- This study benchmarks 29 RNA velocity inference algorithms across 114 simulated and 62 real scRNA-seq datasets, extending to spatial and multi-omics data.
- Performance was evaluated in four dimensions: accuracy, scalability, stability, and usability, revealing no single method is universally optimal.
- The findings guide the selection of RNA velocity tools based on data type, available priors, and computational constraints, highlighting scalability and sensitivity as current development bottlenecks.
Abstract
RNA velocity is a computational framework for single-cell RNA sequencing (scRNA-seq) that estimates the future transcriptional state of individual cells, thereby capturing the direction and rate of cell state transitions rather than providing a purely static snapshot. Since its introduction in 2018, multiple RNA velocity methods have been developed, differing in their modeling assumptions, required inputs, computational complexity, and robustness. However, there remains limited consensus on how best to evaluate these methods or on which tools are most reliable under specific biological and technical settings. Here, we perform a systematic comparison of 29 velocity inference algorithms across 114 simulated datasets with known ground-truth cell dynamics and 62 real scRNA-seq datasets, and we extend the evaluation to spatial and multi-omics levels where velocity is increasingly applied. We benchmark RNA velocity methods using a unified framework that decomposes performance into four practical dimensions: accuracy, scalability, stability, and usability. Our results show that performance rankings vary substantially across metrics and datasets, indicating no single method is uniformly optimal and that practical deployment is often constrained by feasibility and robustness as much as by accuracy. Based on these results, we provide actionable guidance for selecting RNA velocity tools according to data modality, available priors, and computational constraints. Finally, we identify key bottlenecks that currently limit RNA velocity development and deployment, including scalability to large size of datasets, sensitivity to gene selection, and the lack of genuinely multimodal and spatially explicit velocity models for spot-based technologies.
bioinformatics2026-01-03v1PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction
Zhang, Y.; Tang, S.; Chen, T.; Mahood, E.; Vincoff, S.; Chatterjee, P.AI Summary
- PeptiVerse is introduced as a unified platform for predicting properties of therapeutic peptides, accommodating both canonical sequences and chemically modified peptide SMILES.
- It leverages large foundational models to provide state-of-the-art performance in diverse property prediction tasks.
- The platform offers a web interface and open-source implementation, facilitating early-stage peptide therapeutic development and property-aware design.
Abstract
Therapeutic peptides combine the advantages of small molecules and antibodies, offering target flexibility and low immunogenicity, yet their successful translation requires careful evaluation of multiple developability properties beyond binding alone. As chemically modified peptides become increasingly common in drug design, no unified platform currently supports systematic property assessment across both canonical sequences and SMILES-based representations. Leveraging the generalizability of large foundational models trained on protein and chemical data, we introduce PeptiVerse, a universal therapeutic peptide property prediction platform. PeptiVerse accepts either amino acid sequences or chemically modified peptide SMILES, delivers state-of-the-art performance across diverse property prediction tasks, and provides both a web interface and open-source implementation for rapid, accessible, and scalable peptide developability analysis. By unifying property prediction across representations, PeptiVerse directly supports early-stage peptide therapeutic development campaigns and property-aware generative design workflows. PeptiVerse Interface: https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse
bioinformatics2026-01-03v1Unveiling Gene Regulatory Network Dynamics using Fuzzy Clustering
Kollyfas, R.; Cagna, M.; Nicaise, A. M.; Vallier, L.; Mohorianu, I. I.AI Summary
- Flufftail, an R framework with a Shiny app, uses fuzzy clustering to address the limitations of current methods in capturing continuous expression dynamics in single-cell analyses.
- It aggregates iterative stochastic partitions to compute membership probabilities, consistency scores, and consensus matrices, leading to both fuzzy and crisp cell assignments.
- Applied to single-nuclei and spatial transcriptomics, Flufftail highlights transitional cell populations and infers state-dependent regulatory network dynamics linked to phenotypes.
Abstract
Partitioning cells into robust, reproducible clusters is a core step across single-cell-resolution analyses; current state-of-the-art approaches struggle with capturing and summarising dynamics on continuous expression patterns. We present Flufftail (Fuzzy Logic Unifying Framework reveals Transcriptional Architectures summarised via Integrated Learning), an R framework and interactive Shiny app that consolidates clustering uncertainty by aggregating iterative stochastic partitions, resulting from a constant input. Flufftail computes per-cell membership probabilities, element-centric consistency scores, consensus matrices, and collapsed hard/crisp cell-assignments; we also leverage fuzzy assignments to prioritise genes that might act as regulatory hubs, subsequently using these as anchor points to infer gene regulatory network dynamics across transitions. We showcase the approach on single-nuclei and spatial transcriptomic case studies, illustrating how fuzzy clustering highlights transitional cell populations, proposing an ordered, state-dependent rewiring of regulatory interactions directly linked to the observed phenotype.
bioinformatics2026-01-02v1Cross-View Latent Integration via Nonparametric Gamma Shrinkage Factor Analysis
Akell, H.; Lazecka, M.; Adhithya Haridoss, D.; Urban, M.; Staub, E.; Szczurek, E.AI Summary
- CLING is a new unsupervised multi-view factor model using hierarchical Bayesian sparsity priors to address challenges in multi-omic data analysis.
- It employs a product-of-Gammas prior and Gamma-Gamma local-precision hierarchy to automatically select factors and induce selective sparsity.
- Testing on synthetic and real multi-omics datasets, CLING outperformed baselines by recovering more accurate factors and identifying relevant biological pathways in glioblastoma data.
Abstract
Factor analysis is a dominant paradigm for multi-omic heterogeneous data, but is challenged by partially redundant signals and noise across views and by an unknown true number of factors. We present CLING, an unsupervised multi-view factor model with hierarchical Bayesian sparsity priors: a product-of-Gammas prior inducing cumulative column-wise shrinkage (increasing with factor index) coupled with a Gamma-Gamma local-precision hierarchy on loadings yielding heavy-tailed marginals. This pairing enables automatic factor selection by adaptively deactivating unsupported factors while retaining active ones during inference, and induces selective sparsity that allows salient loadings to escape shrinkage while collapsing negligible ones. As a fully conjugate hierarchical model, CLING admits a scalable variational inference algorithm for multi-view data. Across synthetic benchmarks and multi-omics datasets, CLING recovers more accurate factors and more informative loadings while explaining at least as much variance as competitive multi-view baselines; on glioblastoma gene expression and DNA methylation data, CLING identifies pathways linked to tumor subtype and patient age.
bioinformatics2026-01-02v1Phylogenetic based dissection of eukaryotic Mo-insertase functionality: From mechanism to complex assembly
Schmidt, T. J.; Hassan, A. H.; Pucker, B.; Kruse, T.AI Summary
- The study investigates the evolutionary and functional aspects of eukaryotic molybdenum insertases (Mo-insertases) using phylogenetic analysis, sequence analysis, and structural modeling.
- It was found that most eukaryotic Mo-insertases have fused E- and G-domains with varying orientations, but the E-domain active site remains highly conserved.
- Vertebrate gephyrin, unique among Mo-insertases, serves dual roles in metabolism and scaffolding neurotransmitter receptors, showing extreme conservation, suggesting additional functional constraints.
Abstract
Molybdenum cofactor (Moco) biosynthesis is vitally important for all organisms, yet the domain organization of the eukaryotic molybdenum insertase (Mo-insertase) remains enigmatic. We combine extensive phylogenetic reconstructions, sequence analysis and structural modeling in order to uncover evolutionary and functional principles of eukaryotic Mo-insertases. We note, that the vast majority of plant, fungi and animal species evolved fused E- and G-domains, yet the orientation of both domains in the fusion proteins differs among different eukaryotic lineages. Despite the divergent domain arrangements amongst eukaryotic Mo-insertases the E-domain active site is well conserved, with very few tolerated substitutions in >1,000 sequences. Among the Mo-insertases from different eukaryotic species, vertebrate gephyrin is the only Mo-insertase with a dual function as - next to its metabolic function - it scaffolds inhibitory neurotransmitter receptors in the post synapsis. Gephyrin is surprisingly high conserved, including surface patches not directly involved in catalysis and receptor clustering. This profile suggests additional, as yet uncharacterized, functional constraints on gephyrins evolution. Together, our results reveal how eukaryotic Mo-insertases combine evolutionary domain organization plasticity with stringent active site conservation and recognize the evolutionary constraint on gephyrins surface conservation to be extreme, likely due to its mutual metabolic and neuronal function.
bioinformatics2026-01-02v1Clustering Compresses Attractors in Watts-Strogatz Threshold Boolean Networks
Alqarni, M.; Cooper, M.; Donovan, D. M.; Lefevre, J. G.AI Summary
- The study investigates if higher clustering in Watts-Strogatz threshold Boolean networks shortens attractor periods, using 330 directed, signed networks of sizes N=10-100 and mean degrees kbar=2-10.
- Findings show that a 0.10 increase in clustering coefficient C reduces the geometric mean period by 13-15%, with a shift from C=0.000 to C=0.460 reducing it by about 50%.
- The effect of clustering on reducing attractor periods persists across different model specifications and is not due to an increase in fixed points.
Abstract
Does higher clustering shorten attractor periods? We examine whether the global clustering coefficient C, a direct measure of triangle density, predicts attractor lengths in synchronous, signed-threshold Boolean networks on Watts-Strogatz (WS) graphs. We generate 330 directed, signed WS networks spanning sizes N=10-100 and mean degrees kbar=2-10, simulate dynamics from M=100 random initial states per graph with exact cycle detection, and summarise each graph by the average log attractor period (equivalently, the geometric mean period). Our primary analysis relates this log-period summary to C while adjusting for N, kbar, the mean directed shortest path length (MSP), and including nonlinear size-degree and clustering-degree interactions. Main finding: Higher clustering robustly shortens attractor periods: a 0.10 increase in C (C in [0,1]) corresponds to about a 13-15% lower expected geometric mean period, and moving from C=0.000 to C=0.460 yields about a 50% reduction, holding other properties fixed. The effect persists when the linear C term is replaced by a nonlinear function of C, and it replicates in held-out graph instances (graphs not used to fit the model). Shorter periods are not explained by an increase in fixed points under the strict comparator (>); rather, higher triangle density shifts mass from long cycles to medium-length cycles. Significance: In threshold-like logic, settling speed and oscillatory stability are central to computation and control. Our results provide a direct, quantitative link between triangle density and these long-run behaviours, showing that C acts as a structural lever on temporal complexity. All figures and tables are reproducible from the accompanying code, data, and analysis scripts.
bioinformatics2026-01-02v1scDeepVariant: A population-informed deep learning framework for germline variant calling in scRNA-seq
Buralkin, I.; Chen, H.; Park, J.; Liu, Z.AI Summary
- scDeepVariant (scDV) is a deep learning framework for germline variant calling in scRNA-seq, adapted from DeepVariant and trained on paired WGS and scRNA-seq data.
- Incorporating allele frequency data from gnomAD or the 1000 Genomes Project into scDV enhances its performance, particularly in detecting rare variants.
- scDV with allele frequency channels outperformed existing methods like Monopogen in precision and recall at coverage depths above 10 reads.
Abstract
Single-cell RNA sequencing (scRNA-seq) provides unprecedented resolution of cellular heterogeneity while also capturing information on germline genetic variation, but accurate variant calling remains limited by sparse coverage, allelic imbalance, and RNA-specific artifacts. Existing single-cell methods, including cellSNP, scAllele, and Monopogen, address some of these challenges, yet either suffer from low sensitivity and precision or rely on linkage disequilibrium (LD) priors that restrict performance on rare variants. Here, we introduce single-cell DeepVariant (scDV), a deep learning-based framework adapted from DeepVariant and trained on paired WGS and scRNA-seq data. We show that scDV can be effectively trained on sparse single-cell data and that augmenting models with allele frequency information from gnomAD or the 1000 Genomes Project consistently improves performance. Across benchmarks, scDV with allele frequency channels achieved higher precision and recall than standard six-channel configurations, surpassing Monopogen at coverage depth above 10 reads and demonstrating a pronounced advantage in rare variant detection, where LD-based refinement is most limited. These results establish scDV as a robust alternative for germline variant discovery from scRNA-seq and highlight the broader value of integrating population-scale information into deep learning frameworks for transcriptomic variant calling.
bioinformatics2026-01-02v1ALLCatchR, a machine-learning classifier identifies now 20 T-ALL subtypes across cohorts and age groups
Beder, T.; Wolgast, N.; Walter, W.; Bendig, S.; Hartmann, A. M.; Barz, M. J.; Zaliova, M.; Reitzel, E.; Baden, D.; Schwartz, S. M.; Gökbuget, N.; Kester, L.; Trka, J.; Haferlach, C.; Brüggemann, M.; Baldus, C. D.; Neumann, M.; Bastian, L.AI Summary
- The study aimed to establish a gene expression framework for T-ALL subtyping by analyzing 2,314 transcriptomes from 15 cohorts across various ages, identifying 20 subtypes including a novel "clonal hematopoiesis-related" subtype.
- A machine learning classifier, ALLCatchR, was developed to identify these subtypes with high accuracy (0.995-1.0) in validation and 92.7% in hold-out datasets.
- The classifier also provides functionalities like lineage separation, subtyping, blast estimation, and developmental annotation, enhancing T-ALL classification across different studies.
Abstract
T cell acute lymphoblastic leukemia (T-ALL) comprises molecular diverse subtypes, currently lacking robust cross-cohort validations and operational gene expression definitions. To establish a gene expression anchored framework for T-ALL subtyping, we aggregated 2,314 transcriptomes (15 cohorts, age: 0.8 to 90.8 years). An extended unsupervised approach defined 17 main clusters and 3 sub-clusters in high blast fraction samples. Supervised analysis added an overarching immature "ETP-like" definition and resolved the LMO2 gd-like subtype. All clusters were populated by samples from at least two cohorts. Characteristic genomic driver enrichment agreed across cohorts, while gene expression clusters did not correspond exclusively to single driver events but also reflected developmental origins. A machine learning classifier based on ALLCatchR - our B-ALL classifier - identified these 21 transcriptomic definitions with 0.995-1.0 accuracy in a validation set (n=203). Testing the classifier on a hold-out data set (n=265 samples) showed that 92.7% of predictions matched with corresponding driver alterations. Across all samples, 88.5% of cases were high-confidence, 6.5% candidate predictions and 5.0% remained unclassified, largely due to low blast fractions. We identified a novel gene expression cluster markedly enriched (P<0.001) for clonal hematopoiesis mutations (IDH2 R140Q, DNMT3A) and a stem-/progenitor cell-like gene expression. This novel "clonal hematopoiesis-related" T-ALL subtype was observed in six cohorts representing 8.9% of adults and 39.5% of patients aged >50 years. We advanced ALLCatchR, a free R package which now enables B- /T- lineage separation, gene-expression subtyping, blast estimation, and developmental annotation to harmonize T-ALL classification across studies and clinical contexts.
bioinformatics2026-01-02v1BioTrouble: A Multi-Agent Workflow for Troubleshooting Molecular Biology Techniques
Ameri, M.; Yousefabadi, H.; Ramezani, A.AI Summary
- BioTrouble is a multi-agent AI workflow designed to assist in troubleshooting molecular biology experiments like PCR and cloning.
- It uses a retrieval-augmented generation framework with small language models and a smart routing system to manage costs.
- Compared to state-of-the-art large language models, BioTrouble provides comparable troubleshooting recommendations while learning from user feedback to enhance its knowledge base.
Abstract
Troubleshooting is a critical yet often underdocumented aspect of molecular biology experiments across laboratories. Failures in core techniques such as PCR, qPCR, molecular cloning, and related assays can lead to experimental failure, wasted resources, and delays in research progress. Here, we present BioTrouble, a multi-agent AI workflow designed to assist researchers in troubleshooting a wide range of molecular biology experiments. It leverages a custom-designed troubleshooting knowledge base through a retrieval-augmented generation (RAG) framework. BioTrouble employs small language models to generate the troubleshooting plan and utilizes a smart model routing system to manage cost per request. User interactions and feedback are stored as structured cases, enabling BioTrouble to expand its troubleshooting knowledge base and improve response generation over time. Compared with single-model SOTA LLM, BioTrouble generated comparable troubleshooting recommendations using small language models.
bioinformatics2026-01-02v1In Silico analysis of the structural and functional impact of deleterious nsSNPs in the human RETREG1 gene associated with congenital sensory neuropathy type II
Alsied, M. E. M.; Abdelhameed Abbas, T. A.; Mohammed, S. A.AI Summary
- This study aimed to identify pathogenic nsSNPs in the RETREG1 gene associated with HSAN II using in silico tools.
- Five nsSNPs (Y221C, G216R, G211R, L119V, W107C) were identified as deleterious, with four predicted to decrease protein stability and one to increase it.
- Structural analysis showed these mutations affect the Reticulon Homology Domain, causing structural disruptions and impairing protein function, providing insights into HSAN II pathogenesis.
Abstract
Background: Mutations in the RETREG1 gene are known to cause Hereditary Sensory and Autonomic Neuropathy type II (HSAN II), a severe congenital disorder affecting sensory neurons. However, the full spectrum of pathogenic single nucleotide polymorphisms (SNPs) and their specific structural consequences remain incompletely characterized. Objectives: This study aimed to elucidate pathogenic nsSNPs and their role in the Congenital Sensory Neuropathy (HSAN II) by employing in-silico analysis. Method: The nsSNPs of RETREG1 were retrieved from the dbSNP in NCBI database. Different in silico tools, SIFT, PolyPhen-2, SNP&GO, PHD-SNP, SNAP2, I-mutant, Project Hope, MutPred, ConSurf, phyre2, Chimera, and GeneMANIA, were used for predicting the pathogenicity, protein stability, evolutionary conservation, structural alterations, and protein-protein interaction networks.for RETREG1 gene. Result: Five nsSNPs were identified as \"damaging\" or deleterious by using the above software (Y221C, G216R, G211R, L119V, W107C). Four SNPs (Y221C, G211R, L119V, and W107C) were predicted to decrease protein stability, while the fifth SNP (G216R) was expected to increase it. Structural modeling revealed that all five mutations are located within the critical Reticulon Homology Domain (RHD), where they are predicted to cause steric clashes, disrupt hydrophobic packing, and impair protein-membrane interactions. Conclusion: This integrated in silico analysis identifies four novel deleterious nsSNPs in RETREG1 (W107C, L119V, G211R, Y221C) and confirms the established G216R variant. These mutations are predicted to impair RETREG1 structure and function, providing mechanistic insight into HSAN II pathogenesis and prioritizing candidates for future experimental validation.
bioinformatics2026-01-02v1Explainability methods from machine learning detect important drugs' atoms in drug-target interactions
Mahindran, M.; Liu, Q.; Kadambalithaya, V. M.; Kalinina, O. V.AI Summary
- The study benchmarks four explainable AI (XAI) methods on GNN models for predicting drug-target interactions (DTI) with kinases and GPCRs, focusing on interpretability.
- Consistency among methods was assessed using atom-level intersection-over-union, with biological relevance validated by mapping to 3D structures.
- Consensus attributions showed high enrichment for atoms in direct contact with binding pockets (up to 76% within 2Å), often involving key regulatory residues, demonstrating the utility of XAI in identifying significant drug features.
Abstract
Predicting drug-target interactions (DTI) with graph neural networks (GNNs) is hindered by their lack of interpretability. To address this, we benchmark four explainable artificial intelligence (XAI) attribution methods on GNN models trained for kinase and GPCR targets. We assess the methods\' consistency through atom-level intersection-over-union and validate their biological relevance by mapping attributed atoms to 3D protein-ligand structures. While consistency across methods was modest, consensus attributions were highly enriched for atoms directly contacting the binding pocket--up to 76% within 2A in the kinase-inhibitor complexes. Notably, these attributed atoms were frequently found contacting experimentally important regulatory residues, such as those in the DFG motif. This indicates that XAI methods, despite their disagreements, can identify chemically meaningful ligand features, providing a foundation for developing more interpretable GNNs in drug discovery.
bioinformatics2026-01-02v1HPVarcall: Calling lineages and sublineages for partial DNA sequences of human papillomavirus
Lomsadze, A.; Borodovsky, M.AI Summary
- HPVarcall is a computational method that assigns HPV DNA sequences to specific lineages and sublineages using statistical models based on positional frequency profiles.
- The method involves aligning sequences, constructing phylogenetic trees, and using sublineage-specific models to determine the most probable sublineage for a query sequence.
- Testing on nine HPV types from Gardasil 9 showed high accuracy, with low error rates for sequences over 1000 nucleotides.
Abstract
We describe a computational method, HPVarcall, that assigns DNA sequences of a human papillomavirus (HPV) variant of known type to lineages and sublineages. The algorithm relies on statistical models - positional frequency profiles - trained on multiple alignments of HPV genomic sequences that are known to belong to specific sublineages of a given HPV type. The workflow begins with multiple alignment of all available sequences for the HPV type, followed by construction of a phylogenetic tree and identification of branches containing sublineage-specific reference sequences. In the prediction phase, sublineage-specific statistical models are used to compute the posterior probabilities for each sublineage given a query sequence. The query classifies to belong to the sublineage with the highest posterior probability. Accuracy assessments performed for the nine HPV types included in the Gardasil 9 vaccine demonstrated a low error rate in assigning HPV genomic fragments of at least 1000 nucleotides to their correct sublineages and even higher accuracy for longer sequence fragments.
bioinformatics2026-01-02v1Scaling Large Language Models for Next-Generation Single-Cell Analysis
Rizvi, S. A.; Levine, D.; Patel, A.; Zhang, S.; Wang, E.; Perry, C. J.; Vrkic, I.; Constante, N. M.; Fu, Z.; He, S.; Zhang, D.; Tang, C.; Lyu, Z.; Darji, R.; Li, M.; Sun, E.; Jeong, D.; Zhao, L.; Kwan, J.; Braun, D.; Hafler, B.; Chung, H.; Dhodapkar, R.; Jaeger, P.; Perozzi, B.; Ishizuka, J.; Azizi, S.; van Dijk, D.AI Summary
- This study scales the Cell2Sentence (C2S) framework to train a 27 billion parameter Large Language Model (LLM) on over one billion tokens of transcriptomic, biological text, and metadata for single-cell RNA sequencing analysis.
- The scaled model, C2S-Scale, shows enhanced predictive and generative capabilities, supporting tasks like perturbation response prediction and complex biological reasoning.
- Experimental validation confirmed C2S-Scale's prediction of silmitasertib as a candidate for context-selective upregulation of antigen presentation in human cell models.
Abstract
Single-cell RNA sequencing has transformed our understanding of cellular diversity, yet current single-cell foundation models (scFMs) remain limited in their scalability, flexibility across diverse tasks, and ability to natively integrate textual information. In this work, we build upon the Cell2Sentence (C2S) framework, which represents scRNA-seq profiles as textual "cell sentences," to train Large Language Models (LLMs) on a corpus comprising over one billion tokens of transcriptomic data, biological text, and metadata. Scaling the model to 27 billion parameters yields consistent improvements in predictive and generative capabilities and supports advanced downstream tasks that require synthesis of information across multi-cellular contexts. Targeted fine-tuning with modern reinforcement learning techniques produces strong performance in perturbation response prediction, natural language interpretation, and complex biological reasoning. This predictive strength enabled a dual-context virtual screen that nominated the kinase inhibitor silmitasertib (CX-4945) as a candidate for context-selective upregulation of antigen presentation. Experimental assessment in human cell models unseen during training supported this prediction, demonstrating that C2S-Scale can effectively guide the discovery of context-conditioned biology. C2S-Scale unifies transcriptomic and textual data at unprecedented scales, surpassing both specialized single-cell models and general-purpose LLMs to provide a platform for next-generation single-cell analysis and the development of "virtual cells."
bioinformatics2026-01-01v4yallHap: Modern Y-chromosome haplogroup inference with probabilistic scoring and ancient DNA support
Hardie, A.AI Summary
- yallHap is a Y-chromosome haplogroup classifier that uses the YFull phylogenetic tree, probabilistic scoring, and ancient DNA damage filtering to improve classification accuracy.
- Validation on modern datasets showed 99.9% and 99.8% accuracy for gnomAD and 1000 Genomes samples respectively, while ancient DNA analysis achieved up to 90.7% accuracy with Bayesian mode.
- The tool supports multiple reference genomes, provides detailed quality metrics, and is designed for integration into bioinformatics pipelines with multi-threaded processing capabilities.
Abstract
The human Y chromosome enables detailed reconstruction of paternal lineages through haplogroup classification. Existing tools for this purpose typically rely on outdated phylogenies, lack ancient DNA handling, or provide limited confidence metrics. Here I present yallHap, a Y-chromosome haplogroup classifier that integrates the YFull phylogenetic tree (185,780 SNPs) with probabilistic scoring, built-in ancient DNA damage filtering, and parallel processing for population-scale studies. Validation on 1,231 high-coverage gnomAD samples achieved 99.9% accuracy (95% CI: 99.5-100%) on GRCh38, and 1,233 samples from 1000 Genomes Phase 3 achieved 99.8% accuracy (95% CI: 99.3-100%). For ancient DNA with moderate variant density (4-10%), Bayesian ancient mode achieves +19.3 pp improvement over heuristic mode (+12 to +24 pp at 1% increments; see Supplementary Table S3), reaching 60-86% accuracy. On full AADR ancient DNA validation (7,333 samples spanning ~45,000 years), this translates to 90.7% overall accuracy (95% CI: 90.0-91.3%) versus 88.3% for heuristic transversions-only mode. At variant densities [≥]10%, both modes reach 97-99% accuracy. yallHap supports multiple reference genomes (GRCh37, GRCh38, T2T-CHM13v2.0), provides detailed quality metrics including optional ISOGG nomenclature output, and offers multi-threaded batch processing for large-scale studies. The tool is designed for integration into modern bioinformatics pipelines, with example wrappers for nf-core/eager [16,17] and Snakemake [18] workflows. The software is open source, available at https://github.com/trianglegrrl/yallHap, and distributed via pip, Bioconda, and Docker.
bioinformatics2026-01-01v2Ancestral intronic splicing regulatory elements in the SCNα gene family
Chernyavskaya, E.; Vorobeva, M.; Spirin, S. A.; Skvortsov, D. A.; Pervouchine, D.AI Summary
- The study explored the evolutionary history and splicing regulation of exons 5a and 5b in the SCN gene family, finding that exon 5 duplication dates back to a common ancestor.
- Analysis across tissues, tumors, and developmental stages showed that splicing is regulated by intronic splicing regulatory elements (ISRE) and Rbfox2 binding sites, not primarily by nonsense mediated decay.
- Mutagenesis and antisense oligonucleotide experiments confirmed that RNA structure formation between ISREs and Rbfox2 activity promote exon 5b skipping in SCN9A, suggesting an ancient regulatory mechanism conserved across species.
Abstract
SCN genes encode components of voltage-gated sodium channels that are crucial for generating electrical signals. Humans have ten paralogous SCN genes, some of which contain duplicated mutually exclusive exons 5a and 5b. In reconstructing their evolutionary history, we found multiple unannotated copies of exon 5 in distant species and showed that exon 5 duplication goes back to a common ancestor of the SCN gene family. We characterized splicing patterns of exons 5a and 5b across tissues, tumors, and developmental stages, and demonstrated that the nonsense mediated decay (NMD) system is not the major factor contributing to their mutually exclusive choice. Comparison of SCN2A, SCN3A, SCN5A, and SCN9A intronic nucleotide sequences revealed multiple Rbfox2 binding sites and two highly conserved intronic splicing regulatory elements (ISRE) that are shared between paralogs. Minigene mutagenesis and blockage by antisense oligonucleotides showed that the formation of RNA structure between ISRE promotes exon 5b skipping in SCN9A. The inclusion of exon 5b is also suppressed in siRNA-mediated knockdown of Rbfox2, which makes the collective action of RNA structure and Rbfox2 compatible with the model of a structural RNA bridge. ISRE sequences are conserved from human to elephant shark and may represent an ancient, evolutionarily conserved regulatory mechanism. Our results demonstrate the power of comparative sequences analysis in application to paralogs for elucidating splicing regulatory programs.
bioinformatics2026-01-01v2Combinatorial Optimization of Antibody Libraries via Constrained Integer Programming
Hayes, C. F.; Goncalves, A.; Magana-Zook, S. A.; Pettit, J.; Solak, A. C.; Faissol, D.; Landajuela, M.AI Summary
- The study addresses the combinatorial challenge of designing antibody libraries by proposing an integer linear programming (ILP) method that optimizes for diversity and affinity.
- The method uses AI-guided mutational fitness profiling to predict binding scores, integrating protein language models and inverse folding tools.
- Testing on Trastuzumab, D44.1, and Spesolimab showed that the optimized libraries had superior predicted affinity and sequence diversity compared to baseline designs.
Abstract
Designing effective antibody libraries is a challenging combinatorial search problem in computational biology. We propose a novel integer linear programming (ILP) method that explicitly controls diversity and affinity objectives when generating candidate libraries. Our approach formulates library design as a constrained optimization problem, where diversity parameters and predicted binding scores are encoded as ILP constraints and objectives. Predicted binding scores are obtained via AI-guided mutational fitness profiling, which combines protein language models and inverse folding tools to evaluate mutational effects. We demonstrate the method on cold-start design tasks for Trastuzumab, D44.1, and Spesolimab, showing that our optimized libraries outperform baseline designs in both predicted affinity and sequence diversity. This hybrid search-and-learning framework illustrates how constrained optimization and predictive modeling can be combined to deliver interpretable, high-quality solutions to antibody library engineering. Code is available at https://github.com/llnl/protlib-designer.
bioinformatics2025-12-31v3MetaPaCS: A Novel Meta-Learning Framework for Pancreatic Cancer Subtype Identification
Peterson, N. B.; Sun, M.; Wu, X.; Wang, J.; Wan, S.AI Summary
- The study introduces MetaPaCS, a meta-learning framework for identifying pancreatic cancer (PaC) subtypes using transcriptomics data.
- MetaPaCS uses 10 base machine learning classifiers to generate ensemble feature vectors, which are then processed by a meta-learning model.
- Validation showed MetaPaCS significantly outperformed existing methods and individual classifiers in PaC subtype identification, enhancing risk stratification and personalized treatment design.
Abstract
As the third leading cause of cancer related deaths in the United States, pancreatic cancer (PaC) is a highly heterogenous malignancy that can be divided into a multitude of potential subtypes, with the main 4 consisting of aberrantly differentiated endocrine exocrine (ADEX), immunogenic, progenitor, and squamous. Each PaC subtype is characterized by their unique molecular pathways and therapeutic characteristics. Identifying PaC molecular subtypes is essential for downstream patient risk stratification and tailored treatment design. Conventional wet-lab approaches for PaC subtyping like microdissection, histopathological studies or molecular profiling are often laborious, costly and time-consuming. To address these concerns, we present MetaPaCS, a novel meta-learning framework to accurately identify PaC subtypes based on transcriptomics data only. Specifically, after preprocessing, the transcriptome-based feature vectors were classified by 10 base machine learning (ML) classifiers, whose prediction outputs were then combined with the initial preprocessed feature vectors to constitute a new set of ensemble feature vectors for a meta-learning model. Our meta-learning model could learn and leverage the diversity of different base classifiers to boost the prediction performance beyond any single ML model. Results based on 100 times ten-fold cross validation tests on benchmarking datasets demonstrated that MetaPaCS performed significantly better than existing state-of-the-art methods for PaC subtyping. In addition, our meta-learning model remarkably outperformed each individual base classifier, demonstrating that MetaPaCS could combine diverse results from multiple base classifiers to boost the ensemble performance. We believe that MetaPaCS is a promising tool for characterizing PaC subtypes and will have positive impacts on downstream risk stratification and personalized treatment design for PaC patients.
bioinformatics2025-12-31v2Computational Discovery of CRISPR-Cas13b Guide RNAs for Broad-Spectrum Dengue Virus Targeting
Naqvi, S. M. A.; Khan, F.; Muslim, S. M.AI Summary
- The study developed a computational pipeline and machine learning framework to design effective CRISPR-Cas13b guide RNAs targeting all four Dengue virus serotypes.
- The approach involved genomic data analysis, conservation analysis, and an optimization module, focusing on specific Cas13b design rules.
- Classical machine learning models were found to outperform deep learning models in predicting guide RNA efficiency, identifying potent pan-serotype guides.
Abstract
Dengue (DENV), an RNA virus, remains a significant global health threat, particularly in developing regions, with no widely effective antiviral therapy available. The CRISPR-Cas13b system, specifically the PspCas13b subtype, has emerged as a promising programmable antiviral tool capable of targeting viral RNA with high specificity. However, the efficacy of Cas13b-based interventions relies heavily on the design of potent and conserved CRISPR RNA (crRNA) spacer sequences, a task complicated by high viral genetic diversity. Unlike CRISPR-Cas9, which targets double-stranded DNA in eukaryotic genomes, Cas13b directly targets single-stranded RNA, making it ideally suited for RNA virus therapeutics; however, existing computational tools predominantly focus on Cas9 DNA targeting or Cas13d for mammalian transcript knockdown, leaving a significant gap for Cas13b-specific viral antiviral design. In this paper, we propose a computational pipeline and machine learning framework for the rational design of high-efficacy Cas13b guide RNAs targeting all four Dengue serotypes. Our approach integrates large-scale genomic data extraction, conservation analysis, and a novel in silico optimization module for guide RNA (gRNA) sequences, based on recently reported Cas13b design rules (e.g., 5' GG motif preference, Cytosine penalties). To predict targeting efficiency, we benchmark classical machine learning models (Random Forest, XGBoost) against foundation model-based predictors (Nucleotide Transformer, RNA-FM) using a dataset of experimentally validated spacers. Our results demonstrate that classical feature-engineered models significantly outperform deep learning approaches when trained on experimentally validated gRNA datasets in low-data regimes. We identify highly conserved, optimized crRNA candidates, including several pan-serotype guides with predicted high potency. This work establishes a baseline for Cas13b efficiency prediction and provides a robust computational resource for accelerating the development of CRISPR-based antivirals against Dengue and other RNA viruses.
bioinformatics2025-12-31v1Scalable Non-negative Matrix Factorization of the Human Cell Census Reveals Interpretable Transcriptional Programs
Liu, Y.-T.; Triche, T. J.; DeBruine, Z. J.AI Summary
- The study applied Nonnegative Matrix Factorization to create an interpretable reference embedding of 28.5 million healthy cells from the Human Cell Census, focusing on 60,000 genes.
- This embedding defines transcriptional programs that align with cell types and biological pathways, allowing new datasets to be projected into this space without retraining.
- The approach was validated by projecting an independent cystic fibrosis dataset, demonstrating its utility for exploratory analysis and hypothesis generation.
Abstract
Large single-cell atlases now span tens of millions of cells, yet few provide reusable and interpretable reference representations that support direct biological reasoning at atlas-scale. Here, we present an interpretable Nonnegative Matrix Factorization reference embedding of 28.5 million healthy cells and approximately 60,000 genes from the Human Cell Census. The resulting gene and cell weights define additive transcriptional programs that align with annotated cell types and organized biological pathways. New datasets can be projected into this fixed reference space without fine-tuning or retraining, as we demonstrate using an independent cystic fibrosis dataset. This resource provides a transparent coordinate system for exploratory analysis and hypothesis generation, complementing deep embeddings that prioritize integration or prediction with a representation designed for interpretability and reuse.
bioinformatics2025-12-31v1Phylogenetic tree-aware positive-unlabeled deep metric learning for phage-host interaction identification
Zhang, Y.-z.; Tobias, B.; Imoto, S.AI Summary
- The study proposes a phylogenetic tree-aware positive-unlabeled deep metric learning framework to identify phage-host interactions (PHIs), addressing the limitations of traditional methods by incorporating host phylogenetic relationships.
- This approach learns representations from confirmed positive PHIs and phylogenetic constraints, enhancing prediction accuracy and cross-host generalization.
- Experiments on the Cherry and metagenome Hi-C datasets show improved species-level prediction and more interpretable phage-host relationship representations.
Abstract
Phages are viruses that infect bacteria and play essential roles in shaping microbial communities. Identifying phage-host interactions (PHIs) is crucial for understanding infection dynamics and developing phage-based therapeutic strategies. Recent deep learning approaches have shown great promise for PHI prediction; however, their performance remains constrained by the limited number of experimentally validated positive pairs and the overwhelming abundance of unlabeled or non-validated samples. Moreover, most existing models overlook higher-level phylogenetic relationships among hosts, which could provide valuable structural priors for guiding representation learning. To address these challenges, we propose a phylogenetic tree-aware positive-unlabeled deep metric learning framework for phage-host interaction (PHI) identification. Unlike traditional approaches that train classification models to strictly separate positive and negative phage-host pairs, the proposed method learns representations under supervision from both confirmed positive PHIs and host phylogenetic tree constraints on non-positive samples. The proposed method can seamlessly formalize contrastive learning and deep metric learning within the same framework that explicitly optimizes PHI encoders with biological constraints in the learning functions. We show that this metric learning formulation outperforms conventional contrastive learning approaches that enforce separation between positive and negative samples without consistently aligning the learned representations with evolutionary distances. Experiments on the Cherry benchmark dataset and metagenome Hi-C multi-host dataset demonstrate that our approach enhances species-level prediction accuracy, improves cross-host generalization, and yields more interpretable representations of phage-host relationships.
bioinformatics2025-12-31v1Metabolite Fraction Libraries for Quantitative NMR Metabolomics
Esselman, C.; Garrison, K.; Ponce, L.; Borges, R. M.; Delaglio, F.; Edison, A. S.AI Summary
- The study introduces a metabolite fraction library (mFL) approach for quantitative NMR metabolomics to address spectral overlap in 1D proton NMR.
- An algorithm was developed to create a metabolite basis set (mBS) from mFL, which was used to fit and quantify NMR data.
- Applied to 10 mixtures of 53 metabolites, the method accurately quantified 50 metabolites, and when used on Neurospora crassa, identified 90 metabolites with high to medium confidence, accounting for 94% of spectral intensity.
Abstract
Nuclear Magnetic Resonance (NMR) has unique strengths in metabolomics studies, particularly in quantifying mixtures and elucidating the structures of unknown molecules. One-dimensional (1D) proton (1H) NMR is the most common method; however, spectral overlap is significant, making analysis challenging. We present a new approach that utilizes chromatographically separated fractions from a pooled sample, henceforth called a metabolite fraction library (mFL). We developed an algorithm to extract highly correlated peaks from the mFL, collectively forming a metabolite basis set (mBS). The mBS can be fit to NMR profiling data, enabling comprehensive quantification. Applied to 10 mixtures of 53 metabolites, our approach accurately quantified 50, quantified an impurity and an oxidation product, and described between 91-96% of total spectral intensity. The method is demonstrated using the fungus Neurospora crassa, resulting in the identification of 45 metabolites with high confidence, 45 with medium confidence, and accounting for 94% of total spectral intensity.
bioinformatics2025-12-31v1An Evidence-Grounded Research Assistant for Functional Genomics and Drug Target Assessment
Sokolova, K.; Kosenkov, D.; Nallamotu, K.; Vedula, S.; Sokolov, D.; Sapiro, G.; Troyanskaya, O. G.AI Summary
- Alvessa is an evidence-grounded research assistant designed to enhance functional genomics and drug target assessment by integrating entity recognition, biological tools, and data-constrained answer generation with verification against records.
- Evaluated on dbQA from LAB-Bench and GenomeArena, Alvessa showed improved accuracy over general-purpose models and comparable performance to coding-centric agents, with fully traceable outputs.
- The system's ability to detect fabricated statements relies on access to retrieved evidence, and it has been applied to drug discovery, identifying candidate targets overlooked by traditional literature-based methods.
Abstract
The growing availability of biological data resources has transformed research, yet their effective use remains challenging: selecting appropriate sources requires domain knowledge, data are fragmented across databases, and synthesizing results into reliable conclusions is labor-intensive. Although large language models promise to address these barriers, their impact in biomedicine has been limited by unsupported statements, incorrect claims, and lack of provenance. We introduce Alvessa, an evidence-grounded agentic research assistant designed around verifiability. Alvessa integrates entity recognition, orchestration of pre-validated biological tools, and data-constrained answer generation with statement-level verification against retrieved records, explicitly flagging unsupported claims and guiding revision when reliability criteria are not met. We evaluate Alvessa on dbQA from LAB-Bench and GenomeArena, a benchmark of 720 questions spanning gene and variant annotation, pathways, molecular interactions, miRNA targets, drug-target evidence, protein structure, and gene-phenotype associations. Alvessa substantially improves accuracy relative to general-purpose language models and performs comparably to coding-centric agents while producing fully traceable outputs. Using adversarial perturbations, we show that detection of fabricated statements depends critically on access to retrieved evidence. We further demonstrate application to drug discovery, where evidence-grounded synthesis enables identification of candidate targets missed or misattributed by literature-centered reasoning alone. Alvessa and GenomeArena are released to the community to support reproducible, verifiable AI-assisted biological research.
bioinformatics2025-12-31v1CausalGRN: deciphering causal gene regulatory networks from single-cell CRISPR screens
Yu, B.; Liu, D.; Qi, G.; Huangfu, D.; Hsu, L.; Shojaie, A.; Sun, W.AI Summary
- CausalGRN is a computational framework designed to infer causal gene regulatory networks (GRNs) from single-cell CRISPR screens with scRNA-seq data.
- It uses adaptive thresholding to correct for spurious correlations in sparse data, constructs an undirected graph, and orients it based on perturbation outcomes.
- CausalGRN outperforms existing methods in accurately reconstructing networks and predicting effects of novel perturbations in both simulations and experimental datasets.
Abstract
Large-scale single-cell CRISPR screens with single-cell RNA-seq (scRNA-seq) readouts provide critical data to map causal gene regulatory networks (GRNs). However, translating the complex scRNA-seq outputs into reliable causal insights remains a major analytical challenge. Here we present CausalGRN, a scalable computational framework that infers causal GRNs and predicts cellular responses to unseen perturbations. CausalGRN first mitigates pervasive spurious partial correlations in sparse scRNA-seq data through a novel adaptive thresholding correction, enabling robust inference of an undirected graph. It then orients this graph using observed perturbation outcomes. The resulting directed GRN can be used to predict the downstream effects of novel perturbations via network propagation. Across both simulations and diverse experimental datasets, CausalGRN substantially outperforms existing approaches in network reconstruction accuracy and in predicting the effects of unseen perturbations, providing a principled bridge from perturbation data to causal gene regulation.
bioinformatics2025-12-31v1Hippocampome.org, a resource for subicular neuron types and beyond
Tecuatl, C.; Ascoli, G. A.AI Summary
- Hippocampome.org classifies neurons in the rodent hippocampal formation based on axonal and dendritic morphology, providing detailed annotations on properties like neurotransmitters and firing patterns.
- The resource quantifies circuitry through connection probabilities and synaptic signals, linking all data to peer-reviewed evidence and computational models.
- A focus on the subiculum reveals only 6 of 180 neuron types are from this region, highlighting the need for further research to understand its neuronal organization and connectivity within the hippocampal formation.
Abstract
To establish the relationship between circuit organization and information processing, many neuroscientists find it useful to reason in terms of neuron types. Hippocampome.org uses axonal and dendritic morphology as a foundational approach to classify neurons in the rodent hippocampal formation, including dentate gyrus, Cornu Ammonis, subiculum, and entorhinal cortex. For each identified neuron type, this open access knowledge base annotates essential properties, such as main neurotransmitter, membrane biophysics, firing patterns, molecular expression, and cell counts. Moreover, Hippocampome.org quantifies circuitry in terms of directional connection probabilities and synaptic signals between interacting neuron types. All properties are directly linked to peer reviewed experimental evidence and best-fitted with computational models. The resultant online resource provides an effective reference to design new experiments, analyses, and spiking neural network simulations. Here we illustrate the content and utility of Hippocampome.org with a focus on the subiculum, whose neuron type organization has received relatively less attention. Only 6 of the 180 Hippocampome.org neuron types are from the subiculum, compared to more than 60 in the adjacent area CA1. Specifically, we analyze the local subicular circuit and its broader interaction with the hippocampal formation with respect to both anatomical connectivity and signal transfer. Our results exemplify the potential added value of data integration in neuronal classification, while also highlighting the need for further research to fill existing knowledge gaps.
bioinformatics2025-12-31v1SurvMarker: An R Package for Identifying Survival-Associated Molecular Features Using PCA-Based Weighted Scores
Gu, T.; Gammune, D. H. V.AI Summary
- SurvMarker is an R package designed to identify survival-associated molecular features by using PCA-based weighted scoring.
- It aggregates feature loadings across survival-associated principal components and assesses significance against an empirical null distribution.
- The package is available under the MIT License, with documentation and source code accessible on GitHub.
Abstract
Summary: SurvMarker is an R package for identifying survival-associated features in high-dimensional molecular data using PCA-based weighted scoring. The method aggregates feature loadings across survival-associated principal components (PCs) and evaluates feature significance against a feature-specific empirical null distribution, enabling stable, parsimonious, and statistically calibrated prognostic feature selection. Availability and Implementation: SurvMarker is implemented in R and distributed under the MIT License (MIT). The package includes comprehensive documentation, a reference manual, and a reproducible workflow example, and is provided as Supplementary Material. Source code and documentation are openly available at the GitHub repository https://github.com/tjgu/SurvMarker.
bioinformatics2025-12-31v1Localized Reactivity on Protein as Riemannian Manifolds: A Geometric and Quantum-Inspired Basis for Deterministic, Metal-Aware Reactive-Site Prediction
Park, H.AI Summary
- The study presents a geometric, quantum-inspired framework for predicting reactive sites on proteins, treating them as configurations in 3D space with additional conditions.
- A deterministic, metal-aware approach was used to map protein structures to environment vectors, processing large complexes like ribosomes on standard hardware.
- The method achieved competitive performance in predicting protein interfaces and catalytic sites, with high ROC AUC values in various case studies, demonstrating its effectiveness without evolutionary data or retraining.
Abstract
We introduce a unified framework for analysing molecular reactivity based on a geometric, quantum-inspired environment representation and a fully deterministic, metal-aware implementation. Proteins and ribonucleoprotein complexes are treated as configurations in 3D space with an abstract condition axis (R^3 x T), and each residue or nucleotide p is mapped to an environment vector Ep that encodes a coarse-grained, DFT-inspired density surrogate together with metal and phosphate fields, solvent exposure, and local geometry. A block-streamed, GPU-optional Python pipeline maps arbitrary PDB or mmCIF structures to fixed-dimensional environment vectors without stochastic training and scales to supramolecular assemblies: the 6Q97 tmRNA-SmpB-ribosome rescue complex (11,618 residues) can be processed in a single pass on commodity cloud hardware, demonstrating practical feasibility at ribosome scale. In a strict unbound, zero-shot setting on the Docking Benchmark 5.5 (DB5.5), a simple classifier trained on top of Ep achieves a macro-averaged area under the precision-recall curve of approximately 0.53 and a ROC AUC of approximately 0.86 for residue-level interface versus non-interface classification, competitive with specialised interface-prediction architectures despite using no evolutionary profiles, multiple sequence alignments, or task-specific retraining on DB5.5. Across mechanistically curated case studies (Rubisco, GroEL/GroES, SecA, p53-DNA, and ribosomal pockets), untuned random forests used purely as probes under site-grouped cross-validation yield ROC AUC values exceeding 0.95 for catalytic and anchor cores (for example SecA ATPase and the GroES IVL anchor), while diffuse regulatory and fitness-defined labels are substantially harder to separate. For 6Q97, a Tier 1 and Tier 2 labelling scheme over tmRNA and SmpB pockets, decoding-centre rRNA, the 23S peptidyl transferase centre, and helicase-like uS3, uS4, and uS5 pockets, together with a curated hard-negative panel of 323 buried hydrophobic, electrostatic, and stacking decoys, yields global AUCs of approximately 0.94 (Tier 1 plus Tier 2 versus all) and approximately 0.98 (Tier 1 plus Tier 2 versus hard negatives). These results support the view that the environment representation defines an interpretable "reactivity manifold" in which genuinely functional pockets occupy regions that cannot be mimicked by generic dense or charged environments, and that this structure remains accessible even for full ribosomes on modest hardware.
bioinformatics2025-12-30v8An Explainable Knowledge Graph driven approach to decipher the link Between Brain Disorders and the Gut Microbiome
Aamer, N.; Asim, M. N.; Vollmer, S.; Dengel, A.AI Summary
- The study investigates the mechanisms through which the gut microbiome influences brain disorders via the microbiome-gut-brain axis (MGBA) using an explainable graph neural network (GNN-GBA) trained on a biomedical knowledge graph.
- GNN-GBA identified pathways for 103 brain disorders, showing consistency with existing literature.
- An interactive dashboard is available for exploring these mechanisms.
Abstract
Motivation: The communication between the gut microbiome and the brain, known as the microbiome-gut-brain axis (MGBA), is emerging as a critical factor in neurological and psychiatric disorders. This communication involves complex pathways including neural, hormonal, and immune interactions that enable gut microbes to modulate brain function and behavior. However, the specific mechanisms through which gut microbes influence brain function remain poorly understood, and existing computational efforts to understand these mechanisms are simplistic or have limited scope. Results: This work presents a comprehensive approach for understanding these mechanisms by elucidating the cascade of interactions that allows gut microbes to influence brain disorders. By using a large curated biomedical knowledge graph, we train GNN-GBA, an explainable graph neural network, to learn the complex biological interactions between the gut microbiome and the brain. GNN-GBA is then used to extract the mechanistic pathways through which the gut microbiome can influence brain disorders. The network successfully identified pathways for 103 brain disorders, and we show that these pathways are consistent with existing literature. Availability: An interactive dashboard to explore thousands of potential mechanisms through which the gut microbiome can influence brain diseases is available at: https://sds-genetic-interaction-analysis.opendfki.de/gut brain/. Contact: naafey.aamer@cs.rptu.de
bioinformatics2025-12-30v2Ultra-fast and highly sensitive protein structure alignment with segment-level representations and block-sparse optimization
Litfin, T.; Zhou, Y.; von Itzstein, M.AI Summary
- This study introduces SPfast, a method for protein structure alignment that is over 100 times faster than traditional methods, with increased sensitivity.
- SPfast was used to perform over 100 billion pairwise comparisons, revealing new insights into type III secretion in pathogenic bacteria and identifying novel toxin-antitoxin systems.
- Functional assignments made by SPfast were supported by genomic context and high-confidence AlphaFold3 complex modeling.
Abstract
Deep learning models for protein structure prediction have given rise to extreme growth in 3D structure data. As a result, traditional methods for geometric structure alignment are too slow to effectively search modern structure libraries. In this study we introduce SPfast - a fully geometric method for structure-based alignment which accelerates search by more than 2 orders of magnitude while increasing sensitivity by 21% and 5% compared with foldseek and TMalign respectively. Using the significant speed of SPfast to conduct more than 100B pairwise comparisons between bona fide uncharacterized proteins and a large-scale, annotated structure library uncovers new biological insights relating to type III secretion in pathogenic bacteria and identifies novel toxin-antitoxin systems. Putative SPfast-based functional assignments are supported by orthogonal evidence including shared genomic context and high-confidence AlphaFold3 complex modelling.
bioinformatics2025-12-30v2An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT and BioStarsGPT
Muneeb, M.; Ascher, D.AI Summary
- The study presents a pipeline for fine-tuning large language models (LLMs) on bioinformatics data, focusing on PRSGPT for polygenic risk scores and BioStarsGPT for community forum discussions.
- Three LLMs were fine-tuned and evaluated on 14 metrics, with Qwen2.5-7B showing significant improvements in BLEU-4 and ROUGE-1 scores.
- PRSGPT achieved 61.9% accuracy in PRS tool comparison, while BioStarsGPT had 59% conceptual accuracy, demonstrating the pipeline's effectiveness in creating domain-specific bioinformatics assistants.
Abstract
Large language models (LLMs) often lack specialized knowledge for complex bioinformatics applications. We present a reproducible pipeline for fine-tuning LLMs on specialized bioinformatics data, demonstrated through two use cases: PRSGPT, focused on polygenic risk score (PRS) tools, and BioStarsGPT, trained on community forum discussions. The nine-step pipeline integrates diverse data sources, structured preprocessing, prompt-based question-answer (QA) generation (via Google Gemini), natural language inference (NLI) for quality control, semantic deduplication, clustering-based data splitting, and parameter-efficient fine-tuning using LoRA. We fine-tuned three LLMs (LLaMA-3.2-3B, Qwen2.5-7B, Gemma) and benchmarked them on over 14 lexical and semantic metrics. Qwen2.5-7B emerged as the best performer, with BLEU-4 and ROUGE-1 improvements of 82% and 70% for PRSGPT and 6% and 18% for BioStarsGPT, respectively. The open-source datasets produced include over 28,000 QA pairs for PRSGPT and 154,282 for BioStarsGPT. Human evaluation of PRSGPT yielded 61.9% accuracy on the PRS tools comparison task, comparable to Google Gemini (61.4%), but with richer methodological detail and accurate citations. BioStarsGPT demonstrated 59% conceptual accuracy across 142 curated bioinformatics questions. Our pipeline enables scalable, domain-specific fine-tuning of LLMs. It enables privacy-preserving, locally deployable bioinformatics assistants, explores their practical applications, and addresses the challenges, limitations, and mitigation strategies associated with their development and use.
bioinformatics2025-12-30v1Generative Reconstruction of Unobserved Cellular Dynamics using Single-Cell Transcriptomic Trajectories
Ray, S.; Das Mandal, S.; Lall, S.; Pyne, S.AI Summary
- The study addresses the challenge of capturing intermediate cellular states between two developmental stages using single-cell transcriptomics.
- GRAIL, a computational framework, reconstructs these intermediate states by employing Locality Sensitive Hashing for cell selection and interpolation in a learned latent space, followed by decoding via a GAN generator.
- Benchmarking showed GRAIL's effectiveness in preserving marker gene expression and enhancing downstream analyses like differential expression and cell clustering.
Abstract
The primary challenge in cellular dynamics is to capture the intermediate states between two distinct biological stages. This is because of the technical constraints in capturing transient states or the nature of single-cell sequencing protocols. To address this challenge, we introduce a new computational framework, GRAIL (Generative Reconstruction of Artificial Intermediate Lineages), which aims to bridge these missing transitions. GRAIL reconstructs biologically plausible intermediate cell states between two distinct developmental stages (A and B) through a Locality Sensitive Hashing (LSH) based cell selection strategy followed by a smooth interpolation in learned latent space. The interpolated latent representations are subsequently decoded into gene expression profiles using a trained generator from a Generative Adversarial Network (GAN). The framework consists of three components: (a) a pretrained autoencoder that learns latent representations of stage specific transcriptomes, (b) an LSH-guided interpolation procedure that identifies anchor cells and performed interpolation in latent space, and (c) a GAN generator that extrapolates realistic intermediate expression from the interpolated samples consistent with the underlying trajectory. We benchmarked GRAIL with different state-of-the-arts (SOTA) in diverse setup of simulated dataets as well as in real life scRNA-seq datasets. Generated samples from GRAIL preserve expected marker gene expression patterns and also improve downstream analyses, including differential expression and cell clustering. Our method addresses the critical gap in studying cellular transitions when experimental intermediate samples are unavailable.
bioinformatics2025-12-30v1