Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Cell type-specific functions of nucleic acid-binding proteins revealed by deep learning on co-expression networks
Osato, N.; Sato, K.AI Summary
- This study uses a deep learning framework to infer the regulatory influence of nucleic acid-binding proteins (NABPs) across different cellular contexts by integrating gene co-expression data, improving prediction accuracy over traditional binding-based methods.
- The model's predictions were validated against ChIP-seq and eCLIP datasets, showing strong concordance.
- Analysis revealed cell type-specific regulatory programs, such as cancer pathways in K562 cells and differentiation in neural progenitor cells, highlighting the framework's utility in functional annotation of NABPs.
Abstract
Nucleic acid-binding proteins (NABPs) play central roles in gene regulation, yet their functional targets and regulatory programs remain incompletely characterized due to the limited scope and context specificity of experimental binding assays. Here, we present a deep learning framework that integrates gene co-expression-derived interactions with contribution-based model interpretation to infer NABP regulatory influence across diverse cellular contexts, without relying on predefined binding motifs or direct binding evidence. Replacing low-informative binding-based features with co-expression-derived interactions significantly improved gene expression prediction accuracy. Model-inferred regulatory targets showed strong and reproducible concordance with independent ChIP-seq and eCLIP datasets, exceeding random expectations across multiple genomic regions and threshold definitions. Functional enrichment and gene set enrichment analyses revealed coherent, cell type-specific regulatory programs, including cancer-associated pathways in K562 cells and differentiation-related processes in neural progenitor cells. Notably, we demonstrate that DeepLIFT-derived contribution scores capture relative regulatory importance in a background-dependent but biologically robust manner, enabling systematic identification of context-dependent NABP regulatory roles. Together, this framework provides a scalable strategy for functional annotation of NABPs and highlights the utility of combining expression-driven inference with interpretable deep learning to dissect gene regulatory architectures at scale.
bioinformatics2026-02-02v9ELITE: E3 Ligase Inference for Tissue specific Elimination: A LLM Based E3 Ligase Prediction System for Precise Targeted Protein Degradation
Patjoshi, S.; Froehlich, H.; Madan, S.AI Summary
- The study introduces ELITE, an AI-driven system using a BERT-based model to predict tissue-specific E3 ligases for targeted protein degradation (TPD).
- ELITE integrates protein embeddings with tissue-specific interaction data to identify E3 ligases that can selectively degrade pathogenic proteins in relevant tissues.
- This approach aims to expand the E3 ligase repertoire, enhancing precision in TPD and reducing systemic toxicity.
Abstract
Targeted protein degradation (TPD) has transformed modern drug discovery by harnessing the ubiquitin proteasome system to eliminate disease-driving proteins previously deemed undruggable. However, current approaches predominantly rely on a narrow set of ubiquitously expressed E3 ligases, such as Cereblon (CRBN) and Von Hippel Lindau (VHL), which limits tissue specificity, increases systemic toxicity, and fosters resistance. Here, we present an AI-driven framework for the rational identification of tissue specific E3 ligases suitable for precision-targeted degradation. Our model leverages a BERT-based protein language architecture trained on billions of sequences to generate contextual embeddings that capture structural and functional motifs relevant for E3 substrate compatibility. By integrating these embeddings with tissue resolved protein protein interaction data, the framework predicts ligase/target interactions that are both biologically plausible and context restricted. This enables the prioritization of ligases capable of driving selective degradation of pathogenic proteins within disease-relevant tissues. The proposed approach offers a scalable path to expand the E3 ligase repertoire and advance TPD toward true precision medicine.
bioinformatics2026-02-02v9Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.AI Summary
- The study addresses the challenge of distinguishing half-siblings from niece/nephew-avuncular pairs in large genomic biobanks without pedigree information.
- A novel method using across-chromosome phasing and haplotype-level sharing features was developed, achieving over 98% classification accuracy.
- This approach also enhances long-range phasing accuracy, aiding in pedigree reconstruction and managing cryptic relatedness in genomic studies.
Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-02-02v3rnaends: an R package to study exact RNA ends at nucleotide resolution
Caetano, T.; Redder, P.; Fichant, G.; Barriot, R.AI Summary
- The rnaends R package is designed for analyzing RNA-end sequencing data, focusing on the exact nucleotide resolution of RNA ends.
- It provides tools for preprocessing, mapping, quantification, and post-processing of RNA-end data, including TSS identification, analysis of translation speed, and post-transcriptional modifications.
- The package's utility is demonstrated through workflows on published datasets, highlighting its application in RNA metabolism studies.
Abstract
5' and 3' RNA-end sequencing protocols have unlocked new opportunities to study aspects of RNA metabolism such as synthesis, maturation and degradation, by enabling the quantification of exact ends of RNA molecules in vivo. From RNA-Seq data that have been generated with one of the specialized protocols, it is possible to identify transcription start sites (TSS) and/or endoribonucleolytic cleavage sites, and even, in some cases, co-translational 5' to 3' degradation dynamics. Furthermore, post-transcriptional addition of ribonucleotides at the 3' end of RNA can be studied at the nucleotide resolution. While different RNA-end sequencing library protocols exist that have been adapted to a specific organism (prokaryote or eukaryote) or specific biological question, the generated RNA-Seq data are very similar and share common processing steps. Most importantly, the major aspect of RNA-end sequencing is that only the 5' or 3' end mapped location is of interest, contrary to conventional RNA sequencing that considers genomic ranges for gene expression analysis. This translates to a simple representation of the quantitative data as a count matrix of RNA-end location on the reference sequences. This representation seems under-exploited and is, to our knowledge, not available in a generic package focused on the analyses on the exact transcriptome ends. Here, we present the rnaends R package which is dedicated to RNA-end sequencing analysis. It offers functions for raw read pre-processing, RNA-end mapping and quantification, RNA-end count matrix post-processing, and further downstream count matrix analyses such as TSS identification, fast Fourier transform for signal periodic pattern analysis, or differential proportion of RNA-end analysis. The use of rnaends is illustrated here with applications in RNA metabolism studies through selected rnaends workflows on published RNA-end datasets: (i) TSS identification, (ii) ribosome translation speed and co-translational degradation, (iii) post-transcriptional modification analysis and differential proportion analysis.
bioinformatics2026-02-02v3MLMarker: A machine learning framework for tissue inference and biomarker discovery
Claeys, T.; van Puyenbroeck, S.; Gevaert, K.; Martens, L.AI Summary
- MLMarker uses a Random Forest model to compute tissue similarity scores from proteomics data, trained on 34 healthy tissues.
- It employs SHAP for protein-level explanations and a penalty factor for missing proteins, enhancing robustness for sparse datasets.
- Testing on three datasets, MLMarker identified brain-like signatures in cerebral melanoma, achieved high accuracy in pan-cancer analysis, and traced origins in biofluids.
Abstract
MLMarker is a machine learning tool that computes continuous tissue similarity scores for proteomics data, addressing the challenge of interpreting complex or sparse datasets. Trained on 34 healthy tissues, its Random Forest model generates probabilistic predictions with SHAP-based protein-level explanations. A penalty factor corrects for missing proteins, improving robustness for low-coverage samples. Across three public datasets, MLMarker revealed brain-like signatures in cerebral melanoma metastases, achieved high accuracy in a pan-cancer cohort, and identified brain and pituitary origins in biofluids. MLMarker provides an interpretable framework for tissue inference and hypothesis generation, available as a Python package and Streamlit app.
bioinformatics2026-02-02v2cheCkOVER: An open framework and AI-ready global crayfish database for next-generation biodiversity knowledge
Parvulescu, L.; Livadariu, D.; Bacu, V. I.; Nandra, C. I.; Stefanut, T. T.; World of Crayfish Contributors,AI Summary
- The study introduces cheCkOVER, an open framework that transforms species occurrence data into structured, AI-ready formats, focusing on crayfish.
- cheCkOVER processes 111,729 crayfish records from 465 species, producing biogeographic descriptors, dynamic maps, and JSON geo-narratives with provenance metadata.
- This framework supports conservation metrics, tracks invasive species, and enhances biodiversity data utility for AI applications and public platforms like World of Crayfish.
Abstract
Background Species occurrence records represent the backbone of biodiversity science, yet their utility is often limited to spatial analyses, distribution maps, or presence-absence models. Current biodiversity infrastructures rarely provide computational formats directly usable by modern artificial intelligence (AI) systems, such as large language models (LLMs), which increasingly mediate scientific communication and knowledge synthesis. Open frameworks that convert biodiversity occurrences into structured, machine-accessible, provenance-rich knowledge are therefore essential--particularly those enabling rapid integration of new records, near real-time generation of spatial metrics, and production of both human interpretable reports and AI-consumable outputs. Such capabilities substantially reduce latency between data acquisition and decision support, while ensuring biodiversity knowledge remains traceable and verifiable in AI-mediated workflows. Results We introduce cheCkOVER, an open framework that converts raw species occurrence datasets into standardized, API-ready, multi-layered outputs: biogeographic descriptors, dynamic distribution maps, summary metrics, and structured JSON geo-narratives following a canonical template. The framework stratifies processing by population origin (indigenous vs. non-indigenous), enabling IUCN-aligned conservation metrics while simultaneously tracking invasion dynamics. Each output embeds standardized citation metadata ensuring full provenance traceability. We applied the pipeline to 111,729 validated crayfish (Astacidea) occurrence records from 465 species, generating comprehensive species packages including indigenous-range classifications (171 endemic, 287 regional, 5 cosmopolitan taxa) and non-indigenous range tracking for 30 invasive species. This proof-of-concept demonstrates how the framework transforms minimal datapoints--validated species occurrences--into interoperable knowledge consumable by both humans and computational systems. The JSON outputs are optimized for retrieval-augmented generation, enabling AI systems to dynamically access and cite biodiversity knowledge with explicit source attribution. Conclusions cheCkOVER is taxon-agnostic and establishes a reproducible pathway from biodiversity occurrences to narrative-ready, AI-interoperable knowledge with immediate public utility via the World of Crayfish(R) platform (https://world.crayfish.ro/), where each species page integrates structured outputs. The open-source framework (GPL-3) combines a generalizable processing pipeline with taxon-specific knowledge products, enabling flexible reuse across conservation research, policy reporting, and AI-driven applications. This minimalist-to-complex design extends the reach of biodiversity data beyond traditional analyses, positioning occurrence repositories as active knowledge engines for next-generation biodiversity informatics.
bioinformatics2026-02-02v2WITHDRAWN: OKR-Cell: Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training
wang, H.; Zhang, X.; Fang, S.; Ran, L.; deng, z.; Zhang, Y.; Li, Y.; Li, s.AI Summary
- The manuscript titled "OKR-Cell: Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training" was withdrawn due to duplicate posting on arXiv.
- The authors request that this work not be cited as a reference.
Abstract
The authors have withdrawn this manuscript because of a duplicate posting of a preprint on arXiv. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author. The original preprint can be found at arXiv:2601.05648
bioinformatics2026-02-02v2Learning Dynamic Protein Representations at Scale with Distograms
Portal, N.; Karroucha, W.; Mallet, V.; Bonomi, M.AI Summary
- This study addresses the challenge of incorporating protein structural dynamics into machine learning by using distograms from AlphaFold2 instead of computationally intensive simulations.
- The approach involves encoding dynamic protein information through residue-residue distance probability distributions to enhance function prediction.
- Key finding: This method offers a scalable solution for dynamic protein representation, potentially improving prediction accuracy without the need for explicit conformational sampling.
Abstract
Protein function and other biological properties often depend on structural dynamics, yet most machine-learning predictors rely on static representations. Physics-based molecular simulations can describe conformational variability but remain computationally prohibitive at scale. Generative models provide a more efficient alternative, though their ability to produce accurate conformational ensembles is still limited. In this work, we bypass expensive simulations by leveraging residue-residue distance probability distributions (distograms) from structure predictors such as AlphaFold2. Our approach provides a scalable way to encode dynamic information into protein representations, aiming to improve function prediction without explicit conformational sampling.
bioinformatics2026-02-02v1MultiGEOmics: Graph-Based Integration of Multi-Omics via Biological Information Flows
Alipour Pijani, B.; Rifat, J. I. M.; Bozdag, S.AI Summary
- MultiGEOmics is a graph-based framework designed to integrate multi-omics data by incorporating cross-omics regulatory signals and handling missing data.
- It learns robust embeddings across omics types, maintaining performance under varying data completeness scenarios.
- Evaluations on 11 datasets showed MultiGEOmics consistently performs well and provides interpretability by highlighting key omics features for predictions.
Abstract
Motivation: Multi-omics datasets capture complementary aspects of biological systems and are central to modern machine learning applications in biology and medicine. Existing graph-based integration methods typically construct separate graphs for each omics type and focus primarily on intra-omic relationships. As a result, they often overlook cross-omics regulatory signals, bidirectional interactions across omics layers, that are critical for modeling complex cellular processes. A second major challenge is missing or incomplete omics data; many current approaches degrade substantially in performance or exclude patients lacking one or more omics modalities. To address these limitations, we introduce MultiGEOmics, an intermediate-level graph integration framework that explicitly incorporates regulatory signals across omics types during graph representation learning and models biologically inspired omics-specific and cross-omics dependencies. MultiGEOmics learns robust cross-omics embeddings that remain reliable even when some modalities are partially missing. Results: We evaluated MultiGEOmics across eleven datasets spanning cancer and Alzheimer's disease, under zero, moderate, and high missing-rate scenarios. MultiGEOmics consistently maintains strong predictive performance across all missing-data conditions while offering interpretability by identifying the most influential omics types and features for each prediction task.
bioinformatics2026-02-02v1Batch correction for large-scale mass spectrometry imaging experiments
Thomsen, A. A.; Jensen, O. N.AI Summary
- This study evaluates batch correction methods for MALDI mass spectrometry imaging experiments.
- ComBAT was found to reduce batch-related technical variance, preserve biological variation, and enhance the overall score by 19.4%.
Abstract
We assess batch correction methods for MALDI mass spectrometry imaging experiments. ComBAT reduced batch-related technical variance, maintained biological variation, and improved the overall score by 19.4%.
bioinformatics2026-02-02v1Evaluating the applicability of kinship analyses for sedimentary ancient DNA datasets
Cohen, P.; Johnson, S.; Zavala, E. I.; Moorjani, P.; Slon, V.AI Summary
- This study evaluates the feasibility of kinship inference using sedimentary ancient DNA (sedaDNA), focusing on Neandertals, through extensive simulations.
- The main challenge identified was the presence of DNA from multiple individuals in samples, which complicates accurate kinship analysis.
- A heterozygosity-based test was developed to detect multi-individual DNA, and practical limits were assessed using Neandertal sedaDNA from the Galeria de las Estatuas site.
Abstract
Kinship reconstruction in ancient populations provides key insights into past social organization and evolutionary history. Sedimentary ancient DNA (sedaDNA) enables access to deep-time human populations in the absence of skeletal remains. However, it is characterized by severe degradation and the potential mixture of genetic material from multiple individuals, raising questions about its suitability for kinship inference. Here, we use extensive simulations to evaluate the feasibility and limitations of kinship inference in sparse and damaged sedaDNA data, with a focus on Neandertals. We find that the main obstacle to accurate kinship inference in sedaDNA is the presence of multiple contributors to a given sample. To address this, we introduce a simple heterozygosity-based test to identify samples containing DNA from multiple individuals. Guided by these results, we analyze published Neandertal sedaDNA from the Galeria de las Estatuas site to assess the practical limits of kinship inference in real sedimentary ancient DNA data. Together, our results define methodological considerations and practical limits for kinship inference in sedimentary ancient DNA.
bioinformatics2026-02-02v1DyGraphTrans: A temporal graph representation learning framework for modeling disease progression from Electronic Health Records
Rahman, M. T.; Al Olaimat, M.; Bozdag, S.; Alzheimer's Disease Neuroimaging Initiative,AI Summary
- DyGraphTrans is a framework that models disease progression using EHR data by representing it as temporal graphs, where nodes are patients, features are clinical attributes, and edges show patient similarity.
- It addresses high memory use and lack of interpretability in existing models by employing a sliding-window mechanism and capturing both local and global temporal trends.
- Evaluations on ADNI, NACC, and MIMIC-IV datasets showed DyGraphTrans had strong predictive performance and interpretability aligned with clinical risk factors.
Abstract
Motivation: Electronic Health Records (EHRs) contain vast amounts of longitudinal patient medical history data, making them highly informative for early disease prediction. Numerous computational methods have been developed to leverage EHR data; however, many process multiple patient records simultaneously, resulting in high memory consumption and computational cost. Moreover, these models also often lack interpretability, limiting insight into the factors driving their predictions. Efficiently handling large-scale EHR data while maintaining predictive accuracy and interpretability therefore remains a critical challenge. To address this gap, we propose DyGraphTrans, a dynamic graph representation learning framework that represents patient EHR data as a sequence of temporal graphs. In this representation, nodes correspond to patients, node features encode temporal clinical attributes, and edges capture patient similarity. DyGraphTrans models both local temporal dependencies and long-range global trends, while a sliding-window mechanism reduces memory consumption without sacrificing essential temporal context. Unlike existing dynamic graph models, DyGraphTrans jointly captures patient similarity and temporal evolution in a memory-efficient and interpretable manner. Results: We evaluated DyGraphTrans on Alzheimer's Disease Neuroimaging Initiative (ADNI) and National Alzheimer's Coordinating Center (NACC) for disease progression prediction, as well as on the Medical Information Mart for Intensive Care (MIMIC-IV) dataset for early mortality prediction. We further assessed the model on multiple benchmark dynamic graph datasets to evaluate its generalizability. DyGraphTrans achieved strong predictive performance across diverse datasets. We also demonstrated interpretability of DyGraphTrans aligned with known clinical risk factors.
bioinformatics2026-02-02v1NetPolicy-RL: Network-Informed Offline Reinforcement Learning for Pharmacogenomic Drug Prioritization
Lodh, E.; Majumder, S.; Chowdhury, T.; De, M.AI Summary
- The study introduces NetPolicy-RL, a framework that combines network diffusion modeling with offline reinforcement learning for prioritizing drugs in pharmacogenomic screens.
- Drug selection is treated as an offline contextual bandit problem, optimizing ranking quality directly by integrating drug response data with network disruption scores from biological networks.
- NetPolicy-RL significantly outperformed traditional methods in ranking quality (NDCG@10) and reduced regret, with improvements for 88.7% of cell lines compared to GlobalTopK.
Abstract
Large-scale pharmacogenomic screens provide extensive measurements of drug response across diverse cancer cell lines; however, most computational approaches emphasize point-wise sensitivity prediction or static ranking, which are poorly aligned with practical decision-making, where only a limited number of candidate drugs can be tested. We propose NetPolicy-RL, a biologically informed and decision-centric framework for pharmacogenomic drug prioritization that integrates network diffusion modeling with offline reinforcement learning. Drug selection for each cell line is formulated as an offline contextual bandit problem, enabling direct optimization of ranking quality rather than surrogate regression objectives. Mechanistic biological context is incorporated by propagating drug targets over curated interaction networks (STRING and Reactome) using random walk with restart, and combining the resulting diffusion profiles with cell-specific molecular importance derived from multi-omics data to compute network disruption scores. These biologically grounded signals are integrated with normalized drug response measurements to construct a joint state representation, which is optimized using an offline actor-critic architecture. Across held-out test splits, NetPolicy-RL consistently outperforms global ranking heuristics and learning-to-rank baselines, achieving statistically significant improvements in per-cell Normalized Discounted Cumulative Gain (NDCG@10) and substantial reductions in per-cell regret. Relative to GlobalTopK, the policy improves NDCG@10 for 88.7% of cell lines, while improvements exceed 95% compared with LambdaMART and regression-to-ranking baselines. Ablation analyses show that neither empirical response signals nor network-derived features alone are sufficient, and that their integration yields the most robust performance. Overall, this study demonstrates that combining mechanistic network biology with offline policy learning provides an effective and interpretable framework for drug prioritization in precision oncology.
bioinformatics2026-02-02v1PHoNUPS: Open-Source Software for Standardized Analysis and Visualization of Multi-Instrument Extracellular Vesicle Measurements
Melykuti, B.; Bustos-Quevedo, G.; Prinz, T.; Nazarenko, I.AI Summary
- PHoNUPS is open-source software developed in R to standardize the analysis and visualization of extracellular vesicle (EV) measurements from various instruments.
- It processes data to compute statistics and generate standardized histograms and contour plots for EV size and zeta potential, aiding in transparent reporting and cross-study comparisons.
- The software supports multiple file formats, produces publication-ready figures, and is designed for extensibility with community contributions.
Abstract
Accurate and transparent characterization of extracellular vesicle (EV) preparations is essential to ensure reproducibility, comparability, and adherence to MISEV reporting standards. However, data outputs from commonly used instruments for assessing EV size, concentration, and surface charge (zeta potential) vary widely in format and structure, complicating standardized analysis and integration across platforms. We present PHoNUPS (Plotting the Histogram of Non-Uniform Particles' Sizes), free and open-source software (FOSS) developed in R, that enables unified processing, analysis, and visualization of EV characterization data. PHoNUPS computes statistics and generates standardized histograms and contour plots (for size against zeta potential) suitable for transparent reporting and cross-study comparison. The software produces high-quality, publication-ready figures. Third-party graphical editing tools allow users to refine and annotate visualizations for presentation or manuscript preparation. PHoNUPS supports multiple measurement file formats, thereby facilitating dataset integration from different instruments. PHoNUPS was developed with extensibility at its core, providing a basis for user-driven growth. We invite the EV community - researchers, analysts, and tool developers - to use PHoNUPS, share feedback on their experience and needs, and contribute to the platform by integrating additional input data formats, analytical routines, and visualization functionalities.
bioinformatics2026-02-02v1LFQ Benchmark Dataset - Generation Beta: Assessing Modern Proteomics Instruments and Acquisition Workflows with High-Throughput LC Gradients
Van Puyvelde, B. R.; Devreese, R.; Chiva, C.; Sabido, E.; Pfammatter, S.; Panse, C.; Rijal, J. B.; Keller, C.; Batruch, I.; Pribil, P.; Vincendet, J.-B.; Fontaine, F.; Lefever, L.; Magalhaes, P.; Deforce, D.; Nanni, P.; Ghesquiere, B.; Perez-Riverol, Y.; Martens, L.; Carapito, C.; Bouwmeester, R.; Dhaenens, M.AI Summary
- This study extends a previous benchmark dataset to evaluate modern LC-MS platforms for high-throughput proteomics using short LC gradients (5 and 15 min) and low sample inputs.
- Data was collected from a hybrid human-yeast-E. coli proteome across four platforms, including new quadrupole-based systems, to assess proteome depth, quantitative precision, and cross-instrument consistency.
- The dataset, available via ProteomeXchange, aims to advance cross-platform algorithm development and standardize high-throughput LC-MS proteomics.
Abstract
Recent advances in liquid chromatography mass spectrometry (LCMS) have accelerated the adoption of high-throughput workflows that deliver deep proteome coverage using minimal sample amounts. This trend is largely driven by clinical and single-cell proteomics, where sensitivity and reproducibility are essential. Here, we extend our previous benchmark dataset (PXD028735) using next-generation LC-MS platforms optimized for rapid proteome analysis. We generated an extensive DDA/DIA dataset using a human-yeast-E. coli hybrid proteome. The proteome sample was distributed across multiple laboratories together with standardized analytical protocols specifying two short LC gradients (5 and 15 min) and low sample input amounts. This dataset includes data acquired on four different platforms, and features new scanning quadrupole-based implementations, extending coverage across different instruments and acquisition strategies. Our comprehensive evaluation highlights how technological advances and reduced LC gradients may affect proteome depth, quantitative precision, and cross-instrument consistency. The release of this benchmark dataset via ProteomeXchange (PXD070049 and PXD071205), allows for the acceleration of cross-platform algorithm development, enhance data mining strategies, and supports standardization of short-gradient, high-throughput LC-MS-based proteomics.
bioinformatics2026-02-02v1Bridging the gap between genome-wide association studies and network medicine with GNExT
Arend, L.; Woller, F.; Rehor, B.; Emmert, D.; Frasnelli, J.; Fuchsberger, C.; Blumenthal, D. B.; List, M.AI Summary
- GNExT is a web-based platform designed to integrate GWAS data into network medicine, enhancing the interpretation of genetic variants within biological systems.
- It incorporates tools like MAGMA and Drugst.One to explore genetic variants at a network level, identifying potential drug repurposing candidates.
- The platform was demonstrated using a GWAS meta-analysis of human olfactory identification, translating genetic signals into pharmacological targets.
Abstract
Motivation: A growing volume of large-scale genome-wide association study (GWAS) datasets offers unprecedented power to uncover the genetic determinants of complex traits, but existing web-based platforms for GWAS data exploration provide limited support for interpreting these findings within broader biological systems. Systems medicine is particularly well-suited to fill this gap, as its network-oriented view of molecular interactions enables the integration of genetic signals into coherent network modules, thereby opening opportunities for disease mechanism mining and drug repurposing. Results: We introduce GNExT (GWAS network exploration tool), a web-based platform that moves beyond the variant-level effect and significance exploration provided by existing solutions. By including MAGMA and Drugst.One, GNExT allows its users to study genetic variants on the network level down to the identification of potential drug repurposing candidates. Moreover, GNExT advances over the current state of the art by offering a highly standardized Nextflow pipeline for data import and preprocessing, allowing researchers to easily deploy their study results on a web interface. We demonstrate the utility of GNExT using a genome-wide association meta-analysis of human olfactory identification, in which the framework translated isolated GWAS signals to potential pharmacological targets in human olfaction. Availability and Implementation: The complete GNExT ecosystem, including the Nextflow preprocessing pipeline, the backend service, and frontend interface, is publicly available on GitHub (https://github.com/dyhealthnet/gnext_nf_pipeline, https://github.com/dyhealthnet/gnext_platform). The public instance of the GNExT platform on olfaction is available under http://olfaction.gnext.gm.eurac.edu.
bioinformatics2026-02-02v1Quantifying biomarker ambiguity using metabolic network analysis
Hinkston, M. A.; Bradley, A. S.AI Summary
- This study quantifies biomarker ambiguity by introducing three metrics (retrobiosynthetic complexity, normalized branch depth, and fraction shared) to assess the biosynthetic specificity of biomarkers through metabolic network analysis.
- Analysis of 9,140 MetaCyc metabolites revealed that only 13% of multi-pathway compounds had low complexity, distal divergence, and high pathway consensus.
- Lipid biomarkers like hopanoids and sterols were found to vary in specificity, with hopanoids showing higher specificity, while diagnostic quality and lipophilicity were found to be independent.
Abstract
Molecular biomarkers preserved in rocks provide evidence about ancient life but interpreting them requires inference through multiple stages of information loss arising from phylogenetic, biosynthetic, and diagenetic ambiguity. However, biomarker specificity is typically assessed qualitatively rather than quantitatively. Here we formalize biosynthetic ambiguity as entropy over metabolic networks. We introduce three metrics that quantify pathway-level information content: retrobiosynthetic complexity ({psi}), normalized branch depth ({lambda}), and fraction shared ({sigma}). Analysis of 9,140 MetaCyc metabolites defines a three-dimensional specificity space for biomarker evaluation. Only 13% of multi-pathway compounds exhibited low complexity, distal divergence, and high pathway consensus. Lipid biomarkers span this specificity space heterogeneously: hopanoids cluster near the high-specificity region while sterols occupy intermediate territory. Diagnostic quality and lipophilicity are approximately independent, so the constraint on molecular paleontology is the limited chemical diversity among preservable compound classes rather than their biosynthetic properties. This framework supports probabilistic biomarker interpretation by explicitly incorporating biosynthetic, phylogenetic, and diagenetic constraints.
bioinformatics2026-02-02v1Multi-ancestry conditional and joint analysis (Manc-COJO) applied to GWAS summary statistics
Wang, X.; Wang, Y.; Visscher, P. M.; Wray, N. R.; Yengo, L.AI Summary
- The study introduces Manc-COJO, a method for performing conditional and joint analysis on GWAS summary statistics across multiple ancestries to identify independent SNP associations.
- Simulations and real-data analyses demonstrate that Manc-COJO enhances the detection of independent signals and reduces false positives compared to traditional methods.
- Manc-COJO:MDISA, a follow-up algorithm, identifies ancestry-specific associations, and the C++ implementation of Manc-COJO significantly improves computational efficiency.
Abstract
Conditional and joint (COJO) analysis of genome-wide association study (GWAS) summary statistics to identify single nucleotide polymorphisms (SNPs) independently associated with a trait is standard in post-GWAS pipelines. GWAS meta-analyses are increasingly conducted across multiple ancestry groups but how to perform COJO in a multi-ancestry context is not known. Here we introduce Manc-COJO, a method for multi-ancestry COJO analysis. Simulations and real-data analyses show that Manc-COJO improves the detection of independent association signals and reduces false positives compared to COJO and ad hoc adaptations for multi-ancestry use. We also introduce Manc-COJO:MDISA, a follow-up within ancestry algorithm to identify ancestry-specific associations after fitting Manc-COJO identified SNPs. The C++ implementation of Manc-COJO substantially improves on computational efficiency (for single ancestry >120 times faster than GCTA-COJO software) and supports linkage disequilibrium references derived either from individual-level genotype data or pre-computed matrices, facilitating analysis when data sharing is limited.
bioinformatics2026-02-02v1scDiagnostics: systematic assessment of cell type annotation in single-cell transcriptomics data
Christidis, A.; Ghazi, A. R.; Chawla, S.; Turaga, N.; Gentleman, R.; Geistlinger, L.AI Summary
- The study addresses the challenge of assessing computational cell type annotations in single-cell transcriptomics by introducing scDiagnostics, a software package designed to detect complex or ambiguous annotations.
- scDiagnostics uses novel diagnostic methods compatible with major annotation tools and was tested on simulated and real-world datasets.
- The tool effectively identifies misleading annotations that could distort downstream analysis, enhancing the reliability of single-cell data interpretation.
Abstract
Although cell type annotation has become an integral part of single-cell analysis workflows, the assessment of computational annotations remains challenging. Many annotation tools transfer labels from an annotated reference dataset to a new query dataset of interest, but blindly transferring labels from one dataset to another has its own set of challenges. Often enough there is no perfect alignment between datasets, especially when transferring annotations from a healthy reference atlas for the discovery of disease states. We present scDiagnostics, a new open-source software package that facilitates the detection of complex or ambiguous annotation cases that may otherwise go unnoticed, thus addressing a critical unmet need in current single-cell analysis workflows. scDiagnostics is equipped with novel diagnostic methods that are compatible with all major cell type annotation tools. We demonstrate that scDiagnostics reliably detects complex or conflicting annotations using both carefully designed simulated datasets and diverse real-world single-cell datasets. Our evaluation demonstrates that scDiagnostics reliably identifies misleading annotations that systematically distort downstream analysis and interpretation and that would otherwise remain undetected. The scDiagnostics R package is available from Bioconductor (https://bioconductor.org/packages/scDiagnostics).
bioinformatics2026-02-02v1An Explainable Machine Learning Approach to study the positional significance of histone post-translational modifications in gene regulation
Ramachandran, S.; Ramakrishnan, N.AI Summary
- This study used XGBoost classifiers to analyze ChIP-seq data for 26 histone PTMs in yeast, focusing on their positional significance from -3 to 8 in genes.
- The approach predicted gene transcription rates and identified critical histone modifications and nucleosomal positions for gene expression using SHAP for explainability.
- Key findings highlighted the importance of specific histone modifications and their positions in yeast gene regulation, with potential for extension to other organisms.
Abstract
Epigenetic mechanisms regulate gene-expression by altering the structure of the chromatin without modifying the underlying DNA sequence. Histone post-translational modifications (PTMs) are critical epigenetic signals that influence transcriptional activity, promoting or repressing gene-expression. Understanding the impact of individual PTMs and the combinatorial effects is essential to deciphering gene regulatory mechanisms. In this study, we analyzed the ChIP-seq data for 26 PTMs in yeast, examining the PTM intensities gene-wise from positions -3 to 8 in each gene. Using XGBoost classifiers, we predicted gene transcription rates and identified key histone modifications and nucleosomal positions that are critical in gene expression using explainability measures (such as SHAP). Our study provides a comprehensive insight into the histone modifications, their positions and their combinations that are most critical in gene regulation in yeast. The proposed explainable Machine Learning models can be easily extended to other model organisms to provide meaningful insights into gene regulation by epigenetic mechanisms.
bioinformatics2026-02-02v1Serum Proteomic Profiling Implicates a Dysregulated Neurohormonal-Inflammatory Axis in Post-Fontan Tachycardia
Takaesu, F.; Villarreal, D. J.; Zhou, A.; Jimenez, M.; Turner, M.; Spiess, J. L.; Kievert, J.; Deshetler, C.; Schwartzman, W.; Yates, A. R.; Kelly, J. M.; Breuer, C. K.; Davis, M.AI Summary
- This study used serum proteomics and machine learning to investigate the molecular mechanisms behind post-Fontan tachycardia in both ovine models and human patients.
- Post-operative tachycardia was observed in both species, with significant heart rate increases noted from day 1 to day 3 post-operation.
- A seven-protein panel was identified, with ANGT, ACE, and PTX3 consistently dysregulated across species, suggesting a neurohormonal-inflammatory axis involvement in tachycardia.
Abstract
Background: Post-operative tachycardia is a common and poorly understood complication following the Fontan procedure. Post-operative factors such as surgical scarring and venous hypertension can contribute to tachycardia risk, but the specific molecular signaling cascades triggering acute tachycardia remain uncharacterized, limiting therapeutic innovation and leaving clinicians with limited strategies. Here, we present a retrospective translational study leveraging serum proteomics and machine learning to identify molecular drivers of post-operative Fontan tachycardia. Methods: We integrated a clinically relevant ovine animal model Fontan circulation with continuous telemetric heart rate monitoring and human patient data. Serum proteomics coupled with machine learning algorithms were employed to identify protein panels predictive of post-operative tachycardia. Cross-species validation was performed by comparing proteomic signatures from sheep and pediatric patients undergoing Glenn or Fontan surgery. Results: Ovine Fontan animals demonstrated significant heart rate elevation beginning on post-operative day (POD) 1, peaking at POD 3 (159.4 {+/-} 11.7 bpm vs. pre-operative 105.3 {+/-} 10.5 bpm, p<0.0001), before trending toward baseline by POD 10. This pattern was similar in human patients, though more modest. Proteomic analysis identified distinct separation between pre- and post-operative serum profiles. Principal component analysis revealed that the principal components most correlated with heart rate were significantly enriched for inflammatory and neural pathways. We leveraged the Boruta algorithm to identify a seven-protein panel (ACE, ANGT, ITIH4, SELENOP, W5PHP7, PTX3, and F5) with superior predictive power (AUC=0.926). A cross-species comparison between human and sheep demonstrated that three, angiotensinogen (ANGT), angiotensin-converting enzyme (ACE), and pentraxin 3 (PTX3), were similarly dysregulated in both species. Conclusions: This study provides the first direct molecular evidence implicating a dysregulated neurohormonal-inflammatory axis as a principal driver of acute post-operative Fontan tachycardia. The identified protein signature offers novel mechanistic insights and establishes a foundation for targeted diagnostics and therapeutics to predict and mitigate this significant clinical complication.
bioinformatics2026-02-02v1CellCov: gene-body coverage profiling for single-cell RNA-seq
Chen, S.; Zevnik, U.; Ziegenhain, C.AI Summary
- CellCov addresses the issue of gene-body coverage bias in single-cell RNA-seq by providing profiling at single-cell resolution, which reveals cell-to-cell variability.
- It supports flexible grouping and aggregation of coverage profiles, facilitating comparison across different sequencing protocols.
- The tool was demonstrated on public datasets from various full-length scRNA-seq chemistries, showcasing its utility.
Abstract
Motivation: Gene-body coverage bias differs across scRNA-seq protocols and can influence downstream analyses, yet coverage is often assessed using bulk-level summaries that obscure cell-to-cell variability. Results: CellCov provides gene-body coverage profiling at single-cell resolution, enabling exploration of coverage heterogeneity across both cells and features. The accompanying workflow supports flexible grouping and robust aggregation of profiles by user-provided annotations, allowing principled comparison of coverage bias across sequencing protocols. We demonstrate its use on public datasets from several full-length scRNA-seq chemistries. Availability: CellCov source code and documentation are available at https://github.com/ziegenhain-lab/CellCov
bioinformatics2026-02-02v1AstraKit: Customizable, reproducible workflows for biomedical research and precision medicine
Kurz, N. S.; Kornrumpf, K.; Stoves, M. K.; Dönitz, J.AI Summary
- AstraKit is a customizable KNIME workflow suite designed to streamline precision medicine analytics by integrating variant interpretation, multi-omics analysis, and drug response modeling.
- It features dynamic variant annotation, multi-layered omics integration, and translational drug matching, validated in oncology cohorts to show concordance with clinical outcomes.
- AstraKit's open-source, platform-independent workflows enhance reproducibility and accelerate biomarker validation, linking molecular profiles to therapeutic decisions.
Abstract
MotivationFragmented bioinformatics tools compels researchers and clinicians to resort to error-prone manual pipelines. The success of precision medicine depends largely on the efficient interpretation of genetic variants and the selection of highly effective targeted therapies. However, the success of precision medicine and biomedical research depends on the availability of efficient software solutions for processing and interpreting genetic variants, interpreting multi-omics data, and integrating drug screen analyses. ResultsWe present AstraKit, a unified KNIME workflow suite enabling end-to-end precision medicine analytics. AstraKit introduces three transformative innovations: 1) Dynamic variant interpretation with customizable annotation and filtering for disease-specific genomic contexts; 2) Multi-layered omics analyses integrating genomic, transcriptomic, and epigenetic data; and 3) Translational drug matching that correlates in vitro drug screens with clinical outcomes. Validated across oncology cohorts, AstraKit demonstrates concordance between experimental drug sensitivity and clinical trial responses, resolving discordances to uncover resistance mechanisms. By unifying variant analysis, multi-omics, and drug response modeling on a single customizable platform, AstraKit eliminates siloed workflows, accelerating biomarker validation and enabling clinicians to directly link molecular profiles to therapeutic decisions. As all AstraKit workflows are open-source and platform-independent, we provide a versatile comprehensive software suite for a multitude of tasks in bioinformatics and precision medicine. Availability and implementationThe KNIME workflows are available at KNIME Hub https://hub.knime.com/bioinf_goe/spaces/Public/AstraKit~lfVsGBY2HnPYc1h1/. The source code is available at https://gitlab.gwdg.de/MedBioinf/mtb/astrakit.
bioinformatics2026-01-31v3A novel phylogenomics pipeline reveals complex pattern of reticulate evolution in Cucurbitales
Ortiz, E. M.; Hoewener, A.; Shigita, G.; Raza, M.; Maurin, O.; Zuntini, A.; Forest, F.; Baker, W. J.; Schaefer, H.AI Summary
- This study introduces Captus, a novel pipeline for integrating diverse sequencing data types for phylogenomic analysis, applied to the angiosperm order Cucurbitales.
- Captus efficiently assembles and analyzes mixed data, recovering more complete loci across species, and reveals complex reticulate evolution patterns within Cucurbitales and Cucurbitaceae.
- The phylogenomic analysis supports the current classification of Cucurbitales but shows conflicting placement of Apodanthaceae, suggesting gene tree conflict as a cause for previous discrepancies in phylogenetic studies.
Abstract
A diverse range of high-throughput sequencing data, such as target capture, RNA-Seq, genome skimming, and high-depth whole genome sequencing, are used for phylogenomic analyses but the integration of such mixed data types into a single phylogenomic dataset requires a number of bioinformatic tools and significant computational resources. Here, we present a novel pipeline, Captus, to analyze mixed data in a fast and efficient way. Captus assembles these data types, allows searching of the assemblies for loci of interest, and finally produces alignments filtered for paralogs. If reference target loci are not available for the studied taxon, Captus can also be used to discover new putative homologs via sequence clustering. Compared to other software, Captus allows the recovery of a greater number of more complete loci across a larger number of species. We apply Captus to assemble a comprehensive mixed dataset, comprising the four types of sequencing data for the angiosperm order Cucurbitales, a clade of about 3,100 species in eight mainly tropical plant families, including begonias (Begoniaceae) and gourds (Cucurbitaceae). Our phylogenomic results support the currently accepted circumscription of Cucurbitales except for the position of the holoparasitic Apodanthaceae, which group with Rafflesiaceae in Malpighiales. A subset of mitochondrial gene regions supports the earlier position of Apodanthaceae in Cucurbitales. However, the nuclear regions and majority of mitochondrial regions place Apodanthaceae in Malpighiales. Within Cucurbitaceae, we confirm the monophyly of all currently accepted tribes but also reveal deep reticulation patterns both in Cucurbitales and within Cucurbitaceae. We show that contradicting results among earlier phylogenetic studies in Cucurbitales can be reconciled when accounting for gene tree conflict and demonstrate the efficiency of Captus for complex datasets.
bioinformatics2026-01-31v3PanCNV-Explorer: Deciphering copy number alterations across human cancers
Kurz, N. S.; Kornrumpf, K.; Krüger, A.-R.; Dönitz, J.AI Summary
- PanCNV-Explorer integrates copy number variation data from 33 cancer types and healthy tissues to create a comprehensive database.
- It combines CNV profiles with functional genomic data to identify context-specific oncogenic drivers, vulnerabilities, and therapeutic targets.
- The platform offers an interactive web interface and programmatic services for annotating user-submitted CNVs with functional, clinical, and pathogenicity insights.
Abstract
Copy number variants (CNVs) drive cancer progression and genetic disorders, yet the interpretation of their biological consequences and potential targeted therapeutic options remains fragmented across clinical, functional, and structural domains. To bridge this gap, we present PanCNV-Explorer, a comprehensive annotated CNV database integrating copy number variation data from pan-cancer and healthy cohorts. PanCNV-Explorer combines CNV profiles with functional genomic layers, including gene expression CRISPR screening data, through a novel analytical framework. PanCNV-Explorer represents a systematic map of copy number variation in the human genome across 33 different cancer types and normal tissue, integrating pan-cancer and healthy samples into a comprehensive, harmonized database. These analyses reveal context-specific oncogenic drivers, vulnerabilities, and therapeutic targets, accessible via an interactive web interface for dynamic exploration and hypothesis generation. In addition, the web server provides programmatic web services for annotating user-submitted CNVs with functional annotation, clinical relevance, and pathogenicity predictions. PanCNV-Explorer serves as a pivotal resource for accelerating the analyses of copy number and structural variants in the human genome, bridging raw CNV data to actionable biological and clinical interpretations. A public web instance of the PanCNV-Explorer web server is available at https://mtb.bioinf.med.uni-goettingen.de/pancnv-explorer.
bioinformatics2026-01-31v3FusionPath: Gene fusion pathogenicity prediction using protein structural data and contextual protein embeddings
Kurz, N. S.; Güven, I. B.; Beissbarth, T.; Dönitz, J.AI Summary
- FusionPath is a deep learning framework designed to predict gene fusion pathogenicity by integrating protein embeddings, structural data, and functional annotations.
- It uses a hierarchical attention mechanism to weigh different feature contributions, achieving superior performance over existing methods with higher AUC scores.
- SHAP analysis showed that protein domains and GO terms provide interpretable, non-redundant signals, highlighting specific domains and processes crucial for pathogenicity prediction.
Abstract
Accurate prediction of gene fusion pathogenicity is critical for understanding oncogenic mechanisms and advancing precision oncology. While existing computational methods provide valuable insights, their performance remains limited by incomplete integration of multi-scale biological features and lack of interpretability. We present FusionPath, a novel deep learning framework for gene fusion pathogenicity prediction. FusionPath uniquely integrates embeddings from multiple pretrained protein language models, including FusON-pLM and ProtBERT and retained protein domains and Gene Ontology (GO) functional annotations. A hierarchical attention mechanism dynamically weights the contribution of each feature type, enabling both high-accuracy prediction and biological interpretability. The model was trained and rigorously validated on a large-scale dataset of clinically annotated pathogenic and benign fusions. FusionPath significantly outperformed state-of-the-art methods, achieving higher AUC on independent test sets. Crucially, SHAP analysis revealed that protein domains and GO terms contributed non-redundant, biologically interpretable signals, with specific domains and GO processes exhibiting high predictive weights for pathogenicity. FusionPath establishes a new standard for gene fusion pathogenicity prediction by effectively leveraging complementary sequence, structural, and functional information. Its attention-driven interpretability provides actionable insights into the molecular determinants of fusion oncogenicity, facilitating biological discovery and clinical variant prioritization. The framework is publicly available to accelerate research in cancer genomics and therapeutic target identification.
bioinformatics2026-01-31v3PanCNV-Explorer: Deciphering copy number alterations across human cancers
Kurz, N. S.; Kornrumpf, K.; Krüger, A.-R.; Dönitz, J.AI Summary
- PanCNV-Explorer integrates copy number variation data from 33 cancer types and healthy tissues to create a comprehensive database.
- It combines CNV profiles with functional genomic data to identify context-specific oncogenic drivers, vulnerabilities, and therapeutic targets.
- The tool offers an interactive web interface and programmatic services for annotating user-submitted CNVs with functional, clinical, and pathogenicity insights.
Abstract
Copy number variants (CNVs) drive cancer progression and genetic disorders, yet the interpretation of their biological consequences and potential targeted therapeutic options remains fragmented across clinical, functional, and structural domains. To bridge this gap, we present PanCNV-Explorer, a comprehensive annotated CNV database integrating copy number variation data from pan-cancer and healthy cohorts. PanCNV-Explorer combines CNV profiles with functional genomic layers, including gene expression CRISPR screening data, through a novel analytical framework. PanCNV-Explorer represents a systematic map of copy number variation in the human genome across 33 different cancer types and normal tissue, integrating pan-cancer and healthy samples into a comprehensive, harmonized database. These analyses reveal context-specific oncogenic drivers, vulnerabilities, and therapeutic targets, accessible via an interactive web interface for dynamic exploration and hypothesis generation. In addition, the web server provides programmatic web services for annotating user-submitted CNVs with functional annotation, clinical relevance, and pathogenicity predictions. PanCNV-Explorer serves as a pivotal resource for accelerating the analyses of copy number and structural variants in the human genome, bridging raw CNV data to actionable biological and clinical interpretations. A public web instance of the PanCNV-Explorer web server is available at https://mtb.bioinf.med.uni-goettingen.de/pancnv-explorer.
bioinformatics2026-01-31v2AstraKit: Customizable, reproducible workflows for biomedical research and precision medicine
Kurz, N. S.; Kornrumpf, K.; Stoves, M. K.; Doenitz, J.AI Summary
- AstraKit is a KNIME workflow suite designed to streamline precision medicine analytics by integrating variant interpretation, multi-omics analysis, and drug response modeling.
- It offers customizable workflows for dynamic variant annotation, multi-layered omics integration, and translational drug matching, validated in oncology cohorts.
- AstraKit's open-source and platform-independent nature enhances its utility in bioinformatics and precision medicine, available on KNIME Hub and GitLab.
Abstract
MotivationFragmented bioinformatics tools compels researchers and clinicians to resort to error-prone manual pipelines. The success of precision medicine depends largely on the efficient interpretation of genetic variants and the selection of highly effective targeted therapies. However, the success of precision medicine and biomedical research depends on the availability of efficient software solutions for processing and interpreting genetic variants, interpreting multi-omics data, and integrating drug screen analyses. ResultsWe present AstraKit, a unified KNIME workflow suite enabling end-to-end precision medicine analytics. AstraKit introduces three transformative innovations: 1) Dynamic variant interpretation with customizable annotation and filtering for disease-specific genomic contexts; 2) Multi-layered omics analyses integrating genomic, transcriptomic, and epigenetic data; and 3) Translational drug matching that correlates in vitro drug screens with clinical outcomes. Validated across oncology cohorts, AstraKit demonstrates concordance between experimental drug sensitivity and clinical trial responses, resolving discordances to uncover resistance mechanisms. By unifying variant analysis, multi-omics, and drug response modeling on a single customizable platform, AstraKit eliminates siloed workflows, accelerating biomarker validation and enabling clinicians to directly link molecular profiles to therapeutic decisions. As all AstraKit workflows are open-source and platform-independent, we provide a versatile comprehensive software suite for a multitude of tasks in bioinformatics and precision medicine. Availability and implementationThe KNIME workflows are available at KNIME Hub https://hub.knime.com/bioinf_goe/spaces/Public/AstraKit~lfVsGBY2HnPYc1h1/. The source code is available at https://gitlab.gwdg.de/MedBioinf/mtb/astrakit.
bioinformatics2026-01-31v2Longevity Bench: Are SotA LLMs ready for aging research?
Zhavoronkov, A.; Sidorenko, D.; Naumov, V.; Pushkov, S.; Zagirova, D.; Aladinskiy, V.; Unutmaz, D.; Aliper, A.; Galkin, F.AI Summary
- LongevityBench was developed to evaluate if state-of-the-art LLMs can understand aging biology and utilize biodata for phenotype predictions.
- The benchmark includes tasks on predicting human time-to-death, mutation effects on lifespan, and age-related omics patterns, covering various biodata types.
- Testing revealed current LLMs' limitations, suggesting improvements for their application in aging research.
Abstract
Aging is a core biological process observed in most species and tissues, which is studied with a vast array of technologies. We argue that the abilities of AI systems to emulate aging and to accurately interpret biodata in its context are the key criteria to judge an LLM's utility in biomedical research. Here, we present LongevityBench -- a collection of tasks designed to assess whether foundation models grasp the fundamental principles of aging biology and can use low-level biodata to arrive at phenotype-level conclusions. The benchmark covers a variety of prediction targets including human time-to-death, mutations' effect on lifespan, and age-dependent omics patterns. It spans all common biodata types used in longevity research: transcriptomes, DNA methylation profiles, proteomes, genomes, clinical blood tests and biometrics, as well as natural language annotations. After ranking state-of-the-art foundation models using LongevityBench, we highlight their weaknesses and outline procedures to maximize their utility in aging research and life sciences
bioinformatics2026-01-30v2Diffusion-based Representation Integration for Foundation Models Improves Spatial Transcriptomics Analysis
Jain, A.; Pham, T. M.; Laidlaw, D. H.; Ma, Y.; Singh, R.AI Summary
- DRIFT integrates spatial context into foundation models using diffusion on spatial graphs from spatial transcriptomics (ST) data to enhance tasks like cell-type annotation and clustering.
- The framework uses heat kernel diffusion to incorporate local neighborhood context while preserving transcriptomic representations from single-cell models.
- Benchmarking showed DRIFT significantly improves performance of foundational models on ST tasks compared to specialized methods.
Abstract
Motivation: We propose DRIFT, a framework that integrates spatial context into the input representations for foundation models by leveraging diffusion on spatial graphs derived from spatial transcriptomics (ST) data. ST captures gene expression profiles while preserving spatial context, enabling downstream analysis tasks such as cell-type annotation, clustering, and cross-sample alignment. However, due to its emerging nature, there are very few foundation models that can utilize ST data to generate embeddings generalizable across multiple tasks. Meanwhile, well-documented foundational models trained on large-scale single-cell gene expression (scRNA-seq) data have demonstrated generalizable performance across scRNA-seq assays, tissues, and tasks; however, they do not leverage the spatial information in ST data. We use heat kernel diffusion to propagate embeddings across spatial neighborhoods, incorporating the local neighborhood context of the ST data while preserving the transcriptomic representations learned by state-of-the-art single-cell foundation models. Results: We systematically benchmark five foundational models (both scRNA-seq and ST-based) across key ST tasks such as annotation, alignment, and clustering, ensuring a comprehensive evaluation of our proposed framework. Our results show that DRIFT significantly improves the performance of existing foundational models on ST data over specialized state-of-the-art methods. Overall, DRIFT is an effective, accessible, and generalizable framework that bridges the gap toward universal models for modeling spatial transcriptomics. Availability and Implementation: Code and data available at https://github.com/rsinghlab/DRIFT.
bioinformatics2026-01-30v2Evidence of off-target probe binding affecting 10x Genomics Xenium gene panels compromise accuracy of spatial transcriptomic profiling
Hallinan, C.; Ji, H. J.; Tsou, E.; Salzberg, S. L.; Fan, J.AI Summary
- Investigated off-target binding in 10x Genomics Xenium technology using a developed software tool, Off-target Probe Tracker (OPT), to identify potential off-target binding in a human breast gene panel.
- Found that at least 14 out of 313 genes were potentially affected by off-target binding to protein-coding genes.
- Validated findings by comparing Xenium data with Visium CytAssist and single-cell RNA-seq, showing that some gene expression patterns reflected both target and off-target genes.
Abstract
The accuracy of spatial gene expression profiles generated by probe-based in situ spatially-resolved transcriptomic technologies depends on the specificity with which probes bind to their intended target gene. Off-target binding, defined as a probe binding to something other than the target gene, can distort a gene's true expression profile, making probe specificity essential for reliable transcriptomics. Here, we investigated off-target binding affecting the 10x Genomics Xenium technology. We developed a software tool, Off-target Probe Tracker (OPT), to identify putative off-target binding via alignment of probe sequences and assessing whether mapped loci corresponded to the intended target gene across multiple reference annotations. Applying OPT to a Xenium human breast gene panel, we identified at least 14 out of the 313 genes in the panel potentially impacted by off-target binding to protein-coding genes. To substantiate our predictions, we leveraged a Xenium breast cancer dataset generated using this gene panel and compared results to orthogonal spatial and single-cell transcriptomic profiles from Visium CytAssist and 3' single-cell RNA-seq derived from the same tumor block. Our findings indicate that for some genes, the expression patterns detected by Xenium demonstrably reflect the aggregate expression of the target and predicted off-target genes based on Visium and single-cell RNA-seq rather than the target gene alone. We further applied OPT to identify potential off-target binding in custom gene panels and integrate tissue-specific RNA-seq data to assess effects. Overall, this work enhances the biological interpretability of spatial transcriptomics data and improves reproducibility in spatial transcriptomics research.
bioinformatics2026-01-30v2Zero-shot biological reasoning with open-weights large language models reproduces CRISPR screen based prediction of synthetic lethal interactions.
Prosz, A. G.; Sztupinszki, Z.; Diossy, M.; Zimon, B.; Csabai, I. G.; Szallasi, Z.AI Summary
- This study tested open-weight Large Language Models (LLMs) for predicting synthetic lethal interactions, focusing on their ability to replicate results from CRISPR knockout screens.
- The best-performing model, Qwen2.5-32B-Instruct, achieved an AUROC of 0.715, outperforming random chance and showing that model size influences performance.
- An in silico screen of 398,277 gene pairs from 893 clinically relevant genes was conducted, demonstrating the potential of LLMs for scalable prediction of synthetic lethal interactions in cancer research.
Abstract
Identifying clinically relevant synthetic lethal interactions has great potential for uncovering novel therapeutic vulnerabilities in cancer. Current approaches rely on machine learning models that estimate probabilities of synthetic lethal interactions, without supplying explicit knowledge of the underlying biology and lack the human-readable interpretation leading to the prediction. Large Language Models (LLMs) represent a new class of tools capable of reasoning and leveraging extensive biological knowledge acquired from relevant literature during their pretraining. Here, we tested multiple open-weight LLMs for their ability to predict known and novel synthetic lethal interactions. We found that most of the tested models were better at reconstructing the results of three known genome-wide CRISPR knockout screens than random chance, while observed that their performance was related to the parameter-size of the model, and on average benefited little from additional pathway and genetic information apart from what they already possess when estimating the likelihood of a synthetic lethal relationship. After selecting the best-performing and most computationally efficient model for our use case (Qwen2.5-32B-Instruct, 0.715 AUROC), we performed an in silico screen of 398,277 gene pairs from 893 clinically relevant genes. Our goal was to highlight the potential of open-weights LLMs as scalable, context-aware prioritization tools for synthetic lethal interactions, and to lay the groundwork for predicting higher-order genetic interactions.
bioinformatics2026-01-30v1clinTALL: machine learning-driven multimodal subtypeclassification and treatment outcome prediction in pediatric T-ALL
Stoiber, L.; Antic, Z.; Rebellato, S.; Fazio, G.; Rademacher, A.; Lenk, L.; Locatelli, F.; Balduzzi, A.; Cario, G.; Rizzari, C.; Cazzaniga, G.; Yu, J.; Bergmann, A. K.AI Summary
- The study developed clinTALL, a deep learning pipeline for classifying subtypes and predicting treatment outcomes in pediatric T-ALL using multimodal data.
- The transcriptomic-only model achieved 92.2% accuracy in subtype prediction and a 65.9% C-index for event-free survival (EFS), while integrating all data modalities improved EFS prediction to 67.5%.
- Validation on an internal dataset showed 81.8% accuracy for subtype prediction, with clinTALL available as a Docker application for broader use.
Abstract
Background: Childhood T-lineage acute lymphoblastic leukemia (T-ALL) is an aggressive hematologic malignancy with poor prognosis. Differently from B-cell precursor ALL, T-ALL lacks effective risk stratification strategies. A recent study has integrated whole genome and whole transcriptome data to define over 15 distinct molecular subtypes with prognostic significance. However, clinical translation of this knowledge remains challenging due to the complexity of interpreting high-dimensional multi-omics-based data. Methods: Here, we present clinTALL, a deep learning based multi-task pipeline for pediatric T-ALL subtype classification and treatment outcome estimation. The model integrates multimodal input data and uses a neural network architecture to generate a shared latent embedding for jointly learned multi-task prediction. The competing risk-based model was used to predict event-specific outcomes. The model was trained on a publicly available multimodal dataset comprising clinical, genomic and transcriptomic features of 1309 pediatric T-ALL samples. Results: We observed that the transcriptomic-only model achieved superior single modality results, with 92.2% accuracy for subtype prediction and a 65.9% concordance index (C-index) for eventfree survival (EFS) in a cross-validation setup. Integrating all data modalities maintained high subtype classification accuracy (91.7%) and improved the overall concordance index for EFS estimation to 67.5%. The competing risk-based model enables accurate predictions of induction failure (C-index = 96.0%) and second malignant neoplasm (C-index = 62.1%). We validated molecular subtype predictions on an internal dataset of 120 pediatric T-ALL samples and obtained an accuracy of 81.8%. To facilitate the broad application of multi-omics based subtype prediction and treatment outcome inference, we provide clinTall as a Docker based application, allowing for user friendly access to the tool. The full source code of clinTALL is available on GitHub (https://github.com/UKWgenommedizin/clinTALL). Conclusion: Together, our machine learning-based framework allows for automated, accurate subtype classification and treatment outcome inference using multimodal input data, advancing precision risk stratification for pediatric T-ALL.
bioinformatics2026-01-30v1A systematic assessment of machine learning for structural variant filtering
Kalra, A.; Paulin, L.; SEDLAZECK, F.AI Summary
- The study benchmarked five machine learning approaches for filtering structural variants (SVs) in long-read sequencing data using GIAB samples HG002 and HG005.
- A Random Forest classifier using 15 genomic features achieved a peak F1-score of 95.7%, comparable to more complex models like ResNet50 (95.9%) and diffusion-based methods (95.8%).
- Simpler models were found to offer the best balance of accuracy, speed, and interpretability, suggesting that increased model complexity needs justification beyond marginal performance improvements.
Abstract
Background: Accurate discrimination of true structural variants (SVs) from artifacts in long-read sequencing data remains a critical bottleneck. Numerous machine learning solutions have been proposed, ranging from classical models using engineered features to advanced deep learning and foundation model interpretability methods. However, a systematic comparison of their performance, efficiency, and practical utility is lacking. Results: We conducted a comprehensive benchmark of five machine learning paradigms for SV filtering using standardized Genome in a Bottle (GIAB) data for samples HG002 and HG005. We evaluated classical Random Forest classifiers on 15 genomic features, computer vision models (ResNet/VICReg), diffusion-based anomaly detection, sparse autoencoders (SAEs) on the Evo2-7B foundation model, and multimodal ensembles. A simple Random Forest on interpretable features achieved a peak F1-score of 95.7%, effectively matching all more complex models (ResNet50: 95.9%, Diffusion: 95.8%). This study represents the first application of diffusion-based anomaly detection and sparse autoencoders to structural variant analysis; while diffusion models learned highly discriminative, disentangled representations and SAEs uncovered biologically interpretable features (including atoms that were specific for ALU deletions, chromosome X variants and insertion events), they did not significantly surpass this classification ceiling. Ensemble methods offered no performance benefit but may have future potential given the orthogonality of vision-based and linear features. Conclusions: Our findings demonstrate that for the established task of germline SV filtering, simpler, interpretable models provide an optimal balance of accuracy, speed, and transparency. This benchmark establishes a pragmatic framework for method selection and argues that increased model complexity must be justified by clear, unmet biological needs rather than marginal predictive gains.
bioinformatics2026-01-30v1Systematic Data-Driven Penalty Calibration for Constrained Quantum Optimization with Application to Molecular Docking
Mukherjee, P.; Mandal, S.AI Summary
- The paper introduces MMP, a framework for quantum optimization of constrained molecular docking, focusing on converting these problems into QUBO formulations.
- MMP uses data-driven penalty calibration through classical analysis, systematic penalty sweeps, and adaptive constraint QAOA to optimize solution quality.
- Benchmarks showed 99.7% solution validity and a 25.5% improvement over static-penalty methods, with future work aimed at real molecular docking applications.
Abstract
This paper describes MMP, a three-stage framework for systematic quantum optimization of constrained molecular docking problems. The protocol addresses the formulation bottleneck- the critical challenge of translating constrained optimization problems into valid QUBO (Quadratic Unconstrained Binary Optimization) formulations for quantum solvers. MMP replaces heuristic penalty tuning with data-driven calibration through: (1) classical solution-space analysis to validate fragment libraries before quantum deployment, (2) systematic penalty sweeps to identify optimal Goldilocks Zone coefficients, and (3) MAC-QAOA (MMP Adaptive Constraint QAOA) with layer-dependent penalty decay. Preliminary benchmarks on synthetic constrained optimization problems demonstrate 99.7% solution validity at identified elbow points and 25.5% improvement in solution quality over static-penalty QAOA. MMP is hardware-agnostic but designed for near-term devices including Pasqal Orion Gamma (140+ qubits). The theoretical framework, algorithmic details, and preliminary validation results of the protocol are discussed, establishing a systematic methodology for quantum-augmented optimization workflows for drug discovery. All benchmarks are conducted on synthetic constrained optimization instances that reproduce structural features of docking formulations; application to real molecular docking targets is left for future work.
bioinformatics2026-01-30v1Accurate haplotype-resolved de novo assembly of human genomes with RFhap
Gonzalez, D.; Cabas, G.; Miquel, J. F.; Moraga, C.; Salas, F.; Di Genova, A.AI Summary
- RFhap is a new trio-based long-read phasing method that uses multi-k-mer parent-specific markers, an alignment-free k-mer lookup engine, and a random forest classifier to improve haplotype-resolved de novo assembly.
- When benchmarked on four human trio datasets, RFhap significantly improved haplotype NG50 (24.3 Mb vs 13.1 Mb) and reduced switch error rates (0.111% vs 0.236%) compared to Hifiasm-Trio.
- RFhap also decreased long-switch errors by ~3-fold, particularly in repetitive regions, enhancing the accuracy of diploid human genome assembly.
Abstract
Haplotype-resolved de novo assemblies enable genome-wide separation of maternal and paternal variation, improving the interpretation of complex variants relevant to human disease. Trio-aware assemblers such as Hifiasm-Trio leverage parental short reads by deriving parent-specific k-mers to guide phasing within the long-read assembly graph; however, fixed k-mer length and heuristics cannot be optimal in the presence of sequencing errors and graph complexity in repetitive regions, contributing to phasing errors and reduced haplotype-resolved contiguity. Here we present RFhap, a trio-based long-read phasing method that integrates multi-k-mer parent-specific markers with an alignment-free k-mer lookup engine and a random forest classifier to assign long reads to maternal, paternal, or unknown haplotypes prior to de novo assembly. We benchmarked RFhap on four human trio datasets from the Human Pangenome project spanning two ONT chemistries (R10.4 and R9.4.1). Using Merqury to evaluate downstream assemblies, RFhap nearly doubled corrected haplotype NG50 (mean 24.3 Mb vs 13.1 Mb) and halved switch error rates (mean 0.111% vs 0.236%) relative to Hifiasm-Trio, while remaining competitive in consensus QV and parental k-mer completeness. Consistent with improved phasing, RFhap reduced long-switch errors by ~3-fold across datasets, particularly within interspersed repeats. Together, these results demonstrate that RFhap improves phasing accuracy from standard trio data, representing a step toward accurate and automated diploid human de novo assembly from long reads.
bioinformatics2026-01-30v1A Large-Scale Concordance Study of Toxicity Findings Across Preclinical Species and Humans for Small Molecules and Biologics in Drug Development
Liu, X.; Fan, F.AI Summary
- The study aimed to improve the translation of preclinical safety findings to human risk assessment by analyzing a large dataset of 7,565 drugs, addressing previous limitations in drug coverage and AE matching.
- Using likelihood ratios and semantic/mechanistic AE pairing, the research identified 850 significant identical-term AEs and 2,833 unique endpoints from cross-term associations.
- The study provides open access to its analytical tools and results through an interactive web application and a multi-agent AI system, ToxAgents, to enhance transparency and reproducibility in drug safety research.
Abstract
Translating preclinical safety findings into reliable insights for human risk assessment remains a fundamental challenge in drug development. Prior preclinical-clinical concordance studies have been constrained by limited drug coverage, reliance on identical-term matching for adverse events (AEs), and insufficient consideration of species, modality, exposure, and biological or mechanistic context. To address these gaps, we assembled a large cross-species concordance dataset, integrating standardized preclinical and clinical safety data for 7,565 marketed and investigational drugs from PharmaPendium and OFF-X. Our framework employs likelihood ratios to reduce prevalence bias and extends concordance assessment beyond identical-term matches to include semantically and mechanistically related AE pairs. Stratified analyses by species, modality, and exposure-matched subsets further refined translational relevance, while integration of on- and off-target annotations supports mechanistic interpretation and potential screening. Using this approach, we identified 850 significant identical-term AEs and 2,833 additional unique endpoints from cross-term associations. To promote reproducibility and transparency in animal research, we provide open access to the analytic code and statistical results via an interactive web application. An accompanying multi-agent AI system (ToxAgents) enables standardized querying and interpretation of concordance results. Together, these resources extend previous foundational efforts and establish a shared, data-driven platform to advance translational safety science, support evidence-based study design aligned with the 3Rs, and ultimately contribute to the development of safer medicines to improve human health.
bioinformatics2026-01-30v1ARUNA: Slice-based self-supervised imputation for upscaling DNA methylation sequencing assays
Singh, J.; Lee, W.-h.; Yu, G.; Yao, V.AI Summary
- ARUNA is a self-supervised denoising convolutional autoencoder designed to upscale DNA methylation data from RRBS to WGBS resolution by predicting genome-wide CpG-level methylation from sparse data.
- It uses methylation "slices" to preserve local correlation structure, enhancing the biological relevance of the imputed data.
- In simulations and real data applications, ARUNA outperformed existing methods, successfully handling 80-95% missingness in RRBS data and validated by comparison with WGBS replicates.
Abstract
Whole-genome bisulfite sequencing (WGBS) can provide near-comprehensive, base-resolution maps of DNA methylation, transforming our understanding of epigenetic regulation in development and disease, but its cost is often prohibitive for many studies. Reduced representation bisulfite sequencing (RRBS) offers a cost-effective alternative that profiles a CpG-enriched subset of the genome at base resolution. Similar sequencing protocols for both assays pose an opportunity for cross-assay integration, presenting an opportunity for massively increasing sample sizes at whole-genome resolution. However, existing imputation methods are designed for within-assay scenarios and cannot handle the substantial CpG coverage differences between WGBS and RRBS. We introduce ARUNA, a self-supervised denoising convolutional autoencoder that predicts genome-wide CpG-level methylation using only a small subset of observed methylation values and CpG coordinates. By modeling methylation "slices", spatially stacked windows that preserve local correlation structure, ARUNA captures biologically meaningful covariation while avoiding representation collapse. In simulation studies using the GTEx dataset, ARUNA successfully upscales RRBS-scale sparse methylomes (80-95% missingness) to whole-genome resolution, consistently outperforming baselines and maintaining robust performance across donor and tissue holdouts. When applied to real RRBS data from the ENCODE dataset, ARUNA outperformed state-of-the-art methods, with performance validated by matching upscaled RRBS samples to isogenic WGBS replicates. Source code for ARUNA can be found at https://github.com/ylaboratory/ARUNA.
bioinformatics2026-01-30v1EMTscore infers divergent EMT pathways from omics data and enables rapid screening for correlated gene sets
wen, h.; Bleris, L.; Hong, T.AI Summary
- EMTscore is a new computational tool designed to analyze epithelial-mesenchymal transition (EMT) pathways using single-cell or bulk omics data.
- It offers unbiased scoring methods for multiple EMT pathways, addressing the need for standardization in EMT analysis.
- The tool enables rapid screening to identify correlations between EMT and other cellular processes.
Abstract
Quantitative analyses of epithelial-mesenchymal transition (EMT) have been widely used in several areas of biomedical sciences due to its importance in development and cancer progression, but its multi-contextual nature requires standardization and implementation of gene set scoring methods beyond capacities of conventional tools. We developed EMTscore, a package that provides an efficient implementation of unbiased scoring methods for multiple EMT pathways using individual single-cell or bulk omics data, and the package allows rapid screening for relationships between EMT and other cellular processes.
bioinformatics2026-01-30v1How different AI models understand cells differently
Zhao, Y.; Sun, D.; Hao, M.; Xiong, Y.; Li, C.; Gong, T.; Wei, L.; Zhang, X.AI Summary
- The study introduces scGeneLens, a framework to analyze how different AI single-cell foundation models (scFMs) understand cell transcriptomics.
- By modifying attention mechanisms and using techniques like attention propagation and integrated gradients, scGeneLens was applied to scFoundation and scGPT.
- Findings reveal scFoundation focuses on cell-type marker genes for better cell-type separation, while scGPT emphasizes genes in shared pathways, enhancing generalization across conditions.
Abstract
AI single-cell foundation models (scFMs) are believed to be able to learn essential relations in cell transcriptomics with the attention modules in Transformer, but there is no method to reveal what they actually learned. We observed that different models may grasp different aspects of relations. To unravel the mystery, we propose scGeneLens, a framework for dissecting how scFMs perceive cells. We employed a sparse block attention to replace the original attention mechanism to concentrate attentions into a few dominant gene-gene relations, used attention propagation to trace how the relations propagate across Transformer layers, and used integrated gradients to disentangle the relative contributions of gene identity and expression in cell representations. We applied it to scFoundation and scGPT and show that they exhibit pluralistic perceptions of cells: scFoundation emphasizes relations among cell-type marker genes, resulting in stronger cell-type separability, whereas scGPT focuses more on genes involved in shared cellular pathways and core biological activities, leading to representations that generalize across conditions. The framework provides a unified lens for probing what scFMs learn about cells and offers actionable insights for the design of future cellular foundation models. Our code can be seen in \url{https://anonymous.4open.science/r/scGeneLens-B771/} .
bioinformatics2026-01-30v1Pseudotime graph diffusion for post hoc visualization of inferred single-cell trajectories
Lukas, B. E.; Pang, J.; Koh, T. J.; Dai, Y. E.AI Summary
- The study introduces Pseudotime Graph Diffusion (PGD), a method for enhancing the visualization of single-cell trajectories by smoothing cell-level features along pseudotime.
- PGD uses random-walk diffusion on a pseudotime graph to propagate information, improving the continuity and structure of inferred trajectories.
- Application of PGD to monocytes and macrophages during wound healing showed improved visualization and extended to trajectory-aware gene expression smoothing, enhancing the interpretation of cellular dynamics.
Abstract
Visual representations are widely used to interpret trajectories in single-cell data; however, they do not always faithfully capture inferred trajectory structure. As a result, interpretation of cellular dynamics and downstream analyses may be compromised. Here, we present Pseudotime Graph Diffusion (PGD), a lightweight and interpretable post hoc framework for smoothing cell-level features along pseudotime. PGD operates by performing random-walk diffusion on a pseudotime graph, propagating information along inferred trajectory paths to enhance continuity and structure. We demonstrate that PGD-smoothed embeddings improve visualization of increasingly complex inferred trajectories of monocytes and macrophages during wound healing. We further show that PGD extends naturally to trajectory-aware gene expression smoothing. By improving agreement between visual representations and inferred trajectories, PGD enables more faithful interpretation of dynamic cellular processes.
bioinformatics2026-01-30v1DVPNet: A New XAI-Based Interpretable Genetic Profiling Framework Using Nucleotide Transformer and Probabilistic Circuits
Kusumoto, T.AI Summary
- The study introduces DVPNet, an XAI-based genetic profiling framework using Nucleotide Transformer and probabilistic circuits for classifying genetic data.
- The framework was tested on a single-cell lung cancer dataset, achieving 0.97 training accuracy and 0.94 test accuracy.
- Gene rankings were derived from the model's probabilistic contributions, offering insights beyond traditional genetic analysis methods.
Abstract
This research provides an XAI-driven genetic profiling approach that may contribute to scientific discoveries in genetic research. We propose a new explainable AI (XAI) classification algorithm that combines probabilistic circuits with the Nucleotide Transformer. By leveraging the strong feature-extraction capability of the Nucleotide Transformer, we design a tractable classification framework based on probabilistic circuits while preserving decomposability and smoothness. To demonstrate the capability of this algorithm, we used the GSE131907 single-cell lung cancer atlas and created a dataset consisting of cancer-cell and normal-cell classes. From each sample, 900 gene types were randomly selected and converted into embedding vectors by the Nucleotide Transformer, after which the classification model was trained. The model demonstrated high representational capacity, achieving an accuracy of 0.97 on the training set and high robustness to unknown genetic contexts with an accuracy of 0.94 on the test set. We extracted the probabilistic contribution for each class from the tractable classification model and defined a contribution score for the cancer-cell class. Gene rankings were then created based on these scores. These rankings may reflect the inherent properties of each gene for the classification task. These analyses go beyond traditional statistical or gene-expression-level approaches, providing new academic insights in genetic research.
bioinformatics2026-01-30v1FluNexus: a versatile web platform for antigenic prediction and visualization of influenza A viruses
Li, X.; Zhou, C.; Wu, H.; Xiao, K.; Hao, J.; Zhao, D.; Zhu, J.; Li, Y.; Peng, J.; Gu, J.; Deng, G.; Cai, W.; Li, M.; Liu, Y.; Shang, X.; Chen, H.; Kong, H.AI Summary
- FluNexus is a web platform designed for antigenic prediction and visualization of influenza A viruses, addressing the need for a user-friendly tool in vaccine strain selection and pandemic preparedness.
- It features data preprocessing for HA1 and HI data of H1, H3, and H5 subtypes, and uses a novel manifold-based method for antigenic mapping with sparse data.
- The platform provides an interactive interface for antigenic prediction, visualization of antigenic evolution, and supports decision-making in vaccine strain selection.
Abstract
Influenza A viruses continuously undergo antigenic evolution to escape host immunity induced by previous infections or vaccinations, consequently causing seasonal epidemics and occasional pandemics. Antigenic prediction and visualization of influenza A viruses are crucial for precise vaccine strain selection and robust pandemic preparedness. However, a user-friendly online platform for these capabilities remains notably absent, despite widespread demand. Here, we present FluNexus (https://flunexus.com), the first-of-its-kind, one-stop-shop web platform designed to facilitate the prediction and visualization of the antigenic change in emerging variants. FluNexus features a data preprocessing module for hemagglutinin subunit 1 (HA1) and hemagglutination inhibition (HI) data across three major public health threat subtypes (H1, H3 and H5). Meanwhile, FluNexus provides an interactive interface for online antigenic prediction and offers practical guidance for researchers. Most notably, FluNexus offers the visualization of influenza A virus antigenic evolution, providing intuitive insights into its antigenic dynamics. Specially, FluNexus proposes a novel manifold-based method for positioning antigens and antisera, generating accurate antigenic cartographies even with sparse HI data. By alleviating the programming burden on biologists, FluNexus supports more informed decision-making in vaccine strain selection and strengthens surveillance and pandemic preparedness.
bioinformatics2026-01-30v1Agentomics: An Agentic System that Autonomously Develops Novel State-of-the-art Solutions for Biomedical Machine Learning Tasks
Martinek, V.; Gariboldi, A.; Tzimotoudis, D.; Galea, M.; Zacharopoulou, E.; Alberdi Escudero, A.; Blake, E.; Cechak, D.; Cassar, L.; Balestrucci, A.; Alexiou, P.AI Summary
- Agentomics is an autonomous LLM-powered system designed for end-to-end machine learning (ML) experimentation in biomedical fields.
- It implements various ML strategies on given datasets, with strict validation checkpoints, and supports biomedical foundation models.
- Evaluated on 20 datasets, Agentomics outperformed other agentic systems and produced novel state-of-the-art models for 11/20 datasets compared to human expert solutions.
Abstract
Motivation: Extracting knowledge from biomedical data is crucial for advancing our understanding of biological systems and developing novel therapeutics. The quantity, quality, and resolution of biomedical data constantly evolves, requiring the automation of biomedical machine learning (ML). Existing Automated ML tools lack flexibility, while Large Language Models (LLMs) struggle to consistently deliver reproducible machine learning codebases, and existing LLM Agent-powered solutions lag behind human-engineered ML models. Results: Here, we introduce Agentomics, an autonomous LLM-powered agentic system for end-to-end ML experimentation. Given a biomedical dataset, Agentomics implements various ML modeling strategies, and produces a ready-to-use ML model. Agentomics introduces strict validation checkpoints for standard ML development steps, allowing gradual development on top of working code with defined interfaces and validated artifacts. Further, it offers native support for biomedical foundation models that can be leveraged during experimentation. The generic nature of Agentomics allows the user to create ML solutions for a large variety of datasets and use various LLMs. We evaluate Agentomics across 20 datasets from the domains of Protein Engineering, Drug Discovery, and Regulatory Genomics. When benchmarked against other agentic systems, Agentomics outperformed them in all tested domains. When benchmarked against human expert solutions, Agentomics generated novel state-of-the-art models for 11/20 established benchmark datasets. Availability and Implementation: Agentomics is implemented in Python. Source code and documentation are freely available at: https://github.com/BioGeMT/Agentomics-ML. Contact: panagiotis.alexiou@um.edu.mt
bioinformatics2026-01-30v1BEACON: predicting side effects and therapeutics outcomes to drugs by Bridging knowlEdge grAph with CONtextual language model
Xu, C.; Xu, J.; Bulusu, K.; Pan, H.; Elemento, O.AI Summary
- BEACON integrates knowledge graphs with contextual language models to predict drug side effects and therapeutic outcomes, converting biomedical entities into tokens and relationships into syntactic dependencies.
- It outperforms existing methods in predicting drug sensitivity in cancer cell lines (AUROC 0.941) and drug-drug interactions (AUROC 0.964 on TwoSIDES, 0.84 on FDA data).
- Analysis with BEACON on acalabrutinib showed enriched interactions with drugs metabolized by CYP3A enzymes, validated by network proximity analysis.
Abstract
Biomedical knowledge graphs encode millions of relationships between drugs, proteins, pathways, and diseases, yet translating this structured knowledge into accurate predictions remains challenging. Existing deep learning approaches, including graph neural networks and knowledge graph embeddings, assign fixed representations to entities regardless of biological context, limiting their ability to capture how the same gene or pathway functions differently across scenarios. These methods also lack interpretability and often fail when applied to novel drugs outside their training distribution. Here we present BEACON (Bridging knowlEdge grAph with CONtextual language model), a framework that transforms knowledge graphs into contextual sentence representations processable by language models. BEACON converts biomedical entities into tokens and relationships into syntactic dependencies, creating "sentence trees" that preserve graph structure while enabling contextual processing. A visibility matrix ensures that attention patterns respect the underlying knowledge graph topology, and a perturbation-based evaluation module identifies the specific genes, enzymes, and pathways driving each prediction. We demonstrate BEACON's versatility through two clinically important applications. For drug sensitivity prediction in cancer cell lines, BEACON achieves 0.941 AUROC and Spearman r = 0.919, outperforming existing methods (DrugCell, DeepCDR and DeepTTA). For drug-drug interaction (DDI) prediction, BEACON achieves 0.964 AUROC on the TwoSIDES benchmark and 0.84 AUROC on temporally held-out FDA adverse event data (2013-2023), demonstrating robust generalization to newly approved drugs. Applying BEACON to the BTK inhibitor acalabrutinib revealed that predicted interactions are enriched for drugs metabolized by CYP3A enzymes (OR = 3.01, P = 4.3 x 10-4;), a mechanism validated through network proximity analysis. BEACON provides a unified, interpretable approach to knowledge graph-enhanced biomedical prediction.
bioinformatics2026-01-30v1IBDkb: an AI-enhanced integrative knowledge base for inflammatory bowel disease research and drug discovery
Tao, L.; Shi, S.; Zhu, R.; Liu, Z.; Yang, B.; Liu, L.; Chen, W.; Long, Q.; Jiao, N.; Zhang, G.; Xu, P.; Wu, D.AI Summary
- IBDkb is a web-based platform that integrates multi-source data on inflammatory bowel disease (IBD) to address fragmentation in research resources.
- It features AI tools for literature retrieval, trend analysis, and interactive visualizations, integrating 98,453 articles, 3,390 trials, and other relevant data.
- A case study showed its utility in drug repositioning, highlighting its role in accelerating hypothesis generation and translational research in IBD.
Abstract
IBDkb (Inflammatory Bowel Disease Knowledge Base; https://www.biosino.org/ibdkb) is a freely accessible, integrated web-based platform that systematically curates and harmonizes multi-source data related to inflammatory bowel disease (IBD). To address the fragmentation and therapeutic gaps in existing specialized resources, IBDkb establishes a unified framework featuring advanced full-text search, interactive visualizations, cross-module knowledge graphs, and AI-powered utilities for real-time literature retrieval, trend analysis, text/PDF interpretation, and domain-specific conversational assistance. The platform currently integrates 98,453 research articles, 3,390 clinical trials, 200 investigational drugs, 200,606 bioactive compounds, 103 therapeutic targets, 77 experimental models, 12 pathogenesis summaries, and 15 treatment strategies. These integrated tools facilitate efficient exploration of complex associations among drugs, targets, trials, and mechanisms, thereby accelerating hypothesis generation and translational research in IBD. The platform is openly available without registration and supports data downloads. A case study on structure-aware drug comparison further demonstrates its utility in facilitating cross-disease drug repositioning hypotheses.
bioinformatics2026-01-30v1CytoVerse: Single-Cell AI Foundation Models in the Browser
Currie, R.; Gonzalez Ferrer, J.; Mostajo-Radji, M. A.; Haussler, D.AI Summary
- CytoVerse addresses server constraints and privacy issues in mapping single-cell datasets by running scRNA-seq Foundation Models (scFM) in the browser.
- It uses ONNX for model deployment, IVFPQ for efficient indexing of over 20 million cells, and a protocol for sharing embeddings without exposing raw data.
- This framework offers a scalable, privacy-preserving solution for distributed single-cell analysis.
Abstract
Mapping single-cell datasets to large atlases is often hindered by server constraints and privacy concerns. We present CytoVerse, a framework that runs scRNA-seq Foundation Models (scFM) entirely in the browser. Three key contributions enable this: (1) deploying models via ONNX without server side compute; (2) using compressed indexing (IVFPQ) to search a more then 20 million cell reference from the client; and (3) a lightweight protocol for sharing embeddings across consortia without exposing raw data. CytoVerse thereby provides a scalable, privacy preserving framework for distributed single-cell analysis.
bioinformatics2026-01-30v1ProMeta: A meta-learning framework for robust disease diagnosis and prediction from plasma proteomics
Li, H.; Gu, H.; Hu, L.; Zhang, Z.; Lv, Y.; Gao, P.; Cooper-Knock, J.; Min, Y.; Zeng, J.; Zhang, S.AI Summary
- ProMeta is a meta-learning framework designed to enhance disease diagnosis and prediction from plasma proteomics under data scarcity by integrating knowledge-guided pathway encoding and bi-level meta-optimization.
- It uses biobank-scale data to learn a global initialization with transferable biological priors, allowing rapid adaptation to new tasks.
- In benchmark tests, ProMeta outperformed baselines, achieving an AUROC of ~0.69 in 4-shot scenarios, and identified disease-specific biomarkers and pathways.
Abstract
The plasma proteome offers a dynamic window of human health, capturing the real-time intersections between genetics and physiology. However, the application of deep learning to proteomics is currently hindered by a reliance on large-scale labeled datasets, rendering standard models ineffective for rare or novel diseases where patient samples are inherently scarce. Here, we present ProMeta, a meta-learning framework designed to enable robust disease modeling under extreme data restrictions. By integrating knowledge-guided pathway encoding with bi-level meta-optimization, ProMeta projects unstructured proteomic profiles into biologically interpretable functional tokens. This architecture allows the model to learn a global initialization containing transferable biological priors from biobank-scale data, facilitating rapid adaptation to novel tasks. Through comprehensive benchmark experiments, ProMeta consistently outperformed transfer learning and traditional machine learning baselines in both disease diagnosis and prediction tasks. In the most challenging 4-shot scenarios (utilizing only 2 cases and 2 controls), the model achieved robust generalization with an average AUROC of {approx} 0.69, representing a 24.6% relative improvement over the best-performing baseline methods. Mechanistic investigation revealed that ProMeta disentangles cases from controls in the latent space prior to task-specific adaptation, confirming the acquisition of universal biological rules rather than rote memorization. Furthermore, gradient-based interpretation identified disease-specific protein biomarkers and functional pathways consistent with known pathophysiology. Collectively, ProMeta overcomes the data-scarcity bottleneck in precision medicine, providing a scalable, interpretable framework for characterizing the full spectrum of human diseases, particularly for rare conditions lacking extensive clinical cohorts.
bioinformatics2026-01-30v1Exploring protein conformational ensembles using evolutionary conditional diffusion
cui, X.; Ge, L.; Yang, X.; Li, X.; Hou, D.; Zhou, X.; Zhang, G.AI Summary
- This study introduces DiffEnsemble, a diffusion-based framework for modeling protein conformational ensembles, using static structures from the Protein Data Bank and AlphaFold data as guidance.
- DiffEnsemble was benchmarked on 72 proteins, showing superior performance over BioEmu and AlphaFLOW, with improvements of 28.9% and 11.3% in correlation coefficients for ensemble metrics.
- The framework successfully captured dominant motions in 42% of the targets, demonstrating the utility of latent dynamical information from static data in modeling protein dynamics.
Abstract
Protein conformational ensembles encode the dynamic landscapes underlying biological function, regulation, and allostery. Accurately reconstructing such ensembles while balancing conformational distributions accuracy and physical plausibility remains a fundamental challenge in structural biology, particularly when dynamic data is scarce. Here, we propose DiffEnsemble, a diffusion-based framework designed for modeling protein conformational ensembles. DiffEnsemble learns latent dynamical representations from static protein structures in the Protein Data Bank, integrated with the structural profile derived from the AlphaFold Protein Structure Database as conditional guidance during the diffusion process. Benchmarking on 72 protein targets from the ATLAS molecular dynamics simulation dataset demonstrates that DiffEnsemble outperforms existing methods, including BioEmu and AlphaFLOW. Compared with AlphaFLOW, DiffEnsemble achieves improvements of 28.9% and 11.3% in Pearson correlation coefficients for ensemble pairwise root mean square deviation and root mean square fluctuation, respectively. Importantly, DiffEnsemble successfully captures the dominant motions for 42% of the targets. These results demonstrate that latent dynamical information embedded in static structural data can effectively support the modeling of protein conformational ensembles.
bioinformatics2026-01-30v1Monalisa: An Open Source, Documented, User-Friendly MATLAB Toolbox for Magnetic Resonance Imaging Reconstruction
Leidi, M.; Jia, Y.; Helbing, D.; Barranco, J.; Acikgoz, B. C.; Peper, E.; Ledoux, J.-B.; Bastiaansen, J. A. M.; Milani, B.; Franceschiello, B.AI Summary
- Monalisa is an open-source MATLAB toolbox designed for MRI reconstruction, focusing on non-Cartesian imaging and dynamic applications with motion.
- It features modular steps like data reading, trajectory computation, and supports various reconstruction methods including iterative-SENSE, GRAPPA, and regularized techniques.
- Benchmark tests showed Monalisa outperforming BART with higher SSIM, lower l2 error, and fewer artifacts, highlighting its effectiveness and educational value.
Abstract
Purpose: An open-source, user-friendly MATLAB framework for Magnetic Resonance Imaging (MRI) reconstruction was developed to simplify the reconstruction process, with a specific focus on non-Cartesian imaging and dynamic applications in the presence of motion. Methods: Monalisa is decomposing the reconstruction pipeline into clear modular steps, including raw data reading with flexible file-type abstraction, trajectory computation, density compensation, advanced coil sensitivity mapping, and tailored binning strategies through its "mitosius" preprocessing stage. The framework supports a suite of reconstruction methods, including iterative-SENSE (also named CG-SENSE), GeneRalized Autocalibrating Partial Parallel Acquisition (GRAPPA) reconstructions, and regularized reconstructions supporting both spatial and temporal regularization using l1 (Compressed Sensing (CS)) and l2 techniques, accommodating both Cartesian and non-Cartesian acquisitions. We performed benchmark experiments comparing Monalisa with the Berkeley Advanced Reconstruction Toolbox (BART) toolbox on simulated 2D radial acquisitions. Results: Results of the comparison demonstrate competitive performance, yielding higher Structural Similarity Index (SSIM) and lower l2 error. Notably, Monalisa reconstructions exhibited fewer visible artifacts than BART. Conclusion: By providing comprehensive documentation, Monalisa serves not only as a powerful tool for research and clinical imaging but also as an educational platform to facilitate innovation in MRI reconstruction.
bioinformatics2026-01-29v3