Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Generative deep learning expands apo RNA conformational ensembles to include ligand-binding-competent cryptic conformations: a case study of HIV-1 TAR
Kurisaki, I.; Hamada, M.AI Summary
- The study used Molearn, a hybrid molecular-dynamics-generative deep-learning model, to explore cryptic conformations of apo HIV-1 TAR RNA that could bind ligands.
- Molearn was trained on apo TAR conformations and generated a diverse ensemble, from which potential MV2003-binding conformations were identified.
- Docking simulations showed these conformations had RNA-ligand interaction scores similar to NMR-derived complexes, demonstrating the model's ability to predict ligand-binding competent RNA states.
Abstract
RNA plays vital roles in diverse biological processes and represents an attractive class of therapeutic targets. In particular, cryptic ligand-binding sites--absent in apo structures but formed upon conformational rearrangement--offer high specificity for RNA-ligand recognition, yet remain rare among experimentally-resolved RNA-ligand complex structures and difficult to predict in silico. RNA-targeted structure-based drug design (SBDD) is therefore limited by challenges in sampling cryptic states. Here, we apply Molearn, a hybrid molecular-dynamics-generative deep-learning model, to expand apo RNA conformational ensembles toward cryptic states. Focusing on the paradigmatic HIV-1 TAR-MV2003 system, Molearn was trained exclusively on apo TAR conformations and used to generate a diverse ensemble of TAR structures. Candidate cryptic MV2003-binding conformations were subsequently identified using post-generation geometric analyses. Docking simulations of these conformations with MV2003 yielded binding poses with RNA-ligand interaction scores comparable to those of NMR-derived complexes. Notably, this work provides the first demonstration that a generative modeling framework can access cryptic RNA conformations that are ligand-binding competent and have not been recovered in prior molecular-dynamics and deep-learning studies. Finally, we discuss current limitations in scalability and systematic detection, including application to the Internal Ribosome Entry Site, and outline future directions toward RNA-targeted SBDD.
bioinformatics2026-02-03v6GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure
Pourmirzaei, M.; Morehead, A.; Esmaili, F.; Ren, J.; Pourmirzaei, M.; Xu, D.AI Summary
- The study introduces GCP-VQVAE, a tokenizer using SE(3)-equivariant GCPNet to convert protein structures into discrete tokens while preserving chirality and orientation.
- Trained on 24 million protein structures, GCP-VQVAE achieves state-of-the-art performance with backbone RMSDs of 0.4377 Å, 0.5293 Å, and 0.7567 Å on CAMEO2024, CASP15, and CASP16 datasets respectively.
- On a zero-shot set of 1,938 new structures, it showed robust generalization with a backbone RMSD of 0.8193 Å and TM-score of 0.9673, and offers significantly reduced latency compared to previous models.
Abstract
Converting protein tertiary structure into discrete tokens via vector-quantized variational autoencoders (VQ-VAEs) creates a language of 3D geometry and provides a natural interface between sequence and structure models. While pose invariance is commonly enforced, retaining chirality and directional cues without sacrificing reconstruction accuracy remains challenging. In this paper, we introduce GCP-VQVAE, a geometry-complete tokenizer built around a strictly SE(3)-equivariant GCPNet encoder that preserves orientation and chirality of protein backbones. We vector-quantize rotation/translation-invariant readouts that retain chirality into a 4,096-token vocabulary, and a transformer decoder maps tokens back to backbone coordinates via a 6D rotation head trained with SE(3)-invariant objectives. Building on these properties, we train GCP-VQVAE on a corpus of 24 million monomer protein backbone structures gathered from the AlphaFold Protein Structure Database. On the CAMEO2024, CASP15, and CASP16 evaluation datasets, the model achieves backbone RMSDs of 0.4377 A, 0.5293 A, and 0.7567 A, respectively, and achieves 100% codebook utilization on a held-out validation set, substantially outperforming prior VQ-VAE-based tokenizers and achieving state-of-the-art performance. Beyond these benchmarks, on a zero-shot set of 1,938 completely new experimental structures, GCP-VQVAE attains a backbone RMSD of 0.8193 A and a TM-score of 0.9673, demonstrating robust generalization to unseen proteins. Lastly, we show that the Large and Lite variants of GCP-VQVAE are substantially faster than the previous SOTA (AIDO), reaching up to ~408x and ~530x lower end-to-end latency, while remaining robust to structural noise. We make the GCP-VQVAE source code, zero-shot dataset, and its pretrained weights fully open for the research community: https://github.com/mahdip72/vq_encoder_decoder
bioinformatics2026-02-03v3Transcriptomic and protein analysis of human cortex reveals genes and pathways linked to NPTX2 disruption in Alzheimer's disease
Lao, Y.; Xiao, M.-F.; Ji, S.; Piras, I. S.; Kim, K.; Bonfitto, A.; Song, S.; Aldabergenova, A.; Sloan, J.; Trejo, A.; Geula, C.; Na, C.-H.; Rogalski, E. J.; Kawas, C. H.; Corrada, M. M.; Serrano, G. E.; Beach, T. G.; Troncoso, J. C.; Huentelman, M. J.; Barnes, C. A.; Worley, P. F.; Colantuoni, C.AI Summary
- This study used bulk RNA sequencing and targeted proteomics on human cortex samples to explore genes and pathways associated with NPTX2 disruption in Alzheimer's disease (AD).
- NPTX2 expression was significantly reduced in AD, correlating with BDNF, VGF, SST, and SCG2, indicating a role in synaptic and mitochondrial functions.
- In AD, NPTX2-related synaptic and mitochondrial pathways weakened, while stress-linked transcriptional regulators increased, suggesting a shift in regulatory dynamics.
Abstract
The expression of NPTX2, a neuronal immediate early gene (IEG) essential for excitatory-inhibitory balance, is altered in the earliest stages of cognitive decline that precede Alzheimer's disease (AD). Here, we use NPTX2 as a point of reference to identify genes and pathways linked to its role in AD onset and progression. We performed bulk RNA sequencing on 575 middle temporal gyrus (MTG) samples across four cohorts, together with targeted proteomics in 135 of these same samples, focusing on 20 curated proteins spanning synaptic, trafficking, lysosomal, and regulatory categories. NPTX2 RNA and protein were significantly reduced in AD, and to a lesser extent in mild cognitive impairment (MCI) samples. RNA expression of BDNF, VGF, SST, and SCG2 correlated with both NPTX2 mRNA and protein levels. We identified NPTX2-correlated synaptic and mitochondrial programs that were negatively correlated with lysosomal and chromatin/stress modules. Gene set enrichment analysis (GSEA) of NPTX2 correlations across all samples confirmed broad alignment with synaptic and mitochondrial compartments, and more NPTX2-specific associations with proteostasis and translation regulator pathways, all of which were weakened in AD. In contrast, correlation of NPTX2 protein with transcriptomic profiles revealed negative associations with stress-linked transcription regulator RNAs (FOXJ1, ZHX3, SMAD5, JDP2, ZIC4), which were strengthened in AD. These results position NPTX2 as a hub of an activity-regulated "plasticity cluster" (BDNF, VGF, SST, SCG2) that encompasses interneuron function and is embedded on a neuronal/mitochondrial integrity axis that is inversely coupled to lysosomal and chromatin-stress programs. In AD, these RNA-level correlations broadly weaken, and stress-linked transcriptional regulators become more prominent, suggesting a role in NPTX2 loss of function. Individual gene-level data from the bulk RNA-seq in this study can be freely explored at [INSERT LINK].
bioinformatics2026-02-03v2Automated Segmentation of Kidney Nephron Structures by Deep Learning Models on Label-free Autofluorescence Microscopy for Spatial Multi-omics Data Acquisition and Mining
Patterson, N. H.; Neumann, E. K.; Sharman, K.; Allen, J. L.; Harris, R. C.; Fogo, A. B.; deCaestecker, M. P.; Van de Plas, R.; Spraggins, J. M.AI Summary
- Developed deep learning models for automated segmentation of kidney nephron structures using label-free autofluorescence microscopy.
- Models accurately segmented functional tissue units and gross kidney morphology with F1-scores >0.85 and Dice-Sorensen coefficients >0.80.
- Enabled quantitative association of lipids with segmented structures and spatial transcriptomics data acquisition from collecting ducts, showing differential gene expression in medullary regions.
Abstract
Automated spatial segmentation models can enrich spatio-molecular omics analyses by providing a link to relevant biological structures. We developed segmentation models that use label-free autofluorescence (AF) microscopy to recognize multicellular functional tissue units (FTUs) (glomerulus, proximal tubule, descending thin limb, ascending thick limb, distal tubule, and collecting duct) and gross morphological structures (cortex, outer medulla, and inner medulla) in the human kidney. Annotations were curated using highly specific multiplex immunofluorescence and transferred to co-registered AF for model training. All FTUs (except the descending thin limb) and gross kidney morphology were segmented with high accuracy: >0.85 F1-score, and Dice-Sorensen coefficients >0.80, respectively. This workflow allowed lipids, profiled by imaging mass spectrometry, to be quantitatively associated with segmented FTUs. The segmentation masks were also used to acquire spatial transcriptomics data from collecting ducts. Consistent with previous literature, we demonstrated differing transcript expression of collecting ducts in the inner and outer medulla.
bioinformatics2026-02-03v2SpaCEy: Discovery of Functional Spatial Tissue Patterns by Association with Clinical Features Using Explainable Graph Neural Networks
Rifaioglu, A. S.; Ervin, E. H.; Sarigun, A.; Germen, D.; Bodenmiller, B.; Tanevski, J.; Saez-Rodriguez, J.AI Summary
- SpaCEy uses explainable graph neural networks to analyze spatial tissue patterns from molecular marker expression, linking these patterns to clinical outcomes without predefined cell types.
- Applied to lung cancer, SpaCEy identified spatial cell arrangements and protein marker expressions linked to disease progression.
- In breast cancer datasets, SpaCEy stratified patients by overall survival, revealing key spatial patterns of protein markers across and within clinical subtypes.
Abstract
Tissues are complex ecosystems tightly organized in space. This organization influences their function, and its alteration underpins multiple diseases. Spatial omics allows us to profile its molecular basis, but how to leverage these data to link spatial organization and molecular patterns to clinical practice remains a challenge. We present SpaCEy (SpatialClinicalExplainability), an explainable graph neural network that uncovers organizational tissue patterns predictive of clinical outcomes. SpaCEy learns directly from molecular marker expression by modelling tissues as spatial graphs of cells and their interactions, without requiring predefined cell types or anatomical regions. Its embeddings capture intercellular relationships and molecular dependencies that enable accurate prediction of variables such as overall survival and disease progression. SpaCEy integrates a specialized explainer module that reveals recurring spatial patterns of cell organisation and coordinated marker expression that are most relevant to predictions of the models. Applied to a spatially resolved proteomic lung cancer cohort, SpaCEy discovers distinct spatial arrangements of cells together with coordinated expression of protein markers associated with disease progression. Across multiple breast cancer proteomic datasets, it consistently stratifies patients according to overall survival, both across and within established clinical subtypes. SpaCEy also highlights spatial patterns of a small set of key protein markers underlying this patient stratification.
bioinformatics2026-02-03v2Informative Missingness in Nominal Data: A Graph-Theoretic Approach to Revealing Hidden Structure
Zangene, E.; Schwammle, V.; JAFARI, M.AI Summary
- This study introduces a graph-theoretic approach to analyze missing data in nominal datasets, treating missing values as informative signals rather than gaps.
- By constructing bipartite graphs from nominal variables, the method reveals hidden structures through modularity, nestedness, and similarity analysis.
- Applied across various domains, the approach showed that missing data patterns can distinguish between random and non-random missingness, enhancing structural understanding and aiding in tasks like clustering.
Abstract
Missing data is often treated as a nuisance, routinely imputed or excluded from statistical analyses, especially in nominal datasets where its structure cannot be easily modeled. However, the form of missingness itself can reveal hidden relationships, substructures, and biological or operational constraints within a dataset. In this study, we present a graph-theoretic approach that reinterprets missing values not as gaps to be filled, but as informative signals. By representing nominal variables as nodes and encoding observed or missing associations as edges, we construct both weighted and unweighted bipartite graphs to analyze modularity, nestedness, and projection-based similarities. This framework enables downstream clustering and structural characterization of nominal data based on the topology of observed and missing associations; edge prediction via multiple imputation strategies is included as an optional downstream analysis to evaluate how well inferred values preserve the structure identified in the non-missing data. Across a series of biological, ecological, and social case studies, including proteomics data, the BeatAML drug screening dataset, ecological pollination networks, and HR analytics, we demonstrate that the structure of missing values can be highly informative. These configurations often reflect meaningful constraints and latent substructures, providing signals that help distinguish between data missing at random and not at random. When analyzed with appropriate graph-based tools, these patterns can be leveraged to improve the structural understanding of data and provide complementary signals for downstream tasks such as clustering and similarity analysis. Our findings support a conceptual shift: missing values are not merely analytical obstacles but valuable sources of insight that, when properly modeled, can enrich our understanding of complex nominal systems across domains.
bioinformatics2026-02-03v2Predicting unknown binding sites for transition metal based compounds in proteins
Levy, A.; Rothlisberger, U.AI Summary
- This study evaluates the use of Metal3D and Metal1D, tools originally designed for zinc ion binding prediction, to identify binding sites for transition metal complexes in proteins.
- Both tools successfully predicted several known binding sites from apo protein structures, despite limitations like sensitivity to side-chain conformations.
- The research suggests a computational pipeline where these tools could initially identify potential binding sites, followed by refinement with more precise methods.
Abstract
Transition metal based compounds are promising therapeutic agents, particularly in cancer treatment. However, predicting their binding sites remains a major challenge. In this work, we investigate the applicability of two tools, Metal3D and Metal1D, for this purpose. Although originally trained to predict zinc ion binding sites only, both predictors successfully identify several experimentally observed binding sites for transition metal complexes directly from apo protein structures. At the same time, we highlight current limitations, such as the sensitivity to side-chain conformations, and discuss possible strategies for improvement. This work provides a first step toward establishing a robust computational pipeline in which rapid and low-cost predictors are able to identify putative hotspots for transition metal binding, which can then be refined using more accurate but computationally demanding methods.
bioinformatics2026-02-03v1PPGLomics: An Interactive Platform for Pheochromocytoma and Paraganglioma Transcriptomics
Alkaissi, H.; Gordon, C. M.; Pacak, K.AI Summary
- PPGLomics is an interactive web platform for analyzing pheochromocytoma and paraganglioma (PPGL) transcriptomics, addressing the lack of disease-specific bioinformatics resources.
- It integrates the TCGA-PCPG (n=160) and A5 consortium SDHB (n=91) datasets, offering tools for differential expression, correlation, survival analysis, and various visualizations.
- The platform is designed for use by scientists and healthcare professionals without requiring bioinformatics expertise and is freely accessible online.
Abstract
Pheochromocytoma and paraganglioma (PPGL) are rare neuroendocrine tumors with unique biological behavior and remarkably high heritability, yet dedicated bioinformatics resources for these diagnoses remain limited. Existing cancer multi-omics platforms are pan-cancer in scope, often lacking the disease-specific annotations, granularity, and cross-database harmonization required for meaningful stratification and hypothesis generation. Here we introduce PPGLomics, an interactive web-based platform designed for comprehensive PPGL transcriptomics analysis. PPGLomics v1.0 integrates two major datasets, the TCGA-PCPG cohort (n=160) spanning multiple molecular subtypes, and the A5 consortium SDHB cohort (n=91) with detailed clinicopathological and molecular annotations. The platform provides basic and clinical scientists, as well as a broad range of healthcare professionals, with tools for differential expression analysis, correlation analysis, survival analysis, and visualization, including boxplots, heatmaps, volcano plots, and Kaplan-Meier survival plots, enabling exploration of gene expression patterns across PPGL subtypes without requiring bioinformatics expertise. PPGLomics v1.0 is freely available at https://alkaissilab.shinyapps.io/PPGLomics.
bioinformatics2026-02-03v1PlotGDP: an AI Agent for Bioinformatics Plotting
Luo, X.; Shi, Y.; Huang, H.; Wang, H.; Cao, W.; Zuo, Z.; Zhao, Q.; Zheng, Y.; Xie, Y.; Jiang, S.; Ren, J.AI Summary
- PlotGDP is an AI agent-based web server designed for creating high-quality bioinformatics plots using natural language commands, eliminating the need for coding or environment setup.
- It leverages large language models (LLMs) to process user-uploaded data on a remote server, ensuring ease of use.
- The platform uses curated template scripts to reduce the risk of errors from LLMs, aiming to enhance bioinformatics visualization for global research.
Abstract
High-quality bioinformatics plotting is important for biology research, especially when preparing for publications. However, the long learning curve and complex coding environment configuration often appear as inevitable costs towards the creation of publication-ready plots. Here, we present PlotGDP (https://plotgdp.biogdp.com/), an AI agent-based web server for bioinformatics plotting. Built on large language models (LLMs), the intelligent plotting agent is designed to accommodate various types of bioinformatics plots, while offering easy usage with simple natural language commands from users. No coding experience or environment deployment is required, since all the user-uploaded data is processed by LLM-generated codes on our remote high-performance server. Additionally, all plotting sessions are based on curated template scripts to minimize the risk of hallucinations from the LLM. Aided by PlotGDP, we hope to contribute to the global biology research community by constructing an online platform for fast and high-quality bioinformatics visualization.
bioinformatics2026-02-03v1HiChIA-Rep quantifies the similarity between enrichment-based chromatin interactions datasets
Kim, S. S.; Jackson, J. T.; Zhang, H. B.; Kim, M.AI Summary
- HiChIA-Rep is an algorithm designed to quantify the similarity between datasets from enrichment-based 3D genome mapping technologies like ChIA-PET and HiChIP.
- It uses both 1D and 2D signals through graph signal processing to assess data reproducibility.
- HiChIA-Rep effectively distinguishes biological replicates from non-replicates and outperforms tools designed for Hi-C data.
Abstract
3D genome mapping technologies ChIA-PET, HiChIP, PLAC-seq, HiCAR, and ChIATAC yield pairwise contacts and a one-dimensional signal indicating protein binding or chromatin accessibility. However, a lack of computational tools to quantify the reproducibility of these enrichment-based 3C data prevents rigorous data quality assessment and interpretation. We developed HiChIA-Rep, an algorithm incorporating both 1D and 2D signals to measure similarity via graph signal processing methods. HiChIA-Rep can distinguish biological replicates from non-replicates, cell lines, and protein factors, outperforming tools designed for Hi-C data. With a large amount of multi-ome datasets being generated, HiChIA-Rep will likely be a fundamental tool for the 3D genomics community.
bioinformatics2026-02-03v1MOSAIC: A Structured Multi-level Framework for Probabilistic and Interpretable Cell-type Annotation
Yang, M.; Qi, J.; Lan, M.; Huang, J.; Jin, S.AI Summary
- MOSAIC is a multi-level framework for cell-type annotation in single-cell RNA sequencing that integrates cell-level marker evidence with cluster-level population context.
- It uses a probabilistic approach to handle uncertainty, mixed states, and population structure, improving upon single-level annotation methods.
- Across six tissues and under dropout perturbations, MOSAIC matched or outperformed other methods, providing structured uncertainty estimates and identifying stable intermediate cell states.
Abstract
Accurate cell-type annotation is a foundational task in single-cell RNA sequencing analysis, yet remains fundamentally challenged by cellular heterogeneity, gradual lineage transitions, and technical noise. As single-cell atlases expand in scale and resolution, most existing annotation approaches operate at a single analytical level and encode cell identity as fixed categorical labels, limiting their ability to represent uncertainty, mixed biological states, and population-level structure. Here we introduce MOSAIC (Multi-level prObabilistic and Structured Adaptive IdentifiCation), a structured multi-level annotation framework that integrates cell-level marker evidence with cluster-level population context within a unified probabilistic system. Rather than treating annotation as an independent per-cell prediction task, MOSAIC formulates cell-type assignment as a coordinated multi-level inference process, in which probabilistic evidence at the single-cell level is aggregated, constrained, and refined by population context. MOSAIC integrates direction-aware marker scoring with dual-layer probabilistic representation and adaptive cross-level refinement, enabling uncertainty to be quantified and propagated across biological scales. This design yields coherent annotations that preserve fine-grained single-cell variation while maintaining population-level consistency, and allows ambiguous or transitional states to be represented explicitly rather than collapsed into hard labels. Across six diverse tissues and under controlled dropout perturbations, MOSAIC consistently matches or outperforms representative marker-based, reference-based, and machine-learning annotation methods. Beyond accuracy, MOSAIC provides structured uncertainty estimates and coherent population-level structure, enabling the identification of stable intermediate cell states that arise from gradual lineage transitions rather than technical noise. Together, MOSAIC advances cell-type annotation from a single-level classification task to a structured multi-level inference problem, and establishes a general, interpretable, and uncertainty-aware computational framework for large-scale single-cell analysis.
bioinformatics2026-02-03v1PepMCP: A Graph-Based Membrane Contact Probability Predictor for Membrane-Lytic Antimicrobial Peptides
Dong, R.; Awang, T.; Cao, Q.; Kang, K.; Wang, L.; Zhu, Z.; Song, C.AI Summary
- This study introduces PepMCP, a graph-based model for predicting membrane contact probability (MCP) of short antimicrobial peptides (AMPs) targeting bacterial membranes.
- Over 500 membrane-lytic AMPs were used to train PepMCP, employing coarse-grained molecular dynamics simulations and the GraphSAGE framework.
- PepMCP achieved a Pearson correlation coefficient of 0.883 and RMSE of 0.123, enhancing mechanism-driven AMP discovery with the MemAMPdb database and a web server for access.
Abstract
Motivation: The membrane-lytic mechanism of antimicrobial peptides (AMPs) is often overlooked during their in silico discovery process, largely due to the lack of a suitable metric for the membrane-binding propensity of peptides. Previously, we proposed a characteristic called membrane contact probability (MCP) and applied it to the identification of membrane proteins and membrane-lytic AMPs. However, previous MCP predictors were not trained on short peptides targeting bacterial membranes, which may result in unsatisfactory performance for peptide studies. Results: In this study, we present PepMCP, a peptide-tailored model for predicting MCP values of short peptides. We collected more than 500 membrane-lytic AMPs from the literature, conducted coarse-grained molecular dynamics (MD) simulations for these AMPs, and extracted their residue MCP labels from MD trajectories to train PepMCP. PepMCP employs the GraphSAGE framework to address this node regression task, encoding each peptide sequence as a graph with 4-hop edges. PepMCP achieved a Pearson correlation coefficient of 0. 883 and an RMSE of 0. 123 on the node-level test set. It can recognize membrane-lytic AMPs with the predicted MCP values for each sequence, thereby facilitating mechanism-driven AMP discovery. Additionally, we provide a database, MemAMPdb, which includes the membrane-lytic AMPs, as well as the PepMCP web server for easy access. Availability and Implementation: The code and data are available at https://github.com/ComputBiophys/PepMCP.
bioinformatics2026-02-03v1Attractor Landscape Analysis Distinguishes Aging Markers from Rejuvenation Targets in Human Keratinocytes
Copes, N.; Canfield, C.-A. E.AI Summary
- The study used PRISM, a computational pipeline integrating pseudotime trajectory and Boolean network analysis, to identify rejuvenation targets in aging human keratinocytes from single-cell RNA sequencing data.
- Two distinct aging trajectories were identified: one where cells converge to an aged state (Y_272) and another where cells depart from a youthful state (Y_308).
- Key findings included BACH2 knockdown as the top rejuvenation target for Y_272, improving the aging score by 98.9%, and ASCL2 knockdown for Y_308, with enhanced effects when combined with ATF6 perturbation.
Abstract
Cellular aging is characterized by progressive changes in gene expression that contribute to tissue dysfunction; however, identifying genes that regulate the aging process, rather than merely serve as biomarkers, remains a significant challenge. Here we present PRISM (Pseudotime Reversion via In Silico Modeling), a computational pipeline that integrates pseudotime trajectory analysis with Boolean network analysis to identify cellular rejuvenation targets from single-cell RNA sequencing data. We applied PRISM to a published dataset of human skin comprising 47,060 cells from nine donors aged 18 to 76 years. Analysis of keratinocytes revealed two distinct aging trajectories with fundamentally different regulatory architectures. One trajectory (labeled Y_272) exhibited "aging as convergence," where cells were driven toward a single dominant aged attractor (aging score +2.181). A second trajectory (labeled Y_308) exhibited "aging as departure," where cells escaped from a dominant youthful attractor basin (aging score -0.536). Systematic perturbation analysis revealed a critical distinction between genes exhibiting age-related expression changes (phenotypic markers) and genes controlling attractor landscape architecture (regulatory controllers). Switch genes marking the aging trajectories proved largely ineffective as intervention targets, while master regulators operating at higher levels of the regulatory hierarchy produced substantial rejuvenation effects. BACH2 knockdown was identified as the dominant intervention for Y_272, shifting the aging score by {Delta}=-3.746 (98.9% improvement). ASCL2 knockdown was identified as the top target for Y_308, with synergistic enhancement observed through combinatorial perturbation with ATF6. These findings demonstrate that attractor-based analysis identifies different and potentially superior therapeutic targets compared to expression-based approaches and provide specific hypotheses for experimental validation of cellular rejuvenation strategies in human skin.
bioinformatics2026-02-03v1Predicting mutation-rate variation across the genome using epigenetic data
Katori, M.; Kobayashi, T. J.; Nordborg, M.; Shi, S.AI Summary
- The study integrates epigenetic data (histone marks, DNA methylation, chromatin accessibility) with de novo mutation data in Arabidopsis thaliana to model mutation probability at the coding sequence level.
- Using non-negative matrix factorization, 15 epigenetic patterns were identified, stratifying coding sequences into six classes with different mutation probabilities.
- A predictive model based on these patterns outperformed others, showing that epigenetic context significantly influences local mutation rates, with changes under hypoxia indicating dynamic chromatin effects on mutation probability.
Abstract
Mutation rate variation is a fundamental driver of evolution, yet how it is locally patterned across genomes and structured by chromatin context remains unresolved. Here, we integrate genome-wide profiles of histone marks, DNA methylation and chromatin accessibility in Arabidopsis thaliana with de novo mutation data to model mutation probability at the level of coding sequence (CDS). Using non-negative matrix factorization, we identify 15 combinatorial epigenetic patterns whose graded mixtures stratify CDSs into six classes with distinct mutation probabilities. A generalized linear model based on pattern weights predicts local mutation probability and outperforms models based on sequence context, expression and classical genomic categories. These patterns capture context-dependent variation that is obscured by gene-level summaries and single-feature analyses. Cluster-level differences are partly retained in mutation-accumulation lines, indicating persistence into heritable mutational input. Under hypoxia, stress-responsive chromatin remodeling redistributes epigenetic contexts associated with higher predicted mutation probability toward hypoxia-responsive genes and DNA-repair pathways. Together, our results provide a CDS-resolved and interpretable framework linking combinatorial epigenomic context to mutational input, clarifying how dynamic chromatin states shape local mutation-rate heterogeneity.
bioinformatics2026-02-03v1GAISHI: A Python Package for Detecting Ghost Introgression with Machine Learning
Huang, X.; Hackl, J.; Kuhlwilm, M.AI Summary
- GAISHI is a Python package designed to detect ghost introgression using machine learning techniques like logistic regression and UNet++.
- It addresses the limitation of previous studies by providing a software implementation for identifying introgressed segments and alleles.
- The package's utility was demonstrated in a Human-Neanderthal introgression scenario.
Abstract
Summary: Ghost introgression is a challenging problem in population genetics. Recent studies have explored supervised learning models, namely logistic regression and UNet++, to detect genomic footprints of ghost introgression. However, their applicability is limited because existing implementations are tailored to tasks in their respective publications, but not available as software implementations. Here, we present GAISHI, a Python package for identifying introgressed segments and alleles using machine learning and demonstrate its usage in a Human-Neanderthal introgression scenario. Availabity and implementation: GAISHI is available on GitHub under the GNU General Public License v3.0. The source code can be found at https://github.com/xin-huang/gaishi.
bioinformatics2026-02-03v1A modality gap in personal-genome prediction by sequence-to-function models
Mostafavi, S.; Tu, X.; Spiro, A.; Chikina, M.AI Summary
- The study evaluated AlphaGenome's ability to predict personal genome variations in gene expression and chromatin accessibility.
- AlphaGenome performed near the heritability ceiling for chromatin accessibility but significantly underperformed for gene expression compared to baseline.
- Findings suggest chromatin accessibility is influenced by local regulatory elements, while gene expression requires integration of long-range regulatory effects, which current models struggle with.
Abstract
Sequence-to-function (S2F) models trained on reference genomes have achieved strong performance on regulatory prediction and variant-effect benchmarks, yet they still struggle to predict inter-individual variation in gene expression from personal genomes. We evaluated AlphaGenome on personal genome prediction in two molecular modalities--gene expression and chromatin accessibility--and observed a striking dichotomy: AlphaGenome approaches the heritability ceiling for chromatin accessibility variation, but remains far below baseline for gene-expression variation, despite improving over Borzoi. Context truncation and fine-mapped QTL analyses indicate that accessibility is governed by local regulatory grammar captured by current architectures, whereas gene-expression variation requires long-range regulatory integration that remains challenging.
bioinformatics2026-02-03v1An agentic framework turns patient-sourced records into a multimodal map of ALS heterogeneity
Li, Z.; Gao, C.; Kong, J.; Fu, Y.; Wen, S.; Li, G.; Cao, Y.; Fu, Y.; Zhang, H.; Jia, S.; Liu, X.; Cai, L.; Yan, F.; Liu, X.; Tian, L.AI Summary
- The study introduces MEDSTREM, an LLM-based agent that transforms patient-sourced document images into standardized electronic health records, facilitating cohort building and linkage to trials and multi-omics data.
- By analyzing 8,298 individuals' clinical reports, MEDSTREM generated 17,602 records and multi-omics profiles, identifying five ALS subtypes and a continuous degeneration score.
- Key findings include functional loss tracking with hand-grip strength and forced vital capacity, malnutrition as a modifiable factor, and epigenetic changes like cell-cycle suppression and chromatin opening linked to clinical severity.
Abstract
ALS shows marked clinical heterogeneity, yet much real-world evidence remains trapped in unstructured reports. Here we introduce MEDSTREM, a large-language-model (LLM)-based agent that converts patient-sourced document images into standardized longitudinal electronic health records, enabling bottom-up cohort building and linkage to trials and multi-omics. By applying MEDSTREM to clinical report images from 8,298 individuals collected via AskHelpU and harmonizing with PRO-ACT and Answer ALS, we generated 17,602 standardized records and multi-omics profiles from 940 induced motor neuron lines. Progression modelling resolved five subtypes and a continuous degeneration score with interpretable anchors: hand-grip strength and forced vital capacity tracked functional loss, and malnutrition emerged as a modifiable correlate. Across RNA-seq and ATAC-seq, clinical severity is aligned with suppression of cell-cycle programmes, declining histone-gene activity and genome-wide chromatin opening, suggesting distinct epigenetic trajectories. These findings establish an agentic AI framework that turns unstructured clinical records into mechanistic insight and links them to multi-omics, reframing ALS studies from top-down, trial-centric analyses to a bottom-up, patient-sourced approach that reveals actionable heterogeneity.
bioinformatics2026-02-03v1Computational insights into the interaction between Topoisomerase I and Rpc82 subunit of RNA Polymerase III in Saccharomyces cereviseae
Nandi, P.; Kamal, I. M.; Chakrabarti, S.; Sengupta, S.AI Summary
- This study modeled the full-length yeast Topoisomerase I (Top1) to investigate its interaction with Rpc82, a subunit of RNA Polymerase III in Saccharomyces cerevisiae.
- Using molecular docking and dynamics simulations, the study identified critical residues at the Top1-Rpc82 interface, providing insights into how Top1 might regulate Pol III-mediated transcription.
Abstract
The process of DNA transcription leads to the generation of torsional stress, which must be resolved for smooth progression of the transcription machinery. In Saccharomyces cerevisiae, DNA topoisomerase I (Top1), a type IB topoisomerase, plays a critical role in relaxing supercoils and mitigating the topological strain associated with transcription. While several proteins from the transcription machinery have been reported to interact with yeast Top1, detailed characterization and functional relevance of these interactions have remained underexplored. This gap is partly due to the absence of a complete three-dimensional structure of the full-length enzyme, which hinders structure-based computational analyses of its interactome. In this study, we present a template-based model of full-length yeast Top1. Leveraging this model, we investigated its molecular interaction with Rpc82, a key subunit of RNA polymerase III enzyme, responsible for transcribing small non-coding RNAs such as tRNAs and 5S rRNA. Through molecular docking and molecular dynamics simulations, critical residues at the Top1-Rpc82 interface were identified that likely mediate their interaction. Our findings provide new insights into the structural basis of Top1's association with RNA polymerase III and its potential role in regulating Pol III-mediated transcription. The Top1 model developed here offers a valuable framework for future in silico studies aimed at elucidating the broader interactome and regulatory mechanisms of this essential enzyme.
bioinformatics2026-02-03v1ImmunoPheno: A Computational Framework for Data-Driven Design and Analysis of Immunophenotyping Experiments
Wu, L.; Nguyen, M. A.; Yang, Z.; Potluri, S.; Sivagnanam, S.; Kirchberger, N.; Joshi, A.; Ahn, K. J.; Tumulty, J. S.; Cruz Cabrera, E.; Romberg, N.; Tan, K.; Coussens, L. M.; Camara, P. G.AI Summary
- ImmunoPheno is a computational framework that uses single-cell proteo-transcriptomic data to automate the design of antibody panels, gating strategies, and cell identity annotation for immunophenotyping.
- It was used to create a reference (HICAR) with 390 antibodies and 93 immune cell populations, enabling the design of minimal panels for isolating rare cells like MAIT cells and pDCs, validated experimentally.
- The framework accurately annotates cell identities across various cytometry datasets, enhancing the accuracy, reproducibility, and resolution of immunophenotyping.
Abstract
Immunophenotyping is fundamental to characterizing tissue cellular composition, pathogenic processes, and immune infiltration, yet its accuracy and reproducibility remain constrained by heuristic antibody panel design and manual gating. Here, we present ImmunoPheno, an open-source computational platform that repurposes large-scale single-cell proteo-transcriptomic data to guide immunophenotyping experimental design and analysis. ImmunoPheno integrates existing datasets to automate the design of optimal antibody panels, gating strategies, and cell identity annotation. We used ImmunoPheno to construct a harmonized reference (HICAR) comprising 390 monoclonal antibodies and 93 human immune cell populations. Leveraging this resource, we algorithmically designed minimal panels to isolate rare populations, such as MAIT cells and pDCs, which we validated experimentally. We further demonstrate accurate cell identity annotation across publicly available and newly generated cytometry datasets spanning diverse technologies, including spatial platforms like CODEX. ImmunoPheno complements expert curation and supports continual expansion, providing a scalable framework to enhance the accuracy, reproducibility, and resolution of immunophenotyping.
bioinformatics2026-02-03v1Cell type-specific functions of nucleic acid-binding proteins revealed by deep learning on co-expression networks
Osato, N.; Sato, K.AI Summary
- This study uses a deep learning framework to infer the regulatory influence of nucleic acid-binding proteins (NABPs) across different cellular contexts by integrating gene co-expression data, improving prediction accuracy over traditional binding-based methods.
- The model's predictions were validated against ChIP-seq and eCLIP datasets, showing strong concordance.
- Analysis revealed cell type-specific regulatory programs, such as cancer pathways in K562 cells and differentiation in neural progenitor cells, highlighting the framework's utility in functional annotation of NABPs.
Abstract
Nucleic acid-binding proteins (NABPs) play central roles in gene regulation, yet their functional targets and regulatory programs remain incompletely characterized due to the limited scope and context specificity of experimental binding assays. Here, we present a deep learning framework that integrates gene co-expression-derived interactions with contribution-based model interpretation to infer NABP regulatory influence across diverse cellular contexts, without relying on predefined binding motifs or direct binding evidence. Replacing low-informative binding-based features with co-expression-derived interactions significantly improved gene expression prediction accuracy. Model-inferred regulatory targets showed strong and reproducible concordance with independent ChIP-seq and eCLIP datasets, exceeding random expectations across multiple genomic regions and threshold definitions. Functional enrichment and gene set enrichment analyses revealed coherent, cell type-specific regulatory programs, including cancer-associated pathways in K562 cells and differentiation-related processes in neural progenitor cells. Notably, we demonstrate that DeepLIFT-derived contribution scores capture relative regulatory importance in a background-dependent but biologically robust manner, enabling systematic identification of context-dependent NABP regulatory roles. Together, this framework provides a scalable strategy for functional annotation of NABPs and highlights the utility of combining expression-driven inference with interpretable deep learning to dissect gene regulatory architectures at scale.
bioinformatics2026-02-02v9ELITE: E3 Ligase Inference for Tissue specific Elimination: A LLM Based E3 Ligase Prediction System for Precise Targeted Protein Degradation
Patjoshi, S.; Froehlich, H.; Madan, S.AI Summary
- The study introduces ELITE, an AI-driven system using a BERT-based model to predict tissue-specific E3 ligases for targeted protein degradation (TPD).
- ELITE integrates protein embeddings with tissue-specific interaction data to identify E3 ligases that can selectively degrade pathogenic proteins in relevant tissues.
- This approach aims to expand the E3 ligase repertoire, enhancing precision in TPD and reducing systemic toxicity.
Abstract
Targeted protein degradation (TPD) has transformed modern drug discovery by harnessing the ubiquitin proteasome system to eliminate disease-driving proteins previously deemed undruggable. However, current approaches predominantly rely on a narrow set of ubiquitously expressed E3 ligases, such as Cereblon (CRBN) and Von Hippel Lindau (VHL), which limits tissue specificity, increases systemic toxicity, and fosters resistance. Here, we present an AI-driven framework for the rational identification of tissue specific E3 ligases suitable for precision-targeted degradation. Our model leverages a BERT-based protein language architecture trained on billions of sequences to generate contextual embeddings that capture structural and functional motifs relevant for E3 substrate compatibility. By integrating these embeddings with tissue resolved protein protein interaction data, the framework predicts ligase/target interactions that are both biologically plausible and context restricted. This enables the prioritization of ligases capable of driving selective degradation of pathogenic proteins within disease-relevant tissues. The proposed approach offers a scalable path to expand the E3 ligase repertoire and advance TPD toward true precision medicine.
bioinformatics2026-02-02v9rnaends: an R package to study exact RNA ends at nucleotide resolution
Caetano, T.; Redder, P.; Fichant, G.; Barriot, R.AI Summary
- The rnaends R package is designed for analyzing RNA-end sequencing data, focusing on the exact nucleotide resolution of RNA ends.
- It provides tools for preprocessing, mapping, quantification, and post-processing of RNA-end data, including TSS identification, analysis of translation speed, and post-transcriptional modifications.
- The package's utility is demonstrated through workflows on published datasets, highlighting its application in RNA metabolism studies.
Abstract
5' and 3' RNA-end sequencing protocols have unlocked new opportunities to study aspects of RNA metabolism such as synthesis, maturation and degradation, by enabling the quantification of exact ends of RNA molecules in vivo. From RNA-Seq data that have been generated with one of the specialized protocols, it is possible to identify transcription start sites (TSS) and/or endoribonucleolytic cleavage sites, and even, in some cases, co-translational 5' to 3' degradation dynamics. Furthermore, post-transcriptional addition of ribonucleotides at the 3' end of RNA can be studied at the nucleotide resolution. While different RNA-end sequencing library protocols exist that have been adapted to a specific organism (prokaryote or eukaryote) or specific biological question, the generated RNA-Seq data are very similar and share common processing steps. Most importantly, the major aspect of RNA-end sequencing is that only the 5' or 3' end mapped location is of interest, contrary to conventional RNA sequencing that considers genomic ranges for gene expression analysis. This translates to a simple representation of the quantitative data as a count matrix of RNA-end location on the reference sequences. This representation seems under-exploited and is, to our knowledge, not available in a generic package focused on the analyses on the exact transcriptome ends. Here, we present the rnaends R package which is dedicated to RNA-end sequencing analysis. It offers functions for raw read pre-processing, RNA-end mapping and quantification, RNA-end count matrix post-processing, and further downstream count matrix analyses such as TSS identification, fast Fourier transform for signal periodic pattern analysis, or differential proportion of RNA-end analysis. The use of rnaends is illustrated here with applications in RNA metabolism studies through selected rnaends workflows on published RNA-end datasets: (i) TSS identification, (ii) ribosome translation speed and co-translational degradation, (iii) post-transcriptional modification analysis and differential proportion analysis.
bioinformatics2026-02-02v3Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.AI Summary
- The study addresses the challenge of distinguishing half-siblings from niece/nephew-avuncular pairs in large genomic biobanks without pedigree information.
- A novel method using across-chromosome phasing and haplotype-level sharing features was developed, achieving over 98% classification accuracy.
- This approach also enhances long-range phasing accuracy, aiding in pedigree reconstruction and managing cryptic relatedness in genomic studies.
Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-02-02v3WITHDRAWN: OKR-Cell: Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training
wang, H.; Zhang, X.; Fang, S.; Ran, L.; deng, z.; Zhang, Y.; Li, Y.; Li, s.AI Summary
- The manuscript titled "OKR-Cell: Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training" was withdrawn due to duplicate posting on arXiv.
- The authors request that this work not be cited as a reference.
Abstract
The authors have withdrawn this manuscript because of a duplicate posting of a preprint on arXiv. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author. The original preprint can be found at arXiv:2601.05648
bioinformatics2026-02-02v2MLMarker: A machine learning framework for tissue inference and biomarker discovery
Claeys, T.; van Puyenbroeck, S.; Gevaert, K.; Martens, L.AI Summary
- MLMarker uses a Random Forest model to compute tissue similarity scores from proteomics data, trained on 34 healthy tissues.
- It employs SHAP for protein-level explanations and a penalty factor for missing proteins, enhancing robustness for sparse datasets.
- Testing on three datasets, MLMarker identified brain-like signatures in cerebral melanoma, achieved high accuracy in pan-cancer analysis, and traced origins in biofluids.
Abstract
MLMarker is a machine learning tool that computes continuous tissue similarity scores for proteomics data, addressing the challenge of interpreting complex or sparse datasets. Trained on 34 healthy tissues, its Random Forest model generates probabilistic predictions with SHAP-based protein-level explanations. A penalty factor corrects for missing proteins, improving robustness for low-coverage samples. Across three public datasets, MLMarker revealed brain-like signatures in cerebral melanoma metastases, achieved high accuracy in a pan-cancer cohort, and identified brain and pituitary origins in biofluids. MLMarker provides an interpretable framework for tissue inference and hypothesis generation, available as a Python package and Streamlit app.
bioinformatics2026-02-02v2cheCkOVER: An open framework and AI-ready global crayfish database for next-generation biodiversity knowledge
Parvulescu, L.; Livadariu, D.; Bacu, V. I.; Nandra, C. I.; Stefanut, T. T.; World of Crayfish Contributors,AI Summary
- The study introduces cheCkOVER, an open framework that transforms species occurrence data into structured, AI-ready formats, focusing on crayfish.
- cheCkOVER processes 111,729 crayfish records from 465 species, producing biogeographic descriptors, dynamic maps, and JSON geo-narratives with provenance metadata.
- This framework supports conservation metrics, tracks invasive species, and enhances biodiversity data utility for AI applications and public platforms like World of Crayfish.
Abstract
Background Species occurrence records represent the backbone of biodiversity science, yet their utility is often limited to spatial analyses, distribution maps, or presence-absence models. Current biodiversity infrastructures rarely provide computational formats directly usable by modern artificial intelligence (AI) systems, such as large language models (LLMs), which increasingly mediate scientific communication and knowledge synthesis. Open frameworks that convert biodiversity occurrences into structured, machine-accessible, provenance-rich knowledge are therefore essential--particularly those enabling rapid integration of new records, near real-time generation of spatial metrics, and production of both human interpretable reports and AI-consumable outputs. Such capabilities substantially reduce latency between data acquisition and decision support, while ensuring biodiversity knowledge remains traceable and verifiable in AI-mediated workflows. Results We introduce cheCkOVER, an open framework that converts raw species occurrence datasets into standardized, API-ready, multi-layered outputs: biogeographic descriptors, dynamic distribution maps, summary metrics, and structured JSON geo-narratives following a canonical template. The framework stratifies processing by population origin (indigenous vs. non-indigenous), enabling IUCN-aligned conservation metrics while simultaneously tracking invasion dynamics. Each output embeds standardized citation metadata ensuring full provenance traceability. We applied the pipeline to 111,729 validated crayfish (Astacidea) occurrence records from 465 species, generating comprehensive species packages including indigenous-range classifications (171 endemic, 287 regional, 5 cosmopolitan taxa) and non-indigenous range tracking for 30 invasive species. This proof-of-concept demonstrates how the framework transforms minimal datapoints--validated species occurrences--into interoperable knowledge consumable by both humans and computational systems. The JSON outputs are optimized for retrieval-augmented generation, enabling AI systems to dynamically access and cite biodiversity knowledge with explicit source attribution. Conclusions cheCkOVER is taxon-agnostic and establishes a reproducible pathway from biodiversity occurrences to narrative-ready, AI-interoperable knowledge with immediate public utility via the World of Crayfish(R) platform (https://world.crayfish.ro/), where each species page integrates structured outputs. The open-source framework (GPL-3) combines a generalizable processing pipeline with taxon-specific knowledge products, enabling flexible reuse across conservation research, policy reporting, and AI-driven applications. This minimalist-to-complex design extends the reach of biodiversity data beyond traditional analyses, positioning occurrence repositories as active knowledge engines for next-generation biodiversity informatics.
bioinformatics2026-02-02v2MultiGEOmics: Graph-Based Integration of Multi-Omics via Biological Information Flows
Alipour Pijani, B.; Rifat, J. I. M.; Bozdag, S.AI Summary
- MultiGEOmics is a graph-based framework designed to integrate multi-omics data by incorporating cross-omics regulatory signals and handling missing data.
- It learns robust embeddings across omics types, maintaining performance under varying data completeness scenarios.
- Evaluations on 11 datasets showed MultiGEOmics consistently performs well and provides interpretability by highlighting key omics features for predictions.
Abstract
Motivation: Multi-omics datasets capture complementary aspects of biological systems and are central to modern machine learning applications in biology and medicine. Existing graph-based integration methods typically construct separate graphs for each omics type and focus primarily on intra-omic relationships. As a result, they often overlook cross-omics regulatory signals, bidirectional interactions across omics layers, that are critical for modeling complex cellular processes. A second major challenge is missing or incomplete omics data; many current approaches degrade substantially in performance or exclude patients lacking one or more omics modalities. To address these limitations, we introduce MultiGEOmics, an intermediate-level graph integration framework that explicitly incorporates regulatory signals across omics types during graph representation learning and models biologically inspired omics-specific and cross-omics dependencies. MultiGEOmics learns robust cross-omics embeddings that remain reliable even when some modalities are partially missing. Results: We evaluated MultiGEOmics across eleven datasets spanning cancer and Alzheimer's disease, under zero, moderate, and high missing-rate scenarios. MultiGEOmics consistently maintains strong predictive performance across all missing-data conditions while offering interpretability by identifying the most influential omics types and features for each prediction task.
bioinformatics2026-02-02v1Batch correction for large-scale mass spectrometry imaging experiments
Thomsen, A. A.; Jensen, O. N.AI Summary
- This study evaluates batch correction methods for MALDI mass spectrometry imaging experiments.
- ComBAT was found to reduce batch-related technical variance, preserve biological variation, and enhance the overall score by 19.4%.
Abstract
We assess batch correction methods for MALDI mass spectrometry imaging experiments. ComBAT reduced batch-related technical variance, maintained biological variation, and improved the overall score by 19.4%.
bioinformatics2026-02-02v1Evaluating the applicability of kinship analyses for sedimentary ancient DNA datasets
Cohen, P.; Johnson, S.; Zavala, E. I.; Moorjani, P.; Slon, V.AI Summary
- This study evaluates the feasibility of kinship inference using sedimentary ancient DNA (sedaDNA), focusing on Neandertals, through extensive simulations.
- The main challenge identified was the presence of DNA from multiple individuals in samples, which complicates accurate kinship analysis.
- A heterozygosity-based test was developed to detect multi-individual DNA, and practical limits were assessed using Neandertal sedaDNA from the Galeria de las Estatuas site.
Abstract
Kinship reconstruction in ancient populations provides key insights into past social organization and evolutionary history. Sedimentary ancient DNA (sedaDNA) enables access to deep-time human populations in the absence of skeletal remains. However, it is characterized by severe degradation and the potential mixture of genetic material from multiple individuals, raising questions about its suitability for kinship inference. Here, we use extensive simulations to evaluate the feasibility and limitations of kinship inference in sparse and damaged sedaDNA data, with a focus on Neandertals. We find that the main obstacle to accurate kinship inference in sedaDNA is the presence of multiple contributors to a given sample. To address this, we introduce a simple heterozygosity-based test to identify samples containing DNA from multiple individuals. Guided by these results, we analyze published Neandertal sedaDNA from the Galeria de las Estatuas site to assess the practical limits of kinship inference in real sedimentary ancient DNA data. Together, our results define methodological considerations and practical limits for kinship inference in sedimentary ancient DNA.
bioinformatics2026-02-02v1DyGraphTrans: A temporal graph representation learning framework for modeling disease progression from Electronic Health Records
Rahman, M. T.; Al Olaimat, M.; Bozdag, S.; Alzheimer's Disease Neuroimaging Initiative,AI Summary
- DyGraphTrans is a framework that models disease progression using EHR data by representing it as temporal graphs, where nodes are patients, features are clinical attributes, and edges show patient similarity.
- It addresses high memory use and lack of interpretability in existing models by employing a sliding-window mechanism and capturing both local and global temporal trends.
- Evaluations on ADNI, NACC, and MIMIC-IV datasets showed DyGraphTrans had strong predictive performance and interpretability aligned with clinical risk factors.
Abstract
Motivation: Electronic Health Records (EHRs) contain vast amounts of longitudinal patient medical history data, making them highly informative for early disease prediction. Numerous computational methods have been developed to leverage EHR data; however, many process multiple patient records simultaneously, resulting in high memory consumption and computational cost. Moreover, these models also often lack interpretability, limiting insight into the factors driving their predictions. Efficiently handling large-scale EHR data while maintaining predictive accuracy and interpretability therefore remains a critical challenge. To address this gap, we propose DyGraphTrans, a dynamic graph representation learning framework that represents patient EHR data as a sequence of temporal graphs. In this representation, nodes correspond to patients, node features encode temporal clinical attributes, and edges capture patient similarity. DyGraphTrans models both local temporal dependencies and long-range global trends, while a sliding-window mechanism reduces memory consumption without sacrificing essential temporal context. Unlike existing dynamic graph models, DyGraphTrans jointly captures patient similarity and temporal evolution in a memory-efficient and interpretable manner. Results: We evaluated DyGraphTrans on Alzheimer's Disease Neuroimaging Initiative (ADNI) and National Alzheimer's Coordinating Center (NACC) for disease progression prediction, as well as on the Medical Information Mart for Intensive Care (MIMIC-IV) dataset for early mortality prediction. We further assessed the model on multiple benchmark dynamic graph datasets to evaluate its generalizability. DyGraphTrans achieved strong predictive performance across diverse datasets. We also demonstrated interpretability of DyGraphTrans aligned with known clinical risk factors.
bioinformatics2026-02-02v1NetPolicy-RL: Network-Informed Offline Reinforcement Learning for Pharmacogenomic Drug Prioritization
Lodh, E.; Majumder, S.; Chowdhury, T.; De, M.AI Summary
- The study introduces NetPolicy-RL, a framework that combines network diffusion modeling with offline reinforcement learning for prioritizing drugs in pharmacogenomic screens.
- Drug selection is treated as an offline contextual bandit problem, optimizing ranking quality directly by integrating drug response data with network disruption scores from biological networks.
- NetPolicy-RL significantly outperformed traditional methods in ranking quality (NDCG@10) and reduced regret, with improvements for 88.7% of cell lines compared to GlobalTopK.
Abstract
Large-scale pharmacogenomic screens provide extensive measurements of drug response across diverse cancer cell lines; however, most computational approaches emphasize point-wise sensitivity prediction or static ranking, which are poorly aligned with practical decision-making, where only a limited number of candidate drugs can be tested. We propose NetPolicy-RL, a biologically informed and decision-centric framework for pharmacogenomic drug prioritization that integrates network diffusion modeling with offline reinforcement learning. Drug selection for each cell line is formulated as an offline contextual bandit problem, enabling direct optimization of ranking quality rather than surrogate regression objectives. Mechanistic biological context is incorporated by propagating drug targets over curated interaction networks (STRING and Reactome) using random walk with restart, and combining the resulting diffusion profiles with cell-specific molecular importance derived from multi-omics data to compute network disruption scores. These biologically grounded signals are integrated with normalized drug response measurements to construct a joint state representation, which is optimized using an offline actor-critic architecture. Across held-out test splits, NetPolicy-RL consistently outperforms global ranking heuristics and learning-to-rank baselines, achieving statistically significant improvements in per-cell Normalized Discounted Cumulative Gain (NDCG@10) and substantial reductions in per-cell regret. Relative to GlobalTopK, the policy improves NDCG@10 for 88.7% of cell lines, while improvements exceed 95% compared with LambdaMART and regression-to-ranking baselines. Ablation analyses show that neither empirical response signals nor network-derived features alone are sufficient, and that their integration yields the most robust performance. Overall, this study demonstrates that combining mechanistic network biology with offline policy learning provides an effective and interpretable framework for drug prioritization in precision oncology.
bioinformatics2026-02-02v1PHoNUPS: Open-Source Software for Standardized Analysis and Visualization of Multi-Instrument Extracellular Vesicle Measurements
Melykuti, B.; Bustos-Quevedo, G.; Prinz, T.; Nazarenko, I.AI Summary
- PHoNUPS is open-source software developed in R to standardize the analysis and visualization of extracellular vesicle (EV) measurements from various instruments.
- It processes data to compute statistics and generate standardized histograms and contour plots for EV size and zeta potential, aiding in transparent reporting and cross-study comparisons.
- The software supports multiple file formats, produces publication-ready figures, and is designed for extensibility with community contributions.
Abstract
Accurate and transparent characterization of extracellular vesicle (EV) preparations is essential to ensure reproducibility, comparability, and adherence to MISEV reporting standards. However, data outputs from commonly used instruments for assessing EV size, concentration, and surface charge (zeta potential) vary widely in format and structure, complicating standardized analysis and integration across platforms. We present PHoNUPS (Plotting the Histogram of Non-Uniform Particles' Sizes), free and open-source software (FOSS) developed in R, that enables unified processing, analysis, and visualization of EV characterization data. PHoNUPS computes statistics and generates standardized histograms and contour plots (for size against zeta potential) suitable for transparent reporting and cross-study comparison. The software produces high-quality, publication-ready figures. Third-party graphical editing tools allow users to refine and annotate visualizations for presentation or manuscript preparation. PHoNUPS supports multiple measurement file formats, thereby facilitating dataset integration from different instruments. PHoNUPS was developed with extensibility at its core, providing a basis for user-driven growth. We invite the EV community - researchers, analysts, and tool developers - to use PHoNUPS, share feedback on their experience and needs, and contribute to the platform by integrating additional input data formats, analytical routines, and visualization functionalities.
bioinformatics2026-02-02v1Bridging the gap between genome-wide association studies and network medicine with GNExT
Arend, L.; Woller, F.; Rehor, B.; Emmert, D.; Frasnelli, J.; Fuchsberger, C.; Blumenthal, D. B.; List, M.AI Summary
- GNExT is a web-based platform designed to integrate GWAS data into network medicine, enhancing the interpretation of genetic variants within biological systems.
- It incorporates tools like MAGMA and Drugst.One to explore genetic variants at a network level, identifying potential drug repurposing candidates.
- The platform was demonstrated using a GWAS meta-analysis of human olfactory identification, translating genetic signals into pharmacological targets.
Abstract
Motivation: A growing volume of large-scale genome-wide association study (GWAS) datasets offers unprecedented power to uncover the genetic determinants of complex traits, but existing web-based platforms for GWAS data exploration provide limited support for interpreting these findings within broader biological systems. Systems medicine is particularly well-suited to fill this gap, as its network-oriented view of molecular interactions enables the integration of genetic signals into coherent network modules, thereby opening opportunities for disease mechanism mining and drug repurposing. Results: We introduce GNExT (GWAS network exploration tool), a web-based platform that moves beyond the variant-level effect and significance exploration provided by existing solutions. By including MAGMA and Drugst.One, GNExT allows its users to study genetic variants on the network level down to the identification of potential drug repurposing candidates. Moreover, GNExT advances over the current state of the art by offering a highly standardized Nextflow pipeline for data import and preprocessing, allowing researchers to easily deploy their study results on a web interface. We demonstrate the utility of GNExT using a genome-wide association meta-analysis of human olfactory identification, in which the framework translated isolated GWAS signals to potential pharmacological targets in human olfaction. Availability and Implementation: The complete GNExT ecosystem, including the Nextflow preprocessing pipeline, the backend service, and frontend interface, is publicly available on GitHub (https://github.com/dyhealthnet/gnext_nf_pipeline, https://github.com/dyhealthnet/gnext_platform). The public instance of the GNExT platform on olfaction is available under http://olfaction.gnext.gm.eurac.edu.
bioinformatics2026-02-02v1Quantifying biomarker ambiguity using metabolic network analysis
Hinkston, M. A.; Bradley, A. S.AI Summary
- This study quantifies biomarker ambiguity by introducing three metrics (retrobiosynthetic complexity, normalized branch depth, and fraction shared) to assess the biosynthetic specificity of biomarkers through metabolic network analysis.
- Analysis of 9,140 MetaCyc metabolites revealed that only 13% of multi-pathway compounds had low complexity, distal divergence, and high pathway consensus.
- Lipid biomarkers like hopanoids and sterols were found to vary in specificity, with hopanoids showing higher specificity, while diagnostic quality and lipophilicity were found to be independent.
Abstract
Molecular biomarkers preserved in rocks provide evidence about ancient life but interpreting them requires inference through multiple stages of information loss arising from phylogenetic, biosynthetic, and diagenetic ambiguity. However, biomarker specificity is typically assessed qualitatively rather than quantitatively. Here we formalize biosynthetic ambiguity as entropy over metabolic networks. We introduce three metrics that quantify pathway-level information content: retrobiosynthetic complexity ({psi}), normalized branch depth ({lambda}), and fraction shared ({sigma}). Analysis of 9,140 MetaCyc metabolites defines a three-dimensional specificity space for biomarker evaluation. Only 13% of multi-pathway compounds exhibited low complexity, distal divergence, and high pathway consensus. Lipid biomarkers span this specificity space heterogeneously: hopanoids cluster near the high-specificity region while sterols occupy intermediate territory. Diagnostic quality and lipophilicity are approximately independent, so the constraint on molecular paleontology is the limited chemical diversity among preservable compound classes rather than their biosynthetic properties. This framework supports probabilistic biomarker interpretation by explicitly incorporating biosynthetic, phylogenetic, and diagenetic constraints.
bioinformatics2026-02-02v1Multi-ancestry conditional and joint analysis (Manc-COJO) applied to GWAS summary statistics
Wang, X.; Wang, Y.; Visscher, P. M.; Wray, N. R.; Yengo, L.AI Summary
- The study introduces Manc-COJO, a method for performing conditional and joint analysis on GWAS summary statistics across multiple ancestries to identify independent SNP associations.
- Simulations and real-data analyses demonstrate that Manc-COJO enhances the detection of independent signals and reduces false positives compared to traditional methods.
- Manc-COJO:MDISA, a follow-up algorithm, identifies ancestry-specific associations, and the C++ implementation of Manc-COJO significantly improves computational efficiency.
Abstract
Conditional and joint (COJO) analysis of genome-wide association study (GWAS) summary statistics to identify single nucleotide polymorphisms (SNPs) independently associated with a trait is standard in post-GWAS pipelines. GWAS meta-analyses are increasingly conducted across multiple ancestry groups but how to perform COJO in a multi-ancestry context is not known. Here we introduce Manc-COJO, a method for multi-ancestry COJO analysis. Simulations and real-data analyses show that Manc-COJO improves the detection of independent association signals and reduces false positives compared to COJO and ad hoc adaptations for multi-ancestry use. We also introduce Manc-COJO:MDISA, a follow-up within ancestry algorithm to identify ancestry-specific associations after fitting Manc-COJO identified SNPs. The C++ implementation of Manc-COJO substantially improves on computational efficiency (for single ancestry >120 times faster than GCTA-COJO software) and supports linkage disequilibrium references derived either from individual-level genotype data or pre-computed matrices, facilitating analysis when data sharing is limited.
bioinformatics2026-02-02v1scDiagnostics: systematic assessment of cell type annotation in single-cell transcriptomics data
Christidis, A.; Ghazi, A. R.; Chawla, S.; Turaga, N.; Gentleman, R.; Geistlinger, L.AI Summary
- The study addresses the challenge of assessing computational cell type annotations in single-cell transcriptomics by introducing scDiagnostics, a software package designed to detect complex or ambiguous annotations.
- scDiagnostics uses novel diagnostic methods compatible with major annotation tools and was tested on simulated and real-world datasets.
- The tool effectively identifies misleading annotations that could distort downstream analysis, enhancing the reliability of single-cell data interpretation.
Abstract
Although cell type annotation has become an integral part of single-cell analysis workflows, the assessment of computational annotations remains challenging. Many annotation tools transfer labels from an annotated reference dataset to a new query dataset of interest, but blindly transferring labels from one dataset to another has its own set of challenges. Often enough there is no perfect alignment between datasets, especially when transferring annotations from a healthy reference atlas for the discovery of disease states. We present scDiagnostics, a new open-source software package that facilitates the detection of complex or ambiguous annotation cases that may otherwise go unnoticed, thus addressing a critical unmet need in current single-cell analysis workflows. scDiagnostics is equipped with novel diagnostic methods that are compatible with all major cell type annotation tools. We demonstrate that scDiagnostics reliably detects complex or conflicting annotations using both carefully designed simulated datasets and diverse real-world single-cell datasets. Our evaluation demonstrates that scDiagnostics reliably identifies misleading annotations that systematically distort downstream analysis and interpretation and that would otherwise remain undetected. The scDiagnostics R package is available from Bioconductor (https://bioconductor.org/packages/scDiagnostics).
bioinformatics2026-02-02v1An Explainable Machine Learning Approach to study the positional significance of histone post-translational modifications in gene regulation
Ramachandran, S.; Ramakrishnan, N.AI Summary
- This study used XGBoost classifiers to analyze ChIP-seq data for 26 histone PTMs in yeast, focusing on their positional significance from -3 to 8 in genes.
- The approach predicted gene transcription rates and identified critical histone modifications and nucleosomal positions for gene expression using SHAP for explainability.
- Key findings highlighted the importance of specific histone modifications and their positions in yeast gene regulation, with potential for extension to other organisms.
Abstract
Epigenetic mechanisms regulate gene-expression by altering the structure of the chromatin without modifying the underlying DNA sequence. Histone post-translational modifications (PTMs) are critical epigenetic signals that influence transcriptional activity, promoting or repressing gene-expression. Understanding the impact of individual PTMs and the combinatorial effects is essential to deciphering gene regulatory mechanisms. In this study, we analyzed the ChIP-seq data for 26 PTMs in yeast, examining the PTM intensities gene-wise from positions -3 to 8 in each gene. Using XGBoost classifiers, we predicted gene transcription rates and identified key histone modifications and nucleosomal positions that are critical in gene expression using explainability measures (such as SHAP). Our study provides a comprehensive insight into the histone modifications, their positions and their combinations that are most critical in gene regulation in yeast. The proposed explainable Machine Learning models can be easily extended to other model organisms to provide meaningful insights into gene regulation by epigenetic mechanisms.
bioinformatics2026-02-02v1Serum Proteomic Profiling Implicates a Dysregulated Neurohormonal-Inflammatory Axis in Post-Fontan Tachycardia
Takaesu, F.; Villarreal, D. J.; Zhou, A.; Jimenez, M.; Turner, M.; Spiess, J. L.; Kievert, J.; Deshetler, C.; Schwartzman, W.; Yates, A. R.; Kelly, J. M.; Breuer, C. K.; Davis, M.AI Summary
- This study used serum proteomics and machine learning to investigate the molecular mechanisms behind post-Fontan tachycardia in both ovine models and human patients.
- Post-operative tachycardia was observed in both species, with significant heart rate increases noted from day 1 to day 3 post-operation.
- A seven-protein panel was identified, with ANGT, ACE, and PTX3 consistently dysregulated across species, suggesting a neurohormonal-inflammatory axis involvement in tachycardia.
Abstract
Background: Post-operative tachycardia is a common and poorly understood complication following the Fontan procedure. Post-operative factors such as surgical scarring and venous hypertension can contribute to tachycardia risk, but the specific molecular signaling cascades triggering acute tachycardia remain uncharacterized, limiting therapeutic innovation and leaving clinicians with limited strategies. Here, we present a retrospective translational study leveraging serum proteomics and machine learning to identify molecular drivers of post-operative Fontan tachycardia. Methods: We integrated a clinically relevant ovine animal model Fontan circulation with continuous telemetric heart rate monitoring and human patient data. Serum proteomics coupled with machine learning algorithms were employed to identify protein panels predictive of post-operative tachycardia. Cross-species validation was performed by comparing proteomic signatures from sheep and pediatric patients undergoing Glenn or Fontan surgery. Results: Ovine Fontan animals demonstrated significant heart rate elevation beginning on post-operative day (POD) 1, peaking at POD 3 (159.4 {+/-} 11.7 bpm vs. pre-operative 105.3 {+/-} 10.5 bpm, p<0.0001), before trending toward baseline by POD 10. This pattern was similar in human patients, though more modest. Proteomic analysis identified distinct separation between pre- and post-operative serum profiles. Principal component analysis revealed that the principal components most correlated with heart rate were significantly enriched for inflammatory and neural pathways. We leveraged the Boruta algorithm to identify a seven-protein panel (ACE, ANGT, ITIH4, SELENOP, W5PHP7, PTX3, and F5) with superior predictive power (AUC=0.926). A cross-species comparison between human and sheep demonstrated that three, angiotensinogen (ANGT), angiotensin-converting enzyme (ACE), and pentraxin 3 (PTX3), were similarly dysregulated in both species. Conclusions: This study provides the first direct molecular evidence implicating a dysregulated neurohormonal-inflammatory axis as a principal driver of acute post-operative Fontan tachycardia. The identified protein signature offers novel mechanistic insights and establishes a foundation for targeted diagnostics and therapeutics to predict and mitigate this significant clinical complication.
bioinformatics2026-02-02v1LFQ Benchmark Dataset - Generation Beta: Assessing Modern Proteomics Instruments and Acquisition Workflows with High-Throughput LC Gradients
Van Puyvelde, B. R.; Devreese, R.; Chiva, C.; Sabido, E.; Pfammatter, S.; Panse, C.; Rijal, J. B.; Keller, C.; Batruch, I.; Pribil, P.; Vincendet, J.-B.; Fontaine, F.; Lefever, L.; Magalhaes, P.; Deforce, D.; Nanni, P.; Ghesquiere, B.; Perez-Riverol, Y.; Martens, L.; Carapito, C.; Bouwmeester, R.; Dhaenens, M.AI Summary
- This study extends a previous benchmark dataset to evaluate modern LC-MS platforms for high-throughput proteomics using short LC gradients (5 and 15 min) and low sample inputs.
- Data was collected from a hybrid human-yeast-E. coli proteome across four platforms, including new quadrupole-based systems, to assess proteome depth, quantitative precision, and cross-instrument consistency.
- The dataset, available via ProteomeXchange, aims to advance cross-platform algorithm development and standardize high-throughput LC-MS proteomics.
Abstract
Recent advances in liquid chromatography mass spectrometry (LCMS) have accelerated the adoption of high-throughput workflows that deliver deep proteome coverage using minimal sample amounts. This trend is largely driven by clinical and single-cell proteomics, where sensitivity and reproducibility are essential. Here, we extend our previous benchmark dataset (PXD028735) using next-generation LC-MS platforms optimized for rapid proteome analysis. We generated an extensive DDA/DIA dataset using a human-yeast-E. coli hybrid proteome. The proteome sample was distributed across multiple laboratories together with standardized analytical protocols specifying two short LC gradients (5 and 15 min) and low sample input amounts. This dataset includes data acquired on four different platforms, and features new scanning quadrupole-based implementations, extending coverage across different instruments and acquisition strategies. Our comprehensive evaluation highlights how technological advances and reduced LC gradients may affect proteome depth, quantitative precision, and cross-instrument consistency. The release of this benchmark dataset via ProteomeXchange (PXD070049 and PXD071205), allows for the acceleration of cross-platform algorithm development, enhance data mining strategies, and supports standardization of short-gradient, high-throughput LC-MS-based proteomics.
bioinformatics2026-02-02v1CellCov: gene-body coverage profiling for single-cell RNA-seq
Chen, S.; Zevnik, U.; Ziegenhain, C.AI Summary
- CellCov addresses the issue of gene-body coverage bias in single-cell RNA-seq by providing profiling at single-cell resolution, which reveals cell-to-cell variability.
- It supports flexible grouping and aggregation of coverage profiles, facilitating comparison across different sequencing protocols.
- The tool was demonstrated on public datasets from various full-length scRNA-seq chemistries, showcasing its utility.
Abstract
Motivation: Gene-body coverage bias differs across scRNA-seq protocols and can influence downstream analyses, yet coverage is often assessed using bulk-level summaries that obscure cell-to-cell variability. Results: CellCov provides gene-body coverage profiling at single-cell resolution, enabling exploration of coverage heterogeneity across both cells and features. The accompanying workflow supports flexible grouping and robust aggregation of profiles by user-provided annotations, allowing principled comparison of coverage bias across sequencing protocols. We demonstrate its use on public datasets from several full-length scRNA-seq chemistries. Availability: CellCov source code and documentation are available at https://github.com/ziegenhain-lab/CellCov
bioinformatics2026-02-02v1Learning Dynamic Protein Representations at Scale with Distograms
Portal, N.; Karroucha, W.; Mallet, V.; Bonomi, M.AI Summary
- This study addresses the challenge of incorporating protein structural dynamics into machine learning by using distograms from AlphaFold2 instead of computationally intensive simulations.
- The approach involves encoding dynamic protein information through residue-residue distance probability distributions to enhance function prediction.
- Key finding: This method offers a scalable solution for dynamic protein representation, potentially improving prediction accuracy without the need for explicit conformational sampling.
Abstract
Protein function and other biological properties often depend on structural dynamics, yet most machine-learning predictors rely on static representations. Physics-based molecular simulations can describe conformational variability but remain computationally prohibitive at scale. Generative models provide a more efficient alternative, though their ability to produce accurate conformational ensembles is still limited. In this work, we bypass expensive simulations by leveraging residue-residue distance probability distributions (distograms) from structure predictors such as AlphaFold2. Our approach provides a scalable way to encode dynamic information into protein representations, aiming to improve function prediction without explicit conformational sampling.
bioinformatics2026-02-02v1AstraKit: Customizable, reproducible workflows for biomedical research and precision medicine
Kurz, N. S.; Kornrumpf, K.; Stoves, M. K.; Dönitz, J.AI Summary
- AstraKit is a customizable KNIME workflow suite designed to streamline precision medicine analytics by integrating variant interpretation, multi-omics analysis, and drug response modeling.
- It features dynamic variant annotation, multi-layered omics integration, and translational drug matching, validated in oncology cohorts to show concordance with clinical outcomes.
- AstraKit's open-source, platform-independent workflows enhance reproducibility and accelerate biomarker validation, linking molecular profiles to therapeutic decisions.
Abstract
MotivationFragmented bioinformatics tools compels researchers and clinicians to resort to error-prone manual pipelines. The success of precision medicine depends largely on the efficient interpretation of genetic variants and the selection of highly effective targeted therapies. However, the success of precision medicine and biomedical research depends on the availability of efficient software solutions for processing and interpreting genetic variants, interpreting multi-omics data, and integrating drug screen analyses. ResultsWe present AstraKit, a unified KNIME workflow suite enabling end-to-end precision medicine analytics. AstraKit introduces three transformative innovations: 1) Dynamic variant interpretation with customizable annotation and filtering for disease-specific genomic contexts; 2) Multi-layered omics analyses integrating genomic, transcriptomic, and epigenetic data; and 3) Translational drug matching that correlates in vitro drug screens with clinical outcomes. Validated across oncology cohorts, AstraKit demonstrates concordance between experimental drug sensitivity and clinical trial responses, resolving discordances to uncover resistance mechanisms. By unifying variant analysis, multi-omics, and drug response modeling on a single customizable platform, AstraKit eliminates siloed workflows, accelerating biomarker validation and enabling clinicians to directly link molecular profiles to therapeutic decisions. As all AstraKit workflows are open-source and platform-independent, we provide a versatile comprehensive software suite for a multitude of tasks in bioinformatics and precision medicine. Availability and implementationThe KNIME workflows are available at KNIME Hub https://hub.knime.com/bioinf_goe/spaces/Public/AstraKit~lfVsGBY2HnPYc1h1/. The source code is available at https://gitlab.gwdg.de/MedBioinf/mtb/astrakit.
bioinformatics2026-01-31v3FusionPath: Gene fusion pathogenicity prediction using protein structural data and contextual protein embeddings
Kurz, N. S.; Güven, I. B.; Beissbarth, T.; Dönitz, J.AI Summary
- FusionPath is a deep learning framework designed to predict gene fusion pathogenicity by integrating protein embeddings, structural data, and functional annotations.
- It uses a hierarchical attention mechanism to weigh different feature contributions, achieving superior performance over existing methods with higher AUC scores.
- SHAP analysis showed that protein domains and GO terms provide interpretable, non-redundant signals, highlighting specific domains and processes crucial for pathogenicity prediction.
Abstract
Accurate prediction of gene fusion pathogenicity is critical for understanding oncogenic mechanisms and advancing precision oncology. While existing computational methods provide valuable insights, their performance remains limited by incomplete integration of multi-scale biological features and lack of interpretability. We present FusionPath, a novel deep learning framework for gene fusion pathogenicity prediction. FusionPath uniquely integrates embeddings from multiple pretrained protein language models, including FusON-pLM and ProtBERT and retained protein domains and Gene Ontology (GO) functional annotations. A hierarchical attention mechanism dynamically weights the contribution of each feature type, enabling both high-accuracy prediction and biological interpretability. The model was trained and rigorously validated on a large-scale dataset of clinically annotated pathogenic and benign fusions. FusionPath significantly outperformed state-of-the-art methods, achieving higher AUC on independent test sets. Crucially, SHAP analysis revealed that protein domains and GO terms contributed non-redundant, biologically interpretable signals, with specific domains and GO processes exhibiting high predictive weights for pathogenicity. FusionPath establishes a new standard for gene fusion pathogenicity prediction by effectively leveraging complementary sequence, structural, and functional information. Its attention-driven interpretability provides actionable insights into the molecular determinants of fusion oncogenicity, facilitating biological discovery and clinical variant prioritization. The framework is publicly available to accelerate research in cancer genomics and therapeutic target identification.
bioinformatics2026-01-31v3A novel phylogenomics pipeline reveals complex pattern of reticulate evolution in Cucurbitales
Ortiz, E. M.; Hoewener, A.; Shigita, G.; Raza, M.; Maurin, O.; Zuntini, A.; Forest, F.; Baker, W. J.; Schaefer, H.AI Summary
- This study introduces Captus, a novel pipeline for integrating diverse sequencing data types for phylogenomic analysis, applied to the angiosperm order Cucurbitales.
- Captus efficiently assembles and analyzes mixed data, recovering more complete loci across species, and reveals complex reticulate evolution patterns within Cucurbitales and Cucurbitaceae.
- The phylogenomic analysis supports the current classification of Cucurbitales but shows conflicting placement of Apodanthaceae, suggesting gene tree conflict as a cause for previous discrepancies in phylogenetic studies.
Abstract
A diverse range of high-throughput sequencing data, such as target capture, RNA-Seq, genome skimming, and high-depth whole genome sequencing, are used for phylogenomic analyses but the integration of such mixed data types into a single phylogenomic dataset requires a number of bioinformatic tools and significant computational resources. Here, we present a novel pipeline, Captus, to analyze mixed data in a fast and efficient way. Captus assembles these data types, allows searching of the assemblies for loci of interest, and finally produces alignments filtered for paralogs. If reference target loci are not available for the studied taxon, Captus can also be used to discover new putative homologs via sequence clustering. Compared to other software, Captus allows the recovery of a greater number of more complete loci across a larger number of species. We apply Captus to assemble a comprehensive mixed dataset, comprising the four types of sequencing data for the angiosperm order Cucurbitales, a clade of about 3,100 species in eight mainly tropical plant families, including begonias (Begoniaceae) and gourds (Cucurbitaceae). Our phylogenomic results support the currently accepted circumscription of Cucurbitales except for the position of the holoparasitic Apodanthaceae, which group with Rafflesiaceae in Malpighiales. A subset of mitochondrial gene regions supports the earlier position of Apodanthaceae in Cucurbitales. However, the nuclear regions and majority of mitochondrial regions place Apodanthaceae in Malpighiales. Within Cucurbitaceae, we confirm the monophyly of all currently accepted tribes but also reveal deep reticulation patterns both in Cucurbitales and within Cucurbitaceae. We show that contradicting results among earlier phylogenetic studies in Cucurbitales can be reconciled when accounting for gene tree conflict and demonstrate the efficiency of Captus for complex datasets.
bioinformatics2026-01-31v3PanCNV-Explorer: Deciphering copy number alterations across human cancers
Kurz, N. S.; Kornrumpf, K.; Krüger, A.-R.; Dönitz, J.AI Summary
- PanCNV-Explorer integrates copy number variation data from 33 cancer types and healthy tissues to create a comprehensive database.
- It combines CNV profiles with functional genomic data to identify context-specific oncogenic drivers, vulnerabilities, and therapeutic targets.
- The platform offers an interactive web interface and programmatic services for annotating user-submitted CNVs with functional, clinical, and pathogenicity insights.
Abstract
Copy number variants (CNVs) drive cancer progression and genetic disorders, yet the interpretation of their biological consequences and potential targeted therapeutic options remains fragmented across clinical, functional, and structural domains. To bridge this gap, we present PanCNV-Explorer, a comprehensive annotated CNV database integrating copy number variation data from pan-cancer and healthy cohorts. PanCNV-Explorer combines CNV profiles with functional genomic layers, including gene expression CRISPR screening data, through a novel analytical framework. PanCNV-Explorer represents a systematic map of copy number variation in the human genome across 33 different cancer types and normal tissue, integrating pan-cancer and healthy samples into a comprehensive, harmonized database. These analyses reveal context-specific oncogenic drivers, vulnerabilities, and therapeutic targets, accessible via an interactive web interface for dynamic exploration and hypothesis generation. In addition, the web server provides programmatic web services for annotating user-submitted CNVs with functional annotation, clinical relevance, and pathogenicity predictions. PanCNV-Explorer serves as a pivotal resource for accelerating the analyses of copy number and structural variants in the human genome, bridging raw CNV data to actionable biological and clinical interpretations. A public web instance of the PanCNV-Explorer web server is available at https://mtb.bioinf.med.uni-goettingen.de/pancnv-explorer.
bioinformatics2026-01-31v3PanCNV-Explorer: Deciphering copy number alterations across human cancers
Kurz, N. S.; Kornrumpf, K.; Krüger, A.-R.; Dönitz, J.AI Summary
- PanCNV-Explorer integrates copy number variation data from 33 cancer types and healthy tissues to create a comprehensive database.
- It combines CNV profiles with functional genomic data to identify context-specific oncogenic drivers, vulnerabilities, and therapeutic targets.
- The tool offers an interactive web interface and programmatic services for annotating user-submitted CNVs with functional, clinical, and pathogenicity insights.
Abstract
Copy number variants (CNVs) drive cancer progression and genetic disorders, yet the interpretation of their biological consequences and potential targeted therapeutic options remains fragmented across clinical, functional, and structural domains. To bridge this gap, we present PanCNV-Explorer, a comprehensive annotated CNV database integrating copy number variation data from pan-cancer and healthy cohorts. PanCNV-Explorer combines CNV profiles with functional genomic layers, including gene expression CRISPR screening data, through a novel analytical framework. PanCNV-Explorer represents a systematic map of copy number variation in the human genome across 33 different cancer types and normal tissue, integrating pan-cancer and healthy samples into a comprehensive, harmonized database. These analyses reveal context-specific oncogenic drivers, vulnerabilities, and therapeutic targets, accessible via an interactive web interface for dynamic exploration and hypothesis generation. In addition, the web server provides programmatic web services for annotating user-submitted CNVs with functional annotation, clinical relevance, and pathogenicity predictions. PanCNV-Explorer serves as a pivotal resource for accelerating the analyses of copy number and structural variants in the human genome, bridging raw CNV data to actionable biological and clinical interpretations. A public web instance of the PanCNV-Explorer web server is available at https://mtb.bioinf.med.uni-goettingen.de/pancnv-explorer.
bioinformatics2026-01-31v2AstraKit: Customizable, reproducible workflows for biomedical research and precision medicine
Kurz, N. S.; Kornrumpf, K.; Stoves, M. K.; Doenitz, J.AI Summary
- AstraKit is a KNIME workflow suite designed to streamline precision medicine analytics by integrating variant interpretation, multi-omics analysis, and drug response modeling.
- It offers customizable workflows for dynamic variant annotation, multi-layered omics integration, and translational drug matching, validated in oncology cohorts.
- AstraKit's open-source and platform-independent nature enhances its utility in bioinformatics and precision medicine, available on KNIME Hub and GitLab.
Abstract
MotivationFragmented bioinformatics tools compels researchers and clinicians to resort to error-prone manual pipelines. The success of precision medicine depends largely on the efficient interpretation of genetic variants and the selection of highly effective targeted therapies. However, the success of precision medicine and biomedical research depends on the availability of efficient software solutions for processing and interpreting genetic variants, interpreting multi-omics data, and integrating drug screen analyses. ResultsWe present AstraKit, a unified KNIME workflow suite enabling end-to-end precision medicine analytics. AstraKit introduces three transformative innovations: 1) Dynamic variant interpretation with customizable annotation and filtering for disease-specific genomic contexts; 2) Multi-layered omics analyses integrating genomic, transcriptomic, and epigenetic data; and 3) Translational drug matching that correlates in vitro drug screens with clinical outcomes. Validated across oncology cohorts, AstraKit demonstrates concordance between experimental drug sensitivity and clinical trial responses, resolving discordances to uncover resistance mechanisms. By unifying variant analysis, multi-omics, and drug response modeling on a single customizable platform, AstraKit eliminates siloed workflows, accelerating biomarker validation and enabling clinicians to directly link molecular profiles to therapeutic decisions. As all AstraKit workflows are open-source and platform-independent, we provide a versatile comprehensive software suite for a multitude of tasks in bioinformatics and precision medicine. Availability and implementationThe KNIME workflows are available at KNIME Hub https://hub.knime.com/bioinf_goe/spaces/Public/AstraKit~lfVsGBY2HnPYc1h1/. The source code is available at https://gitlab.gwdg.de/MedBioinf/mtb/astrakit.
bioinformatics2026-01-31v2Longevity Bench: Are SotA LLMs ready for aging research?
Zhavoronkov, A.; Sidorenko, D.; Naumov, V.; Pushkov, S.; Zagirova, D.; Aladinskiy, V.; Unutmaz, D.; Aliper, A.; Galkin, F.AI Summary
- LongevityBench was developed to evaluate if state-of-the-art LLMs can understand aging biology and utilize biodata for phenotype predictions.
- The benchmark includes tasks on predicting human time-to-death, mutation effects on lifespan, and age-related omics patterns, covering various biodata types.
- Testing revealed current LLMs' limitations, suggesting improvements for their application in aging research.
Abstract
Aging is a core biological process observed in most species and tissues, which is studied with a vast array of technologies. We argue that the abilities of AI systems to emulate aging and to accurately interpret biodata in its context are the key criteria to judge an LLM's utility in biomedical research. Here, we present LongevityBench -- a collection of tasks designed to assess whether foundation models grasp the fundamental principles of aging biology and can use low-level biodata to arrive at phenotype-level conclusions. The benchmark covers a variety of prediction targets including human time-to-death, mutations' effect on lifespan, and age-dependent omics patterns. It spans all common biodata types used in longevity research: transcriptomes, DNA methylation profiles, proteomes, genomes, clinical blood tests and biometrics, as well as natural language annotations. After ranking state-of-the-art foundation models using LongevityBench, we highlight their weaknesses and outline procedures to maximize their utility in aging research and life sciences
bioinformatics2026-01-30v2Diffusion-based Representation Integration for Foundation Models Improves Spatial Transcriptomics Analysis
Jain, A.; Pham, T. M.; Laidlaw, D. H.; Ma, Y.; Singh, R.AI Summary
- DRIFT integrates spatial context into foundation models using diffusion on spatial graphs from spatial transcriptomics (ST) data to enhance tasks like cell-type annotation and clustering.
- The framework uses heat kernel diffusion to incorporate local neighborhood context while preserving transcriptomic representations from single-cell models.
- Benchmarking showed DRIFT significantly improves performance of foundational models on ST tasks compared to specialized methods.
Abstract
Motivation: We propose DRIFT, a framework that integrates spatial context into the input representations for foundation models by leveraging diffusion on spatial graphs derived from spatial transcriptomics (ST) data. ST captures gene expression profiles while preserving spatial context, enabling downstream analysis tasks such as cell-type annotation, clustering, and cross-sample alignment. However, due to its emerging nature, there are very few foundation models that can utilize ST data to generate embeddings generalizable across multiple tasks. Meanwhile, well-documented foundational models trained on large-scale single-cell gene expression (scRNA-seq) data have demonstrated generalizable performance across scRNA-seq assays, tissues, and tasks; however, they do not leverage the spatial information in ST data. We use heat kernel diffusion to propagate embeddings across spatial neighborhoods, incorporating the local neighborhood context of the ST data while preserving the transcriptomic representations learned by state-of-the-art single-cell foundation models. Results: We systematically benchmark five foundational models (both scRNA-seq and ST-based) across key ST tasks such as annotation, alignment, and clustering, ensuring a comprehensive evaluation of our proposed framework. Our results show that DRIFT significantly improves the performance of existing foundational models on ST data over specialized state-of-the-art methods. Overall, DRIFT is an effective, accessible, and generalizable framework that bridges the gap toward universal models for modeling spatial transcriptomics. Availability and Implementation: Code and data available at https://github.com/rsinghlab/DRIFT.
bioinformatics2026-01-30v2Evidence of off-target probe binding affecting 10x Genomics Xenium gene panels compromise accuracy of spatial transcriptomic profiling
Hallinan, C.; Ji, H. J.; Tsou, E.; Salzberg, S. L.; Fan, J.AI Summary
- Investigated off-target binding in 10x Genomics Xenium technology using a developed software tool, Off-target Probe Tracker (OPT), to identify potential off-target binding in a human breast gene panel.
- Found that at least 14 out of 313 genes were potentially affected by off-target binding to protein-coding genes.
- Validated findings by comparing Xenium data with Visium CytAssist and single-cell RNA-seq, showing that some gene expression patterns reflected both target and off-target genes.
Abstract
The accuracy of spatial gene expression profiles generated by probe-based in situ spatially-resolved transcriptomic technologies depends on the specificity with which probes bind to their intended target gene. Off-target binding, defined as a probe binding to something other than the target gene, can distort a gene's true expression profile, making probe specificity essential for reliable transcriptomics. Here, we investigated off-target binding affecting the 10x Genomics Xenium technology. We developed a software tool, Off-target Probe Tracker (OPT), to identify putative off-target binding via alignment of probe sequences and assessing whether mapped loci corresponded to the intended target gene across multiple reference annotations. Applying OPT to a Xenium human breast gene panel, we identified at least 14 out of the 313 genes in the panel potentially impacted by off-target binding to protein-coding genes. To substantiate our predictions, we leveraged a Xenium breast cancer dataset generated using this gene panel and compared results to orthogonal spatial and single-cell transcriptomic profiles from Visium CytAssist and 3' single-cell RNA-seq derived from the same tumor block. Our findings indicate that for some genes, the expression patterns detected by Xenium demonstrably reflect the aggregate expression of the target and predicted off-target genes based on Visium and single-cell RNA-seq rather than the target gene alone. We further applied OPT to identify potential off-target binding in custom gene panels and integrate tissue-specific RNA-seq data to assess effects. Overall, this work enhances the biological interpretability of spatial transcriptomics data and improves reproducibility in spatial transcriptomics research.
bioinformatics2026-01-30v2