Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Cell-Type-Resolved Isoform Atlas of Human Tissues Reveals Age and Alzheimer's Disease-Associated Splicing Changes
Yamamoto, R.; Fu, T.; Huang, E.; Fraser, A.; Nazim, M.; Xiao, X.; Zaitlen, N.AI Summary
- The study introduces Sciege, a method integrating bulk RNA-seq with single-cell and long-read data to estimate cell type-specific isoform distributions in human tissues.
- Applied to GTEx and ROSMAP datasets, Sciege created a multi-tissue isoform atlas, revealing isoform changes related to aging and Alzheimer's disease.
- Key findings include the upregulation of the MAPT-010 isoform in Alzheimer's disease inhibitory neurons, validated by external data.
Abstract
Alternative splicing is a key mechanism for transcriptomic diversity, but how isoforms map to specific cell types in bulk tissues remains unclear. We present Sciege, a multimodal method that integrates bulk short-read RNA-seq with single-cell and long-read data to estimate cell type-specific isoform distributions. Through simulations, we demonstrate that Sciege accurately estimates isoform abundances and identifies differentially abundant transcripts through statistical tests. Applied to seven tissues in GTEx and brain tissue in ROSMAP datasets, Sciege generates a first-to-date multi-tissue isoform atlas and reveals isoform changes linked to cell types, aging, and Alzheimer's disease. Validation with external cohorts and experimental data confirms our findings. Notably, we identify upregulation of the MAPT-010 isoform in AD inhibitory neurons, consistent with known methylation signatures. Our approach demonstrates the value of integrating RNA-seq data to study cell type-specific splicing and provides a foundation for further genetic and functional studies of alternative splicing across biological contexts.
bioinformatics2025-11-16v1Contrastive modelling of transcription and transcript abundance in legumes using PlanTT
Raymond, N.; Zhang, X.; Sheikh, J.; Daveouis, F.; Verma, R.; Cram, D.; Song, H.; Cao, Y.; Kirzinger, M.; Akaniru, D.; Ubbens, J.; Konkin, D.AI Summary
- The study aimed to predict the impact of sequence variation on gene expression in legumes by comparing transcription and transcript abundance.
- A multiomic dataset from four legume species was used to develop contrastive models within a novel prediction framework.
- Key findings included the ability to predict quantitative differences in gene expression between ortholog pairs from unseen orthogroups.
Abstract
Predicting the impacts of sequence variation on gene expression remains a challenging task. Further, in plants, we have a limited understanding of the relative contributions of different gene expression regulatory mechanisms. To address these limitations we generated a comparative multiomic dataset comprising matched 3-RNA-seq and PRO-seq data from matched tissues of reference genotypes of four legumes of the invert repeat lacking clade (Pisum sativum, Vicia faba, Lathyrus sativa and Medicago truncatula). Focused on the challenging task of predicting expression differences between ortholog pairs from unseen orthogroups, we used this dataset and a novel prediction framework to build contrastive models that predict quantitative differences (effect size differences) in transcription and transcript abundance.
bioinformatics2025-11-16v1Using Large Language Models to Assemble, Audit, and Prioritize the Therapeutic Landscape
Thyagatur Kidigannappa, A.; Sonti, N.; Vijayan, R.; Parikh, A.; Faller, R.AI Summary
- The study introduces an AI-assisted pipeline that uses large language models to analyze the therapeutic landscape for specific diseases by integrating data from structured and unstructured sources.
- The pipeline creates a disease-centric map of therapeutic assets, categorizing them from preclinical to FDA-approved, and ranks them using a scoring system that considers trial data, clinical outcomes, and regulatory status.
- Case studies on Alzheimer's disease, pancreatic cancer, and cystic fibrosis illustrate the pipeline's effectiveness in providing comprehensive, prioritized therapeutic insights.
Abstract
We present an AI-assisted pipeline for disease-specific drug landscape analysis. Given a disease name, the system assembles a comprehensive, evidence-based view of therapeutic assets by integrating structured sources (such as ClinicalTrials.gov and ChEMBL) and unstructured sources (such as publications, press releases, and patents). Large language models are used in a constrained, auditable mode to normalize drug aliases, resolve drug target/mechanism of action annotations, and harmonize program status across records. The output is a disease-centric map that spans preclinical assets, not-yet-approved assets (both active and discontinued/shelved), and FDA-approved drugs suitable for re-purposing. Assets are ranked using interpretable, evidence-based scoring heuristics that combine trial volume and clinical phase, endpoint outcomes, biomarker support, recency of activity, and regulatory designations, along with penalties for safety signals and non-pharmaceutical interventions, as well as proportional adjustments for operational versus scientific discontinuations. Case studies in Alzheimers disease, pancreatic cancer, and cystic fibrosis demonstrate generality, coverage, and discrimination across mechanisms and stages. This framework provides a transparent method to assemble and prioritize the therapeutic landscape for any disease, unifying disparate data into a coherent and analyzable representation.
bioinformatics2025-11-16v1Preference-Based Fine-Tuning of Genomic Sequence Models for Personal Expression Prediction with Data Augmentation
Choi, M.; Cho, B.; Lee, S.AI Summary
- The study addresses the challenge of predicting individual gene expression from DNA sequences by integrating genomic data synthesis with statistical frameworks.
- They simulated genetic variations from the 1000 Genomes Project, assigned pseudo-expression labels with PrediXcan, and used a preference-based fine-tuning approach on Enformer.
- Their method, tested on the GEUVADIS dataset, outperformed AlphaGenome, PrediXcan, and standard Enformer, enhancing prediction accuracy by leveraging simulated data.
Abstract
Despite substantial progress in genomic foundation models, accurately predicting inter-individual variation in gene expression from DNA sequence alone remains a major challenge. Current sequence-based models, such as Enformer and Borzoi, trained exclusively on the reference genome, cannot capture the effects of individual-specific regulatory variants. Moreover, the acquisition of paired whole-genome and transcriptome data required for personalized modeling is hindered by privacy and data-sharing constraints. To address this limitation, we integrate genomic data synthesis with established statistical frameworks. Our approach generates thousands of virtual training samples by simulating genetic variation from the 1000 Genomes Project and assigning pseudo-expression labels using PrediXcan, a validated eQTL-based predictor. Because simulated and real expression values differ in scale and distribution, we introduce a preference-based objective that models relative rather than absolute expression patterns. Fine-tuning Enformer through alternating cycles of real-data regression and virtual-data preference optimization enables efficient learning from both real and synthesized data. Using the GEUVADIS dataset, our framework outperforms AlphaGenome, PrediXcan, and Enformer fine-tuned without synthesized data, demonstrating that simulation-based integration of population-level regulatory knowledge can effectively mitigate data scarcity and improve cross-individual generalization in sequence-based gene expression prediction. Availability and implementation: Code and data are available at https://github.com/pacifiic/augment-finetune-genomics.
bioinformatics2025-11-15v2Novel tensor decomposition-based approach for cell type deconvolution in Visium datasets when reference scRNA-seqs include multiple minor cell types
Taguchi, Y.-h.; Turki, T.AI Summary
- The study introduces a tensor decomposition (TD)-based unsupervised feature extraction method for integrating multiple Visium datasets to profile spatial gene expression.
- This approach successfully deconvolutes cell types within Visium spots by referencing scRNA-seq data with multiple minor cell types, where conventional methods like RCTD and SPOTlight fail.
- TD-based method is effective for deconvolution in scenarios with multiple minor cell types, expanding its application range.
Abstract
We have applied tensor decomposition (TD)-based unsupervised feature extraction (FE) to integrate multiple Visium datasets, as a platform for spatial gene expression profiling (spatial transcriptomics). As a result, TD-based unsupervised FE successfully obtains singular value vectors consistent with the spatial distribution, that is, singular value vectors with similar values are assigned to neighboring spots. Furthermore, TD-based unsupervised FE successfully infers the cell-type fractions within individual Visium spots (i.e., successful deconvolution) by referencing single-cell RNA-seq experiments that include multiple minor cell types, for which other conventional methods--RCTD, SPOTlight, SpaCET, and cell2location--fail. Therefore, TD-based unsupervised FE can be used to perform deconvolution even when other conventional methods fail because it includes multiple minor cell types in the reference profiles, although it cannot be used in typical cases. TD-based unsupervised FE is thus expected to be applied to a wide range of deconvolution applications.
bioinformatics2025-11-14v4S3R: Modeling spatially varying associations with Spatially Smooth Sparse Regression
zhou, x.; dang, p.; Wang, X.; Peng, L. X.; Yeh, J. J.; Sears, R. C.; Zhang, N.; Neelon, B.; Zimmers, T.; Zhang, C.; Cao, S.AI Summary
- S3R is a framework for modeling spatially varying associations in spatial transcriptomics data, using structured sparsity and a smoothness penalty guided by a minimum spanning tree.
- It efficiently handles large datasets with multi-GPU training and parallel hyperparameter search, accurately recovering spatial effects in synthetic data.
- Applied to real datasets, S3R revealed layer-specific associations in human brain tissue, spatial gradients in skin infection, and cross-cell type interactions in pancreatic cancer, enhancing understanding of spatial biological processes.
Abstract
Spatial transcriptomics (ST) data demands models that recover how associations among molecular and cellular features change across tissue while contending with noise, collinearity, cell mixing, and thousands of predictors. We present Spatially Smooth Sparse Regression (S3R), a general framework that estimates location specific coefficients linking a response feature to high dimensional spatial predictors. S3R unites structured sparsity with a minimum spanning tree guided smoothness penalty, yielding coefficient fields that are coherent within neighborhoods yet permit sharp boundaries. S3R enables large scale data analysis with an efficient implementation using a reduced MST graph, multi GPU training, and parallel hyperparameter search. In synthetic data, S3R accurately recovers spatially varying effects, selects relevant predictors, and preserves known boundaries. Applied to human dorsolateral prefrontal cortex, S3R recapitulates layer-specific target TF associations with concordant layer wise correlations in matched single cell data. In acute Haemophilus ducreyi skin infection, S3R converts spot level mixtures into cell type attributed expression fields and reveals spatial gradients; applying SVG tests to these fields increases concordance and recovers gradients missed by spot level methods. In pancreatic ductal adenocarcinoma, S3R constructs cross cell type, cross gene co variation tensors that prioritize interactions among cell types with interacting genes enriching pathways consistent with known biology. Because responses and predictors in S3R are user defined, it could flexibly address diverse ST questions within a single, scalable, and interpretable regression framework.
bioinformatics2025-11-14v2WITHDRAWN: Genome-wide association study reveals novel resistance loci to banana weevil (Cosmopolites sordidus) in Rwandan Musa cultivars
Erastus, D.; Edema, R.; Ebapu, D.; Nduwumuremyi, A.; Mukamuhirwa, A.AI Summary
- The manuscript on a genome-wide association study for resistance loci to banana weevil in Rwandan Musa cultivars was withdrawn due to identified data issues.
- The authors no longer support the conclusions and request that the work not be cited.
Abstract
The authors have withdrawn this manuscript because they have identified issues with the data and can no longer stand by the conclusion. the author agree that the original version remains online labelled withdrawn to preserve the scholarly record. therefore, the authors do not with this work to be cited as reference for the project. if you have any questions, please contact the corresponding author
bioinformatics2025-11-14v2Nuclear Irregularity as a Universal Diagnostic Tool in Solid Tumors
Hamilton, F.; Foster, K.AI Summary
- This study investigates nuclear irregularity as a diagnostic tool in solid tumors, focusing on breast cancer using imaging mass cytometry (IMC).
- It differentiates cancerous from non-cancerous nuclei with a p-value of 1.02e-06, achieving 78% accuracy and a 72% F1 score through a machine learning approach.
- The method allows for cross-cohort and cross-cancer comparisons, potentially informing prognosis, treatment, and monitoring of therapeutic response.
Abstract
As tumors develop, cancer cells accumulate diverse genomic and phenotypic alterations to meet heightened demands for energy production and biosynthesis. Loss of lamina function and perturbations in energy production are associated with pronounced aberrations in cellular morphology, particularly within nuclear architecture and the plasma membrane. Systematic analysis of nuclear morphology can reveal conserved structures across diverse cancer types, enabling disease state stratification, biomarker discovery, and potential avenues for personalizing therapy to minimize recurrence risk. To this end, this study analyzes an imaging mass cytometry (IMC) breast cancer dataset, differentiating cancerous and non-cancerous nuclei with a p-value of 1.02e-06. In addition, this study achieves an accuracy of 78 percent and a f1 score of 72 percent using a computational and machine learning-based pipeline for analyzing the morphological heterogeneity of nuclei and protein expression, enabling characterization of patient-specific tumor phenotypes. Unlike traditional morphology analysis pipelines limited to specific imaging platforms, this workflow enables cross-cohort and cross-cancer comparison, capturing tumor-specific phenotypic deviations at a single-cell resolution. The resulting phenotypic profiles could inform prognosis, treatment, and monitoring of therapeutic response.
bioinformatics2025-11-14v2WITHDRAWN: Comprehensive Error Profiling of NovaSeq6000, NovaSeqX, and Salus Pro Using Overlapping Paired-End Reads
Yao, T.AI Summary
- The manuscript titled "Comprehensive Error Profiling of NovaSeq6000, NovaSeqX, and Salus Pro Using Overlapping Paired-End Reads" was withdrawn.
- The withdrawal was due to disagreement among the authors regarding publication.
Abstract
The authors have withdrawn this manuscript because the one or more of the authors did not agree with publishing this manuscript. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2025-11-14v2SAM-based Automatic Workflow for Histology Cyst Segmentation in Autosomal Dominant Polycystic Kidney Disease
Delgado-Rodriguez, P.; Kinakh, R.; Aldabe, R.; Munoz-Barrutia, A.AI Summary
- The study introduces an automated workflow using the Segment Anything Model (SAM) for segmenting cysts in histological images of kidneys affected by Autosomal Dominant Polycystic Kidney Disease (ADPKD).
- This method requires no manual annotations or training, providing precise quantification of cyst progression over time in mice from 8 to 16 weeks.
- The workflow outperforms the existing Cystanalyser tool in accuracy and flexibility, and is publicly available for histological research.
Abstract
Autosomal Dominant Polycystic Kidney Disease (ADPKD) is a genetic disorder characterized by the development of numerous cysts in the kidneys, ultimately leading to significant structural alterations and renal failure. Detailed investigations of this disease frequently utilize histological analyses of kidney sections across various stages of ADPKD progression. In this paper, we introduce an automated workflow leveraging the Segment Anything Model (SAM) neural network, complemented by a series of post-processing steps, to autonomously segment cysts in histological images. This approach eliminates the need for manual annotations or preliminary training phases and enables precise quantification of cystic changes over entire kidney sections. Application of this method to sequential histology images across the development timeline of ADPKD in mice demonstrated a notable increase in the proportion of diseased tissue from 8 to 12 weeks and from 12 to 16 weeks, with the cysts appearing progressively lighter. Our workflow not only surpasses the performance of the existing Cystanalyser tool but also offers enhanced flexibility and accuracy in full-image segmentation. The developed workflow is made publicly accessible to facilitate its adoption as an efficient tool for rapid and reliable cyst segmentation in histological studies.
bioinformatics2025-11-14v2Benchmarking Data Leakage on Link Prediction in Biomedical Knowledge Graph Embeddings
BRIERE, G.; STOSSKOPF, T.; LOIRE, B.; BAUDOT, A.AI Summary
- This study investigates data leakage in Knowledge Graph Embedding (KGE) models for link prediction in biomedical knowledge graphs, focusing on inadequate train-test separation.
- A systematic procedure was implemented to control train-test separation, showing its impact on model performance.
- Evaluations on real-world drug repurposing tasks revealed significantly lower performance compared to standard KG-sampled tasks, with no evidence of models using node degree as an illegitimate feature.
Abstract
In recent years, Knowledge Graphs (KGs) have gained significant attention for their ability to organize complex biomedical knowledge into entities and relationships. Knowledge Graph Embedding (KGE) models facilitate efficient exploration of KGs by learning compact data representations. These models are increasingly applied to biomedical KGs for link prediction, for instance to uncover new therapeutic uses for existing drugs. While numerous KGE models have been developed and benchmarked for link prediction, existing evaluations often overlook the critical issue of data leakage. Data leakage leads the model to learn patterns it would not encounter when deployed in real-world settings, artificially inflating performance metrics and compromising the overall validity of benchmark results. In machine learning, data leakage can arise when (1) there is inadequate separation between training and test sets, (2) the model leverages illegitimate features, or (3) the test set does not accurately reflect real-world inference scenarios. In this study, we implement a systematic procedure to control train-test separation for KGE-based link prediction and demonstrate its impact on models' performance. In addition, through permutation experiments, we investigate the potential use of node degree as an illegitimate predictive feature, finding no evidence of such leveraging. Finally, by evaluating KGE models on a curated dataset of rare disease drug indications, we demonstrate that performance metrics achieved on real-world drug repurposing tasks are substantially worse than those obtained on drug-disease indications sampled from the KG.
bioinformatics2025-11-14v2Compound-specific DNA adduct profiling with nanopore sequencing and IonStats
Koski, Y.; Patel, D.; Kakko von Koch, N.; Jouhten, P.; Aaltonen, L.; Palin, K.; Sahu, B.; Pitkanen, E.AI Summary
- This study developed IonStats, a statistical toolkit for profiling DNA adducts using nanopore sequencing.
- The approach was used to examine the effects of four genotoxic compounds, revealing both common and specific alterations in sequencing metrics.
- Notably, aristolochic acid II and melphalan significantly impacted nanopore readouts, demonstrating the potential for high-resolution DNA damage profiling.
Abstract
Covalently bound DNA adducts are mutation precursors that contribute to aging and diseases such as cancer. Accurate detection of adducts in the genome will shed light on tumorigenesis. Commonly used detection methods are unable to pinpoint the exact genomic locations of adducts. Long-read nanopore sequencing has the potential to accurately detect multiple types of DNA adducts at single-nucleotide precision. In this study, we developed a novel statistical toolkit, IonStats, to profile DNA adducts in nanopore sequencing data. With IonStats, we investigated the effects of four adduct-inducing genotoxic compounds on nanopore sequencing, and found both shared and compound-specific perturbations in base quality scores, ionic current profiles, and translocation dynamics. Notably, aristolochic acid II and melphalan treatments profoundly altered nanopore readouts and led to substantial sequence-specific read interruptions. Our study shows that nanopore sequencing can be effectively employed to detect and characterize DNA adducts, paving the way for high-resolution, high-throughput profiling of DNA damage and the exposome.
bioinformatics2025-11-14v2CryoSiam: self-supervised representation learning for automated analysis of cryo-electron tomograms
Stojanovska, F.; Sanchez, R. M.; Jensen, R. K.; Mahamid, J.; Kreshuk, A.; Zaugg, J. B.AI Summary
- CryoSiam is a self-supervised learning framework for cryo-electron tomography (cryo-ET) that learns hierarchical representations from both voxel and subtomogram levels.
- It was trained using CryoETSim, a synthetic dataset simulating various experimental conditions.
- CryoSiam models can directly analyze experimental data, enhancing tomogram denoising, segmentation, and macromolecular detection in both prokaryotic and eukaryotic systems.
Abstract
Cryo-electron tomography (cryo-ET) enables visualization of macromolecular complexes in their native cellular context, but interpretation remains challenging due to high noise levels, missing information, and lack of ground-truth data. Here, we present CryoSiam (CRYO-electron tomography SIAMese networks), an open-source framework for self-supervised representation learning in cryo-ET. CryoSiam learns hierarchical representations of tomographic data spanning both voxel-level and subtomogram-level information. To train CryoSiam, we generated CryoETSim (CRYO-Electron Tomography SIMulated), a synthetic dataset that systematically models defocus variation, sample thickness, and molecular crowding. CryoSiam trained models transfer directly to experimental data without fine-tuning and support key aspects of cryo-ET data analysis, including tomogram denoising, segmentation of subcellular structures, and macromolecular detection and identification across both prokaryotic and eukaryotic systems. Publicly available pretrained models and the CryoETSim dataset provide a foundation for scalable and automated cryo-ET analysis.
bioinformatics2025-11-14v2Single-cell transcriptomics reveals spaceflight-induced accelerated aging and impaired recovery in the murine bone marrow
Zhao, Z.; Zhang, J.; Ji, G.; Zhou, Z.; Yang, Y.; Sun, F.; Lu, H.AI Summary
- This study used single-cell transcriptomics to examine how spaceflight affects bone marrow in young and old mice, comparing pre- and post-flight conditions.
- Old mice showed persistent dysregulation post-flight, particularly in erythroid and B cell lineages, with signs of accelerated aging like impaired maturation and increased oxidative stress in erythroid cells.
- B cells in old flight mice exhibited dysregulation in stress-response pathways and signaling networks, indicating spaceflight disproportionately impacts aged hematopoietic systems and recovery.
Abstract
Spaceflight induces physiological changes that resemble accelerated aging, however, how age influences bone marrow response at the cellular level remains poorly understood. Here, we perform single-cell transcriptomic profiling of murine femur and humerus bone marrow from young (12-week-old) and old (29-week-old) mice that underwent a 32-day spaceflight mission followed by a 24-day Earth recovery, with age-matched ground controls. Our analysis reveals that, compared with young cohorts, old mice exhibit persistent dysregulation after spaceflight, most prominently in erythroid and B cell lineages. In erythroid cells, old flight mice show pronounced aging signatures, characterized by impaired maturation, inhibited mitophagy, and increased oxidative stress. In B cells, old flight mice show dysregulation associated with failure of the AP-1 stress-response pathway and complete collapse of the intercellular CXCL signaling network. Our findings dissect the age-dependent effects of spaceflight on the bone marrow hematopoietic and immune system at single-cell resolution, and demonstrate that spaceflight imposes a disproportionate burden on the aged hematopoietic system and blunts post-flight recovery. These insights provide candidate pathways and biomarkers for health monitoring and countermeasures in long-duration missions.
bioinformatics2025-11-14v2Covary: A translation-aware framework for alignment-free phylogenetics using machine learning
De los Santos, M.AI Summary
- Covary is a machine learning framework for alignment-free phylogenetic analysis that incorporates translation awareness by encoding codon-boundary and positional information into vector representations.
- It accurately clusters sequences, identifies species, and reconstructs phylogenetic trees across various datasets, including human TP53 variants and SARS-CoV-2 genomes.
- The framework's performance is comparable to traditional alignment-based methods, with near-linear scalability demonstrated by analyzing nearly a thousand SARS-CoV-2 genomes quickly.
Abstract
In large-scale phylogenetic analysis, incorporating translation awareness is critical to account for the genotypic and phenotypic dimensions underlying biological diversification. Covary is a machine learning-based framework that analyzes, clusters, and compares genetic sequences through alignment-free, translation-aware embeddings. By integrating codon-boundary and intra-sequence positional information into a unified vector representation, Covary encodes mutational patterns alongside translation-level constraints. This design enables discrimination of frameshift-inducing mutations, substitutions, and other biologically meaningful sequence variations relevant to evolutionary relationships. Despite inherent sensitivity to k-mer-based distortions, Covary accurately clustered sequences, identified species, and reconstructed phylogenetic trees across diverse datasets, including human TP53 variants, ribosomal gene markers (18S and 16S), and complete genomes from viral, bacterial, and archaeal taxa. The resulting topologies were comparable to those produced by multiple sequence alignment (MASA)-based implementations like ETE3, with near-linear scalability demonstrated by the successful analysis of nearly a thousand SARS-CoV-2 genomes within minutes. The versatility and interpretability of Covary across mutation-, gene-, and genome-level analyses underscore its potential as a biologically informed, data-driven tool for bioinformatics, comparative genomics, taxonomy, ecology, and evolutionary studies. Covary is available online at https://github.com/mahvin92/Covary or at https://covary.chordexbio.com.
bioinformatics2025-11-14v1Single-cell transcriptomics reveals spaceflight-induced accelerated aging and impaired recovery in the murine bone marrow
Zhao, Z.; Zhang, J.; Ji, G.; Zhou, Z.; Yang, Y.; Sun, F.; Lv, H.AI Summary
- This study used single-cell transcriptomics to investigate how spaceflight affects bone marrow in young and old mice, comparing them before, during, and after a 32-day mission.
- Old mice showed persistent dysregulation post-flight, particularly in erythroid and B cell lineages, with signs of accelerated aging like impaired maturation and increased oxidative stress in erythroid cells.
- B cells in old flight mice exhibited dysregulation in the AP-1 stress-response pathway and a collapse of the CXCL signaling network, indicating age-dependent impacts on recovery and immune function.
Abstract
Spaceflight induces physiological changes that resemble accelerated aging, however, how age influences bone marrow response at the cellular level remains poorly understood. Here, we perform single-cell transcriptomic profiling of murine femur and humerus bone marrow from young (12-week-old) and old (29-week-old) mice that underwent a 32-day spaceflight mission followed by a 24-day Earth recovery, with age-matched ground controls. Our analysis reveals that, compared with young cohorts, old mice exhibit persistent dysregulation after spaceflight, most prominently in erythroid and B cell lineages. In erythroid cells, old flight mice show pronounced aging signatures, characterized by impaired maturation, inhibited mitophagy, and increased oxidative stress. In B cells, old flight mice show dysregulation associated with failure of the AP-1 stress-response pathway and complete collapse of the intercellular CXCL signaling network. Our findings dissect the age-dependent effects of spaceflight on the bone marrow hematopoietic and immune system at single-cell resolution, and demonstrate that spaceflight imposes a disproportionate burden on the aged hematopoietic system and blunts post-flight recovery. These insights provide candidate pathways and biomarkers for health monitoring and countermeasures in long-duration missions.
bioinformatics2025-11-14v1CONCLAVE: CONsensus CLustering with Annotation-Validation Extrapolation for cyclic multiplexed immunofluorescence data
Nazari, P.; Arnould, A.; Andhari, M. D.; Fontecha, M.; Hernandez, J. M.; De Moor, B.; Pey, J.; De Smet, F.; Bosisio, F. M.; Antoranz, A.AI Summary
- CONCLAVE is a consensus-clustering workflow developed to improve cell annotation in cyclic multiplexed immunofluorescence (cMIF) data by integrating results from multiple clustering algorithms.
- It was tested through in-silico simulations and real-world datasets, showing superior accuracy, reproducibility, and robustness compared to single-method approaches.
- CONCLAVE also includes a scoring module for identifying unreliable data regions, enhancing quality control in spatial proteomics analyses.
Abstract
High-dimensional cyclic multiplexed immunofluorescence (cMIF) enables single-cell phenotyping within intact tissues. Cell annotations rely on a multi-step pipeline involving normalization, sampling, dimensionality reduction, and clustering, but the absence of standardized benchmarks for method selection, especially at the clustering stage, leads to inconsistent and less reproducible phenotyping. To address this, we developed CONCLAVE, a consensus-clustering-based workflow that optimizes upstream steps and integrates results from multiple clustering algorithms retaining only those cell labels supported by at least two independent methods. Through in-silico simulations and real-world cMIF datasets, CONCLAVE consistently outperformed single-clustering-method approaches in accuracy, reproducibility, and robustness, with improvements becoming more evident when mapped within spatial tissue contexts. Additionally, CONCLAVE includes a scoring module that flags regions likely to contain unreliable or inconsistent data, facilitating targeted quality control. In summary, CONCLAVE offers a robust framework for cell annotation in cMIF datasets, enhancing the reliability of downstream spatial proteomics analyses.
bioinformatics2025-11-14v1Unified modeling of cellular responses to diverse perturbation types
Li, C.; Wei, L.; Zhang, X.AI Summary
- The study introduces X-Pert, a model designed to predict cellular responses to various perturbations by capturing gene-perturbation interactions and gene-gene dependencies through attention mechanisms.
- X-Pert handles diverse perturbation types, combinations, and their dosage/efficacy effects within a unified representation space, improving prediction accuracy for unseen and combinatorial responses.
- Benchmark tests showed X-Pert's superior performance across genetic and chemical perturbations, facilitating applications in drug discovery and biological research.
Abstract
Predicting cellular responses to perturbations is essential for understanding gene regulation and advancing drug development. Most existing in silico perturbation models treat perturbation-cell interactions with simplistic fusion strategies that overlook the hierarchical nature of regulatory processes, leading to inaccurate predictions and poor generalization across perturbation types. We present X-Pert, a universal in silico perturbation model that explicitly captures both gene-perturbation interactions and gene-gene dependencies through attention mechanisms. X-Pert flexibly handles diverse perturbation inputs across different types and combinations, and quantitatively models their dosage- and efficacy-dependent effects, all within a unified representation space. This enables accurate prediction of unseen, combinatorial, and dose- or efficacy- dependent responses of various perturbation types. Across benchmarks spanning genetic and chemical perturbations, X-Pert consistently demonstrates superior performance at both gene and pathway levels. Moreover, its unified representation space enables downstream analyses such as perturbation retrieval and druggene association discovery. By integrating data across perturbation types, experimental platforms, and cell contexts, X-Pert establishes a versatile and generalizable foundation for in silico perturbation, enabling broad applications in biological and therapeutic discovery.
bioinformatics2025-11-14v1Time-resolved phylogenomics analysis reveals patterns in biosphere nutrient limitation through Earth history
Ni, Z.; Osborn, T.; Zhong, J.; Gonzalez, A.; Puzella, W.; Klos, A.; Le, B.; Leonetti, A.; Boden, J. S.; Stueeken, E. E.; Anderson, R. E.AI Summary
- This study uses phylogenomics to trace the evolution of genes related to nutrient limitation over Earth's history.
- Findings indicate that genes for phosphorus limitation appeared early, nitrogen limitation genes emerged near the Great Oxidation Event, and siderophores for iron uptake might date back to the Archean.
- These results highlight how nutrient availability has influenced the biosphere's scale over 4 billion years.
Abstract
The co-evolution of life and Earth has profoundly transformed global biogeochemical cycles over the past 3.5 billion years. These cycles, in turn, have dictated the availability of essential nutrients like phosphorus, nitrogen, and iron, thereby affecting primary productivity and the scale of the Earth\'s biosphere. Despite the critical role of nutrient limitation in shaping the size and scope of the biosphere, significant uncertainties persist about which nutrients were globally limiting at various points in Earth history. Here, we use a phylogenomic approach to trace the origin and spread of genes associated with nutrient limitation over time. We show that genes associated with phosphorus limitation emerged relatively early in life\'s history, whereas genes associated with nitrogen limitation emerged later, closer to the Great Oxidation Event. In terms of iron limitation, we present novel evidence that siderophores, compounds that facilitate iron uptake, may have arisen as early as the Archean. Overall, our results have important implications for understanding how the geosphere has influenced the scale and extent of life on Earth for the past 4 billion years.
bioinformatics2025-11-14v1Discovery and optimization of antimicrobial peptides from extreme environments on global scale
Kang, Z.; Zhang, H.; Zhou, Q.; Liu, J.; Zhou, K.; Chen, P.; Liu, B.-F.; Ning, K.AI Summary
- The study aimed to discover and optimize antimicrobial peptides (AMPs) from extreme environments to combat antimicrobial resistance.
- A deep-learning framework, SEGMA, was developed to mine AMPs from 60,461 extremophile metagenome-assembled genomes globally.
- The approach identified 3,298 novel AMPs, termed extremocins, highlighting the untapped potential of extremophiles for antibiotic discovery.
Abstract
Novel antibiotics to combat global antimicrobial resistance (AMR) in human and animal pathogens are urgently required. Antimicrobial peptides (AMPs) are a class of small molecules inhibiting growth of various microorganism, including both Gram-negative and Gram-positive bacteria, fungi and viruses. These peptides are valued for their broad-spectrum antimicrobial activity, achieved through mechanisms such as bacterial membrane disruption or interference with intracellular processes rather than targeting specific proteins, which makes lower propensity to induce resistance. While AMPs have been extensively identified and verified from animal proteomes, reference microbial genomes and host environments, those from extreme environments remain unexplored. Extreme environments, such as deep-sea hydrothermal vents, glaciers, the polar regions, plateau or hot springs, are widely distributed globally and exhibit steep environmental gradients, harsh physiochemical conditions and limited nutrient availability. Microbes in such niches evolve unique membrane modification, specialized metabolic pathway and other strategies to cope with the extreme stresses. Natural geographic isolation and competitive pressures driven by extreme environments make extremophiles an ideal reservoir for mining novel antibiotics. Archaea, which possess vast antimicrobial potential through the production of unique compounds such as archaeasins that exhibit activity against a range of drug-resistant bacteria, often dominate these habitats. Although recent studies have employed deep learning models to mine archaeal proteome, the antimicrobial potential of uncultured archaea and other extremophiles on a global scale remains largely untapped. Many computational approaches have been developed for in-silico screening of AMPs. Traditional approaches encode sequences with physicochemical and biochemical properties to encode sequences, whereas some recent methods leverage protein language models (pLMs) to generate diverse representations. These encoding strategies are typically combined with deep learning models, such as Long Short-Term Memory (LSTM) networks or attention-based models, to predict peptide antimicrobial activities. Beyond identification, generative models have been developed to design novel AMPs de novo , and optimization can be achieved through genetic algorithms or reinforcement learning (RL). However, existing approaches often overlook structural information of peptides and frequently require training complex decision models for virtual evolutions. To overcome the bottleneck of traditional data mining sources and limitations in peptide screening and optimization, here we introduced structure-aware extremophile genome mining for antimicrobial peptides (SEGMA), a deep-learning based framework used to systematically mine all AMPs from extreme environments on the global scale, herein referred to as extremocin. By leveraging computational pipeline, we identified 3,298 extremocins from 60,461 extremophile metagenome-assembled genomes (MAGs).
bioinformatics2025-11-14v1A spatially-aware unsupervised pipeline to identify co-methylation regions in DNA methylation data
Meshram, S.; Fadikar, A.; Arunkumar, G.; Chatterjee, S.AI Summary
- This study introduces SACOMA, a spatially-aware unsupervised framework to identify co-methylated regions in DNA methylation data by clustering based on spatial proximity and methylation similarity.
- SACOMA uses a data-adaptive mixing parameter to avoid rigid assumptions, enhancing its robustness.
- Simulations and real data analyses showed SACOMA's superior sensitivity, effective false-positive control, and ability to identify biologically relevant co-regulated regions, improving reproducibility and biological inference.
Abstract
DNA methylation (DNAm) plays a central role in modern epigenetic research; however, the high dimensionality of DNAm data comprising hundreds of thousands of spatially ordered probes continues to present major analytical challenges. The multiple testing burden in these data introduces redundancy and reduces statistical power, contributing to the limited reproducibility often observed in association studies. Moreover, DNAm probes frequently exhibit correlated methylation patterns with neighboring sites, reflecting underlying biological co-regulation and spatial dependence along the genome. Ignoring these spatial correlations can bias parameter and standard error estimates, inflate type I error rates, and obscure biologically meaningful effects. Existing methods for detecting methylation co-regulation and reducing DNAm data dimensions, typically rely on fixed distance or correlation thresholds and arbitrary hyperparameter settings that lack data adaptivity. In this study, we introduce SACOMA (Spatially-Aware Clustering for Co-Methylation Analysis), a flexible, data-driven, and unsupervised framework designed to identify co-methylated regions which are genomic regions where adjacent sites show correlated methylation levels. SACOMA employs spatially constrained hierarchical clustering to group neighboring DNAm sites based on both spatial proximity and methylation similarity. A tunable, data-adaptive mixing parameter allows SACOMA to avoid rigid assumptions and remain robust to hyperparameter choices. Although developed for DNAm array data, SACOMA provides a generalizable framework applicable to any data exhibiting spatial dependence, enabling the identification of spatially correlated features across diverse domains. Through extensive simulations, SACOMA demonstrated superior sensitivity while maintaining effective false-positive control compared to existing methods. In population-level DNAm data analyses, SACOMA successfully identified biologically relevant co-regulated methylation regions with functional roles. Overall, SACOMA reduces the multiple-testing burden and enhances both the discovery and specificity of statistical associations, leading to improved reproducibility and more reliable biological inference.
bioinformatics2025-11-14v1Simulation and empirical evaluation of biologically-informed neural network performance
Miller, G. A.; Roman, A.; Glettig, M.; Elmarakeby, H. A.; AlDubayan, S. H.; Park, J.; Collins, R. L.; Van Allen, E.AI Summary
- This study developed simulation frameworks to assess how factors like signal type, strength, feature sparsity, and sample size affect the performance of biologically-informed neural networks (BiNNs).
- Simulations showed that BiNN performance is hindered by small sample sizes, weak signals, and extreme feature sparsity, with a preference for linear signals.
- Empirically, integrating germline with somatic data in the P-NET model did not enhance prediction of prostate cancer metastasis but improved gene prioritization and model interpretation.
Abstract
Biologically-informed neural networks (BiNNs) offer interpretable deep learning models for biological data, but the dataset characteristics required for strong performance remain poorly understood. For instance, we previously developed P-NET, a BiNN with an architecture based on the Reactome pathway database, and applied this model to predict metastatic status of patients with prostate cancer using somatic mutation and copy number information. It seems likely that including additional relevant signal -- e.g., germline variation in this context -- should improve model performance, but we currently lack a principled approach to assess whether BiNNs will successfully detect this signal. Here, we developed two simulation frameworks to evaluate the factors that influence BiNN performance -- including signal type, signal strength, feature sparsity, and sample size -- and empirically tested how integrating germline and somatic data affects the model's ability to predict prostate cancer metastatic status. Simulations revealed that small sample size, weak signal strength, and especially extreme feature sparsity limit BiNN performance, and that the model preferentially uses linear over nonlinear signal. Empirically, P-NET performed poorly on sparse germline data, and while adding germline to somatic data did not improve prediction, it improved gene prioritization and model interpretation. Broadly, our simulation frameworks enable systematic evaluation of how dataset-level characteristics affect BiNN performance and provide a principled framework for benchmarking novel methods.
bioinformatics2025-11-14v1Deciphering the global genomic landscape of C. neoformans: Population dynamics, molecular epidemiology and genomic signatures of pathogenicity
Sathiyamoorthy, J.; Ramakrishnan, J.AI Summary
- The study analyzed 139 global C. neoformans genomes over 30 years, identifying 19 sequence types with ST5 being predominant (63%).
- Phylogenomic analysis showed no distinct clustering between clinical and environmental isolates, indicating high genome conservation.
- Key findings include the dominance of the MATα mating type (91%), clonal expansion of the VNI lineage, and conserved virulence genes with potential azole resistance mechanisms.
Abstract
The study investigates global genomic surveillance of Cryptococcus neoformans over three decades to elucidate genetic diversity, incorporating serotypes, molecular types, STs and mating types, phylogenomics, virulence-associated determinants, antifungal resistance, and pangenome profiles. Across 139 study genomes, the isolates exhibited nineteen distinct sequence types (STs), among which ST5 (63%) was the most prevalent. The predominance of MAT mating type 91% mirrors enhanced virulence, environmental adaptability and clonal expansion, enabling its persistence without MATa (9%) counterparts. The phylogenomic analysis revealed, there is no distinct clustering of isolates from clinical and environmental settings, indicating a high level of genome conservation across sources. Core and accessory genome partitioning and ortholog-based clustering revealed clonal population dominance of the VNI lineage across continents, alongside clear divergence from the VNIV lineage. Virulence gene analysis highlighted conserved expression of capsular genes, phospholipase gene, iron acquisition genes and superoxide dismutase genes, while point mutations and aneuploidy in ERG11 and AFR1 suggested potential azole-resistance mechanisms. This study delivers a holistic assessment of global genomic diversity in the C. neoformans population, emphasising evolutionary conservation, lineage-specific divergence, and adaptability to environment, host and anti-fungal treatments. Keywords: Population structure, Genome diversity, Phylogenomics, Virulome, Resistome, Pangenome
bioinformatics2025-11-14v1An Enhanced Variant-Aware Deep Learning Model for Individual Gene Expression Prediction
Zhao, X.; Su, S.AI Summary
- The study introduces GenomicVariExpress (GVE), a deep learning model for predicting individual gene expression from whole-genome sequences, incorporating an Enhanced Variant Integration Module (EVIM) to handle genetic variations.
- GVE was evaluated using GTEx Whole Blood data, showing superior performance over existing methods, with EVIM significantly contributing to this enhancement.
- The model demonstrated improved biological interpretability and effectiveness across various tissues, particularly for genes affected by rare variants.
Abstract
Accurate prediction of gene expression from individual whole-genome sequences is critical for understanding disease mechanisms and advancing precision medicine. Current methods, however, struggle with individual-specific genetic variations and integrating detailed sequence context. To address this, we introduce GenomicVariExpress (GVE), a novel deep learning model that leverages a pre-trained sequence encoder and incorporates an Enhanced Variant Integration Module (EVIM). EVIM explicitly encodes and fuses multi-dimensional variant features, such as type, allele frequency, and predicted functional impact, enabling GVE to precisely capture how individual variations modulate gene expression. We evaluate GVE using paired whole-genome and RNA sequencing data from the GTEx Whole Blood cohort. Our experiments demonstrate GVE consistently achieves superior performance compared to state-of-the-art baselines. An ablation study confirms EVIM's critical contribution to this improved performance. Furthermore, analyses highlight GVE's enhanced biological interpretability and its superior performance across multiple tissues and for genes influenced by rare variants. GVE represents a significant step towards accurate, individual-level gene expression prediction, offering a powerful tool for genomic function research and personalized healthcare applications.
bioinformatics2025-11-14v1Clustering of Omic Data Using Semi-Supervised Transfer Learning for Gaussian Mixture Models via Natural-Gradient Variational Inference: Method and Applications to Bulk and Single-Cell Transcriptomics
Jia, Q.; Conti, D. V.; Goodrich, J. A.AI Summary
- The study introduces Praxis-BGM, a semi-supervised transfer learning method for Gaussian Mixture Models using natural-gradient variational inference, designed to improve clustering of high-dimensional omic data with small sample sizes.
- Praxis-BGM incorporates prior knowledge from large-scale reference data, enhancing clustering accuracy through feature selection via Bayes Factors and efficient computation with JAX.
- Applications demonstrated improved clustering in bulk transcriptomics for breast cancer subtyping and in transferring cell-type annotations in single-cell RNA-seq data, even with partially mismatched priors.
Abstract
Recent advances in high-throughput technologies have enabled observational studies to collect high-dimensional omic data. However, such data, often measured on small sample sizes, pose challenges to model-based clustering approaches such as Gaussian Mixture Models. Existing methods often fail to generalize due to model instability under complex mixture patterns. To overcome these limitations, we propose a natural-gradient variational inference framework for Gaussian mixture models named Praxis-BGM that incorporates informative priors, cluster-specific means, covariances, and structural connectivity from large-scale reference data with known cluster or class labels to enable semi-supervised transfer learning. We derive natural-gradient updates that integrate prior knowledge, leveraging the Variational Online Newton algorithm. We also perform feature selection for clustering using Bayes Factors. Implemented using the JAX library for accelerator-oriented computation, Praxis-BGM is computationally efficient and scalable. We demonstrate the effectiveness of Praxis-BGM in extensive simulations and with two real-world applications: bulk transcriptomic datasets for breast cancer subtyping (the Cancer Genome Atlas Breast Invasive Carcinoma and the Molecular Taxonomy of Breast Cancer International Consortium), and transferring cell-type annotations between single-cell transcriptomic datasets produced by different single-cell RNA-seq technologies in a human pancreas study. Even when priors are partially mismatched with the target data, Praxis-BGM enhances semi-supervised clustering accuracy and biological interpretability.
bioinformatics2025-11-14v1A unified language model bridging de novo and fragment-based 3D molecule design delivers potent CBL-B inhibitors for cancer treatment
Wang, H.; Sun, G.; Zhang, B.; Wang, Y.; Xi, B.; Yang, M.; Liu, C.; Ge, Y.; Fan, F.; Feng, W.; Zhu, Y.; Xiao, Y.; Wang, Y.; Liu, Z.; Jiang, D.; Wang, H.; Zhou, W.; Huang, B.AI Summary
- The study introduces UniLingo3DMol, a language model for 3D molecular generation that integrates de novo and fragment-based design through multi-stage training.
- UniLingo3DMol outperformed existing models in generating molecules for over 100 biological targets.
- It was used to design potent CBL-B inhibitors, resulting in a lead compound with strong in vitro activity and in vivo anti-tumor efficacy.
Abstract
The rational design of small molecules is central to drug discovery, yet current artificial intelligence (AI) methodologies for generating three-dimensional (3D) molecules are often siloed, focusing on either de novo design or fragment-based design. The lack of a holistic framework limits AI's application across the complex and multi-step pipeline spanning from novel scaffold identification to lead compound optimization, and prevents AI from effectively learning from the entire process. Here, we introduce UniLingo3DMol, a language model for 3D molecular generation, empowered by fragment permutation-capable molecular representation alongside multi-stage and multi-task training strategy. This integrated design enables UniLingo3DMol to seamlessly span both de novo and fragment-retained molecular design, demonstrating superior performance over existing generation models in in silico evaluations across more than 100 diverse biological targets. We further leveraged UniLingo3DMol in the design of inhibitors targeting CBL-B, a crucial immune E3 ubiquitin ligase and attractive immunotherapy target. This strategy led to a lead compound demonstrating excellent in vitro activity and robust in vivo anti-tumor efficacy. Our findings establish UniLingo3DMol as a generalized and powerful platform, showing the strong potential to advance AI-driven drug discovery.
bioinformatics2025-11-14v1TooTranslator: Zero-Shot Classification of Specific Substrates for Transport Proteins by Language Embedding Alignment of Proteins and Chemicals
Ataei, S.; Butler, G.AI Summary
- TooTranslator uses a regression-based model to align embeddings from protein, chemical, and text language models for zero-shot prediction of transport protein substrates.
- The model was evaluated with four loss functions, showing no significant performance differences.
- For unseen substrates, the model achieved 4.3% accuracy for top-1 prediction, improving to 80% for top-100 predictions.
Abstract
Transmembrane transport proteins mediate selective movement of ions and metabolites across membranes. Experimental characterization of their substrate specificity is limited. For novel class discovery with limited data, zero-shot learning aims to assign labels to test samples whose label has not been seen previously during training of the model. We introduce TooTranslator, a regression-based model that aligns embeddings from pre-trained protein (ProtBERT), chemical (ChemBERTa), and text (SciBERT) language models into a shared latent space, enabling substrate prediction by minimizing distances between protein and substrate embeddings. TooTranslator tackles zero-shot learning for the task of predicting the specific substrate of transmembrane transport proteins. Models using four loss functions are evaluated on protein test sets with seen and unseen substrates. We find no statistically significant difference in performance of the four loss functions. For tests with seen substrates, the models compare with the state-of-the art in performance. For tests with unseen (but known) substrate a top-k approach is needed as only 4.3% of test cases predict the correct label as the nearest label. For top-10 that rises to 17%; for top-50 rises to 54%, and top-100 reaches 80%. TooTranslator demonstrates the potential of multimodal embedding alignment for open-world protein function inference.
bioinformatics2025-11-14v1SignifiKANTE: Efficient P-value computation for gene regulatory networks
Woller, F.; Martini, P.; Sen, S.; Blumenthal, D. B.; Hartebrodt, A.AI Summary
- The study addresses the computational challenge of calculating statistical significance in gene regulatory network (GRN) inference by developing SignifiKANTE.
- SignifiKANTE uses gene clustering based on the 1-Wasserstein distance to efficiently compute approximate empirical P-values for multiple target genes, reducing computation time significantly.
- This method integrates with the Arboreto GRN inference package, enhancing its functionality without loss of P-value accuracy.
Abstract
Gene regulatory networks (GRNs) are graph-based representations of regulatory relationships between transcription factors and target genes. Various tools exist to infer GRNs from gene expression data, but since this task is computationally intensive, statistical significance estimates are often omitted. While permutation-based empirical P-value computation methods are relatively straightforward to implement, they are prohibitively expensive when applied to popular regression-based GRN inference methods and realistically sized datasets. To address this bottleneck, we developed SignifiKANTE. SignifiKANTE is based on the key insight that the background count distributions of groups of target genes may be highly similar, even if their expression vectors show distinct behavior. Relying on this insight, SignifiKANTE employs gene clustering based on the 1-Wasserstein distance to create a small, constant number of background distributions which enables the simultaneous computation of approximate empirical P-values for multiple target genes. This reduces runtime by orders of magnitudes (for some datasets, from several weeks to few hours), without compromising faithfulness of the obtained P-values. SignifiKANTE extends the popular GRN inference package Arboreto and is available as a Python package on GitHub (https://github.com/bionetslab/SignifiKANTE) and PyPI (https://pypi.org/project/signifikante/).
bioinformatics2025-11-14v1Novel tensor decomposition-based approach for cell type deconvolution in Visium datasets when reference scRNA-seqs include multiple minor cell types
Taguchi, Y.-h.; Turki,AI Summary
- The study introduces a tensor decomposition (TD)-based unsupervised feature extraction method for integrating multiple Visium datasets to profile spatial gene expression.
- This approach successfully deconvolutes cell types within Visium spots by referencing scRNA-seq data with multiple minor cell types, where conventional methods like RCTD and SPOTlight fail.
- TD-based unsupervised FE is effective for deconvolution in scenarios with multiple minor cell types, expanding its application range.
Abstract
We have applied tensor decomposition (TD)-based unsupervised feature extraction (FE) to integrate multiple Visium datasets, as a platform for spatial gene expression profiling (spatial transcriptomics). As a result, TD-based unsupervised FE successfully obtains singular value vectors consistent with the spatial distribution, that is, singular value vectors with similar values are assigned to neighboring spots. Furthermore, TD-based unsupervised FE successfully infers the cell-type fractions within individual Visium spots (i.e., successful deconvolution) by referencing single-cell RNA-seq experiments that include multiple minor cell types, for which other conventional methods--RCTD, SPOTlight, SpaCET, and cell2location--fail. Therefore, TD-based unsupervised FE can be used to perform deconvolution even when other conventional methods fail because it includes multiple minor cell types in the reference profiles, although it cannot be used in typical cases. TD-based unsupervised FE is thus expected to be applied to a wide range of deconvolution applications.
bioinformatics2025-11-13v3Gene interaction enrichment analysis for transcriptomic data with GREA
Liu, X.; Jiang, A.; Lyu, C.; Chen, L.AI Summary
- GREA is a novel framework that incorporates gene interaction data into enrichment analysis, enhancing the detection of complex pathway signals in transcriptomic data.
- It uses interaction overlap ratios and supports three metrics: ES, ESD, and AUC, with statistical significance assessed via permutation testing and gamma distribution modeling.
- Benchmarking on respiratory viral infection datasets showed GREA outperforms tools like blitzGSEA and GSEApy, identifying more relevant pathways with improved stability and reproducibility.
Abstract
Gene Set Enrichment Analysis (GSEA) is a cornerstone for interpreting gene expression data, yet traditional approaches overlook gene interactions by focusing solely on individual genes, limiting their ability to detect subtle or complex pathway signals. To overcome this, we present GREA (Gene Interaction Enrichment Analysis), a novel framework incorporating gene interaction data into enrichment analysis. GREA replaces the binary gene hit indicator with an interaction overlap ratio, capturing the degree of overlap between gene sets and gene interactions to enhance sensitivity and biological interpretability. It supports three enrichment metrics: Enrichment Score (ES), Enrichment Score Difference (ESD) from a Kolmogorov-Smirnov-based statistic, and Area Under the Curve (AUC) from a recovery curve. GREA evaluates statistical significance using both permutation testing and gamma distribution modeling. Benchmarking on transcriptomic datasets related to respiratory viral infections shows that GREA consistently outperforms existing tools such as blitzGSEA and GSEApy, identifying more relevant pathways with greater stability and reproducibility. By integrating gene interactions into pathway analysis, GREA offers a powerful and flexible tool for uncovering biologically meaningful insights in complex datasets. The source code is available at https://github.com/compbioclub/GREA.
bioinformatics2025-11-13v2SAGA (Simplified Association Genomewide Analyses): a user-friendly Pipeline to Democratize Genome-Wide Association Studies
Cieza, B.; Pandey, N.; Ruhela, V.; Ali, S.; tosto, g.AI Summary
- The main challenge addressed is the accessibility and reproducibility of genome-wide association studies (GWAS).
- SAGA, a BASH-based pipeline, was developed to integrate PLINK, GMMAT, and SAIGE, automating GWAS from data preprocessing to visualization.
- Key findings show SAGA enables users without scripting expertise to perform robust GWAS, enhancing accessibility to genetic analyses.
Abstract
Genome-wide association studies (GWAS) have enabled clinicians and researchers to identify genetic variants linked to complex traits and diseases (1-3). However, GWAS still face several challenges, particularly regarding accessibility and reproducibility (4-6). Conducting these analyses often requires substantial bioinformatics expertise for data preprocessing, software installation, and scripting (7-10). We then developed SAGA ("Simplified Association Genome-wide Analyses"), a BASH-based, open-source, fully automated pipeline that integrates three widely adopted tools - PLINK (11), GMMAT (12), and SAIGE (13) - for accessible, robust, and reproducible GWAS. After installation, users simply need to provide genotype and phenotype files in standard formats. The pipeline automates preprocessing, association testing, and visualization, outputting summary statistics, Manhattan plots, and quantile-quantile plots. SAGA enables robust GWAS for users without scripting experience, expanding access to complex genetic analyses.
bioinformatics2025-11-13v2PathogenSurveillance: an automated pipeline for population genomic analyses and pathogen identification
Foster, Z. S. L.; Sudermann, M. A.; Parada Rojas, C. H.; Blair, L. K.; Iruegas Bocardo, F.; Dhakal, U.; Alcala-Briseno, R. I.; Weisberg, A. J.; Phan, H.; Schummer, T. R.; Chang, J. H.; Grunwald, N. J.AI Summary
- PathogenSurveillance is an automated Nextflow pipeline for population genomic analyses of whole genome sequencing (WGS) data, designed for biosurveillance.
- It supports both short- and long-read datasets, mixed samples, and automates processes from reference retrieval to producing interactive reports for pathogen identification.
- The pipeline is flexible, runs on Linux systems, and provides quality control metrics, enhancing real-time pathogen detection and monitoring.
Abstract
Whole genome sequencing (WGS) offers a comprehensive, organism-agnostic method that effectively meets the need for efficient, reliable, and standardized responses to emerging threats from pathogens and pests. Here, we present PathogenSurveillance, an open-source and automated Nextflow pipeline for population genomic analyses of WGS data. It is designed with features tailored for biosurveillance and is suitable for in-field or point-of-care diagnostics. PathogenSurveillance is flexible, accommodating short- and long-read datasets and mixed samples of prokaryotes and/or eukaryotes. It automates all steps, including reference identification and retrieval from the NCBI Assembly database, and produces customizable interactive reports with summaries, phylogenetic trees, and minimum spanning networks that enable species and subspecies level identification. It also outputs quality control metrics and organizes genomic data hierarchically to facilitate downstream analyses. The pipeline runs on any Linux-based system and minimizes the need for advanced computational expertise. Source code is available on GitHub under the open-source MIT license. The pipeline expands the toolkit for real-time biosurveillance, enabling rapid detection and monitoring of pathogens and pests for rapid response to novel variants.
bioinformatics2025-11-13v2Stratified Active Learning for Spatiotemporal Generalisation in Large-Scale Bioacoustic Monitoring
McEwen, B.; Bernard, C.; Stowell, D.AI Summary
- The study investigates stratified active learning to enhance model performance generalizability in large-scale bioacoustic monitoring across different ecological strata.
- It compares implicit cluster-based diversification with explicit stratification, finding that generalization depends on stratum divergence rather than sampling balance.
- Analysis revealed that spatiotemporal context significantly explains species label variance, aiding in informed sampling decisions.
Abstract
Active learning optimises machine learning model training through the data-efficient selection of informative samples for annotation and training. In the context of biodiversity monitoring using passive acoustic monitoring, active learning offers a promising strategy to reduce the fundamental annotation bottleneck and improve global training efficiency. However, the generalisability of model performance across ecologically relevant strata (e.g. sites, season etc) is often overlooked. As passive acoustic monitoring is extended to larger scales and finer resolutions, inter-strata spatiotemporal variability also increases. We introduce and investigate the concept of stratified active learning to achieve reliable and generalisable model performance across deployment conditions. We compare between implicit cluster-based diversification methods and explicit stratification, demonstrating that cross-strata generalisation is a function of stratum divergence, not sampling balance. Additionally, mutual information as well as exclusion analysis show that spatiotemporal context can explain a substantial proportion of species label variance and inform sampling decisions.
bioinformatics2025-11-13v2One-Hot News: Drug Synergy Models Take a Shortcut
Candir, E. B.; Kuru, H. I.; Rattray, M.; Cicek, A. E.; Tastan, O.AI Summary
- This study investigates whether computational models for predicting drug synergy use structural and chemical information or merely identifiers, by comparing them with models using one-hot encodings.
- Results showed that one-hot encodings performed comparably or better, suggesting models primarily use representations as identifiers.
- The findings indicate a need for improved strategies in synergy prediction models to effectively learn from intended features and generalize to new drugs and cell lines.
Abstract
Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements. In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can still aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and generalize to unseen drugs and cell lines.
bioinformatics2025-11-13v2A METHOD TO CALIBRATE CHEMICAL AGNOSTIC QUANTITATIVE ADVERSE OUTCOME PATHWAYS ON MULTIPLE CHEMICAL DATA
Zhou, Z.; Sahlin, U.AI Summary
- The study developed a chemical-agnostic calibration approach for quantitative Adverse Outcome Pathways (qAOPs) to address inter-chemical heterogeneity in multi-chemical data.
- This method uses hierarchical structures to separate chemical-specific effects from pathway effects, modeling chemical deviations as random effects.
- Simulation studies showed that without calibration, qAOPs are not truly chemical-agnostic, as demonstrated in a case study on non-mutagenic liver tumors.
Abstract
Quantitative Adverse Outcome Pathways (qAOPs) may support next-generation risk assessment by integrating New Approach Methodologies (NAMs) for derivation of points of departure. To be useful, a qAOP should be chemical-agnostic. However, existing qAOP studies often pool multi-chemical data without adequately addressing inter-chemical heterogeneity. Consequently, fundamental pathway relationships become obscured by heterogeneity-induced noise, thereby compromising the reliability of chemical-agnostic predictions. We developed a chemical-agnostic calibration approach to addresses this challenge by leveraging hierarchical structures to systematically separate chemical-specific heterogeneity from underlying pathway effects. Through this methodological framework, chemical-specific deviations are explicitly modeled as random effects, enabling the extraction of pathway-level parameters that represent core mechanistic relationships independent of individual chemical properties. Through simulation studies across varying heterogeneity levels, we demonstrate that performance differences between models with and without hierarchical calibration reveal the magnitude of heterogeneity in the data. Moreover, when heterogeneity is substantial, an uncalibrated qAOP should not be considered truly chemical-agnostic in practice, as it confounds pathway-level effects with chemical-specific variation. We demonstrated the application of this calibration approach through a case study of non-mutagenic liver tumor. The framework proposed in this study enhances qAOP generalizability while preserving the chemical-agnostic principle, supporting robust NAMs-based next-generation risk assessments.
bioinformatics2025-11-13v2SpecLig: Energy-Guided Hierarchical Model for Target-Specific 3D Ligand Design
Zhang, P.; Han, R.; Kong, X.; Chen, T.; Ma, J.AI Summary
- SpecLig is a structure-based framework that generates small molecules and peptides with enhanced target affinity and specificity by using a hierarchical SE(3)-equivariant variational autoencoder and an energy-guided geometric latent diffusion model.
- It incorporates chemical priors from block-block contact statistics to favor pocket-complementary fragment combinations.
- Benchmarking on public datasets showed that SpecLig's ligands bind with high specificity and affinity, with ablations confirming the importance of its hierarchical and energy-guided components.
Abstract
Structure-based generative models often optimize single target affinity while ignoring specificity, producing candidates prone to off-target binding. We introduce SpecLig, a unified, structure-based framework that jointly generates small molecules and peptides with improved target affinity and specificity. SpecLig represents a complex as a block-based graph, combining a hierarchical SE(3)-equivariant variational autoencoder with an energy-guided geometric latent diffusion model. Chemical priors derived from block-block contact statistics are explicitly incorporated, biasing generation toward pocket-complementary fragment combinations. We benchmark SpecLig on peptide and small-molecule tasks using standard public datasets and propose precision/breadth testing paradigms to quantify specificity. Across multiple evaluations, ligand candidates generated by SpecLig usually bind the target pocket with high specificity and affinity while maintaining competitive advantages in other attributes. Ablations indicate that both hierarchical representation and energy guidance contribute to the success. Finally, we provide multiple real applications to demonstrate how SpecLig improves ligands in natural complexes to avoid potential off-target risks. SpecLig therefore provides a practical route to prioritize higher-specificity designs for downstream experimental validation. The codes are available at: https://github.com/CQ-zhang-2016/SpecLig.
bioinformatics2025-11-13v2GPU-accelerated, self-optimizing processing for 3D multiplexed iterative RNA-FISH experiments
Kruithoff, R.; Spendlove, M. D.; Sheppard, S. J.; Schweiger, M. C.; Pessoa, P.; Abbasi, M.; Presse, S.; Bartelle, B. B.; Shepherd, D. P.AI Summary
- This study introduces a GPU-accelerated framework, merfish3d-analysis, to enhance the computational processing of 3D multiplexed iterative RNA-FISH experiments.
- The framework was used to assess information loss from axial sampling changes, reprocess existing MERFISH datasets, and analyze new MERFISH data from a human olfactory bulb.
- A multi-step autofluorescence quenching protocol was developed to improve data quality in post-mortem samples.
Abstract
Imaging-based spatial transcriptomic approaches rely on iterative labeling and imaging of carefully prepared samples, followed by solving a computational inverse problem to determine the location and identity of the targeted RNA. Because these approaches require high-resolution optics, the Nyquist-Shannon determined voxel size is small relative to typical tissue sample footprints. A common solution to speed up both experiments and computation is to increase the distance in between focal planes, trading off local information content to sample a larger imaging area in a reasonable time. In this work we introduce a GPU-accelerated computational framework, merfish3d-analysis, designed to speed up the computational processing of barcoded, in situ imaging-based spatial transcriptomics. Using this framework, we quantify the information lost due to axial sampling changes in simulated imaging-based spatial transcriptomic experiments, robustly reprocess publicly available multiplexed error-robust fluorescence in situ hybridization (MERFISH) datasets, and analyze new MERFISH experiments performed on a post-mortem human olfactory bulb sample. To improve the quality of experimental data in the post-mortem human sample, we designed a multi-step autofluorescence quenching protocol specific for in situ imaging-based spatial transcriptomic strategies. Taken together, we hope that the sample preparation protocols and single workstation, GPU-accelerated processing will further democratize imaging-based spatial transcriptomic experiments.
bioinformatics2025-11-13v2Accelerating ligand discovery by combining Bayesian optimization with MMGBSA-based binding affinity calculations
Andersen, L.; Rausch-Dupont, M.; Martinez Leon, A.; Volkamer, A.; Hub, J.; Klakow, D.AI Summary
- The study introduces an active learning framework combining Bayesian optimization with MMGBSA for predicting protein-ligand binding affinity, aiming to balance accuracy and computational efficiency.
- This approach was tested on 60,000 compounds targeting MCL1, showing that integrating MMGBSA into the learning loop improved recovery of top binders to 79.9% from 6.7% with docking alone.
- MMGBSA showed a stronger correlation with experimental data and was more efficient than docking, with one-at-a-time acquisition outperforming batched methods.
Abstract
Predicting protein-ligand binding affinity with high accuracy is critical in structure-based drug discovery. While docking methods offer computational efficiency, they often lack the precision required for reliable affinity ranking. In contrast, molecular dynamics (MD)-based approaches such as MMGBSA provide more accurate binding free energy estimates but are computationally intensive, limiting their scalability. To address this trade-off, we introduce an active learning framework that automates molecule selection for docking and MD simulations, replacing manual expert-driven decisions with a data-efficient, model-guided strategy. Our approach integrates fixed - partly pre-trained deep learning - molecular embeddings (MolFormer, ChemBERTa-2, and Morgan fingerprints) with adaptive regression models (e.g. Bayesian Ridge and Random Forest) to iteratively improve binding affinity predictions. We evaluate this approach retrospectively on a new dataset of 60,000 chemically diverse compounds from ZINC-22 targeting the MCL1 protein using both AutoDock Vina and MMGBSA. Our results show that incorporating MMGBSA scores into the active learning loop significantly enhances performance, recovering 79.9% of the top 1% binders in the whole dataset, compared to only 6.7% when using docking scores alone. Notably, MMGBSA exhibits a stronger correlation with experimental binding affinities than AutoDock Vina on our dataset and enables more accurate ranking of candidate compounds in a runtime efficient way. Furthermore, we demonstrate that a one-at-a-time acquisition active learning strategy consistently outperforms traditional batched acquisition, the latter achieving just 78.4% recovery with MolFormer and Bayesian Ridge. These findings underscore the potential of integrating deep learning-based molecular representations with MD-level accuracy in an active learning framework, offering a scalable and efficient path to accelerate virtual screening and improve hit identification in drug discovery.
bioinformatics2025-11-13v2MaxGeomHash: An Algorithm for Variable-Size Random Sampling of Distinct Elements
Hera, M. R.; Koslicki, D.; Martinez, C.AI Summary
- The study introduces MaxGeomHash, a novel algorithm for variable-size random sampling of distinct elements, designed to produce sub-linear sketches without prior knowledge of the total number of k-mers.
- MaxGeomHash offers a balance between efficiency and accuracy by generating samples of size r lg(n/r)+O(r), which is more efficient than FracMinHash for tasks like database filtering, clustering, and similarity search on genomic datasets.
Abstract
With the surge in sequencing data generated from an ever-expanding range of biological studies, designing scalable computational techniques has become essential. One effective strategy to enable large-scale computation is to split long DNA or protein sequences into k-mers, and summarize large k-mer sets into compact random samples (a.k.a. sketches). These random samples allow for rapid estimation of similarity metrics such as Jaccard or cosine, and thus facilitate scalable computations such as fast similarity search, classification, and clustering. Popular sketching tools in bioinformatics include Mash and sourmash. Mash uses the MinHash algorithm to generate fixed-size sketches; while sourmash employs FracMinHash, which produces sketches whose size scales linearly with the total number of k-mers. Here, we introduce a novel algorithm, MaxGeomHash, which for a specified parameter r > 1, will produce, without prior knowledge of n (the number of k-mers) a random sample of size r lg(n/r)+O(r). Notably, this is the first deterministic, permutation-invariant, and parallelizable sketching algorithm to date that can produce sub-linear sketches. We also introduce a variant, -MaxGeomHash, that produces random samples of size{Theta} (n) for a given [isin] (0, 1). We study the algorithm's properties, analyze generated sample sizes, verify theoretical results empirically, provide a fast implementation, and investigate similarity estimate quality. With intermediate-sized samples between constant (MinHash) and linear (FracMinHash), MaxGeomHash balances efficiency (smaller samples need less storage and processing) with accuracy (larger samples yield better estimates). On genomic datasets, we demonstrate that MaxGeomHash computes common sketching tasks like database filtering, clustering, and similarity search, more efficiently than approaches like FracMinHash.
bioinformatics2025-11-13v1Chemical Dice Integrator (CDI): A Scalable Framework for Multimodal Molecular Representation Learning
Ahuja, G.; Kumar, S.; Solanki, S.; Gupta, M.; Mohanty, S. K.; Satija, S.; Chauhan, S.; Duari, S.; Sharma, A.; Gautam, V.; Arora, S.; Shome, R.; Sinha, S.; Sharma, A. K.; Mittal, A.; Sengupta, D.; Murugan, N. A.AI Summary
- The study introduces the Chemical Dice Integrator (CDI), a framework that integrates six different molecular representations into a unified embedding for improved molecular property prediction.
- CDI consists of CDI-Basic, which uses a two-tiered autoencoder, and CDI-Generalised, which employs a Mamba State-Space Model for direct mapping from SMILES strings.
- Benchmarking showed CDI embeddings outperform individual Featurizers and standard methods in predictive performance, with CDI-Generalised offering high efficiency and the ability to distinguish subtle structural differences.
Abstract
The machine learning landscape for molecular property prediction is fragmented, with numerous Featurizers each capturing a narrow, specialized view of chemical structure. This heterogeneity forces a suboptimal choice of representation a priori, limiting model generalizability. We introduce the Chemical Dice Integrator (CDI), a hierarchical framework that unifies six orthogonal molecular representations, physicochemical (Mordred), topological (GROVER), visual (ImageMol), biological (Signaturizer), quantum-mechanical (MOPAC), and linguistic (ChemBERTa), into a single, coherent embedding. The framework consists of CDI-Basic, a two-tiered autoencoder that fuses these modalities, and CDI-Generalised, a Mamba State-Space Model (SSM) that learns a direct, efficient map from SMILES strings to the unified embedding space. Extensive benchmarking across 23 classification (171 tasks) and 10 regression datasets demonstrates that CDI embeddings consistently achieve superior predictive performance compared to individual Featurizers and standard feature aggregation methods. The CDI-Generalised model achieves this performance with exceptional computational efficiency, outperforming deep learning Featurizers in terms of speed and resource overhead. Furthermore, we demonstrate that the CDI embedding is chemically intuitive, allowing for the sensitive distinction of nuanced structural variants, such as chiral enantiomers and kekulized SMILES forms. By bridging multimodal chemical intelligence with scalable, sequence-based inference, CDI offers a strong foundation for molecular machine learning.
bioinformatics2025-11-13v1Learning from All Views: A Multiview Contrastive Framework for Metabolite Annotation
Zhou Chen, Y.; Hassoun, S.AI Summary
- The study introduces MultiView Projection (MVP), a framework for metabolite annotation that learns a joint embedding space from multiple data views including molecular graphs, fingerprints, and spectra.
- MVP uses contrastive multiview learning to enhance spectral annotation by capturing mutual information across views, improving molecular candidate ranking.
- On the MassSpecGym benchmark, MVP showed superior performance, achieving 35.99% and 13.96% rank@1 for consensus spectra, and 26.37% and 11.10% for individual spectra when retrieving by mass and formula, respectively.
Abstract
Metabolomics, enabled by high-throughput mass spectrometry, promises to advance our understanding of cellular biochemistry and guide new discoveries in disease mechanisms, drug development, and personalized medicine. However, as the assignment of molecular structures to measured spectra is challenging, annotation rates remain low and hinder potential advancements. We present MultiView Projection (MVP), a novel framework for learning a joint embedding space between molecules and spectra by leveraging multiple data views: molecular graphs, molecular fingerprints, spectra, and consensus spectra. MVP builds on contrastive multiview learning to capture mutual information across views, leading to more robust and generalizable representations for spectral annotation. Unlike prior approaches that consider multiple views via concatenation or as targets of auxiliary tasks, MVP learns from all views jointly, resulting in improved molecular candidate ranking. Notably, MVP supports annotation using either individual spectra or consensus spectra, enabling flexible use of multiple measurements. On the MassSpecGym benchmark, we show that annotation using query consensus spectra significantly outperforms rank aggregation strategies based on constituent spectrum annotation. Using the consensus spectrum view, MVP achieves 35.99% and 13.96% rank@1 when retrieving candidates by mass and formula, respectively. When ranking using individual spectra, MVP demonstrates performance that is superior to or on par with existing methods, achieving 26.37% and 11.10% rank@1 for candidates by mass and formula, respectively. MVP offers a flexible, extensible foundation for learning from multiple molecule/spectra data views.
bioinformatics2025-11-13v1Wasserstein Critics Outperform Discriminatorsin Adversarial Deconfounding of Gene Expression Data
Reid, K.; Guven, E.AI Summary
- The study addresses the issue of confounding variables in gene expression data by comparing a discriminator-based approach with a Wasserstein critic in a deconfounding autoencoder.
- The Wasserstein critic was found to significantly outperform the discriminator in terms of integration quality on benchmark datasets.
- This improvement came with only a marginal loss in biological fidelity.
Abstract
High-throughput gene expression measurements are biased by technical and biological confounding variables, which obscure true biological signals. A common deep-learning-based solution involves training latent-space models with adversarial regularizers to ignore information from confounding variables. These methods rely on discriminator-networks which are unstable to train, especially when confounding effects are strong, leading to issues like vanishing gradients. Inspired by Wasserstein GANs, we propose replacing the standard discriminator with a Wasserstein critic for this deconfounding task. We systematically compare the performance of a discriminator-based approach against a Wasserstein critic-based approach on a state-of-the-art deconfounding autoencoder architecture. We evaluate these methods on standard single-cell integration benchmark datasets and demonstrate that the Wasserstein critic significantly outperforms the discriminator in integration quality with marginal loss of biological fidelity.
bioinformatics2025-11-13v1Leveraging FracMinHash Containment for Genomic dN/dS
Rodriguez, J. S.; Hera, M. R.; Koslicki, D.AI Summary
- The study introduces an alignment-free method using FracMinHash containment to estimate the dN/dS ratio, which measures evolutionary pressures, at a genomic level.
- This approach was tested on 85,205 genomes, completing pairwise dN/dS estimations in 5 hours, demonstrating scalability and speed.
- Results showed comparability to traditional methods, with applications in identifying selection signatures between Archaeal and Bacterial genomes.
Abstract
Increasing availability of genomic data demands algorithmic approaches that can efficiently and accurately conduct downstream genomic analyses. These analyses, such as evaluating selection pressures within and across genomes, can reveal developmental and environmental pressures. One such commonly used metric to measure evolutionary pressures is based on the ratio of non-synonymous and synonomous substitution rates, dN/dS. Conventionally, the dN/dS ratio is used to infer selection pressures employing alignments to estimate total non-synonymous and synonymous substitution rates along protein-coding genes. However, this process can be time consuming and not scalable for larger datasets. Recently, a fast, approximate similarity measure, FracMinHash containment, was introduced and related to average nucleotide identity. In this work, we show how FracMinHash containment can be used to quickly estimate dN/dS enabling alignment-free estimations at a genomic level. Through simulated and real world experiments, our results indicate that employing FracMinHash containment to estimate $d_N/d_S$ is scalable, enabling pairwise dN/dS estimations for 85,205 genomes within 5 hours. Furthermore, our approach is comparable to traditional dN/dS methods, representing sequences subject to positive and negative selection across various mutation rates. Moreover, we used this model to evaluate signatures of selection between Archaeal and Bacterial genomes, identifying a previously unreported metabolic island between Methanobrevibacter sp. RGIG2411 and Candidatus Saccharibacteria bacterium RGIG2249.. We present, FracMinHash dN/dS, a novel alignment-free approach for estimating dN/dS at a genome level that is accurate and scalable beyond gene-level estimations while demonstrating comparability to conventional alignment-based dN/dS methods. Leveraging the alignment-free similarity estimation, FracMinHash containment, pairwise dN/dS estimations are facilitated within milliseconds, making it suitable for large-scale evolutionary analyses across diverse taxa. It supports comparative genomics, evolutionary inference, and functional interpretation across both synthetic, and complex biological datasets. Availability and implementation: A version of the implementation is available at https://github.com/KoslickiLab/dnds-using-fmh.git. The reproduction of figures, data, and analysis can be found at https://github.com/KoslickiLab/dnds-using-fmh_reproducibles.git.
bioinformatics2025-11-13v1MetaXtract: Extracting Metadata from Raw Files for FAIR Data Practices and Workflow Optimisation
Lutfi, A.; Chen, Z. A.; Fischer, L.; Rappsilber, J.AI Summary
- MetaXtract extracts metadata from Thermo Fisher raw files to enhance FAIR data practices by providing structured, tabular formats of sample information, LC-MS settings, and scan metrics.
- This tool supports reproducibility, data sharing, and quality control by making metadata accessible, thereby improving the value of MS datasets in public repositories.
- Beyond data sharing, MetaXtract aids in QC, troubleshooting, and integration into automated workflows, optimizing method development and supporting large-scale applications.
Abstract
Mass spectrometry (MS) experiments generate rich acquisition metadata that are essential for reproducibility, data sharing, and quality control (QC). Because these metadata are typically stored only in vendor-specific formats, they often remain difficult to access. MetaXtract is a lightweight tool that extracts detailed parameters directly from Thermo Fisher raw files and exposes them in structured, tabular formats. By capturing sample information, LC-MS method settings, and scan-level metrics such as retention time, total ion current, and ion injection time, MetaXtract increases transparency and ensures that essential acquisition details accompany published data and results in easy readable form. This supports FAIR data practices by improving the findability, accessibility, interoperability, and reusability of MS datasets, thereby increasing the value of deposition in public repositories. The importance of such metadata accessibility was recently highlighted by the crosslinking mass spectrometry community in efforts to advance FAIR data principles, and it extends to MS-based omics approaches more broadly. Beyond data sharing, the tool streamlines QC and troubleshooting through simple visualisations of MS1/MS2 scans and enables integration into automated pipelines. By embedding acquisition parameters into routine data handling, MetaXtract strengthens reproducibility, optimises method development, and supports large-scale applications, including machine learning and secondary data analysis.
bioinformatics2025-11-13v1Integrating Millions of Years of Evolutionary Information into Protein Structure Models for Function Prediction
Ma, R.; He, C.; Zhang, Z.; Zheng, H.; Duan, L.AI Summary
- The study introduces ESMSCOP, a novel framework that integrates evolutionary sequence data with detailed 3D structural information for protein function prediction.
- ESMSCOP uses a contrast-aware pre-training strategy to bridge the sequence-structure gap, enhancing the synergy between sequence and structure.
- Experiments show ESMSCOP outperforms existing methods in function prediction, even with less pre-training data.
Abstract
Background. Understanding life processes relies on accurate protein function prediction, fundamentally requiring the integration of evolutionary information encoded in sequences with spatial characteristics from 3D structures. Existing approaches often face limitations, however, by either over-relying on sequence, using simplified structural representations instead of fine-grained spatial details, or failing to capture the synergistic relationship between sequence and structure, compounded by challenges in acquiring annotated data. Results. To address these issues, we propose a novel contrast-aware pre-training framework, ESMSCOP. ESMSCOP leverages a state-of-the-art protein language model to harness evolutionary insights embedded in sequences, and introduces a new encoder to fuse topological and fine-grained spatial structural features. By employing a contrastive pre-training strategy with auxiliary supervision, ESMSCOP effectively bridges the sequence-structure gap, yielding rich and informative representations. Conclusions. Extensive experiments conducted on multiple benchmark datasets demonstrate that ESMSCOP achieves superior performance in protein function prediction tasks compared to existing methods. Furthermore, it shows strong performance even when utilizing relatively less pre-training data compared to some large-scale models.
bioinformatics2025-11-13v1Harnessing protein-folding algorithms to drug intrinsically disordered epitopes
Lala, J.; Angioletti-Uberti, S.AI Summary
- The study addresses the challenge of targeting intrinsically disordered protein epitopes by using a protein-folding algorithm within a Monte Carlo optimization framework to design peptide-based binders.
- The approach successfully designed peptides with binding energies comparable to covalent interactions, as confirmed by free energy calculations.
- Molecular simulations revealed that upon binding, the targeted epitopes fold into structured domains, indicating the algorithm's ability to induce folding.
Abstract
Due to their lack of a specific structure and dynamical nature, targeting of epitopes that are part of an intrinsically disordered region of a protein is a notoriously difficult task. Here, we describe a computational approach to overcome this problem, based on the use of a protein-folding algorithm and its confidence metrics within a Monte Carlo optimization pipeline to generate peptide-based binders. For different protein targets, we show by accurate free energy calculations that our approach is able to design peptides with binding free energies on the order of tens of $k_BT$, i.e., with strengths comparable to covalent interactions. Direct observation of the bound complex through molecular simulations shows that the targeted epitope folds into structured domains with lowered thermal fluctuations upon binding, while remaining unstructured and dynamic in the unbound state, suggesting that the protein-folding algorithm must have learned the principles of induced (co-)folding. Given the ubiquitous presence of unstructured regions in proteins, our results suggest a potential pathway to design drugs targeting a large variety of previously untargetable epitopes, and open new possibilities for therapeutic intervention in diseases where disordered proteins play a key role.
bioinformatics2025-11-13v1SC-Framework: A robust and FAIR semi-interactive environment for single-cell resolution datasets
Schultheis, H.; Detleffsen, J.; Wiegandt, R.; Bentsen, M.; Alayoubi, Y.; Valente, G.; Kessler, M. F.; Bruns, B.; Mirza, D.; Usanayo, A.; Walter, J.; Goymann, P.; Hobein, M.; Kuenne, C.; Looso, M.AI Summary
- The SC-Framework addresses computational challenges in single-cell data analysis by integrating standardized data structures, declarative workflows, and computational backends in a containerized environment.
- This framework aims to enhance reproducibility, scalability, and benchmarking by reducing reliance on fragmented public tools.
- It is designed to allow analysts to focus on biological interpretation rather than technical issues, available on GitHub.
Abstract
The accelerated development of single-cell technologies has profoundly impacted the field of biological research, facilitating unparalleled insights into cellular heterogeneity. However, this progress has also produced new computational challenges in the field of bioinformatics: single-cell datasets are increasingly high-dimensional, multimodal, and large-scale, while analysis workflows often remain fragmented, data type specific, ad hoc, and difficult to reproduce. The prevailing methodologies are dependent on a combination of public tools, which hinders the reproducibility of results, limits scalability, and complicates the efforts to establish benchmarks. The necessity for a higher-level, unified framework for single-cell data analysis is paramount to address these inherent limitations. Here, we introduce the SC-Framework, providing the integration of standardized data structures, declarative workflows and standardized computational backends in a containerized environment, enabling analysts to focus on biological interpretation rather than technical overhead. SC-Framework is available at GitHub (https://github.com/loosolab/SC-Framework).
bioinformatics2025-11-13v1SCOUT: Ornstein-Uhlenbeck modelling of gene expression evolution on single-cell lineage trees
Stuart, H.; McKenna, A.AI Summary
- SCOUT uses Ornstein-Uhlenbeck processes to model gene expression evolution on single-cell lineage trees, distinguishing between neutral drift and selective pressure.
- Simulations confirmed SCOUT's ability to classify genes accurately based on their evolutionary dynamics.
- Applied to C. elegans and a lung adenocarcinoma model, SCOUT identified genes under selection during development and key regulators in metastatic progression, respectively.
Abstract
Understanding the evolutionary dynamics of clonal populations is essential for uncovering the principles of development, disease progression, and therapeutic resistance. Recent advances in single-cell lineage tracing and transcriptomics enable such analyses by combining heritable barcodes with cell-state information. Here, we present SCOUT (single-cell Ornstein-Uhlenbeck trees), a framework that models gene expression dynamics along single-cell lineage trees using Ornstein-Uhlenbeck processes to distinguish neutral drift from selective pressure. Using simulations, we demonstrate that SCOUT accurately classifies genes based on their underlying evolutionary models. We further validate SCOUT in Caenorhabditis elegans development, identifying biological processes under selection across distinct developmental contexts. Finally, we apply SCOUT to a lung adenocarcinoma xenograft model, revealing key regulators of metastatic progression and tumor microenvironmental adaptation. By integrating lineage and transcriptomic data, SCOUT provides a powerful evolutionary lens for dissecting the forces that shape cell fate.
bioinformatics2025-11-13v1Elucidating Neurodevelopmental Trajectories in Cancer with Topic Modeling: Revealing Persistent External Granule Layer Lineages in Medulloblastoma
Rajendran, A.; Haldipur, P.; Arora, S.; Grama, K.; Subramanian, S. S.; Galan, L. M.; Johnson, D.; Aldinger, K. A.; Shendure, J.; Millen, K. J.; Gennari, J. H.; Pattwell, S. S.AI Summary
- Researchers used topic modeling on over one million fetal cerebellar nuclei to study neurodevelopmental trajectories in cancer, focusing on medulloblastoma.
- They identified proliferative states from the rhombic lip and external granule layer (EGL) that differentiate into glial and neuronal lineages, capturing developmental stages from outer to inner EGL.
- The study confirmed that these developmental signatures persist in Sonic hedgehog (SHH) medulloblastoma, validating the granule neuron precursor origins and revealing age-specific molecular programs within SHH subtypes.
Abstract
The cerebellar rhombic lip generates cerebellar progenitors and neurons that ultimately differentiate to comprise over half of all neurons in the adult human brain. Standard clustering approaches often fragment or miss rhombic lip progenitor populations entirely due to their transient nature, small size, and rapid state transitions, leaving fundamental questions unanswered about normal cerebellar development and how such processes may be hijacked in pediatric brain cancer. Medulloblastoma, the most common malignant pediatric brain tumor, affects approximately 500 children annually in the United States with overall survival rates varying dramatically by subgroup. Sonic hedgehog (SHH) medulloblastoma, comprising 25-30% of cases, arises from rhombic lip-derived granule neuron precursors (GNP) within the external granule layer (EGL) and has particularly poor outcomes in several subtypes (5-year survival ~41%). Using our topic modeling framework on over one million fetal cerebellar nuclei, we identify proliferative rhombic lip and EGL states that bifurcate into distinct glial and neuronal lineages through intermediate progenitors and capture a portion of the developmental spectrum form outer EGL (oEGL) proliferation through inner EGL (iEGL) differentiation. These developmental signatures (topics) persist in medulloblastoma, validating GNP origins of SHH tumors and revealing age-specific molecular programs that correspond to distinct stages of EGL development within SHH subtypes. Our transferable framework enables systematic comparison of developmental and disease states across technologies without data integration, solving a fundamental challenge as genomic atlases expand.
bioinformatics2025-11-13v1Comparative pangenomics unveils distinct host adaptation levels and conserved biosynthetic potential in microbiome Clostridia
De Vrieze, L.; Aerts, J.; Masschelein, J.AI Summary
- This study compared the evolutionary trajectories and metabolic capabilities of Clostridia across different orders to understand host adaptation.
- Findings showed that Oscillospirales are highly specialized for host-associated lifestyles, while Lachnospirales retain more free-living traits and metabolic versatility.
- The research also identified conserved biosynthetic gene clusters, suggesting significant untapped potential in Clostridia.
Abstract
Bacterial species adopt various lifestyles to thrive in diverse ecological niches, ranging from living freely in the soil to being part of various human and animal microflora. However, the impact of these adaptation processes on their genomes and metabolisms remains largely unexplored beyond the genus level. Investigating these evolutionary dynamics for higher taxonomic levels can enhance our understanding of the relationship between host adaptation and their functional capabilities. In this regard, we examined the evolutionary trajectories and metabolic capabilities of the Clostridia class, which display a variety of lifestyles and are of high importance for industry, medicine and microbiome research. First, we uncover that the clostridial orders have significantly different adaptation rates. Second, we show that the Oscillospirales order has undergone extensive genomic and functional specialisation toward a host-associated lifestyle, while the Lachnospirales order tends to be at a lower level of host association, retaining a remarkably high number of free-living trait genes and a high degree of metabolic versatility. Third, we reveal substantial differences in genomic architecture and metabolic versatility between the clostridial orders and link these to the progressing stages of host adaptation. Additionally, we identify widely conserved biosynthetic gene clusters, highlighting untapped biosynthetic potential of evolutionary significance. Hence, the beyond-genus level analyses in this study provide valuable new insights with implications for the biological sciences and biotechnology.
bioinformatics2025-11-12v3