Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT)
Sladky, O.; Vesely, P.; Brinda, K.AI Summary
- The study introduces FMSI, a superstring-based index for arbitrary k-mer sets, utilizing the Masked Burrows-Wheeler Transform (MBWT) to enhance efficiency in genomic data indexing.
- FMSI offers superior space efficiency, using 2-3x less memory than existing methods like SBWT and SSHash, while maintaining competitive query times.
- Across various k values and dataset types, FMSI proves to be a robust, scalable, and versatile framework for bioinformatics applications.
Abstract
The growing volumes and heterogeneity of genomic data call for scalable and versatile -mer-set indexes. However, state-of-the-art indexes such as Spectral Burrows-Wheeler Transform (SBWT) and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small , sampled data, or high-diversity settings. Here, we introduce FMSI, a superstring-based index for arbitrary -mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in -mer superstrings and uses the Masked Burrows-Wheeler Transform (MBWT), a novel extension of the classical BWT that incorporates position masking. Across a range of values and dataset types - including genomic, pangenomic, and metagenomic - FMSI consistently achieves superior query space efficiency, using up to 2-3x less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI's footprint in some cases, but then FMSI is 2-3x faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary -mer sets across diverse bioinformatics applications.
bioinformatics2025-10-01v3A statistical framework for defining synergistic anticancer drug interactions
Dias, D.; Zobolas, J.; Ianevski, A.; Aittokallio, T.AI Summary
- This study developed a statistical framework to identify synergistic anticancer drug interactions by establishing reference null distributions from a large dataset of over 2,000 drug combinations across 125 cancer cell lines.
- The framework uses empirical p-values to assess the significance of drug combination effects, confirming known synergistic combinations and revealing novel ones.
- The approach was validated on an independent dataset, demonstrating its applicability to smaller-scale studies.
Abstract
Synergistic drug combinations have the potential to delay drug resistance and improve clinical outcomes. However, current cell-based screens lack robust statistical assessment to identify significant synergistic interactions for downstream experimental or clinical validation. Leveraging a large-scale dataset that systematically evaluated more than 2,000 drug combinations across 125 pan-cancer cell lines, we established reference null distributions separately for various synergy metrics and cancer types. These data-driven reference distributions enable estimation of empirical p-values to assess the significance of observed drug combination effects, thereby standardizing synergy detection in future studies. The statistical evaluation confirmed key synergistic combinations and uncovered novel combination effects that met stringent statistical criteria, yet were overlooked in the original analyses. We revealed cell context-specific drug combination effects across the tissue types and differences in statistical behavior of the synergy metrics. To demonstrate the general applicability of our approach to smaller-scale studies, we applied the reference distributions to evaluate the significance of combination effects in an independent dataset. We provide a fast and statistically rigorous approach to detecting synergistic drug interactions in combinatorial screens.
bioinformatics2025-10-01v2Benchmarking generative AI tools for literature retrieval and summarization in genomic variant interpretation.
Gazzo, A. M.; Berardelli, S.; Biancospino, M.; Cuollo, L.; Dei Zotti, F.; Ferraro, E.; Marra, A.; Tartarotti, E.; Magni, P.AI Summary
- This study evaluated five generative AI platforms (ChatGPT, MistralAI, VarChat, Perplexity, ScholarAI) for their ability to summarize literature on human genomic variants, focusing on accuracy and real-world usability.
- VarChat was identified as the top performer in summarization accuracy, citation relevance, and resistance to hallucinations, with Gpt-4o as a close second.
- The performance of these tools was significantly affected by the availability of peer-reviewed literature, highlighting the need for expert validation and domain-specific fine-tuning for reliable integration into research.
Abstract
Background: Generative AI is increasingly used to extract structured information across domains, but its reliability in academic and clinical research, where precision and accuracy are essential, remains largely unexplored. This study evaluates the ability of Large Language Models (LLMs)-based algorithms to generate accurate, literature-based summaries of human genomic variants, with a focus on real-world usability. Results: We benchmarked five open-access generative AI platforms (ChatGPT, MistralAI, VarChat, Perplexity, and ScholarAI) across 40 curated variants equally divided between somatic and germline settings. For each variant, summary reports were generated and blindly evaluated by domain experts using five defined metrics. VarChat emerged as the top-ranked tool, showing the highest summarization accuracy, citation relevance, and robustness against hallucinations. Gpt-4o consistently ranked second, showing particularly stable robustness in conditions where the literature was scarce. Perplexity and ScholarAI, despite being literature-focused, ranked lowest across most metrics. Tool performance was strongly influenced by the availability of peer-reviewed literature, confirming that current generative models remain sensitive to data scarcity. Conclusions: Our findings highlight the heterogeneity of current generative AI tools in genomic variant interpretation workflows. While some platforms already provide useful outputs, reliable integration into basic and clinical research requires expert validation and domain-related fine-tuning. This work provides for the first time a curated benchmark for assessing LLM-generated content in variant genomics and underscores the need for caution when using these tools to support variant interpretation.
bioinformatics2025-10-01v1Exploring brain lobe-specific insights in an explainable framework for EEG-based schizophrenia detection
Hossain, M. M.; Tawhid, M. N. A.AI Summary
- This study developed an EEG-based framework using mel-spectrogram images and CNNs to detect schizophrenia, focusing on brain lobe-specific insights.
- The framework achieved high accuracies of 99.82% and 98.31% on two datasets, with the frontal lobe showing the highest diagnostic significance.
- Explainability was enhanced using LIME, SHAP, and Grad-CAM, providing insights into the critical brain regions for schizophrenia diagnosis.
Abstract
Schizophrenia (ScZ) is a growing global health concern that affects millions of people and puts severe pressure on healthcare systems. Early detection and accurate diagnosis are crucial for adequate management. Electroencephalography (EEG) has evolved into a promising non-invasive tool for detecting ScZ in contemporary research. However, specific biomarkers, especially those related to brain lobes, cannot often be identified by current EEG-based diagnostic methods. Different brain lobes are associated with distinct cognitive functions and patterns of diseases. Also, there is a gap in the incorporation of the XAI technique, as medical diagnosis needs trustworthiness and explainability. This study strives to address these gaps by developing a framework using mel-spectrogram images with Convolutional Neural Networks (CNNs). EEG signals are converted into mel-spectrogram images using Short-Time Fourier Transform (STFT). After that, these images are analyzed using a CNN model to perform classification between ScZ and healthy control (HC). To identify the most critical brain regions, the full brain regions are divided into five different regions, and the same classification process is performed. The performance of the proposed framework is evaluated using two publicly available EEG datasets: repOD and the kaggle basic sensory task dataset, which provides a remarkable accuracy of 99.82% and 98.31% respectively. Among regions, the frontal lobe has the most significant performance with an accuracy of 97.02% and 88.03%, respectively, in these datasets, followed by the temporal lobe. Conversely, the occipital lobe shows the lowest accuracy among lobes, with only 79.30 % and 68.33% accuracy on both occasions, showing its lower significance in the diagnosis. To bring result explainability, LIME, SHAP, and the Grad-CAM methods are applied, providing valuable insights for clinicians and researchers. These findings emphasize the potential of EEG-based brain lobe analysis in enhancing ScZ detection, diagnostic accuracy, explainability, and clinical guidance.
bioinformatics2025-10-01v1Correcting Non-Uniform Milling in FIB-SEM Images with Unsupervised Cross-Plane Image-to-Image Translation
Li, Y.; Kreinin, Y.; Huang, S.; Schomburg, E. W.; Chklovskii, D. B.; Pfister, H.; Wu, J.AI Summary
- The study addresses non-uniform milling in FIB-SEM images by developing an unsupervised deep learning method for image-to-image translation.
- This method corrects distortions in 3D image volumes without needing ground truth annotations.
- Testing on a micro-wasp dataset showed significant improvements in image quality, confirmed by qualitative and quantitative analysis.
Abstract
Motivation: Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) is an advanced Volume Electron Microscopy technology with growing applications, featuring thinner sectioning compared to other Volume Electron Microscopes. Such axial resolution is crucial for accurate segmentation and reconstruction of fine structures in biological tissues. However, in reality, the milling thickness is not always uniform across the sample surface, resulting in the axial plane looking distorted. Existing image processing approaches often: (i) assume constant section thickness; (ii) consist of multiple separate processing steps (i.e., not in an end-to-end fashion); (iii) require ground truth images for modeling, which may entail significant labor and be unsuitable for rapid analysis. Results: We develop a deep learning method to correct non-uniform milling artifacts observed in FIB-SEM images. The proposed method is an image-to-image translation technique that can mitigate image distortions in an unsupervised manner. It conducts cross-plane learning within 3D image volumes without any ground truth annotations. We demonstrate the efficacy of our method on a real-world micro-wasp dataset, showcasing significantly improved image quality after correction with qualitative and quantitative analysis.
bioinformatics2025-10-01v1Functional Annotation of Novel Heat Stress-responsive Genes in Rice Utilizing Public Transcriptomes and Structurome
Yonezawa, S.; Bono, H.AI Summary
- This study aimed to functionally annotate novel heat stress-responsive genes in rice by integrating public transcriptome data with structural information from AlphaFold.
- A meta-analysis identified gene groups, followed by structural and sequence alignment between rice and human proteins under low sequence-high structural similarity conditions.
- Key findings included the identification of genes related to metal homeostasis, particularly iron and copper metabolism, providing insights into their functions.
Abstract
Life science databases include large collections of public transcriptome and large-scale structural data. The reuse and integration of these datasets may facilitate the identification of understudied genes and enable functional annotation across distantly related species, including plants and humans. In this study, we used heat stress-responsive genes in rice as a model to functionally annotate previously understudied genes by integrating publicly available transcriptome data with structural information from the AlphaFold Protein Structure Database. Initially, we conducted a meta-analysis of public heat stress-related transcriptome datasets, identified gene groups, and verified stress-related terms through enrichment analysis. Subsequently, we performed structural alignment and sequence alignment between rice and human proteins, focusing on candidates exhibiting low sequence similarity but high structural similarity (LS-HS conditions). We further incorporated supplemental data from public databases, including shared domain information between rice and human. This approach yielded a unique set of LS-HS candidates, notably those associated with metal homeostasis, such as iron and copper metabolism. Overall, our integrative method provided insights into these genes by leveraging diverse, publicly available datasets.
bioinformatics2025-10-01v1Rapid, accurate long- and short-read mapping to large pangenome graphs with vg Giraffe
Chang, X.; Novak, A. M.; Eizenga, J. M.; Siren, J.; Monlong, J.; Negi, S.; Andreace, F.; Nag, S.; Kyriakidis, K.; Hickey, G.; Hwang, S.; Delot, E. C.; Carroll, A.; Shafin, M. K.; Chang, P.-C.; Okamoto, F.; Paten, B.; the Human Pangenome Reference Consortium,AI Summary
- The study presents updates to Giraffe, enhancing its ability to map both short and long reads to large pangenome graphs.
- Giraffe now maps reads to a pangenome with over 450 human haplotypes as quickly as linear mappers to reference genomes, and is significantly faster than GraphAligner.
- The updated Giraffe improves variant calling and supports a pangenome-guided assembly workflow, producing more contiguous assemblies than Hifiasm.
Abstract
We previously introduced Giraffe, a short-read-to-pangenome graph mapper available in the vg pangenomics toolkit. Giraffe was fast and accurate for mapping short reads to human-scale pangenomes, but struggled with long reads. Long reads present a unique challenge to pangenome mapping algorithms due to their length and error profile, which allow them to take more topologically complex paths through the pangenome graph and increase the possible search space for the algorithm. We present updates to Giraffe that allow it to quickly and accurately map long reads to pangenome graphs. For both short and long reads, Giraffe mapping to a pangenome containing data from more than 450 human haplotypes, generated by the Human Pangenome Reference Consortium, is comparable in speed to linear mappers to human reference genomes; Giraffe is also over an order of magnitude faster than GraphAligner, the current state-of-the-art long-read-to-pangenome mapper. Its alignments produce similar or improved small and structural variant calling results, compared to those from commonly used graph-based and linear mappers. We additionally demonstrate using Giraffe's long read alignments in a pangenome-guided assembly workflow, which is capable of producing more contiguous local assemblies than Hifiasm in our test regions.
bioinformatics2025-10-01v1T cell-microbiome associations captured through T cell receptor convergence analysis
Vandoren, R.; Ha, M. K.; Van Deuren, V. M. L.; De Roeck, N.; Pu, T.; Brand, E. C.; Kuznetsova, M.; Besbassi, H.; Bartholomeus, E.; Affaticati, F.; De Boeck, I.; Gehrmann, T.; Lebeer, S.; Oldenburg, B.; van Wijk, F.; Delputte, P.; Verbandt, S.; Tejpar, S.; Laukens, K.; Ogunjimi, B.; Meysman, P.AI Summary
- The study investigates how specific bacterial genera influence T cell receptor (TCR) diversity using AIRRWAS, a computational framework integrating TCR-microbiome analysis.
- Applied to three cohorts, AIRRWAS identified associations between TCR clusters and 21 bacterial genera, including core commensals and probiotics.
- The findings showed that predicted TCR clonotypes were enriched in the TCR-microbiome network and responded to genus-specific stimuli, suggesting shared immune signatures across distinct repertoires.
Abstract
The gut microbiome modulates mucosal immunity, yet how specific bacterial taxa shape the diversity and specificity of T cell receptor (TCR) repertoires remains poorly understood. Existing approaches emphasize single-species effects or broad immune features, without pinpointing which microbes drive specific T cell clonotypes. We present AIRRWAS, a computational framework that integrates TCR-microbiome interaction analysis with targeted in vitro validation to detect genus-level TCR convergence. Applied to three independent cohorts, AIRRWAS identified reproducible associations between convergent TCR clusters and 21 bacterial genera spanning core commensals, probiotics and taxa with immunomodulatory roles. Predicted clonotypes were enriched within the TCR-microbiome interaction network and preferentially activated by genus-matched stimuli, eliciting different functional T cell responses. These findings demonstrate that distinct repertoires can share genus-specific TCR motifs, enabling detection of shared immune signatures. AIRRWAS can map these TCR-microbiome interactions, laying the groundwork for biomarker discovery immune monitoring and the development of microbiome-targeted therapies.
bioinformatics2025-10-01v1FUSED: CROSS-DOMAIN INTEGRATION OF FOUNDATION MODELS FOR CANCER DRUG RESPONSE PREDICTION
Rössner, T.; Balke, J.; Tang, M.AI Summary
- The study introduces FUSED, a novel architecture for integrating molecular and single-cell foundation models (FMs) to predict cancer drug responses.
- By benchmarking different FMs, Molformer and scGPT were found to outperform ChemBERTa and scFoundation respectively in predictive accuracy.
- Integrating single-cell FMs significantly reduces required input features and improves prediction performance in both known and novel drug scenarios.
Abstract
AI-driven methods for predicting drug responses hold promise for advancing personalized cancer therapy, but cancer heterogeneity and the high cost of data generation pose substantial challenges. Here we explore the transfer learning capability and introduce FUSED (Fusion of Foundation Model Embeddings for Drug Response Prediction), a novel architecture for cross-domain foundation model (FM) integration. By systematically benchmark FMs across two domains - molecular FM for drugs and single-cell FM for cell lines, we demonstrate that integrating single-cell FMs substantially reduces the number of input features required for cell line representation. Among FMs, Molformer significantly outperforms ChemBERTa, and scGPT surpasses scFoundation in predictive accuracy and training stability. Moreover, integrating single-cell FMs improves performance in both drugknown and leave-one-drug-out scenarios. These findings highlight the potential of cross-domain FM integration for more efficient and robust drug response prediction.
bioinformatics2025-10-01v1Target-site Dynamics and Alternative Polyadenylation Explain Large Share of Apparent MicroRNA Differential Expression
Cihan, M.; More, P.; Sprang, M.; Marini, F.; Andrade, M.AI Summary
- The study introduces MIRNAPEX, a machine learning framework that integrates target-gene expression and 3'UTR isoform usage to quantify miRNA regulatory activity from RNA-seq data.
- Using pan-cancer datasets, MIRNAPEX showed that alternative polyadenylation (APA) significantly enhances the prediction of miRNA differential expression beyond gene expression alone.
- The framework demonstrated that changes in miRNA abundance can result from APA-driven alterations in target-site availability, rather than changes in miRNA transcription, highlighting the importance of considering APA in miRNA expression analysis.
Abstract
MicroRNA (miRNA) abundance reflects a dynamic balance between biogenesis, target engagement and decay, yet differential expression (DE) analyses typically ignore changes in target-site availability driven by alternative polyadenylation (APA). We introduce MIRNAPEX, an interpretable expression-stratification-based machine learning framework that quantifies the effect size of miRNA regulatory activity from RNA-seq by integrating target-gene expression with 3'UTR isoform usage to infer binding-site dosage. Using pan-cancer training sets, we fit regularized linear models to learn robust relationships between transcriptomic features and miRNA log-fold changes, with APA patterns adding clear predictive power beyond expression alone. When applied to knockdowns of core APA regulators, MIRNAPEX captured widespread 3'UTR shortening and correctly anticipated distinct, miRNA-specific shifts whose direction and magnitude mirrored the APA-driven change in site availability. Analysis of target-directed miRNA degradation interactions further showed that loss of distal decay-trigger sites coincides with higher miRNA abundance, consistent with a reduced degradation rate. Together these findings reveal that apparent DE of miRNAs can arise from dynamic changes in target-site landscapes rather than altered miRNA transcription, and that ignoring this aspect in conventional analysis workflows can lead to misestimation of the true effect size of gene-expression regulation.
bioinformatics2025-10-01v1AlphaMissense pathogenicity scores predict response to immunotherapy and enhances the predictive capability of tumor mutation burden
Adeleke, D.; Jansen, R.; Rahul, G.; Fadaka, A. O.AI Summary
- The study introduces AlphaTMB, a biomarker combining Tumor Mutational Burden (TMB) with the pathogenicity of mutations assessed by AlphaMissense, to predict response to immune checkpoint inhibitor (ICI) therapy in cancer patients.
- Using data from 1,662 patients, AlphaTMB showed a strong correlation with TMB but provided superior prognostic accuracy, with high AlphaTMB patients having significantly better survival outcomes.
- AlphaTMB reclassified patients, identifying those with low TMB but high deleterious mutation load, and was enriched in mutations associated with immunotherapy responsiveness.
Abstract
Tumor Mutational Burden (TMB) is a widely used biomarker for selecting cancer patients for immune checkpoint inhibitor (ICI) therapy. However, TMB alone has limited predictive power, as it fails to account for the functional impact of mutations. We introduce AlphaTMB, a composite biomarker that integrates the quantity of mutations (TMB) with the qualitative assessment of their pathogenicity using AlphaMissense, a deep learning model that predicts the deleteriousness of missense variants. Using a pan-cancer cohort of 1,662 patients from the MSK-IMPACT study who received ICI therapy, we computed three scores per patient: TMB, Alpha (sum of AlphaMissense scores), and AlphaTMB (product of TMB and Alpha). Patients were stratified using both cancer-specific and pan-cancer quantiles. Survival outcomes were evaluated using Kaplan-Meier and multivariate Cox proportional hazards models, controlling for cancer type, age, and ICI regimen. AlphaTMB showed strong correlation with TMB (Spearman {rho} = 0.866, p < 0.001), but offered improved prognostic accuracy. Patients in the bottom 80% AlphaTMB group had significantly poorer survival than those in the top 10% (HR < 2.51, p < 0.001), outperforming TMB and Alpha alone. AlphaTMB reclassified borderline cases, identifying subsets with low TMB but high deleterious mutation load, and vice versa. Gene mutation heatmaps and co-occurrence analysis confirmed that to 10% AlphaTMB-high tumors were enriched in mismatch repair and POLE mutations, reflecting a neoantigen-rich, immunotherapy-responsive phenotype. AlphaTMB improves survival prediction beyond TMB alone, better captures immunogenic tumor profiles, and reflects more accurate patient stratification. This AI derived somatic mutations pathogenicity scoring represents a step toward personalized immuno-oncology and merits further validation in prospective studies.
bioinformatics2025-10-01v1Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis
Alves Sobrinho, P. d. A.; Sakamoto, T.; Figuerola, W. B.AI Summary
- Researchers developed Protein Dimension DB to address computational demands of large protein language models (PLMs) by providing precomputed protein embeddings from seven state-of-the-art PLMs for Swiss-Prot/UniProt proteins.
- The database was benchmarked for molecular function prediction, showing that hybrid embeddings outperformed single-model approaches, with taxonomic encodings enhancing performance by 2.9% AUPRC.
- Protein Dimension DB offers embeddings in Parquet format, reducing storage needs and enabling use in resource-limited settings, with datasets accessible via Github and HuggingFace for various biological research applications.
Abstract
Inspired by the success of large language models in areas like natural language processing, researchers have applied similar architectures, notably the Transformer, to protein sequences. Thanks to these developments, Protein Language Models (PLMs) have become important resources for diverse tasks such as predicting protein family, function, solubility, cellular location, molecular interactions and remote homology. However, the size of the best performing PLMs (which can be up to 15B parameters) requires substantial computational power. Protein Dimension DB addresses this critical bottleneck by providing a centralized, version-controlled resource of precomputed protein embeddings, experimentally validated molecular function annotations, and taxonomic encodings. The database integrates embeddings from seven state-of-the-art PLMs, including ProtT5, ESM2, and Ankh variants for all Swiss-Prot/ UniProt proteins. These models were compared by benchmarking molecular function prediction. Tests revealed that hybrid embeddings (e.g., Ankh Base + ProtT5) outperformed single-model approaches with minimal dimensionality increases. Taxonomic encodings further boosted performance by 2.9% AUPRC, demonstrating lineage-aware learning. By providing embeddings in Parquet format - a columnar storage optimized for machine learning workflows - the resource eliminates GPU-dependent preprocessing and reduces storage requirements. This enables immediate use in resource-constrained environments while maintaining backward compatibility through versioned releases. All datasets are freely accessible via Github and HuggingFace, with unified metadata enabling applications from functional annotation to evolutionary studies. Protein Dimension DB bridges the gap between cutting-edge PLMs and practical biological research, offering researchers standardized inputs for reproducible, multi-modal protein analysis.
bioinformatics2025-10-01v1PLI Analyzer for Data-driven Validation of AI Predicted Biomolecular Interfaces
Liang, F.; Srinivasan, S.; Chang, H. Y.AI Summary
- PLI-Analyzer is a tool for analyzing biomolecular interaction interfaces, integrating atomic-level contact detection with domain annotations.
- It was used to validate AI-predicted protein-protein complexes from AlphaFold-3 and Boltz-2, highlighting differences in predicted interfaces and confidence scores.
- The study revealed limitations in current AI models for predicting macromolecular complexes, emphasizing the need for domain-informed evaluation tools.
Abstract
PLI-Analyzer is a computational tool for detailed analysis of protein, RNA, and DNA interaction interfaces, combining atomic-level contact detection with domain annotations from UniProt and InterPro. It offers customizable interaction thresholds, residue-level output, and compatibility with AI-generated structures, enabling precise validation and interpretation of contact dynamics. We applied PLI-Analyzer to evaluate high-quality protein-protein complexes predicted by AlphaFold-3 and Boltz-2, revealing notable differences in predicted interfaces and confidence scores. These findings highlight current limitations in generative structure models specifically for macromolecular complexes and underscore the need for robust, domain-informed evaluation frameworks in structural bioinformatics.
bioinformatics2025-10-01v1Mapping Structural Aging across Human Tissues reveals tissue-specific trajectories, coordinated deterioration and genetic determinants
Yadav, A.; Alvarez, K.; Yip, K.; Gomez-Lobo, V.; Ruppin, E.; Kumsta, C.; Sinha, S.AI Summary
- Researchers developed PathStAR, a computational framework to analyze structural aging in tissues using histopathology images from 25,306 postmortem samples across 40 tissue types.
- PathStAR identified tissue-specific aging trajectories, revealing patterns like Early, Late, and Biphasic Aging, with shared aging hallmarks during accelerated phases.
- The study linked 123 genetic variants to accelerated structural aging, including SIRT6 variants, and found connections between systemic diseases and increased structural aging scores.
Abstract
Tissue structure, the organization of cells, vasculature and extracellular matrix, determines organ function. Yet how tissue structure changes with aging remains largely unknown. Current aging research primarily focuses on molecular changes, missing this structural dimension. Here, we present PathStAR, Pathology based Structural Aging Rate, the first computational framework that captures when and how tissue structure changes during aging from histopathology images. We applied it to 25,306 postmortem tissues covering 40 tissue types from individuals aged 21 - 70, connecting structural aging to molecular data, health records and genotype data. Without any training on chronological age, PathStAR captured non-linear functional decline of ovary, undetectable by bulk-molecular profiling. Applying it across 40 tissues, it revealed that structural aging occurs through discrete phases of rapid change (accelerated periods), with tissue-specific trajectories following three patterns: Early Aging Tissues (vascular system with major changes during the 30s), Late Aging Tissues (uterus and vagina with major changes during menopause (50s)) and Biphasic Aging Tissues (digestive, male reproductive tissues, and ovary with two periods of major changes). During these accelerated phases, most tissues exhibited shared aging hallmarks of inflammation and energy production decline, coupled with disruption of pathways governing their specialized functions. Cross-organ analysis revealed coordinated aging within organ systems and an unexpected link between digestive and male reproductive tissues. We next identified 123 germline variants associated with organ-specific accelerated structural aging, including SIRT6 variants linked to accelerated vascular decline. Finally, individuals with systemic autoimmune disease, as well as tissues with classical aging pathologies (atrophy, calcification, fibrosis), showed elevated structural aging scores. We demonstrate that structural aging is measurable from histology scans and provide the first systematic framework for studying it, revealing organ-specific aging processes.
bioinformatics2025-10-01v1GatorAffinity: Boosting Protein-Ligand Binding Affinity Prediction with Large-Scale Synthetic Structural Data
Wei, J.; Zhang, Y.; Ramdhan, P. A.; Huang, Z.; Seabra, G.; Jiang, Z.; Li, C.; Li, Y.AI Summary
- The study addresses the challenge of data scarcity in protein-ligand binding affinity prediction by utilizing over 450,000 synthetic complexes with Kd and Ki values, and over 1 million from SAIR with IC50 values.
- GatorAffinity, a geometric deep learning model, was developed using this synthetic data, pre-trained and fine-tuned with experimental data from PDBbind.
- Evaluations showed GatorAffinity significantly outperforms existing methods, enhancing accuracy and generalizability in affinity prediction.
Abstract
Protein-ligand binding affinity prediction is a fundamental task in computational drug discovery. Although substantial efforts have been made to enhance prediction accuracy using data-driven approaches, progress remains limited by persistent data scarcity. The widely used PDBbind dataset, for example, contains fewer than 20,000 experimental structures with annotated binding affinities, while a vast number of affinity measurements remain underutilized due to missing structural data. Here, we investigate this untapped potential by curating more than 450,000 synthetic protein-ligand complexes annotated with Kd and Ki values using the Boltz-1 structure prediction model. Building on this unprecedented scale of synthetic data, further augmented with over 1 million synthetic complexes from the recently released SAIR database annotated with IC50 values, we develop GatorAffinity, a geometric deep learning-based scoring function pre-trained on large-scale synthetic data and fine-tuned using high-quality experimental structures from PDBbind. Extensive evaluation on a leak-proof benchmark demonstrates that GatorAffinity significantly outperforms state-of-the-art affinity prediction methods, offering superior accuracy and generalizability. Our findings show that augmenting available experimental data with synthetic complexes can effectively address the data scarcity challenge while maintaining strong predictive reliability. By releasing the pretrained GatorAffinity model and the large-scale synthetic dataset GatorAffinity-DB, we provide a scalable and reproducible foundation for affinity prediction, virtual screening, and broader structure-based drug design applications (https://github.com/AIDD-LiLab/GatorAffinity).
bioinformatics2025-10-01v1Disentangling covariate effects on single cell-resolved epigenomes with DeepDive
Moeller, A.; Madsen, J.AI Summary
- DeepDive is a deep learning framework designed to disentangle known and unknown sources of variation in single-nucleus ATAC-seq data, addressing multicollinearity issues.
- It outperforms existing methods by accurately reconstructing chromatin accessibility and recovering biological signals from entangled covariates.
- Applied to pancreatic islet cells, DeepDive enabled counter-factual analyses, identifying covariates linked to a type 2 diabetes-related beta cell subtype and potential transcription regulators.
Abstract
Understanding the effects of individual biological factors from single cell-resolved epigenomic data is hindered by multicollinearity, particularly in human cohorts. We introduce DeepDive, a novel deep learning framework designed to systematically disentangle known and unknown sources of variation in single-nucleus ATAC-seq data. DeepDive accurately reconstructs chromatin accessibility, outperforms state-of-the-art methods with incomplete covariate information, and robustly recovers true biological signals from even highly entangled covariates, unlocking counter-factual, what-if, analyses. Applying DeepDive to pancreatic islet cells, we perform counter-factual analyses to prioritize covariates associated with a type 2 diabetes-linked beta cell subtype and nominate transcription regulators. DeepDive offers a powerful and unbiased tool for mechanistic discovery in complex human disease cohorts.
bioinformatics2025-10-01v1A unified analysis of cell-type and trajectory-associated pathways in single-cell data using Phoenix
Halperin, Y.; Nachmani, D.; Rabani, M.AI Summary
- Phoenix, a new pathway analysis framework, uses random forest models and non-parametric testing to identify cell-type and trajectory-associated pathways in single-cell RNA sequencing data.
- Applied to human, mouse, and zebrafish datasets, Phoenix identified both specific and shared pathways, outperforming existing tools in capturing cell-type-specific activities.
- It revealed complex non-linear gene interactions and provided insights into dynamic gene regulation across species.
Abstract
Single-cell RNA sequencing has transformed our ability to resolve complex cellular heterogeneity within biospecimens at the molecular level. However, identifying which biological pathways accurately reflect distinct cell types or continuous cellular trajectories remains a major challenge. Traditional methods often miss subtle or non-linear pathway activities, limiting biological interpretability and insights. To address this, we developed Phoenix, a pathway analysis framework that leverages random forest models and non-parametric significance testing to evaluate the relevance of functional gene sets for cell-type classification and pseudotemporal cellular trajectories. Phoenix reveals both up- and down-regulated processes, including those shaped by complex non-linear gene interactions, and quantifies their effect sizes. Applied to human and mouse hematopoiesis as well as zebrafish embryogenesis, Phoenix identified both cell-type-specific and trajectory-associated pathways, spanning housekeeping, developmental, and lineage-specific programs. It outperformed existing tools in capturing cell-type-specific activities and revealed greater overlap in pathway activities across species. By integrating statistical rigor with trajectory- and cell-type-aware analysis, Phoenix provides a sensitive, context-driven framework for uncovering biologically meaningful pathways in complex single-cell datasets, opening new opportunities to explore dynamic gene regulation across biological systems.
bioinformatics2025-10-01v1PathQC: Determining Molecular and Physical Integrity of Tissues from Histopathological Slides
Sinha, R. K.; Yadav, A.; Sinha, S.AI Summary
- PathQC is a deep learning framework designed to predict RNA Integrity Number (RIN) and autolysis from H&E-stained whole-slide images, addressing the limitations of destructive testing and manual evaluations.
- It uses a digital pathology foundation model (UNI) to extract morphological features, followed by a supervised model trained on the GTEx cohort of 25,306 samples.
- PathQC achieved correlations of 0.47 for RIN and 0.45 for autolysis, with higher performance in specific tissues like Adrenal Gland (R=0.82) for RIN and Colon (R=0.83) for autolysis.
Abstract
Quantifying tissue molecular and physical integrity is essential for biobank development. However, current assessment methods either involve destructive testing that depletes valuable biospecimens or rely on manual evaluations, which are not scalable and lead to interindividual variation. To overcome these challenges, we present PathQC, a deep learning framework that directly predicts the tissue RNA Integrity Number (RIN) and the extent of autolysis from hematoxylin and eosin (H&E)-stained whole-slide images of normal tissue biopsies. PathQC first extracts morphological features from the slide using a recently developed digital pathology foundation model (UNI), followed by a supervised model that learns to predict RNA Integrity Number and autolysis scores from these morphological features. PathQC is trained on and applied to the Genotype-Tissue Expression (GTEx) cohort, which comprises 25,306 non-diseased post-mortem samples across 29 tissues from 970 donors, where paired ground truth RIN and autolysis scores were available. Here, PathQC predicted RIN with an average correlation of 0.47 and an autolysis score of 0.45, with notably high performance in Adrenal Gland tissue (R=0.82) for RIN and in Colon tissue (R=0.83) for autolysis. We provide a pan-tissue model for the prediction of RIN and autolysis score for a new slide from any tissue type. Overall, PathQC will enable scalable measurement of molecular and physical integrity from routine H&E images, thereby enhancing the quality of both biobank generation and its retrospective analysis.
bioinformatics2025-10-01v1An Interpretable Deep Learning Framework for Biomarker Discovery in Complex Disease Survival Outcomes
Wan, S.; Mi, X.; Zou, F.; Zou, B.AI Summary
- The study introduces SurvDNN, a deep learning framework designed to identify biomarkers in complex disease survival outcomes, addressing challenges like non-linear interactions and high-dimensionality.
- SurvDNN uses bootstrapping for regularization and a stability-driven filtering algorithm for robustness, with PermFIT extended for survival data to quantify biomarker importance.
- Simulations and real-world data applications showed SurvDNN outperforming other methods in biomarker discovery and predictive accuracy.
Abstract
Identification of important biomarkers associated with complex disease survival outcomes is fundamental for gaining an in-depth understanding of disease mechanisms and advancing precision medicine in conditions such as cancer and cardiovascular disorders. However, these tasks are complicated by the unique nature of time-to-event data, which captures both the occurrence and timing of clinical events. Notably, complex associations such as the non-linear and non-additive biomarker interactions and the high-dimensionality challenge conventional survival data modeling approaches. To address these difficulties, we propose SurvDNN, an enhanced deep neural network framework specifically designed for survival outcomes modeling. SurvDNN incorporates a bootstrapping-based regularization strategy to mitigate overfitting and a novel stability-driven filtering algorithm to improve model robustness. To enable interpretable biomarker discovery, we extend the Permutation-based Feature Importance Test (PermFIT) to survival settings, allowing rigorous quantification of individual biomarker contributions under complex biomarker-outcome associations. Through extensive simulations and applications to real-world datasets, SurvDNN consistently outperforms existing machine learning approaches in both biomarker identification and predictive accuracy. Our results demonstrate the potential of SurvDNN coupled with PermFIT as an interpretable, robust, and powerful tool for biomarker-driven survival modeling in complex diseases. An open-source R package implementing SurvDNN is publicly available on GitHub (https://github.com/BZou-lab/SurvDNN).
bioinformatics2025-10-01v1scRGP: Prediction of Single-cell Genetic Perturbation Transcriptional Responses based on Rank in Multiple Scenarios
Liu, Y.; Zhang, H.; Xu, M.; Wang, D.; Hu, W.; Zhang, L.; Yang, Y.; Pian, C.; Chen, Y.AI Summary
- The study introduces scRGP, a deep learning framework for predicting transcriptional responses to single-cell genetic perturbations using rank-order gene expression data.
- scRGP addresses challenges of data explosion and complexity in single-cell perturbation sequencing by improving prediction accuracy for single, double, and triple gene perturbations.
- It outperforms existing methods by 10-16% in Pearson correlation for perturbation predictions and 5-9% in cross-cell-line predictions, enhancing computational approaches in functional genomics.
Abstract
Single-cell perturbation sequencing technologies (e.g., Perturb-seq, CROP-seq), which integrate CRISPR-based gene editing with single-cell transcriptome profiling, have revolutionized the analysis of transcriptomic changes induced by genetic perturbations at single-cell resolution. These technologies serve as a powerful tool for identifying key genes that inhibit tumor growth or reverse cancer cell phenotypes. However, they face two major challenges: data explosion with high experimental costs, and data complexity characterized by high dimensionality, noise, sparsity, and heterogeneity. To address these challenges, we developed the single-cell Rank-based Genetic Perturbation predictor (scRGP), the first deep learning framework leveraging gene expression rank-order information for this task. scRGP demonstrates superior performance in terms of robustness, cross-cell-line perturbation prediction, and high-throughput screening. Specifically, scRGP achieves an approximately 10-16 percentage points improvement in Pearson correlation coefficient (PCC) over state-of-the-art methods (e.g., GEARS and scFoundation) for single- and double-gene perturbation predictions, while also extending prediction capability to triple-gene perturbations. Furthermore, it outperforms these methods by approximately 5-9 percentage points in cross-cell-line predictions. These advancements promise to shift the paradigm of single-cell perturbation studies from experiment-driven to computation-driven approaches, providing new support for functional genomics and precision medicine.
bioinformatics2025-10-01v1A structure and function-based complete mutational map of Human Hemoglobin using AI
Marti, M. A.; Salvatore, F.; Brunello, F. G.; Schuster, C. D.AI Summary
- This study aimed to understand how sequence variations affect the structure and function of human hemoglobin (HbA) by creating a comprehensive mutational map using AI.
- A dataset of HbA variants was curated, annotated with clinical classifications, and mapped to structural and evolutionary features to develop a pathogenicity prediction model.
- The model outperformed AlphaMissense, providing insights into variant effects and offering a framework for improving variant classification in other proteins.
Abstract
Hemoglobin (Hb), a well-characterized protein central to oxygen transport and molecular medicine, serves as a model for studying how sequence variations influence protein structure and function. Its precise activity depends on tightly regulated structural dynamics, which can be disrupted by mutations that give rise to structural hemoglobinopathies, including sickle cell disease, unstable hemoglobins, methemoglobins, and hemoglobins with altered oxygen affinity, each associated with distinct functional and clinical consequences. Among genetic variants, missense mutations are the most widely studied in clinical settings. Accurately predicting their clinical impact remains challenging, requiring integration of evolutionary, biochemical, and structural data. While broad deep learning models like AlphaMissense show promise, they often lack interpretability and protein-specific precision. This motivates the development of focused models that leverage detailed knowledge of individual proteins, like hemoglobin, to improve both predictive power and mechanistic understanding. In this work, we conducted a comprehensive analysis of all known and potential human adult hemoglobin (HbA) variants, guided by the hypothesis that a deep understanding of the sequence structure function relationship in Hb can yield interpretable and predictive insights into the functional and clinical consequences of single amino acid substitutions. We curated an updated dataset of HbA variants annotated with their clinical classifications, Benign, Pathogenic, or of Uncertain Significance (VUS), and systematically mapped each to a range of features, including structural location and classification, predicted impact on folding stability, and evolutionary conservation. Using this data, we developed a pathogenicity prediction model and benchmarked it against AlphaMissense, demonstrating strong and complementary performance. Additionally, we generated a complete mutational landscape of all possible single amino acid substitutions (SAS) in HbA, providing a resource for future clinical interpretation. Our findings provide insight into the molecular basis for variant effects in HbA and highlight the utility of combining structure-informed features with Machine Learning (ML) for variant interpretation. Moreover, our results offer a framework for evaluating the portability and interpretability of variant effect predictors across structurally dynamic systems, with implications in the improvement of variant classification in other protein families.
bioinformatics2025-10-01v1Adversarial erasing enhanced multiple instance learning (siMILe): Discriminative identification of oligomeric protein structures in single molecule localization microscopy
Hallgrimson, C. D.; Li, Y. L.; Cardoen, B.; Lim, J.; Wong, T. H.; Khater, I. M.; Nabi, I. R.; Hamarneh, G.AI Summary
- The study introduces siMILe, a weakly-supervised machine learning method for identifying condition-specific changes in protein structures using SMLM data.
- siMILe uses multiple instance learning with adversarial erasing and a symmetric classifier to enhance structure classification without structure-level supervision.
- Validation on simulated data and PC3 prostate cancer cells showed siMILe's effectiveness in detecting caveolae and distinguishing different Cav1 oligomer structures based on cavin-1 expression.
Abstract
Single-molecule localization microscopy (SMLM) achieves nanoscale imaging of complex protein structures in the cell. However, the ability to capture structural variability across cell conditions (e.g., cell lines, gene expression, or treatment) from 3D point cloud SMLM data remains limited. We present siMILe, a novel weakly-supervised machine learning method based on multiple instance learning (MIL), leveraging shape and network features of protein assemblies, to close this important gap in interpretable subcellular discovery. siMILe identifies condition-specific changes in protein structures, without requiring structure-level supervision, and improves structure classification by extending embedded instance selection (MILES) through adversarial erasing and a symmetric classifier. siMILe is validated on simulated SMLM data and by detecting caveolae from caveolin-1 (Cav1) labeled PC3 prostate cancer cells differentially expressing cavin-1. In PC3-CAVIN1 cells dually labeled for Cav1 and cavin-1, cavin-1 closely associates with siMILe-identified caveolae, to a lesser extent with higher-order non-caveolar Cav1 scaffolds, but not with base Cav1 oligomers that correspond to 8S complexes, supporting a role for progressive cavin-1 interaction in 8S complex oligomerization. These results highlight siMILe's potential to identify differential molecular structures in distinct cell conditions. siMILe extends the SuperResNET SMLM software platform with the ability to detect interpretable structural differences across conditions.
bioinformatics2025-10-01v1Mechanistic Insights into the inhibition of Plasmodium falciparum DNA gyrase A by withanolides derivatives through integrated computational analysis
Dasari, J. B.; Soren, B. C.; Vastrad, S. J.; Junied, S.; Srikanth, D.; Chimalamari, A.; Jayappa, B. M. K. B.AI Summary
- This study used computational methods to explore how withanolide derivatives (D, E, O) inhibit Plasmodium falciparum DNA gyrase A, a unique target for antimalarial drugs.
- Molecular docking showed high binding affinities, with withanolide E having the highest at -9.73 kcal/mol, interacting via key residues.
- Molecular dynamics simulations and MM/GBSA calculations indicated withanolide O had the most favorable dynamic profile and lowest binding free energy, suggesting potential for antimalarial drug development.
Abstract
Malaria is a fatal disease affecting millions of people worldwide, primarily due to infection by Plasmodium falciparum. The emergence of multidrug-resistant parasite strains has necessitated the exploration of novel therapeutic targets, among which DNA gyrase represents a unique and underexploited enzyme in the parasites replication machinery. Plasmodium falciparum DNA gyrase A (pfDNA gyrase), an essential topoisomerase II that is not present in humans, has been identified as a promising target for antimalarial drug development. Present study deals with a structure based computational approach to characterize the binding mechanism and dynamic stability of three bioactive withanolide derivatives (D, E, and O) against pfDNA gyrase. Molecular docking revealed high binding affinities for withanolide D (-9.14kcal/mol), E (-9.73kcal/mol), and O (-9.00kcal/mol), with interactions mediated through key catalytic residues such as GLU648, LYS647, and TRY590 via hydrogen bonding and hydrophobic contacts. Stability of the ligand-protein complexes was further assessed through molecular dynamics simulations, where analyses of RMSD, RMSF, radius of gyration (Rg), and solvent accessible surface area (SASA) analysis confirmed the structural integrity and compactness of the complexes, notably withanolide O exhibited the most favorable dynamic profile, whereas withanolide E induced confirmational rigidity. MM/GBSA calculations are further supported by showing the lowest binding free energy for withanolide O and E ({Delta}Gbind =-20.89 and -20.22 kcal/mol). The ADME studies showed favorable pharmacokinetic and physiochemical properties of three ligands. Collectively, these findings highlight the potential of withanolide derivatives as promising inhibitors of pfDNA gyrase, thereby paving a way for the foundation of future antimalarial drug development.
bioinformatics2025-10-01v1Integrating Multi-Structure Covalent Docking with Machine Learning Consensus Scoring Enhances Virtual Screening of Human Acetylcholinesterase Inhibitors
Rayakar, A. A.; Jaladanki, C. K.; Yap, X. H.; Fan, H.AI Summary
- This study developed an in silico protocol integrating multi-structure covalent docking with machine learning (ML) consensus scoring to enhance virtual screening of human acetylcholinesterase (AChE) inhibitors.
- Analysis of 65 ligand-bound AChE structures identified four representative conformations for docking, with covalent docking showing superior performance (Spearman's ρ up to 0.54) compared to non-covalent docking (ρ up to 0.18).
- The ML consensus model, trained on covalent docking scores from five structures, achieved the highest predictive accuracy (ρ = 0.70), highlighting its effectiveness in predicting AChE inhibitors.
Abstract
Acetylcholinesterase (AChE) inhibition is a key mechanism in the treatment of neurodegenerative diseases and in counteracting toxic exposures to pesticides and nerve agents. However, virtual screening of AChE remains challenging due to the enzyme's structural flexibility and the chemical diversity of its covalently binding inhibitors. In this study, we developed an in silico protocol that integrates multi-structure covalent docking and machine learning (ML) consensus scoring to improve the prediction of AChE inhibitors. We analyzed 65 ligand-bound (holo) human AChE crystal structures using hierarchical clustering to identify four representative conformations, along with one high-resolution apo structure, for multi-structure docking. A curated library of 412 organophosphate and carbamate inhibitors was then docked covalently and non-covalently into each receptor conformation. The resulting docking scores were evaluated against inhibitors' experimental logIC50 values using Spearman's rank correlation coefficient (). Covalent docking outperformed non-covalent docking ( values up to 0.54 vs 0.18), and our ML consensus model trained on the five structures' covalent docking scores achieved the highest predictive accuracy ( = 0.70), surpassing all single-structure and conventional consensus baselines. Chemical cluster analysis revealed structure-activity trends based on ligand flexibility, polarity, and aromaticity. SHapley Additive exPlanations analysis highlighted the ML consensus model's ability to flexibly distribute the influence each structure's scores played on its predictions. It identified and exploited relationships based on its training dataset that would be difficult to anticipate through a manual analysis of individual structures' docking performance metrics. This framework is broadly applicable to other covalently targeted proteins, offering a generalizable and interpretable strategy for data-driven covalent inhibitor discovery.
bioinformatics2025-10-01v1A proteogenomic approach to discover novel lncRNA-derived peptides and their potential clinical utility in hepatocellular carcinoma
Bingwu, L.; Joshi, K.; Wang, D. O.AI Summary
- This study used a proteogenomic approach integrating Ribo-seq data to identify 105 novel lncRNA-derived peptides (lncPeps) in hepatocellular carcinoma (HCC) tissues.
- The lncPeps were differentially expressed between tumor and non-tumor tissues and some correlated with prognosis.
- Incorporating lncPeps with canonical proteins in a LASSO regression model enhanced the prediction of HCC recurrence, improving AUC by 0.005 to 0.085.
Abstract
Peptides are increasingly recognized for their versatile functions in biological contexts but their clinical relevance and utility remain largely unexplored. Proteogenomic approaches can accelerate peptide discovery in clinical samples by integrating proteomic data with genomics and transcriptomics evidence. However, long noncoding RNA (lncRNA) derived peptides (lncPeps) remain largely unidentified, resulting in unmatchable MS/MS spectra. To solve this problem, we have used high-quality Ribo-seq translatomic datasets to generate an extensive database of human liver lncPeps, which we subsequently applied to proteomics data of tumor adjacent normal tissue pairs from hepatocellular carcinoma (HCC) patients. Using the new database, we discovered 105 novel lncPeps including lncPeps differentially expressed between tumor and non-tumor tissues, and lncPeps with significant correlation with prognosis. Remarkably, combining the expression of lncPeps with canonical proteins in a LASSO regression model improved predictive performance for recurrence, increasing the AUC by 0.005 to 0.085 across three recurrence time points. These findings suggest that lncPeps discovery contributes to our understanding of the molecular heterogeneity and progression of HCC, and broadens the range of potential biomarker candidates or treatment targets for the disease.
bioinformatics2025-10-01v1Comprehensive benchmarking with guidelines for analyzing transposable element-derived RNA expression
She, J.; Wang, J.; Yang, E.AI Summary
- This study benchmarks 16 tools for analyzing transposable element-derived RNA (teRNA) expression using 120 simulated and 60 real-world datasets.
- Findings highlight the exon-level analysis as a balance between accuracy and resolution, revealing method-specific strengths and weaknesses at different levels.
- The study provides decision-tree guidelines and an integrated pipeline for teRNA analysis, establishing a gold standard for future tool development.
Abstract
Transposable element-derived RNAs (teRNAs) have been recognized with accelerating fundamental or pathogenic roles, especially in human. Despite the rapid development of computational methods, the best practice for accurate identification and quantification of teRNAs are currently lacking owing to the difficulties of evaluation. Here we present benchmarking of 16 representative tools with 120 simulated datasets and 60 real-world paired datasets (comprising both long- and short-read data), by evaluating the performance of teRNA identification or quantification across family-, unit-, exon-, and transcript-level. Our findings demonstrate not only the exon-level as a trade-off between accuracy and resolution for teRNA analysis, but also the level-dependent strengths and weaknesses of evaluated methods. To refine our benchmarking results, we present decision-tree-style guidelines and develop an integrated best-practice pipeline, serving as the basis for future functional researches. In addition, our evaluation framework also provides a gold standard for developing and benchmarking better computational tools in the field.
bioinformatics2025-10-01v1Robust metabolomics data normalization across scales and experimental designs
Vynck, M.; Vangeenderhuysen, P.; De Paepe, E.; Nawrot, T.; Plekhova, V.; Vanhaecke, L.AI Summary
- The study addresses the issue of signal drift and batch effects in metabolomics by introducing three robust normalization methods: rLOESS, rGAM, and tGAM, which improve resistance to outliers.
- These methods, implemented in the Metanorm R package, use additive models for flexible non-linear modeling and differential sample weighting, enhancing normalization performance.
- Across various datasets, these methods reduced false positives/negatives, improved replicate concordance, and minimized batch effects, demonstrating versatility in metabolomics studies.
Abstract
Metabolomics studies employing liquid chromatography-mass spectrometry are affected by signal drift and batch effects, introducing technical variance that impedes biological knowledge discovery. Quality control (QC) sample-based normalization strategies are widely implemented but remain vulnerable to outliers, thereby reducing normalization performance. We introduce rLOESS, rGAM and tGAM, three robust normalization methods that improve resistance to outliers by downweighting or accommodating them. Leveraging additive models, the rGAM and tGAM methods allow flexible non-linear modeling, differential sample weighting, and data-driven QC representativeness evaluation. Implementations of these methods are gathered in the Metanorm R package, integrating robust normalization with visualization for performance verification, while supporting efficient parallel processing. Spanning in silico and experimental datasets, the robust methods, relative to existing strategies, consistently demonstrated a reduction in false positive and false negative differentially abundant metabolites, improved replicate concordance, and reduced batch effects. Metanorm is versatile, supporting normalization in metabolomics studies across scales and experimental setups.
bioinformatics2025-10-01v1metaAPA: a tool for integration of PolyA site predictions from single-cell and spatial transcriptomics
Zhao, Q.; Rattray, M.AI Summary
- The study addresses the inconsistency in polyA site predictions from single-cell RNA-seq data by integrating outputs from different APA prediction tools.
- metaAPA was developed to combine results from tools like Sierra, polyApipe, and SCAPE, allowing selection of polyA sites based on user needs.
- Key findings show that metaAPA can identify high-confidence polyA sites with expected biological characteristics, enhancing both sensitivity and positional accuracy.
Abstract
Motivation: Single-cell 3'-tagging sequencing, such as that provided by 10x Genomics, can be utilized to study alternative polyadenylation (APA). APA can affect RNA function, stability, and subcellular localization, thereby influencing development and disease processes. Currently, computational tools based on various algorithms, such as Sierra, polyApipe, and SCAPE, have been developed to infer polyA site positions from scRNA-seq data. However, these methods exhibit significant differences in the number of predicted sites and positional inconsistencies in the sites identified for the same gene, leading to divergent conclusions when analyzing the same data with different tools. Results: We designed two strategies to integrate the outputs of alternative APA tools, enabling users to select appropriate polyA site sets based on their specific needs. Our method can be used to extract high confidence sites, supported by all methods, as well as putative sites supported by a subset of methods. We find that methods with high sensitivity for detecting APA sites can be usefully augmented by methods with higher positional accuracy but lower sensitivity. We show that our method obtains the expected number of high-confidence sites and that these sites exhibit the expected biological sequence characteristics.
bioinformatics2025-10-01v1pLM-SAV: A Δ-Embedding Approach for Predicting Pathogenic Single Amino Acid Variants
Gereben, O.; Tordai, H.; Khamisi, L.; Kouri, A.; Hegedus, T.AI Summary
- The study introduces pLM-SAV, a predictor for pathogenic single amino acid variants (SAVs) using protein language models (pLMs) and Δ-embeddings, which are differences between wild-type and mutant sequence embeddings.
- pLM-SAV was trained on Eff10k and evaluated on ClinVar data, showing strong performance on Eff10k and reasonable on ClinVar, with notable improvements over AlphaMissense in ambiguous cases.
- An ensemble method, REVEL, was found to outperform both AlphaMissense and pLM-SAV, leading to its integration into the AlphaMissense web application.
Abstract
Predicting whether single amino acid variants (SAVs) in proteins lead to pathogenic outcomes is a critical challenge in molecular biology and precision medicine. Experimental determination of all possible mutation effects is infeasible, and while state-of-the-art tools such as AlphaMissense show promise, their diagnostic performance is insufficient and they are often difficult to run locally. We developed pLM-SAV, a simple yet effective predictor that leverages protein language models (pLMs). {Delta}-embeddings, computed as the difference between wild-type and mutant sequence embeddings, are used as input for a convolutional neural network. To prevent data leakage, we trained our model on a well-characterized, labeled set of Eff10k and evaluated it on a non-homologous subset of ClinVar data. Our results demonstrate that this approach performs exceptionally well on the Eff10k test folds and reasonably on ClinVar test sets. Notably, pLM-SAV excels in resolving ambiguous predictions by AlphaMissense. We also found that an ensemble method, REVEL, outperforms both AlphaMissense and pLM-SAV, thus, we integrated these REVEL-enhanced predictions into our widely used AlphaMissense web application. Our results demonstrate that an SAV predictor trained on labeled data can achieve high predictive performance. Unlike previous methods such as VESPA, pLM-SAV uses no handcrafted features or substitution matrices, relying solely on pLM-derived representations. We anticipate that incorporating delta-embeddings into other mutation effect predictors or mutant structure prediction methods will further enhance their accuracy and utility in diverse biological contexts.
bioinformatics2025-09-30v3Towards Personalized Epigenomics: Learning Shared Chromatin Landscapes and Joint De-Noising of Histone Modification Assays
Narendra, T.; Visona, G.; Cardona, C. d. J.; Abbott, J.; Schweikert, G.AI Summary
- The study introduces DecoDen, a method to learn shared chromatin landscapes and de-bias histone modification measurements from ChIP-Seq data.
- DecoDen was applied to analyze histone modification patterns across multiple tissues in personal epigenomes, demonstrating its effectiveness in reducing biases.
Abstract
Epigenetic mechanisms enable cellular differentiation and the maintenance of distinct cell-types. They enable rapid responses to external signals through changes in gene regulation and their registration over longer time spans. Consequently, the chromatin landscape, which is the overall organization and biochemical state of chromatin, exhibits both cell-type and individual specificity and contributes to phenotypic diversity. Genomic distributions of chromatin features are typically measured using ChIP-Seq and related methods. However, these measurements are subject to substantial biases introduced by the chromatin landscape itself. Here, we introduce DecoDen, which uses measurements of several different histone modifications, to simultaneously learn shared chromatin landscapes while de-biasing individual measurement tracks. We demonstrate DecoDen's effectiveness on an integrative analysis of histone modification patterns across multiple tissues in personal epigenomes.
bioinformatics2025-09-30v2WITHDRAWN: NanoDel: a long-read sequencing pipeline for identifying large-scale mitochondrial DNA deletions validated in patient samples clinically diagnosed with mitochondrial disease and evaluated in glioblastoma.
Fearn, C.; Oliva, C.; Griguer, C.; Poulton, J.; Fratter, C.; McGeehan, J.; Baldock, R.; Robson, S.; McGeehan, R.AI Summary
- The manuscript titled "NanoDel" was withdrawn before peer-review for substantial revisions, including additional data validation and clarification of conclusions.
- The revised version will be resubmitted for peer-review.
Abstract
The authors have withdrawn this manuscript (MS ID#: BIORXIV/2025/677263) prior to peer-review to perform substantial revisions to the content, including additional data validation and clarification of key conclusions. The revised manuscript will be resubmitted to a journal for peer-review in due course. Consequently, the authors do not wish this version of the work to be cited in any publications. If you have any questions, please contact the corresponding author.
bioinformatics2025-09-30v2Hi-GREx: A 3D Genome Guided Framework for enhancing Gene Expression Prediction Using Hi-C Selected Distal SNPs
Joshi, K.; Xuan, Z.; Chen, M.AI Summary
- The study introduces Hi-GREx, a framework that enhances gene expression prediction by incorporating long-distance SNPs identified through Hi-C data, addressing limitations of traditional TWAS methods.
- Hi-GREx was benchmarked using GTEx brain cortex data, showing a 77.4% improvement in prediction accuracy for active genes.
- Notably, Hi-GREx could predict expression for 18% of genes that were not modeled using only short-distance SNPs.
Abstract
Genome-Wide Association Study (GWAS) method has been successfully used to map thousands of loci associated with complex traits, but its ability to reveal the molecular mechanisms altered in complex diseases has been limited due to not including combinations and interactions between markers when predicting a disease. Transcriptome-Wide Association Studies (TWAS) estimate the aggregate effects of multiple genetic variants on complex diseases and represent a promising approach to address the limitations of GWAS. In particular, TWAS provides insights into the functional consequences of disease-associated SNPs by linking them to gene transcription, thereby offering a mechanistic understanding that GWAS alone cannot provide. However, TWAS associated variants have been annotated with the closest or most biologically relevant candidate gene within arbitrarily defined distances but fails to account for long distance SNPs which can affect many genes and have a widespread impact on regulatory networks. Therefore, there is a need to leverage these observed enrichments and build a method that incorporates both short and long distance-associations between SNPs and complex phenotypes. Here we present a method which can utilize Hi-C data to capture informative long-distance SNPs and aim to improve prediction accuracy of previous TWAS method. We benchmarked our method on GTEx brain cortex genotype and expression data together with the corresponding Hi-C data. By using the informative long-distance SNPs selected based on Hi-C, our method improved prediction accuracy of gene expression for 77.4% of the active genes across the entire genome. Particularly, our method can build significant expression models for 18% of genes which were missed by using only short-distance SNPs. Our method has demonstrated the efficiency and importance of utilizing long-distance SNPs in predicting gene expression and can further enhance the power of TWAS methods.
bioinformatics2025-09-30v1Distinct cell wall molecular architecture of dimorphic Talaromyces marneffei cells revealed by solid-state NMR spectroscopy
Chen, Q.; Xu, X.; Liao, S.; Chen, Y.; Liang, H.; Wang, J.; Wang, F.; An, S.AI Summary
- Researchers used solid-state NMR to investigate the cell wall structure of Talaromyces marneffei, which shifts from mold to yeast form with temperature changes.
- The yeast form had a cell wall 2.3 times thicker and more hydrated, with β-1,3-glucan increasing from 57% in mold to 72% in yeast.
- The study revealed distinct molecular architectures, with only the mold form showing lysine-containing protein interactions with chitin and chitosan.
Abstract
Talaromyces marneffei, causing systemic infections in immunocompromised patients ranging from HIV/AIDS individuals to cancer and transplant recipients, is an increasingly urgent global pathogen. However, the fungus remains underrecognized despite the systemic infection disease talaromycosis caused by this pathogen is associated with high mortality rates. Its pathogenicity depends on a temperature-triggered shift from saprophytic mold (25{degrees}C) to pathogenic yeast (37{degrees}C), and the two growth forms display distinct sensitivity to antifungal drugs, which processes involve extensive cell wall structure and components remodeling. To dissect these processes, we use solid-state nuclear magnetic resonance (ssNMR) and other techniques to show that T. marneffei yeast and hyphal cells have distinct cell wall thickness and hydrophobicity, and different assembly of mobile and rigid polymers within the T. marneffei cell wall. The yeast wall was 2.3 times thicker and more hydrated. ssNMR revealed a rigid core of {beta}-1,3-glucans, chitin and chitosan, with {beta}-1,3-glucan rising from 57% in mold to 72% in yeast. Both forms showed tight polysaccharide packing, but only mold exhibited lysine-containing protein interactions with chitin and chitosan. These insights not only map the structural basis of host temperature adaptation and also inform targeted antifungal design in future.
bioinformatics2025-09-30v1Unmmaped RNA-Seq reads from inoculated sugarcane reveals long non-coding RNAs related to retrotransposons and suggest microbiome modulation
Leite, J. N.; Cirino, H.; Antonello, P.; Zerillo, M.; Dias, H. M.; Van Sluys, M.-A.AI Summary
- The study analyzed unmapped RNA-Seq reads from sugarcane inoculated with a pathogen, with or without beneficial bacteria, to explore transcripts not aligning to reference genomes.
- A pipeline was developed to assemble these unmapped reads, revealing long non-coding RNAs linked to retrotransposons and suggesting microbiome modulation.
- The findings included identification of sugarcane microbiome species and changes in microbial transcription levels under biotic stress, providing insights into the sugarcane pathobiome.
Abstract
Background: Transcriptome studies have contributed to the understanding of protein coding and non-coding gene expression of several organisms. They also provide knowledge into host responses to pathogen infections. Not common yet is to detect transcripts from multiple interacting organisms present in a given sample. In addition, transcriptome studies with complex polyploid hybrid genomes and not completely sequenced, such as the one of sugarcane, remains a challenge. In such studies, a considerable set of reads may not align on any reference genome but still hold significant relevance to the study, allowing to gather information beyond the complex organism itself. Results: A complete transcriptome analysis of sugarcane inoculated with a pathogen, with and without the addition of a beneficial bacteria, generated a subset of reads that did not map to their respective genome references. Here, we report a pipeline based on the assembly of a collection of unmapped reads that potentially interfere with the gene expression of the associated microbiome, as well as the identification of long non-coding RNAs related to transposable elements present in a host-pathogen interaction. In addition, the detection of transcripts of naturally occurring microorganisms in sugarcane allowed the identification of its microbiome at the species level. Further studies support changes in microorganism transcription levels according to the biotic stress the plant was conditioned. Conclusion: A quick and practical pipeline is proposed to study unmapped reads to infer relevant information that would remain otherwise unnoticed. New sequences from hybrid sugarcane transcriptome, such as long non-coding genes related to known retrotransposons are described. Also, changes in microbiome gene expression provides insights to the microbiome alterations and bring knowledge to the sugarcane pathobiome.
bioinformatics2025-09-30v1Zero-Shot Protein-Ligand Binding Site Prediction from Protein Sequence and SMILES
Pourmirzaei, M.; Alqarghuli, S.; Chen, K.; Pourmirzaei, M.; Xu, D.AI Summary
- The study addresses the challenge of predicting protein-ligand binding sites across different data regimes by developing a three-stage modeling suite that incorporates protein sequence and SMILES data.
- Stage 2 improved performance on overrepresented ligands (Macro F1 from 0.4769 to 0.5832), while Stage 3 enabled zero-shot prediction with an F1 score of 0.3109 on 5,612 unseen ligands.
- Larger protein language model (PLM) backbones were found to significantly enhance performance across all regimes, whereas scaling the chemical language model (CLM) showed less consistent benefits.
Abstract
Accurate identification of protein-ligand binding sites is critical for mechanistic biology and drug discovery, yet performance varies widely across ligand families and data regimes. We present a systematic prediction and evaluation framework that stratifies ligands into three settings: overrepresented (many examples), underrepresented (tens of examples; few-shot), and zero-shot (unseen at training). We developed a novel three-stage, sequence-based modeling suite that progressively adds ligand conditioning and zero-shot capability, and used an evaluation framework to assess the suite. Stage 1 trains per-ligand predictors using a pretrained protein language model (PLM). Stage 2 introduces ligand-aware conditioning via an embedding table, enabling a single multi-ligand model. Stage 3 replaces the table with a pretrained chemical language model (CLM) operating on SMILES, enabling zero-shot generalization. We show Stage 2 improves Macro F1 on the overrepresented test set from 0.4769 (Stage 1) to 0.5832 and outperforms sequence- and structure-based baselines. Stage 3 attains zero-shot performance (F1 = 0.3109) on 5,612 previously unseen ligands while remaining competitive on represented ligands. Ablations across five PLM scales and multiple CLMs reveal larger PLM backbones consistently increase Macro F1 across all regimes, whereas scaling the CLM yields modest or inconsistent gains, which need further investigation. Our results demonstrate that zero-shot residue-level prediction from sequence and SMILES is feasible and identify the PLM scale as the dominant lever for further advances. The code is fully open source at GitHub: https://github.com/mahdip72/ProteinLigand
bioinformatics2025-09-30v1BenchHub enables an inclusive and transparent ecosystem for community-focused benchmarking in computational biology
Yang, J. Y. H.; Liang, X. C.; Robertson, N.; Torkel, M.; Kim, S.; Strbenac, D.; Cao, Y.AI Summary
- The study addresses the lack of standardized data structures for benchmarking in computational biology by introducing BenchHub, a modular R6-based ecosystem.
- BenchHub includes a Trio database linking datasets, performance metrics, and ground truth, a BenchmarkStudy structure for study designs, and tools for result analysis.
- This ecosystem enhances reproducibility, comparability, and sustainability in benchmarking for computational biology.
Abstract
The rapid growth of computational methods for the computational biology field highlights the critical role of benchmarking in guiding method selection. However, there is no standardised data structure that effectively links and stores datasets, performance metrics and available ground truth. Without such a unified and shareable structure, it is difficult for the community to contribute, update and extend existing benchmarking studies to ensure long-term relevancy. To address this challenge, we present BenchHub, a community-oriented ecosystem with a modular R6-based structure that enables living benchmarking. BenchHub comprises three key components: (i) a Trio database that links datasets, performance metrics, and supporting evidence (e.g. ground truth), (ii) a BenchmarkStudy structure that captures the different benchmark study designs, and (iii) a series of tools together with vignettes and interactive platform that allow users to gain insights from the benchmarking results. Together, these components streamline the benchmarking process for benchmark study developers, methods contributors, and benchmark consumers, promoting reproducibility, comparability, and long-term sustainability in computational biology.
bioinformatics2025-09-30v1Supervised Factorization to Associate Spatial Transcriptomics with Complementary Molecular Readouts
Awal, F. B.; Pautler, R. G.; Samee, M. A. H.; Rahman, M. S.AI Summary
- The study introduces a supervised Non-negative Matrix Factorization (NMF) framework to link spatial transcriptomics with molecular readouts, focusing on spatial alignment for targeted factorization components.
- Applied to Alzheimer's Disease (AD) and Myocardial Infarction (MI), the method identified disease-related spatial factors, with AD analysis incorporating a spatial decay model of amyloid-beta plaque influence.
- The approach successfully highlighted gene sets and candidate genes associated with disease progression by ranking their contributions to the supervised spatial factor.
Abstract
Spatial Transcriptomics enables studying gene expression data within spatial context of tissues. Yet understanding how spatial molecular phenomena influence transcriptional patterns remains a key challenge. We propose a novel supervised Non-negative Matrix Factorization (NMF) framework, where supervision is selectively and explicitly applied to guide the learning of a supervised spatial factor. This distinguishes our method from prior approaches by enforcing spatial alignment only on a targeted component of the factorization, enabling biologically interpretable associations between gene expression and spatial molecular events. This approach also enables the identification of genes whose expression patterns are spatially correlated with molecular events of interest. Applied to datasets involving Alzheimer's Disease (AD) and Myocardial Infarction (MI), our method successfully discovered supervised spatial factor associated with disease related signal. In the case of Alzheimer's Disease (AD), we have presented a spatial decay model to represent how the influence of amyloid-beta plaque signals diminishes with distance, and used this as a supervision signal during matrix factorization. Applied across both disease contexts, our method successfully identified biologically meaningful gene sets associated with disease progression. By ranking genes based on their contribution to the supervised spatial factor, the framework highlights candidate genes potentially involved in disease-related processes.
bioinformatics2025-09-30v1MitoNGS: an online platform to analyze fish metabarcoding data in high-resolution
Zhu, T.; Sato, Y.; Fukunaga, T.; Miya, M.; Iwasaki, W.; Yoshizawa, S.AI Summary
- MitoNGS is an online platform designed to analyze fish metabarcoding data with high resolution, addressing challenges like incomplete reference databases and ambiguous taxa.
- It incorporates comprehensive references, including non-fish species, and uses a "species group" strategy with habitat and geographic data to improve species identification.
- MitoNGS supports various mitochondrial markers and Nanopore sequencing, showing excellent performance across diverse datasets.
Abstract
Environmental DNA (eDNA) metabarcoding has become a powerful tool for assessing fish biodiversity in aquatic ecosystems. However, accurate species-level identification remains challenging due to incomplete and contaminated reference databases, as well as ambiguous taxa sharing identical barcode sequences. Here, we present MitoNGS, a next-generation platform that succeeds the widely used MiFish pipeline, designed for high-resolution analysis of fish metabarcoding data. MitoNGS addresses these challenges by incorporating more comprehensive references including non-fish species and detailed annotations of heterospecific regions. Additionally, it introduces the \"species group\" strategy in conjunction with environmental habitat and geographic occurrence data to resolve ambiguous taxa. Furthermore, MitoNGS expands the functionalities of the legacy MiFish pipeline. It can analyze data from any mitochondrial markers and from Nanopore sequencing platforms. MitoNGS demonstrated excellent performance on our testing datasets from diverse locations, markers and sequencing platforms. MitoNGS offers a user-friendly, web-based solution for fish detection, biodiversity monitoring, conservation research, and bioresource management. MitoNGS is freely available via https://mitofish.aori.u-tokyo.ac.jp/mito-ngs.
bioinformatics2025-09-30v1Label-free biochemical imaging and timepoint analysis of neural organoids via deep learning-enhanced Raman microspectroscopy
Georgiev, D.; Xie, R.; Reumann, D.; Zhao, X.; Fernandez-Galiana, A.; Barahona, M.; Stevens, M. M.AI Summary
- This study introduces a non-invasive, label-free imaging platform combining Raman microspectroscopy with deep learning for biochemical analysis of neural organoids.
- The approach allows high-resolution mapping of cellular structures in both cryosectioned and intact organoids, enhancing imaging accuracy over traditional methods.
- Key findings include volumetric imaging of neural rosettes and analysis of spatiotemporal biochemical changes in lipids, proteins, and nucleic acids during organoid development.
Abstract
Three-dimensional organoids have emerged as powerful models for studying human development, disease and drug response in vitro. Yet, their analysis remains constrained by standard imaging and characterisation techniques, which are invasive, require exogenous labelling and offer limited multiplexing. Here, we present a non-invasive, label-free imaging platform that integrates Raman microspectroscopy with deep learning-based hyperspectral unmixing for unsupervised, spatially resolved biochemical analysis of neural organoids. Our approach enables high-resolution mapping of cellular and subcellular structures in both cryosectioned and intact organoids, achieving improved imaging accuracy and robustness compared to conventional methods for hyperspectral analysis. Using our platform, we demonstrate volumetric imaging of a neural rosette within a neural organoid, and interrogate changes in biochemical composition during early developmental stages in intact neural organoids, revealing spatiotemporal variations in lipids, proteins and nucleic acids. This work establishes a versatile framework for high-content, label-free (bio)chemical phenotyping with broad applications in organoid research and beyond.
bioinformatics2025-09-30v1Enhancing protein structure prediction accuracy by prioritizing important residues using protein language models
cui, q.; Liu, Y.; Kang, B.AI Summary
- The study aimed to improve protein structure prediction by incorporating residue importance scores (RIS) from protein language models into AlphaFold2, creating i-Fold.
- i-Fold uses RIS as dynamic positional weights during training to focus on functionally critical residues.
- Results showed i-Fold significantly enhanced prediction accuracy (p=0) and success rate by 7.6% on a benchmark set and 6.0% on an independent set, particularly for challenging proteins.
Abstract
Accurate prediction of protein tertiary structures from amino acid sequences remains a fundamental challenge in computational biology. Although AlphaFold2 represents a major advance, systematic discrepancies persist between its predictions and experimentally determined structures. Given that individual residues contribute differentially to protein function, we hypothesized that incorporating residue-specific importance metrics could improve prediction accuracy. Here, we develop i-Fold (importanceFold), an enhanced neural architecture enhances the AlphaFold2 architecture by integrating protein language model ESM-derived residue importance scores (RIS) as dynamic positional weights during training. Our approach dynamically weights amino acids using RIS during structure prediction, thereby directing computational attention toward functionally critical residues and regions. Evaluation on a benchmark test set of 3,559 protein structures reveals that i-Fold significantly improves accuracy (reduction in r.m.s.d., p=0) and achieves a higher prediction success rate (7.6% improvement: 55.1% [->] 62.7%). Notably, i-Fold demonstrates particular improvements for targets that are typically challenging for AlphaFold2, including ribosomal proteins, membrane proteins, and orphan proteins. Consistent results were obtained on a completely independent test set of 167 recently released protein structures, where i-Fold again exhibited a higher prediction success rate (6.0% improvement: 43.7% [->] 49.7%) compared to AlphaFold2. Our findings indicate that explicit integration of RIS can advance the state-of-the-art in protein structure prediction, producing more accurate and generalizable models without substantially increasing computational cost.
bioinformatics2025-09-30v1A statistical framework for defining synergistic anticancer drug interactions
Dias, D.; Zobolas, J.; Ianevski, A.; Aittokallio, T.AI Summary
- The study developed a statistical framework to identify synergistic anticancer drug interactions by establishing reference null distributions for synergy metrics across 125 pan-cancer cell lines.
- This approach allowed for the estimation of empirical p-values to assess the significance of drug combinations, confirming known synergies and revealing novel ones.
- The framework was validated on an independent dataset, demonstrating its applicability to smaller-scale studies.
Abstract
Synergistic drug combinations have the potential to delay drug resistance and improve clinical outcomes. However, current cell-based screens lack robust statistical assessment to identify significant synergistic interactions for downstream experimental or clinical validation. Leveraging a large-scale dataset that systematically evaluated more than 2,000 drug combinations across 125 pan-cancer cell lines, we established reference null distributions separately for various synergy metrics and cancer types. These data-driven reference distributions enable estimation of empirical p-values to assess the significance of observed drug combination effects, thereby standardizing synergy detection in future studies. The statistical evaluation confirmed key synergistic combinations and uncovered novel combination effects that met stringent statistical criteria, yet were overlooked in the original analyses. We revealed cell context-specific drug combination effects across the tissue types and differences in statistical behavior of the synergy metrics. To demonstrate the general applicability of our approach to smaller-scale studies, we applied the reference distributions to evaluate the significance of combination effects in an independent dataset. We provide a fast and statistically rigorous approach to detecting synergistic drug interactions in combinatorial screens.
bioinformatics2025-09-30v1Dividing out quantification uncertainty enables assessment of differential transcript usage with limma and edgeR
Baldoni, P. L.; Chen, L.; Li, M.; Chen, Y.; Smyth, G. K.AI Summary
- The study addresses the challenge of differential transcript usage (DTU) analysis by incorporating read-to-transcript ambiguity (RTA) into the statistical frameworks of limma and edgeR.
- New pipelines using the diffSplice function were developed to remove RTA-induced dispersion, enhancing analysis for both small and large datasets.
- Simulations and real data analysis showed that these pipelines offer increased power, efficiency, and better false discovery rate control compared to existing methods.
Abstract
Differential transcript usage (DTU) refers to changes in the relative abundance of transcript isoforms of the same gene between experimental conditions, even when the total expression of the gene doesn't change. DTU analysis requires the quantification of individual isoforms from RNA-seq data, which has a high level of uncertainty due to transcript overlap and read-to-transcript ambiguity (RTA). Popular DTU analysis methods do not directly account for the RTA overdispersion within their statistical frameworks, leading to reduced statistical power or poor error rate control, particularly in scenarios with small sample sizes. This article presents limma and edgeR analysis pipelines that account for RTA during DTU assessment. Leveraging recent advancements in the limma and edgeR Bioconductor packages, we propose DTU analysis pipelines optimized for small and large datasets with a unified interface via the diffSplice function. The pipelines make use of divided counts to remove RTA-induced dispersion from transcript isoform counts and account for the sparsity in transcript-level counts. Simulations and analysis of real data from mouse mammary epithelial cells demonstrate that the diffSplice pipelines provide greater power, improved efficiency, and improved FDR control compared to existing specialized DTU methods.
bioinformatics2025-09-29v3Inferring spatial single-cell-level interactions through interpreting cell state and niche correlations learned by self-supervised graph transformer
Xiao, X.; Zhang, L.; Zhao, H.; Wang, Z.AI Summary
- The study introduces GITIII, a self-supervised graph transformer model that infers cell-cell interactions (CCI) by correlating cell state with its niche, using spatial transcriptomics data.
- GITIII allows for visualization of spatial CCI patterns, CCI-informed cell clustering, and construction of CCI networks.
- Applied to four datasets, GITIII successfully identified and interpreted CCI patterns in brain and tumor microenvironments across different species and platforms.
Abstract
Cell-cell interactions (CCI), driven by distance-dependent signaling, are important for tissue development and organ function. While imaging-based spatial transcriptomics offers unprecedented opportunities to unravel CCI at single-cell resolution, current analyses face challenges such as limited ligand-receptor pairs measured, insufficient spatial encoding, and low interpretability. We present GITIII, a lightweight, interpretable, self-supervised graph transformer-based model that conceptualizes cells as words and their surrounding cellular neighborhood as context that shapes the meaning or state of the central cell. GITIII infers CCI by examining the correlation between cell state and its niche, enabling us to understand how sender cells influence the gene expression of receiver cells, visualize spatial CCI patterns, perform CCI-informed cell clustering, and construct CCI networks. Applied to four spatial transcriptomics datasets across multiple species, organs, and platforms, GITIII effectively identified and statistically interpreted CCI patterns in the brain and tumor microenvironments.
bioinformatics2025-09-29v2Long-term clonal analysis using stochastic models reveals heterogeneity and quiescence of hematopoietic stem cells
Garcia Vilela, Y.; Thielecke, L.; Cesar Fassoni, A.; Glauche, I.AI Summary
- The study used mechanistic mathematical modeling on longitudinal clonal data from non-human primates to understand the dynamics of hematopoietic stem cells (HSCs), focusing on quiescence and heterogeneity.
- A single homogeneous model failed to explain clone size distributions and persistence, whereas a two-compartment model, incorporating reversible transitions between active and quiescent states, provided a better fit.
- Findings suggest that HSC heterogeneity, particularly the reversible quiescent state, is crucial for explaining long-term clonal dynamics in hematopoiesis.
Abstract
Hematopoietic stem cells (HSCs) maintain lifelong production of blood by balancing self-renewal and differentiation. However, certain aspects of their divisional dynamics, namely the role of quiescence and the intrinsic heterogeneity of the HSC pool, are not completely understood. High-resolution clonal tracking provides a powerful resource to investigate such dynamics as the data captures patterns of clonal persistence, dilution and late clonal emergence. Here, we apply mechanistic mathematical modeling to longitudinal clonal data from non-human primates to explore structural requirements that underlie the observed dynamical patterns. We show that models treating HSCs as a single, homogeneous population can explain the gradual loss of clonal diversity, but fail to reproduce clone size distributions and the long-term persistence of small and late-appearing clones. To address this, we propose a stochastic, two-compartment model in which HSCs transition reversibly between an actively cycling state and a quiescent, potentially niche-bound state. Compared to the simpler one-compartment model, this advanced framework provides a substantially improved fit for different metrics, consistently captures clone size distributions and explains the delayed activation and sustained coexistence of small and large clones. These results provide quantitative evidence that heterogeneity within the HSC pool, particularly the existence of a reversible quiescent state, is critical to account for clonal aspects of long-term hematopoiesis. Our findings highlight how clonal data can uncover underlying regulatory mechanisms and supports a central role for niche-mediated HSC quiescence in maintaining stable and diverse blood production over time.
bioinformatics2025-09-29v2AlphaFold-driven discovery of ORP-PIP phosphatase interactions using new generation confidence scores
Dall'Armellina, F.; Urbe, S.; Rigden, D. J.AI Summary
- The study aimed to explore interactions between OSBP-related proteins (ORPs) and phosphoinositide phosphatases (PIPs) using AlphaFold2-Multimer, AlphaPulldown2, and AlphaFold3.
- A pipeline was developed to predict and validate these interactions, incorporating confidence metrics like ipTM+pTM, actifpTM, and ipSAE, along with biological context analysis.
- Key findings included conserved binding modes between SAC1 phosphatase and ORPs, notably ORP11, highlighting functionally relevant protein-lipid interfaces.
Abstract
Non-vesicular lipid transport contributes to the regulation of membrane composition and organelle function at membrane contact sites. OSBP-related proteins (ORPs) are central to this process, yet their interaction networks remain incompletely defined. Here, we systematically screened potential interactions between ORPs and phosphoinositide 3-, 4-, and 5-phosphatases (PIPs) using AlphaPulldown2, AlphaFold2-Multimer, and AlphaFold3. We established a pipeline for model generation by combining AlphaFold2-Multimer predictions (including five-replicates) with an AlphaPulldown2 interaction screen across around 200 protein pairs, and with AlphaFold3 predictions including lipid-bound and multimeric assemblies. Interface confidence was assessed for consistency using the weighted ipTM+pTM metric, actifpTM, new generation ipSAE scoring, and FoldSeek-Multimer clustering. We further evaluated the protein pairs' biological plausibility based on subcellular localisation data, in silico membrane insertion, evolutionary conservation via ConSurf, and protein binding interface analysis using the deep learning tool PeSTo. This integrative pipeline uncovered functionally conserved binding modes in the SAC1 lipid phosphatase with the ORP family, particularly with ORP11, and predicted functionally relevant protein-lipid interfaces.
bioinformatics2025-09-29v2Differential analysis of translation efficiency and usage of open reading frames using DOTSeq
Lim, C. S.; Chieng, G. S. W.AI Summary
- DOTSeq is a statistical framework designed to analyze differential translation efficiency and usage of multiple open reading frames (ORFs) within genes.
- It enables the detection of cis-regulatory events like uORF-mediated control across different biological conditions.
- Benchmarking showed DOTSeq's sensitivity to subtle regulatory signals, outperforming existing tools in detecting modest effect sizes.
Abstract
Protein synthesis is a key cellular process in which mRNAs are translated into proteins by ribosomes. This process is tightly regulated, enabling cells to control protein output in response to specific cellular states. Ribosome profiling captures translatomic landscapes across conditions, but existing computational tools for differential translation analysis operate at the gene level, overlooking translational control at the level of multiple open reading frames (ORFs). Here, we present DOTSeq, a Differential ORF Translation statistical framework that enables systematic discovery of translational control events within genes. DOTSeq offers differential analyses of ORF usage and translation efficiency across biological conditions. These modules allow global detection of cis-regulatory events, such as upstream ORF (uORF)-mediated translational control. Benchmarking on simulated datasets demonstrates DOTSeq's sensitivity to subtle regulatory signals, particularly in modest effect sizes where most biological signals occur and where existing tools often show limited sensitivity. DOTSeq provides a flexible and powerful approach for dissecting the complexity of translational control.
bioinformatics2025-09-29v2AFMnanoSALQ: An Accurate Detection Framework for Semi-Automatic Labeling and Quantitative Analysis of α-Hemolysin Nanopores Using Intensity-Height Cues in HS-AFM Data
Nguyen, T. V. T.; Ly, N. Q.; Le, N. T. P.; Nguyen, H. D.; Ngo, K. X.AI Summary
- The study introduces AFMnanoSALQ, a framework for semi-automatic labeling and quantitative analysis of α-hemolysin nanopores using HS-AFM data, integrating both visual and geometric features for 3D morphology detection.
- AFMnanoSALQ does not require annotated data or extensive training, offering a cost-effective and rapid deployment solution.
- It performs comparably to deep learning models, aiding in preliminary data analysis and dataset creation for future deep learning studies.
Abstract
High-Speed Atomic Force Microscopy (HS-AFM) enables imaging of biological structures and dynamics with nanometer spatial and millisecond temporal resolution. AFM images contain three-dimensional (3D) surface information, comprising two-dimensional (2D) lateral (x-y) and one-dimensional (1D) height (z) encoded in pixel intensity. This dynamic structure poses significant challenges for instance boundary detection and morphological analysis. To address this, we develop AFMnanoSALQ, a feature-driven computational framework for semi-automatic labeling and quantitative (SALQ) detection and morphological measurement of HS-AFM data. Unlike conventional methods that rely solely on either visual or geometric features for 2D boundary detection, AFMnanoSALQ integrates both to extract 3D morphology. It requires neither annotated data nor intensive training, enabling fast deployment at minimal cost. With performance comparable to typical deep learning models, AFMnanoSALQ facilitates semi-automatic labeling, making it a practical tool for preliminary data inspection and accelerating the creation of training datasets. As a case study, we focus on -hemolysin (HL), a {beta}-barrel pore-forming toxin secreted by Staphylococcus aureus, using both synthetic and experimental AFM data. AFMnanoSALQ provides a foundation for future deep learning studies, enabling both dataset generation and cross-validation between feature-driven and data-driven approaches.
bioinformatics2025-09-29v2Recurrent enhancer-promoter interactions across samples
Weston, M.; Gunjala, S.; Hu, H.; Li, X.AI Summary
- This study analyzed the recurrence of enhancer-promoter interactions (EPIs) across 49 Hi-C and 95 HiChIP datasets.
- Most EPIs were found to be recurrent across different samples, regardless of assay type or enhancer annotations.
- Unique EPIs in individual samples were often less surrounded by other EPIs, suggesting they might not be truly sample-specific.
Abstract
Enhancer-promoter interactions (EPIs) are fundamental to gene regulation, and understanding their recurrence across diverse biological samples is key to deciphering chromatin architecture. In this study, we systematically analyzed the recurrence of EPIs across 49 Hi-C and 95 HiChIP datasets. We found that the majority of EPIs identified in a given sample were also present in other samples, regardless of the assay type (Hi-C or HiChIP) or the enhancer annotations used. Interestingly, EPIs that appeared unique to individual samples were typically surrounded by fewer neighboring EPIs, suggesting they may not represent truly sample-specific interactions. Our findings indicate that most human EPIs have already been captured and that cells primarily reuse subsets of these shared EPIs across different cell types and conditions. This study provides new insights into the pervasive and reusable nature of EPIs in the human genome, with important implications for chromatin conformation studies.
bioinformatics2025-09-29v1ChromPolymerDB: A High-Resolution Database of Single-Cell 3D Chromatin Structures for Functional Genomics
Chen, M.; Du, L.; Zhao, S.; Ye, B.; Delafrouz, P.; Farooq, H.; Chattopadhyay, D.; Marai, G. E.; Shao, Z.; Liang, J.; Czajkowsky, D. M.; Chronis, C.AI Summary
- The study addresses the limitation of population-averaged chromatin structure data by developing sBIF, a polymer physics-based framework to reconstruct single-cell 3D chromatin conformations from bulk Hi-C data.
- ChromPolymerDB was created, containing ~10^8 reconstructed 5 kb-resolution single-cell structures across 50 human cell types, offering tools for 3D structural analysis and multi-omics integration.
- The database allows exploration of associations between chromatin structure, gene expression, and regulatory elements, supporting comparative analyses across different conditions.
Abstract
The three-dimensional (3D) organization of chromatin plays a critical role in regulating gene expression and genomic processes like DNA replication, repair, and genome stability. Although these processes occur at the individual-cell level, most chromatin structure data are derived from population-averaged assays, such as Hi-C, obscuring the heterogeneity of single-cell conformations. To address this limitation, we developed a polymer physics-based modelling framework, the Sequential Bayesian Inference Framework (sBIF), that deconvolutes bulk Hi-C data to reconstruct single-cell 3D chromatin conformations. To support a broader use of sBIF, we created ChromPolymerDB, a publicly accessible, high-resolution database of single-cell chromatin structures inferred by sBIF. The database contains ~10^8 reconstructed 5 kb-resolution single cell structures, spanning over 60,000 genomic loci across 50 human cell types and experimental conditions. ChromPolymerDB features an interactive web interface with tools for 3D structural analysis and multi-omics integration. Users can explore associations between chromatin conformation and gene expression, epigenetic modifications, and regulatory elements. The platform also supports comparative analyses to identify structural changes across cell types, developmental stages, or disease contexts. ChromPolymerDB offers a unique resource for researchers studying the relationship between genome architecture and gene regulation, and for advancing comparative 3D genomics. ChromPolymerDB is available online at https://chrompolymerdb.bme.uic.edu/.
bioinformatics2025-09-29v1A Graph-Attentive GAN for Rare-Cell-Aware Single-Cell RNA-Seq Data Generation
Ganguly, R.; Aafrine, S.; Hossain, S. M. M.; Ray, S.AI Summary
- The study addresses the challenge of high-dimensional, small-sample (HDSS) data and class imbalance in scRNA-seq by introducing GARAGE, a Graph-Attentive GAN.
- GARAGE uses a graph attention network to prioritize rare cell types, integrating them into the generator's input to enhance synthesis.
- Results show GARAGE improves feature selection and clustering in downstream analyses, outperforming existing methods by generating realistic synthetic cells that maintain rare-cell structure.
Abstract
A central challenge in downstream single-cell RNA sequencing (scRNA-seq) analysis is the high-dimensional, small-sample (HDSS) regime, often compounded by class imbalance from rare cell types. These factors hinder robust feature (gene) selection and cell clustering and limit the realism of samples generated by existing simulators. We introduce GARAGE, a Graph-Attentive RAre-cell aware single-cell data GEneration that augments the generator's input with a small, attention-weighted 'leakage' of real cells in addition to prior noise. Specifically, we build a k-nearest-neighbour cell graph and use a graph attention network (GAT) to prioritize nodes that likely represent under-sampled (rare) subpopulations; these high-attention cell embeddings are injected into the generator input to steer synthesis toward biologically plausible regions of the data manifold while respecting cell-type proportions. This attention-guided leakage accelerates training, reduces mode dropping, and yields realistic synthetic cells that preserve rare-cell structure. Across real scRNA-seq benchmarks, GARAGE improves downstream feature selection and clustering compared with state-of-the-art baselines. In summary, GARAGE directly addresses HDSS and rarity in scRNA-seq by coupling graph attention with adversarial generation to produce high-fidelity synthetic cells that enhance downstream analyses.
bioinformatics2025-09-29v1