Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
DupyliCate: mining, classifying, and characterizing gene duplications
Natarajan, S.; Pucker, B.Abstract
Paralogs, copies of a gene, form an important basis for novelty during evolution. Analysis of such gene duplications is important to understand the emergence of novel traits during evolution. DupyliCate is a Python tool that has been developed for this purpose. With the ability to process multiple datasets concurrently, flexible features, and parameters to set species-specific thresholds, DupyliCate offers a high-throughput method for gene copy identification and analysis. The different available parameters and modes are explored in detail based on the Arabidopsis thaliana datasets. Proof of concept for the tool is presented by characterizing well known duplications in different plants, and its broad applicability is demonstrated by running it on diverse datasets including complex plant genome sequences with high heterozygosity. Further, two case studies involving the evolution of flavonol synthase (FLS) genes in Brassicales, and the evolution of flavonol synthesis regulating myeloblastosis (MYB) transcription factors- MYB12 and MYB111 across a large number of plant species, are presented as exemplar use cases. The tool's applicability beyond plants is demonstrated on Escherichia coli, Saccharomyces cerevisiae, and Caenorhabditis elegans datasets. DupyliCate is available at: https://github.com/ShakNat/DupyliCate.
bioinformatics2026-05-06v4Large-Scale Statistical Dissection of Sequence-Derived Biochemical Features Distinguishing Soluble and Insoluble Proteins
Vu, N. H. H.; Nguyen Bao, L.Abstract
Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a statistically rigorous large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliffs {delta}, and discriminative performance was assessed by ROC-AUC. Although 34 features remained significant after correction, most exhibited small effect sizes and substantial class overlap, consistent with a weak-signal regime. The strongest effects were associated with size-related features (sequence length and molecular weight; {delta} {approx} -0.21), whereas charge-related descriptors, particularly the proportion of negatively charged residues ({delta} = 0.150; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ({rho} up to 0.998). Applying a redundancy threshold (|{rho}| [≥] 0.85), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings demonstrate that sequence-level solubility information is intrinsically low-dimensional and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.
bioinformatics2026-05-06v3Identification, evolutionary history and characteristics of orphan genes in root-knot nematodes
Seckin, E.; Colinet, D.; Bailly-Bechet, M.; Seassau, A.; Bottini, S.; Sarti, E.; Danchin, E. G.Abstract
Orphan genes, lacking homologs in other species, are systematically found across genomes. Their presence may result from extensive divergence from pre-existing genes or from de novo gene birth, which occurs when a gene emerges from a previously non-genic region. In this study, we identified orphan genes in the genomes of globally distributed plant-parasitic nematodes of the genus Meloidogyne and investigated their origins, evolution, and characteristics. Using a comparative genomics framework across 85 nematode species, we found that 18% of Meloidogyne genes are genus-specific, transcriptionally supported orphans. By combining ancestral sequence reconstruction and synteny-based approaches, we inferred that 20% of these orphan genes originated through high divergence, while 18% likely emerged de novo. Proteomic and translatomic evidence confirmed the translation of a subset of these genes, and feature analyses revealed distinctive molecular signatures, including shorter length, signal peptide enrichment, and a tendency for extracellular localization. These findings highlight orphan genes as a substantial and previously underexplored component of the Meloidogyne genome, with potential roles in their worldwide parasitism.
bioinformatics2026-05-06v3Advancing in silico drug design with Bayesian refinement of AlphaFold models
Sen, S.; Hoff, S. E.; Morozova, T. I.; Schnapka, V.; Bonomi, M.Abstract
Virtual screening has become an indispensable tool in modern structure-based drug discovery, enabling the identification of candidate molecules by computationally evaluating their potential to bind target proteins. The accuracy of such screenings critically depends on the quality of the target structures employed. Recent advances in protein structure prediction, particularly AlphaFold2, have revolutionized this field with unprecedented accuracy. However, AlphaFold2 models often exhibit limitations in local structural details, especially within binding pockets, which limit their utility for small molecule docking. In contrast, molecular dynamics simulations with accurate atomistic force fields can refine protein structures, but lack the ability to leverage the structural information provided by deep learning approaches. Here, we introduce bAIes, an integrative method that bridges this gap by combining physics-based force fields with data-driven predictions through Bayesian inference. Crucially, bAIes demonstrates a superior ability to discriminate between binders and non-binders in virtual screening campaigns, outperforming both AlphaFold2 and molecular dynamics-refined models. By enhancing the usability of AlphaFold2 models without requiring extensive experimental or computational resources, bAIes offers a convenient solution to a longstanding challenge in structure-based drug design, potentially accelerating the early phases of drug discovery.
bioinformatics2026-05-06v2PhenotypeToGeneDownloaderR: automated multi-source retrieval and validation of phenotype-associated genes
Muneeb, M.; Ascher, D. B.Abstract
Identifying phenotype-associated genes is a common first step in polygenic risk score construction, enrichment testing, target prioritisation and variant interpretation, but relevant evidence is distributed across heterogeneous databases with different interfaces, formats and evidence models. Here, we present PhenotypeToGeneDownloaderR, a phenotype-guided R/Python pipeline for automated gene retrieval, harmonisation, symbol validation and cross-source summary analysis. Given a phenotype term, the pipeline queries integrated biological databases, standardises per-source outputs, combines gene lists, validates retrieved symbols against the NCBI human gene reference and generates summary tables and visualisations. Across 13 clinically relevant phenotypes and 13 databases, PhenotypeToGeneDownloaderR generated 136,487 raw gene retrievals, with at least one source returning genes for every phenotype. Across all 13 phenotypes, 100,175 of 114,345 combined input symbols were retained after direct or synonym-based validation, corresponding to an 87.6% validation rate. Cross-source overlap was low, supporting the complementarity of integrated evidence sources. Against an HPO/ClinVar/OMIM-derived gold standard, the pipeline recovered 1,039 of 1,056 known phenotype-associated genes, corresponding to 98.4% recall. PhenotypeToGeneDownloaderR provides a lightweight, reproducible upstream framework for generating candidate gene sets for downstream prioritisation and interpretation. The pipeline is implemented in R and Python, released under the MIT licence, and available at https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR.
bioinformatics2026-05-06v1An LLM-driven pipeline for proteomics-based detection and structural modeling of post-translational modifications
George, A.; Mejia-Rodriguez, D.; Li, X.; Rigor, P.; Cheung, M. S.; Bilbao, A.Abstract
Post-translational modifications (PTMs) on proteins dynamically regulate their functions and subsequently cellular physiology. Significant advances have been made in their detection and modeling: mass spectrometry-based proteomics has become the cornerstone for PTM detection in complex samples, while emerging structure-prediction frameworks enable modeling of PTM-dependent conformational changes. However, the biological significance of many PTMs remains largely unexplored, in part because integrated pipelines that bridge PTM detection with structural modeling remain limited. We present a generative AI-driven pipeline that integrates PTM detection with structural modeling of their effects on protein dynamics and interactions. The pipeline comprises two complementary tools: PTMdiscoverer and PTM-Psi. First, PTMdiscoverer leverages large language models to identify, annotate, and interpret candidate PTMs from open-search proteomics results, addressing limitations of conventional proteomics tools. Next, PTM-Psi models the structural, functional, and dynamic consequences of these spatially aware modifications on protein dynamics. These two components bridge PTM discovery with mechanistic interpretation at the structural level. We demonstrate our pipeline by using cyanobacterial proteomics data to study potential molecular mechanisms of redox-regulated "dark complex" formation in carbon metabolism, advancing our ability to interpret PTM-mediated regulation in microbial systems.
bioinformatics2026-05-06v1UNKAI: A protein functional identity prediction model based on ESM-C latent representations and the attention mechanism
Ukai, K.; Fujita, S.; Terada, T.Abstract
The rapid advancement of genome sequencing technologies has led to the accumulation of a vast number of protein sequences in public databases. However, a significant proportion of these proteins remain functionally uncharacterized. Concurrently, the expansion of protein sequence data has enabled the development of protein language models (pLMs). By distilling billions of years of evolutionary history into a latent representational space, these models have acquired an unprecedented capacity to predict both the tertiary structures and functions of proteins. In this study, we developed a deep learning-based method to predict whether two proteins catalyze the same enzymatic reaction. Our approach leverages latent representations generated by ESM Cambrian (ESM C), a state-of-the-art pLM, which are then processed through a neural network architecture integrating an attention mechanism. Our method outperformed existing approaches, including those based solely on full-length sequence similarity. Notably, it also surpassed our previous LightGBM-based model, which relied on structural similarity scores derived from AlphaFold-predicted models. Analysis of the attention weights reveals that our model autonomously highlights biologically significant sites, such as catalytic and binding residues. This demonstrates that integrating pLMs with attention mechanisms can enhance the accuracy and interpretability of protein function prediction while eliminating the need for manual feature engineering.
bioinformatics2026-05-06v1Integrated Multi-Omics Analysis for the Identification of Disease-Associated Variations and Prognostic Biomarkers in Triple-Negative Breast Cancer (TNBC)
MANNEKUNTA, N.; NATRAJAN, E.Abstract
Background: Triple-negative breast cancer (TNBC) exhibits substantial molecular heterogeneity and lacks targeted receptor therapies. Single-omic approaches inadequately capture its regulatory complexity, necessitating integrated multi-omic frameworks to identify stable prognostic signatures. Methods: Matched transcriptomic and DNA methylation data from the TCGA-BRCA cohort were normalised and mathematically integrated to isolate disease-associated variations. A calibrated machine learning voting ensemble (comprising LightGBM, Random Forest, and Logistic Regression) was trained to predict clinical survival. Model generalisability was tested on an independent microarray cohort (GSE58812) using independent quantile normalisation. SHAP (SHapley Additive exPlanations) values provided biological interpretability. Results: Differential and integrative analyses identified a 47-gene master prognostic signature. The ensemble classifier achieved an external validation accuracy of 74.77% (AUC 0.590) on unseen clinical patients. SHAP analysis confirmed the biological directionality of these specific biomarkers in driving mortality. Hypergeometric pathway enrichment highlighted targetable metabolic and signalling networks. Conclusions: This multi-omic machine learning pipeline identifies a highly prognostic 47-gene signature for TNBC. The model demonstrates strong cross-platform generalisability and offers interpretable clinical utility for stratifying patient risk and guiding future therapeutic target development.
bioinformatics2026-05-06v1Simple baselines rival protein language models in mutation-dense design tasks
Talpir, I.; Fleishman, S. J.Abstract
Computational protein design demands generally applicable models that reliably predict or generate unmeasured variants with superior functional properties. Recent studies have proposed protein language models (pLMs) for design tasks, including zero-shot scoring and transfer learning from limited experimental data. Although pLMs have been used in zero-shot and transfer-learning studies, they have generally not been assessed in benchmarks that explicitly test combinatorial extrapolation from lower- to higher-order variants. Here we benchmark widely used pLMs against conventional baseline methods in recently described dense, experimentally validated multi-mutant landscapes. We find that regardless of architecture and parameter count, pLMs are statistically similar to one another, and none consistently outperforms conventional baseline methods. Furthermore, their ability to distinguish functional from non-functional variants in zero-shot prediction is comparable to that of conventional homology-based methods. We suggest that to contribute to the design of protein function, pLMs may need to encode biophysical and structural priors or be combined with structure-based approaches.
bioinformatics2026-05-06v1ArchaicSeeker 3.0: A deep-learning framework for scalable, haplotype-resolved inference of archaic introgression
Wang, B.; Lei, C.; Lin, H.; Shi, S.; Ma, X.; Zeng, W.; Yuan, K.; Ni, X.; Xu, S.Abstract
Archaic introgression has left a significant mark on human genetic diversity, but reliably identifying introgressed segments remains a major challenge, especially with complex demographic histories and limited sample sizes. Existing methods often rely on demographic assumptions or cohort-specific parameter fitting, which compromises robustness and scalability. We introduce ArchaicSeeker 3.0 (AS3), a deep-learning framework designed for haplotype-resolved detection of archaic introgression. AS3 integrates a tract-scale sequence model with an overlap-aware reassembly approach and boundary refinement, enabling accurate, boundary-coherent reconstruction of introgressed segments across diverse genomic contexts. By leveraging a simulation-trained model, AS3 avoids inference-time recalibration, offering stable performance across unrepresented demographic scenarios and small cohorts. In extensive simulations, AS3 outperforms existing methods in precision, recall, and F1 score, while providing more continuous segments with accurate boundary localization. It demonstrates robustness in small-target regimes and varying marker densities. Applied to 3,453 genomes from 209 populations, AS3 shows strong concordance with existing introgression callers and identifies additional introgressed regions, including high-frequency AS3-specific introgressed segments supported by locus-level haplotype and phylogenetic analyses. AS3 provides a scalable, robust solution for detecting archaic introgression from single individuals to large biobank datasets, marking a significant advancement in the field of local ancestry inference and opening new possibilities for the study of human evolutionary genetics. ArchaicSeeker 3.0 is available at https://github.com/Shuhua-Group/ArchaicSeeker3.0.
bioinformatics2026-05-06v1First Survey of Publicly Available Metagenomic Sequencing Data Across 24 Middle Eastern and North African Countries: The MENA Microbiome Database
Mathlouthi, N. E. H.; Gdoura-Ben Amor, M.; Belguith, I.; Derouich, R.; Ammar Keskes, L.; Gdoura, R.Abstract
Microbiome research has expanded globally, yet the Middle East and North Africa (MENA) region remains severely under-represented in international sequencing repositories. Here we present the MENA Microbiome Database, the first systematically harmonized catalog of publicly available metagenomic sequencing data from 24 MENA countries, consolidating 60,126 runs across 51,365 biological samples and 2,373 BioProjects deposited between 2008 and 2026. Records were retrieved from ENA, NCBI SRA, and PubMed, enriched with BioSample and study-level metadata, and classified into microbiome subtypes using a 73-rule keyword-based harmonization framework. Amplicon sequencing accounted for 80.6% of runs, with Illumina platforms dominating at 92.7%. Geographic coverage is highly skewed: Saudi Arabia and Turkey together contribute over half of all records, while five countries (Libya, Syria, Palestine, Yemen, and South Sudan) remain critically under-sampled. Metadata completeness averaged 73.97% under a MIxS-MIMS proxy framework, with geographic coordinates available for fewer than 15% of runs. Ecological analyses revealed that country-level factors significantly structure environmental, animal-associated, and plant-associated microbiomes, but not human-associated microbiomes. Spatial autocorrelation confirmed non-random clustering of sampling effort around Red Sea coastal and eastern Mediterranean hotspots. This open, reproducible resource, comprising harmonized data files, analysis code, and an interactive browsing platform, establishes a foundational infrastructure for regional microbiome science and equitable global comparative studies. Keywords: MENA; microbiome; metagenomics; public repository; SRA; ENA; database; harmonization; Middle East; North Africa
bioinformatics2026-05-06v1Tumor cell specific total mRNA expression informed neural networks predicts cancer progression
Paul, A.; Lal, J. C.; Ji, S.; Fong, C.; Chen, K.; Ding, Y.; Li, R.; Dai, Y.; Tran, Q.; Montierth, M.; Alberti, S.; Kopetz, S.; Wang, W.Abstract
Inferring tumor molecular phenotypes from high-dimensional multi-omic data is a fundamental challenge in computational biology. Current methods for estimating tumor cell-specific total mRNA expression (TmS) require matched DNA and RNA sequencing data and rely on computationally intensive deconvolution pipelines. We present TmSNet, a deep learning framework that predicts TmS using mRNA, DNA methylation, miRNA, and immune cell proportions as input features. TmSNet integrates structured feature selection (gradient boosting, LASSO, elastic net) with specialized neural architectures to predict continuous TmS. Across 12 TCGA cancer types, TmSNet achieved cross-validated performance up to concordance correlation coefficient (CCC) = 0.93 and correlation R-squared = 0.88 and generalized to external cohorts with correlations of 0.54 (SCAN-B) and 0.43 (FUSCC). Predicted TmS values effectively stratify patients by risk and preserve known transcriptional profiles across tumor subtypes. These results demonstrate that TmSNet can infer biologically meaningful phenotypes from multi-omic data and provide a scalable framework for modeling tumor transcriptional activity in heterogeneous cohorts.
bioinformatics2026-05-06v1Pharmacological proximities in the GPCR family discovered using contact-informed amino-acid and binding pocket similarities
So, S. S.; Ngo, T.; Ilatovskiy, A. V.; Finch, A. M.; Riek, R. P.; Abagyan, R.; Smith, N. J.; Kufareva, I.Abstract
Understanding protein proximities in the theoretical ligand space is essential for developing therapeutics with desirable polypharmacology, predicting off-targets, and discovering surrogate ligands for poorly characterized proteins. This is especially important for G protein-coupled receptors (GPCRs) - a major class of drug targets, many of which still lack known ligands. Circumventing this limitation, we present GPCR-CoINPocket v2, a contact-informed metric for detecting GPCR pharmacological similarities from amino-acid sequences alone. We first establish a "gold standard" of pharmacological relatedness using ChEMBL-derived ligand sets. We then replace traditional evolutionary amino acid similarity matrices with a chemically-informed matrix derived from protein:ligand interaction patterns across 3,306 structures, significantly improving early detection of shared pharmacology between distantly homologous receptors. An additional unconstrained, contact-informed matrix further enhances predictive performance. Pilot application of the method revealed previously unrecognized similarities between the {beta}2 adrenoceptor and three Class A peptide GPCRs, which we confirmed experimentally by demonstrating the binding of select ligands of these receptors to the {beta}2. Dimensionality reduction of similarity scores recapitulates known receptor relationships and predicts neighbors of orphan GPCRs later confirmed experimentally. Overall, GPCR-CoINPocket v2 provides a powerful sequence-based framework to prioritize ligand space, predict polypharmacology, and accelerate GPCR drug discovery and deorphanization.
bioinformatics2026-05-06v1Learning the Language of the Microbiome with Transformers
Treloar, N. J.; Ur-Rehman, S.; Yang, J.Abstract
Self-supervised pretraining has become central to biological machine learning, yet microbiome data remains comparatively underexplored in terms of both modeling approaches and evaluation frameworks. To address this gap, we present Atlas, a pretraining dataset of over 539,000 microbiome datapoints from the MGnify database. Using Atlas, we train the Waypoint family of microbiome foundation models: a series of GPT-2 style causal language models ranging from 6M to 170M parameters. We also introduce Compass, a curated benchmark of eight predictive tasks spanning biome classification, drug-microbiome interactions, drug degradation, and infant gut development. Using this benchmark, we compare the performance of Waypoint models against classical baselines and the existing MGM foundation model. Our results show that pretraining leads to consistent and significant improvements in downstream task performance, that both dataset scale and tokenization strategy impact model quality, and that pretraining is essential for achieving favorable scaling behavior. Furthermore, pretrained transformer models begin to reliably outperform classical methods once training data exceeds roughly 10,000 examples - a threshold that is attainable for modern microbiome studies. Finally, we demonstrate that the Waypoint models achieve state-of-the-art performance among microbiome foundation models. Overall, our work highlights the importance of large-scale self-supervised pretraining in this domain and establishes Atlas, Compass, and the Waypoint models as valuable resources for the research community in this emerging field.
bioinformatics2026-05-06v1Bridging LLM Reasoning and Chemical Knowledge via an Evolutionary Multi-Agent Framework for Molecular Synthesis
Chen, Y.; Rao, J.; Xie, J.; Sun, Y.; Yang, Y.Abstract
Molecular design faces the dual challenge of navigating a vast chemical space while ensuring experimental synthesizability. Traditional models are constrained by small datasets, restricting their scalability and broader chemical context. In contrast, Large Language Models (LLMs) encapsulate extensive synthesis protocols derived from vast scientific literature, yet they struggle to leverage this potential due to severe hallucinations and a superficial grasp of rigorous chemical logic. We propose EvoSyn, an evolutionary multi-agent framework that synergizes LLM reasoning with domain experts for preference-aware molecular synthesis. EvoSyn orchestrates a dual-process evolutionary paradigm: a co-evolving process that collaboratively aligns linguistic capabilities with multi-objective constraints, and a self-evolving process formulated as a Markov Game. Through evolution and reinforcement learning, agents actively learn from mistakes, utilizing domain feedback to penalize invalid proposals and ground generation in feasible reaction pathways. Extensive evaluations on comprehensive benchmarks demonstrate that EvoSyn significantly outperforms state-of-the-art baselines. These results highlight that by integrating LLM-guided self-evolution with rigorous domain validation to mitigate hallucinations, EvoSyn effectively yields molecules that are both bioactive and synthetically actionable.
bioinformatics2026-05-06v1A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data
Hamilton, T.; Sparta, B.; Cooley, S. M.; Aragones, S. D.; Ray, J. C. J.; Deeds, E. J.Abstract
High-dimensional data are becoming increasingly common in nearly all areas of science. Developing approaches to analyze these data and understand their meaning is a pressing issue. This is particularly true for single-cell RNA-seq (scRNA-seq), a technique that simultaneously measures the expression of tens of thousands of genes in thousands to millions of single cells. Popular analysis pipelines significantly reduce the dimensionality of the dataset before performing downstream analysis. One problem with this approach is that dimensionality reduction can introduce substantial distortion into the data, particularly by disrupting the local neighborhoods of certain points. Since many scRNA-seq analyses like cell type clustering or trajectory inference rely on these near-neighbor relationships, distortion in this aspect of the data could significantly influence the outcomes of these analyses. Here, we introduce a straightforward approach to quantifying this distortion by comparing the local neighborhoods of points before and after dimensionality reduction. We found that popular techniques like t-SNE and UMAP introduce substantial distortion even for simple simulated data sets. For scRNA-seq data, we found the distortion in local neighborhoods was often greater than 95%, and that there was no consistent set of neighborhoods across the various steps in the consensus scRNA-seq analysis pipeline. We also found that this distortion had profound impacts on the outcomes of cell type clustering and other downstream analyses. Our findings suggest that caution must be applied when interpreting results in terms of 2-D visualizations produced by tools like UMAP, and that there is a critical need for new dimensionality reduction tools that more effectively preserve the local topological structure of the data.
bioinformatics2026-05-05v7cellNexus: Quality control, annotation, aggregation and analytical layers for the Human Cell Atlas data
Shen, M.; Gao, Y.; Liu, N.; Bhuva, D.; Milton, M.; Henao, J.; Andrews, J.; Yang, E.; Zhan, C.; Liu, N.; Si, S.; Hutchison, W. J.; Shakeel, M. H.; Morgan, M.; Papenfuss, A. T.; Iskander, J.; Polo, J. M.; Mangiola, S.Abstract
Large-scale single-cell atlases such as the Human Cell Atlas have transformed our understanding of human biology. Yet, the lack of a robust framework that standardises quality control, expands cellular annotation, and adds normalisation and analytical layers, limits multi-study analyses and the usefulness of this resource. Here we present cellNexus, a comprehensive tool and resource that converts the Human Cell Atlas collection into analysis-ready data by linking quality control layers, metadata enrichment, expression normalisation, analysis and data aggregation. These enhancements enable robust statistical modelling across studies, exemplified by a multi-tissue map of immune cell communication during ageing, which reveals macrophage-muscle axes as among the most depleted regenerative interactions with age. All harmonised layers, including pseudobulk and cell-cell communication summaries, are accessible via a public web interface and with R and Python APIs. By providing continuous integration with CELLxGENE releases, cellNexus transforms large cell atlas corpora into an accessible, reproducible, interoperable foundation for large-scale biological discovery and the next generation of single-cell foundation models.
bioinformatics2026-05-05v3Topology Matters: The Trade-off Between Wasserstein Critics and Discriminators in Single-Cell Data Integration
Reid, K.; Stein-O'Brien, G.; Guven, E.Abstract
Motivation: Integrating single-cell RNA sequencing experiments (scRNA-seq) across technologies is hindered by severe technical batch effects that confound analysis and mask biological variation. Adversarial autoencoders are a popular solution to correct for these confounding effects, often relying on discriminator networks that approximate the Jensen-Shannon divergence. Previous research has established that the Jensen-Shannon divergence suffers from vanishing gradients when distributions do not overlap, a common phenomenon when datasets come from different sequencing technologies, leading to failed training. In contrast, the Wasserstein distance remains a valid metric with informative gradients even for disjoint distributions. While both approaches appear in the literature, no study has rigorously isolated the adversarial objective to systematically evaluate its impact on batch alignment, biological conservation, and scalability across varying dataset complexities. Results: We introduce a multi-class reference-based Wasserstein critic to systematically benchmark adversarial objectives. We find that the Wasserstein critic yields superior mixing; however, extensive reference sensitivity analysis reveals that the Wasserstein critic is prone to over-correction resulting in collapsed cellular representations; that its integrative performance is dependent on a topologically dense reference batch; and that it scales poorly with the number of batches. In contrast, we find that the "weak" integration characteristic of discriminators acts as a protective measure against over-correction. By highlighting the trade-offs between these methods, we aim to empower researchers to choose the correct method for their specific needs. Availability and Implementation: Source code is available at https://github.com/kreid415/wasserstein-critic-deconfounding. Data are available at https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-_ integration_task_datasets_Immune_and_pancreas_/12420968/1. Contact: kreid20@jh.edu.
bioinformatics2026-05-05v3Exploring per-base quality scores as a surrogate marker of cell-free DNA fragmentome
Volkov, H. H. V.; Raitses-Gurevich, M.; Grad, M.; Shlayem, R.; Leibowitz, D.; Rubinek, T.; Golan, T.; Shomron, N.Abstract
Per-base quality scores are widely treated as technical metadata in next-generation sequencing. Here, we show that in rigorously controlled whole-genome sequencing of cell-free DNA, quality profiles may encode fragmentomic signals that enable classification of cancer samples against matched controls. Analyzing four independent batches (23 cancer samples: pancreatic and breast; 22 matched controls) sequenced in a within-lane regime and further normalized per flow-cell tile to reduce technical confounders, we demonstrate through unsupervised analysis that boundary-enriched dynamics captured in these quality scores consistently separate cancer from control samples. A leave-one-batch-out classifier trained on quality-derived scores achieved a pooled area under the curve of 0.81. Furthermore, we show that the quality-derived metric correlates with short-fragment enrichment and tumor-associated 5-end motifs, performing comparably to established, motif-based orthogonal methods. These results provide initial evidence that quality scores could serve as a low-cost, alignment-free biomarker for cfDNA-based cancer detection.
bioinformatics2026-05-05v2MolGene-E: Inverse Molecular Design to Modulate Single Cell Transcriptomics
Ohlan, R.; Murugan, R.; Xie, L.; Nallabolu, V.; Mottaqi, M.; Zhang, S.; Xie, L.Abstract
Designing drugs that can restore a diseased cell to its healthy state is an emerging approach in systems pharmacology to address medical needs that conventional target-based drug discovery paradigms have failed to meet. Single-cell transcriptomics can comprehensively map the differences between diseased and healthy cellular states, making it a valuable technique for systems pharmacology. However, single-cell omics data is noisy, heterogeneous, scarce, and high-dimensional. As a result, no machine learning methods currently exist to use single-cell omics data to design new drug molecules. We have developed a new deep generative framework named MolGene-E to tackle this challenge. MolGene-E combines two novel models: 1) a cross-modal model that can harmonize and denoise chemical-perturbed bulk and single-cell transcriptomics data, and 2) a contrastive learning-based generative model that can generate new molecules based on the transcriptomics data. MolGene-E consistently outperforms baseline methods in generating high-quality, hit-like molecules on gene expression profiles from two evaluation settings: CRISPR knock-out perturbation profiles from L1000toRNAseq dataset, and single-cell gene expression profiles from Sciplex-3 dataset, both in zero-shot molecule generation setting. This superior performance is demonstrated across diverse de novo molecule generation metrics. Extensive evaluations demonstrate that MolGene-E achieves state-of-the-art performance for zero-shot molecular generations. This makes MolGene-E a potentially powerful new tool for drug discovery.
bioinformatics2026-05-05v2Systematic contextual biases in SegmentNT potentially relevant to other nucleotide transformer models
Ebbert, M. T. W.; Ho, A.; Page, M. L.; Dutch, B.; Byer, B. K.; Hankins, K. L.; Sabra, H.; Aguzzoli Heberle, B.; Wadsworth, M. E.; Fox, G. A.; Karki, B.; Hickey, C.; Fardo, D. W.; Bumgardner, C.; Jakubek, Y. A.; Steely, C. J.; Miller, J. B.Abstract
Recent advances in large language models (LLMs) have extended to genomic applications, yet model robustness relative to context is unclear. Here, we demonstrate two intrinsic biases (input sequence length and nucleotide position) affecting SegmentNT results, a model included with the Nucleotide Transformer that provides nucleotide-level predictions of biological features. We demonstrate that nucleotide position within the input sequence (beginning, middle, or end) alters the nature of SegmentNT's raw prediction probabilities, which can be standardized to improve prediction consistency. While longer input sequence length improves model performance, diminishing returns suggest a surprisingly small input length of ~3,072 nucleotides might be sufficient for many applications. We further identify a 24-nucleotide periodic oscillation in SegmentNT's prediction probabilities, revealing an intrinsic bias potentially linked to the model's training tokenization (6-mers) and architecture. We identify potential approaches to account for these biases and provide generalizable insights for utilizing nucleotide-resolution functional prediction models.
bioinformatics2026-05-05v2Preferential CDR masking in paired antibody language models improves binding affinity prediction
Talaei, M.; Walker, K. C.; Hao, B.; Jolley, E.; Jin, Y.; Kozakov, D.; Misasi, J.; Vajda, S.; Paschalidis, I. C.; Joseph-McCarthy, D.Abstract
Background: Therapeutic antibodies are a leading class of biologics, yet their unique architecture poses challenges for computational modeling. Each antibody comprises paired heavy and light variable domains with conserved framework regions that maintain structure and hypervariable complementarity-determining regions (CDRs) that directly contact antigens. This functional asymmetry, where CDRs determine binding specificity while frameworks provide scaffolding, suggests that region-aware training strategies could yield superior representations. Existing protein language models treat all regions uniformly, potentially missing critical features present in CDRs. Methods: We developed a region-aware pretraining strategy for paired variable domain sequences using two protein language models: a 3 billion parameter model (ESM2) and a compact 600 million parameter model (ESM C). We compared three masking approaches: uniform whole-chain masking, CDR-focused masking, and a hybrid strategy. Final models were trained on over 1.6 million paired antibody sequences and evaluated on binding affinity datasets with over 90,000 antibody variants across six antigens, including single-mutant panels and combinatorial libraries. Results: Here we show that CDR-focused training produces embeddings with superior predictive performance for antibody-antigen binding. Our approach achieves up to 27% improvements in binding affinity prediction compared to benchmarked antibody models. Remarkably, training exclusively on paired sequences proves sufficient; pretraining on billions of unpaired sequences provides no measurable benefit. Our compact model matches or exceeds larger antibody-specific baselines. Conclusions: These findings establish that prioritizing paired sequences with CDR-aware supervision over scale and complex training schemes achieves both computational efficiency and predictive accuracy, providing a practical framework for next generation antibody language models.
bioinformatics2026-05-05v2multiVIB: A unified probabilistic contrastive learning framework for atlas-scale integration of single-cell multi-omics data
Xu, Y.; Fleming, S. J.; Wang, B.; Schoenbeck, E. G.; Babadi, M.; Huo, B.-X.Abstract
Comprehensive brain cell atlases are essential for understanding neural functions and enabling translational insights. As single-cell technologies proliferate across experimental platforms, species, and modalities, these atlases must scale accordingly, calling for data integration framework that aligns heterogeneous datasets without erasing biologically meaningful variations. Existing tools typically address narrow integration settings, forcing researchers to assemble \textit{ad hoc} workflows that may generate artifacts. Here, we introduce multiVIB, a unified probabilistic contrastive learning framework that handles diverse integration scenarios. We show that multiVIB achieves state-of-the-art performance while mitigating spurious alignments. Applied to atlas-scale datasets from the BRAIN Initiative, multiVIB demonstrates robust and scalable integration, including integration of diverse data modalities and reliable preservation of species-specific variations in cross-species integration. These capabilities position multiVIB as a scalable, biologically faithful foundation for constructing next-generation brain cell atlases with the growing landscape of single-cell data.
bioinformatics2026-05-05v2IMAS enables target-aware integration of tumour multiomics to resolve communication-guided regulatory mechanisms
Deyang, W.; Yamashiro, T.; Inubushi, T.Abstract
Tumour multiomic datasets are often sparse, heterogeneous and limited in size, hindering robust and interpretable discovery of regulatory mechanisms. Here we present IMAS, a target-aware integrative framework for multiomic data augmentation and mechanism prioritization that leverages a pan-cancer single-cell multiomic resource to contextualize new tumour datasets and identify reliable sample-specific mechanistic hypotheses. IMAS combines shared latent-space modelling with target-domain adaptation to improve correspondence between predicted and observed RNA and TF profiles while concentrating explanatory predictive supports within the target dataset. Building on this adapted representation, IMAS reconstructs structured RNA-TF coupling networks, refines intercellular signaling through ligand-informed communication modelling, and organizes regulatory programs along communication-associated ordering. In independent colon cancer data, IMAS improved cluster-resolved correspondence and revealed communication-guided regulatory cascades across malignant epithelial states. A LAMB1-centred analysis further demonstrates how the framework supports progressive reinforcement of local regulatory structure and enables perturbation-based probing of context-specific dependencies. Rather than exhaustively predicting all possible outcomes, IMAS provides a target-aware and interpretable strategy to construct consistent and interpretable mechanism-discovery scaffolds and prioritize regulatory dependencies in data-limited tumour systems.
bioinformatics2026-05-05v2Sequence-dependent transferability of the LRLLR membrane translocation motif: A computational study of smacN and NR2B9c peptides.
Munoz-Gacitua, D.; Blamey, J.Abstract
The LRLLR cell-penetrating motif can be transferred to confer membrane translocation activity, but only to compatible recipient peptides. Using umbrella sampling molecular dynamics simulations, we demonstrate that C-terminal LRLLR addition to the pro-apoptotic smacN peptide eliminates its translocation barrier entirely, transforming a +65 kJ/mol barrier into a -50 kJ/mol energy well. In contrast, N-terminal LRLLR addition to the neuroprotective NR2B9c peptide increases the translocation barrier from +85 to +100 kJ/mol, demonstrating that motif transfer can prove counterproductive for incompatible sequences. Cell-penetrating peptides offer promising strategies for intracellular delivery of therapeutic cargo, yet the sequence determinants governing their activity remain incompletely understood. The LRLLR motif, identified through systematic screening as essential for spontaneous membrane translocation, represents a minimal penetrating element whose transferability has not been previously evaluated. We appended this motif to two clinically relevant peptides: smacN, a tetrapeptide targeting inhibitor of apoptosis proteins in chemotherapy-resistant cancers, and NR2B9c, a nonapeptide that disrupts excitotoxic signaling in ischemic stroke. Potential of mean force profiles calculated across a POPC/POPG bilayer, combined with analysis of hydrogen bonding patterns, secondary structure propensity, and conformational dynamics, reveal the structural basis for these divergent outcomes. Successful transfer to smacN results from favorable complementarity: the hydrophobic, neutral smacN provides an ideal platform for the charged, amphipathic LRLLR motif, yielding a chimera capable of simultaneous interaction with both membrane leaflets. Transfer failure with NR2B9c stems from conformational rigidity induced by intramolecular hydrogen bonding, which prevents optimal membrane insertion, combined with unfavorable positioning of internal polar residues at the bilayer center. These findings establish that cell-penetrating motif transfer requires compatibility in charge distribution, hydrophobicity, and conformational flexibility between the motif and recipient sequence. The smacN-LRLLR chimera emerges as a promising candidate for experimental validation as a membrane-permeable therapeutic for survivin-positive tumors. More broadly, this work demonstrates the value of computational screening to identify compatible motif-cargo pairings prior to experimental investment.
bioinformatics2026-05-05v2Building computational benchmarks: an Omnibenchmark reimplementation of a single-cell preprocessing pipeline evaluation
Choudhury, A.; Kitak, T.; Carrillo, B.; Busch, P.; Emons, M.; Gunz, S.; Koderman, M.; Luo, S.; Mallona, I.; Meara, A.; Wissel, D.; Robinson, M. D.Abstract
In the past few years, we have seen a veritable surge in single-cell (e.g., RNA sequencing) techniques and datasets, enabling increasingly detailed characterization of cellular heterogeneity across tissues and conditions. This surge in single-cell techniques has been complemented by a large number of analysis frameworks and pipelines, and a large parameter space and researcher degrees of freedom to use them. Many neutral benchmarks have been presented for various computational tasks, but most make design decisions that render them incompatible with each other, e.g., different datasets and metrics, or parameter sets used. In this work, we showcase a recently developed framework, Omnibenchmark, to build reproducible, extensible and standardized method comparisons. This not only facilitates the broad investigation of pipelines used in single-cell data analysis, but also highlights how the process of building benchmarks can be streamlined and unified. We do this as an initial proof-of-principle for an arms-length benchmark that evaluates five single-cell RNA sequencing pipelines (filtering to normalization to dimensionality reduction to clustering) on three datasets. This standardization enables benchmarks to be easily extended in several directions, including broader parameter sweeps, comparisons across software versions and architectures, isolation of pipeline steps, and integration of additional pipelines, datasets, and metrics.
bioinformatics2026-05-05v1Clonal embeddings allow exploratory analysis of lineage-resolved single-cell data
Isaev, S.; Erickson, A. G.; Adameyko, I.; Kharchenko, P. V.Abstract
Assays coupling high-throughput lineage tracing with single-cell transcriptomics are transforming studies of development and disease biology, revealing not only major differentiation routes but also continuous fate biases and their putative regulators. Yet, analysis of such data at scale presents challenges due to the sparse nature of clonal data and annotation dependencies. Towards that aim we developed a machine learning approach - clone2vec - which learns informative clone embeddings directly from the cellular expression manifold, bypassing discrete cell-type labels and remaining stable when clones are represented by few cells. This representation summarizes clonal variation as an interpretable geometry that supports exploration, statistics for clone-gene associations, and cross-dataset alignment. In prospective barcoding datasets spanning embryogenesis, tumorigenesis, and hematopoiesis, clone2vec recapitulates established clonal patterns and uncovers new axes of continuous variation that implicate regulatory programs and developmental pathways. In tumor microenvironments profiled with TCR sequencing, clone2vec robustly recovers distinct Treg lineages as well as conserved CD8+ T cell sublineages across cancer types, including several bystander-like clonal subsets. Overall, clone2vec provides a robust, general solution for the exploratory analysis of lineage-coupled scRNA-seq data.
bioinformatics2026-05-05v1Cell Type Weighted Dimensionality Reduction
Putta, S.; Jensen, W.; Devakonda, S.; Pennell, L.; Croteau, J.Abstract
High-dimensional single-cell technologies, such as flow cytometry and CITE-Seq, typically rely on established lineage markers to define cell identities. Additional markers are commonly analyzed within the context of these predefined cell types. Nonlinear projection methods such as t-SNE and UMAP provide a visual framework for this analysis by enabling the overlay of cell types and marker expression. However, these methods frequently produce projections where distinct cell types substantially overlap, hindering interpretation of marker expression patterns relative to known cell types. In this study, we investigate the underlying causes of this phenomenon and demonstrate that such overlaps often stem from the inherent high-dimensional structure of the data rather than limitations in the dimensionality reduction algorithms themselves. To address this, we introduce Cell Type Weighted Dimensionality Reduction (CWDR), a novel approach that incorporates lineage-based information through a supervised weighting mechanism. By integrating both cell identity and marker expression, CWDR preserves the visual separation between predefined cell types while maintaining the local variance necessary for downstream analysis. We validate our method across multiple high-dimensional flow cytometry and proteogenomic datasets. Our results show that CWDR significantly reduces inter-cluster overlap compared to traditional methods, providing a clearer framework for visualizing marker expression within the context of specific cell lineages.
bioinformatics2026-05-05v1Interpreting Omics Data Analysis with Large Language Models for Disease Target and Drug Discovery
XU, Z.; Chen, W.; Ren, W.; Xu, T.; Amaechin, S.; Khan, R.; Chen, Y.; Province, M.; Payne, P.; Li, F.Abstract
In biomedical scientific discovery, synthesizing prior knowledge from the literature is an essential component of interpreting numerical omics data analyses for disease target identification and drug discovery. Large language models (LLMs) alone can rapidly retrieve disease mechanisms from biomedical text, but text-only outputs are general and unreliable for target and drug prioritization without cohort-specific quantitative evidence. Herein, we propose a provenance-aware Text-to-Target framework that couples schema-constrained multi-model LLM retrieval with numeric omics data analysis. The key design is a modality-aware fusion step: candidates are partitioned into overlap-supported anchors, retrieval-only hidden hubs, and network-emergent novelty nodes, then propagated into staged hypothesis and strategy generation under topology constraints. We evaluate the model in Alzheimer's disease (AD) and pancreatic ductal adenocarcinoma (PDAC). In PDAC, the workflow produced a balanced 75-gene candidate universe and a 23-strategy portfolio, with significant DepMap support at both target level and strategy level. In AD, stricter candidate controls yielded a compact 34-gene universe and 14 strategies; under an expanded CRISPRbrain registry, both target-level axes were significant , with strong strategy-level enrichment. Across both diseases, final strategies preserved full provenance closure to the candidate pool, enabling end-to-end auditability from retrieval artifacts to validation outputs. These results support a transferable discovery architecture in which omics evidence constrains biological activity, LLM retrieval expands mechanistic search space, and network-aware fusion preserves interpretability. The framework provides a reproducible basis for dual-disease target prioritization and motivates continuous literature-mechanism concordance with agentic evidence-refresh loops.
bioinformatics2026-05-05v1A universal taxonomic and functional human gut microbiome model for disease classification and phenotype discovery
Karwowska, Z.; Mozejko, M.; Nowak, W.; Romanchenko, A.; Szczurek, E.; Kosciolek, T.Abstract
The human gut microbiome is a powerful indicator of host health, yet its compositional nature, high sparsity, and inter-individual variability complicate downstream analysis. Here, we introduce two complementary approaches to characterize gut microbiome structure at population scale. First, we define eight functional signatures of the human gut microbiome using Non-negative Matrix Factorization, revealing coordinated metabolic patterns that partially decouple from taxonomic composition. Second, we present GUT-FORMer, a transformer-based autoencoder that jointly models taxonomic and functional metagenomic profiles from close to 21,000 publicly available samples. The learned latent representations capture biologically meaningful structure, reflect geographic and disease-associated variation, and enable accurate classification of 25 diseases in both binary and multiclass settings, as well as regression of host age and BMI. GUT-FORMer outperforms existing microbiome indices and deep learning methods across all tasks, establishing a generalizable framework for microbiome-based precision medicine.
bioinformatics2026-05-05v1Structure-derived synthetic sequences guide a protein language model toward metalloproteins
Peteani, G.; Sgueglia, G.; Lemmin, T.; Chino, M.Abstract
Motivation Protein language models (pLMs) capture evolutionary sequence constraints but are limited in modeling underrepresented functional classes due to training data imbalance. Metalloproteins constitute a fundamental but sparsely represented class in sequence databases. We therefore assess whether structure-conditioned synthetic sequences can be used to specialize pLMs toward metal-binding functionality. Results We fine-tuned the generalist model ProtGPT2 on synthetic sequences generated by the inverse-folding model ProteinMPNN, constructing training sets with controlled variation in size and diversity. Fine-tuning increased recovery of canonical metal-binding motifs from 43% in the baseline model to 91% in the fine-tuned models. Generated sequences retained high predicted structural confidence and structural similarity to known folds, despite low sequence identity. Analysis of latent representations from ProtGPT2 indicated that fine-tuned models occupy distinct regions of embedding space relative to both the baseline model and structure-conditioned sequences, consistent with partial incorporation of structural constraints while preserving sequence diversity. A multi-step filtering pipeline applied to sequences lacking canonical motifs identified candidate metal-binding sites in four-helical bundle topologies not detected in a non-redundant subset of Protein Data Bank structures or in AlphaFold-predicted proteomes. Availability and implementation Code, trained models, and datasets are available at: https://doi.org/10.5281/zenodo.18672158 and https://huggingface.co/gsgueglia.
bioinformatics2026-05-05v1Revealing the Hidden Landscape of Public Metabolomics Data Reuse in MetaboLights
Karaman, I.; Payne, T.; Vizcaino, J. A.Abstract
Public data reuse is a key driver of progress in omics sciences, including increasingly metabolomics data. In this study, we present a validated analysis of confirmed reuse of datasets from the MetaboLights data repository, one of the leading resources in the field. Candidate publications were collected via dataset identifiers (MTBLS#) using a Python-based retrieval pipeline across major publisher databases. They were next manually validated to distinguish active reuse from citation-only mentions. Overall, 272 unique publications were confirmed to have reused at least one MetaboLights dataset. Reuse is dominated by Method/Tool Development, with smaller contributions from Secondary Biological Analysis and Data Integration/Meta-analysis. LC-MS datasets account for the majority of reuse, whereas NMR and GC-MS also contribute but at a lower level. Data reuse has increased over time, with a noticeable acceleration in the most recent years. At the dataset level, reuse follows a long-tail distribution, where a small subset of datasets accounts for repeated reuse, mainly as community benchmarks. These results provide a conservative estimate of public metabolomics data reuse and show that public datasets are predominantly used for methodological and computational applications. They also indicate that reuse is under-detected when dataset identifiers are not consistently reported, highlighting the need for standardised dataset citation to improve traceability and recognition of reuse.
bioinformatics2026-05-05v1Machine learning approaches for the identification and analysis of enterotoxin genes in Staphylococcus aureus genomes
Uttin, A.; Leggett, R.; Moulton, V.; Dicks, J.Abstract
Staphylococcus aureus produces a broad range of enterotoxins that act as superantigens, disrupting host immune responses and resulting in a myriad of clinical symptoms. However, large-scale analyses determining enterotoxin gene diversity, lineage structure and isolate metadata remain scarce. We analysed 15,887 S. aureus RefSeq genomes using a machine learning pipeline combining profile Hidden Markov Model-based enterotoxin gene identification, lineage typing, gene profile-based strain clustering and association rule mining using a broad range of gene and metadata features. This approach identified 35 distinct enterotoxin genes and five variant forms, including two putative novel enterotoxin genes, sel34 and sel35. HDBSCAN clustering distinguished 45 enterotoxin gene profile groups, revealing strong associations between the two major egc enterotoxin gene cluster variants (OMIWNG and OMIUNG) and Clonal Complex membership: CC5, CC22 and CC45 with OMIWNG; CC30 and CC121 with OMIUNG. Integration of isolate metadata exposed distinct geographic and temporal trends, including a recent rise in non-egc lineages derived from Asia and animal sources. These findings show that S. aureus enterotoxin diversity is structured by lineage, mobile genetic element composition and Clonal Complex association. The discovery of sel34 and sel35, together with the comprehensive overview of lineage-specific enterotoxin profiles, expands current understanding of S. aureus virulence evolution and provides a scalable analytical framework for monitoring toxin gene dynamics in clinical and environmental populations.
bioinformatics2026-05-05v1Network-based analysis of glioblastoma identifies patient communities and cluster-specific biomarkers
Siminea, N.; Florea, D.; Paun, M.; Paun, A.; Petre, I.Abstract
Glioblastoma is an aggressive and highly heterogeneous brain tumor with poor prognosis despite multimodal treatment strategies. Understanding the molecular diversity of the disease is essential for improving tumor stratification and identifying potential therapeutic targets. In this study, we investigate whether network-based analysis can reveal biologically meaningful subgroups of glioblastoma tumors. Using RNA sequencing and mutation data from the TCGA-GBM cohort, we constructed patient-specific protein-protein interaction networks based on genes that are differentially expressed or harbor somatic mutations. These networks capture the molecular alterations associated with individual tumors within the context of the human interactome. We then derived similarities between tumors using a binary representation of network nodes and the Jaccard similarity metric, enabling the construction of a patient similarity graph. Community detection algorithms (Louvain and Leiden) were applied to this graph to identify clusters of tumors with similar molecular network profiles. Our analysis revealed six tumor communities characterized by distinct gene compositions and enriched biological processes. For each community, we identified candidate biomarkers and network hubs that may represent potential therapeutic targets. Several of the identified genes correspond to known drug targets, while others represent potential candidates for further investigation. These results illustrate how integrating molecular alterations with network-based modeling can help stratify glioblastoma tumors and uncover molecular mechanisms that may guide the development of more personalized therapeutic strategies.
bioinformatics2026-05-05v1Transfer Learning Enables Drug-Target Interaction Prediction in Data-Scarce One-Carbon Metabolism
Dalkiran, A.; Cho, T.; Atalay, M. V.; Shin, K. W. D.; Meliton, A. Y.; Woods, P. S.; Shamaa, O. R.; Hamanaka, R. B.; Mutlu, G. M.; Cetin-Atalay, R.Abstract
Predicting drug-target interactions (DTIs) with deep learning offers opportunities to accelerate drug discovery, yet performance is constrained by the scarcity of target-specific training data. This is a particular challenge for mitochondrial one-carbon (1C) pathway enzymes, which are attractive therapeutic targets but remain pharmacologically understudied. Mitochondrial 1C metabolism supplies glycine, reducing equivalents, and 1C units critical for nucleotide synthesis, and has emerged as a key pathway in cancer and fibrosis. SHMT2 and MTHFD2, two key 1C enzymes, support collagen production in fibroblasts, blocking either prevents TGF-{beta}-induced glycine and collagen accumulation. Here, we developed transfer learning-based deep learning models to predict interactions between approved drugs and SHMT2 or MTHFD2 despite minimal target-specific training data, pre-training on large datasets from related enzymes before fine-tuning to these targets. Virtual screening of the DrugBank library identified six candidates, three of which, Carbimazole, Crizotinib, and GSK2018682 reduced TGF-{beta}-induced collagen production and glycine accumulation in human lung fibroblasts, demonstrating transfer learning as a strategy for repurposable drug identification in data-scarce metabolic targets.
bioinformatics2026-05-05v1ANYI: The ANnotated Yeast Interactome
Nissley, D. A.; Goel, M.; Castellanos-Girouard, X.; Kuntz, C. P.; Wang, Y.; Mukhtar, S.; Serohijos, A.; Schlebach, J. P.Abstract
Although several existing protein-protein interaction (PPI) databases provide yeast PPI data, none unify large-scale network topology information with detailed biophysical, proteostasis, and regulatory annotations in a single protein-centric framework. To address this gap, we developed the ANnotated Yeast Interactome (ANYI), an open, integrated resource that combines experimental yeast PPIs with sixteen feature annotation types, including protein abundance, half-life, disorder content, post-translational modifications, conformational stability, chaperone interactions, sequence, and structure. ANYI integrates 3,927 proteins with 155 annotation features, forming a unified matrix that enables systematic cross-layer analyses. Available via GitHub and Docker Hub with an interactive network browser for broad accessibility, ANYI provides both experienced and beginner computational scientists with tools to investigate the yeast interactome. For example, users can directly test whether highly connected hub proteins exhibit distinct stability, disorder, or proteostasis signatures relative to peripheral nodes.
bioinformatics2026-05-05v1immuneKG: An Immune-Cell-Aware Knowledge Graph Framework for Target Discovery in Immune-Mediated Diseases
Ye, Y.; PB-IDD Department, Pharmablock Sciences Inc.,Abstract
Biomedical knowledge graphs have emerged as foundational infrastructure for AI-driven drug discovery, yet their translational impact on novel target identification in immune-mediated diseases remains limited. Here we present immuneKG, a multimodal knowledge graph centred on autoimmune diseases, constructed through biologically meaningful feature reprogramming of disease nodes to enable deep mechanistic modelling of immune-related disorders. immuneKG introduces a new entity class immune_cell, and four original directed relation types, together adding 9,105 novel triples absent from all existing biomedical KG schemas. Disease nodes are endowed with three novel modal feature sets quantifying immune homeostatic imbalance: autoantibody profiles, cytokine signatures, and HLA genotypes, complemented by systemic involvement scores and genetic features. The graph encompasses over 407,000 training triples across 7,287 entities and 32 relation types. Applied to inflammatory bowel disease (IBD), immuneKG combined with a HeteroPNA-Attn graph neural network achieves a Hits@100 of 0.99 against a Clarivate Phase II+ clinical pipeline, while a novelty-penalised scoring function surfaces high-potential dark targets. The framework shifts from conventional candidate-space screening to a development-oriented decision-support paradigm, providing actionable and interpretable guidance for downstream drug discovery.
bioinformatics2026-05-05v1Massively parallel reporter assay-informed modeling improves prediction of context-specific enhancer-gene regulatory interactions
DeGroat, W.; Kreimer, A.Abstract
Enhancers are cis-regulatory elements that drive context-specific gene expression, yet their target genes and modes of action remain largely unresolved. Because most disease-associated variants lie in non-coding regulatory DNA, accurate, cell type-specific enhancer-gene (E-G) mapping is essential for understanding genetic risk. However, current E-G prediction frameworks lack the resolution to capture such context-specific interactions. Massively parallel reporter assays (MPRAs) provide measurement of cis-regulatory activity, but their integration into genome-scale E-G models has been limited. Here, we introduce MPRabc, an MPRA-informed model that improves E-G interaction prediction. MPRabc integrates predicted MPRA activity, sequence-derived regulatory features, epigenomic signals, and three-dimensional chromatin contact maps with CRISPR-based perturbation training data. Benchmarking against validated regulatory interactions shows that MPRabc outperforms state-of-the-art models. We generated high-resolution E-G networks for K562, HepG2, and hiPSC cell lines and applied a graph-based framework to identify regulatory architecture, map trait-associated variants and expression quantitative trait loci, and resolve transcription factor drivers of enhancer activity. Across contexts, we accurately recovered lineage-defining regulatory programs, including GATA1::TAL1 in K562, HNF1A/B in HepG2, and POU factor circuits in hiPSCs. Together, these results establish MPRA-informed modeling as a scalable strategy for decoding enhancer function and linking non-coding variants to gene regulatory mechanisms across cellular contexts.
bioinformatics2026-05-05v1DOMINO: Learning Domain Co-occurrence for Multidomain Protein Design
Dai, F.; Su, J.; Tan, Q.; Yang, H.; Zhou, X.; Yuan, F.Abstract
Multidomain proteins arise through the reuse and recombination of structural domains, yet natural architectures represent a sparse, structured sample of the possible domain-combination space. Here, we introduce DOMINO, a two-stage framework that learns domain co-occurrence from TED-annotated multidomain proteins and uses the learned patterns to generate new multidomain sequences. DOMIN, a contrastive retrieval model, embeds domains into a latent compatibility space and retrieves candidate partners for a query domain from a TED-derived domain pool, including pairings not observed in the TED-derived co-occurrence set. DOMO, a conditional autoregressive sequence model, converts each retrieved domain pair into a full-length protein sequence by jointly generating the specified domain regions and the non-domain sequence context between and around them. DOMIN recovers hierarchical patterns of natural domain co-occurrence and expands the observed CATH homologous-superfamily co-occurrence network with candidate novel pairings. DOMO realizes both held-out natural pairs and DOMIN-retrieved pairs as proteins with high domain recovery and high AlphaFold-predicted structural confidence. Applied at scale, DOMINO generated 5 million retrieval-derived multidomain proteins, with sampled designs showing recovery of the specified domains, diverse CATH annotations, and sequence novelty relative to UniRef100. Together, these results support domain co-occurrence as a predictive design prior and demonstrate a scalable strategy for exploring multidomain protein architectures through new combinations of existing structural modules.
bioinformatics2026-05-05v1Cross-assay RNA modeling reveals cancer biomarkers
Townsend, H. A.; Jordan, K. R.; Wolsky, R. J.; Van Kleunen, L. B.; Davidson, N. R.; Behbakht, K.; Sikora, M. J.; Dowell, R. D.; Clauset, A.; Bitler, B. G.Abstract
The clinical heterogeneity of cancer poses a major challenge for precision medicine. Limited cohort sizes across evolving assay platforms impede reliable biomarker discovery. Here, we systematically evaluate how to integrate data from four transcriptomics platforms: bulk and single-cell (sc) RNA sequencing (RNA-seq), NanoString, and microarray for predictive modeling in cancer. We use high-grade serous carcinoma (HGSC) of tube-ovarian origin as a model system, as it is highly heterogeneous in both biology and assay data. We find that using fold-change of gene expression in patients with matched pre- and post-neoadjuvant chemotherapy samples reduces inter-patient and inter-assay variability but is insufficient to overcome platform-specific biases. Microarray and scRNA-seq data exhibit systematic biases, while RNA-seq and NanoString show the most promise for combination into a single training cohort. To mitigate inter-assay limitations, we generate a new data set of HGSC tumor samples profiled with both RNA-seq and NanoString, and use it to identify the limits of detection and optimal harmonization strategies. Our approaches enable integration of cohorts for separate and combined RNA-seq and NanoString predictive models of disease recurrence (test-set AUROCs > 0.8), validated in external microarray cohorts. We leverage single-cell and bulk RNA-seq network-based analyses to provide mechanistic context for genes in the predictive models. Our models indicate that GBP4 expression is a key predictor of recurrence and marks immune remodeling towards cytotoxicity. We provide an interactive web portal to facilitate exploration of data and results. These findings guide cross-assay harmonization of transcriptomic data and enable improved predictive modeling in heterogeneous cancers.
bioinformatics2026-05-05v1Celldetective: an AI-enhanced image analysis tool for unraveling dynamic cell interactions
Torro, R.; Diaz Bello, B.; El Arawi, D.; Dervanova, K.; Ammer, L.; Dupuy, F.; Chames, P.; Sengupta, K.; Limozin, L.Abstract
Analysis of multimodal and multidimensional data capturing dynamic interactions between diverse cell populations is a current challenge in bioimaging, especially in the context of immunology and immunotherapy research. Here, we introduce Celldetective, an open-source Python-based software tool designed for high-performance, end-to-end analysis of image-based in vitro immune and immunotherapy assays. Celldetective is purpose-built for multicondition, 2D multi-channel time-lapse microscopy of mixed cell populations. Although it is optimised for the needs of immunology assays, it is nevertheless broadly applicable to any biological system involving interacting cell populations. The software seamlessly integrates AI-based segmentation, tracking, and automated single-cell event detection, all within an intuitive graphical interface that supports interactive visualisation, annotation, and training options. We showcase its capabilities with original datasets of single immune effector cell interactions with an activating surface mediated by bispecific antibodies, and pairwise interactions in antibody-dependent cell cytotoxicity events.
bioinformatics2026-05-04v4SenNet Portal: Build, Optimization and Usage
Borner, K.; Blood, P. D.; Silverstein, J. C.; Ruffalo, M.; Satija, R.; Gehlenborg, N.; Honick, B.; Bueckle, A.; Jain, Y.; Qaurooni, D.; Shirey, B.; Sibilla, M.; Metis, K.; Bisciotti, J.; Morgan, R. S.; Betancur, D.; Sablosky, G. R.; Turner, M. L.; Kim, S.-J.; Lee, P. J.; Bartz, J.; Domanskyi, S.; Peters, S. T.; Enninful, A.; Farzad, N.; Fan, R.; SenNet Team, ; Herr, B. W.Abstract
Cellular senescence is a hallmark of aging and a driver of functional decline across tissues, yet its heterogeneity and context dependence have limited systematic study. The Common Fund's Cellular Senescence Network (SenNet) Program addresses this challenge by generating multimodal, multi-tissue datasets that profile senescent cells across the human lifespan and complementary mouse models. The SenNet Data Portal (https://data.sennetconsortium.org) serves as the public gateway to these resources, providing open access to harmonized single-cell, spatial, imaging, transcriptomic, and proteomic data; senescence biomarker catalogs; and standardized protocols that can be used to comprehensively identify and characterize senescent cells in mouse and human tissue. As of April 2026, the portal hosts 2,041 publicly available human and mouse datasets across 15 organs using 6 general assay types. Experts from 13 Tissue Mapping Centers (TMCs) and 12 Technology Development and Application (TDAs) components contribute tissue data, analyze data, identify senescent biomarkers, and agree on panels for cross-tissue antibody harmonization. They also register human tissue data into the Human Reference Atlas (HRA) and develop user interfaces for the multiscale and multimodal exploration of this data. Built on a scalable hybrid cloud microservices architecture by the Consortium Organization and Data Coordinating Center (CODCC), the Portal enables data submission, management, integrated analysis, spatial context mapping, and harmonized access to cross-species data critical for aging research. This paper presents user needs, the Portal's architecture, data processing workflows, and senescence-focused analytical tools; usage scenarios illustrating applications in biomarker discovery, quality benchmarking, hypothesis generation, spatial analysis, cost-efficient profiling, and cell distance distribution analysis; and utility and usage by the larger researcher community. Current limitations and planned extensions, including expanded spatial-omics releases and improved tools for senotype characterization, are discussed. SenNet protocols, code, and user interfaces are freely available on https://docs.sennetconsortium.org/apis.
bioinformatics2026-05-04v2Semi supervised GAN for smart microscopy, fast and data efficient cell cycle classification
Manick, R.; El Habouz, Y.; Guillout, M.; Martin, C.; Bonnet, J.; Ruel, L.; Pastezeur, S.; Chanteux, O.; Bouchareb, O.; Tramier, M.; Pecreaux, J.Abstract
Modern optical microscopes are fully motorised; however, transforming them into truly smart systems requires real-time adjustment of acquisition settings in response to detected objects and dynamic biological events. At the core are classification algorithms that commonly depend on customised softwares and are generally designed for narrowly-defined biological applications. In addition, they often require substantial annotated datasets for effective training. We introduce a semi-supervised generative adversarial network (SGAN) for robust cell-cycle stage classification under low-resource conditions, adaptable to diverse cellular structures. The framework combines unlabelled microscopy images with synthetically generated samples to mitigate limited annotation, while preserving stable performance even when the unlabelled subset is class-imbalanced. Tested on the Mitocheck dataset, which features five mitosis classes, the model achieved 93{+/-}2 % accuracy using only 80 labelled per class and 600 unlabelled images. The proposed algorithm is generic and can be readily adapted to new labelling schemes, classification targets, cell lines, or microscopy modalities through transfer learning. SGAN is well suited for integration into automated microscopes, enabling efficient and adaptable image analysis across diverse biological and microscopy applications.
bioinformatics2026-05-04v2Species-specific transformer models of bacterial gene order and content for genomic surveillance tasks
Horsfield, S. T.; Wiatrak, M.; McInerney, J. O.; Bentley, S. D.; Colijn, C.; Lees, J. A.Abstract
Transformer models enable functionally meaningful representation of complex biological data, such as nucleotide or protein sequences. Existing foundation transformer models are trained on large multi-domain corpuses of unlabelled DNA or protein data, showing unmatched task generalisation. However, these foundation models are often outperformed on domain-specific tasks by models trained on taxonomically-constrained data, such as gene classification in prokaryotes. By extension, species-specific transformer models hold promise for targeted analyses, given sufficient training data are available. Epidemiological analysis of bacterial pathogens exemplifies the use-case of species-specific transformers, due to the wealth of genome data available, coupled with pathogen-specific analyses carried out during routine and outbreak surveillance. Here, we trained a transformer model, PanBART, on the gene content and gene order of two important and biologically distinct bacterial pathogens, Escherichia coli and Streptococcus pneumoniae, benchmarking against state-of-the-art non-transformer approaches for genomic epidemiology. We show PanBART learns representations of population structure in an unsupervised manner, and can be used to accurately assign genomes to biologically-meaningful sequence clusters. PanBART is also able to identify emergent lineages, differentiating them from pre-existing lineages, and can accurately predict genomes likely to uptake genes involved in antibiotic resistance before a transfer event has occurred. Finally, PanBART can be used to conduct co-selection analysis to identify pairs of genes likely to be found together. Our work demonstrates that species-specific transformer models can be employed in many critical public health scenarios. We lay the groundwork for wider application of such models in epidemiological analysis, and provide scenarios where such models excel.
bioinformatics2026-05-04v2Integrated transcriptomic and proteomic analyses identify novel biomarkers of bladder outlet obstruction
Bigger-Allen, A. A.; Das, B.; Tang, Y.; Costa, K.; Ocampo, G.-L.; Hashemi Gheinani, A.; DiMartino, S.; Kaull, J.; Froehlich, J.; Lee, R. S.; Adam, R.Abstract
Bladder outlet obstruction leads to pathological remodeling and emergence of lower urinary tract symptoms. Although relief of obstruction is associated with symptomatic improvement, it is not universally successful, reflecting persistent alterations in the bladder. Reliable surrogate biomarkers of obstruction are lacking, particularly early in the disease course before irreversible damage to the bladder may have occurred. In this study, re-analysis of publicly available transcriptomic datasets from diverse rodent models of obstruction identified tissue transcripts including Cthrc1, Grem1, Ltbp2 and Msn that were induced in response to injury. Candidate markers were validated experimentally in an independent model of neurogenic obstruction demonstrating time-dependent changes. Candidate markers were also attenuated with either surgical removal of obstruction or treatment with anticholinergic medication or inosine. Integrated analysis of tissue transcriptomics data and tissue and urine proteomics data from a model of neurogenic obstruction revealed significant concordance between markers observed in tissue and urine. Urinary proteomics analysis identified a statistically significant increase in MSN in patients with neurogenic bladder compared to unaffected controls. These findings identify tissue and urine biomarkers of both non-neurogenic and neurogenic obstruction that may reflect early changes in obstructive uropathy that could be monitored in a non-invasive manner.
bioinformatics2026-05-04v1spatiAlytica: Viewer-Grounded Multimodal Agentic System for Interactive Spatial Omics Analysis
Das, A.; Zhang, K.; Song, J.; Han, M.; Chen, A.; Meng, W.; Galloway, H.; Chen, P.-Y.; Jo, S.; Liu, Z.; Hasib, M. M.; Officer, A.; Sinha, H.; Chiu, Y.-C.; Gao, S.-J.; Li, L.; Huang, Y.Abstract
Spatial transcriptomics and proteomics map tissue architecture and cellular interactions, but analysis remains limited by programming demands and text-centered AI agents that lack viewer grounding and cross-turn context. We present spatiAlytica, a viewer-centric multimodal interactive agentic system embedded in the Napari viewer that enables non-programmer biologists to perform iterative, hypothesis-driven spatial omics analysis via natural language. spatiAlytica couples viewer-state serialization, agentic memory, biological concept-to-data-field mapping, code generation and debugging, Spatial VQA, and grounded interpretation to support an exploratory analysis and interpretive reasoning workflow. We introduce spatiAlyticaBench, a comprehensive benchmark spanning 222 single-turn spatial analytical coding questions, 178 multi-turn sequential workflow questions, and 7,350 image-grounded reasoning questions. spatiAlytica outperformed strong agentic baselines, while using less time and tokens. Case studies across Kaposi's sarcoma, colorectal cancer, and ovarian cancer recapitulated known spatial patterns and uncovered progressive CD8 T-cell dysfunction during KS progression.
bioinformatics2026-05-04v1Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Guo, J.Abstract
The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task--molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.
bioinformatics2026-05-04v1MeiCOfi: Meiotic CrossOver Finder in haploid, diploid, polyploid and hyper-recombinant genomes
Fuentes, R. R.; Fernandes, J. B.; Susanto, T.; Wang, Y.; Underwood, C. J.Abstract
During the meiotic cell division, homologous chromosomes pair and recombine, leading to large reciprocal exchanges of genetic information. In most species, meiotic crossovers (COs) are crucial for normal chromosome segregation and they generate genetic diversity, which can be acted upon by natural selection in wild populations or by breeders to combine desirable traits in a genome. Identifying the position and frequency of COs is therefore essential in both classical genetics studies and breeding programmes. However, a computational tool capable of accurately detecting COs across diverse contexts, including varying marker densities, genome size and structure, recombination rate, and ploidy, remains lacking. We developed MeiCOfi (Meiotic CrossOver Finder) to detect meiotic crossover events at high-resolution from low-coverage genome sequencing data. We evaluated it using data from Arabidopsis thaliana, rice, barley and both intra- and inter-specific tomato hybrids, encompassing a wide range of genome complexities and marker densities. It reliably detects crossovers in hyper-recombinant A. thaliana with up to 62 CO per backcross offspring and in haploid gametes from barley with sequencing coverage as low as 0.1x. It can identify crossovers in polyploid genomes, including simulated recombinant tetraploids and also real data from tetraploid tomato hybrid offspring. Our results demonstrate that MeiCOfi can robustly identify crossovers in diverse genomic contexts.
bioinformatics2026-05-04v1MetaUmbra: Statistically Controlled Genome-Level Presence Inference from Metaproteomic Peptides
Wu, Q.; Ning, Z.; Zhang, A.; Cheng, K.; Figeys, D.Abstract
Taxonomic interpretation of metaproteomic peptides remains difficult because many peptide sequences are present in proteins from different organisms, reducing taxonomic specificity. Current peptide-centric workflows can report taxonomic summaries or taxon level confidence scores, but they do not provide formal statistical evidence that a taxon is present in the microbiome. Here we present MetaUmbra, a tool that derives genome-level statistical significance values from identified peptides. MetaUmbra builds theoretical peptide lists by in silico digestion of the taxon specific proteins and matches observed peptides against these references. It then combines a conservative significance estimate from unique peptides with a Monte Carlo based p-value for shared peptide evidence estimated under an empirical null model. In the defined community benchmark SIHUMIx, MetaUmbra identified the expected genomes without introducing false-positive genomes after embedding the SIHUMIx genomes in a large gut reference background. In the single strain benchmark Mix24X, all expected genomes were identified with the best statistical significances even after near neighbor and full background expansion. In a hamster gut genome panel, MetaUmbra further preserved an interpretable ranking of candidate genomes in a dense real-data setting. Together, these results show that MetaUmbra can statistically identify the presence of specific microbes in a complex microbiome while maintaining low false-positive calls. MetaUmbra therefore provides a practical framework for converting peptide evidence into genome-level statistical inference in metaproteomics.
bioinformatics2026-05-04v1Radiant DIA: A Fast, Sensitive, and Accurate Search Engine for Quantitative Proteomics
Just, S.; Cantrell, L. S.; Nichols, A.; Wang, J.; Kis, J.; Mohtashemi, I.; Platt, T.; Farokhzad, O.; Batzoglou, S.Abstract
In mass spectrometry-based proteomics, robust and efficient search engines are essential for accurate peptide and protein identification and quantification. Advances in sample preparation and instrumentation have increased the demand for highly scalable processing tools, with datasets comprising hundreds or thousands of samples in single-cell and population studies. Here we present Radiant DIA, a novel Data-Independent Acquisition search engine which achieves 4x faster processing and 10x lower cloud compute costs for large experiments while ensuring rigorous control of false discovery rate (FDR) and maintaining similar sensitivity, precision, and quantitative accuracy. The Radiant DIA search engine is paired with a modular pipeline deployable on cloud and desktop environments comprising individual modules for distributed re-scoring, FDR estimation, protein inference and quantification. Unlike traditional monolithic applications, this architecture enables high-performance, cloud-scale analysis without sacrificing local usability. Together, the Radiant DIA and Fulcrum Pipeline tools enhance computational efficiency to facilitate biological discovery in large-scale proteomics, as demonstrated by analyses of real-world experiments up to thousands of MS acquisitions.
bioinformatics2026-05-04v1