Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
A universal model for drug-receptor interactions
Menezes, F.; Wahida, A.; Froehlich, T.; Grass, P.; Zaucha, J.; Napolitano, V.; Siebenmorgen, T.; Pustelny, K.; Barzowska-Gogola, A.; Rioton, S.; Didi, K.; Bronstein, M.; Czarna, A.; Hochhaus, A.; Plettenburg, O.; Sattler, M.; Nissen-Meyer, J.; Conrad, M.; Kurzrock, R.; Popowicz, G. M.Abstract
The genomic landscape of disease holds, in principle, the information required for rational therapeutic design. Genes encode proteins whose functions are tightly coupled to their three-dimensional structures via non-bonded interactions. Since the late 1970s, the advent of macromolecular crystallography inspired the notion that structural knowledge alone could enable a lock-and-key approach to drug design. However, this framework has failed to catalyze a step-change in the generation of new chemical matter. Drug discovery continues to depend on costly and largely serendipitous screening campaigns. Our understanding of, and reasoning from, non-bonded interaction chemistry is still too limited. Compounding this is the scarcity of novel chemistry and infinitesimal coverage of the chemical combinatorial space by current experimental data. To alleviate these problems, we show that a machine learning model can successfully learn and infer the principles of non-bonded interactions in the drug-receptor space. A reductionist approach to training data led to a model generalizing drug-target interactions to truly novel chemical matter without suffering from memorization bias. This work addresses that gap in drug discovery through a theoretical framework for predictive molecular recognition.
bioinformatics2026-03-24v10FlashDeconv enables atlas-scale, multi-resolution spatial deconvolution via structure-preserving sketching
Yang, C.; Chen, J.; Zhang, X.Abstract
Coarsening Visium HD resolution from 8 to 64 m can flip cell-type co-localization from negative to positive (r = -0.12 [->] +0.80), yet investigators are routinely forced to coarsen because current deconvolution methods cannot scale to million-bin datasets. Here we introduce FlashDeconv, which combines leverage-score importance sampling with sparse spatial regularization to match top-tier Bayesian accuracy while processing 1.6 million bins in 153 seconds on a standard laptop. Systematic multi-resolution analysis of Visium HD mouse intestine reveals a tissue-specific resolution horizon (8-16 m)--the scale at which this sign inversion occurs--validated by Xenium ground truth. Below this horizon, FlashDeconv provides, to our knowledge, the first sequencing-based quantification of Tuft cell chemosensory niches (15.3-fold stem cell enrichment). In a 1.6-million-bin human colorectal cancer cohort, FlashDeconv uncovers neutrophil inflammatory microdomains co-localized with immunoregulatory dendritic cells (mRegDC) at the tumor-stroma interface--spatial niches invisible to classification-based methods, which discard 97.7% of the relevant bins.
bioinformatics2026-03-24v3Deconvolution of omics data in Python with Deconomix -- cellular compositions, cell-type specific gene regulation, and background contributions
Mensching-Buhr, M.; Sterr, T.; Voelkl, D.; Seifert, N.; Tauschke, J.; Engel, L.; Rayford, A.; Straume, O.; Grellscheid, S. N.; Beissbarth, T.; Zacharias, H. U.; Goertler, F.; Altenbuchinger, M. C.Abstract
Background: Gene expression profiles derived from heterogeneous bulk samples contain signals from various cell populations. Cell-type deconvolution approaches are computational tools to reverse engineer the composition of bulks in term of cell populations. Accurate estimates of cell compositions are crucial for identifying cell populations relevant for disease. Moreover, analyses, such as the identification of differentially expressed genes, can be confounded by cellular composition, as differences in gene expression may arise from both variations in cellular composition and gene regulation. Results: We present Deconvolution of omics data (Deconomix) - a comprehensive toolbox for the cell-type deconvolution of bulk transcriptomics data, available as a Python package and standalone graphical user interface. Deconomix stands apart from competing solutions with rich functionality and highly efficient implementations. It facilitates (A) the inference of cellular compositions from bulk transcriptomics data, (B) the machine learning-based optimization of gene weights to resolve small cell populations and to disentangle phenotypically related cells, (C) the inference of background contributions which otherwise would deteriorate cell-type deconvolution, and (D) population estimates of cell-type specific gene regulation. To showcase the application of Deconomix, we present a case study on breast cancer data from TCGA, highlighting subtype-specific cellular compositions and cell-type-specific gene-regulatory programs. Conclusion: We present Deconomix, a comprehensive Python package including a graphical user interface for the inference of cellular compositions, cell-type-specific gene regulation, and background contributions from bulk transcriptomics data. Key words: Cell-type deconvolution, Bulk transcriptomics, Gene regulation, Gene expression analysis, Machine learning, Cellular composition inference
bioinformatics2026-03-24v2Micro16S: Universal Phylogenetic 16S rRNA Gene Representations for Deep Learning of the Microbiome
Bishop, H. V.; Ogilvie, O. J.; Dobson, R. C. J.; Herbold, C. W.Abstract
Existing self-supervised microbiome models represent taxa as discrete, independent units restricted to fixed vocabularies, disregarding their evolutionary context. Here we present Micro16S, a deep learning approach that embeds 16S ribosomal RNA gene sequences into a continuous vector space according to phylogenetic relationships derived from the Genome Taxonomy Database. Using a combination of triplet and pair loss objectives, the model learns representations where spatial proximity reflects phylogenetic relatedness, while remaining largely invariant to the specific 16S rRNA region. Evaluations demonstrate taxonomically coherent clustering across most ranks and substantially improved region invariance compared to k-mer frequency baselines. A transformer pretrained on 50,418 unlabelled gut microbiome samples using these embeddings captured biologically meaningful community structure, though classical machine learning baselines outperformed Micro16S across six benchmark classification tasks, highlighting the limitations of the current system. These results establish the feasibility of phylogenetic embeddings for microbiome deep learning and identify mining algorithm design and class imbalance as primary targets for future improvement.
bioinformatics2026-03-24v1MiCBuS: Marker Gene Mining for Unknown Cell Types Using Bulk and Single Cell RNA-Seq Data
Zhang, S.; Lu, Y.; Luo, Q.; An, L.Abstract
Motivation: Identifying cell type-specific expressed genes (marker genes) is essential for understanding the roles and interactions of cell populations within tissues. To achieve this, the traditional differential analysis approaches are often applied to individual cell-type bulk RNA-seq and single-cell RNA-seq data. However, real-world datasets often pose challenges, such as heterogeneous bulk RNA-seq and incomplete scRNA-seq. Heterogeneous bulk RNA-seq amalgamates gene expression profiles from multiple cell types and results in low resolution, while incomplete scRNA-seq does not capture some cell types from the tissue, leading to unknown cell types. Traditional methods fail to identify marker genes for such unknown cell types. Results: MiCBuS addresses this limitation by generating Dirichlet-pseudo-bulk RNA-seq based on bulk and incomplete single-cell RNA-seq data. By performing differential analysis of gene expressions on bulk and Dirichlet-pseudo-bulk RNA-seq samples, MiCBuS can identify the marker genes of unknown cell types, enabling the identification and characterization of these elusive cellular components. Simulation studies and real data analyses demonstrate that MiCBuS reliably and robustly identifies marker genes specific to unknown cell types, a capability that traditional differential analysis methods cannot achieve.
bioinformatics2026-03-24v1ERFMTDA: Predicting tsRNA-disease associations using an enhanced rotative factorization machine
Lan, W.; Wang, D.; Chen, W.; Yan, X.; Chen, Q.; Pan, S.; Pan, Y.Abstract
Motivation: tRNA-derived small RNAs (tsRNAs) have emerged as a novel class of regulatory molecules implicated in the pathogenesis of many human diseases, making them as promising biomarkers and therapeutic targets. However, existing computational methods for tsRNA-disease association prediction often overlook explicit biological attributes and complex feature interactions, limiting their predictive performance. Results: We propose ERFMTDA, an enhanced rotative factorization machine framework for predicting potential tsRNA-disease associations. ERFMTDA explicitly models complex interactions among heterogeneous biological features while integrating latent structural representations derived from the global association matrix. In addition, a biologically informed negative sampling strategy based on motif-level sequence similarity is introduced to improve the reliability of negative samples. Extensive experiments demonstrate that ERFMTDA consistently outperforms eleven state-of-the-art methods. Case studies on diabetic retinopathy and hepatocellular carcinoma further confirm its ability to prioritize biologically meaningful tsRNA-disease associations.
bioinformatics2026-03-24v1PACMON: Pathway-guided Multi-Omics data integration for interpreting large-scale perturbation screens
Qoku, A.; Stickel, T.; Amerifar, S.; Wolf, S.; Oellerich, T.; Buettner, F.Abstract
High-throughput perturbation screens coupled with single-cell molecular profiling enable systematic interrogation of gene function, yet interpreting the resulting data in terms of biological pathways remains challenging. Existing approaches either identify latent gene modules without linking them to perturbations, or model perturbation effects without incorporating prior biological knowledge, limiting interpretability and scalability. Here, we introduce PACMON (Pathway-guided Multi-Omics data integration for interpreting large-scale perturbation screens), a Bayesian latent factor model that jointly infers pathway-level programs and their modulation by experimental perturbations. PACMON decomposes multimodal molecular measurements into shared latent factors aligned with known biological pathways through structured sparsity priors, while simultaneously estimating how each perturbation activates or represses these pathway programs. The framework naturally accommodates multiple data modalities and employs stochastic variational inference for scalable application to large datasets. We evaluate PACMON in three settings of increasing complexity. On synthetic data with known ground truth, PACMON achieves near-perfect recovery of pathway structure and perturbation effects, outperforming existing methods in both accuracy and computational scalability. Applied to a multimodal Perturb-CITE-seq screen of melanoma cells, PACMON recovers coherent interferon-signaling and cell-cycle programs spanning RNA and surface-protein modalities and identifies interpretable perturbation-pathway associations consistent with known immune-evasion mechanisms. Finally, we apply PACMON to the Tahoe-100M perturbation atlas - approximately 100 million cells and over 1,000 drug-dose combinations - producing the first pathway-level latent factor analysis at this scale and revealing biologically meaningful drug-response landscapes across Hallmark pathway programs. PACMON provides a unified, scalable and interpretable framework for mapping perturbation effects onto biological pathways in modern large-scale perturbation experiments.
bioinformatics2026-03-24v1RiboPipe: efficient per-transcript codon-resolution ribo-seq coverage imputation for low-coverage transcripts
Zhang, Y.-z.; Hashimoto, S.; Li, S.; Inada, T.; Imoto, S.Abstract
Motivation: Ribosome profiling (Ribo-seq) provides codon-resolution measurements of translation; however, many transcripts exhibit sparse or low read coverage, which limits downstream quantitative analyses. Reliable prediction and imputation of codon-resolution coverage for low-coverage transcripts remain computationally challenging. Results: We present RiboPipe, an efficient framework for per-transcript codon-resolution Ribo-seq coverage imputation for low-coverage transcripts. RiboPipe is designed around three key principles. First, it jointly optimizes transcript-level mean ribosome load (MRL) prediction and codon-level coverage modeling within a unified objective, enabling consistent learning across both local and transcript-level scales. Second, it introduces a peak-weighted loss that emphasizes high-signal codon positions associated with translational pausing, improving the recovery of functionally relevant coverage peaks. Third, the framework is lightweight and data-efficient, achieving stable performance even when trained on only a small fraction of high-coverage transcripts. Using two publicly available Ribo-seq datasets (GSE233886 and GSE133393), we demonstrate stable convergence and consistent prediction accuracy across multiple train-test split ratios. Comparative evaluation of embedding strategies shows that simple one-hot representations achieve competitive or even superior performance compared with pre-trained language model embeddings under identical training conditions. Overall, RiboPipe provides a computationally efficient and scalable framework for Ribo-seq coverage imputation in low-coverage transcripts. Availability and Implementation: The source code and associated data can be accessed at https://github.com/yaozhong/riboPipe
bioinformatics2026-03-24v1From SNPs to Pathways: A genome-wide benchmark of annotation discrepancies and their impact on protein- and pathway-level inference
Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.Abstract
Background Accurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. Results We conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. Conclusion SNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.
bioinformatics2026-03-24v1dreampy: Pseudobulk mixed-model differential expression for single-cell RNA-seq in Python
Wells, S. B.; Shahnawaz, H.; Jones, J. L.Abstract
dreampy is a Python implementation of the R dreamlet framework for pseudobulk differential expression analysis of single-cell RNA-seq data. dreamlet combines voom precision-weighted linear mixed models with empirical Bayes moderation to handle batch effects, repeated measures, and other hierarchical structure in multi-donor studies, but exists entirely within the R/Bioconductor ecosystem. dreampy reproduces this pipeline natively in Python, integrating with AnnData and the scverse ecosystem.
bioinformatics2026-03-24v1TCRseek: Scalable Approximate Nearest Neighbor Search for T-Cell Receptor Repertoires via Windowed k-mer Embeddings
Yang, Y.Abstract
The rapid growth of T-cell receptor (TCR) sequencing data has created an urgent need for computational methods that can efficiently search CDR3 sequences at scale. Existing approaches either rely on exact pairwise distance computation, which scales quadratically with repertoire size, or employ heuristic grouping that sacrifices sensitivity. Here we present TCRseek, a two-stage retrieval framework that combines biologically informed sequence embeddings with approximate nearest neighbor (ANN) indexing for scalable search over TCR repertoires. TCRseek first encodes CDR3 amino acid sequences into fixed-length numerical vectors through a multi-scale windowed k-mer embedding scheme derived from BLOSUM62 eigendecomposition, then indexes these vectors using FAISS-based structures (IVF-Flat, IVF-PQ, or HNSW-Flat) that support sublinear-time search. A second-stage reranking module refines the shortlisted candidates using exact sequence alignment scores (Needleman--Wunsch with BLOSUM62), Levenshtein distance, or Hamming distance. We benchmarked TCRseek against tcrdist3, TCRMatch, and GIANA on a 100,000-sequence corpus with precomputed exact ground truth under three distance metrics. Under cross-metric evaluation---where the reranking and ground truth metrics differ, providing the most informative test of generalization---TCRseek achieved NDCG@10 = 0.890 (Levenshtein ground truth) and 0.880 (Hamming ground truth), ranking highest among the retained baselines under Hamming and remaining competitive with tcrdist3 (0.894) under Levenshtein. When the reranking metric matches the ground truth definition (BLOSUM62 alignment), NDCG@10 reached 0.993, confirming that the ANN shortlist captures >99% of true neighbors---the expected ceiling of the two-stage design. On the 100,000-sequence corpus, TCRseek achieved 3.6--39.6x speedup over exact brute-force search depending on index type and distance metric, with the largest gains for alignment-based retrieval. These results demonstrate that embedding-based ANN search provides a practical and scalable alternative for TCR repertoire analysis.
bioinformatics2026-03-24v1A comprehensive reference database to support untargeted metabolomics in Pseuudomonas putida
Ross, D. H.; Chang, C.; Vasquez, J.; Overstreet, R.; Schultz, K.; Metz, T.; Bade, J.Abstract
Pseudomonas putida strain KT2440 is a crucial model organism for synthetic biology and bioengineering applications, yet there currently exists no comprehensive metabolomics database comparable to those available for other model organisms. This gap hinders the use of untargeted metabolomics for exploratory analyses in this system. We developed the P. putida metabolome reference database (PPMDB v1) to address this limitation by consolidating metabolite information from multiple sources and expanding coverage through computational predictions. The database was constructed by curating metabolites from BioCyc, BiGG, and other literature sources, then computationally expanding this collection using BioTransformer environmental transformation predictions to generate additional predicted metabolites. We enhanced the database's utility for molecular annotation in metabolomics studies by incorporating analytical properties including collision cross-sections, tandem mass spectra, and gas-phase infrared spectra. These analytical properties were gathered from existing measurement data or predicted using computational tools. We further augmented the database through inclusion of reaction information and pathway annotations, facilitating biological interpretation of metabolomics data. This publicly available resource fills a critical gap in P. putida research infrastructure, supporting metabolite annotation and biological interpretation in untargeted metabolomics studies and enabling in-depth exploratory analyses of this important synthetic biology platform at the molecular level.
bioinformatics2026-03-24v1Col-Ovo: Smartphone-based artificial intelligence for rapid counting of Aedes mosquito eggs under field conditions
Almanza, J.; Montenegro, D.Abstract
Background: OviCol has recently been proposed as a disruptive strategy for the surveillance and control of synanthropic Aedes mosquitoes, vectors of dengue, Zika, and chikungunya viruses. The approach integrates monitoring and control through ultra-low-cost ovitraps (~0.2 USD), bioattractants, and egg inactivation using hot water. However, large-scale ovitrap surveillance generates thousands of egg substrates that require time-consuming manual counting, creating a major operational bottleneck. To address this limitation, we developed Col-Ovo, an artificial intelligence-based tool for automated counting of Aedes aegypti eggs from real field samples, together with OviLab, a digital platform for annotation, curation, and management of entomological image datasets. Methodology/Principal Findings: The detection model was trained using YOLOv11m on a dataset of 275 oviposition substrates (20.5 cm strips) collected under routine operational conditions. Images were captured in situ without preprocessing and included substrates heavily stained by bioattractants such as blackstrap molasses and dry yeast (Saccharomyces cerevisiae), as well as sand and particulate debris, reflecting realistic field conditions. The system was designed to operate with standard smartphone images and tolerate compression artifacts produced by messaging platforms such as WhatsApp. Performance was evaluated by comparing automated egg counts with expert manual counts and with virtual-human counts conducted in OviLab using >200% image magnification. Col-Ovo achieved >95% agreement with expert counts and 88% agreement with OviLab while reducing processing time from approximately 15 minutes to <3 seconds per sample. Conclusions/Significance: Col-Ovo enables rapid, scalable quantification of Ae. aegypti eggs from smartphone images, addressing a critical operational barrier in ovitrap-based surveillance. The system requires no image preprocessing or specialized hardware and is accessible through a lightweight web interface supported by an AI architecture that allows retraining for new ecological contexts or additional Aedes species. Integrated with OviLab, this platform provides a flexible digital infrastructure that can strengthen routine vector surveillance and community-level control programs across regions where Aedes mosquitoes continue to expand.
bioinformatics2026-03-24v1AI-readiness for Biomedical Data
Clark, T.; Caufield, H.; Parker, J. A.; Al Manir, S.; Amorim, E.; Eddy, J.; Gim, N.; Gow, B.; Goar, W.; Hansen, J. N.; Harris, N.; Hermjakob, H.; Joachimiak, M.; Jordan, G.; Lee, I.-H.; McWeeney, S. K.; Nebeker, C.; Nikolov, M.; Reese, J.; Shaffer, J.; Sheffield, N.; Sheynkman, G.; Stevenson, J.; Chen, J. Y.; Mungall, C.; Wagner, A.; Kong, S. W.; Ghosh, S. S.; Patel, B.; Williams, A.; Munoz-Torres, M. C.Abstract
Biomedical research is rapidly adopting artificial intelligence (AI). Yet the inherent complexity of biomedical data preparation requires implementing actionable, robust criteria for ethical and explainable AI (XAI) at the "pre-model" stage, encompassing data acquisition, detailed transformations, and ethical governance. Simple conformance to FAIR (Findable, Accessible, Interoperable, Reusable) Principles is insufficient. Here, we define criteria and practices for reliable AI-readiness of biomedical data, developed by the NIH Bridge to Artificial Intelligence (Bridge2AI) Standards Working Group across seven core dimensions of dataset AI-readiness: FAIRness, Provenance, Characterization, Ethics, Pre-model Explainability, Sustainability, and Computability. Conformance to these criteria provides a basis for pre-model scientific rigor and ethical integrity, mitigating downstream risks of bias and error before AI modeling. We apply and evaluate these standards across all four Bridge2AI flagship datasets, spanning functional genomics to clinical medicine, and encode them in machine-actionable metadata bound to the datasets. This framework sets a benchmark for preparing ethical, reusable datasets in biomedical AI and provides standardized methods for reliable pre-model data evaluation.
bioinformatics2026-03-23v5Variable performance of widely used bisulfite sequencing methods and read mapping software for DNA methylation
Kerns, E. V.; Weber, J. N.Abstract
DNA methylation (DNAm) is the most commonly studied marker in ecological epigenetics, yet the performance of library preparation strategies and bioinformatic tools are seldom assessed in genetically variable natural populations. We profiled DNAm in threespine stickleback (Gasterosteus aculeatus) liver tissue, using reduced representation bisulfite sequencing (RRBS) and whole genome bisulfite sequencing (WGBS) across technical and biological replicates. We additionally collated publicly available RRBS and WGBS data from taxonomically diverse organisms, and then compared how the most commonly used methylation software (Bismark) performed relative to alternative pipelines (BWA meth, BiSulfite Bolt, and Biscuit). Even after choosing parameters to maximize Bismarks mapping efficiency, it was still outperformed by all other methods. Surprisingly, newer tools overrepresented DNAm compared to older methods, highlighting the importance of testing methods on nonmodel organisms. There were also distinct differences in DNAm profiles produced across library preparation methods, with large impacts of population and read depth filters. Methylated sites unique to WGBS predominantly mapped to introns and intergenic regions, while sites unique to RRBS primarily overlapped with promoters and exons. Moreover, the prevalence of nucleotides with intermediate methylation (within individuals) was greatly reduced in RRBS. Together, this suggests that RRBS may be more useful for detecting functionally-relevant methylation differences. Based on these results, we provide methodological recommendations for improving the reliability and utility of DNAm profiles, particularly concerning the detection of functionally relevant DNAm differences in genetically diverse natural populations.
bioinformatics2026-03-23v4AI-Enhanced Adaptive Virtual Screening Platform Enabling Exploration of 69 Billion Molecules Discovers Structurally Validated FSP1 Inhibitors
Cecchini, D.; Nigam, A.; Tang, M.; Reis, J.; Koop, M.; Gottinger, A.; Nicoll, C. R.; Wang, Y.; Jayaraj, A.; Cinaroglu, S. S.; Törner, R.; Malets, Y.; Gehev, M.; Padmanabha Das, K. M.; Churion, K.; Kim, J.; Thomas, N.; Li, Y.; Seo, H.-S.; Dhe-Paganon, S.; Secker, C.; Haddadnia, M.; Hasson, A.; Li, M.; Kumar, A.; Levin-Konigsberg, R.; Choi, E.-B.; Shapiro, G. I.; Cox, H.; Sebastian, L.; Braithwaite, C.; Bashyal, P.; Radchenko, D. S.; Kumar, A.; Yang, L.; Aquilanti, P.-Y.; Gabb, H.; Alhossary, A.; Wagner, G.; Aspuru-Guzik, A.; Moroz, Y. S.; Kalodimos, C. G.; Fackeldey, K.; Schuetz, J. D.; MattevAbstract
Identifying potent lead molecules for specific targets remains a major bottleneck in drug discovery. As structural information about proteins becomes increasingly available, ultra-large virtual screenings (ULVSs) which computationally evaluate billions of molecules offer a powerful way to accelerate early-stage drug discovery. Here, we introduce AdaptiveFlow, an open-source platform designed to make ULVSs more accessible, scalable, and efficient. AdaptiveFlow provides free access to a screening-ready version of the Enamine REAL Space, the largest library of ready-to-dock, drug-like molecules, containing 69 billion compounds that we prepared using the ligand preparation module of the platform. A key innovation of the platform is its use of a multi-dimensional grid of molecular properties, which helps researchers explore and prioritize chemical space more effectively and reduce the computational costs by a factor of approximately 1000. This grid forms the basis of a new method for identifying promising regions of chemical space, enabling systematic exploration and prioritization of compound libraries. An optional active learning component can further accelerate this process by adaptively steering the search toward molecules most likely to bind a given target. To support a broad range of applications, AdaptiveFlow is compatible with over 1,500 docking methods. The platform achieves near-linear scaling on up to 5.6 million CPUs in the AWS Cloud, setting a new benchmark for large-scale cloud computing in drug discovery. Using this approach, we identified nanomolar inhibitors of two disease-relevant targets: ferroptosis suppressor protein 1 (FSP1) and poly(ADP-ribose) polymerase 1 (PARP-1). By leveraging newly solved crystal structures of FSP1 in complex with NAD+, FAD, and coenzyme Q1, we validated these hits experimentally and determined the first co-crystal structures of FSP1 bound to small-molecule inhibitors, enabling insights into inhibitor binding mechanisms previously unknown. With its high scalability, flexibility, and open accessibility, AdaptiveFlow offers a powerful new resource for discovering and optimizing drug candidates at an unprecedented scale and speed.
bioinformatics2026-03-23v3Identification of Distinct Topological Structures From High-Dimensional Data
Xu, B.; Braun, R.Abstract
Single-cell RNA sequencing allows the direct measurement of the expression of tens of thousands of genes, providing an unprecedented view of the transcriptomic state of a cell. Within each cell, different biological processes such as differentiation or cell cycle take place simultaneously, each providing a different characterization of cell state. To identify gene sets that govern these processes for the purpose of disentangling convolved biological processes, we present "Identification of Distinct topological structures" (ID). ID works by constructing an alternative low-dimensional parametrization of the high-dimensional system, applying a finite perturbation to this alternative parametrization, and looking for genes that respond similarly. With this approach, we demonstrate that ID is capable of identifying structures within the data that will otherwise be missed. We further demonstrate the utility of ID in scRNA-seq datasets collected under various backgrounds, delineating cellular differentiation, characterizing cellular response to external perturbation, and dissecting the effect of genetic knock-outs.
bioinformatics2026-03-23v3VINE: Variational inference for scalable Bayesian reconstruction of species and cell-lineage phylogenies
Siepel, A.; Hassett, R.; Staklinski, S. J.Abstract
Bayesian methods are now widely used in reconstructing both species and cell-lineage phylogenies, but they remain heavily reliant on computationally intensive Markov chain Monte Carlo sampling. Phylogenetic variational inference (VI) circumvents this dependency but so far has been limited in speed and scalability. Here we introduce Variational Inference with Node Embeddings (VINE), a computational method that combines an embedding of taxa in a high-dimensional space and a distance-based "decoder" with several algorithmic innovations to dramatically improve phylogenetic VI. VINE supports both standard DNA substitution models and CRISPR barcode-mutation models for inference of cell-lineage trees and tissue-migration histories. In extensive simulation experiments, we show that VINE is comparable in accuracy to the best available Bayesian methods with speeds orders of magnitude faster. We then apply VINE to ~1,000 complete SARS-CoV-2 genomes and ~900 lung-cancer cell barcodes, showing reductions in compute time from days to hours or minutes.
bioinformatics2026-03-23v2Single-cell spatial multi-omics molecular pathology enabled by SuperFocus
Lu, Y.; Tian, X.; Vicari, M.; Enninful, A.; Bao, S.; Bai, Z.; Liu, C.; Zhang, X.; Andren, P.; Lundeberg, J.; Xu, M. L.; Fan, R.; Xiao, Y.; Ma, Z.Abstract
Histopathology and molecular pathology are currently distinct diagnostic modalities for the most part, one revealing tissue morphology at cellular resolution and the other providing molecular measurements with limited or no spatial context. Projecting genome-scale molecular information onto histopathology images at single-cell resolution across whole tissue sections represents a long-sought goal for next-generation pathology. Here we present SuperFocus, a modality-agnostic computational platform that generates histopathology-integrated single-cell spatial multi-omics from spot-based spatial measurements acquired on the same or an adjacent section without requiring external reference data. SuperFocus combines constrained cascading imputation with feature-level and cell-level quality-control scores to reduce spurious predictions and quantify confidence. On a ground-truth spatial transcriptomics benchmark dataset, SuperFocus improves key accuracy metrics by 28-73% over existing methods. Across Patho-DBiT, spatial ATAC-RNA, spatial CITE-seq and Visium-MALDI-MSI (SMA) datasets, SuperFocus enables cell-resolved analyses of MALT lymphoma microenvironments, gene regulatory programs in human hippocampus, lipotoxic hepatocyte states in human MASH, and transcriptomic-metabolomic states linked to neurotransmission and neuroinflammation in Parkinsonian mouse brain. Overall, SuperFocus enables scalable whole-slide single-cell spatial multi-omics integrated with histopathology, bridging histology and genome-scale molecular profiling for next-generation molecular pathology.
bioinformatics2026-03-23v2Breaking the Extraction Bottleneck: A Single AI Agent Achieves Statistical Equivalence with Human-Extracted Meta-Analysis Data Across Five Agricultural Datasets
Halpern, M.Abstract
Background: Data extraction is the primary bottleneck in meta-analysis, consuming weeks of researcher time with single-extractor error rates of 17.7%. Existing LLM-based systems achieve only 26-36% accuracy on continuous outcomes, and no study has validated AI-extracted continuous data against multiple independent datasets using formal equivalence testing. Methods: A single AI agent (Claude Opus 4.6) extracted treatment means, control means, sample sizes, and variance measures from source PDFs across five published agricultural meta-analyses spanning zinc biofortification, biostimulant efficacy, biochar amendments, predator biocontrol, and elevated CO2 effects on plant mineral nutrition. Observations were matched to reference standards using an LLM-driven alignment method. Validation employed proportional TOST equivalence testing, ICC(3,1), Bland-Altman analysis, and source-type stratification. Results: Across five datasets, the agent produced 1,149 matched observations from 136 papers. Pearson correlations ranged from 0.984 to 0.999. Proportional TOST confirmed statistical equivalence for all five datasets (all p < 0.05). Table-sourced observations achieved 5.5x lower median error than figure-sourced observations. Aggregate effects were reproduced within 0.01-1.61 pp of published values. Independent duplicate runs confirmed extraction stability (within 0.09-0.23 pp). Conclusions: A single AI agent achieves statistical equivalence with human-extracted meta-analysis data across five independent agricultural datasets. The approach reduces extraction cost by approximately one to two orders of magnitude while maintaining accuracy sufficient for aggregate meta-analytic pooling.
bioinformatics2026-03-23v2REMAG: recovery of eukaryotic genomes from metagenomic data using contrastive learning
Gomez-Perez, D.; Raguideau, S.; Warring, S.; James, R.; Hildebrand, F.; Quince, C.Abstract
Metagenome-assembled genomes (MAGs) are central to exploring microbial communities. Yet, despite the relevance of protists and fungi to diverse ecosystems, eukaryotic MAG recovery lags behind that of prokaryotes. A major bottleneck is that most state-of-the-art binning pipelines exclusively rely on prokaryotic single-copy core gene reference databases and are optimized for smaller genomes. To address this gap, we present REMAG (Recovery of Eukaryotic MAGs), a tool designed to recover high-quality eukaryotic genomes suited for long-read metagenomic data. REMAG leverages fine-tuned HyenaDNA genomic foundation models to efficiently filter eukaryotic contigs. It then employs a dual-encoder Siamese network trained with Barlow Twins contrastive loss to learn a shared embedding space by integrating contig composition and differential coverage. Finally, high-quality bins are extracted using greedy iterative Leiden clustering optimized with eukaryotic single-copy core gene constraints. In benchmarks based on simulated mixed prokaryotic/eukaryotic communities and real datasets of varying sizes and origin, we demonstrate REMAG's ability to recover more near-complete eukaryotic genomes than existing state-of-the-art tools, which often produce highly fragmented eukaryotic bins. REMAG provides an automated eukaryotic binning method that scales effectively with the increasing size and sequencing depth of metagenomic datasets.
bioinformatics2026-03-23v2Translating Histopathology Foundation Model Embeddings into Cellular and Molecular Features for Clinical Studies
Cui, S.; Sui, Z.; Li, Z.; Matkowskyj, K. A.; Yu, M.; Grady, W. M.; Sun, W.Abstract
AI-powered pathology foundation models provide general-purpose representations of histopathological images by encoding image tiles into numerical embeddings. However, these embeddings are not directly interpretable in biological or clinical terms and must be translated into biologically meaningful features, such as cell-type composition or gene expression, to enable downstream clinical applications. To bridge this gap, we developed STpath, a framework that integrates histopathology image embeddings derived from existing pathology foundation models with matched, spatially resolved transcriptomics data. STpath consists of cancer-specific XGBoost models trained to infer cell-type compositions and gene expression from histopathology image tiles. We evaluated STpath in colorectal and breast cancer datasets and showed that it provides accurate estimates of the composition of major cell types and the expression of a subset of genes, with further performance gains achieved by combining embeddings from multiple foundation models. Finally, we demonstrated that STpath inferred features that can be used in downstream studies to evaluate their associations with clinical outcomes.
bioinformatics2026-03-23v2ChEA-KG: Human Transcription Factor Regulatory Network with a Knowledge Graph Interactive User Interface
Byrd, A. I.; Evangelista, J. E.; Lachmann, A.; Chung, H.-Y.; Jenkins, S. L.; Ma'ayan, A.Abstract
Gene expression is controlled by transcription factors (TFs) that selectively bind and unbind to DNA to regulate mRNA expression of all human genes. TFs control the expression of other TFs, forming a complex gene regulatory network (GRN) with switches, feedback loops, and other regulatory motifs. Many experimental and computational methods have been developed to reconstruct the human intracellular GRN. Here we present a different approach. By submitting thousands of up and down gene sets from the RummaGEO resource for TF enrichment analysis with ChEA3, we distill signed and directed edges that connect human TFs to construct a high quality human GRN. The GRN has 131,581 signed and directed edges connecting 701 source TF nodes to 1,559 target TF nodes. The GRN is accessible via the ChEA-KG web server application, which provides interactive network visualization and analysis tools. Users may query the GRN for single or pairs of TFs or submit gene sets to perform TF enrichment analysis with ChEA3, placing the enriched TFs within the GRN. To demonstrate the utility of ChEA-KG, several TF-centric atlases are also made available via the ChEA-KG website. These atlases host TF subnetworks that regulate 131 major normal human cell-types (Cell Type Atlas); 69 tumour subtypes from 10 cancers (Cancer Atlas); 30 consensus perturbation response signatures for common mechanisms of action (MoA Atlas); and 24 aging signatures from tissues profiled by GTEx. Overall, ChEA-KG is an interactive web-server application that presents to users a new method of exploring the human gene regulatory network through both network visualization and transcription factor enrichment analysis. The ChEA-KG application is available from: https://chea-kg.maayanlab.cloud/.
bioinformatics2026-03-23v2Learning gene interactions from tabular gene expression data using Graph Neural Networks
Boulougouri, M.; Nallapareddy, M. V.; Vandergheynst, P.Abstract
Gene interactions form complex networks underlying disease susceptibility and therapeutic response. While bulk transcriptomic datasets offer rich resources for studying these interactions, applying Graph Neural Networks (GNNs) to such data remains limited by a lack of methodological guidance, especially for constructing gene interaction graphs. We present REGEN (REconstruction of GEne Networks), a GNN-based framework that simultaneously learns latent gene interaction networks from bulk transcriptomic profiles and predicts patient vital status. Evaluated across seven cancer types in the TCGA cohort, REGEN outperforms baseline models in five datasets and provides robust network inference. By systematically comparing strategies for initializing gene - gene adjacency matrices, we derive practical guidelines for GNN application to bulk transcriptomics. Analysis of the learned kidney cancer gene-network reveals cancer related pathways and biomarkers, validating the model's biological relevance. Together, we establish a principled approach for applying GNNs to bulk transcriptomics, enabling improved phenotype prediction and meaningful gene network discovery.
bioinformatics2026-03-23v1Solving the Diagnostic Odyssey with Synthetic Phenotype Data
Colangelo, G.; Marti, M.Abstract
The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.
bioinformatics2026-03-23v1FuzzyClusTeR: a web server for analysis of tandem and diffuse DNA repeat clusters with application to telomeric-like repeats
Aksenova, A. Y.; Zhuk, A. S.; Lada, A. G.; Sergeev, A. V.; Volkov, K. V.; Batagov, A.Abstract
DNA repeats constitute a large fraction of eukaryotic genomes and play important roles in genome stability and evolution. While tandem repeats such as microsatellites have been extensively studied, the genomic organization and potential functions of dispersed or loosely organized repeat patterns remain poorly understood. Here we present FuzzyClusTeR, a web server for the identification, visualization and enrichment analysis of DNA repeat clusters in genomic sequences. Using parameterized metrics, FuzzyClusTeR detects both classical tandem repeats and regions where related motifs occur in proximity without forming perfect tandem arrays, which we term diffuse (or fuzzy) repeat clusters. The server supports analysis of user-defined sequences as well as genome-scale datasets, including the T2T-CHM13 and GRCh38 human genome assemblies, and provides interactive visualization and statistical tools for assessing the genomic distribution of repetitive motifs and corresponding clusters. As a demonstration, we analyzed telomeric-like repeats in the T2T-CHM13v2.0 genome and identified families of diffuse clusters enriched in these motifs. Comparison with simulated sequences suggests that these clusters represent non-random genomic patterns with potential evolutionary and functional significance. FuzzyClusTeR enables systematic exploration of repeat clustering across genomic regions or entire genomes. It is available at https://utils.researchpark.ru/bio/fuzzycluster
bioinformatics2026-03-23v1EvoMut: A Computational Framework for Engineering Oxidative Stability in Proteins
Arab, S. S.; Lewis, N. E.Abstract
Amino acid oxidation is a major cause of protein instability and loss of function in therapeutic and industrial settings. Although methionine, cysteine, tyrosine, and tryptophan residues are widely recognized as oxidation-prone, only a subset of such residues are dominant functional hotspots, and not all are suitable targets for mutation. Identifying these vulnerable yet engineerable sites remains a major challenge. Here, we present EvoMut, a residue-level analytical framework for evaluating both oxidative vulnerability and mutation feasibility. EvoMut estimates oxidation risk by integrating structural features, local functional context, intrinsic chemical susceptibility, and evolutionary conservation. A central feature of the framework is the explicit separation of oxidation risk from mutation feasibility: candidate substitutions are evaluated only after high-risk residues are identified and ranked by evolutionary substitution patterns. Application of EvoMut to multiple proteins, and evaluation with experimental data, showed that oxidation-prone residues differ markedly in their engineering potential. EvoMut distinguishes residues that are both oxidation-sensitive and evolutionarily permissive from those that are chemically vulnerable but functionally constrained. By providing residue-level mechanistic insight, EvoMut offers a practical framework for the rational design of oxidation-resistant proteins. EvoMut is freely available as a web server at https://evomut.org.
bioinformatics2026-03-23v1Time-Resolved Phosphoproteomics-Guided BFS Beam Search Reveals Cell-Type-Specific EGFR Signaling Architectures and SHP2 Inhibitor-Induced Pathway Rewiring
Lee, H.; Lee, G.Abstract
Background: The epidermal growth factor receptor (EGFR) orchestrates highly context-dependent intracellular signaling networks whose architecture varies across cell types and is frequently rewired by targeted therapeutics. Systems-level reconstruction of these networks from phosphoproteomic data remains challenging because phosphorylation measurements identify signaling nodes but do not reveal the interaction paths that propagate signals between proteins. Results: We developed a computational framework integrating time-resolved phosphoproteomics with graph traversal algorithms to reconstruct EGFR-initiated signaling pathways across three contexts/conditions. A sign-assignment preprocessing procedure converts quantitative phosphorylation measurements into binary activation states across time points, defining a condition-specific active node set that filters the protein-protein interaction network. Breadth-First Search combined with interaction-weighted Beam Search is then applied to the STRING interaction database (v11.5) to enumerate candidate signaling paths. Applying this framework to phosphoproteomic datasets from EGF-stimulated HeLa cells, EGF-stimulated MDA-MB-468 triple-negative breast cancer (TNBC) cells, and EGF-stimulated MDA-MB-468 cells pretreated with the SHP2 inhibitor SHP099 yielded 260 paths in HeLa cells (117 unique topologies), 293 paths in MDA-MB-468 cells (155 unique), and 292 paths under SHP2 inhibition (85 unique). HeLa cells displayed a SRC-centered architecture dominated by ERBB2 and SHC1 first-hop effectors, converging on focal adhesion, HSP90 chaperone, CRKL adaptor, and integrin signaling arms. In contrast, MDA-MB-468 cells showed a PIK3CA/PTPN11 dual-axis architecture integrating direct PI3K engagement with SHP2-mediated GRB2-IRS1-ABL1 signaling. SHP2 inhibition abolished PTPN11-mediated pathways and induced PIK3CA dominance (69.2% first-hop), accompanied by compensatory ERBB3 engagement and a computationally predicted SYK/VAV1/LCP2 node set whose biological role warrants experimental validation. Conclusions: Time-resolved phosphoproteomics-guided BFS Beam Search over STRING interaction networks captures cell-type-specific EGFR signaling architectures and drug-induced pathway rewiring. This framework provides a systematic approach for transforming phosphoproteomic measurements into mechanistically interpretable signaling hypotheses specific to the cell-type-specific contexts, directly applicable to drug resistance modeling.
bioinformatics2026-03-23v1The Risk of Gulf Birds Functional Diversity Loss with Climate Change Uncovered Using Deep Learning Population Models
Li, L.; Bai, J.; Sun, S.; Zuzuarregui, M.; Wang, Z.Abstract
Climate change and sea-level rise (SLR) pose increasing threats to coastal ecosystems and biodiversity in the Gulf of America. Most efforts to anticipate these threats focus on species counts or range shifts, while changes in species functional diversity remain uncovered. We estimated climate change and sea level rise impacts on hundreds of bird species populations and corresponding functional diversity shifts. We used the generative deep learning method, Variational Gaussian Mixture Autoencoder (GMVAE), and Trait Probability Density analysis to study such impacts. We found that a generative GMVAE model uncovered species' unobserved ranges, and that climate change reduced coastal ecosystem resilience and caused biodiversity loss across multiple dimensions, including functional richness, redundancy, evenness, and divergence. Surprisingly, the most impacted areas are not the exposed shoreline but the landward coastal transition zones. Specifically, shoreline functional diversity turned out to increase with climate change and sea level rise, whereas uplands showed declining functional diversity and increasing redundancy, indicating contraction of functional trait space. Furthermore, avian biodiversity expanded in coastal protected areas, serving as refugia embedded in a surrounding landscape where unique combinations of species traits are lost.
bioinformatics2026-03-23v1Rastair: an integrated variant and methylation caller
Etzioni, Z.; Zhao, L.; Hertleif, P.; Schuster-Boeckler, B.Abstract
Cytosine methylation is a crucial epigenetic mark that impact tissue-specific chromatin conformation and gene expression. For many years, bisulfite sequencing (BS-seq), which converts all non-methylated cytosine (C) to thymine (T), remained the only approach to measure cytosine methylation at base resolution. Recently, however, several new methods that convert only methylated cytosines to thymine (mC[->]T) have become widely available. Here we present rastair, an integrated software toolkit for simultaneous SNP detection and methylation calling from mC[->]T sequencing data such as those created with Watchmaker's TAPS+ and Illumina's 5-Base chemistries. Rastair combines machine-learning-based variant detection with genotype-aware methylation estimation. Using NA12878 benchmark datasets, we show that rastair outperforms existing methylation-aware SNP callers and achieves F1 scores exceeding 0.99 for datasets above 30x depth, matching the accuracy of state-of-the-art tools run on whole-genome sequencing data. At the same time, rastair is significantly faster than other genetic variant callers, processing a 30x depth file takes less than 30 minutes given 32 CPU cores on an Intel Xeon, and half as long when a GPU is available. By integrating genotyping with methylation calling, rastair reports an additional 500,000 positions in NA12878 where a SNP turns a non-CpG reference position into a "de-novo" CpG. Vice-versa, rastair also identifies positions where a variant disrupts a CpG and corrects their reported methylation levels. Rastair produces standard-compliant outputs in vcf, bam and bed formats, facilitating integration into downstream analyses pipelines. Rastair is open-source and available via conda, Dockerhub, and as pre-compiled binaries from https://www.rastair.com.
bioinformatics2026-03-23v1TogoMCP: Natural Language Querying of Life-Science Knowledge Graphs via Schema-Guided LLMs and the Model Context Protocol
Kinjo, A. R.; Yamamoto, Y.; Bustamante-Larriet, S.; Labra-Gayo, J. E.; Fujisawa, T.Abstract
Querying the RDF Portal knowledge graph maintained by DBCLS, which aggregates more than 70 life-science databases, requires proficiency in both SPARQL and database-specific RDF schemas, placing this resource beyond the reach of most researchers. Large Language Models (LLMs) can, in principle, translate natural-language questions into executable SPARQL, but without schema-level context, they frequently fabricate non-existent predicates or fail to resolve entity names to database-specific identifiers. We present TogoMCP, a system that recasts the LLM as a protocol-driven inference engine orchestrating specialized tools via the Model Context Protocol (MCP). Two mechanisms are essential to its design: (i) the MIE (Metadata-Interoperability-Exchange) file, a concise YAML document that dynamically supplies the LLM with each target database's structural and semantic context at query time; and (ii) a two-stage workflow separating entity resolution via external REST APIs from schema-guided SPARQL generation.On a benchmark of 50 biologically grounded questions spanning five types and 23 databases, TogoMCP achieved a large improvement over an unaided baseline (Cohen's d = 0.92, Wilcoxon p < 10-6), with win rates exceeding 80% for question types with precise, verifiable answers. An ablation study identified MIE files as the single indispensable component: removing them reduced the effect to a non-significant level (d = 0.08), while a one-line instruction to load the relevant MIE file recovered the full benefit of an elaborate behavioral protocol. These results suggest a general design principle: concise, dynamically delivered schema context is more valuable than complex orchestration logic.
bioinformatics2026-03-23v1Single-cell Landscape of T Cell Heterogeneity in Kawasaki Disease: STAT3/JAK Axis Regulates the Lineage Differentiation Bias of Th17 Cells
Song, S.; Zong, Y.; Xu, Y.; Chen, L.; Zhou, Y.; Chen, L.; Li, G.; Xiao, T.; Huang, M.Abstract
Abstract Background: Kawasaki disease (KD) is a pediatric systemic vasculitis in which T-cell-mediated immune responses play a pivotal role. However, the precise dynamic evolution of T-cell subsets during disease progression remains poorly understood. Methods: Single-cell RNA sequencing (scRNA-seq) was employed to perform high-resolution annotation of peripheral blood mononuclear cells (PBMCs) from healthy controls and KD patients, both pre- and post- IVIG treatment. T-cell developmental trajectories were reconstructed via Monocle3-based pseudotime analysis. Furthermore, the functional significance of the significant pathway was validated in a CAWS-induced KD murine model. Results: A high-resolution single-cell landscape identified 13 distinct T-cell subtypes. Pseudotime analysis revealed a significant lineage commitment of CD4+ T cells toward a Th17 phenotype during the acute phase of KD, synchronized with the transcriptional upregulation of the STAT3/JAK signaling axis. Animal experiments further demonstrated that pharmacological inhibition of this pathway substantially attenuated inflammatory infiltration in the cardiac vasculature of KD mice. Conclusion: This study identifies the STAT3/JAK-mediated Th17 differentiation bias as a potential regulatory program associated with acute inflammation in Kawasaki disease, thereby highlighting the STAT3/JAK axis as a potential therapeutic target.
bioinformatics2026-03-23v1Structurally Restricted Message-Passing within Shallow Architectures for Explainable Network-Level Brain Decoding on Small Cohorts
Marques dos Santos, J. D.; Ramos, M. B.; Reis, L. P.; Marques dos Santos, J. P.; Direito, B.Abstract
The application of artificial intelligence (AI) to functional magnetic resonance imaging (fMRI) has gained increasing attention due to its ability to model complex, high-dimensional brain data and capture nonlinear patterns of neural activity. However, deep learning architectures, such as Graph Neural Networks (GNNs), typically require large sample sizes to achieve stable convergence, limiting their applicability in neuroimaging contexts where data are often scarce. This challenge highlights the need for compact, data-efficient models that maintain predictive performance and interpretability. Shallow neural networks (SNNs) have demonstrated robustness in low-sample settings but commonly rely on region-level features that treat brain areas independently, overlooking the brain's intrinsically network-based organization. To address this limitation, we propose a structurally constrained message-passing framework that integrates diffusion tensor imaging (DTI)-derived structural connectivity with region-level fMRI signals within a shallow architecture. This approach enables network-level modeling while preserving the stability and data efficiency of SNNs. The method is evaluated on 30 subjects performing a Theory of Mind (ToM) task from the Human Connectome Project Young Adult dataset. A baseline SNN achieved global accuracies of 88.2% (fully connected), 80.0% (pruned), and 84.7% (retrained), while the proposed model achieved 87.1%, 77.6%, and 84.7%, respectively. Although structural constraints led to a more pronounced performance decrease after pruning, retraining restored accuracy to baseline levels, demonstrating that biological constraints can be incorporated without compromising predictive validity. Model interpretability was assessed using SHAP (Shapley Additive Explanations). While the baseline model primarily identified isolated regions as key contributors, the proposed framework revealed distributed, structurally coherent networks as the main drivers of classification. These networks showed correspondence with established ToM regions, including the temporo-parietal junction, superior temporal sulcus, and inferior frontal gyrus. Importantly, the findings suggest that groups of moderately informative regions can collectively form highly relevant subnetworks. Overall, the proposed framework achieves competitive performance in a limited dataset while incorporating graph-inspired message passing into a shallow architecture. Its explainability provides insight into how structurally constrained networks support stimulus-driven responses in ToM and demonstrates potential for investigating network dysfunction in disorders such as Alzheimer's disease, ADHD, autism spectrum disorder, bipolar disorder, mild cognitive impairment, and schizophrenia.
bioinformatics2026-03-23v1Epidemiology of Legionella: Genome-bAsed Typing (el_gato) - a new bioinformatic tool for identifying sequence-based types of Legionella pneumophila from whole genome sequencing data
Collins, A. J.; Mashruwala, D.; Chivukula, V.; Kozak-Muiznieks, N. A.; Rishishwar, L.; Norris, E. T.; Willby, M. J.; Hamlin, J.; Overholt, W. A.Abstract
Sequence-based typing (SBT) via Sanger sequencing has been the standard for describing Legionella pneumophila relatedness for two decades. SBT involves sequencing seven loci, identifying alleles using the United Kingdom Health Security Agency (UKHSA) database, and inferring the corresponding sequence type (ST). While similar SBT approaches for other organisms can be easily adapted to whole-genome sequencing (WGS), L. pneumophila presents several challenges for this adaptation: multiple copies of one locus (mompS) and extensive heterogeneity in a second locus (neuA/neuAh). Although several computational methods have been proposed to address these issues, a WGS-based replacement with equal resolution to traditional SBT has been elusive. To address this gap, we developed el_gato (Epidemiology of Legionella: Genome-bAsed Typing; https://github.com/CDCgov/el_gato), which offers several advantages over existing methods: (1) a novel approach for resolving multiple mompS alleles identified in the same isolate, (2) the ability to capture diverse neuA/neuAh alleles, (3) fast single-threaded execution with an average of 27.7 seconds per sample, (4) easy installation via Bioconda or Docker and (5) an updated database as of March 2025. el_gato works with either paired-end short reads or genome assemblies, performing more accurately with paired-end short reads at least 250 base pairs (bp) in length. We compared el_gato against two other in silico SBT tools (mompS, hereafter referred to as mompS tool and legsta) using a dataset of 441 isolates with sequence types (STs) previously determined by Sanger-based sequencing. el_gato correctly identified the ST for 98.9% of the test isolates, compared to 95.2% for the mompS tool and 42.2% for legsta, demonstrating a significant improvement compared to the mompS tool (adjusted p = 1.06e- 3) and legsta (adjusted p = 4.24e-55) in ST identification. Furthermore, el_gatos determination of ST was not significantly different from Sanger sequencing (adjusted p = 0.442). In summary, el_gato significantly improves in silico SBT and given its growing adoption, is poised to support the public health community.
bioinformatics2026-03-23v1CoPISA: Combinatorial Proteome Integral Solubility/Stability Alteration analysis
zangene, e.; gholizadeh, e.; Vadadokhau, U.; Ritz, D.; Saei, A.; JAFARI, M.Abstract
Combination therapies are widely used in acute myeloid leukemia (AML), but systematic datasets capturing proteome-wide responses to multi-drug perturbations remain limited. Here we present CoPISA (Combinatorial Proteome Integral Solubility/Stability Alteration), a quantitative proteomics assay designed to profile protein solubility and stability responses to single and combined drug treatments. The dataset includes two AML drug pairs (LY3009120-sapanisertib and ruxolitinib-ulixertinib) applied to four AML cell lines (MOLM-13, MOLM-16, SKM-1, and NOMO-1) under control, single-agent, and combination conditions in both lysate and intact-cell formats. Thermal solubility profiling coupled with TMT-based multiplexed LC-MS/MS generated 16 TMT16-plex experiments comprising 192 LC-MS/MS raw files, providing deep proteome coverage across treatments and biological contexts. The resource includes raw and processed proteomics data, detailed experimental metadata in Sample and Data Relationship Format (SDRF), and reproducible analysis scripts for reporter normalization, protein-level aggregation, statistical modeling, and classification of combinatorial response patterns. The experimental design enables identification of proteins responding uniquely to combination treatments as well as overlapping single-agent effects. Technical validation demonstrates reproducible quantification across multiplex experiments and assay formats. All data are publicly available through the PRIDE repository (PXD066812) together with analysis code, enabling independent reanalysis and method development. This dataset provides a benchmark resource for studying proteome responses to drug combinations, comparing lysate and intact-cell perturbation profiles, developing computational approaches for combinatorial target inference, and supporting training in computational proteomics.
bioinformatics2026-03-23v1NLCD: A method to discover nonlinear causal relations among genes
Easwar, A.; Narayanan, M.Abstract
Distinguishing correlation from causation is a fundamental challenge in many scientific fields, including biology, especially when interventions like randomized controlled trials are infeasible and only observational data are available. Methods based on statistical tests of conditional independence within the Mendelian Randomization framework can detect causality between two observed variables that are each associated with a third instrumental variable. However, these methods for detecting causal relationships between traits (e.g., two gene expression or clinical traits associated with a genetic variant, all observed in the same population) often assume a linear relationship, thereby hindering the discovery of causal gene networks from genomics data. We have developed NLCD, a method for NonLinear Causal Discovery from genomics data based on nonlinear regression modeling and conditional feature importance scoring. NLCD uses these techniques to extend the statistical tests in an existing linear causal discovery method called the Causal Inference Test (CIT). We benchmarked NLCD against current state-of-the-art methods: CIT, Findr, and MRPC. On simulated datasets, NLCD performs comparably to most methods in detecting linear relations (Average AUPRC (Area Under the Precision-Recall Curve) of NLCD=0.94, CIT=0.94, Findr=0.94, and MRPC=0.99), and outperforms them in detecting nonlinear (sine and sawtooth type) relations between two genes (Average AUPRC of NLCD=0.76, CIT=0.60, Findr=0.56, and MRPC=0.73). When tested on a nonlinear subset of a yeast genomic dataset to recover known causal relations involving transcription factors, NLCD and CIT performed comparable to each other and slightly better than Findr and MRPC (Average AUPRC of NLCD=0.82, CIT=0.81, Findr=0.71, and MRPC=0.54). On application to a human genomic dataset, NLCD revealed active causal gene pairs (IRF1[->]PSME1 and HLA-C[->]HLA-T) in the muscle tissue, and clarified the promises and challenges in discovering causal gene networks in tissues under in vivo human settings.
bioinformatics2026-03-23v1MHCBind: A Pan- and Allele-Specific Model for Predicting Class I MHC-Peptide Binding Affinity
Peddi, N.; Bijjula, D. R.; Gogte, S.; Kondaparthi, V.Abstract
Major Histocompatibility Complex (MHC) molecules are essential to the immune system because they bind and present peptide antigens to T cells, enabling immune recognition and response. The specificity of MHC-peptide interactions is crucial for understanding immune-related diseases, developing personalized immunotherapies, and designing effective vaccines. Current computational methods, while powerful, often rely on a single type of molecular information, usually sequence, and implicitly model the interaction between the two molecules. To address these limitations, we introduce MHC-Bind, a novel deep learning framework that captures a more comprehensive and biologically relevant view of the binding event. MHCBind's architecture employs a dual-view feature extraction strategy for both the MHC and the peptide. A Graph Attention Network (GAT) learns topological features from predicted residue contact maps, while a parallel 1D Convolutional Neural Network (CNN) captures multi-scale patterns from sequence embeddings. These four distinct feature sets are then integrated in a cross-fusion module that uses an attention mechanism to model interactions between the two molecules. Finally, a multi-layer perceptron (MLP) regression head maps the fused interaction signature to a precise binding affinity score. In rigorous comparative benchmarks against recent variants, such as NetMHCpan, MHCFlurry, and MHCnuggets, MHCBind demonstrates superior performance, achieving a significantly lower average prediction error (RMSE: 0.1485) and a higher correlation (PCC: 0.7231) in allele-specific contexts. For pan-allele tasks, it excels at correctly ranking peptides with a superior Spearman's Correlation (SCC: 0.7102), a crucial advantage for practical applications. The framework's design is inherently flexible, excelling in both allele-specific and pan-allele prediction tasks.
bioinformatics2026-03-23v1General-purpose embeddings for long-read metagenomic sequences via β-VAE on multi-scale k-mer frequencies
Nielsen, T. N.; Lui, L. M.Abstract
Long-read metagenomics routinely produces millions of assembled contigs, creating a need for methods that organize sequences into biologically meaningful groups across samples and environments. We present a general-purpose compositional embedding for metagenomic sequences based on a -variational autoencoder (-VAE) trained on multi-scale k-mer frequencies (1-mers through 6-mers; 2,772 features with centered log- ratio transformation). The embedding compresses each contig into a 384-dimensional vector that preserves local compositional similarity, enabling similarity search and graph-based clustering from sequence composi- tion alone. Through systematic comparison of fifteen models trained on up to 17.4 million contigs (525.5 Gbp) from brackish, terrestrial, and reference genome sources, we find that a small set of curated prokaryotic refer- ence genomes (656,000 contigs) outperforms ten-fold larger domain-specific training sets, and that neither reconstruction loss nor Spearman correlation reliably predicts downstream clustering quality. On nearest- neighbor graphs, flow-based clustering (MCL) markedly outperforms modularity-based methods (Leiden), yielding 12,123 clusters from 154,041 contigs ([≥] 100 kbp) with 99.2% phylum-level purity confirmed by inde- pendent marker gene phylogenetics. Multi-method taxonomic annotation achieves 87% coverage and reveals that 16.4% of contigs are eukaryotic - the single largest component invisible to standard prokaryotic anno- tation tools. The embedding provides a sample-independent coordinate system for organizing metagenomic sequence space at scale.
bioinformatics2026-03-23v1A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes
Muneeb, M.; Ascher, D.Abstract
Polygenic risk score (PRS) tools differ substantially in statistical assumptions, input requirements, and implementation complexity, making direct comparison difficult. We developed a harmonized, implementation-aware benchmarking framework to evaluate 46 PRS tools across seven binary UK Biobank phenotypes and one continuous trait under three model configurations: null, PRS-only, and PRS plus covariates. The framework integrates standardized preprocessing, tool-specific execution, hyperparameter exploration, and unified downstream evaluation using five-fold cross-validation on high-performance computing infrastructure. In addition to predictive performance, we assessed runtime, memory use, input dependencies, and failure modes. A Friedman test across 40 phenotype--fold combinations confirmed significant differences in tool rankings ({chi}2 = 102.29, p = 2.57 x 10-11), with no single method universally optimal. These findings provide a reproducible framework for comparative PRS evaluation and demonstrate that tool performance is shaped not only by statistical methodology but also by phenotype architecture, preprocessing choices, covariate structure, computational demands, software robustness, and practical implementation constraints.
bioinformatics2026-03-23v1aaKomp: Alignment-free amino acid k-mer matching for genome completeness assessment at scale
Wong, J.; Coombe, L.; Warren, R. L.; Birol, I.Abstract
In de novo sequencing projects, genome assembly optimization requires evaluating a number of candidate assemblies to identify optimal tool parameters. Yet, current completeness assessment tools like BUSCO and compleasm require 10-80 minutes per evaluation for gigabase-scale genomes, transforming what should be rapid iteration into time-intensive processes. These tools rely on alignment-based approaches and fixed ortholog databases, limiting their scalability across the tree of life. We present aaKomp, a scalable alignment-free tool that leverages amino acid k-mer matching and multi-index Bloom filters for rapid genome completeness assessment. Unlike current utilities, aaKomp supports user-defined reference databases, enabling customized assessments for any organism. In benchmarking against state-of-the-art tools using simulated T2T-CHM13 datasets, aaKomp achieved 68-fold faster execution and 15-fold lower memory consumption while maintaining accuracy. Testing on 94 Human Pangenome Reference Consortium assemblies and a European Eel assembly, aaKomp maintained one-minute runtimes (1.2 {+/-} 0.35 min) and low memory usage (<13.64 GB). aaKomp's scoring system provides nuanced estimates rather than threshold-based classifications, offering increased resolution for tracking incremental improvements during iterative workflows. aaKomp's speed, memory efficiency, and flexible database generation makes it well-suited for modern and biodiverse projects requiring the evaluation of hundreds of assemblies.
bioinformatics2026-03-22v1ATHILAfinder: a tool to detect ATHILA LTR retrotransposons in plant genomes
Bousios, A.; Primetis, E.Abstract
Motivation The ATHILA lineage of LTR retrotransposons has colonised all branches of the plant tree of life. In Arabidopsis thaliana and A. lyrata, ATHILA elements have invaded centromeres, influencing the genetic and epigenetic organisation, and driving satellite evolution. To assess the broader significance of ATHILA across plants, a computational pipeline is needed to identify ATHILA elements with high efficiency. Existing tools lack this ability because they are optimised for broad transposon classification at the expense of precise annotation of lower taxonomic levels. Results We present ATHILAfinder, a pipeline for accurate and large-scale discovery of ATHILA elements. ATHILAfinder uses lineage-specific sequence motifs as seeds and additional filters to build de novo intact elements. Homology-based steps rescue intact ATHILA and identify soloLTRs. A detailed identity card includes coordinates, LTR identity, coding capacity, length and other sequence features for every ATHILA. We validate ATHILAfinder in the A. thaliana Col-CEN assembly and five additional Brassicaceae species, covering four supertribes and ~30 million years of evolution. ATHILAfinder has very low false positive rates and outperforms widely-used tools like EDTA and the deep-learning-based Inpactor2 software for both recovery and precision of ATHILA. To demonstrate its usefulness, we generate insights into ATHILA dynamics across Brassicaceae. Outlook Few computational pipelines target specific transposon lineages, yet such tools can empower their identification and downstream analyses. Our tailored approach can be adapted to other LTR retrotransposon lineages, offering new ways for high-resolution analysis of transposons.
bioinformatics2026-03-22v1Helicase: Vectorized parsing and bitpacking of genomic sequences
Martayan, I.; Lobet, L.; Marchet, C.; Paperman, C.Abstract
Modern sequencing pipelines routinely produce billions of reads, yet the dominant storage formats (FASTQ and FASTA) are text-based and sequential, making high-throughput parsing a persistent bottleneck in bioinformatics. Their regular, line-oriented structure makes them well-suited to SIMD vectorization, but existing libraries do not fully exploit it. We present vectorized algorithms for high-throughput FASTA/Q parsing, with on-the-fly handling of non-ACTG characters and built-in bitpacking of DNA sequences into multiple compact representations. The parsing logic is expressed as a finite state machine, compiled into efficient SIMD programs targeting both x86 and ARM CPUs. These algorithms are implemented in Helicase, a Rust library exposing a tunable interface that retrieves only caller-requested fields, minimizing unnecessary work. Exhaustive benchmarks across a wide range of CPUs show that Helicase meets or exceeds the throughput of all evaluated state-of-the-art libraries, making it the fastest general-purpose FASTA/Q parser to our knowledge. Availability: https://github.com/imartayan/helicase
bioinformatics2026-03-22v1miRBind2 enables sequence-only prediction of miRNA binding and transcript repression
Cechak, D.; Tzimotoudis, D.; Sammut, S.; Gresova, K.; Marsalkova, E.; Farrugia, D.; Alexiou, P.Abstract
Motivation: MicroRNAs (miRNAs) regulate gene expression by guiding Argonaute proteins to partially complementary sites on target RNAs. While classical prediction methods rely on engineered features such as seed match categories, evolutionary conservation, and site context, recent advances in deep learning offer the potential to learn targeting rules directly from sequence. We developed a sequence-based deep learning model that improves miRNA target site prediction, and further validated the learned target site representations by extending the model to gene-level functional repression prediction. Results: We introduce miRBind2, a deep learning method for miRNA target site prediction that incorporates a novel pairwise nucleotide representation capturing all possible miRNA-target nucleotide interactions, with a CNN-based architecture. miRBind2 outperforms previous SotA models across four independent datasets from the debiased miRBench benchmark, while using 92% fewer parameters. We show that the convolutional features and weights learned by miRBind2 can be transferred to transcript-level prediction by extending the miRBind2 architecture and fine-tuning it on miRNA perturbation experiments. This miRBind2-3UTR model predicts gene repression from sequence alone. On a dataset of 50,549 miRNA-gene pairs, miRBind2-3UTR significantly outperforms TargetScan. These results show that deep models pretrained on target site data can capture regulatory signals and predict functional repression without requiring conventional engineered biological features. Availability: Models and source code are freely available via GitHub (https://github.com/BioGeMT/miRBind_2.0). A publicly available web-tool for novel predictions and visualization is available at: (https://huggingface.co/spaces/dimostzim/BioGeMT-miRBind2) Contact: panagiotis.alexiou@um.edu.mt
bioinformatics2026-03-21v1SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Jiang, T.; Hu, H.; Gao, R.; Jiang, Z.; Zhou, M.; Gao, W.; Zhou, S.; Wang, G.Abstract
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangenome analysis. However, the availability of structural variant (SV) calling tools that can effectively leverage pangenome information is limited. In addition, efficient construction of pangenome graphs becomes increasingly challenging with acquisition of larger number of samples. In this study, we present SVPG, an approach that leverages haplotype-resolved pangenome reference for accurate SV detection and rapid pangenome graph augmentation from long-read sequencing data. Compared to state-of-the-art SV callers, SVPG maintained superior overall performance across different coverages and sequencing technologies. SVPG also achieves notable improvements in calling rare and individual-specific SVs on both simulated and real somatic datasets. Furthermore, in a benchmark involving 20 samples, SVPG accelerated pangenome graph augmentation by nearly 10-fold compared to traditional augmentation strategies. We believe that this novel SVPG method, has the potential to revolutionize SV detection and serve as an effective and essential tool, offering new possibilities for advancing pangenomic research.
bioinformatics2026-03-20v4Coupling codon and protein constraints decouples drivers of variant pathogenicity
Chen, R.; Palpant, N.; Foley, G.; Boden, M.Abstract
Predicting the functional impact of genetic variants remains a fundamental challenge in genomics. Existing models focus on protein-intrinsic defects yet overlook regulatory constraints embedded within coding sequences. Here, we couple a codon language model (CaLM) with a protein language model (ESM-2) to dissect the drivers of variant pathogenicity. On ClinVar data, both modalities contribute near-equally to distinguishing pathogenic from benign variants. Evaluation across Deep Mutational Scanning and CRISPR-Based Genome Editing platforms in ClinMAVE reveals that loss-of-function variants are governed primarily by residue-level features, whereas gain-of-function variants show a greater relative contribution from codon-level constraints, albeit in a gene-specific manner. A controlled comparison of identical variants in BRCA1 and TP53 further suggests that codon-level signals are elevated in the endogenous genomic context. Together, these findings indicate that pathogenicity reflects both the "product'' and the "process,'' and that the experimental platform may influence which dimension is observable.
bioinformatics2026-03-20v3SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Jiang, T.; Hu, H.; Gao, R.; Cao, S.; Jiang, Z.; Liu, Y.; Zhou, M.; Gao, W.; Zhou, S.; Wang, G.Abstract
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangenome analysis. However, the availability of structural variant (SV) calling tools that can effectively leverage pangenome information is limited. In addition, efficient construction of pangenome graphs becomes increasingly challenging with acquisition of larger number of samples. In this study, we present SVPG, an approach that leverages haplotype-resolved pangenome reference for accurate SV detection and rapid pangenome graph augmentation from long-read sequencing data. Compared to state-of-the-art SV callers, SVPG maintained superior overall performance across different coverages and sequencing technologies. SVPG also achieves notable improvements in calling rare and individual-specific SVs on both simulated and real somatic datasets. Furthermore, in a benchmark involving 20 samples, SVPG accelerated pangenome graph augmentation by nearly 10-fold compared to traditional augmentation strategies. We believe that this novel SVPG method, has the potential to revolutionize SV detection and serve as an effective and essential tool, offering new possibilities for advancing pangenomic research.
bioinformatics2026-03-20v3PyrMol: A Knowledge-Structured Pyramid Graph Framework forGeneralizable Molecular Property Prediction
Li, Y.; Zhao, Q.; Wang, J.Abstract
Expert pharmaceutical chemists interpret molecular structures through a sophisticated cognitive hierarchy, transitioning from local functional moieties to spatial pharmacophores and, ultimately, to macroscopic pharmacological and physicochemical profiles. However, conventional Graph Neural Networks frequently overlook this high-level chemical intuition by treating molecules as single-scale atomic topology. To bridge this gap between human expertise and computational inference, we propose PyrMol, a knowledge-structured pyramid representation learning framework. By constructing heterogeneous hierarchical graphs, PyrMol orchestrates information flow across atomic, subgraph, and molecular levels. Crucially, the subgraph level systematically integrates three complementary expert views comprising functional groups, pharmacophores, and retrosynthetic fragments. To harmonize these explicit domain priors with implicit computational semantics, we introduce an adaptive Multi-source Knowledge Enhancement and Fusion module that dynamically balances their complementarity and redundancy. A Hierarchical Contrastive Learning strategy further ensures cross-scale semantic consistency. Empirical evaluations across ten benchmark datasets demonstrate that PyrMol outperforms 12 state-of-the-art baselines. Furthermore, its "plug-and-play" versatility provides a framework-agnostic performance boost for existing GNN architectures. PyrMol thus establishes a principled data-knowledge dual-driven paradigm for AI-aided Drug Discovery, effectively leveraging domain knowledge to catalyze advances in molecular property prediction.
bioinformatics2026-03-20v2A new pipeline for cross-validation fold-aware machine learning prediction of clinical outcomes addresses hidden data-leakage in omics based 'predictors'.
Hurtado, M.; Pancaldi, V.Abstract
Motivation: Machine learning (ML) approaches are increasingly applied to high-dimensional biological data in which features are often dataset-dependent. In many omics workflows, features are computed using information derived from the entire dataset, such as correlations between variables, clustering structures, or enrichment scores. We refer to these as global dataset features, defined as features whose computation depends on properties of the full dataset. In such cases, standard validation strategies can fail, especially when evaluating on independent datasets, due to information leakage that leads to overly optimistic performance estimates. Results: To address this challenge, we present pipeML, a flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation (CV) fold construction. pipeML enables users to recompute global dataset features independently within each CV fold, ensuring strict separation between training and test data, while preserving compatibility with a wide range of ML algorithms for both classification and survival tasks. Using real-world biological datasets, we demonstrate that pipeML enables leakage-free model evaluation when global dataset features are used. We argue that overestimation of model performance during CV can lead to overoptimistic expectations for validation on independent datasets. By explicitly addressing data leakage and offering a transparent, modular workflow, pipeML provides a robust solution for developing and validating ML models in complex biological settings. Availability:The pipeML R package as well as a tutorial are available at https://github.com/VeraPancaldiLab/pipeML Contact: vera.pancaldi@inserm.fr or marcelo.hurtado@inserm.fr Supplementary information: Available at Bioinformatics online.
bioinformatics2026-03-20v2ChiMER: Integrating chromatin architecture into splicing graphs for chimeric enhancer RNAs detection
Xiang, Y.; Xiao, X.; Zhou, B.; Xie, L.Abstract
Motivation: Enhancer-derived RNAs (eRNAs) and their fusion with protein coding genes represent a crucial yet understudied layer of transcriptional regulation. eRNAs are typically expressed at low levels, which makes fusion events difficult to detect with conventional fusion detection tools. In addition, these tools are not designed to capture fusion transcripts arising from spatial proximity between distal regulatory elements and gene loci. Reads spanning such regions are also frequently filtered as mapping artifacts. As a result, computational approaches for systematically identifying spatially mediated enhancer-exon fusion transcripts remain lacking. Methods: We developed ChiMER, a graph-based framework for detecting ChiMeric Enhancer RNAs from short-read RNA-seq data. ChiMER constructs splice graphs with chromatin contact information to introduce enhancer-exon edges and uses graph alignment to search for potential transcriptional paths. A ranking-based scoring module then prioritizes high-confidence events. Evaluations on simulated and real RNA-seq datasets show that ChiMER achieves higher sensitivity than conventional linear fusion detection methods while maintaining low false-positive rates. Results: Applied to cancer cell line RNA-seq datasets, ChiMER identified multiple enhancer-exon chimeric transcripts, several associated with super-enhancer regions. Multi-omics analysis further show that fusion transcripts occur in transcriptionally active regulatory environments and frequently coincide with strong R-loop signals, suggesting a potential role of RNA-DNA hybrid structures in facilitating long-range transcriptional joining events.
bioinformatics2026-03-20v2Integrative transcriptome-based drug repurposing in tuberculosis
Samart, K.; Thang, L.; Buskirk, L. R.; Tonielli, A. P.; Krishnan, A.; Ravi, J.Abstract
Tuberculosis (TB) remains the leading cause of infectious disease mortality worldwide, killing over one million people annually. Rising antibiotic resistance has added urgency to the need for host-directed therapeutics (HDTs) that modulate host immune responses alongside directly targeting the pathogen. Repurposing FDA-approved drugs is particularly attractive for this purpose because their safety profiles are already well-established, substantially reducing development time and cost. Transcriptomic methods have successfully identified repurposable therapeutics for TB based on 'connectivity mapping,' which identifies drugs that reverse disease gene expression patterns. However, these applications are limited to a small subset of data belonging to a specific data platform and a few connectivity methods. Expanding beyond these constrained settings introduces substantial challenges, including dataset heterogeneity across transcriptomics platforms and biological conditions, uncertainty about optimal scoring methods, and the lack of systematic approaches to identify robust disease signatures. We developed a computational workflow that integrates 28 TB gene expression signatures and multiple connectivity scoring methods to capture dominant TB signals regardless of variation in microarray and RNAseq platforms, cell types, and infection conditions. We systematically identified 64 FDA-approved drugs as promising TB host-directed therapeutics. These high-confidence drug candidates include known HDTs, such as statins (rosuvastatin, fluvastatin, lovastatin) and tamoxifen, recently validated in experimental TB models. Our prioritized candidate drugs reveal enrichment for therapeutically TB-relevant mechanisms, e.g., cholesterol metabolism inhibition and immune modulation pathways. Network analysis of disease-drug interactions identified 12 key bridging genes (including IL-8, CXCR2) that represent potential novel druggable targets for TB host-directed therapy. This work establishes transcriptome-based connectivity mapping as a viable approach for systematic HDT discovery in bacterial infections and provides a robust computational framework applicable to other infectious diseases. Our findings offer immediate opportunities for experimental validation of prioritized drug candidates and mechanistic investigation of identified druggable targets in TB pathogenesis.
bioinformatics2026-03-20v2