Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
AlphaFold 3 Fails to Predict D-peptide Chirality, Fold, and Binding Pose in Heterochiral Complexes
Childs, H.; Zhou, P.; Donald, B. R.Abstract
Due to their favorable therapeutic properties, including improved stability, bioavailability, and membrane permeability, D-peptides that bind biological L-proteins represent an important class of systems in computational drug design. A reliable in silico workflow for these systems must correctly preserve stereochemistry while predicting fold and binding pose. The AlphaFold 3 (AF3) model reported by Abramson et al. (2024) enforces a strict chirality violation penalty to maintain chiral centers from model inputs and is reported to have a low chirality violation rate of only 4.4% on a PoseBusters benchmark containing diverse chiral molecules. Herein, we report the results of 3,255 black-box experiments with AF3 to evaluate its ability to predict the fold, chirality, and binding pose of D-peptides in heterochiral complexes. Despite inputs specifying explicit D-stereocenters, we report that the AF3 chirality violation rate for D-peptide binders is much higher at 51% across all evaluated predictions; on average the model is as accurate as chance (random chirality choice, L or D, for each peptide residue). Increasing the number of seeds failed to improve this violation rate. The AF3 predictions exhibit incorrect folds and binding poses, with D-peptides commonly oriented incorrectly in the L-protein binding interface. Confidence metrics returned by AF3 also fail to distinguish predictions with low chirality violation and correct docking vs. predictions with high chirality violation and incorrect docking. We conclude that AF3 is a poor predictor of D-peptide chirality, fold, and binding pose and propose solutions to address these limitations.
bioinformatics2026-05-19v3LAMPP: A benchmark for continuous evaluation of host phenotype prediction from shotgun metagenomic data
Barak, N.; Bhattacharya, H.; Asnicar, F.; Sung, J.; Segata, N.; Yassour, M.Abstract
Predicting host phenotypes from shotgun metagenomic data is essential for translating microbiome research into clinical practice. Despite the development of numerous computational tools for this task, researchers often default to traditional machine learning methods such as Random Forest. This hesitancy to adopt newer methods stems from their complexity as well as the lack of standardized evaluations, as most tools are assessed on different datasets and compared against a limited set of methods. Here, we introduce LAMPP, a standardized benchmark for evaluating methods for predicting host phenotypes from gut metagenomic data. LAMPP features a diverse range of prediction tasks and enables consistent, comparative assessments across prediction tools. Our systematic evaluation of existing tools shows that classic machine learning methods (e.g., Random Forest) perform competitively, offering both ease of use and state-of-the-art results. At the same time, it demonstrates that microbiome-based phenotype prediction remains a challenging problem. By providing a consistent platform for ongoing evaluation and access to raw sequencing data, LAMPP motivates the development of novel prediction pipelines from raw sequencing data to phenotype prediction, including novel sample representation and data augmentation strategies. LAMPP is publicly available for ongoing benchmarking at https://lampp.yassourlab.com/.
bioinformatics2026-05-19v3Systematic cross-study assessment of RNA-Seq experimental workflows for plasma cell-free transcriptome profiling
Tuni-Dominguez, C.; Asole, G.; Monteagudo-Mesas, P.; Rusu, E. C.; Cabus, L.; Gonzalez, L.; Sanchez, L.; Neto, B.; Sanders, P.; Weber, M.; Lagarde, J.Abstract
Plasma cell-free RNA (cfRNA) is a promising source of non-invasive biomarkers, but its clinical translation is hindered by technical challenges and a lack of protocol standardization, which compromises reproducibility and comparability across studies. There is a need for a systematic evaluation of existing cfRNA-Seq workflows to understand the drivers of technical variability. Here, we address this gap by performing a comprehensive cross-study analysis of 2,166 cfRNA-Seq samples from 15 published studies and an in-house generated dataset, applying a uniform bioinformatics pipeline to enable a controlled comparison of experimental workflows. Our analysis reveals that the donor phenotype typically explains a negligible fraction of the transcriptomic variation, whose main determinants are technical -- principally protocol choice, genomic DNA contamination levels and library diversity. Remarkably, this technical noise is so profound that variation within plasma cfRNA samples exceeds that found across a wide range of human tissues. Furthermore, we demonstrate that critical pre-analytical factors are often confounded with patient phenotypes, jeopardizing the validity of biomarker discovery efforts. Finally, we identify a 100 bp fragment-length threshold as a vital requirement for reliable cfRNA-based taxonomic profiling. Our work serves as a comprehensive benchmark of current cfRNA-Seq methodologies and provides evidence-based guidelines to improve experimental design. By highlighting the dominance of controllable technical factors, we offer a path towards more robust and reproducible cfRNA research.
bioinformatics2026-05-19v3Learning the Language of the Microbiome with Transformers
Treloar, N. J.; Ur-Rehman, S.; Yang, J.Abstract
Self-supervised pretraining has become central to biological machine learning, yet microbiome data remains comparatively underexplored in terms of both modeling approaches and evaluation frameworks. To address this gap, we present Atlas, a pretraining dataset of over 539,000 microbiome datapoints from the MGnify database. Using Atlas, we train the Waypoint family of microbiome foundation models: a series of GPT-2 style causal language models ranging from 6M to 170M parameters. We also introduce Compass, a curated benchmark of eight predictive tasks spanning biome classification, drug-microbiome interactions, drug degradation, and infant gut development. Using this benchmark, we compare the performance of Waypoint models against classical baselines and the existing MGM foundation model. Our results show that pretraining leads to consistent and significant improvements in downstream task performance, that both dataset scale and tokenization strategy impact model quality, and that pretraining is essential for achieving favorable scaling behavior. Furthermore, pretrained transformer models begin to reliably outperform classical methods once training data exceeds roughly 10,000 examples - a threshold that is attainable for modern microbiome studies. Finally, we demonstrate that the Waypoint models achieve state-of-the-art performance among microbiome foundation models. Overall, our work highlights the importance of large-scale self-supervised pretraining in this domain and establishes Atlas, Compass, and the Waypoint models as valuable resources for the research community in this emerging field.
bioinformatics2026-05-19v2IsoformSwitchAnalyzeR v2: Analysis of Functional Isoform Changes in Long-read and Single-cell Sequencing Data
Han, C.; Gilis, J.; Delgado, E. I.; Clement, L.; Vitting-Seerup, K.Abstract
Alternative splicing enables a single gene to produce a variety of mRNA transcripts, significantly enhancing protein diversity in higher eukaryotes. Isoform switching refers to the differential usage of transcripts of a gene and occurs pervasively across physiological and pathological conditions. IsoformSwitchAnalyzeR was developed to identify these isoform switches and analyze their functional consequences. Advances in RNA-seq technology, including long-read and single-cell sequencing, along with state-of-the-art computational tools, enable unprecedented accuracy in isoform switch identification and its functional consequences, necessitating an update to IsoformSwitchAnalyzeR. Here we present IsoformSwitchAnalyzeR 2.0, with substantial improvements in the robustness of isoform switch detection, the incorporation of new types of functional annotation, and interoperability with other bioinformatics tools. We showcase how IsoformSwitchAnalyzeR is well-suited for analysis of both long-read RNA-seq and single-cell data through two case studies. Specifically, we analyze long-read data from patients with Alzheimer's Disease and single-cell data from Glioblastoma patients. In both case studies, we find important isoform switches with disease-relevant functional consequences, showcasing the power of IsoformSwitchAnalyzeR v2. Taken together, these findings highlight the versatility and robustness of IsoformSwitchAnalyzeR in handling advanced sequencing technologies, thereby broadening its applicability across diverse research contexts.
bioinformatics2026-05-19v2Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators
Thiel, M.; Cunningham, A.; Barnes, C. P.Abstract
We compare the efficacy and distributional effects of supervised fine-tuning (SFT) and reinforcement learning (RL) post-training for PlasmidGPT, a foundation model for whole-plasmid generation, using Group Relative Policy Optimization (GRPO) for the RL model. Using a biologically motivated reward function encoding functional annotations, length constraints, and repeat penalties, the RL model achieves a 71.6% quality control pass rate across 8 prompts on 4,000 sequences, compared to 4.3% for the pretrained baseline and 11.0% for SFT. A five-model reward ablation identifies the cassette arrangement bonus, which rewards correct promoter[->]CDS[->]terminator ordering, as the critical reward component. Rejection-sampling baselines indicate that the gain is not recovered by sampling more heavily from the base model. Beyond directly optimized features, RL-generated sequences converge toward real plasmid distributions in 3-mer composition, ORF length, and thermodynamic stability, properties we categorize as reward-correlated or indirectly shaped by the structural reward signal. Minimum free energy density independently converges to the real-plasmid regime under both SFT and RL despite these being parallel post-training paths. On a small curated hold-out set, RL improves continuation log-likelihood over the pretrained baseline on every sequence (mean {Delta} = +0.83 nats), with no degradation in next-token prediction.
bioinformatics2026-05-19v2TransXplorer: An automated translational discovery platform for RNA-seq data
Verma, V. M.; Oler, E.; Syed, H.; Han, S.; Berjanskii, M.; Mason, A. L.; Wishart, D. S.; Wong, G. K.-S.Abstract
RNA-seq experiments routinely identify thousands of differentially expressed genes, but translating these into biological insights and therapeutic hypotheses often requires integrating multiple tools. Existing web platforms such as iDEP, NetworkAnalyst, and GEPIA2 address individual steps, differential expression, network visualization, or TCGA queries, but lack a unified environment spanning raw data processing to clinical and pharmacological interpretation. TransXplorer (https://www.transxplorer.org) is a freely available web platform that addresses this limitation by integrating the complete RNA-seq analytical workflow. It supports processing from raw FASTQ files using HISAT2 or Salmon, as well as direct GEO dataset import with automated metadata handling. Differential expression analysis is implemented via DESeq2, edgeR, and limma-voom, followed by functional enrichment across more than 1,800 species using Bioconductor resources. Batch effects are automatically detected and corrected using a composite of PVCA, kBET, and Silhouette metrics without requiring predefined batch annotations. Downstream analyses include co-expression network construction (WGCNA), protein-protein interaction mapping (STRING), cell-type deconvolution, and transcription factor inference using integrated DoRothEA and TFLink resources. The platform further links gene signatures to drug candidates through DGIdb and OpenTargets and enables survival and tumour-normal comparisons across TCGA cohorts. Application to cardiac endothelial differentiation (GSE151427) and kidney renal papillary cell carcinoma (TCGA-KIRP) datasets demonstrates accurate batch correction, biologically consistent pathway enrichment, recovery of expected cell-type proportions, and identification of clinically relevant genes and drug candidates. TransXplorer is freely available without a login.
bioinformatics2026-05-19v2OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
Wang, L.Abstract
We introduce OmniGene-4, a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing). We inject 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP labels), continue pretraining on a 32.5 GB DNA / protein / natural-language / structural mixture, and run a five-stage supervised fine-tuning pipeline (v2-v5) on 199,576 instruction-format examples across eight task families. The final v5 adds a dual-head architecture: a generation head plus two per-residue classification heads (3Di, DSSP) trained jointly under a 0.5 / 0.5 loss split. v5 reaches 99.40% accuracy on BioPAWS standard protein homology, 82.60% on remote homology (500 pairs), and 93.66% on BixBench gaining +14.4, +22.6, +6.7 percentage points over the vocabulary extended Gemma-4-Instruct baseline, and outperforming ESM-2 (650M) by +32.1 pp on the identical remote-homology split. The classification heads reach 78.6% per-residue accuracy on 3Di (chance 5%) and 100% on DSSP (chance 12.5%). MoE router activations further yield a clean CPT/SFT 96%/4% decomposition of cross-task differentiation, providing direct interpretability of where biological specialization is acquired.
bioinformatics2026-05-19v2Beyond single markers: bacterial synergies identified by Multidimensional Feature Selection reveal conserved microbiome disease signatures
Zielinska, K.; Rudnicki, W.; Kahles, A.; Labaj, P. P.Abstract
The gut microbiome encodes disease-relevant information not only in the abundance of individual taxa and functions, but in the way they co-occur and interact. Yet metagenomic analyses have largely relied on univariate approaches that evaluate features in isolation, systematically overlooking the combinatorial signals that arise from microbial co-occurrence. Here, we introduce a framework based on the Multidimensional Feature Selection (MDFS) algorithm to identify synergistic feature pairs - combinations of taxa and functions whose joint predictive relevance substantially exceeds that of either constituent alone, including features that carry no individual signal and would be discarded by any conventional analysis. We first validated the approach on a meta-analysis of colorectal cancer (CRC) cohorts - one of the most competitive microbiome classification benchmarks available - using a leave-one-cohort-out cross-validation framework. Our framework matched state-of-the-art classification performance (AUC = 0.85) while simultaneously revealing microbial interactions that are structurally inaccessible to univariate methods. A subset of high-stability synergistic pairs showed consistently elevated model selection frequencies and robust discriminatory power across independent cohorts, confirmed under stringent per-cohort effect size testing. Extending the framework to 20 disease cohorts spanning inflammatory bowel disease, type 2 diabetes, liver cirrhosis, and atherosclerotic cardiovascular disease, we identified thousands of high-impact synergistic interactions and 21 conserved cross-cohort markers. Across all contexts examined, synergistic pairs substantially outperformed their individual constituents, establishing microbial co-occurrence as a reproducible and biologically informative axis of disease-associated variation that univariate approaches are structurally unable to detect. The framework is freely available at https://github.com/Kizielins/MDFS_synergies. Importance: Most microbiome studies search for individual gut bacterial species associated with disease. However, bacteria do not act in isolation, and their combined presence or relative balance may be far more informative than any single microbe considered alone. This study presents a computational framework that identifies pairs of gut microorganisms whose co-occurrence or relative abundance carries substantially greater predictive signal than either constituent feature independently. Applied to stool metagenomic data from patients with colorectal cancer and more than a dozen additional conditions, we demonstrate that these synergistic interactions are widespread, reproducible across independent patient cohorts, and reveal disease-relevant microbial relationships that standard analyses miss entirely. Our framework offers a more complete view of how the gut microbiome is altered in disease and provides a principled basis for identifying robust, interaction-based biomarkers.
bioinformatics2026-05-19v2ToxCastLite: A portable semantic evidence graph linking in vitro bioactivity, in vivo toxicity, and exposure-use context
Dönmez, A.; Nosov, O.; Heck, K.; Mosig, A.; Fritsche, E.; Koch, K.Abstract
Motivation: The ToxCast database is a valuable resource for computational toxicology and new approach methodologies (NAMs), but the approximately 100GB MySQL distribution is difficult to use for portable local analysis and cross-domain evidence mining. Many practical questions concern chemicals, in vitro bioactivity, in vivo toxicological evidence, and exposure-relevant product-use context rather than raw database keys. Results: We present ToxCastLite, a portable semantic evidence-access system that combines assay-scoped SQLite databases with a compact RDF layer for GraphDB-based querying. The system streams large ToxCast/invitrodb MySQL dumps into curated SQLite profiles, reducing the footprint to approximately 3~GB for focused use cases such as developmental neurotoxicity. Dense numerical evidence, including concentration--response rows, remains in SQLite, while the RDF projection exposes linked semantic entities such as chemicals, assays, endpoints, model results, potency parameters (AC50), and MC6 quality flags. We further extend the graph with CPDat v4.0 product-use and functional-use evidence and ToxRefDB v3.0 in vivo toxicity evidence, including processed studies, point-of-departure records, effect summaries, and observation summaries. These layers are linked through DSSTox Substance Identifiers, enabling integrated queries across NAM bioactivity, curated animal-study evidence, and exposure/use context. A Streamlit prototype supports exploration through a locally deployed LLM that translates natural-language questions into SPARQL, grounded by a versioned RDF schema to reduce hallucination risk. Case studies in developmental neurotoxicity demonstrate how ToxCastLite identifies concordance between high-confidence in vitro DNT activity and positive in vivo apical evidence, detects in vitro DNT activity beyond available DNT-specific in vivo evidence, and prioritizes chemicals where NAM signals, ToxRefDB evidence, and CPDat product-use context intersect. For selected results, users can drill down from the semantic graph to the underlying SQLite records and retrieve concentration--response curves for expert inspection without manually writing SQL or SPARQL.
bioinformatics2026-05-19v1CANCAN: high-resolution copy number and mutation heterogeneity analysis of DNA sequence data for clinical applications
Pladsen, A. V.; Vodak, D.; Zhao, S.; Nakken, S.; Nebdal, D.; Lien, T.; Danielsen, B. K.; Wang, C.; Kildal, W.; Hjortland, G. O.; Hovig, E.; Russnes, H. G.; Lingjaerde, O. C.Abstract
High-throughput DNA sequencing is central to precision oncology, yet robust and interpretable methods for integrated analysis of copy number alterations and somatic variants across sequencing platforms remain limited. We present CANCAN (Copy number integrative ANalysis in CANcer), a platform-agnostic computational framework for high-resolution analysis of allele-specific copy number and variant data. CANCAN integrates novel normalization and segmentation strategies and enables inference of tumor purity, ploidy, subclonality and mutation multiplicity, while providing statistical confidence estimates and transparent evaluation of alternative solutions. Benchmarking across whole-genome, whole-exome and targeted sequencing datasets from TCGA and the IMPRESS-Norway study demonstrates high concordance with established methods, with particularly strong performance on targeted sequencing data. CANCAN accurately estimates global genomic features, including purity and ploidy, even at reduced sequencing coverage, and shows comparable or improved agreement relative to existing tools. In addition, it provides detailed visualization of the genomic context of clinically relevant biomarkers, supporting diagnostic interpretation. CANCAN constitutes a reproducible and interpretable approach for integrated genomic analysis, addressing key methodological and practical challenges in clinical cancer genomics.
bioinformatics2026-05-19v1Cross-Cohort Optimal Transport Maps Macrophage Plasticity and Competing Routes to Inflammation and Fibrosis in Human Atherosclerotic Plaques
Vazquez Montes de Oca, S.; Acedo Terrades, A.; Carreno Martinez, J. F.; Kirchner, P.; Ord, T.; Kaikkonen, M. U.; Freigang, S. B.; Zlobec, I.; Rodriguez Martinez, M.Abstract
Single-cell transcriptomics has revealed extensive macrophage heterogeneity in atherosclerotic plaques, but how macrophages move between states, and whether transition mechanisms depend on cellular origin, remain unclear. Here we develop a computational framework that reconstructs directed cell-state transition networks from cross-sectional single-cell RNA-sequencing data by combining optimal transport with RNA velocity and systematic cross-cohort validation. Applying this approach to seven human carotid plaque cohorts, we generate an integrated atlas of 81,633 monocytes and macrophages and identify 15 statistically significant pairwise transitions, of which 11 directed transitions organize into three biological axes: monocyte fate diversification, inflammatory reactivation, and fibrotic remodeling. The strongest transition links scavenging macrophages to inflammatory macrophages, indicating that plaque inflammation is driven predominantly by reactivation of tissue-adapted macrophages rather than by direct differentiation of newly recruited monocytes. By tracking gene expression changes along the OT commitment gradient, we find that macrophage plasticity follows an origin-dependent spectrum. Tissue-resident macrophages, in particular scavenging C1q$^+$ macrophages, acquire inflammatory programs while preserving and reinforcing their resident scavenging identity, a mechanism we term \textit{transcriptional layering}, whereas monocyte-derived transitions proceed through selective loss of source-identity modules. Despite these distinct routes, transitions converging on the same fate activate shared destination-specific regulatory circuits, with inflammatory and fibrotic programs governed by mutually antagonistic transcription factor networks. These findings identify inflammatory reactivation of scavenging macrophages as a dominant transition axis in human atherosclerosis and suggest that macrophage origin constrains how disease-associated programs are acquired. More broadly, this framework provides a general strategy for quantifying cell-state transitions and dissecting plasticity mechanisms in chronic inflammatory disease.
bioinformatics2026-05-19v1DistPCA: Tera-Scale Genomic PCA via Out-of-Core Distributed Parallelism
Mermigkis, G.; Sofotasios, A.; Kontopoulou, E.-M.; Gallopoulos, E.; Hadjidoukas, P.Abstract
Principal Component Analysis (PCA) is a fundamental tool in human genetics, widely used to study population structure. However, the rapid growth of modern genomic datasets, which often exceed main memory capacity, renders traditional PCA methods infeasible, motivating out-of-core approaches. Prior work on out-of-core genomic PCA has focused primarily on optimizing the inherently compute-intensive numerical core, largely overlooking the stages of data I/O and preprocessing, which emerge as significant performance bottlenecks at tera-scale. Furthermore, existing approaches remain limited to shared-memory single-node architectures, lacking support for distributed multi-node environments. To address these limitations, we introduce DistPCA, the first distributed out-of-core framework for tera-scale genomic PCA, implemented as a C++ library and scalable across both single- and multi-node systems. Built on top of Message Passage Interface (MPI), the proposed framework employs multi-level data parallelism across the entire PCA pipeline, combining multiprocessing, multithreading, SIMD vectorization, and compute--transfer overlap, while remaining compatible with block-based methods that rely on associative operations. Extensive evaluation on real and synthetic datasets demonstrates near-linear scalability, achieving speedups of up to 58.2x and over 98% reduction in wall-clock time, while maintaining parallel efficiency above 82% and preserving accuracy in the recovered principal components.
bioinformatics2026-05-19v1Multi-Scale Tri-Modal Histology Dataset Integrating Tumor Morphology, Immune Patterns, and Clinical Outcomes
Jung, K. J.; Qiu, J.; Cho, S.; McDonough, E.; Chadwick, C.; Ghose, S.; West, R. B.; Brooks, J. D.; Ginty, F.; Machiraju, R.; Mallick, P.Abstract
Accurate prognostic assessment of prostate cancer (PCa) requires an integrated understanding of tissue morphology-encompassing cell structure, glandular architecture, and tissue organization-and the immune environment. We present Prostate-TriMod, a novel tri-modal histology dataset designed to integrate high-resolution visual morphology with spatial tissue maps, immune infiltration patterns, and clinical outcomes. This dataset, generated from the Cell DIVE multiplexed imaging platform, consists of three synchronized modalities: (1) multiscale virtual H&E tiles (224px, 256px, 512px, and 2040px) providing visual morphological context, (2) spatial tissue maps identifying cancerous/non-cancerous epithelial cells, stroma and immune cell populations (via TOPAZ and CAT models), and (3) text captions generated from single-cell data and patterns. The dataset includes comprehensive clinical annotations, including Grade Groups and biochemical recurrence (BCR) status. By providing high-fidelity alignment between visual features, spatial tissue maps, and textual descriptions, Prostate-TriMod empowers the development of advanced multimodal AI frameworks. We expect this resource to support reuse in multimodal representation learning, spatial analysis, and benchmarking studies that link histology morphology and immune context to clinical outcomes in prostate cancer.
bioinformatics2026-05-19v1PocketBagger: Generalizable pocket druggability prediction via positive-unlabeled learning
Gingrich, P. W.; Biswas, A.; Mica, I. L.; Brammer, K. M.; Shu, Z.; Maxwell, D. S.; Russell, K. P.; Al-Lazikani, B.Abstract
Reliable structure-based prediction of small-molecule druggability is hindered by a fundamental labeling problem. Experimentally confirmed liganded sites (positives) are observable, but credible "undruggable" pockets (negatives) are almost impossible to define. Standard supervised machine learning consequently relies on arbitrary definitions of 'undruggable', leading to bias and false negatives. Here we introduce PocketBagger, a positive-unlabeled (PU) learning framework for pocket druggability prediction trained exclusively on experimentally determined Protein Data Bank (PDB) structures. PocketBagger uses PU bagging to learn key features associated with reliable 'druggable' pockets and considers all remaining pockets in the structurally characterized proteome as unlabeled. We demonstrate the capability of PocketBagger through the training of a simple Random Forest classifier and demonstrate its power in recall (0.804), even when challenged with increasingly difficult generalizability assessments and entire protein-family hold outs. We benchmark and demonstrate the added value of PU learning by comparing PocketBagger to a leading deep-learning predictor. However, PocketBagger is intended to be used as a framework for any model architecture. Along with the code, the data generated by PocketBagger are deployed in canSAR.ai, providing scalable, generalizable pocket druggability predictions to the drug discovery community.
bioinformatics2026-05-19v1BioGAIP: A Scalable, User-Friendly and Robust LLM-Powered Multi-Agent System for Automated Bioinformatics Tasks
Zhang, J.; Guo, P.; Jiang, G.; Zhou, M.; Wei, G.; Ni, T.Abstract
The rapid explosion of large-scale, high-throughput biological data has created an urgent demand for efficient analysis pipelines. Traditional bioinformatics approaches, while powerful, often require specialized computational expertise, placing them out of reach for bench biologists. Large Language Models (LLMs) offer new possibilities for automating complex reasoning and tool integration, yet existing LLM-based solutions have not sufficiently lowered this barrier, and expert-level analysis remains inaccessible to most nonexperts. Here, we present BioGAIP, an LLM-powered agent that integrates expert-level reasoning within an end-to-end platform for bioinformatics tasks. By coupling optimized autonomous agents with full graphical interfaces, BioGAIP transforms complex analytical workflows into an automated, user-friendly, and low-intervention process with natural language input. Key features of BioGAIP include dynamic information retrieval, automatic environment configuration, and self-directed design of analysis pipelines, making large-scale multi-omics analysis highly accessible. Built on agent-based client-server architecture, BioGAIP ensures secure resource management and supports heavy computational demands. Extensive evaluations on diverse published datasets demonstrate that BioGAIP reliably recapitulates established biological insights and shows strong potential for novel discovery. By democratizing complex bioinformatics workflows, BioGAIP accelerates accessible data-driven discovery for both experts and nonexperts.
bioinformatics2026-05-19v1AbSolution and ENCORE: a proof-of-concept for automating computational reproducibility in interactive applications
Garcia-Valiente, R.; Langton, S. H.; van Kampen, A. H. C.Abstract
Reproducibility and transparency in computational analyses are essential in science, although achieving these goals often requires significant knowledge and systematic organization. Graphical interactive applications simplify the conduct of analyses and make them accessible to a broader audience. However, there is currently no consensus on how to and to which extent implement reproducibility in interactive applications. We recently developed AbSolution, a user-friendly and flexible interactive web-based R Shiny application for exploring immune repertoires and their sequence-based features, and we established the ENCORE framework to enhance transparency and reproducibility by guiding researchers in structuring and documenting computational projects. In this work, as a proof-of-concept we integrate AbSolution, ENCORE and specific R packages to address reproducibility challenges. This enables a single-step export of raw, processed, and meta-data, the software environment, the underlying generated code and a HTML report containing results and figures, operating system, hardware and R session details, and researcher notes. Its reproducibility has been independently validated by the CODECHECK initiative. This paper demonstrates how the combination of several approaches can improve and automate reproducibility of interactive applications.
bioinformatics2026-05-19v1cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature
Campbell, J.; Lain, A. D.; Simpson, T. I.Abstract
cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-speci[fi]c corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text [fi]les, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the [fi]delity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central [fi]les and the [fi]les retrieved from cadmus, resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available.
bioinformatics2026-05-19v1Computational Design of Novel Selective Phosphodiesterase 4B Inhibitors from Natural Products: An Integrated Machine Learning and Structure-Based Drug Discovery Approach
Oni, S. A.; Oyemomi, M. D.; Osho, A.; Abdulfatai, A.Abstract
Abstract Selective inhibition of phosphodiesterase 4B (PDE4B) remains a promising strategy for preserving the anti-inflammatory benefit of PDE4 inhibition in chronic obstructive pulmonary disease while reducing PDE4D-associated tolerability liabilities. This study integrated SHAP-interpretable machine learning, natural product virtual screening, hierarchical docking, post-docking MM-GBSA, isoform cross-docking, binding-pocket comparison, ADMET prediction, and 100 ns molecular dynamics simulations to identify PDE4B-selective inhibitors from the LOTUS natural product database. A Random Forest classifier trained on curated ChEMBL PDE4B bioactivity data achieved an external performance with AUC-ROC = 0.955, accuracy = 0.893, F1-score = 0.896, MCC = 0.785, and prioritized 119,698 predicted actives from 276,518 LOTUS compounds. SHAP analysis identified BertzCT and TPSA as major contributors to predicted activity. Sequential Lipinski, PAINS, and QED filtering retained 14,210 candidates for structure-based evaluation. Extra precision docking identified four leads with PDE4B docking scores of -9.123 to -12.080 kcal/mol, all outperforming roflumilast (-7.658 kcal/mol). Cross-docking and post-docking MM-GBSA supported preferential PDE4B binding for three candidates. The top lead, LTS0048837, maintained a stable PDE4B-bound pose during simulation, with comparatively stronger interaction persistence than its PDE4D complex and the roflumilast reference. These findings nominate LTS0048837 as a computationally prioritized PDE4B-selective natural product lead requiring experimental enzyme, cellular, and pharmacokinetic validation.
bioinformatics2026-05-19v1Geometric averaging provides normalization-invariant feature ranking in compositional sequencing data
Nunzi, E.; Romani, L.Abstract
In compositional next-generation sequencing (NGS) analyses (including microbiome studies, RNA-seq and metagenomics) the arithmetic mean (AM) of relative proportions is the default operator for summarizing feature abundances. We show that this default produces unstable rankings in real compositional data. Across 102 prevalent genera in the dietswap dataset (n=38 baseline samples), 23 genera (22.5%), including members of Bacteroides, Eubacterium and Bilophila, yielded opposite group-level conclusions under AM and the geometric mean (GM). This pattern reflects two formal properties of compositional aggregation. First, AM-based rankings change with the within-sample normalization domain, whereas GM-based rankings are invariant under the multiplicative structure of compositional data. Second, the centered log-ratio (CLR) transformation absorbs geometric averaging into the data representation, so that arithmetic averaging on CLR-space recovers the GM ranking exactly. Both properties were verified numerically on the dietswap dataset, where the Spearman correlation between GM- and CLR-based rankings was 1.000 in both groups. The operator-choice problem propagates to between-group differential inference: under AM, log2 fold-changes vary across normalizations and the relative ranking of features by effect size is not preserved; under GM and CLR, the ranking is preserved. We recommend GM-based summaries for feature ranking and CLR-transformed abundances for cross-sample comparisons. This change requires no new computational tools and is fully compatible with existing differential-abundance pipelines, but eliminates an under-recognized source of irreproducibility in biomarker discovery across microbiome studies, transcriptomics, metagenomics, and mass-spectrometry-based metabolomics, in all settings where features are quantified relative to a sample total.
bioinformatics2026-05-19v1Insertions, deletions, and exchangeable couplings: a Dirichlet process over TKF92 domains and sites
Large, A. L.; Holmes, I. H.Abstract
To introduce local heterogeneity, evolutionary models are often equipped with site-class mixtures, preserving this symmetry in the sense of de Finetti: conditional on the latent class, residues are still exchangeable. In a four-step theoretical ladder, we show how long-range structure such as couplings between distant sites can also be introduced exchangeably by using a Dirichlet process to partition sites into co-evolving classes. Our first step is a thorough analysis of TKF92 to establish sufficient statistics, limiting behavior, and inferential tools. We then lift the pairwise TKF92 hidden Markov model, in the limit of small time, to a time-indexed gravestone-augmented pair stochastic context-free grammar, and from there to its phylogenetic generalisation. This framing allows trajectories to be sampled exactly by Inside-Outside recursion. The third step places a Dirichlet process over the alive sites and asks co-keyed sites to evolve under a sparse Potts interaction --- an exchangeably-partitioned hidden direct-coupling model whose marginal alignment likelihood is unchanged from plain TKF92. The fourth rung of the ladder develops inference machinery: a Gibbs--Metropolis sampler that alternates alignment resamples, key-partition resamples, and stochastic parameter updates. We close several gaps along the way --- exact closed-form sufficient statistics for the linear birth--death--immigration component, the resolvable L'Hopital limit at {lambda}=, and a closed-form M-step for a recursive generalisation of TKF92 --- and we report a 1,000-family Pfam fit with K=4 site classes whose Potts atoms carry ~0.54 nats of covariation per class-pair on top of a substantial single-site substitution model. Supplementary material, including full source code for inference, may be found at https://tkfdp.net/.
bioinformatics2026-05-19v1Sequence alignment of the primate lineage reveals evolutionary divergence and conserved secondary structural motifs in noncoding RNAs
Beeram, A.; Perry, Z. R.; Pyle, A. M.Abstract
Long noncoding RNAs (lncRNAs) constitute most of the human transcriptome and perform essential roles in chromatin organization and transcriptional regulation. Because lncRNA genes are not constrained by protein-coding ability, they tend to exhibit more rapid evolutionary divergence. Their poor nucleotide sequence conservation among mammals often led to the assumption that lncRNAs lack conserved structures. However, emerging evidence indicates that many noncoding RNAs adopt secondary and tertiary folds critical for protein recruitment, chromatin binding, and regulation of gene expression. Nevertheless, there are few experimental secondary structures for lncRNAs, hindering mechanistic insight into lncRNA structure-function relationships. Even without available structural data, covariation, in which two nucleotides co-evolve, can provide evidence for conserved structures. This requires sequence alignments with sufficient divergence to detect covariation but enough similarity to maintain alignment quality. Here we report the development of a novel computational pipeline to mine 190 unannotated primate genomes to generate high-quality multiple sequence alignments of noncoding RNAs. This pipeline performs sequence searching, locus extraction, cross-species alignment, and downstream analyses, including assessment of covariation and primary sequence conservation. Ultimately, we demonstrate that because many noncoding elements, such as lncRNAs evolve at a more rapid rate than protein-coding genes, phylogenetic analyses constrained within a narrower evolutionary span can be used to identify conservation of primary sequence and secondary structure. By focusing our alignments on the primate lineage, our method overcomes the limitations of broad phylogenetic analyses, enabling high-resolution detection of subtle conservation patterns and conserved secondary structural motifs of long noncoding RNAs.
bioinformatics2026-05-19v1Sequence-based Drug-Target Binding Site Pre-training Enables Cryptic Pocket Detection and Improves Binding Affinity and Kinetics Prediction
Zhang, S.; Xie, L.; Tiourine, D.; Xie, L.Abstract
Predicting protein-ligand binding characteristics, such as affinity and kinetics, is critical for accelerating drug discovery. However, many existing computational methods face key limitations, including insufficient integration of comprehensive databases, inadequate representation of protein structural dynamics, and incomplete modeling of microscale protein-ligand interactions. To address these challenges, we introduce ProMoNet, a sequence-based pre-training and fine-tuning framework to enhance the prediction of protein-ligand binding characteristics. ProMoNet leverages protein and molecular foundation models to expand data coverage and enhance diversity. It also introduces a pre-training strategy based on protein-ligand binding site prediction, which bridges protein- and ligand-level representations to support downstream prediction tasks involving protein-ligand complexes. Our pre-training module effectively models microscale protein-ligand interactions and captures the dynamic nature of proteins, including binding site crypticity, without relying on 3-dimensional structural inputs. Notably, this module surpasses or matches state-of-the-art structure-based methods in identifying exposed and cryptic binding sites while maintaining high efficiency. Our fine-tuning module then efficiently transfers the pre-trained knowledge to downstream tasks such as binding affinity and binding kinetics prediction, achieving superior performance. The combination of ProMoNet's strong performance and demonstrated efficiency across multiple tasks highlights its potential for broad applications in drug discovery.
bioinformatics2026-05-18v3CatIF-RL: Activity-Oriented Enzyme Sequence Design by Steered Inverse Protein Folding
Li, Y.; Xiong, J.; Zhang, Y.; Cai, T.; Fu, C.; Li, S.; Xu, W.; Lyu, R.; Chen, Z.; Guo, Z.; Gong, X.; Wang, F.Abstract
Protein inverse folding models are designed to generate amino acid sequences compatible with a given backbone structure, but they are not explicitly optimized for specific biological functions. Here, we present CatIF-RL, a framework that steers a graph-based denoising diffusion inverse folding model toward designing enzyme variants with enhanced catalytic activity. CatIF-RL first adapts the inverse folding model to enzyme structural data, then introduces activity-oriented preference signals using predicted catalytic constant (kcat) as the optimization objective, enabling specialization through generative dataset curation and group-relative policy optimization (GRPO). This process iteratively shifts the sequence distribution toward higher predicted kcat while constraining sequence divergence to sequences that remain compatible with the input structure. On the independent benchmark, CatIF-RL achieves an approximately four-fold increase in predicted kcat relative to native enzymes, substantially outperforming representative inverse folding methods, while maintaining sequence recovery (0.55) and structural fidelity, and supporting motif-preserving partial sequence design. CatIF-RL establishes a practical framework for activity-oriented enzyme design and provides a generalizable strategy for steering structure-conditioned protein generation toward functional optimization.
bioinformatics2026-05-18v2Ensemble Post-hoc Explainable AI for Multilead ECG: Identifying Disease-Relevant Features in Single-Lead Interpretations
Metsch, J.; Hempel, P.; Maurer, M. C.; Spicher, N.; Hauschild, A.-C.; Steinhaus, K. E.Abstract
Despite the growing success of deep learning (DL) in multivariate time-series classification, such as 12-lead electrocardiography (ECG), widespread integration into clinical practice has yet to be achieved. The limited transparency of DL hinders clinical adoption, where understanding model decisions is crucial for trust and compliance with regulations such as the General Data Protection Regulation (GDPR) or the EU AI Act. To tackle this challenge, we implemented a state-of-the-art 1D-ResNet in Pytorch that was trained on the large-scale Brazilian CODE dataset to classify six different ECG abnormalities. We employed the model on the German PTB-XL dataset, and evaluated its decision-making processes using 16 post-hoc explainable AI (XAI) methods. To assess the clinical relevance of the model's attributions, we conducted a Wilcoxon signed-rank test to identify features with significantly higher relevance for each XAI method. We used an ensemble majority vote approach to validate whether the model has learned clinically meaningful features for each abnormality. Additionally, a Mann-Whitney U test was employed to detect significant differences in relevance attributions between correctly and incorrectly classified ECGs. Overall, the model achieved sensitivity scores above 0.9 for most abnormalities in the PTB-XL dataset. However, our XAI analysis showed that the model struggled to capture clinically relevant features for some diseases. Certain XAI methods, including DeepLift, DeepLiftShap, and Occlusion, consistently highlighted clinically meaningful features across abnormalities, while others, such as LIME, KernelShap, and LRP, failed to do so. Moreover, some XAI methods demonstrated significant differences in attributions between correctly and incorrectly classified ECGs, highlighting their potential for enhancing model robustness and interpretability. In conclusion, our findings underscore the importance of selecting suitable XAI methods tailored to specific model architectures and data types to ensure transparency and reliability. By identifying effective XAI techniques, this study contributes to closing the gap between DL advancements and their clinical implementation, paving the way for more trustworthy AI-driven healthcare solutions.
bioinformatics2026-05-18v2On the applicability domain of HADDOCK3 for protein-aptamer docking: documented failure modes from a 5x7 cross-target screening matrix and a 1676 aa receptor case study (P01031)
Dohi, E.Abstract
We screened a 5-receptor x 7-aptamer = 35-cell cross-target screening matrix with HADDOCK3 under blind ambiguous-interaction-restraint (AIR) protocols on AlphaFold-modelled receptors. The 35-cell matrix is primarily a cross-target/decoy screening matrix rather than a 35-cognate-pair benchmark: it contains an n = 4 K_D-calibration subset under matched assay conditions, at least six biological cognate or intended-cognate cells, and the remaining cells are intentional non-target pairings used to characterise score-distribution behaviour. The screen surfaced 12 operationally distinct failure modes that collapse into five broad conceptual groups. The principal case study is P01031 (complement C5, 1676 aa, [≥] 12 structural domains): all seven panel members produced positive HADDOCK3 top-1 scores under a scale-adaptive AIR. Score-term decomposition locates the anomaly in the AIR term (+217 to +268 to top-1 score). With AIR zeroed, scores fall to -131 to -74 -- the small-receptor regime. Boltz-2 cofolding chain-pair ipTM (cpi_AB) is an independent channel: P01031 shows the lowest median cpi_AB (0.211; 0/7 above the 0.5 confident-interface threshold). To our knowledge, this is an early documented case study of a 1676 aa multi-domain receptor exhibiting this signature under a blind scale-adaptive AIR workflow -- an n = 1 mechanistic case, not a statistical generalisation. We adapt the QSAR applicability-domain concept to in silico aptamer screening. We report an empirical Mode 1 mitigation, a pLDDT-aware AIR prefilter, with cohort Jaccard recovery of ~10x. The n = 4 K_D-calibration Spearman {rho} shift is reported as exploratory cross-method convergence, not as a calibration claim.
bioinformatics2026-05-18v2Design of DNA Aptamers for Lyme disease Diagnosis Combining experimental and numerical approaches
Issouani, E. M.; Da Ponte, H.; Guerin, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Davila Felipe, M.; GAYRAUD, G.Abstract
Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.
bioinformatics2026-05-18v2Systematic cross-study assessment of RNA-Seq experimental workflows for plasma cell-free transcriptome profiling
Tuni, C.; Asole, G.; Monteagudo-Mesas, P.; Rusu, E. C.; Cabus, L.; Gonzalez, L.; Sanchez, L.; Neto, B.; Sanders, P.; Weber, M.; Lagarde, J.Abstract
Plasma cell-free RNA (cfRNA) is a promising source of non-invasive biomarkers, but its clinical translation is hindered by technical challenges and a lack of protocol standardization, which compromises reproducibility and comparability across studies. There is a need for a systematic evaluation of existing cfRNA-Seq workflows to understand the drivers of technical variability. Here, we address this gap by performing a comprehensive cross-study analysis of 2,1666 cfRNA-Seq samples from 15 published studies and an in-house generated dataset, applying a uniform bioinformatics pipeline to enable a controlled comparison of experimental workflows. Our analysis reveals that the donor phenotype typically explains a negligible fraction of the transcriptomic variation, whose main determinants are technical -- principally protocol choice, genomic DNA contamination levels and library diversity. Remarkably, this technical noise is so profound that variation within plasma cfRNA samples exceeds that found across a wide range of human tissues. Furthermore, we demonstrate that critical pre-analytical factors are often confounded with patient phenotypes, jeopardizing the validity of biomarker discovery efforts. Finally, we identify a 100 bp fragment-length threshold as a vital requirement for reliable cfRNA-based taxonomic profiling. Our work serves as a comprehensive benchmark of current cfRNA-Seq methodologies and provides evidence-based guidelines to improve experimental design. By highlighting the dominance of controllable technical factors, we offer a path towards more robust and reproducible cfRNA research.
bioinformatics2026-05-18v2petVAE: A Data-Driven Model for Identifying Amyloid PET Subgroups Across the Alzheimer's Disease Continuum
Tagmazian, A. A.; Schwarz, C.; Lange, C.; Pitkänen, E.; Vuoksimaa, E.Abstract
Amyloid-{beta} (A{beta}) PET imaging is a core biomarker and is sufficient for the biological diagnosis of Alzheimer's disease (AD). Here, we aimed to identify biologically meaningful subgroups across the continuum of A{beta} accumulation using a data-driven deep learning approach, without imposing predefined thresholds for A{beta} negativity or positivity. We analyzed 3,110 of A{beta} PET scans from Alzheimer's Disease Neuroimaging Initiative and Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease studies to develop petVAE, a two-dimensional variational autoencoder. The model accurately reconstructed scans without prior labeling, selection by scanner or region of interest. Latent representations of scans extracted from petVAE were used to visualize and cluster the AD continuum. Clustering yielded four groups: two predominantly A{beta} negative (A{beta} -, A{beta} -+) and two predominantly A{beta} positive (A{beta} +, A{beta}++). All clusters differed significantly in standardized uptake value ratio (p < 1.64e-8) and cerebrospinal fluid (CSF) A{beta} (p < 0.02), demonstrating petVAE's ability to assign scans along the A{beta} continuum. Extreme clusters (A{beta}-, A{beta}++) resembled conventional A{beta} negative and positive groups and differed in cognition, APOE {epsilon}4 prevalence, A{beta} and tau CSF biomarkers (p < 3e-6). Intermediate clusters (A{beta}-+, A{beta}+) showed higher odds of carrying at least one APOE {epsilon}4 allele versus A{beta}- (p < 0.03). Participants in A{beta}+ or A{beta}++ clusters exhibited faster progression to AD (A{beta}+ hazard ratio = 2.42, A{beta}++ HR = 9.43; p < 1.17e-7). Thus, petVAE was capable of reconstructing PET scans while extracting latent features that capture the AD continuum and define biologically meaningful subgroups, enabling data-driven characterization of preclinical disease stages.
bioinformatics2026-05-18v2Mantis-Delta: Mass-Action Network Theory and Steady-State Characterization for Chemical Reaction Networks
Venegas Hernandez, E. A.Abstract
Abstract Chemical Reaction Network Theory (CRNT), developed by Horn, Jackson, and Feinberg, provides parameter-free structural theorems that constrain the asymptotic dynamics of mass-action systems irrespective of the numerical values of the rate constants. Despite the maturity of the theory, modern open-source implementations that combine CRNT structural analysis with symbolic ordinary differential equation (ODE) construction and robust numerical steady-state finding remain scarce. We present mantis-delta, a pure Python library that ingests human-readable reaction strings, builds the complex reaction graph, computes the deficiency = -{ell}- and weak reversibility, and decides applicability of the Deficiency Zero Theorem (DZT) and Deficiency One Theorem (D1T). For systems satisfying these structural conditions, mantis-delta certifies, without any simulation whatsoever, existence, uniqueness and (for DZT) asymptotic stability of the positive steady state in every stoichiometric compatibility class. When the structural theorems do not apply, the library provides symbolic mass-action ODEs and Jacobians via SymPy and a hybrid numerical solver that combines stiff implicit integration with bound-constrained algebraic least-squares to locate both stable and unstable fixed points, including Hopf bifurcation centres inaccessible to forward integration. We demonstrate the workflow on six benchmarks: a reversible isomerisation, the Michaelis-Menten enzyme mechanism, the closed and chemostatted Brusselator, a catalytic hairpin assembly (CHA) miR-21 biosensor, and the Goldbeter-Koshland zero-order ultrasensitivity switch. In each case, the CRNT-predicted qualitative behaviour (monostability, oscillation, uniqueness) is recovered numerically with a residual below 10-6 M s-1, and the Goldbeter-Koshland dose-response curve agrees with the closed-form quasisteady-state approximation to within 1% over a 400x kinase/phosphatase activity scan. mantis-delta is open-source (MIT license) and available at https://github.com/EmilioVenegas/mantis
bioinformatics2026-05-18v1Rescuing true protein binders from AI hallucinations via zero-shot, ensemble-driven statistical physics scoring
Chou, C.-H.; Hong, X.; Xu, J.Abstract
The advancement of deep generative models has facilitated de novo protein and antibody design, yet translation to experimental success is hindered by a high generation rate of structural decoys. Current affinity predictors and standard structural confidence metrics fail to reliably distinguish these AI hallucinations from true binders. Here, we present Sipobe-PPA, an affinity ranking framework that conceptualizes interacting protein interfaces as pseudo-ligands, evaluating them through an AI-driven statistical physics forcefield. Because this forcefield is trained exclusively on small-molecule interactions, Sipobe-PPA acts as a zero-shot physical evaluator for protein-protein interfaces, preventing the framework against the data leakage and memorization pitfalls that affect models trained directly on protein complex datasets. To capture the structural plasticity of binding interactions, Sipobe-PPA employs a conformational ensemble strategy, computing interaction scores across multiple AlphaFold3(AF3)-predicted structural states. Benchmarking on decoy-rich de novo datasets-including Bindcraft, Boltzgen, and the Germinal antibody dataset-demonstrates the significant improvement offered by this approach. In a real-world pipeline scenario simulating wet-lab constraints (pre-filtered by AF3 ipTM > 0.8 and pLDDT > 80), Sipobe-PPA achieved an 80% Hit Rate within its Top 5 predictions across the combined dataset, compared to 0% for physical baselines like Rosetta-dG. Notably, our structural ensemble averaging outperformed single-structure scoring, highlighting the necessity of modeling prediction diversity. By maximizing top-tier hit rates across diverse nanobody and de novo targets, Sipobe-PPA provides a scalable screening paradigm that bridges the gap between computational generation and wet-lab viability.
bioinformatics2026-05-18v1Evaluating open LLMs for agentic analysis orchestration in a typical biomedical lab
Nekrutenko, A.Abstract
Agentic tools - software environments where a large language model plans, calls external tools, executes code, and iterates with minimal human intervention - will run a substantial share of routine biomedical data analysis within the next few years. However, per-call inference cost on frontier models is the bottleneck and can add up quickly. Here, we tested whether a free, locally-runnable open-weight model could take over the repetitive execution steps at frontier accuracy. We used Claude's Opus to author plans of increasing detail for per-sample variant calling, and ran six 2026-release open-weight implementer LLMs against those plans on a set of desktop GPUs. qwen3.6:27b reproduced frontier accuracy on every plan and matched Opus cell-for-cell on a 36-cell error-injection matrix. A sub-$2,000 Jetson or Apple Mac Mini sufficed for the implementer side. The open-weight model landscape evolves on the order of months, so the specific implementer recommended here will be superseded; we provide the plans, harness, scoring code, and per-cell artifacts at https://github.com/nekrut/LLM-eval-paper as a framework for re-evaluating future models.
bioinformatics2026-05-18v1Manchester Proteome Profiler: A User-Friendly Platform for Quantitative Proteomic Analysis
Cain, S. A.; Fatima, M.; Humphries, M.Abstract
Manchester Proteome Profiler (MPP) is an open-source R Shiny application that streamlines downstream analysis of quantitative proteomic data. Compatible with grouped protein intensities tables from MaxQuant, FragPipe, Proteome Discoverer and other custom layouts, MPP provides an integrated platform for filtering, normalisation, imputation, differential expression analysis and cluster analysis across user-chosen experimental conditions. MPP supports both single- and dual-dataset comparisons, incorporates SAINTexpress for affinity purification and proximity labelling experiments, and downstream analysis of the significant protein list clusters to functional enrichment and interaction networks via Gene Ontology, BioGRID and STRING. Benchmarking with a KRAS proximity biotinylation dataset demonstrated the ability of MPP to identify reproducible clusters of differentially expressed proteins and reveal biologically meaningful patterns, including enrichment of solute carrier transporters and adhesion molecules. With interactive visualisations, customisable reports, and support for complex experimental designs, MPP offers a novel, versatile and user-friendly environment for proteomic data exploration and hypothesis generation.
bioinformatics2026-05-18v1Elab2ARC: A Browser-Based Workspace for Converting Free-Text Protocols into rich FAIR digital objects
Zander, S.; Zhou, X.-R.; Kranz, A.; Dumschott, K.; Rocca-Serra, P.; Weil, H. L.; Tschoepke, M.; Muehlhaus, T.; Von Suchodoletz, D.; Usadel, B.Abstract
Electronic laboratory notebooks (ELNs) are widely used in the life sciences, but their notebook format limits machine-readability and FAIR compliance. Consequently, researchers often spend significant manual effort restructuring ELN records into publication-ready outputs. We present elab2ARC, a browser-based workspace that automates the conversion of open-source eLabFTW records into Annotated Research Contexts (ARCs) - version-controlled, ISA-compliant research objects. Using the eLabFTW API, elab2ARC retrieves administrative metadata, protocols, and attachments, reorganising them into ISA-compliant tables and linked datasets. All processing occurs client-side, ensuring user data control before submission to the PLANTdataHUB repository. An optional LLM-assisted workflow extracts structured metadata from free-text protocols, providing editable drafts while preserving human oversight. Designed for use at project completion, elab2ARC reuses existing ELN documentation without disrupting daily laboratory practice. It offers a practical route to FAIR-aligned sharing, publication, and long-term archiving of life-science experimental records.
bioinformatics2026-05-18v1Interpretable Predictive Modeling for Medical Data Using Boolean Rule-aware Regression
Eskandarian, M.; Malekpour, S. A.Abstract
Purpose: In clinical practice, accurate prediction of disease risk must be accompanied by transparent, human-understandable explanations to support diagnostic confidence, guide therapeutic decisions, and meet ethical and regulatory standards. While deep neural networks achieve high predictive performance in tasks such as cancer detection and diabetes risk stratification, their black-box nature prevents clinicians from understanding the reasoning behind predictions, severely limiting trust and safe integration into patient care. Methods: We present Regression-Based Boolean Rule (RBBR), a framework that automatically derives clinically interpretable Boolean rules directly from patient data. RBBR generates human-readable conjunctions (logical AND combinations) of up to three clinical features, transforms them into inputs for ridge regression to predict binary or multi-class disease outcomes, estimates rule importance via regularized coefficients, and selects the most parsimonious and predictive rule sets using the Bayesian Information Criterion. Results: Applied to six real-world medical datasets (lung cancer screening and staging, Wisconsin and diagnostic breast cancer, heart failure, and early-stage diabetes risk), RBBR consistently produced concise, clinically meaningful rules - e.g., gender-specific symptom combinations in diabetes, distinct histopathological subpopulations in breast cancer, and symptom-risk factor interactions in lung cancer - with strong explanatory power (R2 up to 0.92) and competitive discrimination. Conclusion: By delivering logical, transparent decision rules aligned with clinical reasoning (if symptom A and B, then high risk), RBBR bridges the gap between predictive accuracy and bedside usability, enabling clinicians to validate predictions, identify high-risk patients, stratify subpopulations, and enhance shared decision-making in routine care.
bioinformatics2026-05-18v1The Paipu framework enables creation of a large-scale mammalian cancer transcriptomics atlas
Smith, B. S.; Smith, L. A.; Lee, J.-H.; Cahill, J. A.; Graim, K.Abstract
A plethora of studies have identified shared molecular mechanisms involved in tumor development across humans and other mammalian species. While these two-species analyses advance understanding of human disease, extending them across many species would provide evolutionary insight into molecular mechanisms driving human cancers. However, this expansion requires knowledge transfer and harmonization across species. Genomic differences between species, including variation in genome annotation quality, have historically hindered multi-species large-scale atlas creation. To overcome these challenges, we present Paipu, a comprehensive pipeline designed to streamline querying, preprocessing, harmonization, and retrieval of large-scale RNA-seq data and associated metadata from the NCBI Sequence Read Archive (SRA). Paipu facilitates multi-species analysis by creating a harmonized atlas from user-defined search terms and species. It consists of three components: reference genome preparation, SRA metadata retrieval, and RNA-seq data processing. We apply Paipu to 188 cancer-related terms in 239 non-human mammalian species, creating a harmonized atlas of 3,484 RNA-seq samples spanning 17 species and 35 cancers. This pan-mammalian pan-cancer atlas enables myriad comparative genomics analyses that leverage genetic variation to better understand rare human cancers. As such, Paipu serves as a resource for cross-species cancer genomics and supports atlas creation for any set of species and search terms.
bioinformatics2026-05-18v1Nutritional-Metabolic Lipid Profiling with LipidOne for plasma lipidomics interpretation in metabolic health
Frongia Mancini, D.; Alabed, H. B. R.; Pellegrino, R. M.Abstract
Background/Objectives: Human plasma lipidomics provides valuable information on dietary and metabolic phenotypes, but the interpretation of high-dimensional lipid datasets remains challenging. We developed the Nutritional-Metabolic Lipid Profile (NMLP) module within LipidOne to translate plasma lipidomics data into interpretable nutritional-metabolic indices, functional categories, visual outputs, and biological statements. Subjects/Methods: NMLP calculates lipid indices reflecting cardiometabolic lipid status, fatty acid remodelling, overall lipid quality, oxidative protection, and omega-3/essential fatty acid status. The module was applied to three human plasma lipidomics public datasets: a randomized crossover glycemic-load feeding study, a eucaloric high-fat diet intervention in normal-weight women, and a large public dataset stratified by insulin sensitivity. Results: Across datasets, NMLP converted complex lipidomic matrices into coherent nutritional-metabolic profiles. In the glycemic-load study, the module highlighted metabolic lipid shifts not captured by standard clinical lipid panels, mainly involving cardiometabolic lipid status, oxidative protection, and fatty acid remodelling. In the high-fat diet intervention, NMLP tracked temporal lipid remodelling across pre-diet, on-diet, and post-diet states, consistent with metabolic adaptation to increased dietary fat exposure. In the insulin-sensitivity dataset, insulin-resistant subjects showed a storage-oriented lipid phenotype characterized by increased neutral lipid storage indices and altered lipid quality and oxidative-protection features. Category-level clustering further revealed heterogeneous nutritional-metabolic states within insulin-resistant subjects. Conclusions: NMLP provides a deeper and clearer interpretative framework for human plasma lipidomics in nutrition and metabolic health research. By translating lipid species into functional indices and category-level readouts, the module may facilitate the use of lipidomics in clinical nutrition, metabolic phenotyping, and precision nutrition studies. NMLP is freely accessible as part of the online LipidOne platform.
bioinformatics2026-05-18v1Metabarcode and transcriptome datasets of Pinus sylvestris to assess fungal phyllosphere and disease dynamics.
Moore, B.; Perry, A.; Kaur, S.; Crampton, B.; Gurung, A.; Beaton, J.; Smith, V. A.; Morris, J.; Hedley, P. E.; Nemeth, K.; Barber, H.; Cavers, S.; Jones, S.Abstract
Understanding how host microbiome interactions influence tree disease is critical for understanding forest resilience. Here, we present foliar microbiome ITS2 metabarcoding transcriptomic datasets from Pinus sylvestris to investigate susceptibility to Dothistroma needle blight (DNB), a globally important foliar disease caused by Dothistroma septosporum. We hypothesised that host genotype shapes foliar microbial communities and their interactions, thereby influencing disease outcomes. Samples were collected from a progeny provenance field trial in the south of Scotland representing a broad spectrum of disease susceptibilities. The dataset comprises ITS2 metabarcoding samples from 200 genotypes across three timepoints and RNAseq samples from 48 genotypes across two timepoints. Sampling captured key stages of pathogen exposure and disease progression. Both standardised and bespoke protocols were used for nucleotide extraction, sequencing, and quality control, including multiple negative and positive controls. These datasets, available in the European Nucleotide Archive (project accession PRJEB88228), enable analysis of temporal dynamics in foliar fungal communities, host microbiome transcriptional responses, and genotype dependent variation in disease susceptibility.
bioinformatics2026-05-18v1Discriminative learning of substitution matrices and gap penalties for pairwise alignment of biological sequences
Ciach, M. A.; Zacharopoulou, E.; Startek, M. P.; Miasojedow, B.; Alexiou, P.Abstract
Pairwise alignment scores are used to classify pairs of sequences in many areas of bioinformatics, including homology search, predicting interactions, or read mapping. The relative scores of different pairs strongly depend on the choice of a substitution matrix and gap penalties, but the existing approaches for the estimation of these parameters do not directly optimize them for the task of classification. In this work, we present DiscrimAlign, a statistical model for discriminative learning of substitution matrices and gap penalties from a dataset of positive and negative pairs of unaligned biological sequences. The model links the alignment score of a sequence pair with the associated binary label through a logistic function and learns the parameters by likelihood maximization. We analyze theoretical properties of the model, derive and implement a learning procedure, study its performance in simulated experiments, and apply it to predict microRNA-target interactions. We show that sequence alignment with discriminative substitution matrices and gap penalties predicts the interactions comparably to state-of-the-art neural network classifiers while being more interpretable. An implementation of the model and reproducibility workflows are available at https://github.com/BioGeMT/DiscrimAlign.
bioinformatics2026-05-18v1A Multimodal Neural Network Model for Early Recurrence Prediction in Lung Adenocarcinoma
Patricoski-Chavez, J. A.; Hayek, K.; Singh, R.; Azzoli, C. G.; Warner, J. L.; Gamsiz Uzun, E. D.Abstract
Lung adenocarcinoma (LUAD), a subtype of non-small cell lung cancer (NSCLC), is the most common primary lung cancer worldwide. Despite advancements in early detection and treatment, up to 39% of patients develop recurrent tumors following complete resection. Currently, no widely available models exist for reliably predicting early recurrence of LUAD, which is a significant prognostic factor of post-recurrence survival. Models leveraging deep learning (DL) techniques have demonstrated notable utility in cancer recurrence prediction, particularly when used in combination with both clinical and genomic data. We developed a DL-based model, Predicting Lung Adenocarcinoma recurrence via Selective Multimodal Attention (PLASMA), to predict early recurrence using clinical, mRNA expression, and mutation data from patients with primary stage I-III LUAD. Trained on The Cancer Genome Atlas (TCGA) dataset, PLASMA outperformed traditional machine learning models in predicting early recurrence in both the TCGA test set and an external validation set (TRACERx Lung), achieving area under the receiver operating characteristic curve (AUROC) scores of 85.0% and 76.5%, respectively. Our results support the potential of multimodal DL for early LUAD recurrence prediction and risk stratification.
bioinformatics2026-05-18v1Learning Chirality-Aware Representations to Predict Drug Side Effect Frequencies
Galeano, A.; Dutra, I.; Ferreyra, S.; Paccanaro, A.Abstract
Ab initio prediction of side effect frequencies is important for assessing the risk-benefit profile of drugs and for identifying potential adverse effects early in development. A key challenge is chirality: many drugs exist as enantiomers, pairs of molecules with the same atoms and bond connectivity but different three-dimensional arrangements. Although chemically similar, enantiomers can interact differently with biological targets and therefore exhibit distinct efficacy and adverse-effect profiles. Here we introduce F2S (Features to Signatures), a method to predict the frequencies of drug side effects while explicitly accounting for chirality. Drug representations are learned directly from chemical structure using a directed-bond message-passing graph neural network that captures stereochemical configurations. Side effect representations are derived from curated textual descriptions encoded with a frozen PubMedBERT model. Side effect frequencies are predicted from the dot product between drug and side effect signatures together with biases for drugs and side effects. We evaluated F2S extensively across multiple settings, including cold-start and warm-start prediction, prospective evaluation, and scenarios controlling for chemical similarity between training and test drugs. Across these evaluations, F2S achieves performance comparable to state-of-the-art methods for general side-effect frequency prediction while producing fewer false positives and substantially improves the prediction of frequency differences between enantiomer pairs. Finally, F2S learns compact 10-dimensional signatures that support interpretability: drug signatures reflect therapeutic class and shared targets, side-effect signatures capture phenotype similarity, and the learned bias terms correlate with the popularity of drugs and side effects.
bioinformatics2026-05-18v1Stereochemistry-Aware Drug-Target Affinity Prediction
Ferreyra, S.; Dutra, I.; Galeano, A.; Paccanaro, A.Abstract
Drug-target affinity (DTA) prediction is a key task in drug discovery, enabling the estimation of the interaction strength between candidate compounds and biological targets. However, current models rely on connectivity-based molecular representations and do not explicitly account for the spatial organization, also known as stereochemistry. This limitation becomes evident when considering chirality, where a drug can exist as enantiomers, i.e., molecules that share the same atoms and bonds but differ in their three-dimensional arrangement. Despite their chemical similarity, they can interact differently with the same target, leading to variations in binding affinity and biological activity. In this paper, we propose a stereochemistry-aware DTA prediction framework that incorporates this information into molecular representations. Drug representations are learned from chemical structure using a directed-bond message passing graph neural network that captures enantiomers configurations, while protein targets are represented through sequence-based embeddings. Experiments on the Davis dataset demonstrate that our model can improve affinity prediction. Importantly, a case study on a manually curated dataset of enantiomers with different biological action shows that the model is able to distinguish the affinities in the two forms consistent with their experimentally observed biological activity. These findings support the relevance of stereochemistry-aware molecular representation for more accurate and chemically faithful DTA prediction.
bioinformatics2026-05-18v1HiCPEP: Efficient estimation of chromatin compartment PC1 from Hi-C covariance structure
Cheng, Z.-R.; Chang, J.-M.Abstract
Principal component analysis (PCA) of the Hi-C Pearson correlation matrix is the standard approach for identifying A/B chromatin compartments. Despite its widespread use, the relationship between the first principal component (PC1) and the underlying compartment structure remains insufficiently characterized, and computing PC1 can become computationally expensive for high-resolution Hi-C data. Here we investigate the role of the PC1 explained variance ratio in compartment analysis and show that chromosomes with strong compartment organization typically exhibit a dominant PC1 signal. Based on this observation, we propose HiCPEP, a heuristic algorithm that estimates the sign pattern and relative magnitude of PC1 directly from the Hi-C Pearson covariance matrix without performing explicit eigenvector decomposition. The method can operate from either a dense Pearson matrix for fast approximation or a sparse observed/expected (O/E) matrix to reduce memory usage. Furthermore, because many covariance columns exhibit PC1-like patterns when the compartment signal is strong, HiCPEP can be accelerated using random sampling without substantially reducing accuracy. Across multiple Hi-C datasets, HiCPEP consistently recovered compartment patterns with high similarity to reference PC1 vectors produced by standard PCA-based methods. Benchmark experiments show that HiCPEP achieves comparable accuracy while reducing computational cost in terms of runtime or memory usage. These results suggest that HiCPEP provides a practical alternative for efficient chromatin compartment analysis from large-scale Hi-C datasets. The HiCPEP implementation is freely available at https://github.com/ZhiRongDev/HiCPEP.
bioinformatics2026-05-18v1HESTIA: Scalable Multimodal Integration of Histology and High-Resolution Spatial Transcriptomics for Robust Spatial Domain Identification
Zhong, Z.; Zhu, X.; Guo, J.; Liao, S.; Chen, A.Abstract
Spatial omics has revolutionized molecular biology by providing invaluable insights into how native tissue microenvironments regulate cellular functions and disease mechanisms. Accurately capturing this structural complexity and decoding the underlying biological processes requires effectively integrating data from multiple modalities. However, transitioning to subcellular resolutions introduces massive data scales and severe transcriptomic sparsity, which challenge current analytical frameworks. To address this, we present HESTIA (Histology-Enhanced Scalable cross-Resolution inTegration for spatial trAnscriptomics), a highly efficient multimodal algorithm designed for identifying spatial domains in large-scale, high-resolution spatial omics data. By circumventing memory-intensive computations, HESTIA effortlessly processes massive datasets that existing algorithms fail due to memory constraints. HESTIA outperforms current multimodal methods in clustering accuracy and spatial continuity, accurately delineating fine structural boundaries. Furthermore, applying HESTIA to large-scale pathological samples successfully dissects clinically relevant intratumoral heterogeneity and maps distinct immune microenvironments in lung and colorectal cancers.
bioinformatics2026-05-18v1Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier
Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.Abstract
Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem.
bioinformatics2026-05-18v1GeneFior: A back to basics and transparent multi-tool approach tosequence detection
Dimonaco, N. J.; Lawther, K.Abstract
The detection of sequences of interest, such as antimicrobial resistance genes, directly from genomic and metagenomic sequencing data has become routine, enabled by curated reference databases and rapid in silico sequence search tools. Yet most workflows depend on prior assembly, an inherently lossy process in which a substantial proportion of reads fail to assemble or are collapsed into consensus sequences, causing low-abundance variants and nucleotide-level diversity to be systematically obscured. The tools used to interrogate the resulting assemblies compound this further, clustering reference sequences at arbitrary identity thresholds, imposing hidden parameter defaults, and reducing intermediate alignment evidence to summarised outputs that cannot be critically evaluated or reproduced. Here we present GeneFior, a transparent, multi-tool workflow integrating BLAST, DIAMOND, Bowtie2, BWA, and Minimap2 to search both DNA and protein sequences against any user-supplied reference database. By enforcing gene-centric identity and coverage thresholds at both the read and gene level, GeneFior reduces false positives while retaining sensitivity to genuine, low-abundance variants, including those differing at single-nucleotide resolution. Crucially, by exposing all alignment parameters, preserving intermediate outputs, and generating cross-tool consensus detection matrices, GeneFior makes the influence of tool choice, database selection, and parameter configuration on reported gene profiles directly observable and reproducible.
bioinformatics2026-05-18v1Learning from Drops: AI-Guided Integration of Liquid Biopsy Features in Cancer Studies
Andueza, M.; Villoslada-Blanco, P.; De Dreuille, B.; Alonso, L.; Sabroso-Lasa, S.; Pantel, K.; Alix-Panabieres, C.; Lopez de Maturana, E.; Malats, N.Abstract
Cancer is a major global health issue with rising incidence and mortality. Early detection, tumor characterization, and disease surveillance are crucial for timely and effective treatment, ultimately reducing mortality rates. Liquid biopsy (LB) has emerged as a valuable detection tool offering a non-invasive method to determine tumor-derived biomarkers in body fluids with demonstrated translational potential. To increase biomarker sensitivity, high-throughput sequencing platforms deliver massive volumes of data. Artificial Intelligence (AI) is pivotal in enabling huge and complex data integration. This contribution aims to assess the current state of integrative AI-based research in the LB field and provide methodological guidance. First, we conducted a PubMed search and found that the literature is sparse in studies integrating LB features, particularly by applying AI. When adopting the latter approach, defining the study objectives is crucial to guide the subsequent methodological aspects, including study design, patient selection criteria, sample size, nature of the LB features, and metadata to collect. Specifically, we propose strategies and tools for data preprocessing, including normalization and batch correction, as well as handling outliers and missing data. Furthermore, we recommend various Machine/Deep Learning approaches for feature selection techniques to ensure model robustness, and we highlight the importance of undergoing rigorous internal and external validations of the selected models. Assessing clinical utility and interpretability is often overlooked but fundamental for real-world implementation. In conclusion, we provide the LB scientific community with an AI-based methodological guidance to bridge the two fields and enhance the integrative analysis of LB features.
bioinformatics2026-05-17v1A comparative analysis of urinary microbiome identifies putative probiotics
Anand, R.; Sahil, R.; Pandey, R.; Prakash, P.; Misra, H. S.; Maurya, G. K.Abstract
Urinary tract infections (UTIs) are the most prevalent bacterial infections globally, and their management increasingly challenged by antimicrobial resistance (AMR). Probiotics offer a promising approach to mitigate AMR by competitively excluding uropathogens and enhancing host immunity by producing immune modulators. Despite being potential, key gaps persist between the discovery of uroprotective probiotic strains and optimization of formulations for urinary tract delivery. Here, we analyzed the urinary microbiome of UTI patients and healthy individuals to identify potential probiotic candidates for the prevention and management of UTIs. Publicly available 16S rRNA amplicon sequencing data of the urinary tract were processed using a standardized pipeline for sequence quality assessment, taxonomic assignment, and microbial function prediction. Comparative analysis showed a significant shift in microbial composition between UTI patients and healthy controls. The dominated phyla identified included Acidobacteriota, Actinobacteriota, Bacteroidota, Campylobacterota, Cyanobacteria, Firmicutes, Fusobacteriota, Patescibacteria, Proteobacteria, and Synergistota. Overall differential abundance analysis revealed Escherichia coli as the predominant UTI-associated species, while Lactobacillus crispatus was enriched in healthy samples. Additionally, predictive functional analysis indicated that metabolic pathways associated with beneficial microbes were enriched in the healthy group. Overall, the study highlights the association of distinct urinary microbiome signatures with infection status, which supports L. crispatus as the most promising probiotic for UTI prevention and control.
bioinformatics2026-05-17v1KaryoScope: rapid, alignment-free sequence annotation for the pangenome era
Ranallo-Benavidez, T. R.; Chen, Y.-A.; Potapova, T. A.; Alanko, J. N.; Loucks, H.; Lucas, J.; Human Pangenome Reference Consortium, ; Guarracino, A.; Puglisi, S. J.; MARCHET, C.; Miga, K. H.; Gerton, J. L.; Barthel, F. P.Abstract
The pangenome era is producing long-read sequencing data and complete genome assemblies at a pace that current annotation methods cannot match. Existing tools were each built for a single feature class (repeats, centromeric satellites, or genes) and falter precisely where the genome is most variable and harbours clinically important variation: the centromeres, subtelomeres, and acrocentric short arms. Here we present KaryoScope, an alignment-free method to annotate an assembly at base resolution across any desired feature classes in a single pass, completing in minutes on a standard workstation. Applied to the Human Pangenome Reference Consortium Release 2 assemblies, KaryoScope identifies the SST1 macrosatellite as the recurrent sequence at Robertsonian translocation fusion points, delivers the first pangenome-wide census of D4Z4 macrosatellite structural diversity at the 4q and 10q subtelomeres relevant to facioscapulohumeral muscular dystrophy, and reveals previously uncharacterized centromere structural polymorphism, including chromosome-specific satellite loss and megabase-scale rearrangement validated by fluorescence in situ hybridization. A pre-built KaryoScope database for the human genome is distributed alongside the tool, and additional databases can be built for any reference genome or annotation source. Together, these capabilities bring the most variable regions of the genome within reach for comparative, clinical, and pangenome-scale analysis. KaryoScope is available at https://github.com/barthel-lab/KaryoScope.
bioinformatics2026-05-17v1Conservation of TNF-TNFR Signaling Modality Across Invertebrate
Govindan, M. K.; K, K.; Goswami, M.; Menon, N.; Singh, A.; Srinivasan, S.Abstract
In humans, the signaling mechanisms of the 19 paralogs of the tumor necrosis factor superfamily (TNFSF) and the 29 receptor paralogs of the tumor necrosis factor receptor superfamily (TNFRSF) are extensively characterized because of their therapeutic relevance. The functional expansion of TNFSF in vertebrates from a single ancestral gene through successive duplication events is also well established. However, apart from the first identification of a TNFSF homolog, Eiger (dmEiger), in Drosophila melanogaster in 2002, together with its receptor homologs Wengen (dmWgn) and Grindelwald (dmGrnd), this signaling system has remained largely unexplored in invertebrates. More recently, the implication of an Eiger homolog in Plasmodium resistance in malaria vectors has further highlighted the need for a systematic investigation of this pathway in lower invertebrates. Structural comparison of the dmEiger-dmGrnd complex with the canonical 3:3 ligand-receptor configuration observed in human TNFSF-TNFRSF signaling suggests either conservation of this signaling modality since before the bilaterian split or convergent evolution of a similar architecture in both branches. The recent explosion in high-quality proteomes spanning diverse phyla, together with advances in protein-complex prediction using AlphaFold-multimer, now enables large-scale exploration of ligand-receptor evolution across invertebrates. Here, we analyzed 148 near-complete proteomes spanning major invertebrate phyla and identified 290 TNFSF, 336 wengen (wgn), and 115 grindelwald (grnd) homologs, including homologs from lower invertebrates. Structural characterization of 140 selected complexes using AlphaFold and AlphaFold-multimer revealed several key findings: (i) TNFSF and TNFRSF homologs are present in majority of the phyla under invertebrates (ii) the canonical 3:3 ligand-receptor signaling configuration is conserved across invertebrates; (iii) orthologs of 25 out of the 26 genes implicated in TNF signaling pathways are present in lower invertebrates; and (iv) signaling through grnd-like receptors containing a single cysteine-rich domain with CXXCXXXC signature is the predominant signaling mode in invertebrates and becomes highly prevalent in Arthropoda. We also elaborate a hypothesize on the evolutionary trajectories toward a genetically parsimonious signaling by this complex system before functional expansion in vertebrates and species diversification in Arthropoda.
bioinformatics2026-05-16v2