Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large cis-regulatory domains, spanning tens to hundreds of kilobases, are pivotal in orchestrating cell-state-specific transcriptional programs that define cellular identity. However, existing single-cell analytical frameworks lack the capacity to identify these higher-order structures, thereby obscuring the coordinated, domain-level epigenetic regulation essential for complex biological processes. To address this, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) solely from single-cell ATAC-sequencing data. Our approach faithfully recapitulates established hematopoietic hierarchies and resolves lineage-specific regulatory programs by recovering canonical master transcription factors, frequent chromatin interactions, and enrichment of fine-mapped autoimmune disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE captures tumor-associated H3K27ac landscapes and prioritizes USP7 as a potential therapeutic candidate, supported by in silico perturbation. Collectively, our framework provides a powerful and scalable platform for deciphering the complex epigenetic architectures underlying human development and disease.
bioinformatics2026-05-20v8Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes
Insana, G.; Martin, M. J.; Pearson, W. R.Abstract
MMseqs2 clustering was used to examine the uniformity and heterogeneity of proteomes from 20 bacterial species. Using clustering parameters that required 50% sequence overlap, clusters with proteins from 50% of proteomes typically contain proteins from 95% of the proteomes and capture more than 80% of the proteins in an organism. Protein clusters are highly uniform in length; across the 20 bacteria, the median cluster has more than 99% of the proteins at the mode length. While protein lengths in clusters are highly uniform, some clusters contain dozens to hundreds of proteins that are considerably shorter (75%) than the mode-length, and a few clusters include proteins that are 133% the mode length. Most "outlier" proteins are found in fewer than 10% of clusters, and "high-outlier" clusters are over-represented in a small fraction of proteomes. Short-outlier proteins are artifacts; at least 80% of short-outlier genomes contain mode-length copies of the protein in the cluster; 40% of short protein artifacts are produced by sequencing errors (frameshifts and termination codons) while another 40% by initiation codon choice. High "outlier" clusters are concentrated in a small fraction of proteomes, which often have poor Proteome BUSCO fragment scores. As with "short-outlier" proteins, the 5% of proteomes that are excluded from the core (50% participation) cluster set encode the missing protein more than 98% of the time; these proteins were missed because of frameshifts in the genome sequence. MMseqs2 clustering with 50% participation provides robust sets of core bacterial proteins.
bioinformatics2026-05-20v3Metabarcode and transcriptome datasets of Pinus sylvestris to assess fungal phyllosphere and disease dynamics.
Moore, B.; Perry, A.; Kaur, S.; Crampton, B.; Gurung, A.; Beaton, J.; Smith, V. A.; Morris, J.; Hedley, P. E.; Nemeth, K.; Barber, H.; Cavers, S.; Jones, S.Abstract
Understanding how host microbiome interactions influence tree disease is critical for understanding forest resilience. Here, we present foliar microbiome ITS2 metabarcoding transcriptomic datasets from Pinus sylvestris to investigate susceptibility to Dothistroma needle blight (DNB), a globally important foliar disease caused by Dothistroma septosporum. We hypothesised that host genotype shapes foliar microbial communities and their interactions, thereby influencing disease outcomes. Samples were collected from a progeny provenance field trial in the south of Scotland representing a broad spectrum of disease susceptibilities. The dataset comprises ITS2 metabarcoding samples from 200 genotypes across three timepoints and RNAseq samples from 48 genotypes across two timepoints. Sampling captured key stages of pathogen exposure and disease progression. Both standardised and bespoke protocols were used for nucleotide extraction, sequencing, and quality control, including multiple negative and positive controls. These datasets, available in the European Nucleotide Archive (project accession PRJEB88228), enable analysis of temporal dynamics in foliar fungal communities, host microbiome transcriptional responses, and genotype dependent variation in disease susceptibility.
bioinformatics2026-05-20v2Early terminated transcripts and missing proteins reflect artifacts in bacterial proteomes
Insana, G.; Martin, M. J.; Pearson, W. R.Abstract
MMseqs2 clustering was used to examine the uniformity and heterogeneity of proteomes from 20 bacterial species. Using clustering parameters that required 50% sequence overlap, clusters with proteins from 50% of proteomes typically contain proteins from 95% of the proteomes and capture more than 80% of the proteins in an organism. Protein clusters are highly uniform in length; across the 20 bacteria, the median cluster has more than 99% of the proteins at the mode length. While protein lengths in clusters are highly uniform, some clusters contain dozens to hundreds of proteins that are considerably shorter (75%) than the mode-length, and a few clusters include proteins that are 133% the mode length. Most "outlier" proteins are found in fewer than 10% of clusters, and "high-outlier" clusters are over-represented in a small fraction of proteomes. Short-outlier proteins are artifacts; at least 80% of short-outlier genomes contain mode-length copies of the protein in the cluster; 40% of short protein artifacts are produced by sequencing errors (frameshifts and termination codons) while another 40% by initiation codon choice. High "outlier" clusters are concentrated in a small fraction of proteomes, which often have poor Proteome BUSCO fragment scores. As with "short-outlier" proteins, the 5% of proteomes that are excluded from the core (50% participation) cluster set encode the missing protein more than 98% of the time; these proteins were missed because of frameshifts in the genome sequence. MMseqs2 clustering with 50% participation provides robust sets of core bacterial proteins.
bioinformatics2026-05-20v2Novel 4D tensor decomposition-based approach integrating tri-omics profiling data can identify functionally relevant gene clusters
Taguchi, Y.-h.; Turki, T.Abstract
Understanding gene expression requires integrating multiple regulatory layers, because transcript abundance does not necessarily correspond to translational activity or protein abundance. Ribosome profiling and proteomics help distinguish increased translation from ribosome stacking or translational buffering, but no de facto standard framework exists for unsupervised integration of transcriptome, translatome, and proteome profiles. Here, we propose a four-dimensional tensor decomposition-based unsupervised feature extraction approach for tri-omics integration. We applied higher-order singular value decomposition to transcriptome, Ribo-seq, and proteome profiles measured under branched-chain amino acid starvation. The resulting singular value vectors captured relationships among the three omics layers, including a component consistent with ribosome stacking, where transcriptome and translatome signals increased while proteome signals decreased, and another consistent with translational buffering, where proteome variation was suppressed despite transcriptome and translatome changes. Gene selection identified 1,781 genes associated with ribosome stacking and 227 genes associated with translational buffering. Enrichment analyses linked the former to translation, post-translational protein modification, RNA polymerase II transcription, cell cycle regulation, endoplasmic reticulum protein processing, ubiquitin-mediated proteolysis, and stress-related pathways, and the latter to ribosome, translation elongation and termination, spliceosome, immune- and stress-related pathways, and ribosomopathy-associated diseases. Robustness analyses indicated that the results were not substantially affected by the duplicated proteome replicate or missing-value handling. Comparison with MOFA+ and mixOmics suggested that our approach more effectively extracted components interpretable as ribosome stacking and translational buffering. These results demonstrate that tensor decomposition-based unsupervised feature extraction is useful for identifying functionally relevant gene clusters from tri-omics data.
bioinformatics2026-05-20v2Pan1c : a pipeline to easily build chromosome-level pangenome graphs
Mergez, A.; Racoupeau, M.; Bardou, P.; Linard, B.; Legeai, F.; Choulet, F.; Gaspin, C.; Klopp, C.Abstract
The advances of sequencing technologies and the availability of high-quality genome assemblies for many genotypes per species, give the opportunity to improve sequence alignment rate and quality, and the variant calling accuracy by including all genomic variations in a graph reference, called a pangenome graph. Because the process of building and analysing a pangenome graph is still complex, with related software packages under development, there is an important need for releasing user-friendly pipelines for this emerging research area. Pan1C is a pipeline based on a chromosome-by-chromosome graph construction strategy. It integrates two complementary strategies for building pangenomes and produces informative metric plots and graphics using a large set of tools. By benchmarking Pan1C on human, fungal, and wheat assemblies, which span a wide range of genome sizes and complexities, we showed the interest of Pan1C for assembly and graph validation as well as for performing primary analyses.
bioinformatics2026-05-20v2Synthetic-data augmented calibration for expert-informed rare disease models
Yang, H.; Rachel, T.; Litwin, T.; Karakioulaki, M.; Reimer-Taschenbrecker, A.; Timmer, J.; Has, C.; Binder, H.; Hess, M.Abstract
Clinical data for rare diseases are sparse, noisy, and heterogeneous, complicating calibration of ordinary differential equation (ODE) models. Thus, we introduce a noise-robust calibration in latent space that combines expert-derived ODEs with learned latent representations. Our approach leverages synthetic ODE trajectories, augmenting our scarce observations to train a model-specific autoencoder representation and imputer. During calibration, observed and ODE-generated trajectories are compared in latent space, and ODE parameters are updated by minimizing their latent distance. In a controlled ABCDE simulation model, the imputer outperformed a carry-forward baseline for moderate parameter shifts, parameter recovery remained stable under random missingness, calibration remained robust to additional noise variables despite reduced downstream identifiability, and distinct dynamics formed visually separable latent trajectories. On a custom developed ODE model for real Epidermolysis Bullosa patients, the calibrated phenomenological model reproduced patient-level trajectories from sparse observations. Thus, we conclude that our latent-space calibration approach supports rare-disease modeling.
bioinformatics2026-05-20v1Widespread use of invalid statistical tests in biomedical machine learning
Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.Abstract
Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance -- not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
bioinformatics2026-05-20v1Mapping Tumor-Microenvironment dependencies with TMEformer: A spatial foundation framework enabling in silico perturbation
Chen, S.; Zhu, G.; Yang, L.; Li, S.; Liu, P.; Chen, Q.; Tang, Y.; Luo, J.; Huang, L.; Chen, B.; Ou, S.; Jiang, J.Abstract
Despite the fundamental role of spatial context in driving tumor progression, most current computational models for virtual perturbation have largely overlooked its importance. Here, we introduce TMEformer, a tumor microenvironment-aware deep learning framework that leverages high-resolution spatial transcriptomics to jointly model intrinsic tumor cell programs and local microenvironmental signals by explicitly incorporating spatial architecture. Validated across diverse tumor spatial transcriptomic cohorts, TMEformer enables virtual perturbations that capture functional dependencies within local cellular ecosystems. Despite being trained on cancer-specific spatial datasets, TMEformer outperforms baseline models pretrained on large-scale corpora in capturing key tumor transitions, including lineage plasticity and the emergence of therapy resistance. Systematic perturbation analyses prioritize tumor-intrinsic transcription factors and TME-derived ligands that drive disease progression, recovering established regulators and revealing novel candidates. Furthermore, TME-derived embeddings improve the spatial stratification of tumor cells and align more closely with pathological architecture. Together, TMEformer establishes a general framework for modeling tumors as spatially coupled, perturbable ecosystems.
bioinformatics2026-05-20v1Static2Dynamic: Reconstructing videos of unobservable cellular, developmental, and disease processes
Boyer, T.; Del Nery, E.; Spassky, N.; Genovesio, A.Abstract
A fundamental limitation in biology is that many of its most important processes unfold as visual dynamics that cannot be directly observed. Development, tissue remodeling, and disease progression often occur deep in living organisms, over extended timescales, and at cellular resolution beyond the reach of current live imaging technologies. As a result, much of biology remains accessible only through static snapshots, while the underlying phenotypic trajectories and visual transformations remain hidden. Here, we introduce Static2Dynamic, a general framework to reconstruct unseen biological dynamics from sets of cross-sectional image data. Starting from time-unpaired static samples, Static2Dynamic first recovers a continuous pseudotime for individual images in a time-discriminative deep representation space, then learns a generative model of images conditionally to the underlying process, and finally reconstructs temporally coherent videos initialized from real samples. This makes it possible to infer past and future visual states of a static image and to simulate complete trajectories of cellular, developmental, and disease processes that were never directly recorded. We quantitatively validate Static2Dynamic on two large-scale experimental microscopy video datasets generated specifically for benchmarking, enabling direct comparison of inferred pseudotime trajectories and reconstructed videos against ground-truth biological dynamics. We further show that the framework generalizes across biological scales, organisms, and imaging modalities, including processes inaccessible to continuous live observation. More broadly, Static2Dynamic establishes the foundations of pseudotime microscopy, a new paradigm for reconstructing the visual and temporal dynamics of biological processes directly from static imaging data, thereby expanding the observable space of living systems beyond current experimental limits.
bioinformatics2026-05-20v1Using Mapping-Profiles to Refine Strain-Level Metagenomic Classification
Lipovac, J.; Angevin, L.; Krizanovic, K.Abstract
Metagenomic classification at the strain level remains challenging due to high sequence similarity among closely related genomes, which leads to ambiguous read mappings and frequent false-positive strain detections. Reducing such errors improves the reliability of strain-level analyses, which is critical for applications such as pathogen detection. We introduce StrainRefine, a post-mapping refinement method that analyzes read-reference mapping profiles to resolve ambiguous assignments among highly similar genomes. The method represents candidate reference genomes using binary profiles that capture read-support patterns and measures similarity between references based on profile overlap. The method clusters references based on similar mapping profiles, filters weakly supported genomes, and reassigns reads to representative references, reducing redundant reporting of near-identical strains. StrainRefine substantially reduces false-positive strain detections while preserving recall and improving agreement between predicted and true abundance profiles. On large-scale metagenomic datasets, it achieves a substantially improved precision-recall balance compared to existing mapping-based approaches, with the standalone method obtaining the highest read-level classification accuracy on the most complex evaluated dataset. Unlike many strain-level tools designed for individual species, StrainRefine operates without prior assumptions about sample composition or curated species-specific reference collections, while still achieving comparable performance in single-species settings on species-specific reference databases. These results highlight mapping-profile similarity as an effective signal for improving strain-level metagenomic classification.
bioinformatics2026-05-20v1HiCP2GAN: A Plug and Play Foundation Model-based GAN for Hi-C Enhancement
Olowofila, S.; Oluwadare, O.Abstract
The three-dimensional organization of chromatin shapes gene regulation and cellular function. Hi-C has emerged as the primary technique for mapping chromatin interactions genomewide, yet high-resolution data remain costly and scarce, leaving many studies with sparse contact maps that limit downstream analysis. Deep learning methods, especially generative adversarial networks (GANs), have shown promise for enhancing low-resolution Hi-C data. Most existing GAN-based approaches, however, rely on custom discriminators trained from scratch, which can yield unstable training and limited generalization. Hi-C foundation models pretrained on large-scale data capture rich, transferable representations of chromatin structure; their use as discriminators within adversarial enhancement frameworks has not been explored. In this work, we introduce HiCP2GAN, a plug-and-play GAN that employs a pretrained Vision Transformer-based Hi-C foundation model as its discriminator. The discriminator was pretrained on 118 million Hi-C patches across diverse species and cell types, providing biologically meaningful gradients for adversarial supervision. The HiCP2GAN framework is generator-agnostic: any compatible Hi-C resolution enhancement architecture can serve as the generator, enabling fair comparison across methods. The encoder phase of the foundation model was adapted as a discriminator backbone and experimented with finetuning different numbers of layers from the input while freezing the deeper transformer layers. Finetuning the first few layers while freezing the rest preserved pretrained knowledge while allowing task-specific adaptation. Experiments on human cell lines show that HiCP2GAN consistently improves resolution over standalone generators and conventional GAN-based models, while serving as a plug-and-play framework for most non-GAN generator models. HiCP2GAN is publicly available at https://github.com/OluwadareLab/HiCP2GAN.
bioinformatics2026-05-20v1Deciphering interaction syntax via decoupling intrinsic lineages and niche pressure
Guo, Q.; Zhong, W.; Zeng, Z.; Nie, Q.; Zhou, P.; Zhang, L.Abstract
Spatial transcriptomics enables the mapping of gene expression within intact tissues, yet a fundamental gap remains between knowing where cells are and understanding how they interact. A cell's measured transcriptome reflects both its intrinsic lineage identity and niche pressure. Here we introduce TRINUS, a self-supervised model that deciphers interaction syntax by generative decoupling of a cell's intrinsic lineage identity from the extrinsic niche pressure. TRINUS maintains a library of context-free cell prototypes to isolate lineage identity while modeling cooperative interaction dependencies among neighbors. We validated TRINUS on synthetic datasets with known interaction logic and benchmarked it against existing methods with superior performance in cell clustering and spatial domain detection. Applied across diverse platforms and biological systems, TRINUS resolves multi-level interaction syntax and maps tissue-wide interaction patterns in colorectal cancer, and identifies stage-specific signaling dependencies and time-dependent receptor windows during mouse organogenesis. We also show TRINUS's bidirectional in silico engineering capability in the ovarian tumor microenvironment, where forward perturbation revealed subtype-specific macrophage immunosuppressive programs via virtual transplantation and inverse design identified molecular modifications in macrophages predicted to rescue adjacent T-cell function. Collectively, TRINUS provides a practical tool for interaction syntax discovery and predictive tissue engineering on spatial transcriptomics data.
bioinformatics2026-05-20v1The ATLAS Penalty: Auxiliary-Transformed Location-Aware Smoothing with Applications to Spatial Transcriptomics
Tang, Q.; Chi, E. C.; Wang, W.Abstract
We address the problem of fitting a collection of location-specific models under a spatial smoothness assumption. Existing approaches penalize roughness in the model parameters directly, an assumption that breaks down when smoothness is a function of parameters and auxiliary covariates rather than the parameters themselves. Our framework, the Auxiliary-Transformed Location-Aware Smoothing (ATLAS) penalty, generalizes spatial smoothness by penalizing roughness in transformations of model parameters using auxiliary information. As a concrete case study, we develop a spatially smooth deconvolution model for spatial transcriptomics that estimates tumor mixing coefficients from thousands of spots distributed on a single tissue slide. To handle the computational challenges posed by the nonlinear likelihood, nonsmooth nonconvex penalty, and spatially coupled estimation, we propose an alternating direction method of multipliers (ADMM) algorithm. Through simulation studies, we demonstrate that our framework provides substantially better spatial domain detection than approaches that smooth model parameters directly, with particularly strong gains when auxiliary covariates carry calibrated spatial structure.
bioinformatics2026-05-20v1Predicting and Elucidating Peptide Retention Mechanisms with Graph Attention Networks
Kensert, A.; Hruzova, K.; Devreese, R.; Nameni, A.; Declercq, A.; Gabriels, R.; Martens, L.; Bouwmeester, R.; Urban, J.Abstract
Liquid chromatography (LC) is a key technology in bottom-up proteomics, separating proteolytic peptides to decrease sample complexity, enhance coverage, and increase the robustness of protein identification and quantification. Although high-resolution mass spectrometry has advanced significantly, comparable progress in LC has lagged, primarily due to a limited understanding of peptide-column interactions. To bridge this knowledge gap, we introduce a novel deep learning model (PeptideGNN) based on a Graph Neural Network (GNN) architecture to model and elucidate peptide behaviors across various separation conditions. Trained to accurately predict peptide retention times on ten diverse proteomic datasets, the model subsequently employed a saliency mapping technique to interpret the underlying retention mechanisms. Our model consistently outperformed existing retention-time predictors across multiple datasets, while the saliency mapping, importantly, revealed insights into peptide-stationary phase interactions, highlighting the effects of neighboring amino acids, post-translational modifications (PTMs), chromatographic columns, and mobile phase additives on peptide retention.
bioinformatics2026-05-20v1Decoding heterogeneous aging clocks and disease risk stratification using a metabolomic foundation model
Xu, Y.; Zou, B.; Xie, G.; Jia, W.; Zhang, L.Abstract
Metabolomic aging clocks estimate biological age by modeling metabolite concentrations, thereby capturing aging signals from healthspan and adverse outcomes. However, existing clocks generally assume homogeneous aging trajectories and yield only a single age acceleration metric, limiting their capacity to capture inter-individual metabolic heterogeneity and characterize nuanced individual-level representations. To address these limitations, we proposed MetFoundation, a metabolomic foundation model pre-trained on nuclear magnetic resonance (NMR) metabolomic profiles from over 430,000 participants in UK Biobank via self-supervised learning. This large-scale pre-training enables MetFoundation to learn a metabolomic representation space that captures the complex, nonlinear structure of systemic metabolism as reflected in NMR data. Building on MetFoundation, we developed a mortality-informed metabolomic aging clock by fine-tuning an attached survival module, deriving age acceleration that demonstrates significant associations with multiple age-related diseases and factors. More importantly, we utilized embeddings generated by MetFoundation to identify metabolic subtypes, resulting in 13 distinct subtypes with differential susceptibility profiles for major age-related diseases, particularly dementia and diabetes. This finding empirically demonstrated profound metabolic heterogeneity across populations, persisting even at comparable levels of age acceleration. To enhance clinical applicability, we further employed contrastive learning to distill a lightweight model that approximates the learned metabolomic representation space using only routine blood test measurements as inputs. Both hold-out testing within UK Biobank and the external validation in China Health and Retirement Longitudinal Study replicated similar disease onset patterns across the identified subtypes, underscoring the robust generalizability of MetFoundation and the translational potential of the discovered metabolic subtypes.
bioinformatics2026-05-20v1Therapeutic Relevance of NLPA Lipoprotein to Combat Biofilm-Associated infection in Acinetobacter baumannii
Brahma, V. U.; Munagalasetty, S.; Bhandari, V.Abstract
Acinetobacter baumannii is a leading multidrug-resistant critical priority pathogen in healthcare settings, where biofilm formation confers survival and antibiotic tolerance. Targeting virulence associated proteins offers an alternative to conventional bactericidal strategies. Here, the inner membrane anchored lipoprotein NLPA, implicated in biofilm associated adaptation, was studied as a putative anti-virulence target using an integrated in silico pipeline and complementing the computational findings. The Alpha fold-derived structure of NLPA served as the basis for virtual screening of approximately 1.6 million compounds, with subsequent prioritization guided by MM/GBSA calculated binding free energies to highlight the top promising candidates. Molecular dynamics simulations demonstrated stable NLPA ligand complexes, as indicated by equilibrated RMSD, low residue fluctuations in the binding region, and persistent interaction networks over time. Pharmacokinetic evaluation indicated that the compounds satisfied Lipinski Rule of Five and had overall acceptable ADMET characteristics. Two compounds, NLPA-6 and NLPA-3, showed the most favourable predicted binding free energies, suggesting strong and stable interactions within the NLPA binding site. NLPA-3 was evaluated in vitro against A. baumannii to validate the computational outcomes. The compound displayed moderate antibacterial activity with a MIC of 125 mcg/mL and demonstrated 55.75% inhibition of biofilm formation at 4x MIC. In addition, in macrophage infection studies, NLPA-3 decreased intracellular bacterial survival to 19.25% at 50 mcg /mL, suggesting that it may disrupt virulence pathways linked to persistence. In whole, these findings identify promising NLPA targeting compounds and support the feasibility of NLPA as an anti virulence target.
bioinformatics2026-05-20v1Counterfactual Explanations for Graph Neural Networks in Patient Outcome Prediction
Chaidos, N.; Dimitriou, A.; Calzi, H.; Casiraghi, E.; Stamou, G.; Valentini, G.Abstract
Counterfactual Explanation (CE) algorithms have been successfully applied to uncover the main factors driving computational diagnostic and prognostic predictions on tabular medical data.Recently, a new Network Medicine paradigm has been introduced for patient diagnosis and prognosis using Patient Similarity Networks (PSNs), i.e. graphs where patients are represented as nodes and their clinical and biomolecular similarities as edges. In this context, graph-based algorithms, including Graph Neural Networks (GNNs), can provide predictions using not only individual patient features but also their relations within a network of clinically and biomolecularly similar individuals. In this work, we propose the first CE algorithm tailored to explain diagnostic and prognostic predictions within PSNs. Alongside a contrastive GNN backbone, we introduce a versatile, model-agnostic counterfactual search method compatible with any underlying classifier. Preliminary results on synthetic data and on a cohort of patients affected by the Alzheimer's disease show that our algorithm is competitive both with seminal tabular based CE algorithms and GNNExplainer, a well-established method for explaining graph-based classification tasks.
bioinformatics2026-05-20v1Shiny AMMOA: an interactive platform for integrative multi-omics analysis of murine aging
Ninomiya Kanda, M.Abstract
Aging is accompanied by complex, tissue-specific molecular changes across multiple biological layers, yet integrative analysis of multi-omics datasets remains challenging for many experimental researchers due to technical and computational barriers. Here, I present Shiny Aging Murine Multi-Omic Analyzer (Shiny AMMOA), a graphical user interface (GUI)-based, user-friendly analytical platform that enables interactive exploration of murine aging-associated bulk transcriptomic, proteomic, and metabolomic datasets. Shiny AMMOA integrates publicly available multi-omics resources within a unified R Shiny framework and supports end-to-end analyses, including differential expression testing, pathway enrichment analysis, and pathway-level visualization across individual and multiple omics layers. Using representative use cases, I demonstrate that Shiny AMMOA recapitulates key findings from original source studies and facilitates intuitive discovery of tissue-, pathway-, and modality-specific aging signatures, including age-associated alterations in unfolded protein response, extracellular matrix organization, and metabolic pathways across specific tissues and omics layers. The platform further enables integrated visualization of molecular changes across omics layers on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway diagrams, supporting hypothesis generation at the systems level. By democratizing access to integrative multi-omics analysis while preserving analytical rigor, Shiny AMMOA provides an extensible resource for experimental biologists and aging researchers to interrogate large-scale public datasets, prioritize biological pathways, and accelerate translation of multi-omics insights into testable experimental hypotheses. Shiny AMMOA is available at https://github.com/M-Ninomiya-Kanda/Shiny_AMMOA_local, and a lightweight web-based demonstration version with limited functionality is available at https://m-ninomiya-kanda.shinyapps.io/shiny_ammoa_web/.
bioinformatics2026-05-20v1An 8 Gene Bevacizumab Resistance Signature Predicts Prognosis and Reveals Immunosuppressive Microenvironment in Colorectal Cancer
Niu, Z.; Qiu, D.; Xu, P.Abstract
Background Bevacizumab resistance severely limits long term efficacy in metastatic colorectal cancer (CRC). This study aimed to develop and validate a bevacizumab resistance associated gene signature for prognosis prediction and immune microenvironment characterization in CRC. Methods Two GEO datasets (GSE19862, GSE86582) with bevacizumab response data and TCGA COAD/READ RNA seq data were analyzed. Overlapping differentially expressed genes (DEGs) linked to both CRC progression and bevacizumab resistance were identified. An 8 gene signature (AXIN2, PSORS1C1, KRT74, SLC2A3, STIL, IL33, GALNT6, HSD11B2) was constructed via univariate Cox and LASSO Cox regression. Results In the TCGA cohort, high risk patients had shorter overall survival (OS; log rank P < 0.0001). Time dependent ROC yielded 1 year AUC = 0.638, 3 year AUC = 0.657, and 5 year AUC = 0.757. Multivariate Cox regression confirmed the risk score as an independent prognostic factor. External validation in GSE39582 (optimal cutoff = -1.49) replicated these findings: high risk patients had inferior OS (P = 0.0016) with acceptable 1/3/5 year AUCs and retained independent prognostic value (HR = 1.634, P = 0.00415). CIBERSORT and ESTIMATE analyses showed that the high risk group was characterized by increased M2 macrophages and neutrophils, higher immune and stromal scores, and reduced activated memory CD4 T cells, monocytes, and activated dendritic cells (all P < 0.05). GSEA highlighted enrichment of TNF /NF {kappa}B, IL 6/JAK/STAT3, and immune checkpoint pathways in the high risk group. AXIN2 (HR = 0.829, P = 0.032) was an independent protective factor, while PSORS1C1 (HR = 1.356, P = 0.048) was an independent risk factor. Conclusion The 8 gene bevacizumab resistance signature robustly predicts prognosis and reflects an immunosuppressive microenvironment closely linked to bevacizumab failure in CRC. These findings provide novel insights into immune mediated resistance and support clinical risk stratification.
bioinformatics2026-05-20v1NANOTAXI: A Shiny-Based GUI for Real-Time Classification and Analysis of 16S rRNA Nanopore Reads
Mahar, N. S.; Chouhan, K.; Gupta, I.Abstract
Real time taxonomic classification of nanopore amplicon sequencing data enables rapid insights into microbial communities, with applications in clinical diagnostics, environmental monitoring, and outbreak surveillance. However, bridging the gap between long-read data and interpretable results often requires specialised bioinformatics expertise. There remains a need for integrated, user-friendly software that combines live data acquisition with downstream microbiome analysis. Here we present NANOTAXI, a fully automated Shiny-based GUI for the classification of barcoded 16S rRNA gene sequences generated by Oxford Nanopore sequencing. The platform supports four taxonomic classifiers, integrated with five reference databases, enabling flexible selection of classification strategies based on user requirements and available computational resources. In addition to real-time monitoring, NANOTAXI performs cohort-level analyses, including alpha and beta diversity, ordination, differential abundance testing, and functional inference using PICRUSt2. Validation using barcoded synthetic communities comprising pooled genomic DNA from clinically relevant bacterial species and the ZymoBIOMICS mock community demonstrated that NANOTAXI generated biologically coherent taxonomic and functional profiles. Benchmarking revealed clear trade-offs between computational performance and taxonomic specificity. Emu provided the lowest observed species-level false-positive rate, whereas Kraken2 offered the fastest classification and enabled continuous near-real-time monitoring across all tested databases.
bioinformatics2026-05-20v1Dual-Stream Compression of High Bit-Depth Medical Images with Application to DNA Storage
Su, H.; Fan, W.; Peng, J.; Zhang, Y.Abstract
High bit-depth medical images preserve subtle intensity variations that are important for quantitative analysis and clinical interpretation, but their large dynamic range poses challenges for efficient compression. We propose a bit-plane-aware dual-stream compression framework for 16-bit medical images by separately modeling the most significant bit (MSB) and least significant bit (LSB) components. The MSB structural stream is encoded using JPEG coding with a Duplicate Segment Skipping (DSS) strategy to exploit spatial and segment-level redundancy, while the LSB detail stream is compressed using learned image compression to represent residual variations and fine-grained details. Experiments on four MRI and CT datasets show that the proposed method consistently outperforms representative traditional and learning-based codecs, achieving the lowest bit rate across all datasets. Meanwhile, it preserves high reconstruction fidelity and maintains local intensity profiles and downstream segmentation consistency. As a downstream application, we further demonstrate that the compressed bitstreams can be effectively integrated with DNA encoding and converted into sequences with favorable biochemical properties.
bioinformatics2026-05-20v1VX: an AI-enabled desktop genome viewer and transcriptome browser with a programmable analysis framework
Shirokikh, N. E.; Cleynen, A.Abstract
Background. Genome and transcriptome browsers are central to the interpretation of high-throughput sequencing data, but today's tools assume a human operator at a graphical interface and offer only limited programmability. As large-language-model assistants become routine in bioinformatics \citep{MCP2024}, this creates a bottleneck: agents cannot observe the visual state of the browser or drive it through the same interface as the human user, and analyses remain fragmented across a separate ecosystem of external tools. Transcript-coordinate data, produced by ribosome profiling \citep{Ingolia2012} and direct RNA sequencing \citep{Garalde2018}, is also awkwardly supported in chromosome-oriented viewers. Results. We present VX, a desktop genome and transcriptome viewer written in D, using GTK~3 and OpenGL, that handles genome-scale and transcriptome-scale data in a unified interface. VX exposes its full functionality through an embedded HTTP API on the loopback interface and a Model Context Protocol server of currently thirty-nine tools, so that scripts and LLM agents can load data, navigate, manage tracks, run analyses, and capture figures through the same contract used by the GUI. An integrated analysis framework provides more than fifty analyses and includes signal processing and peak calling, quantification, variant analysis, alignment statistics, interaction and cross-track comparisons, all with an explicit four-level scope hierarchy running from viewport to whole dataset; results are written to disk and, where appropriate, added as new tracks. Additional features include a magnifier popup for base-resolution inspection (Alt+hover), chromosome-alias resolution across UCSC, Ensembl, and NCBI conventions, viewport video recording via an \texttt{ffmpeg} pipe, and INI-based configuration. Conclusions. VX complements existing desktop and web browsers by providing a native agent-control layer, an integrated analysis framework, and first-class transcript-space handling. The binary is freely available for non-commercial use; the HTTP API and MCP protocol are fully specified in this article, so third-party clients can be written independently of the core implementation.
bioinformatics2026-05-20v1LAMPP: A benchmark for continuous evaluation of host phenotype prediction from shotgun metagenomic data
Barak, N.; Bhattacharya, H.; Asnicar, F.; Sung, J.; Segata, N.; Yassour, M.Abstract
Predicting host phenotypes from shotgun metagenomic data is essential for translating microbiome research into clinical practice. Despite the development of numerous computational tools for this task, researchers often default to traditional machine learning methods such as Random Forest. This hesitancy to adopt newer methods stems from their complexity as well as the lack of standardized evaluations, as most tools are assessed on different datasets and compared against a limited set of methods. Here, we introduce LAMPP, a standardized benchmark for evaluating methods for predicting host phenotypes from gut metagenomic data. LAMPP features a diverse range of prediction tasks and enables consistent, comparative assessments across prediction tools. Our systematic evaluation of existing tools shows that classic machine learning methods (e.g., Random Forest) perform competitively, offering both ease of use and state-of-the-art results. At the same time, it demonstrates that microbiome-based phenotype prediction remains a challenging problem. By providing a consistent platform for ongoing evaluation and access to raw sequencing data, LAMPP motivates the development of novel prediction pipelines from raw sequencing data to phenotype prediction, including novel sample representation and data augmentation strategies. LAMPP is publicly available for ongoing benchmarking at https://lampp.yassourlab.com/.
bioinformatics2026-05-19v3Systematic cross-study assessment of RNA-Seq experimental workflows for plasma cell-free transcriptome profiling
Tuni-Dominguez, C.; Asole, G.; Monteagudo-Mesas, P.; Rusu, E. C.; Cabus, L.; Gonzalez, L.; Sanchez, L.; Neto, B.; Sanders, P.; Weber, M.; Lagarde, J.Abstract
Plasma cell-free RNA (cfRNA) is a promising source of non-invasive biomarkers, but its clinical translation is hindered by technical challenges and a lack of protocol standardization, which compromises reproducibility and comparability across studies. There is a need for a systematic evaluation of existing cfRNA-Seq workflows to understand the drivers of technical variability. Here, we address this gap by performing a comprehensive cross-study analysis of 2,166 cfRNA-Seq samples from 15 published studies and an in-house generated dataset, applying a uniform bioinformatics pipeline to enable a controlled comparison of experimental workflows. Our analysis reveals that the donor phenotype typically explains a negligible fraction of the transcriptomic variation, whose main determinants are technical -- principally protocol choice, genomic DNA contamination levels and library diversity. Remarkably, this technical noise is so profound that variation within plasma cfRNA samples exceeds that found across a wide range of human tissues. Furthermore, we demonstrate that critical pre-analytical factors are often confounded with patient phenotypes, jeopardizing the validity of biomarker discovery efforts. Finally, we identify a 100 bp fragment-length threshold as a vital requirement for reliable cfRNA-based taxonomic profiling. Our work serves as a comprehensive benchmark of current cfRNA-Seq methodologies and provides evidence-based guidelines to improve experimental design. By highlighting the dominance of controllable technical factors, we offer a path towards more robust and reproducible cfRNA research.
bioinformatics2026-05-19v3AlphaFold 3 Fails to Predict D-peptide Chirality, Fold, and Binding Pose in Heterochiral Complexes
Childs, H.; Zhou, P.; Donald, B. R.Abstract
Due to their favorable therapeutic properties, including improved stability, bioavailability, and membrane permeability, D-peptides that bind biological L-proteins represent an important class of systems in computational drug design. A reliable in silico workflow for these systems must correctly preserve stereochemistry while predicting fold and binding pose. The AlphaFold 3 (AF3) model reported by Abramson et al. (2024) enforces a strict chirality violation penalty to maintain chiral centers from model inputs and is reported to have a low chirality violation rate of only 4.4% on a PoseBusters benchmark containing diverse chiral molecules. Herein, we report the results of 3,255 black-box experiments with AF3 to evaluate its ability to predict the fold, chirality, and binding pose of D-peptides in heterochiral complexes. Despite inputs specifying explicit D-stereocenters, we report that the AF3 chirality violation rate for D-peptide binders is much higher at 51% across all evaluated predictions; on average the model is as accurate as chance (random chirality choice, L or D, for each peptide residue). Increasing the number of seeds failed to improve this violation rate. The AF3 predictions exhibit incorrect folds and binding poses, with D-peptides commonly oriented incorrectly in the L-protein binding interface. Confidence metrics returned by AF3 also fail to distinguish predictions with low chirality violation and correct docking vs. predictions with high chirality violation and incorrect docking. We conclude that AF3 is a poor predictor of D-peptide chirality, fold, and binding pose and propose solutions to address these limitations.
bioinformatics2026-05-19v3OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
Wang, L.Abstract
We introduce OmniGene-4, a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing). We inject 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP labels), continue pretraining on a 32.5 GB DNA / protein / natural-language / structural mixture, and run a five-stage supervised fine-tuning pipeline (v2-v5) on 199,576 instruction-format examples across eight task families. The final v5 adds a dual-head architecture: a generation head plus two per-residue classification heads (3Di, DSSP) trained jointly under a 0.5 / 0.5 loss split. v5 reaches 99.40% accuracy on BioPAWS standard protein homology, 82.60% on remote homology (500 pairs), and 93.66% on BixBench gaining +14.4, +22.6, +6.7 percentage points over the vocabulary extended Gemma-4-Instruct baseline, and outperforming ESM-2 (650M) by +32.1 pp on the identical remote-homology split. The classification heads reach 78.6% per-residue accuracy on 3Di (chance 5%) and 100% on DSSP (chance 12.5%). MoE router activations further yield a clean CPT/SFT 96%/4% decomposition of cross-task differentiation, providing direct interpretability of where biological specialization is acquired.
bioinformatics2026-05-19v2Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators
Thiel, M.; Cunningham, A.; Barnes, C. P.Abstract
We compare the efficacy and distributional effects of supervised fine-tuning (SFT) and reinforcement learning (RL) post-training for PlasmidGPT, a foundation model for whole-plasmid generation, using Group Relative Policy Optimization (GRPO) for the RL model. Using a biologically motivated reward function encoding functional annotations, length constraints, and repeat penalties, the RL model achieves a 71.6% quality control pass rate across 8 prompts on 4,000 sequences, compared to 4.3% for the pretrained baseline and 11.0% for SFT. A five-model reward ablation identifies the cassette arrangement bonus, which rewards correct promoter[->]CDS[->]terminator ordering, as the critical reward component. Rejection-sampling baselines indicate that the gain is not recovered by sampling more heavily from the base model. Beyond directly optimized features, RL-generated sequences converge toward real plasmid distributions in 3-mer composition, ORF length, and thermodynamic stability, properties we categorize as reward-correlated or indirectly shaped by the structural reward signal. Minimum free energy density independently converges to the real-plasmid regime under both SFT and RL despite these being parallel post-training paths. On a small curated hold-out set, RL improves continuation log-likelihood over the pretrained baseline on every sequence (mean {Delta} = +0.83 nats), with no degradation in next-token prediction.
bioinformatics2026-05-19v2IsoformSwitchAnalyzeR v2: Analysis of Functional Isoform Changes in Long-read and Single-cell Sequencing Data
Han, C.; Gilis, J.; Delgado, E. I.; Clement, L.; Vitting-Seerup, K.Abstract
Alternative splicing enables a single gene to produce a variety of mRNA transcripts, significantly enhancing protein diversity in higher eukaryotes. Isoform switching refers to the differential usage of transcripts of a gene and occurs pervasively across physiological and pathological conditions. IsoformSwitchAnalyzeR was developed to identify these isoform switches and analyze their functional consequences. Advances in RNA-seq technology, including long-read and single-cell sequencing, along with state-of-the-art computational tools, enable unprecedented accuracy in isoform switch identification and its functional consequences, necessitating an update to IsoformSwitchAnalyzeR. Here we present IsoformSwitchAnalyzeR 2.0, with substantial improvements in the robustness of isoform switch detection, the incorporation of new types of functional annotation, and interoperability with other bioinformatics tools. We showcase how IsoformSwitchAnalyzeR is well-suited for analysis of both long-read RNA-seq and single-cell data through two case studies. Specifically, we analyze long-read data from patients with Alzheimer's Disease and single-cell data from Glioblastoma patients. In both case studies, we find important isoform switches with disease-relevant functional consequences, showcasing the power of IsoformSwitchAnalyzeR v2. Taken together, these findings highlight the versatility and robustness of IsoformSwitchAnalyzeR in handling advanced sequencing technologies, thereby broadening its applicability across diverse research contexts.
bioinformatics2026-05-19v2TransXplorer: An automated translational discovery platform for RNA-seq data
Verma, V. M.; Oler, E.; Syed, H.; Han, S.; Berjanskii, M.; Mason, A. L.; Wishart, D. S.; Wong, G. K.-S.Abstract
RNA-seq experiments routinely identify thousands of differentially expressed genes, but translating these into biological insights and therapeutic hypotheses often requires integrating multiple tools. Existing web platforms such as iDEP, NetworkAnalyst, and GEPIA2 address individual steps, differential expression, network visualization, or TCGA queries, but lack a unified environment spanning raw data processing to clinical and pharmacological interpretation. TransXplorer (https://www.transxplorer.org) is a freely available web platform that addresses this limitation by integrating the complete RNA-seq analytical workflow. It supports processing from raw FASTQ files using HISAT2 or Salmon, as well as direct GEO dataset import with automated metadata handling. Differential expression analysis is implemented via DESeq2, edgeR, and limma-voom, followed by functional enrichment across more than 1,800 species using Bioconductor resources. Batch effects are automatically detected and corrected using a composite of PVCA, kBET, and Silhouette metrics without requiring predefined batch annotations. Downstream analyses include co-expression network construction (WGCNA), protein-protein interaction mapping (STRING), cell-type deconvolution, and transcription factor inference using integrated DoRothEA and TFLink resources. The platform further links gene signatures to drug candidates through DGIdb and OpenTargets and enables survival and tumour-normal comparisons across TCGA cohorts. Application to cardiac endothelial differentiation (GSE151427) and kidney renal papillary cell carcinoma (TCGA-KIRP) datasets demonstrates accurate batch correction, biologically consistent pathway enrichment, recovery of expected cell-type proportions, and identification of clinically relevant genes and drug candidates. TransXplorer is freely available without a login.
bioinformatics2026-05-19v2Learning the Language of the Microbiome with Transformers
Treloar, N. J.; Ur-Rehman, S.; Yang, J.Abstract
Self-supervised pretraining has become central to biological machine learning, yet microbiome data remains comparatively underexplored in terms of both modeling approaches and evaluation frameworks. To address this gap, we present Atlas, a pretraining dataset of over 539,000 microbiome datapoints from the MGnify database. Using Atlas, we train the Waypoint family of microbiome foundation models: a series of GPT-2 style causal language models ranging from 6M to 170M parameters. We also introduce Compass, a curated benchmark of eight predictive tasks spanning biome classification, drug-microbiome interactions, drug degradation, and infant gut development. Using this benchmark, we compare the performance of Waypoint models against classical baselines and the existing MGM foundation model. Our results show that pretraining leads to consistent and significant improvements in downstream task performance, that both dataset scale and tokenization strategy impact model quality, and that pretraining is essential for achieving favorable scaling behavior. Furthermore, pretrained transformer models begin to reliably outperform classical methods once training data exceeds roughly 10,000 examples - a threshold that is attainable for modern microbiome studies. Finally, we demonstrate that the Waypoint models achieve state-of-the-art performance among microbiome foundation models. Overall, our work highlights the importance of large-scale self-supervised pretraining in this domain and establishes Atlas, Compass, and the Waypoint models as valuable resources for the research community in this emerging field.
bioinformatics2026-05-19v2Beyond single markers: bacterial synergies identified by Multidimensional Feature Selection reveal conserved microbiome disease signatures
Zielinska, K.; Rudnicki, W.; Kahles, A.; Labaj, P. P.Abstract
The gut microbiome encodes disease-relevant information not only in the abundance of individual taxa and functions, but in the way they co-occur and interact. Yet metagenomic analyses have largely relied on univariate approaches that evaluate features in isolation, systematically overlooking the combinatorial signals that arise from microbial co-occurrence. Here, we introduce a framework based on the Multidimensional Feature Selection (MDFS) algorithm to identify synergistic feature pairs - combinations of taxa and functions whose joint predictive relevance substantially exceeds that of either constituent alone, including features that carry no individual signal and would be discarded by any conventional analysis. We first validated the approach on a meta-analysis of colorectal cancer (CRC) cohorts - one of the most competitive microbiome classification benchmarks available - using a leave-one-cohort-out cross-validation framework. Our framework matched state-of-the-art classification performance (AUC = 0.85) while simultaneously revealing microbial interactions that are structurally inaccessible to univariate methods. A subset of high-stability synergistic pairs showed consistently elevated model selection frequencies and robust discriminatory power across independent cohorts, confirmed under stringent per-cohort effect size testing. Extending the framework to 20 disease cohorts spanning inflammatory bowel disease, type 2 diabetes, liver cirrhosis, and atherosclerotic cardiovascular disease, we identified thousands of high-impact synergistic interactions and 21 conserved cross-cohort markers. Across all contexts examined, synergistic pairs substantially outperformed their individual constituents, establishing microbial co-occurrence as a reproducible and biologically informative axis of disease-associated variation that univariate approaches are structurally unable to detect. The framework is freely available at https://github.com/Kizielins/MDFS_synergies. Importance: Most microbiome studies search for individual gut bacterial species associated with disease. However, bacteria do not act in isolation, and their combined presence or relative balance may be far more informative than any single microbe considered alone. This study presents a computational framework that identifies pairs of gut microorganisms whose co-occurrence or relative abundance carries substantially greater predictive signal than either constituent feature independently. Applied to stool metagenomic data from patients with colorectal cancer and more than a dozen additional conditions, we demonstrate that these synergistic interactions are widespread, reproducible across independent patient cohorts, and reveal disease-relevant microbial relationships that standard analyses miss entirely. Our framework offers a more complete view of how the gut microbiome is altered in disease and provides a principled basis for identifying robust, interaction-based biomarkers.
bioinformatics2026-05-19v2Computational Design of Novel Selective Phosphodiesterase 4B Inhibitors from Natural Products: An Integrated Machine Learning and Structure-Based Drug Discovery Approach
Oni, S. A.; Oyemomi, M. D.; Osho, A.; Abdulfatai, A.Abstract
Abstract Selective inhibition of phosphodiesterase 4B (PDE4B) remains a promising strategy for preserving the anti-inflammatory benefit of PDE4 inhibition in chronic obstructive pulmonary disease while reducing PDE4D-associated tolerability liabilities. This study integrated SHAP-interpretable machine learning, natural product virtual screening, hierarchical docking, post-docking MM-GBSA, isoform cross-docking, binding-pocket comparison, ADMET prediction, and 100 ns molecular dynamics simulations to identify PDE4B-selective inhibitors from the LOTUS natural product database. A Random Forest classifier trained on curated ChEMBL PDE4B bioactivity data achieved an external performance with AUC-ROC = 0.955, accuracy = 0.893, F1-score = 0.896, MCC = 0.785, and prioritized 119,698 predicted actives from 276,518 LOTUS compounds. SHAP analysis identified BertzCT and TPSA as major contributors to predicted activity. Sequential Lipinski, PAINS, and QED filtering retained 14,210 candidates for structure-based evaluation. Extra precision docking identified four leads with PDE4B docking scores of -9.123 to -12.080 kcal/mol, all outperforming roflumilast (-7.658 kcal/mol). Cross-docking and post-docking MM-GBSA supported preferential PDE4B binding for three candidates. The top lead, LTS0048837, maintained a stable PDE4B-bound pose during simulation, with comparatively stronger interaction persistence than its PDE4D complex and the roflumilast reference. These findings nominate LTS0048837 as a computationally prioritized PDE4B-selective natural product lead requiring experimental enzyme, cellular, and pharmacokinetic validation.
bioinformatics2026-05-19v1Geometric averaging provides normalization-invariant feature ranking in compositional sequencing data
Nunzi, E.; Romani, L.Abstract
In compositional next-generation sequencing (NGS) analyses (including microbiome studies, RNA-seq and metagenomics) the arithmetic mean (AM) of relative proportions is the default operator for summarizing feature abundances. We show that this default produces unstable rankings in real compositional data. Across 102 prevalent genera in the dietswap dataset (n=38 baseline samples), 23 genera (22.5%), including members of Bacteroides, Eubacterium and Bilophila, yielded opposite group-level conclusions under AM and the geometric mean (GM). This pattern reflects two formal properties of compositional aggregation. First, AM-based rankings change with the within-sample normalization domain, whereas GM-based rankings are invariant under the multiplicative structure of compositional data. Second, the centered log-ratio (CLR) transformation absorbs geometric averaging into the data representation, so that arithmetic averaging on CLR-space recovers the GM ranking exactly. Both properties were verified numerically on the dietswap dataset, where the Spearman correlation between GM- and CLR-based rankings was 1.000 in both groups. The operator-choice problem propagates to between-group differential inference: under AM, log2 fold-changes vary across normalizations and the relative ranking of features by effect size is not preserved; under GM and CLR, the ranking is preserved. We recommend GM-based summaries for feature ranking and CLR-transformed abundances for cross-sample comparisons. This change requires no new computational tools and is fully compatible with existing differential-abundance pipelines, but eliminates an under-recognized source of irreproducibility in biomarker discovery across microbiome studies, transcriptomics, metagenomics, and mass-spectrometry-based metabolomics, in all settings where features are quantified relative to a sample total.
bioinformatics2026-05-19v1ToxCastLite: A portable semantic evidence graph linking in vitro bioactivity, in vivo toxicity, and exposure-use context
Dönmez, A.; Nosov, O.; Heck, K.; Mosig, A.; Fritsche, E.; Koch, K.Abstract
Motivation: The ToxCast database is a valuable resource for computational toxicology and new approach methodologies (NAMs), but the approximately 100GB MySQL distribution is difficult to use for portable local analysis and cross-domain evidence mining. Many practical questions concern chemicals, in vitro bioactivity, in vivo toxicological evidence, and exposure-relevant product-use context rather than raw database keys. Results: We present ToxCastLite, a portable semantic evidence-access system that combines assay-scoped SQLite databases with a compact RDF layer for GraphDB-based querying. The system streams large ToxCast/invitrodb MySQL dumps into curated SQLite profiles, reducing the footprint to approximately 3~GB for focused use cases such as developmental neurotoxicity. Dense numerical evidence, including concentration--response rows, remains in SQLite, while the RDF projection exposes linked semantic entities such as chemicals, assays, endpoints, model results, potency parameters (AC50), and MC6 quality flags. We further extend the graph with CPDat v4.0 product-use and functional-use evidence and ToxRefDB v3.0 in vivo toxicity evidence, including processed studies, point-of-departure records, effect summaries, and observation summaries. These layers are linked through DSSTox Substance Identifiers, enabling integrated queries across NAM bioactivity, curated animal-study evidence, and exposure/use context. A Streamlit prototype supports exploration through a locally deployed LLM that translates natural-language questions into SPARQL, grounded by a versioned RDF schema to reduce hallucination risk. Case studies in developmental neurotoxicity demonstrate how ToxCastLite identifies concordance between high-confidence in vitro DNT activity and positive in vivo apical evidence, detects in vitro DNT activity beyond available DNT-specific in vivo evidence, and prioritizes chemicals where NAM signals, ToxRefDB evidence, and CPDat product-use context intersect. For selected results, users can drill down from the semantic graph to the underlying SQLite records and retrieve concentration--response curves for expert inspection without manually writing SQL or SPARQL.
bioinformatics2026-05-19v1CANCAN: high-resolution copy number and mutation heterogeneity analysis of DNA sequence data for clinical applications
Pladsen, A. V.; Vodak, D.; Zhao, S.; Nakken, S.; Nebdal, D.; Lien, T.; Danielsen, B. K.; Wang, C.; Kildal, W.; Hjortland, G. O.; Hovig, E.; Russnes, H. G.; Lingjaerde, O. C.Abstract
High-throughput DNA sequencing is central to precision oncology, yet robust and interpretable methods for integrated analysis of copy number alterations and somatic variants across sequencing platforms remain limited. We present CANCAN (Copy number integrative ANalysis in CANcer), a platform-agnostic computational framework for high-resolution analysis of allele-specific copy number and variant data. CANCAN integrates novel normalization and segmentation strategies and enables inference of tumor purity, ploidy, subclonality and mutation multiplicity, while providing statistical confidence estimates and transparent evaluation of alternative solutions. Benchmarking across whole-genome, whole-exome and targeted sequencing datasets from TCGA and the IMPRESS-Norway study demonstrates high concordance with established methods, with particularly strong performance on targeted sequencing data. CANCAN accurately estimates global genomic features, including purity and ploidy, even at reduced sequencing coverage, and shows comparable or improved agreement relative to existing tools. In addition, it provides detailed visualization of the genomic context of clinically relevant biomarkers, supporting diagnostic interpretation. CANCAN constitutes a reproducible and interpretable approach for integrated genomic analysis, addressing key methodological and practical challenges in clinical cancer genomics.
bioinformatics2026-05-19v1Insertions, deletions, and exchangeable couplings: a Dirichlet process over TKF92 domains and sites
Large, A. L.; Holmes, I. H.Abstract
To introduce local heterogeneity, evolutionary models are often equipped with site-class mixtures, preserving this symmetry in the sense of de Finetti: conditional on the latent class, residues are still exchangeable. In a four-step theoretical ladder, we show how long-range structure such as couplings between distant sites can also be introduced exchangeably by using a Dirichlet process to partition sites into co-evolving classes. Our first step is a thorough analysis of TKF92 to establish sufficient statistics, limiting behavior, and inferential tools. We then lift the pairwise TKF92 hidden Markov model, in the limit of small time, to a time-indexed gravestone-augmented pair stochastic context-free grammar, and from there to its phylogenetic generalisation. This framing allows trajectories to be sampled exactly by Inside-Outside recursion. The third step places a Dirichlet process over the alive sites and asks co-keyed sites to evolve under a sparse Potts interaction --- an exchangeably-partitioned hidden direct-coupling model whose marginal alignment likelihood is unchanged from plain TKF92. The fourth rung of the ladder develops inference machinery: a Gibbs--Metropolis sampler that alternates alignment resamples, key-partition resamples, and stochastic parameter updates. We close several gaps along the way --- exact closed-form sufficient statistics for the linear birth--death--immigration component, the resolvable L'Hopital limit at {lambda}=, and a closed-form M-step for a recursive generalisation of TKF92 --- and we report a 1,000-family Pfam fit with K=4 site classes whose Potts atoms carry ~0.54 nats of covariation per class-pair on top of a substantial single-site substitution model. Supplementary material, including full source code for inference, may be found at https://tkfdp.net/.
bioinformatics2026-05-19v1Sequence alignment of the primate lineage reveals evolutionary divergence and conserved secondary structural motifs in noncoding RNAs
Beeram, A.; Perry, Z. R.; Pyle, A. M.Abstract
Long noncoding RNAs (lncRNAs) constitute most of the human transcriptome and perform essential roles in chromatin organization and transcriptional regulation. Because lncRNA genes are not constrained by protein-coding ability, they tend to exhibit more rapid evolutionary divergence. Their poor nucleotide sequence conservation among mammals often led to the assumption that lncRNAs lack conserved structures. However, emerging evidence indicates that many noncoding RNAs adopt secondary and tertiary folds critical for protein recruitment, chromatin binding, and regulation of gene expression. Nevertheless, there are few experimental secondary structures for lncRNAs, hindering mechanistic insight into lncRNA structure-function relationships. Even without available structural data, covariation, in which two nucleotides co-evolve, can provide evidence for conserved structures. This requires sequence alignments with sufficient divergence to detect covariation but enough similarity to maintain alignment quality. Here we report the development of a novel computational pipeline to mine 190 unannotated primate genomes to generate high-quality multiple sequence alignments of noncoding RNAs. This pipeline performs sequence searching, locus extraction, cross-species alignment, and downstream analyses, including assessment of covariation and primary sequence conservation. Ultimately, we demonstrate that because many noncoding elements, such as lncRNAs evolve at a more rapid rate than protein-coding genes, phylogenetic analyses constrained within a narrower evolutionary span can be used to identify conservation of primary sequence and secondary structure. By focusing our alignments on the primate lineage, our method overcomes the limitations of broad phylogenetic analyses, enabling high-resolution detection of subtle conservation patterns and conserved secondary structural motifs of long noncoding RNAs.
bioinformatics2026-05-19v1Cross-Cohort Optimal Transport Maps Macrophage Plasticity and Competing Routes to Inflammation and Fibrosis in Human Atherosclerotic Plaques
Vazquez Montes de Oca, S.; Acedo Terrades, A.; Carreno Martinez, J. F.; Kirchner, P.; Ord, T.; Kaikkonen, M. U.; Freigang, S. B.; Zlobec, I.; Rodriguez Martinez, M.Abstract
Single-cell transcriptomics has revealed extensive macrophage heterogeneity in atherosclerotic plaques, but how macrophages move between states, and whether transition mechanisms depend on cellular origin, remain unclear. Here we develop a computational framework that reconstructs directed cell-state transition networks from cross-sectional single-cell RNA-sequencing data by combining optimal transport with RNA velocity and systematic cross-cohort validation. Applying this approach to seven human carotid plaque cohorts, we generate an integrated atlas of 81,633 monocytes and macrophages and identify 15 statistically significant pairwise transitions, of which 11 directed transitions organize into three biological axes: monocyte fate diversification, inflammatory reactivation, and fibrotic remodeling. The strongest transition links scavenging macrophages to inflammatory macrophages, indicating that plaque inflammation is driven predominantly by reactivation of tissue-adapted macrophages rather than by direct differentiation of newly recruited monocytes. By tracking gene expression changes along the OT commitment gradient, we find that macrophage plasticity follows an origin-dependent spectrum. Tissue-resident macrophages, in particular scavenging C1q$^+$ macrophages, acquire inflammatory programs while preserving and reinforcing their resident scavenging identity, a mechanism we term \textit{transcriptional layering}, whereas monocyte-derived transitions proceed through selective loss of source-identity modules. Despite these distinct routes, transitions converging on the same fate activate shared destination-specific regulatory circuits, with inflammatory and fibrotic programs governed by mutually antagonistic transcription factor networks. These findings identify inflammatory reactivation of scavenging macrophages as a dominant transition axis in human atherosclerosis and suggest that macrophage origin constrains how disease-associated programs are acquired. More broadly, this framework provides a general strategy for quantifying cell-state transitions and dissecting plasticity mechanisms in chronic inflammatory disease.
bioinformatics2026-05-19v1DistPCA: Tera-Scale Genomic PCA via Out-of-Core Distributed Parallelism
Mermigkis, G.; Sofotasios, A.; Kontopoulou, E.-M.; Gallopoulos, E.; Hadjidoukas, P.Abstract
Principal Component Analysis (PCA) is a fundamental tool in human genetics, widely used to study population structure. However, the rapid growth of modern genomic datasets, which often exceed main memory capacity, renders traditional PCA methods infeasible, motivating out-of-core approaches. Prior work on out-of-core genomic PCA has focused primarily on optimizing the inherently compute-intensive numerical core, largely overlooking the stages of data I/O and preprocessing, which emerge as significant performance bottlenecks at tera-scale. Furthermore, existing approaches remain limited to shared-memory single-node architectures, lacking support for distributed multi-node environments. To address these limitations, we introduce DistPCA, the first distributed out-of-core framework for tera-scale genomic PCA, implemented as a C++ library and scalable across both single- and multi-node systems. Built on top of Message Passage Interface (MPI), the proposed framework employs multi-level data parallelism across the entire PCA pipeline, combining multiprocessing, multithreading, SIMD vectorization, and compute--transfer overlap, while remaining compatible with block-based methods that rely on associative operations. Extensive evaluation on real and synthetic datasets demonstrates near-linear scalability, achieving speedups of up to 58.2x and over 98% reduction in wall-clock time, while maintaining parallel efficiency above 82% and preserving accuracy in the recovered principal components.
bioinformatics2026-05-19v1Multi-Scale Tri-Modal Histology Dataset Integrating Tumor Morphology, Immune Patterns, and Clinical Outcomes
Jung, K. J.; Qiu, J.; Cho, S.; McDonough, E.; Chadwick, C.; Ghose, S.; West, R. B.; Brooks, J. D.; Ginty, F.; Machiraju, R.; Mallick, P.Abstract
Accurate prognostic assessment of prostate cancer (PCa) requires an integrated understanding of tissue morphology-encompassing cell structure, glandular architecture, and tissue organization-and the immune environment. We present Prostate-TriMod, a novel tri-modal histology dataset designed to integrate high-resolution visual morphology with spatial tissue maps, immune infiltration patterns, and clinical outcomes. This dataset, generated from the Cell DIVE multiplexed imaging platform, consists of three synchronized modalities: (1) multiscale virtual H&E tiles (224px, 256px, 512px, and 2040px) providing visual morphological context, (2) spatial tissue maps identifying cancerous/non-cancerous epithelial cells, stroma and immune cell populations (via TOPAZ and CAT models), and (3) text captions generated from single-cell data and patterns. The dataset includes comprehensive clinical annotations, including Grade Groups and biochemical recurrence (BCR) status. By providing high-fidelity alignment between visual features, spatial tissue maps, and textual descriptions, Prostate-TriMod empowers the development of advanced multimodal AI frameworks. We expect this resource to support reuse in multimodal representation learning, spatial analysis, and benchmarking studies that link histology morphology and immune context to clinical outcomes in prostate cancer.
bioinformatics2026-05-19v1PocketBagger: Generalizable pocket druggability prediction via positive-unlabeled learning
Gingrich, P. W.; Biswas, A.; Mica, I. L.; Brammer, K. M.; Shu, Z.; Maxwell, D. S.; Russell, K. P.; Al-Lazikani, B.Abstract
Reliable structure-based prediction of small-molecule druggability is hindered by a fundamental labeling problem. Experimentally confirmed liganded sites (positives) are observable, but credible "undruggable" pockets (negatives) are almost impossible to define. Standard supervised machine learning consequently relies on arbitrary definitions of 'undruggable', leading to bias and false negatives. Here we introduce PocketBagger, a positive-unlabeled (PU) learning framework for pocket druggability prediction trained exclusively on experimentally determined Protein Data Bank (PDB) structures. PocketBagger uses PU bagging to learn key features associated with reliable 'druggable' pockets and considers all remaining pockets in the structurally characterized proteome as unlabeled. We demonstrate the capability of PocketBagger through the training of a simple Random Forest classifier and demonstrate its power in recall (0.804), even when challenged with increasingly difficult generalizability assessments and entire protein-family hold outs. We benchmark and demonstrate the added value of PU learning by comparing PocketBagger to a leading deep-learning predictor. However, PocketBagger is intended to be used as a framework for any model architecture. Along with the code, the data generated by PocketBagger are deployed in canSAR.ai, providing scalable, generalizable pocket druggability predictions to the drug discovery community.
bioinformatics2026-05-19v1BioGAIP: A Scalable, User-Friendly and Robust LLM-Powered Multi-Agent System for Automated Bioinformatics Tasks
Zhang, J.; Guo, P.; Jiang, G.; Zhou, M.; Wei, G.; Ni, T.Abstract
The rapid explosion of large-scale, high-throughput biological data has created an urgent demand for efficient analysis pipelines. Traditional bioinformatics approaches, while powerful, often require specialized computational expertise, placing them out of reach for bench biologists. Large Language Models (LLMs) offer new possibilities for automating complex reasoning and tool integration, yet existing LLM-based solutions have not sufficiently lowered this barrier, and expert-level analysis remains inaccessible to most nonexperts. Here, we present BioGAIP, an LLM-powered agent that integrates expert-level reasoning within an end-to-end platform for bioinformatics tasks. By coupling optimized autonomous agents with full graphical interfaces, BioGAIP transforms complex analytical workflows into an automated, user-friendly, and low-intervention process with natural language input. Key features of BioGAIP include dynamic information retrieval, automatic environment configuration, and self-directed design of analysis pipelines, making large-scale multi-omics analysis highly accessible. Built on agent-based client-server architecture, BioGAIP ensures secure resource management and supports heavy computational demands. Extensive evaluations on diverse published datasets demonstrate that BioGAIP reliably recapitulates established biological insights and shows strong potential for novel discovery. By democratizing complex bioinformatics workflows, BioGAIP accelerates accessible data-driven discovery for both experts and nonexperts.
bioinformatics2026-05-19v1AbSolution and ENCORE: a proof-of-concept for automating computational reproducibility in interactive applications
Garcia-Valiente, R.; Langton, S. H.; van Kampen, A. H. C.Abstract
Reproducibility and transparency in computational analyses are essential in science, although achieving these goals often requires significant knowledge and systematic organization. Graphical interactive applications simplify the conduct of analyses and make them accessible to a broader audience. However, there is currently no consensus on how to and to which extent implement reproducibility in interactive applications. We recently developed AbSolution, a user-friendly and flexible interactive web-based R Shiny application for exploring immune repertoires and their sequence-based features, and we established the ENCORE framework to enhance transparency and reproducibility by guiding researchers in structuring and documenting computational projects. In this work, as a proof-of-concept we integrate AbSolution, ENCORE and specific R packages to address reproducibility challenges. This enables a single-step export of raw, processed, and meta-data, the software environment, the underlying generated code and a HTML report containing results and figures, operating system, hardware and R session details, and researcher notes. Its reproducibility has been independently validated by the CODECHECK initiative. This paper demonstrates how the combination of several approaches can improve and automate reproducibility of interactive applications.
bioinformatics2026-05-19v1cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature
Campbell, J.; Lain, A. D.; Simpson, T. I.Abstract
cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-speci[fi]c corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text [fi]les, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the [fi]delity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central [fi]les and the [fi]les retrieved from cadmus, resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available.
bioinformatics2026-05-19v1Sequence-based Drug-Target Binding Site Pre-training Enables Cryptic Pocket Detection and Improves Binding Affinity and Kinetics Prediction
Zhang, S.; Xie, L.; Tiourine, D.; Xie, L.Abstract
Predicting protein-ligand binding characteristics, such as affinity and kinetics, is critical for accelerating drug discovery. However, many existing computational methods face key limitations, including insufficient integration of comprehensive databases, inadequate representation of protein structural dynamics, and incomplete modeling of microscale protein-ligand interactions. To address these challenges, we introduce ProMoNet, a sequence-based pre-training and fine-tuning framework to enhance the prediction of protein-ligand binding characteristics. ProMoNet leverages protein and molecular foundation models to expand data coverage and enhance diversity. It also introduces a pre-training strategy based on protein-ligand binding site prediction, which bridges protein- and ligand-level representations to support downstream prediction tasks involving protein-ligand complexes. Our pre-training module effectively models microscale protein-ligand interactions and captures the dynamic nature of proteins, including binding site crypticity, without relying on 3-dimensional structural inputs. Notably, this module surpasses or matches state-of-the-art structure-based methods in identifying exposed and cryptic binding sites while maintaining high efficiency. Our fine-tuning module then efficiently transfers the pre-trained knowledge to downstream tasks such as binding affinity and binding kinetics prediction, achieving superior performance. The combination of ProMoNet's strong performance and demonstrated efficiency across multiple tasks highlights its potential for broad applications in drug discovery.
bioinformatics2026-05-18v3Ensemble Post-hoc Explainable AI for Multilead ECG: Identifying Disease-Relevant Features in Single-Lead Interpretations
Metsch, J.; Hempel, P.; Maurer, M. C.; Spicher, N.; Hauschild, A.-C.; Steinhaus, K. E.Abstract
Despite the growing success of deep learning (DL) in multivariate time-series classification, such as 12-lead electrocardiography (ECG), widespread integration into clinical practice has yet to be achieved. The limited transparency of DL hinders clinical adoption, where understanding model decisions is crucial for trust and compliance with regulations such as the General Data Protection Regulation (GDPR) or the EU AI Act. To tackle this challenge, we implemented a state-of-the-art 1D-ResNet in Pytorch that was trained on the large-scale Brazilian CODE dataset to classify six different ECG abnormalities. We employed the model on the German PTB-XL dataset, and evaluated its decision-making processes using 16 post-hoc explainable AI (XAI) methods. To assess the clinical relevance of the model's attributions, we conducted a Wilcoxon signed-rank test to identify features with significantly higher relevance for each XAI method. We used an ensemble majority vote approach to validate whether the model has learned clinically meaningful features for each abnormality. Additionally, a Mann-Whitney U test was employed to detect significant differences in relevance attributions between correctly and incorrectly classified ECGs. Overall, the model achieved sensitivity scores above 0.9 for most abnormalities in the PTB-XL dataset. However, our XAI analysis showed that the model struggled to capture clinically relevant features for some diseases. Certain XAI methods, including DeepLift, DeepLiftShap, and Occlusion, consistently highlighted clinically meaningful features across abnormalities, while others, such as LIME, KernelShap, and LRP, failed to do so. Moreover, some XAI methods demonstrated significant differences in attributions between correctly and incorrectly classified ECGs, highlighting their potential for enhancing model robustness and interpretability. In conclusion, our findings underscore the importance of selecting suitable XAI methods tailored to specific model architectures and data types to ensure transparency and reliability. By identifying effective XAI techniques, this study contributes to closing the gap between DL advancements and their clinical implementation, paving the way for more trustworthy AI-driven healthcare solutions.
bioinformatics2026-05-18v2On the applicability domain of HADDOCK3 for protein-aptamer docking: documented failure modes from a 5x7 cross-target screening matrix and a 1676 aa receptor case study (P01031)
Dohi, E.Abstract
We screened a 5-receptor x 7-aptamer = 35-cell cross-target screening matrix with HADDOCK3 under blind ambiguous-interaction-restraint (AIR) protocols on AlphaFold-modelled receptors. The 35-cell matrix is primarily a cross-target/decoy screening matrix rather than a 35-cognate-pair benchmark: it contains an n = 4 K_D-calibration subset under matched assay conditions, at least six biological cognate or intended-cognate cells, and the remaining cells are intentional non-target pairings used to characterise score-distribution behaviour. The screen surfaced 12 operationally distinct failure modes that collapse into five broad conceptual groups. The principal case study is P01031 (complement C5, 1676 aa, [≥] 12 structural domains): all seven panel members produced positive HADDOCK3 top-1 scores under a scale-adaptive AIR. Score-term decomposition locates the anomaly in the AIR term (+217 to +268 to top-1 score). With AIR zeroed, scores fall to -131 to -74 -- the small-receptor regime. Boltz-2 cofolding chain-pair ipTM (cpi_AB) is an independent channel: P01031 shows the lowest median cpi_AB (0.211; 0/7 above the 0.5 confident-interface threshold). To our knowledge, this is an early documented case study of a 1676 aa multi-domain receptor exhibiting this signature under a blind scale-adaptive AIR workflow -- an n = 1 mechanistic case, not a statistical generalisation. We adapt the QSAR applicability-domain concept to in silico aptamer screening. We report an empirical Mode 1 mitigation, a pLDDT-aware AIR prefilter, with cohort Jaccard recovery of ~10x. The n = 4 K_D-calibration Spearman {rho} shift is reported as exploratory cross-method convergence, not as a calibration claim.
bioinformatics2026-05-18v2Design of DNA Aptamers for Lyme disease Diagnosis Combining experimental and numerical approaches
Issouani, E. M.; Da Ponte, H.; Guerin, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Davila Felipe, M.; GAYRAUD, G.Abstract
Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.
bioinformatics2026-05-18v2petVAE: A Data-Driven Model for Identifying Amyloid PET Subgroups Across the Alzheimer's Disease Continuum
Tagmazian, A. A.; Schwarz, C.; Lange, C.; Pitkänen, E.; Vuoksimaa, E.Abstract
Amyloid-{beta} (A{beta}) PET imaging is a core biomarker and is sufficient for the biological diagnosis of Alzheimer's disease (AD). Here, we aimed to identify biologically meaningful subgroups across the continuum of A{beta} accumulation using a data-driven deep learning approach, without imposing predefined thresholds for A{beta} negativity or positivity. We analyzed 3,110 of A{beta} PET scans from Alzheimer's Disease Neuroimaging Initiative and Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease studies to develop petVAE, a two-dimensional variational autoencoder. The model accurately reconstructed scans without prior labeling, selection by scanner or region of interest. Latent representations of scans extracted from petVAE were used to visualize and cluster the AD continuum. Clustering yielded four groups: two predominantly A{beta} negative (A{beta} -, A{beta} -+) and two predominantly A{beta} positive (A{beta} +, A{beta}++). All clusters differed significantly in standardized uptake value ratio (p < 1.64e-8) and cerebrospinal fluid (CSF) A{beta} (p < 0.02), demonstrating petVAE's ability to assign scans along the A{beta} continuum. Extreme clusters (A{beta}-, A{beta}++) resembled conventional A{beta} negative and positive groups and differed in cognition, APOE {epsilon}4 prevalence, A{beta} and tau CSF biomarkers (p < 3e-6). Intermediate clusters (A{beta}-+, A{beta}+) showed higher odds of carrying at least one APOE {epsilon}4 allele versus A{beta}- (p < 0.03). Participants in A{beta}+ or A{beta}++ clusters exhibited faster progression to AD (A{beta}+ hazard ratio = 2.42, A{beta}++ HR = 9.43; p < 1.17e-7). Thus, petVAE was capable of reconstructing PET scans while extracting latent features that capture the AD continuum and define biologically meaningful subgroups, enabling data-driven characterization of preclinical disease stages.
bioinformatics2026-05-18v2