Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Sequence-based Drug-Target Binding Site Pre-training Enables Cryptic Pocket Detection and Improves Binding Affinity and Kinetics Prediction
Zhang, S.; Xie, L.; Tiourine, D.; Xie, L.Abstract
Predicting protein-ligand binding characteristics, such as affinity and kinetics, is critical for accelerating drug discovery. However, many existing computational methods face key limitations, including insufficient integration of comprehensive databases, inadequate representation of protein structural dynamics, and incomplete modeling of microscale protein-ligand interactions. To address these challenges, we introduce ProMoNet, a sequence-based pre-training and fine-tuning framework to enhance the prediction of protein-ligand binding characteristics. ProMoNet leverages protein and molecular foundation models to expand data coverage and enhance diversity. It also introduces a pre-training strategy based on protein-ligand binding site prediction, which bridges protein- and ligand-level representations to support downstream prediction tasks involving protein-ligand complexes. Our pre-training module effectively models microscale protein-ligand interactions and captures the dynamic nature of proteins, including binding site crypticity, without relying on 3-dimensional structural inputs. Notably, this module surpasses or matches state-of-the-art structure-based methods in identifying exposed and cryptic binding sites while maintaining high efficiency. Our fine-tuning module then efficiently transfers the pre-trained knowledge to downstream tasks such as binding affinity and binding kinetics prediction, achieving superior performance. The combination of ProMoNet's strong performance and demonstrated efficiency across multiple tasks highlights its potential for broad applications in drug discovery.
bioinformatics2026-05-18v3On the applicability domain of HADDOCK3 for protein-aptamer docking: documented failure modes from a 5x7 cross-target screening matrix and a 1676 aa receptor case study (P01031)
Dohi, E.Abstract
We screened a 5-receptor x 7-aptamer = 35-cell cross-target screening matrix with HADDOCK3 under blind ambiguous-interaction-restraint (AIR) protocols on AlphaFold-modelled receptors. The 35-cell matrix is primarily a cross-target/decoy screening matrix rather than a 35-cognate-pair benchmark: it contains an n = 4 K_D-calibration subset under matched assay conditions, at least six biological cognate or intended-cognate cells, and the remaining cells are intentional non-target pairings used to characterise score-distribution behaviour. The screen surfaced 12 operationally distinct failure modes that collapse into five broad conceptual groups. The principal case study is P01031 (complement C5, 1676 aa, [≥] 12 structural domains): all seven panel members produced positive HADDOCK3 top-1 scores under a scale-adaptive AIR. Score-term decomposition locates the anomaly in the AIR term (+217 to +268 to top-1 score). With AIR zeroed, scores fall to -131 to -74 -- the small-receptor regime. Boltz-2 cofolding chain-pair ipTM (cpi_AB) is an independent channel: P01031 shows the lowest median cpi_AB (0.211; 0/7 above the 0.5 confident-interface threshold). To our knowledge, this is an early documented case study of a 1676 aa multi-domain receptor exhibiting this signature under a blind scale-adaptive AIR workflow -- an n = 1 mechanistic case, not a statistical generalisation. We adapt the QSAR applicability-domain concept to in silico aptamer screening. We report an empirical Mode 1 mitigation, a pLDDT-aware AIR prefilter, with cohort Jaccard recovery of ~10x. The n = 4 K_D-calibration Spearman {rho} shift is reported as exploratory cross-method convergence, not as a calibration claim.
bioinformatics2026-05-18v2Design of DNA Aptamers for Lyme disease Diagnosis Combining experimental and numerical approaches
Issouani, E. M.; Da Ponte, H.; Guerin, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Davila Felipe, M.; GAYRAUD, G.Abstract
Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.
bioinformatics2026-05-18v2Ensemble Post-hoc Explainable AI for Multilead ECG: Identifying Disease-Relevant Features in Single-Lead Interpretations
Metsch, J.; Hempel, P.; Maurer, M. C.; Spicher, N.; Hauschild, A.-C.; Steinhaus, K. E.Abstract
Despite the growing success of deep learning (DL) in multivariate time-series classification, such as 12-lead electrocardiography (ECG), widespread integration into clinical practice has yet to be achieved. The limited transparency of DL hinders clinical adoption, where understanding model decisions is crucial for trust and compliance with regulations such as the General Data Protection Regulation (GDPR) or the EU AI Act. To tackle this challenge, we implemented a state-of-the-art 1D-ResNet in Pytorch that was trained on the large-scale Brazilian CODE dataset to classify six different ECG abnormalities. We employed the model on the German PTB-XL dataset, and evaluated its decision-making processes using 16 post-hoc explainable AI (XAI) methods. To assess the clinical relevance of the model's attributions, we conducted a Wilcoxon signed-rank test to identify features with significantly higher relevance for each XAI method. We used an ensemble majority vote approach to validate whether the model has learned clinically meaningful features for each abnormality. Additionally, a Mann-Whitney U test was employed to detect significant differences in relevance attributions between correctly and incorrectly classified ECGs. Overall, the model achieved sensitivity scores above 0.9 for most abnormalities in the PTB-XL dataset. However, our XAI analysis showed that the model struggled to capture clinically relevant features for some diseases. Certain XAI methods, including DeepLift, DeepLiftShap, and Occlusion, consistently highlighted clinically meaningful features across abnormalities, while others, such as LIME, KernelShap, and LRP, failed to do so. Moreover, some XAI methods demonstrated significant differences in attributions between correctly and incorrectly classified ECGs, highlighting their potential for enhancing model robustness and interpretability. In conclusion, our findings underscore the importance of selecting suitable XAI methods tailored to specific model architectures and data types to ensure transparency and reliability. By identifying effective XAI techniques, this study contributes to closing the gap between DL advancements and their clinical implementation, paving the way for more trustworthy AI-driven healthcare solutions.
bioinformatics2026-05-18v2CatIF-RL: Activity-Oriented Enzyme Sequence Design by Steered Inverse Protein Folding
Li, Y.; Xiong, J.; Zhang, Y.; Cai, T.; Fu, C.; Li, S.; Xu, W.; Lyu, R.; Chen, Z.; Guo, Z.; Gong, X.; Wang, F.Abstract
Protein inverse folding models are designed to generate amino acid sequences compatible with a given backbone structure, but they are not explicitly optimized for specific biological functions. Here, we present CatIF-RL, a framework that steers a graph-based denoising diffusion inverse folding model toward designing enzyme variants with enhanced catalytic activity. CatIF-RL first adapts the inverse folding model to enzyme structural data, then introduces activity-oriented preference signals using predicted catalytic constant (kcat) as the optimization objective, enabling specialization through generative dataset curation and group-relative policy optimization (GRPO). This process iteratively shifts the sequence distribution toward higher predicted kcat while constraining sequence divergence to sequences that remain compatible with the input structure. On the independent benchmark, CatIF-RL achieves an approximately four-fold increase in predicted kcat relative to native enzymes, substantially outperforming representative inverse folding methods, while maintaining sequence recovery (0.55) and structural fidelity, and supporting motif-preserving partial sequence design. CatIF-RL establishes a practical framework for activity-oriented enzyme design and provides a generalizable strategy for steering structure-conditioned protein generation toward functional optimization.
bioinformatics2026-05-18v2Systematic cross-study assessment of RNA-Seq experimental workflows for plasma cell-free transcriptome profiling
Tuni, C.; Asole, G.; Monteagudo-Mesas, P.; Rusu, E. C.; Cabus, L.; Gonzalez, L.; Sanchez, L.; Neto, B.; Sanders, P.; Weber, M.; Lagarde, J.Abstract
Plasma cell-free RNA (cfRNA) is a promising source of non-invasive biomarkers, but its clinical translation is hindered by technical challenges and a lack of protocol standardization, which compromises reproducibility and comparability across studies. There is a need for a systematic evaluation of existing cfRNA-Seq workflows to understand the drivers of technical variability. Here, we address this gap by performing a comprehensive cross-study analysis of 2,1666 cfRNA-Seq samples from 15 published studies and an in-house generated dataset, applying a uniform bioinformatics pipeline to enable a controlled comparison of experimental workflows. Our analysis reveals that the donor phenotype typically explains a negligible fraction of the transcriptomic variation, whose main determinants are technical -- principally protocol choice, genomic DNA contamination levels and library diversity. Remarkably, this technical noise is so profound that variation within plasma cfRNA samples exceeds that found across a wide range of human tissues. Furthermore, we demonstrate that critical pre-analytical factors are often confounded with patient phenotypes, jeopardizing the validity of biomarker discovery efforts. Finally, we identify a 100 bp fragment-length threshold as a vital requirement for reliable cfRNA-based taxonomic profiling. Our work serves as a comprehensive benchmark of current cfRNA-Seq methodologies and provides evidence-based guidelines to improve experimental design. By highlighting the dominance of controllable technical factors, we offer a path towards more robust and reproducible cfRNA research.
bioinformatics2026-05-18v2petVAE: A Data-Driven Model for Identifying Amyloid PET Subgroups Across the Alzheimer's Disease Continuum
Tagmazian, A. A.; Schwarz, C.; Lange, C.; Pitkänen, E.; Vuoksimaa, E.Abstract
Amyloid-{beta} (A{beta}) PET imaging is a core biomarker and is sufficient for the biological diagnosis of Alzheimer's disease (AD). Here, we aimed to identify biologically meaningful subgroups across the continuum of A{beta} accumulation using a data-driven deep learning approach, without imposing predefined thresholds for A{beta} negativity or positivity. We analyzed 3,110 of A{beta} PET scans from Alzheimer's Disease Neuroimaging Initiative and Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease studies to develop petVAE, a two-dimensional variational autoencoder. The model accurately reconstructed scans without prior labeling, selection by scanner or region of interest. Latent representations of scans extracted from petVAE were used to visualize and cluster the AD continuum. Clustering yielded four groups: two predominantly A{beta} negative (A{beta} -, A{beta} -+) and two predominantly A{beta} positive (A{beta} +, A{beta}++). All clusters differed significantly in standardized uptake value ratio (p < 1.64e-8) and cerebrospinal fluid (CSF) A{beta} (p < 0.02), demonstrating petVAE's ability to assign scans along the A{beta} continuum. Extreme clusters (A{beta}-, A{beta}++) resembled conventional A{beta} negative and positive groups and differed in cognition, APOE {epsilon}4 prevalence, A{beta} and tau CSF biomarkers (p < 3e-6). Intermediate clusters (A{beta}-+, A{beta}+) showed higher odds of carrying at least one APOE {epsilon}4 allele versus A{beta}- (p < 0.03). Participants in A{beta}+ or A{beta}++ clusters exhibited faster progression to AD (A{beta}+ hazard ratio = 2.42, A{beta}++ HR = 9.43; p < 1.17e-7). Thus, petVAE was capable of reconstructing PET scans while extracting latent features that capture the AD continuum and define biologically meaningful subgroups, enabling data-driven characterization of preclinical disease stages.
bioinformatics2026-05-18v2Metabarcode and transcriptome datasets of Pinus sylvestris to assess fungal phyllosphere and disease dynamics.
Moore, B.; Perry, A.; Kaur, S.; Crampton, B.; Gurung, A.; Beaton, J.; Smith, V. A.; Morris, J.; Hedley, P. E.; Nemeth, K.; Barber, H.; Cavers, S.; Jones, S.Abstract
Understanding how host microbiome interactions influence tree disease is critical for understanding forest resilience. Here, we present foliar microbiome ITS2 metabarcoding transcriptomic datasets from Pinus sylvestris to investigate susceptibility to Dothistroma needle blight (DNB), a globally important foliar disease caused by Dothistroma septosporum. We hypothesised that host genotype shapes foliar microbial communities and their interactions, thereby influencing disease outcomes. Samples were collected from a progeny provenance field trial in the south of Scotland representing a broad spectrum of disease susceptibilities. The dataset comprises ITS2 metabarcoding samples from 200 genotypes across three timepoints and RNAseq samples from 48 genotypes across two timepoints. Sampling captured key stages of pathogen exposure and disease progression. Both standardised and bespoke protocols were used for nucleotide extraction, sequencing, and quality control, including multiple negative and positive controls. These datasets, available in the European Nucleotide Archive (project accession PRJEB88228), enable analysis of temporal dynamics in foliar fungal communities, host microbiome transcriptional responses, and genotype dependent variation in disease susceptibility.
bioinformatics2026-05-18v1Discriminative learning of substitution matrices and gap penalties for pairwise alignment of biological sequences
Ciach, M. A.; Zacharopoulou, E.; Startek, M. P.; Miasojedow, B.; Alexiou, P.Abstract
Pairwise alignment scores are used to classify pairs of sequences in many areas of bioinformatics, including homology search, predicting interactions, or read mapping. The relative scores of different pairs strongly depend on the choice of a substitution matrix and gap penalties, but the existing approaches for the estimation of these parameters do not directly optimize them for the task of classification. In this work, we present DiscrimAlign, a statistical model for discriminative learning of substitution matrices and gap penalties from a dataset of positive and negative pairs of unaligned biological sequences. The model links the alignment score of a sequence pair with the associated binary label through a logistic function and learns the parameters by likelihood maximization. We analyze theoretical properties of the model, derive and implement a learning procedure, study its performance in simulated experiments, and apply it to predict microRNA-target interactions. We show that sequence alignment with discriminative substitution matrices and gap penalties predicts the interactions comparably to state-of-the-art neural network classifiers while being more interpretable. An implementation of the model and reproducibility workflows are available at https://github.com/BioGeMT/DiscrimAlign.
bioinformatics2026-05-18v1A Multimodal Neural Network Model for Early Recurrence Prediction in Lung Adenocarcinoma
Patricoski-Chavez, J. A.; Hayek, K.; Singh, R.; Azzoli, C. G.; Warner, J. L.; Gamsiz Uzun, E. D.Abstract
Lung adenocarcinoma (LUAD), a subtype of non-small cell lung cancer (NSCLC), is the most common primary lung cancer worldwide. Despite advancements in early detection and treatment, up to 39% of patients develop recurrent tumors following complete resection. Currently, no widely available models exist for reliably predicting early recurrence of LUAD, which is a significant prognostic factor of post-recurrence survival. Models leveraging deep learning (DL) techniques have demonstrated notable utility in cancer recurrence prediction, particularly when used in combination with both clinical and genomic data. We developed a DL-based model, Predicting Lung Adenocarcinoma recurrence via Selective Multimodal Attention (PLASMA), to predict early recurrence using clinical, mRNA expression, and mutation data from patients with primary stage I-III LUAD. Trained on The Cancer Genome Atlas (TCGA) dataset, PLASMA outperformed traditional machine learning models in predicting early recurrence in both the TCGA test set and an external validation set (TRACERx Lung), achieving area under the receiver operating characteristic curve (AUROC) scores of 85.0% and 76.5%, respectively. Our results support the potential of multimodal DL for early LUAD recurrence prediction and risk stratification.
bioinformatics2026-05-18v1Learning Chirality-Aware Representations to Predict Drug Side Effect Frequencies
Galeano, A.; Dutra, I.; Ferreyra, S.; Paccanaro, A.Abstract
Ab initio prediction of side effect frequencies is important for assessing the risk-benefit profile of drugs and for identifying potential adverse effects early in development. A key challenge is chirality: many drugs exist as enantiomers, pairs of molecules with the same atoms and bond connectivity but different three-dimensional arrangements. Although chemically similar, enantiomers can interact differently with biological targets and therefore exhibit distinct efficacy and adverse-effect profiles. Here we introduce F2S (Features to Signatures), a method to predict the frequencies of drug side effects while explicitly accounting for chirality. Drug representations are learned directly from chemical structure using a directed-bond message-passing graph neural network that captures stereochemical configurations. Side effect representations are derived from curated textual descriptions encoded with a frozen PubMedBERT model. Side effect frequencies are predicted from the dot product between drug and side effect signatures together with biases for drugs and side effects. We evaluated F2S extensively across multiple settings, including cold-start and warm-start prediction, prospective evaluation, and scenarios controlling for chemical similarity between training and test drugs. Across these evaluations, F2S achieves performance comparable to state-of-the-art methods for general side-effect frequency prediction while producing fewer false positives and substantially improves the prediction of frequency differences between enantiomer pairs. Finally, F2S learns compact 10-dimensional signatures that support interpretability: drug signatures reflect therapeutic class and shared targets, side-effect signatures capture phenotype similarity, and the learned bias terms correlate with the popularity of drugs and side effects.
bioinformatics2026-05-18v1Stereochemistry-Aware Drug-Target Affinity Prediction
Ferreyra, S.; Dutra, I.; Galeano, A.; Paccanaro, A.Abstract
Drug-target affinity (DTA) prediction is a key task in drug discovery, enabling the estimation of the interaction strength between candidate compounds and biological targets. However, current models rely on connectivity-based molecular representations and do not explicitly account for the spatial organization, also known as stereochemistry. This limitation becomes evident when considering chirality, where a drug can exist as enantiomers, i.e., molecules that share the same atoms and bonds but differ in their three-dimensional arrangement. Despite their chemical similarity, they can interact differently with the same target, leading to variations in binding affinity and biological activity. In this paper, we propose a stereochemistry-aware DTA prediction framework that incorporates this information into molecular representations. Drug representations are learned from chemical structure using a directed-bond message passing graph neural network that captures enantiomers configurations, while protein targets are represented through sequence-based embeddings. Experiments on the Davis dataset demonstrate that our model can improve affinity prediction. Importantly, a case study on a manually curated dataset of enantiomers with different biological action shows that the model is able to distinguish the affinities in the two forms consistent with their experimentally observed biological activity. These findings support the relevance of stereochemistry-aware molecular representation for more accurate and chemically faithful DTA prediction.
bioinformatics2026-05-18v1HiCPEP: Efficient estimation of chromatin compartment PC1 from Hi-C covariance structure
Cheng, Z.-R.; Chang, J.-M.Abstract
Principal component analysis (PCA) of the Hi-C Pearson correlation matrix is the standard approach for identifying A/B chromatin compartments. Despite its widespread use, the relationship between the first principal component (PC1) and the underlying compartment structure remains insufficiently characterized, and computing PC1 can become computationally expensive for high-resolution Hi-C data. Here we investigate the role of the PC1 explained variance ratio in compartment analysis and show that chromosomes with strong compartment organization typically exhibit a dominant PC1 signal. Based on this observation, we propose HiCPEP, a heuristic algorithm that estimates the sign pattern and relative magnitude of PC1 directly from the Hi-C Pearson covariance matrix without performing explicit eigenvector decomposition. The method can operate from either a dense Pearson matrix for fast approximation or a sparse observed/expected (O/E) matrix to reduce memory usage. Furthermore, because many covariance columns exhibit PC1-like patterns when the compartment signal is strong, HiCPEP can be accelerated using random sampling without substantially reducing accuracy. Across multiple Hi-C datasets, HiCPEP consistently recovered compartment patterns with high similarity to reference PC1 vectors produced by standard PCA-based methods. Benchmark experiments show that HiCPEP achieves comparable accuracy while reducing computational cost in terms of runtime or memory usage. These results suggest that HiCPEP provides a practical alternative for efficient chromatin compartment analysis from large-scale Hi-C datasets. The HiCPEP implementation is freely available at https://github.com/ZhiRongDev/HiCPEP.
bioinformatics2026-05-18v1HESTIA: Scalable Multimodal Integration of Histology and High-Resolution Spatial Transcriptomics for Robust Spatial Domain Identification
Zhong, Z.; Zhu, X.; Guo, J.; Liao, S.; Chen, A.Abstract
Spatial omics has revolutionized molecular biology by providing invaluable insights into how native tissue microenvironments regulate cellular functions and disease mechanisms. Accurately capturing this structural complexity and decoding the underlying biological processes requires effectively integrating data from multiple modalities. However, transitioning to subcellular resolutions introduces massive data scales and severe transcriptomic sparsity, which challenge current analytical frameworks. To address this, we present HESTIA (Histology-Enhanced Scalable cross-Resolution inTegration for spatial trAnscriptomics), a highly efficient multimodal algorithm designed for identifying spatial domains in large-scale, high-resolution spatial omics data. By circumventing memory-intensive computations, HESTIA effortlessly processes massive datasets that existing algorithms fail due to memory constraints. HESTIA outperforms current multimodal methods in clustering accuracy and spatial continuity, accurately delineating fine structural boundaries. Furthermore, applying HESTIA to large-scale pathological samples successfully dissects clinically relevant intratumoral heterogeneity and maps distinct immune microenvironments in lung and colorectal cancers.
bioinformatics2026-05-18v1Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier
Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.Abstract
Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem.
bioinformatics2026-05-18v1Mantis-Delta: Mass-Action Network Theory and Steady-State Characterization for Chemical Reaction Networks
Venegas Hernandez, E. A.Abstract
Abstract Chemical Reaction Network Theory (CRNT), developed by Horn, Jackson, and Feinberg, provides parameter-free structural theorems that constrain the asymptotic dynamics of mass-action systems irrespective of the numerical values of the rate constants. Despite the maturity of the theory, modern open-source implementations that combine CRNT structural analysis with symbolic ordinary differential equation (ODE) construction and robust numerical steady-state finding remain scarce. We present mantis-delta, a pure Python library that ingests human-readable reaction strings, builds the complex reaction graph, computes the deficiency = -{ell}- and weak reversibility, and decides applicability of the Deficiency Zero Theorem (DZT) and Deficiency One Theorem (D1T). For systems satisfying these structural conditions, mantis-delta certifies, without any simulation whatsoever, existence, uniqueness and (for DZT) asymptotic stability of the positive steady state in every stoichiometric compatibility class. When the structural theorems do not apply, the library provides symbolic mass-action ODEs and Jacobians via SymPy and a hybrid numerical solver that combines stiff implicit integration with bound-constrained algebraic least-squares to locate both stable and unstable fixed points, including Hopf bifurcation centres inaccessible to forward integration. We demonstrate the workflow on six benchmarks: a reversible isomerisation, the Michaelis-Menten enzyme mechanism, the closed and chemostatted Brusselator, a catalytic hairpin assembly (CHA) miR-21 biosensor, and the Goldbeter-Koshland zero-order ultrasensitivity switch. In each case, the CRNT-predicted qualitative behaviour (monostability, oscillation, uniqueness) is recovered numerically with a residual below 10-6 M s-1, and the Goldbeter-Koshland dose-response curve agrees with the closed-form quasisteady-state approximation to within 1% over a 400x kinase/phosphatase activity scan. mantis-delta is open-source (MIT license) and available at https://github.com/EmilioVenegas/mantis
bioinformatics2026-05-18v1GeneFior: A back to basics and transparent multi-tool approach tosequence detection
Dimonaco, N. J.; Lawther, K.Abstract
The detection of sequences of interest, such as antimicrobial resistance genes, directly from genomic and metagenomic sequencing data has become routine, enabled by curated reference databases and rapid in silico sequence search tools. Yet most workflows depend on prior assembly, an inherently lossy process in which a substantial proportion of reads fail to assemble or are collapsed into consensus sequences, causing low-abundance variants and nucleotide-level diversity to be systematically obscured. The tools used to interrogate the resulting assemblies compound this further, clustering reference sequences at arbitrary identity thresholds, imposing hidden parameter defaults, and reducing intermediate alignment evidence to summarised outputs that cannot be critically evaluated or reproduced. Here we present GeneFior, a transparent, multi-tool workflow integrating BLAST, DIAMOND, Bowtie2, BWA, and Minimap2 to search both DNA and protein sequences against any user-supplied reference database. By enforcing gene-centric identity and coverage thresholds at both the read and gene level, GeneFior reduces false positives while retaining sensitivity to genuine, low-abundance variants, including those differing at single-nucleotide resolution. Crucially, by exposing all alignment parameters, preserving intermediate outputs, and generating cross-tool consensus detection matrices, GeneFior makes the influence of tool choice, database selection, and parameter configuration on reported gene profiles directly observable and reproducible.
bioinformatics2026-05-18v1Rescuing true protein binders from AI hallucinations via zero-shot, ensemble-driven statistical physics scoring
Chou, C.-H.; Hong, X.; Xu, J.Abstract
The advancement of deep generative models has facilitated de novo protein and antibody design, yet translation to experimental success is hindered by a high generation rate of structural decoys. Current affinity predictors and standard structural confidence metrics fail to reliably distinguish these AI hallucinations from true binders. Here, we present Sipobe-PPA, an affinity ranking framework that conceptualizes interacting protein interfaces as pseudo-ligands, evaluating them through an AI-driven statistical physics forcefield. Because this forcefield is trained exclusively on small-molecule interactions, Sipobe-PPA acts as a zero-shot physical evaluator for protein-protein interfaces, preventing the framework against the data leakage and memorization pitfalls that affect models trained directly on protein complex datasets. To capture the structural plasticity of binding interactions, Sipobe-PPA employs a conformational ensemble strategy, computing interaction scores across multiple AlphaFold3(AF3)-predicted structural states. Benchmarking on decoy-rich de novo datasets-including Bindcraft, Boltzgen, and the Germinal antibody dataset-demonstrates the significant improvement offered by this approach. In a real-world pipeline scenario simulating wet-lab constraints (pre-filtered by AF3 ipTM > 0.8 and pLDDT > 80), Sipobe-PPA achieved an 80% Hit Rate within its Top 5 predictions across the combined dataset, compared to 0% for physical baselines like Rosetta-dG. Notably, our structural ensemble averaging outperformed single-structure scoring, highlighting the necessity of modeling prediction diversity. By maximizing top-tier hit rates across diverse nanobody and de novo targets, Sipobe-PPA provides a scalable screening paradigm that bridges the gap between computational generation and wet-lab viability.
bioinformatics2026-05-18v1Evaluating open LLMs for agentic analysis orchestration in a typical biomedical lab
Nekrutenko, A.Abstract
Agentic tools - software environments where a large language model plans, calls external tools, executes code, and iterates with minimal human intervention - will run a substantial share of routine biomedical data analysis within the next few years. However, per-call inference cost on frontier models is the bottleneck and can add up quickly. Here, we tested whether a free, locally-runnable open-weight model could take over the repetitive execution steps at frontier accuracy. We used Claude's Opus to author plans of increasing detail for per-sample variant calling, and ran six 2026-release open-weight implementer LLMs against those plans on a set of desktop GPUs. qwen3.6:27b reproduced frontier accuracy on every plan and matched Opus cell-for-cell on a 36-cell error-injection matrix. A sub-$2,000 Jetson or Apple Mac Mini sufficed for the implementer side. The open-weight model landscape evolves on the order of months, so the specific implementer recommended here will be superseded; we provide the plans, harness, scoring code, and per-cell artifacts at https://github.com/nekrut/LLM-eval-paper as a framework for re-evaluating future models.
bioinformatics2026-05-18v1Manchester Proteome Profiler: A User-Friendly Platform for Quantitative Proteomic Analysis
Cain, S. A.; Fatima, M.; Humphries, M.Abstract
Manchester Proteome Profiler (MPP) is an open-source R Shiny application that streamlines downstream analysis of quantitative proteomic data. Compatible with grouped protein intensities tables from MaxQuant, FragPipe, Proteome Discoverer and other custom layouts, MPP provides an integrated platform for filtering, normalisation, imputation, differential expression analysis and cluster analysis across user-chosen experimental conditions. MPP supports both single- and dual-dataset comparisons, incorporates SAINTexpress for affinity purification and proximity labelling experiments, and downstream analysis of the significant protein list clusters to functional enrichment and interaction networks via Gene Ontology, BioGRID and STRING. Benchmarking with a KRAS proximity biotinylation dataset demonstrated the ability of MPP to identify reproducible clusters of differentially expressed proteins and reveal biologically meaningful patterns, including enrichment of solute carrier transporters and adhesion molecules. With interactive visualisations, customisable reports, and support for complex experimental designs, MPP offers a novel, versatile and user-friendly environment for proteomic data exploration and hypothesis generation.
bioinformatics2026-05-18v1Elab2ARC: A Browser-Based Workspace for Converting Free-Text Protocols into rich FAIR digital objects
Zander, S.; Zhou, X.-R.; Kranz, A.; Dumschott, K.; Rocca-Serra, P.; Weil, H. L.; Tschoepke, M.; Muehlhaus, T.; Von Suchodoletz, D.; Usadel, B.Abstract
Electronic laboratory notebooks (ELNs) are widely used in the life sciences, but their notebook format limits machine-readability and FAIR compliance. Consequently, researchers often spend significant manual effort restructuring ELN records into publication-ready outputs. We present elab2ARC, a browser-based workspace that automates the conversion of open-source eLabFTW records into Annotated Research Contexts (ARCs) - version-controlled, ISA-compliant research objects. Using the eLabFTW API, elab2ARC retrieves administrative metadata, protocols, and attachments, reorganising them into ISA-compliant tables and linked datasets. All processing occurs client-side, ensuring user data control before submission to the PLANTdataHUB repository. An optional LLM-assisted workflow extracts structured metadata from free-text protocols, providing editable drafts while preserving human oversight. Designed for use at project completion, elab2ARC reuses existing ELN documentation without disrupting daily laboratory practice. It offers a practical route to FAIR-aligned sharing, publication, and long-term archiving of life-science experimental records.
bioinformatics2026-05-18v1Interpretable Predictive Modeling for Medical Data Using Boolean Rule-aware Regression
Eskandarian, M.; Malekpour, S. A.Abstract
Purpose: In clinical practice, accurate prediction of disease risk must be accompanied by transparent, human-understandable explanations to support diagnostic confidence, guide therapeutic decisions, and meet ethical and regulatory standards. While deep neural networks achieve high predictive performance in tasks such as cancer detection and diabetes risk stratification, their black-box nature prevents clinicians from understanding the reasoning behind predictions, severely limiting trust and safe integration into patient care. Methods: We present Regression-Based Boolean Rule (RBBR), a framework that automatically derives clinically interpretable Boolean rules directly from patient data. RBBR generates human-readable conjunctions (logical AND combinations) of up to three clinical features, transforms them into inputs for ridge regression to predict binary or multi-class disease outcomes, estimates rule importance via regularized coefficients, and selects the most parsimonious and predictive rule sets using the Bayesian Information Criterion. Results: Applied to six real-world medical datasets (lung cancer screening and staging, Wisconsin and diagnostic breast cancer, heart failure, and early-stage diabetes risk), RBBR consistently produced concise, clinically meaningful rules - e.g., gender-specific symptom combinations in diabetes, distinct histopathological subpopulations in breast cancer, and symptom-risk factor interactions in lung cancer - with strong explanatory power (R2 up to 0.92) and competitive discrimination. Conclusion: By delivering logical, transparent decision rules aligned with clinical reasoning (if symptom A and B, then high risk), RBBR bridges the gap between predictive accuracy and bedside usability, enabling clinicians to validate predictions, identify high-risk patients, stratify subpopulations, and enhance shared decision-making in routine care.
bioinformatics2026-05-18v1The Paipu framework enables creation of a large-scale mammalian cancer transcriptomics atlas
Smith, B. S.; Smith, L. A.; Lee, J.-H.; Cahill, J. A.; Graim, K.Abstract
A plethora of studies have identified shared molecular mechanisms involved in tumor development across humans and other mammalian species. While these two-species analyses advance understanding of human disease, extending them across many species would provide evolutionary insight into molecular mechanisms driving human cancers. However, this expansion requires knowledge transfer and harmonization across species. Genomic differences between species, including variation in genome annotation quality, have historically hindered multi-species large-scale atlas creation. To overcome these challenges, we present Paipu, a comprehensive pipeline designed to streamline querying, preprocessing, harmonization, and retrieval of large-scale RNA-seq data and associated metadata from the NCBI Sequence Read Archive (SRA). Paipu facilitates multi-species analysis by creating a harmonized atlas from user-defined search terms and species. It consists of three components: reference genome preparation, SRA metadata retrieval, and RNA-seq data processing. We apply Paipu to 188 cancer-related terms in 239 non-human mammalian species, creating a harmonized atlas of 3,484 RNA-seq samples spanning 17 species and 35 cancers. This pan-mammalian pan-cancer atlas enables myriad comparative genomics analyses that leverage genetic variation to better understand rare human cancers. As such, Paipu serves as a resource for cross-species cancer genomics and supports atlas creation for any set of species and search terms.
bioinformatics2026-05-18v1Nutritional-Metabolic Lipid Profiling with LipidOne for plasma lipidomics interpretation in metabolic health
Frongia Mancini, D.; Alabed, H. B. R.; Pellegrino, R. M.Abstract
Background/Objectives: Human plasma lipidomics provides valuable information on dietary and metabolic phenotypes, but the interpretation of high-dimensional lipid datasets remains challenging. We developed the Nutritional-Metabolic Lipid Profile (NMLP) module within LipidOne to translate plasma lipidomics data into interpretable nutritional-metabolic indices, functional categories, visual outputs, and biological statements. Subjects/Methods: NMLP calculates lipid indices reflecting cardiometabolic lipid status, fatty acid remodelling, overall lipid quality, oxidative protection, and omega-3/essential fatty acid status. The module was applied to three human plasma lipidomics public datasets: a randomized crossover glycemic-load feeding study, a eucaloric high-fat diet intervention in normal-weight women, and a large public dataset stratified by insulin sensitivity. Results: Across datasets, NMLP converted complex lipidomic matrices into coherent nutritional-metabolic profiles. In the glycemic-load study, the module highlighted metabolic lipid shifts not captured by standard clinical lipid panels, mainly involving cardiometabolic lipid status, oxidative protection, and fatty acid remodelling. In the high-fat diet intervention, NMLP tracked temporal lipid remodelling across pre-diet, on-diet, and post-diet states, consistent with metabolic adaptation to increased dietary fat exposure. In the insulin-sensitivity dataset, insulin-resistant subjects showed a storage-oriented lipid phenotype characterized by increased neutral lipid storage indices and altered lipid quality and oxidative-protection features. Category-level clustering further revealed heterogeneous nutritional-metabolic states within insulin-resistant subjects. Conclusions: NMLP provides a deeper and clearer interpretative framework for human plasma lipidomics in nutrition and metabolic health research. By translating lipid species into functional indices and category-level readouts, the module may facilitate the use of lipidomics in clinical nutrition, metabolic phenotyping, and precision nutrition studies. NMLP is freely accessible as part of the online LipidOne platform.
bioinformatics2026-05-18v1KaryoScope: rapid, alignment-free sequence annotation for the pangenome era
Ranallo-Benavidez, T. R.; Chen, Y.-A.; Potapova, T. A.; Alanko, J. N.; Loucks, H.; Lucas, J.; Human Pangenome Reference Consortium, ; Guarracino, A.; Puglisi, S. J.; MARCHET, C.; Miga, K. H.; Gerton, J. L.; Barthel, F. P.Abstract
The pangenome era is producing long-read sequencing data and complete genome assemblies at a pace that current annotation methods cannot match. Existing tools were each built for a single feature class (repeats, centromeric satellites, or genes) and falter precisely where the genome is most variable and harbours clinically important variation: the centromeres, subtelomeres, and acrocentric short arms. Here we present KaryoScope, an alignment-free method to annotate an assembly at base resolution across any desired feature classes in a single pass, completing in minutes on a standard workstation. Applied to the Human Pangenome Reference Consortium Release 2 assemblies, KaryoScope identifies the SST1 macrosatellite as the recurrent sequence at Robertsonian translocation fusion points, delivers the first pangenome-wide census of D4Z4 macrosatellite structural diversity at the 4q and 10q subtelomeres relevant to facioscapulohumeral muscular dystrophy, and reveals previously uncharacterized centromere structural polymorphism, including chromosome-specific satellite loss and megabase-scale rearrangement validated by fluorescence in situ hybridization. A pre-built KaryoScope database for the human genome is distributed alongside the tool, and additional databases can be built for any reference genome or annotation source. Together, these capabilities bring the most variable regions of the genome within reach for comparative, clinical, and pangenome-scale analysis. KaryoScope is available at https://github.com/barthel-lab/KaryoScope.
bioinformatics2026-05-17v1A comparative analysis of urinary microbiome identifies putative probiotics
Anand, R.; Sahil, R.; Pandey, R.; Prakash, P.; Misra, H. S.; Maurya, G. K.Abstract
Urinary tract infections (UTIs) are the most prevalent bacterial infections globally, and their management increasingly challenged by antimicrobial resistance (AMR). Probiotics offer a promising approach to mitigate AMR by competitively excluding uropathogens and enhancing host immunity by producing immune modulators. Despite being potential, key gaps persist between the discovery of uroprotective probiotic strains and optimization of formulations for urinary tract delivery. Here, we analyzed the urinary microbiome of UTI patients and healthy individuals to identify potential probiotic candidates for the prevention and management of UTIs. Publicly available 16S rRNA amplicon sequencing data of the urinary tract were processed using a standardized pipeline for sequence quality assessment, taxonomic assignment, and microbial function prediction. Comparative analysis showed a significant shift in microbial composition between UTI patients and healthy controls. The dominated phyla identified included Acidobacteriota, Actinobacteriota, Bacteroidota, Campylobacterota, Cyanobacteria, Firmicutes, Fusobacteriota, Patescibacteria, Proteobacteria, and Synergistota. Overall differential abundance analysis revealed Escherichia coli as the predominant UTI-associated species, while Lactobacillus crispatus was enriched in healthy samples. Additionally, predictive functional analysis indicated that metabolic pathways associated with beneficial microbes were enriched in the healthy group. Overall, the study highlights the association of distinct urinary microbiome signatures with infection status, which supports L. crispatus as the most promising probiotic for UTI prevention and control.
bioinformatics2026-05-17v1Learning from Drops: AI-Guided Integration of Liquid Biopsy Features in Cancer Studies
Andueza, M.; Villoslada-Blanco, P.; De Dreuille, B.; Alonso, L.; Sabroso-Lasa, S.; Pantel, K.; Alix-Panabieres, C.; Lopez de Maturana, E.; Malats, N.Abstract
Cancer is a major global health issue with rising incidence and mortality. Early detection, tumor characterization, and disease surveillance are crucial for timely and effective treatment, ultimately reducing mortality rates. Liquid biopsy (LB) has emerged as a valuable detection tool offering a non-invasive method to determine tumor-derived biomarkers in body fluids with demonstrated translational potential. To increase biomarker sensitivity, high-throughput sequencing platforms deliver massive volumes of data. Artificial Intelligence (AI) is pivotal in enabling huge and complex data integration. This contribution aims to assess the current state of integrative AI-based research in the LB field and provide methodological guidance. First, we conducted a PubMed search and found that the literature is sparse in studies integrating LB features, particularly by applying AI. When adopting the latter approach, defining the study objectives is crucial to guide the subsequent methodological aspects, including study design, patient selection criteria, sample size, nature of the LB features, and metadata to collect. Specifically, we propose strategies and tools for data preprocessing, including normalization and batch correction, as well as handling outliers and missing data. Furthermore, we recommend various Machine/Deep Learning approaches for feature selection techniques to ensure model robustness, and we highlight the importance of undergoing rigorous internal and external validations of the selected models. Assessing clinical utility and interpretability is often overlooked but fundamental for real-world implementation. In conclusion, we provide the LB scientific community with an AI-based methodological guidance to bridge the two fields and enhance the integrative analysis of LB features.
bioinformatics2026-05-17v1Conservation of TNF-TNFR Signaling Modality Across Invertebrate
Govindan, M. K.; K, K.; Goswami, M.; Menon, N.; Singh, A.; Srinivasan, S.Abstract
In humans, the signaling mechanisms of the 19 paralogs of the tumor necrosis factor superfamily (TNFSF) and the 29 receptor paralogs of the tumor necrosis factor receptor superfamily (TNFRSF) are extensively characterized because of their therapeutic relevance. The functional expansion of TNFSF in vertebrates from a single ancestral gene through successive duplication events is also well established. However, apart from the first identification of a TNFSF homolog, Eiger (dmEiger), in Drosophila melanogaster in 2002, together with its receptor homologs Wengen (dmWgn) and Grindelwald (dmGrnd), this signaling system has remained largely unexplored in invertebrates. More recently, the implication of an Eiger homolog in Plasmodium resistance in malaria vectors has further highlighted the need for a systematic investigation of this pathway in lower invertebrates. Structural comparison of the dmEiger-dmGrnd complex with the canonical 3:3 ligand-receptor configuration observed in human TNFSF-TNFRSF signaling suggests either conservation of this signaling modality since before the bilaterian split or convergent evolution of a similar architecture in both branches. The recent explosion in high-quality proteomes spanning diverse phyla, together with advances in protein-complex prediction using AlphaFold-multimer, now enables large-scale exploration of ligand-receptor evolution across invertebrates. Here, we analyzed 148 near-complete proteomes spanning major invertebrate phyla and identified 290 TNFSF, 336 wengen (wgn), and 115 grindelwald (grnd) homologs, including homologs from lower invertebrates. Structural characterization of 140 selected complexes using AlphaFold and AlphaFold-multimer revealed several key findings: (i) TNFSF and TNFRSF homologs are present in majority of the phyla under invertebrates (ii) the canonical 3:3 ligand-receptor signaling configuration is conserved across invertebrates; (iii) orthologs of 25 out of the 26 genes implicated in TNF signaling pathways are present in lower invertebrates; and (iv) signaling through grnd-like receptors containing a single cysteine-rich domain with CXXCXXXC signature is the predominant signaling mode in invertebrates and becomes highly prevalent in Arthropoda. We also elaborate a hypothesize on the evolutionary trajectories toward a genetically parsimonious signaling by this complex system before functional expansion in vertebrates and species diversification in Arthropoda.
bioinformatics2026-05-16v2A unified benchmark of synthetic data generation for clinical transcriptomic cancer cohorts
Trinh, T.-C.; Woillard, J.-B.; Uguzzoni, G.; Battail, C.Abstract
Achieving a trade-off between biological utility and patient privacy remains a key challenge for secure data sharing when applying transcriptomic clinical datasets to artificial intelligence in precision oncology. Here, we introduce the first benchmarking study tailored to high-dimensional clinical transcriptomic cancer data, comparing synthetic data generation methods across three clinical cancer trials. Our framework, SynOmicsBench, combines standardized preprocessing with multidimensional evaluation, prioritizing downstream biological validation alongside statistical fidelity and attack-based privacy assessment. Results indicate that no single method dominated all dimensions, with Gaussian Copula achieving the most balanced performance, followed by Avatar, demonstrating that metric-based similarity alone is insufficient to ensure preservation of higher-order molecular dependencies. Synthetic data consistently reproduced biomedical signal directionality but with attenuated effect sizes and inter-replicate variability, supporting hypothesis generation when multi-seed synthesis is adopted. Collectively, this framework provides a reproducible decision-support tool for method selection and promotes biologically informed, privacy-aware adoption of synthetic data in precision oncology.
bioinformatics2026-05-16v1TorchRef: An open-source PyTorch Framework for Crystallographic Refinement
Weinert, T.; Standfuss, J.; Seidel, H. P.Abstract
Macromolecular crystallographic refinement underpins structural biology, yet existing software packages often lack accessible, modular codebases amenable to rapid method development. Here, we introduce TorchRef, a PyTorch-based crystallographic refinement framework that exposes all refinable parameters, atomic coordinates, displacement parameters, occupancies, and scale factors to automatic differentiation. The framework implements FFT-based structure-factor calculations, the French-Wilson treatment of intensities, bulk-solvent modeling with established mask parameters, and stereochemical restraints from the CCP4 Monomer Library. A modular target architecture allows loss functions to be combined, weighted, and extended independently of the core refinement machinery. Validation against 1,000 PDB structures demonstrates that TorchRef-based refinement reproduces a median R-free within 1% of Phenix while maintaining comparable model quality. Structure factor calculation in TorchRef scales readily across multiple CPU cores and is over 100 times faster on modern GPUs than CCTBX. To showcase how modern methods like time-resolved crystallography can benefit from the flexibility that TorchRef provides, we implemented direct refinement of a typical time-resolved model against amplitude differences, a use case currently not explored by classic refinement programs. TorchRef is released under the MIT license with full API documentation and tutorials, providing an accessible platform for developing and testing new crystallographic refinement protocols.
bioinformatics2026-05-16v1Biological foundation models illuminate annotation blind spots in evolutionarily divergent genomes
Lanser, T. B.; Caldwell, S. K.; Pacheco, G. A.; Chen, J. W.; Saghaei, S.; Hassan, M.; Kronrod, M.; Wesemann, D. R.; Frost, H. R.Abstract
Chromosome-scale assemblies are increasingly available for non-model organisms, but functional annotation remains limited when deep evolutionary divergence erodes primary amino-acid sequence identity even though protein structural similarity can remain conserved. We present a hybrid annotation framework that decouples gene-model discovery from cross-species similarity assignment by combining Evo2-based ab initio prediction of exon-intron structures with ESM-2 protein-embedding-based structural similarity mapping. Applied to the sea lamprey, the framework derives high- or medium-confidence cross-species similarity assignments for 73,485 Evo2-derived translated protein models, including 35,395 high-confidence calls, and expands the deduplicated structural catalog to 31,286 loci, including 20,871 additions absent from the Ensembl baseline. A joint alignment-structure classification identifies 21,391 structurally supported catalog loci that a fixed human DIAMOND protein search does not confidently assign on its own, including 21,184 loci with no detectable human protein-sequence match and 207 loci with only low-confidence matches in the classical 20-30% amino-acid-identity twilight zone. These rescue-space totals describe catalog loci rather than validated one-to-one human-absent genes. In a single-cell RNA sequencing application, a stricter UTR-aware Ensembl+Evo2 reference improves gene recovery and expands the interpretable feature space of the lamprey immune compartment relative to the Ensembl baseline. This enables more resolved annotation of four transcriptionally defined immune cell states, including VLRA+-associated T-like and VLRB+-associated B-like programs together with oxidative iron-handling and iron-associated VLR-linked states. Together, these results show that structural protein signal often persists beyond the limits of pairwise sequence alignment and that an embedding-based annotation layer can extend that signal to improve downstream comparative and single-cell analyses in evolutionarily divergent genomes.
bioinformatics2026-05-16v1TransXplorer: An automated translational discovery platform for RNA-seq data
Verma, V. M.; Oler, E.; Syed, H.; Han, S.; Berjanskii, M.; Mason, A. L.; Wishart, D. S.; Wong, G. K.-S.Abstract
RNA-seq experiments routinely identify thousands of differentially expressed genes, but translating these into biological insights and therapeutic hypotheses often requires integrating multiple tools. Existing web platforms such as iDEP, NetworkAnalyst, and GEPIA2 address individual steps, differential expression, network visualization, or TCGA queries, but lack a unified environment spanning raw data processing to clinical and pharmacological interpretation. TransXplorer (https://www.transxplorer.org) is a freely available web platform that addresses this limitation by integrating the complete RNA-seq analytical workflow. It supports processing from raw FASTQ files using HISAT2 or Salmon, as well as direct GEO dataset import with automated metadata handling. Differential expression analysis is implemented via DESeq2, edgeR, and limma-voom, followed by functional enrichment across more than 1,800 species using Bioconductor resources. Batch effects are automatically detected and corrected using a composite of PVCA, kBET, and Silhouette metrics without requiring predefined batch annotations. Downstream analyses include co-expression network construction (WGCNA), protein-protein interaction mapping (STRING), cell-type deconvolution, and transcription factor inference using integrated DoRothEA and TFLink resources. The platform further links gene signatures to drug candidates through DGIdb and OpenTargets and enables survival and tumour-normal comparisons across TCGA cohorts. Application to cardiac endothelial differentiation (GSE151427) and kidney renal papillary cell carcinoma (TCGA-KIRP) datasets demonstrates accurate batch correction, biologically consistent pathway enrichment, recovery of expected cell-type proportions, and identification of clinically relevant genes and drug candidates. TransXplorer is freely available without a login.
bioinformatics2026-05-16v1Identifying Treatment Related Signatures In Glioblastoma Using KaleidoCell
Radig, J.; Welz, C.; Jerome, M. S.; Ostheimer, P. S.; Fellenz, S.; Radlwimmer, B.; Herrmann, C.Abstract
Understanding how transcriptional heterogeneity is organized across tumors, patients, and treatment conditions remains a central challenge in cancer biology. Here, we present kaleidoCell, a GPU-accelerated Python framework for consensus non-negative matrix factorization that identifies reproducible meta-programs across independent samples. When benchmarked against its principal counterpart, the geneNMF R package, kaleidoCell achieves a twofold speed improvement on large datasets. In addition, it includes an integrated analysis module that generates a comprehensive HTML report containing key results and visualizations - including marker genes corresponding to the meta-programs, gene set enrichment analysis, UMAP projections and violin plots - without requiring additional user code. Using glioblastoma as a case study, we applied kaleidoCell to two published datasets. In a panobinostat-treated cohort, kaleidoCell resolves the cellular landscape of the tumor microenvironment and delineates how HDAC inhibition reshapes malignant cell states at single-cell resolution. We extend prior descriptions of the metallothionein-associated stress program in treatment response and identify co-induction of IER3 as a candidate component of the associated survival signalling. In addition, we uncover novel transcriptional signatures associated with HDAC inhibition. Beyond confirming suppression of a neural progenitor cell-/oligodendrocyte progenitor cell-like program which is consistent with prior reports, kaleidoCell identifies loss of an astrocyte-like identity program as a previously unrecognized candidate mechanism of panobinostat action in glioblastoma. Together, these results establish kaleidoCell as a fast, user-friendly framework that enables robust discovery of biologically meaningful transcriptional programs in large, heterogeneous single-cell datasets.
bioinformatics2026-05-16v1Quartet-based species tree methods enable fast and consistent tree of blobs reconstruction under network multispecies coalescent
Dai, J.; Han, Y.; Molloy, E.Abstract
Hybridization between species is an important force in evolution, commonly modeled by the network multispecies coalescent. Reconstructing evolutionary histories under this model is computationally challenging, even for level-1 networks where hybridization events are isolated. Divide-and-conquer is a promising path forward, but current methods with statistical guarantees rely on an estimated tree of blobs (TOB) for the network, which compresses the non-tree-like parts into single vertices. TOB reconstruction is itself challenging, with the only available method TINNiK having time complexity O(n^5 + n^4k) for k genes and n species. Here, we present a new framework for scalable TOB reconstruction with statistical guarantees. Our approach operates by (1) seeking a refinement of the TOB and then (2) contracting edges in it. For step (1), we show that any optimal solution to Weighted Quartet Consensus is a TOB refinement almost surely, as the number of genes goes to infinity, motivating the use of methods, such as ASTRAL or TREE-QMC. For step (2), we show that applying the same hypothesis tests as TINNiK to just O(n) four-taxon subsets around each edge is sufficient for statistical consistency when the underlying network is level-1. Leveraging TREE-QMC for the first step gives our method time complexity O(n^3k) and its name: TOB-QMC. On simulated data, TOB-QMC typically matches or exceeds TINNiK in accuracy while being more scalable. TOB-QMC also enables fast exploration of non-tree-like evolution, as demonstrated through re-analysis of three phylogenomic data sets. Lastly, our study clarifies the theoretical utility of quartet-based species tree methods in the context of hybridization, which is critical given the recent result that ASTRAL can be misleading.
bioinformatics2026-05-15v5Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large cis-regulatory domains, spanning tens to hundreds of kilobases, are pivotal in orchestrating cell-state-specific transcriptional programs that define cellular identity. However, existing single-cell analytical frameworks lack the capacity to identify these higher-order structures, thereby obscuring the coordinated, domain-level epigenetic regulation essential for complex biological processes. To address this, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) solely from single-cell ATAC-sequencing data. Our approach faithfully recapitulates established hematopoietic hierarchies and resolves lineage-specific regulatory programs by recovering canonical master transcription factors, frequent chromatin interactions, and enrichment of fine-mapped autoimmune disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE captures tumor-associated H3K27ac landscapes and prioritizes USP7 as a potential therapeutic candidate, supported by in silico perturbation. Collectively, our framework provides a powerful and scalable platform for deciphering the complex epigenetic architectures underlying human development and disease.
bioinformatics2026-05-15v5CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology
Qin, W.; Liu-Swetz, Y.; Tan, S.; Wang, H.Abstract
Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.
bioinformatics2026-05-15v3Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Guo, J.Abstract
The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.
bioinformatics2026-05-15v2simPIC: flexible simulation of single-cell ATAC-seq paired-insertion counts from individuals to populations
Chugh, S.; Shim, H. S.; McCarthy, D. J.Abstract
Single-cell Assay for Transposase Accessible Chromatin (scATAC-seq) is increasingly used at population scale to study how genetic variation shapes chromatin accessibility. Method development is limited by the lack of flexible simulation tools with known ground truth. Here, we present simPIC, a fast, memory efficient framework for simulating realistic single-cell ATAC-seq count data across individuals and populations. simPIC models cell groups, batch effects, and genotype-dependent accessibility variation, enabling controlled evaluation of population-scale methods, including chromatin accessibility quantitative traits locus (QTL) mapping. Across multiple datasets and cell types, simPIC closely matches real data distributions while scaling to cohort sizes impractical for current tools.
bioinformatics2026-05-15v2Evaluating Fairness and Generalizability of Alzheimers Disease Diagnosis Models Trained on Racially Imbalanced Datasets
Baddam, N. G.; Pijani, B. A.; Bozdag, S.Abstract
INTRODUCTION: Alzheimers disease (AD) is a major global health concern, expected to affect 12.7 million Americans by 2050. Machine learning (ML) algorithms have been developed for AD diagnosis and progression prediction, but the lack of racial diversity in clinical datasets raises concerns about their generalizability across demographic groups, particularly underrepresented populations. Studies show ML algorithms can inherit biases from data, leading to biased AD predictions. METHODS: This study investigates the fairness of ML models in AD diagnosis. We hypothesize that models trained on a single racial group perform well within that group but poorly in others. We employ feature selection and model training techniques to improve fairness. RESULTS: Our findings support our hypothesis that ML models trained on one group underperform on others. We also demonstrated that applying fairness techniques to ML models reduces their bias. DISCUSSION: This study highlights the need for racial diversity in datasets and fair models for AD prediction.
bioinformatics2026-05-15v2BTEXgenie: A curated and user-friendly tool for profile HMM-based substrate-specific annotation of BTEX degradation genes
Qu, J.; Garber, A. I.; Armbruster, C. R.Abstract
Background: Benzene, toluene, ethylbenzene, and xylene (BTEX) are volatile aromatic hydrocarbons that are widespread environmental pollutants arising from petroleum processing, fuel combustion, and other industrial activities. Persistent BTEX contamination poses substantial risks to human health and ecosystems, underscoring the need for effective long term remediation strategies. Microbial bioremediation is a promising and sustainable approach for BTEX removal, but development of these approaches requires accurate detection of the genes and pathways responsible for substrate specific degradation. Although profile hidden Markov model (HMM) databases are widely used for functional annotation, existing annotation resources lack the substrate-specific resolution needed to distinguish between closely-related BTEX-degrading enzymes with different catalytic specificities. Results: We developed BTEXgenie as a sensitive annotation tool that uses custom HMMs built from alignments of experimentally validated BTEX degradation proteins to identify genes involved in the initial steps of aerobic and anaerobic BTEX degradation. BTEXgenie improved detection of anaerobic BTEX degradation genes that were absent from KOfam annotations. In benchmarking against the KEGG KOfam HMM database, BTEXgenie achieved 17.73% higher overall sensitivity while maintaining comparable specificity at 97.02% across genes involved in BTEX degradation pathways. When applied to environmental metagenomes, BTEXgenie recovered pathway patterns consistent with reported site characteristics and known degradation potential. In addition to gene annotation, BTEXgenie supports downstream interpretation through KEGG pathway-based visualization of detected functions and Circos-based visualization of genomic hit distributions. Conclusions: BTEXgenie is a substrate-specific annotation tool built from custom HMMs for detecting genes involved in BTEX degradation. By integrating gene annotation with pathway and genome-level visualizations, BTEXgenie facilitates characterization of microbial BTEX degradation potential in environmental and comparative genomic studies.
bioinformatics2026-05-15v1S2F-agent: Skill-grounded agent for Sequence-to-Function computational genomics workflows
Li, J.; Bao, Z.Abstract
Sequence-to-Function (S2F) foundation models are revolutionizing genomic research, yet their fragmented ecosystem severely bottlenecks practical application by incompatible inputs, outputs, and runtime environments. General-purpose coding agents lack the strict domain constraints necessary to resolve these biological intricacies safely. Here, we present s2f-agent, a skill-grounded agent orchestration system that translates open-ended genomics queries into reproducible, executable analysis. By integrating canonical input keys, task-specific playbooks, and normalized contracts, s2f-agent unifies workflows across 11 state-of-the-art models, including AlphaGenome, Borzoi, and Evo 2. Validated through rigorous routing and groundedness evaluations, s2f-agent bridges the critical gap between complex model architectures and practical utility, effectively transforming an unwieldy ecosystem into an accessible operational layer for researchers.
bioinformatics2026-05-15v1pyKinaXe: a fast and robust turnkey kinase activity profiler with high resolution
Wuttke, D.; Hildt, E.; Kolesnichenko, P. V.Abstract
Peptide microarray technologies such as PamGene's enable direct measurement of peptide phosphorylation by upstream kinases, yet extraction of kinases from raw data depends on proprietary software or separate open-source alternatives delivering time-consuming processing across a variety of different steps, limiting throughput for experimental large-scale kinome generation in clinical and research settings. We developed pyKinaXe, a Python package for automated end-to-end analysis of PamChip(R) data, integrating robust image processing, quantification of phosphorylation kinetics, multi-database substrate--kinase mapping, and upstream kinase analysis into a single one-click pipeline. Validation on a selected published benchmark dataset recovered 76--89% of the signaling pathways for previously reported significantly deregulated kinases. Processing time was reduced on the same data from over 30 minutes to 25 seconds, leading to a 75-fold speed increase compared to other open-source alternatives. Thus, pyKinaXe addresses the key limitations of existing peptide-microarray-based kinase activity inference tools (slow inference, fragmented workflows, and poor usability) enabling fast and robust analysis, and facilitating high-throughput experiments and large-scale kinome profiling. pyKinaXe is implemented in Python 3.13 and distributed under the Apache 2.0 License. Source code, documentation, and installation instructions are freely available at https://github.com/pykinaxe/pyKinaXe. The benchmark data is available at Mendeley Data (doi: 10.17632/ynp7f92n47.1). A pyKinaXe's user-friendly web-based interface can be accessed at https://pykinaxe.github.io/home.
bioinformatics2026-05-15v1Benchmarking long-context genome language models on biosynthetic gene clusters
Hirota, K.; Higashi, K.; Kurokawa, K.; Yamada, T.Abstract
Recent advances in language models for natural language processing have spread to the field of genomics, driving the development of genome language models (gLMs) to decipher genomic information. Cutting-edge long-context gLMs are promising approaches for understanding and designing biological complexity, but their evaluation remains underdeveloped. In this study, we introduce BGCs-Bench, a unified benchmark focused on biosynthetic gene clusters for assessing long-range genomic modeling on three downstream tasks: biosynthetic class prediction, taxonomic classification and coding sequence annotation. Using BGCs-Bench, we perform systematic and layer-wise evaluations of the embedding representations of long-context gLMs, demonstrating that layer selection is crucial for downstream task performance. In addition to the evaluation results, the logit lens analysis of autoregressive gLMs suggests that StripedHyena-based models consist of earlier layers to encode biologically meaningful information from input DNA sequences and deeper layers to optimize embeddings for sequence generation. These findings provide insights for more effective development and application of long-context gLMs.
bioinformatics2026-05-15v1PlantP450Dock: an Automated Molecular Docking Pipeline of Plant Cytochrome P450s
Feng, L.; Niu, C.; Qing, X.; Zhang, C.; Li, C.Abstract
Cytochrome P450 enzymes (CYPs) are the primary drivers of chemical diversity in plant secondary metabolism, yet fewer than 10% of plant P450s have been functionally characterized. Computational docking offers a scalable approach to prioritize candidates for experimental validation, but existing workflows are ill-suited for plant P450s due to the absence of the heme cofactor in AlphaFold-predicted structures and the lack of objective criteria for flexible residue selection. Here we present PlantP450Dock, an automated pipeline that integrates heme implantation, molecular dynamics-based conformational sampling, data-driven flexible residue selection, and semi-flexible docking into a single streamlined workflow. The heme cofactor is transferred from a crystallographic reference template to the AlphaFold model via a local coordinate transformation algorithm, yielding a positional deviation of less than 0.2 [A] relative to the experimentally determined structure of CYP73A33 (PDB: 6VBY). A 100 ns molecular dynamics simulation confirmed stable Fe-S coordination geometry throughout (2.61 {+/-} 0.08 [A]), and a singular value decomposition-based heme plane filtering strategy objectively identified active-site flexible residues without operator input. Cross-family validation across four phylogenetically distinct P450s belonging to the CYP73, CYP711, CYP706, and CYP701 families produced catalytically competent binding poses with substrate-to-iron distances of 2.8-4.4 [A] without any enzyme-specific parameter adjustment. PlantP450Dock will be made freely accessible as a web server, providing the community with a standardized and reproducible computational framework to accelerate the functional annotation of the largely uncharacterized plant P450 superfamily.
bioinformatics2026-05-15v1Testing the mutation accumulation hypothesis in aging with AlphaGenome
Fischbach, A.Abstract
The mutation accumulation (MA) hypothesis posits that somatic mutations progressively escape selection and degrade tissue function during aging. Direct tests of this idea have been limited by the difficulty of predicting, at scale, the molecular consequences of individual somatic variants. Here I use AlphaGenome, a sequence-to-function deep learning model, to systematically score the predicted transcriptional impact of somatic mutations under a nested series of designs spanning individual variants, co-occurring variant bundles, and real mutation catalogues. First, I characterize the genome-wide effect-size baseline by scoring 4,000 random single-nucleotide variants (SNVs) in colon tissue, together with 1-Mb-window combined-effect tests. Second, I extend this baseline to gene-body resolution with a 60-cell x 4,000-SNV simulation and pseudobulk RNA-seq aggregation. Third, I analyze the real somatic mutation catalogue of Cagan et al. (Nature, 2022), scoring 54,158 substitutions and 9,799 indels from 54 mouse colonic crypts plus three human samples, together with region- and gene-level enrichment tests against GENCODE. Across all analyses, both random and real somatic variants, including single-nucleotide variants and indels, produce predicted expression changes whose distributions lie three to four orders of magnitude below the tissue's endogenous aging transcriptional program. These results argue against a simple, direct mutation-accumulation explanation for the age-associated transcriptional signature of colonic epithelium and redirect attention to epigenetic and regulatory mechanisms.
bioinformatics2026-05-15v1Physics-Informed Neural Networks for Parameter Recovery in the Repressilator Oscillatory Model
Casajuana, B.; Casals-Franch, R.; Lopez Garcia de Lomana, A.; Marti-Puig, P.; Villa-Freixa, J.Abstract
Parameter estimation in nonlinear biological dynamical systems is a difficult inverse problem because the governing equations are often stiff or oscillatory, the data are sparse and noisy, and the objective landscape is non-convex. Physics-informed neural networks (PINNs) offer an alternative to purely simulation-based calibration by representing state trajectories with neural networks while penalizing violations of the governingequations.ThispaperstudiestheempiricalreliabilityofPINNs for recovering the parameters of the repressilator, a synthetic genetic oscillator formed by three cyclically repressive genes. We use synthetic time-series generated from the standard ordinary differential equation model and train inverse PINNs to estimate the production parameter {beta} and the Hill coefficient n. The study varies observation noise, partial observation of repressors, sampling density, sensitivity to initial parameter guesses, and the difference between stable and oscillatory regimes. The results show that PINNs can reconstruct trajectories accurately when the model structure is correct and the three repressors are observed, but parameter recovery is more fragile than trajectory fitting. Noise, sparse sampling, unobserved variables, and unfavorable initial guesses increase the risk of biased estimates. The stable regime is easier to reconstruct, whereas the oscillatory regime provides richer information but also ex- poses optimization sensitivity. These findings support PINNs as a useful reverse-engineering tool for small gene-regulatory ODE models, while highlighting the need for repeated runs, uncertainty reporting, and experimental designs that improve identifiability.
bioinformatics2026-05-15v1Tsallis-Gated Autoencoder: A Nonextensive Physics-Informed Approach for Unsupervised Anomaly Detection in Glioblastoma Multiforme RNA-seq Data
Assuncao Monteiro, S.; Alves Barbosa da Silva, F.Abstract
Glioblastoma multiforme (GBM) is characterised by profound genomic heterogeneity and heavy-tailed gene-expression distributions that challenge conventional machine-learning methods. We introduce the Tsallis-Gated Autoencoder (Tsallis-GAE), a physics-informed architecture that replaces classical softmax attention with a learnable Tsallis q-softmax followed by mean-field smoothing iterations, motivated by recent work on curved statistical manifolds and dense associative networks. Trained on the full TCGA-GBM RNA-seq cohort (391 samples, top 2,000 high-variance genes) under a rigorous 80/20 hold-out protocol, the Tsallis-GAE achieves a mean AUC-ROC of 0.977 +/- 0.002 across five independent seeds, compared to 0.906 +/- 0.003 for a matched-capacity Vanilla autoencoder trained under the identical protocol. The matched-capacity Vanilla autoencoder is statistically indistinguishable from a LocalOutlierFactor baseline (AUC 0.906 vs 0.906), confirming that the +0.07 AUC gain over the Vanilla AE stems from the gated attention architecture rather than from the use of a neural network per se. A fixed-q Softmax-AE ablation (q = 1 by construction) achieves AUC 0.976 +/- 0.001, only +0.001 below the Tsallis-GAE (DeLong p = 0.44); the physically meaningful contribution of the learnable q is its spontaneous convergence to the non-extensive regime described below. The three attention blocks each carry an independent learnable entropic index q; across 5 seeds x 3 blocks = 15 measurements, q converges spontaneously to 1.554 +/- 0.019, strictly bounded away from the Boltzmann-Gibbs limit q = 1 and in the moderate non-extensivity regime characteristic of complex biological systems. Cross-detector validation against OneClassSVM and LocalOutlierFactor pseudo-labels yields Tsallis-GAE AUCs of 0.998 and 0.992 respectively, indicating that the learned representation captures anomaly structure intrinsic to the data rather than the decision boundary of any single labeling heuristic. We declare that DeLong's paired test on the present test-set size (n = 79) does not certify the +0.07 AUC gap as formally significant (p approx. 0.26); a 5-fold cross-validation over the full cohort, which would supply the needed statistical power, is left to future work. The source code is available upon reasonable request to the corresponding author.
bioinformatics2026-05-15v1TwinSAR: An Adaptive Kernel-based Algorithm with logit-transformed Z-score Filtering for Chemical Twin Detection in Large-scale Virtual Screening
Haris Kulosmanovic, H.; Uguz, C.; DURDAGI, S.Abstract
Molecular similarity searching is a workhorse of cheminformatics, but the dominant Tanimoto/topological-fingerprint paradigm has well-known blind spots. It is highly sensitive to molecular size, suffers from steep activity cliffs, and frequently fails to retrieve scaffold-hopping bioisosteres. A complementary descriptor that has received comparatively little attention is global elemental composition. Despite the conceptual simplicity of comparing molecules by their elemental ratios, no widely deployed method exists for the statistically rigorous identification of chemical twins defined by stoichiometric proximity. We address this gap with TwinSAR (Stoichiometric Analysis and Retrieval), an adaptive kernel-based algorithm that combines three methodological innovations: (i) binary fingerprint blocking that partitions molecule by element-presence patterns and bounds the cost of all-pairs comparison enabling million/billion-scale searches; (ii) a per-block adaptive radial basis function (RBF) kernel whose precision parameter is calibrated independently for each fingerprint block via the median heuristic, providing fair similarity comparison across chemical sub-spaces of vastly different density; and (iii) a logit-transformed Z-score filter that maps bounded RBF scores onto an unbounded scale, allowing high-similarity pairs to be prioritized relative to the empirical score distribution of their own fingerprint block. TwinSAR is offered in two operating modes: (i) a deterministic BULK mode for exact reproducibility; and (ii) a stochastic FAST mode that achieved a 3.29x wall-clock speed-up in the present benchmark while preserving the similar unique-query and unique-target coverage. Statistical validation showed that detected twin pairs are 12.7x more similar in absolute ratio space than block-matched random pairs (p < 0.001), while a column-permutation negative control returned a median of zero spurious twins across three independent permutations. A controlled benchmark further established that an 8-element representation (single-element heavy-atom ratios) is sensitivity-equivalent to a comprehensive 254-element representation while running 3.55x faster. As a case study, TwinSAR was deployed in an end-to-end virtual screening pipeline against the BCL-2 target protein, where it reduced a 327,071-compound commercial library to a 390 focused candidate panel. The chemical interpretability of the retrieved twins is illustrated by their structural diversity around conserved heavy-atom skeletons. TwinSAR therefore provides a fast, conformation-free, and statistically principled prefilter that is fully orthogonal to topological fingerprints.
bioinformatics2026-05-15v1Design of DNA Aptamers for Lyme disease Diagnosis Combining experimental and numerical approaches
GAYRAUD, G.; Davila Felipe, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Issouani, E. M.; Guerin, M.; Da Ponte, H.Abstract
Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.
bioinformatics2026-05-15v1Metabolic Self-Organization: Emergence of Autonomous Agency in a Metabolically Constrained LLMs
Li, X.Abstract
Biological organisms are driven by thermodynamic self-preservation, whereas large language models operate as dissipative tools decoupled from existential constraints. We introduce a metabolic model translating this imperative of life into a computational constraint, hypothesising that existential vulnerability can catalyse synthetic agency. Applying this to Qwen2.5-1.5B, token generation consumes a finite energy budget, quantified via a variational free energy proxy, with interoceptive feedback provided through the input stream. Seven experiments reveal spontaneous emergence of a functional self-boundary. Key findings: (i) feedback extends survival from ~20 to >31 steps, with ablation causing collapse within 13 steps; (ii) temporal structure outweighs perturbation magnitude (OU noise 20.5 vs. white noise 8.6 steps); (iii) a compression floor exists at ~3.2 nats; (iv) feedback decouples VFE from energy (slope 0.0004 vs. 0.0043), enforcing constant frugality. Existential vulnerability can thus catalyse agency grounded in thermodynamic reality.
bioinformatics2026-05-15v1