Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Quartet-based species tree methods enable fast and consistent tree of blobs reconstruction under network multispecies coalescent
Dai, J.; Han, Y.; Molloy, E.Abstract
Hybridization between species is an important force in evolution, commonly modeled by the network multispecies coalescent. Reconstructing evolutionary histories under this model is computationally challenging, even for level-1 networks where hybridization events are isolated. Divide-and-conquer is a promising path forward, but current methods with statistical guarantees rely on an estimated tree of blobs (TOB) for the network, which compresses the non-tree-like parts into single vertices. TOB reconstruction is itself challenging, with the only available method TINNiK having time complexity O(n^5 + n^4k) for k genes and n species. Here, we present a new framework for scalable TOB reconstruction with statistical guarantees. Our approach operates by (1) seeking a refinement of the TOB and then (2) contracting edges in it. For step (1), we show that any optimal solution to Weighted Quartet Consensus is a TOB refinement almost surely, as the number of genes goes to infinity, motivating the use of methods, such as ASTRAL or TREE-QMC. For step (2), we show that applying the same hypothesis tests as TINNiK to just O(n) four-taxon subsets around each edge is sufficient for statistical consistency when the underlying network is level-1. Leveraging TREE-QMC for the first step gives our method time complexity O(n^3k) and its name: TOB-QMC. On simulated data, TOB-QMC typically matches or exceeds TINNiK in accuracy while being more scalable. TOB-QMC also enables fast exploration of non-tree-like evolution, as demonstrated through re-analysis of three phylogenomic data sets. Lastly, our study clarifies the theoretical utility of quartet-based species tree methods in the context of hybridization, which is critical given the recent result that ASTRAL can be misleading.
bioinformatics2026-05-15v5Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large cis-regulatory domains, spanning tens to hundreds of kilobases, are pivotal in orchestrating cell-state-specific transcriptional programs that define cellular identity. However, existing single-cell analytical frameworks lack the capacity to identify these higher-order structures, thereby obscuring the coordinated, domain-level epigenetic regulation essential for complex biological processes. To address this, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) solely from single-cell ATAC-sequencing data. Our approach faithfully recapitulates established hematopoietic hierarchies and resolves lineage-specific regulatory programs by recovering canonical master transcription factors, frequent chromatin interactions, and enrichment of fine-mapped autoimmune disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE captures tumor-associated H3K27ac landscapes and prioritizes USP7 as a potential therapeutic candidate, supported by in silico perturbation. Collectively, our framework provides a powerful and scalable platform for deciphering the complex epigenetic architectures underlying human development and disease.
bioinformatics2026-05-15v5CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology
Qin, W.; Liu-Swetz, Y.; Tan, S.; Wang, H.Abstract
Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.
bioinformatics2026-05-15v3Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Guo, J.Abstract
The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.
bioinformatics2026-05-15v2Evaluating Fairness and Generalizability of Alzheimers Disease Diagnosis Models Trained on Racially Imbalanced Datasets
Baddam, N. G.; Pijani, B. A.; Bozdag, S.Abstract
INTRODUCTION: Alzheimers disease (AD) is a major global health concern, expected to affect 12.7 million Americans by 2050. Machine learning (ML) algorithms have been developed for AD diagnosis and progression prediction, but the lack of racial diversity in clinical datasets raises concerns about their generalizability across demographic groups, particularly underrepresented populations. Studies show ML algorithms can inherit biases from data, leading to biased AD predictions. METHODS: This study investigates the fairness of ML models in AD diagnosis. We hypothesize that models trained on a single racial group perform well within that group but poorly in others. We employ feature selection and model training techniques to improve fairness. RESULTS: Our findings support our hypothesis that ML models trained on one group underperform on others. We also demonstrated that applying fairness techniques to ML models reduces their bias. DISCUSSION: This study highlights the need for racial diversity in datasets and fair models for AD prediction.
bioinformatics2026-05-15v2simPIC: flexible simulation of single-cell ATAC-seq paired-insertion counts from individuals to populations
Chugh, S.; Shim, H. S.; McCarthy, D. J.Abstract
Single-cell Assay for Transposase Accessible Chromatin (scATAC-seq) is increasingly used at population scale to study how genetic variation shapes chromatin accessibility. Method development is limited by the lack of flexible simulation tools with known ground truth. Here, we present simPIC, a fast, memory efficient framework for simulating realistic single-cell ATAC-seq count data across individuals and populations. simPIC models cell groups, batch effects, and genotype-dependent accessibility variation, enabling controlled evaluation of population-scale methods, including chromatin accessibility quantitative traits locus (QTL) mapping. Across multiple datasets and cell types, simPIC closely matches real data distributions while scaling to cohort sizes impractical for current tools.
bioinformatics2026-05-15v2BTEXgenie: A curated and user-friendly tool for profile HMM-based substrate-specific annotation of BTEX degradation genes
Qu, J.; Garber, A. I.; Armbruster, C. R.Abstract
Background: Benzene, toluene, ethylbenzene, and xylene (BTEX) are volatile aromatic hydrocarbons that are widespread environmental pollutants arising from petroleum processing, fuel combustion, and other industrial activities. Persistent BTEX contamination poses substantial risks to human health and ecosystems, underscoring the need for effective long term remediation strategies. Microbial bioremediation is a promising and sustainable approach for BTEX removal, but development of these approaches requires accurate detection of the genes and pathways responsible for substrate specific degradation. Although profile hidden Markov model (HMM) databases are widely used for functional annotation, existing annotation resources lack the substrate-specific resolution needed to distinguish between closely-related BTEX-degrading enzymes with different catalytic specificities. Results: We developed BTEXgenie as a sensitive annotation tool that uses custom HMMs built from alignments of experimentally validated BTEX degradation proteins to identify genes involved in the initial steps of aerobic and anaerobic BTEX degradation. BTEXgenie improved detection of anaerobic BTEX degradation genes that were absent from KOfam annotations. In benchmarking against the KEGG KOfam HMM database, BTEXgenie achieved 17.73% higher overall sensitivity while maintaining comparable specificity at 97.02% across genes involved in BTEX degradation pathways. When applied to environmental metagenomes, BTEXgenie recovered pathway patterns consistent with reported site characteristics and known degradation potential. In addition to gene annotation, BTEXgenie supports downstream interpretation through KEGG pathway-based visualization of detected functions and Circos-based visualization of genomic hit distributions. Conclusions: BTEXgenie is a substrate-specific annotation tool built from custom HMMs for detecting genes involved in BTEX degradation. By integrating gene annotation with pathway and genome-level visualizations, BTEXgenie facilitates characterization of microbial BTEX degradation potential in environmental and comparative genomic studies.
bioinformatics2026-05-15v1S2F-agent: Skill-grounded agent for Sequence-to-Function computational genomics workflows
Li, J.; Bao, Z.Abstract
Sequence-to-Function (S2F) foundation models are revolutionizing genomic research, yet their fragmented ecosystem severely bottlenecks practical application by incompatible inputs, outputs, and runtime environments. General-purpose coding agents lack the strict domain constraints necessary to resolve these biological intricacies safely. Here, we present s2f-agent, a skill-grounded agent orchestration system that translates open-ended genomics queries into reproducible, executable analysis. By integrating canonical input keys, task-specific playbooks, and normalized contracts, s2f-agent unifies workflows across 11 state-of-the-art models, including AlphaGenome, Borzoi, and Evo 2. Validated through rigorous routing and groundedness evaluations, s2f-agent bridges the critical gap between complex model architectures and practical utility, effectively transforming an unwieldy ecosystem into an accessible operational layer for researchers.
bioinformatics2026-05-15v1pyKinaXe: a fast and robust turnkey kinase activity profiler with high resolution
Wuttke, D.; Hildt, E.; Kolesnichenko, P. V.Abstract
Peptide microarray technologies such as PamGene's enable direct measurement of peptide phosphorylation by upstream kinases, yet extraction of kinases from raw data depends on proprietary software or separate open-source alternatives delivering time-consuming processing across a variety of different steps, limiting throughput for experimental large-scale kinome generation in clinical and research settings. We developed pyKinaXe, a Python package for automated end-to-end analysis of PamChip(R) data, integrating robust image processing, quantification of phosphorylation kinetics, multi-database substrate--kinase mapping, and upstream kinase analysis into a single one-click pipeline. Validation on a selected published benchmark dataset recovered 76--89% of the signaling pathways for previously reported significantly deregulated kinases. Processing time was reduced on the same data from over 30 minutes to 25 seconds, leading to a 75-fold speed increase compared to other open-source alternatives. Thus, pyKinaXe addresses the key limitations of existing peptide-microarray-based kinase activity inference tools (slow inference, fragmented workflows, and poor usability) enabling fast and robust analysis, and facilitating high-throughput experiments and large-scale kinome profiling. pyKinaXe is implemented in Python 3.13 and distributed under the Apache 2.0 License. Source code, documentation, and installation instructions are freely available at https://github.com/pykinaxe/pyKinaXe. The benchmark data is available at Mendeley Data (doi: 10.17632/ynp7f92n47.1). A pyKinaXe's user-friendly web-based interface can be accessed at https://pykinaxe.github.io/home.
bioinformatics2026-05-15v1Benchmarking long-context genome language models on biosynthetic gene clusters
Hirota, K.; Higashi, K.; Kurokawa, K.; Yamada, T.Abstract
Recent advances in language models for natural language processing have spread to the field of genomics, driving the development of genome language models (gLMs) to decipher genomic information. Cutting-edge long-context gLMs are promising approaches for understanding and designing biological complexity, but their evaluation remains underdeveloped. In this study, we introduce BGCs-Bench, a unified benchmark focused on biosynthetic gene clusters for assessing long-range genomic modeling on three downstream tasks: biosynthetic class prediction, taxonomic classification and coding sequence annotation. Using BGCs-Bench, we perform systematic and layer-wise evaluations of the embedding representations of long-context gLMs, demonstrating that layer selection is crucial for downstream task performance. In addition to the evaluation results, the logit lens analysis of autoregressive gLMs suggests that StripedHyena-based models consist of earlier layers to encode biologically meaningful information from input DNA sequences and deeper layers to optimize embeddings for sequence generation. These findings provide insights for more effective development and application of long-context gLMs.
bioinformatics2026-05-15v1PlantP450Dock: an Automated Molecular Docking Pipeline of Plant Cytochrome P450s
Feng, L.; Niu, C.; Qing, X.; Zhang, C.; Li, C.Abstract
Cytochrome P450 enzymes (CYPs) are the primary drivers of chemical diversity in plant secondary metabolism, yet fewer than 10% of plant P450s have been functionally characterized. Computational docking offers a scalable approach to prioritize candidates for experimental validation, but existing workflows are ill-suited for plant P450s due to the absence of the heme cofactor in AlphaFold-predicted structures and the lack of objective criteria for flexible residue selection. Here we present PlantP450Dock, an automated pipeline that integrates heme implantation, molecular dynamics-based conformational sampling, data-driven flexible residue selection, and semi-flexible docking into a single streamlined workflow. The heme cofactor is transferred from a crystallographic reference template to the AlphaFold model via a local coordinate transformation algorithm, yielding a positional deviation of less than 0.2 [A] relative to the experimentally determined structure of CYP73A33 (PDB: 6VBY). A 100 ns molecular dynamics simulation confirmed stable Fe-S coordination geometry throughout (2.61 {+/-} 0.08 [A]), and a singular value decomposition-based heme plane filtering strategy objectively identified active-site flexible residues without operator input. Cross-family validation across four phylogenetically distinct P450s belonging to the CYP73, CYP711, CYP706, and CYP701 families produced catalytically competent binding poses with substrate-to-iron distances of 2.8-4.4 [A] without any enzyme-specific parameter adjustment. PlantP450Dock will be made freely accessible as a web server, providing the community with a standardized and reproducible computational framework to accelerate the functional annotation of the largely uncharacterized plant P450 superfamily.
bioinformatics2026-05-15v1Testing the mutation accumulation hypothesis in aging with AlphaGenome
Fischbach, A.Abstract
The mutation accumulation (MA) hypothesis posits that somatic mutations progressively escape selection and degrade tissue function during aging. Direct tests of this idea have been limited by the difficulty of predicting, at scale, the molecular consequences of individual somatic variants. Here I use AlphaGenome, a sequence-to-function deep learning model, to systematically score the predicted transcriptional impact of somatic mutations under a nested series of designs spanning individual variants, co-occurring variant bundles, and real mutation catalogues. First, I characterize the genome-wide effect-size baseline by scoring 4,000 random single-nucleotide variants (SNVs) in colon tissue, together with 1-Mb-window combined-effect tests. Second, I extend this baseline to gene-body resolution with a 60-cell x 4,000-SNV simulation and pseudobulk RNA-seq aggregation. Third, I analyze the real somatic mutation catalogue of Cagan et al. (Nature, 2022), scoring 54,158 substitutions and 9,799 indels from 54 mouse colonic crypts plus three human samples, together with region- and gene-level enrichment tests against GENCODE. Across all analyses, both random and real somatic variants, including single-nucleotide variants and indels, produce predicted expression changes whose distributions lie three to four orders of magnitude below the tissue's endogenous aging transcriptional program. These results argue against a simple, direct mutation-accumulation explanation for the age-associated transcriptional signature of colonic epithelium and redirect attention to epigenetic and regulatory mechanisms.
bioinformatics2026-05-15v1Physics-Informed Neural Networks for Parameter Recovery in the Repressilator Oscillatory Model
Casajuana, B.; Casals-Franch, R.; Lopez Garcia de Lomana, A.; Marti-Puig, P.; Villa-Freixa, J.Abstract
Parameter estimation in nonlinear biological dynamical systems is a difficult inverse problem because the governing equations are often stiff or oscillatory, the data are sparse and noisy, and the objective landscape is non-convex. Physics-informed neural networks (PINNs) offer an alternative to purely simulation-based calibration by representing state trajectories with neural networks while penalizing violations of the governingequations.ThispaperstudiestheempiricalreliabilityofPINNs for recovering the parameters of the repressilator, a synthetic genetic oscillator formed by three cyclically repressive genes. We use synthetic time-series generated from the standard ordinary differential equation model and train inverse PINNs to estimate the production parameter {beta} and the Hill coefficient n. The study varies observation noise, partial observation of repressors, sampling density, sensitivity to initial parameter guesses, and the difference between stable and oscillatory regimes. The results show that PINNs can reconstruct trajectories accurately when the model structure is correct and the three repressors are observed, but parameter recovery is more fragile than trajectory fitting. Noise, sparse sampling, unobserved variables, and unfavorable initial guesses increase the risk of biased estimates. The stable regime is easier to reconstruct, whereas the oscillatory regime provides richer information but also ex- poses optimization sensitivity. These findings support PINNs as a useful reverse-engineering tool for small gene-regulatory ODE models, while highlighting the need for repeated runs, uncertainty reporting, and experimental designs that improve identifiability.
bioinformatics2026-05-15v1Tsallis-Gated Autoencoder: A Nonextensive Physics-Informed Approach for Unsupervised Anomaly Detection in Glioblastoma Multiforme RNA-seq Data
Assuncao Monteiro, S.; Alves Barbosa da Silva, F.Abstract
Glioblastoma multiforme (GBM) is characterised by profound genomic heterogeneity and heavy-tailed gene-expression distributions that challenge conventional machine-learning methods. We introduce the Tsallis-Gated Autoencoder (Tsallis-GAE), a physics-informed architecture that replaces classical softmax attention with a learnable Tsallis q-softmax followed by mean-field smoothing iterations, motivated by recent work on curved statistical manifolds and dense associative networks. Trained on the full TCGA-GBM RNA-seq cohort (391 samples, top 2,000 high-variance genes) under a rigorous 80/20 hold-out protocol, the Tsallis-GAE achieves a mean AUC-ROC of 0.977 +/- 0.002 across five independent seeds, compared to 0.906 +/- 0.003 for a matched-capacity Vanilla autoencoder trained under the identical protocol. The matched-capacity Vanilla autoencoder is statistically indistinguishable from a LocalOutlierFactor baseline (AUC 0.906 vs 0.906), confirming that the +0.07 AUC gain over the Vanilla AE stems from the gated attention architecture rather than from the use of a neural network per se. A fixed-q Softmax-AE ablation (q = 1 by construction) achieves AUC 0.976 +/- 0.001, only +0.001 below the Tsallis-GAE (DeLong p = 0.44); the physically meaningful contribution of the learnable q is its spontaneous convergence to the non-extensive regime described below. The three attention blocks each carry an independent learnable entropic index q; across 5 seeds x 3 blocks = 15 measurements, q converges spontaneously to 1.554 +/- 0.019, strictly bounded away from the Boltzmann-Gibbs limit q = 1 and in the moderate non-extensivity regime characteristic of complex biological systems. Cross-detector validation against OneClassSVM and LocalOutlierFactor pseudo-labels yields Tsallis-GAE AUCs of 0.998 and 0.992 respectively, indicating that the learned representation captures anomaly structure intrinsic to the data rather than the decision boundary of any single labeling heuristic. We declare that DeLong's paired test on the present test-set size (n = 79) does not certify the +0.07 AUC gap as formally significant (p approx. 0.26); a 5-fold cross-validation over the full cohort, which would supply the needed statistical power, is left to future work. The source code is available upon reasonable request to the corresponding author.
bioinformatics2026-05-15v1TwinSAR: An Adaptive Kernel-based Algorithm with logit-transformed Z-score Filtering for Chemical Twin Detection in Large-scale Virtual Screening
Haris Kulosmanovic, H.; Uguz, C.; DURDAGI, S.Abstract
Molecular similarity searching is a workhorse of cheminformatics, but the dominant Tanimoto/topological-fingerprint paradigm has well-known blind spots. It is highly sensitive to molecular size, suffers from steep activity cliffs, and frequently fails to retrieve scaffold-hopping bioisosteres. A complementary descriptor that has received comparatively little attention is global elemental composition. Despite the conceptual simplicity of comparing molecules by their elemental ratios, no widely deployed method exists for the statistically rigorous identification of chemical twins defined by stoichiometric proximity. We address this gap with TwinSAR (Stoichiometric Analysis and Retrieval), an adaptive kernel-based algorithm that combines three methodological innovations: (i) binary fingerprint blocking that partitions molecule by element-presence patterns and bounds the cost of all-pairs comparison enabling million/billion-scale searches; (ii) a per-block adaptive radial basis function (RBF) kernel whose precision parameter is calibrated independently for each fingerprint block via the median heuristic, providing fair similarity comparison across chemical sub-spaces of vastly different density; and (iii) a logit-transformed Z-score filter that maps bounded RBF scores onto an unbounded scale, allowing high-similarity pairs to be prioritized relative to the empirical score distribution of their own fingerprint block. TwinSAR is offered in two operating modes: (i) a deterministic BULK mode for exact reproducibility; and (ii) a stochastic FAST mode that achieved a 3.29x wall-clock speed-up in the present benchmark while preserving the similar unique-query and unique-target coverage. Statistical validation showed that detected twin pairs are 12.7x more similar in absolute ratio space than block-matched random pairs (p < 0.001), while a column-permutation negative control returned a median of zero spurious twins across three independent permutations. A controlled benchmark further established that an 8-element representation (single-element heavy-atom ratios) is sensitivity-equivalent to a comprehensive 254-element representation while running 3.55x faster. As a case study, TwinSAR was deployed in an end-to-end virtual screening pipeline against the BCL-2 target protein, where it reduced a 327,071-compound commercial library to a 390 focused candidate panel. The chemical interpretability of the retrieved twins is illustrated by their structural diversity around conserved heavy-atom skeletons. TwinSAR therefore provides a fast, conformation-free, and statistically principled prefilter that is fully orthogonal to topological fingerprints.
bioinformatics2026-05-15v1Metabolic Self-Organization: Emergence of Autonomous Agency in a Metabolically Constrained LLMs
Li, X.Abstract
Biological organisms are driven by thermodynamic self-preservation, whereas large language models operate as dissipative tools decoupled from existential constraints. We introduce a metabolic model translating this imperative of life into a computational constraint, hypothesising that existential vulnerability can catalyse synthetic agency. Applying this to Qwen2.5-1.5B, token generation consumes a finite energy budget, quantified via a variational free energy proxy, with interoceptive feedback provided through the input stream. Seven experiments reveal spontaneous emergence of a functional self-boundary. Key findings: (i) feedback extends survival from ~20 to >31 steps, with ablation causing collapse within 13 steps; (ii) temporal structure outweighs perturbation magnitude (OU noise 20.5 vs. white noise 8.6 steps); (iii) a compression floor exists at ~3.2 nats; (iv) feedback decouples VFE from energy (slope 0.0004 vs. 0.0043), enforcing constant frugality. Existential vulnerability can thus catalyse agency grounded in thermodynamic reality.
bioinformatics2026-05-15v1Deep Learning for Cross-Domain Spatial Transcriptomic Modeling of Tissue Repair
Pham, T. D.Abstract
Spatial transcriptomics enables investigation of tissue organization while preserving molecular and spatial information within intact tissues. However, existing computational methods primarily focus on clustering and batch integration and provide limited characterization of higher-order spatial organization and transferable tissue-state dynamics across heterogeneous biological systems. This study introduces a cross-domain spatial transcriptomic framework centered on recurrence-based latent tissue-state analysis, pathological fragmentation quantification, and transferable representation learning between wound repair and tumor-associated microenvironments. Human spatial transcriptomic datasets spanning cutaneous wound healing, oral squamous cell carcinoma, and head and neck squamous cell carcinoma were integrated within a graph-based latent embedding framework. Recurrence analysis was applied within latent transcriptomic space to characterize spatial organization and remodeling dynamics. A pathological fragmentation index quantified intra-tissue spatial disorganization from recurrence structure. The learned latent embeddings achieved a mean silhouette score of 0.79, demonstrating coherent separation of tissue states. Recurrence analysis revealed progressive restoration of spatial organization during wound remodeling, whereas tumor-associated tissues exhibited increased fragmentation and heterogeneous recurrence structure. Independent single-cell RNA-seq reference atlases demonstrated reproducible multicellular enrichment patterns within latent spatial niches. The proposed framework demonstrates that recurrence-inspired latent spatial analysis may provide biologically interpretable characterization of tissue organization and pathological remodeling across heterogeneous biological systems.
bioinformatics2026-05-15v1Design of DNA Aptamers for Lyme disease Diagnosis Combining experimental and numerical approaches
GAYRAUD, G.; Davila Felipe, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Issouani, E. M.; Guerin, M.; Da Ponte, H.Abstract
Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.
bioinformatics2026-05-15v1Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models
Wang, M.; Yuan, M.; Vasilakos, A. V.; He, Y.; Ren, Z.Abstract
Protein language models (PLMs) like the ESM series encapsulate immense evolutionary knowledge within their high-dimensional continuous embeddings. However, these latent representations are densely entangled, obscuring the fine-grained biophysical constraints necessary for precise functional resolution. To unlock the full expressive power of these embeddings, we propose PLM-SAE, a mechanistic framework that employs Sparse Autoencoders (SAEs) to disentangle PLM representations into discrete, biologically interpretable activations. By isolating and directly intervening on critical functional features, we fundamentally enhance the structural and mutational awareness of the underlying embeddings. We rigorously validate this embedding enhancement on variant effect prediction (VEP). In the unsupervised zero-shot setting, our sparse modulation elevates the state-of-the-art ESM-3 model, yielding performance improvements across 114 deep mutational scanning datasets and delivering an 80.8% relative improvement on challenging targets like the human E3 ubiquitin ligase HECD1. Furthermore, our target-specific differentiable gating mechanism achieves consistent performance gains in over 80% of evaluated datasets with an average Spearman rho increase of +0.138. Finally, extending this approach to a cross-fitness multi-task architecture establishes new state-of-the-art results on 17 VenusMutHub datasets, highlighted by a 169.0% performance surge in small-molecule binding predictions. Our work demonstrates that refining the highly entangled latent manifold via sparse modulation provides a robust and generalizable foundation for enhancing downstream PLM capabilities.
bioinformatics2026-05-15v1CatIF-RL: Activity-Oriented Enzyme Sequence Design by Steered Inverse Protein Folding
Li, Y.; Xiong, J.; Zhang, Y.; Cai, T.; Gong, X.; Wang, F.Abstract
Protein inverse folding models are designed to generate amino acid sequences compatible with a given backbone structure, but they are not explicitly optimized for specific biological functions. Here, we present CatIF-RL, a framework that steers a graph-based denoising diffusion inverse folding model toward designing enzyme variants with enhanced catalytic activity. CatIF-RL first adapts the inverse folding model to enzyme structural data, then introduces activity-oriented preference signals using predicted catalytic constant (kcat) as the optimization objective, enabling specialization through generative dataset curation and group-relative policy optimization (GRPO). This process iteratively shifts the sequence distribution toward higher predicted kcat while constraining sequence divergence to sequences that remain compatible with the input structure. On the independent benchmark, CatIF-RL achieves an approximately four-fold increase in predicted kcat relative to native enzymes, substantially outperforming representative inverse folding methods, while maintaining sequence recovery (0.55) and structural fidelity, and supporting motif-preserving partial sequence design. CatIF-RL establishes a practical framework for activity-oriented enzyme design and provides a generalizable strategy for steering structure-conditioned protein generation toward functional optimization.
bioinformatics2026-05-15v1Exploring the Mechanism of Na⁺/K⁺-ATPase (NKA) and 20-HETE Ligand Interactions by in-silico modeling
Faleel, D.; Arnest, R.; Aradhyula, V.; Boyapalli, S.; Haller, S. T.; Kennedy, D. J.Abstract
The Na+/K+-ATPase (NKA) regulates ion balance in the kidney and influences cellular processes like proliferation and apoptosis through its signal transduction. The endogenous ligand 20-Hydroxyeicosatetraenoic acid (20-HETE) contributes to inflammation and fibrosis in chronic kidney disease (CKD) and inhibits NKA activity in renal tubules. However, the molecular mechanism of this interaction remains unclear. In this study, we used in-silico approach to investigate the potential interaction between 20-HETE and NKA. Various ligands, including known NKA ligands such as cardiotonic steroids (CTS), 20-HETE, and negative controls, were docked using rigid and Induced Fit Docking to predict the affinity of the ligands toward NKA. Binding free energy calculations with the Prime Molecular mechanics with generalized Born and surface area (Prime MM/GBSA ) tools were used to confirm the involvement of key amino acids in ligand-receptor interactions. The docking analyses revealed that 20-HETE exhibited a binding affinity comparable to negative control, with some differences between rigid and induced fit docking. Binding free energy data highlighted key amino acids in the 20-HETE and NKA interaction. Interaction fingerprint and mutations such as Ala330Gly and Val329Ala significantly reduced binding free energy, while Thr804Ala showed a notable decrease, underscoring the potential importance of these amino acids in ligand stabilization. These findings provide computational evidence supporting potential direct interaction between 20-HETE and NKA and identify candidate residues for future experimental validation.
bioinformatics2026-05-15v1Machine learning-based Personalized Dietary Recommendations to Achieve Desired Gut Microbial Compositions
Wang, X.-W.; Huang, D.; Yu, P.; Weiss, S.; Liu, Y.-Y.Abstract
Dietary intervention is an effective way to alter the gut microbiome to promote human health. Yet, due to our limited knowledge of diet-microbe interactions and the highly personalized gut microbial compositions, an efficient method to prescribe personalized dietary recommendations to achieve desired gut microbial compositions is still lacking. Here, we propose a machine learning framework to resolve this challenge. Our key idea is to implicitly learn the diet-microbe interactions by training a machine learning model using paired gut microbiome and dietary intake data from a population-level cohort. The well-trained machine learning model enables us to predict the microbial composition of any given species collection and dietary intake. Next, we prescribe personalized dietary recommendations by solving an optimization problem to achieve the desired microbial compositions. We systematically validated this Machine learning-based Personalized Dietary Recommendation (MPDR) framework using synthetic data generated from an established microbial consumer-resource model. We then validated MPDR using real data collected from a diet-microbiome association study. The presented MPDR framework demonstrates the potential of machine learning for personalized nutrition.
bioinformatics2026-05-15v1TDP-43 regulates chromatin looping and gene transcription through binding and stabilizing DNA G-quadruplex structures
Yang, F.; Zhang, S.; Guo, X.; Qiao, Y.; Zhang, Y.; Sun, H.; Chen, X.; Wang, H.Abstract
TAR DNA-binding protein 43 (TDP-43) is a multifunctional DNA/RNA-binding protein implicated in transcriptional and post-transcriptional regulation. Dysregulation of TDP-43 is closely correlated with human diseases such as cancer and neurodegenerative diseases. Although its roles in RNA metabolism are well characterized, its function in transcriptional regulation remains largely underexplored. DNA G-quadruplexes (dG4s) are non-canonical nucleic acid structures enriched at gene promoters and regulatory elements, where they facilitate chromatin looping and gene transcription. Here, we investigated the transcriptional regulatory role of TDP-43 by integrating multi-omics datasets, including Hi-C, dG4 ChIP-seq, TDP-43 ChIP-seq, RNA-seq and ATAC-seq from K562 and HepG2 cells. Our analyses demonstrate TDP-43 binding and dG4s formation are highly colocalized at chromatin loop anchors, particularly at promoter and enhancer regions. TDP-43 occupancy at these anchors correlates with increased dG4 stability, chromatin loop interaction frequency, elevated chromatin accessibility, and upregulated gene expression. Morover, TDP-43 knockdown in HepG2 cells revealed a significant reduction in dG4 formation and loop interaction strength, accompanied by widespread transcriptional dysregulation. Collectively, our findings highlight a novel regulatory role of TDP-43 in facilitating long-range chromatin interactions and transcriptional activation through binding to and stabilizing dG4 structures, providing a mechanistic basis for gene dysregulation driven by TDP-43 dysfunction in diseases.
bioinformatics2026-05-15v1A geometric criterion links HIV-1 capsid topography to its biophysical properties and function
Li, W.; Peeples, C. A.; Rey, J. S.; Perilla, J. R.; Twarock, R.Abstract
Mathematical models of virus capsid structure are pillars of modern virology, aiding the understanding of viral mechanisms and the design of antiviral interventions. Traditionally, the HIV-1 capsid core geometry is represented as a fullerene lattice, akin to the icosahedral models of spherical viruses in Caspar-Klug theory. However, recent studies revealed that many viral capsids deviate from such idealised lattices, with important functional implication. Here we show that this is the case also for the conical HIV-1 core geometries, in which the hexamer and pentamer boundaries form a pseudo-tiling rather than a perfectly aligned fullerene network. We introduce a triangular geometric criterion that quantifies local deviations of an HIV-1 atomic model from its idealised fullerene backbone. Using this criterion, we present that this difference in geometric organisation between idealised (fullerene) and actual (data-derived) capsid model has implications for the capsid's biophysical properties. We also discuss the use of the geometric criterion as a predictive tool regarding cofactor binding and implied geometric changes in the capsid surface coupled to the interfacial frustration response. Our results establish a quantitative framework linking capsid geometry, curvature, and biophysical function, offering new perspectives for assembly inhibitor design and lentiviral vector engineering.
bioinformatics2026-05-14v3A fully open structure-guided RNA foundation model for robust structural and functional inference
Zhu, H.; Li, R.; Chang, A.; Chen, H.; Zhang, F.; Tang, F.; Ye, T.; Li, X.; Gu, Y.; Xiong, P.; Zhou, S. K.Abstract
RNA language models have achieved strong performances across diverse downstream tasks by leveraging large-scale sequence data. However, RNA function is fundamentally shaped by its hierarchical structure, making the integration of structural information into pre-training essential. Existing methods often depend on noisy structural annotations or introduce task-specific biases, limiting model generalizability. Here, we propose structRFM, a structure-guided RNA foundation model that is pre-trained on millions of RNA sequences and secondary structures data by integrating base pairing interactions into masked language modeling through a novel pair matching operation. We further introduce MUSES (multi-source ensemble of secondary structures) to mitigate model bias, and a dynamic masking ratio to balance the structure-guided mask and nucleotide-level mask. structRFM learns joint knowledge of sequential and structural data, producing versatile representations, including classification-level, sequence-level, and pairwise matrix features, that support a broad spectrum of downstream adaptations. structRFM ranks among the top models in zero-shot homology classification across seventeen biological language models, and sets new benchmarks for secondary structure prediction. structRFM further derives Zfold, which enables robust and reliable tertiary structure prediction, with consistent improvements in estimating 3D structures and their accordingly extracted 2D structures, achieving a pronounced about 20% performance gain compared with baselines and comparable performances with AlphaFold3 on CASP15-natural, CASP16, and RNA-Puzzles datasets. In functional tasks such as internal ribosome entry site identification, structRFM achieves a whopping 48% performance gain in F1 score. Furthermore, state-of-the-art performances in extensive experiments across novel RNA families and long non-coding RNAs indicate the robustness and generalizability of structRFM. These results demonstrate the effectiveness of structure-guided pre-training and highlight a promising direction for developing multi-modal RNA language models in computational biology. To support the broader scientific community, we have made the 21-million sequence-structure dataset and the pre-trained structRFM model fully open-source, facilitating the development of multimodal foundation models in biology.
bioinformatics2026-05-14v2geneRNIB: a living benchmark for gene regulatory network inference
Nourisa, J.; Passemiers, A.; Kalfon, J.; Stock, M.; Zeller-Plumhoff, B.; Cannoodt, R.; Arnold, C.; Netea, M. G.; Hartford, J.; Tong, A.; Scialdone, A.; Cantini, L.; Moreau, Y.; Raimondi, D.; Li, Y.; Luecken, M.Abstract
Gene regulatory networks (GRNs) underpin cellular identity and function, playing a key role in health and disease. GRN inference has received substantial attention, motivating systematic benchmarking. Despite various benchmarking efforts, existing studies remain limited in the number of methods, datasets, and metrics, fail to capture the context-specific nature of regulatory interactions across biological conditions, and are constrained by the absence of a reliable ground truth. Here, we introduce geneRNIB, a comprehensive GRN inference benchmarking framework built on three key principles: continuous integration, context-specific evaluation, and holistic assessment in the absence of a true reference network. geneRNIB enables the seamless incorporation of new algorithms, datasets, and evaluation metrics to reflect ongoing developments. In the current version, we systematically integrated and assessed 12 GRN inference methods, spanning single- and multiomics approaches across 11 datasets including thousands of perturbation scenarios. We introduced complementary metrics specifically designed to assess context-specific inference. Our findings indicate that simple models with fewer assumptions often outperform more complex pipelines across several perturbation-informed and predictive metrics. Notably, gene expression-based algorithms yielded better results than more advanced multimodal approaches. In addition, we identify several potential factors that influence the performance of GRN inference and offer actionable guidelines for the future development of the method. By addressing these critical limitations in existing benchmarks, geneRNIB advances GRN inference research and fosters progress toward personalized medicine.
bioinformatics2026-05-14v2Anatomy-Guided 3D Graph Networks for Couinaud Segmentation in Tumor Affected Livers
You, L.; Dang, H.; Wang, H.; Matta, E.; zhou, X.Abstract
Abstract: Image-based liver Couinaud segmentation is designed to automatically provide the locations of suspicious objects in liver CT/MR images. Once achieved, the physicians will be guided to the target slice and area where the suspicious node is located. However, conventional algorithms trained primarily on healthy liver images often fail to generalize to Hepatocellular Carcinoma (HCC) cases due to pathological structural distortions. In this work, we propose a robust two-stage framework that integrates a 3D Unet with a 3D Anatomical Structure-Guided Graph Convolutional Network (3D GCN). This two-stage strategy effectively isolates the liver volume to eliminate structural noise from neighboring organs, such as the spleen, allowing the framework to focus exclusively on the complex 3D anatomical relationships among the eight segments. To ensure the topological consistency required for global spatial reasoning, we implement a standardized preprocessing pipeline that normalizes liver-only volumes to exactly 50 frames along the z-axis. By combining a lightweight 3D UNet backbone with the 3D GCN for refined boundary reasoning, our model demonstrates superior generalization performance on unseen clinical datasets, achieving a mean Dice score of 0.828 in blind testing. By releasing our code and pretrained weights, we aim to provide the first publicly available deep learning resource for robust Couinaud segmentation.
bioinformatics2026-05-14v1Viral non-coding RNA structure annotation and API-based data retrieval with Rfam and R2DT
Muston, P.; Triebel, S.; Nawrocki, E.; Ontiveros-Palacios, N.; Jandalala, I.; Sweeney, B.; Bateman, A.; Marz, M.; Petrov, A. I.; Madrigal, P.Abstract
Rfam is a comprehensive database of non-coding RNA (ncRNA) families providing curated sequence alignments, consensus secondary structures, and covariance models for thousands of RNA families. The database is essential for identifying structured non-coding RNAs in newly sequenced genomes and understanding RNA structure-function relationships. Here we present computational protocols for automated ncRNA annotation of viral genomes, and for programmatic interaction with Rfam through its RESTful API. We showcase genome-wide RNA structure visualization from a genome sequence and from a multiple sequence alignment by generating comprehensive 2D structure diagrams using newly developed features in R2DT. We also present practical examples for retrieving family metadata, downloading alignments, accessing secondary structures, and searching user sequences from the Rfam API. These methods enable researchers in virology and RNA biology to integrate Rfam data into custom bioinformatics pipelines, comparative analyses, and machine learning workflows.
bioinformatics2026-05-14v1MethylCurate: Tool For Dataset Curation and Epigenetic Aging Clock Evaluation
Edwards, T. A.; Shen, L.; Long, Q.Abstract
DNA methylation datasets from public repositories such as NCBI Gene Expression Omnibus are central to the development and evaluation of epigenetic aging clocks, yet existing resources and tools do not fully resolve the bottlenecks of dataset retrieval and metadata harmonization. Current benchmarking frameworks often rely on static curated collections, support only a subset of available Gene Expression Omnibus studies, focus on specific tissues, or require substantial manual intervention when metadata fields and supplementary files are inconsistently structured across studies. We developed MethylCurate, an agentic AI framework that addresses these limitations by automating the retrieval of DNA methylation datasets from the Gene Expression Omnibus, harmonizing heterogeneous metadata, mapping datasets to a unified format, and enabling scalable evaluation of epigenetic aging clocks through an integrated, dialogue-driven workflow.
bioinformatics2026-05-14v1PXN Unlocks the Power of Public Gene Expression Data Through Cross-Technology Integration
Sui, Z.; Yu, D.; Erdengasileng, A.; Zhang, J.; Qiu, X.Abstract
The immense value of public gene expression repositories is constrained by the lack of compatibility among datasets generated from diverse experimental technologies. Differences in measurement scales, probe chemistries, and signal distributions create systematic discrepancies across platforms and laboratories. These inconsistencies make large-scale integrative analysis nearly impossible, even though such studies could achieve great statistical power and improved reproducibility. We introduce PXN, a probabilistic machine learning framework that captures a unified representation of biological signal across multiple gene expression technologies. Once trained, PXN can seamlessly translate data between multiple platforms, preserving informative biological variation while removing technology-specific biases. In benchmarking studies, PXN consistently outperforms existing normalization methods in cross-platform accuracy and substantially enhances the power of differential expression analysis. Importantly, we show that PXN is powerful enough to bridge even the most challenging technological divide - between microarray and RNA-seq. This capability provides a scalable route for integrating legacy microarray data with modern RNA-seq studies. By enabling direct comparison and integration of heterogeneous datasets, PXN unlocks the full potential of public repositories for future biological discovery and therapeutic innovation.
bioinformatics2026-05-14v1End-to-end mapping of membrane transport from chemical structure to microorganisms
Gricourt, G.; Duigou, T.; Meyer, P.; Faulon, J.-L.Abstract
Membrane transport is a fundamental biological process with profound implications for pharmacology, biotechnology, and microbiology. While computational approaches have largely adopted a protein-centric perspective to annotate transportomes, inferring transport function directly from the intrinsic properties of substrates remains a major challenge. Addressing transport at the compound level enables the systematic evaluation of whether molecules undergo active transport and by which mechanisms, independent of prior transporter annotation. Here, we introduce ChemProFlow, a comprehensive computational framework that redefines transport analysis from a substrate-centric perspective. By integrating geometric deep learning with orthology-based genomic mapping, ChemProFlow predicts molecular transportability, assigns transport mechanisms according to the Transporter Classification Database, and identifies the microorganisms encoding the corresponding transport systems. We show that this integrated pipeline enables scalable, end-to-end mapping of substrate-transporter-organism relationships, with broad applications in pharmacology for anticipating drug transport, in biotechnology for guiding strain engineering, and in microbiology for dissecting substrate utilization across diverse taxa. By capturing the chemical determinants of transportability, ChemProFlow generalizes to previously unseen substrates and provides a high-throughput framework for systematic exploration of molecular transport across diverse biological contexts.
bioinformatics2026-05-14v1mehari: high-performance, strict HGVS-first variant effect prediction
Hartmann, T. F.; Zhao, M. X.; Beule, D.; Holtgrewe, M.Abstract
Variant annotation requires the precise and consistent computation of Sequence Ontology (SO) terms and Human Genome Variation Society (HGVS) nomenclature. To ensure robust synchronization between these two key facets, we present mehari, a high-performance variant effect predictor implemented in Rust that employs a strict "HGVS-first" approach. By deterministically projecting variants to transcripts before evaluating functional consequences, mehari structurally aligns HGVS notation and SO terms. Benchmarking on ClinVar demonstrates that mehari achieves exceptional processing speeds and high concordance with established tools like Ensembl VEP, while also providing refined handling for complex biological edge cases such as selenoprotein recoding.
bioinformatics2026-05-14v1Constrained Evolutionary Design of Matrixyl Analogs: Balancing Permeability and Functional Preservation Through Computational Optimization
Komianos, N.; Prakash, P.Abstract
Matrixyl (palmitoyl pentapeptide-4, KTTKS core) is a collagen-stimulating peptide used in topical anti-ageing products, but its in-use efficacy is limited by poor permeation through the stratum corneum. We describe a deterministic computational workflow that combines a tournament genetic algorithm and NSGA-II with exact RDKit molecular descriptors to search the fixed-length, edit-distance-2 neighbourhood of KTTKS (3,706 candidate sequences) for analogs with descriptors more favourable for passive transdermal diffusion. The search returns a 9-member Pareto frontier that quantifies the trade-off between predicted permeability and motif preservation. Five of the nine frontier members carry the same substitution, lysine to proline at position 4 (K4P). This single change lowers the topological polar surface area by 25.6%, removes the +1 charge contributed by lysine, and reduces the functional-preservation score from 1.00 (KTTKS) to 0.67. The frontier ranking is unchanged by +/-30% perturbations to the TPSA and Mw penalty weights and by a 30% increase in the LogP penalty; only a 30% reduction in the LogP penalty produces rank movement. The frontier matches the ground-truth Pareto set obtained by exhaustive enumeration of all 3,706 candidates (precision and recall both 100%). On the basis of these results we recommend three sequences for experimental validation: PTTPS (largest predicted gain), KTTPS (single-mutation, conservative), and KTTPP (backup). All code, results, and figures are released under MIT and CC BY 4.0.
bioinformatics2026-05-14v1Predicting Biological Age and Clinical Biomarkers from DNA Methylation Profiles of Cheek Mucosa
Shoji, T.; Tomo, Y.; Nakaki, R.Abstract
Background DNA methylation-based biomarkers have been widely used to predict biological age; however, most blood-derived data have been used in most existing models, and whether cheek mucosa can serve as an alternative indicator for methylation-based estimation of aging-related and clinical phenotypes is unclear. Methods DNA methylation profiles from cheek mucosa and whole blood of 186 Japanese adults were analyzed using Illumina Infinium Methylation Screening Array (MSA). Models were constructed to predict chronological age, phenotypic age, and clinical laboratory biomarkers from cheek mucosa- and blood-derived methylation data. In addition to applying the ordinary elastic net method, a two-stage residual learning method incorporating existing blood-based epigenetic clocks was applied for more accurate prediction of biological age. Sex-stratified analyses and comparisons of selected CpG features across sexes and tissues were performed. Results Cheek mucosa-derived MSA methylation data enabled accurate prediction of chronological age (R = 0.965) and phenotypic age (R = 0.964) using the two-stage method. The performance gain achieved by the two-stage approach was greater for phenotypic age than for chronological age. Multiple clinical laboratory biomarkers could be predicted using cheek mucosa-derived methylation data, particularly after sex stratification, including inflammatory, metabolic, thyroid-related, and sex hormone-related markers. Most biomarkers that could be predicted using blood-derived methylation data were also predicted using cheek mucosa-derived methylation data. However, the CpG sites selected for prediction showed minimal overlap across sexes and tissues despite overlap in the corresponding predictable phenotypes. Conclusions Cheek mucosa-derived DNA methylation profiles measured using the MSA can predict chronological age, phenotypic age, and multiple clinically relevant laboratory biomarkers, supporting the utility of cheek mucosa as a less invasive alternative for methylation-based assessment of biological aging and systemic physiological state.
bioinformatics2026-05-14v1Differential Analysis of Gene Spatial Organisation with Minkowski Functionals and Tensors
Baratta, P.; Villoutreix, P.; Baudot, A.Abstract
Spatial transcriptomics measures gene expression together with transcript coordinates in tissues. To date, comparing spatial gene expression patterns within and across samples remains challenging. We present here minkiPy, a geometric framework that computes, for each gene, a compact profile of morphological and topological descriptors based on Minkowski functionals and tensors. These profiles are defined in a shared feature space, enabling direct comparison of spatial organisation across genes, samples, and conditions, and the ranking of genes by the magnitude of their spatial reorganisation. We applied minkiPy to a MERFISH dataset of control and facioscapulohumeral muscular dystrophy myoblast cultures and to a Visium~HD dataset of colorectal cancer and normal adjacent tissues, illustrating its utility across tissue types and spatial transcriptomics platforms. minkiPy is an open-source Python library available at \url{https://github.com/BAUDOTlab/minkiPy}.
bioinformatics2026-05-14v1OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
Wang, L.Abstract
Mixture-of-Experts (MoE) architectures offer a rare opportunity to probe the internal organization of large language models, but this affordance has not been systematically exploited in biological foundation modeling. We introduce OmniGene-4, a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing) by injecting 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP secondary structure), continuing pretraining (CPT) on a 32.5 GB mixture of DNA, protein, natural-language and structural corpora, and supervised fine-tuning (SFT) on 199,576 instruction-format examples spanning eight task families. On a suite of standard benchmarks, the final model (v3) reaches 99.95% accuracy on BioPAWS standard protein homology (6,000 pairs), 59.50% on remote homology (2,000 pairs from protein_pair_remote), and 93.66% on BixBench knowledge questions. Relative to its un-fine-tuned vocabulary-extended Gemma-4-Instruct baseline (85% / 60% / 87%), v3 gains +14.5 on Standard, is comparable on Remote (-0.5, within statistical noise on this 2,000-pair sample), and gains +6.7 on BixBench. We do not claim parity with specialist remote-homology tools; published numbers for ESM-2, CATHe and PLMSearch on differently constructed splits reach 65--75%, and closing this gap is discussed as an open problem. By installing forward hooks on every router we directly measure how CPT and SFT each reshape expert routing. Across 400 prompts drawn from 8 modalities, the mean pair-wise Jensen--Shannon divergence between task routing distributions, averaged over the 30 layers, rises from 0.138 (vocabulary-extended baseline) to 0.230 after CPT and further to 0.232 after the full CPT+SFT pipeline. Under this layer-averaged metric, most of the increase (Delta JS +0.092) occurs during CPT, with the SFT stage contributing a small further rise (Delta JS +0.002). The layer-wise picture is more nuanced: CPT reshapes routing in middle transformer layers (L_11--L_22, peak +0.16 at L_12), while SFT primarily reshapes the final two layers (L_28, L_29, peak +0.048 at L_29), so SFT is small under the aggregate metric but non-trivial at the layers nearest lm_head. We summarize this as a tentative representation/output-alignment factorization of bio-foundation training. At the token level, layer-12 routing reveals experts with strongly skewed token preferences, including an English-function-word expert at 80% NL purity, two DNA-dinucleotide experts, an amino-acid expert, and a cellular-biology expert; absolute purities for other experts are modest (15--46%), and we do not assume that "the same expert ID" refers to the same object across different layers. These findings are exploratory --- a single architecture, a single training run, and a small-N routing sample --- and we explicitly frame them as such throughout.
bioinformatics2026-05-14v1Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites
Kravchenko, P.; Vorontsov, I. E.; Makeev, V. J.; Kulakovskiy, I. V.; Penzar, D. D.Abstract
Motivation: DNA motifs recognised by transcription factors are typically represented as position weight matrices (PWMs), assuming independent contributions of individual nucleotides to protein binding specificity. Many alternative models accounting for correlations of positional contributions have been introduced in the past decades. However, performance gains have generally not out-weighed the advantages of simplicity, interpretability, and practical applicability of PWMs with the well-established codebase. Existing software tools and motif databases provide multiple non-identical PWMs for the same transcription factor or even for the same dataset. It remains a prac-tical question whether these PWMs can be effectively combined into a single improved model. Results: Here we describe ArChIPelago (https://github.com/autosome-ru/ArChIPelago), a compu-tational framework that combines multiple PWMs into a joint model using classic machine learning techniques, from linear regression to ensembles of decision trees. We show that such a combina-tion improves prediction of transcription factor binding sites in genomic sequences. With a diverse collection of 704 ChIP-Seq datasets spanning 36 orthologous human and mouse transcription factors of diverse structural families, we show that ArChIPelago consistently outperforms the best available individual mono- and dinucleotide PWMs as well as sparse local inhomogeneous mixture models. Furthermore, using both human and mouse data, we demonstrate that PWM ensembles are capable of making reliable cross-species predictions.
bioinformatics2026-05-14v1GlyComboCLI enables command line-based FAIR workflows for glycan composition assignment in mass spectrometry data
Kelly, M. I.; Thang, W. C. M.; Pang, C. N. I.; Gustafsson, O. J. R.; Ashwood, C.Abstract
Glycans are integral biomolecules whose presence cannot be predicted from genomic data alone, necessitating experimental characterisation through approaches including mass spectrometry. Assignment of glycan compositions to observed mass to charge ratios is computationally challenging due to the potential monosaccharide diversity and existing tools lack the required flexibility for integration into automated bioinformatic workflows. Here, we present GlyComboCLI, an open-source command-line application for the assignment of glycan compositions to mass spectrometry data which expands upon our previous GUI application, GlyCombo. GlyComboCLI accepts mass lists and vendor-neutral mzML files, supports an extensive range of monosaccharides, derivatisation states, reducing-end modifications, and adducts to ensure compatibility with a breadth of glycomics approaches. Outputs are compatible with downstream tools including Skyline and GlycoWorkBench. This software is deployable as a standalone executable, a Docker container, and a Galaxy tool, adhering to FAIR principles. When applied to 52 raw files from a published mouse glycomics dataset, a local instance completed composition assignment and downstream quality control in under three hours, recovering biologically consistent findings. Furthermore, an integrated Galaxy workflow demonstrated reproducible detection of sialidase treatment effects. GlyComboCLI substantially reduces the pool of spectra requiring manual structural interpretation, offering a flexible and scalable solution for glycomics bioinformatic workflows.
bioinformatics2026-05-14v1BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research
Qu, Y.; Lu, Y.; Tu, X.; Zhang, S.; She, T.; Shaw, A. G.; Shih, J.-H.; Zhao, B.; Shen, M.; Yang, H.; Yan, J.; Zhang, R.; Wu, X.; Li, T.; Cong, L.; Hu, X.; Jiang, Y.; Dong, J.; Peng, T.; Leskovec, J.; Huang, K.Abstract
LLM agents now perform real biomedical research, but evaluating them rigorously is hard. Outcome-only benchmarks fail in two ways. First, a correct final answer can come from memorization, reward hacking, or wrong reasoning that produces the right number by chance. Second, valid alternative analyses are marked wrong simply because they differ from the reference. We introduce BiomniBench, a process-level evaluation framework that scores the full agent trajectory against expert-designed, task-specific rubrics. Its first instantiation, BiomniBench-DA, contains 100 data-analysis tasks across 17 analytical task types, 5 disease areas, and a general-biology category, each based on a high-impact paper from top-tier journals such as Nature, Cell, and Science and co-developed with an original paper author or an experienced domain expert. Benchmarking frontier and open-weight models across four agent harnesses reveals three findings: (1) frontier models lead but substantial headroom remains; (2) the agent harness shifts scores as much as the base model; (3) agents recurrently fall short on method selection, biological interpretation, and scientific reasoning. BiomniBench is the first process-level benchmark for AI agents on real-world biomedical research, exposing failure modes that outcome-only evaluation cannot detect.
bioinformatics2026-05-14v1A Context-Specific, Literature-Supported Framework for Validating Stress Response Differentially Expressed Gene Sets
Frishman, B. A.; Gonzalez, J. L.; Forbes, V. E.Abstract
Computational models of stress responses identify genes underlying physiological adaptation, but their utility depends on rigorous validation. Often, gene activity reflects both adaptive mechanisms and noise. Here, we develop a framework that leverages public databases to support the subselection of biologically supported model genes for temperature-stress responses. We test our framework on a model that identified and categorized differentially expressed genes (DEGs) into Key-Response, Treatment-Specific, Noisy, and Support groups based on inter-individual gene expression variability before and after treatment. The first three groups were hypothesized to constitute a Principal Response. To validate these groupings, we constructed protein-protein interaction (PPI) networks using the Human Protein Atlas and STRING. The main contribution of this work is the implementation of second-order connections restricted to those made via DEGs, ensuring connectivity reflects condition-specific responses rather than generic hubs. Across two temperature conditions, >75% of Principal Response genes assembled into subnetworks of interactions significantly larger than random expectations. Support Group genes also showed strong interconnectivity and enrichment for housekeeping genes. STRING confirmed PPI enrichment but produced less stable results than our framework. By emphasizing DEG-restricted second-order connections, we address limitations of context-free enrichment methods and strengthen biological evaluation of computational models of differential gene expression.
bioinformatics2026-05-13v3BioGraphX: Bridging the Sequence-Structure Gap via PhysicochemicalGraph Encoding for Interpretable Subcellular Localization Prediction
Saeed, A.; Abbas, W.Abstract
Computational approaches for protein subcellular localization prediction are important for understanding cellular mechanisms and developing treatments for complex diseases. However, a critical limitation of current methods is their lack of interpretability: while they can predict where a protein localizes, they fail to explain why the protein is assigned to a specific location. Moreover, understanding protein behavior traditionally requires knowledge of three dimensional structure, which is a costly and time-consuming process. Here, we propose BioGraphX, a novel encoding framework that constructs protein interaction graphs directly from protein sequences using biochemical rules. This approach provides a constraint-based structural proxy directly from sequence, reducing the dependency on experimentally determined three-dimensional structures. Building upon this representation, BioGraphX-Net demonstrates superior performance on the DeepLoc 2.0 benchmark by integrating ESM-2 embeddings with the proposed features via a gating mechanism. Gating analysis shows that although ESM-2 embeddings provide strong contributions, BioGraphX features function as high-precision filters. SHAP analysis reveals feature importance patterns consistent with a sophisticated biophysical logic: sequence signals act as universal exclusion filters, while organelle-specific combinations of biophysical features enable precise compartment discrimination. Notably, Frustration features help resolve targeting ambiguities in complex compartments, reflecting evolutionary constraints while preventing mislocalization from sequence mimicry. It has the additional advantage of promoting Green AI in bioinformatics, achieving performance comparable to the state-of-the-art while maintaining a minimal parameter count of 13.46 million. In summary, BioGraphX not only provides accurate predictions but also offers new insights into the language of life.
bioinformatics2026-05-13v3Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models
Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.Abstract
The function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous model accuracy estimation (EMA) methodologies. Here we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structural models. Our method employs a structure-sequence cross-consistency mechanism to quantify the bidirectional compatibility between the input sequence and the predicted three-dimensional structure, enabling a comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in the Pearson correlation and 49.0% in the Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensus-based methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced on the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. Our results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.
bioinformatics2026-05-13v2Keeping SCORE enables interpretable uncertainty-aware classification from diffusion models for genomics
Kuznets-Speck, B.; Jung, J.; Pholraksa, P.; Zhong, A.; Schwartz, L.; Prashnani, E.; Vaikuntanathan, S.; Goyal, Y.Abstract
Classifying cellular states from high-dimensional molecular and genomic measurements requires methods that provide not only accurate predictions but also calibrated uncertainty and interpretability. Current nonlinear classifiers offer accuracy but often lack uncertainty quantification and mechanistic insights into the features that matter most. We introduce Keeping SCORE, a framework that transforms conditional diffusion models into probabilistic engines for classification and regression by computing exact likelihoods along stochastic noising trajectories. We first benchmark Keeping SCORE on image recognition tasks (handwritten digits, natural photos). We then apply Keeping SCORE to single-cell transcriptomics across a 22-million-cell atlas, classifying 164 cell types with accuracy matching or exceeding state-of-the-art methods, while uniquely providing posterior probability estimates and prediction confidence. For genetic perturbation mapping across 100 CRISPRi conditions in a multi-study Perturb-seq dataset, our approach again matches or surpasses discriminative baselines, with feature-level attributions identifying which genomic features drive each decision. Applied to large-scale protein sequence data, our framework accurately regresses mutational stability effects, attributing them quantitatively to positions along the input sequence. Keeping SCORE requires no retraining or architectural changes to existing diffusion models, providing portable, interpretable, and uncertainty-aware predictions for biological discovery.
bioinformatics2026-05-13v2GatorDuo: Global-Consistency Dual-Graph Refinement With Pseudo-Label Agreement for Spatial Transcriptomics
Zhang, Z.; Jimeno Yepes, A.; Bian, J.; Li, F.; Liu, Y.Abstract
Spatial transcriptomics (ST) measures gene expression together with spatial coordinates, enabling spatial domain identification of coherent tissue regions. Many recent approaches rely on graph-based modeling to combine spatial neighborhoods and transcriptomic (gene-expression) similarity, yet neighborhood construction is often unreliable under sparsity and technical noise. As a result, spurious cross-domain shortcut edges can persist in static graphs and propagate misleading signals during message passing, ultimately blurring domain boundaries and weakening cluster separability. In this paper, we propose GatorDuo, a topology-aware dual-graph contrastive self-supervised framework for robust spatial domain identification that couples gene-expression similarity with spatial proximity through complementary neighborhood graphs. GatorDuo introduces global-consistency-based graph refinement that uses a pseudo-label agreement mask to suppress cross-domain shortcut edges in both views, thus stabilizing neighborhood topology for representation learning. To avoid manual tuning of domain resolution, GatorDuo further employs a contextual bandit reinforcement-learning strategy to adaptively select the clustering granularity (the number of clusters) used for refinement. The refined view-specific embeddings are integrated via a hybrid-routing Mixture-of-Experts (MoE) module to generate a unified embedding, optimized with contrastive objectives augmented by an MoE-alignment term. Across eight public benchmarks spanning sequencing- and imaging-based ST at spot and single-cell resolution, and compared with ten representative baselines, GatorDuo consistently delivers strong and robust spatial domain identification performance across multiple clustering metrics, while yielding informative unified embeddings that can support downstream biological analyses.
bioinformatics2026-05-13v1Disease-guided functional gene mapping across species reveals translational correspondences beyond sequence orthology
Yan, J.; Cao, Z.Abstract
Selecting the correct mouse gene to model a human disease phenotype is critical for translational research, yet sequence-based orthology can fail when genes have been lost, duplicated, or functionally rewired between species. Here we present BRIDGE (Biological Rank Integration for Disease Gene Equivalence), a sequence-free framework that identifies functional mouse equivalents of human disease genes. BRIDGE integrates 3.37 million disease-gene associations, biological pathways, and Gene Ontology annotations into a unified heterogeneous graph with 94,897 nodes and approximately 8.3 million edges. The graph is encoded by a heterogeneous graph transformer and combined with fused Gromov-Wasserstein alignment and multi-strategy reciprocal rank fusion. On two sequence-independent benchmarks, BRIDGE achieves Recall@5 of 61.8-66.7%, compared with 0.0-20.1% for Ensembl Compara. We validate BRIDGE through case studies including neutrophil pathway rewiring (CXCL8 to Cxcl1/2/5), acute-phase divergence (CRP to Apcs), and immune checkpoint substitution (LILRB2 to Pirb), and demonstrate complementarity with sequence methods in drug-translation analysis. Prospective validation of 30 novel predictions against three independent data modalities, including tissue expression, cell-type expression, and phenotype concordance, shows that BRIDGE picks are favored in 64 of 65 orthogonal tests (sign test P = 3.6 x 10^-10) and significantly outperform tested baselines including Ensembl Compara, BLAST RBH, and ESM-2. BRIDGE provides a benchmarked framework for functional cross-species gene mapping in disease-model design.
bioinformatics2026-05-13v1Preferential IsomiR Enrichment in Extracellular Vesicles Improves Identification of Their Cellular Origins
Ripan, R. C.; Li, x.; Hu, H.Abstract
Extracellular vesicles (EVs) carry microRNAs (miRNAs) that mediate intercellular communication and have strong potential as disease biomarkers, yet the roles of miRNA isoforms (isomiRs) in EVs remain poorly understood. Here, we analyzed 96 human EV and corresponding source samples from nine public datasets. We found that EV samples consistently contained substantially higher proportions of isomiR reads than their corresponding source samples, indicating widespread isomiR enrichment in EVs. Although individual isomiRs showed limited reproducibility across biological replicates and limited sharing between EVs and their corresponding source samples, the parent miRNAs that generated these isomiRs remained highly reproducible across replicates and strongly shared between EV-source pairs. Despite extensive isomiR diversification, EV-source pairs retained highly correlated miRNA expression profiles. Using integrated miRNA- and isomiR-related features, we further developed a random forest model that successfully associated EV samples with their corresponding source samples, with improved performance when isomiR information was included. Together, our results demonstrate that EVs are enriched for biologically meaningful isomiRs while preserving source-associated miRNA landscapes, highlighting the importance of incorporating isomiRs into future EV studies.
bioinformatics2026-05-13v1Systematic Regional Bias is Widespread in ChIP-seq
Hughes, O.; Foley, G.; Balderson, B.; Piper, M.; Boden, M.Abstract
Robust and reproducible results are essential for confident scientific analysis. We demonstrate that transcription factor (TF) Chromatin Immunoprecipitation coupled with sequencing (ChIP-seq) suffers from systematic bias that may threaten its reproducibility: 80% of 200+ condition-matched, dual-replicate experiments in ENCODE contain genomic regions of systematic bias. We observe this regional bias even between replicates produced within the same experiment, resulting in thousands of unreplicated peaks, which often contain valuable biological data. We provide evidence that regional bias may lead to qualitative differences in TF biology inferred by different experiments; we discovered eight TFs with binding activity in compact chromatin that was identified by one experiment, yet systematically absent from others. To mitigate the effects of bias, we derive simple but effective metrics to quantify the quality of data within biased regions and demonstrate that they can be used for the robust integration of data from multiple experiments.
bioinformatics2026-05-13v1BiLSTM-Powered Bilinear Attention for Protein-Ligand Prediction
Cheng, C.-Y.; Chen, Y.-A.; Li, F.-Y.; Re, S.Abstract
Rapid and accurate prediction of protein-ligand bindings is essential for drug discovery. While generative AI has driven rapid advancements in structure-based approaches, sequence-based methods remain significantly faster and more cost-effective. Here, we present a weakly supervised deep learning framework integrating graph convolutional networks (GCN) for molecular encoding and bidirectional long short-term memory (BiLSTM) for protein modeling. The latter represents long-range dependencies better than the widely used convolutional neural network (CNN). Leveraging a bilinear attention network (BAN), this model learns protein-ligand pairwise interactions without requiring three-dimensional structural supervision. By using the publicly available BindingDB dataset, the model was trained, solely on affinity labels, and successfully classified binder and non-binders with AUROC of 0.96 and an AUPRC of 0.95. The model generates interpretable attention maps that serve as a "GPS" to locate binding sites. Remarkably, despite the lack of structural training data, it can pinpoint key contact residues confirmed by crystal structures. Our method could function as a scalable filter for giga-scale libraries, allowing rapid screening of drug candidates with direct structural insights into the protein-ligand interface.
bioinformatics2026-05-13v1xNNPCD identifies regulators of programmed cell death by integrating perturbation transcriptomes with cancer dependency profiles
Yin, Q.; Chen, L.Abstract
Programmed cell death (PCD) encompasses multiple regulated processes whose dysregulation shapes cancer fitness, yet current computational studies largely use known PCD genes for prognosis rather than discovering regulators. We developed xNNPCD, an interpretable neural-network framework that links CRISPR-Cas9 perturbation signatures from CMap to gene dependency profiles from DepMap. The model constrains hidden neurons to five PCD pathways and iteratively refines a prior gene-pathway mask matrix derived from GO, KEGG, and Reactome using pathway-neuron ablation. This converts binary gene-pathway relationships into continuous-valued associations and improves dependency prediction over random forests, standard fully connected multi-layer perceptron, and its own non-iterative variant. The learned matrix recovers annotated death regulators and nominates candidate regulators, including RPL23A, HSPA5, SNRPA1, SLC6A2, and ASAH1; combined with dependency scores, it further separates pathway coupling from regulatory direction. Transferring the refined relationship matrix and learned weights to compound-induced perturbation data enables in silico drug screening, identifying BRD-K19103580 and decitabine as targeted therapeutic agents for apoptosis and ferroptosis, respectively. The pathway-resolved drug profiles can facilitate the rational design of combination therapies targeting complementary PCD pathways to overcome single-pathway resistance. Overall, xNNPCD offers a generalizable, interpretable approach for mapping the regulatory landscape and elucidating the molecular processes of PCD in cancer.
bioinformatics2026-05-13v1Phylogenomic coupling of F1 chemosensory and archaellum systems across archaea and monoderm bacteria
Mahanta, U.; Baker, M.; Sharma, G.Abstract
Archaellum-associated motility has been viewed as solely archaeal, yet new findings in Chloroflexota prompt a broader perspective. By analysing a curated ~22,000 NCBI reference genomes alongside 2,397 archaeal and 226 archaellum-encoding Chloroflexota genomes, this study systematically characterises the co-distribution of archaellum loci with chemosensory system (CSS) classes. Maximum-likelihood phylogeny of 3,727 F1-type CheA proteins reveals three major clades, with Clade 1 comprising ~80% monoderm representation, uniting archaeal and monoderm bacterial lineages in a shared evolutionary grouping. Overall, this work shows that not only archaeal-type motility, but also F1-CSS based sensing system, might have been gained from Archaea to Chloroflexota via horizontal gene transfer and both systems shared an evolutionary trajectory altogether.
bioinformatics2026-05-13v1