Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Large mRNA language foundation modeling with NUWA for unified sequence perception and generation
Zhong, Y.; Yan, W.; Zhang, Y.; Tan, K.; Saito, Y.; Bian, B.AI Summary
- This study introduces NUWA, a large mRNA language foundation model using a BERT-like architecture, trained on extensive mRNA sequences from bacteria, eukaryotes, and archaea for unified sequence perception and generation.
- NUWA excels in various downstream tasks, including RNA-related perception and cross-modal protein tasks, and uses an entropy-guided strategy for generating natural-like mRNA sequences.
- Fine-tuned NUWA can generate functional, novel mRNA sequences for applications in biomanufacturing, vaccine development, and therapeutics.
Abstract
The mRNA serves as a crucial bridge between DNA and proteins. Compared to DNA, mRNA sequences are much more concise and information-dense, which makes mRNA an ideal language through which to explore various biological principles. In this study, we present NUWA, a large mRNA language foundation model leveraging a BERT-like architecture, trained with curriculum masked language modeling and supervised contrastive loss for unified mRNA sequence perception and generation. For pretraining, we utilized large-scale mRNA coding sequences comprising approximately 80 million sequences from 19,676 bacterial species, 33 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species, and pre-trained three domain-specific models respectively. This enables NUWA to learn coding sequence patterns across the entire tree of life. The fine-tuned NUWA demonstrates strong performance across a variety of downstream tasks, excelling not only in RNA-related perception tasks but also exhibiting robust capability in cross-modal protein-related tasks. On the generation front, NUWA pioneers an entropy-guided strategy that enables BERT-like models in generating mRNA sequences, producing natural-like sequences that accurately recapitulate species-specific codon usage patterns. Moreover, NUWA can be effectively fine-tuned on small, task-specific datasets to generate functional mRNAs with desired properties, including sequences that do not exist in nature, and to design coding sequences for diverse proteins in biomanufacturing, vaccine development, and therapeutic applications. To our knowledge, NUWA represents the first mRNA language model for unified sequence perception and generation, providing a versatile and programmable platform for mRNA design.
bioinformatics2026-02-04v3FlashDeconv enables atlas-scale, multi-resolution spatial deconvolution via structure-preserving sketching
Yang, C.; Chen, J.; Zhang, X.AI Summary
- FlashDeconv uses leverage-score importance sampling and sparse spatial regularization to enable rapid, high-resolution spatial deconvolution, processing 1.6 million bins in 153 seconds.
- It identifies a tissue-specific resolution horizon in mouse intestine at 8-16 µm, where cell-type co-localization sign inversions occur, validated by Xenium data.
- In human colorectal cancer, FlashDeconv reveals neutrophil inflammatory microdomains in low-UMI regions, recovering spatial biology from previously uninformative data.
Abstract
Coarsening Visium HD resolution from 8 to 64 m can flip cell-type co-localization from negative to positive (r = -0.12 [->] +0.80), yet investigators are routinely forced to coarsen because current deconvolution methods cannot scale to million-bin datasets. Here we introduce FlashDeconv, which combines leverage-score importance sampling with sparse spatial regularization to match top-tier Bayesian accuracy while processing 1.6 million bins in 153 seconds on a standard laptop. Systematic multi-resolution analysis of Visium HD mouse intestine reveals a tissue-specific resolution horizon (8-16 m)--the scale at which this sign inversion occurs--validated by Xenium ground truth. Below this horizon, FlashDeconv provides the first sequencing-based quantification of Tuft cell chemosensory niches (15.3-fold stem cell enrichment). In a 1.6-million-bin human colorectal cancer cohort, FlashDeconv further uncovers neutrophil inflammatory microdomains in low-UMI regions that classification-based methods discard, recovering spatially organized biology from measurements previously considered uninformative.
bioinformatics2026-02-04v2Peptide-to-protein data aggregation using Fisher's method improves target identification in chemical proteomics
Lyu, H.; Gharibi, H.; Meng, Z.; Sokolova, B.; Zhang, X.; Zubarev, R.AI Summary
- This study compares two methods for protein-level statistical testing in chemical proteomics: traditional aggregation of peptide data versus Fisher's method of combining peptide p-values.
- Fisher's method, using the top four peptides by p-value, was tested across various datasets and consistently outperformed traditional methods by avoiding biases from deviant or missing peptide data.
- The approach improved the identification of regulated or shifted proteins in diverse proteomics assays.
Abstract
Protein-level statistical tests in proteomics aimed at obtaining p-value are conventionally made on protein abundances aggregated from peptide data. This integral approach overlooks peptide-level heterogeneity and ignores important information coded in individual peptide data, while protein p-value can also be obtained by Fisher's method of combining peptide p-values using chi-square statistics. Here we test this latter approach across diverse chemical proteomics datasets based on assessments of protein expression, solubility and protease accessibility. Using the top four peptides ranked by their p-values consistently outperformed protein-level analysis and avoided biases introduced by inclusion of deviant peptides or imputation of missing peptide values. Fisher's method provides a simple and robust strategy, improving identification of regulated/shifted proteins in diverse proteomics assays.
bioinformatics2026-02-04v1EvoPool: Evolution-Guided Pooling of Protein Language Model Embeddings
NaderiAlizadeh, N.; Singh, R.AI Summary
- The study introduces EvoPool, a self-supervised pooling framework that integrates evolutionary information from homologous sequences into protein language model (PLM) embeddings using optimal transport.
- EvoPool constructs a fixed-size evolutionary anchor and uses sliced Wasserstein distances to enhance PLM representations for protein-level prediction tasks.
- Experiments on the ProteinGym benchmark showed that EvoPool outperforms standard pooling methods in variant effect prediction, highlighting the benefit of evolutionary guidance.
Abstract
Protein language models (PLMs) encode amino acid sequences into residue-level embeddings that must be pooled into fixed-size representations for downstream protein-level prediction tasks. Although these embeddings implicitly reflect evolutionary constraints, existing pooling strategies operate on single sequences and do not explicitly leverage information from homologous sequences or multiple sequence alignments. We introduce EvoPool, a self-supervised pooling framework that integrates evolutionary information from homologs directly into aggregated PLM representations using optimal transport. Our method constructs a fixed-size evolutionary anchor from an arbitrary number of homologous sequences and uses sliced Wasserstein distances to derive query protein embeddings that are geometrically informed by homologous sequence embeddings. Experiments across multiple state-of-the-art PLM families on the ProteinGym benchmark show that EvoPool consistently outperforms standard pooling baselines for variant effect prediction, demonstrating that explicit evolutionary guidance substantially enhances the functional utility of PLM representations. Our implementation code is available at https://github.com/navid-naderi/EvoPool.
bioinformatics2026-02-04v1CAMUS: Scalable Phylogenetic Network Estimation
Willson, J.; Warnow, T.AI Summary
- CAMUS is a scalable method for estimating phylogenetic networks, designed to handle larger datasets by maximizing quartet trees within a given constraint tree.
- Simulation studies under the Network Multi-Species Coalescent showed CAMUS to be highly accurate, fast, and scalable, processing up to 201 species in minutes.
- Compared to PhyloNet-MPL and SNaQ, CAMUS is slightly less accurate when PhyloNet-MPL is used without a fixed tree but significantly faster and capable of handling larger datasets.
Abstract
Motivation: Phylogenetic networks are models of evolution that go beyond trees, and so represent reticulate events such as horizontal gene transfer or hybridization, which are frequently found in many taxa. Yet, the estimation of phylogenetic networks is extremely computationally challenging, and nearly all methods are limited to very small datasets with perhaps 10 to 15 species (some limited to even smaller numbers). Results: We introduce CAMUS (Constrained Algorithm Maximizing qUartetS), a scalable method for phylogenetic network estimation. CAMUS takes an input constraint tree T as well as a set Q of unrooted quartet trees that it derives from input, and returns a level-1 phylogenetic network N that is built upon T through the addition of edges, in order to maximize the number of quartet trees in Q that are induced in N. We perform a simulation study under the Network Multi-Species Coalescent and show that a simple pipeline using CAMUS provides high accuracy and outstanding speed and scalability, in comparison to two leading methods, PhyloNet-MPL used with a fixed tree and SNaQ. CAMUS is slightly less accurate than PhyloNet-MPL used without a fixed tree, but is much faster (minutes instead of hours) and can complete on inputs with 201 species while PhyloNet-MPL fails to complete on the inputs with more than 51 species. Availability and Implementation: The source code is available at https://github.com/jsdoublel/camus.
bioinformatics2026-02-04v1SPCoral: diagonal integration of spatial multi-omics across diverse modalities and technologies
Wang, H.; Yuan, J.; Li, K.; Chen, X.; Yan, X.; Lin, P.; Tang, Z.; Wu, B.; Nan, H.; Lai, Y.; Lv, Y.; Esteban, M. A.; Xie, L.; Wang, G.; Hui, L.; Li, H.AI Summary
- SPCoral was developed to integrate spatial multi-omics data across different slices, modalities, and technologies using graph attention networks and optimal transport.
- It employs a cross-modality attention network for feature integration and cross-omics prediction, showing superior performance in benchmarks.
- The integration enhances spatial domain identification, data augmentation, cross-modal analysis, and cell-cell communication, revealing insights unattainable with single modality data.
Abstract
Spatial multi-omics is indispensable for decoding the comprehensive molecular landscape of biological systems. However, the integration of multi-omics remains largely unresolved due to inherent disparities in molecular features, spatial morphology, and resolution. Here we developed SPCoral for diagonal integration of spatial multi-omics across adjacent slices. SPCoral extracts spatial covariation patterns via graph attention networks, followed by the use of optimal transport to identify high-confidence anchors in an unsupervised, feature-independent manner. SPCoral utilizes a cross-modality attention network to enable seamless cross-resolution feature integration alongside robust cross-omics prediction. Comprehensive benchmarking demonstrates SPCoral's superior performance across different technologies, modalities and varied resolutions. The integrated multi-omics representation further improves spatial domain identification, effectively augments experimental data, enables cross-modal association analysis, and facilitates cell-cell communication. SPCoral exhibits good scalability with data size, reveals biological insights that are not attainable using a single modality. In summary, SPCoral offers a powerful framework for spatial multi-omics integration across various technologies and biological scenarios.
bioinformatics2026-02-04v1Common Pitfalls in CircRNA Detection and Quantification
Weyrich, M.; Trummer, N.; Boehm, F.; Furth, P. A.; Hoffmann, M.; List, M.AI Summary
- This study compares circRNA detection in poly(A)-enriched versus ribosomal RNA-depleted RNA-seq data, finding that poly(A) data often yield false positives.
- The quality of sample processing, indicated by ribosomal read fraction, impacts circRNA detection sensitivity.
- Best practices include using total RNA sequencing with rRNA depletion, employing multiple detection tools, and focusing on back-splice junctions for reliable circRNA analysis.
Abstract
Circular RNAs have garnered considerable interest, as they have been implicated in numerous biological processes and diseases. Through their stability, they are often considered promising biomarker candidates or therapeutic targets. Due to the lack of a poly(A) tail, circRNAs are best detected in total RNA-seq data after depleting ribosomal RNA. However, we observe that the application of circRNA detection in the vastly more ubiquitous poly(A)-enriched RNA-seq data still occurs. In this study, we systematically compare the detection of circRNAs in two matched poly(A) and ribosomal RNA-depleted data sets. Our results indicate that the comparably few circRNAs detected in poly(A) data are likely false positives. In addition, we demonstrate that the quality of sample processing, as measured by the fraction of ribosomal reads, significantly affects the sensitivity of circRNA detection, leading to a bias in downstream analysis. Our findings establish best practices for circRNA research: total RNA sequencing with effective rRNA depletion is the preferred approach for accurate circRNA profiling, whereas poly(A)-enriched data are unsuitable for comprehensive detection. Employing multiple circRNA detection tools and prioritizing back-splice junctions identified by several algorithms enhances confidence in the selection of candidates. These recommendations, validated across diverse datasets and tissue types, provide generalizable principles for robust circRNA analysis.
bioinformatics2026-02-04v1Ophiuchus-Ab: A Versatile Generative Foundation Model for Advanced Antibody-Based Immunotherapy
Zhu, Y.; Ma, J.; Yin, M.; Wu, J.; Tang, L.; Zhang, Z.; Li, Q.; Feng, S.; Liu, H.; Qin, T.; Yan, J.; Hsieh, C.-Y.; Hou, T.AI Summary
- The study addresses the challenge of antibody design by modeling the sequence space of paired heavy and light chains to understand inter-chain dependencies.
- Ophiuchus-Ab, a generative foundation model, was developed using a diffusion language modeling framework, trained on large-scale paired antibody repertoires.
- This model excels in tasks like CDR infilling, antibody humanization, and light-chain pairing, and predicts antibody properties like developability and binding affinity, enhancing antibody-based immunotherapy.
Abstract
Antibodies exhibit extraordinary specificity and diversity in antigen recognition and have become a central class of therapeutics across a wide range of diseases. Despite this clinical success, antibody design remains fundamentally challenging. Antibody function emerges from intricate and highly coupled interactions between heavy and light chains, which complicate sequence-function relationships and limit the rational design of developable antibodies. Here, we reveal that modeling antibody sequence space at the level of paired heavy and light chains is essential to faithfully capture inter-chain dependencies, enabling a deeper understanding of antibody function and facilitating antibody discovery. We present Ophiuchus-Ab, a generative foundation model pre-trained on large-scale paired antibody repertoires within a diffusion language modeling framework, unifying antibody generation and representation learning in a single probabilistic formulation. This framework excels diverse antibody design tasks, including CDR infilling, antibody humanization, and light-chain pairing. Beyond generation, diffusion-based pre-training yields transferable representations that enable accurate prediction of antibody properties, including developability, binding affinity, and specificity, even in low-data regimes. Together, these results establish Ophiuchus-Ab as a versatile foundation model for modeling antibodies, providing a foundation for next-generation antibody-based immunotherapy.
bioinformatics2026-02-04v1Embarrassingly_FASTA: Enabling Recomputable, Population-Scale Pangenomics by Reducing Commercial Genome Processing Costs from $100 to less than $1
Walsh, D. J.; Njie, e. G.AI Summary
- The study introduces Embarrassingly_FASTA, a GPU-accelerated preprocessing pipeline that reduces genome processing costs from ~1/genome by using transient intermediates and ephemeral cloud infrastructure.
- This approach enables the retention of raw FASTQ data, facilitating recomputable, population-scale pangenomics.
- The efficiency was demonstrated through simulated large-cohort pangenome builds in C. elegans and humans, showcasing the potential for capturing unsampled genetic diversity._
Abstract
Computational preprocessing has become the dominant bottleneck in genomics, frequently exceeding sequencing costs and constraining population-scale analysis, even as large repositories grow from tens of petabytes toward exabyte-scale storage to support World Genome Models. Legacy CPU-based workflows require many hours to days per 30x human genome, driving many repositories to distribute aligned or derived intermediates such as BAM and VCF files rather than raw FASTQ data. These intermediates embed reference- and model-dependent assumptions that limit reproducibility and impede reanalysis as reference genomes, including pangenomes, continue to evolve. Although recent work has established that GPUs can dramatically accelerate genomic pipelines, enabling large-cohort processing to shrink from years to days given sufficient parallelism, such workflows remain cost-prohibitive. Here, we introduce Embarrassingly_FASTA, a GPU-accelerated preprocessing pipeline built on NVIDIA Parabricks that fundamentally changes the economics of genomic data management. By rendering intermediate files transient rather than archival, Embarrassingly_FASTA enables retention of raw FASTQ data and reliable use of highly discounted ephemeral cloud infrastructure such as spot instances, reducing compute spend from ~$17/genome (CPU on-demand) to <$1/genome (GPU spot), and commercial secondary-analysis pricing from ~$120/genome to compute spend under $1/genome. We demonstrate the impact of this efficiency using a simulated large-cohort pangenome build-up (using variant-union accumulation as a proxy for diversity growth) in Caenorhabditis elegans and humans, highlighting the long tail of unsampled human genetic diversity. Beyond GPU kernels, Embarrassingly_FASTA contributes a transient-intermediate lifecycle and spot-friendly orchestration that makes FASTQ retention and routine recomputation economically viable. Embarrassingly_FASTA thus provides enabling infrastructure for recomputable, population-scale pangenomics and next-generation genomic models. Keywords: Genome preprocessing, GPU acceleration, Whole-genome sequencing (WGS), Population genomics, Pangenomics, World Genome Models, Genomic infrastructure, Variant calling, Recomputable genomics
bioinformatics2026-02-04v1Joint Modeling of Transcriptomic and Morphological Phenotypes for Generative Molecular Design
Verma, S.; Wang, M.; Jayasundara, S.; Malusare, A. M.; Wang, L.; Grama, A.; Kazemian, M.; Lanman, N. A.AI Summary
- The study introduces Pert2Mol, a framework for integrating transcriptomic and morphological data from paired control-treatment experiments to generate molecular structures.
- Pert2Mol uses bidirectional cross-attention and a rectified flow transformer to model perturbation dynamics, achieving a Frechet ChemNet Distance of 4.996 on the GDP dataset, outperforming diffusion and transcriptomics-only methods.
- The model offers high molecular validity, good physicochemical property distributions, 84.7% scaffold diversity, and is 12.4 times faster than diffusion methods for generation.
Abstract
Motivation: Phenotypic drug discovery generates rich multi-modal biological data from transcriptomic and morphological measurements, yet translating complex cellular responses into molecular design remains a computational bottleneck. Existing generative methods operate on single modalities and condition on post-treatment measurements without leveraging paired control-treatment dynamics to capture perturbation effects. Results: We present Pert2Mol, the first framework for multi-modal phenotype-to-structure generation that integrates transcriptomic and morphological features from paired control-treatment experiments. Pert2Mol employs bidirectional cross-attention between control and treatment states to capture perturbation dynamics, conditioning a rectified flow transformer that generates molecular structures along straight-line trajectories. We introduce Student-Teacher Self-Representation (SERE) learning to stabilize training in high-dimensional multi-modal spaces. On the GDP dataset, Pert2Mol achieves Frechet ChemNet Distance of 4.996 compared to 7.343 for diffusion baselines and 59.114 for transcriptomics-only methods, while maintaining perfect molecular validity and appropriate physicochemical property distributions. The model demonstrates 84.7% scaffold diversity and 12.4 times faster generation than diffusion approaches with deterministic sampling suitable for hypothesis-driven validation. Availability: Code and pretrained models will be available at https://github.com/wangmengbo/Pert2Mol.
bioinformatics2026-02-04v1FRED: a universal tool to generate FAIR metadata for omics experiments
Walter, J.; Kuenne, C.; Knoppik, N.; Goymann, P.; Looso, M.AI Summary
- The study addresses the challenge of standardizing metadata in omics experiments to enhance data management according to FAIR principles.
- FRED, a new tool, was developed to generate machine-readable metadata, offering features like dialog-based creation, semantic validation, logical search, an API, and a web interface.
- FRED is designed for use by both non-computational scientists and specialized facilities, integrating easily into existing research data management systems.
Abstract
Scientific research relies on transparent dissemination of data and its associated interpretations. This task encompasses accessibility of raw data, its metadata, details concerning experimental design, along with parameters and tools employed for data interpretation. Production and handling of these data represents an ongoing challenge, extending beyond publication into individual facilities, institutes and research groups, often termed Research Data Management (RDM). It is foundational to scientific discovery and innovation, and can be paraphrased as Findability, Accessibility, Interoperability and Reusability (FAIR). Although the majority of peer-reviewed journals require the deposition of raw data in public repositories in alignment with FAIR principles, metadata frequently lacks full standardization. This critical gap in data management practices hinders effective utilization of research findings and complicates sharing of scientific knowledge. Here we present a flexible design of a machine-readable metadata format to store experimental metadata, along with an implementation of a generalized tool named FRED. It enables i) dialog based creation of metadata files, ii) structured semantic validation, iii) logical search, iv) an external programming interface (API), and v) a standalone web-front end. The tool is intended to be used by non-computational scientists as well as specialized facilities, and can be seamlessly integrated in existing RDM infrastructure.
bioinformatics2026-02-04v1QMAP: A Benchmark for Standardized Evaluation of Antimicrobial Peptide MIC and Hemolytic Activity Regression
Lavertu, A.; Corbeil, J.; Germain, P.AI Summary
- QMAP is introduced as a benchmark for evaluating the prediction of antimicrobial peptide (AMP) potency (MIC) and hemolytic toxicity (HC50), using homology-aware test sets to prevent overfitting.
- The benchmark reassessed existing MIC models, revealing limited progress over six years, poor performance in predicting high-potency MIC, and low predictability for hemolytic activity.
- A Python package with a Rust-accelerated engine for efficient data manipulation is provided to facilitate the adoption of QMAP.
Abstract
Antimicrobial peptides (AMPs) are promising alternatives to conventional antibiotics, but progress in computational AMP discovery has been difficult to quantify due to inconsistent datasets and evaluation protocols. We introduce QMAP, a domain-specific benchmark for predicting AMP antimicrobial potency (MIC) and hemolytic toxicity (HC50) with homology-aware, predefined test sets. QMAP enforces strict sequence homology constraints between training and test data, ensuring that model performance reflects true generalization rather than overfitting. Applying QMAP, we reassess existing MIC models and establish baselines for MIC and HC50 regression. Results show limited progress over six years, poor performance for high-potency MIC regression, and low predictability for hemolytic activity, emphasizing the need for standardized evaluation and improved modeling approaches for highly potent peptides. We release a Python package facilitating practical adoption, and with a Rust-accelerated engine enabling efficient data manipulation, installable with pip install qmap-benchmark.
bioinformatics2026-02-04v1petVAE: A Data-Driven Model for Identifying Amyloid PET Subgroups Across the Alzheimer's Disease Continuum
Tagmazian, A. A.; Schwarz, C.; Lange, C.; Pitkänen, E.; Vuoksimaa, E.AI Summary
- This study aimed to identify subgroups along the Alzheimer's disease (AD) continuum using Aβ PET scans by developing petVAE, a 2D variational autoencoder model.
- petVAE was trained on 3,110 scans from ADNI and A4 datasets, identifying four clusters (Aβ-, Aβ-+, Aβ+, Aβ++) that differed significantly in standardized uptake value ratio, CSF Aβ, cognitive performance, APOE ε4 prevalence, and progression rate to AD.
- The model effectively captured the AD continuum, revealing preclinical stages and offering a new framework for studying disease progression.
Abstract
Amyloid-{beta} (A{beta}) PET imaging is a core biomarker and is considered sufficient for the biological diagnosis of Alzheimer's disease (AD). However, it is typically reduced to a binary A{beta}-/A{beta}+ classification. In this study, we aimed to identify subgroups along the continuum of A{beta} accumulation including subgroups within A{beta}- and A{beta}+. We used a total of 3,110 of A{beta} PET scans from Alzheimer's Disease Neuroimaging Initiative (ADNI) and Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease (A4) datasets to develop petVAE, a 2D variational autoencoder model. The model accurately reconstructed A{beta} PET scans without prior labeling or pre-selection based on scanner type or region of interest. Latent representations of scans extracted from the petVAE (11,648 latent features per scan) were used to visualize, analyze, and cluster the AD continuum. We identified the latent features most representative of the continuum, and clustering of PET scans using these features produced four clusters. Post-hoc characterization revealed that two clusters (A{beta}-, A{beta}-+) were predominantly A{beta} negative and two (A{beta}+, A{beta}++) were predominantly A{beta} positive. All clusters differed significantly in standardized uptake value ratio (p < 1.64e-8) and cerebrospinal fluid (CSF) A{beta} (p < 0.02), demonstrating petVAE's ability to assign scans along the A{beta} continuum. The clusters at the extremes of the continuum (A{beta}-, A{beta}++) resembled to the conventional A{beta} negative and A{beta} positive groups and differed significantly in cognitive performance, Apolipoprotein E (APOE) {epsilon}4 prevalence, and A{beta}, tau and phosphorylated tau CSF biomarkers (p < 3e-6). The two intermediate clusters (A{beta}-+, A{beta}+) showed significantly higher odds of carrying at least one APOE {epsilon}4 allele compared with the A{beta}- cluster (p < 0.026). Participants in A{beta}+ or A{beta}++ clusters exhibited a significantly faster rate of progression to AD compared to A{beta}- group (Hazard ratio = 2.42 and 9.43 for groups A{beta}+ and A{beta}++, respectively, p < 1.17e-7). Thus, petVAE was capable of reconstructing PET scans while also extracting latent features that effectively represented the AD continuum and defined biologically meaningful clusters. By capturing subtle A{beta}-related changes in brain PET scans, petVAE-based classification enables the detection of preclinical AD stages and offers a new data-driven framework for studying disease progression.
bioinformatics2026-02-04v1RareCapsNet: An explainable capsule networks enable robust discovery of rare cell populations from large-scale single-cell transcriptomics
Ray, S.; Lall, S.AI Summary
- RareCapsNet uses capsule networks to identify rare cell populations in large-scale single-cell RNA-seq data.
- It leverages explainable AI to interpret lower-level capsules, identifying novel marker genes for rare cell types.
- Evaluations on simulated and real data show RareCapsNet outperforms other methods in specificity, selectivity, and can transfer knowledge across batches.
Abstract
In-silico analysis of single cell data (downstream analysis) seeks considerable attention to the machine learning researchers in the last few years. Recent technological advances and increases in throughput capabilities open up great new chances to discover rare cell types. We develop RareCapsNet, a rare cell identification technique through capsule network in large single cell RNA-seq data. RareCapsNet aiming to leverage the landmark advantages of capsule networks in single cell domain, by identifying novel rare cell population through markers genes explained from human-mind-friendly interpretation of lower-level (primary) capsules. We demonstrate the explainability of capsule network for identifying novel markers that are act as signature of certain cell population of rare type. A comprehensive evaluation in simulated and real life single cell data demonstrate the efficacy of RareCapsNet for finding out rare population in large scRNA-seq data. RareCapsNet outperforms the other state-of-the-art not only in specificity and selectivity for identifying rare cell types, it can also successfully extract transcriptomic signature of the cell population. We demonstrate RareCapsNet to the dataset of multiple batch, where the model can store the knowledge of one batch which can be transferred to find out rare cells of other batch without training the model. Availability and Implementation: RareCapsNet is available at: https://github.com/sumantaray/RareCapsNet.
bioinformatics2026-02-04v1coelsch: Platform-agnostic single-cell analysis of meiotic recombination events
Parker, M. T.; Amar, S.; Freudigmann, J.; Walkemeier, B.; Dong, X.; Solier, V.; Marek, M.; Huettel, B.; Mercier, R.; Schneeberger, K.AI Summary
- The study benchmarks single-cell sequencing methods (droplet-based chromatin accessibility, RNA sequencing, and plate-based whole-genome amplification) for mapping meiotic recombination in Arabidopsis thaliana.
- Novel tools, coelsch_mapping_pipeline and coelsch, were developed for haplotype-aware alignment and crossover detection, successfully mapping recombination in 34 out of 40 F1 hybrids.
- The analysis revealed significant variation in recombination rates and identified a large ~10 Mb pericentric inversion in accession Zin-9, the largest known in A. thaliana.
Abstract
Background: Meiotic recombination creates genetic diversity through reciprocal exchange of haplotypes between homologous chromosomes. Scalable and robust methods for mapping recombination breakpoints are essential for understanding meiosis and for genetic mapping. Single cell sequencing of gametes offers a direct approach to recombination mapping, yet the effect of technical differences between single-cell sequencing methods for crossover detection remains unclear. Results: We benchmark single cell methods for droplet-based chromatin accessibility and RNA sequencing and plate-based whole-genome amplification for mapping meiotic recombination in Arabidopsis thaliana. For this purpose we introduce two novel open-source tools coelsch_mapping_pipeline and coelsch for haplotype-aware alignment and per-cell crossover detection, using them to recover known recombination frequencies and quantify the effects of coverage sparsity. We subsequently apply our approach to a panel of 40 recombinant F1 hybrids derived from crosses of 22 diverse natural accessions, successfully recovering genetic maps for 34 F1s in a single dataset. This analysis reveals substantial variation in recombination rate and identifies a ~10 Mb pericentric inversion in the accession Zin-9, the largest natural inversion reported in A. thaliana to date. Conclusions: These results demonstrate the applicability and scalability of single-cell gamete sequencing for high-throughput mapping of meiotic recombination, and highlight the strengths and limitations of different single-cell modalities. The accompanying open-source tools provide a framework for haplotyping and crossover detection analysis using sparse single-cell sequencing data. Our methodology enables parallel analysis of large numbers of hybrids in a single dataset, removing a major technical barrier to large-scale studies of natural variation in recombination rate.
bioinformatics2026-02-04v1LoReMINE: Long Read-based Microbial genome mining pipeline
Agrawal, A. A.; Bader, C. D.; Kalinina, O. V.AI Summary
- The study introduces LoReMINE, a pipeline for microbial genome mining that automates the process from long-read sequencing data to the prediction and clustering of biosynthetic gene clusters (BGCs).
- LoReMINE integrates various tools to provide a scalable, reproducible workflow for natural product discovery, addressing the limitations of existing methods that require manual curation.
Abstract
Microbial natural products represent a chemically diverse repertoire of small molecules with major pharmaceutical potential. Despite the increasing availability of microbial genome sequences, large-scale natural product discovery remains challenging because the existing genome mining approaches lack integrated workflows for rapid dereplication of known compounds and prioritization of novel candidates, forcing researchers to rely on multiple tools that requires extensive manual curation and expert intervention at each step. To address these limitations, we introduce LoReMINE (Long Read-based Microbial genome mining pipeline), a fully automated end-to-end pipeline that generates high-quality assemblies, performs taxonomic classification, predicts biosynthetic gene clusters (BGCs) responsible for biosynthesis of natural products, and clusters them into gene cluster families (GCFs) directly from long-read sequencing data. By integrating state-of-the-art tools into a seamless pipeline, LoReMINE enables scalable, reproducible, and comprehensive genome mining across diverse microbial taxa. The pipeline is openly available at https://github.com/kalininalab/LoReMINE and can be installed via Conda (https://anaconda.org/kalininalab/loremine), facilitating broad adoption by the natural product research community.
bioinformatics2026-02-04v1Generative deep learning expands apo RNA conformational ensembles to include ligand-binding-competent cryptic conformations: a case study of HIV-1 TAR
Kurisaki, I.; Hamada, M.AI Summary
- The study used Molearn, a hybrid molecular-dynamics-generative deep-learning model, to explore cryptic conformations of apo HIV-1 TAR RNA that could bind ligands.
- Molearn was trained on apo TAR conformations and generated a diverse ensemble, from which potential MV2003-binding conformations were identified.
- Docking simulations showed these conformations had RNA-ligand interaction scores similar to NMR-derived complexes, demonstrating the model's ability to predict ligand-binding competent RNA states.
Abstract
RNA plays vital roles in diverse biological processes and represents an attractive class of therapeutic targets. In particular, cryptic ligand-binding sites--absent in apo structures but formed upon conformational rearrangement--offer high specificity for RNA-ligand recognition, yet remain rare among experimentally-resolved RNA-ligand complex structures and difficult to predict in silico. RNA-targeted structure-based drug design (SBDD) is therefore limited by challenges in sampling cryptic states. Here, we apply Molearn, a hybrid molecular-dynamics-generative deep-learning model, to expand apo RNA conformational ensembles toward cryptic states. Focusing on the paradigmatic HIV-1 TAR-MV2003 system, Molearn was trained exclusively on apo TAR conformations and used to generate a diverse ensemble of TAR structures. Candidate cryptic MV2003-binding conformations were subsequently identified using post-generation geometric analyses. Docking simulations of these conformations with MV2003 yielded binding poses with RNA-ligand interaction scores comparable to those of NMR-derived complexes. Notably, this work provides the first demonstration that a generative modeling framework can access cryptic RNA conformations that are ligand-binding competent and have not been recovered in prior molecular-dynamics and deep-learning studies. Finally, we discuss current limitations in scalability and systematic detection, including application to the Internal Ribosome Entry Site, and outline future directions toward RNA-targeted SBDD.
bioinformatics2026-02-03v6GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure
Pourmirzaei, M.; Morehead, A.; Esmaili, F.; Ren, J.; Pourmirzaei, M.; Xu, D.AI Summary
- The study introduces GCP-VQVAE, a tokenizer using SE(3)-equivariant GCPNet to convert protein structures into discrete tokens while preserving chirality and orientation.
- Trained on 24 million protein structures, GCP-VQVAE achieves state-of-the-art performance with backbone RMSDs of 0.4377 Å, 0.5293 Å, and 0.7567 Å on CAMEO2024, CASP15, and CASP16 datasets respectively.
- On a zero-shot set of 1,938 new structures, it showed robust generalization with a backbone RMSD of 0.8193 Å and TM-score of 0.9673, and offers significantly reduced latency compared to previous models.
Abstract
Converting protein tertiary structure into discrete tokens via vector-quantized variational autoencoders (VQ-VAEs) creates a language of 3D geometry and provides a natural interface between sequence and structure models. While pose invariance is commonly enforced, retaining chirality and directional cues without sacrificing reconstruction accuracy remains challenging. In this paper, we introduce GCP-VQVAE, a geometry-complete tokenizer built around a strictly SE(3)-equivariant GCPNet encoder that preserves orientation and chirality of protein backbones. We vector-quantize rotation/translation-invariant readouts that retain chirality into a 4,096-token vocabulary, and a transformer decoder maps tokens back to backbone coordinates via a 6D rotation head trained with SE(3)-invariant objectives. Building on these properties, we train GCP-VQVAE on a corpus of 24 million monomer protein backbone structures gathered from the AlphaFold Protein Structure Database. On the CAMEO2024, CASP15, and CASP16 evaluation datasets, the model achieves backbone RMSDs of 0.4377 A, 0.5293 A, and 0.7567 A, respectively, and achieves 100% codebook utilization on a held-out validation set, substantially outperforming prior VQ-VAE-based tokenizers and achieving state-of-the-art performance. Beyond these benchmarks, on a zero-shot set of 1,938 completely new experimental structures, GCP-VQVAE attains a backbone RMSD of 0.8193 A and a TM-score of 0.9673, demonstrating robust generalization to unseen proteins. Lastly, we show that the Large and Lite variants of GCP-VQVAE are substantially faster than the previous SOTA (AIDO), reaching up to ~408x and ~530x lower end-to-end latency, while remaining robust to structural noise. We make the GCP-VQVAE source code, zero-shot dataset, and its pretrained weights fully open for the research community: https://github.com/mahdip72/vq_encoder_decoder
bioinformatics2026-02-03v3Informative Missingness in Nominal Data: A Graph-Theoretic Approach to Revealing Hidden Structure
Zangene, E.; Schwammle, V.; JAFARI, M.AI Summary
- This study introduces a graph-theoretic approach to analyze missing data in nominal datasets, treating missing values as informative signals rather than gaps.
- By constructing bipartite graphs from nominal variables, the method reveals hidden structures through modularity, nestedness, and similarity analysis.
- Applied across various domains, the approach showed that missing data patterns can distinguish between random and non-random missingness, enhancing structural understanding and aiding in tasks like clustering.
Abstract
Missing data is often treated as a nuisance, routinely imputed or excluded from statistical analyses, especially in nominal datasets where its structure cannot be easily modeled. However, the form of missingness itself can reveal hidden relationships, substructures, and biological or operational constraints within a dataset. In this study, we present a graph-theoretic approach that reinterprets missing values not as gaps to be filled, but as informative signals. By representing nominal variables as nodes and encoding observed or missing associations as edges, we construct both weighted and unweighted bipartite graphs to analyze modularity, nestedness, and projection-based similarities. This framework enables downstream clustering and structural characterization of nominal data based on the topology of observed and missing associations; edge prediction via multiple imputation strategies is included as an optional downstream analysis to evaluate how well inferred values preserve the structure identified in the non-missing data. Across a series of biological, ecological, and social case studies, including proteomics data, the BeatAML drug screening dataset, ecological pollination networks, and HR analytics, we demonstrate that the structure of missing values can be highly informative. These configurations often reflect meaningful constraints and latent substructures, providing signals that help distinguish between data missing at random and not at random. When analyzed with appropriate graph-based tools, these patterns can be leveraged to improve the structural understanding of data and provide complementary signals for downstream tasks such as clustering and similarity analysis. Our findings support a conceptual shift: missing values are not merely analytical obstacles but valuable sources of insight that, when properly modeled, can enrich our understanding of complex nominal systems across domains.
bioinformatics2026-02-03v2Automated Segmentation of Kidney Nephron Structures by Deep Learning Models on Label-free Autofluorescence Microscopy for Spatial Multi-omics Data Acquisition and Mining
Patterson, N. H.; Neumann, E. K.; Sharman, K.; Allen, J. L.; Harris, R. C.; Fogo, A. B.; deCaestecker, M. P.; Van de Plas, R.; Spraggins, J. M.AI Summary
- Developed deep learning models for automated segmentation of kidney nephron structures using label-free autofluorescence microscopy.
- Models accurately segmented functional tissue units and gross kidney morphology with F1-scores >0.85 and Dice-Sorensen coefficients >0.80.
- Enabled quantitative association of lipids with segmented structures and spatial transcriptomics data acquisition from collecting ducts, showing differential gene expression in medullary regions.
Abstract
Automated spatial segmentation models can enrich spatio-molecular omics analyses by providing a link to relevant biological structures. We developed segmentation models that use label-free autofluorescence (AF) microscopy to recognize multicellular functional tissue units (FTUs) (glomerulus, proximal tubule, descending thin limb, ascending thick limb, distal tubule, and collecting duct) and gross morphological structures (cortex, outer medulla, and inner medulla) in the human kidney. Annotations were curated using highly specific multiplex immunofluorescence and transferred to co-registered AF for model training. All FTUs (except the descending thin limb) and gross kidney morphology were segmented with high accuracy: >0.85 F1-score, and Dice-Sorensen coefficients >0.80, respectively. This workflow allowed lipids, profiled by imaging mass spectrometry, to be quantitatively associated with segmented FTUs. The segmentation masks were also used to acquire spatial transcriptomics data from collecting ducts. Consistent with previous literature, we demonstrated differing transcript expression of collecting ducts in the inner and outer medulla.
bioinformatics2026-02-03v2Transcriptomic and protein analysis of human cortex reveals genes and pathways linked to NPTX2 disruption in Alzheimer's disease
Lao, Y.; Xiao, M.-F.; Ji, S.; Piras, I. S.; Kim, K.; Bonfitto, A.; Song, S.; Aldabergenova, A.; Sloan, J.; Trejo, A.; Geula, C.; Na, C.-H.; Rogalski, E. J.; Kawas, C. H.; Corrada, M. M.; Serrano, G. E.; Beach, T. G.; Troncoso, J. C.; Huentelman, M. J.; Barnes, C. A.; Worley, P. F.; Colantuoni, C.AI Summary
- This study used bulk RNA sequencing and targeted proteomics on human cortex samples to explore genes and pathways associated with NPTX2 disruption in Alzheimer's disease (AD).
- NPTX2 expression was significantly reduced in AD, correlating with BDNF, VGF, SST, and SCG2, indicating a role in synaptic and mitochondrial functions.
- In AD, NPTX2-related synaptic and mitochondrial pathways weakened, while stress-linked transcriptional regulators increased, suggesting a shift in regulatory dynamics.
Abstract
The expression of NPTX2, a neuronal immediate early gene (IEG) essential for excitatory-inhibitory balance, is altered in the earliest stages of cognitive decline that precede Alzheimer's disease (AD). Here, we use NPTX2 as a point of reference to identify genes and pathways linked to its role in AD onset and progression. We performed bulk RNA sequencing on 575 middle temporal gyrus (MTG) samples across four cohorts, together with targeted proteomics in 135 of these same samples, focusing on 20 curated proteins spanning synaptic, trafficking, lysosomal, and regulatory categories. NPTX2 RNA and protein were significantly reduced in AD, and to a lesser extent in mild cognitive impairment (MCI) samples. RNA expression of BDNF, VGF, SST, and SCG2 correlated with both NPTX2 mRNA and protein levels. We identified NPTX2-correlated synaptic and mitochondrial programs that were negatively correlated with lysosomal and chromatin/stress modules. Gene set enrichment analysis (GSEA) of NPTX2 correlations across all samples confirmed broad alignment with synaptic and mitochondrial compartments, and more NPTX2-specific associations with proteostasis and translation regulator pathways, all of which were weakened in AD. In contrast, correlation of NPTX2 protein with transcriptomic profiles revealed negative associations with stress-linked transcription regulator RNAs (FOXJ1, ZHX3, SMAD5, JDP2, ZIC4), which were strengthened in AD. These results position NPTX2 as a hub of an activity-regulated "plasticity cluster" (BDNF, VGF, SST, SCG2) that encompasses interneuron function and is embedded on a neuronal/mitochondrial integrity axis that is inversely coupled to lysosomal and chromatin-stress programs. In AD, these RNA-level correlations broadly weaken, and stress-linked transcriptional regulators become more prominent, suggesting a role in NPTX2 loss of function. Individual gene-level data from the bulk RNA-seq in this study can be freely explored at [INSERT LINK].
bioinformatics2026-02-03v2SpaCEy: Discovery of Functional Spatial Tissue Patterns by Association with Clinical Features Using Explainable Graph Neural Networks
Rifaioglu, A. S.; Ervin, E. H.; Sarigun, A.; Germen, D.; Bodenmiller, B.; Tanevski, J.; Saez-Rodriguez, J.AI Summary
- SpaCEy uses explainable graph neural networks to analyze spatial tissue patterns from molecular marker expression, linking these patterns to clinical outcomes without predefined cell types.
- Applied to lung cancer, SpaCEy identified spatial cell arrangements and protein marker expressions linked to disease progression.
- In breast cancer datasets, SpaCEy stratified patients by overall survival, revealing key spatial patterns of protein markers across and within clinical subtypes.
Abstract
Tissues are complex ecosystems tightly organized in space. This organization influences their function, and its alteration underpins multiple diseases. Spatial omics allows us to profile its molecular basis, but how to leverage these data to link spatial organization and molecular patterns to clinical practice remains a challenge. We present SpaCEy (SpatialClinicalExplainability), an explainable graph neural network that uncovers organizational tissue patterns predictive of clinical outcomes. SpaCEy learns directly from molecular marker expression by modelling tissues as spatial graphs of cells and their interactions, without requiring predefined cell types or anatomical regions. Its embeddings capture intercellular relationships and molecular dependencies that enable accurate prediction of variables such as overall survival and disease progression. SpaCEy integrates a specialized explainer module that reveals recurring spatial patterns of cell organisation and coordinated marker expression that are most relevant to predictions of the models. Applied to a spatially resolved proteomic lung cancer cohort, SpaCEy discovers distinct spatial arrangements of cells together with coordinated expression of protein markers associated with disease progression. Across multiple breast cancer proteomic datasets, it consistently stratifies patients according to overall survival, both across and within established clinical subtypes. SpaCEy also highlights spatial patterns of a small set of key protein markers underlying this patient stratification.
bioinformatics2026-02-03v2ImmunoPheno: A Computational Framework for Data-Driven Design and Analysis of Immunophenotyping Experiments
Wu, L.; Nguyen, M. A.; Yang, Z.; Potluri, S.; Sivagnanam, S.; Kirchberger, N.; Joshi, A.; Ahn, K. J.; Tumulty, J. S.; Cruz Cabrera, E.; Romberg, N.; Tan, K.; Coussens, L. M.; Camara, P. G.AI Summary
- ImmunoPheno is a computational framework that uses single-cell proteo-transcriptomic data to automate the design of antibody panels, gating strategies, and cell identity annotation for immunophenotyping.
- It was used to create a reference (HICAR) with 390 antibodies and 93 immune cell populations, enabling the design of minimal panels for isolating rare cells like MAIT cells and pDCs, validated experimentally.
- The framework accurately annotates cell identities across various cytometry datasets, enhancing the accuracy, reproducibility, and resolution of immunophenotyping.
Abstract
Immunophenotyping is fundamental to characterizing tissue cellular composition, pathogenic processes, and immune infiltration, yet its accuracy and reproducibility remain constrained by heuristic antibody panel design and manual gating. Here, we present ImmunoPheno, an open-source computational platform that repurposes large-scale single-cell proteo-transcriptomic data to guide immunophenotyping experimental design and analysis. ImmunoPheno integrates existing datasets to automate the design of optimal antibody panels, gating strategies, and cell identity annotation. We used ImmunoPheno to construct a harmonized reference (HICAR) comprising 390 monoclonal antibodies and 93 human immune cell populations. Leveraging this resource, we algorithmically designed minimal panels to isolate rare populations, such as MAIT cells and pDCs, which we validated experimentally. We further demonstrate accurate cell identity annotation across publicly available and newly generated cytometry datasets spanning diverse technologies, including spatial platforms like CODEX. ImmunoPheno complements expert curation and supports continual expansion, providing a scalable framework to enhance the accuracy, reproducibility, and resolution of immunophenotyping.
bioinformatics2026-02-03v1A modality gap in personal-genome prediction by sequence-to-function models
Mostafavi, S.; Tu, X.; Spiro, A.; Chikina, M.AI Summary
- The study evaluated AlphaGenome's ability to predict personal genome variations in gene expression and chromatin accessibility.
- AlphaGenome performed near the heritability ceiling for chromatin accessibility but significantly underperformed for gene expression compared to baseline.
- Findings suggest chromatin accessibility is influenced by local regulatory elements, while gene expression requires integration of long-range regulatory effects, which current models struggle with.
Abstract
Sequence-to-function (S2F) models trained on reference genomes have achieved strong performance on regulatory prediction and variant-effect benchmarks, yet they still struggle to predict inter-individual variation in gene expression from personal genomes. We evaluated AlphaGenome on personal genome prediction in two molecular modalities--gene expression and chromatin accessibility--and observed a striking dichotomy: AlphaGenome approaches the heritability ceiling for chromatin accessibility variation, but remains far below baseline for gene-expression variation, despite improving over Borzoi. Context truncation and fine-mapped QTL analyses indicate that accessibility is governed by local regulatory grammar captured by current architectures, whereas gene-expression variation requires long-range regulatory integration that remains challenging.
bioinformatics2026-02-03v1GAISHI: A Python Package for Detecting Ghost Introgression with Machine Learning
Huang, X.; Hackl, J.; Kuhlwilm, M.AI Summary
- GAISHI is a Python package designed to detect ghost introgression using machine learning techniques like logistic regression and UNet++.
- It addresses the limitation of previous studies by providing a software implementation for identifying introgressed segments and alleles.
- The package's utility was demonstrated in a Human-Neanderthal introgression scenario.
Abstract
Summary: Ghost introgression is a challenging problem in population genetics. Recent studies have explored supervised learning models, namely logistic regression and UNet++, to detect genomic footprints of ghost introgression. However, their applicability is limited because existing implementations are tailored to tasks in their respective publications, but not available as software implementations. Here, we present GAISHI, a Python package for identifying introgressed segments and alleles using machine learning and demonstrate its usage in a Human-Neanderthal introgression scenario. Availabity and implementation: GAISHI is available on GitHub under the GNU General Public License v3.0. The source code can be found at https://github.com/xin-huang/gaishi.
bioinformatics2026-02-03v1PepMCP: A Graph-Based Membrane Contact Probability Predictor for Membrane-Lytic Antimicrobial Peptides
Dong, R.; Awang, T.; Cao, Q.; Kang, K.; Wang, L.; Zhu, Z.; Song, C.AI Summary
- This study introduces PepMCP, a graph-based model for predicting membrane contact probability (MCP) of short antimicrobial peptides (AMPs) targeting bacterial membranes.
- Over 500 membrane-lytic AMPs were used to train PepMCP, employing coarse-grained molecular dynamics simulations and the GraphSAGE framework.
- PepMCP achieved a Pearson correlation coefficient of 0.883 and RMSE of 0.123, enhancing mechanism-driven AMP discovery with the MemAMPdb database and a web server for access.
Abstract
Motivation: The membrane-lytic mechanism of antimicrobial peptides (AMPs) is often overlooked during their in silico discovery process, largely due to the lack of a suitable metric for the membrane-binding propensity of peptides. Previously, we proposed a characteristic called membrane contact probability (MCP) and applied it to the identification of membrane proteins and membrane-lytic AMPs. However, previous MCP predictors were not trained on short peptides targeting bacterial membranes, which may result in unsatisfactory performance for peptide studies. Results: In this study, we present PepMCP, a peptide-tailored model for predicting MCP values of short peptides. We collected more than 500 membrane-lytic AMPs from the literature, conducted coarse-grained molecular dynamics (MD) simulations for these AMPs, and extracted their residue MCP labels from MD trajectories to train PepMCP. PepMCP employs the GraphSAGE framework to address this node regression task, encoding each peptide sequence as a graph with 4-hop edges. PepMCP achieved a Pearson correlation coefficient of 0. 883 and an RMSE of 0. 123 on the node-level test set. It can recognize membrane-lytic AMPs with the predicted MCP values for each sequence, thereby facilitating mechanism-driven AMP discovery. Additionally, we provide a database, MemAMPdb, which includes the membrane-lytic AMPs, as well as the PepMCP web server for easy access. Availability and Implementation: The code and data are available at https://github.com/ComputBiophys/PepMCP.
bioinformatics2026-02-03v1Predicting mutation-rate variation across the genome using epigenetic data
Katori, M.; Kobayashi, T. J.; Nordborg, M.; Shi, S.AI Summary
- The study integrates epigenetic data (histone marks, DNA methylation, chromatin accessibility) with de novo mutation data in Arabidopsis thaliana to model mutation probability at the coding sequence level.
- Using non-negative matrix factorization, 15 epigenetic patterns were identified, stratifying coding sequences into six classes with different mutation probabilities.
- A predictive model based on these patterns outperformed others, showing that epigenetic context significantly influences local mutation rates, with changes under hypoxia indicating dynamic chromatin effects on mutation probability.
Abstract
Mutation rate variation is a fundamental driver of evolution, yet how it is locally patterned across genomes and structured by chromatin context remains unresolved. Here, we integrate genome-wide profiles of histone marks, DNA methylation and chromatin accessibility in Arabidopsis thaliana with de novo mutation data to model mutation probability at the level of coding sequence (CDS). Using non-negative matrix factorization, we identify 15 combinatorial epigenetic patterns whose graded mixtures stratify CDSs into six classes with distinct mutation probabilities. A generalized linear model based on pattern weights predicts local mutation probability and outperforms models based on sequence context, expression and classical genomic categories. These patterns capture context-dependent variation that is obscured by gene-level summaries and single-feature analyses. Cluster-level differences are partly retained in mutation-accumulation lines, indicating persistence into heritable mutational input. Under hypoxia, stress-responsive chromatin remodeling redistributes epigenetic contexts associated with higher predicted mutation probability toward hypoxia-responsive genes and DNA-repair pathways. Together, our results provide a CDS-resolved and interpretable framework linking combinatorial epigenomic context to mutational input, clarifying how dynamic chromatin states shape local mutation-rate heterogeneity.
bioinformatics2026-02-03v1Predicting unknown binding sites for transition metal based compounds in proteins
Levy, A.; Rothlisberger, U.AI Summary
- This study evaluates the use of Metal3D and Metal1D, tools originally designed for zinc ion binding prediction, to identify binding sites for transition metal complexes in proteins.
- Both tools successfully predicted several known binding sites from apo protein structures, despite limitations like sensitivity to side-chain conformations.
- The research suggests a computational pipeline where these tools could initially identify potential binding sites, followed by refinement with more precise methods.
Abstract
Transition metal based compounds are promising therapeutic agents, particularly in cancer treatment. However, predicting their binding sites remains a major challenge. In this work, we investigate the applicability of two tools, Metal3D and Metal1D, for this purpose. Although originally trained to predict zinc ion binding sites only, both predictors successfully identify several experimentally observed binding sites for transition metal complexes directly from apo protein structures. At the same time, we highlight current limitations, such as the sensitivity to side-chain conformations, and discuss possible strategies for improvement. This work provides a first step toward establishing a robust computational pipeline in which rapid and low-cost predictors are able to identify putative hotspots for transition metal binding, which can then be refined using more accurate but computationally demanding methods.
bioinformatics2026-02-03v1PPGLomics: An Interactive Platform for Pheochromocytoma and Paraganglioma Transcriptomics
Alkaissi, H.; Gordon, C. M.; Pacak, K.AI Summary
- PPGLomics is an interactive web platform for analyzing pheochromocytoma and paraganglioma (PPGL) transcriptomics, addressing the lack of disease-specific bioinformatics resources.
- It integrates the TCGA-PCPG (n=160) and A5 consortium SDHB (n=91) datasets, offering tools for differential expression, correlation, survival analysis, and various visualizations.
- The platform is designed for use by scientists and healthcare professionals without requiring bioinformatics expertise and is freely accessible online.
Abstract
Pheochromocytoma and paraganglioma (PPGL) are rare neuroendocrine tumors with unique biological behavior and remarkably high heritability, yet dedicated bioinformatics resources for these diagnoses remain limited. Existing cancer multi-omics platforms are pan-cancer in scope, often lacking the disease-specific annotations, granularity, and cross-database harmonization required for meaningful stratification and hypothesis generation. Here we introduce PPGLomics, an interactive web-based platform designed for comprehensive PPGL transcriptomics analysis. PPGLomics v1.0 integrates two major datasets, the TCGA-PCPG cohort (n=160) spanning multiple molecular subtypes, and the A5 consortium SDHB cohort (n=91) with detailed clinicopathological and molecular annotations. The platform provides basic and clinical scientists, as well as a broad range of healthcare professionals, with tools for differential expression analysis, correlation analysis, survival analysis, and visualization, including boxplots, heatmaps, volcano plots, and Kaplan-Meier survival plots, enabling exploration of gene expression patterns across PPGL subtypes without requiring bioinformatics expertise. PPGLomics v1.0 is freely available at https://alkaissilab.shinyapps.io/PPGLomics.
bioinformatics2026-02-03v1PlotGDP: an AI Agent for Bioinformatics Plotting
Luo, X.; Shi, Y.; Huang, H.; Wang, H.; Cao, W.; Zuo, Z.; Zhao, Q.; Zheng, Y.; Xie, Y.; Jiang, S.; Ren, J.AI Summary
- PlotGDP is an AI agent-based web server designed for creating high-quality bioinformatics plots using natural language commands, eliminating the need for coding or environment setup.
- It leverages large language models (LLMs) to process user-uploaded data on a remote server, ensuring ease of use.
- The platform uses curated template scripts to reduce the risk of errors from LLMs, aiming to enhance bioinformatics visualization for global research.
Abstract
High-quality bioinformatics plotting is important for biology research, especially when preparing for publications. However, the long learning curve and complex coding environment configuration often appear as inevitable costs towards the creation of publication-ready plots. Here, we present PlotGDP (https://plotgdp.biogdp.com/), an AI agent-based web server for bioinformatics plotting. Built on large language models (LLMs), the intelligent plotting agent is designed to accommodate various types of bioinformatics plots, while offering easy usage with simple natural language commands from users. No coding experience or environment deployment is required, since all the user-uploaded data is processed by LLM-generated codes on our remote high-performance server. Additionally, all plotting sessions are based on curated template scripts to minimize the risk of hallucinations from the LLM. Aided by PlotGDP, we hope to contribute to the global biology research community by constructing an online platform for fast and high-quality bioinformatics visualization.
bioinformatics2026-02-03v1HiChIA-Rep quantifies the similarity between enrichment-based chromatin interactions datasets
Kim, S. S.; Jackson, J. T.; Zhang, H. B.; Kim, M.AI Summary
- HiChIA-Rep is an algorithm designed to quantify the similarity between datasets from enrichment-based 3D genome mapping technologies like ChIA-PET and HiChIP.
- It uses both 1D and 2D signals through graph signal processing to assess data reproducibility.
- HiChIA-Rep effectively distinguishes biological replicates from non-replicates and outperforms tools designed for Hi-C data.
Abstract
3D genome mapping technologies ChIA-PET, HiChIP, PLAC-seq, HiCAR, and ChIATAC yield pairwise contacts and a one-dimensional signal indicating protein binding or chromatin accessibility. However, a lack of computational tools to quantify the reproducibility of these enrichment-based 3C data prevents rigorous data quality assessment and interpretation. We developed HiChIA-Rep, an algorithm incorporating both 1D and 2D signals to measure similarity via graph signal processing methods. HiChIA-Rep can distinguish biological replicates from non-replicates, cell lines, and protein factors, outperforming tools designed for Hi-C data. With a large amount of multi-ome datasets being generated, HiChIA-Rep will likely be a fundamental tool for the 3D genomics community.
bioinformatics2026-02-03v1MOSAIC: A Structured Multi-level Framework for Probabilistic and Interpretable Cell-type Annotation
Yang, M.; Qi, J.; Lan, M.; Huang, J.; Jin, S.AI Summary
- MOSAIC is a multi-level framework for cell-type annotation in single-cell RNA sequencing that integrates cell-level marker evidence with cluster-level population context.
- It uses a probabilistic approach to handle uncertainty, mixed states, and population structure, improving upon single-level annotation methods.
- Across six tissues and under dropout perturbations, MOSAIC matched or outperformed other methods, providing structured uncertainty estimates and identifying stable intermediate cell states.
Abstract
Accurate cell-type annotation is a foundational task in single-cell RNA sequencing analysis, yet remains fundamentally challenged by cellular heterogeneity, gradual lineage transitions, and technical noise. As single-cell atlases expand in scale and resolution, most existing annotation approaches operate at a single analytical level and encode cell identity as fixed categorical labels, limiting their ability to represent uncertainty, mixed biological states, and population-level structure. Here we introduce MOSAIC (Multi-level prObabilistic and Structured Adaptive IdentifiCation), a structured multi-level annotation framework that integrates cell-level marker evidence with cluster-level population context within a unified probabilistic system. Rather than treating annotation as an independent per-cell prediction task, MOSAIC formulates cell-type assignment as a coordinated multi-level inference process, in which probabilistic evidence at the single-cell level is aggregated, constrained, and refined by population context. MOSAIC integrates direction-aware marker scoring with dual-layer probabilistic representation and adaptive cross-level refinement, enabling uncertainty to be quantified and propagated across biological scales. This design yields coherent annotations that preserve fine-grained single-cell variation while maintaining population-level consistency, and allows ambiguous or transitional states to be represented explicitly rather than collapsed into hard labels. Across six diverse tissues and under controlled dropout perturbations, MOSAIC consistently matches or outperforms representative marker-based, reference-based, and machine-learning annotation methods. Beyond accuracy, MOSAIC provides structured uncertainty estimates and coherent population-level structure, enabling the identification of stable intermediate cell states that arise from gradual lineage transitions rather than technical noise. Together, MOSAIC advances cell-type annotation from a single-level classification task to a structured multi-level inference problem, and establishes a general, interpretable, and uncertainty-aware computational framework for large-scale single-cell analysis.
bioinformatics2026-02-03v1Attractor Landscape Analysis Distinguishes Aging Markers from Rejuvenation Targets in Human Keratinocytes
Copes, N.; Canfield, C.-A. E.AI Summary
- The study used PRISM, a computational pipeline integrating pseudotime trajectory and Boolean network analysis, to identify rejuvenation targets in aging human keratinocytes from single-cell RNA sequencing data.
- Two distinct aging trajectories were identified: one where cells converge to an aged state (Y_272) and another where cells depart from a youthful state (Y_308).
- Key findings included BACH2 knockdown as the top rejuvenation target for Y_272, improving the aging score by 98.9%, and ASCL2 knockdown for Y_308, with enhanced effects when combined with ATF6 perturbation.
Abstract
Cellular aging is characterized by progressive changes in gene expression that contribute to tissue dysfunction; however, identifying genes that regulate the aging process, rather than merely serve as biomarkers, remains a significant challenge. Here we present PRISM (Pseudotime Reversion via In Silico Modeling), a computational pipeline that integrates pseudotime trajectory analysis with Boolean network analysis to identify cellular rejuvenation targets from single-cell RNA sequencing data. We applied PRISM to a published dataset of human skin comprising 47,060 cells from nine donors aged 18 to 76 years. Analysis of keratinocytes revealed two distinct aging trajectories with fundamentally different regulatory architectures. One trajectory (labeled Y_272) exhibited "aging as convergence," where cells were driven toward a single dominant aged attractor (aging score +2.181). A second trajectory (labeled Y_308) exhibited "aging as departure," where cells escaped from a dominant youthful attractor basin (aging score -0.536). Systematic perturbation analysis revealed a critical distinction between genes exhibiting age-related expression changes (phenotypic markers) and genes controlling attractor landscape architecture (regulatory controllers). Switch genes marking the aging trajectories proved largely ineffective as intervention targets, while master regulators operating at higher levels of the regulatory hierarchy produced substantial rejuvenation effects. BACH2 knockdown was identified as the dominant intervention for Y_272, shifting the aging score by {Delta}=-3.746 (98.9% improvement). ASCL2 knockdown was identified as the top target for Y_308, with synergistic enhancement observed through combinatorial perturbation with ATF6. These findings demonstrate that attractor-based analysis identifies different and potentially superior therapeutic targets compared to expression-based approaches and provide specific hypotheses for experimental validation of cellular rejuvenation strategies in human skin.
bioinformatics2026-02-03v1An agentic framework turns patient-sourced records into a multimodal map of ALS heterogeneity
Li, Z.; Gao, C.; Kong, J.; Fu, Y.; Wen, S.; Li, G.; Cao, Y.; Fu, Y.; Zhang, H.; Jia, S.; Liu, X.; Cai, L.; Yan, F.; Liu, X.; Tian, L.AI Summary
- The study introduces MEDSTREM, an LLM-based agent that transforms patient-sourced document images into standardized electronic health records, facilitating cohort building and linkage to trials and multi-omics data.
- By analyzing 8,298 individuals' clinical reports, MEDSTREM generated 17,602 records and multi-omics profiles, identifying five ALS subtypes and a continuous degeneration score.
- Key findings include functional loss tracking with hand-grip strength and forced vital capacity, malnutrition as a modifiable factor, and epigenetic changes like cell-cycle suppression and chromatin opening linked to clinical severity.
Abstract
ALS shows marked clinical heterogeneity, yet much real-world evidence remains trapped in unstructured reports. Here we introduce MEDSTREM, a large-language-model (LLM)-based agent that converts patient-sourced document images into standardized longitudinal electronic health records, enabling bottom-up cohort building and linkage to trials and multi-omics. By applying MEDSTREM to clinical report images from 8,298 individuals collected via AskHelpU and harmonizing with PRO-ACT and Answer ALS, we generated 17,602 standardized records and multi-omics profiles from 940 induced motor neuron lines. Progression modelling resolved five subtypes and a continuous degeneration score with interpretable anchors: hand-grip strength and forced vital capacity tracked functional loss, and malnutrition emerged as a modifiable correlate. Across RNA-seq and ATAC-seq, clinical severity is aligned with suppression of cell-cycle programmes, declining histone-gene activity and genome-wide chromatin opening, suggesting distinct epigenetic trajectories. These findings establish an agentic AI framework that turns unstructured clinical records into mechanistic insight and links them to multi-omics, reframing ALS studies from top-down, trial-centric analyses to a bottom-up, patient-sourced approach that reveals actionable heterogeneity.
bioinformatics2026-02-03v1Computational insights into the interaction between Topoisomerase I and Rpc82 subunit of RNA Polymerase III in Saccharomyces cereviseae
Nandi, P.; Kamal, I. M.; Chakrabarti, S.; Sengupta, S.AI Summary
- This study modeled the full-length yeast Topoisomerase I (Top1) to investigate its interaction with Rpc82, a subunit of RNA Polymerase III in Saccharomyces cerevisiae.
- Using molecular docking and dynamics simulations, the study identified critical residues at the Top1-Rpc82 interface, providing insights into how Top1 might regulate Pol III-mediated transcription.
Abstract
The process of DNA transcription leads to the generation of torsional stress, which must be resolved for smooth progression of the transcription machinery. In Saccharomyces cerevisiae, DNA topoisomerase I (Top1), a type IB topoisomerase, plays a critical role in relaxing supercoils and mitigating the topological strain associated with transcription. While several proteins from the transcription machinery have been reported to interact with yeast Top1, detailed characterization and functional relevance of these interactions have remained underexplored. This gap is partly due to the absence of a complete three-dimensional structure of the full-length enzyme, which hinders structure-based computational analyses of its interactome. In this study, we present a template-based model of full-length yeast Top1. Leveraging this model, we investigated its molecular interaction with Rpc82, a key subunit of RNA polymerase III enzyme, responsible for transcribing small non-coding RNAs such as tRNAs and 5S rRNA. Through molecular docking and molecular dynamics simulations, critical residues at the Top1-Rpc82 interface were identified that likely mediate their interaction. Our findings provide new insights into the structural basis of Top1's association with RNA polymerase III and its potential role in regulating Pol III-mediated transcription. The Top1 model developed here offers a valuable framework for future in silico studies aimed at elucidating the broader interactome and regulatory mechanisms of this essential enzyme.
bioinformatics2026-02-03v1Cell type-specific functions of nucleic acid-binding proteins revealed by deep learning on co-expression networks
Osato, N.; Sato, K.AI Summary
- This study uses a deep learning framework to infer the regulatory influence of nucleic acid-binding proteins (NABPs) across different cellular contexts by integrating gene co-expression data, improving prediction accuracy over traditional binding-based methods.
- The model's predictions were validated against ChIP-seq and eCLIP datasets, showing strong concordance.
- Analysis revealed cell type-specific regulatory programs, such as cancer pathways in K562 cells and differentiation in neural progenitor cells, highlighting the framework's utility in functional annotation of NABPs.
Abstract
Nucleic acid-binding proteins (NABPs) play central roles in gene regulation, yet their functional targets and regulatory programs remain incompletely characterized due to the limited scope and context specificity of experimental binding assays. Here, we present a deep learning framework that integrates gene co-expression-derived interactions with contribution-based model interpretation to infer NABP regulatory influence across diverse cellular contexts, without relying on predefined binding motifs or direct binding evidence. Replacing low-informative binding-based features with co-expression-derived interactions significantly improved gene expression prediction accuracy. Model-inferred regulatory targets showed strong and reproducible concordance with independent ChIP-seq and eCLIP datasets, exceeding random expectations across multiple genomic regions and threshold definitions. Functional enrichment and gene set enrichment analyses revealed coherent, cell type-specific regulatory programs, including cancer-associated pathways in K562 cells and differentiation-related processes in neural progenitor cells. Notably, we demonstrate that DeepLIFT-derived contribution scores capture relative regulatory importance in a background-dependent but biologically robust manner, enabling systematic identification of context-dependent NABP regulatory roles. Together, this framework provides a scalable strategy for functional annotation of NABPs and highlights the utility of combining expression-driven inference with interpretable deep learning to dissect gene regulatory architectures at scale.
bioinformatics2026-02-02v9ELITE: E3 Ligase Inference for Tissue specific Elimination: A LLM Based E3 Ligase Prediction System for Precise Targeted Protein Degradation
Patjoshi, S.; Froehlich, H.; Madan, S.AI Summary
- The study introduces ELITE, an AI-driven system using a BERT-based model to predict tissue-specific E3 ligases for targeted protein degradation (TPD).
- ELITE integrates protein embeddings with tissue-specific interaction data to identify E3 ligases that can selectively degrade pathogenic proteins in relevant tissues.
- This approach aims to expand the E3 ligase repertoire, enhancing precision in TPD and reducing systemic toxicity.
Abstract
Targeted protein degradation (TPD) has transformed modern drug discovery by harnessing the ubiquitin proteasome system to eliminate disease-driving proteins previously deemed undruggable. However, current approaches predominantly rely on a narrow set of ubiquitously expressed E3 ligases, such as Cereblon (CRBN) and Von Hippel Lindau (VHL), which limits tissue specificity, increases systemic toxicity, and fosters resistance. Here, we present an AI-driven framework for the rational identification of tissue specific E3 ligases suitable for precision-targeted degradation. Our model leverages a BERT-based protein language architecture trained on billions of sequences to generate contextual embeddings that capture structural and functional motifs relevant for E3 substrate compatibility. By integrating these embeddings with tissue resolved protein protein interaction data, the framework predicts ligase/target interactions that are both biologically plausible and context restricted. This enables the prioritization of ligases capable of driving selective degradation of pathogenic proteins within disease-relevant tissues. The proposed approach offers a scalable path to expand the E3 ligase repertoire and advance TPD toward true precision medicine.
bioinformatics2026-02-02v9rnaends: an R package to study exact RNA ends at nucleotide resolution
Caetano, T.; Redder, P.; Fichant, G.; Barriot, R.AI Summary
- The rnaends R package is designed for analyzing RNA-end sequencing data, focusing on the exact nucleotide resolution of RNA ends.
- It provides tools for preprocessing, mapping, quantification, and post-processing of RNA-end data, including TSS identification, analysis of translation speed, and post-transcriptional modifications.
- The package's utility is demonstrated through workflows on published datasets, highlighting its application in RNA metabolism studies.
Abstract
5' and 3' RNA-end sequencing protocols have unlocked new opportunities to study aspects of RNA metabolism such as synthesis, maturation and degradation, by enabling the quantification of exact ends of RNA molecules in vivo. From RNA-Seq data that have been generated with one of the specialized protocols, it is possible to identify transcription start sites (TSS) and/or endoribonucleolytic cleavage sites, and even, in some cases, co-translational 5' to 3' degradation dynamics. Furthermore, post-transcriptional addition of ribonucleotides at the 3' end of RNA can be studied at the nucleotide resolution. While different RNA-end sequencing library protocols exist that have been adapted to a specific organism (prokaryote or eukaryote) or specific biological question, the generated RNA-Seq data are very similar and share common processing steps. Most importantly, the major aspect of RNA-end sequencing is that only the 5' or 3' end mapped location is of interest, contrary to conventional RNA sequencing that considers genomic ranges for gene expression analysis. This translates to a simple representation of the quantitative data as a count matrix of RNA-end location on the reference sequences. This representation seems under-exploited and is, to our knowledge, not available in a generic package focused on the analyses on the exact transcriptome ends. Here, we present the rnaends R package which is dedicated to RNA-end sequencing analysis. It offers functions for raw read pre-processing, RNA-end mapping and quantification, RNA-end count matrix post-processing, and further downstream count matrix analyses such as TSS identification, fast Fourier transform for signal periodic pattern analysis, or differential proportion of RNA-end analysis. The use of rnaends is illustrated here with applications in RNA metabolism studies through selected rnaends workflows on published RNA-end datasets: (i) TSS identification, (ii) ribosome translation speed and co-translational degradation, (iii) post-transcriptional modification analysis and differential proportion analysis.
bioinformatics2026-02-02v3Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.AI Summary
- The study addresses the challenge of distinguishing half-siblings from niece/nephew-avuncular pairs in large genomic biobanks without pedigree information.
- A novel method using across-chromosome phasing and haplotype-level sharing features was developed, achieving over 98% classification accuracy.
- This approach also enhances long-range phasing accuracy, aiding in pedigree reconstruction and managing cryptic relatedness in genomic studies.
Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-02-02v3cheCkOVER: An open framework and AI-ready global crayfish database for next-generation biodiversity knowledge
Parvulescu, L.; Livadariu, D.; Bacu, V. I.; Nandra, C. I.; Stefanut, T. T.; World of Crayfish Contributors,AI Summary
- The study introduces cheCkOVER, an open framework that transforms species occurrence data into structured, AI-ready formats, focusing on crayfish.
- cheCkOVER processes 111,729 crayfish records from 465 species, producing biogeographic descriptors, dynamic maps, and JSON geo-narratives with provenance metadata.
- This framework supports conservation metrics, tracks invasive species, and enhances biodiversity data utility for AI applications and public platforms like World of Crayfish.
Abstract
Background Species occurrence records represent the backbone of biodiversity science, yet their utility is often limited to spatial analyses, distribution maps, or presence-absence models. Current biodiversity infrastructures rarely provide computational formats directly usable by modern artificial intelligence (AI) systems, such as large language models (LLMs), which increasingly mediate scientific communication and knowledge synthesis. Open frameworks that convert biodiversity occurrences into structured, machine-accessible, provenance-rich knowledge are therefore essential--particularly those enabling rapid integration of new records, near real-time generation of spatial metrics, and production of both human interpretable reports and AI-consumable outputs. Such capabilities substantially reduce latency between data acquisition and decision support, while ensuring biodiversity knowledge remains traceable and verifiable in AI-mediated workflows. Results We introduce cheCkOVER, an open framework that converts raw species occurrence datasets into standardized, API-ready, multi-layered outputs: biogeographic descriptors, dynamic distribution maps, summary metrics, and structured JSON geo-narratives following a canonical template. The framework stratifies processing by population origin (indigenous vs. non-indigenous), enabling IUCN-aligned conservation metrics while simultaneously tracking invasion dynamics. Each output embeds standardized citation metadata ensuring full provenance traceability. We applied the pipeline to 111,729 validated crayfish (Astacidea) occurrence records from 465 species, generating comprehensive species packages including indigenous-range classifications (171 endemic, 287 regional, 5 cosmopolitan taxa) and non-indigenous range tracking for 30 invasive species. This proof-of-concept demonstrates how the framework transforms minimal datapoints--validated species occurrences--into interoperable knowledge consumable by both humans and computational systems. The JSON outputs are optimized for retrieval-augmented generation, enabling AI systems to dynamically access and cite biodiversity knowledge with explicit source attribution. Conclusions cheCkOVER is taxon-agnostic and establishes a reproducible pathway from biodiversity occurrences to narrative-ready, AI-interoperable knowledge with immediate public utility via the World of Crayfish(R) platform (https://world.crayfish.ro/), where each species page integrates structured outputs. The open-source framework (GPL-3) combines a generalizable processing pipeline with taxon-specific knowledge products, enabling flexible reuse across conservation research, policy reporting, and AI-driven applications. This minimalist-to-complex design extends the reach of biodiversity data beyond traditional analyses, positioning occurrence repositories as active knowledge engines for next-generation biodiversity informatics.
bioinformatics2026-02-02v2WITHDRAWN: OKR-Cell: Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training
wang, H.; Zhang, X.; Fang, S.; Ran, L.; deng, z.; Zhang, Y.; Li, Y.; Li, s.AI Summary
- The manuscript titled "OKR-Cell: Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training" was withdrawn due to duplicate posting on arXiv.
- The authors request that this work not be cited as a reference.
Abstract
The authors have withdrawn this manuscript because of a duplicate posting of a preprint on arXiv. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author. The original preprint can be found at arXiv:2601.05648
bioinformatics2026-02-02v2MLMarker: A machine learning framework for tissue inference and biomarker discovery
Claeys, T.; van Puyenbroeck, S.; Gevaert, K.; Martens, L.AI Summary
- MLMarker uses a Random Forest model to compute tissue similarity scores from proteomics data, trained on 34 healthy tissues.
- It employs SHAP for protein-level explanations and a penalty factor for missing proteins, enhancing robustness for sparse datasets.
- Testing on three datasets, MLMarker identified brain-like signatures in cerebral melanoma, achieved high accuracy in pan-cancer analysis, and traced origins in biofluids.
Abstract
MLMarker is a machine learning tool that computes continuous tissue similarity scores for proteomics data, addressing the challenge of interpreting complex or sparse datasets. Trained on 34 healthy tissues, its Random Forest model generates probabilistic predictions with SHAP-based protein-level explanations. A penalty factor corrects for missing proteins, improving robustness for low-coverage samples. Across three public datasets, MLMarker revealed brain-like signatures in cerebral melanoma metastases, achieved high accuracy in a pan-cancer cohort, and identified brain and pituitary origins in biofluids. MLMarker provides an interpretable framework for tissue inference and hypothesis generation, available as a Python package and Streamlit app.
bioinformatics2026-02-02v2MultiGEOmics: Graph-Based Integration of Multi-Omics via Biological Information Flows
Alipour Pijani, B.; Rifat, J. I. M.; Bozdag, S.AI Summary
- MultiGEOmics is a graph-based framework designed to integrate multi-omics data by incorporating cross-omics regulatory signals and handling missing data.
- It learns robust embeddings across omics types, maintaining performance under varying data completeness scenarios.
- Evaluations on 11 datasets showed MultiGEOmics consistently performs well and provides interpretability by highlighting key omics features for predictions.
Abstract
Motivation: Multi-omics datasets capture complementary aspects of biological systems and are central to modern machine learning applications in biology and medicine. Existing graph-based integration methods typically construct separate graphs for each omics type and focus primarily on intra-omic relationships. As a result, they often overlook cross-omics regulatory signals, bidirectional interactions across omics layers, that are critical for modeling complex cellular processes. A second major challenge is missing or incomplete omics data; many current approaches degrade substantially in performance or exclude patients lacking one or more omics modalities. To address these limitations, we introduce MultiGEOmics, an intermediate-level graph integration framework that explicitly incorporates regulatory signals across omics types during graph representation learning and models biologically inspired omics-specific and cross-omics dependencies. MultiGEOmics learns robust cross-omics embeddings that remain reliable even when some modalities are partially missing. Results: We evaluated MultiGEOmics across eleven datasets spanning cancer and Alzheimer's disease, under zero, moderate, and high missing-rate scenarios. MultiGEOmics consistently maintains strong predictive performance across all missing-data conditions while offering interpretability by identifying the most influential omics types and features for each prediction task.
bioinformatics2026-02-02v1Batch correction for large-scale mass spectrometry imaging experiments
Thomsen, A. A.; Jensen, O. N.AI Summary
- This study evaluates batch correction methods for MALDI mass spectrometry imaging experiments.
- ComBAT was found to reduce batch-related technical variance, preserve biological variation, and enhance the overall score by 19.4%.
Abstract
We assess batch correction methods for MALDI mass spectrometry imaging experiments. ComBAT reduced batch-related technical variance, maintained biological variation, and improved the overall score by 19.4%.
bioinformatics2026-02-02v1Evaluating the applicability of kinship analyses for sedimentary ancient DNA datasets
Cohen, P.; Johnson, S.; Zavala, E. I.; Moorjani, P.; Slon, V.AI Summary
- This study evaluates the feasibility of kinship inference using sedimentary ancient DNA (sedaDNA), focusing on Neandertals, through extensive simulations.
- The main challenge identified was the presence of DNA from multiple individuals in samples, which complicates accurate kinship analysis.
- A heterozygosity-based test was developed to detect multi-individual DNA, and practical limits were assessed using Neandertal sedaDNA from the Galeria de las Estatuas site.
Abstract
Kinship reconstruction in ancient populations provides key insights into past social organization and evolutionary history. Sedimentary ancient DNA (sedaDNA) enables access to deep-time human populations in the absence of skeletal remains. However, it is characterized by severe degradation and the potential mixture of genetic material from multiple individuals, raising questions about its suitability for kinship inference. Here, we use extensive simulations to evaluate the feasibility and limitations of kinship inference in sparse and damaged sedaDNA data, with a focus on Neandertals. We find that the main obstacle to accurate kinship inference in sedaDNA is the presence of multiple contributors to a given sample. To address this, we introduce a simple heterozygosity-based test to identify samples containing DNA from multiple individuals. Guided by these results, we analyze published Neandertal sedaDNA from the Galeria de las Estatuas site to assess the practical limits of kinship inference in real sedimentary ancient DNA data. Together, our results define methodological considerations and practical limits for kinship inference in sedimentary ancient DNA.
bioinformatics2026-02-02v1DyGraphTrans: A temporal graph representation learning framework for modeling disease progression from Electronic Health Records
Rahman, M. T.; Al Olaimat, M.; Bozdag, S.; Alzheimer's Disease Neuroimaging Initiative,AI Summary
- DyGraphTrans is a framework that models disease progression using EHR data by representing it as temporal graphs, where nodes are patients, features are clinical attributes, and edges show patient similarity.
- It addresses high memory use and lack of interpretability in existing models by employing a sliding-window mechanism and capturing both local and global temporal trends.
- Evaluations on ADNI, NACC, and MIMIC-IV datasets showed DyGraphTrans had strong predictive performance and interpretability aligned with clinical risk factors.
Abstract
Motivation: Electronic Health Records (EHRs) contain vast amounts of longitudinal patient medical history data, making them highly informative for early disease prediction. Numerous computational methods have been developed to leverage EHR data; however, many process multiple patient records simultaneously, resulting in high memory consumption and computational cost. Moreover, these models also often lack interpretability, limiting insight into the factors driving their predictions. Efficiently handling large-scale EHR data while maintaining predictive accuracy and interpretability therefore remains a critical challenge. To address this gap, we propose DyGraphTrans, a dynamic graph representation learning framework that represents patient EHR data as a sequence of temporal graphs. In this representation, nodes correspond to patients, node features encode temporal clinical attributes, and edges capture patient similarity. DyGraphTrans models both local temporal dependencies and long-range global trends, while a sliding-window mechanism reduces memory consumption without sacrificing essential temporal context. Unlike existing dynamic graph models, DyGraphTrans jointly captures patient similarity and temporal evolution in a memory-efficient and interpretable manner. Results: We evaluated DyGraphTrans on Alzheimer's Disease Neuroimaging Initiative (ADNI) and National Alzheimer's Coordinating Center (NACC) for disease progression prediction, as well as on the Medical Information Mart for Intensive Care (MIMIC-IV) dataset for early mortality prediction. We further assessed the model on multiple benchmark dynamic graph datasets to evaluate its generalizability. DyGraphTrans achieved strong predictive performance across diverse datasets. We also demonstrated interpretability of DyGraphTrans aligned with known clinical risk factors.
bioinformatics2026-02-02v1Bridging the gap between genome-wide association studies and network medicine with GNExT
Arend, L.; Woller, F.; Rehor, B.; Emmert, D.; Frasnelli, J.; Fuchsberger, C.; Blumenthal, D. B.; List, M.AI Summary
- GNExT is a web-based platform designed to integrate GWAS data into network medicine, enhancing the interpretation of genetic variants within biological systems.
- It incorporates tools like MAGMA and Drugst.One to explore genetic variants at a network level, identifying potential drug repurposing candidates.
- The platform was demonstrated using a GWAS meta-analysis of human olfactory identification, translating genetic signals into pharmacological targets.
Abstract
Motivation: A growing volume of large-scale genome-wide association study (GWAS) datasets offers unprecedented power to uncover the genetic determinants of complex traits, but existing web-based platforms for GWAS data exploration provide limited support for interpreting these findings within broader biological systems. Systems medicine is particularly well-suited to fill this gap, as its network-oriented view of molecular interactions enables the integration of genetic signals into coherent network modules, thereby opening opportunities for disease mechanism mining and drug repurposing. Results: We introduce GNExT (GWAS network exploration tool), a web-based platform that moves beyond the variant-level effect and significance exploration provided by existing solutions. By including MAGMA and Drugst.One, GNExT allows its users to study genetic variants on the network level down to the identification of potential drug repurposing candidates. Moreover, GNExT advances over the current state of the art by offering a highly standardized Nextflow pipeline for data import and preprocessing, allowing researchers to easily deploy their study results on a web interface. We demonstrate the utility of GNExT using a genome-wide association meta-analysis of human olfactory identification, in which the framework translated isolated GWAS signals to potential pharmacological targets in human olfaction. Availability and Implementation: The complete GNExT ecosystem, including the Nextflow preprocessing pipeline, the backend service, and frontend interface, is publicly available on GitHub (https://github.com/dyhealthnet/gnext_nf_pipeline, https://github.com/dyhealthnet/gnext_platform). The public instance of the GNExT platform on olfaction is available under http://olfaction.gnext.gm.eurac.edu.
bioinformatics2026-02-02v1PHoNUPS: Open-Source Software for Standardized Analysis and Visualization of Multi-Instrument Extracellular Vesicle Measurements
Melykuti, B.; Bustos-Quevedo, G.; Prinz, T.; Nazarenko, I.AI Summary
- PHoNUPS is open-source software developed in R to standardize the analysis and visualization of extracellular vesicle (EV) measurements from various instruments.
- It processes data to compute statistics and generate standardized histograms and contour plots for EV size and zeta potential, aiding in transparent reporting and cross-study comparisons.
- The software supports multiple file formats, produces publication-ready figures, and is designed for extensibility with community contributions.
Abstract
Accurate and transparent characterization of extracellular vesicle (EV) preparations is essential to ensure reproducibility, comparability, and adherence to MISEV reporting standards. However, data outputs from commonly used instruments for assessing EV size, concentration, and surface charge (zeta potential) vary widely in format and structure, complicating standardized analysis and integration across platforms. We present PHoNUPS (Plotting the Histogram of Non-Uniform Particles' Sizes), free and open-source software (FOSS) developed in R, that enables unified processing, analysis, and visualization of EV characterization data. PHoNUPS computes statistics and generates standardized histograms and contour plots (for size against zeta potential) suitable for transparent reporting and cross-study comparison. The software produces high-quality, publication-ready figures. Third-party graphical editing tools allow users to refine and annotate visualizations for presentation or manuscript preparation. PHoNUPS supports multiple measurement file formats, thereby facilitating dataset integration from different instruments. PHoNUPS was developed with extensibility at its core, providing a basis for user-driven growth. We invite the EV community - researchers, analysts, and tool developers - to use PHoNUPS, share feedback on their experience and needs, and contribute to the platform by integrating additional input data formats, analytical routines, and visualization functionalities.
bioinformatics2026-02-02v1scDiagnostics: systematic assessment of cell type annotation in single-cell transcriptomics data
Christidis, A.; Ghazi, A. R.; Chawla, S.; Turaga, N.; Gentleman, R.; Geistlinger, L.AI Summary
- The study addresses the challenge of assessing computational cell type annotations in single-cell transcriptomics by introducing scDiagnostics, a software package designed to detect complex or ambiguous annotations.
- scDiagnostics uses novel diagnostic methods compatible with major annotation tools and was tested on simulated and real-world datasets.
- The tool effectively identifies misleading annotations that could distort downstream analysis, enhancing the reliability of single-cell data interpretation.
Abstract
Although cell type annotation has become an integral part of single-cell analysis workflows, the assessment of computational annotations remains challenging. Many annotation tools transfer labels from an annotated reference dataset to a new query dataset of interest, but blindly transferring labels from one dataset to another has its own set of challenges. Often enough there is no perfect alignment between datasets, especially when transferring annotations from a healthy reference atlas for the discovery of disease states. We present scDiagnostics, a new open-source software package that facilitates the detection of complex or ambiguous annotation cases that may otherwise go unnoticed, thus addressing a critical unmet need in current single-cell analysis workflows. scDiagnostics is equipped with novel diagnostic methods that are compatible with all major cell type annotation tools. We demonstrate that scDiagnostics reliably detects complex or conflicting annotations using both carefully designed simulated datasets and diverse real-world single-cell datasets. Our evaluation demonstrates that scDiagnostics reliably identifies misleading annotations that systematically distort downstream analysis and interpretation and that would otherwise remain undetected. The scDiagnostics R package is available from Bioconductor (https://bioconductor.org/packages/scDiagnostics).
bioinformatics2026-02-02v1An Explainable Machine Learning Approach to study the positional significance of histone post-translational modifications in gene regulation
Ramachandran, S.; Ramakrishnan, N.AI Summary
- This study used XGBoost classifiers to analyze ChIP-seq data for 26 histone PTMs in yeast, focusing on their positional significance from -3 to 8 in genes.
- The approach predicted gene transcription rates and identified critical histone modifications and nucleosomal positions for gene expression using SHAP for explainability.
- Key findings highlighted the importance of specific histone modifications and their positions in yeast gene regulation, with potential for extension to other organisms.
Abstract
Epigenetic mechanisms regulate gene-expression by altering the structure of the chromatin without modifying the underlying DNA sequence. Histone post-translational modifications (PTMs) are critical epigenetic signals that influence transcriptional activity, promoting or repressing gene-expression. Understanding the impact of individual PTMs and the combinatorial effects is essential to deciphering gene regulatory mechanisms. In this study, we analyzed the ChIP-seq data for 26 PTMs in yeast, examining the PTM intensities gene-wise from positions -3 to 8 in each gene. Using XGBoost classifiers, we predicted gene transcription rates and identified key histone modifications and nucleosomal positions that are critical in gene expression using explainability measures (such as SHAP). Our study provides a comprehensive insight into the histone modifications, their positions and their combinations that are most critical in gene regulation in yeast. The proposed explainable Machine Learning models can be easily extended to other model organisms to provide meaningful insights into gene regulation by epigenetic mechanisms.
bioinformatics2026-02-02v1