Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Delta Marches: Generative AI based image synthesis to decode disease-driving morphologic transformations.
Nguyen, T. H.; Panwar, V.; Jarmale, V.; Perny, A.; Dusek, C.; Cai, Q.; Kapur, P. H.; Danuser, G.; Rajaram, S.AI Summary
- Delta-Marches uses generative AI to simulate morphological changes between disease classes, focusing on interpretability.
- Applied to renal carcinoma grading, it identifies key morphological features like tumor-cell nuclear phenotypes and reduced vasculature with increasing grade.
- This approach reduces variability and provides insights into disease mechanisms not captured by standard grading.
Abstract
Deep learning has revealed that tissue morphology contains rich biological information beyond human understanding. However, approaches to convert these spatially distributed signals into precise subcellular insights informing disease mechanism are lacking. We introduce Delta-Marches, an interpretability-first approach that nominates distinguishing morphological features rather than explaining existing models' decisions. Delta-Marches leverages a generative AI framework with latent-space traversals that simulate idealized morphological changes between classes. Comparing each image to its class-shifted counterpart allows downstream feature extractors to infer aspects most affected by the shift, reducing sample-to-sample variability and yielding interpretable morphological transformations at subcellular resolution. Prototyped in renal carcinoma histopathological grading, Delta-Marches generates realistic grade transitions and pinpoints tumor-cell nuclear phenotypes as key properties of tumor grades. It also reveals reduced vasculature associated with increasing grade, a pattern reported in studies but absent from standard grading rubrics. These results indicate Delta-March's ability to parse complex image phenotypes and catalyze hypothesis generation.
bioinformatics2026-02-09v4Protein Language Models in Directed Evolution
Maguire, R.; Bloznelyte, K.; Adepoju, F.; Armean-Jones, M.; Dewan, S.; Goddard, S. E.; Gupta, A.; Jones, F. P.; Lalli, P.; Schooneveld, A.; Thompson, S.; Ebrahimi, E.; Fozzard, S.; Berman, D.; Rossoni, L.; Addison, W.; Taylor, I.AI Summary
- The study investigates the use of zero-shot and few-shot protein language models to guide directed evolution for improving protein fitness, specifically PET degradation.
- Using a few-shot simulated annealing approach, the models recommended enzyme variants that achieved a 1.62x improvement in PET degradation over 72 hours, surpassing the literature's top variant by 1.40x.
- In the second round, with 240 training examples and 32 homologous sequences, 39% of the 176 evaluated variants were fitter than the wild-type.
Abstract
The dominant paradigms for integrating machine-learning into protein engineering are de novo protein design and guided directed evolution. Guiding directed evolution requires a model of protein fitness, but most models are only evaluated in silico on datasets comprising few mutations. Due to the limited number of mutations in these datasets, it is unclear how well these models can guide directed evolution efforts. We demonstrate in vitro how zero-shot and few-shot protein language models of fitness can be used to guide two rounds of directed evolution with simulated annealing. Our few-shot simulated annealing approach recommended enzyme variants with 1.62 x improved PET degradation over 72 h period, outperforming the top engineered variant from the literature, which was 1.40 x fitter than wild-type. In the second round, 240 in vitro examples were used for training, 32 homologous sequences were used for evolutionary context and 176 variants were evaluated for improved PET degradation, achieving a hit-rate of 39 % of variants fitter than wild-type.
bioinformatics2026-02-09v2A high-fat hypertensive diet induces a coordinated perturbation signature across cell types in thoracic perivascular adipose tissue
Terrian, L.; Thompson, J. M.; Bowman, D. E.; Panda, V.; Contreras, G. A.; Rockwell, C. E.; Sather, L.; Fink, G. D.; Lauver, D. A.; Nault, R.; Watts, S. W.; Bhattacharya, S.AI Summary
- This study used single nucleus RNA-sequencing to examine how a high-fat (HF) hypertensive diet affects gene expression in thoracic aortic perivascular adipose tissue (PVAT) of Dahl SS rats.
- The HF diet led to sex-specific changes in cell-type proportions and gene expression related to extracellular matrix dynamics, vascular integrity, and cell communication pathways.
- Analysis identified potential nuclear receptor targets for reversing these diet-induced changes, with deep learning models predicting a hypertensive disease signature across cell types.
Abstract
Perivascular adipose tissue (PVAT), an intriguing layer of fat surrounding blood vessels, regulates vascular tone and mediates vascular dysfunction through mechanisms that are not well understood. Here we show with single nucleus RNA-sequencing of thoracic aortic PVAT from Dahl SS rats that a high-fat (HF) hypertensive diet induces coordinated changes in gene expression across the diverse cell types within PVAT. HF diet produced sex-specific alterations in cell-type proportions and genes related to remodeling of extracellular matrix dynamics and vascular integrity and stiffness, as well as changes in cell-cell communication pathways involved in angiogenesis, vascular remodeling, and mechanotransduction. Gene regulatory network analysis with virtual transcription factor knockout in adipocytes identified specific nuclear receptors that could be targeted for suppression or potential reversal of HF diet-induced changes. Interestingly, generative deep learning models were able to predict cross-cell-type perturbations in gene expression, indicating a hypertensive disease signature that characterizes HF-diet-induced perturbations in PVAT.
bioinformatics2026-02-09v2Unveiling the Terra Cognita of Sequence Spaces using Cartesian Projection of Asymmetric Distances
Ramette, A.AI Summary
- CAPASYDIS is introduced as a method to visualize relationships in large biological sequence datasets by projecting sequences into a fixed, low-dimensional "seqverse" using asymmetric distances.
- Applied to rRNA sequences across Bacteria, Archaea, and Eukaryota, CAPASYDIS showed these domains occupy distinct spatial regions with unique variation patterns.
- The method allows instant mapping of new sequences and retains taxonomic information from broad to fine scales, providing a scalable framework for sequence analysis.
Abstract
Visualizing relationships within massive biological datasets remains a significant challenge, particularly as sequence length and volume increase. We introduce CAPASYDIS (Cartesian Projections of Asymmetric Distances), a scalable approach designed to map the explored regions of a given sequence space. Unlike traditional dimensionality reduction methods, CAPASYDIS calculates asymmetric distances which account for both the position and type of sequence variations. It projects sequences into a fixed, low-dimensional coordinate system, termed a "seqverse", where each sequence occupies a permanent location. This design allows for the instant mapping of new sequences without the need to recalculate the global structure, transforming sequence analysis from a relative comparison into navigation on a standardized map. We applied this method to a large rRNA sequence dataset spanning the three domains of life. Our results demonstrate that the sequences of Bacteria, Archaea, and Eukaryota occupy spatially distinct regions characterized by fundamentally different shapes and patterns of variation. Furthermore, the resulting seqverses retain high amount of taxonomic information, when analyzed from broad domain levels to single-base differences. Overall, CAPASYDIS provides a reproducible, scalable framework for defining the boundaries and topography of biological sequence universes.
bioinformatics2026-02-09v2Target-site Dynamics and Alternative Polyadenylation Explain a Large Share of Apparent MicroRNA Differential Expression
Cihan, M.; More, P.; Sprang, M.; Marini, F.; Andrade, M.AI Summary
- The study introduces MIRNAPEX, a machine learning framework that integrates target-gene expression and 3'UTR isoform usage to assess miRNA regulatory activity from RNA-seq data.
- Using pan-cancer datasets, MIRNAPEX showed that alternative polyadenylation (APA) significantly enhances prediction of miRNA differential expression beyond gene expression alone.
- Findings indicate that changes in miRNA abundance can result from APA-driven alterations in target-site availability, rather than changes in miRNA transcription, highlighting the importance of considering APA in miRNA expression analysis.
Abstract
MicroRNA (miRNA) abundance reflects a dynamic balance between biogenesis, target engagement and decay, yet differential expression (DE) analyses typically ignore changes in target-site availability driven by alternative polyadenylation (APA). We introduce MIRNAPEX, an interpretable expression-stratification-based machine learning framework that quantifies the effect size of miRNA regulatory activity from RNA-seq by integrating target-gene expression with 3'UTR isoform usage to infer binding-site dosage. Using pan-cancer training sets, we fit regularized linear models to learn robust relationships between transcriptomic features and miRNA log-fold changes, with APA patterns adding clear predictive power beyond expression alone. When applied to knockdowns of core APA regulators, MIRNAPEX captured widespread 3'UTR shortening and correctly anticipated distinct, miRNA-specific shifts whose direction and magnitude mirrored the APA-driven change in site availability. Analysis of target-directed miRNA degradation interactions further showed that loss of distal decay-trigger sites coincides with higher miRNA abundance, consistent with a reduced degradation rate. Together these findings reveal that apparent DE of miRNAs can arise from dynamic changes in target-site landscapes rather than altered miRNA transcription, and that ignoring this aspect in conventional analysis workflows can lead to misestimation of the true effect size of gene-expression regulation.
bioinformatics2026-02-09v2COMPASS: A Web-Based COMPosite Activity Scoring System to Navigate Health and Disease Through Deterministic Digital Biomarkers
Sinha, S.; Ghosh, P.AI Summary
- COMPASS is a web-based system that quantifies pathway activation by extracting gene-specific activation thresholds from expression data, standardizing deviations, and aggregating into composite activity scores.
- It allows users to upload expression matrices, define gene sets, and generate activity plots and ROC-AUC statistics for comparisons.
- Across various datasets, COMPASS provides stable, interpretable digital biomarkers that assess model system relevance, differentiation, immune states, and therapeutic responses.
Abstract
Quantifying pathway activation in absolute, reproducible terms is central to systems biology and precision medicine. COMPASS (COMPosite Activity Scoring System) provides a deterministic, ontology free framework that extracts gene-specific activation thresholds from expression data, standardizes deviation from these boundaries, and aggregates direction-encoded genes into per-sample composite activity scores. Implemented as an intuitive web application, COMPASS enables non-coding users to upload expression matrices, define custom gene sets, and instantly generate activity plots and ROC-AUC statistics for biological or clinical comparisons. Across diverse datasets, COMPASS yields stable, interpretable, and transferable digital biomarkers that benchmark the "humanness" of model systems, quantify differentiation or immune states, and track therapeutic response trajectories. By directly linking expression, threshold, deviation, and directionality, COMPASS replaces permutation-based enrichment with closed-form logic, delivering a transparent, mechanistic, and reproducible quantification system for pathway activity.
bioinformatics2026-02-09v2LFQ Benchmark Dataset - Generation Beta: Assessing Modern Proteomics Instruments and Acquisition Workflows with High-Throughput LC Gradients
Van Puyvelde, B. R.; Devreese, R.; Chiva, C.; Sabido, E.; Pfammatter, S.; Panse, C.; Rijal, J. B.; Keller, C.; Batruch, I.; Pribil, P.; Vincendet, J.-B.; Fontaine, F.; Lefever, L.; Magalhaes, P.; Deforce, D.; Nanni, P.; Ghesquiere, B.; Perez-Riverol, Y.; Martens, L.; Carapito, C.; Bouwmeester, R.; Dhaenens, M.AI Summary
- This study extends a previous benchmark dataset to evaluate modern LC-MS platforms for high-throughput proteomics, using a hybrid human-yeast-E. coli proteome with short LC gradients (5 and 15 min).
- Data was collected across four platforms with standardized protocols, focusing on sensitivity, reproducibility, and cross-instrument consistency.
- Key findings include insights into how technological advancements and reduced LC gradients impact proteome depth, quantitative precision, and support for algorithm development and standardization in proteomics.
Abstract
Recent advances in liquid chromatography mass spectrometry (LC-MS) have accelerated the adoption of high-throughput workflows that deliver deep proteome coverage using minimal sample amounts. This trend is largely driven by clinical and single-cell proteomics, where sensitivity and reproducibility are essential. Here, we extend our previous benchmark dataset (PXD028735) using next-generation LC-MS platforms optimized for rapid proteome analysis. We generated an extensive DDA/DIA dataset using a human-yeast-E. coli hybrid proteome. The proteome sample was distributed across multiple laboratories together with standardized analytical protocols specifying two short LC gradients (5 and 15 min) and low sample input amounts. This dataset includes data acquired on four different platforms, and features new scanning quadrupole-based implementations, extending coverage across different instruments and acquisition strategies. Our comprehensive evaluation highlights how technological advances and reduced LC gradients may affect proteome depth, quantitative precision, and cross-instrument consistency. The release of this benchmark dataset via ProteomeXchange (PXD070049 and PXD071205), allows for the acceleration of cross-platform algorithm development, enhance data mining strategies, and supports standardization of short-gradient, high-throughput LC-MS-based proteomics.
bioinformatics2026-02-09v2Enumerating the chemical exposome using in-silico transformation analysis : an example using insecticides
Jothiramajayam, M.; Barupal, D. K.AI Summary
- This study uses an integrated workflow of RXNMmapper, Rxn-INSIGHT, and RDChiral to enumerate transformation products of insecticides in-silico.
- From 181 insecticide structures, 19,392 unique transformation products were generated using over 80,000 reaction templates from PubChem.
- Products were prioritized based on thermodynamic stability, species association, enzyme information, and ADMET properties, enhancing exposomic knowledgebases.
Abstract
The exposome encompasses a vast chemical space that can originate from the consumer industry and environmental sources. Once these chemicals enter into cells (human or other organisms), they can be also transformed into products that differ in terms of toxicity and health effects. Recent developments in machine learning methods and chemical data science resources have enabled the in-silico enumeration of transformation products. Here, we report an integrated workflow of these existing resources (RXNMmapper, Rxn-INSIGHT and RDChiral) to enumerate the transformation product for a chemical. We have generated a large library of reaction templates from > 80,000 reactions sourced from the PubChem database. Utility of the reaction screening and transformation enumeration workflow has been demonstrated for insecticide structures (n=181), yielding 19,392 unique transformation products. Use of filters and ranking by thermodynamic stability, species association, enzyme information and ADMET properties, can prioritize the products relevant for different contexts. Many of these products have PubChem entries but have not yet been linked with the parent compounds. The presented approach can be helpful in enumerating relevant chemical space for exposome using known reaction chemistry, which may ultimately contribute to expanding of the exposomic knowledgebases.
bioinformatics2026-02-09v2LoReMINE: Long Read-based Microbial genome mining pipeline
Agrawal, A. A.; Bader, C. D.; Garcia, R.; Mueller, R.; Kalinina, O. V.AI Summary
- The study introduces LoReMINE, a pipeline for microbial genome mining that automates the process from long-read sequencing data to predicting and clustering biosynthetic gene clusters (BGCs).
- LoReMINE integrates various tools to provide a scalable, reproducible workflow for natural product discovery, addressing the limitations of existing methods that require manual curation.
Abstract
Microbial natural products represent a chemically diverse repertoire of small molecules with major pharmaceutical potential. Despite the increasing availability of microbial genome sequences, large-scale natural product discovery remains challenging because the existing genome mining approaches lack integrated workflows for rapid dereplication of known compounds and prioritization of novel candidates, forcing researchers to rely on multiple tools that requires extensive manual curation and expert intervention at each step. To address these limitations, we introduce LoReMINE (Long Read-based Microbial genome mining pipeline), a fully automated end-to-end pipeline that generates high-quality assemblies, performs taxonomic classification, predicts biosynthetic gene clusters (BGCs) responsible for biosynthesis of natural products, and clusters them into gene cluster families (GCFs) directly from long-read sequencing data. By integrating state-of-the-art tools into a seamless pipeline, LoReMINE enables scalable, reproducible, and comprehensive genome mining across diverse microbial taxa. The pipeline is openly available at https://github.com/kalininalab/LoReMINE and can be installed via Conda (https://anaconda.org/kalininalab/loremine), facilitating broad adoption by the natural product research community.
bioinformatics2026-02-09v2Protenix-v1: Toward High-Accuracy Open-Source Biomolecular Structure Prediction
Xiao, W.; Zhang, Y.; Gong, C.; Zhang, H.; Ma, W.; Liu, Z.; Chen, X.; Guan, J.; Wang, L.AI Summary
- Protenix-v1 (PX-v1) is introduced as an open-source biomolecular structure prediction model that outperforms AlphaFold3 with the same constraints, enhancing prediction quality with increased sampling.
- It includes features like protein template integration and RNA MSA support, with a variant, Protenix-v1-20250630, trained on a larger dataset for better accuracy.
- The study also addresses benchmarking limitations by providing updated tools and year-stratified benchmarks for more reliable assessments.
Abstract
We introduce Protenix-v1 (PX-v1), the first open-source structure prediction model to attain superior performance to AlphaFold3 while strictly adhering to the same training data cutoff, model size, and inference budget. Beyond standard evaluations, we highlight the effectiveness of inference-time scaling behavior, demonstrating that increasing the sampling budget yields consistent improvements in prediction quality--a behavior previously seen in AlphaFold3 but not in other open-source models. In addition to improved accuracy, Protenix-v1 incorporates key capabilities including protein template integration and RNA MSA support. Furthermore, to better support real-world applications such as drug discovery, we additionally release Protenix-v1-20250630, a variant trained on a larger dataset (cutoff: June 30, 2025), delivering further improved prediction accuracy. Finally, we identify the limitations of current benchmarking tools and we provide updated evaluation tools and year-stratified benchmarks to facilitate more reliable and transparent assessment within the community. Collectively, these contributions provide a robust foundation for the Protenix series and the broader field.
bioinformatics2026-02-09v1Exercise-conditioned tear fluid suppresses myopia progression
Yao, H.; Liang, M.; Fei, Q.; Cao, J.; Liang, T.; Zhou, X.; Zhang, S.; Cui, Q.AI Summary
- Researchers hypothesized that tear fluid post-exercise (TA) could protect against myopia, unlike pre-exercise tear fluid (TB), based on reverse transcriptomic analysis.
- In a guinea pig model of form deprivation myopia, TA was administered via periocular injection, significantly reducing myopic progression by limiting refractive shifts, vitreous chamber depth, and axial length elongation.
- TB did not show any protective effects, highlighting a potential new therapeutic approach involving exercise-induced changes in tear fluid for myopia management.
Abstract
Myopia represents a major global public health challenge with rapidly rising prevalence. It is thus important to explore novel therapeutics for the treatment of myopia. Here we predicted tear fluid after (TA) but not before (TB) moderate-intensity aerobic exercise might protect against myopia using reverse transcriptomic analysis. To experimentally validate this hypothesis, TA or TB was administered by periocular injection in a guinea pig model of form deprivation myopia. As a result, TA treatment significantly attenuated myopic refractive shifts and suppressed vitreous chamber depth and axial length elongation, whereas TB showed no protective effects. This study proposes a novel therapeutic avenue for myopia intervention and also suggests a previously unrecognized tear fluid-mediated mechanism linking exercise to myopia. URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE315463 Token: snspgwygltifrqx
bioinformatics2026-02-09v1Fuzzifier*: Robust and Sensitive Multi-omics Data Analysis
Offensperger, F.; Pan, C.; Sinn, E.; Zimmer, R.AI Summary
- Fuzzifier* is a pipeline for differential analysis of multi-omics data, allowing categorization through custom fuzzy concepts at any analysis step.
- It computes multiple analysis paths to identify both consensus and path-specific features, enhancing reliability and sensitivity.
- Applied to TCGA data, Fuzzifier* validated known cancer-specific miRNAs and identified new candidates, focusing on value distributions and foldchange from small sample sizes.
Abstract
Motivation: Categorization is an important means for interpreting data and drawing conclusions. Often, the derived categories provide evidence for diagnostic or even therapeutic approaches. The standard pipelines for differential analysis of multi-omic high-throughput, and in particular single-cell data, yield (ranked) lists of possibly differential features after applying appropriate effect sizes or significance thresholds of computed p-value and/or foldchange. Results: We propose the Fuzzifier* pipeline for the differential analysis of any type of high-throughput data, either raw input data or foldchange data of groups of a (small or large) number of replicates. In Fuzzifier*, categorization can be applied to any step of the analysis pipeline according to custom-designed fuzzy concepts (Fuzzifier). Thus, any (fuzzified) analysis option corresponds to a path in a commutative diagram specifying the Fuzzifier* pipeline. Fuzzifier* computes a user-defined set of paths and presents an overview of the results, thereby identifying both highly reliable (consensus) and sensitive (path-specific) features. Fuzzifier* is a method that can be applied to any analysis pipeline to obtain different views on the data and yield more reliable results. This is demonstrated by the identification of context-specific miRNAs for individual cancer types from TCGA data. Fuzzifier* could both validate known cancer-specific miRNAs and identify novel candidates. In comparison to statistical tests, Fuzzifier* focuses on value distributions of tumor and normal samples as well as paired foldchange distributions and, thus, identifies condition-specific features from a relatively small number of replicates. Availability and Implementation: https://github.com/zimmerlab/fuzzifier Contact: offensperger@bio.ifi.lmu.de and zimmer@ifi.lmu.de
bioinformatics2026-02-09v1Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data
Vicente, A.; Dornfeld, L.; Coines, J.; Ferruz, N.AI Summary
- The study investigates designing proteins that bind specific ligands using sequence-only data by framing it as a sequence-to-sequence translation problem with protein language models (pLMs).
- Models were trained on large datasets (>17M pairs) with varying parameter sizes, revealing a trade-off: fewer protein pairs per ligand result in more foldable but less diverse sequences, while more pairs increase diversity but reduce foldability.
- The research highlights dataset redundancy and incompleteness as critical challenges, providing datasets, models, and tools for further research.
Abstract
Proteins can bind small molecules with high specificity. However, designing proteins that bind user-defined ligands remains a challenge, typically relying on structural information and costly experimental iteration. While protein language models (pLMs) have shown promise for unconditional generation and conditioning on coarse functional labels, instance-level conditioning on a specific ligand has not been evaluated using purely textual inputs. Here we frame small-molecule protein binder design as a sequence-to-sequence translation problem and train ligand-conditioned pLMs that map molecular strings to candidate binder sequences. We curate large-scale ligand-protein datasets (>17M ligand-protein pairs) covering different data regimes and train a suite of models, spanning 16 to 700M parameters. Results reveal a consistent trade-off driven by supervision ambiguity: when each ligand is paired with few proteins, models generate near-neighbour, foldable sequences; when each ligand is paired with many proteins, generations are more diverse but less consistently foldable. Our study exposes how annotation diversity and sampling choices elicit this behaviour and how it changes with the data distribution. These insights highlight dataset redundancy and incompleteness as key bottlenecks for sequence-only binder design. We release the curated datasets, trained models, and evaluation tools to support future work on ligand-conditioned protein generation.
bioinformatics2026-02-09v1From Structure to Dynamics: Activation Mechanism of the G Protein-Coupled Bile Acid Receptor 1-Gs Complex
Fiorillo, B.; Moraca, F.; Di Leva, F. S.; Sepe, V.; Fiorucci, S.; Limongelli, V.; Zampella, A.; Catalanotti, B.AI Summary
- This study explored the activation mechanism of the GPBAR1-Gs complex by lithocholic acid (LCA) using homology modeling, molecular docking, and MD simulations.
- Findings indicate that LCA binding stabilizes the active state of GPBAR1, influencing TM5 and TM6 conformations and enhancing the coupling with Gs.
- The study provides insights into how LCA modulates GPBAR1 activation, aiding in the development of GPBAR1-targeted compounds.
Abstract
The G protein-coupled bile acid receptor 1 (GPBAR1, also known as TGR5) is a key mediator of bile acid signaling, exerting its physiological effects through coupling with the stimulatory G protein (Gs). This interaction is essential for stabilizing the receptor's active conformation and triggering downstream signaling. Among endogenous ligands, lithocholic acid (LCA) is the most potent natural agonist. However, the dynamic features underlying its binding and activation mechanisms remain poorly defined. In this study, we investigated the molecular basis of the interaction between LCA and GPBAR1, as well as the functional consequences of this interaction on receptor activation by integrating homology modelling, molecular docking, and molecular dynamics (MD) simulations. Our calculations reveal that LCA binding stabilizes the active state of GPBAR1, biasing the conformational ensemble of TM5 and TM6, as well as the main microswitches. These ligand-induced rearrangements enhance the coupling interface with the 5 helix of Gs and facilitate allosteric communication between the orthosteric and intracellular sites. Overall, our findings provide dynamic insight into how LCA modulates GPBAR1 activation and G protein engagement, highlighting its role as a molecular effector in bile acid signaling, and furnishing molecular detail relevant to ongoing efforts in GPBAR1-targeted compound development.
bioinformatics2026-02-09v1HiCInterpolate: 4D Spatiotemporal Interpolation of Hi-C Data for Genome Architecture Analysis.
Chowdhury, H. M. A. M.; Oluwadare, O.AI Summary
- HiCInterpolate was developed to interpolate intermediate Hi-C contact matrices between two timestamps, addressing the need for continuous genomic data in genome architecture analysis.
- It uses a deep learning approach with a flow predictor and U-Net-like architecture to predict high-resolution intermediate Hi-C maps.
- The tool supports analysis of 3D genomic features like A/B compartments and TADs, showing strong performance in metrics like PSNR and SSIM, and preserving key chromatin organization features.
Abstract
Motivation: Studying the three-dimensional (3D) structure of a genome, including chromatin loops and Topologically Associating Domains (TADs), is essential for understanding how the genome is organized, such as gene activation, cell development, protein-protein interaction, etc. Hi-C protocol enables us to study 3D genome structure and organization. Chromatin 3D structure changes dynamically over time, and modeling these continuous changes is crucial for downstream analysis in various domains such as disease diagnosis, vaccine development, etc. The high expense and impracticality of continuous genome sequencing, particularly what evolves between two timestamps, limit the most effective genomic analysis. It is crucial to develop a straightforward and cost-efficient method for constantly generating genomic data between two timestamps in order to address these constraints. Results: In this study, we developed HiCInterpolate, a 4D spatiotemporal interpolation architecture that accepts two timestamp Hi-C contact matrices to interpolate intermediate Hi-C contact matrices at high resolution. HiCInterpolate predicts the intermediate Hi-C contact map using a deep learning-based flow predictor, and a feature encoder and decoder architecture similar to U-Net. In addition, HiCInterpolate supports downstream analysis of multiple 3D genomic features, including A/B compartments, chromatin loops, TADs, and 3D genome structure, through an integrated analysis pipeline. Across multiple evaluation metrics, including PSNR, SSIM, GenomeDISCO, HiCRep, and LPIPS, HiCInterpolate 1 achieved consistently strong performance. Biological validation further demonstrated preservation of key chromatin organization features, such as chromatin loops, A/B compartments, and TADs. Together, these results indicate that HiCInterpolate provides a robust computer vision-based framework for highresolution interpolation of intermediate Hi-C contact matrices and facilitates biologically meaningful downstream analyses. Availability: HiCInterpolate is publicly available at https://github.com/OluwadareLab/ HiCInterpolate.
bioinformatics2026-02-09v1UniFacePoint-FM: A Foundation Model for Generalizable 3D Facial Representation Learning and Multi-Attribute Prediction
Li, D.; Fu, C.-H.; Tang, K.AI Summary
- UniFacePoint-FM is a 3D facial foundation model using a self-supervised Point-MAE framework for learning from point clouds, addressing limitations of 2D and task-specific 3D models.
- Pretrained on a custom dataset, it was fine-tuned and evaluated on three datasets for tasks like gender classification, age regression, BMI prediction, and facial expression recognition.
- It achieves state-of-the-art performance in several tasks, showing high generalizability across different datasets and scanning platforms.
Abstract
The human face is a rich medium for biometric, behavioral, and clinical information. However, 2D facial images based technologies lack critical geometric details and are susceptible to pose and illumination interference, while 3D facial deep learning frameworks are hindered by complex annotation, preprocessing, and task-specific designs with poor cross-domain generalization. To address these challenges, we propose UniFacePoint-FM, a 3D facial foundation model built on a self-supervised Point-MAE framework, tailored for high-fidelity point cloud representation learning. The model was pretrained on a self-constructed dataset of high-resolution 3D facial scans, followed by supervised fine-tuning and comprehensive evaluation across three independent datasets for diverse downstream tasks. Experimental results demonstrate that UniFacePoint-FM is both pretraining-efficient and highly generalizable: it achieves state-of-the-art performance on gender classification, age regression, and BMI prediction, and matches the accuracy of the ResMLP model (while outperforming other baselines) in facial expression recognition. Notably, by learning high-quality, fine-grained representations directly from raw point clouds, UniFacePoint-FM delivers robust generalization and transferability across tasks, datasets, and even different face scanning platforms. Overall, our work establishes an effective foundation model paradigm for 3D facial analysis, with promising implications for biometric security, health monitoring, and advanced human-computer interaction systems.
bioinformatics2026-02-09v1SpliceRead: Improving Canonical and Non-Canonical Splice Site Prediction with Residual Blocks and Synthetic Data Augmentation
Thapa, S.; Samderiya, K.; Menon, R.; Oluwadare, O.AI Summary
- SpliceRead uses residual convolutional blocks and synthetic data augmentation to improve the prediction of both canonical and non-canonical splice sites.
- Trained on a multi-species dataset, SpliceRead outperforms existing models in key metrics like F1-score, accuracy, precision, and recall, particularly reducing non-canonical misclassification rates.
- Evaluations confirmed SpliceRead's robustness through cross-validation, cross-species testing, and input-length generalization.
Abstract
Accurate splice site prediction is fundamental to understanding gene expression and its associated disorders. However, most existing models are biased toward frequent canonical sites, limiting their ability to detect rare but biologically important non-canonical variants. These models often rely heavily on large, imbalanced datasets that fail to capture the sequence diversity of non-canonical sites, leading to high false-negative rates. Here, we present SpliceRead, a novel deep learning model designed to improve the classification of both canonical and non-canonical splice sites using a combination of residual convolutional blocks and synthetic data augmentation. SpliceRead employs a data augmentation method to generate diverse non-canonical sequences and uses residual connections to enhance gradient flow and capture subtle genomic features. Trained and tested on a multi-species dataset of 400- and 600-nucleotide sequences, SpliceRead consistently outperforms state-of-the-art models across all key metrics, including F1-score, accuracy, precision, and recall. Notably, it achieves a substantially lower non-canonical misclassification rate than baseline methods. Extensive evaluations, including cross-validation, cross-species testing, and input-length generalization, confirm its robustness and adaptability. SpliceRead offers a powerful, generalizable framework for splice site prediction, particularly in challenging, low-frequency sequence scenarios, and paves the way for more accurate gene annotation in both model and non-model organisms. The open-sourced code of SpliceRead and detailed documentation are available at https://github.com/OluwadareLab/SpliceRead .
bioinformatics2026-02-09v1A new cancer progression model: from synthetic tumors to real data and back
Volpatto, D.; Contaldo, S. G.; Pernice, S.; Beccuti, M.; Cordero, F.; Sirovich, R.AI Summary
- The study introduces a stochastic model for tumor evolution that integrates genotypic inheritance, phenotype-driven traits, and resource competition to understand intratumor heterogeneity (ITH).
- The model uses a simulation algorithm with an open-source GUI for parameter configuration, allowing exploration of clonal dynamics and population changes.
- Findings suggest early tumor growth is stochastic, while later stages show selection for traits that mitigate environmental constraints, aligning with observed biological patterns.
Abstract
Intratumor heterogeneity (ITH) arises from the combined effects of genetic alterations, clonal interactions, and environmental constraints, and plays a central role in therapeutic resistance and disease progression. While ITH has been extensively documented in empirical tumor data, the scientific debate regarding the biological mechanisms underlying this heterogeneity remains complex, highlighting the need for cancer evolution models that are sufficiently flexible and sophisticated to reproduce the observed behaviors and to give insights on the unobserved ones. Here, we present a stochastic modelling framework for tumor evolution that integrates genotypic inheritance with phenotype driven functional traits and resource mediated competition. Mutational events are associated with functional capabilities such as altered proliferation, increased mutation rates, limit evasion potential or enhanced control over shared resources, allowing multiple genotypes to converge on similar phenotypes. The model explicitly tracks subclonal lineages while incorporating environmental constraints that modulate growth and competition.The framework is defined through a mathematically rigorous construction and is accompanied by an efficient simulation algorithm. To facilitate exploration and reproducibility, we provide an open-source graphical user interface that allows users to configure model parameters, run simulations, and inspect clonal genealogies and population dynamics without requiring direct interaction with the underlying code. Using this model, we illustrate how ecological feedbacks can shape clonal dynamics over time, supporting an interpretation in which early tumor growth is dominated by stochastic expansion, while later evolution increasingly reflects selection for traits that alleviate environmental constraints. Rather than constituting a new evolutionary paradigm, this behaviour demonstrates how well-documented biological patterns can emerge naturally from a unified stochastic and ecological description. Overall, our approach offers a flexible and extensible platform for investigating how chance, functional traits, and environmental interactions jointly govern tumor heterogeneity.
bioinformatics2026-02-09v1seq2ribo: Structure-aware integration of machine learning and simulation to predict ribosome location profiles from RNA sequences
Kaynar, G.; Kingsford, C.AI Summary
- seq2ribo integrates machine learning with a structure-aware simulation (sTASEP) to predict ribosome A-site locations from mRNA sequences.
- It outperforms existing methods by reducing transcript-level errors up to 35.8% and improving correlation with experimental data across various cell types.
- This approach enables de novo mRNA sequence design for applications like synthetic biology without requiring expression data or genomic context.
Abstract
Motivation: Ribosome dynamics are vital in the process of protein expression. Current methods rely on ribosome profiling (Ribo-seq), RNA-seq profiles, and full genomic context. This restricts their use in de novo sequence design, like messenger RNA (mRNA) vaccines. Simulation-only approaches like the Totally Asymmetric Simple Exclusion Process (TASEP) oversimplify translation by focusing solely on codon elongation times. Results: We present seq2ribo, a hybrid simulation and machine learning framework that predicts ribosome A-site locations using only an mRNA sequence as input. Our method first employs a novel structure-aware TASEP (sTASEP), which models translation using a comprehensive set of fitted parameters that include codon wait times and structural features, such as local angles, base-pairing, and discrete positional buckets. The ribosome locations generated by sTASEP are then processed by a polisher model, which learns to refine the simulated ribosome distributions. seq2ribo provides high-fidelity predictions of ribosome locations across diverse cell types (iPSC, HEK293, LCL, and RPE-1), significantly outperforming baselines. When benchmarked against sequence-only Translatomer, seq2ribo achieves reductions in transcript-level error up to 35.8%, while simultaneously attaining the highest Pearson and Spearman correlations in every cell line and reducing structural errors between 43.3% and 97.3%. By adding a task-specific head, seq2ribo achieves Spearman correlations up to 0.795 with experimental translation efficiency (TE) across several cell lines, and 0.689 with measured protein expression. By operating from sequence alone, seq2ribo provides a new tool for synthetic biology, enabling the rational design and optimization of mRNA sequences without the need for expression-level data or genomic context. Availability: seq2ribo is available at https://github.com/Kingsford-Group/seq2ribo. Contact: gkaynar@cs.cmu.edu, carlk@cs.cmu.edu. Supplementary information: Supplementary data are available.
bioinformatics2026-02-09v1cspray: Distributed Single Cell Transcriptome Analysis
Hawkins, P. G.; Swanson, E. M.; Feichtel, M.AI Summary
- The study introduces cspray, a distributed method for processing large-scale single cell RNA data, addressing computational throughput and scalability issues.
- cspray handles data ingestion, preprocessing, gene annotation, PCA, and clustering without needing per-file compute sizing.
- The method enables large-scale processing, allowing for LLM-based reference-free cluster annotation, facilitating the development of scalable single cell data discovery platforms.
Abstract
The size of individual single cell samples continues to grow with advancing technologies, as do the number of samples included in individual experiments and across organizations. This presents challenges for processing this data at scale, both in terms of computational throughput and the required size of the machines that must process this data. We present a single cell RNA processing method that is fully distributed, capable of processing arbitrarily large files, and numbers of files, without requiring per-file based compute sizing. Our method, cspray, includes data ingestion, pre-processing, highly variable gene annotation, PCA, and clustering. We also show that this processing at scale permits LLM based reference-free cluster annotation on low resolution clusters, which demonstrates these techniques can be used to build single cell data discovery platforms at scale.
bioinformatics2026-02-09v1A ML-framework for the discovery of next-generation IBD targets using a harmonized single-cell atlas of patient tissue
Joglekar, A.; Joseph, A.; Honsa, P.; Ruppova, K.; Pizzarella, V.; Honan, A.; Mediratta, D.; Vollmer, E.; Geller, E.; Valny, M.; Macuchova, E.; Zheng, S.; Greenberg, A.; Taus, P.; Kline-Schoder, A.; Konickova, R.; Cerna, L.; Sharim, H.; Ness, L.; Camilli, G.; Chouri, E.; Kaymak, I.; D'Rozario, J.; Castiblanco, D.; Oliveira, J.; Prandi, F.; Popov, N.; Moldoveanu, A. L.; Oliphant, C.; Escudero-Ibarz, L.; Uhlitz, F.; Freinkman, E.; Sponarova, J.; Vijay, P.; Joyce, C.; Leonardi, I.; Nayar, S.; Platt, A.; Ort, T.; De Baets, G.; Corridoni, D.; Wroblewska, A.; Rahman, A.AI Summary
- This study developed a machine learning framework (IPR) using a harmonized single-cell atlas of the human intestine to discover novel therapeutic targets for IBD.
- The framework identified 85 disease-associated transcriptional programs and prioritized 400 cell type-specific gene targets.
- Validation confirmed that targeting PTGIR in myeloid cells and IL6ST in fibroblasts reduced inflammatory and fibrotic pathways, suggesting new therapeutic avenues distinct from current treatments.
Abstract
Target discovery for IBD has traditionally relied on genetic associations, which lack the cellular resolution needed to identify novel, actionable, cell type-specific disease pathways. Here, we describe an integrated analytical and experimental framework that leverages harmonized single-cell data to systematically discover novel therapeutic strategies for IBD. We used AMICA DBTM, Immunai's harmonized database of single-cell RNA datasets to construct a harmonized 1 million single-cell atlas of the human intestine. We applied a machine learning framework (Immune Patient Representation, IPR) to identify disease-associated transcriptional programs and cell type-specific gene targets. Candidate targets were prioritized using atlas-derived metrics, refined using custom criteria emphasizing translational actionability, and validated across independent clinical cohorts. Select candidates were evaluated in human primary-cell models reflecting the target's cell-type context. The IPR framework identified 85 disease-associated transcriptional programs and ranked 400 cell type-specific target genes across immune and stromal lineages. Disease-associated programs were interpreted using a structured AI-assisted reasoning framework for structured biological reasoning, linking them to IBD-relevant pathways and guiding the identification of novel, promising gene targets. Functional validation of two cell-type-specific candidates, PTGIR in myeloid cells and IL6ST in fibroblasts, confirmed the reduction of inflammatory and fibrotic pathways linked to IBD pathology. Multi-omic profiling and projection of in vitro phenotypes to patient datasets demonstrated the reversal of disease-associated programs via mechanisms distinct from those of existing biologics. Our single-cell anchored, machine-learning framework integrates in silico discovery with experimental validation, revealing new cell type-specific therapeutic opportunities and supporting a scalable approach for precision target discovery in IBD and other immune-mediated diseases.
bioinformatics2026-02-09v1DEPower: approximate power analysis with DESeq2
Gorin, G.; Guruge, D.; Goodman, L.AI Summary
- The study addresses the need for power analysis in RNA-seq experiments by developing DEPower, a tool based on the DESeq2 framework.
- DEPower calculates the minimum sample size required for detecting effects in both single-cell and bulk RNA-seq experiments.
- It is accessible as a web-based tool at https://poweranalysis-fb.streamlit.app/, facilitating rigorous experimental design for researchers.
Abstract
Rigorous experimental design, including formal power analysis, is a cornerstone of reproducible RNA sequencing (RNA-seq) research. The design of RNA-seq experiments requires computing the minimum sample number required to identify an effect of a particular size at a predefined significance level. Ideally, the statistical test used for the analysis of experimental data should match the test used for sample size determination; however, few tools use the assumptions of the popular differential expression testing framework DESeq2, and most opt for simulation-based rather than analytical approaches. Grounded in the DESeq2 model framework, we derive sample size requirements for both single-cell and bulk RNA-seq experiments delivered as a web-based tool for power analysis, DEPower, available at https://poweranalysis-fb.streamlit.app/, that makes rigorous RNA-seq study design accessible to all researchers.
bioinformatics2026-02-09v1A shape-constrained regression and wild bootstrap framework for reproducible drug synergy testing
Asiaee, A.; Long, J. P.; Pal, S.; Pua, H. H.; Coombes, K. R.AI Summary
- The study introduces a nonparametric framework for drug synergy testing using shape-constrained regression and wild bootstrap to address limitations of traditional heuristic scores.
- The method defines interaction as deviation from a monotone-additive model, using isotonic regression to fit surfaces and a wild bootstrap for statistical inference.
- It showed higher replicate concordance (median correlation 0.91) and lower failure rates compared to existing methods, also predicting missing data with a median RMSE of 0.040.
Abstract
High-throughput drug combination screens motivate computational methods to identify synergistic pairs, yet synergy is typically quantified by heuristic scores (Bliss, HSA, Loewe, ZIP) that provide no statistical inference and can be unstable or undefined when parametric dose--response fits fail. We present a nonparametric, assumption-light framework that defines interaction as the deviation from a monotone-additive null within a shared monotone model class. We fit a monotone surface by two-dimensional isotonic regression and a monotone-additive surface, compute an interaction surface, and summarize global interaction by a stable ``interaction energy'' statistic. A degrees-of-freedom-corrected wild bootstrap yields calibrated p-values for testing interaction in each dose--response matrix, enabling principled hit calling and multiple-testing control. On DrugCombDB, our method yields higher replicate concordance of interaction surfaces (median correlation 0.91 across 1,839 replicate pairs) than Bliss, HSA, Loewe, or ZIP (0.53--0.74), while avoiding the 20.9% Loewe and 3.6% ZIP failure rates. Because the fitted surface is generative, the method also predicts missing wells (median holdout RMSE 0.040 in viability units). By turning synergy scoring into statistically grounded outcomes (effect sizes with uncertainty), the framework provides more reliable targets for downstream machine learning models of combination response.
bioinformatics2026-02-09v1Batch Effect Correction in a Functional Colorectal Cancer Organoid Clinical Correlation Study
Oliver, G. R.; de Jesus Domingues, A.; Barnett, C. C.AI Summary
- The study focused on detecting, characterizing, and correcting batch effects in a retrospective clinical colorectal cancer organoid drug-response study.
- Methods included exploratory diagnostics, experimental drift detection, and statistical adjustments to remove technical artifacts while preserving biological signals.
- The findings highlight the necessity of addressing batch effects to ensure data reproducibility and accurate interpretation in organoid research.
Abstract
Batch effects are recognized as major sources of technical confounding in high-throughput assays. However, their impact on organoid studies receives little attention in the literature. As organoids gain prominence as a class of emerging new approach methodologies (NAMs), consideration of batch variation will become increasingly important to ensure data reproducibility and accurate interpretation in pre-clinical and clinical studies. In this manuscript, we provide a practical description of our work in detecting, characterizing, and correcting batch effects in a prior published retrospective clinical colorectal cancer organoid drug-response study. We outline the workflow we employed, including exploratory diagnostics, experimental drift detection, and statistical adjustment. We detail the methods employed to evaluate batch effects, monitor longitudinal drift, and select approaches to remove technical artifacts, preserve biological signal and test for robustness. Our experience demonstrates that in even modestly sized studies, results can be adversely affected by insufficient consideration and attempts at ameliorating batch effects. By documenting the challenges we encountered and the solutions implemented within our study, we hope that we can provide a seminal practical reference for organoid researchers and enable increased discussion and adoption of robust batch-compensation practices in the organoid field, ensuring that the topic is more routinely addressed, improved, and eventually standardized.
bioinformatics2026-02-09v1Predicting Obstetric and Non-obstetric Diagnoses Co-occurrences during Pregnancy
Singh, A.; Infante, S.; Kim, S.; Kabir, A.AI Summary
- This study models the co-occurrence of obstetric and non-obstetric diagnoses during pregnancy using a network-based approach, treating it as a link prediction problem on a diagnosis-level graph.
- Various graph neural network (GNN) architectures were tested, with GraphSAGE and hybrid models (GCN+GraphSAGE, GAT+GraphSAGE) showing the best performance.
- The GCN+GraphSAGE hybrid model achieved an AUROC and AUPRC of approximately 0.90, revealing clinically plausible associations between pregnancy stages and related diagnoses.
Abstract
Pregnancy care often involves simultaneous obstetric and other medical conditions, but their co-occurrence patterns are rarely modeled explicitly in a systematic, network-based approach. In this work, we formulate obstetric and non-obstetric diagnoses co-occurrences as a link prediction problem on a diagnosis-level homogeneous graph constructed from pregnancy encounters. Diagnoses are represented as nodes connected by co-occurrence edges, with node features capturing graph structure and demographic statistics. We address this challenge by leveraging collected electronic health records data and study several standalone and hybrid graph neural network (GNN) architectures, including GCN, GAT, GraphSAGE, and three hybrid encoders that combine complementary aggregation mechanisms, namely GCN+GraphSAGE, GCN+GAT, and GAT+GraphSAGE. All models used consistent train-validation-test splits and are evaluated on 5-fold cross-validation sets. Among standalone models, GraphSAGE achieved the strongest performance, whereas hybrid GraphSAGE-based models (GCN+GraphSAGE and GAT+GraphSAGE) are best performers. The GCN+GraphSAGE hybrid, reaching an AUROC and AUPRC of approximately $0.90$, consistently outperformed all other architectures. Further analysis of top-ranked predicted links revealed clinically plausible associations between pregnancy stage and risk-related diagnoses and common endocrine, metabolic, and hematological conditions. These findings indicate that graph-based link prediction may effectively prioritize obstetric and non-obstetric diagnosis pairs, providing a scalable framework for identifying clinically meaningful comorbidity patterns. They may further support hypothesis generation and downstream obstetric risk stratification efforts. Availability: All codes including data preparation scripts, training and validation recipes, and experimental configurations are available at: \url{https://github.com/kabir-ai2bio-lab/ob-nonob-diagnoses-cooccurrences}.
bioinformatics2026-02-09v1TM-Vec 2: Accelerated Protein Homology Detection for Structural Similarity
Keluskar, A.; Batra, P.; Bezshapkin, V.; Morton, J. T.; Zhu, Q.AI Summary
- The study addresses the challenge of structural homology detection in protein sequences by introducing TM-Vec 2 and TM-Vec 2s, which optimize the computationally intensive embedding step.
- These models were benchmarked using CATH and SCOPe domains, showing TM-Vec 2s provides speedups of up to 258x over TM-Vec and 56x over Foldseek, with improved accuracy.
Abstract
Understanding protein function is an essential aspect of many biological applications. The exponential growth of protein sequence databases has created a critical bottleneck for structural homology detection. While billions of protein sequences have been identified from sequencing data, the number of protein folds underlying biology is surprisingly limited, likely numbering tens of thousands. The "sequence-fold gap" limits the success of functional annotation methods that rely on sequence homology, especially for newly sequenced, divergent microbial genomes. TM-Vec is a deep learning architecture that can predict TM scores as a metric of structural similarity directly from sequence pairs, bypassing true structural alignment. However, the computational demands of its protein language model (PLM) embeddings create a significant bottleneck for large-scale database searches. In this work, we present two innovations: TM-Vec 2, a new architecture that optimizes the computationally-heavy sequence embedding step, and TM-Vec 2s, a highly efficient model created by distilling the knowledge of the TM-Vec 2 model. Our new models were benchmarked for both accuracy and speed on using the CATH and SCOPe domains for large-scale database queries. We compare them to state-of-the-art models to observe that TM-Vec 2s achieves speedups of up to 258x over the original TM-Vec and 56x over Foldseek for large-scale database queries, while achieving higher accuracy compared to the original TM-Vec model.
bioinformatics2026-02-09v1MetaKnogic-Alpha: A Hyper-Relational Knowledge Base for Grounded Metabolic Reasoning
Dang, P.; Swaminathan, P.; Guo, T.; Wan, C.; Cao, S.; Zhang, C.AI Summary
- MetaKnogic-Alpha addresses the synthesis gap in metabolic research by transforming over 100K full-text articles into a hyper-relational hypergraph structure for grounded metabolic reasoning.
- It uses a hierarchical discovery protocol with an autonomous reasoning agent to enhance query precision and explore metabolic pathways, ensuring biological accuracy by grounding insights against a metabolic reaction network.
- Benchmarking showed a mechanistic accuracy of 0.98, significantly reducing errors and providing traceability to original literature, thus aiding in rapid discovery of metabolic interactions for precision oncology.
Abstract
The exponential trajectory of biomedical literature has precipitated a fundamental "synthesis gap" in metabolic research, where critical mechanistic insights remain fragmented across hundreds of thousands of disjointed full-text articles, preventing the consolidation of a global mechanistic view. Here, we present MetaKnogic-Alpha, a foundational mechanistic knowledge substrate designed to bridge this gap by transforming unstructured literature into a navigable, logic-based resource. MetaKnogic-Alpha synthesizes over 100K full-text articles into a hyper-relational hypergraph structure, preserving the n-ary relational logic inherent in complex metabolic pathways. To ensure biological rigor, we implemented a hierarchical discovery protocol: an autonomous reasoning agent first enriches query nomenclature for domain-specific precision, followed by a multi-hop topological expansion within the hypergraph to surface functional neighbors, such as enzymatic co-factors and distal regulators, often lost in traditional search paradigms. Crucially, the system subjects all literature-derived insights to a deterministic biochemical grounding against a curated metabolic reaction network, significantly mitigating the risk of probabilistic hallucinations common in standalone generative models. In rigorous benchmarking, MetaKnogic-Alpha achieved a mechanistic accuracy of 0.98 in scenarios where supporting evidence was present, providing a robustly attributable audit trail back to the primary literature via PubMed Central Identifiers. We designate this primary release as "alpha" to establish the foundational architectural logic for a burgeoning million-scale resource. By compressing the synthesis of thousands of papers from a multi-month manual effort into several hours of automated discovery, MetaKnogic-Alpha serves as a high-fidelity research companion that augments the human expert's ability to resolve complex metabolic interactions and identify novel therapeutic drivers in precision oncology.
bioinformatics2026-02-09v1Order of Message and Address Domain Engagement Determines Productive β-Endorphin Binding to the μ-Opioid Receptor
Ciemny, M. P.; Kmiecik, S.AI Summary
- The study investigated how the order of message and address domain engagement affects β-endorphin binding to the μ-opioid receptor (OR) using 1,000 CABS-dock simulations.
- Message-first binding was more common but less successful (5.0%), while address-first binding, though less frequent, was 3.8 times more likely to achieve native-like binding (18.8%, p < 0.001).
- Results suggest that early engagement of the address domain enhances productive binding to OR.
Abstract
Understanding -endorphin binding to the -opioid receptor (OR) is crucial for designing safer analgesics. The peptide comprises a message domain mediating activation and an address domain conferring selectivity. Using 1,000 independent CABS-dock simulations, without prior binding-site knowledge, we analysed binding trajectories to compare alternative binding pathways. Message-first binding is most frequently sampled but rarely reaches native-like structures (5.0%). In contrast, address-first binding occurs less often yet shows a 3.8-fold higher success rate (18.8%, p < 0.001). These results refine the message - address model and suggest that early address-domain engagement promotes productive OR binding.
bioinformatics2026-02-09v1Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.AI Summary
- The study addresses the challenge of distinguishing half-siblings from niece/nephew-avuncular pairs in large genomic biobanks without pedigree information.
- A novel method using across-chromosome phasing and haplotype-level sharing features was developed, achieving over 98% classification accuracy.
- This approach also enhances long-range phasing by providing structural constraints for homologue assignment.
Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-02-08v4Proteins as Statistical Languages: Information-Theoretic Signatures of Proteomes Across the Tree of Life
Alegre, E. O. T.AI Summary
- This study explores proteins as statistical languages by analyzing informational descriptors like composition entropy, mutual information, and separation-dependent information across 20 UniProt reference proteomes.
- Using bootstrap resampling and synthetic controls, real proteomes showed dependencies beyond local transition statistics, differing from composition-matched i.i.d. and Markov-1 models.
- Findings suggest proteomes function as constrained statistical languages, with gzip compressibility providing an additional signature of redundancy.
Abstract
Protein sequences are commonly interpreted through biochemical and evolutionary lenses, emphasizing structure-function relationships and selection in sequence space. Here we develop a complementary viewpoint: proteins as statistical languages, strings over a finite alphabet generated by constrained stochastic processes. We formalize intrinsic informational descriptors of protein ensembles, including composition entropy H1, adjacent mutual information I1, and separation-dependent information profiles Id. A null-model ladder (uniform, composition-matched i.i.d., and Markov-1) separates compositional effects from genuine positional dependence. We then evaluate these descriptors empirically across 20 UniProt reference proteomes spanning major clades, using protein-level bootstrap resampling and matched synthetic controls. Real proteomes consistently depart from composition-matched i.i.d. baselines and exhibit information profiles that remain elevated beyond the decay expected under first-order Markov surrogates, indicating dependencies beyond local transition statistics. Finally, a compressibility proxy (gzip) provides an orthogonal signature of redundancy relative to i.i.d. controls at matched composition. Together, these results support the view of proteomes as constrained statistical languages and provide model-agnostic fingerprints for comparing sequence ensembles. These signatures provide a lightweight diagnostic layer for comparing proteomes prior to mechanistic modeling.
bioinformatics2026-02-08v3DGAT: A Dual-Graph Attention Network for Inferring Spatial Protein Landscapes from Transcriptomics
Wang, H.; Cody, B. A.; Saavedra, M.; Faccioli, L.; Florentino, R. M.; Soto-Gutierrez, A.; Osmanbeyoglu, H. U.AI Summary
- DGAT is a deep learning framework that infers spatial protein expression from transcriptomics-only spatial transcriptomics (ST) data by learning RNA-protein relationships.
- It uses dual-graph attention networks to integrate transcriptomic, proteomic, and spatial data, with task-specific decoders for mRNA reconstruction and protein prediction.
- DGAT outperforms existing methods in protein imputation accuracy and reveals new insights into cell states, immune phenotypes, and tissue architectures from ST data lacking protein measurements.
Abstract
Spatial transcriptomics (ST) technologies provide genome-wide transcriptomic profiles in tissue context but lack direct protein-level measurements, which are critical for interpreting cellular function and microenvironmental organization. We present DGAT (Dual-Graph Attention Network), a deep learning framework that imputes spatial protein expression from transcriptomics-only ST data by learning RNA-protein relationships from spatially resolved transcriptomic and proteomic datasets. The model constructs heterogeneous graphs integrating transcriptomic, proteomic, and spatial information, encoded using graph attention networks. Task-specific decoders reconstruct mRNA and predict protein abundance from a shared latent representation. Benchmarking across public and in-house datasets demonstrates that DGAT outperforms existing methods in protein imputation accuracy. Applied to ST datasets lacking protein measurements, the framework reveals spatially distinct cell states, immune phenotypes, and tissue architectures not evident from transcriptomics alone. Here, we show that this framework accurately reconstructs spatial protein landscapes, reveals biologically meaningful tissue organization, and enables protein-level interpretation from transcriptomics-only spatial data.
bioinformatics2026-02-08v2Cell type-specific functions of nucleic acid-binding proteins revealed by deep learning on co-expression networks
Osato, N.; Sato, K.AI Summary
- This study uses a deep learning framework to infer the regulatory influence of nucleic acid-binding proteins (NABPs) across different cellular contexts by analyzing gene co-expression networks.
- The approach improved gene expression prediction accuracy and showed strong agreement with ChIP-seq and eCLIP data.
- The analysis revealed cell type-specific regulatory programs, such as cancer pathways in K562 cells and differentiation in neural progenitor cells, highlighting the context-dependent roles of NABPs.
Abstract
Nucleic acid-binding proteins (NABPs) play central roles in gene regulation, yet their functional targets and regulatory programs remain incompletely characterized due to the limited scope and context specificity of experimental binding assays. Here, we present a deep learning framework that integrates gene co-expression-derived interactions with contribution-based model interpretation to infer NABP regulatory influence across diverse cellular contexts, without relying on predefined binding motifs or direct binding evidence. Replacing low-informative binding-based features with co-expression-derived interactions significantly improved gene expression prediction accuracy. Model-inferred regulatory targets showed strong and reproducible concordance with independent ChIP-seq and eCLIP datasets, exceeding random expectations across multiple genomic regions and threshold definitions. Functional enrichment and gene set enrichment analyses revealed coherent, cell type-specific regulatory programs, including cancer-associated pathways in K562 cells and differentiation-related processes in neural progenitor cells. Notably, we demonstrate that DeepLIFT-derived contribution scores capture relative regulatory importance in a background-dependent but biologically robust manner, enabling systematic identification of context-dependent NABP regulatory roles. Together, this framework provides a scalable strategy for functional annotation of NABPs and highlights the utility of combining expression-driven inference with interpretable deep learning to dissect gene regulatory architectures at scale.
bioinformatics2026-02-07v10NanoDel: Identification of large-scale mitochondrial DNA deletions using long-read sequencing
Fearn, C.; Poulton, J.; Fratter, C.; Oliva, C.; Griguer, C.; Baldock, R.; Robson, S.; McGeehan, R.AI Summary
- NanoDel is a long-read sequencing pipeline developed to detect large-scale mitochondrial DNA deletions (LSMDs) with enhanced sensitivity and accuracy.
- It outperformed other pipelines on artificial datasets and identified known and novel LSMDs in mitochondrial disease samples without prior information.
- Analysis showed LSMD breakpoints near repeat and G-quadruplex motifs, suggesting a shared vulnerability in mtDNA across various tissues.
Abstract
Motivation: Traditional methods for detecting large-scale mitochondrial DNA (mtDNA) deletions (LSMDs) in cells present challenges, i.e. a priori information, high DNA inputs, poor sensitivity and are not always quantitative. Mitigation can be achieved through high throughput DNA sequencing using e.g. Illumina and Oxford Nanopore Technologies (ONT), in combination with LSMD breakpoint identification and quantification using bioinformatic tools. Splice-aware RNA alignment tools increase the sensitivity for detecting LSMD breakpoints compared with DNA aligners. Long-read sequencing (LRS) also offers potential advantages over short read sequencing, e.g. greater read lengths and capturing variants on single reads. No existing pipelines capture the benefits of both a splice-aware alignment tool and LRS. Results: We developed NanoDel, a LRS pipeline, to sensitively and accurately detect cellular LSMDs. Using artificial datasets, NanoDel was more sensitive and accurate than other pipelines. In samples diagnosed with mitochondrial disease, it identified both known and previously uncharacterised (including mixtures) of LSMDs, without a priori information. LSMD breakpoints were found in mt-co1, mt-cyb, mt-nd6 and mt-nd5 genes. Analysis of selected LSMDs revealed proximity to repeat and putative G-quadruplex motifs, and occurrence in a range of healthy and pathological tissues, indicating potential for a shared vulnerability landscape in mtDNA, shaped by sequence motifs and structural constraints. NanoDel combined with one-amplicon, not two-amplicon, LR-PCR offers a robust strategy with clinical application for detecting LSMDs across a variety of cell/tissue samples, and it's application across a broader range of samples, will yield new mechanistic insights into LSMD formation, and further our understanding of mtDNA instability. Availability and implementation: NanoDel is available at https://github.com/uopbioinformatics/NanoDel and raw read data are available through the NCBI Sequence Read Archive (SRA) under BioProject accession code PRJNA1369153 (https://www.ncbi.nlm.nih.gov/bioproject/1369153).
bioinformatics2026-02-07v3PlotGDP: an AI Agent for Bioinformatics Plotting
Luo, X.; Shi, Y.; Huang, H.; Wang, H.; Cao, W.; Zuo, Z.; Zhao, Q.; Zheng, Y.; Xie, Y.; Jiang, S.; Ren, J.AI Summary
- PlotGDP is an AI agent-based web server for creating bioinformatics plots, designed to simplify the process for users without coding skills.
- It uses large language models (LLMs) to generate plots from user-uploaded data via natural language commands on a remote server.
- The platform employs curated template scripts to reduce errors from LLMs, aiming to enhance bioinformatics visualization for research publications.
Abstract
High-quality bioinformatics plotting is important for biology research, especially when preparing for publications. However, the long learning curve and complex coding environment configuration often appear as inevitable costs towards the creation of publication-ready plots. Here, we present PlotGDP at https://plotgdp.biogdp.com/, an AI agent-based web server for bioinformatics plotting. Built on large language models (LLMs), the intelligent plotting agent is designed to accommodate various types of bioinformatics plots, while offering easy usage with simple natural language commands from users. No coding experience or environment deployment is required, since all the user-uploaded data is processed by LLM-generated codes on our remote high-performance server. Additionally, all plotting sessions are based on curated template scripts to minimize the risk of hallucinations from the LLM. Aided by PlotGDP, we hope to contribute to the global biology research community by constructing an online platform for fast and high-quality bioinformatics visualization.
bioinformatics2026-02-07v3Single-cell disentangled representations for perturbation modeling and treatment effect estimation
Sun, J.; Stojanov, P.; Zhang, K.AI Summary
- The study introduces scDRP, a generative framework using disentangled representation learning to separate perturbation-dependent and independent variables in single-cell data.
- scDRP employs conditional optimal transport to estimate individualized treatment effects (ITEs) and infer counterfactual states, with guarantees of asymptotic correctness.
- Applied to both simulated and real data, scDRP accurately estimates treatment effects, revealing cell type-specific responses to various perturbations like rhinovirus, cigarette smoke, interferon, and CRISPR knockouts in leukemia cells.
Abstract
Dissecting cell-state-specific changes in gene regulation induced by perturbations is crucial for understanding biological mechanisms. However, single-cell sequencing provides only unmatched snapshots of cells under different conditions. This destructive measurement process hinders the estimation of individualized treatment effects (ITEs), which are essential for pinpointing these heterogeneous mechanistic responses. We develop scDRP, a generative framework that leverages disentangled representation learning with asymptotic correctness guarantees to separate perturbation-dependent and perturbation-independent latent variables via a sparsity regularized $\beta$-VAE. Assuming quantile-preserving effects of perturbations conditional on confounders, scDRP performs conditional optimal transport in the latent space to infer counterfactual states and estimate ITEs. Applied to simulated and real single-cell perturbation data, scDRP accurately estimates treatment effects and individual counterfactual responses, revealing cell type-specific functional gene module dynamics. Specifically, it captures distinct cellular patterns under rhinovirus and cigarette-smoke extract exposures, reveals heterogeneous responses to interferon stimulation across diverse immune cell types, and identifies distinct functional module activation in chronic myeloid leukemia cells following CRISPR knockouts targeting different genes. scDRP also generalizes to unseen perturbation doses and combinations. Our framework provides a principled computational approach to extracting heterogeneous causal relationships from single-cell perturbation data, enabling a deeper understanding of cellular and molecular mechanisms.
bioinformatics2026-02-07v2CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology
Qin, W.; Liu-Swetz, Y.; Tan, S.; Wang, H.AI Summary
- The study introduces CLEAR-HPV, a framework to enhance morphologic interpretability in HPV-related histopathology by restructuring the latent space of multiple instance learning (MIL) models using attention.
- CLEAR-HPV automatically identifies key morphologic concepts like keratinizing, basaloid, and stromal, and reduces the feature space to 10 interpretable concepts while maintaining predictive accuracy.
- The framework was validated across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC datasets, showing consistent performance and providing concept-level interpretability.
Abstract
Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.
bioinformatics2026-02-07v2Visual-like 2D Geometric Template Diffusion for Boosting Single-Sequence Protein Structure Prediction
Wang, X.; Zhang, T.; Cui, Z.; Guo, X.; Wang, F.; Wang, Y.; Cai, X.; Zheng, W.AI Summary
- The study introduces TDFold, a method using visual-like 2D geometric template diffusion for single-sequence protein structure prediction, adapting stable diffusion from text-vision to sequence-geometry.
- TDFold outperforms existing models like ESMFold and AlphaFold2 on datasets with limited homology, showing superior performance on CASP benchmarks.
- It offers lower GPU memory usage and significantly faster training and inference times, making it efficient for resource-limited settings.
Abstract
Single-sequence protein structure prediction has drawn increasing attention due to the high computational costs associated with obtaining homologous information. Here, we propose a visual-like 2D geometric template diffusion method, named TDFold, to generate high-quality pairwise geometries (including pairwise distances and orientations) for achieving accurate and highly efficient single-sequence 3D structure prediction for proteins. Given a protein sequence, TDFold initially generates high-quality inter-residue geometries from a probabilistic diffusion perspective. Since inter-residue geometries can be encoded as multi-channel feature matrices, analogous to image feature maps, we construct an image-level 2D geometric template diffusion module by adapting the stable diffusion (SD) model from text-vision generation to sequence-geometry diffusion for proteins. Subsequently, a lightweight sequence-geometry collaborative learning (SCL) network is constructed to facilitate accurate and efficient protein structure prediction. As a result, TDFold possesses three highlights: (i) better single-sequence prediction performance: TDFold greatly outperforms existing protein language models (PLMs, e.g. ESMFold and OmegaFold) and homology-based methods (e.g. AlphaFold2, AlphaFold3 and RoseTTAFold) on homology-insufficient datasets such as Orphan and Orphan25, while also achieving promising results on the popular CASP14, CASP15 and CASP16 benchmarks; (ii) low resource consumption: By utilizing the lightweight SCL architecture, the GPU memory consumption of TDFold is generally lower than that of popular methods such as AlphaFold2 and ESMFold; (iii) higher efficiency in training and inference: TDFold can be trained within a week using a single NVIDIA 4090 GPU. Furthermore, the inference time of TDFold is significantly shorter (about 10x to 100x) than that of existing methods (ESMFold, AlphaFold2 and AlphaFold3) for long protein sequences. This work demonstrates the effectiveness of leveraging powerful vision diffusion models to enhance protein 2D geometric template generation, thereby establishing a new paradigm for single-sequence protein structure prediction. It also accelerates protein-related research, particularly for resource-limited universities and academic institutions. The code has been released to speed up biological research.
bioinformatics2026-02-07v2AbNovoBench: a resource and benchmarking platform for monoclonal antibody de novo sequencing
Jiang, W.; Luo, L.; Xiong, Y.; Xiao, J.; Lin, Z.; Huang, L.; Zhang, S.; Wang, J.; Wang, C.; Xia, N.; Yuan, Q.; Yu, R.AI Summary
- AbNovoBench is introduced as a benchmarking platform for evaluating monoclonal antibody (mAb) de novo sequencing strategies, addressing the lack of standardized benchmarks.
- It includes the largest dataset to date with 1,638,248 peptide-spectrum matches from 131 mAbs, used to benchmark 13 deep learning algorithms and three assembly strategies.
- The platform offers online resources, pre-trained models, and customizable workflows to enhance antibody sequencing, algorithm development, and reproducibility in proteomics.
Abstract
Monoclonal antibodies (mAbs) are critical in disease diagnostics and therapeutics, yet the performance of mass spectrometry (MS)-based de novo sequencing remains incompletely characterized due to limited antibody-specific datasets and the absence of a standardized benchmark framework. Here we present AbNovoBench, a comprehensive framework for evaluating data analysis strategies for mAb de novo sequencing. It features the largest high-quality dataset to date, generated in-house, comprising 1,638,248 peptide-spectrum matches from 131 mAbs across six species and 11 proteases, supplemented by eight mAbs with known full-length sequence for end-to-end reconstruction assessment. Employing a unified training dataset, we systematically benchmarked 13 deep learning-based de novo peptide sequencing algorithms and three assembly strategies across peptide sequencing metrics (accuracy, robustness, efficiency, error types) and assembly metrics (coverage depth, assembly score). AbNovoBench (https://abnovobench.com) provides an online platform enriched with curated antibody MS resources and pre-trained models, enabling customizable antibody sequencing workflows, accelerating antibody-specific algorithms development, and improving reproducibility in proteomics.
bioinformatics2026-02-07v2Improved Ensemble Performance by Weight Optimisation for the Genomic Prediction of Maize Flowering Time Traits
Tomura, S.; Powell, O. M.; Wilkinson, M. J.; Lefevre, J.; Cooper, M.AI Summary
- This study investigated the impact of weight optimization on ensemble models for predicting maize flowering time traits using TeoNAM and MaizeNAM datasets.
- Three weight optimization methods (linear transformation, Nelder-Mead, Bayesian) were compared, showing that optimized weights improved prediction performance over naive equal-weighted ensembles.
- No single optimization method was consistently superior, suggesting further research into integrating weight optimization with hyperparameter tuning could be beneficial.
Abstract
Ensembles of multiple genomic prediction models have demonstrated improved prediction performance over the individual models contributing to the ensemble. The outperformance of ensemble models is expected from the Diversity Prediction Theorem, which states that for ensembles constructed with diverse prediction models, the ensemble prediction error becomes lower than the mean prediction error of the individual models. While a naive ensemble-average model provides baseline performance improvement by aggregating all individual prediction models with equal weights, optimising weights for each individual model could further enhance ensemble prediction performance. The weights can be optimised based on their level of informativeness regarding prediction error and diversity. Here, we evaluated weighted ensemble-average models with three possible weight optimisation approaches (linear transformation, Nelder-Mead and Bayesian) using flowering time traits from two maize nested associated mapping (NAM) datasets; TeoNAM and MaizeNAM. The three proposed weighted ensemble-average approaches improved prediction performance in several of the prediction scenarios investigated. In particular, the weighted ensemble models enhanced prediction performance when the adjusted weights differed substantially from the equal weights used by the naive ensemble models. For performance comparisons within the weighted ensembles, there was no clear superiority among the proposed approaches in both prediction accuracy and error across the prediction scenarios. Weight optimisation in ensembles warrants further investigation to explore the opportunities to improve their prediction performance; for example, integration of a weighted ensemble with a simultaneous hyperparameter tuning process may offer a promising direction for further research.
bioinformatics2026-02-06v1StrainCascade: An automated, modular workflow for high-throughput long-read bacterial genome reconstruction and characterization
Jordi, S. B. U.; Baertschi, I.; Li, J.; Fasel, N.; Misselwitz, B.; Yilmaz, B.AI Summary
- StrainCascade is an automated, modular workflow designed to streamline high-throughput long-read bacterial genome reconstruction.
- It integrates genome assembly, annotation, and functional profiling into a single framework, enhancing reproducibility.
- Key findings include improved resolution of strain-level variability, facilitating comparative genomics on diversity, host-microbe interactions, resistance mechanisms, and mobile genetic elements.
Abstract
Long-read sequencing offers unprecedented opportunities for high-resolution bacterial genome reconstruction, yet fragmented bioinformatics workflows hinder biological insights. StrainCascade addresses this gap by providing a fully automated, modular pipeline that integrates genome assembly, accurate annotation, and comprehensive functional profiling into a single, reproducible framework. Leveraging deterministic computational execution strategies, StrainCascade systematically resolves strain-level structural and functional variability, enabling robust comparative genomics of strain diversity, host-microbe interactions, antimicrobial resistance mechanisms, and mobile genetic element dynamics.
bioinformatics2026-02-06v1Unified imputation of missing data modalities and features in multi-omic data via shared representation learning
Nambiar, A.; Melendez, C.; Noble, W. S.AI Summary
- The study introduces MIMIR, a deep learning framework for imputing missing data in multi-omic studies by addressing both missing modalities and missing values through shared representation learning.
- MIMIR uses masked autoencoders to learn modality-specific representations, which are then projected into a common latent space for reconstruction from any observed modality subset.
- Evaluated on The Cancer Genome Atlas data, MIMIR outperformed baseline methods in various missing data scenarios, revealing structured cross-modal dependencies that influence imputation accuracy.
Abstract
Multi-omic studies promise a more comprehensive view of biological systems by jointly measuring multiple molecular layers. In practice, however, such datasets are rarely complete: entire molecular modalities may be missing for many samples, and observed modalities often contain substantial feature-level missingness. Existing imputation approaches typically address only one of these two problems, relying either on feature-level imputation within a single modality or on pairwise translation models that cannot accommodate arbitrary combinations of missing modalities. As a result, there is, to our knowledge, no unified framework for reconstructing both missing data modalities and missing values within those modalities. We present MIMIR, a deep learning framework for unified multi-omic imputation that addresses both missing modalities and missing values through shared representation learning. MIMIR first learns modality-specific representations using masked autoencoders and then projects these representations into a common latent space, enabling reconstruction from any subset of observed modalities. Evaluated on pan-cancer multi-omic data from The Cancer Genome Atlas, MIMIR consistently outperforms baseline methods across a range of missing-modality and missing-value scenarios, including missing completely at random and missing not at random settings. Analysis of the learned shared space reveals structured cross-modal dependencies that explain modality-specific differences in imputation accuracy, with transcriptional and epigenetic modalities forming a strongly aligned core and copy number variation contributing more distinct signal. Together, these results demonstrate that shared representation learning provides an effective and flexible foundation for multi-omic imputation under heterogeneous missingness.
bioinformatics2026-02-06v1Reference genome choice impacts SNP recovery but not evolutionary inference in young species
Soares, L. S.; Goncalves, L. T.; Guzman-Rodriguez, S.; Bombarely, A.; Freitas, L. B.AI Summary
- This study investigates how the choice of reference genome affects SNP recovery and evolutionary inference in RAD-seq analyses of young species like Petunia and Calibrachoa.
- Using congeneric reference genomes resulted in consistent mapping rates, SNP recovery, and population genomic patterns, while distantly related genomes showed lower mapping rates and affected summary statistics.
- Despite these differences, the broader genetic structure, diversity, and evolutionary relationships remained consistent, suggesting that closely related reference genomes are sufficient for robust analyses in recent radiations.
Abstract
Reduced-representation sequencing approaches such as RAD-seq are widely used in population genomics and phylogenetics, particularly for non-model organisms. However, bioinformatics choices during data processing can strongly influence downstream analyses. One key but underexplored factor is the reference genome used for read alignment and SNP discovery. Here, we evaluate the effects of reference genome choice on RAD-seq analyses using multiple datasets spanning recent radiations in Petunia and Calibrachoa, and reference genomes that differ in phylogenetic relatedness. When using congeneric reference genomes, we observed highly consistent mapping rates, SNP recovery, and downstream population genomic patterns. In contrast, mapping to more distantly related genomes resulted in lower mapping rates and stronger effects on summary statistics. Despite these quantitative reductions, broader patterns of genetic structure and diversity, as well as evolutionary relationships, remained largely congruent across reference genomes. Overall, our results indicate that reference genome choice matters most when genomes are distantly related or when analyses target fine-scale genomic signals. For recent radiations with largely conserved genome structure, closely related reference genomes yield comparable SNP datasets and lead to the same biological conclusions regarding population structure and phylogenetic relationships. These findings provide practical guidance for RAD-seq studies in non-model systems, showing that congeneric reference genomes are sufficient for robust population and phylogenetic inference, and that more distantly related genomes can remain informative when no close reference is available.
bioinformatics2026-02-06v1Scaling Variant-Aware Multiplex Primer Design
Han, Y.; Boucher, C.AI Summary
- The study addresses the challenge of designing primers for multiplex PCR that are effective across diverse and evolving pathogen genomes by introducing a near-linear algorithm for Primer Design Region (PDR) optimization with provable guarantees.
- A reference-free risk model based on Gini impurity was developed to ensure PDRs are robust to sequence diversity, and a local-search heuristic was used for optimizing primer subsets for thermodynamic stability.
- Testing on Foot-and-Mouth Disease and Zika virus datasets showed that the method, {Delta}-PRO, produced more compact and robust PDR sets with reduced predicted dimerization, enhancing multiplex PCR efficiency.
Abstract
Motivation: Robust primer design is essential for reliable multiplex PCR in diverse and evolving pathogen, microbial, and host genomes. Traditional methods optimized for a single reference often fail on emerging variants, leading to reduced efficiency. Variant-aware design seeks primers that remain effective across diverse targets, but this introduces two key challenges: identifying robust candidates and selecting an optimal subset of primers. Although there are methods for the first challenge, namely the Primer Design Region (PDR) optimization problem, existing approaches lack optimality guarantees. Results: We introduce a near-linear algorithm with provable guarantees for efficient PDR optimization. Complementing this, we propose a reference-free risk model based on Gini impurity that provides a stable, biologically interpretable measure of site-specific variation and yields PDRs that are robust to sequence diversity across datasets without ad hoc smoothing. For the second challenge related to thermodynamic stability, we optimize predicted {Delta}G and cast subset selection as a k-partite maximum-weight clique problem, which is NP-hard. We then design an efficient local-search heuristic with linear-time updates. Together, these advances yield a principled, scalable framework for variant-aware primer design. Across Foot-and-Mouth Disease virus and Zika virus datasets, {Delta}-PRO produces more compact and robust PDR sets and multiplex panels with reduced predicted dimerization compared to existing tools, demonstrating the practical gains of principled and scalable variant-aware primer design for high-throughput multiplex PCR assays. Availability: The proposed methods are implemented in a software package. The implementation and results are publicly available at https://github.com/yhhan19/variant-aware-primer-design.
bioinformatics2026-02-06v1Learning a Continuous Progression Trajectory of Amyloid in Alzheimer's disease
Tong, M.; Mehfooz, F.; Zhang, S.; Wang, Y.; Fang, S.; Saykin, A. J.; Wang, X.; Yan, J.; Alzheimer's Disease Neuroimaging Initiative,AI Summary
- Researchers developed SLOPE, an unsupervised method to model amyloid progression in Alzheimer's disease (AD) continuously using longitudinal amyloid PET data.
- SLOPE generated a two-dimensional trajectory that better preserved temporal progression and showed greater sensitivity to early amyloid changes than global measures.
- The method revealed biologically consistent amyloid spreading patterns, enhancing disease modeling and monitoring in early AD stages.
Abstract
BACKGROUND: Understanding of Alzheimer progression is critical for timely diagnosis and treatment evaluation, but traditional discrete diagnostic groups often lack sensitivity to subtle early-stage changes. METHODS: We developed SLOPE, an unsupervised dimensionality reduction method that models the amyloid progression in AD on a continuous scale while preserving the temporal order of longitudinal follow-up visits. Applied to longitudinal amyloid PET data, SLOPE generated a two-dimensional trajectory capturing global amyloid accumulation across the AD continuum. RESULTS: SLOPE-derived staging scores better preserved temporal progression across diagnostic groups and longitudinal follow-up visits and can be generalized to held-out subjects. The learned trajectory revealed biologically consistent amyloid spreading patterns and greater sensitivity to early progression than global amyloid SUVR. DISCUSSION: SLOPE provides a continuous staging of amyloid pathology that complements global amyloid measures by capturing early localized progression. These properties highlight its potential in disease modeling and monitoring, particularly in early and preclinical stages of AD.
bioinformatics2026-02-06v1Ecological context structures duplication and mobilization of antibiotic and metal resistance genes in bacteria
Tran, E.; Xu, P. N.; Assis, R.AI Summary
- The study investigated how ecological contexts influence the duplication and mobilization of antibiotic resistance genes (ARGs) and metal resistance genes (MRGs) in bacteria across clinical, agricultural, and wastewater environments.
- Resistance gene profiles varied significantly by environment, with distinct duplication patterns observed.
- Duplication of resistance genes was often linked with mobile genetic elements, but the dynamics of ARGs and MRGs were not uniformly coupled, highlighting the role of ecological context in resistance gene evolution.
Abstract
Antibiotic resistance is a global challenge driven by the persistence and spread of resistance genes across ecological contexts. While mobile genetic elements (MGEs) facilitate horizontal gene transfer, gene duplication represents an additional mechanism through which resistance genes can be amplified, diversified, and maintained under selection. How these processes interact across environments remains poorly understood. Here, we examined genome-level patterns of resistance gene abundance, duplication, and mobilization across clinical, agricultural, and wastewater settings, focusing on both antibiotic resistance genes (ARGs) and metal resistance genes (MRGs). Resistance gene profiles were strongly structured by environment, with distinct duplication patterns emerging across sources. Duplicate genes were frequently associated with MGEs, although the strength of this relationship varied by resistance type and ecological context. Despite frequent co-occurrence of ARGs and MRGs, their duplication and mobilization dynamics were not uniformly coupled at the genome level. Together, these findings highlight gene duplication as a context-dependent contributor to resistance evolution and underscore the importance of ecological setting in shaping how resistance genes persist and spread across microbial communities.
bioinformatics2026-02-06v1ChromBERT-tools: A versatile toolkit for context-specific embedding of transcription regulators across different cell types
Chen, Q.; Yu, Z.; Zhang, Y.AI Summary
- ChromBERT-tools is a toolkit designed to generate context-specific embeddings of transcription regulators, addressing the need for embeddings that consider genome-wide context and cell-type specificity.
- It uses a pre-trained foundation model to produce both cell-type-agnostic and cell-type-specific embeddings, enhancing transcription regulation modeling.
- The toolkit offers command-line interfaces and Python APIs for embedding generation, adaptation to different cell types, and interpretation, facilitating biological inferences like regulator interactions and cell state transitions.
Abstract
Abstract Motivation Representations that capture the genome-wide context of transcription regulators are critical for establishing a shared backbone for flexible transcription modeling and in silico regulatory analysis. Yet, current embeddings predominantly rely on limited modalities, such as gene co-expression or static protein features, offering an incomplete perspective that ignores context-dependent transcription regulator activities across the genome. The lack of transcription regulation-informed embeddings, paired with the absence of a user-friendly and lightweight toolkit for their generation, adaption to different cell types and interpretation, impedes the capture of the regulatory logic that underpin cellular states and functions. Results To address this need, we present ChromBERT-tools, a lightweight toolkit designed to operationalize regulation-informed embeddings derived from a foundation model pre-trained on the comprehensive landscapes of human and mouse transcription regulators. ChromBERT-tools provides user-friendly command-line interfaces (CLIs) and Python APIs to achieve two primary goals: (i) generating cell-type-agnostic embeddings that capture the semantic representations of individual regulators and their combinatorial interactions, serving as biological priors of transcription regulator modality to enhance transcription regulation modeling and rule interpretation; and (ii) generating cell-type-specific embeddings via fine-tuned model variants, which support in silico inference of regulatory roles of transcription regulators in cell types with scarce experimental data. The toolkit streamlines end-to-end workflows for embedding generation, adaption to different cell types and interpretation towards biological inferences such as regulator-regulator interaction across the genome and key regulators determining cell identity or cell state transition.
bioinformatics2026-02-06v1Rapid gene exchange explains differences in bacterial pangenome structure
Horsfield, S. T.; Peng, A.; Russell, M. J.; von Wachsmann, J.; Toussaint, J.; D'Aeth, J. C.; Qin, C.; Pesonen, H.; Tonkin-Hill, G.; Corander, J.; Croucher, N. J.; Lees, J. A.AI Summary
- The study developed Pansim and PopPUNK-mod to model pangenome dynamics, analyzing over 600,000 genomes from 400 bacterial species.
- Findings indicate that variation in the number of rapidly exchanged genes primarily drives differences in pangenome structure between species.
- Bacterial phylogeny, not ecology, was found to correlate with pangenome dynamics, suggesting the need for pan-species gene-level analyses.
Abstract
The size and diversity of bacterial gene repertoires, known as pangenomes, vary widely across species. The evolutionary forces driving the maintenance of pangenomes is an open topic of debate, with contradictory theories suggesting that pangenomes exist as a result of neutral evolution, with all genes gained and lost at random, or that all genes provide a fitness benefit to the host and are maintained by positive selection. Modelling of pangenome dynamics has provided insight into how gene exchange explains observed gene frequency distributions, and stands as the only means of jointly inferring contributions of individual gene selection effects and mobility on the maintenance of pangenomes. However, previous modelling studies have not included both gene-level selection and mobility, and do not consider broadly sampled genome datasets for many species. To differentiate neutral and selective forces maintaining pangenomes, we developed a mechanistic model of gene-level evolution, Pansim, and a scalable model fitting framework, PopPUNK-mod. Together, these tools leverage rapid genome distance calculation to fit models of pangenome dynamics to datasets containing hundreds of thousands of genomes. We used this framework to compare the pangenome dynamics of over 400 different bacterial species, using over 600,000 genomes. We find that diversity in pangenome characteristics between species is driven predominantly by variation in the number of rapidly exchanged genes, while the rate of exchange of remaining genes is conserved. We find that bacterial phylogeny, rather than ecology, correlates with pangenome dynamics. We express that pan-species gene-level analyses are now needed to understand selection across accessory genes. Our work highlights the importance of gene exchange rate differences in governing differences in pangenome characteristics between species.
bioinformatics2026-02-06v1A Systematic Benchmark of Antibiotic Resistance Gene Detection Tools for Shotgun Metagenomic Datasets
Tiwari, S. K.; Ponsero, A. J.; Talas, J.; Grimes, K. P.; Haynes, S.; Telatin, A.AI Summary
- This study benchmarked five ARG detection tools (ARGprofiler, KARGA, ARIBA, GROOT, SRST2) on simulated metagenomic datasets with varying sequencing coverages and microbial complexities.
- Sequencing coverage significantly affects ARG detection accuracy, with reliable detection at 10x coverage; ARGprofiler had the highest F1-score (0.891) at ≥10x.
- Increased community complexity reduced accuracy for all tools, with KARGA showing the highest mean F1-score (0.122 ± 0.067) under realistic uneven coverage, while computational efficiency varied, with ARGprofiler, SRST2, and GROOT being the most efficient.
Abstract
Accurate detection of antimicrobial resistance genes ARGs from metagenomic data is essential for understanding resistance dissemination within microbial communities yet tool performance remains influenced by sequencing coverage community complexity and dataset variability In this study we systematically benchmarked five widely used readbased ARG detection tools ARGprofiler KARGA ARIBA GROOT and SRST2 across simulated metagenomic datasets representing varying sequencing coverages microbial complexities and approximate realistic metagenomic dataset The results demonstrated that sequencing coverage is a major determinant of ARG detection accuracy with reliable detection achieved at 10x coverage and performance stabilizing between 20 and 30 ARGprofiler exhibited the highest overall F1score 0.891 at >=10 whereas KARGA showed higher recall at low coverage levels but lower precision compared to ARGprofiler Increasing community complexity led to a decline in accuracy across all tools and under realistic uneven coverage performance variability increased substantially with KARGA achieving the highest mean F1score 0122 +/- 0067 Runtime evaluation further revealed substantial differences in computational efficiency with ARGprofiler SRST2 and GROOT being the most resourceefficient while KARGA imposed the highest computational burden Collectively these findings highlight that both sequencing coverage and community complexity profoundly shape ARG detection outcomes and that tool selection should balance accuracy with computational efficiency The study also emphasizes the need for standardized benchmarking datasets that reflect true metagenomic complexity to ensure robust and comparable ARG surveillance across analytical pipelines.
bioinformatics2026-02-06v1Decomposing multi-scale dynamic regulation from single-cell multiomics with scMagnify
Chen, X.; Yan, X.; Shen, B.; Wang, H.; Tang, Z.; Zang, Y.; Lin, P.; Zhang, H.; Li, Y.; Li, H.AI Summary
- scMagnify is a deep-learning framework that uses multiomic single-cell data to reconstruct and decompose multi-scale gene regulatory networks (GRNs) via nonlinear Granger causality.
- It employs tensor decomposition to identify combinatorial transcription factor modules and their activation profiles across different time-lags, providing insights into regulatory logic.
- Applied to human hematopoiesis, mouse pancreas development, and kidney injury, scMagnify revealed known regulators and new insights into cell fate decisions and pathological changes.
Abstract
Deciphering the highly coupled regulatory circuits that drive cellular dynamics remains a fundamental goal in biology. However, capturing the multi-scale time-lagged dynamics and combinatorial regulatory logic of gene regulation remains computationally challenging. Here we present scMagnify, a deep-learning-based framework that leverages multiomic single-cell assays of chromatin accessibility and gene expression via nonlinear Granger causality to reconstruct and decompose multi-scale gene regulatory networks (GRNs). Benchmarking on both simulated and real datasets demonstrates that scMagnify achieves superior performance. scMagnify employs tensor decomposition to systematically identify combinatorial TF modules and their activation profiles across different time-lags. It enables a hierarchical dissection of the regulatory landscape from the activity of individual regulator to the combinatorial logic of regulatory modules and intercellular communications. We applied scMagnify to human hematopoiesis and mouse pancreas development, where it successfully recovered known lineage-driving regulators and provided novel insights into the combinatorial logic that governs cell fate decisions. Furthermore, in the context of kidney injury, scMagnify's intracellular communication module mapped key signaling-to-transcription cascades linking microenvironment cues to pathological epithelial cell changes. In summary, scMagnify provides a powerful and versatile computational framework for dissecting the multi-scale regulatory logic that governs complex biological processes in development and disease.
bioinformatics2026-02-06v1Identification and Characterization of Metastasis-initiating cells
Wu, S.; Wei, J.; Liu, X.; Zhang, J.; Wen, J.; Huang, L.; zhou, X.AI Summary
- The study introduces scMIC, a computational framework for identifying metastasis-initiating cells (MICs) from single-cell data, addressing limitations of current methods.
- scMIC uses embedding-based representation, unbalanced optimal transport, and top-k selection to reliably identify MICs across various cancer types and datasets.
- Key findings include the framework's validation, its clinical utility in metastasis prognosis, and its role in discovering metastasis-related gene programs and biomarkers.
Abstract
Metastasis, the primary cause of cancer-related mortality, is a dynamic and complex process driven by a subset of cells known as metastasis-initiating cells (MICs). Accurate identification of MICs is therefore critical for metastasis diagnosis and therapeutic decision-making. However, current approaches rely either on mouse tracing experiments, which are difficult to translate to human systems, or on indirect strategies such as stemness, trajectory, pathway, and biomarker analyses that often yield inconsistent results. To address these limitations, we propose scMIC, a computational framework designed to explicitly and reliably identify MICs from single-cell data (available at https://github.com/swu13/scMIC). scMIC integrates an embedding-based representation, unbalanced optimal transport, and a top-k selection strategy to robustly capture metastasis-initiating potential. The framework was validated and applied across multiple cancer types, species, and multi-omics datasets. Our results demonstrate the reliability of scMIC for MIC identification, its potential clinical utility in metastasis prognosis, and its effectiveness in discovering metastasis-related gene programs and molecular biomarkers. Elucidating the mechanisms of metastasis initiation not only advances our understanding of metastatic progression but also enables the development of therapeutic strategies that target the more aggressive MIC population rather than non-MICs, thereby avoiding unintended increases in metastatic risk. Collectively, scMIC provides a powerful tool for cancer metastasis research and drug discovery.
bioinformatics2026-02-06v1