Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
SCALPEL: A pipeline for processing large-scale spatial transcriptomics data
Kunst, M.; Ching, L.; Quon, J.; Mathieu, R.; Hewitt, M.; Seeman, S.; Ayala, A.; Gelfand, E.; Long, B.; Martin, N.; Nagra, J.; Olsen, P.; Oyama, A.; Valera, N.; Pagen, C.; Sunkin, S.; Ariza, J.; Smith, K.; McMillen, D.; Zeng, H.; Waters, J.AI Summary
- SCALPEL is a pipeline designed for processing large-scale spatial transcriptomics data, featuring 3D segmentation, refined filtering, doublet detection, and cell type label transfer.
- It includes spatial domain detection and registration to the Allen Mouse Brain CCFv3, with genome-wide expression imputation from scRNAseq.
- Benchmarking against a previous dataset showed improvements in cell number, expression clarity, and spatial registration, setting a new standard for spatial transcriptomics studies.
Abstract
Spatial transcriptomics enables the precise mapping of gene expression patterns within tissue architecture, offering unprecedented insights into cellular interactions, tissue heterogeneity, and disease pathology that are unattainable with traditional transcriptomic approaches. We present a tool for processing spatial transcriptomics data, SCALPEL (Spatial Cell Analysis, Labeling, Processing, and Expression Linking). SCALPEL is specifically designed to support the analysis of large, atlas-level datasets. Our new workflow features advanced 3D segmentation optimized for dense and heterogeneous tissues, refined filtering criteria, and transcriptome-based doublet detection to remove low-quality or artifactual cells. Cell type label transfer from existing taxonomies is further improved through updated filtering thresholds. Spatial domain detection is incorporated to capture local transcriptomic organization, and tissue sections are registered to the Allen Mouse Brain Common Coordinate Framework version 3 (CCFv3) for precise anatomical alignment. Genome-wide expression imputation from single-cell RNA-sequencing (scRNAseq) further enriches the dataset. Crucially, we benchmark the performance of this updated pipeline against a previously published version of our whole-mouse-brain (WMB) dataset (Yao et al., 2023b), demonstrating substantial improvements in cell number, expression profile clarity, and spatial registration. These advances provide a robust foundation for downstream spatial analyses and set a new standard for large-scale spatial transcriptomics studies.
bioinformatics2026-02-10v3Systems Level Analysis of Gene, Pathway and Phytochemical Associations with Psoriasis
Ray, S.; Dutta, O.; Kousoulas, K. G.; Apostolopoulos, N.; Chamcheu, J. C.; Kaur, R.AI Summary
- The study used a systems biology approach to analyze gene expression and pathways in psoriatic lesions, identifying key roles of type I/III interferon signaling, AP-1, and CREB1.
- It highlighted seven phytochemicals with potential multi-target activity against psoriasis, focusing on the IL-17/TNF-interferon-AP-1/CREB1-COX-2/MMP9 axis.
- Protopine and atractylon were suggested as promising candidates for topical treatment due to favorable ADMET properties, with further validation needed in skin models.
Abstract
Psoriasis is an inflammatory skin disorder driven by abnormal immune activation that promotes excessive proliferation and accelerated turnover of epidermal keratinocytes. IL-17 and TNF pathways are well known in psoriasis, but the other mechanisms that keep the disease active and link it to systemic comorbidities are not yet fully understood. A combined transcriptomic and systems biology framework was applied to map regulatory circuits in psoriatic lesions and to identify phytochemical candidates capable of multi-target modulation for topical intervention. Differential gene expression between lesional and healthy skin was analyzed, followed by pathway enrichment, upstream regulator inference, protein-protein interaction network, and chemical-gene interaction mapping. This integrative strategy revealed a transcriptional landscape dominated by type I/III interferon signaling, antiviral and antimicrobial responses, immune metabolic dysregulation, and transcriptional hubs centered on AP-1 and CREB1. Several genes and upstream regulators not previously associated with psoriasis were identified within inflammatory and cell migration-related modules, indicating unexplored regulatory layers in disease control. Network-guided chemical prioritization and direction-of-effect filtering highlighted seven phytochemicals (mahanine, atractylon, protopine, annomontine, taraxasterol, tricin, and tamarixetin) with multi-target activity across key disease axes. ADMET-based screening suggested protopine and atractylon as favorable candidates for topical delivery, while synergy modeling supported flavonoid-alkaloid combination designs. This multi-layered approach provides mechanistically informed phytochemicals targeting the IL-17/TNF-interferon-AP-1/CREB1-COX-2/MMP9 axis in psoriasis. Experimental validation in keratinocyte and organotypic skin models will be required to determine whether these compounds, individually or in combination, can effectively restore psoriatic signaling in vivo.
bioinformatics2026-02-10v2ETSAM: Effectively Segmenting Cell Membranes in cryo-Electron Tomograms
Selvaraj, J.; Cheng, J.AI Summary
- This study introduces ETSAM, a two-stage AI method based on SAM2, designed to segment cell membranes in cryo-ET tomograms.
- ETSAM was trained on 83 experimental and 28 simulated tomograms, achieving state-of-the-art performance on an independent test set of 10 tomograms.
- It significantly outperforms existing methods by providing high sensitivity and precision in membrane segmentation despite challenges like low signal-to-noise ratio and missing wedge artifacts.
Abstract
Cryogenic Electron Tomography (cryo-ET) is an emerging experimental technique to visualize cell structures and macromolecules in their native cellular environment. Accurate segmentation of cell structures in cryo-ET tomograms, such as cell membranes, is crucial to advance our understanding of cellular organization and function. However, several inherent limitations in cryo-ET tomograms, including the very low signal-to-noise ratio, missing wedge artifacts from limited tilt angles, and other noise artifacts, collectively hinder the reliable identification and delineation of these structures. In this study, we introduce ETSAM - a two-stage Segment Anything Model 2 (SAM2)-based fine-tuned AI method that effectively segments cell membranes in cryo-ET tomograms. It is trained on a diverse dataset comprising 83 experimental tomograms from the CryoET Data Portal (CDP) database and 28 simulated tomograms generated using PolNet. ETSAM achieves state-of-the-art performance on an independent test set comprising 10 experimental tomograms for which ground-truth annotations are available. It robustly segments cell membranes with high sensitivity and precision, significantly outperforming existing deep learning methods.
bioinformatics2026-02-10v2Reading TEA leaves for de novo protein design
Pantolini, L.; Durairaj, J.AI Summary
- The study explores de novo protein design using a 20-letter structure-inspired alphabet from protein language model embeddings to enhance Monte Carlo sampling efficiency.
- This approach allows for rapid template-guided and unconditional design, producing novel protein sequences that meet designability criteria without known homologues.
- The method significantly reduces the time required for protein design, opening new avenues for therapeutic and industrial applications.
Abstract
De novo protein design expands the functional protein universe beyond natural evolution, offering vast therapeutic and industrial potential. Monte Carlo sampling in protein design is under-explored due to the typically long simulation times required or prohibitive time requirements of current structure prediction oracles. Here we make use of a 20-letter structure-inspired alphabet derived from protein language model embeddings to score random mutagenesis-based Metropolis sampling of amino acid sequences. This facilitates fast template-guided and unconditional design, generating sequences that satisfy in silico designability criteria without known homologues. Ultimately, this unlocks a new path to fast and de novo protein design.
bioinformatics2026-02-10v1Autoregressive forecasting of future single-cell state transitions
Luo, E.; Gao, H.; BIAN, H.; Li, Y.; Li, C.; Hao, M.; Chen, M.; She, Y.; Wei, L.; Liu, K.; Zhang, X.AI Summary
- The study introduces CellTempo, a temporal generative AI model designed to forecast future cellular dynamics from static single-cell RNA-sequencing data.
- CellTempo uses learned semantic codes and an autoregressive decoder to predict long-range cell-state transitions.
- Experiments demonstrated that CellTempo accurately forecasts cell state evolutions and reconstructs cell-state landscapes post-perturbations, aligning well with biological realities.
Abstract
Existing methods for dynamic analysis of static single-cell RNA-sequencing data can reconstruct temporal structures covered by observed cells, but cannot forecast unobserved future state transitions. We propose a temporal generative AI model, CellTempo, to forecast future cellular dynamics by representing cells as learned semantic codes and training an autoregressive generation decoder to predict ordered code sequences. It can forecast long-range cell-state transition trajectories and landscapes from snapshot data. To train the model, we constructed a comprehensive single-cell trajectory dataset scBaseTraj by integrating RNA velocity, pseudotime, and inferred transition probabilities to compose multi-step cellular sequences. Experiments on multiple real datasets showed that CellTempo can forecast cell state evolutions from individual cells, and reconstruct nuanced cell-state potential landscapes and their varied progressions after genetic or chemical perturbations, all with high fidelity to biological truth. This work opens a route for forecasting unseen future dynamics of cell state transitions from static observations.
bioinformatics2026-02-10v1Optimizing Protein Tokenization: Reduced Amino Acid Alphabets for Efficient and Accurate Protein Language Models
Rannon, E.; Burstein, D.AI Summary
- This study explores the use of reduced amino acid alphabets combined with Byte Pair Encoding (BPE) tokenization in protein language models (pLMs) to optimize efficiency.
- RoBERTa-based pLMs were pre-trained using various reduced alphabets and evaluated on multiple tasks.
- Results indicated that reduced alphabets significantly shortened input sequences, sped up training and inference, and maintained or improved performance compared to models using the full 20-amino-acid alphabet.
Abstract
Protein language models (pLMs) typically tokenize sequences at the single-amino-acid level using a 20-residue alphabet, resulting in long input sequences and high computational cost. Sub-word tokenization methods such as Byte Pair Encoding (BPE) can reduce sequence length but are limited by the sparsity of long patterns in proteins encoded by the standard amino acid alphabet. Reduced amino acid alphabets, which group residues by physicochemical properties, offer a potential solution but their performances with sub-word tokenization have not been systematically studied. In this work, we investigate the combined use of reduced amino acid alphabets and BPE tokenization in protein language models. We pre-trained RoBERTa-based pLMs de novo using multiple reduced alphabets and evaluated them across diverse downstream tasks. Our results show that reduced alphabets enable substantially shorter input sequences and faster training and inference, while maintaining comparable, and in some cases improved, performance relative to models trained on the full 20-amino-acid alphabet. These findings demonstrate that alphabet reduction facilitates more effective sub-word tokenization and provides a favorable trade-off between efficiency and predictive accuracy.
bioinformatics2026-02-10v1A multi-agent platform for assessment and improvement of bioinformatics software documentation
Ma, A.; Feng, S.; Gu, S.; Wang, C.; Ma, Q.AI Summary
- The study introduces BioGuider, a multi-agent platform designed to evaluate and enhance documentation quality in bioinformatics software by treating documentation as a testable object.
- BioGuider uses a modular pipeline for documentation assessment, reporting, and correction, with agents simulating user interactions, and evaluates against task-oriented criteria.
- Testing on 47 bioinformatics tools showed BioGuider's effectiveness in error detection and correction, with a correlation between improved documentation quality and increased software adoption.
Abstract
Rapid advances in bioinformatics have transformed biomedical research in areas such as single-cell and spatial omics, digital pathology, and multi-modal data integration, yet software usability and reproducibility have not kept pace with the growing complexity and proliferation of computational tools. Inconsistent, incomplete, or inaccessible documentation remains a pervasive and underappreciated barrier, limiting tool adoption, hindering reproducibility across laboratories, and reducing the long-term impact of computational methods. Here, we introduce BioGuider, a multi-agent platform designed to systematically evaluate and improve documentation quality in bioinformatics software. Rather than treating documentation as ancillary text, BioGuider models it as a first-class, testable object. The platform implements a modular pipeline for documentation collection, assessment, reporting, and optional correction, with specialized agents that emulate real-world user interactions. BioGuider evaluates documentation against standardized, task-oriented criteria spanning installation, configuration, usage, and tutorials, and supports iterative, constraint-aware refinement while preserving code integrity and biological context. We benchmark BioGuider using a controlled error-injection framework that introduces realistic documentation failures across general, biology-specific, and configuration-related categories. Across multiple large language models, BioGuider demonstrates robust error detection and correction, with strong performance maintained under severe documentation degradation. Applying BioGuider to 47 widely used bioinformatics tools, we observe a positive association between documentation quality and citation frequency, highlighting documentation as a previously under-quantified driver of software adoption and scientific impact.
bioinformatics2026-02-10v1PEhub resolves the hierarchical regulatory architecture of multi-way enhancer hubs in the human brain
Tan, J.; Sun, Y.AI Summary
- PEhub is a new framework that resolves multi-way enhancer hubs from chromatin interaction data by modeling synergistic enhancer cooperation and accounting for interaction decay.
- Using H3K27ac HiChIP data, PEhub identified and validated promoter-anchored enhancer hubs in six human brain regions, showing they correspond to real multi-way chromatin assemblies.
- Enhancer hubs were found to be associated with increased transcription, hierarchical organization, and linked to genetic risk and transcription factor deployment in brain regions.
Abstract
Chromatin interaction assays capture regulatory architecture as stochastic pairwise contacts, limiting the ability to resolve how multiple enhancers cooperatively regulate transcription. Here we introduce a promoter-centric quantitative framework, termed PEhub, that resolves multi-way enhancer hubs as higher-order regulatory units from chromatin interaction data. By reparameterizing stochastic pairwise ligation events into promoter-conditioned enhancer networks, our approach explicitly models synergistic enhancer cooperation while accounting for distance-dependent interaction decay through a statistically principled null model. Using H3K27ac HiChIP data, we identify promoter-anchored enhancer hubs and validate their physical existence with single-molecule Pore-C, demonstrating that inferred hubs correspond to bona fide multi-way chromatin assemblies. Application to six human brain regions reveals that enhancer hubs are associated with elevated transcriptional output and exhibit a hierarchical organization spanning shared, circuit-specific, and region-restricted regulatory programs. This architecture hierarchically stratifies genetic risk and transcription factor deployment, linking three-dimensional genome organization to transcriptional control and disease-associated variation. Together, this promoter-centric framework provides a generalizable strategy for resolving higher-order regulatory architecture from 3D genome data and establishes multi-way enhancer hubs as a functionally and genetically meaningful layer of transcriptional regulation in complex tissues.
bioinformatics2026-02-10v1Token Alignment for Verifying LLM-Extracted Text
Booeshaghi, A. S.; Streets, A. M.AI Summary
- The study investigates improving the verification of text extracted by large language models (LLMs) by aligning extracted text with the original source, focusing on discontiguous phrases.
- Using LLM-specific tokenizers and ordered alignment algorithms, the approach improved alignment accuracy by about 50% over word-level tokenization.
- The effectiveness was demonstrated with the introduction of the BOAT and BIO-BOAT datasets, showing ordered alignment as the most practical method for this task.
Abstract
Large language models excel at text extraction, but they sometimes hallucinate. A simple way to avoid hallucinations is to remove any extracted text that does not appear in the original source. This is easy when the extracted text is contiguous (findable with exact string matching), but much harder when it is discontiguous. Techniques for finding discontiguous phrases depend heavily on how the text is split-i.e., how it is tokenized. In this study, we show that splitting text along subword boundaries, with LLM-specific tokenizers, and aligning extracted text with ordered alignment algorithms, improves alignment by about 50% compared to word-level tokenization. To demonstrate this, we introduce the Berkeley Ordered Alignment of Text (BOAT) dataset, a modification of the Stanford Question Answering Dataset (SQuAD) that includes non-contiguous phrases, and BIO-BOAT a biomedical variant built from 51 bioRxiv preprints. We show that text-alignment methods form a partially ordered set, and that ordered alignment is the most practical choice for verifying LLM-extracted text. We implement this approach in taln, which enumerates ordinal subword alignments.
bioinformatics2026-02-10v1bMINTY: Enabling Reproducible Management of High-Throughput Sequencing Analysis Results and their Metadata
Kapelios, K.; Xiropotamos, P.; Manousaki, H.; Sinnis, C.; Kotsira, V.; Dalamagas, T.; GEORGAKILAS, G. K.AI Summary
- The study addresses the challenge of managing high-throughput sequencing data by introducing bMINTY, a web application for structured management of post-alignment data and metadata.
- bMINTY allows for the integration of study, assay, and analysis metadata into a single, portable, queryable resource, enhancing data reuse and reproducibility.
- Users can export data in RO-Crate format, facilitating machine-readable data packages for publication, thereby promoting FAIR science principles.
Abstract
Due to the large scale of high-throughput sequencing data generation, the community and publishers have established standards for the dissemination of studies that produce and analyze these data. Despite efforts towards Findable, Accessible, Interoperable and Reproducible (FAIR) science, critical obstacles remain. Best practices are not consistently enforced by scientific publishers, and when they are, essential information is fragmented across the methods section, supplementary materials, and public repositories. When attempting to reproduce scientific findings or reuse published data or analyses, researchers often avoid analyzing sequencing data from the ground up. Instead, they prefer to start directly from the post-sequence-alignment information (e.g., gene expression matrices in transcriptomics). However, existing repositories and workflow-oriented solutions rarely provide a single, portable, queryable resource that integrates this information with the metadata required for downstream reuse. We introduce bMINTY, a locally deployed web application with an intuitive user interface, for structured management of post-alignment workflow data outputs. bMINTY supports metadata for studies, assays, and analysis assets, including workflows, genome assemblies, genomic intervals, and cell-level entities for single-cell assays. Users may export query results in RO-Crate format, providing machine readable data packages and metadata. To the best of current knowledge, bMINTY is the first framework to bundle all this information in publication-ready, portable packaging designed for reuse. These packages can be included as supplementary material with each publication, accompanied by analysis code deposited in public repositories for downstream ad hoc analyses. Together, these practices can promote transparency, efficient reuse of published data, and support FAIR-aligned scientific reproducibility.
bioinformatics2026-02-10v1SenNet Portal: Build, Optimization and Usage
Borner, K.; Blood, P. D.; Silverstein, J. C.; Ruffalo, M.; Satija, R.; Gehlenborg, N.; Honick, B.; Bueckle, A.; Jain, Y.; Qaurooni, D.; Shirey, B.; Sibilla, M.; Metis, K.; Bisciotti, J.; Morgan, R. S.; Betancur, D.; Sablosky, G. R.; Turner, M. L.; Kim, S.-J.; Lee, P. J.; Bartz, J.; Domanskyi, S.; Peters, S. T.; Enninful, A.; Farzad, N.; Fan, R.; SenNet Team, ; Herr, B. W.AI Summary
- The SenNet Program addresses the challenge of studying cellular senescence by generating multimodal datasets across human and mouse tissues.
- The SenNet Data Portal provides open access to these datasets, including single-cell, spatial, imaging, transcriptomic, and proteomic data, along with senescence biomarker catalogs and standardized protocols.
- The portal, built on a scalable hybrid cloud architecture, supports data submission, analysis, and cross-species mapping, with applications in biomarker discovery and spatial analysis.
Abstract
Cellular senescence is a hallmark of aging and a driver of functional decline across tissues, yet its heterogeneity and context dependence have limited systematic study. The Common Fund Cellular Senescence Network (SenNet) Program addresses this challenge by generating multimodal, multi-tissue datasets that profile senescent cells across the human lifespan and complementary mouse models. The SenNet Data Portal (https://data.sennetconsortium.org) serves as the public gateway to these resources, providing open access to harmonized single-cell, spatial, imaging, transcriptomic, and proteomic data; senescence biomarker catalogs; and standardized protocols that can be used to comprehensively identify and characterize senescent cells in mouse and human tissue. As of January 2026, the portal hosts 1,753 publicly available human and mouse datasets across 15 organs using 6 general assay types. Experts from 13 Tissue Mapping Centers (TMCs) and 12 Technology Development and Application (TDAs) components contribute tissue data, analyze data, identify senescent biomarkers, and agree on panels for cross-tissue antibody harmonization. They also register human tissue data into the Human Reference Atlas (HRA) and develop user interfaces for the multiscale and multimodal exploration of this data. Built on a scalable hybrid cloud microservices architecture by the Consortium Organization and Data Coordinating Center (CODCC), the Portal enables data submission, management, integrated analysis, spatial context mapping, and cross-species senescence mapping critical for aging research. This paper presents user needs, the Portal architecture, data processing workflows, and senescence-focused analytical tools. The paper also presents usage scenarios illustrating applications in biomarker discovery, quality benchmarking, hypothesis generation, spatial analysis, cost-efficient profiling, and cell distance distribution analysis. Current limitations and planned extensions, including expanded spatial-omics releases and improved tools for senotype characterization, are discussed. SenNet protocols, code, and user interfaces are freely available on https://docs.sennetconsortium.org/apis.
bioinformatics2026-02-10v1Using user-centered design to better understand challengesfaced during genetic analyses by novice genomicresearchers
Patel, H.; Crosslin, D.; Jarvik, G. P.; Hall, T.; Veenstra, D.; Xie, S.AI Summary
- This study aimed to understand the challenges novice genomic researchers (NGRs) face with bioinformatics tools by using a user-centered design approach.
- A literature review and semi-structured interviews were conducted to identify issues like poor documentation, installation difficulties, and unclear error messages.
- An evaluation rubric was developed to assess bioinformatics tools, aiming to improve usability for both NGRs and experienced users.
Abstract
The lack of user-centered design principles in the current landscape of commonly-used bioinformatics software tools poses challenges for novice genomics researchers (NGRs) entering the genomics ecosystem. Comparing the usability of one analysis software to that of another is a non-trivial task and requires evaluation criteria that incorporates perspectives from both existing literature and a diverse, underrepresented user base of NGRs. To better characterize these barriers, we utilized a two-pronged approach consisting of a literature review of existing bioinformatics tools and semi-structured interviews of the needs of NGRs. From both knowledge sources, the key attributes that resulted in poor adoption and sustained use of most bioinformatics tools included poor documentation, lack of readily-accessible informational content, challenges with installation and dependency coordination, and inconsistent error messages/progress indicators. Combining the findings from the literature review and the insights gained by interviewing the NGRs, an evaluation rubric was created that can be utilized to grade existing and future bioinformatics tools. This rubric acts as a summary of key components needed for software tools to cater to the diverse needs of both NGRs and experienced users. Due to the rapidly evolving nature of genomics research, it becomes increasingly important to critically evaluate existing tools and develop new ones that will help build a strong foundation for future exploration.
bioinformatics2026-02-10v1PRIZM: Combining Low-N Data and Zero-shot Models to Design Enhanced Protein Variants
Harding-Larsen, D.; Lax, B. M.; Garcia, M. E.; Mendonca, C.; Mejia-Otalvaro, F.; Welner, D. H.; Mazurenko, S.AI Summary
- PRIZM is a two-phase workflow that uses a small, high-quality dataset to select the best pre-trained zero-shot model for predicting protein variant effects.
- It then applies this model to rank and prioritize variants for experimental testing.
- In case studies, PRIZM improved enzyme variants, achieving a 3°C increase in thermostability and a 20% increase in activity.
Abstract
Machine learning has repeatedly shown the ability to accelerate protein engineering, but many approaches demand large amounts of robust, high-quality training data as well as substantial computational expertise. While large pre-trained models can function as zero-shot proxies for predicting variant effects, selecting the best model for a given protein property is often non-trivial. Here, we introduce Protein Ranking using Informed Zero-shot Modelling (PRIZM), a two-phase workflow that first uses a high-quality low-N dataset to identify the most suitable pre-trained zero-shot model for a target protein property and then applies that model to rank and prioritize an in silico variant library for experimental testing. Across diverse benchmark datasets spanning multiple protein properties, PRIZM reliably separated low- from high-performing models using datasets of ~20 labelled variants. We further demonstrate PRIZM in enzyme engineering case studies targeting sucrose synthase thermostability and glycosyltransferase activity, where PRIZM-guided selection identified improved variants, including gains of ~3{degrees}C in apparent melting temperature and ~20% higher relative activity. PRIZM provides an accessible, data-efficient route to leverage foundation models for protein design while requiring minimal experimental data.
bioinformatics2026-02-10v1An Integrated Pipeline for Cell-Type Annotation, Metabolic Profiling, and Spatial Communication Analysis in the Liver using Spatial Transcriptomics
Zhang, C.; Li, J.; Luo, O.; Andrews, T.; Steinberg, G. R.; WANG, D.AI Summary
- The study presents a protocol for analyzing spatial transcriptomics (ST) data in liver tissues from MASLD mouse models to understand liver metabolism.
- The approach includes single-cell RNA-seq referencing, manual annotation with curated liver cell type markers, and metabolic gene set analysis.
- Key findings include the provision of tools for researchers to decode metabolic reprogramming and cellular heterogeneity in liver health and disease.
Abstract
The liver acts as a central metabolic hub, integrating systemic signals through a spatially organized pattern known as zonation, driven by the coordinated activity of diverse cell types including hepatocytes, stellate cells, Kupffer cells, endothelial cells, and immune populations. Spatial transcriptomics (ST) enables the profiling of thousands of cells with spatial resolution in a single experiment, facilitating the identification of novel gene markers, cell types, cellular states, and tissue neighborhoods across diverse tissues and organisms. By simultaneously capturing transcriptional and spatial heterogeneity, ST has become a powerful tool for understanding cellular and tissue biology. Given its advantages, there is growing demand for applying ST to uncover novel biological insights in the liver under various physiological and pathological conditions including obesity, diabetes, and metabolic dysfunction-associated steatotic liver disease (MASLD). However, to date no comprehensive and practical protocols currently exist for analyzing ST data specifically in the context of liver metabolism. Herein, we present a systematic and detailed protocol for ST data analysis using liver tissues from MASLD mouse models. This guide offers practical support for metabolic based researchers without advanced expertise in coding, mathematics and statistics enabling single-cell RNA-seq referencing for deconvolution-based annotation, curated liver cell type markers for manual annotation, and a GMT file of metabolic gene sets and flux balance analysis to analyze liver metabolic activity. This framework and integrated computational resources for decoding metabolic reprogramming and cellular heterogeneity will empower researchers to uncover novel biological pathways regulating liver metabolism in health and disease.
bioinformatics2026-02-10v1HORDCOIN: A Software Library for Higher Order Connected Information and Entropic Constraints Approximation
Raffaelli, G. T.; Kislinger, J.; Kroupa, T.; Hlinka, J.AI Summary
- The study introduces HORDCOIN, a software library for approximating higher-order connected information in complex systems like neuronal populations, using an entropic-constraint approach to simplify computational complexity.
- This method transforms the problem into a linear program, allowing efficient estimation even with limited data.
- Applications to symbolic sequences, neuronal recordings, and DNA sequences showed accurate detection of higher-order interactions, demonstrating the library's utility in biomedical data analysis.
Abstract
Background and objective: Quantifying higher-order statistical dependencies in multivariate biomedical data is essential for understanding collective dynamics in complex systems such as neuronal populations. The connected information framework provides a principled decomposition of the total information content into contributions from interactions of increasing order. However, its application has been limited by the computational complexity of conventional maximum entropy formulations. In this work, we present a generalised formulation of connected information based on maximum entropy problems constrained by entropic quantities. Methods: The entropic-constraint approach, contrasting with the original constraints based on marginals or moments, transforms the original nonconvex optimisation into a tractable linear program defined over polymatroid cones. This simplification enables efficient, robust estimation even under undersampling conditions. Results: We present theoretical foundations, algorithmic implementation, and validation through numerical experiments and real-world data. Applications to symbolic sequences, large-scale neuronal recordings, and DNA sequences demonstrate that the proposed method accurately detects higher-order interactions and remains stable even with limited data. Conclusions: The accompanying open-source software library, HORDCOIN (Higher ORDer COnnected INformation), provides user-friendly tools for computing connected information using both marginal- and entropy-based formulations. Overall, this work bridges the gap between abstract information-theoretic measures and practical biomedical data analysis, enabling scalable investigation of higher-order dependencies in neurophysiological and other complex biological systems such as the genome.
bioinformatics2026-02-10v1LineageSim: A Single-Cell Lineage Simulator with Fate-Aware Gene Expression
Lai, H.; Sadria, M.AI Summary
- LineageSim is introduced as a simulator for single-cell lineage data with gene expression that incorporates fate-aware signals, addressing the limitations of existing Markovian models.
- The simulator generates data where progenitor states show early signs of future cell fate, providing a benchmark for cell fate prediction algorithms.
- Validation through logistic regression showed a 68.3% balanced accuracy, confirming the presence of predictive fate information in the simulated data.
Abstract
Single-cell lineage data paired with gene expression are critical for developing computational methods in developmental biology. Since experimental lineage tracing is often technically limited, robust simulations are necessary to provide the ground truth for rigorous validation. However, existing simulators generate largely Markovian gene expression, failing to encode the fate bias observed in real biological systems, where progenitor states exhibit early signatures of future commitment. Consequently, they cannot support the training and evaluation of computational methods that model long-range temporal dependencies. We present LineageSim, a generative framework that introduces fate-aware gene expression, where progenitor states carry latent signals of their descendants' terminal fates. This framework establishes a new class of benchmarks for cell fate prediction algorithms. We validate the presence of these temporal signals by training a logistic regression baseline, which achieves 68.3% balanced accuracy. This confirms that the generated data contain subtle but recoverable fate information, in contrast to existing simulators, where such predictive signals are systematically absent.
bioinformatics2026-02-10v1Multi-compartment spatiotemporal metabolic modeling of the chicken gut guides dietary intervention design
Utkina, I.; Alizadeh, M.; Sharif, S.; Parkinson, J.AI Summary
- Researchers developed a multi-compartment metabolic model of the chicken gut to understand how diet influences microbial metabolism.
- The model identified cellulose, starch, and L-threonine as effective dietary supplements for enhancing short-chain fatty acid production.
- Validation through a feeding trial confirmed model predictions, particularly for butyrate, highlighting the importance of microbial community composition in metabolic outcomes.
Abstract
Understanding how diet shapes microbial metabolism along the gastrointestinal tract is essential for improving poultry gut health and reducing reliance on antibiotic growth promoters. Yet dietary interventions often yield inconsistent outcomes because their efficacy depends on baseline conditions, including diet composition and microbiota structure. To address this, we developed the first multi-compartment, spatiotemporally resolved metabolic model of the chicken gastrointestinal tract. Our six-compartment framework integrates avian-specific physiological features including bidirectional flow, feeding-fasting cycles, and compartment-specific environmental parameters. The model captured distinct metabolic specialization along the gut, with upper compartments enriched for biosynthetic pathways and lower compartments specialized for fermentation. Systematic in silico screening of 34 dietary supplements revealed context-dependent metabolic responses and identified cellulose, starch, and L-threonine as robust enhancers of short-chain fatty acid production. A controlled feeding trial validated key predictions, particularly for butyrate, and integrating trial-specific microbial community data substantially improved prediction accuracy for several metabolites. Our findings demonstrate that community composition is a major driver of metabolic outcomes and underscore the need for context-specific modeling. Our framework provides a mechanistic platform for rational dietary intervention design and is broadly adaptable to other animal or human gastrointestinal systems.
bioinformatics2026-02-10v1Parsimonious cell co-localization scoring for spatial transcriptomics
Gingerich, I. K.; Frost, H. R.AI Summary
- The study introduces the Neighborhood Product Co-localization (NPC) score for spatial transcriptomics to quantify cell type co-occurrence in local neighborhoods.
- Using a mouse ovary MERFISH dataset, NPC was shown to localize co-localization hotspots, recapitulate global associations, and identify specific niches like follicle boundaries.
- NPC extends to multivariate analysis, demonstrating coordinated co-localization of endothelial, stroma, and theca cells.
Abstract
Spatial transcriptomics (ST) preserves tissue architecture while profiling gene expression, motivating methods that quantify whether annotated labels (such as cell types) preferentially co-occur in local neighborhoods. We introduce the Neighborhood Product Co-localization (NPC) score, a simple per-cell metric computed on a pruned spatial neighbor graph: for a set of m [≥] 2 labels, NPC is the product of their neighborhood proportions, optionally normalized by expected co-occurrence under independence and paired with permutation-based significance testing. NPC is interpretable (maximized under balanced neighborhoods), efficient to compute, and extends naturally from pairwise to multivariate microenvironment definitions. Using a mouse ovary MERFISH dataset, we show that NPC complements established Squidpy co-occurrence and neighborhood enrichment analyses by localizing co-localization hotspots in tissue space, recapitulating prominent global associations, and highlighting spatially restricted niches such as follicle boundaries; we further demonstrate multivariate NPC scoring by identifying coordinated endothelial--stroma--theca co-localization. Overall, NPC provides a practical framework for interpretable, single-cell resolution co-localization analysis in ST cohorts.
bioinformatics2026-02-10v1CoPrimeEEG: CRT-Guided Dual-Branch Reconstruction from Co-Prime Sub-Nyquist EEG
Yu, Y.; Liu, D.; Wu, Y. N.AI Summary
- CoPrimeEEG integrates co-prime sub-Nyquist sampling with a CRT-guided learning objective to reconstruct high-rate EEG from two low-rate streams.
- The framework uses a dual-branch convolutional encoder, upsampling to reconstruct EEG, predict a temporal mask, and extract bandpower features.
- It achieves superior reconstruction quality on real EEG data with fewer parameters, offering a low-power solution for EEG acquisition.
Abstract
We present CoPrimeEEG, a neural reconstruction framework that unifies co-prime sub-Nyquist sampling theory with a CRT-guided learning objective for EEG. Two low-rate streams obtained by co-prime decimations feed a dual-branch convolutional encoder whose fused representation is upsampled to reconstruct high-rate EEG while jointly predicting a temporal usefulness mask and canonical bandpower features. We derive a principled loss with four terms: (i) waveform fidelity, (ii) mask sparsity and smoothness, (iii) bandpower supervision in the log-domain, and (iv) a CRT-consistency term enforcing agreement between the reconstruction and its co-prime downsampled counterparts. On real EEG data, CoPrimeEEG achieves state-of-the-art reconstruction quality across MSE, MAE, correlation, SNR, and PSNR while using fewer parameters. The approach provides a practical path to low-power EEG acquisition with high-fidelity downstream analysis.
bioinformatics2026-02-10v1A methodological framework for accommodating Cancer Genomics Information in OMOP-CDM using Variation Representation Specification (VRS).
Benetti, E.; Scicolone, G.; Tajwar, M.; Masciullo, C.; Bucci, G.; Riba, M.AI Summary
- The study proposes a framework to integrate cancer genomics data into the OMOP Common Data Model (OMOP CDM) using the Variation Representation Specification (VRS).
- The approach involves a scalable strategy for storing genomic variants, from simple biomarker data to complex genome sequencing data, using standardized identifiers.
- KOIOS-VRS, a pipeline, was developed to automate the conversion of VCF files into an OMOP-compatible format.
Abstract
The OMOP Common Data Model (OMOP CDM) in which observational health data are organized and stored is a broadly accepted data standard which helps clinical research facilitating federation study protocols. In case of cancer studies, there is a growing need to incorporate cancer genomics data in a standardized way. Starting from a brief overview of the basic features of the OMOP CDM, we imagine a path of increasing complexity for including known biomarker genomic data coming from pathology or reports or clinical laboratory findings, towards storing thousands of known and unknown variants coming from genome sequencing data. Data should be stored using standardized identifiers, including those defined by the Global Alliance for Genomics and Health (GA4GH). We propose a scalable strategy for storing genomics variants in increasingly complex scenarios and present KOIOS-VRS, a pipeline that automates the conversion of VCF files into OMOP compatible format.
bioinformatics2026-02-10v1thematicGO: A Keyword-Based Framework for Interpreting Gene Ontology Enrichment via Biological Themes
Wang, Z.; Sudlow, L. C.; Du, J.; Berezin, M. Y.AI Summary
- ThematicGO is a framework that organizes Gene Ontology (GO) enriched terms into biological themes using a keyword-based approach to reduce redundancy and enhance interpretability.
- It uses the g:Profiler API for GO enrichment of differentially expressed genes, followed by theme-based score aggregation.
- Compared to traditional GO annotation, thematicGO improves readability and provides a user-friendly GUI for exploring results, making it suitable for RNA-seq studies.
Abstract
Background Gene Ontology (GO) enrichment analysis is a widely used approach for interpreting high-throughput transcriptomic and genomic data. However, conventional GO over-representation analyses typically yield long, redundant lists of enriched terms that are difficult to apply to biological problems and identify the most relevant biological pathways. Results We present thematicGO, a customizable framework that organizes enriched GO terms into biological themes using a curated keyword-based matching strategy. In this approach, GO enrichment of differentially expressed genes is performed using the g:Profiler Application Programming Interface (API), followed by the score aggregation within each theme from contributing individual GO terms. Side-by-side interpretation against conventional GO annotation workflows demonstrates that thematicGO captures related biological outcomes but at the same time substantially reduces redundancy and improves readability. To enhance accessibility, we implemented an interactive, web-deployed graphical user interface (GUI) that enables users to upload gene lists and explore thematic enrichment results. Conclusion thematicGO simplifies functional enrichment analysis by bridging the gap between granular GO term outputs and higher-level biological interpretation using a theme concept, which can be especially useful for RNA-seq studies that identify differentially expressed genes. The new approach complements an orthogonal standard GO enrichment technique with transparent, theme-based aggregation and comparison against classical GO annotation approaches. thematicGO provides an easy, understandable, and reproducible tool for transcriptomic studies, particularly those involving RNA-seq data and complex biological responses.
bioinformatics2026-02-10v1Delta Marches: Generative AI based image synthesis to decode disease-driving morphologic transformations.
Nguyen, T. H.; Panwar, V.; Jarmale, V.; Perny, A.; Dusek, C.; Cai, Q.; Kapur, P. H.; Danuser, G.; Rajaram, S.AI Summary
- Delta-Marches uses generative AI to simulate morphological changes between disease classes, focusing on interpretability.
- Applied to renal carcinoma grading, it identifies key morphological features like tumor-cell nuclear phenotypes and reduced vasculature with increasing grade.
- This approach reduces variability and provides insights into disease mechanisms not captured by standard grading.
Abstract
Deep learning has revealed that tissue morphology contains rich biological information beyond human understanding. However, approaches to convert these spatially distributed signals into precise subcellular insights informing disease mechanism are lacking. We introduce Delta-Marches, an interpretability-first approach that nominates distinguishing morphological features rather than explaining existing models' decisions. Delta-Marches leverages a generative AI framework with latent-space traversals that simulate idealized morphological changes between classes. Comparing each image to its class-shifted counterpart allows downstream feature extractors to infer aspects most affected by the shift, reducing sample-to-sample variability and yielding interpretable morphological transformations at subcellular resolution. Prototyped in renal carcinoma histopathological grading, Delta-Marches generates realistic grade transitions and pinpoints tumor-cell nuclear phenotypes as key properties of tumor grades. It also reveals reduced vasculature associated with increasing grade, a pattern reported in studies but absent from standard grading rubrics. These results indicate Delta-March's ability to parse complex image phenotypes and catalyze hypothesis generation.
bioinformatics2026-02-09v4LFQ Benchmark Dataset - Generation Beta: Assessing Modern Proteomics Instruments and Acquisition Workflows with High-Throughput LC Gradients
Van Puyvelde, B. R.; Devreese, R.; Chiva, C.; Sabido, E.; Pfammatter, S.; Panse, C.; Rijal, J. B.; Keller, C.; Batruch, I.; Pribil, P.; Vincendet, J.-B.; Fontaine, F.; Lefever, L.; Magalhaes, P.; Deforce, D.; Nanni, P.; Ghesquiere, B.; Perez-Riverol, Y.; Martens, L.; Carapito, C.; Bouwmeester, R.; Dhaenens, M.AI Summary
- This study extends a previous benchmark dataset to evaluate modern LC-MS platforms for high-throughput proteomics, using a hybrid human-yeast-E. coli proteome with short LC gradients (5 and 15 min).
- Data was collected across four platforms with standardized protocols, focusing on sensitivity, reproducibility, and cross-instrument consistency.
- Key findings include insights into how technological advancements and reduced LC gradients impact proteome depth, quantitative precision, and support for algorithm development and standardization in proteomics.
Abstract
Recent advances in liquid chromatography mass spectrometry (LC-MS) have accelerated the adoption of high-throughput workflows that deliver deep proteome coverage using minimal sample amounts. This trend is largely driven by clinical and single-cell proteomics, where sensitivity and reproducibility are essential. Here, we extend our previous benchmark dataset (PXD028735) using next-generation LC-MS platforms optimized for rapid proteome analysis. We generated an extensive DDA/DIA dataset using a human-yeast-E. coli hybrid proteome. The proteome sample was distributed across multiple laboratories together with standardized analytical protocols specifying two short LC gradients (5 and 15 min) and low sample input amounts. This dataset includes data acquired on four different platforms, and features new scanning quadrupole-based implementations, extending coverage across different instruments and acquisition strategies. Our comprehensive evaluation highlights how technological advances and reduced LC gradients may affect proteome depth, quantitative precision, and cross-instrument consistency. The release of this benchmark dataset via ProteomeXchange (PXD070049 and PXD071205), allows for the acceleration of cross-platform algorithm development, enhance data mining strategies, and supports standardization of short-gradient, high-throughput LC-MS-based proteomics.
bioinformatics2026-02-09v2Enumerating the chemical exposome using in-silico transformation analysis : an example using insecticides
Jothiramajayam, M.; Barupal, D. K.AI Summary
- This study uses an integrated workflow of RXNMmapper, Rxn-INSIGHT, and RDChiral to enumerate transformation products of insecticides in-silico.
- From 181 insecticide structures, 19,392 unique transformation products were generated using over 80,000 reaction templates from PubChem.
- Products were prioritized based on thermodynamic stability, species association, enzyme information, and ADMET properties, enhancing exposomic knowledgebases.
Abstract
The exposome encompasses a vast chemical space that can originate from the consumer industry and environmental sources. Once these chemicals enter into cells (human or other organisms), they can be also transformed into products that differ in terms of toxicity and health effects. Recent developments in machine learning methods and chemical data science resources have enabled the in-silico enumeration of transformation products. Here, we report an integrated workflow of these existing resources (RXNMmapper, Rxn-INSIGHT and RDChiral) to enumerate the transformation product for a chemical. We have generated a large library of reaction templates from > 80,000 reactions sourced from the PubChem database. Utility of the reaction screening and transformation enumeration workflow has been demonstrated for insecticide structures (n=181), yielding 19,392 unique transformation products. Use of filters and ranking by thermodynamic stability, species association, enzyme information and ADMET properties, can prioritize the products relevant for different contexts. Many of these products have PubChem entries but have not yet been linked with the parent compounds. The presented approach can be helpful in enumerating relevant chemical space for exposome using known reaction chemistry, which may ultimately contribute to expanding of the exposomic knowledgebases.
bioinformatics2026-02-09v2LoReMINE: Long Read-based Microbial genome mining pipeline
Agrawal, A. A.; Bader, C. D.; Garcia, R.; Mueller, R.; Kalinina, O. V.AI Summary
- The study introduces LoReMINE, a pipeline for microbial genome mining that automates the process from long-read sequencing data to predicting and clustering biosynthetic gene clusters (BGCs).
- LoReMINE integrates various tools to provide a scalable, reproducible workflow for natural product discovery, addressing the limitations of existing methods that require manual curation.
Abstract
Microbial natural products represent a chemically diverse repertoire of small molecules with major pharmaceutical potential. Despite the increasing availability of microbial genome sequences, large-scale natural product discovery remains challenging because the existing genome mining approaches lack integrated workflows for rapid dereplication of known compounds and prioritization of novel candidates, forcing researchers to rely on multiple tools that requires extensive manual curation and expert intervention at each step. To address these limitations, we introduce LoReMINE (Long Read-based Microbial genome mining pipeline), a fully automated end-to-end pipeline that generates high-quality assemblies, performs taxonomic classification, predicts biosynthetic gene clusters (BGCs) responsible for biosynthesis of natural products, and clusters them into gene cluster families (GCFs) directly from long-read sequencing data. By integrating state-of-the-art tools into a seamless pipeline, LoReMINE enables scalable, reproducible, and comprehensive genome mining across diverse microbial taxa. The pipeline is openly available at https://github.com/kalininalab/LoReMINE and can be installed via Conda (https://anaconda.org/kalininalab/loremine), facilitating broad adoption by the natural product research community.
bioinformatics2026-02-09v2Protein Language Models in Directed Evolution
Maguire, R.; Bloznelyte, K.; Adepoju, F.; Armean-Jones, M.; Dewan, S.; Goddard, S. E.; Gupta, A.; Jones, F. P.; Lalli, P.; Schooneveld, A.; Thompson, S.; Ebrahimi, E.; Fozzard, S.; Berman, D.; Rossoni, L.; Addison, W.; Taylor, I.AI Summary
- The study investigates the use of zero-shot and few-shot protein language models to guide directed evolution for improving protein fitness, specifically PET degradation.
- Using a few-shot simulated annealing approach, the models recommended enzyme variants that achieved a 1.62x improvement in PET degradation over 72 hours, surpassing the literature's top variant by 1.40x.
- In the second round, with 240 training examples and 32 homologous sequences, 39% of the 176 evaluated variants were fitter than the wild-type.
Abstract
The dominant paradigms for integrating machine-learning into protein engineering are de novo protein design and guided directed evolution. Guiding directed evolution requires a model of protein fitness, but most models are only evaluated in silico on datasets comprising few mutations. Due to the limited number of mutations in these datasets, it is unclear how well these models can guide directed evolution efforts. We demonstrate in vitro how zero-shot and few-shot protein language models of fitness can be used to guide two rounds of directed evolution with simulated annealing. Our few-shot simulated annealing approach recommended enzyme variants with 1.62 x improved PET degradation over 72 h period, outperforming the top engineered variant from the literature, which was 1.40 x fitter than wild-type. In the second round, 240 in vitro examples were used for training, 32 homologous sequences were used for evolutionary context and 176 variants were evaluated for improved PET degradation, achieving a hit-rate of 39 % of variants fitter than wild-type.
bioinformatics2026-02-09v2A high-fat hypertensive diet induces a coordinated perturbation signature across cell types in thoracic perivascular adipose tissue
Terrian, L.; Thompson, J. M.; Bowman, D. E.; Panda, V.; Contreras, G. A.; Rockwell, C. E.; Sather, L.; Fink, G. D.; Lauver, D. A.; Nault, R.; Watts, S. W.; Bhattacharya, S.AI Summary
- This study used single nucleus RNA-sequencing to examine how a high-fat (HF) hypertensive diet affects gene expression in thoracic aortic perivascular adipose tissue (PVAT) of Dahl SS rats.
- The HF diet led to sex-specific changes in cell-type proportions and gene expression related to extracellular matrix dynamics, vascular integrity, and cell communication pathways.
- Analysis identified potential nuclear receptor targets for reversing these diet-induced changes, with deep learning models predicting a hypertensive disease signature across cell types.
Abstract
Perivascular adipose tissue (PVAT), an intriguing layer of fat surrounding blood vessels, regulates vascular tone and mediates vascular dysfunction through mechanisms that are not well understood. Here we show with single nucleus RNA-sequencing of thoracic aortic PVAT from Dahl SS rats that a high-fat (HF) hypertensive diet induces coordinated changes in gene expression across the diverse cell types within PVAT. HF diet produced sex-specific alterations in cell-type proportions and genes related to remodeling of extracellular matrix dynamics and vascular integrity and stiffness, as well as changes in cell-cell communication pathways involved in angiogenesis, vascular remodeling, and mechanotransduction. Gene regulatory network analysis with virtual transcription factor knockout in adipocytes identified specific nuclear receptors that could be targeted for suppression or potential reversal of HF diet-induced changes. Interestingly, generative deep learning models were able to predict cross-cell-type perturbations in gene expression, indicating a hypertensive disease signature that characterizes HF-diet-induced perturbations in PVAT.
bioinformatics2026-02-09v2Unveiling the Terra Cognita of Sequence Spaces using Cartesian Projection of Asymmetric Distances
Ramette, A.AI Summary
- CAPASYDIS is introduced as a method to visualize relationships in large biological sequence datasets by projecting sequences into a fixed, low-dimensional "seqverse" using asymmetric distances.
- Applied to rRNA sequences across Bacteria, Archaea, and Eukaryota, CAPASYDIS showed these domains occupy distinct spatial regions with unique variation patterns.
- The method allows instant mapping of new sequences and retains taxonomic information from broad to fine scales, providing a scalable framework for sequence analysis.
Abstract
Visualizing relationships within massive biological datasets remains a significant challenge, particularly as sequence length and volume increase. We introduce CAPASYDIS (Cartesian Projections of Asymmetric Distances), a scalable approach designed to map the explored regions of a given sequence space. Unlike traditional dimensionality reduction methods, CAPASYDIS calculates asymmetric distances which account for both the position and type of sequence variations. It projects sequences into a fixed, low-dimensional coordinate system, termed a "seqverse", where each sequence occupies a permanent location. This design allows for the instant mapping of new sequences without the need to recalculate the global structure, transforming sequence analysis from a relative comparison into navigation on a standardized map. We applied this method to a large rRNA sequence dataset spanning the three domains of life. Our results demonstrate that the sequences of Bacteria, Archaea, and Eukaryota occupy spatially distinct regions characterized by fundamentally different shapes and patterns of variation. Furthermore, the resulting seqverses retain high amount of taxonomic information, when analyzed from broad domain levels to single-base differences. Overall, CAPASYDIS provides a reproducible, scalable framework for defining the boundaries and topography of biological sequence universes.
bioinformatics2026-02-09v2Target-site Dynamics and Alternative Polyadenylation Explain a Large Share of Apparent MicroRNA Differential Expression
Cihan, M.; More, P.; Sprang, M.; Marini, F.; Andrade, M.AI Summary
- The study introduces MIRNAPEX, a machine learning framework that integrates target-gene expression and 3'UTR isoform usage to assess miRNA regulatory activity from RNA-seq data.
- Using pan-cancer datasets, MIRNAPEX showed that alternative polyadenylation (APA) significantly enhances prediction of miRNA differential expression beyond gene expression alone.
- Findings indicate that changes in miRNA abundance can result from APA-driven alterations in target-site availability, rather than changes in miRNA transcription, highlighting the importance of considering APA in miRNA expression analysis.
Abstract
MicroRNA (miRNA) abundance reflects a dynamic balance between biogenesis, target engagement and decay, yet differential expression (DE) analyses typically ignore changes in target-site availability driven by alternative polyadenylation (APA). We introduce MIRNAPEX, an interpretable expression-stratification-based machine learning framework that quantifies the effect size of miRNA regulatory activity from RNA-seq by integrating target-gene expression with 3'UTR isoform usage to infer binding-site dosage. Using pan-cancer training sets, we fit regularized linear models to learn robust relationships between transcriptomic features and miRNA log-fold changes, with APA patterns adding clear predictive power beyond expression alone. When applied to knockdowns of core APA regulators, MIRNAPEX captured widespread 3'UTR shortening and correctly anticipated distinct, miRNA-specific shifts whose direction and magnitude mirrored the APA-driven change in site availability. Analysis of target-directed miRNA degradation interactions further showed that loss of distal decay-trigger sites coincides with higher miRNA abundance, consistent with a reduced degradation rate. Together these findings reveal that apparent DE of miRNAs can arise from dynamic changes in target-site landscapes rather than altered miRNA transcription, and that ignoring this aspect in conventional analysis workflows can lead to misestimation of the true effect size of gene-expression regulation.
bioinformatics2026-02-09v2COMPASS: A Web-Based COMPosite Activity Scoring System to Navigate Health and Disease Through Deterministic Digital Biomarkers
Sinha, S.; Ghosh, P.AI Summary
- COMPASS is a web-based system that quantifies pathway activation by extracting gene-specific activation thresholds from expression data, standardizing deviations, and aggregating into composite activity scores.
- It allows users to upload expression matrices, define gene sets, and generate activity plots and ROC-AUC statistics for comparisons.
- Across various datasets, COMPASS provides stable, interpretable digital biomarkers that assess model system relevance, differentiation, immune states, and therapeutic responses.
Abstract
Quantifying pathway activation in absolute, reproducible terms is central to systems biology and precision medicine. COMPASS (COMPosite Activity Scoring System) provides a deterministic, ontology free framework that extracts gene-specific activation thresholds from expression data, standardizes deviation from these boundaries, and aggregates direction-encoded genes into per-sample composite activity scores. Implemented as an intuitive web application, COMPASS enables non-coding users to upload expression matrices, define custom gene sets, and instantly generate activity plots and ROC-AUC statistics for biological or clinical comparisons. Across diverse datasets, COMPASS yields stable, interpretable, and transferable digital biomarkers that benchmark the "humanness" of model systems, quantify differentiation or immune states, and track therapeutic response trajectories. By directly linking expression, threshold, deviation, and directionality, COMPASS replaces permutation-based enrichment with closed-form logic, delivering a transparent, mechanistic, and reproducible quantification system for pathway activity.
bioinformatics2026-02-09v2A ML-framework for the discovery of next-generation IBD targets using a harmonized single-cell atlas of patient tissue
Joglekar, A.; Joseph, A.; Honsa, P.; Ruppova, K.; Pizzarella, V.; Honan, A.; Mediratta, D.; Vollmer, E.; Geller, E.; Valny, M.; Macuchova, E.; Zheng, S.; Greenberg, A.; Taus, P.; Kline-Schoder, A.; Konickova, R.; Cerna, L.; Sharim, H.; Ness, L.; Camilli, G.; Chouri, E.; Kaymak, I.; D'Rozario, J.; Castiblanco, D.; Oliveira, J.; Prandi, F.; Popov, N.; Moldoveanu, A. L.; Oliphant, C.; Escudero-Ibarz, L.; Uhlitz, F.; Freinkman, E.; Sponarova, J.; Vijay, P.; Joyce, C.; Leonardi, I.; Nayar, S.; Platt, A.; Ort, T.; De Baets, G.; Corridoni, D.; Wroblewska, A.; Rahman, A.AI Summary
- This study developed a machine learning framework (IPR) using a harmonized single-cell atlas of the human intestine to discover novel therapeutic targets for IBD.
- The framework identified 85 disease-associated transcriptional programs and prioritized 400 cell type-specific gene targets.
- Validation confirmed that targeting PTGIR in myeloid cells and IL6ST in fibroblasts reduced inflammatory and fibrotic pathways, suggesting new therapeutic avenues distinct from current treatments.
Abstract
Target discovery for IBD has traditionally relied on genetic associations, which lack the cellular resolution needed to identify novel, actionable, cell type-specific disease pathways. Here, we describe an integrated analytical and experimental framework that leverages harmonized single-cell data to systematically discover novel therapeutic strategies for IBD. We used AMICA DBTM, Immunai's harmonized database of single-cell RNA datasets to construct a harmonized 1 million single-cell atlas of the human intestine. We applied a machine learning framework (Immune Patient Representation, IPR) to identify disease-associated transcriptional programs and cell type-specific gene targets. Candidate targets were prioritized using atlas-derived metrics, refined using custom criteria emphasizing translational actionability, and validated across independent clinical cohorts. Select candidates were evaluated in human primary-cell models reflecting the target's cell-type context. The IPR framework identified 85 disease-associated transcriptional programs and ranked 400 cell type-specific target genes across immune and stromal lineages. Disease-associated programs were interpreted using a structured AI-assisted reasoning framework for structured biological reasoning, linking them to IBD-relevant pathways and guiding the identification of novel, promising gene targets. Functional validation of two cell-type-specific candidates, PTGIR in myeloid cells and IL6ST in fibroblasts, confirmed the reduction of inflammatory and fibrotic pathways linked to IBD pathology. Multi-omic profiling and projection of in vitro phenotypes to patient datasets demonstrated the reversal of disease-associated programs via mechanisms distinct from those of existing biologics. Our single-cell anchored, machine-learning framework integrates in silico discovery with experimental validation, revealing new cell type-specific therapeutic opportunities and supporting a scalable approach for precision target discovery in IBD and other immune-mediated diseases.
bioinformatics2026-02-09v1DEPower: approximate power analysis with DESeq2
Gorin, G.; Guruge, D.; Goodman, L.AI Summary
- The study addresses the need for power analysis in RNA-seq experiments by developing DEPower, a tool based on the DESeq2 framework.
- DEPower calculates the minimum sample size required for detecting effects in both single-cell and bulk RNA-seq experiments.
- It is accessible as a web-based tool at https://poweranalysis-fb.streamlit.app/, facilitating rigorous experimental design for researchers.
Abstract
Rigorous experimental design, including formal power analysis, is a cornerstone of reproducible RNA sequencing (RNA-seq) research. The design of RNA-seq experiments requires computing the minimum sample number required to identify an effect of a particular size at a predefined significance level. Ideally, the statistical test used for the analysis of experimental data should match the test used for sample size determination; however, few tools use the assumptions of the popular differential expression testing framework DESeq2, and most opt for simulation-based rather than analytical approaches. Grounded in the DESeq2 model framework, we derive sample size requirements for both single-cell and bulk RNA-seq experiments delivered as a web-based tool for power analysis, DEPower, available at https://poweranalysis-fb.streamlit.app/, that makes rigorous RNA-seq study design accessible to all researchers.
bioinformatics2026-02-09v1A shape-constrained regression and wild bootstrap framework for reproducible drug synergy testing
Asiaee, A.; Long, J. P.; Pal, S.; Pua, H. H.; Coombes, K. R.AI Summary
- The study introduces a nonparametric framework for drug synergy testing using shape-constrained regression and wild bootstrap to address limitations of traditional heuristic scores.
- The method defines interaction as deviation from a monotone-additive model, using isotonic regression to fit surfaces and a wild bootstrap for statistical inference.
- It showed higher replicate concordance (median correlation 0.91) and lower failure rates compared to existing methods, also predicting missing data with a median RMSE of 0.040.
Abstract
High-throughput drug combination screens motivate computational methods to identify synergistic pairs, yet synergy is typically quantified by heuristic scores (Bliss, HSA, Loewe, ZIP) that provide no statistical inference and can be unstable or undefined when parametric dose--response fits fail. We present a nonparametric, assumption-light framework that defines interaction as the deviation from a monotone-additive null within a shared monotone model class. We fit a monotone surface by two-dimensional isotonic regression and a monotone-additive surface, compute an interaction surface, and summarize global interaction by a stable ``interaction energy'' statistic. A degrees-of-freedom-corrected wild bootstrap yields calibrated p-values for testing interaction in each dose--response matrix, enabling principled hit calling and multiple-testing control. On DrugCombDB, our method yields higher replicate concordance of interaction surfaces (median correlation 0.91 across 1,839 replicate pairs) than Bliss, HSA, Loewe, or ZIP (0.53--0.74), while avoiding the 20.9% Loewe and 3.6% ZIP failure rates. Because the fitted surface is generative, the method also predicts missing wells (median holdout RMSE 0.040 in viability units). By turning synergy scoring into statistically grounded outcomes (effect sizes with uncertainty), the framework provides more reliable targets for downstream machine learning models of combination response.
bioinformatics2026-02-09v1Batch Effect Correction in a Functional Colorectal Cancer Organoid Clinical Correlation Study
Oliver, G. R.; de Jesus Domingues, A.; Barnett, C. C.AI Summary
- The study focused on detecting, characterizing, and correcting batch effects in a retrospective clinical colorectal cancer organoid drug-response study.
- Methods included exploratory diagnostics, experimental drift detection, and statistical adjustments to remove technical artifacts while preserving biological signals.
- The findings highlight the necessity of addressing batch effects to ensure data reproducibility and accurate interpretation in organoid research.
Abstract
Batch effects are recognized as major sources of technical confounding in high-throughput assays. However, their impact on organoid studies receives little attention in the literature. As organoids gain prominence as a class of emerging new approach methodologies (NAMs), consideration of batch variation will become increasingly important to ensure data reproducibility and accurate interpretation in pre-clinical and clinical studies. In this manuscript, we provide a practical description of our work in detecting, characterizing, and correcting batch effects in a prior published retrospective clinical colorectal cancer organoid drug-response study. We outline the workflow we employed, including exploratory diagnostics, experimental drift detection, and statistical adjustment. We detail the methods employed to evaluate batch effects, monitor longitudinal drift, and select approaches to remove technical artifacts, preserve biological signal and test for robustness. Our experience demonstrates that in even modestly sized studies, results can be adversely affected by insufficient consideration and attempts at ameliorating batch effects. By documenting the challenges we encountered and the solutions implemented within our study, we hope that we can provide a seminal practical reference for organoid researchers and enable increased discussion and adoption of robust batch-compensation practices in the organoid field, ensuring that the topic is more routinely addressed, improved, and eventually standardized.
bioinformatics2026-02-09v1Predicting Obstetric and Non-obstetric Diagnoses Co-occurrences during Pregnancy
Singh, A.; Infante, S.; Kim, S.; Kabir, A.AI Summary
- This study models the co-occurrence of obstetric and non-obstetric diagnoses during pregnancy using a network-based approach, treating it as a link prediction problem on a diagnosis-level graph.
- Various graph neural network (GNN) architectures were tested, with GraphSAGE and hybrid models (GCN+GraphSAGE, GAT+GraphSAGE) showing the best performance.
- The GCN+GraphSAGE hybrid model achieved an AUROC and AUPRC of approximately 0.90, revealing clinically plausible associations between pregnancy stages and related diagnoses.
Abstract
Pregnancy care often involves simultaneous obstetric and other medical conditions, but their co-occurrence patterns are rarely modeled explicitly in a systematic, network-based approach. In this work, we formulate obstetric and non-obstetric diagnoses co-occurrences as a link prediction problem on a diagnosis-level homogeneous graph constructed from pregnancy encounters. Diagnoses are represented as nodes connected by co-occurrence edges, with node features capturing graph structure and demographic statistics. We address this challenge by leveraging collected electronic health records data and study several standalone and hybrid graph neural network (GNN) architectures, including GCN, GAT, GraphSAGE, and three hybrid encoders that combine complementary aggregation mechanisms, namely GCN+GraphSAGE, GCN+GAT, and GAT+GraphSAGE. All models used consistent train-validation-test splits and are evaluated on 5-fold cross-validation sets. Among standalone models, GraphSAGE achieved the strongest performance, whereas hybrid GraphSAGE-based models (GCN+GraphSAGE and GAT+GraphSAGE) are best performers. The GCN+GraphSAGE hybrid, reaching an AUROC and AUPRC of approximately $0.90$, consistently outperformed all other architectures. Further analysis of top-ranked predicted links revealed clinically plausible associations between pregnancy stage and risk-related diagnoses and common endocrine, metabolic, and hematological conditions. These findings indicate that graph-based link prediction may effectively prioritize obstetric and non-obstetric diagnosis pairs, providing a scalable framework for identifying clinically meaningful comorbidity patterns. They may further support hypothesis generation and downstream obstetric risk stratification efforts. Availability: All codes including data preparation scripts, training and validation recipes, and experimental configurations are available at: \url{https://github.com/kabir-ai2bio-lab/ob-nonob-diagnoses-cooccurrences}.
bioinformatics2026-02-09v1TM-Vec 2: Accelerated Protein Homology Detection for Structural Similarity
Keluskar, A.; Batra, P.; Bezshapkin, V.; Morton, J. T.; Zhu, Q.AI Summary
- The study addresses the challenge of structural homology detection in protein sequences by introducing TM-Vec 2 and TM-Vec 2s, which optimize the computationally intensive embedding step.
- These models were benchmarked using CATH and SCOPe domains, showing TM-Vec 2s provides speedups of up to 258x over TM-Vec and 56x over Foldseek, with improved accuracy.
Abstract
Understanding protein function is an essential aspect of many biological applications. The exponential growth of protein sequence databases has created a critical bottleneck for structural homology detection. While billions of protein sequences have been identified from sequencing data, the number of protein folds underlying biology is surprisingly limited, likely numbering tens of thousands. The "sequence-fold gap" limits the success of functional annotation methods that rely on sequence homology, especially for newly sequenced, divergent microbial genomes. TM-Vec is a deep learning architecture that can predict TM scores as a metric of structural similarity directly from sequence pairs, bypassing true structural alignment. However, the computational demands of its protein language model (PLM) embeddings create a significant bottleneck for large-scale database searches. In this work, we present two innovations: TM-Vec 2, a new architecture that optimizes the computationally-heavy sequence embedding step, and TM-Vec 2s, a highly efficient model created by distilling the knowledge of the TM-Vec 2 model. Our new models were benchmarked for both accuracy and speed on using the CATH and SCOPe domains for large-scale database queries. We compare them to state-of-the-art models to observe that TM-Vec 2s achieves speedups of up to 258x over the original TM-Vec and 56x over Foldseek for large-scale database queries, while achieving higher accuracy compared to the original TM-Vec model.
bioinformatics2026-02-09v1MetaKnogic-Alpha: A Hyper-Relational Knowledge Base for Grounded Metabolic Reasoning
Dang, P.; Swaminathan, P.; Guo, T.; Wan, C.; Cao, S.; Zhang, C.AI Summary
- MetaKnogic-Alpha addresses the synthesis gap in metabolic research by transforming over 100K full-text articles into a hyper-relational hypergraph structure for grounded metabolic reasoning.
- It uses a hierarchical discovery protocol with an autonomous reasoning agent to enhance query precision and explore metabolic pathways, ensuring biological accuracy by grounding insights against a metabolic reaction network.
- Benchmarking showed a mechanistic accuracy of 0.98, significantly reducing errors and providing traceability to original literature, thus aiding in rapid discovery of metabolic interactions for precision oncology.
Abstract
The exponential trajectory of biomedical literature has precipitated a fundamental "synthesis gap" in metabolic research, where critical mechanistic insights remain fragmented across hundreds of thousands of disjointed full-text articles, preventing the consolidation of a global mechanistic view. Here, we present MetaKnogic-Alpha, a foundational mechanistic knowledge substrate designed to bridge this gap by transforming unstructured literature into a navigable, logic-based resource. MetaKnogic-Alpha synthesizes over 100K full-text articles into a hyper-relational hypergraph structure, preserving the n-ary relational logic inherent in complex metabolic pathways. To ensure biological rigor, we implemented a hierarchical discovery protocol: an autonomous reasoning agent first enriches query nomenclature for domain-specific precision, followed by a multi-hop topological expansion within the hypergraph to surface functional neighbors, such as enzymatic co-factors and distal regulators, often lost in traditional search paradigms. Crucially, the system subjects all literature-derived insights to a deterministic biochemical grounding against a curated metabolic reaction network, significantly mitigating the risk of probabilistic hallucinations common in standalone generative models. In rigorous benchmarking, MetaKnogic-Alpha achieved a mechanistic accuracy of 0.98 in scenarios where supporting evidence was present, providing a robustly attributable audit trail back to the primary literature via PubMed Central Identifiers. We designate this primary release as "alpha" to establish the foundational architectural logic for a burgeoning million-scale resource. By compressing the synthesis of thousands of papers from a multi-month manual effort into several hours of automated discovery, MetaKnogic-Alpha serves as a high-fidelity research companion that augments the human expert's ability to resolve complex metabolic interactions and identify novel therapeutic drivers in precision oncology.
bioinformatics2026-02-09v1Order of Message and Address Domain Engagement Determines Productive β-Endorphin Binding to the μ-Opioid Receptor
Ciemny, M. P.; Kmiecik, S.AI Summary
- The study investigated how the order of message and address domain engagement affects β-endorphin binding to the μ-opioid receptor (OR) using 1,000 CABS-dock simulations.
- Message-first binding was more common but less successful (5.0%), while address-first binding, though less frequent, was 3.8 times more likely to achieve native-like binding (18.8%, p < 0.001).
- Results suggest that early engagement of the address domain enhances productive binding to OR.
Abstract
Understanding -endorphin binding to the -opioid receptor (OR) is crucial for designing safer analgesics. The peptide comprises a message domain mediating activation and an address domain conferring selectivity. Using 1,000 independent CABS-dock simulations, without prior binding-site knowledge, we analysed binding trajectories to compare alternative binding pathways. Message-first binding is most frequently sampled but rarely reaches native-like structures (5.0%). In contrast, address-first binding occurs less often yet shows a 3.8-fold higher success rate (18.8%, p < 0.001). These results refine the message - address model and suggest that early address-domain engagement promotes productive OR binding.
bioinformatics2026-02-09v1Protenix-v1: Toward High-Accuracy Open-Source Biomolecular Structure Prediction
Xiao, W.; Zhang, Y.; Gong, C.; Zhang, H.; Ma, W.; Liu, Z.; Chen, X.; Guan, J.; Wang, L.AI Summary
- Protenix-v1 (PX-v1) is introduced as an open-source biomolecular structure prediction model that outperforms AlphaFold3 with the same constraints, enhancing prediction quality with increased sampling.
- It includes features like protein template integration and RNA MSA support, with a variant, Protenix-v1-20250630, trained on a larger dataset for better accuracy.
- The study also addresses benchmarking limitations by providing updated tools and year-stratified benchmarks for more reliable assessments.
Abstract
We introduce Protenix-v1 (PX-v1), the first open-source structure prediction model to attain superior performance to AlphaFold3 while strictly adhering to the same training data cutoff, model size, and inference budget. Beyond standard evaluations, we highlight the effectiveness of inference-time scaling behavior, demonstrating that increasing the sampling budget yields consistent improvements in prediction quality--a behavior previously seen in AlphaFold3 but not in other open-source models. In addition to improved accuracy, Protenix-v1 incorporates key capabilities including protein template integration and RNA MSA support. Furthermore, to better support real-world applications such as drug discovery, we additionally release Protenix-v1-20250630, a variant trained on a larger dataset (cutoff: June 30, 2025), delivering further improved prediction accuracy. Finally, we identify the limitations of current benchmarking tools and we provide updated evaluation tools and year-stratified benchmarks to facilitate more reliable and transparent assessment within the community. Collectively, these contributions provide a robust foundation for the Protenix series and the broader field.
bioinformatics2026-02-09v1Exercise-conditioned tear fluid suppresses myopia progression
Yao, H.; Liang, M.; Fei, Q.; Cao, J.; Liang, T.; Zhou, X.; Zhang, S.; Cui, Q.AI Summary
- Researchers hypothesized that tear fluid post-exercise (TA) could protect against myopia, unlike pre-exercise tear fluid (TB), based on reverse transcriptomic analysis.
- In a guinea pig model of form deprivation myopia, TA was administered via periocular injection, significantly reducing myopic progression by limiting refractive shifts, vitreous chamber depth, and axial length elongation.
- TB did not show any protective effects, highlighting a potential new therapeutic approach involving exercise-induced changes in tear fluid for myopia management.
Abstract
Myopia represents a major global public health challenge with rapidly rising prevalence. It is thus important to explore novel therapeutics for the treatment of myopia. Here we predicted tear fluid after (TA) but not before (TB) moderate-intensity aerobic exercise might protect against myopia using reverse transcriptomic analysis. To experimentally validate this hypothesis, TA or TB was administered by periocular injection in a guinea pig model of form deprivation myopia. As a result, TA treatment significantly attenuated myopic refractive shifts and suppressed vitreous chamber depth and axial length elongation, whereas TB showed no protective effects. This study proposes a novel therapeutic avenue for myopia intervention and also suggests a previously unrecognized tear fluid-mediated mechanism linking exercise to myopia. URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE315463 Token: snspgwygltifrqx
bioinformatics2026-02-09v1cspray: Distributed Single Cell Transcriptome Analysis
Hawkins, P. G.; Swanson, E. M.; Feichtel, M.AI Summary
- The study introduces cspray, a distributed method for processing large-scale single cell RNA data, addressing computational throughput and scalability issues.
- cspray handles data ingestion, preprocessing, gene annotation, PCA, and clustering without needing per-file compute sizing.
- The method enables large-scale processing, allowing for LLM-based reference-free cluster annotation, facilitating the development of scalable single cell data discovery platforms.
Abstract
The size of individual single cell samples continues to grow with advancing technologies, as do the number of samples included in individual experiments and across organizations. This presents challenges for processing this data at scale, both in terms of computational throughput and the required size of the machines that must process this data. We present a single cell RNA processing method that is fully distributed, capable of processing arbitrarily large files, and numbers of files, without requiring per-file based compute sizing. Our method, cspray, includes data ingestion, pre-processing, highly variable gene annotation, PCA, and clustering. We also show that this processing at scale permits LLM based reference-free cluster annotation on low resolution clusters, which demonstrates these techniques can be used to build single cell data discovery platforms at scale.
bioinformatics2026-02-09v1Fuzzifier*: Robust and Sensitive Multi-omics Data Analysis
Offensperger, F.; Pan, C.; Sinn, E.; Zimmer, R.AI Summary
- Fuzzifier* is a pipeline for differential analysis of multi-omics data, allowing categorization through custom fuzzy concepts at any analysis step.
- It computes multiple analysis paths to identify both consensus and path-specific features, enhancing reliability and sensitivity.
- Applied to TCGA data, Fuzzifier* validated known cancer-specific miRNAs and identified new candidates, focusing on value distributions and foldchange from small sample sizes.
Abstract
Motivation: Categorization is an important means for interpreting data and drawing conclusions. Often, the derived categories provide evidence for diagnostic or even therapeutic approaches. The standard pipelines for differential analysis of multi-omic high-throughput, and in particular single-cell data, yield (ranked) lists of possibly differential features after applying appropriate effect sizes or significance thresholds of computed p-value and/or foldchange. Results: We propose the Fuzzifier* pipeline for the differential analysis of any type of high-throughput data, either raw input data or foldchange data of groups of a (small or large) number of replicates. In Fuzzifier*, categorization can be applied to any step of the analysis pipeline according to custom-designed fuzzy concepts (Fuzzifier). Thus, any (fuzzified) analysis option corresponds to a path in a commutative diagram specifying the Fuzzifier* pipeline. Fuzzifier* computes a user-defined set of paths and presents an overview of the results, thereby identifying both highly reliable (consensus) and sensitive (path-specific) features. Fuzzifier* is a method that can be applied to any analysis pipeline to obtain different views on the data and yield more reliable results. This is demonstrated by the identification of context-specific miRNAs for individual cancer types from TCGA data. Fuzzifier* could both validate known cancer-specific miRNAs and identify novel candidates. In comparison to statistical tests, Fuzzifier* focuses on value distributions of tumor and normal samples as well as paired foldchange distributions and, thus, identifies condition-specific features from a relatively small number of replicates. Availability and Implementation: https://github.com/zimmerlab/fuzzifier Contact: offensperger@bio.ifi.lmu.de and zimmer@ifi.lmu.de
bioinformatics2026-02-09v1Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data
Vicente, A.; Dornfeld, L.; Coines, J.; Ferruz, N.AI Summary
- The study investigates designing proteins that bind specific ligands using sequence-only data by framing it as a sequence-to-sequence translation problem with protein language models (pLMs).
- Models were trained on large datasets (>17M pairs) with varying parameter sizes, revealing a trade-off: fewer protein pairs per ligand result in more foldable but less diverse sequences, while more pairs increase diversity but reduce foldability.
- The research highlights dataset redundancy and incompleteness as critical challenges, providing datasets, models, and tools for further research.
Abstract
Proteins can bind small molecules with high specificity. However, designing proteins that bind user-defined ligands remains a challenge, typically relying on structural information and costly experimental iteration. While protein language models (pLMs) have shown promise for unconditional generation and conditioning on coarse functional labels, instance-level conditioning on a specific ligand has not been evaluated using purely textual inputs. Here we frame small-molecule protein binder design as a sequence-to-sequence translation problem and train ligand-conditioned pLMs that map molecular strings to candidate binder sequences. We curate large-scale ligand-protein datasets (>17M ligand-protein pairs) covering different data regimes and train a suite of models, spanning 16 to 700M parameters. Results reveal a consistent trade-off driven by supervision ambiguity: when each ligand is paired with few proteins, models generate near-neighbour, foldable sequences; when each ligand is paired with many proteins, generations are more diverse but less consistently foldable. Our study exposes how annotation diversity and sampling choices elicit this behaviour and how it changes with the data distribution. These insights highlight dataset redundancy and incompleteness as key bottlenecks for sequence-only binder design. We release the curated datasets, trained models, and evaluation tools to support future work on ligand-conditioned protein generation.
bioinformatics2026-02-09v1From Structure to Dynamics: Activation Mechanism of the G Protein-Coupled Bile Acid Receptor 1-Gs Complex
Fiorillo, B.; Moraca, F.; Di Leva, F. S.; Sepe, V.; Fiorucci, S.; Limongelli, V.; Zampella, A.; Catalanotti, B.AI Summary
- This study explored the activation mechanism of the GPBAR1-Gs complex by lithocholic acid (LCA) using homology modeling, molecular docking, and MD simulations.
- Findings indicate that LCA binding stabilizes the active state of GPBAR1, influencing TM5 and TM6 conformations and enhancing the coupling with Gs.
- The study provides insights into how LCA modulates GPBAR1 activation, aiding in the development of GPBAR1-targeted compounds.
Abstract
The G protein-coupled bile acid receptor 1 (GPBAR1, also known as TGR5) is a key mediator of bile acid signaling, exerting its physiological effects through coupling with the stimulatory G protein (Gs). This interaction is essential for stabilizing the receptor's active conformation and triggering downstream signaling. Among endogenous ligands, lithocholic acid (LCA) is the most potent natural agonist. However, the dynamic features underlying its binding and activation mechanisms remain poorly defined. In this study, we investigated the molecular basis of the interaction between LCA and GPBAR1, as well as the functional consequences of this interaction on receptor activation by integrating homology modelling, molecular docking, and molecular dynamics (MD) simulations. Our calculations reveal that LCA binding stabilizes the active state of GPBAR1, biasing the conformational ensemble of TM5 and TM6, as well as the main microswitches. These ligand-induced rearrangements enhance the coupling interface with the 5 helix of Gs and facilitate allosteric communication between the orthosteric and intracellular sites. Overall, our findings provide dynamic insight into how LCA modulates GPBAR1 activation and G protein engagement, highlighting its role as a molecular effector in bile acid signaling, and furnishing molecular detail relevant to ongoing efforts in GPBAR1-targeted compound development.
bioinformatics2026-02-09v1HiCInterpolate: 4D Spatiotemporal Interpolation of Hi-C Data for Genome Architecture Analysis.
Chowdhury, H. M. A. M.; Oluwadare, O.AI Summary
- HiCInterpolate was developed to interpolate intermediate Hi-C contact matrices between two timestamps, addressing the need for continuous genomic data in genome architecture analysis.
- It uses a deep learning approach with a flow predictor and U-Net-like architecture to predict high-resolution intermediate Hi-C maps.
- The tool supports analysis of 3D genomic features like A/B compartments and TADs, showing strong performance in metrics like PSNR and SSIM, and preserving key chromatin organization features.
Abstract
Motivation: Studying the three-dimensional (3D) structure of a genome, including chromatin loops and Topologically Associating Domains (TADs), is essential for understanding how the genome is organized, such as gene activation, cell development, protein-protein interaction, etc. Hi-C protocol enables us to study 3D genome structure and organization. Chromatin 3D structure changes dynamically over time, and modeling these continuous changes is crucial for downstream analysis in various domains such as disease diagnosis, vaccine development, etc. The high expense and impracticality of continuous genome sequencing, particularly what evolves between two timestamps, limit the most effective genomic analysis. It is crucial to develop a straightforward and cost-efficient method for constantly generating genomic data between two timestamps in order to address these constraints. Results: In this study, we developed HiCInterpolate, a 4D spatiotemporal interpolation architecture that accepts two timestamp Hi-C contact matrices to interpolate intermediate Hi-C contact matrices at high resolution. HiCInterpolate predicts the intermediate Hi-C contact map using a deep learning-based flow predictor, and a feature encoder and decoder architecture similar to U-Net. In addition, HiCInterpolate supports downstream analysis of multiple 3D genomic features, including A/B compartments, chromatin loops, TADs, and 3D genome structure, through an integrated analysis pipeline. Across multiple evaluation metrics, including PSNR, SSIM, GenomeDISCO, HiCRep, and LPIPS, HiCInterpolate 1 achieved consistently strong performance. Biological validation further demonstrated preservation of key chromatin organization features, such as chromatin loops, A/B compartments, and TADs. Together, these results indicate that HiCInterpolate provides a robust computer vision-based framework for highresolution interpolation of intermediate Hi-C contact matrices and facilitates biologically meaningful downstream analyses. Availability: HiCInterpolate is publicly available at https://github.com/OluwadareLab/ HiCInterpolate.
bioinformatics2026-02-09v1UniFacePoint-FM: A Foundation Model for Generalizable 3D Facial Representation Learning and Multi-Attribute Prediction
Li, D.; Fu, C.-H.; Tang, K.AI Summary
- UniFacePoint-FM is a 3D facial foundation model using a self-supervised Point-MAE framework for learning from point clouds, addressing limitations of 2D and task-specific 3D models.
- Pretrained on a custom dataset, it was fine-tuned and evaluated on three datasets for tasks like gender classification, age regression, BMI prediction, and facial expression recognition.
- It achieves state-of-the-art performance in several tasks, showing high generalizability across different datasets and scanning platforms.
Abstract
The human face is a rich medium for biometric, behavioral, and clinical information. However, 2D facial images based technologies lack critical geometric details and are susceptible to pose and illumination interference, while 3D facial deep learning frameworks are hindered by complex annotation, preprocessing, and task-specific designs with poor cross-domain generalization. To address these challenges, we propose UniFacePoint-FM, a 3D facial foundation model built on a self-supervised Point-MAE framework, tailored for high-fidelity point cloud representation learning. The model was pretrained on a self-constructed dataset of high-resolution 3D facial scans, followed by supervised fine-tuning and comprehensive evaluation across three independent datasets for diverse downstream tasks. Experimental results demonstrate that UniFacePoint-FM is both pretraining-efficient and highly generalizable: it achieves state-of-the-art performance on gender classification, age regression, and BMI prediction, and matches the accuracy of the ResMLP model (while outperforming other baselines) in facial expression recognition. Notably, by learning high-quality, fine-grained representations directly from raw point clouds, UniFacePoint-FM delivers robust generalization and transferability across tasks, datasets, and even different face scanning platforms. Overall, our work establishes an effective foundation model paradigm for 3D facial analysis, with promising implications for biometric security, health monitoring, and advanced human-computer interaction systems.
bioinformatics2026-02-09v1SpliceRead: Improving Canonical and Non-Canonical Splice Site Prediction with Residual Blocks and Synthetic Data Augmentation
Thapa, S.; Samderiya, K.; Menon, R.; Oluwadare, O.AI Summary
- SpliceRead uses residual convolutional blocks and synthetic data augmentation to improve the prediction of both canonical and non-canonical splice sites.
- Trained on a multi-species dataset, SpliceRead outperforms existing models in key metrics like F1-score, accuracy, precision, and recall, particularly reducing non-canonical misclassification rates.
- Evaluations confirmed SpliceRead's robustness through cross-validation, cross-species testing, and input-length generalization.
Abstract
Accurate splice site prediction is fundamental to understanding gene expression and its associated disorders. However, most existing models are biased toward frequent canonical sites, limiting their ability to detect rare but biologically important non-canonical variants. These models often rely heavily on large, imbalanced datasets that fail to capture the sequence diversity of non-canonical sites, leading to high false-negative rates. Here, we present SpliceRead, a novel deep learning model designed to improve the classification of both canonical and non-canonical splice sites using a combination of residual convolutional blocks and synthetic data augmentation. SpliceRead employs a data augmentation method to generate diverse non-canonical sequences and uses residual connections to enhance gradient flow and capture subtle genomic features. Trained and tested on a multi-species dataset of 400- and 600-nucleotide sequences, SpliceRead consistently outperforms state-of-the-art models across all key metrics, including F1-score, accuracy, precision, and recall. Notably, it achieves a substantially lower non-canonical misclassification rate than baseline methods. Extensive evaluations, including cross-validation, cross-species testing, and input-length generalization, confirm its robustness and adaptability. SpliceRead offers a powerful, generalizable framework for splice site prediction, particularly in challenging, low-frequency sequence scenarios, and paves the way for more accurate gene annotation in both model and non-model organisms. The open-sourced code of SpliceRead and detailed documentation are available at https://github.com/OluwadareLab/SpliceRead .
bioinformatics2026-02-09v1A new cancer progression model: from synthetic tumors to real data and back
Volpatto, D.; Contaldo, S. G.; Pernice, S.; Beccuti, M.; Cordero, F.; Sirovich, R.AI Summary
- The study introduces a stochastic model for tumor evolution that integrates genotypic inheritance, phenotype-driven traits, and resource competition to understand intratumor heterogeneity (ITH).
- The model uses a simulation algorithm with an open-source GUI for parameter configuration, allowing exploration of clonal dynamics and population changes.
- Findings suggest early tumor growth is stochastic, while later stages show selection for traits that mitigate environmental constraints, aligning with observed biological patterns.
Abstract
Intratumor heterogeneity (ITH) arises from the combined effects of genetic alterations, clonal interactions, and environmental constraints, and plays a central role in therapeutic resistance and disease progression. While ITH has been extensively documented in empirical tumor data, the scientific debate regarding the biological mechanisms underlying this heterogeneity remains complex, highlighting the need for cancer evolution models that are sufficiently flexible and sophisticated to reproduce the observed behaviors and to give insights on the unobserved ones. Here, we present a stochastic modelling framework for tumor evolution that integrates genotypic inheritance with phenotype driven functional traits and resource mediated competition. Mutational events are associated with functional capabilities such as altered proliferation, increased mutation rates, limit evasion potential or enhanced control over shared resources, allowing multiple genotypes to converge on similar phenotypes. The model explicitly tracks subclonal lineages while incorporating environmental constraints that modulate growth and competition.The framework is defined through a mathematically rigorous construction and is accompanied by an efficient simulation algorithm. To facilitate exploration and reproducibility, we provide an open-source graphical user interface that allows users to configure model parameters, run simulations, and inspect clonal genealogies and population dynamics without requiring direct interaction with the underlying code. Using this model, we illustrate how ecological feedbacks can shape clonal dynamics over time, supporting an interpretation in which early tumor growth is dominated by stochastic expansion, while later evolution increasingly reflects selection for traits that alleviate environmental constraints. Rather than constituting a new evolutionary paradigm, this behaviour demonstrates how well-documented biological patterns can emerge naturally from a unified stochastic and ecological description. Overall, our approach offers a flexible and extensible platform for investigating how chance, functional traits, and environmental interactions jointly govern tumor heterogeneity.
bioinformatics2026-02-09v1seq2ribo: Structure-aware integration of machine learning and simulation to predict ribosome location profiles from RNA sequences
Kaynar, G.; Kingsford, C.AI Summary
- seq2ribo integrates machine learning with a structure-aware simulation (sTASEP) to predict ribosome A-site locations from mRNA sequences.
- It outperforms existing methods by reducing transcript-level errors up to 35.8% and improving correlation with experimental data across various cell types.
- This approach enables de novo mRNA sequence design for applications like synthetic biology without requiring expression data or genomic context.
Abstract
Motivation: Ribosome dynamics are vital in the process of protein expression. Current methods rely on ribosome profiling (Ribo-seq), RNA-seq profiles, and full genomic context. This restricts their use in de novo sequence design, like messenger RNA (mRNA) vaccines. Simulation-only approaches like the Totally Asymmetric Simple Exclusion Process (TASEP) oversimplify translation by focusing solely on codon elongation times. Results: We present seq2ribo, a hybrid simulation and machine learning framework that predicts ribosome A-site locations using only an mRNA sequence as input. Our method first employs a novel structure-aware TASEP (sTASEP), which models translation using a comprehensive set of fitted parameters that include codon wait times and structural features, such as local angles, base-pairing, and discrete positional buckets. The ribosome locations generated by sTASEP are then processed by a polisher model, which learns to refine the simulated ribosome distributions. seq2ribo provides high-fidelity predictions of ribosome locations across diverse cell types (iPSC, HEK293, LCL, and RPE-1), significantly outperforming baselines. When benchmarked against sequence-only Translatomer, seq2ribo achieves reductions in transcript-level error up to 35.8%, while simultaneously attaining the highest Pearson and Spearman correlations in every cell line and reducing structural errors between 43.3% and 97.3%. By adding a task-specific head, seq2ribo achieves Spearman correlations up to 0.795 with experimental translation efficiency (TE) across several cell lines, and 0.689 with measured protein expression. By operating from sequence alone, seq2ribo provides a new tool for synthetic biology, enabling the rational design and optimization of mRNA sequences without the need for expression-level data or genomic context. Availability: seq2ribo is available at https://github.com/Kingsford-Group/seq2ribo. Contact: gkaynar@cs.cmu.edu, carlk@cs.cmu.edu. Supplementary information: Supplementary data are available.
bioinformatics2026-02-09v1Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.AI Summary
- The study addresses the challenge of distinguishing half-siblings from niece/nephew-avuncular pairs in large genomic biobanks without pedigree information.
- A novel method using across-chromosome phasing and haplotype-level sharing features was developed, achieving over 98% classification accuracy.
- This approach also enhances long-range phasing by providing structural constraints for homologue assignment.
Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-02-08v4