Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Decoding the Molecular Language of Proteins with Evolla
Zhou, X.; Han, C.; Zhang, Y.; Du, H.; Tian, J.; Su, J.; Liu, R.; Zhuang, K.; Jiang, S.; Gitter, A.; Liu, L.; Li, H.; Wu, M.; You, S.; Yuan, Z.; Ju, F.; Zhang, H.; Zheng, W.; Dai, F.; Zhou, Y.; Tao, Y.; Wu, D.; Shao, Z.; Liu, Y.; Lu, H.; Yuan, F.AI Summary
- Evolla is an interactive protein-language model trained on 546 million protein-text pairs, designed to interpret protein function through natural language queries.
- It outperforms general large language models in functional inference and matches state-of-the-art supervised models in zero-shot performance.
- Applications include identifying eukaryotic signature proteins in Asgard archaea and discovering a novel PET hydrolase, PsPETase, validated for plastic degradation.
Abstract
Proteins, nature's intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - understanding how sequences and structures encode biological functions - remains a cornerstone challenge. Here, we introduce Evolla, an interactive protein-language model designed to transcend static classification by interpreting protein function through natural language queries. Trained on 546 million protein-text pairs and refined via Direct Preference Optimization, Evolla couples high-dimensional molecular representations with generative semantic decoding. Benchmarking establishes Evolla's superiority over general large language models in functional inference, demonstrates zero-shot performance parity with the state-of-the-art supervised model, and exposes remote functional relationships invisible to conventional alignment. We validate Evolla through two distinct applications: identifying candidate eukaryotic signature proteins in Asgard archaea, with functional Vps4 homologs validated via yeast complementation; and interactively discovering a novel deep-sea polyethylene terephthalate (PET) hydrolase, PsPETase, confirmed to degrade plastic films. These results position Evolla not merely as a predictor, but as a generative engine capable of complex hypothesis formulation, shifting the paradigm from static annotation to interactive, actionable discovery. The Evolla online service is available at <a href="http://www.chat-protein.com/">http://www.chat-protein.com/</a>.
bioinformatics2026-02-11v4scPRINT-2: Towards the next-generation of cell foundation models and benchmarks
Kalfon, J.; Peyre, G.; Cantini, L.AI Summary
- The study introduces scPRINT-2, a single-cell Foundation Model pre-trained on 350 million cells from 16 organisms, aiming to enhance performance in cell biology tasks.
- scPRINT-2 was developed using an additive benchmark across various tasks, leading to state-of-the-art results in expression denoising, cell embedding, and cell type prediction.
- The model's capabilities include generative functions like expression imputation and counterfactual reasoning, with demonstrated generalization to new modalities and organisms.
Abstract
Cell biology has been booming with foundation models trained on large single-cell RNA-seq databases, but benchmarks and capabilities remain unclear. We propose an additive benchmark across a gymnasium of tasks to discover which features improve performance. From these findings, we present scPRINT-2, a single-cell Foundation Model pre-trained across 350 million cells and 16 organisms. Our contributions in pre-training tasks, tokenization, and losses made scPRINT-2 state-of-the-art in expression denoising, cell embedding, and cell type prediction. Furthermore, with our cell-level architecture, scPRINT-2 becomes generative, as demonstrated by our expression imputation and counterfactual reasoning results. Finally, thanks to our pre-training database, we uncover generalization to unseen modalities and organisms. These studies, together with improved abilities in gene embeddings and gene network inference, place scPRINT-2 as a next-generation cell foundation model.
bioinformatics2026-02-11v3Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters
Sakhnini, L. I.; Beltrame, L.; Fulle, S.; Sormanni, P.; Henriksen, A.; Lorenzen, N.; Vendruscolo, M.; Granata, D.AI Summary
- This study predicts antibody non-specificity using protein language models (PLMs) and biophysical descriptors, focusing on human and mouse antibody data.
- The best prediction model, ESM 1v LogisticReg, achieved 71% accuracy in 10-fold cross-validation, highlighting the heavy variable domain's importance.
- Biophysical analysis revealed the isoelectric point as a significant factor in non-specificity, with implications for developing therapeutic antibodies and nanobodies.
Abstract
The development of therapeutic antibodies requires optimizing target binding affinity and pharmacodynamics, while ensuring high developability potential, including minimizing non-specific binding. In this study, we address this problem by predicting antibody non-specificity by two complementary approaches: (i) antibody sequence embeddings by protein language models (PLMs), and (ii) a comprehensive set of sequence-based biophysical descriptors. These models were trained on human and mouse antibody data from Boughter et al. (2020) and tested on three public datasets: Jain et al. (2017), Shehata et al. (2019) and Harvey et al. (2022). We show that non-specificity is best predicted from the heavy variable domain and heavy-chain complementary variable regions (CDRs). The top performing PLM, a heavy variable domain-based ESM 1v LogisticReg model, resulted in 10-fold cross-validation accuracy of up to 71%. Our biophysical descriptor-based analysis identified the isoelectric point as a key driver of non-specificity. Our findings underscore the importance of biophysical properties in predicting antibody non-specificity and highlight the potential of protein language models for the development of antibody-based therapeutics. To illustrate the use of our approach in the development of lead candidates with high developability potential, we show that it can be extended to therapeutic antibodies and nanobodies.
bioinformatics2026-02-11v2A multi-component power-law penalty corrects distance bias in single-cell co-accessibility and deep-learning chromatin interaction predictions
Schlegel, L.; Gomez-Cano, F.; Marand, A. P.; Johannes, F.AI Summary
- The study addresses the overestimation of long-range interactions in single-cell co-accessibility and deep learning predictions by introducing a distance-based penalty function.
- Using Hi-C data from maize, rice, and soybean, the researchers developed tissue-specific and global consensus penalties based on multi-regime power-law exponents.
- Applying these corrections to scATAC-seq data reduced long-range false positives by 73% with tissue-specific penalties and 66% with global consensus, aligning predictions more closely with Hi-C data.
Abstract
Scalable proxies for 3D genome contacts - such as single-cell co-accessibility and deep learning predictions - have emerged as powerful alternatives to chromatin capture-based methods, but predictions systematically overestimate long-range interactions. Here we show how to correct this bias using distance-based penalty functions informed by Gaussian mixture modeling and polymer-physics scaling. Using Hi-C datasets from maize, rice, and soybean, we derive tissue-specific and global consensus penalties parameterized by multi-regime power-law exponents. Applying these corrections to scATAC-seq co-accessibility scores improves their distance profiles in concordance with Hi-C and reduces long-range false positives by an average of 73% with tissue-specific penalties and 66% with the global consensus. We provide open-source code and fitted parameters to support adoption in maize, rice, and soybean.
bioinformatics2026-02-11v2ModSeqR: An R package for efficient analysis of modified nucleotide data
Zimmerman, H. E.; Moore, J.; Miller, R. H.; Stirland, I.; Jenkins, A.; Saito, E.; Jenkins, T.; Hill, J. T.AI Summary
- The study addresses the computational challenges in analyzing large datasets from long-read technologies for DNA methylation.
- They introduce the CH3 file format, reducing file sizes by over 95%, and the ModSeqR R package, which uses this format and a database backend for efficient epigenetic analyses.
- These tools facilitate high-throughput methylation analysis with reduced computational demands.
Abstract
DNA methylation regulates a wide range of biological processes, including gene expression, disease progression, and cell identity. Long-read technologies now enable more comprehensive and accurate methylome analyses than ever before, but they are hindered by the computational resources needed to analyze the massive datasets. Here, we present the CH3 file format, which aids data storage and transfer by reducing file sizes by more than 95%, and the ModSeqR R package, which builds on the CH3 format and a database backend to enable a broad range of epigenetic analyses. Together, these tools enable high-throughput methylation analysis while minimizing computational resource requirements.
bioinformatics2026-02-11v2Multi-compartment spatiotemporal metabolic modeling of the chicken gut guides the design of dietary interventions
Utkina, I.; Alizadeh, M.; Sharif, S.; Parkinson, J.AI Summary
- The study developed a multi-compartment, spatiotemporally resolved metabolic model of the chicken gut to understand how diet influences microbial metabolism.
- The model identified cellulose, starch, and L-threonine as effective dietary supplements for enhancing short-chain fatty acid production, particularly butyrate, through in silico screening.
- Validation through a feeding trial confirmed model predictions, highlighting the importance of microbial community composition in metabolic outcomes.
Abstract
Understanding how diet shapes microbial metabolism along the gastrointestinal tract is essential for improving poultry gut health and reducing reliance on antibiotic growth promoters. Yet dietary interventions often yield inconsistent outcomes because their efficacy depends on baseline conditions, including diet composition and microbiota structure. To address this, we developed the first multi-compartment, spatiotemporally resolved metabolic model of the chicken gastrointestinal tract. Our six-compartment framework integrates avian-specific physiological features including bidirectional flow, feeding-fasting cycles, and compartment-specific environmental parameters. The model captured distinct metabolic specialization along the gut, with upper compartments enriched for biosynthetic pathways and lower compartments specialized for fermentation. Systematic in silico screening of 34 dietary supplements revealed context-dependent metabolic responses and identified cellulose, starch, and L-threonine as robust enhancers of short-chain fatty acid production. A controlled feeding trial validated key predictions, particularly for butyrate, and integrating trial-specific microbial community data substantially improved prediction accuracy for several metabolites. Our findings demonstrate that community composition is a major driver of metabolic outcomes and underscore the need for context-specific modeling. Our framework provides a mechanistic platform for rational dietary intervention design and is broadly adaptable to other animal or human gastrointestinal systems.
bioinformatics2026-02-11v2Verifying LLM-extracted text with token alignment
Booeshaghi, A. S.; Streets, A. M.AI Summary
- This study investigates improving the verification of text extracted by large language models (LLMs) by aligning extracted text with the original source, focusing on discontiguous phrases.
- Using LLM-specific tokenizers and ordered alignment algorithms, the approach improved alignment accuracy by about 50% compared to traditional word-level tokenization.
- The study introduced the BOAT and BIO-BOAT datasets for testing, demonstrating that ordered alignment is the most practical method for this task.
Abstract
Large language models excel at text extraction, but they sometimes hallucinate. A simple way to avoid hallucinations is to remove any extracted text that does not appear in the original source. This is easy when the extracted text is contiguous (findable with exact string matching), but much harder when it is discontiguous. Techniques for finding discontiguous phrases depend heavily on how the text is split-i.e., how it is tokenized. In this study, we show that splitting text along subword boundaries, with LLM-specific tokenizers, and aligning extracted text with ordered alignment algorithms, improves alignment by about 50% compared to word-level tokenization. To demonstrate this, we introduce the Berkeley Ordered Alignment of Text (BOAT) dataset, a modification of the Stanford Question Answering Dataset (SQuAD) that includes non-contiguous phrases, and BIO-BOAT a biomedical variant built from 51 bioRxiv preprints. We show that text-alignment methods form a partially ordered set, and that ordered alignment is the most practical choice for verifying LLM-extracted text. We implement this approach in taln, which enumerates ordinal subword alignments.
bioinformatics2026-02-11v2Augmented prediction of multi-species protein--RNA interactions using evolutionary conservation of RNA-binding proteins
He, J.; Zhou, T.; Hu, L.-F.; Jiao, Y.; Wang, J.; Yan, S.; Jia, S.; Chen, Q.; Zhu, W.; Zhang, J.; Jia, M.; Li, Y.; Wang, X.; Wang, Y.; Yang, Y. T.; Sun, L.AI Summary
- The study introduces MuSIC, a deep learning framework to predict multi-species RBP--RNA interactions by using evolutionary conservation across 11 species.
- MuSIC outperforms existing methods, accurately predicting RBP-binding peaks with higher confidence in closely related species.
- The framework also quantifies the impact of genetic variants on RBP binding, validated experimentally, revealing disruptions in ubiquitination pathways.
Abstract
RNA-binding proteins (RBPs) play critical roles in gene expression regulation. Recent studies have begun to detail the RNA recognition mechanisms of diverse RBPs. However, given the array of RBPs studied so far, it is implausible to experimentally profile RBP-binding peaks for hundreds of RBPs in multiple non-model organisms. Here, we introduce MuSIC (Multi-Species RBP--RNA Interactions using Conservation), a deep learning-based framework for predicting cross-species RBP--RNA interactions by leveraging label smoothing and evolutionary conservation of RBPs across 11 diverse species ranging from human to yeast. MuSIC outperforms state-of-the-art computational methods, and provides predicted RBP-binding peaks across species with high accuracy. The prediction confidence is higher in the closely related species, partially due to the RBP conservation patterns. Finally, the effects of homologous genetic variants on RBP binding can be computationally quantified across species, followed by experimental validations. The target transcripts with disrupted binding events are enriched with the ubiquitination-associated pathways. To summarize, MuSIC provides a useful computational framework for predicting RBP--RNA interactions cross-species and quantifying the effects of genetic variants on RBP binding, offering novel insights into the RBP-mediated regulatory mechanisms implicated in human diseases.
bioinformatics2026-02-11v2Siderophore identification in microorganisms associated with marine sponges by LC-HRMS and a data analytic approach in R.
Rios, A. G.; Kato, M. J.; Yamaguchi, L. F.; Esposito, B. P.; Arenas, A. F.AI Summary
- The study aimed to identify siderophores in the microbiomes of three marine sponge species using LC-HRMS and an R-based analytical workflow.
- A total of 59 potential siderophores were annotated, with 41 confirmed through chromatographic profiling and rigorous validation criteria.
- The approach revealed a diverse set of iron-chelating metabolites, including Ferricrocin, Aeruginic acid, and Madurastatin, without significant impact from iron supplementation during extraction.
Abstract
Siderophores are pivotal iron-acquisition biomolecules integral to microbial survival, pathogenicity, and ecology. Elucidating these compounds offers critical insights into the microbial dynamics of marine holobionts and potential therapeutic applications. In this study, we present a culture-independent, data-centric strategy to identify siderophores from the microbiome of three marine sponge species: Dragmacidon reticulatum, Aplysina fulva, and Amphimedon viridis. Utilizing Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) coupled with a custom R-based analytical workflow (XCMS and MetaboAnnotation), we successfully annotated 59 potential siderophores, 41 of which were confirmed via chromatographic profiling. We employed a rigorous validation pipeline, utilizing multiple iron-adduct calculations [M-2H+Fe]+, [M-H+Fe]2+, [2M-2H+Fe]+, high mass accuracy thresholds (<3 ppm), and retention time precision (CV < 2%). Notably, iron supplementation during extraction did not significantly alter siderophore detection, suggesting constitutive production or environmental saturation. This workflow bypasses the limitations of traditional cultivation, revealing a diverse landscape of iron-chelating metabolites--including Ferricrocin, Aeruginic acid, and Madurastatin--directly within the sponge holobiont.
bioinformatics2026-02-11v1Adaptive and Spandrel-like Constraints at Functional Sites in Protein Folds
Poley-Gil, M.; Fernandez-Martin, M.; Banka, A.; Heinzinger, M.; Rost, B.; Valencia, A.; Parra, R. G.AI Summary
- The study investigates how amino acid sequences contribute to protein structure and function, focusing on the role of evolutionary and physical constraints.
- Using reverse folding and structure prediction, researchers found that some evolutionary conserved frustration in proteins cannot be removed, suggesting these are spandrels from physical constraints.
- These findings suggest that functional specificity in proteins might evolve from these constraints, providing insight into the interplay between evolution, structure, and biophysics.
Abstract
Understanding the relationships among amino acid sequences, structures and functions in proteins and how they evolve, remains a central challenge in molecular biology. It is still unclear which sequence elements differentially contribute to structural integrity or molecular function. Even more, there are ongoing debates on whether protein folds emerge as a result of evolution or as a consequence of physical laws. The energy landscapes theory states that proteins are minimally frustrated systems, i.e. they fold by minimising their energetic conflicts. However, some local frustration, believed to be selected for functional reasons, remains in the native state of proteins. Here, we combine reverse folding and structure prediction methods with sequence and local frustration analysis to address the aforementioned ideas. We found that reverse folding techniques are unable to erase evolutionary conserved frustration from certain residues, even when detrimental for structural integrity. We propose that certain frustration hotspots behave like architectural spandrels, not directly shaped by selection but emerging from physical constraints in protein folds which evolution can later co-opt for function. Our results provide a new perspective revealing how sequence variation and functional specificity could evolve from evolutionary, structural and biophysical constraints.
bioinformatics2026-02-11v1Cigarette smoke induces colon cancer by regulating the gut microbiota and related metabolites
Li, W.; Bao, Y.-n.; Zhao, Q.; Yang, X.; Gong, Y.; Gan, B.AI Summary
- This study investigated the link between cigarette smoke and colorectal cancer (CRC) using a mouse model, finding that smoke exposure increases CRC incidence by altering gut microbiota and related metabolites.
- Smoke exposure decreased beneficial bacteria like Lactobacillus, increased harmful bacteria like Firmicutes and Clostridium, and altered metabolites, while also downregulating tumor suppressor genes PARG, CPT2, and ALDH1A1.
- Functional assays confirmed that reduced CPT2 expression in CRC cells enhanced malignancy, and clinical data showed these genes were downregulated in smoking-related CRC patients, offering diagnostic potential.
Abstract
The causal relationship between smoking and colorectal cancer (CRC) remains unclear. In this study, a cigarette smoke-exposed mouse model demonstrated that smoking significantly increased CRC incidence by inducing gut microbiota dysbiosis and altering related metabolites. Smoke exposure reduced beneficial bacteria (e.g., Lactobacillus), increased harmful bacteria (e.g., Firmicutes and Clostridium), elevated metabolites such as histamine, and suppressed the tumor suppressor genes PARG, CPT2, and ALDH1A1, thereby promoting tumor development. Functional assays in CRC cell lines further confirmed that CPT2 knockdown enhanced malignant phenotypes, including proliferation, migration, and invasion. Clinical analysis showed that these genes were markedly downregulated in smoking-related CRC patients, with strong diagnostic value (AUC > 0.8).
bioinformatics2026-02-11v1A global survey of System Biology-based predictions of gene-rare disease associations to enhance new diagnoses
Benitez, Y.; Uria-Regojo, G.; Minguez, P.AI Summary
- The study aimed to enhance rare disease diagnosis by predicting gene-disease associations using a global, network-based approach.
- By analyzing functional neighborhoods of known disease genes, the research identified 192 genes linked to single diseases and 251 genes associated with specific disease classes.
- These findings were used to develop a gene-disease specificity score to improve variant prioritization in genetic diagnostics.
Abstract
In rare disease diagnosis, described genotype-phenotype associations are evaluated first. In the absence of strong evidence, WES and WGS provide hundred to million other genetic variants, most poorly annotated, that need to be prioritized. While several in silico approaches leverage existing gene-disease knowledge to predict novel associations, doing so in isolation can hide how different genes are represented across other predictions. We hypothesize that a global perspective, accounting for differences in the knowledge accumulated in the gene collections, can refine predictions. Using a network-based algorithm, we explored functional neighborhoods of known disease-associated genes to predict novel candidates for over 200 rare diseases. A global analysis of gene and protein family behavior across predictions identified genes and functions broadly associated with multiple conditions, 192 genes linked to a single disease and 251 genes functionally associated with specific classes of rare diseases. These findings are integrated into a gene-disease specificity score, aimed at enhancing variant prioritization and guiding geneticists in advancing candidate genes toward functional validation.
bioinformatics2026-02-11v1SIPdb: A stable isotope probing database and analytical dashboard for linking amplicon sequences to microbial activity using a reverse ecology approach
Trentin, A. B.; Simpson, A.; Kimbrel, J. A.; Blazewicz, S. J.; Wilhelm, R. C.AI Summary
- SIPdb is introduced as a SQLite database and RShiny dashboard for integrating stable isotope probing (SIP) data with microbial sequence data, standardizing 22 studies across 21 isotopolog substrates.
- The database uses a standardized pipeline to analyze SIP data, identifying over 42,000 unique amplicon sequence variants as isotope incorporators across 62 phyla, with ALDEx2 showing the highest specificity in performance.
- Validation showed SIPdb recovered 70.1% of reported incorporator taxa, and reanalysis of a non-SIP study identified additional candidate taxa for 1,4-dioxane degradation, enhancing ecological interpretation in microbiome research.
Abstract
Stable isotope probing (SIP) provides a powerful means to connect microbial sequence data with diverse metabolic activities, but the lack of a framework for SIP-derived data has limited its integration into broader strategies for ecological inference. Here, we introduce the SIPdb, an extensible SQLite database of curated nucleic acid SIP experiments (also in phyloseq format) paired with an interactive RShiny dashboard for analysis and visualization. The initial release compiles 22 studies covering 21 isotopolog substrates across diverse environments, with data standardized using the MISIP metadata standard. In creating the SIPdb, we have provided a standardized pipeline that accommodates the three most common SIP gradient fractionation strategies (binary, multi-fraction, and density-resolved), two isotope incorporator designation strategies (fixed- and sliding-window), and four complementary differential abundance methods (DESeq2, edgeR, limma-voom, and ALDEx2). Using our pipeline, we identified more than 42,000 unique amplicon sequence variants as isotope incorporators across 62 phyla. Benchmarking with synthetic datasets demonstrated consistent performance across incorporator designation strategies, with ALDEx2 providing the highest specificity. Validation against original publications showed that, on average, SIPdb recovered 70.1% of author-reported incorporator taxa, with discrepancies arising from differences in phylotyping or classification approaches. Finally, our reanalysis of a non-SIP study of 1,4-dioxane degradation showed how SIPdb can both validate known degraders and uncover additional candidate taxa involved in community metabolism. The SIPdb establishes a scalable platform for reverse ecology, enabling hypothesis generation, cross-study meta-analysis, and linking taxa to metabolic processes, while serving as an open, extensible resource to accelerate ecological interpretation in microbiome research.
bioinformatics2026-02-11v1DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics
Liao, Y.; Wen, H.; E, W.; Zhang, W.AI Summary
- The study introduces DIA-CLIP, a framework for zero-shot DIA proteomics that uses universal cross-modal representation learning to overcome the limitations of semi-supervised, run-specific training in DIA-MS analysis.
- DIA-CLIP employs a dual-encoder contrastive learning approach to align peptide sequences with spectral features, enabling high-precision peptide-spectrum match inference without run-specific retraining.
- Evaluations show DIA-CLIP increases protein identification by up to 45% and reduces false discovery rates by 17%, demonstrating superior performance over existing tools in diverse proteomic applications.
Abstract
Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility.has emerged as an indispensable cornerstone of quantitative proteomics[3.1].[4.1] However, Ccurrent DIA analysis frameworks, however,identification [5.1]pipelines require semi-supervised training within each run rely on semi-supervised, run-specific training[6.1] for peptide-spectrum match (PSM) re-scoring. This approach is prone to often leads to[7.1] overfitting and lacks generalizability across diverse heterogeneous species and experimental conditionss[8.1][9.1]. Here, we present DIA-CLIP, a pre-trained modelfoundation-model-inspired[10.1] framework that shiftings the DIA analysis paradigm from semi-supervised trainingrun-specificper-file refinement to universal cross-modal representation learning.. BBy integrating dual-encoder contrastive learning framework with encoder-decoder architecture[11.1], DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achievingemploying supervised contrastive learning on large-scale PSM datasets, DIA-CLIP aligns peptide sequences with spectral signals within a shared latent space. This approach enables high-precision, zero-shot PSM inference, eliminating the requirement for run-specific re-training or fine-tuning.[12.1] Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications.DIA-CLIP is validated to We demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools across diverse benchmarks, increasing proteome coverage without compromising by 45% for single cell proteomics and reducing false discovery rates by 17% under challenging entrapment experimentfalse discovery rates under entrapment experiment[13.1]. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.
bioinformatics2026-02-11v1BiOS: An Open-Source Framework for the Integration of Heterogeneous Biodiversity Data
Roldan, A.; Duran, T. G.; Far, A. J.; Capa, M.; Arboleda, E.; Cancellario, T.AI Summary
- The study addresses the challenge of integrating heterogeneous biodiversity data by introducing BiOS, an open-source framework designed to harmonize datasets from taxonomy, genetics, to species distribution.
- BiOS features a modular architecture with a decoupled back-end for data management and a user-friendly front-end, offering both an API for developers and a web interface for general users.
- Key findings include BiOS's adherence to FAIR principles, enabling seamless data integration, and enhancing collaborative conservation efforts by overcoming data fragmentation.
Abstract
The era of Big Data has revolutionised biodiversity research, yet the potential of this information is frequently constrained by data heterogeneity, incompatible schemas, and the fragmentation of resources. Whilst standards such as Darwin Core have improved interoperability, significant barriers persist in harmonising multi-typology datasets ranging from taxonomy and genetics to species distribution. Here, we present the Biodiversity Observatory System (BiOS), a comprehensive, open-source software stack designed to address these impediments through a modular, community-driven architecture. BiOS departs from monolithic database designs by decoupling the back-end data management from the front-end presentation layer. This architectural separation supports a dual-access model tailored to diverse stakeholder needs. For researchers and developers, the system offers a comprehensive Application Programming Interface (API) that exposes all back-end functionalities, enabling seamless programmatic access, automated data retrieval, and integration with external analytical workflows. Simultaneously, the platform features a user-centric web interface designed to lower the technical barrier to entry. This interface facilitates intuitive data exploration through agile taxonomic navigation, advanced geospatial map viewers for species occurrence filtering, and dedicated dashboards for visualising genetic markers and legislative status. Strictly adhering to the FAIR principles (Findable, Accessible, Interoperable, Reusable), BiOS acts as a relational engine capable of integrating heterogeneous data streams. By providing a flexible, interoperable core that supports the "seven shortfalls" framework of biodiversity knowledge, BiOS offers a turnkey solution to overcome data fragmentation and enhance collaborative conservation efforts.
bioinformatics2026-02-11v1VC-RDAgent: An efficient rare disease diagnosis agent via virtual case construction informed by hybrid statistical-metric and hyperbolic-semantic prioritization
Liu, Y.; Li, H.; Jiang, P.; Wu, L.; Xie, Z.; Ning, C.; Kong, X.; Wang, Y.; Zhang, X.; Huang, Z.AI Summary
- VC-RDAgent addresses the challenge of rare disease diagnosis by creating virtual standardized cases, avoiding the need for real-world patient data due to its scarcity and privacy issues.
- The system uses VC-Ranker, which combines statistical-metric measures with hyperbolic-semantic embeddings to generate high-fidelity virtual references from knowledge bases.
- Testing on four datasets showed VC-RDAgent improved Top-1 hit rates by 8.7% to 85.9%, with VC-Ranker achieving a Top-10 hit rate of 0.819, surpassing previous methods by 6%.
Abstract
While Large Language Models (LLMs) have shown promise in clinical decision support, current Retrieval-Augmented Generation (RAG) paradigms face a fundamental bottleneck in rare disease diagnosis: the scarcity, privacy restrictions, and extreme heterogeneity of real-world patient records. This reliance on sparse or inaccessible data leads to a severe "retrieval mismatch," where the lack of high-quality reference cases causes diagnostic performance to degrade sharply. To break this deadlock, we propose VC-RDAgent, a privacy-preserving and offline-capable framework that decouples diagnostic reasoning from sensitive real-world records by synthesizing virtual standardized cases. The system is powered by VC-Ranker, a multi-dimensional engine that integrates statistical-metric measures with hyperbolic-semantic embeddings to capture deep hierarchical ontology relationships. This approach allows for the dynamic generation of high-fidelity virtual references directly from authoritative knowledge bases. Extensive benchmarking across four diverse datasets demonstrates that VC-RDAgent effectively functions as a "performance equalizer." It boosts average Top-1 hit rates by 8.7% to 85.9% over zero-case baselines, enabling lightweight open-source models to rival frontier commercial systems. Notably, VC-Ranker alone achieved an aggregate Top-10 hit rate of 0.819, outperforming prior state-of-the-art methods by 6%. By eliminating the dependency on real-time web retrieval and private case sharing, VC-RDAgent provides a scalable, robust, and clinically deployable solution to shorten the diagnostic odyssey, which is made accessible through an intuitive, chat-based web application https://rarellm.service.bio-it.tech/rdagent/.
bioinformatics2026-02-11v1PlantMDCS: A code-free, modular toolkit for rapid deployment of plant multi-omics databases
Chen, C.; Liu, Y.; Wang, L.; Sai, J.; Wang, Y.; Yue, W.; Sun, J.; Li, Z.; Wang, F.; Tian, J.; Xu, D.; Fang, Y.AI Summary
- PlantMDCS is a user-friendly, code-free toolkit designed for rapid deployment of plant multi-omics databases, addressing the challenge of managing and analyzing diverse omics data.
- It features a decoupled front-end/back-end architecture where the back end manages data storage, preprocessing, and integration, while the front end supports the entire research workflow from data import to visualization without programming.
- Benchmarking showed that PlantMDCS can construct databases in minutes across various plant species, enhancing efficiency, reproducibility, and data security through local deployment and controlled access.
Abstract
With the rapid accumulation of diverse omics datasets, achieving efficient management and integrative analysis of plant multi-omics data remains a major challenge. Conventional solutions rely on constructing web-based databases, which often demand substantial programming expertise and long-term financial support. To address these limitations, we developed the Plant Multi-omics Database Construction System (PlantMDCS)-a locally deployable, user-friendly, and collaborative platform that unifies database construction and downstream multi-omics analysis within a graphical environment. PlantMDCS adopts a decoupled front-end/back-end architecture. The back end serves as the core engine for data management and computation, and is responsible for the storage, preprocessing, integration, and hierarchical association of multi-omics data. Once initialized, the front end supports the complete research workflow, including data import, querying, integrative analysis and visualization. All operations can be performed without programming, while local resource usage is dominated by disk storage required for user-provided datasets rather than sustained computational overhead. Benchmarking across plant species ranging from Arabidopsis to hexaploid wheat demonstrated that database construction can be completed within minutes, independent of genome size or data complexity. PlantMDCS is designed for local deployment to ensure data security, while allowing multi-user collaboration within local networks and supporting controlled remote access for teams distributed across different regions. Overall, PlantMDCS offers a secure and sustainable framework that integrates data management and analysis within a unified system. This design shifts multi-omics research away from fragmented file-based processing toward persistent, database-driven exploration, thereby enhancing analytical efficiency and reproducibility.
bioinformatics2026-02-11v1A machine learning approach to identify key Epigenetic Transcripts for Ageing research in human blood (Epitage)
Benazzi Maia, T.; Pfeffer, U.AI Summary
- This study analyzed the GSE87571 dataset to explore the relationship between transcript-level DNA methylation and chronological age in human blood, leading to the creation of Epitage.
- Epitage consists of 48 transcripts from 13 genes, identified via machine learning, with strong age correlation (R^2 >= 0.8), including novel markers like KCNS1, SPTBN4, and VTRNA1-2.
- An R package, ugPlot, was developed to automate model validation, enhancing reproducibility and efficiency in ageing research.
Abstract
DNA methylation is an established biomarker of human ageing, and analysing CpGs grouped by transcript as functional units may reveal new insights into the processes of ageing. In this study, we analyzed the GSE87571 dataset (714 samples from 14-94 years) to assess the relationship between transcript-level methylation profiles and chronological age in human blood. This approach led to the creation of Epitage, a curated set of 48 transcripts from 13 genes identified through machine learning as having methylation profiles that strongly correlate with age (R^2 >= 0.8). This analysis highlighted transcripts from the genes KCNS1, SPTBN4, and VTRNA1-2, which have been only rarely mentioned as age-related methylation markers in humans, suggesting them as underexplored candidates for future investigation. In addition, the list includes genes already implicated in aging or related pathways, such as ELOVL2, FHL2, KLF14, TRIM59, MIR29B2CHG, CALB1, OBSCN, PRRT1, OTUD7A, and SYNGR3. To validate models efficiently while ensuring reproducibility, we developed ugPlot, an open-source R package with a graphical user interface (GUI) that automates routine steps for training and testing hundreds of machine-learning models. The tool also streamlines dataset import and manipulation, reducing human error and generating publication-ready plots. Epitage thus provides a focused and accessible starting point for experimental and translational studies into the roles of DNA methylation and transcript regulation in human ageing.
bioinformatics2026-02-11v1A spectral framework for measuring diversity in multiple sequence alignments
opuu, v.AI Summary
- This study introduces Leff, a spectral measure to quantify the effective diversity in multiple sequence alignments (MSAs) by estimating the number of independent positions needed to capture observed diversity.
- Applied to RNA and protein MSAs, Leff reveals that evolutionary constraints significantly reduce diversity, with proteins showing even lower effective diversity due to stronger constraints.
- Leff correlates with protein structure prediction accuracy and quantifies diversity in experimental libraries, serving as a tool to guide future protein and RNA design.
Abstract
Machine learning (ML) methods for proteins and RNAs rely on multiple sequence alignments (MSAs) and related datasets such as experimental mutagenesis libraries, yet the amount of usable information they contain remains unclear. Here, a spectral measure of information is recast into an interpretable quantity for MSAs, denoted Leff, defined as the number of fully independent alignment positions that reproduce the observed sequence diversity. Applied to RNA MSAs, this measure shows that evolutionary constraints nearly halve diversity relative to the secondary structure alone, quantifying functional and phylogenetic restrictions beyond base pairing. The same analysis indicates even lower effective diversity in proteins, quantifying stronger physicochemical and evolutionary constraints on amino acids. Leff further correlates with protein structure prediction accuracy, anticipating cases with insufficient evolutionary signal. When applied to experimentally and computationally generated libraries, it measures both produced diversity and cross-library overlap, quantifying novelty rather than redundant sampling. Together, these results establish Leff as an operational tool to estimate effective information in MSAs, anticipate modeling difficulties, and guide future protein and RNA design.
bioinformatics2026-02-11v1Metadiffusion: inference-time meta-energy biasing of biomolecular diffusion models
Lam, H. Y. I.; Pujalte Ojeda, S.; Brezinova, M.; Hanke, J.; Ong, X. E.; Mu, Y.; Vendruscolo, M.AI Summary
- Metadiffusion introduces a meta-energy biasing layer to guide pretrained biomolecular diffusion models, enhancing exploration of conformational landscapes without retraining.
- The method generates diverse conformational ensembles that align with molecular dynamics simulations, supporting optimization, targeted steering, and exploration.
- It facilitates the study of collective variables, alternative binding poses, and ensemble generation consistent with SAXS and NMR data.
Abstract
Biomolecular function often depends on conformational ensembles, yet modern diffusion-based structure generators are biased toward the compact conformations prevalent in structural databases, limiting their ability to explore broad conformational landscapes. This work introduces metadiffusion, where an additional meta-energy biasing layer on top of diffusion steers pretrained biomolecular diffusion models through gradient-guided denoising. Without retraining, metadiffusion generates diverse conformational ensembles whose residue-level flexibility patterns closely match molecular dynamics simulations. The method supports three complementary modes: optimisation, steering to user-specified targets, and exploration via inter-sample repulsion. This approach enables controlled exploration of collective variables, enumeration of alternative binding poses across proteins, nucleic acids and ligands, and conformational ensemble generation consistent with SAXS and NMR chemical shifts. Metadiffusion thus provides a practical route to connect diffusion-based structure generation with ensemble-level, experimentally-restrained structural analysis.
bioinformatics2026-02-11v1Large-scale quantum computing framework enhances drug discovery in multiple stages
Wen, K.; Zha, J.; Chen, S.; Zhong, J.; Yuan, L.; Cui, Y.; Shi, X.; Qin, W.; Lan, X.; Liu, Y.; Yang, X.; Qin, H.; Li, M.; Guo, P.; Xiao, Q.; Wu, T.; Zhou, Y.; Cao, C.; Ning, S.; Wu, C.; Gao, Q.; He, H.; Ma, Y.; An, Z.; Liu, X.; Chen, Y.; Zheng, Z.; Wei, H.; Ma, Y.; Zhang, J.AI Summary
- The study improved the stability of a 2000-node Coherent Ising Machine (CIM), named QBoson-CPQC-3Gen, through enhanced vibration isolation and temperature control, allowing stable solutions for over an hour.
- A CIM-based framework was developed for computer-aided drug discovery (CADD), incorporating graph-based encoding for tasks like allosteric site detection and protein-peptide docking.
- This framework outperformed heuristic algorithms in speed and accuracy, identifying 2 novel druggable sites and bioactive compounds for 6 targets, validated through in vitro, in-cell, and crystallographic methods.
Abstract
Coherent Ising machines (CIMs) excel at solving large-scale combinational optimization problems (COPs), but their insufficient long-term stability has hindered their applications in compute-intensive tasks like computer-aided drug discovery (CADD). By improving fiber vibration isolation and temperature control system, we have implemented a 2000-node CIM named QBoson-CPQC-3Gen achieving stable solutions over one hour on large-scale COPs. Graph-based encoding schemes were further introduced to realize a CIM-based CADD workflow including allosteric site detection, protein-peptide docking and intermolecular similarity calculation. CIM-based methods demonstrated superior speed and accuracy than heuristic algorithms. Especially, QBoson-CPQC-3Gen identified 2 novel druggable sites and bioactive compounds for 6 targets, which were further validated in vitro, in-cell and by crystal structures. Our contributions established a quantum-computing framework for multi-stage drug discovery, representing a significant advancement in both quantum computing applications and pharmaceutical research.
bioinformatics2026-02-11v1BRIDGE: Biological Antimicrobial Resistance Inference viaDomain-Knowledge Graph Embeddings
Iyer, A.; Kazeem, Y.; Kafaie, S.; Rajabi, E.AI Summary
- The study introduces BRIDGE, a knowledge graph-based framework to enhance the prediction of antimicrobial resistance genes (ARGs) by integrating gene neighbourhood information and protein-protein interactions.
- Focused on Klebsiella pneumoniae and Escherichia coli, BRIDGE uses data from CARD, STRING, and DrugBank to construct a knowledge graph.
- Applying graph embedding models and deep neural networks, BRIDGE achieved a classification accuracy of up to 97% in predicting novel AMR links, demonstrating improved predictive accuracy and interpretability.
Abstract
Antimicrobial resistance (AMR) is a growing global health crisis, responsible for an estimated 1.27 million deaths in 2019 alone. traditional approaches to identifying antibiotic resistance genes (ARGs) are often labour-intensive and limited in their ability to detect novel resistance mechanisms. In this study, we propose BRIDGE, a knowledge graph-based framework, to improve AMR gene prediction by integrating gene neighbourhood information and protein-protein interaction networks. Focusing on Klebsiella pneumoniae and Escherichia coli, we construct a comprehensive and biologically grounded knowledge graph using curated data from CARD, STRING, and DrugBank. We apply knowledge graph embedding models which are fed into deep neural networks to infer novel AMR links, achieving classification accuracy of up to 97%. Our results demonstrate that incorporating biologically meaningful relationships, such as gene neighbourhood information and protein interactions, enhances the predictive accuracy and interpretability of AMR link predictions. This work contributes to the development of scalable and data-integrated approaches for advancing antimicrobial resistance surveillance and drug discovery.
bioinformatics2026-02-11v1SCALPEL: A pipeline for processing large-scale spatial transcriptomics data
Kunst, M.; Ching, L.; Quon, J.; Mathieu, R.; Hewitt, M.; Seeman, S.; Ayala, A.; Gelfand, E.; Long, B.; Martin, N.; Nagra, J.; Olsen, P.; Oyama, A.; Valera, N.; Pagen, C.; Sunkin, S.; Ariza, J.; Smith, K.; McMillen, D.; Zeng, H.; Waters, J.AI Summary
- SCALPEL is a pipeline designed for processing large-scale spatial transcriptomics data, featuring 3D segmentation, refined filtering, doublet detection, and cell type label transfer.
- It includes spatial domain detection and registration to the Allen Mouse Brain CCFv3, with genome-wide expression imputation from scRNAseq.
- Benchmarking against a previous dataset showed improvements in cell number, expression clarity, and spatial registration, setting a new standard for spatial transcriptomics studies.
Abstract
Spatial transcriptomics enables the precise mapping of gene expression patterns within tissue architecture, offering unprecedented insights into cellular interactions, tissue heterogeneity, and disease pathology that are unattainable with traditional transcriptomic approaches. We present a tool for processing spatial transcriptomics data, SCALPEL (Spatial Cell Analysis, Labeling, Processing, and Expression Linking). SCALPEL is specifically designed to support the analysis of large, atlas-level datasets. Our new workflow features advanced 3D segmentation optimized for dense and heterogeneous tissues, refined filtering criteria, and transcriptome-based doublet detection to remove low-quality or artifactual cells. Cell type label transfer from existing taxonomies is further improved through updated filtering thresholds. Spatial domain detection is incorporated to capture local transcriptomic organization, and tissue sections are registered to the Allen Mouse Brain Common Coordinate Framework version 3 (CCFv3) for precise anatomical alignment. Genome-wide expression imputation from single-cell RNA-sequencing (scRNAseq) further enriches the dataset. Crucially, we benchmark the performance of this updated pipeline against a previously published version of our whole-mouse-brain (WMB) dataset (Yao et al., 2023b), demonstrating substantial improvements in cell number, expression profile clarity, and spatial registration. These advances provide a robust foundation for downstream spatial analyses and set a new standard for large-scale spatial transcriptomics studies.
bioinformatics2026-02-10v3ETSAM: Effectively Segmenting Cell Membranes in cryo-Electron Tomograms
Selvaraj, J.; Cheng, J.AI Summary
- This study introduces ETSAM, a two-stage AI method based on SAM2, designed to segment cell membranes in cryo-ET tomograms.
- ETSAM was trained on 83 experimental and 28 simulated tomograms, achieving state-of-the-art performance on an independent test set of 10 tomograms.
- It significantly outperforms existing methods by providing high sensitivity and precision in membrane segmentation despite challenges like low signal-to-noise ratio and missing wedge artifacts.
Abstract
Cryogenic Electron Tomography (cryo-ET) is an emerging experimental technique to visualize cell structures and macromolecules in their native cellular environment. Accurate segmentation of cell structures in cryo-ET tomograms, such as cell membranes, is crucial to advance our understanding of cellular organization and function. However, several inherent limitations in cryo-ET tomograms, including the very low signal-to-noise ratio, missing wedge artifacts from limited tilt angles, and other noise artifacts, collectively hinder the reliable identification and delineation of these structures. In this study, we introduce ETSAM - a two-stage Segment Anything Model 2 (SAM2)-based fine-tuned AI method that effectively segments cell membranes in cryo-ET tomograms. It is trained on a diverse dataset comprising 83 experimental tomograms from the CryoET Data Portal (CDP) database and 28 simulated tomograms generated using PolNet. ETSAM achieves state-of-the-art performance on an independent test set comprising 10 experimental tomograms for which ground-truth annotations are available. It robustly segments cell membranes with high sensitivity and precision, significantly outperforming existing deep learning methods.
bioinformatics2026-02-10v2Systems Level Analysis of Gene, Pathway and Phytochemical Associations with Psoriasis
Ray, S.; Dutta, O.; Kousoulas, K. G.; Apostolopoulos, N.; Chamcheu, J. C.; Kaur, R.AI Summary
- The study used a systems biology approach to analyze gene expression and pathways in psoriatic lesions, identifying key roles of type I/III interferon signaling, AP-1, and CREB1.
- It highlighted seven phytochemicals with potential multi-target activity against psoriasis, focusing on the IL-17/TNF-interferon-AP-1/CREB1-COX-2/MMP9 axis.
- Protopine and atractylon were suggested as promising candidates for topical treatment due to favorable ADMET properties, with further validation needed in skin models.
Abstract
Psoriasis is an inflammatory skin disorder driven by abnormal immune activation that promotes excessive proliferation and accelerated turnover of epidermal keratinocytes. IL-17 and TNF pathways are well known in psoriasis, but the other mechanisms that keep the disease active and link it to systemic comorbidities are not yet fully understood. A combined transcriptomic and systems biology framework was applied to map regulatory circuits in psoriatic lesions and to identify phytochemical candidates capable of multi-target modulation for topical intervention. Differential gene expression between lesional and healthy skin was analyzed, followed by pathway enrichment, upstream regulator inference, protein-protein interaction network, and chemical-gene interaction mapping. This integrative strategy revealed a transcriptional landscape dominated by type I/III interferon signaling, antiviral and antimicrobial responses, immune metabolic dysregulation, and transcriptional hubs centered on AP-1 and CREB1. Several genes and upstream regulators not previously associated with psoriasis were identified within inflammatory and cell migration-related modules, indicating unexplored regulatory layers in disease control. Network-guided chemical prioritization and direction-of-effect filtering highlighted seven phytochemicals (mahanine, atractylon, protopine, annomontine, taraxasterol, tricin, and tamarixetin) with multi-target activity across key disease axes. ADMET-based screening suggested protopine and atractylon as favorable candidates for topical delivery, while synergy modeling supported flavonoid-alkaloid combination designs. This multi-layered approach provides mechanistically informed phytochemicals targeting the IL-17/TNF-interferon-AP-1/CREB1-COX-2/MMP9 axis in psoriasis. Experimental validation in keratinocyte and organotypic skin models will be required to determine whether these compounds, individually or in combination, can effectively restore psoriatic signaling in vivo.
bioinformatics2026-02-10v2Autoregressive forecasting of future single-cell state transitions
Luo, E.; Gao, H.; BIAN, H.; Li, Y.; Li, C.; Hao, M.; Chen, M.; She, Y.; Wei, L.; Liu, K.; Zhang, X.AI Summary
- The study introduces CellTempo, a temporal generative AI model designed to forecast future cellular dynamics from static single-cell RNA-sequencing data.
- CellTempo uses learned semantic codes and an autoregressive decoder to predict long-range cell-state transitions.
- Experiments demonstrated that CellTempo accurately forecasts cell state evolutions and reconstructs cell-state landscapes post-perturbations, aligning well with biological realities.
Abstract
Existing methods for dynamic analysis of static single-cell RNA-sequencing data can reconstruct temporal structures covered by observed cells, but cannot forecast unobserved future state transitions. We propose a temporal generative AI model, CellTempo, to forecast future cellular dynamics by representing cells as learned semantic codes and training an autoregressive generation decoder to predict ordered code sequences. It can forecast long-range cell-state transition trajectories and landscapes from snapshot data. To train the model, we constructed a comprehensive single-cell trajectory dataset scBaseTraj by integrating RNA velocity, pseudotime, and inferred transition probabilities to compose multi-step cellular sequences. Experiments on multiple real datasets showed that CellTempo can forecast cell state evolutions from individual cells, and reconstruct nuanced cell-state potential landscapes and their varied progressions after genetic or chemical perturbations, all with high fidelity to biological truth. This work opens a route for forecasting unseen future dynamics of cell state transitions from static observations.
bioinformatics2026-02-10v1Optimizing Protein Tokenization: Reduced Amino Acid Alphabets for Efficient and Accurate Protein Language Models
Rannon, E.; Burstein, D.AI Summary
- This study explores the use of reduced amino acid alphabets combined with Byte Pair Encoding (BPE) tokenization in protein language models (pLMs) to optimize efficiency.
- RoBERTa-based pLMs were pre-trained using various reduced alphabets and evaluated on multiple tasks.
- Results indicated that reduced alphabets significantly shortened input sequences, sped up training and inference, and maintained or improved performance compared to models using the full 20-amino-acid alphabet.
Abstract
Protein language models (pLMs) typically tokenize sequences at the single-amino-acid level using a 20-residue alphabet, resulting in long input sequences and high computational cost. Sub-word tokenization methods such as Byte Pair Encoding (BPE) can reduce sequence length but are limited by the sparsity of long patterns in proteins encoded by the standard amino acid alphabet. Reduced amino acid alphabets, which group residues by physicochemical properties, offer a potential solution but their performances with sub-word tokenization have not been systematically studied. In this work, we investigate the combined use of reduced amino acid alphabets and BPE tokenization in protein language models. We pre-trained RoBERTa-based pLMs de novo using multiple reduced alphabets and evaluated them across diverse downstream tasks. Our results show that reduced alphabets enable substantially shorter input sequences and faster training and inference, while maintaining comparable, and in some cases improved, performance relative to models trained on the full 20-amino-acid alphabet. These findings demonstrate that alphabet reduction facilitates more effective sub-word tokenization and provides a favorable trade-off between efficiency and predictive accuracy.
bioinformatics2026-02-10v1PEhub resolves the hierarchical regulatory architecture of multi-way enhancer hubs in the human brain
Tan, J.; Sun, Y.AI Summary
- PEhub is a new framework that resolves multi-way enhancer hubs from chromatin interaction data by modeling synergistic enhancer cooperation and accounting for interaction decay.
- Using H3K27ac HiChIP data, PEhub identified and validated promoter-anchored enhancer hubs in six human brain regions, showing they correspond to real multi-way chromatin assemblies.
- Enhancer hubs were found to be associated with increased transcription, hierarchical organization, and linked to genetic risk and transcription factor deployment in brain regions.
Abstract
Chromatin interaction assays capture regulatory architecture as stochastic pairwise contacts, limiting the ability to resolve how multiple enhancers cooperatively regulate transcription. Here we introduce a promoter-centric quantitative framework, termed PEhub, that resolves multi-way enhancer hubs as higher-order regulatory units from chromatin interaction data. By reparameterizing stochastic pairwise ligation events into promoter-conditioned enhancer networks, our approach explicitly models synergistic enhancer cooperation while accounting for distance-dependent interaction decay through a statistically principled null model. Using H3K27ac HiChIP data, we identify promoter-anchored enhancer hubs and validate their physical existence with single-molecule Pore-C, demonstrating that inferred hubs correspond to bona fide multi-way chromatin assemblies. Application to six human brain regions reveals that enhancer hubs are associated with elevated transcriptional output and exhibit a hierarchical organization spanning shared, circuit-specific, and region-restricted regulatory programs. This architecture hierarchically stratifies genetic risk and transcription factor deployment, linking three-dimensional genome organization to transcriptional control and disease-associated variation. Together, this promoter-centric framework provides a generalizable strategy for resolving higher-order regulatory architecture from 3D genome data and establishes multi-way enhancer hubs as a functionally and genetically meaningful layer of transcriptional regulation in complex tissues.
bioinformatics2026-02-10v1Token Alignment for Verifying LLM-Extracted Text
Booeshaghi, A. S.; Streets, A. M.AI Summary
- The study investigates improving the verification of text extracted by large language models (LLMs) by aligning extracted text with the original source, focusing on discontiguous phrases.
- Using LLM-specific tokenizers and ordered alignment algorithms, the approach improved alignment accuracy by about 50% over word-level tokenization.
- The effectiveness was demonstrated with the introduction of the BOAT and BIO-BOAT datasets, showing ordered alignment as the most practical method for this task.
Abstract
Large language models excel at text extraction, but they sometimes hallucinate. A simple way to avoid hallucinations is to remove any extracted text that does not appear in the original source. This is easy when the extracted text is contiguous (findable with exact string matching), but much harder when it is discontiguous. Techniques for finding discontiguous phrases depend heavily on how the text is split-i.e., how it is tokenized. In this study, we show that splitting text along subword boundaries, with LLM-specific tokenizers, and aligning extracted text with ordered alignment algorithms, improves alignment by about 50% compared to word-level tokenization. To demonstrate this, we introduce the Berkeley Ordered Alignment of Text (BOAT) dataset, a modification of the Stanford Question Answering Dataset (SQuAD) that includes non-contiguous phrases, and BIO-BOAT a biomedical variant built from 51 bioRxiv preprints. We show that text-alignment methods form a partially ordered set, and that ordered alignment is the most practical choice for verifying LLM-extracted text. We implement this approach in taln, which enumerates ordinal subword alignments.
bioinformatics2026-02-10v1bMINTY: Enabling Reproducible Management of High-Throughput Sequencing Analysis Results and their Metadata
Kapelios, K.; Xiropotamos, P.; Manousaki, H.; Sinnis, C.; Kotsira, V.; Dalamagas, T.; GEORGAKILAS, G. K.AI Summary
- The study addresses the challenge of managing high-throughput sequencing data by introducing bMINTY, a web application for structured management of post-alignment data and metadata.
- bMINTY allows for the integration of study, assay, and analysis metadata into a single, portable, queryable resource, enhancing data reuse and reproducibility.
- Users can export data in RO-Crate format, facilitating machine-readable data packages for publication, thereby promoting FAIR science principles.
Abstract
Due to the large scale of high-throughput sequencing data generation, the community and publishers have established standards for the dissemination of studies that produce and analyze these data. Despite efforts towards Findable, Accessible, Interoperable and Reproducible (FAIR) science, critical obstacles remain. Best practices are not consistently enforced by scientific publishers, and when they are, essential information is fragmented across the methods section, supplementary materials, and public repositories. When attempting to reproduce scientific findings or reuse published data or analyses, researchers often avoid analyzing sequencing data from the ground up. Instead, they prefer to start directly from the post-sequence-alignment information (e.g., gene expression matrices in transcriptomics). However, existing repositories and workflow-oriented solutions rarely provide a single, portable, queryable resource that integrates this information with the metadata required for downstream reuse. We introduce bMINTY, a locally deployed web application with an intuitive user interface, for structured management of post-alignment workflow data outputs. bMINTY supports metadata for studies, assays, and analysis assets, including workflows, genome assemblies, genomic intervals, and cell-level entities for single-cell assays. Users may export query results in RO-Crate format, providing machine readable data packages and metadata. To the best of current knowledge, bMINTY is the first framework to bundle all this information in publication-ready, portable packaging designed for reuse. These packages can be included as supplementary material with each publication, accompanied by analysis code deposited in public repositories for downstream ad hoc analyses. Together, these practices can promote transparency, efficient reuse of published data, and support FAIR-aligned scientific reproducibility.
bioinformatics2026-02-10v1SenNet Portal: Build, Optimization and Usage
Borner, K.; Blood, P. D.; Silverstein, J. C.; Ruffalo, M.; Satija, R.; Gehlenborg, N.; Honick, B.; Bueckle, A.; Jain, Y.; Qaurooni, D.; Shirey, B.; Sibilla, M.; Metis, K.; Bisciotti, J.; Morgan, R. S.; Betancur, D.; Sablosky, G. R.; Turner, M. L.; Kim, S.-J.; Lee, P. J.; Bartz, J.; Domanskyi, S.; Peters, S. T.; Enninful, A.; Farzad, N.; Fan, R.; SenNet Team, ; Herr, B. W.AI Summary
- The SenNet Program addresses the challenge of studying cellular senescence by generating multimodal datasets across human and mouse tissues.
- The SenNet Data Portal provides open access to these datasets, including single-cell, spatial, imaging, transcriptomic, and proteomic data, along with senescence biomarker catalogs and standardized protocols.
- The portal, built on a scalable hybrid cloud architecture, supports data submission, analysis, and cross-species mapping, with applications in biomarker discovery and spatial analysis.
Abstract
Cellular senescence is a hallmark of aging and a driver of functional decline across tissues, yet its heterogeneity and context dependence have limited systematic study. The Common Fund Cellular Senescence Network (SenNet) Program addresses this challenge by generating multimodal, multi-tissue datasets that profile senescent cells across the human lifespan and complementary mouse models. The SenNet Data Portal (https://data.sennetconsortium.org) serves as the public gateway to these resources, providing open access to harmonized single-cell, spatial, imaging, transcriptomic, and proteomic data; senescence biomarker catalogs; and standardized protocols that can be used to comprehensively identify and characterize senescent cells in mouse and human tissue. As of January 2026, the portal hosts 1,753 publicly available human and mouse datasets across 15 organs using 6 general assay types. Experts from 13 Tissue Mapping Centers (TMCs) and 12 Technology Development and Application (TDAs) components contribute tissue data, analyze data, identify senescent biomarkers, and agree on panels for cross-tissue antibody harmonization. They also register human tissue data into the Human Reference Atlas (HRA) and develop user interfaces for the multiscale and multimodal exploration of this data. Built on a scalable hybrid cloud microservices architecture by the Consortium Organization and Data Coordinating Center (CODCC), the Portal enables data submission, management, integrated analysis, spatial context mapping, and cross-species senescence mapping critical for aging research. This paper presents user needs, the Portal architecture, data processing workflows, and senescence-focused analytical tools. The paper also presents usage scenarios illustrating applications in biomarker discovery, quality benchmarking, hypothesis generation, spatial analysis, cost-efficient profiling, and cell distance distribution analysis. Current limitations and planned extensions, including expanded spatial-omics releases and improved tools for senotype characterization, are discussed. SenNet protocols, code, and user interfaces are freely available on https://docs.sennetconsortium.org/apis.
bioinformatics2026-02-10v1Using user-centered design to better understand challengesfaced during genetic analyses by novice genomicresearchers
Patel, H.; Crosslin, D.; Jarvik, G. P.; Hall, T.; Veenstra, D.; Xie, S.AI Summary
- This study aimed to understand the challenges novice genomic researchers (NGRs) face with bioinformatics tools by using a user-centered design approach.
- A literature review and semi-structured interviews were conducted to identify issues like poor documentation, installation difficulties, and unclear error messages.
- An evaluation rubric was developed to assess bioinformatics tools, aiming to improve usability for both NGRs and experienced users.
Abstract
The lack of user-centered design principles in the current landscape of commonly-used bioinformatics software tools poses challenges for novice genomics researchers (NGRs) entering the genomics ecosystem. Comparing the usability of one analysis software to that of another is a non-trivial task and requires evaluation criteria that incorporates perspectives from both existing literature and a diverse, underrepresented user base of NGRs. To better characterize these barriers, we utilized a two-pronged approach consisting of a literature review of existing bioinformatics tools and semi-structured interviews of the needs of NGRs. From both knowledge sources, the key attributes that resulted in poor adoption and sustained use of most bioinformatics tools included poor documentation, lack of readily-accessible informational content, challenges with installation and dependency coordination, and inconsistent error messages/progress indicators. Combining the findings from the literature review and the insights gained by interviewing the NGRs, an evaluation rubric was created that can be utilized to grade existing and future bioinformatics tools. This rubric acts as a summary of key components needed for software tools to cater to the diverse needs of both NGRs and experienced users. Due to the rapidly evolving nature of genomics research, it becomes increasingly important to critically evaluate existing tools and develop new ones that will help build a strong foundation for future exploration.
bioinformatics2026-02-10v1PRIZM: Combining Low-N Data and Zero-shot Models to Design Enhanced Protein Variants
Harding-Larsen, D.; Lax, B. M.; Garcia, M. E.; Mendonca, C.; Mejia-Otalvaro, F.; Welner, D. H.; Mazurenko, S.AI Summary
- PRIZM is a two-phase workflow that uses a small, high-quality dataset to select the best pre-trained zero-shot model for predicting protein variant effects.
- It then applies this model to rank and prioritize variants for experimental testing.
- In case studies, PRIZM improved enzyme variants, achieving a 3°C increase in thermostability and a 20% increase in activity.
Abstract
Machine learning has repeatedly shown the ability to accelerate protein engineering, but many approaches demand large amounts of robust, high-quality training data as well as substantial computational expertise. While large pre-trained models can function as zero-shot proxies for predicting variant effects, selecting the best model for a given protein property is often non-trivial. Here, we introduce Protein Ranking using Informed Zero-shot Modelling (PRIZM), a two-phase workflow that first uses a high-quality low-N dataset to identify the most suitable pre-trained zero-shot model for a target protein property and then applies that model to rank and prioritize an in silico variant library for experimental testing. Across diverse benchmark datasets spanning multiple protein properties, PRIZM reliably separated low- from high-performing models using datasets of ~20 labelled variants. We further demonstrate PRIZM in enzyme engineering case studies targeting sucrose synthase thermostability and glycosyltransferase activity, where PRIZM-guided selection identified improved variants, including gains of ~3{degrees}C in apparent melting temperature and ~20% higher relative activity. PRIZM provides an accessible, data-efficient route to leverage foundation models for protein design while requiring minimal experimental data.
bioinformatics2026-02-10v1An Integrated Pipeline for Cell-Type Annotation, Metabolic Profiling, and Spatial Communication Analysis in the Liver using Spatial Transcriptomics
Zhang, C.; Li, J.; Luo, O.; Andrews, T.; Steinberg, G. R.; WANG, D.AI Summary
- The study presents a protocol for analyzing spatial transcriptomics (ST) data in liver tissues from MASLD mouse models to understand liver metabolism.
- The approach includes single-cell RNA-seq referencing, manual annotation with curated liver cell type markers, and metabolic gene set analysis.
- Key findings include the provision of tools for researchers to decode metabolic reprogramming and cellular heterogeneity in liver health and disease.
Abstract
The liver acts as a central metabolic hub, integrating systemic signals through a spatially organized pattern known as zonation, driven by the coordinated activity of diverse cell types including hepatocytes, stellate cells, Kupffer cells, endothelial cells, and immune populations. Spatial transcriptomics (ST) enables the profiling of thousands of cells with spatial resolution in a single experiment, facilitating the identification of novel gene markers, cell types, cellular states, and tissue neighborhoods across diverse tissues and organisms. By simultaneously capturing transcriptional and spatial heterogeneity, ST has become a powerful tool for understanding cellular and tissue biology. Given its advantages, there is growing demand for applying ST to uncover novel biological insights in the liver under various physiological and pathological conditions including obesity, diabetes, and metabolic dysfunction-associated steatotic liver disease (MASLD). However, to date no comprehensive and practical protocols currently exist for analyzing ST data specifically in the context of liver metabolism. Herein, we present a systematic and detailed protocol for ST data analysis using liver tissues from MASLD mouse models. This guide offers practical support for metabolic based researchers without advanced expertise in coding, mathematics and statistics enabling single-cell RNA-seq referencing for deconvolution-based annotation, curated liver cell type markers for manual annotation, and a GMT file of metabolic gene sets and flux balance analysis to analyze liver metabolic activity. This framework and integrated computational resources for decoding metabolic reprogramming and cellular heterogeneity will empower researchers to uncover novel biological pathways regulating liver metabolism in health and disease.
bioinformatics2026-02-10v1HORDCOIN: A Software Library for Higher Order Connected Information and Entropic Constraints Approximation
Raffaelli, G. T.; Kislinger, J.; Kroupa, T.; Hlinka, J.AI Summary
- The study introduces HORDCOIN, a software library for approximating higher-order connected information in complex systems like neuronal populations, using an entropic-constraint approach to simplify computational complexity.
- This method transforms the problem into a linear program, allowing efficient estimation even with limited data.
- Applications to symbolic sequences, neuronal recordings, and DNA sequences showed accurate detection of higher-order interactions, demonstrating the library's utility in biomedical data analysis.
Abstract
Background and objective: Quantifying higher-order statistical dependencies in multivariate biomedical data is essential for understanding collective dynamics in complex systems such as neuronal populations. The connected information framework provides a principled decomposition of the total information content into contributions from interactions of increasing order. However, its application has been limited by the computational complexity of conventional maximum entropy formulations. In this work, we present a generalised formulation of connected information based on maximum entropy problems constrained by entropic quantities. Methods: The entropic-constraint approach, contrasting with the original constraints based on marginals or moments, transforms the original nonconvex optimisation into a tractable linear program defined over polymatroid cones. This simplification enables efficient, robust estimation even under undersampling conditions. Results: We present theoretical foundations, algorithmic implementation, and validation through numerical experiments and real-world data. Applications to symbolic sequences, large-scale neuronal recordings, and DNA sequences demonstrate that the proposed method accurately detects higher-order interactions and remains stable even with limited data. Conclusions: The accompanying open-source software library, HORDCOIN (Higher ORDer COnnected INformation), provides user-friendly tools for computing connected information using both marginal- and entropy-based formulations. Overall, this work bridges the gap between abstract information-theoretic measures and practical biomedical data analysis, enabling scalable investigation of higher-order dependencies in neurophysiological and other complex biological systems such as the genome.
bioinformatics2026-02-10v1LineageSim: A Single-Cell Lineage Simulator with Fate-Aware Gene Expression
Lai, H.; Sadria, M.AI Summary
- LineageSim is introduced as a simulator for single-cell lineage data with gene expression that incorporates fate-aware signals, addressing the limitations of existing Markovian models.
- The simulator generates data where progenitor states show early signs of future cell fate, providing a benchmark for cell fate prediction algorithms.
- Validation through logistic regression showed a 68.3% balanced accuracy, confirming the presence of predictive fate information in the simulated data.
Abstract
Single-cell lineage data paired with gene expression are critical for developing computational methods in developmental biology. Since experimental lineage tracing is often technically limited, robust simulations are necessary to provide the ground truth for rigorous validation. However, existing simulators generate largely Markovian gene expression, failing to encode the fate bias observed in real biological systems, where progenitor states exhibit early signatures of future commitment. Consequently, they cannot support the training and evaluation of computational methods that model long-range temporal dependencies. We present LineageSim, a generative framework that introduces fate-aware gene expression, where progenitor states carry latent signals of their descendants' terminal fates. This framework establishes a new class of benchmarks for cell fate prediction algorithms. We validate the presence of these temporal signals by training a logistic regression baseline, which achieves 68.3% balanced accuracy. This confirms that the generated data contain subtle but recoverable fate information, in contrast to existing simulators, where such predictive signals are systematically absent.
bioinformatics2026-02-10v1Multi-compartment spatiotemporal metabolic modeling of the chicken gut guides dietary intervention design
Utkina, I.; Alizadeh, M.; Sharif, S.; Parkinson, J.AI Summary
- Researchers developed a multi-compartment metabolic model of the chicken gut to understand how diet influences microbial metabolism.
- The model identified cellulose, starch, and L-threonine as effective dietary supplements for enhancing short-chain fatty acid production.
- Validation through a feeding trial confirmed model predictions, particularly for butyrate, highlighting the importance of microbial community composition in metabolic outcomes.
Abstract
Understanding how diet shapes microbial metabolism along the gastrointestinal tract is essential for improving poultry gut health and reducing reliance on antibiotic growth promoters. Yet dietary interventions often yield inconsistent outcomes because their efficacy depends on baseline conditions, including diet composition and microbiota structure. To address this, we developed the first multi-compartment, spatiotemporally resolved metabolic model of the chicken gastrointestinal tract. Our six-compartment framework integrates avian-specific physiological features including bidirectional flow, feeding-fasting cycles, and compartment-specific environmental parameters. The model captured distinct metabolic specialization along the gut, with upper compartments enriched for biosynthetic pathways and lower compartments specialized for fermentation. Systematic in silico screening of 34 dietary supplements revealed context-dependent metabolic responses and identified cellulose, starch, and L-threonine as robust enhancers of short-chain fatty acid production. A controlled feeding trial validated key predictions, particularly for butyrate, and integrating trial-specific microbial community data substantially improved prediction accuracy for several metabolites. Our findings demonstrate that community composition is a major driver of metabolic outcomes and underscore the need for context-specific modeling. Our framework provides a mechanistic platform for rational dietary intervention design and is broadly adaptable to other animal or human gastrointestinal systems.
bioinformatics2026-02-10v1Parsimonious cell co-localization scoring for spatial transcriptomics
Gingerich, I. K.; Frost, H. R.AI Summary
- The study introduces the Neighborhood Product Co-localization (NPC) score for spatial transcriptomics to quantify cell type co-occurrence in local neighborhoods.
- Using a mouse ovary MERFISH dataset, NPC was shown to localize co-localization hotspots, recapitulate global associations, and identify specific niches like follicle boundaries.
- NPC extends to multivariate analysis, demonstrating coordinated co-localization of endothelial, stroma, and theca cells.
Abstract
Spatial transcriptomics (ST) preserves tissue architecture while profiling gene expression, motivating methods that quantify whether annotated labels (such as cell types) preferentially co-occur in local neighborhoods. We introduce the Neighborhood Product Co-localization (NPC) score, a simple per-cell metric computed on a pruned spatial neighbor graph: for a set of m [≥] 2 labels, NPC is the product of their neighborhood proportions, optionally normalized by expected co-occurrence under independence and paired with permutation-based significance testing. NPC is interpretable (maximized under balanced neighborhoods), efficient to compute, and extends naturally from pairwise to multivariate microenvironment definitions. Using a mouse ovary MERFISH dataset, we show that NPC complements established Squidpy co-occurrence and neighborhood enrichment analyses by localizing co-localization hotspots in tissue space, recapitulating prominent global associations, and highlighting spatially restricted niches such as follicle boundaries; we further demonstrate multivariate NPC scoring by identifying coordinated endothelial--stroma--theca co-localization. Overall, NPC provides a practical framework for interpretable, single-cell resolution co-localization analysis in ST cohorts.
bioinformatics2026-02-10v1CoPrimeEEG: CRT-Guided Dual-Branch Reconstruction from Co-Prime Sub-Nyquist EEG
Yu, Y.; Liu, D.; Wu, Y. N.AI Summary
- CoPrimeEEG integrates co-prime sub-Nyquist sampling with a CRT-guided learning objective to reconstruct high-rate EEG from two low-rate streams.
- The framework uses a dual-branch convolutional encoder, upsampling to reconstruct EEG, predict a temporal mask, and extract bandpower features.
- It achieves superior reconstruction quality on real EEG data with fewer parameters, offering a low-power solution for EEG acquisition.
Abstract
We present CoPrimeEEG, a neural reconstruction framework that unifies co-prime sub-Nyquist sampling theory with a CRT-guided learning objective for EEG. Two low-rate streams obtained by co-prime decimations feed a dual-branch convolutional encoder whose fused representation is upsampled to reconstruct high-rate EEG while jointly predicting a temporal usefulness mask and canonical bandpower features. We derive a principled loss with four terms: (i) waveform fidelity, (ii) mask sparsity and smoothness, (iii) bandpower supervision in the log-domain, and (iv) a CRT-consistency term enforcing agreement between the reconstruction and its co-prime downsampled counterparts. On real EEG data, CoPrimeEEG achieves state-of-the-art reconstruction quality across MSE, MAE, correlation, SNR, and PSNR while using fewer parameters. The approach provides a practical path to low-power EEG acquisition with high-fidelity downstream analysis.
bioinformatics2026-02-10v1A methodological framework for accommodating Cancer Genomics Information in OMOP-CDM using Variation Representation Specification (VRS).
Benetti, E.; Scicolone, G.; Tajwar, M.; Masciullo, C.; Bucci, G.; Riba, M.AI Summary
- The study proposes a framework to integrate cancer genomics data into the OMOP Common Data Model (OMOP CDM) using the Variation Representation Specification (VRS).
- The approach involves a scalable strategy for storing genomic variants, from simple biomarker data to complex genome sequencing data, using standardized identifiers.
- KOIOS-VRS, a pipeline, was developed to automate the conversion of VCF files into an OMOP-compatible format.
Abstract
The OMOP Common Data Model (OMOP CDM) in which observational health data are organized and stored is a broadly accepted data standard which helps clinical research facilitating federation study protocols. In case of cancer studies, there is a growing need to incorporate cancer genomics data in a standardized way. Starting from a brief overview of the basic features of the OMOP CDM, we imagine a path of increasing complexity for including known biomarker genomic data coming from pathology or reports or clinical laboratory findings, towards storing thousands of known and unknown variants coming from genome sequencing data. Data should be stored using standardized identifiers, including those defined by the Global Alliance for Genomics and Health (GA4GH). We propose a scalable strategy for storing genomics variants in increasingly complex scenarios and present KOIOS-VRS, a pipeline that automates the conversion of VCF files into OMOP compatible format.
bioinformatics2026-02-10v1thematicGO: A Keyword-Based Framework for Interpreting Gene Ontology Enrichment via Biological Themes
Wang, Z.; Sudlow, L. C.; Du, J.; Berezin, M. Y.AI Summary
- ThematicGO is a framework that organizes Gene Ontology (GO) enriched terms into biological themes using a keyword-based approach to reduce redundancy and enhance interpretability.
- It uses the g:Profiler API for GO enrichment of differentially expressed genes, followed by theme-based score aggregation.
- Compared to traditional GO annotation, thematicGO improves readability and provides a user-friendly GUI for exploring results, making it suitable for RNA-seq studies.
Abstract
Background Gene Ontology (GO) enrichment analysis is a widely used approach for interpreting high-throughput transcriptomic and genomic data. However, conventional GO over-representation analyses typically yield long, redundant lists of enriched terms that are difficult to apply to biological problems and identify the most relevant biological pathways. Results We present thematicGO, a customizable framework that organizes enriched GO terms into biological themes using a curated keyword-based matching strategy. In this approach, GO enrichment of differentially expressed genes is performed using the g:Profiler Application Programming Interface (API), followed by the score aggregation within each theme from contributing individual GO terms. Side-by-side interpretation against conventional GO annotation workflows demonstrates that thematicGO captures related biological outcomes but at the same time substantially reduces redundancy and improves readability. To enhance accessibility, we implemented an interactive, web-deployed graphical user interface (GUI) that enables users to upload gene lists and explore thematic enrichment results. Conclusion thematicGO simplifies functional enrichment analysis by bridging the gap between granular GO term outputs and higher-level biological interpretation using a theme concept, which can be especially useful for RNA-seq studies that identify differentially expressed genes. The new approach complements an orthogonal standard GO enrichment technique with transparent, theme-based aggregation and comparison against classical GO annotation approaches. thematicGO provides an easy, understandable, and reproducible tool for transcriptomic studies, particularly those involving RNA-seq data and complex biological responses.
bioinformatics2026-02-10v1Reading TEA leaves for de novo protein design
Pantolini, L.; Durairaj, J.AI Summary
- The study explores de novo protein design using a 20-letter structure-inspired alphabet from protein language model embeddings to enhance Monte Carlo sampling efficiency.
- This approach allows for rapid template-guided and unconditional design, producing novel protein sequences that meet designability criteria without known homologues.
- The method significantly reduces the time required for protein design, opening new avenues for therapeutic and industrial applications.
Abstract
De novo protein design expands the functional protein universe beyond natural evolution, offering vast therapeutic and industrial potential. Monte Carlo sampling in protein design is under-explored due to the typically long simulation times required or prohibitive time requirements of current structure prediction oracles. Here we make use of a 20-letter structure-inspired alphabet derived from protein language model embeddings to score random mutagenesis-based Metropolis sampling of amino acid sequences. This facilitates fast template-guided and unconditional design, generating sequences that satisfy in silico designability criteria without known homologues. Ultimately, this unlocks a new path to fast and de novo protein design.
bioinformatics2026-02-10v1A multi-agent platform for assessment and improvement of bioinformatics software documentation
Ma, A.; Feng, S.; Gu, S.; Wang, C.; Ma, Q.AI Summary
- The study introduces BioGuider, a multi-agent platform designed to evaluate and enhance documentation quality in bioinformatics software by treating documentation as a testable object.
- BioGuider uses a modular pipeline for documentation assessment, reporting, and correction, with agents simulating user interactions, and evaluates against task-oriented criteria.
- Testing on 47 bioinformatics tools showed BioGuider's effectiveness in error detection and correction, with a correlation between improved documentation quality and increased software adoption.
Abstract
Rapid advances in bioinformatics have transformed biomedical research in areas such as single-cell and spatial omics, digital pathology, and multi-modal data integration, yet software usability and reproducibility have not kept pace with the growing complexity and proliferation of computational tools. Inconsistent, incomplete, or inaccessible documentation remains a pervasive and underappreciated barrier, limiting tool adoption, hindering reproducibility across laboratories, and reducing the long-term impact of computational methods. Here, we introduce BioGuider, a multi-agent platform designed to systematically evaluate and improve documentation quality in bioinformatics software. Rather than treating documentation as ancillary text, BioGuider models it as a first-class, testable object. The platform implements a modular pipeline for documentation collection, assessment, reporting, and optional correction, with specialized agents that emulate real-world user interactions. BioGuider evaluates documentation against standardized, task-oriented criteria spanning installation, configuration, usage, and tutorials, and supports iterative, constraint-aware refinement while preserving code integrity and biological context. We benchmark BioGuider using a controlled error-injection framework that introduces realistic documentation failures across general, biology-specific, and configuration-related categories. Across multiple large language models, BioGuider demonstrates robust error detection and correction, with strong performance maintained under severe documentation degradation. Applying BioGuider to 47 widely used bioinformatics tools, we observe a positive association between documentation quality and citation frequency, highlighting documentation as a previously under-quantified driver of software adoption and scientific impact.
bioinformatics2026-02-10v1Delta Marches: Generative AI based image synthesis to decode disease-driving morphologic transformations.
Nguyen, T. H.; Panwar, V.; Jarmale, V.; Perny, A.; Dusek, C.; Cai, Q.; Kapur, P. H.; Danuser, G.; Rajaram, S.AI Summary
- Delta-Marches uses generative AI to simulate morphological changes between disease classes, focusing on interpretability.
- Applied to renal carcinoma grading, it identifies key morphological features like tumor-cell nuclear phenotypes and reduced vasculature with increasing grade.
- This approach reduces variability and provides insights into disease mechanisms not captured by standard grading.
Abstract
Deep learning has revealed that tissue morphology contains rich biological information beyond human understanding. However, approaches to convert these spatially distributed signals into precise subcellular insights informing disease mechanism are lacking. We introduce Delta-Marches, an interpretability-first approach that nominates distinguishing morphological features rather than explaining existing models' decisions. Delta-Marches leverages a generative AI framework with latent-space traversals that simulate idealized morphological changes between classes. Comparing each image to its class-shifted counterpart allows downstream feature extractors to infer aspects most affected by the shift, reducing sample-to-sample variability and yielding interpretable morphological transformations at subcellular resolution. Prototyped in renal carcinoma histopathological grading, Delta-Marches generates realistic grade transitions and pinpoints tumor-cell nuclear phenotypes as key properties of tumor grades. It also reveals reduced vasculature associated with increasing grade, a pattern reported in studies but absent from standard grading rubrics. These results indicate Delta-March's ability to parse complex image phenotypes and catalyze hypothesis generation.
bioinformatics2026-02-09v4Enumerating the chemical exposome using in-silico transformation analysis : an example using insecticides
Jothiramajayam, M.; Barupal, D. K.AI Summary
- This study uses an integrated workflow of RXNMmapper, Rxn-INSIGHT, and RDChiral to enumerate transformation products of insecticides in-silico.
- From 181 insecticide structures, 19,392 unique transformation products were generated using over 80,000 reaction templates from PubChem.
- Products were prioritized based on thermodynamic stability, species association, enzyme information, and ADMET properties, enhancing exposomic knowledgebases.
Abstract
The exposome encompasses a vast chemical space that can originate from the consumer industry and environmental sources. Once these chemicals enter into cells (human or other organisms), they can be also transformed into products that differ in terms of toxicity and health effects. Recent developments in machine learning methods and chemical data science resources have enabled the in-silico enumeration of transformation products. Here, we report an integrated workflow of these existing resources (RXNMmapper, Rxn-INSIGHT and RDChiral) to enumerate the transformation product for a chemical. We have generated a large library of reaction templates from > 80,000 reactions sourced from the PubChem database. Utility of the reaction screening and transformation enumeration workflow has been demonstrated for insecticide structures (n=181), yielding 19,392 unique transformation products. Use of filters and ranking by thermodynamic stability, species association, enzyme information and ADMET properties, can prioritize the products relevant for different contexts. Many of these products have PubChem entries but have not yet been linked with the parent compounds. The presented approach can be helpful in enumerating relevant chemical space for exposome using known reaction chemistry, which may ultimately contribute to expanding of the exposomic knowledgebases.
bioinformatics2026-02-09v2Protein Language Models in Directed Evolution
Maguire, R.; Bloznelyte, K.; Adepoju, F.; Armean-Jones, M.; Dewan, S.; Goddard, S. E.; Gupta, A.; Jones, F. P.; Lalli, P.; Schooneveld, A.; Thompson, S.; Ebrahimi, E.; Fozzard, S.; Berman, D.; Rossoni, L.; Addison, W.; Taylor, I.AI Summary
- The study investigates the use of zero-shot and few-shot protein language models to guide directed evolution for improving protein fitness, specifically PET degradation.
- Using a few-shot simulated annealing approach, the models recommended enzyme variants that achieved a 1.62x improvement in PET degradation over 72 hours, surpassing the literature's top variant by 1.40x.
- In the second round, with 240 training examples and 32 homologous sequences, 39% of the 176 evaluated variants were fitter than the wild-type.
Abstract
The dominant paradigms for integrating machine-learning into protein engineering are de novo protein design and guided directed evolution. Guiding directed evolution requires a model of protein fitness, but most models are only evaluated in silico on datasets comprising few mutations. Due to the limited number of mutations in these datasets, it is unclear how well these models can guide directed evolution efforts. We demonstrate in vitro how zero-shot and few-shot protein language models of fitness can be used to guide two rounds of directed evolution with simulated annealing. Our few-shot simulated annealing approach recommended enzyme variants with 1.62 x improved PET degradation over 72 h period, outperforming the top engineered variant from the literature, which was 1.40 x fitter than wild-type. In the second round, 240 in vitro examples were used for training, 32 homologous sequences were used for evolutionary context and 176 variants were evaluated for improved PET degradation, achieving a hit-rate of 39 % of variants fitter than wild-type.
bioinformatics2026-02-09v2A high-fat hypertensive diet induces a coordinated perturbation signature across cell types in thoracic perivascular adipose tissue
Terrian, L.; Thompson, J. M.; Bowman, D. E.; Panda, V.; Contreras, G. A.; Rockwell, C. E.; Sather, L.; Fink, G. D.; Lauver, D. A.; Nault, R.; Watts, S. W.; Bhattacharya, S.AI Summary
- This study used single nucleus RNA-sequencing to examine how a high-fat (HF) hypertensive diet affects gene expression in thoracic aortic perivascular adipose tissue (PVAT) of Dahl SS rats.
- The HF diet led to sex-specific changes in cell-type proportions and gene expression related to extracellular matrix dynamics, vascular integrity, and cell communication pathways.
- Analysis identified potential nuclear receptor targets for reversing these diet-induced changes, with deep learning models predicting a hypertensive disease signature across cell types.
Abstract
Perivascular adipose tissue (PVAT), an intriguing layer of fat surrounding blood vessels, regulates vascular tone and mediates vascular dysfunction through mechanisms that are not well understood. Here we show with single nucleus RNA-sequencing of thoracic aortic PVAT from Dahl SS rats that a high-fat (HF) hypertensive diet induces coordinated changes in gene expression across the diverse cell types within PVAT. HF diet produced sex-specific alterations in cell-type proportions and genes related to remodeling of extracellular matrix dynamics and vascular integrity and stiffness, as well as changes in cell-cell communication pathways involved in angiogenesis, vascular remodeling, and mechanotransduction. Gene regulatory network analysis with virtual transcription factor knockout in adipocytes identified specific nuclear receptors that could be targeted for suppression or potential reversal of HF diet-induced changes. Interestingly, generative deep learning models were able to predict cross-cell-type perturbations in gene expression, indicating a hypertensive disease signature that characterizes HF-diet-induced perturbations in PVAT.
bioinformatics2026-02-09v2LoReMINE: Long Read-based Microbial genome mining pipeline
Agrawal, A. A.; Bader, C. D.; Garcia, R.; Mueller, R.; Kalinina, O. V.AI Summary
- The study introduces LoReMINE, a pipeline for microbial genome mining that automates the process from long-read sequencing data to predicting and clustering biosynthetic gene clusters (BGCs).
- LoReMINE integrates various tools to provide a scalable, reproducible workflow for natural product discovery, addressing the limitations of existing methods that require manual curation.
Abstract
Microbial natural products represent a chemically diverse repertoire of small molecules with major pharmaceutical potential. Despite the increasing availability of microbial genome sequences, large-scale natural product discovery remains challenging because the existing genome mining approaches lack integrated workflows for rapid dereplication of known compounds and prioritization of novel candidates, forcing researchers to rely on multiple tools that requires extensive manual curation and expert intervention at each step. To address these limitations, we introduce LoReMINE (Long Read-based Microbial genome mining pipeline), a fully automated end-to-end pipeline that generates high-quality assemblies, performs taxonomic classification, predicts biosynthetic gene clusters (BGCs) responsible for biosynthesis of natural products, and clusters them into gene cluster families (GCFs) directly from long-read sequencing data. By integrating state-of-the-art tools into a seamless pipeline, LoReMINE enables scalable, reproducible, and comprehensive genome mining across diverse microbial taxa. The pipeline is openly available at https://github.com/kalininalab/LoReMINE and can be installed via Conda (https://anaconda.org/kalininalab/loremine), facilitating broad adoption by the natural product research community.
bioinformatics2026-02-09v2Unveiling the Terra Cognita of Sequence Spaces using Cartesian Projection of Asymmetric Distances
Ramette, A.AI Summary
- CAPASYDIS is introduced as a method to visualize relationships in large biological sequence datasets by projecting sequences into a fixed, low-dimensional "seqverse" using asymmetric distances.
- Applied to rRNA sequences across Bacteria, Archaea, and Eukaryota, CAPASYDIS showed these domains occupy distinct spatial regions with unique variation patterns.
- The method allows instant mapping of new sequences and retains taxonomic information from broad to fine scales, providing a scalable framework for sequence analysis.
Abstract
Visualizing relationships within massive biological datasets remains a significant challenge, particularly as sequence length and volume increase. We introduce CAPASYDIS (Cartesian Projections of Asymmetric Distances), a scalable approach designed to map the explored regions of a given sequence space. Unlike traditional dimensionality reduction methods, CAPASYDIS calculates asymmetric distances which account for both the position and type of sequence variations. It projects sequences into a fixed, low-dimensional coordinate system, termed a "seqverse", where each sequence occupies a permanent location. This design allows for the instant mapping of new sequences without the need to recalculate the global structure, transforming sequence analysis from a relative comparison into navigation on a standardized map. We applied this method to a large rRNA sequence dataset spanning the three domains of life. Our results demonstrate that the sequences of Bacteria, Archaea, and Eukaryota occupy spatially distinct regions characterized by fundamentally different shapes and patterns of variation. Furthermore, the resulting seqverses retain high amount of taxonomic information, when analyzed from broad domain levels to single-base differences. Overall, CAPASYDIS provides a reproducible, scalable framework for defining the boundaries and topography of biological sequence universes.
bioinformatics2026-02-09v2Target-site Dynamics and Alternative Polyadenylation Explain a Large Share of Apparent MicroRNA Differential Expression
Cihan, M.; More, P.; Sprang, M.; Marini, F.; Andrade, M.AI Summary
- The study introduces MIRNAPEX, a machine learning framework that integrates target-gene expression and 3'UTR isoform usage to assess miRNA regulatory activity from RNA-seq data.
- Using pan-cancer datasets, MIRNAPEX showed that alternative polyadenylation (APA) significantly enhances prediction of miRNA differential expression beyond gene expression alone.
- Findings indicate that changes in miRNA abundance can result from APA-driven alterations in target-site availability, rather than changes in miRNA transcription, highlighting the importance of considering APA in miRNA expression analysis.
Abstract
MicroRNA (miRNA) abundance reflects a dynamic balance between biogenesis, target engagement and decay, yet differential expression (DE) analyses typically ignore changes in target-site availability driven by alternative polyadenylation (APA). We introduce MIRNAPEX, an interpretable expression-stratification-based machine learning framework that quantifies the effect size of miRNA regulatory activity from RNA-seq by integrating target-gene expression with 3'UTR isoform usage to infer binding-site dosage. Using pan-cancer training sets, we fit regularized linear models to learn robust relationships between transcriptomic features and miRNA log-fold changes, with APA patterns adding clear predictive power beyond expression alone. When applied to knockdowns of core APA regulators, MIRNAPEX captured widespread 3'UTR shortening and correctly anticipated distinct, miRNA-specific shifts whose direction and magnitude mirrored the APA-driven change in site availability. Analysis of target-directed miRNA degradation interactions further showed that loss of distal decay-trigger sites coincides with higher miRNA abundance, consistent with a reduced degradation rate. Together these findings reveal that apparent DE of miRNAs can arise from dynamic changes in target-site landscapes rather than altered miRNA transcription, and that ignoring this aspect in conventional analysis workflows can lead to misestimation of the true effect size of gene-expression regulation.
bioinformatics2026-02-09v2