Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Fast, accurate construction of multiple sequence alignments from protein language embeddings
Hoang, M.; Armour-Garb, I.; Singh, M.Abstract
Multiple sequence alignment (MSA) is a foundational task in computational biology, underpinning protein structure prediction, evolutionary analysis, and domain annotation. Traditional MSA algorithms rely on pairwise amino acid substitution matrices derived from conserved protein families. While effective for aligning closely related sequences, these scoring schemes struggle in the low-identity "twilight zone." Here, we present a new approach for constructing MSAs leveraging amino acid embeddings generated by protein language models (PLMs), which capture rich evolutionary and contextual information from massive and diverse sequence datasets. We introduce a windowed reciprocal-weighted embedding similarity metric that is surprisingly effective in identifying corresponding amino acids across sequences. Building on this metric, we develop ARIES (Alignment via RecIprocal Embedding Similarity), an algorithm that constructs a PLM-generated template embedding and aligns each sequence to this template via dynamic time warping in order to build a global MSA. Across diverse benchmark datasets, ARIES achieves higher accuracies than existing state-of-the-art approaches, especially in low-identity regimes where traditional methods degrade, while scaling almost linearly with the number of sequences to be aligned. Together, these results provide the first large-scale demonstration of the power of PLMs for accurate and scalable MSA construction across protein families of varying sizes and levels of similarity, highlighting the potential of PLMs to transform comparative sequence analysis. Code availabilityhttps://github.com/Singh-Lab/ARIES
bioinformatics2026-03-13v3User-driven development and evaluation of an agentic framework for analysis of large pathway diagrams
Corradi, M.; Djidrovski, I.; Ladeira, L.; Staumont, B.; Verhoeven, A.; Sanz Serrano, J.; Rougny, A.; Vaez, A.; Hemedan, A.; Mazein, A.; Niarakis, A.; de Carvalho e Silva, A.; Auffray, C.; Wilighagen, E.; Kuchovska, E.; Schreiber, F.; Balaur, I.; Calzone, L.; Matthews, L.; Veschini, L.; Gillespie, M. E.; Kutmon, M.; Koenig, M.; van Welzen, M.; Hiroi, N.; Lopata, O.; Klemmer, P.; Overall, R.; Hofer, T.; Satagopam, V.; Schneider, R.; Teunis, M.; Geris, L.; Ostaszewski, M.Abstract
As biomedical knowledge keeps growing, resources storing available information multiply and grow in size and complexity. Such resources can be in the format of molecular interaction maps, which represent cellular and molecular processes under normal or pathological conditions. However, these maps can be complex and hard to navigate, especially to novice users. Large Language Models (LLMs), particularly in the form of agentic frameworks, have emerged as a promising technology to support this exploration. In this article, we describe a user-driven process of prototyping, development, and user testing of Llemy, an LLM-based system for exploring these molecular interaction maps. By involving domain experts from the very first prototyping in the form of a hackathon and collecting both fine-grained and general feedback on more refined versions, we were able to evaluate the perceived utility and quality of the developed system, in particular for summarising maps and pathways, as well as prioritise the development of future features. We recommend continued user-driven development and benchmarking to keep the community engaged. This will also facilitate the transition towards open-weight LLMs to support the needs of the open research environment in an ever-changing technology landscape.
bioinformatics2026-03-13v2Per-residue optimisation of protein structures: Rapid alternative to optimisation with constrained alpha carbons
Schindler, O.; Bucekova, G.; Svoboda, T.; Svobodova, R.Abstract
In recent years, the number of known protein structures has increased significantly. Predictive algorithms and experimental methods provide the positions of protein residues relative to each other with high accuracy. However, the local quality of the protein structure, including bond lengths, angles, and positions of individual atoms, often lacks the same level of precision. For this reason, protein structures are usually optimised by a force field prior to their application in further research sensitive to structural quality. Protein structure optimisation, however, is computationally challenging. In this paper, we introduce a general method Per-residue optimisation of protein structures: Rapid alternative to optimisation with constrained alpha carbons (PROPTIMUS RAPHAN). Rather than optimising the entire protein structure at once, PROPTIMUS RAPHAN divides the structure into overlapping residual substructures and optimises each substructure individually. This approach results in computational time that scales linearly with the size of the structure. Additionally, we present PROPTIMUS RAPHANGFN-FF, a reference implementation of our method employing a generic, almost QM-accurate force field, GFN-FF. We tested PROPTIMUS RAPHANGFN-FF on 461 AlphaFold DB structures and demonstrated that our approach achieves results comparable to the optimisation of the structure with constrained alpha carbons in significantly less time.
bioinformatics2026-03-13v2Cross-etiology transcriptomic conservation in hepatocellular carcinoma reveals opposing proliferation and hepatocyte-loss programs validated across cohorts
Romero, R.; Toledo, C.Abstract
Background: Hepatocellular carcinoma (HCC) arises from diverse etiologies, but the balance between conserved and specific transcriptomic programs remains unclear. Methods: HBV and HCV cohorts were analyzed using GSVA to quantify Hallmark shifts. Biology was distilled into proliferation (ProlifHub) and hepatocyte-loss (HepLoss) modules, forming a composite HCCStateScore. An HBV injury axis was adjusted for proliferative state (E2F/G2M). Validation was performed using GSE14520 and GEPIA3. Results: Hallmark analysis revealed conserved proliferative activation and hepatocyte function suppression across etiologies. In HBV-HCC, the injury axis remained significantly elevated after adjusting for proliferation (p{approx}0.0147), indicating an injury component independent of the cell cycle. HCCStateScore robustly separated tumor from non-tumor tissue (AUC{approx}0.986, p=0). GEPIA3 confirmed concordant expression and survival associations for module genes. Conclusions: HCC features conserved opposing proliferation and hepatocyte-loss programs. HBV-associated tumors retain a distinct injury-linked component not fully explained by cell division. This validated score provides a framework for cross-cohort analysis and mechanistic prioritization in liver cancer research.
bioinformatics2026-03-13v1An explainable boosting machine model for identifying artifacts caused by formalin-fixed paraffin embedding
Grether, V.; Goldstein, Z. R.; Shelton, J. M.; Chu, T. R.; Hooper, W. F.; Geiger, H.; Corvelo, A.; Martini, R.; Davis, M. B.; Robine, N.; Liao, W.Abstract
Background: Formalin-fixed paraffin-embedding (FFPE) is a widely used, cost-effective method for long-term storage of clinical samples. However, fixation is known to introduce damage to nucleic acids that can present as artifactual bases in sequencing otherwise absent from higher fidelity storage methods such as fresh freezing (FF). Various machine learning methods exist for filtering these variant artifacts, but benchmarking performance can be difficult without reliable truth sets. In this study, we employ a collection of 90 paired fresh-frozen and formalin-fixed paraffin embedded samples from the same tumor to robustly define real and FFPE-derived, artifactual variation and enable objective evaluation of filtering methods. To address existing shortcomings, we propose a novel explainable boosting machine (EBM) model that improves performance, can be easily updated with new data, requires modest computational resources, and is analysis pipeline agnostic, making it broadly accessible. Results: We evaluated several methods for limiting FFPE-derived variant artifacts using cohorts of B-cell lymphoma samples. We found capturing local context around variants to be a highly informative, under-utilized feature set not commonly incorporated into many existing machine learning methods. Consequently, we developed a novel algorithm, FIFA, for filtering FFPE artifacts, which uses an EBM model, an interpretable decision-tree-based learning algorithm, to address some of the existing shortcomings. We used four independent cohorts composed of paired lymphoma and cervical cancer samples and a breast cancer cell line with both FF and FFPE samples to define clearly annotated training and test sets and demonstrated improved performance over existing methods. Additionally, FIFA filtering increased relevant biological signals in FFPE breast cancer datasets distinct from the training and testing sets. The EBM framework employed by FIFA is computationally efficient and easily amenable to incorporation of additional datasets due to its generalized additive modeling of features making it straightforward to incorporate new data into existing models dynamically over time. Conclusions: Our novel FFPE variant artifact filtering tool, FIFA, is a marked improvement over existing methods. It can be easily implemented, post hoc, to supplement existing somatic calling pipelines, training and inference can be run quickly across most compute environments, and it can be easily updated online as new training data becomes available. Accordingly, FIFA represents an important advance in retrospective cancer genomics research by further enhancing access to the vast stores of FFPE-archived tumor samples currently in existence
bioinformatics2026-03-13v1Fast and accurate resolution of ecDNA sequence using Cycle-Extractor
Faizrahnemoon, M.; Luebeck, J.; Hung, K. L.; Rao, S.; Prasad, G.; Tsz-Lo Wong, I.; G. Jones, M.; S. Mischel, P.; Y. Chang, H.; Zhu, K.; Bafna, V.Abstract
Extrachromosomal DNA (ecDNA) plays a key role in cancer pathology. EcDNAs mediate high oncogene amplification and expression and worse patient outcomes. Accurately determining the structure of these circular molecules is essential for understanding their function, yet reconstructing ecDNA cycles from sequencing data remains challenging. We introduce Cycle-Extractor (CE) for reconstruction. CE accepts a breakpoint graph derived from either short or long read sequencing data as input and extracts a cycle with the maximum length-weighted-copy-number. CE utilizes a mixed-integer linear program (MILP) and a separate traversal procedure, enabling fast optimization and compatibility with free solvers. We evaluated CE against CoRAL (long-read-based quadratic optimization), Decoil (long-reads), and AmpliconArchitect (AA for short reads) on both simulated data and real cancer cell lines. On simulated ecDNA, CE achieves performance comparable to CoRAL across three accuracy metrics and consistently outperforms AA and Decoil. On cancer cell lines, CE produces longer and heavier cycles than AA, and achieves performance similar to CoRAL. Moreover, CE is, on average, 40 x faster than CoRAL. These results demonstrate that CE accurately reconstructs ecDNA from both short- and long-read sequencing data, while long-read inputs allow CE to recover more complete and higher-confidence ecDNA structures. CE improved the prediction of many ecDNA structures. On a PC3 ecDNA containing MYC, CE uses ONT data to reconstruct a substantially larger and higher-copy sequence (4.2 Mbp) compared to the short-read-derived reconstruction (690 Kbp). CRISPR-CATCH experiments confirm the presence of a large ecDNA molecule, validating the long-read-based CE reconstruction.
bioinformatics2026-03-13v1Multiscale conformational sampling of multidomain fusion proteins by a physics informed diffusion model
Su, Z.; Wang, B.; Wu, Y.Abstract
Multidomain fusion proteins, such as bispecific antibodies, rely on highly flexible linker regions for their therapeutic efficacy. Characterizing these vast conformational ensembles is crucial for rational drug design; however, while all-atom molecular dynamics (MD) is the traditional gold standard, its immense computational cost makes simulating large-scale domain motions prohibitive. Recently, deep generative diffusion models have emerged as a rapid alternative for sampling protein dynamics. Yet, being trained primarily on massive databases of structured, static domains, these generic models often lack the biophysical constraints required to thoroughly sample the large-scale dynamics of highly flexible multidomain architectures. To overcome this, we leverage microsecond MD trajectories of a multidomain protein construct with various linkers to train a multiscale diffusion framework utilizing an Equivariant Graph Neural Network (EGNN). To efficiently model the dynamics of the large molecular complexes, we employ a coarse-grained spatial graph that condenses rigid domains into center-of-mass anchors while preserving explicit backbone resolution for the flexible linker. By further integrating foundational rules in biophysics directly into both the training objective and the inference process, our model generates high-fidelity conformational ensembles that reproduce the thermodynamic distributions of long-timescale MD. This physics-informed approach provides a mathematically stable, highly scalable platform for the rapid multiscale characterization of flexible biologics, significantly accelerating the rational design of fusion protein therapeutics.
bioinformatics2026-03-13v1EnsAgent: a tool-ensemble multiple Agent system for robust annotation in spatial transcriptomics
Zhang, D.; Zhang, M.; Li, N.; Zheng, C.; Liang, L.; Ke, X.; Dong, Q.Abstract
Motivation: Automated domain annotation in spatially resolved transcriptomics (SRT) remains challenging since it depends on gene expression, morphology, and clinical conventions, which vary across cohorts and platforms. While Large Language Model (LLM)-driven agents show promise, current approaches typically condition semantic reasoning on static, single-method partitions. This reliance makes annotation pipelines fragile to upstream partition errors and prone to hallucinations when molecular evidence is ambiguous. A robust framework integrating ensemble intelligence with iterative, evidence-based reasoning is required to ensure reproducibility and accuracy. Results: We introduce EnsAgent, a tool-ensemble multi-agent system designed for robust SRT annotation. Uniquely, EnsAgent decouples structural partitioning from semantic labeling via a Consultation-Review workflow. A Tool-Runner Agent orchestrates a diverse portfolio of clustering algorithms via the Model Context Protocol (MCP), generating a consensus partition optimized by a multimodal Scoring Agent. Subsequently, a Proposer-Critic feedback loop coordinates four specialized experts (Marker, Pathway, Spatiality, and Visual) to formulate annotations with explicit evidence trails and uncertainty estimates. Benchmarking on three SRT datasets demonstrates that EnsAgent effectively neutralizes batch effects and resolves subtle tumor microenvironment niches missed by single-paradigm baselines, delivering state-of-the-art accuracy and interpretability. Availability and Implementation: EnsAgent is available at github.com/keviccz/ensAgent.
bioinformatics2026-03-13v1MetaResNet: Enhancing Microbiome-Based Disease Classification through Colormap Optimization and Imbalance Handling
Qureshi, A.; Wahid, A.; Qazi, S.; Khattak, H. A.; Hussain, S. F.Abstract
Image-based representations of metagenomic data enable convolutional neural network (CNN) applications for microbiome disease classification. However, the impact of colormap selection on model performance remains unexplored. Current approaches arbitrarily select visualization parameters despite evidence that colormap choices can suppress minority-class features in imbalanced microbiome datasets. This study systematically evaluates colormap effects on metagenomic disease classification to establish evidence-based visualization guidelines. We developed MetaResNet, a custom CNN architecture incorporating residual blocks and attention mechanisms, to assess five colormap schemes (Jet, YlGnBu, Reds, Paired and nipy\_spectral) across four benchmark datasets: inflammatory bowel disease (n=110), colon cancer (n=121), women type 2 diabetes (n=96), and obesity (n=253). Class imbalance was addressed using Synthetic Minority Over-sampling Technique (SMOTE) versus class weighting strategies. Custom data augmentation preserved taxonomic abundance relationships while enhancing generalization. Performance evaluation employed F1-score, Receiver Operating Characteristic and Area Under the Curve (AUC-ROC), Matthews correlation coefficient (MCC), precision, and recall to address accuracy limitations in imbalanced scenarios. Results identified the Jet colormap coupled with SMOTE as the optimal global configuration, maximizing signal retention and achieving peak performance (AUC 1.00 in Colon). SMOTE significantly improved minority-class recall over class weighting (0.81{+/-}0.09 vs. 0.69{+/-}0.11, p=0.003). MetaResNet achieved performance comparable to current state-of-the-art frameworks, while statistically outperforming established deep learning baselines (e.g., DeepMicro, PopPhy-CNN; p=0.025) in discriminatory power (AUC), with peak values exceeding 0.96. These findings demonstrate that visualization efficacy is strategy-dependent, establishing MetaResNet as a robust framework for microbiome-based diagnostics that supports evidence-based visualization strategies for precision medicine.
bioinformatics2026-03-13v1Nanoscale Material Size Shapes Distinct Immune Transcriptional States Under Physiological Flow
Kovacevic, V.; Milivojevic Dimitrijevic, N.; Mihailovich, M.; Zivanovic, M.; Ivanovic, M.; Zivic, A.; Jankovic, M. G.; Kovacevic, A.; Zmrzljak, U. P.; Puac, F.; Filipovic, N.; Ljujic, B.Abstract
Nanoscale materials interact with circulating immune cells, yet how material size and exposure complexity shape transcriptional state organization under physiological flow conditions remains poorly understood. Controlled microfluidic exposure is combined with single-cell RNA sequencing to examine how size-defined polystyrene nanoplastics (PSNPs; 40 nm, 200 nm) and their combination modulate transcriptional programs in primary human peripheral blood mononuclear cells (PBMCs) under dynamic flow conditions. Across immune populations, PSNP exposure induces a conserved translational and RNA-regulatory program, indicating a shared intracellular adaptation framework. Upon this backbone, innate and adaptive immune compartments exhibit distinct organizational principles. Monocytes undergo size-dependent, pathway-coherent state remodeling, whereas B cells and CD4+; T cells display distributed, lineage-preserving transcriptional tuning without discrete state transitions. Combined exposure to different particle sizes does not produce additive responses but instead generates integrated transcriptional states in monocytes, revealing non-linear immune adaptation to heterogeneous material cues. These findings demonstrate that nanoscale material size and exposure complexity shape immune transcriptional state architecture under physiological flow and establish a framework for understanding dynamic material-immune interfaces at single-cell resolution.
bioinformatics2026-03-13v1Learning the All-Atom Equilibrium Distribution of Biomolecular Interactions at Scale
Wang, Y.; Xu, Y.; Li, W.; Yu, H.; Tan, W.; Li, S.; Huang, Q.; Chen, N.; Wu, X.; Wu, Q.; Liu, K.Abstract
Biomolecular functions are governed by dynamic conformational ensembles rather than static structures. While models like AlphaFold have revolutionized static structure prediction, accurately capturing the equilibrium distribution of all-atom biomolecular interactions remains a significant challenge due to the high computational cost of molecular dynamics (MD). We present AnewSampling, a transferable generative foundation framework designed for the high-fidelity sampling of all-atom equilibrium distributions, which is the first model to faithfully reproduce MD at the all-atom level. It uses a novel quotient-space generative framework to ensure mathematical consistency and leverages the largest self-curated database of protein-ligand trajectories to date, with over 15 million conformations. Statistically, AnewSampling consistently outperforms all prior generative methods on the ATLAS monomer benchmark, and the all-atom capabilities of AnewSampling enable close statistical alignment with ground-truth MD for evaluating atomic biomolecular interactions in protein-ligand dynamics. Furthermore, AnewSampling successfully recovers coupled ligand and side-chain motions in CDK2 systems, overcoming a major sampling hurdle inherent to conventional MD. AnewSampling enables rapid exploration of conformational landscapes prior to intensive simulations, elucidating fundamental biophysical mechanisms and accelerating the broader design of functional biomolecules.
bioinformatics2026-03-13v1SuperSurv: A Unified Framework for Machine Learning Ensembles in Survival Analysis
Lyu, Y.; Huang, X.; Lin, S. H.; Li, Z.Abstract
This paper introduces SuperSurv, a user-friendly R package for building, evaluating, and interpreting ensemble models for right-censored survival data. Although many survival modeling methods are available, existing tools are often model-specific and lack a unified platform for systematically integrating, comparing, and ensembling heterogeneous learners. SuperSurv addresses this gap by providing a unified interface for diverse survival learners, including models that return full survival curves as well as methods that produce only risk scores. All learner outputs are mapped to calibrated survival probability curves on a common evaluation time grid, enabling direct comparison and ensemble construction across heterogeneous model classes. SuperSurv implements stacking of survival models using inverse-probability-of-censoring weighted (IPCW) Brier risk to estimate ensemble weights in the presence of right censoring. The framework integrates hyperparameter tuning, time-dependent benchmarking metrics, and visualization tools for survival model evaluation. In addition, the package provides post-hoc interpretability utilities based on SHAP values and supports covariate-adjusted restricted mean survival time (RMST) contrasts through g-computation. By bridging the gap between theoretical rigor and clinical application, SuperSurv offers researchers a comprehensive ecosystem for modern survival analysis. The SuperSurv package is open-source and freely available at https://github.com/yuelyu21/SuperSurv. An empirical example using the METABRIC breast cancer dataset demonstrates a complete workflow from model training and benchmarking to explainability and clinically interpretable survival contrasts.
bioinformatics2026-03-13v1DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA
Khabbaz, R.; Mateos, J.; Antonini, M.; Kas Hanna, S.Abstract
The biochemical processes underlying DNA data storage, including synthesis, amplification, and sequencing, are inherently noisy. Consequently, base-level insertion, deletion, and substitution (IDS) errors, as well as sequence-level dropouts, occur and pose major challenges for reliable data retrieval. Here we introduce DNA-MGC+, a DNA storage codec designed to enable reliable and resource-efficient data retrieval under diverse operating conditions. We evaluate DNA-MGC+ across a wide range of in silico and in vitro settings, including experiments with both Illumina and Nanopore sequencing, and show that it consistently outperforms existing codecs. In particular, DNA-MGC+ achieves simultaneous gains in sequencing depth requirements, read cost, decoding time, storage density, and error-correction capability under explicit reliability constraints. Notable results include reliable decoding under IDS error rates of up to 24% in synthetic scenarios, and reliable retrieval at sequencing depths below 3x with read costs below 3.5 bits/nt under electrochemical synthesis for both Illumina and Nanopore sequencing.
bioinformatics2026-03-13v1StrainVis: interactive visual strain-level analysis of microbiome data
Paz, I.; Ley, R. E.; Enav, H.Abstract
Background: Microbiomes contain multiple conspecific strains whose genomic differences arise from both single nucleotide variants (SNVs) and structural variation (insertions, deletions, recombination). Recently, computational tools to assess strain-level differences became available, based either on average nucleotide identity (ANI) or on the average pairwise synteny (APSS) of strains, which are sensitive, respectively, to either SNVs or to structural variation. However, strain-level analyses remain technically challenging and fragmented across approaches and combining these complementary signals typically requires substantial bioinformatic expertise. Results: Here we present StrainVis, a web-based analysis and visualization platform that integrates outputs from both ANI- and APSS-based strain tracking tools to enable unified, interactive exploration of within-species diversity. StrainVis allows users to perform per-species and multi-species comparisons, incorporate metadata and gene annotations, and generate statistical summaries and publication-ready figures without programming. Conclusions: By lowering technical barriers and enabling joint interpretation of sequence and structural variation, StrainVis makes advanced strain-level microbiome analysis accessible to a broader community and facilitates discovery of evolutionary patterns that would be missed by single-method approaches alone.
bioinformatics2026-03-13v1Orally Delivered dsRNA-Derived siRNAs Reach the Central Nervous System in Leptinotarsa decemlineata
Amineni, V. P. S.; Cedden, D.Abstract
RNA interference (RNAi) has emerged as an eco-friendly approach to pest management and relies on the processing of exogenous double-stranded RNA (dsRNA). RNAi-based pest management is highly effective in the Colorado potato beetle (Leptinotarsa decemlineata); however, the tissue-specific distribution and processing of exogenous dsRNA following oral uptake remain incompletely understood. In this study, we investigated whether ingested dsRNA reaches the central nervous system (CNS) and is processed into active small interfering RNAs (siRNAs). Adult beetles were fed dsmGFP-coated leaf disks, and RISC-bound small RNAs were isolated from midgut, CNS, and remaining body tissues using a RISC-enrichment approach. Small RNA sequencing revealed abundant 21-nucleotide antisense guide-strand siRNAs in all analysed tissues, with relative proportions following the order midgut > CNS > remaining tissues. Notably, antisense siRNAs of consistent size were detected in CNS samples, indicating that exogenous dsRNA or its processed products can access neural tissue and enter the RNAi silencing machinery. These findings provide strong biochemical evidence that orally taken-up dsRNA is processed into AGO-loaded siRNAs in the L. decemlineata CNS. Together, our results offer a tissue-resolved view of functional RNAi activity in this species and contribute to a mechanistic understanding of systemic dsRNA transport in coleopteran pests.
bioinformatics2026-03-13v1Context-dependent genetic regulation of gene expression in pigs
Wang, F.; Wang, C.; Teng, J.; Fang, L.; Ionita-Laza, I.Abstract
Production livestock provide a natural system for studying gene regulation under physiologically demanding conditions shaped by rapid growth, environmental exposure, and immune challenges. Using farm pigs from the PigGTEx resource as a model, we applied quantile regression to uncover latent, context-dependent genetic effects on gene expression across tissues. This approach identifies quantile-specific expression quantitative trait loci (eQTLs) that are not detected by standard linear regression and are enriched in distal regulatory elements and three-dimensional genome architectural features rather than promoter-proximal regions. Genes with quantile-dependent eQTLs are more intolerant to loss-of-function variants and exhibit stronger enrichment in GO functional categories, indicating their likely functional significance. Cross-species comparisons reveal substantial overlap between pig and human eGenes across tissues, indicating conservation of regulatory architecture. Notably, many quantile-specific eQTLs influence tail expression states and involve genes relevant to human disease. For example, we identify a cis-eQTL affecting the conserved transcriptional regulator BCL6B in pig blood that modulates enhancer activity and reduces expression at lower quantiles. In contrast, BCL6B is minimally expressed in resting human blood and lacks detectable cis-regulatory variation under baseline conditions, consistent with its reported induction during immune activation. These findings demonstrate that pig eQTL maps can reveal context-dependent regulatory variation at loci that remain silent or weakly variable in human cohorts.
bioinformatics2026-03-13v1scprocess: a pipeline for processing, integrating and visualising atlas-scale single cell data
Koderman, M.; Pilarski, J.; Bianco, E.; Gonzalez, D.; Robinson, M. D.; Macnair, W.Abstract
The transition toward "atlas-scale" single cell research has resulted in datasets comprising millions of cells across hundreds of samples, creating significant challenges for data management, computational efficiency, and reproducibility. While numerous methods are available for individual steps in single cell data processing, the highly complex nature of the analysis makes it challenging to maintain a clear record of every tool and parameter used. This makes final results difficult to reproduce, highlighting the need for a unified workflow that integrates multiple steps into a cohesive framework. We introduce scprocess, a Snakemake pipeline designed to streamline and automate the complex steps involved in processing single cell RNA sequencing data. Specifically optimized for data generated using the 10x Genomics technology, it provides a comprehensive solution that transforms raw sequencing files into standardized outputs suitable for a variety of downstream tasks. The pipeline is built to support the analysis of datasets comprising multiple (e.g. 100+) samples via a simple CLI, allowing researchers to efficiently explore their datasets while ensuring reproducibility and scalability in their workflows. scprocess can be installed via GitHub (https://github.com/marusakod/scprocess) under the MIT license. Documentation, including setup instructions and tutorials on example datasets is available at https://marusakod.github.io/scprocess/.
bioinformatics2026-03-13v1Expression-based annotation identifies and enables quantification of small vault RNAs (svtRNAs) in human cells
Sheppard, J. D.; Smircich, P.; Duhagon, M. A.; Fort, R. S.Abstract
Background Small non-coding RNAs (sncRNAs) play central roles in post-transcriptional gene regulation. In addition to canonical microRNAs (miRNAs), fragments derived from vault RNAs (vtRNAs), called small vault RNAs (svtRNAs), have been reported in human cells. However, the absence of a standardized annotation framework has hindered their systematic detection, quantification, and comparison across small RNA sequencing (small RNA-seq) studies. Methods We developed an expression-based annotation strategy to identify svtRNAs from human small RNA-seq datasets. Using FlaiMapper followed by structure and expression-based filtering, we generated two annotation sets: a stringent miRNA-like set enriched in Argonaute-associated datasets, and (ii) a broader Total set derived from total small RNA-seq libraries under relaxed structural constraints. We explored the expression of the annotated svtRNAs across the different datasets analyzed: multiple normal and tumor-derived human cell lines, including Argonaute immunoprecipitation datasets. Results We identified a repertoire of svtRNAs that are detected across independent datasets and, in several cases, reach abundance levels comparable to canonical miRNAs. Several highly abundant svtRNAs correspond to molecules with experimental validation from prior studies, supporting the robustness of our annotation strategy. Importantly, the same dominant (in terms of gene expression) svtRNAs emerged independently from Argonaute-associated and total small RNA datasets, supporting the idea of enzymatically consistent, reproducible svtRNA processing. We further identified svtRNAs derived from distinct vtRNA precursors that could share identical seed sequences, suggesting the possibility of svtRNA families with potential miRNA-like regulatory properties. We provide a standardized annotation that enables reproducible svtRNA quantification. Conclusions Our study establishes a comprehensive expression-based annotation resource for human svtRNAs. By enabling their systematic detection and reproducible quantification, we show that svtRNAs appear to represent an abundant component of the human small RNA landscape.
bioinformatics2026-03-13v1STEVE: Single-cell Transcriptomics Expression Visualization and Evaluation
Torbenson, E. J.; Ma, X.; Lin, J.-R.; Garry, D.; Jameson, S. C.; Zhang, Z.; Niedernhofer, L. J.; Zhang, L.; Li, M.; Dong, X.Abstract
Single-cell RNA sequencing (scRNA-seq) has become a key technology for characterizing cell-type heterogeneity in complex tissues. However, its utility depends on accurate and reproducible cell-type annotation, which remains a major analytical challenge. Although hundreds of computational tools have been developed for automated annotation, there is currently no systematic framework to evaluate annotation robustness in a dataset-specific manner or within the context of complete analytical pipelines. Here, we present STEVE (Single-cell Transcriptomics Expression Visualization and Evaluation), a quantitative framework designed to assess the accuracy, robustness, and reproducibility of cell-type annotation in scRNA-seq studies. STEVE implements three complementary in silico evaluation modules: (i) Subsampling Evaluation to quantify annotation stability under varying reference sizes and data partitions; (ii) Novel Cell Evaluation to assess the ability to detect previously unseen cell types; and (iii) Annotation Benchmarking to compare alternative annotation tools against ground-truth labels. In addition, STEVE includes a Reference Transfer Annotation module that enables cross-dataset cell-type mapping using external reference datasets. All modules are built upon a unified probabilistic framework that provides consistent confidence estimation across evaluation scenarios. We evaluated STEVE across four independent scRNA-seq datasets with experimentally defined or expert-curated cell-type labels. Our results show that annotation robustness is strongly influenced by the annotation method, biological separability, dataset complexity, and batch effects. STEVE provides a practical framework for quantifying annotation uncertainty and improving reproducibility in single-cell transcriptomic analyses. STEVE is freely available at GitHub (https://github.com/XiaoDongLab/STEVE).
bioinformatics2026-03-13v1EoRNA2: Autonomous Data Discovery and Processing for Databasing of Gene Expression Data
Milne, L.; Simpson, C. G.; Guo, W.; Mayer, C.-D.; Milne, I.; Bayer, M.Abstract
We describe a major new release of the EoRNA database, a gene expression database for barley based on public data, first published in 20211. EoRNA v.2 (https://ics.hutton.ac.uk/eorna2/index.html) features an order of magnitude more samples and is based on a new automated workflow of sample discovery and processing which has enabled a dramatic scale-up the original database. EoRNA v.2 also features a major rebuild of the web user interface with rich new functionality. All infrastructure-related code and database schemas and web components are now species agnostic and publicly available for reuse with other taxa. A dedicated new reference transcript dataset has been created for EoRNA v.2 which is largely based on the recently published barley pan-transcriptome and represents the most comprehensive dataset of its kind to date.
bioinformatics2026-03-13v1Immune Transcriptional Signatures Across Human Cardiomyopathy Subtypes: A Multi-Cohort Integrative Computational Analysis
Adegboyega, B. B.; Okorie, B.; Courage, P.Abstract
Abstract Background: Heart failure, arrhythmia, and sudden cardiac death are common outcomes of cardiomyopathies, which are molecularly diverse heart muscle disorders marked by structural and functional myocardial dysfunction. The lack of sensitive molecular biomarkers that precede overt physiological deterioration makes early diagnosis difficult despite advancements in imaging and clinical classification. The immune transcriptional landscape across cardiomyopathy subtypes is still poorly understood, despite growing evidence linking both innate and adaptive immune dysregulation, such as macrophage activation and T-cell and inflammatory cytokine networks, as active contributors to myocardial remodelling and disease progression. Methods: We performed a multi-cohort integrative transcriptomic analysis of 1,068 cardiac tissue samples from five publicly available GEO datasets (GSE57338, GSE5406, GSE36961, GSE141910, GSE47495) spanning dilated, ischemic, hypertrophic, and peripartum cardiomyopathy. Using a fully scripted R and Python pipeline, we conducted differential expression analysis (limma), immune cell deconvolution (xCell), pathway enrichment (clusterProfiler), weighted gene co-expression network analysis (WGCNA), and regularised machine learning classification (LASSO, Random Forest). Cross-dataset validation was performed in two independent cohorts on different microarray platforms. Results: Differential expression analysis identified 43 primary DEGs (FDR < 0.05, |log2FC| > 1.0), revealing a coherent immune-fibrotic program characterized by loss of anti-inflammatory macrophage markers (CD163, VSIG4), complement dysregulation (FCN3), innate interferon activation (IFI44L, IFIT2), and ECM remodelling (ASPN, SFRP4, LUM). xCell deconvolution identified coordinated depletion of adaptive immune populations in failing myocardium. WGCNA defined a fibrosis hub module (brown; CTSK, SULF1, SFRP4) and an immune collapse module (turquoise; MYD88, TNFRSF1A, LAPTM5). A nine-gene LASSO classifier achieved a cross-validated AUC of 0.986, with HMOX2 as the top-discriminating feature, implicating ferroptosis in cardiomyocyte death. Cross-platform validation in an independent HCM cohort (GSE36961) demonstrated a directional concordance of 34.9%. Conclusions: This study defines a reproducible immune-fibrotic transcriptional signature of human cardiomyopathy, nominates HMOX2 and ferroptosis as central pathomechanisms, and provides a validated nine-gene biomarker panel for future translational investigation.
bioinformatics2026-03-13v1ATAClone: Cancer Clone Identification and Copy Number Estimation from Single-cell ATAC-seq
Cain, L. D.; Trigos, A. S.Abstract
Single-cell analyses of cancer typically begin by identifying distinct populations of cancer cells by unsupervised clustering. However, in many cases this clustering is explained simply by differences in DNA copy number, which affects the interpretation of differential expression results and tumour heterogeneity studies. To detect and estimate these differences in copy number, we have developed ATAClone. Applicable to both standalone and multiome scATAC-seq assays, ATAClone first identifies cancer cells with shared DNA copy number profiles (i.e. clones), then estimates their copy number jointly. Importantly, ATAClone can determine an optimal clustering resolution automatically using simulations. By utilising only stably accessible regions, ATAClone maximises copy number signal while minimising unrelated biological and technical noise. Additionally, by leveraging differences in total DNA between cells, ATAClone can infer absolute copy number, even in the presence of polyploidy. Using cancer cell mixture experiments, we verify the ability of ATAClone to accurately separate clones based on copy number differences. Moreover, using matched scATAC-seq and bulk whole genome sequencing, we show that copy number estimates from ATAClone are more accurate than those derived with existing methods, achieving Pearson correlations between 0.75-0.95 with their bulk-derived estimates. ATAClone represents an important tool for disentangling the genetic and non-genetic contributions to gene expression in cancer, providing deeper insight into the evolutionary history and adaptive forces driving a tumour.
bioinformatics2026-03-13v1BioPipelines: Accessible Computational Protein and Ligand Design for Chemical Biologists
Quargnali, G.; Rivera-Fuentes, P.Abstract
Deep learning methods for protein structure generation, sequence design, and structure and property prediction have created unprecedented opportunities for protein engineering and drug discovery. However, using these tools often requires navigating incompatible software environments, diverse input/output formats, and high-performance computing infrastructure, any of which may hinder adoption by primarily experimental chemical biology laboratories. Here we present BioPipelines, an open-source Python framework that allows researchers to define multi-step computational design workflows in a few lines of code. Additionally, its robust yet modular architecture provides a straightforward way to expand the toolkit with different functionalities, particularly by leveraging coding agents, with little effort. The framework currently integrates over 30 tools encompassing structure generation, sequence design, structure prediction, compound screening, and analysis. The same workflow code can be prototyped interactively in a Jupyter notebook and then submitted for production-scale runs without modification. We demonstrate applications in inverse folding, gene synthesis, de novo protein design, compound library screening, iterative binding site optimization, and fusion-protein linker optimization. We hope this framework will empower researchers, allowing them to focus on the scientific question rather than computational logistics. BioPipelines is available under the MIT license at https://github.com/locbp-uzh/biopipelines.
bioinformatics2026-03-13v1SAMWOOD: An automated method to measure wood cells along growth orientation
Verlingue, K.; Brunel, G.; Decombeix, A.-L.; Ramel, M.; Tresson, P.Abstract
Quantitative wood anatomy requires precise measurement of wood cells. This step is often laborious and limiting for further analysis. We introduce Samwood, a tool based on the zero-shot Segment-Anything Model to easily segment cells on microscopic images without the need for a training dataset. The reconstruction of cell files then allows for the analysis of wood along growth orientation and precise measurement of anatomical properties of the wood, such as lumen areas. We tested our pipeline on an example dataset of fossil woods featuring deformation, heterogeneous preservation, and frequent artefacts, to assess the robustness of our approach. The model achieves a precision of 0.78 and a recall of 0.80, often producing segmentation of better quality and more consistent than a human operator. This approach substantially reduces analysis time, minimizes operator bias, and provides a robust and extensible framework for large-scale anatomical studies. The complete code pipeline is available at https://github.com/umr-amap/samwood.
bioinformatics2026-03-13v1Improving Local Ancestry Inference through Neural Networks
Medina Tretmanis, J.; Avila-Arcos, M. C.; Jay, F.; Huerta-Sanchez, E.Abstract
Motivation: Local Ancestry Inference (LAI) allows us to study evolutionary processes in admixed populations, uncover ancestry-specific disease risk factors, and to better understand the demographic history of these populations. Many methods for LAI exist, however, these methods usually focus on cases of intercontinental admixture. In this work, we evaluate both existing and novel methods in challenging scenarios, such as downsampled reference panels, intracontinental admixture, and distant admixture events. Results: We present four novel LAI implementations based on neural network architectures, including Bidirectional Long Short-Term Memory and Transformer networks which have not previously been used for LAI. We compare these novel implementations to existing methods for LAI across a variety of scenarios using the 1 Thousand Genomes dataset and other synthetic datasets. We find that while all networks achieve high performance for intercontinental admixture scenarios, inference power is comparatively low for scenarios of intracontinental or distant admixture. We further show how our implementations achieve the best performance of all methods through specialized preprocessing and inference smoothing steps.
bioinformatics2026-03-13v1Descriptron-GBIF Annotator: A browser-based platform for crowdsourced morphological annotation of biodiversity images to help accelerate morphology based biodiversity data
Van Dam, A. R.; Hita Garcia, F.Abstract
The accelerating biodiversity crisis demands new approaches to taxonomic description that can scale beyond the capacity of professional taxonomists alone. We present the Descriptron-GBIF Annotator, a zero-installation, browser-based tool for morphological annotation of biodiversity specimen images retrieved directly from the Global Biodiversity Information Facility (GBIF). The application runs entirely client-side as a single HTML file, integrating SAM2 (Segment Anything Model 2) for AI-assisted segmentation, ontology-linked anatomical region templates covering 25 major taxonomic groups across 124 standardized views, 335 ontology Compact URI Expressions (CURIEs), with 745 possible ontology mentions, and structured trait attribute recording. The annotator supports multiple export formats including Darwin Core JSON, COCO JSON, traits CSV, and a novel JSON-LD knowledge graph linking specimens to anatomical regions and morphological traits via UBERON and domain-specific ontologies. A built-in Zenodo publishing pipeline enables users to deposit annotations as citable datasets with DOIs directly from the browser. Additionally users can also annotate images from Zenodo BioSysLit enabling annotation of taxonomic treatments directly. We position this public-facing tool as the first tier of a two-tier architecture complementing the Descriptron Portal, a GPU-accelerated professional workbench for taxonomists providing tools for fine-tuning AI models, geometric morphometrics, and automated species descriptions. Together, these tiers create a feedback loop where public annotations generate training data for expert AI models, while expert-validated outputs improve the public tool. This approach draws on the citizen science model pioneered by Notes from Nature and iNaturalist to engage diverse audiences in structured morphological data collection, addressing a critical gap in biodiversity informatics where specimen images exist in abundance but structured morphological annotations remain scarce. To learn more go here: https://descriptrongbifannotator.org
bioinformatics2026-03-13v1Fleming: An AI Agent for Antibiotic Discovery in Mycobacterium Tuberculosis
Wei, Z.; Ektefaie, Y.; Zhou, A.; Negatu, D.; Aldridge, B. B.; Dick, T. B.; Skarlinski, M.; White, A.; Rodriques, S. G.; Hosseiniporgham, S.; Parai, M.; Flores, A.; Inna, K. V.; Zitnik, M.; Sacchettini, J.; Farhat, M. R.Abstract
Antibiotic development is challenged by high costs and failure rates. Artificial intelligence (AI) holds promise to overcome these challenges by predicting inhibitory properties of novel compounds, generating new candidates, and contextualizing property predictions in the biological background. Fleming is an integrative AI agent that explores novel chemical space to identify lead compounds meeting multiple criteria. The discriminative and generative AI models for Mycobacterium tuberculosis (Mtb) inhibition were trained on a set of 114,900 diverse compounds and fragments based on in vitro growth inhibition. We combined both models as well as molecular optimization, ADMET prediction and literature search functions to make Fleming an integrated agent for Mtb preclinical lead identification. Fleming has 17% higher discrimination between known Mtb leads and leads for other diseases than a generic LLM agent along with 13% higher discrimination than molecular property prediction alone on challenging ADMET tasks. Fleming demonstrates an 83% in vitro hit rate of predicted inhibition and a 100% hit rate of de novo generative design. Fleming's generative designs also demonstrate an 83% rate of favorable ADMET profiles. Fleming is an integrative AI agent able to explore new regions of the chemical space to select lead compounds that simultaneously meet several desirable criteria.
bioinformatics2026-03-12v2Harnessing methylation signals inherent in long-read sequencing data for improved variant phasing
Pfennig, A.; Akey, J. M.Abstract
Accurate phasing of genetic and epigenetic variation is crucial for many downstream analyses, including association testing, clinical variant interpretation, and inference of population history. Although long-read sequencing significantly improves the continuity and completeness of genome sequencing, reconstructing chromosome-scale haplotypes remains challenging, often requiring the integration of multiple technologies, such as PacBio HiFi and Oxford Nanopore Technologies (ONT) sequencing. While these sequencing platforms detect the epigenetic modification 5-methylcytosine (5mC), current read-based phasing algorithms do not incorporate this information. We developed a read-based phasing method named LongHap that seamlessly integrates sequence and methylation data and shows that it significantly improves haplotype reconstruction. LongHap first creates phase blocks based on overlapping heterozygous sequence variants, accurately phasing complex variants by embedding them into the broader haplotype context through belief propagation. LongHap then dynamically identifies differentially methylated sites that are informative for phasing to refine and extend initial phase blocks. Through extensive analyses, we demonstrate that LongHap outperforms existing tools, including WhatsHap, HapCUT2, LongPhase, and MethPhaser, by achieving lower switch error rates and greater phase block contiguity. Crucially, we show that LongHap also improves variant phasing in challenging, medically relevant genes. In summary, by leveraging native methylation signals from long-read sequencing data, LongHap enhances long-range haplotype reconstruction, enabling more accurate haplotype-based genome analysis. LongHap is available from: https://github.com/AkeyLab/LongHap.
bioinformatics2026-03-12v1DEX: a consensus-based amino acid exchangeability measure for improved codon substitution modelling
Douglas, G. M.; Bobay, L.-M.Abstract
Physicochemically similar amino acids undergo more frequent substitutions compared to dissimilar amino acid pairs. Despite their clear potential, amino acid similarity matrices remain underused in molecular evolution, partially due to the high number of proposed amino acid distance measures and the lack of agreement on which are most accurate. In this study, we assessed the performance of 30 amino acid distance measures, including a new amino acid distance measure we developed based on recent deep mutational scanning data. We compared these measures across codon substitution models fit to alignments spanning Streptococcus, Drosophila, and mammalian lineages, as well as segregating variants across Escherichia coli strains and human genotypes. We further constructed consensus measures from combinations of top-performing measures in this analysis using the DISTATIS approach and retested these matrices. Our results show that experimentally-derived measures, particularly our new measure and the existing experimental exchangeability (EX) measure, best fit codon substitution patterns across diverse lineages. We found that a consensus measure based on these two approaches, which we named DEX, performed best overall. In addition, although site-specific variant effect predictors are intended to identify deleterious mutations, the representative tools we tested did not outperform amino acid distance measures for predicting mean substitution frequencies. They were however substantially more informative for identifying individual highly deleterious mutations. Overall, we provide a systematic comparison of the performance of existing measures, and we introduce an improved general-purpose amino acid distance measure for molecular evolution models.
bioinformatics2026-03-12v1Rational Design of Selective IL-2-based Activators for CAR T Cells Using AlphaFold3 and Physics-Informed Machine Learning
Dahmani, L. Z.; Banerjee, A.Abstract
Recombinant human Interleukin-2 (rhIL-2, Aldesleukin) is used in immunotherapy for metastatic melanoma and renal cell carcinoma. Low-dose IL-2 has been investigated for administration after adoptive T cell transfer to enhance CAR T expansion and sustain effector function. However, systemic IL-2 can cause severe toxicities and promote expansion of regulatory T cells (Tregs). Previous attempts at mitigating cytokine-mediated side effects involved isolating CAR T cell signaling from endogenous immune responses by developing IL-2/IL-2RB; based selective ligand-receptors systems. Expressing these variant orthogonal (ortho)IL-2-RB; receptors in CAR T cells and supplying variant orthoIL-2, was shown to dramatically improve selectivity in CAR T cell expansion and anti-tumoral potency in a leukemia mouse model. This study describes the computational design of synthetic orthogonal cytokine receptor-ligand systems based on the scaffolds of the human canonical IL-2 and IL-2RB;. Leveraging state-of-the-art AlphaFold3 (AF3) structure prediction capabilities and a physics-informed constrained sequence generator (CSG), the pipeline generates, filters and ranks sets of putative orthoIL-2/orthoIL-2RB; mutant designs. Variants displaying minimal predicted off-target interactions and enhanced in target contacts are prioritized for structural modelling. Top designs showed outstanding AF3 structural and interfacial quality metrics ipTM and pTM, with averages between cognate pairs of 0.724{+/-}0.05 and 0.770{+/-}0.042, respectively. All in-silico hits showed ipTM <0.5 for non-cognates, indicating a good likelihood of orthogonality. Additionally, putative hits showed high levels of predicted structural fidelity to wild-type (WT) human IL-2/IL-2RB; (PDB: 2ERJ), with an average structural root-mean-square deviation (RMSD) of 0.843{+/-}0.375 Angstrom. These mutants incorporated 7-26 interfacial mutations derived from multiple interface selection strategies. Altogether, the results support the putative foldability and selective affinity of top-ranking mutants displaying metrics close-to or within experimental reference range. Finally, strengths and limitations are discussed, alongside the experimental implications of coupling a constrained protein design pipeline to the discovery and validation of selective binders based on naturally occurring scaffolds.
bioinformatics2026-03-12v1Cyclic peptides space: The methodology of sequence selection to cover the comprehensive physical properties
Tsuchihashi, R.; Kinoshita, M.Abstract
Cyclic peptides have emerged as a pivotal modality for next-generation therapeutics, due to their superior biocompatibility, high selectivity, and structural stability. While AI-driven peptide design has advanced rapidly, conventional optimization algorithms are often constrained by initialization biases, which impede the efficient exploration of the vast chemical space. Here, we propose a novel methodology that integrates the protein language model ESM-2 with cyclic permutation averaging of embeddings to resolve this bottleneck. This approach establishes a comprehensive "peptide space", a high-dimensional vector representation that encapsulates the physicochemical and structural attributes of cyclic peptides. Our analysis reveals that random sequence selection results in a heterogeneous distribution within this space, potentially underrepresenting specific functional regions. Conversely, navigating this defined peptide space enables the selection of libraries that uniformly span diverse molecular properties. In a proof-of-concept study designing binders for {beta}2-microglobulin ({beta}2m), we demonstrate that initial sequences uniformly sampled from our peptide space yield superior candidates more efficiently than those derived from random selection. Furthermore, this framework facilitates the quantitative assessment of mutational perturbations on global peptide properties, supporting rational decision-making for both broad exploration and local optimization. This "peptide space" concept provides a foundational framework for defining appropriate search boundaries and enhancing computational efficiency in AI-mediated drug discovery.
bioinformatics2026-03-12v1Benchmarking zero-shot single-cell foundation model embeddings for cellular dynamics reconstruction
Zhou, X.; Wang, Z.; Ling, Y.; Tian, Q.; Zhang, Z.; Li, Y.; Zhou, P.; Chen, L.Abstract
Reconstructing cellular trajectories from time-resolved single-cell transcriptomics is fundamental to understanding processes from embryonic development to cancer progression. While single-cell foundation models (scFMs) promise universal biological representations through large-scale pretraining, their capacity to capture the non-linear dynamics governing cell-fate decisions remains uncharacterized. Here we systematically benchmark multiple scFMs across challenging biomedical scenarios involving branching lineages and continuous state transitions. By coupling zero-shot scFM embeddings with dynamic optimal transport, we evaluated their performance against a traditional highly variable gene (HVG) baseline in backtracking progenitor states, interpolating transition intermediates, and extrapolating future fates. We find that zero-shot scFM embeddings underperform the HVG baseline across diverse biological systems, particularly in recovering the distributional complexity of unobserved cells. Mechanistic analysis reveals that current scFM architectures tend to over-compress subtle temporal signals, causing an artificial "linearization" of branched biological structures that may obscure critical divergence points in disease progression. Our findings suggest that while scFMs provide unified cell-state views, the HVG baseline remains more robust for trajectory inference, identifying a fundamental "temporal-compression" bottleneck that must be addressed to develop next-generation, dynamics-aware foundation models.
bioinformatics2026-03-12v1Benchmarking BEAGLE to find optimal parameters for BEAST X
Fosse, S.; Duchene, S.; Duitama Gonzalez, C.Abstract
Bayesian phylogenetic analyses are notoriously time-consuming, largely because exploring the posterior distribution requires computing Felsenstein's likelihood. The BEAGLE library is a high-performance computational tool that dramatically accelerates the calculation of such likelihoods by leveraging parallel processing on GPUs, multicore CPUs, and SSE vectorisation. Here, we present results from benchmarking a widely popular phylogenetics package, BEAST X, using BEAGLE integration, focusing on how hardware allocation affects running times. We demonstrate substantial differences among BEAGLE settings on real Dengue Virus (DENV) data, both with and without partitioning. Using simulated sequences, we establish guidelines for GPU usage in BEAST X runs. These guidelines can be used for effective resource allocation for empirical analyses and simulation studies.
bioinformatics2026-03-12v1Directional Variant Tension (Tv): A Causal Framework for Quantifying Substitution Asymmetry
Karagöl, A.; Karagöl, T.Abstract
Amino acid substitutions are often directionally asymmetric due to underlying biophysical constraints and diverse evolutionary pressures. We introduce Tv (variant tension), a kernel regression-based metric that quantifies this directional asymmetry directly from aligned multiple sequence alignments (MSAs). Tv leverages empirical amino acid frequencies and a non-parametric Gaussian kernel to capture nonlinear substitution flows, providing a causality-inspired framework for understanding evolutionary dynamics. We also present a web-based application that implements the calculation, allowing users to input MSAs, adjust parameters (kernel bandwidth;, smoothing window size), and visualize results, including global tension scores and high-tension sites. Applying T_v to the human glutamate transporter (EAA1), we identify significant substitution asymmetries, localize high-tension sites, and reveal correlations between elevated Tv and known pathogenic variants. This framework integrates statistical learning with protein evolution, offering a powerful tool for bridging protein design principles with evolutionary inference. Beyond variant prioritization, Tv; offers a scalable framework for simulating evolution under directional constraints, enabling predictive modeling of protein adaptation. The free web application is openly accsessible at https://www.karagolresearch.com/variantt
bioinformatics2026-03-12v1GCN-Mamba: Graph Convolutional Network with Mamba for Antibacterial Synergy Prediction
Su, H.; Liang, Y.; Xiao, W.; Li, H.; Liu, X.; Yang, Z.; Yuan, M.; Liu, X.Abstract
The escalating crisis of antimicrobial resistance necessitates novel therapeutic strategies, among which drug combination therapy shows great promise by enhancing efficacy and reducing toxicity. However, identifying effective synergistic pairs from the vast combinatorial space remains experimentally challenging and resource-intensive. To address this, we introduce GCN-Mamba, a deep learning framework that integrates Graph Convolutional Networks (GCN) with the Mamba State Space Model. This architecture captures both local molecular topological structures and global implicit interactions by leveraging Extended 3-Dimensional Fingerprints (E3FP) and bacterial gene expression profiles. Evaluation on a comprehensive dataset demonstrated that GCN-Mamba significantly outperforms classical machine learning models in predictive accuracy. In a targeted case study against Methicillin-resistant Staphylococcus aureus (MRSA), the model successfully rediscovered known synergistic pairs, such as Quercetin and Curcumin, consistent with recent literature. Furthermore, prospective in vitro validation confirmed a novel synergistic combination of Shikimic acid and Oxacillin, validating the model's practical utility. By efficiently prioritizing potential candidates, GCN-Mamba serves as a powerful and reliable tool for accelerating the discovery of synergistic antimicrobial combinations, effectively bridging the gap between computational prediction and experimental validation.
bioinformatics2026-03-12v1HitAnno: Atlas-level cell type annotation based on scATAC-seq data via a hierarchical language model
Wang, Z.; Chen, X.; Cui, X.; Gao, Z.; Li, Z.; Li, K.; Jiang, R.Abstract
The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) has emerged as a core technology for dissecting cellular epigenomic heterogeneity and gene regulatory programs. With the emergence of atlas-level scATAC-seq datasets, cell type annotation increasingly faces challenges arising from unprecedented data scale and increased cell-type diversity, which together place stringent demands on model reliability and robustness. Here, we introduce HitAnno, a hierarchical language model capable of accurate and scalable cell type annotation in atlas-level scATAC-seq data. Leveraging selected cell-type-specific peaks to construct cell sentences, HitAnno employs a two-level attention mechanism that captures accessibility profiles hierarchically. Extensive evaluations show that HitAnno robustly annotates both major and rare cell types across multiple settings, including intra-dataset, cross-donor and inter-dataset annotation. The hierarchical attention mechanisms of the model reveal co-accessibility patterns among peaks and dependencies across higher-order peak sets, ensuring an interpretable annotation process. Training on a 31-cell-type human atlas, HitAnno can directly annotate new query datasets without retraining and is accessible through an online interface. Our model identifies heterogeneous subgroups within mixed labeled cells from unseen datasets, demonstrating its potential to assist researchers in refining existing cell atlases.
bioinformatics2026-03-12v1mnDINO: Accurate and robust segmentation of micronuclei with vision transformer networks
Ren, Y.; Morlot, L.; Andrews, J. O.; Thrane Hertz, E. P.; Mailand, N.; Caicedo, J. C.Abstract
Recent advances in cell segmentation successfully produce models that generalize across various cell-lines and imaging types. However, these methods still fail to recognize subcellular structures such as micronuclei (MN), which are rare and tiny DNA-containing structures found outside of the main nucleus and observable under the microscope. While they can be hard to recognize in images, studying MN formation is of great interest because of their relationship to chromosome instability, genotoxicity, and cancer progression. Here we present a segmentation model, mnDINO, to segment micronuclei in DNA stained images under diverse experimental conditions with very high efficiency and accuracy. To train this model, we collected a heterogeneous set of images with more than five thousand annotated micronuclei. Trained with this diverse resource, the mnDINO model improves the accuracy of MN segmentation, and exhibits strong generalization across microscopes and cell lines. The dataset, code, and pre-trained model are made publicly available to facilitate future research in MN biology.
bioinformatics2026-03-12v1Comparative Analysis of Structural and Dynamical Properties of Lipid Membranes Simulated with the AMBER Lipid21 ForceField Using SPC/E, TIP3P, TIP3P-FB, TIP4P-FB, TIP4P-Ew, TIP4P/2005, TIP4P-D, and OPC Water Models
Chakraborty, D. S.; Singh, P. P.; Dey, C.; Kaur, J.Abstract
We have conducted all atom molecular dynamics simulations of POPC and DPPC lipid bilayers using AMBER Lipid21 force field with eight different water models, including SPC/E, TIP3P, TIP3P-FB, TIP4P-FB, TIP4P-Ew, TIP4P/2005, TIP4P-D, and OPC, to identify the most compatible one without any modification. A number of parameters have been computed in order to understand the structure of the lipid bilayer: Area per lipid, Isothermal compressibility modulus, average Volume per lipid, electron density profile, bilayer thickness, X-ray and neutron scattering form factors, deuterium order parameter, and radial distribution function. The estimated Area per lipid, Isothermal compressibility factor, volume per lipid and bilayer thickness are highly consistent with experimental results for the SPC/E water model, indicating its suitability with the AMBER Lipid21 force field, insted of any modification. The bilayer electron density profiles of both the lipid bilayers demonstrate a little augmentation of water penetration with respect to the membrane surface for TIP4P-D water model. However, the experimental X-ray and neutron scattering form factors are aligning well with the simulated results for all studied water models, and TIP4P-D shows better for X-ray data. The deuterium order parameter for lipid acyl chains value less than 0.25 for all observed water models, depicting their disorderness for both the lipid bilayers. The lateral diffusion and reorientation autocorrelation function of the lipid molecules in both the bilayers are computed to reveal their dynamics across all water models. In comparison to other water models, the simulated trajectories predict better structure and reasonably fair dynamic properties for the SPC/E water model. The TIP4P-Ew water model reproduces the lateral diffusion co-efficient in close agreement with experiment. Reorientational dynamics for both the lipids in the bilayers for eight different water models are observed; the presence of slow and slowest time components corresponds to the lipid axial motion (wobble motion) and Twist/Splay motions. So, in view of the overall performance of the different water models with the AMBER Lipid21 all atom force field in reproducing membrane physical properties, the SPC/E water model appears to be an optimal choice.
bioinformatics2026-03-12v1DiaReport: Reproducible Workflow for Differential Expression Analysis and Interactive Reporting in DIA-based Proteomics
Argentini, A.; Fernandez Fernandez, E.; Pauwels, J.; Gevaert, K.Abstract
Data-independent acquisition (DIA) has become the preferred data acquisition method for mass spectrometry-based proteomics, yet, reproducible workflows for differential expression (DE) analysis and reporting results remain limited. We present DiaReport, an R package that performs precursor- and protein-level DE analysis from DIA-NN output using MSqRob and QFeatures, while generating high-quality, interactive HTML reports through Quarto. DiaReport integrates precursor data, filtering of missing values, normalization, protein summarization and statistical modeling within a single function, supporting both simple pairwise as well as complex experimental designs. The package provides structured outputs and configuration files to ensure computational reproducibility across different studies. To accommodate diverse research needs, DiaReport includes multiple reporting templates tailored to different proteomic applications. Applying DiaReport to an extracellular vesicle (EV) proteomics dataset demonstrates its ability to efficiently analyze DIA data and provide rapid insights into sample quality and protein level differences. Availability and Implementation: DiaReport is an open-source R package available at https://github.com/Gevaert-Lab/diareport. The package is platform-independent and distributed under the MIT license. Reports are generated using Quarto and require only standard R dependencies. Detailed documentation, installation guides and usage vignettes are provided within the repository. The interactive HTML reports discussed in this study, including the UPS2 benchmark and EV case study, are archived on Zenodo (https://doi.org/10.5281/zenodo.18632744 and https://doi.org/10.5281/zenodo.18632731).
bioinformatics2026-03-12v1Evaluating transformer-based models for structural characterization of orphan proteins
Seckin, E.; Colinet, D.; Danchin, E.; Sarti, E.Abstract
Transformer-based models (TBMs) are state-of-the-art deep learning architectures that predict protein structural and functional features with high accuracy. Despite methodological differences, they all rely on large protein sequence datasets structured by homology, as homologous proteins typically share structure and function. However, 5-30% of eukaryotic proteomes consist of orphan proteins - sequences without detectable similarity to known families. Although they may share structural or functional traits with characterized proteins, their lack of homology makes them ideal for evaluating TBM generalization beyond familiar sequence space. We compared predictions from several widely used TBM architectures on an expert-curated set of orphan proteins from the Meloidogyne genus. None of these proteins has an experimentally determined structure. To assess model performance, we conducted consistency analyses, comparing predicted features with those observed in sets of known homologous proteins and across models. Multiple sequence alignment-based approaches such as AlphaFold2 performed poorly on orphan proteins, as did single-sequence or embedding-based language models including ESMFold, OmegaFold, and ProtT5. This limited performance cannot be fully attributed to intrinsic disorder, as confirmed by independent non-TBM disorder predictors. While accurate tertiary structure prediction remains out of reach, secondary structure is more reliably captured: predictors share about 70% of secondary structure elements on average, regardless of global fold similarity, and these elements are consistently identified by dedicated secondary structure tools.
bioinformatics2026-03-12v1GE-BiCross: A Hierarchical Bidirectional Cross-Attention Framework for Genotype-by-Environment Prediction in Maize
Zhou, S.; Zhao, T.Abstract
Genotype-by-environment interactions are central to crop adaptation and yield stability, yet they remain difficult to model for robust prediction across heterogeneous environments. Although enviromic profiling has improved the characterization of dynamic field conditions, most existing genomic prediction methods adopt a late-fusion strategy that encodes genomic and environmental information independently before global integration, thereby limiting their ability to resolve fine-scale, context-dependent G x E effects. Here, we developed GE-BiCross, a hierarchical bidirectional cross-attention framework for maize prediction. GE-BiCross incorporates a dual-path feature extraction module to disentangle independent and cooperative effects, a tokenized bidirectional cross-attention module to enable reciprocal genotype-environment interaction learning, and a mixture-of-experts module to adaptively capture heterogeneous response patterns across environments. Using a large-scale dataset of approximately 360,000 observations from 4,923 maize hybrids evaluated in 241 environments, GE-BiCross consistently outperformed conventional genomic prediction, machine learning, and deep learning baselines across six agronomic traits. The greatest improvements were observed for environmentally responsive and genetically complex traits. In particular, GE-BiCross achieved an R2 of 0.672 for grain yield and 0.880 for grain moisture, significantly surpassing all comparison models. Ablation analyses demonstrated that the three core modules make distinct and complementary contributions to predictive performance.These results show that deep, bidirectional integration of genomic and enviromic information can substantially improve modeling of complex G x E interactions, providing a powerful framework for interpretable genomic prediction and climate-smart crop breeding.
bioinformatics2026-03-12v1Sassy2: Batch Searching of Short DNA Patterns
Beeloo, R.; Groot Koerkamp, R.Abstract
Motivation. Searching short DNA patterns such as barcodes, primers, or CRISPR spacers within sequencing reads or genomes is a fundamental task in bioinformatics. These problems are instances of multiple approximate string matching (MASM) [Baeza-Yates and Navarro, 1997], which requires locating all occurrences with up to k errors of multiple patterns of length m in a text of length n. Classical approaches based on seeding with exact matches become inefficient for short patterns (m [≤]64 bp) as k increases, producing either many spurious hits or missing true matches. Our previous work, Sassy1, showed that careful hardware optimization drastically accelerates single-pattern searches in long texts by distributing chunks of the text across SIMD lanes. Methods. Sassy2 distributes multiple patterns across SIMD lanes to maximize parallelism when searching batches of short patterns. When k is small, often only a short substring of the pattern of length O(k) is needed to reject a possible match. Thus, Sassy2 first examines short suffixes of the patterns (e.g., the last 16 bp of 32 bp patterns), allowing more (but smaller) parallel SIMD lanes. Only positions passing this suffix filter undergo full pattern verification. Results. On synthetic data, Sassy2 achieves 10-50x speedups over Sassy1 for short texts (n [≤]200 bp) and 2-4x for large texts (n [≥]1 Mbp). On real-world tasks with 16 threads, Sassy2 reaches over 100 Gbp/s text throughput per guide when searching 312 gRNAs across the human genome and 116 Gbp/s throughput when demultiplexing Nanopore reads with 96 barcodes. In both cases, Sassy2 outperforms Sassy1 by 2-5x and Edlib by 20-45x. Availability. Sassy2 is implemented in Rust and available at github.com/RagnarGrootKoerkamp/sassy.
bioinformatics2026-03-12v1AlphaFind v2: Similarity Search in AlphaFold DB and TED Domains across Structural Contexts
Slaninakova, T.; Rosinec, A.; Cillik, J.; Krenek, A.; Gresova, K.; Porubska, J.; Marsalkova, E.; Olha, J.; Prochazka, D.; Hejtmanek, L.; Dohnal, V.; Berka, K.; Svobodova, R.; Antol, M.Abstract
The availability of large-scale protein structure collections enables structure-based analysis of their function and evolution beyond what is possible from sequence alone. However, applying three-dimensional structure comparison at scale remains computationally demanding and limits practical exploration of large experimental and predicted collections. This creates a need for fast, structure-based search methods that retain biological relevance while enabling large-scale exploration. In this paper, we present AlphaFind v2, an application for finding structurally similar proteins in the AlphaFold Database (https://alphafold.ebi.ac.uk/) of predicted structures. AlphaFind v2 uses fast pre-filtering via state-of-the-art protein embeddings that preserve structural information, followed by refinement with US-align. The application presents multiple complementary search modes, including (i) search over full protein chains, (ii) search aware of the AlphaFold pLDDT metric, restricting similarity computation to the most stable and structurally relevant regions, (iii) search over protein domains from the TED database (https://ted.cathdb.info/), and (iv) a multidomain search mode, combining multiple chain-level domain matches within a single score and alignment. The application accepts protein identifiers and returns similar proteins with metrics, rich metadata, and interactive superpositions. AlphaFind v2 additionally allows searching within an organism or CATH label and matches the proteins with experimental structures. AlphaFind v2 is accessible at https://alphafind.ics.muni.cz/.
bioinformatics2026-03-12v1Joint Geometric--Chemical Distance for Protein Surfaces
Swami, H.; Eckmann, J.-P.; McBride, J. M.; Tlusty, T.Abstract
Protein function is executed at the molecular surface, where shape and chemistry act together to govern interaction. Yet most comparison methods treat these aspects separately, privileging either global fold or local descriptors and missing their coupled organization. Here we introduce IFACE (Intrinsic Field--Aligned Coupled Embedding), a correspondence-based framework that aligns protein surfaces through probabilistic coupling of intrinsic geometry with spatially distributed chemical fields. From this alignment, we derive a joint geometric-chemical distance that integrates structural and physicochemical discrepancies within a single formulation. Across diverse proteins, this distance separates conformational variability from true structural divergence more effectively than fold-based similarity measures. Applied to the cytochrome P450 family, it reveals coherent family-level organization and identifies conserved buried catalytic pockets despite the complex topology. By linking interpretable surface correspondences with a unified distance, IFACE establishes a principled basis for comparing protein interfaces and detecting functionally related interaction patches across proteins.
bioinformatics2026-03-12v1MultiPopPred: A Trans-Ethnic Disease Risk Prediction Method, and its Application to the South Asian Population
Kamal, R.; Narayanan, M.Abstract
Genome-wide association studies (GWAS) have guided significant contributions towards identifying disease-associated Single Nucleotide Polymorphisms (SNPs) in Caucasian populations, albeit with limited focus on other understudied low-resource non-Caucasian populations. There have been active efforts over the years to understand and exploit the population specific versus shared aspects of the genotype-phenotype relation across different populations or ethnicities to bridge this gap. However, the efficacy of transfer learning models that are simpler than existing approaches and utilize individual-level data remains an open question. We propose MultiPopPred, a novel and simple trans-ethnic polygenic risk score (PRS) estimation method that taps into the shared genetic risk across populations and transfers information learned from multiple well-studied auxiliary populations to a less-studied target population. The default version of MultiPopPred (MPP-PRS+) harnesses individual-level data using a specially designed Nesterov-smoothed penalized shrinkage model and an L-BFGS optimization routine. Extensive comparative analyses performed on simulated genotype-phenotype data, assuming an infinitesimal model, reveal that MPP-PRS+ improves PRS prediction in the South Asian population by 38% on average across all simulation settings when compared to state-of-the-art trans-ethnic PRS estimation methods. This improvement is enhanced in settings with low target sample sizes and in semi-simulated settings. Furthermore, MPP-PRS+ produces better or comparable PRS predictions than state-of-the-art methods across 12 out of 16 evaluated quantitative and binary traits in UK Biobank, with the exception being 4 lipid-related traits. This performance trend is promising and encourages application of MultiPopPred for reliable PRS estimation in low-resource populations with individual-level data for complex omnigenic traits.
bioinformatics2026-03-11v3Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data
Vicente, A.; Dornfeld, L.; Coines, J.; Ferruz, N.Abstract
Proteins can bind small molecules with high specificity. However, designing proteins that bind user-defined ligands remains a challenge, typically relying on structural information and costly experimental iteration. While protein language models (pLMs) have shown promise for unconditional generation and conditioning on coarse functional labels, instance-level conditioning on a specific ligand has not been evaluated using purely textual inputs. Here we frame small-molecule protein binder design as a sequence-to-sequence translation problem and train ligand-conditioned pLMs that map molecular strings to candidate binder sequences. We curate large-scale ligand-protein datasets (>17M ligand-protein pairs) covering different data regimes and train a suite of models, spanning 16 to 700M parameters. Results reveal a consistent trade-off driven by supervision ambiguity: when each ligand is paired with few proteins, models generate near-neighbour, foldable sequences; when each ligand is paired with many proteins, generations are more diverse but less consistently foldable. Our study exposes how annotation diversity and sampling choices elicit this behaviour and how it changes with the data distribution. These insights highlight dataset redundancy and incompleteness as key bottlenecks for sequence-only binder design. We release the curated datasets, trained models, and evaluation tools to support future work on ligand-conditioned protein generation.
bioinformatics2026-03-11v2Modularity, ecology, and theoretical evolution of the ribozyme body plan
Bachelet, I.Abstract
Ribozymes are relics of molecular life forms from the primitive earth that are embedded within modern genomes across all kingdoms of life. Despite significant knowledge from decades of bioinformatic and biochemical research, a gap remains in our understanding of the world in which ribozymes existed, their interactions, ecology, and possibly also evolution. The present study proposes a new theoretical basis for understanding these aspects of ribozyme biology by adopting a zoological frame of thought. Seven families of small self-cleaving ribozymes are each mapped to a primitive marine animal analog based on topological architecture, and classified into body plan grades paralleling cnidarian, ctenophore, and bilaterian organization. A formal notation describing ribozyme regions as bodies, cavities, and limbs enables systematic comparison with animal body plans and highlights reusability of parts across ribozyme groups, in turn enabling the construction of a connectivity network and a putative body plan-based evolutionary ordering. This ordering of body plans identifies systematic gaps corresponding to undiscovered ribozyme forms, one of which, a planktonic form of hammerhead, was bioinformatically found in 16.2% of all hammerhead sequences. Computational cross-cleavage analysis across all 49 pairwise interactions (including conspecific) suggests that the hammerhead was a generalist apex predator in the RNA world, while the hatchet was a vulnerable, filter-feeding or scavenger prey species. Conspecific analysis suggests that cannibalism was also a prevalent feeding strategy. Evolutionary avoidance signatures suggest ancient predator-prey coevolution. This theory emphasizes behavior, modularity, and ecological interactions as primary drivers of early ribozyme evolution, offering a new pathway for inferring ancient RNA forms independent of sequence-first assumptions.
bioinformatics2026-03-11v1HAETAE: A highly accurate and efficient epigenome transformer for tissue-specific histone modification prediction
Park, S.-J.; Im, S.-H.; Kim, S.-Y.; Kim, J.-Y.Abstract
While genomic models trained on four bases often fail to capture cell-type specificity, we introduce HAETAE, which integrates 5-methylcytosine from long-read sequencing into a 5-base framework. By explicitly modeling epigenetic context, HAETAE achieves state-of-the-art accuracy (>0.95) with orders of magnitude fewer parameters, challenging the prevailing scaling-law paradigm. Furthermore, HAETAE deciphers tissue-specific regulatory logic, as demonstrated by revealing the distinct, context-dependent functional impact of the TERT promoter mutation across diverse tissues.
bioinformatics2026-03-11v1Beyond Binding Affinity: The Kinetic-Compatibility Hypothesis for Nipah Virus Neutralization
Bozkurt, C.Abstract
Nipah virus (40-75% fatality) has no approved treatments. Its highly dynamic fusion (F) protein presents a severe challenge for static binder design. We analyzed 1,194 validated computational binders, focusing on 22 functionally tested candidates (8 neutralizers, 14 non-neutralizers) to identify features associated with live-virus neutralization. We initially hypothesized that maximizing binding affinity would be the primary driver of success. However, we observed an affinity-neutralization mismatch: higher static affinity did not stratify neutralizers from non-neutralizers, and ultra-tight static affinity did not correlate with functional success. We found that successful neutralizers were instead enriched for specific architectural patterns, including computational structural flexibility and terminal sequence motifs. These findings motivate a "Kinetic Compatibility Hypothesis," suggesting that neutralization may require a state-dependent, multi-feature profile rather than maximum affinity alone. Furthermore, we report exploratory developability associations - such as a 0.48-0.55 amyloid propensity "sweet spot" and secondary structure constraints - specific to the 15 kDa miniprotein scaffolds in this dataset. This 10-point framework integrates empirical sequence data with Orbion's Astra ML model suite predictions to propose an exploratory lead-triage heuristic, though it does not yet definitively prove mechanism
bioinformatics2026-03-11v1Automated extraction and optimization of protein purification protocols using multi-agent large language models
Ye, J.; DeRocher, A.; Khim, M.; Subramanian, S.; Cron, L.; Myler, P. J.; Phan, I. Q.Abstract
Recent advances in Large Language Models (LLMs) present new opportunities for automating critical bottlenecks in scientific workflows such as literature reviews or protocol design. One such bottleneck is the purification of recombinant proteins, a vital aspect of biomedical research that frequently fails. To improve success rates, researchers must manually define optimal large-scale purification conditions and establish robust rescue protocols for proteins with low stability or solubility -- a time-intensive process. To address this gap, we introduce a multi-agent LLM system that automates the creation and optimization of protein purification protocols to facilitate the production of high-concentration, high-purity protein samples. Our application streamlines the labor-intensive manual process of sequence similarity searches, literature reviews, and protocol comparison. Operating in a tool-like constrained workflow, the system identifies analogous proteins, leverages specialized LLM agents to extract successful purification methodologies from primary source literature, and cross-references them against failed protocols to generate optimization recommendations. Evaluation on a select number of targets demonstrated high accuracy in protocol extraction and the generation of scientifically sound, expert-validated optimization recommendations. While this system reduces complex analysis time from hours to minutes, we identify the lack of programmatic open access to literature, specifically primary citations in the Protein Data Bank, as a fundamental limitation to LLM agent-based scientific workflows. Ultimately, this system demonstrates the feasibility of using LLM agents to streamline wet-lab workflows while preserving methodological transparency and reproducibility.
bioinformatics2026-03-11v1