Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
MultiPopPred: A Trans-Ethnic Disease Risk Prediction Method, and its Application to the South Asian Population
Kamal, R.; Narayanan, M.Abstract
Genome-wide association studies (GWAS) have guided significant contributions towards identifying disease-associated Single Nucleotide Polymorphisms (SNPs) in Caucasian populations, albeit with limited focus on other understudied low-resource non-Caucasian populations. There have been active efforts over the years to understand and exploit the population specific versus shared aspects of the genotype-phenotype relation across different populations or ethnicities to bridge this gap. However, the efficacy of transfer learning models that are simpler than existing approaches and utilize individual-level data remains an open question. We propose MultiPopPred, a novel and simple trans-ethnic polygenic risk score (PRS) estimation method that taps into the shared genetic risk across populations and transfers information learned from multiple well-studied auxiliary populations to a less-studied target population. The default version of MultiPopPred (MPP-PRS+) harnesses individual-level data using a specially designed Nesterov-smoothed penalized shrinkage model and an L-BFGS optimization routine. Extensive comparative analyses performed on simulated genotype-phenotype data, assuming an infinitesimal model, reveal that MPP-PRS+ improves PRS prediction in the South Asian population by 38% on average across all simulation settings when compared to state-of-the-art trans-ethnic PRS estimation methods. This improvement is enhanced in settings with low target sample sizes and in semi-simulated settings. Furthermore, MPP-PRS+ produces better or comparable PRS predictions than state-of-the-art methods across 12 out of 16 evaluated quantitative and binary traits in UK Biobank, with the exception being 4 lipid-related traits. This performance trend is promising and encourages application of MultiPopPred for reliable PRS estimation in low-resource populations with individual-level data for complex omnigenic traits.
bioinformatics2026-03-11v3Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data
Vicente, A.; Dornfeld, L.; Coines, J.; Ferruz, N.Abstract
Proteins can bind small molecules with high specificity. However, designing proteins that bind user-defined ligands remains a challenge, typically relying on structural information and costly experimental iteration. While protein language models (pLMs) have shown promise for unconditional generation and conditioning on coarse functional labels, instance-level conditioning on a specific ligand has not been evaluated using purely textual inputs. Here we frame small-molecule protein binder design as a sequence-to-sequence translation problem and train ligand-conditioned pLMs that map molecular strings to candidate binder sequences. We curate large-scale ligand-protein datasets (>17M ligand-protein pairs) covering different data regimes and train a suite of models, spanning 16 to 700M parameters. Results reveal a consistent trade-off driven by supervision ambiguity: when each ligand is paired with few proteins, models generate near-neighbour, foldable sequences; when each ligand is paired with many proteins, generations are more diverse but less consistently foldable. Our study exposes how annotation diversity and sampling choices elicit this behaviour and how it changes with the data distribution. These insights highlight dataset redundancy and incompleteness as key bottlenecks for sequence-only binder design. We release the curated datasets, trained models, and evaluation tools to support future work on ligand-conditioned protein generation.
bioinformatics2026-03-11v2Automated extraction and optimization of protein purification protocols using multi-agent large language models
Ye, J.; DeRocher, A.; Khim, M.; Subramanian, S.; Cron, L.; Myler, P. J.; Phan, I. Q.Abstract
Recent advances in Large Language Models (LLMs) present new opportunities for automating critical bottlenecks in scientific workflows such as literature reviews or protocol design. One such bottleneck is the purification of recombinant proteins, a vital aspect of biomedical research that frequently fails. To improve success rates, researchers must manually define optimal large-scale purification conditions and establish robust rescue protocols for proteins with low stability or solubility -- a time-intensive process. To address this gap, we introduce a multi-agent LLM system that automates the creation and optimization of protein purification protocols to facilitate the production of high-concentration, high-purity protein samples. Our application streamlines the labor-intensive manual process of sequence similarity searches, literature reviews, and protocol comparison. Operating in a tool-like constrained workflow, the system identifies analogous proteins, leverages specialized LLM agents to extract successful purification methodologies from primary source literature, and cross-references them against failed protocols to generate optimization recommendations. Evaluation on a select number of targets demonstrated high accuracy in protocol extraction and the generation of scientifically sound, expert-validated optimization recommendations. While this system reduces complex analysis time from hours to minutes, we identify the lack of programmatic open access to literature, specifically primary citations in the Protein Data Bank, as a fundamental limitation to LLM agent-based scientific workflows. Ultimately, this system demonstrates the feasibility of using LLM agents to streamline wet-lab workflows while preserving methodological transparency and reproducibility.
bioinformatics2026-03-11v1Beyond Binding Affinity: The Kinetic-Compatibility Hypothesis for Nipah Virus Neutralization
Bozkurt, C.Abstract
Nipah virus (40-75% fatality) has no approved treatments. Its highly dynamic fusion (F) protein presents a severe challenge for static binder design. We analyzed 1,194 validated computational binders, focusing on 22 functionally tested candidates (8 neutralizers, 14 non-neutralizers) to identify features associated with live-virus neutralization. We initially hypothesized that maximizing binding affinity would be the primary driver of success. However, we observed an affinity-neutralization mismatch: higher static affinity did not stratify neutralizers from non-neutralizers, and ultra-tight static affinity did not correlate with functional success. We found that successful neutralizers were instead enriched for specific architectural patterns, including computational structural flexibility and terminal sequence motifs. These findings motivate a "Kinetic Compatibility Hypothesis," suggesting that neutralization may require a state-dependent, multi-feature profile rather than maximum affinity alone. Furthermore, we report exploratory developability associations - such as a 0.48-0.55 amyloid propensity "sweet spot" and secondary structure constraints - specific to the 15 kDa miniprotein scaffolds in this dataset. This 10-point framework integrates empirical sequence data with Orbion's Astra ML model suite predictions to propose an exploratory lead-triage heuristic, though it does not yet definitively prove mechanism
bioinformatics2026-03-11v1HAETAE: A highly accurate and efficient epigenome transformer for tissue-specific histone modification prediction
Park, S.-J.; Im, S.-H.; Kim, S.-Y.; Kim, J.-Y.Abstract
While genomic models trained on four bases often fail to capture cell-type specificity, we introduce HAETAE, which integrates 5-methylcytosine from long-read sequencing into a 5-base framework. By explicitly modeling epigenetic context, HAETAE achieves state-of-the-art accuracy (>0.95) with orders of magnitude fewer parameters, challenging the prevailing scaling-law paradigm. Furthermore, HAETAE deciphers tissue-specific regulatory logic, as demonstrated by revealing the distinct, context-dependent functional impact of the TERT promoter mutation across diverse tissues.
bioinformatics2026-03-11v1Modularity, ecology, and theoretical evolution of the ribozyme body plan
Bachelet, I.Abstract
Ribozymes are relics of molecular life forms from the primitive earth that are embedded within modern genomes across all kingdoms of life. Despite significant knowledge from decades of bioinformatic and biochemical research, a gap remains in our understanding of the world in which ribozymes existed, their interactions, ecology, and possibly also evolution. The present study proposes a new theoretical basis for understanding these aspects of ribozyme biology by adopting a zoological frame of thought. Seven families of small self-cleaving ribozymes are each mapped to a primitive marine animal analog based on topological architecture, and classified into body plan grades paralleling cnidarian, ctenophore, and bilaterian organization. A formal notation describing ribozyme regions as bodies, cavities, and limbs enables systematic comparison with animal body plans and highlights reusability of parts across ribozyme groups, in turn enabling the construction of a connectivity network and a putative body plan-based evolutionary ordering. This ordering of body plans identifies systematic gaps corresponding to undiscovered ribozyme forms, one of which, a planktonic form of hammerhead, was bioinformatically found in 16.2% of all hammerhead sequences. Computational cross-cleavage analysis across all 49 pairwise interactions (including conspecific) suggests that the hammerhead was a generalist apex predator in the RNA world, while the hatchet was a vulnerable, filter-feeding or scavenger prey species. Conspecific analysis suggests that cannibalism was also a prevalent feeding strategy. Evolutionary avoidance signatures suggest ancient predator-prey coevolution. This theory emphasizes behavior, modularity, and ecological interactions as primary drivers of early ribozyme evolution, offering a new pathway for inferring ancient RNA forms independent of sequence-first assumptions.
bioinformatics2026-03-11v1Unsupervised identification of low-frequency antigen-specific TCRs using distance-based anomaly scoring
Kinoshita, K.; Kobayashi, T. J.Abstract
Identifying antigen-specific T cell receptors (TCRs) within the diverse human repertoire remains challenging due to their extremely low frequencies, often as rare as one per million cells. Here, we propose a novel unsupervised approach that detects low-frequency antigen-specific TCRs through distance-based anomaly detection in TCR sequence space. Our method is based on the observation that antigen-specific TCRs preferentially localize at the periphery of V gene clusters rather than cluster centers. Using TCRdist3 to quantify sequence distances, we identify query TCRs that are anomalous compared to reference repertoires within their V-J gene combinations. We validated this approach across three immunological contexts: COVID-19 infection, influenza vaccination, and yellow fever vaccination. For SARS-CoV-2-specific TCR detection in a COVID-19 patient, our method demonstrated 34.3% accuracy, significantly outperforming similarity-based (ALICE: 8.0%) and frequency-based methods (edgeR: 5.8%, the Pogorelyy method: 6.3%), and uniquely detected low-frequency antigen-specific TCRs at clone count one. The minimal overlap with conventional approaches ([≤]6.7%) indicates our method captures distinct TCR clones overlooked by existing analyses. This spatial distribution-based paradigm provides a complementary strategy for TCR specificity detection, particularly valuable for identifying rare antigen-specific clones essential for understanding immune responses.
bioinformatics2026-03-11v1MESSI: Multimodal Experiments with SyStematic Interrogation using nextflow
Liang, C.; Grewal, T.; Singh, A.; Singh, A.Abstract
Background: Multimodal biomedical studies increasingly profile multiple molecular and clinical modalities from the same samples, creating new opportunities for disease prediction and biological discovery. However, benchmarking multimodal integration methods remains difficult because studies often use inconsistent preprocessing, unequal tuning strategies, and non-comparable evaluation schemes, limiting fair assessment across methods. Results: We developed MESSI (Multimodal Experiments with SyStematic Interrogation), a reproducible Nextflow-based benchmarking framework for multimodal outcome prediction that standardizes data preparation, supports interoperable R and Python workflows, and enforces leakage-free nested cross-validation for model selection and model assessment. MESSI currently implements representative intermediate- and late-integration methods and supports bulk multiomics, bulk multimodal, and single-cell multiomics datasets. In simulation studies with known ground truth, most methods were well calibrated in the absence of signal and achieved high performance under strong signal, whereas differences emerged under weaker signal and in feature recovery. We then applied MESSI to 19 real datasets spanning cancer, neurodevelopmental, neurodegenerative, infectious, renal, transplant, and metastatic disease settings, with diverse modality combinations including transcriptomic, epigenomic, proteomic, imaging, electrical, clinical, and single-cell-derived features. Across bulk multimodal datasets, classification differences were generally modest, although DIABLO and multiview cooperative learning tended to rank highest, while MOFA+glmnet and MOGONET were weaker overall. Biological enrichment analyses revealed clearer differences: DIABLO, RGCCA, MOFA, and IntegrAO more consistently recovered significant Reactome, oncogenic, and tissue-relevant gene signatures. In single-cell multiomics benchmarks, method rankings were more dataset dependent, but DIABLO performed consistently well across all case studies, while RGCCA also showed strong performance in specific settings. Computational analyses further showed that DIABLO and MOFA had the most favorable runtime and memory profiles, whereas multiview was the most time-intensive and IntegrAO the most memory-demanding. Conclusions: MESSI provides a reproducible, extensible, and equitable framework for benchmarking multimodal integration methods under a common model assessment strategy. Our results indicate that no single method is uniformly optimal across datasets and objectives; instead, method choice should balance predictive performance, biological interpretability, and computational efficiency. MESSI establishes a foundation for transparent benchmarking and future extensions to broader multimodal learning tasks.
bioinformatics2026-03-11v1CESAR: High-Sensitivity Detection of Copy Number Variations in ctDNA Using Segmentation and Anchor Recalibration
Ni, S.; Kan, K.; Wang, L.; Wu, N.; Jiang, X.Abstract
Background: Detecting copy number variations (CNVs) in circulating tumor DNA (ctDNA) is crucial for the companion diagnosis and resistance monitoring of various solid tumors (e.g., NSCLC, Glioblastoma). However, when tumor-derived DNA fractions are extremely low (often <1%), traditional depth-based methods frequently fail due to non-linear sequencing depth fluctuations and probe-specific capture biases inherent to targeted Next-Generation Sequencing (NGS). Methods: We developed CESAR (CNV Estimation with Segmentation and Anchor Recalibration), a novel computational tool optimized for ultra-sensitive, tumor-only CNV detection in targeted NGS panels. CESAR utilizes Circular Binary Segmentation (CBS) to re-partition target regions based on relative capture efficiency. It then introduces a dynamic "anchor" selection algorithm that identifies a personalized set of genomic segments mirroring the non-linear coverage behavior of each target gene. By minimizing the Coefficient of Variation (CV) through iterative anchor selection, CESAR effectively recalibrates the baseline to suppress technical noise. Results: Validation using standard DNA reference materials demonstrated that CESAR successfully identified both amplifications (e.g., MET, ERBB2, EGFR) and relative copy number deletions at ultra-low tumor fractions. Notably, CESAR achieved stable detection of focal alterations as subtle as 2.18 copies (a mere 1.09x fold change relative to the diploid baseline), while maintaining zero false positives in control regions. Evaluation across distinct clinical biofluids, 36 clinical plasma samples and 41 glioma cerebrospinal fluid (CSF) samples, identified critical, previously undetected CNV events, including subtle ERBB2 gains and distinct MET deletions. Furthermore, comprehensive benchmarking revealed that CESAR consistently outperformed the widely used CNVkit, particularly in suppressing technical variance and resolving ultra-low-level copy number gains that CNVkit failed to distinguish from background noise. Conclusions: CESAR provides a highly stable and sensitive algorithmic framework for tumor-only CNV calling in liquid biopsies, facilitating precise therapeutic decision-making in precision oncology.
bioinformatics2026-03-11v1Pairing Data Independent Acquisition and High-Resolution Full Scan for Fast Urinary Tract Infection Diagnosis
Coyle, E.; Lacombe-Rastoll, A.; Roux-Dalvai, F.; Leclercq, M.; Bories, P.; Berube, E.; Gotti, C.; Bekker-Jensen, D.; Bache, N.; Isabel, S.; Droit, A.Abstract
Background: Rapid and accurate identification of urinary tract infection (UTI) pathogens is critical for effective treatment and combating antimicrobial resistance. Conventional culture-based diagnostics are slow, and standard tandem mass spectrometry workflows are resource-intensive. Methods: We present a proof-of-concept workflow that integrates high-resolution data-independent acquisition (DIA) MS/MS on the Thermo Scientific Orbitrap Astral with MS1-only spectra from the Orbitrap Exploris 480. DIA data establish a reference panel of pathogen-specific peptides, which are then identified in MS1 spectra from urine samples. Machine learning models trained on these matched MS1 features were used to classify eight common uropathogens and non-infected controls across synthetic inoculations, pure cultures, and clinical patient samples. Results: The approach accurately distinguished bacterial species in both controlled inoculated samples and clinical patient samples, achieving a Matthews Correlation Coefficient (MCC) of 0.924 on held-out test data and 0.77 on patient samples. Conclusions: This proof-of-concept demonstrates that pairing DIA-derived peptide panels with MS1-only data acquired on a cost-effective instrument suitable for routine analysis, enables rapid, culture-free identification of UTI pathogens. The method provides a scalable, high-throughput platform suitable for clinical applications and establishes a foundation for broader biomarker discovery and potential quantitative workflows.
bioinformatics2026-03-11v1Making Biorisk Measurable: A Bayesian Framework for Laboratory Risk Management
Prodanov, D.Abstract
Biosafety risk assessment traditionally relies on categorical scales embodied by the four WHO Risk Groups and biocontainment levels. Mapping such categories to quantitative metrics is an open problem for the field: the classifications are too coarse for operational decision-making, yet strictly probabilistic language remains inaccessible to most safety professionals, laboratory managers, and decision-makers. To bridge these gaps, the present work develops a quantitative Bayesian framework for laboratory risk management that combines WHO Risk Group classification as a prior with a Markov chain model of the incident--disaster escalation chain. Risk is reported on a log-risk scale that transforms multiplicative probabilities into additive quantities, mirroring the decibel scale in acoustics. The framework accommodates longitudinal updating with local incident data and quantifies the separate contributions of training, preventive maintenance, and inspection to system-level safety. Resource allocation recommendations are derived that complement existing compliance frameworks with auditable, evidence-based prioritisation. The framework is illustrated on synthetic BSL-3 scenarios and shifts the perspective of biorisk governance from static compliance assessment to dynamic risk and resource management.
bioinformatics2026-03-11v1Rational in silico discovery and serological validation of Trypanosoma cruzi-specific B-cell epitopes for high-precision Chagas disease diagnosis
Candia Puma, M. A.; Goyzueta Mamani, L. D.; Barazorda Ccahuana, H. L.; S B Camara, R.; A.G. Pereira, I.; L Silva, A.; M Rodrigues, M.; P N Assis, B.; Chaves, A. T.; A V A Correa, L.; O da Costa Rocha, M.; U Goncalves, D.; Maia Goncalves, A. A.; B de Moura, A.; Galdino, A.; Machado de Avila, R.; Cordeiro Giunchetti, R.; Ferraz Coelho, E. A.; Chavez Fumagalli, M. A.Abstract
Chagas disease is caused by the parasite Trypanosoma cruzi and remains a neglected tropical disease presenting a substantial global health burden. Crude antigen-based assays have historically been limited in specificity; however, even contemporary recombinant-antigen tests may exhibit residual cross-reactivity, depending on antigen composition and geographic context. To overcome this limitation, this study developed a novel diagnostic strategy that integrates computational and experimental approaches to identify specific linear B-cell epitopes within the T. cruzi proteome. The strategy was developed to exclude sequences homologous to H. sapiens and Leishmania spp. proteins, thereby minimizing potential cross-reactivity. Using a consensus approach across five prediction algorithms, B-cell epitopes were identified and subsequently clustered to reveal conserved, immunoreactive consensus sequences. The peptide sequences were characterized for optimal physicochemical properties and subsequently modeled to interact with a human antibody using protein-peptide docking and molecular dynamics simulations to assess complex stability. The most promising candidates were chemically synthesized and validated using ELISA against a cohort comprising Chagas disease patients (chronic indeterminate and cardiac forms), healthy donors, and a cross-reactive control group (visceral and tegumentary leishmaniasis and leprosy). From the initial set of 19,245 proteins, the multi-tiered bioinformatic analysis identified 4,431 unique, non-homologous sequences. Consensus prediction yielded 401 high-confidence epitopes, which were refined to 179 structurally stable candidates. Computational analyses identified five top-ranking epitopes capable of forming high-affinity, stable complexes with a human antibody. Experimental validation confirmed the high diagnostic accuracy of two epitopes, which demonstrated exceptional diagnostic performance: Epitope 4 and Epitope 5 achieved 100% sensitivity. Notably, Epitope 5 exhibited superior specificity, reaching 96.67% against healthy controls and 90.91% against the cross-reactive group. This study establishes a basis for the development of an improved immunoassay for Chagas disease and provides a reproducible framework for targeted epitope discovery. Consequently, this study validates a high-precision computational pipeline capable of discovering T. cruzi-specific antigens that effectively circumvent cross-reactivity with Leishmania spp., proposing Epitope 5 as a qualified candidate for reliable serological diagnosis in co-endemic regions.
bioinformatics2026-03-11v1BICEP: an extension to indels and copy number variants for rare variant prioritisation in pedigree analysis
Ormond, C.; Ryan, N. M.; Corvin, A.; Heron, E. A.Abstract
Summary: BICEP is a Bayesian inference model that evaluates how likely a rare variant is to be causal for a genomic trait in pedigree-based analyses. The original prior model in BICEP was designed for single nucleotide variants only. Here, we have developed an extension of the prior models for more comprehensive genomic analysis to include indels and copy number variants. We benchmark the performance of these new priors and show comparable performance accuracy with the existing single nucleotide variant prior model. For copy number variants we evaluate four different input predictors to the models and recommend the best performing ones as the default. Availability and implementation: the updated prior models have been implemented in the current version of BICEP available from: https://github.com/cathaloruaidh/BICEP.
bioinformatics2026-03-11v1Cell DiffErential Expression by Pooling (CellDEEP) highlights issues in differential gene expression in scRNA-seq
Cheng, Y.; Kettlewell, T.; Laidlaw, R. F.; Hardy, O. M.; McCluskey, A.; Otto, T. D.; Somma, D.Abstract
Accurate identification of differentially expressed genes (DEGs) in single-cell RNA sequencing (scRNA-seq) data remains challenging. Single-cell-specific statistical models often report large numbers of candidate genes but can exhibit inflated false positive rates, whereas pseudobulk approaches improve false discovery control at the cost of reduced sensitivity. To overcome the noise and bias that other tools have, and allow the user to have more control of the DEG process, we present CellDEEP, which uses a cell aggregation (metacell) approach. This tool provides a framework for flexible selection of pooling strategies and parameterisation for differential expression analysis (DE). Benchmarking on simulated and real datasets, including COVID-19 and rheumatoid arthritis, shows that CellDEEP often outperforms other methods, consistently reduces false positives compared to single-cell methods and recovers more true positives than pseudobulk methods. Our work shifts the focus from selecting a single "best" method to an approach that reduces cell-level noise while preserving biological signal, together with transparent validation framework, advancing more reliable differential-expression analysis in single-cell transcriptomics.
bioinformatics2026-03-11v1FishMamba-1: A Linear-Complexity Foundation Model for Deciphering Polyploid Cyprinid Genomes
Lu, S.; Fang, C.; Wang, C.; Qian, Y.; Fang, W.; Li, T.; Zeng, H.; He, S.Abstract
Abstract The Cypriniformes order, comprising essential aquaculture species like carps and minnows, presents unique genomic challenges due to complex whole-genome duplication (WGD) events and extensive repetitive elements. Conventional annotation tools and Transformer-based foundation models often struggle to capture long-range dependencies in these expanded genomes due to quadratic computational complexity. Here, we introduce FishMamba, the first genomic foundation model tailored for the aquatic clade, built upon the selective state-space model (SSM) architecture. By leveraging Mamba-2's linear scaling efficiency, FishMamba processes context windows of 32,768 base pairs (32k) - significantly surpassing the 4-6k limit of standard DNA Transformers - enabling the modeling of distal regulatory patterns on a single GPU. We curated Cypri-24, a comprehensive dataset comprising 28.8 Gb of high-quality genome assemblies from 24 representative species, to pre-train FishMamba on 15 billion tokens. Subsequent fine-tuning for genome segmentation (FishSegmenter) demonstrates the model's capability to annotate gene structures at single-nucleotide resolution with remarkable precision. Evaluation on a held-out test set reveals that FishMamba achieves a precision of 64.6% in exon identification, effectively distinguishing coding regions from the vast non-coding background without relying on RNA-seq evidence. Furthermore, interpretability analysis confirms that the model captures biological syntax such as splice acceptor motifs. FishMamba provides a scalable, open-source framework for decoding the complex genomes of non-model organisms, providing a scalable computational resource to support downstream applications in molecular breeding and ecological monitoring. The complete source code, pre-trained model weights, and datasets are freely available at https://github.com/lu1000001/FishMamba. Additionally, the FishMamba Hub, a web-based inference platform, is accessible at https://huggingface.co/spaces/lu1000001/FishMamba-Hub to facilitate real-time genomic segmentation for the aquatic research community.
bioinformatics2026-03-11v1The Genomic Legacy of Ancient Polyploidy in Crop Domestication
McKibben, M. T. W.; Barker, M. S.Abstract
Species that have an ancestry of whole-genome duplications (WGDs) are more likely to be domesticated, but the underlying mechanisms remain unclear. We tested whether paleologs--genes duplicated during ancient WGDs--are enriched in candidate domestication lists across 22 crop species. Paleologs were significantly enriched in 14 species, with single-copy paleologs showing the most consistent overrepresentation. This finding provides the first empirical test of an assumption in plant genome evolution: models based on retention patterns inferred that genes rapidly returning to single-copy status are under strong purifying selection, potentially limiting their adaptive potential. We find instead that constraint on copy number does not appear to preclude selection on gene function. Several non-mutually exclusive processes could explain this pattern, including accumulated genetic diversity becoming available upon return to single-copy, selection to maintain essential functions, and greater selection efficiency on unmasked loci. Ancient WGDs thus provide a persistent genomic substrate for crop evolution millions of years later.
bioinformatics2026-03-11v1MSstatsResponse: Semi-parametric statistical model enhances detection of drug-protein interactions in chemoproteomics experiments
Szvetecz, S.; Kohler, D.; Federspiel, J.; Field, D. S.; Jean-Beltran, P.; Seward, R. J.; Suh, H.; Xue, L.; Vitek, O.Abstract
Chemoproteomics is a popular approach for the identification of small molecule-protein interactions in biological systems. Several chemoproteomics workflows leverage functionalized chemical probes and mass spectrometry to measure protein engagement through direct protein enrichment or competition using a range of small molecule concentrations. Statistical methods for analysis of such dose-response chemoproteomics datasets are limited. For example, existing methods rely on fixed curve shapes and are sensitive to experimental variation, particularly when the number of doses or replicates is limited. Here, we present MSstatsResponse, a semi-parametric statistical framework for analyzing chemoproteomic dose-response experiments that uses isotonic regression that does not require a fixed curve shape. This approach improves the accuracy and robustness of curve fitting, target identification, and half-response estimation across diverse experimental designs. We evaluate MSstatsResponse by generating a benchmark chemoproteomic dataset that profiled the competition between the kinase-binding probe XO44 and the drug Dasatinib using three mass spectrometry acquisition strategies: data-independent acquisition, tandem mass tag-based data-dependent acquisition, and selected reaction monitoring. We further evaluate the method on simulated datasets that vary the number of doses, number of replicates, and levels of noise, and demonstrate that MSstatsResponse consistently improves sensitivity, specificity, and reproducibility compared to existing methods, particularly in low-replicate and low-dose settings. MSstatsResponse is implemented as an open-source R/Bioconductor package that integrates with the MSstats ecosystem for quantitative proteomics. It provides a unified workflow for preprocessing, curve fitting, target identification, and experimental design, enabling researchers to select the number of doses and replicates appropriate to their experimental goals. The software and documentation are freely available at https://bioconductor.org/packages/MSstatsResponse.
bioinformatics2026-03-11v1Landscape of 8q24.3-Encoded microRNAs and Their Prognostic Impact in Ovarian Cancer
Filipek, K.; Merelli, I.; Chiappori, F.; Penzo, M.Abstract
Ovarian cancer is the most lethal gynecological malignancy, largely because of late diagnosis and marked genomic instability, with high-grade serous ovarian cancer (HGSOC) representing its most common and aggressive subtype. Amplification of chromosome 8q24.3 is a recurrent event in HGSOC, yet the regulation and clinical relevance of the non-coding RNA output from this locus remain poorly defined. Here, we performed an integrative analysis of 8q24.3-encoded miRNAs in ovarian cancer using copy-number, transcriptomic, isoform-resolved, and clinical data from TCGA and NCBI datasets. We identified pronounced heterogeneity in miRNA abundance and strand usage across this locus. Copy-number gain broadly associated with increased miRNA expression, although this effect was not uniform across all candidates. Intronic miRNAs showed variable coupling with their host genes, indicating that mature miRNA output is shaped by both genomic dosage and post-transcriptional regulation. Isoform-level analysis revealed marked strand asymmetry and regulatory complexity, but did not strengthen copy-number or histotype associations compared with total miRNA measurements. Clinically, higher expression of miR-937, miR-4664, and miR-6849 was associated with improved overall survival in HGSOC. Functional enrichment of validated targets highlighted pathways related to cellular stress responses, senescence, p53 signaling, endocytosis, and metabolic adaptation. Together, these findings define 8q24.3 as a heterogeneous non-coding regulatory hub in ovarian cancer and provide a basis for future mechanistic and biomarker studies.
bioinformatics2026-03-11v1SwiftTCR: Efficient Computational Docking protocol of TCRpMHC-I Complexes Using Restricted Rotation Matrices
Parizi, F. M.; Aarts, Y. J. M.; Smit, N.; Roran A R, D.; Diepenbroek, D.; Krösschell, W. A.; Thijs, L.; Tepperik, J.; Eerden, S.; Marzella, D. F.; Ramakrishnan, G.; Xue, L. C.Abstract
The T cell's ability to discern self and non-self depends on its T cell receptor (TCR), which recognizes peptides presented by MHC molecules. Understanding this TCR-peptide-MHC (TCRpMHC) interaction is important for cancer immunotherapy design, tissue transplantation, pathogen identification, and autoimmune disease treatments. Understanding the intricacies of TCR recognition, encapsulated in TCRpMHC structures, remains challenging due to the immense diversity of TCRs (>108/individual), rendering experimental determination and general-purpose computational docking impractical. Addressing this gap, we have developed a rapid integrative modeling protocol leveraging unique docking patterns in TCRpMHC complexes. Built upon PIPER, our pipeline significantly cuts down FFT rotation sets, exploiting the consistent polarized docking angle of TCRs at pMHC. Additionally, our ultra-fast structure superimposition tool, GradPose, accelerates clustering. It models a case in 3-4 minutes on 12 CPUs, showcasing a speedup of up to 25-40 times compared to the ClusPro webserver. On a benchmark set of 38 TCRpMHC class I (TCRpMHC-I) complexes, our protocol outperforms the state-of-the-art docking tools in model quality. This protocol can potentially provide structural information to TCR repertoires targeting specific peptides. Its computational efficiency can also enrich existing pMHC-specific single-cell sequencing TCR data, facilitating the development of structure-based deep learning (DL) algorithms. These insights are essential for understanding T cell recognition and specificity, advancing the development of therapeutic interventions.
bioinformatics2026-03-10v3FASTiso: Fast Algorithm on Search state Tree for subgraph ISOmorphism in graphs of any size and density
Agbeto, W.; Coti, C.; Reinharz, V.Abstract
Subgraph isomorphism is a fundamental combinatorial problem that involves finding one or more occurrences of a pattern graph within a target graph. It arises in a wide range of application domains, including biology, chemistry, social network analysis, and pattern recognition. Although subgraph isomorphism is NP-complete in the general case, many exact algorithms allow it to be solved in practice on many instances. However, the increasing size and structural diversity of graph datasets continue to pose significant challenges in terms of robustness and scalability. In this article, we propose FASTiso, an exact subgraph isomorphism algorithm that emphasizes a strong consistency between the variable ordering strategy and the pruning rules used during search. This design enables a unified exploitation of structural information throughout the exploration process, leading to improved efficiency and stable performance across heterogeneous graph structures. An extensive experimental evaluation on widely used synthetic and real-world benchmarks shows that FASTiso consistently outperforms reference solvers such as VF3, VF3L, and RI, and achieves competitive performance compared to constraint programming-based approaches (Glasgow, PathLad+), while outperforming them on most datasets. The results further demonstrate that FASTiso remains highly efficient on small instances and scales well to large graphs, while maintaining a lower memory footprint than most evaluated solvers. The peak memory usage is 7.74 GB for FASTiso, 36.19 GB for PathLad+, over 500 GB for Glasgow, 9.62 GB for VF3/VF3L, and 4.31 GB for RI. FASTiso code is available at https://gitlab.info.uqam.ca/cbe/fastiso as a C++ implementation, a Python module, and an integration within an extended version of NetworkX. The implementations support simple graphs and multigraphs, directed or undirected, with labels on nodes, edges, or both.
bioinformatics2026-03-10v3Non-consensus flanking sequence of hundreds of base pairs around in vivo binding sites: statistical beacons for transcription factor scanning
Faltejskova, K.; Sulc, J.; Vondrasek, J.Abstract
It was long suspected that for specific DNA binding by a transcription factor, the flanks of the binding motifs can play an important role. By a thorough analysis of the DNA sequence in the broad context (+- 5000 bp) of in vivo binding sites (as identified in a ChIP-seq or a Cut&Tag experiment), we show that the average GC content is in most cases statistically significantly increased around the binding site in a patch spanning 1000-- 1500 bp. This increase was observed consistently in experiment targeting the same TF in different cell lines. The surrounding of binding sites of certain TFs like MYC display a directional alteration of dinucleotide frequencies. We attempt to explain these preferences by alteration in DNA shape features as well as by potential cooperation with other TFs. We observed differences in sequence affinity to various potential cooperating TFs between cell lines. Altogether, we propose that the observed feature distortion is indicative of a coarse scanning mechanism that helps TFs find the target binding site.
bioinformatics2026-03-10v3GREmLN: A Cellular Graph Structure Aware Transcriptomics Foundation Model
Zhang, M.; Swamy, V.; Cassius, R.; Dupire, L.; Kanatsoulis, C.; Paull, E.; AlQuraishi, M.; Karaletsos, T.; Califano, A.Abstract
The ever-increasing availability of large-scale single-cell profiles presents an opportunity to develop foundation models to capture cell properties and behavior. However, standard language models such as transformers benefits from sequentially structured data with well defined absolute or relative positional relationships, while single cell RNA data have orderless gene features. Molecular-interaction graphs, such as gene regulatory networks (GRN) or protein-protein interaction (PPI) networks, offer graph structure-based models that effectively encode both non-local gene token dependencies, as well as potential causal relationships. We introduce GREmLN (Gene Regulatory Embedding-based Large Neural model), a foundation model that leverages graph signal processing to embed gene token graph structure directly within its attention mechanism, producing biologically informed single cell specific gene embeddings. Our model faithfully captures transcriptomics landscapes and achieves superior performance relative to state-of-the-art baselines on cell type annotation, graph structure understanding, and fine-tuned reverse perturbation prediction tasks. It offers a unified and interpretable framework for learning high-capacity foundational representations that capture complex, long-range regulatory dependencies from high-dimensional single-cell transcriptomic data. Moreover, the incorporation of graph-structured inductive biases enables more parameter-efficient architectures and accelerates training convergence.
bioinformatics2026-03-10v3Sassy: Fuzzy Searching DNA Sequences using SIMD
Beeloo, R.; Groot Koerkamp, R.Abstract
Motivation. Approximate string matching (ASM) is the problem of finding all occurrences of a pattern in a text while allowing up to k errors. Many modern methods use seed-chain-extend, which is fast in practice, but does not guarantee finding all matches with [≤]k errors. However, applications such as CRISPR off-target detection require exhaustive results. Methods. We introduce Sassy, a library and tool for ASM of short patterns in long texts. Sassy splits the text into 4 parts that are searched in parallel, and uses bitvectors in the text direction rather than the pattern direction. This has compexity O(k{lceil}n/W{rciel}) when searching a random text of length n, where W = 256 is the SIMD width, and provides significant speedups for small k. Separately, we allow matches of the pattern to extend beyond the text for an overhang cost of e.g. = 0.5 per character, to find matches near contig or read ends. Results. Sassy is 4x to 15x faster than Edlib for patterns [≤]1000bp, and can search text with a throughput near 2 Gbp/s. Likewise, Sassy is over 100x faster than parasail. We apply Sassy to CRISPR off-target detection by searching 61 guide sequences in a human genome. Sassy is 100x faster than SWOffinder and only slightly slower (for k [≤]3) than CHOPOFF, for which building its index takes 20 minutes. Sassy also scales well to larger k, unlike CHOPOFF whose index took over 10 hours to build for k = 5. Availibility. Sassy is available as library and binary at https://github.com/RagnarGrootKoerkamp/sassy, and archived at swh:1:dir:e884758dce5777a441bc2799dc8824e563c5f97b.
bioinformatics2026-03-10v2Computed atlas of the human GPCR-G protein signaling complexes
Miglionico, P.; Matic, M.; Franchini, L.; Arai, H.; Nemati Fard, L. A.; Arora, C.; Gherghinescu, M.; DeOliveira Rosa, N.; Ryoji, K.; Gutkind, J. S.; Orlandi, C.; Inoue, A.; Raimondi, F.Abstract
Experimental mapping of G protein-coupled receptors (GPCR)-G protein signaling coupling has illuminated hundreds of receptors, yet the coupling specificity of a large fraction of this large receptor family remains unknown, thereby preventing the development of new GPCR-targeting therapies. Here, we used AlphaFold3 (AF3) to predict the 3D structures of the human GPCRome in complex with heterotrimeric G proteins. We used experimental GPCR-G protein binding data to show that AF3 predictions significantly discriminate between positive and negative binders, and used 3D structural features to train a machine learning (ML) algorithm to predict coupling potency. Interpretation of the ML model helped discriminate universal features governing the strength of G protein coupling from those determining binding specificity. We computationally illuminated the coupling preferences of 180 non-olfactory GPCRs (non-OR) with previously unreported transduction mechanisms and experimentally validate the predicted couplings for multiple previously uncharacterized GPCRs, including QRFPR, GPR50, GPR37, GPR37L1 and GPRC5A. Our predictions established that Gi/o is the most prevalent coupling among non-OR GPCRs, which is often co-occurring with Gq/11 and, to a lesser extent, G12/13 signaling. Gs coupling is less common and restricted to specific clusters within the non-OR GPCRome phylogeny, likely due to stricter structural requirements for its binding. We also computed G protein complexes for over 400 ORs, establishing Gs as the most prevalent coupling. ORs are predicted to bind to Gs with a simpler interface compared to non-ORs, ultimately leading to energetically less stable complexes. Additionally, we predict recurrent bindings to Gq/11 and Gi/o proteins for ORs, suggesting potentially novel ORs signaling mechanisms. We exploited the GPCRome coupling atlas to interpret healthy and cancer expression data, revealing the coupling of most GPCR-G protein co-expressed pairs. This analysis highlights a richer coupling repertoire in healthy tissues compared to cancer, likely reflecting the high signaling requirements of specialized normal cell functions, which are lost in most cancer cells due to their de-differentiated state or under cancer selection processes. In summary, this study provides the first computational 3D atlas of the human GPCR-G protein transductome, thereby illuminating the signaling mechanisms of neglected GPCR classes and providing the basis for interpreting omics datasets from a myriad of pathological conditions, thus enabling the development of novel precision therapeutics.
bioinformatics2026-03-10v1STAR Suite: Integrating transcriptomics through AI software engineering in the NIH MorPhiC consortium
Hung, L.-H.; Yeung, K. Y.Abstract
To accommodate rapid methodological turnover, bioinformatics pipelines typically consist of discrete binaries linked via scripts. While flexible, this architecture relies on intermediate files, sacrificing performance, and treating complex codebases as static silos. For example, the STAR aligner {dobin2013star}---the standard engine for transcriptomics---uses an external script for adapter trimming, necessitating the decompression and re-compression of large files. These limitations presented scalability problems for uniform processing of data in the NIH MorPhiC consortium. We present our solution, STAR Suite, a human-engineered and AI-implemented modernization that integrates functionality directly into the C++ source. In just four months, a single developer added over 92,000 lines to the original 28,000-line codebase to produce four unified modules: STAR-core, STAR-Flex, STAR-Perturb, and STAR-SLAM that can be installed as a pre-compiled binary without introducing any new dependencies. This work demonstrates a new paradigm for the rapid evolution of high-performance bioinformatics software.
bioinformatics2026-03-10v1AQuA2-Cloud: a web platform for fluorescence bioimaging activity analysis
Bright, M.; Mi, X.; Duarte, D.; Carey, E.; Lyu, B.; Wang, Y.; Nimmerjahn, A.; Yu, G.Abstract
Advanced biological imaging analysis platforms such as Activity Quantification and Analysis (AQuA2) enable accurate spatiotemporal activity analysis across diverse cell populations within many species. These tools are increasingly important for investigating cellular signaling dynamics and behavior. However, despite advances in the accuracy and species capability of AQuA2, it remains computationally demanding for analysis of long time-series datasets and requires all users to maintain a MATLAB license, which may limit accessibility and large-scale deployment. To address these limitations, we have designed and made available AQuA2-Cloud, a portable software stack and web platform developed as an improvement and further evolution of AQuA2. This container-deployable system permits multi-user cloud-based high accuracy activity quantification with intuitive workflows, export of analysis data and project files, and comparable processing times. The platform offers integrated features such as in-browser analysis control interfaces, asynchronous program state control, multiple users and user management, support for unreliable connections, file uploading and downloading via web browsers and File Transfer Protocol, and centralized organization of analysis output. AQuA2-Cloud constitutes a cloud-native solution for laboratories or research groups seeking to centralize analysis of spatiotemporal biological imaging datasets while reducing software installation and licensing barriers for end users. The platform enables researchers with minimal technical expertise to perform advanced bioimaging analysis through standard web browsers while maintaining the analytical capabilities of AQuA2. AQuA2-Cloud source code, deployment procedures, and documentation are freely available at (https://github.com/yu-lab-vt/AQuA2-Cloud).
bioinformatics2026-03-10v1Automatic Generation of Model Sequences for Complex Regions in Assembly Graphs
Antipov, D.; Chen, Y.; Sollitto, M.; Phillippy, A. M.; Formenti, G.; Koren, S.Abstract
Recent developments in genome sequencing and assembly technologies have enabled the automated assembly of vertebrate chromosomes from telomere to telomere. However, for some long, highly similar repeats, genome assemblers may lack sufficient information to unambiguously resolve the sequence, leaving tangles in the assembly graph and gaps in the final assembly. In recently published genomes, such gaps are often closed by manual graph curation, a process that is labor-intensive, error-prone, and sometimes infeasible. This can leave important genomic repeats, such as recently duplicated genes, misassembled or excluded from the final assembly. Here we present the Trivial Tangle Traverser (TTT) algorithm that finds optimized resolutions of assembly graph tangles. TTT uses depth of coverage and read-to-graph alignment information in a two-stage process to identify evidence-based traversals that are consistent with the underlying data. First, sequence multiplicities are estimated through mixed-integer linear programming, after which an Eulerian path is found in the derived multigraph and optimized through a gradient-descent-like approach. We evaluate TTT traversals on the HG002 human reference genome and demonstrate its use to characterize a previously unassembled amplified gene array in the zebra finch genome. Availability: TTT is available at https://github.com/marbl/TTT
bioinformatics2026-03-10v1Measuring Amorphous Motion: Application of Optical Flow to Three-Dimensional Fluorescence Microscopy Images
Lee, R. M.; Eisenman, L. R.; Hobson, C.; Aaron, J. S.; Chew, T.-L.Abstract
Motion is an essential component of any living system. It is rich with information, but it is often challenging to quantitatively extract biologically informative results from the motion apparent in microscopy images. This challenge is exacerbated by the wide variety in biological movement, which often takes the form of difficult-to-segment amorphous structures undergoing complex motion. An image processing technique known as optical flow can capture motion at each pixel in an image, thus bypassing the need for object segmentation or a priori definition of motion types. This makes it a powerful tool for quantitative assessment of biological systems from the protein to organism scale. However, despite its flexibility and strengths for analyzing fluorescence microscopy images, its adoption in the bioimaging community has been limited by the availability of easy-to-use tools and guidance in results interpretation. Here we describe an optical flow tool, OpticalFlow3D, that can be run in Python or MATLAB and is compatible with three-dimensional microscopy images. Using biological examples across length scales, we illustrate how OpticalFlow3D can enable new biological insight.
bioinformatics2026-03-10v1In silico analysis of the human titin protein (Immunoglobulin-like, fibronectin type III, and Protein kinase domains) as a potential forensic marker for postmortem interval (PMI) estimation
Gill, M. U.; Akhtar, M.Abstract
Abstract: Due to the limited availability of reliable and well-validated molecular markers, the determination of postmortem interval (PMI) is still a major obstacle for forensic investigators to resolve a case. The largest human protein, known as titin, has never undergone at domain level examination of postmortem degradation patterns. This study focused on the In-silico analysis of the Immunoglobulin-like, fibronectin-type III, and Protein kinase domains of human titin to assess their potential utility in PMI estimation. Sequence data for the studied domains were retrieved from UniProt, 2D & 3D models were generated by PSIPRED and SWISS-MODEL, respectively, followed by physicochemical properties, solubility assessment, and structural comparison. This study revealed that the Ig-like domain is the most stable, followed by the Fn-III and Protein kinase domains. These findings indicate that Titin domains may degrade at different rates in the postmortem period. This study introduces the first computational basis for considering Titin as a multi-domain candidate biomarker for PMI estimation, laying the groundwork for upcoming laboratory validation.
bioinformatics2026-03-10v1SpatioCAD: Context-aware graph diffusion model for pinpointing spatially variable genes in heterogeneous tissues
Zhang, S.; Wen, H.; Shen, Q.Abstract
Spatial transcriptomics enables comprehensive characterization of tissue architecture, and the identification of spatially variable genes (SVGs) is a critical step for defining region-specific molecular markers and uncovering spatially regulated mechanisms across diverse biological contexts. However, most existing methods for SVG detection overlook cell density variations, a major confounding factor in complex tissues such as tumors, where heterogeneous cellularity frequently introduces false-positive calls. Here we present SpatioCAD, a computational framework that explicitly decouples genuine spatial expression patterns from confounding effects driven by cellularity. SpatioCAD leverages and extends a graph diffusion model to simulate expression propagation under cell-density-aware con- ditions, thereby ensuring unbiased detection of SVGs across all expression levels. Systematic evaluations on simulated datasets demonstrate its superior statistical power and specificity. Applied to breast cancer, lung cancer, and glioma datasets, SpatioCAD identifies functionally diverse SVGs, including low-abundance transcripts with established roles in tumor progression, while also recapitulates biologically meaningful tissue architecture features.
bioinformatics2026-03-10v1MOZAIC: Compound Growth via In Silico Reactions and Global Optimization using Conformational Space Annealing
Yoo, J.; Shin, W.-H.Abstract
Motivation: Fragment-based drug discovery (FBDD) is an efficient strategy that leverages small molecular fragments to explore broader chemical space by combining them. Advances in computational methods have enabled the calculation of molecular properties and docking scores, thereby accelerating the development of algorithm- and AI-based approaches in FBDD. However, it should be noted that certain methods do not provide synthetic pathways to obtain the proposed compounds. Consequently, these molecules might not be synthesized easily. Results: In light of these developments, we propose MOZAIC, a novel framework that explores chemical space using a reaction-based fragment growing and Conformational Space Annealing, a powerful global optimization algorithm. Our results show that MOZAIC effectively produces chemically diverse molecules with balanced improvements in lead-like properties, including QED, synthetic accessibility, and binding affinity. Furthermore, its flexible objective function allows fine-tuning for specific design goals, such as enhancing solubility with binding affinity. These capabilities position MOZAIC as a valuable platform for advancing fragment-to-lead and lead optimization efforts in drug discovery. Availability and implementation: MOZAIC is available at https://github.com/kucm-lsbi/MOZAIC/. Supplementary Information: Supplementary data are available at Bioinformatics online.
bioinformatics2026-03-10v1InversePep: Diffusion-Driven Structure-Based Inverse Folding for Functional Peptides
Chilakamarri, S. K.; Kasturi, S. R.; Yerrabandla, S. P. R.; Gogte, S.; Kondaparthi, V.Abstract
Designing functional peptides with specific structural and biochemical properties is critical for applications in protein engineering and therapeutic discovery. However, most peptide design approaches rely on evolutionary or local sequence optimization methods, which are limited when adapting to peptides' shorter length, high conformational flexibility, and unique physicochemical constraints. While recent structure-based inverse folding models have shown success for proteins, these models often underperform on peptides because sequence recovery alone is not a reliable indicator of stability or foldability in short, flexible backbones. To address this challenge, we introduce InversePep, a generative diffusion model for structure-based peptide inverse folding. InversePep learns the conditional distribution of sequences that can adopt a given backbone conformation, enabling direct generation of peptides tailored to target structural geometries. The framework integrates a geometric graph neural network to encode 3D backbone features with a Transformer-based sequence refinement module that iteratively denoises candidate sequences during diffusion. Trained on a diverse set of peptide backbones sourced from Propedia and SATPdb, InversePep effectively captures structural and biochemical diversity across peptide families. In systematic evaluations on held-out peptide structures and the PepBDB benchmark, InversePep achieves a mean TM score of 0.38 and a median of 0.28, outperforming ProteinMPNN and ESM-IF1 in generating geometry-consistent peptide sequences. In-silico folding analyses confirm that sampled peptides reliably adopt the target conformations. These results highlight InversePep's capability for designing structurally stable and sequence-diverse peptides, demonstrating its potential in antimicrobial peptide discovery, peptide therapeutics, and molecular probe development.
bioinformatics2026-03-10v1Neurotox: Deep learning decodes conserved hallmarks of neurotoxicity across venomous species
Bedraoui, A.; El Mejjad, S.; Enezari, S.; El Hajji, F. Z.; Galan, J.; El Fatimy, R.; Daouda, T.Abstract
Neurotoxic proteins drive the most pathophysiological effects of animal envenomation, yet it remains unclear whether neurotoxicity is encoded directly within the protein sequence or emerges from higher-order structure binding and interactions with their target receptor. To address this, we developed Neurotox, a sequence-based deep learning framework trained on 200,000 curated protein sequences, with balanced representation of neurotoxic and non-neurotoxic proteins across taxa, achieving high classification accuracy (96%) with strong performance on unseen toxin families. We further introduced a controlled sequence-representation warping strategy that selectively perturbs neurotoxicity-relevant features, inducing a systematic loss of predicted neurotoxicity while preserving primary sequence identity. Structural modeling using AlphaFold 3 showed that, for most top-ranked toxins, warping disrupted beta sheet architectures and reduced interface precision, with all top candidates showing highly significant effects (p < 0.0001). These structural changes were accompanied by recurrent cysteine-centered substitutions, implicating disruption of conserved disulfide frameworks. A single exception retained its global fold (RMSD = 2.8 Angstrom), maintained low PAE, high pLDDT, and high pDockQ scores, and preserved a close arginine-glutamate contact (Arg53-Glu75), yet still exhibited marked attenuation of predicted neurotoxicity. These results suggest that neurotoxicity arises from distributed sequence features that shape secondary-structure organization and receptor interaction, rather than from isolated contact residues alone.
bioinformatics2026-03-10v1NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Wang, G.; Yang, S.; Ding, J.-e.; Zhu, H.; Liu, F.Abstract
Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.
bioinformatics2026-03-10v1Counting strands in outer membrane beta-barrels
Lim, S.; Nimmagadda, T.; Khamis, A.; Montezano, D.; Feehan, R.; Copeland, M.; Slusky, J.Abstract
Beta-barrel structures are critical components of bacterial outer membranes, where they facilitate transport, cell signaling, antibiotic resistance, and structural integrity. A key feature of beta-barrels is their strand count, which influences pore diameter, binding site locations, and functional properties. However, because of breaks in strands and the presence of strands in periplasmic domains and plug domains, manual counting is inefficient and current algorithms do not accurately determine barrel strand count. To address this, we refined our previous beta-barrel structural assessment tool, PolarBearal, to improve strand number identification in large-scale datasets. To enhance the accuracy of barrel strand number labeling, our updated algorithm integrates three structural criteria, namely inter-residue vector angles, hydrogen-bonding distances, and strand connectivity. Using this algorithm, we labeled strand numbers for 571,760 predicted outer membrane beta-barrel structures obtained from the AlphaFold2 database. Our algorithm has 97% accuracy in strand number assignments, and the resulting dataset facilitates assessment of the homogeneity of strand counts for different types of outer membrane proteins. The strand labeling also provides insights on beta-barrel strand distribution and evolutionary patterns, supporting further research in protein structure prediction and design.
bioinformatics2026-03-10v1From General-Purpose to Disease-Specific Features: Aligning LLM Embeddings on a Disease-Specific Biomedical Knowledge Graph for Drug Repurposing
Pandey, S.; Talo, M.; Siderovski, D. P.; Sumien, N.; Bozdag, S.Abstract
Identifying new therapeutic uses for existing drugs is a major challenge in biomedicine, especially for complex neurodegenerative conditions such as Alzheimer disease and related dementias (ADRD), where treatment options remain limited and relevant data are often sparse, heterogeneous, and difficult to integrate. Although general-purpose Large Language Model (LLM) embeddings encode rich semantic information, they often lack the task-specific biomedical context needed for inference tasks such as computational drug repurposing. We introduce Contextualizing LLM Embeddings via Attention-based gRaph learning (CLEAR), a multimodal representation-fusion framework that aligns LLM embeddings with the topological structure of a context-specific Knowledge Graph (KG). Across five benchmark datasets, CLEAR achieved state-of-the-art results, improving predictive performance (e.g., F1 score) by up to 30% over prior methods. We further applied CLEAR to identify FDA-approved drugs with potential for repurposing for ADRD, including Parkinson disease-related dementia and Lewy Body dementia. CLEAR learned a biologically coherent embedding space, prioritized leading ADRD drug candidates, and accurately summarized known therapeutic relationships for FDA-approved Alzheimer disease drugs. Overall, CLEAR shows that grounding biomedical LLM embeddings with context-specific KG signals can improve drug repurposing in data-sparse, real-world settings.
bioinformatics2026-03-10v1Exploring per-base quality scores as a surrogate marker of cell-free DNA fragmentome
Volkov, H. H. V.; Raitses-Gurevich, M.; Grad, M.; Shlayem, R.; Leibowitz, D.; Rubinek, T.; Golan, T.; Shomron, N.Abstract
Per-base quality scores are widely treated as technical metadata in next-generation sequencing. Here, we show that in rigorously controlled whole-genome sequencing of cell-free DNA, quality profiles may encode fragmentomic signals that enable classification of cancer samples against matched controls. Analyzing four independent batches (23 cancer samples: pancreatic and breast; 22 matched controls) sequenced in a within-lane regime and further normalized per flow-cell tile to reduce technical confounders, we demonstrate through unsupervised analysis that boundary-enriched dynamics captured in these quality scores consistently separate cancer from control samples. A leave-one-batch-out classifier trained on quality-derived scores achieved a pooled area under the curve of 0.81. Furthermore, we show that the quality-derived metric correlates with short-fragment enrichment and tumor-associated 5-end motifs, performing comparably to established, motif-based orthogonal methods. These results provide initial evidence that quality scores could serve as a low-cost, alignment-free biomarker for cfDNA-based cancer detection.
bioinformatics2026-03-10v1NanoVI: a Bayesian variational inference Nextflow pipelinefor species-level taxonomic classification from full-length16S rRNA Nanopore reads
Curiqueo, C.; Fuentes-Santander, F.; Ugalde, J. A.Abstract
NanoVI is a Nextflow pipeline for species-level taxonomic classification of full length 16S rRNA Oxford Nanopore reads. Unlike existing tools that rely on expectation maximization (EM) algorithms, NanoVI employs Bayesian variational inference with a Dirichlet Categorical conjugate model, yielding abundance estimates accompanied by Bayesian 95% credible intervals that quantify estimation uncertainty, along with automatic shrinkage that suppresses spurious taxa. NanoVI integrates the Genome Taxonomy Database (GTDB) r226, providing phylogenetically consistent taxonomy while maintaining compatibility with NCBI style databases. Benchmarked against a standardized mock community, NanoVI achieves species-detection metrics comparable to Emu, with 25 to 62% lower execution time and fewer false-positive assignments. Validation on 20 clinical vaginal microbiome samples confirms reproducibility against previously published Emu-based analyses.
bioinformatics2026-03-10v1DIA-NN EasyFilter workflow for the fast and user-friendly critical assessment and visualization of DIA-NN proteomics analysis outcome
Moagi, M. G.; Thatiana, F. F.; Kristof, E. K.; Arda, A. G.; Arianti, R.; Horvatovich, P.; Csosz, E.Abstract
Liquid chromatography-tandem mass spectrometry (LC-MS/MS) based proteomics, particularly data-independent acquisition (DIA), has become widely adopted across in One Health approaches for biological and clinical research for quantitative protein characterization. Among the many computational tools available, DIA-NN has demonstrated superior performance; however, the primary output of the current versions is provided as a compact, compressed PARQUET file that can be difficult to interrogate without programming expertise. To address this limitation, we developed DIA-NN EasyFilter (DEF), a fast, user-friendly, KNIME-based workflow for comprehensive protein filtering, and visualization. DEF integrates chromatographic peak-based filtering, curated contaminant libraries, and quantity-quality assessment, along with interactive modules for qualitative and quantitative data exploration. The workflow is optimized for efficient execution within the KNIME local desktop environment and is designed to support end-users in improving accuracy and interpretability without requiring coding skills. We provide detailed description on how to run DEF and demonstrate the utility and robustness of DEF using published large-scale proteomics datasets, showing high comparability across studies regardless of instrument platform or dataset size.
bioinformatics2026-03-10v1Improving Causal Gene Identification Using Large Language Models
Ofer, D.; Kaufman, H.Abstract
Genome-Wide Association Studies (GWAS) have successfully identified numerous loci associated with complex traits and diseases, yet pinpointing causal genes remains a significant challenge. The reliance on simple proximity-based heuristics is often insufficient due to linkage disequilibrium, gene interactions, and regulatory effects. Recent advancements in Large Language Models (LLMs) have demonstrated potential in automating causal gene identification, but their effectiveness remains limited by knowledge representation and retrieval mechanisms. This study builds on previous research by evaluating LLMs for causal gene identification, with a focus on enhancing performance through Retrieval- Augmented Generation (RAG) and the incorporation of genomic distance information. We replicate prior results using smaller model Qwen2.5 - assessing their predictive accuracy using a benchmark dataset from Open Targets. We improved the preformences when integrating RAG-based literature retrieval (F1 = 0.795) and gene distance information (F1 = 0.806). However, the combined approach yielded diminishing returns, suggesting interactions between these enhancements. Error analysis revealed that genomic distance features improved predictions by reinforcing established heuristics, while RAG enhanced domain knowledge but occasionally led to semantic biases. These findings highlight the potential of hybrid approaches in leveraging both structured genomic features and unstructured textual data.
bioinformatics2026-03-10v1FAMUS: A Few-Shot Learning Framework for Large-Scale Protein Annotation
Shur, G.; Burstein, D.Abstract
Predicting gene function is a pivotal and challenging step in genomic and metagenomic data analysis. Current automatic annotation tools typically rely on the single most similar sequence from the query database. The sparsity of data per annotation makes it challenging to confidently assign gene function for underrepresented genes. Here, we present a contrastive learning framework for functional annotation. FAMUS (Functional Annotation Method Using Supervised contrastive learning) compares query sequences to profile Hidden Markov Model databases and transforms the similarity scores into a condensed vector space that minimizes the distance of proteins from the same family. The similarity scores of a query to all profiles are used for its representation instead of considering only the top-ranking hit. In a protein family assignment task, FAMUS outperformed KEGG's native KofamScan for KEGG Orthology annotation and InterPro's InterProScan for PANTHER family annotation. We thus created four protein annotation models using protein families from the KEGG Orthology, InterPro family, OrthoDB, and EggNOG databases. All four models are available as a conda package and via our user-friendly web server, allowing users to annotate large-scale datasets. FAMUS is the first comprehensive and modular annotation framework based on contrastive learning. It supports both pre-defined and user-specific databases for tailored annotation, and can be easily integrated into any genomic and metagenomic analysis pipeline to facilitate accurate, large-scale functional annotation.
bioinformatics2026-03-10v1Ensemble-based genomic prediction for maize flowering-time improves prediction accuracy and reveals novel insights into trait genetic variation
Tomura, S.; Powell, O. M.; Wilkinson, M. J.; Cooper, M.Abstract
While various genomic prediction models have been evaluated for their potential to accelerate genetic gain for multiple traits, no individual genomic prediction model has outperformed all others across all applications. As an alternative approach, ensembles of multiple individual genomic prediction models can be applied to utilise the complementary strengths of individual prediction models and offset the prediction errors of each. We used the EasiGP (Ensemble AnalySis with Interpretable Genomic Prediction) pipeline to investigate the performance of an ensemble approach, targeting flowering-time traits measured in two maize nested association mapping datasets. For both datasets, the ensemble-based prediction approach achieved higher prediction accuracy and lower prediction error across the flowering-time traits compared to each individual model. Multiple genomic regions known to contain key flowering-time related genes were repeatedly included as features across individual genomic prediction models, indicating the models successfully captured SNPs as features that are associated with genomic regions known to contain flowering-time genes. Although repeatability was high for some genomic regions, estimated marker effects varied across many genomic regions, suggesting that the models might also have captured different aspects of the genetic variation underlying the traits. The ensemble combination of the diverse views likely contributed to the improvement of prediction performance by the ensemble-based approach over the individual prediction models. Ensemble-based prediction can be applied to overcome limitations observed in the continuous exploration for the best individual genomic prediction models that can consistently achieve the highest prediction performance, thereby potentially contributing to improved prediction accuracy for applications in crop breeding.
bioinformatics2026-03-09v4ChatSpatial: Schema-Enforced Agentic Orchestration for Reproducible and Cross-Platform Spatial Transcriptomics
Yang, C.; Zhang, X.; Chen, J.Abstract
Spatial transcriptomics has transformed our ability to study tissue architecture at molecular resolution, yet analyzing these data demands navigating dozens of computational methods across incompatible Python and R ecosystems---forcing researchers to devote more effort to making tools function than to pursuing biological questions. We present ChatSpatial, a platform in which the LLM selects from pre-validated tool schemas rather than generating free-form code, with domain expertise embedded in schema descriptions for context-aware parameter inference. Built on the Model Context Protocol (MCP), ChatSpatial unifies 60+ methods across 15 analytical categories into a single conversational workflow spanning Python and R ecosystems. Replication of two published studies---recovering subclonal heterogeneity in ovarian cancer and tumor microenvironment organization in oral squamous cell carcinoma---and validation across seven LLM platforms demonstrate that schema-enforced orchestration yields near-deterministic reproducibility at the workflow level for multi-step spatial analyses. Beyond replication, exploratory cross-method analyses illustrate practical triangulation across independent analytical frameworks.
bioinformatics2026-03-09v2singIST: an R/Bioconductor library and Quarto dashboard for automated single-cell comparative transcriptomics analysis ofdisease models and humans
Moruno Cuenca, A.; Picart-Armada, S.; Perera-Lluna, A.; Fernandez-Albert, F.Abstract
Preclinical disease models often diverge from human pathophysiology at single-cell resolution, complicating model selection and limiting translational value. We present singIST, an R/Bioconductor package for quantitative and explainable comparison of disease model scRNA-seq data against a human reference. For each superpathway, singIST fits an adaptive sparse multi-block PLS-DA model on human pseudobulk expression, integrated one-to-one orthology and cell type mapping, and translates model fold changes into the human expression space to compute signed recapitulation at the superpathway, cell type, and gene levels. To streamline interpretation and reporting, we provide singIST Visualizer, a companion Quarto/Shiny dashboard that loads singIST outputs and offers interactive exploration with export ready plots and tables, avoiding manual figure coding across many superpathways and models. We demonstrate the workflow. We illustrate an end-to-end workflow on an oxazolone mouse model against a human atopic dermatitis reference for two representative pathways: Dendritic Cells in regulating Th1/Th2 Development [BIOCARTA] and Cytokine-cytokine receptor interaction [KEGG]. singIST is distributed under the MIT License via Bioconductor, and the Visualizer is available on GitHub.
bioinformatics2026-03-09v1MapMyCells: High-performance mapping of unlabeled cell-by-gene data to reference brain taxonomies
Daniel, S. F.; Lee, C.; Mollenkopf, T.; Lee, M.; Arbuckle, J.; Fiabane, E.; Gabitto, M. I.; Johansen, N.; Kapen, I.; Kraft, A. W.; Lai, J.; Li, S. Y.; McGinty, R.; Miller, J. A.; Welch-Moosman, S.; Otto, S.; Sawyer, L.; Shepard, N.; Thompson, C. L.; Tjaernberg, A.; Waters, J.; Zhen, X.; Macosko, E.; Lein, E.; Ng, L.; Zeng, H.; Mufti, S.; Yao, Z.; Hawrylycz, M.Abstract
Single-cell mapping methods convert raw, heterogeneous single-cell datasets into interpretable and comparable representations of biological identity. As reference cell-type taxonomies mature, mapping new datasets to shared references has become a central strategy for enabling cross-study integration, reproducible annotation, and cumulative biological knowledge. Here we present MapMyCells, an open-source framework designed to align diverse single-cell omics datasets to hierarchical reference taxonomies with minimal preprocessing. MapMyCells provides out-of-the-box support for an expanding set of high-quality brain cell-type references generated by the Allen Institute for Brain Science, the BRAIN Initiative, and the Seattle Alzheimer's Disease Brain Cell Atlas, including whole-brain mouse and human atlases, aging and Alzheimer's disease cohorts, and a cross-species consensus taxonomy initially focused on the basal ganglia. MapMyCells enables efficient mapping of hundreds of thousands of cells on standard workstations without specialized hardware, providing a deterministic, scalable, and modality-agnostic approach that is robust across species and molecular assays. The framework produces interpretable confidence metrics and quantitative summaries of mapping performance, allowing users to evaluate assignment precision and accuracy. We demonstrate the mapping of unlabeled transcriptomic, epigenomic, and spatial datasets to reference taxonomies and describe a general workflow for preparing arbitrary hierarchical taxonomies for reference-based mapping. As the ecosystem of single-cell reference atlases expands, MapMyCells offers a practical and reproducible solution for community-scale cell-type annotation and cross-dataset integration, supporting the development of unified and extensible brain cell atlases.
bioinformatics2026-03-09v1Benchmarking tissue- and cell type-of-origin deconvolution in cell-free transcriptomics
Ioannou, A.; Friman, E. T.; Daub, C. O.; Bickmore, W. A.; Biddie, S. C.Abstract
Plasma cell-free RNA (cfRNA) reflects tissue- and cell-type-specific activity across pathological states and is a promising biomarker for organ injury and disease. Computational deconvolution methods are widely used to infer organ and cell-type contributions to cfRNA profiles. However, most were originally developed for single-tissue bulk transcriptomes and their performance in body-wide cfRNA settings, where any tissue or cell type can contribute, remains poorly characterised. Here, we present a systematic benchmarking of tissue- and cell type-of-origin deconvolution for plasma cfRNA that considers both methodological and reference-related sources of variability under realistic cfRNA simulation settings. We evaluated seven commonly used deconvolution methods across distinct algorithmic classes and multi-organ reference configurations derived from bulk and single-cell atlases. We assessed performance using simulation frameworks that model multi-organ mixtures, technical noise, and transcript degradation. We further examined deconvolution methods across multiple previously published clinical cfRNA cohorts spanning diverse disease contexts. Across both tissue- and cell-type-level analyses, deconvolution performance was strongly influenced by both method choice and reference parameters. Tissue-of-origin inference was comparatively robust across simulated and clinical datasets, recovering disease-associated organ signals and concordance with biochemical markers. In contrast, cell type-of-origin inference showed greater variability and reduced consistency across analytical settings, leading to divergent interpretations in both simulations and published clinical cfRNA cohorts. Together, these findings demonstrate that methodological and reference-related variability are major sources of uncertainty in cfRNA deconvolution, with tissue-level inference being more robust than cell-type-level inference. Our benchmarking framework provides guidance for reference selection and comparative interpretation in cfRNA deconvolution.
bioinformatics2026-03-09v1Fractal: Towards FAIR bioimage analysis at scale with OME-Zarr-native workflows
Lüthi, J.; Cerrone, L.; Comparin, T.; Hess, M.; Hornbachner, R.; Tschan, A.; Glasner de Medeiros, G. Q.; Repina, N. A.; Cantoni, L. K.; Steffen, F. D.; Bourquin, J.-P.; Liberali, P.; Pelkmans, L.; Uhlmann, V.Abstract
The rapid growth in microscopy data volume, dimensionality, and diversity urgently calls for scalable and reproducible analysis frameworks. While efforts on the open OME-Zarr format have helped standardize the storage of large microscopy datasets, solutions for standardized processing are still lacking. Here, we introduce two complementary contributions to address this gap: 1) the Fractal task specification, defining OME-Zarr processing units that can interoperate across computational environments and workflow engines, and 2) the Fractal platform, using this specification to enable scalable and modular OME-Zarr-native analysis workflows. We demonstrate their use across diverse biological research data, including terabyte-scale multiplexed, volumetric, and time-lapse imaging. In a clinical setting, we show that Fractal workflows achieve near-identical quantification of millions of cells across independent deployments, demonstrating the reproducibility required for translational applications. With its growing community of contributors, the Fractal ecosystem provides a foundation for FAIR microscopy image analysis relying on open file formats.
bioinformatics2026-03-09v1Quantum Hamiltonian Learning using Time-Resolved Measurement Data and its Application to Gene Regulatory Network Inference
Sohail, M. A.; Sudharshan, R. R.; Pradhan, S. S.; Rao, A.Abstract
We present a new Hamiltonian-learning framework based on time-resolved measurement data from a fixed local IC-POVM and its application to inferring gene regulatory networks. We introduce the quantum Hamiltonian-based gene-expression model (QHGM), in which gene interactions are encoded as a parameterized Hamiltonian that governs gene expression evolution over pseudotime. We derive finite-sample recovery guarantees and establish upper bounds on the number of time and measurement samples required for accurate parameter estimation with high probability, scaling polynomially with system size. To recover the QHGM parameters, we develop a scalable variational learning algorithm based on empirical risk minimization. Our method recovers network structure efficiently on synthetic benchmarks and reveals novel, biologically plausible regulatory connections in Glioblastoma single-cell RNA sequencing data, highlighting its potential in cancer research. This framework opens new directions for applying quantum-like modeling to biological systems beyond the limits of classical inference.
bioinformatics2026-03-09v1Defining mutational signatures of lung cancer-associated carcinogens through in vitro exposure of human airway epithelial cells
Gurevich, N. Q.; Chiu, D. J.; Yajima, M.; Huggins, J.; Mazzilli, S. A.; Campbell, J. D.Abstract
While distinct environmental exposures imprint unique mutational signatures on cancer genomes, the specific causal patterns for many known carcinogens remain uncharacterized in relevant human tissues. To address this gap, we developed a novel, physiologically relevant system that uses a combination of airway epithelial cells and whole genome sequencing to characterize mutational patterns induced by genotoxic carcinogens associated with lung cancer. After validating the platform's accuracy by successfully recapturing the known signature for Benzo(a)pyrene (BaP), we used this system to gain detailed insights into the types of mutations that occur with exposure to N-nitrosotris-(2-chloroethyl) urea (NTCU) and 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK), genotoxic compounds that induce lung squamous cell carcinoma and lung adenocarcinoma in mouse models, respectively. Cells exposed to NTCU had significantly more somatic SNVs compared to control samples. An average of 82.3% of mutations in NTCU samples were attributed to a novel mutational signature distinct from those in the COSMIC database but highly correlated with recent in vivo mouse models. In contrast, NNK exposure did not demonstrate a distinct mutational pattern above background at both high and low concentrations. Ultimately, this in vitro system provides a robust platform to define causal links between environmental exposures and mutational patterns in lung cancer mutagenesis.
bioinformatics2026-03-09v1Application of large language models to the annotation of cell lines and mouse strains in genomics data
Rogic, S.; Mancarci, B. O.; Xu, B.; Xiao, A.; Yang, C.; Pavlidis, P.Abstract
Accurate, consistent and comprehensive metadata are essential for the reuse of functional genomics data deposited in repositories such as the Gene Expression Omnibus (GEO), however, achieving this often requires careful manual curation that is time-consuming, costly and prone to errors. In this paper, we evaluate the performance of Large Language Models (LLMs), specifically OpenAI's GPT-4o, as an assistive tool for entity-to-ontology annotation of two commonly encountered descriptors in transcriptomic experiments - mouse strains and cell lines. Using over 9,000 manually curated experiments from the Gemma database and over 5,000 associated journal articles, we assess the model's ability to identify relevant free-text entries and map them to appropriate ontology terms. Using zero-shot prompting and retrieval-augmented generation (RAG) to incorporate domain-specific ontology knowledge, GPT-4o correctly annotated 77% of mouse strain and 59% of cell line experiments, and uncovered manual curation errors in Gemma for over 200 experiments. GPT-4o substantially outperformed a regular expression-based string-matching method, which correctly annotated only 6% of mouse strain experiments due to low precision. Model errors often arose from typographical mistakes or inconsistent naming in the GEO record or publication, and resembled those made by human curators. Along with annotations, our approach requests that the model output supporting context and quotes from the sources. These were typically accurate and enabled rapid curator verification. These findings suggest that LLMs are not ready to fully replace manual curators, but can already effectively support them. A human-in-the-loop workflow, in which LLM's annotations are provided to human curators for validation, may improve the efficiency and quality of large-scale biomedical metadata curation.
bioinformatics2026-03-07v1