Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large cis-regulatory domains, spanning tens to hundreds of kilobases, are pivotal in orchestrating cell-state-specific transcriptional programs that define cellular identity. However, existing single-cell analytical frameworks lack the capacity to identify these higher-order structures, thereby obscuring the coordinated, domain-level epigenetic regulation essential for complex biological processes. To address this, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) solely from single-cell ATAC-sequencing data. Our approach faithfully recapitulates established hematopoietic hierarchies and resolves lineage-specific regulatory programs by recovering canonical master transcription factors, frequent chromatin interactions, and enrichment of fine-mapped immune-related disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE captures tumor-associated H3K27ac landscapes and prioritizes USP7 as a potential therapeutic candidate, supported by in silico perturbation. Collectively, our framework provides a powerful and scalable platform for deciphering the complex epigenetic architectures underlying human development and disease.
bioinformatics2026-06-02v11Fold or flop: quality assessment of AlphaFold predictions on whole proteomes
Sarti, E.; Cazals, F.Abstract
MOTIVATION: Reliability of AlphaFold2 predictions is mainly assessed using the predicted Local Distance Difference Test (pLDDT). For model organisms, 30-40% of residues fall into the low-confidence pLDDT range. Moreover, pLDDT sometimes fails to flag physically implausible structures. This raises two questions: can more robust reliability indicators be identified, and do unreliable predictions share common structural or biophysical features? RESULTS: We characterize protein structures through histograms of per-residue neighbor counts, and use the Wasserstein principal component analysis to define the arity map, and lightweight and informative 2D embedding of proteins in a dataset. Using AlphaFold-DB, we show that the arity map reveals three structurally and biophysically distinct populations (well-folded proteins, intrinsically disordered proteins, and physically implausible predictions). We also use our packing based encoding at the residue level to define Abstraqt (Arity-Based STRuctural Arrangement Quality assessmenT), a per-residue scoring function complementing the pLDDT, assigning low scores to hallucinated helices and distorted beta strands while correctly scoring native like predictions. AVAILABILITY: The code to compute arity maps is available within Structural Bioinformatics Library, see https://sbl.inria.fr/doc/Alphafold_analysis-user-manual.html and https://sbl.inria.fr/data/AlphaFold-assessment.
bioinformatics2026-06-02v3GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking
Hu, S.; Bojar, D.Abstract
Quantifying all complex carbohydrates in a sample produces glycomics data, which constitutes compositional data and is stymied by biosynthetic dependencies between glycans, requiring dedicated analytic workflows. Properly assessing such methods frequently requires simulated data with known ground truths and injectable effects. However, simulating glycomics data, especially with control over effects and biases, is still unsolved. Here, we present GlycoForge, a feature-complete solution for simulating comparative glycomics data. GlycoForge supports simulating fully synthetic glycomics data and templated simulations based on real-world data, with specified motif-level effects, based on Gaussian copulas and estimated covariances. We further support injection of batch effects, both mean and variance shifts, via center-log ratio transformations to maintain compositional closure, and realistic missing data simulation. We showcase the utility of GlycoForge by evaluating batch effect correction algorithms for glycomics data, with automated guidelines for when to use such methods on real-world data. GlycoForge is available as an open-access Python package at https://github.com/BojarLab/GlycoForge.
bioinformatics2026-06-02v2UMITIC: An unsupervised framework for the joint characterization of cellular phenotypes and spatial neighborhoods in multiplex and hyperplex immunofluorescence imaging data
Sangüesa Recalde, M.; De Andrea, C. E.; Ariz, M.Abstract
Multiplexed imaging technologies enable the simultaneous measurement of dozens of protein markers while preserving context, providing a high-resolution view of tissue organization schemes. However, extracting meaningful insights from these high-dimensional datasets--particularly in hyperplex settings (>20 markers)--remains a major computational challenge, especially in the absence of annotated data. Here, we present UMITIC (Unsupervised Analysis of Multiplex Images via TIssue Characterization), a modular and unsupervised computational framework for the joint characterization of cell phenotypes and tissue neighborhoods from multiplex imaging data. UMITIC integrates three components: (i) CellCut, a strategy that combines nuclear and cytoplasmic predictions to improve the delineation capabilities of the framework; (ii) CellMap, a contrastive learning approach that generates low-dimensional representations of single-cell image crops that are enriched with morphological features; and (iii) TissueNet, a graph neural network that models spatial cell-cell interactions to identify tissue neighborhoods. We evaluated UMITIC across four datasets of increasing complexity to assess its robustness, scalability and biological relevance. With respect to a 7-plex human tonsil dataset, the framework identified canonical immune cell populations and reconstructed well-established anatomical regions. When applied to a 43-plex tonsil image, UMITIC preserved these tissue-level structures while enabling a finer cell subtype stratification process driven by increased marker dimensionality. We further validated our method on a 58-plex colorectal cancer cohort, where UMITIC was able to recover previously reported immune composition differences and spatial organization variations between patient groups with different prognoses. Finally, when an expert-annotated mass cytometry imaging dataset concerning human lung tissue was used, UMITIC achieved higher agreement with the reference tissue annotations than the existing approaches did, demonstrating improved lung microanatomy reconstruction accuracy. Together, these results show that UMITIC enables consistent and interpretable analyses of both cellular phenotypes and tissue architectures across diverse multiplex and hyperplex imaging datasets without the need for manual annotations.
bioinformatics2026-06-02v2MorphOTU: image-derived morphological operational units for open-set biodiversity assessment
Zhan, Z.; Ye, M.; Orr, M. C.; Chen, W.; Liu, X.; Yue, L.; Sun, X.; Zhang, F.Abstract
The absence of a scalable system for organizing the vast majority of unidentified species is a central obstacle in biodiversity science. Molecular methods can generate OTUs without species names but require sequencing infrastructure and often remain difficult to link to observable morphology, whereas most computer-vision methods still rely on closed-set species labels. These limitations hamper biodiversity quantification under the open, incomplete conditions that characterize real ecosystems. Here, we introduce morphOTUs, a general image-based framework that constructs operational units of biodiversity directly from phenotypes. Using morphOTU, we derive image-based OTUs across five standardized benchmark datasets spanning flowers, wood anatomy, and beetle dorsal habitus. These units closely approximate reference species-level groupings, including closely related species, retain coherent structure when most species are "unseen" during training, and accurately approximate -diversity metrics under sparse labeling or limited sampling. Furthermore, morphOTUs remain effective on a heterogeneous, long-tailed real-world insect survey dataset, demonstrating robustness beyond standardized imaging conditions. Visual explanations reveal that morphOTU consistently focuses on biologically meaningful traits and captures continuous phenotypic variation. By providing a scalable and open-set framework for quantifying phenotypic diversity, morphOTUs enable biodiversity assessment that includes unnamed species and unlock the ecological value of rapidly expanding digital image repositories.
bioinformatics2026-06-02v2ATI_Box: A Simple tool for convolutional neural network-based image semantic segmentation
Przygodzki, T.Abstract
Quantitative analysis of microscopic images has become a standard in basic biological and biomedical research. Deep machine learning provided a powerful tool facilitating this process. However, practical adoption of deep machine learning to image analysis may be difficult for a researcher who lacks basic coding skills. This is caused by a limited number of non-coding solutions, specifically in the domain of convolutional neural networks (CNNs). This scarcity may be explained by the following paradox. Training of CNNs is a relatively complex process. Researchers who are familiar with this process are also skilled enough to code the full pipeline of CNN implementation from annotation, through model training and evaluation to its usage in laboratory practice. Any kind of an alternative solution, acceptable by a broader group of researchers who are unfamiliar with CNN concepts, must inevitably result in simplification of the entire process, specifically the training step. Such simplification in turn may lead to limitation to solve specific problems by such a tool. Author believes however, that some compromise may be found between complexity and simplicity that would be sufficient to solve some basic problems in the field of basic biological and biomedical research. To address this challenge, author proposes ATI_Box (Annotation, Training, Inference in One Box), a unified, user-oriented platform for end-to-end image semantic segmentation. The system integrates data annotation, storage, model training, evaluation, and quantitative analysis into a single workflow, significantly simplifying the model development process. Image and annotation data are managed through an S3-compatible object storage system (MinIO), enabling scalable and transparent data handling. Annotation process is implemented through Label Studio. Model training is based on convolutional neural network U-Net architecture with ResNet as an encoder. Model evaluation is performed on ground-truth dataset held-out during training and provides pixel-level and object-level evaluation metrics. Batch analysis mode enables automated quantification of model predictions such as object counts and coverage areas. The usability of the platform was presented on examples from laboratory practice. The platform is intentionally devoid of model-tuning capabilities as it is addressed for a user unfamiliar with profound machine learning concepts. At the same time, accessibility of such basic features of model training as definition of epochs number or saving and implementing of trained model versions enables one to perform some basic analytical experiments. As such, the platform may serve not only as an analytical tool but also as an educational solution to explain practical basics of semantic segmentation process.
bioinformatics2026-06-02v1PepForge: Hierarchical HELM-Based Peptide Generation
Wang, Q.; Suessmuth, R. D.Abstract
Peptides carrying special connections such as macrocyclizations and various other structural modifications constitute a major class among peptide therapeutics, yet their chemical space remains largely inaccessible to computational generation methods. Here we present PepForge, a deep learning platform for peptide generation that exploits Hierarchical Editing Language for Macromolecules (HELM) notation to access the chemical space of modified peptides, through a Layout-Content-Connection (LCC) cascade decomposing the generation task into block layout, monomer content, and special connection prediction. The LCC cascade is trained on 383,817 HELM peptides covering 425 monomers and nine connection types. Beyond de novo generation, the LCC cascade supports masked infilling for targeted scaffold modification and multi-level constrained generation. Both the monomer library and the connection-type set support user-defined extensions for exploring a broader chemical space. The prediction module is decoupled from generation and accepts arbitrary scoring heads for downstream tasks. As a demonstration, we built an antimicrobial potency ensemble predictor trained on 11,026 peptides with minimum inhibitory concentration (MIC) values, alongside the external PeptiVerse predictor. Applied at scale, we generated 4.78 million novel HELM peptides and obtained 799 structurally novel hit antimicrobial peptide (AMP) candidates after potency and safety filtering. All code, pre-trained models, and a web interface for interactive use are publicly available at https://github.com/wqx1999/PepForge.
bioinformatics2026-06-02v1RAD: A Read-structure Agnostic Demultiplexer for Single-Cell Long-Read Sequencing and Analysis
Vaidya, C. M.; Carpenter, M. C.; Abdullah, L.; Kolling, F. W.; Huang, Y. H.; Song, L.; Ackerman, M. E.Abstract
Single-cell long-read sequencing (LRS) techniques enable the analysis of full transcript sequences within a cell. However, the high error rate inherent to LRS introduces computational challenges for parsing information like cell barcode, and custom workflows are often required to handle complex read layouts, such as split combinatorial barcodes. We introduce an error-robust, read-structure agnostic demultiplexer (RAD). In RAD, users can easily specify read structure, such as adapter sequence and barcode relative position, and can rapidly extract these elements for each read. In addition to finding the barcode, RAD implements efficient barcode correction strategies for scenarios of knowing or not knowing the full barcode whitelist or having paired short-read single-cell sequencing data for a short whitelist. In synthetic and real-world benchmarks, RAD is faster and achieves significantly higher sensitivity than existing pipelines while having comparable precision. We show RAD can be applied to high-definition long-read spatial transcriptomic data and demonstrate single cell and spatial analysis of B cell isotype and secretion states.
bioinformatics2026-06-02v1Bridging Ancestry Gaps in Genomic Risk Prediction with Tabular Foundation Models
Das, A.; Cui, Y.Abstract
Motivation: Models deployed for genomic prediction of diseases perform unevenly across populations, limiting clinical utility. Two factors drive this limitation: large imbalances in sample availability across ancestry groups and non-stationarity of genotype-phenotype effect sizes across the ancestry continuum. While tabular foundation models with in-context learning (ICL) have shown strong sample efficiency in other domains, their effectiveness for genotype-to-phenotype prediction and their robustness to ancestry-driven effect heterogeneity remain unclear. Results: Using large, ancestrally diverse biobank data, we show that ICL-capable tabular foundation models reduce performance degradation in under-sampled ancestry groups compared to conventional supervised approaches. However, we find that prevailing models trained on existing synthetic tabular tasks fail when allele effect sizes vary across ancestry space. Treating genetic ancestry as a continuous variable, we introduce an instruction-tuning framework that exposes models to synthetic tasks with ancestry-dependent non-stationary effects. Instruction-tuned models achieve improved and more stable predictive performance across the genetic ancestry continuum, including for individuals distant from in-context exemplars in ancestry space.
bioinformatics2026-06-02v1Hierarchical refinements of cis-regulatory inputs improve scalable gene expression prediction
Zhang, Q.; Xing, M.; Liao, Q.; Li, Z.; Huang, D.-S.Abstract
Deciphering the relationships between cis-regulatory elements (CREs) and target gene expression has long been a challenging problem in molecular biology. However, predicting gene expression from hundreds of candidate cis-regulatory elements (cCREs) requires models that scale to long, noisy inputs while retaining interpretable regulatory structure. Existing Transformer-based approaches typically attend over all nucleotides and all surrounding cCREs, diluting causal signals when hundreds of elements compete for limited model capacity. Here we introduce a two-stage selective framework (TSSF) that performs hierarchical refinements: nucleotide-level masking within each cCRE, followed by cCRE-level selection around each gene, implemented with information-bottleneck priors and a fully Transformer-based architecture. Across 70 human cell types and tissues, TSSF and lightweight variants improve expression prediction and enhancer-gene prioritization relative to strong baselines, including on cross-cell-line and cell-type-specific benchmarks. Prediction-stratified analysis motivates a distance-decay prior that aligns attention with long-range regulatory geometry, and chromatin-contact augmentation improves recovery of distal links. Motif analyses of high-confidence predictions recover proximal and distal regulatory programs, supporting mechanistic interpretability. TSSF offers a general strategy for scalable, interpretable modeling of high-dimensional regulatory inputs in genomics.
bioinformatics2026-06-02v1Deciphering functional dark matter: Machine and deep learning-based processing of protein embeddings enables targeted function discoveries
Wiegand, S.; Kaster, A.-K.Abstract
The ever-expanding catalogue of uncharacterized proteins - the so called functional dark matter - poses a major challenge for biotechnological and biomedical exploitation. Functional assessment of most proteins is hindered by the technical limitations of annotation transfer and by the propagation of erroneous annotations in databases. The common denominator here is the reliance on sequence similarities. However, these become inaccurate below certain thresholds and can diverge even at sequence identities around 70%. To approach this challenge, we implemented a strategy using embeddings generated by protein language models for targeted function discovery (PE-TFD). Datasets of proteins representing target as well as non-target functions were used to train supervised learning models. The resulting ensemble models yielded interpretable prediction scores, enabling the exploration of databases without relying on multiple sequence alignments or structural information. We here tested PE-TFD for the discovery of novel hydrogenases as proof-of-concept, resulting in the novel discovery of 773 [NiFe] and 1,929 [FeFe] hydrogenases that were not detected by established sequence- or profile-based approaches. Structural analyses supported their non-random nature and further revealed a significant number of enzymes lacking prior functional annotation. Our framework therefore enables interpretable function discovery in large-scale datasets and the exploitation of functional dark matter.
bioinformatics2026-06-02v1An integrated resource for systems-level analysis of aging hallmarks and associated genes
Tiwari, R.; Balaji, M.; Chivukula, N.; Sil, P.; Samal, A.Abstract
Aging is a complex biological process involving progressive cellular dysfunction, tissue decline, and increased susceptibility to multiple chronic diseases. A systemic view of aging through its established hallmarks provides a structured framework to understand this complexity and drive therapeutic discovery. Towards this, we present AgingHallmarksDB, an interactive web platform that enables systems-level analysis of hallmark-associated gene sets. Aging-related genes were first curated from seven established resources, and those present in at least 2 of these resources were considered as consensus aging-related genes. Using functional annotations derived from GO, KEGG, and Reactome, a total of 3111 genes were mapped to the 11 aging hallmarks, of which 2593 were supported by additional experimental or manually curated evidence, with 1089 of these forming the consensus set. Further, AgingHallmarksDB supplements gene annotations with tissue or cell type class specificity, exosomal profiles, and regulatory interactions. The platform allows users to interactively perform systems-level hallmark enrichment analysis across multiple condition-associated gene sets, while seamlessly integrating functional annotations and complex regulatory interactions to elucidate mechanistic hallmark-gene associations. The utility of the resource was explored through hallmark enrichment and network proximity analysis of gene sets corresponding to 11 chronic age-related diseases and PM2.5-associated skin transcriptome to explore relationships between aging hallmarks and disease mechanisms or environmental aging-related signatures. Overall, AgingHallmarksDB will support longevity research by enabling aging hallmark centered analysis, and the resource is accessible at https://cb.imsc.res.in/aginghallmarksdb/.
bioinformatics2026-06-02v1SNV and indel error modeling of deep targeted cell-free DNA sequencing data for sensitive detection of circulating tumor DNA in colorectal cancer
Diekema, M. H.; Rasmussen, M. H.; Drue, S. O.; Frydendahl, A.; Andersen, C. L.; Pedersen, J.Abstract
Circulating tumor DNA (ctDNA) is a promising biomarker for cancer detection, but low tumor burden makes it difficult to distinguish true signal from background noise. To aggregate and better evaluate weak mutational signals, we propose PyDREAMS, which incorporates both single-nucleotide variants (SNVs) and insertions and deletions (indels) for ctDNA detection and quantification. To distinguish signal from noise, a neural network background error model is learned from healthy controls. It captures the joint effects of cell-free DNA (cfDNA)-specific lesions and sequencing errors, accounting for both genomic context and read-level features. Finally, a statistical test is used to evaluate the presence of mutational signals. We evaluate the method in a tumor-informed setting, using cohorts of colorectal cancer samples with deep targeted plasma cfDNA sequencing across 12 cancer driver genes. We trained PyDREAMS on 46 healthy controls, with feature analysis revealing that both SNV and indel error rates were lowest at mononucleosomal fragment lengths, suggesting that nucleosomes protect cfDNA and reduce lesion accumulation during circulation and sample handling. In the validation cohort, combining SNVs with indels improved detection, with indels contributing approximately 1.5-fold more evidence per mutation than SNVs. On a test cohort of 209 stage I to III colorectal cancer (CRC) patients and 24 healthy controls, PyDREAMS outperformed a Shearwater-based caller, with an area under the receiver operating characteristic curve (AUC) of 0.917 compared with 0.909. In stage III post-operative (Post-OP) samples (n = 26), where ctDNA was expected only in non-cured patients, PyDREAMS detected ctDNA in 5 patients, including 3 of 9 with later recurrence, while Shearwater detected none. Together, these results show that PyDREAMS improves evaluation of ultra-low-frequency tumor signals through unified read-level modelling of SNV and indel background error.
bioinformatics2026-06-02v1Combining transcriptomic resolutions and machine learning strategies uncovers new OXPHOS genes in Caenorhabditis elegans
Zeballos - Goron, S.; Salinas, G.; Pazos Obregon, F.Abstract
Assigning functions to genes remains a major challenge in biology, as a large fraction of genes remain unannotated despite the availability of complete genomes. Oxidative phosphorylation (OXPHOS), the primary source of ATP in eukaryotes, exemplifies this gap: although it has been extensively studied in mammals, our understanding of this process in other lineages remains limited. In general, research in other organisms has relied on the identification of sequence homologs of genes previously characterized in mammals. While this strategy has enabled the inference of certain conserved functions, it may overlook genes with key roles that lack detectable homology. This highlights the need to explore alternative approaches, such as the integration of transcriptomic data, to better understand the specific features and adaptations of this process across different evolutionary lineages. Caenorhabditis elegans provides a powerful framework to address this problem, combining conservation of mitochondrial pathways with extensive transcriptomic resources. Studying this organism also has translational relevance for parasitic helminths, where OXPHOS represents a promising therapeutic target. We hypothesized that genes involved in OXPHOS share transcriptional signatures that can be exploited for functional prediction. Using a curated set of 65 well-established OXPHOS genes, we applied two complementary machine learning strategies to identify new candidates. We trained an ensemble of supervised learning models on a time-resolved bulk RNA-seq transcriptome of C. elegans. To address uncertainty in functional annotations, we implemented a novel informed bagging strategy combined with a two-round training scheme, in which weak positives were initially excluded and subsequently incorporated based on model predictions. In parallel, we performed cluster-based functional inference using embryonic and adult single-cell RNA-seq datasets. Integration of both approaches produced a list of candidate genes supported by strong predictive performance on an independent evaluation set. Several candidates lack prior functional annotation. A mutant strain in ril-1, one of the highly supported predictions, showed decreased respiration rates compared to the wild-type strain. Our results highlight the value of integrating biological priors, complementary learning paradigms, and multi-resolution transcriptomic data to enable systematic gene function discovery.
bioinformatics2026-06-02v1Equitable Health Intelligence: An Open Benchmark of Multi-Population Machine Learning for Omics-Based Cancer Prognosis
Sharma, T.; Chopra, A. P.; Agrawal, L.; Verma, N. K.; Starlard-Davenport, A.; Wang, J.; Hayes, D. N.; Cui, Y.Abstract
Purpose: Machine learning (ML) models for omics-based cancer prognosis are often trained on data from predominantly European-ancestry populations, producing biased predictions for other populations and undermining equitable genomic medicine. Existing fairness benchmarks mainly focus on outcome parity rather than predictive performance parity across populations. Public benchmark resources are needed for systematically detecting and mitigating such performance disparities in multi-population cancer prognosis. Methods: We developed Equitable Health Intelligence (EHI, https://ehiportal.org), an open-source benchmark of multi-population ML for omics-based cancer prognosis. EHI contains 1,475 ML tasks across 40 cancer/pan-cancer types, 4 omics feature sets, 4 clinical endpoints, 5 event-time thresholds, and 3 data-disadvantaged population (DDP) groups relative to a majority European Ancestry population group. Deep neural network models are trained under three multi-population ML schemes (Mixture, Independent, and Transfer Learning), with Naive Transfer included as a no-adaptation control, comprising a total of 10,325 ML experiments. Results: The EHI platform provides an interactive environment with visualization and exploratory tools for users to inspect predictive performance disparities between the majority European-ancestry group and data-disadvantaged populations, evaluate the extent to which transfer learning mitigates these disparities, and examine the impact of feature engineering methods across cancer types, omics features, and clinical endpoints. Conclusion: EHI is an open, interactive, and extensible benchmark for identifying and addressing performance disparities in multi-population ML for omics-based cancer prognosis. It provides a foundation for a growing ecosystem of methods targeting ML performance disparities arising from biomedical data inequality and population-level distribution shifts, thereby advancing equitable AI in precision oncology.
bioinformatics2026-06-02v1A Pan-Cancer Multi-Omic SuperLearner for Regulated Cell Death Survival Topologies
Rodrigues de Souza, E.; Almeida Cordeiro Nogueira, H.; dos Santos Lopes, V.; Medina-Acosta, E.Abstract
Introduction: Regulated cell death (RCD) pathways profoundly influence tumor progression and immune modulation. In prior work, we constructed a comprehensive database mapping 25 forms of RCD across seven multi-omic layers encompassing 33 tumor types (CancerRCDShiny). Despite their robust ability to identify risk populations, translating these prognostic signatures into personalized clinical workflows requires a shift from generalized cohort stratification to individualized risk mapping. This necessitates mapping the complex geometric landscape of patient risk - Survival Topologies - to accurately capture the non-linear dynamics of RCD signatures. Methods: We engineered a Pan-Cancer Multi-Omic SuperLearner pipeline evaluating 33 cancer types. Phase I performed zero-leakage data harmonization and groupwise imputation to prevent cross-cohort amalgamation. Phase II utilized Elastic Net - regularized Cox (CoxNet) regression as an audit-compliant CANARY diagnostic to map mathematical proportional-hazards failures. Admissible strata enforcing a rigid 35% topological missingness barrier entered Phase III, deploying an advanced non-linear Quadripartite Base-Learner Ensemble (Random Survival Forests (RSF), Extreme Gradient Boosting (XGBoost), insulated Survival-Boruta, and Multi-Task Logistic Regression (MTLR)) - fused within an Elastic Net Multi-View Meta-Learner (MVL) - with local interpretability guaranteed via post-hoc SHAPley Additive exPlanations (TreeSHAP) and Local Interpretable Model-agnostic Explanations (LIME). Results: The CANARY diagnostic empirically proved the structural invalidity of pan-cancer geometric proportional-hazards. Advancing 96 verified matrices into the Quadripartite Machine Learning Ensemble, Phase III executed a structural algorithmic displacement: dense continuous multi-omic topologies computationally suppressed static genomic mutations and Copy Number Variations (CNVs) during multidimensional competition (85.7% vs 0.0% apex retention). Furthermore, the MVL stabilized global predictions against extreme biological variance, while surrogate LIME validations (R-squared < 0.10) confirmed the absolute failure of linear interpretative proxies. Extracting N-dimensional TreeSHAP interactions natively bypassed generalized risk parameters, mapping exact Survival Topologies. This dynamically exposed multi-omic synergistic (lethal peaks) and antagonistic (protective valleys) rescue trajectories invisible to additive models. We integrated this architecture into CancerRCDPredictor, a Shiny application operating as a digital tumor board. Conclusion: Deploying a Pan-Cancer Multi-Omic SuperLearner to bypass linear topological failures, this study advances beyond generalized cohort stratifications, establishing a deterministically mapped architecture for predicting RCD-related Survival Topologies. Through the CancerRCDPredictor interface, we directly translate multi-omic insights into individualized precision oncology interception.
bioinformatics2026-06-02v1Ground Truth-Based Evaluation of False Discovery Rate and Statistical Power in DIA Proteomics
Yarbro, J. M.; Huang, Y.; Pagala, V.; Fu, Y.; Wang, Z.; Wu, L.; Wang, X.; High, A. A.; Byrum, S.; Peng, J.; Yuan, Z.-F.Abstract
Data-independent acquisition (DIA) mass spectrometry enables rapid proteomic quantification, yet the reliability of statistical inference in DIA-based protein quantification remains incompletely understood. Here, we systematically evaluated missingness, false discovery rate (FDR), and statistical power, defined as true positive rate (i.e. sensitivity or recall), using technical replicates and a spike-in benchmark with known ground truth. Analysis of 18 HeLa replicates revealed persistent, abundance-dependent missingness. In the spike-in experiment with five replicates, human peptides were titrated against a stable yeast background, allowing fold changes (FCs) to be compared with expected values. Across comparisons with log2FCs ranging from 0.2 to 2.5, the nominal BH-FDR substantially underestimated the true FDR. For example, at a BH-FDR threshold of 0.05, the true FDR was ~0.2. Statistical power was ~40% for a log2FC of 0.2 and increased to nearly 100% for a log2FC of 2.5. Additional incorporation of FC thresholds improved the true FDR for large-FC comparisons, with slight loss of power, but markedly reduced sensitivity for small-FC comparisons. Together, these results indicate that nominal FDR does not necessarily reflect actual error rates in DIA proteomics and that DIA performance is influenced by protein abundance and expected fold changes. This study provides a framework for experimental design and data interpretation in DIA-based proteomic studies.
bioinformatics2026-06-02v1Quantifying and Predicting the Difficulty of Multiple Sequence Alignment with AlDiScore
Bodynek, M.; Martin-Fernandez, L.; Bettisworth, B.; Haag, J.; Stamatakis, A.Abstract
Multiple Sequence Alignment (MSA) constitutes an important and frequent operation in molecular sequence data analysis. There exist numerous tools, algorithms, and criteria to infer an MSA. This plethora of available approaches to MSA may induced an ensemble of divergent MSAs for the same underlying unaligned sequence set. Even a single MSA tool may infer distinct MSAs when varying the input parameters. Hence, when using a diversified set of MSA algorithms and parameterizations, the observed dispersion within an MSA ensemble expresses the difficulty of inferring a robust alignment. We refer to this notion as MSA difficulty. As downstream analyses heavily rely on the MSA, characterizing MSA difficulty for a given unaligned sequence set is critical. Initially, we show that measures of dispersion within diversified MSA ensembles can reliably predict MSA difficulty. We then assess the adequacy of these measures by computing the average reference-based distance between the MSAs in the MSA ensemble and its corresponding structural reference MSA and subsequently comparing this distance to the corresponding reference-free average distance over all MSA pairs in the ensemble. We find that Blackburne and Whelan's dpos alignment metric is most appropriate as its reference-free counterpart most accurately approximates the reference-based difficulty computed on BAliBASE reference data. We therefore use the average pairwise distance measured by dpos to quantify MSA difficulty on a scale from 0 (easy) to 1 (difficult) given an MSA ensemble. Next, we introduce the AlDiScore open-source tool, which uses machine learning to directly and reliably predict reference-free difficulty scores from unaligned sequence sets to completely omit expensive MSA computations. The underlying regression model relies upon a large set of features, including sampling-based measures of transitive consistency. We trained our AlDiScore models on a diverse collection of empirical datasets from BAliBASE, TreeBASE, an published studies. Subsequently, we demonstrate that AlDiScore attains an R2 of 0.89 and of 0.84 on unseen AA and DNA sequence sets extracted from the PANDIT v17 database. Finally, we show that there is no correlation between MSA difficulty and the corresponding phylogenetic difficulty of the respective MSA.
bioinformatics2026-06-02v1miDGD: a multi-modal deep generative model predicts miRNA expression from bulk or single-cell mRNA expression
Zamani, F.; Rasmussen, A. M.; Schuster, V.; Diekema, M. H.; Krogh, A.; Pedersen, J. S.Abstract
MicroRNAs (miRNAs) are important post-transcriptional regulators, yet their expression is typically unobserved in single-cell and most bulk RNA-seq datasets. We present miDGD, a deep generative decoder model that predicts miRNA abundance directly from gene expression alone. Trained on bulk and single-cell datasets from TCGA, GTEx, and human cell lines, miDGD learned a shared latent representation of matched mRNA and miRNA profiles that organized samples into biologically meaningful clusters reflecting tissue and cancer types. The model reconstructed both tissue-specific and broadly expressed miRNAs, recapitulated known miRNA-target relationships, and showed robust performance in sparse and single-cell data. miDGD outperformed miRSCAPE and recent miRNA activity inference methods, with improved cross-dataset generalization. These results establish a deep generative model as an improved framework for predicting miRNA expression when direct measurements are unavailable.
bioinformatics2026-06-02v1Mechanistic Interpretability for Protein Language Models: A Validation Framework
Chon, P.; ANDREOPOULOS, W. B.Abstract
Protein language models (PLMs) are shown to be powerful predictors of protein structure and function but their internal mechanisms remain poorly understood. Recent mechanistic interpretability methods have decomposed PLM representations into interpretable features, but they have not combined methods on a single biologically meaningful task. This paper tests whether an InterPLM sparse autoencoder and ProtoMech cross-layer transcoder can discover features in ESM-2 (6 layers, 8M) that can mainly discriminate between Class A {beta}-lactamase and Class B {beta}-lactamase with class C and D used as more challenging comparisons. The main goal is to find distinct features for Class A {beta}-lactamase that are not shared by other classes. We find that both methods find distinct features for Class A {beta}-lactamase, but the cross-layer transcoders show that the concepts for Class A {beta}-lactamase seems to be distributed among nodes such as in layer 4 and 6 rather than one node. We also showcase a validation framework to prevent overclaiming the role of a node, and we use it to show that several strong nodes fail in some stages of the framework meaning that they cannot be the sole node that defines Class A {beta}-lactamase.
bioinformatics2026-06-02v1Decoding the Grammar of Protein-Protein Interaction Interfaces with Multimodal Representations
Cuturello, F.; Senci, S.; Di Vora, D.; Gardinazzi, Y.; Villegas Garcia, E. N.; Feltrin, A.Abstract
Protein-protein interactions (PPI) govern essential cellular processes, making the computational identification of interacting sites a central challenge in structural biology, with important implications for protein engineering and the development of targeted therapeutics. Existing prediction algorithms include sequence-based methods, which lack structural information, or structure-based approaches, which often struggle to effectively integrate evolutionary context. Here, we present ESM3-PPISites, a supervised model for residue-level classification of PPI interfaces, leveraging the multimodal representations of the ESM3 Protein Language Model. To ensure a bias-free evaluation, we adopt a stringent redundancy filtering protocol, systematically eliminating latent homology between the training data and a curated benchmark set in both sequence and structural space. Our findings demonstrate that while ESM3 largest proprietary version yields the highest predictive power, targeted fine-tuning of its small open-weight counterpart significantly narrows the performance gap. Requiring only primary sequence data at inference, ESM3-PPISites achieves unprecedented accuracy, vastly outperforming current approaches. Crucially, we demonstrate the practical impact of these predictions by integrating them as spatial restraints within the HADDOCK3 docking platform. When evaluated on an independent subset of 12 complexes from the Docking Benchmark v5, our prediction-guided pipeline strongly enhances the identification of near-native binding poses over ab initio blind docking, while reducing computational runtime by an order of magnitude. This framework establishes a scalable paradigm for high-throughput structural interactomics.
bioinformatics2026-06-02v1BacTaxID: A universal framework for standardized bacterial classification
Fernandez-de-Bobadilla, M. D.; Lanza, V. F.Abstract
Bacterial strain typing is key to surveillance, outbreak investigation and microbial ecology, yet current systems remain species-specific, reference-dependent and lack a universal, interpretable metric of genomic relatedness. Here, we introduce BacTaxID, a fully configurable, whole-genome k-mer-based framework that encodes each genome as a numeric sketch and organizes strains into hierarchical clusters with user-defined similarity thresholds. BacTaxID distances are strictly proportional to Average Nucleotide Identity (ANI), providing a direct quantitative link between vectorial typing and genome-wide divergence. Applied to 2.3 million genomes from All the Bacteria across 67 genera, BacTaxID demonstrates universal concordance species and sub-species classification systems, while capturing finer strain-level diversity than traditional reference-based approaches. In simulated surveillance and real outbreak datasets, BacTaxID reproduces SNP and cgMLST-based definitions while enabling rapid, scalable screening. Precomputed genus-level schemes and an open implementation provide a practical, genus-agnostic alternative to classical typing systems for standardized bacterial classification.
bioinformatics2026-06-01v4Systems Level Analysis of Gene, Pathway and Phytochemical Associations with Psoriasis
Ray, S.; Dutta, O.; Kousoulas, K. G.; Apostolopoulos, N.; Chamcheu, J. C.; Kaur, R.Abstract
Psoriasis is an inflammatory skin disorder driven by abnormal immune activation that promotes excessive proliferation and accelerated turnover of epidermal keratinocytes. IL-17 and TNF pathways are well established in psoriasis, but the other mechanisms that keep the disease active and link it to systemic comorbidities are not yet fully understood. A combined transcriptomic and systems biology framework was applied to map regulatory circuits in psoriatic lesions and to identify phytochemical candidates capable of multi-target modulation for topical intervention. Differential gene expression between lesional and healthy skin was analyzed, followed by functional characterization, employing Qiagen's Ingenuity Pathway Analysis (IPA) for pathway and upstream regulator inference, protein-protein interaction network, and chemical-gene interaction mapping. This integrative strategy revealed a transcriptional landscape dominated by type I/III interferon signaling, antiviral and antimicrobial responses, immunometabolic dysregulation, and transcriptional hubs centered on AP-1 and CREB1. Several previously unreported genes and upstream regulators without prior documented association with psoriasis were identified within inflammatory and cell migration-related modules, indicating unexplored regulatory layers in disease control. Network-guided chemical prioritization and direction-of-effect filtering highlighted seven phytochemicals (mahanine, atractylon, protopine, annomontine, taraxasterol, tricin, and tamarixetin) with multi-target activity across key disease axes. ADMET-based screening suggested protopine and atractylon as favorable candidates for topical delivery, while synergy modeling identified compatible phytochemical combinations, with flavonoid-alkaloid pairings among the top candidates. This multi-layered approach provides mechanistically informed phytochemicals targeting the IL-17/TNF-interferon-AP-1/CREB1-COX-2/MMP9 axis in psoriasis. Experimental validation in keratinocyte and organotypic skin models will be required to determine whether these compounds, individually or in combination, can effectively modulate psoriatic signaling in vivo.
bioinformatics2026-06-01v3A unified transcriptome database to accelerate gene discovery in Amaryllidoideae species
Goncalves dos Santos, K. C.; Merindol, N.; Desgagne-Penix, I.Abstract
Amaryllidoideae plants produce structurally diverse and unique alkaloids with potent anti-cholinesterase, antiviral, and antitumor activities, making this subfamily a rich source of pharmaceutical leads. Despite the absence of reference genomes for any Amaryllidoideae species, many enzyme characterization and pathway reconstruction efforts to date have been made possible through transcriptome mining, often requiring bioinformatic expertise and data preprocessing. To facilitate new studies in this subfamily, here we present AmarylOmicBase, a unified transcriptomic dataset that integrates assemblies, annotations, and expression profiles from 39 studies, covering 27 species and four hybrid cultivars across 13 genera of Amaryllidoideae. The AmarylOmicBase includes both published and de novo assemblies generated from published raw data using Trinity or IsoSeq workflows and provides standardized functional annotation and quantitative expression datasets. AmarylOmicBase provides ready-to-use datasets that support gene discovery, comparative transcriptomics, and pathway-level investigations for specialized metabolism, including Amaryllidaceae alkaloid biosynthesis. By providing ready-to-use datasets and fully reproducible analysis scripts, this resource reduces computational barriers and expands access to transcriptomic information for researchers working on non-model plant species. AmarylOmicBase provides a centralized resource for transcriptomic data that can be reused in studies of enzyme function, pathway evolution, and regulatory processes in Amaryllidoideae.
bioinformatics2026-06-01v2Hierarchical latent representations reveal protein organization for functional discovery and design
Guo, Z.; Wang, Z.; Wang, S.; Chai, Y.; XU, K.; Li, M.; Li, W.; Ou, G.Abstract
Proteins can preserve conserved functions despite extensive sequence and structural divergence, suggesting that functional organization is governed by distributed constraints not captured by conventional representations. Here we develop a hierarchical sequence-based representation framework that compresses proteins into context-dependent latent states while preserving multiscale organizational information. Using this framework, we identified previously uncharacterized ciliary proteins lacking detectable sequence and structure homology, including ADMAP1, which is required for normal sperm axonemal organization and motility in mice. Discrete latent protein states captured species-level organizational signatures correlated with major evolutionary groups and revealed expansion of intrinsically disordered regulatory environments in eukaryotes. Autoregressive sampling within this latent space further enabled design of synthetic actin-remodeling proteins that maintained robust F-actin severing activity despite extensive sequence rewiring across key functional interfaces. These findings demonstrate that distributed protein organization can be inferred directly from sequence, linking functional discovery, evolutionary analysis, and protein design within a shared representational framework.
bioinformatics2026-06-01v2Evolutionary constraints improve protein large language model predictions for protein stability, binding regions and epistasis
Tzavella, K.; Olsen, C.; Vranken, W. F.Abstract
Our understanding of protein function and evolution is largely based on the relationship between amino acid sequence and overall fold, now effectively captured by computational models. Yet predicting how mutations--shaped by epistasis--alter protein behavior, especially in dynamic or structurally ambiguous regions, remains difficult. Here we present D2D, which combines a self-supervised protein language model with protein-specific evolutionary information to predict mutational effects using little to no task-specific labeled data. D2D captures long-range epistatic interactions, accurately predicts single and higher-order mutation effects on protein thermostability and binding, without being trained on the task. When fine-tuned, D2D outperforms state-of-the-art methods on latent driver cancer mutations and co-occurring proliferation-enhancing mutations across independent experimental studies. Unlike most existing approaches, D2D avoids biases linked to solvent accessibility or to multiple sequence alignment depth and quality, making it particularly effective for disordered or surface binding regions where structure-based predictors typically falter. Overall, D2D provides a general framework for modeling mutational effects in proteins with limited experimental or structural information.
bioinformatics2026-06-01v2Cross-etiology transcriptomic conservation in hepatocellular carcinoma reveals opposing proliferation and hepatocyte-loss programs validated across cohorts
Romero, R.; Toledo, C.Abstract
Background: Hepatocellular carcinoma (HCC) arises from diverse etiologies, but the extent to which viral etiologies converge on reproducible transcriptomic state axes remains incompletely resolved. Methods: We analyzed HBV- and HCV-associated HCC discovery cohorts using Hallmark GSVA, limma-based differential modeling, and cross-cohort meta-analysis. Conserved tumor-upregulated and tumor-downregulated genes were distilled into ProlifHub and HepLoss modules, combined as HCCStateScore = ProlifHubScore - HepLossScore. Module performance was evaluated across multiple independent GEO cohorts, module-size robustness was tested across alternative top-N definitions, and TCGA-LIHC was used for continuous Cox survival modeling. An HBV-derived injury axis was constructed from an ordinal ALT/AST/HBV-DNA injury index in GSE83148 and tested in GSE121248 with adjustment for E2F/G2M activity and CIBERSORTx-inferred immune composition. Results: HBV- and HCV-associated HCC showed conserved activation of proliferation/repair programs and suppression of hepatocyte functional programs. The HCCStateScore validated across independent HCC cohorts with consistently positive tumor-non-tumor deltas and high discrimination, and module-size sensitivity analysis showed that performance was not dependent on the top-20 cutoff. In TCGA-LIHC, higher ProlifHubScore and HCCStateScore were associated with poorer overall survival in continuous Cox models, including after age/sex/stage adjustment. A compact HBV injury program remained tumor-associated after simultaneous adjustment for E2F/G2M activity and CIBERSORTx-derived immune-composition covariates, with concordant results using an extended FDR-defined injury set. Conclusions: HCC exhibits a robust cross-etiology transcriptomic state characterized by opposing proliferation and hepatocyte-loss programs. The module framework provides a portable bulk transcriptomic state score and supports a residual tumor-associated HBV injury component that is not fully explained by proliferation or inferred immune composition.
bioinformatics2026-06-01v2GeneKnow: AI-powered literature synthesis for gene-context analysis
Zhang, H.; Zang, C.Abstract
Interpreting gene function in specific biological contexts is essential for biomedical research, yet manual literature review is labor-intensive. We developed GeneKnow, a source-grounded framework that uses generative AI models within a controlled hybrid workflow to produce reliable, traceable literature synthesis supported by authentic citations. Through systematic benchmarking, we showed that GeneKnow outperforms mainstream web-interface AI tools in generating trustworthy context-specific gene function syntheses without fabricated citations and minimizing hallucinations.
bioinformatics2026-06-01v1Species- and Topic-aware Representation Learning for Antimicrobial Peptide Discovery
Padi, S.; Mondal, K.; Kaur, N.; Hoogerheide, D. P.; Heinrich, F.; Mihailescu, E.; Klauda, J. B.; Cardone, A.; Keyrouz, W.Abstract
Antimicrobial resistance poses a major global health challenge, necessitating efficient strategies to discover potent antimicrobial peptides (AMPs). While recent generative models can produce many candidate sequences, experimentally validating all generated peptides in wet labs is impractical due to the high costs and time involved in such measurements. As a result, there is a strong demand for accurate predictions of peptide efficacy, typically measured as the minimum inhibitory concentration (MIC). We introduce STAMP, a framework for Species-and Topic-aware Representation Learning in AMP Discovery. STAMP integrates protein language model embeddings with species conditioning and topic-aware representations that capture sequence level patterns, enabling generalizable predictions across multiple bacterial species within a single model. We evaluated STAMP on three benchmark datasets, which include two previously published datasets and a newly curated dataset derived from DBAASP, addressing duplicates and inconsistencies systematically. STAMP achieved strong predictive performance across these datasets, demonstrating a Pearson correlation coefficient (PCC) of 0.837 and an R2 of 0.70, outperforming several baseline models. Importantly, we further validated our prediction model using peptides that were experimentally tested for their antimicrobial activity against E.coli. and S.epidermidis bacteria, demonstrating its real-world applicability. Furthermore, residue-level importance analyses provide insights into the sequence determinants governing antimicrobial activity.
bioinformatics2026-06-01v1Trustworthy ML/AI for Aging Clocks: Preventing Systematic Prediction Bias in Biological Age Estimation
Lee, H.; Ye, Z.; Yang, Y.; Pan, Y.; Maron, B.; Wang, Z.; Kochunov, P.; Thompson, P.; Hong, L. E.; MA, T.; Chen, C.; Chen, S.Abstract
Machine learning (ML)- and artificial intelligence (AI)-based aging clocks are increasingly used to quantify physiological and molecular aging from omics and medical imaging data as distinct from chronological age. Here, we characterize a fundamental but underappreciated computational limitation of commonly used ML/AI regression models: systematic prediction bias and its propagation to downstream association estimates. We demonstrate that systematic prediction bias can distort, and in some cases reverse, biomedical conclusions drawn from aging-clock analyses. For example, it can produce spurious associations suggesting that older predicted brain age is linked to better cognitive performance, or that older epigenetic age is associated with better kidney function. To address this problem, we introduce a principled and broadly applicable ML/AI regression framework based on constrained optimization, ensuring trustworthy aging-clock estimation and biomedical inference.
bioinformatics2026-06-01v1Ultra-efficient High Resolution 3D Reconstruction of Spatial Omics Data with Neural Transcriptomic Field
Gong, Y.; Yuan, X.; Gao, R.; Chen, J.; Yu, Z.Abstract
Biological tissues are inherently three-dimensional (3D) ecosystems where spatial architecture dictates cellular function. While spatial omics technologies have revolutionized molecular profiling, they are largely restricted to isolated two-dimensional (2D) tissue sections. Existing computational methods attempting to reconstruct 3D volumes from sparse slices rely heavily on local slice-to-slice interpolation, struggling to balance high-fidelity reconstruction, noise reduction, and atlas-scale efficiency. Here, we present Neural Transcriptomic Field (NTF), a deep learning framework employing multi-resolution hash-grid encoding and implicit neural representations. Unlike interpolation-based approaches that merely bridge adjacent observations, NTF learns a global, continuous 3D representation of the tissue. By modeling the underlying latent biological patterns, NTF intrinsically decouples true molecular signals from technical artifacts, naturally enabling robust denoising and high-fidelity reconstructions. This global field paradigm shatters traditional scalability limits: NTF achieves up to a 1,000x speedup over existing methods, notably reconstructing a 100-million-cell scale 3D whole-mouse embryo atlas in under 15 minutes. Furthermore, NTF can generate super-resolved volumes from sparse input (e.g., utilizing only 10% of slices) and robustly extrapolating into unseen tissue regions. We demonstrate NTF's versatility across diverse transcriptomic and proteomic datasets, capturing complex spatiotemporal dynamics in Drosophila and mouse embryogenesis, and mapping intra-tumoral functional gradients in human breast cancer. Ultimately, NTF provides an unprecedentedly fast, scalable, and robust computational engine for constructing the next generation of comprehensive 3D tissue atlases.
bioinformatics2026-06-01v1Recursive exploration of metabolic yield space
Mores, W.; Bhonsale, S.; Floros, S.; Logist, F.; Van Impe, J. F. M.Abstract
Genome-scale metabolic network reconstructions contain extremely detailed and valuable information regarding cellular metabolism. For many applications such as finding genetic engineering targets and reduced kinetic model construction, metabolic network analysis techniques exist. Yield spaces based on the extreme rays of solution cones related to the metabolic network are frequently constructed for these types of analyses. However, for genome-scale networks, full enumeration of these extreme rays is not computationally feasible. In this work, a novel direct generation method for yield spaces is presented. This allows the application of many metabolic network analysis techniques to even the most recent genome-scale metabolic networks. Inspired by principles from multi-objective optimization algorithms, the proposed method performs highly efficient recursive exploration but specifically adapted to the mathematical properties of yield spaces. Two case studies showcase both the efficiency of the method and its applicability for analysis of genome-scale metabolic networks.
bioinformatics2026-06-01v1A Foundation Model for the Cancer Genome
Sidhom, J.-W.; Baras, A. S.; Elemento, O.; Shah, M. A.Abstract
Cancer is a disease of the genome, in which somatic mutations and copy-number alterations determine tumour identity, clinical behaviour, and response to therapy. Consortium-scale sequencing has profiled hundreds of thousands of tumours, yet clinical interpretation still proceeds one alteration at a time against hand-curated knowledgebases, often ignoring co-occurring alterations and the genome-wide copy-number pattern. Self-supervised foundation models pretrained on unlabelled corpora have produced transferable representations in adjacent biological domains by learning joint structure across many features, yet no comparable model exists for the cancer genome. Here we present TESSERA (Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations), a foundation model for the cancer genome; we pretrain it on somatic single-nucleotide variants and copy-number segments through masked-token reconstruction within each modality and a contrastive objective across modalities. A single representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour typing, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation that yields predictive chemotherapy-selection biomarkers in real-world cohorts. These biomarkers are interpretable: each surfaces the co-occurring alterations underlying the prediction, exposing biology that single-gene rules miss. In metastatic colorectal cancer, where the FOLFOX-vs-FOLFIRI choice is currently guided by toxicity rather than tumour biology, the model uncovers a candidate predictive biomarker: a three-feature rule (TP53+/KRAS+/17p-) selecting patients who derive substantially greater benefit from FOLFOX than FOLFIRI.
bioinformatics2026-06-01v1fourSynergy: Ensemble-based interaction calling on 4C-seq data using gradient-free optimization
Wind, S.-M.; Plagwitz, L.; Dix, J.; Heidtmann, G.; Heider, D.; Walter, C.Abstract
Motivation: Chromatin organization plays a crucial role in gene regulation and is associated with various severe diseases like cancer. Since chromatin changes are potentially reversible, a deeper understanding of the alterations needs to be harnessed for the development of new therapies. Circular Chromosome Conformation Capture Sequencing (4C-seq) is a sequencing technique enabling the identification of chromatin interactions between genes and regulatory elements. This work aims to develop an ensemble algorithm that utilizes synergies among available 4C-seq tools, which in turn allows to achieve superior predictive performance in interaction calling. Results: We employed existing 4C-seq algorithms using a weighted-voting approach. By optimizing the tool weights according to various predictive metrics using gradient-free optimization strategies, we demonstrate the potential of combining multiple 4C-seq analysis tools for interaction calling. Our results indicate that a weighted-voting based ensemble approach can outperform individual algorithms in various datasets. Although the optimal solutions differ across the 4C-seq datasets, we successfully identified global solutions that outperform the individual algorithms for all datasets analyzed. Availability: https://github.com/sophiewind/fourSynergy, https://github.com/sophiewind/fourSynergy_pip
bioinformatics2026-06-01v1Assessing and Optimizing Low-Frequency Somatic Mutation Detection: A Multi-Platform High-Throughput Sequencing Perspective
Feng, B. N.; Lin, Y.; Liu, L.; Lin, Q.; Lin, Y.; Liu, Y.; Li, J.; Lei, C.; Chen, C.; Yang, M.; Peng, X.; Zhou, Z.; Yan, Q.; Sun, L.; Li, Q.Abstract
The availability of multiple commercial short-read sequencing platforms necessitates systematic cross-platform performance comparisons, particularly for challenging applications such as low-frequency somatic mutation detection. Here, a large-scale targeted sequencing dataset from five Genome in a Bottle (GIAB) human genomic DNA reference standards, HG001 to HG005, alongside Twist Biosciences cfDNA reference standards featuring 1% variant allele frequency (VAF), was generated by six platforms (NovaSeq 6000, NovaSeq X, FASTASeq 300, GenoLab M, SURFSeq 5000, and MGISEQ-T7). To build a realistic benchmark while keeping authentic sequencing backgrounds, we developed PosMix, a simulating tool that generates position-specific VAFs. To overcome the limitations of conventional variant callers (high recall with poor precision for VarScan2, higher precision with lower recall for Strelka2/Mutect2), we developed SomaticXGB, a machine learning-based caller. In this study, SURFSeq 5000 consistently exhibited the lowest error rates and achieved superior accuracy for VAFs as low as 0.5%, outperforming all other sequencing platforms. On the other hand, SomaticXGB attained F1 scores of approximately 0.92 on simulated datasets with VAFs ranging from 0.5% to 1.5% and 0.89 on Twist 1% standards, substantially outperforming conventional methods. This work delivers a valuable rich multi-platform data resource, offering a standardized pipeline for performance benchmarking and a machine learning-based strategy for optimized somatic mutation detection.
bioinformatics2026-06-01v1A Multi-Epitope Vaccine Design for Human Pasteurellosis using Outer Membrane β-barrel Proteins of Pasteurella multocida
Panda, A.; Kapoor, J.; Kumar, S.; Bandyopadhyay, A.Abstract
Pasteurella multocida is a facultative anaerobic, Gram-negative coccobacillus that causes pasteurellosis in companion animals (cats and dogs), livestock, and poultry. Close contact with infected animals poses a significant zoonotic risk to humans through bite wounds, scratches, licking and transfer of bodily fluids. Current treatment relies mainly on antibiotics, and the lack of a licensed human vaccine further exacerbates the challenge. In the present study, a consensus-based computational approach was employed on the P. multocida Past 9 proteome. A total of 29 outer membrane {beta}-barrel (OMBB) proteins, including TonB-dependent receptors, porins, autotransporters, adhesins and efflux pumps, were identified and used to design a multi-epitope vaccine (MEV) construct. B-cell and T-cell epitopes were predicted from the identified proteins. Ten epitopes each of cytotoxic T-lymphocyte (CTL) and helper T-lymphocyte (HTL), and three B-cell epitopes were selected based on their antigenicity, non-allergenicity, non-toxicity, surface accessibility, and conservation across eight P. multocida human-infecting strains. The MEV was supplemented with suitable adjuvants at the N-terminus to enhance its immunogenicity. The MEV construct, with a length of 459 amino acids, was predicted to be antigenic, non-allergenic, non-toxic and soluble upon expression. The MEV structural model was generated and subsequently validated, which indicated good structural quality. Molecular docking between MEV and human toll-like receptor 4 (TLR4) demonstrated strong binding affinity, and molecular dynamics simulation confirmed the structural stability of the MEV-TLR4 complex. Immune simulation of the MEV construct elicited a strong immune response. This study proposes a designed MEV candidate against human pasteurellosis and highlights OMBB proteins as potential immunogenic targets for vaccine development.
bioinformatics2026-06-01v1Reproducible and shareable bioinformatics pipelines from natural-language prompts
Kim, H.-M.; Jeong, H.; Mekonnen, A. M.; Kim, Y.; Oh, Y.; Lee, H.; Jung, C.; Park, J.Abstract
Large language models (LLMs) are increasingly used to generate bioinformatics pipelines and to carry out analyses from natural-language prompts. However, the resulting analyses are often difficult to reproduce across sessions, owing to the non-deterministic nature of LLM-driven conversations and heterogeneity of local execution environments, and cannot run on remote high-performance computing (HPC) servers or be shared and reused. We present Autopipe, a platform that guides any Model Context Protocol (MCP) - compatible LLM to produce, execute, and publish source-preserved, re-executable containerized pipelines. Autopipe enables users to execute bioinformatics pipelines on any on-premises remote servers - supported by comprehensive setup documentation aimed at researchers without prior server-administration experience - and to visualize results through an extensible web-based viewer. The Autopipe platform comprises four components: a desktop application with an embedded MCP server for pipeline management and remote execution, an online registry for pipeline and plugin discovery, a web-based result viewer, and a CLI tool for customizing viewer plugins. Autopipe turns conversational analysis into re-executable and shareable workflows. Autopipe is freely available at https://autopipe.org/.
bioinformatics2026-06-01v1FlashDeconv reveals resolution horizons in atlas-scale spatial transcriptomics
Yang, C.; Chen, J.; Zhang, X.Abstract
Coarsening Visium HD resolution from 8 to 64 m can flip cell-type co-localization from negative to positive (r = -0.12 [->] +0.80), yet many widely used compositional deconvolution workflows require coarsening or subsampling at million-bin scale. Here we introduce FlashDeconv, which combines leverage-score importance sampling with sparse spatial regularization to achieve competitive benchmark accuracy while processing 1.6 million bins in 153 seconds on commodity hardware. Systematic multi-resolution analysis of Visium HD mouse intestine reveals a tissue-specific resolution horizon (8-16 m), the scale at which this sign inversion occurs, validated by Xenium ground truth. Below this horizon, FlashDeconv provides, to our knowledge, the first sequencing-based quantification of Tuft cell chemosensory niches (15.3-fold stem cell enrichment). In a 1.6-million-bin human colorectal cancer cohort, FlashDeconv uncovers neutrophil inflammatory microdomains co-localized with immunoregulatory dendritic cells (mRegDC) at the tumor-stroma interface, spatial niches largely missed by discrete-label summaries, with RCTD doublet mode labeling only 2.3% of hotspot bins as neutrophil singlets.
bioinformatics2026-05-31v4Resolving taxonomic boundaries and revealing genetic diversity in Enterococcus casseliflavus and closely related taxa
Soares de Medeiros Lima, M. M.; Prichula, J.; Sakamoto, T.Abstract
Enterococcus casseliflavus is a commensal bacterium that can occasionally cause human infections. A concern is that all strains of this species carry the vanC operon on their chromosomes, which reduces susceptibility to vancomycin. Aside from that, the classification of E. casseliflavus remains challenging because it shares up to 99% 16S rRNA sequence identity with other enterococci species. Here, we reassess the taxonomy and genomic diversity of E. casseliflavus and closely related taxa by comprehensively analyzing publicly available genomes to clarify their evolutionary relationships, ecological distributions, and clinical relevance. We retrieved 156 genomes of clinical and non-clinical isolates that showed ANI >90% with E. casseliflavus from public databases. Using ANI, core genome-based phylogeny, and pangenome reconstruction, we consistently identified three well-defined clusters corresponding to E. casseliflavus, E. entomosocium, and E. innesii. Our results indicate that a higher ANI threshold (>96.6%) is required to accurately discriminate between species within this group, as many strains showed more than 95% ANI with multiple species. Our results also support that E. flavescens and E. casseliflavus represent a single species and should be treated as synonyms. Genes related to biocide resistance, metal tolerance, and virulence were identified across the three species. E. casseliflavus exhibited the greatest diversity of resistance determinants. All species harbored the vanC operon associated with intrinsic vancomycin resistance, while two genomes additionally carried the vanD operon linked to high-level glycopeptide resistance. Overall, our findings refine species boundaries and highlight the importance of genome-based approaches for accurate classification and surveillance within the Enterococcus genus.
bioinformatics2026-05-31v2Cohort-HMM marker recruitment with per-OG orthology QC for phylogenomic supermatrices
Nielsen, T. N.Abstract
OrthoFinder's all-vs-all DIAMOND step systematically misses single-copy orthogroups (SC OGs) at deep taxonomic divergence: a marker recovered cleanly within a tightly defined cohort is dropped when the same marker is searched against phylum-broad metagenome-assembled genome (MAG) sets, because pairwise sequence similarity falls below DIAMOND's detection threshold even when the underlying ortholog is present. The result is biased dropout - supermatrices that retain genomes near the cohort but lose genomes from the deeper, more diverged corners of the same phylum. We describe a two-stage cohort-HMM recruitment pipeline (per-OG profile HMMs built from cohort alignments, then hmmsearch against the broader proteome set) followed by an independent per-OG gene-tree QC step that classifies each recruited hit relative to the cohort's most recent common ancestor (MRCA) descendant set, with a per-MAG paralog-rate filter applied before supermatrix concatenation. We characterize the pipeline across three taxonomic ranks. At phylum scale (Omnitrophota, 97 cohort OGs, 714 NCBI MAGs), the recruitment recovers MAGs that the OrthoFinder-only supermatrix would otherwise drop, and the QC identifies 2 deep-peripheral MAGs - divergent genomes whose per-OG tips repeatedly place outside the cohort MRCA descendant set despite being orthologs - that the per-MAG filter removes. At family scale (Pelagibacteraceae, 146 cohort OGs, 366 NCBI MAGs) and at genus scale (Actinomarina, 289 cohort OGs, 23 NCBI MAGs), the per-tip paralog-candidate rate drops to 0.0 %. The pipeline addresses two independent failure modes. Cohort paralog density breaks strict-SC OG discovery at the cohort step (the family-rank case, where every candidate marker has at least one cohort species carrying multiple copies; the relaxed cohort criterion supplies the marker set and HMM recruitment disambiguates which copy each NCBI MAG contributes). DIAMOND-reach attrition breaks OG assignment for the most divergent NCBI MAGs (the phylum-rank case, where pairwise similarities fall below DIAMOND's detection threshold; HMM recruitment recovers the dropouts and the per-OG QC step filters residual paralog candidates). At genus rank both modes are inactive and OrthoFinder suffices directly; HMM recruitment runs but finds no new orthologs. Code and per-case data products are released as a community resource at Zenodo (DOI 10.5281/zenodo.20422348).
bioinformatics2026-05-31v1Rare RNA Polymerase II failure modes mark the cancer-driving genes most affected by epigenetic perturbation
Asante, Y.; Gryder, B. E.Abstract
RNA Polymerase II (Pol2) transcribes genes through a complex life cycle (initiation, pausing, elongation, co-transcriptional splicing, termination, and recycling). Chromatin immunoprecipitation of Pol2 before and after chemical perturbation has identified promoter-proximal accumulation (pausing) as a critical step in the transcription genome-wide. However, the full landscape of Pol2 responses has not been well characterized. Here, we introduce a tool for comparing Pol2 Activity State Shifts (compPASS), a computational pipeline which uses data from paired ChIP-based approaches to assign genes to one of eight distinct modes by Pol2 response under different forms of perturbation. In multiple cancer types and drug contexts, we show that compPASS identifies previously undescribed Pol2 failure modes with important implications for gene regulation. By looking past pausing, compPASS exposes Pol2 failure modes (clogging, entry, gain, loss) that are rare but pinpoint the genes most relevant to cancer cell state changes in response to therapy, turning a single paired Pol2 ChIP-seq into a mechanistic map of shifting transcriptional states.
bioinformatics2026-05-31v1MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders
Wijaya, A. S.; Leung, H.; Yoo, H.Abstract
Frozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each gene's natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts. In 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 / kappa 0.821, compared with 0.672 / kappa 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context.
bioinformatics2026-05-30v2Genotype and methylation interact to reconfigure transcriptional regulation in colorectal cancer
Kim, B.; Kim, H.; Kwon, M.-K.; Hannenhalli, S.; Choi, S. S.Abstract
Background Transcriptional regulation is shaped by both genomic variants and the environment. Yet, how the regulatory effects of genomic variants are reconfigured by dynamic epigenomic changes during tumorigenesis remains incompletely understood. Methods We investigated methylation context-dependent links between genotype and gene expression in colorectal cancer (CRC) using paired tumor and normal-adjacent tissue (NAT) from 80 patients, thereby controlling for germline genomic background. By integrating promoter-targeted bisulfite sequencing with RNA-seq, we systematically compared expression quantitative trait loci (eQTLs) and methylation quantitative trait loci (mQTLs). To capture regulatory complexity beyond simple mediation, we implemented a memo-eQTL framework that explicitly models genotype x DNA methylation (GxM) interactions. Results We observed extensive tissue specificity in both eQTL and mQTL landscapes; tumor-specific eGenes were significantly enriched for hallmark oncogenic pathways, including WNT and MAPK signaling. Standard mediation models explained only a minority of genotype-expression relationships, whereas our explicit interaction framework revealed widespread reconfiguration of methylation-dependent genetic effects in tumors. Memo-eQTL mapping (FDR < 0.05) identified 18 NAT and 73 tumor eGenes with significant GxM interactions, and results were consistent at a more permissive threshold (FDR < 0.2). We further developed a patient-level memo-eQTL score and found that interaction-based regulatory disruption in NAT, but not in tumor, significantly correlated with clinical stage (P = 0.035). Conclusions Genetic regulation in cancer is reorganized through context-dependent GxM interactions. Importantly, GxM signatures in NAT are specifically linked to disease progression, offering new insights into field cancerization and the clinical consequences of regulatory reprogramming in CRC. Keywords: genetic regulation; DNA methylation; eQTL; context-dependent regulation; colorectal cancer
bioinformatics2026-05-30v1Improving viral protein clustering using both diversified protein profiles and structural information
Nugier, Q.; Bouras, G.; Galiez, C.; Petit, M.-A.; Enault, F.Abstract
Viruses are abundant, ancestral and potentially fast-evolving biological entities. As a result, their encoded proteins are diverse and identifying homologous relationships between sequences is as important for phylogeny and functional annotation as it is challenging. Traditional methods group viral proteins by sequence similarity, build HMM profiles for each protein family, and cluster further via profile comparisons. Here, we present an improved framework where HMM sensitivity is boosted by enriching reference virus HMM profiles with tens of millions of metagenomic sequences. This increases diversity within most protein families, raising the diversity index from less than 2 for 92.7% of clusters to a median value of 6. This enrichment of the profiles more than triples the number of homologies detected compared to the raw profiles. First-step clusters are then grouped more effectively using these relationships and further unified via structural predictions and comparisons. The sequence-enrichment strategy excels at linking small proteins, while structures better connect highly structured ones like tail and head proteins. Applied to 1.42 million proteins, our method yields 56,560 families--far fewer than 200,018 (sequence-based) or 135,048 (raw HMM)--revealing that prior approaches vastly overestimated viral protein diversity. The strategy of enriching the diversity of sequences of interest with external sequences, combined with the complementary use of structural information, highlights deep evolutionary links, offering a more accurate picture of viral protein evolution.
bioinformatics2026-05-30v1BasalCell: A project scaffold generator for bioinformatics analysis
Okano, Y.; Ishikawa, T.; Sakurada, K.Abstract
In the current bioinformatics landscape, where R-centric and Python-centric ecosystems coexist and overlap, there is an increasing demand for the organic integration of these disparate development environments. As bioinformatics practices become ubiquitous, it is crucial to lower the technical barriers for biologists to adopt software engineering standards, such as version control, environment reproducibility, code readability, continuous integration/continuous deployment (CI/CD), and comprehensive documentation, which often present significant implementation hurdles for biologists with limited programming experience. To address these challenges, we developed BasalCell (https://github.com/yo-aka-gene/BasalCell), a project scaffolding system designed to provide a standardized, easily reproducible template that integrates these essential features by default, enabling researchers to seamlessly manage multi-language environments and automate rigorous development workflows, ultimately fostering greater transparency and reliability in biological data science.
bioinformatics2026-05-30v1SpatialDataAgent: Autonomous Spatial Omics Data Curation at Decade Scale
Ji, J.-H.; Zou, Q.; Cheng, J.; She, Z.; Hao, Y.; Liu, W.; Zhang, D.; Wang, Z.; Yu, J.-T.; Yuan, Z.Abstract
Fragmented metadata in spatial omics archives has rendered large volumes of multimodal molecular-histological data inaccessible as 'dark data'. Here, we introduce SpatialDataAgent, an agentic workflow for autonomous spatial omics data curation, combining schema-constrained evidence evaluation with a self-refining standardization agent. Applied to a decade of GEO records, SpatialDataAgent identified 769 paired H&E-spatial transcriptomics (ST) datasets, representing a 6.4-fold scale expansion over existing manually curated baselines. Within the benchmarking window, the framework achieved a 141% increase in high-confidence (Class A) paired datasets, which were automatically filtered and assembled to establish HESRT (a datalake containing 29.2 million spots/cells), establishing a blueprint for evidence-grounded autonomous curation of multimodal biomedical archives.
bioinformatics2026-05-30v1reComBat-seq: Regularized negative binomial regression for batch-effect correction in underdetermined transcriptomics datasets
Stoyanova, Z.; Malzl, D.; Menche, J.Abstract
Motivation: Batch effect correction is essential for the integration of large-scale transcriptomics datasets such as single-cell RNA-seq or multi-study bulk RNA-seq datasets for reducing technical noise that may mask biological signal. Existing correction methods, either do not produce count data output which is crucial for state-of-the-art downstream analyses such as differential expression analysis or fail to converge in underdetermined study designs. Results: We present reComBat-seq, a method that extends the Negative Binomial regression framework of ComBat-seq by incorporating Elastic Net regularization. This approach resolves problems with rank-deficient design matrices while also preserving the integer nature of count data. Benchmarking on simulated and real datasets such as single-cell RNA-seq data demonstrates that reComBat-seq successfully removes batch effects in complex study designs while maintaining compatibility with downstream differential expression tools. Availability and Implementation: reComBat-seq source code can be found at https://github.com/menchelab/reComBat-seq. All code to reproduce the presented analyses can be found at https://github.com/menchelab/reComBatseq_Studies. Data produced in this study is available at https://doi.org/10.5281/zenodo.19736515. Used single-cell RNA-seq data can be found at https://doi.org/10.5281/zenodo.14234956.
bioinformatics2026-05-30v1Personalized reference genome-based pipeline reveals comprehensive haplotype-resolved views of cancer genomes
Sakamoto, Y.; Ochi, Y.; Kogure, Y.; Kato, S.; Sato-Otsubo, A.; Sugawa, M.; Tanaka, Y.; Tsujimura, T.; Mikami, T.; Nagae, G.; Chiba, K.; Okada, A.; Ito, Y.; Suzuki, H.; Aburatani, H.; Koga, Y.; Kato, I.; Takita, J.; Mano, H.; Ogawa, S.; Kataoka, K.; Kato, M.; Shiraishi, Y.Abstract
Cancer genome analysis relies on standard human reference genomes but detecting somatic alterations in highly repetitive or individual-specific regions remains challenging. We developed the Personalized Reference genome-based Cancer Genome Analysis Pipeline (PRCGAP), to our knowledge, the first comprehensive pipeline integrating haplotype-resolved analyses of somatic point mutations, structural variants, copy number, and DNA methylation on personalized diploid reference genomes. We applied PRCGAP to eight tumor-normal cell line pairs and three pediatric B-cell acute lymphoblastic leukemia (B-ALL) clinical samples. PRCGAP detected most variants identified by GRCh38- and T2T-CHM13-based pipelines while uncovering variants in centromeric and telomeric regions. Building on PRCGAP outputs, we identified L1 retrotransposition source sites absent from standard references and showed that the IGH::DUX4 fusion in a B-ALL sample originated from an internal DUX4 pseudogene within an internal D4Z4 repeat unit, rather than the canonical full-length DUX4 gene. PRCGAP extends comprehensive cancer genome analysis from cell lines to clinical settings.
bioinformatics2026-05-30v1Integrated Multi-Omics Analysis for the Identification of Disease-Associated Variations and Prognostic Biomarkers in Triple-Negative Breast Cancer (TNBC)
MANNEKUNTA, N.; NATRAJAN, E.Abstract
Background: Triple-negative breast cancer (TNBC) exhibits high molecular heterogeneity. While multi-omic panels capture disease complexity, translating these profiles into actionable, cost-effective prognostic tools remains analytically challenging. Objective: To mathematically distill a high-dimensional multi-omic profile into a lean, highly predictive biomarker panel. Furthermore, we aimed to construct, validate, and clinically anchor a prognostic survival nomogram. Methods: Matched TCGA-BRCA transcriptomic and epigenomic data (n=5546) were integrated utilizing MOFA2. Functional pathways were mapped via the Enrichr database against Reactome, KEGG, and WikiPathways libraries. A machine learning ensemble (LightGBM, Random Forest) optimized the discovery signature. Prognostic stability was validated via Kaplan-Meier stratification, continuous Z-score Multivariate Cox Regression, and Time-Dependent ROC modeling. The tumor microenvironment was profiled via ssGSEA, and immunotherapy checkpoint correlation was assessed. External validation was executed on a microarray cohort (GSE58812). Results: A 47-gene discovery signature was computationally optimized into a 15-gene clinical panel (Internal AUC = 0.9898). Kaplan-Meier analysis demonstrated profound prognostic separation (p < 0.0001). Multivariate Cox regression confirmed the signature's independent prognostic value (Hazard Ratio = 10.67, p < 0.001). Immune profiling revealed the signature is driven by tumor-intrinsic factors, showing no significant correlation with local checkpoint expressions like PD-L1 (p = 0.72). External validation achieved an integrated multi-covariate AUC of 0.6874. Conclusion: This optimized 15-gene signature and the associated clinical-genomic nomogram provide an accurate, independent, and generalizable framework for individualized TNBC survival prediction.
bioinformatics2026-05-29v3Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators
Thiel, M.; Cunningham, A.; Barnes, C. P.Abstract
We compare the efficacy and distributional effects of supervised fine-tuning (SFT) and reinforcement learning (RL) post-training for PlasmidGPT, a foundation model for whole-plasmid generation, using Group Relative Policy Optimization (GRPO) for the RL model. Using a biologically motivated reward function encoding functional annotations, length constraints, and repeat penalties, the RL model achieves a 71.6% quality-control pass rate across 8 prompts on 4,000 sequences, compared to 4.3% for the pretrained baseline and 11.0% for SFT. A five-model reward ablation identifies the cassette arrangement bonus, which rewards correct promoter[->]CDS[->]terminator ordering, as the critical reward component. Rejection sampling baselines indicate that the gain is not recovered by sampling more heavily from the base model. Beyond directly optimized features, RL generated sequences converge toward real plasmid distributions in 3-mer composition and minimum free energy density, neither of which is directly optimized by the reward function. Minimum free energy density independently converges to the real-plasmid regime under both SFT and RL despite these being parallel post-training paths. On a small curated hold-out set, RL improves continuation log-likelihood over the pretrained baseline on all 29 held-out sequences (mean {triangleup} = +0.83 nats).
bioinformatics2026-05-29v3