Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
The RdRp Thumb-1 Pocket is a Conserved Target for Broad-Spectrum Antiviral Development
Woods, V.; Umansky, T.; Russell, S. M.; Gallay, P.; Smith, D.; Haders, D.Abstract
Single-stranded RNA (ssRNA) viruses cause human diseases ranging from mild colds to deadly pandemics. Broad-spectrum non-nucleoside antivirals have been characterized as impossible to develop because allosteric binding sites are poorly conserved. The Thumb-1 allosteric site identified in HCV's RNA-dependent RNA polymerase (RdRp) governs an essential conformational change in the {Lambda}1-loop required for polymerase initiation. The only approved Thumb-1 inhibitor, beclabuvir, has been shown to be inactive against a broad panel of non-HCV viruses, including poliovirus, rhinovirus, coronavirus, coxsackievirus, influenza virus, and HIV. It subsequently failed to inhibit SARS-CoV-2 despite favorable docking predictions. A conserved, homologous allosteric site on RdRp that spans multiple viral families has not been reported. Here, we demonstrate that the Thumb-1 pocket and its associated {Lambda}1-loop are conserved across ssRNA viral families through comparative structural analysis and multiple sequence alignments. We demonstrate that beclabuvir's dependence on its indole C6 carbonyl to interact with the HCV-specific residue R503 and its C3 cyclohexyl chemistry restricts its activity to HCV. We validate the target discovery with MDL-001, which does not contain a C6 carbonyl or a C3 cycloalkyl substituent. MDL-001 directly blocks viral RNA synthesis in isolated replication complexes and selects for the canonical Thumb-1 resistance mutation P495S in HCV. MDL-001 demonstrates broad-spectrum in vitro inhibition of both HCV and SARS-CoV-2. Preclinical proof of concept and development of MDL-001 across HCV, HBV, HDV, influenza, SARS-CoV-2, and RSV are reported in a companion manuscript. These findings establish RdRp Thumb-1 as a conserved allosteric pocket and a druggable target for broad-spectrum antiviral development.
bioinformatics2026-06-23v5OmniCell: Unified Foundation Modeling of Single-Cell and Spatial Transcriptomics for Cellular and Molecular Insights
Pang, J.; Qiu, P.; He, Y.; Deng, Y.; Tang, W.; Zhi, H.; Yan, J.; Li, B.; Lin, A.; Cao, L.; Teng, F.; Fang, S.; Li, S.; Deng, Z.; Zhang, Y.; Li, Y.; Li, S.; Xu, X.Abstract
A cell's transcriptional programme is not fully defined by gene expression alone, but by the tissue context in which that programme is enacted. Singlecell RNA sequencing resolves molecular identity after dissociation, whereas spatial transcriptomics preserves tissue architecture but remains constrained by assay-specific sparsity and gene coverage. Here we present OmniCell, a tissue-contextual transcriptomic foundation model pretrained on 67 million dissociated and spatially resolved profiles. By integrating gene identity, expression magnitude and tissue context, OmniCell links transcriptional programmes to the cellular neighbourhoods and anatomical contexts in which they operate. OmniCell organised transcriptomes across molecular, cellular and tissue scales. It recovered celltypespecific programmes and tissuealigned gene modules, preserved robust cell-state structure across batches, species and rare populations, and improved the reconstruction of spatial cell identity, anatomical domains and cell-type composition. In human liver cancer Stereo-seq data, OmniCell resolved a tumour-margin transition zone characterised by immune infiltration, acute-phaseinflammation, coagulation/complement activity and metallothionein-linked metalion detoxification. Contextual geneembedding similarity analysis showed that gene relationships differed across tumour core, transition-zone and paratumour/adjacent non-malignant niches, indicating that OmniCell captures tissue-dependent gene function rather than expression similarity alone. In mouse brain development and macaque cortex, spatial virtual perturbations mapped regulatory genes onto stage and regionspecific anatomical programmes. Together, these results establish tissue context as a primary axis of transcriptomic representation and provide a framework for studying how cellular programmes acquire context-dependent biological meaning in intact tissues.
bioinformatics2026-06-23v3A tailored variant filtering procedure for multi-breed and multi-species unbalanced animal SNP collections
Lazzari, B.; Milanesi, M.; Talenti, A.; Bionda, A.; Li, Y.; Jiang, L.; Lenstra, J. A.; Bardou, P.; Tosser Klopp, G.; Crepaldi, P.; Colli, L.Abstract
Technological advancements and decreasing costs of whole-genome sequencing have generated a huge amount of resequencing data. Large-sized datasets, encompassing the molecular variation of several species and/or populations can now be assembled easily. However, these are extremely variable in terms of geographical provenance and sample sizes, with taxonomic groups varying from one single to hundreds of entries. Consequently, the application of standard filtering approaches may bias the representation of groups or gene pools. Commonly adopted variant filtering approaches relying on minor allele frequency (MAF) and linkage disequilibrium (LD) are not adequate because of remarkable differences in LD structure and frequency of allele variants within datasets representing both local and global diversity of multiple populations and species. Thus, by using the VarGoats 1000 goat genome project, we devised a novel approach which avoids the biases of the standard filtering procedures by adopting within-population subsampling, minor allele count (MAC) and marker spacing (bp-space) as filters. Starting from a quality-filtered dataset of >28M SNPs from 1372 animals, we generated a dataset of <14M markers and 750 individuals, complying with the initial requirements and facilitating further computational steps.
bioinformatics2026-06-23v2HoloCell: A Generative Foundation Model for Holistic Cellular Modeling
Jiang, Q.; Li, Z.; Hu, B.; Bie, Y.; Li, K.; Li, Q.; Jin, P.; He, Y.; Deng, P.; Wang, Z.; Chen, X.; Qin, T.; Liu, H.; Jiang, R.; Yin, Q.Abstract
Single-cell multi-omics technologies have recently advanced to enable the profiling of epigenomic, transcriptomic, and proteomic layers within individual cells, offering new opportunities to characterize cellular states as integrated biological systems. However, developing a unified framework that can seamlessly integrate diverse omics modalities and remain robust to heterogeneous modality missingness remains challenging. Existing methods are often designed for specific modalities or modality pairs, relying on dataset-specific training or paired measurements. Here we present HoloCell, to our knowledge the first generative foundation model for joint representation learning and generative modeling across all three major single-cell omics modalities, i.e., epigenomics, transcriptomics, and proteomics. HoloCell contains over 860 million parameters and is pretrained on the Human-Multi-Omics-Corpus, which comprises approximately 468 million single-cell profiles across these three omics layers, corresponding to over 425 billion tokens. HoloCell introduces a simple yet biologically motivated hierarchical tokenization strategy that encodes cis-regulatory elements, genes, and proteins as structured tokens within a shared modeling framework. We evaluated HoloCell across single-omics representation learning, paired multi-omics integration, unpaired multi-omics alignment, and cross-modal generation via iterative diffusion and remasking, demonstrating its superior performance and flexibility across diverse omics tasks. From a representation perspective, HoloCell provides a unified digital mapping of cellular states across multiple omics layers, capturing cell heterogeneity as an integrated system. From a generation perspective, its iterative diffusion and remasking framework permits flexible generation orders beyond fixed left-to-right causality, enabling in silico simulation of multi-omics information flow. Together, these capabilities position HoloCell as a versatile foundation model toward the emerging concept of a virtual cell, offering both systematic characterization and generative simulation of cellular systems within a unified framework.
bioinformatics2026-06-23v2Structural Pockets and Interacting RNA-Associated Ligands (SPIRAL): A DSSR-enabled Meta-Analysis of RNA-Small Molecule Recognition
Lu, X.-J.; Wang, Y.Abstract
Small molecules that target structured RNA hold therapeutic promise across a wide range of diseases, yet the structural principles governing RNA-ligand recognition remain poorly defined. We present SPIRAL (Structural Pockets and Interacting RNA-Associated Ligands), a curated database of 1,098 RNA-small molecule structures from the Protein Data Bank covering 1,137 ligand-binding events across six functional RNA categories. A customized pipeline built on DSSR (Dissecting the Spatial Structure of RNA) extracts structural interaction parameters from each complex, capturing stacking geometry, hydrogen-bond topology resolved by RNA moiety, groove engagement, and tertiary motif context. Unsupervised clustering of these fingerprints resolves six mechanistically distinct binding modes, the distribution of which is strongly governed by RNA functional class. To enable category-independent comparison of interaction quality across these diverse modes, we introduce the Composite Binding Quality Score (CBQS), a seven-metric framework that ranks riboswitches highest and regulatory RNA motifs lowest among the six categories. Across 275 affinity-characterized entries, C2'-endo sugar pucker count and total buried contact surface area emerge as the dominant predictors of binding affinity, converging with the structural features most underengaged by current regulatory RNA motif binders. SPIRAL provides a data-driven foundation for the rational design of next-generation RNA-targeted therapeutics.
bioinformatics2026-06-23v2GenoME: a MoE-based generative model for individualized, multimodal prediction and perturbation of genomic profiles
Wei, J.; Xue, Y.; Chai, H.; Gao, Y. Q.Abstract
The non-coding genome operates through a complex, multiscale regulatory system where regulated gene expressions are closely associated with cell-type-specific histone modifications, transcription factor binding and 3D conformation. Developing computational models that can integrate these patterns to predict and interpret the regulatory system remains challenging. Here, we present GenoME, a Mixture of Experts (MoE)-based generative model that uses DNA sequence and cell-type-specific ATAC-seq signals to predict a unified genomic profile encompassing epigenomics, transcriptomics, and chromatin architecture at base-pair to kilobase resolutions. GenoME enables multiscale predictions for held-out genomic regions and, critically, generalizes to predict the full regulatory landscape of unseen or individualized cell types from a single ATAC-seq input. We equip GenoME with an in silico perturbation framework that accurately forecasts the multimodal consequences of genetic perturbations and identifies functional enhancer-promoter connections, outperforming specialized models like Activity-by-Contact. These predictions can also be used to decipher the transcription factor grammar of cell-type-specific enhancers. GenoME thus provides a versatile, all-in-one platform for generative modeling, cross-cell-type generalization, and causal mechanistic investigation of the multiscale regulatory genome.
bioinformatics2026-06-23v2Automated Segmentation of Prostatic Gold Fiducial Markers for MR-Only Radiotherapy Planning Using Multi-Modal Consensus Deep Learning
Stewart, A. W.; Goodwin, J.; Richardson, M.; Robinson, S. D.; O'Brien, K.; Jin, J.; Barth, M.Abstract
Purpose: To develop and evaluate a multi-model consensus deep learning approach for automated gold fiducial marker (FM) segmentation in T1-weighted prostate MRI. Materials and Methods: In this retrospective study, T1-weighted MRI and CT-derived reference standard segmentations were collected from 127 prostate cancer patients (all male; mean age, 70 years +/- 7 [standard deviation]; age range, 50-88 years; collected between October 2020 and January 2026) who each had three implanted gold FMs. A 3D U-Net was trained on 93 subjects using four random seeds to produce an ensemble. At inference, marker-class probability maps were averaged across models and the top three connected components selected. Performance was evaluated on 34 temporally held-out subjects (9 tuning, 25 test) using marker-level sensitivity and precision with exact (Clopper-Pearson) 95% confidence intervals (CIs). A model count ablation study was performed. The pipeline was deployed for on-scanner processing on Siemens MRI systems via the OpenRecon framework and as a browser-based application using WebAssembly, executing entirely client-side. Results: The four-model consensus achieved 96% (70 of 73) sensitivity and 95% (70 of 74) precision on 25 test subjects, with 29 of 34 (85%) subjects achieving perfect marker detection. Single models had a mean sensitivity of 84% (SD, 9%), improving to 96% with four-model consensus (SD, <1%). Conclusion: Multi-model consensus deep learning substantially improved FM segmentation reliability over individual models, achieving high sensitivity and precision using only routinely acquired T1-weighted MRI.
bioinformatics2026-06-23v1Model-based inference of gene expression noise from single-cell RNA-sequencing data
Giersdorf, F.; Rogers, D. W.; Christensen, S.; Dutheil, J. Y.Abstract
The heterogeneity of expression levels among genetically identical cells, termed gene expression noise, is a property of the gene expression process whose importance in the biology of organisms and their evolution is increasingly recognized. Measuring gene expression noise requires single-cell expression data, as obtained from single-cell RNA sequencing (scRNASeq). Its estimation, however, is challenging owing to (i) the presence of technical noise in addition to biological noise, and (ii) the heterogeneity of cell types in the sampled population. We propose a maximum-likelihood framework to infer biological noise from scRNASeq data, while accounting for technical noise, dropout probabilities, and distinct cell sequencing depths. We demonstrate the parameter identifiability using simulations and that the resulting noise estimates are uncorrelated from the mean gene expression, and therefore do not need extra correction in downstream analyses, easing intra- and inter- genome comparisons. Using two technical replicates of scRNASeq data from the wild yeast *Saccharomyces paradoxus*, we show that expression noise can be inferred in a reproducible manner.
bioinformatics2026-06-23v1CellOS: Learning a World Model of Cellular State through Joint Embedding Prediction
Zhou, Q.; Le, Y.; Qi, X.; Chang, S.; Lu, H.; Wu, Y.; Wang, H.; Ran, R.; li, x.Abstract
Foundation models learned from single-cell transcriptomes are central to the prospect of AI virtual cell that can represent, query and predict cellular state. However, most current single-cell foundation models learn from a single view of gene expression and are optimized primarily through reconstruction or next-token prediction. As a result, they capture expression abundance but can-not explicitly reconcile complementary views of cellular state. Here we present CellOS, a multi-view foundation model that learns cellular representations from paired expression and perception views. CellOS integrates complementary views through a scalable three-stage training strategy that combines causal cell-sentence language modelling, function-preserving dense-to-mixture-of-experts expansion and latent-space alignment via an LLM-JEPA objective. Using this framework, we trained a 12-billion-parameter model on 390.5 million single-cell transcriptomes. Across diverse benchmarks spanning cell-state annotation, batch integration and perturbation-response prediction, CellOS consistently outperformed state-of-the-art single-cell foundation models in cell-state annotation and perturbation-response prediction while preserving robust batch integration. Together, these results suggest that predictive alignment between complementary cellular views provides a scalable path toward representation-centric cellular world models and transferable AI virtual cells.
bioinformatics2026-06-23v1Systematic benchmarking of zero-shot utility and robustness in single-cell transcriptomic foundation models
Liu, T.; Feng, T.; Pan, X.; Chen, Y.; Ren, L.; Ye, X.; Sakurai, T.; Lin, H.; Zhang, Y.Abstract
Single-cell foundation models (scFMs) have been proposed as reusable representations for transcriptomic analysis, yet their practical utility and robustness when applied without task-specific fine-tuning remain incompletely characterized. Here, we systematically evaluated single-cell transcriptomic representations in zero-shot settings across 20 methods, 6 downstream tasks and 1,607 datasets comprising nearly 21.8 million cells. We characterized model behavior along three complementary dimensions: baseline utility, structural robustness, and dataset-level drivers of performance variability. Our large-scale analysis reveals a decoupling between utility and robustness: methods ranking highly on standard benchmarks often show marked instability under shifts in dataset structure. Furthermore, no single model performs uniformly well across tasks. In several tasks, classical statistical representations based on highly variable genes remain competitive under zero-shot conditions. Together, these results define the practical boundaries of zero-shot use in scFMs and provide a large-scale benchmark and decision framework for representation selection in single-cell genomics.
bioinformatics2026-06-23v1Early Tracheal and Salivary miRNAs in Extremely Preterm Infants Predict BPD-related Pulmonary Hypertension
Li, T.; Zhang, S.; Aluquin, V.; Donnelly, A.; Stephens, H.; Sharma, S.; Hicks, S. D.; Liu, D.; Austin, E.; Siddaiah, R.Abstract
Pulmonary hypertension (BPD-PH) associated with bronchopulmonary dysplasia (BPD) in preterm infants associates with high morbidity and mortality within the first two years of life. In a previous unbiased study, we identified a panel miRNAs in tracheal aspirates (TA) that were differentially expressed in extremely low gestational age newborns (ELGANs) with BPD-PH compared to those with BPD but no PH. To explore the predictive potential of these miRNAs, we studied TA exosomes from 7 days old ELGANs and analysed a curated panel of 16 miRNAs through logistic regression and calculated the predictive AUROC to diagnose BPD-PH at 36 weeks PMA. AUROC of TA miRNAs was 0.76 with sensitivity and specificity of 53% and 93%, respectively. Adding sex and gestational age to the variables improved the AUROC to 0.78 with sensitivity and specificity of 61 and 87% respectively. Due to challenges of obtaining TA in non-invasively ventilated infants, we collected saliva samples from ELGANs at 7 days of age and compared the log expression of these 16 miRNAs in both biofluids and found significant correlation in their expression (pearson r=0.92, p<0.001). We calculated the predictive AUROC of the same miRNAs to diagnose BPD-PH at 36 weeks PMA. AUROC of these miRNAs in saliva was = 0.85 with sensitivity and specificity of 82% and 72%, respectively; addition of biological sex and gestational age improved AUROC to 0.86 with sensitivity and specificity of 79% and 76% respectively. Leave-one-sample-out sensitivity analysis demonstrated stable training performance with reduced performance in testing samples, supporting the need for validation in larger independent cohorts. In conclusion, early salivary miRNAs have great potential for risk stratification of ELGANs to develop BPD-PH, while also providing the opportunity to identify target molecules and mechanisms that modulate molecular function.
bioinformatics2026-06-23v1biomeStat: Using Agentic AI for Scalable Genomic Epidemiology Demonstrated Through End-to-End Analysis of 1,000 Asian Dengue Virus Genomes
Ariyaratne, D.; Somaratna, N.; Malavige, G. N.Abstract
Genomic epidemiology workflows typically require expert curation of multiple specialized tools, extensive manual parameter tuning, and access to heterogeneous compute infrastructure. While standard generative AI models often hallucinate in complex biological domains, we introduce biomeStat: an autonomous AI agent that functions as a strict deterministic orchestrator. By automatically writing code to execute established bioinformatics tools in sandboxed environments, biomeStat dynamically provisions compute resources (CPU and GPU) and guarantees reproducibility, making it immediately useful for scientists without requiring command-line expertise. To demonstrate the platform, we performed a fully autonomous genomic epidemiology and structural analysis of 1,000 Dengue virus (DENV) genomes sampled from 16 Asian countries between 2000 and 2025. The agent seamlessly orchestrated phylogenetic reconstruction (IQ-TREE, TreeTime), Bayesian phylodynamics (BEAST2 via NVIDIA H200 GPU), selection pressure analysis (HyPhy), and structural mapping (PyMOL). The analysis was completed in under 24 hours of wall-clock time, revealing endemic stability (R_e ~1.0) and identifying 1,869 candidate immune escape sites structurally colocalized with B-cell and T-cell epitopes. Furthermore, the agent validated 176 highly conserved drug target residues across the viral replication complex, confirming that resistance-associated positions for emerging antivirals JNJ-1802 and NITD-688 remain absolutely conserved across all four serotypes. By bridging the gap between natural language intent and deterministic computational execution, biomeStat reduces weeks of expert effort into a single-session analysis with full methodological transparency.
bioinformatics2026-06-23v1Comorbidity structure as an inductive bias: Comparing output-head designs for multi-label prediction of diabetes and myocardial infarction complications
Asumboya, W. A.; Agbenorhevi, P. K.; Adams, C. F.; Ayariga, D. A.; Adjadeh, T.; Adams Ziblim, S.; Kwofie, S. K.Abstract
Background: Clinical complications are often predicted with separate sigmoid outputs, even when the target labels arise from related pathophysiological processes. This paper asks whether output-layer choice should reflect both predictive convenience and the biological structure assumed among complications. The central premise is that label-dependence mechanisms are explicit hypotheses about comorbidity, not generic modelling additions. Methods: Output-head assumptions were compared across two clinically distinct multi-label prediction tasks. In Type 2 diabetes (T2D), six heads were evaluated for nephropathy, neuropathy, and retinopathy: independent baseline, linear additive, multiplicative, symmetric conditional random field (CRF), residual multilayer perceptron (MLP), and combined additive-multiplicative. In myocardial infarction (MI), four heads were evaluated for ventricular tachycardia, ventricular fibrillation, and atrioventricular block: independent baseline, linear additive, multiplicative, and symmetric CRF. All experiments used five training data fractions and seven independent seeds, with the same shared-backbone protocol within each disease setting. Results: In T2D, the symmetric CRF gave the most consistent improvement pattern, ranking highest at full data and at the two lowest data fractions while adding only three interaction parameters. At 20% training data, it was the only interaction head whose aggregate mean exceeded the independent baseline. The residual MLP, despite 123 interaction parameters, remained below the baseline across all T2D fractions. In MI, rankings changed across fractions: the multiplicative head led at 80% and 60%, the CRF led at 100% and 20%, and the baseline led at 40%. The combined additive-multiplicative head did not improve robustness in T2D and showed the largest negative baseline-relative deviations at lower fractions. Conclusions: The findings support a biology-guided view of output-layer design. A small constrained mechanism was most useful when its symmetry matched the shared microvascular structure of T2D, whereas the heterogeneous electrophysiology of MI produced no stable winner. Output-layer choice should therefore be reported and defended as an assumption about disease structure instead of a routine hyperparameter decision.
bioinformatics2026-06-23v1Learning interpretable structural similarity from tandem mass spectra for small molecule analog discovery
Piedrahita Giraldo, J. S.; Da Silva, K. M.; Zare Shahneh, M. R.; Wang, M.; Laukens, K.; De Vijlder, T.; Bittremieux, W.Abstract
Analog discovery remains a central bottleneck in mass spectrometry-based untargeted metabolomics, as conventional spectral similarity scores poorly reflect molecular structure. We introduce SIMBA, a transformer-based model that infers two interpretable graph-based distances, maximum common edge subgraph and substructure edit distance, directly from tandem mass spectra. SIMBA consistently retrieves structurally closer analogs than existing methods, enabling structure-aware small molecule identification beyond exact spectral matching.
bioinformatics2026-06-23v1VCBench: A Multi-Dimensional Benchmark for Single-Cell Foundation Models
Weidener, L. S.; Brkic, M.; Jovanovic, M.; Ulgac, E.; Meduri, A.Abstract
Single-cell foundation models are increasingly positioned as virtual cells, yet their capabilities are assessed by fragmented, largely single-task benchmarks that obscure where these models improve on simple baselines. VCBench addresses this by synthesizing four independent virtual-cell frameworks into seven capability dimensions: perturbation response prediction, cross-species universality, gene regulatory network (GRN) inference, modality integration, temporal dynamics, multi-scale integration, and in silico experimentation. Each dimension is assessed for operational testability under current architectures and datasets: five admit direct or proxy evaluation, while multi-scale integration and in silico experimentation are structurally untestable as end-to-end tasks. We evaluate five foundation models (Geneformer, scGPT, UCE, TranscriptFormer, Arc State) against pre-registered linear and nearest-neighbor baselines across the five testable dimensions, and report three findings. First, the baselines match or exceed every foundation model on four of the five scored dimensions, replicating the reported competitiveness of linear baselines on perturbation prediction and extending it to cross-species transfer, GRN inference, and temporal ordering. Second, TranscriptFormer alone exceeds the strongest baseline on cross-modal RNA-to-protein prediction (53% Pearson improvement, with a documented contamination caveat) and is the only model to reach Level 2 in the pre-registered Virtual Cell (VC) Level rubric; the architectural choice behind this advantage simultaneously causes a spectral collapse that destroys its temporal-ordering performance, a tradeoff invisible to single-task benchmarks. Third, no foundation model publishes a complete cell-level training manifest, leaving data contamination undetectable to users. Alongside the benchmark, VCBench releases a Contamination Reporting Schema and contributes two further methodological tools: a common-label-set protocol that controls for class-count confounds in cross-species transfer, and a spread-error correlation probe for epistemic calibration.
bioinformatics2026-06-23v1Measuring peptide-MHC generalization to unseen alleles across both HLA classes
Mysore, V.Abstract
Reported peptide-MHC (pMHC) AUROCs of 0.85-0.95 overstate generalization to unseen alleles: because immunopeptidome data are dense on a few well-studied alleles and sparse on the rest, training and test sets come to share near-identical alleles, so the numbers partly reflect interpolation rather than extrapolation to new MHC grooves. This is a property of the data, not of any one method. We assembled an open, harmonized corpus of 5.8 million experimental measurements across both HLA classes and use it to control the leakage explicitly: alleles held out at the sequence and cluster level, peptide-disjoint splits, and provenance-matched negatives. On strictly novel alleles, generalization is in the high 0.7s rather than the 0.9s a conventional split returns. Against this benchmark we trained a predictor that spans both classes in one model and factors presentation into a peptide-only ligand-likeness term and an allele-specific term; it exceeds eight published predictors by per-allele {Delta}AUROC = +0.22 to +0.37 (p < 10-9), most on the least-studied genes. Corpus, benchmark, and model are released.
bioinformatics2026-06-23v1WITHDRAWN: Generating Structurally Diverse Therapeutic Peptides with GFlowNet
Wijaya, E.Abstract
The authors have withdrawn this manuscript because the submitter did not have the rights to agree to the distribution license at the time of submission. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v5WITHDRAWN: Distilling Protein Language Models with Complementary Regularizers
Wijaya, E.Abstract
The authors have withdrawn this manuscript because, at the time of submission, the submitter did not have the rights required to agree to the distribution license. Accordingly, the authors request that this work not be cited as a reference for the project. Please contact the corresponding author with any questions.
bioinformatics2026-06-22v4WITHDRAWN: Agent-Guided Ranking Policy Improvement for Peptide Drug Candidate Prioritization
Wijaya, E.Abstract
The authors have withdrawn this manuscript because the submitter did not have the rights to agree to the distribution license at the time of submission. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v2Proteomics-constrained deconvolution reveals spatial cell-type programs in tumours
Isik, E. B.; Haley, M. J.; Anbaki, A. A.; Bere, L.; Roncaroli, F.; Piper Hanley, K.; Couper, K.; Wedge, D. C.; Sellers, R.; Baker, A.; Oliveira, P.; Ashton, J.; Bristow, R. G.; Alvarez, M. A.; Georgaka, S.; Rattray, M.Abstract
Accurately resolving cell-type mixtures in spatial transcriptomics remains challenging, particularly in heterogeneous tumours where cell populations are intermixed and matched single-cell references may be unavailable or poorly aligned. Current deconvolution approaches either require high-quality scRNA-seq references, suffer from scalability limitations, or lack interpretability. We introduce PISTACHIO, a proteomics-informed spatial transcriptomics deconvolution framework based on constrained non-negative matrix factorization with a negative-binomial likelihood. Rather than using probabilistic priors, PISTACHIO incorporates spatial cell-type constraints derived from paired Imaging Mass Cytometry, enforcing biologically grounded sparsity and explicit spatial feasibility of cell-type presence. PISTACHIO improved recovery of spatial cell-type distributions compared with Cell2location and STdeconvolve across synthetic and real tumour datasets. Our approach remains robust under cell-type assignment errors, maintaining high correlation with ground-truth under moderate noise, and achieves fast runtime on standard hardware, enabling practical large-scale deployment.
bioinformatics2026-06-22v2ATLAS: a scverse-compatible package for multi-omic single-cell trajectory inference integration
Leclercq, A.; Martini, L.; Bardini, R.; Savino, A.; Di Carlo, S.Abstract
Single-cell trajectory inference is widely used to study cellular differentiation and fate decisions, yet most existing approaches rely on transcriptomic information alone, limiting their ability to capture the regulatory processes underlying cell-state transitions. This work presents ATLAS (Advanced Trajectory Learning from multi-omics At Single-cell resolution), a scverse-compatible framework for trajectory inference in paired single-cell RNA-seq and ATAC-seq data. ATLAS integrates transcriptomic and chromatin accessibility information through Weighted Nearest Neighbor graphs, enabling both molecular layers to jointly inform pseudotime estimation, terminal-state identification, and fate probability inference within a unified multi-omic representation. Across synthetic and real datasets, ATLAS reconstructs coherent developmental trajectories, captures progressive fate commitment, and resolves biologically meaningful lineage structures, demonstrating the effectiveness of multi-omic integration for characterizing cellular dynamics. In addition, ATLAS enables the joint exploration of transcription factor expression and target gene activity along pseudotime, providing direct access to regulatory programs and chromatin-associated transitions that are not detectable from transcriptomic data alone. Overall, ATLAS provides a scalable and biologically informative framework for studying dynamic cellular processes in single-cell multi-omics experiments.
bioinformatics2026-06-22v2WITHDRAWN: Preprint Commons: A platform for the systematic tracking of preprint trends and impact
Behera, B. P.; panda, B.Abstract
The authors have withdrawn their manuscript because it was posted without the consent of all authors. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v2When Less Is Not More: DICEPro Mitigates the Impact of Incomplete Reference Matrices on Cellular Frequency Deconvolution.
BA, K.; Thiebaut, R.; Hinaut, X.; Hejblum, B. P.Abstract
Cellular deconvolution aims to estimate the frequencies of different cell populations from gene expression measurements in a biological sample. Supervised approaches, such as CIBERSORTx and DISSECT, critically depend on the reference signature matrix, which encodes the gene expression profiles of cell-types based on prior knowledge. Despite numerous deconvolution methods, the impact of missing cell populations in the reference matrix remains understudied. Here, we evaluate the robustness of state-of-the-art deconvolution approaches using simulations based on real dataset examples combined with statistical modeling, validated against published data, and multiple real benchmark datasets. Results show that deconvolution performance remains stable when the reference matrix includes most cell-types, but declines sharply as the matrix becomes incomplete, especially for abundant cell populations. To address the limitations of incomplete reference matrices, we introduce DICEPro, an optimization-based framework designed to enhance existing deconvolution methods. By systematically adjusting the reference signatures, DICEPro better accounts for missing or underrepresented cell populations, leading to improved precision and robustness. We show that DICEPro consistently boosts deconvolution performance across both simulated datasets, derived from real data examples, and multiple real biological datasets, offering a practical solution when standard methods are hindered by incomplete references.
bioinformatics2026-06-22v1Benchmarking cell type annotation in spatial transcriptomics: resolving cellular hierarchies, biological fidelity, and dynamic cell states
Zhu, Y.; Hu, Y.; Xie, M. B.; Qin, H.; Szul, Z. J.; Young, D. M.; Yuan, W.; Wang, Q.; Liu, Y. H.; Shen, W.; Meltzer, S.; Zhou, X. M.Abstract
Spatial transcriptomics enables the quantification of gene expression within its native tissue context, providing unprecedented insight into tissue architecture, cellular ecosystems, and local cell-cell interactions at regional and single-cell resolution. Accurate cell type annotation is a critical prerequisite for interpreting these data and is often the first and most essential step in downstream analysis. Despite rapid advances in computational methods, cell type annotation remains challenging and frequently requires extensive expert-driven manual curation based on marker-gene expression, spatial context, and prior biological knowledge. While early approaches relied primarily on transcriptional similarity, newer methods increasingly incorporate spatial information, histological features, and multimodal data to improve annotation accuracy. Nevertheless, reliable annotation remains difficult when biological interpretation requires fine-grained subtype resolution, particularly for platforms with limited gene panels, tissues undergoing dynamic cellular state transitions, and studies in which reference and query datasets differ substantially in biological context or technical modality. Here, we present a systematic benchmark of 20 state-of-the-art cell type annotation methods across four spatial transcriptomics datasets spanning diverse technologies, experimental conditions, cell numbers, and gene panel sizes. Importantly, all benchmark datasets contain expert-curated cell type labels, including well-resolved cell populations and subtype annotations, providing high-quality biological ground truth for evaluation. The benchmark encompasses both reference-based and reference-free methods representing a broad range of computational frameworks. Performance was assessed using conventional classification metrics, including accuracy and F1-based measures, together with structure-aware metrics that evaluate both cell-level annotation accuracy and preservation of higher-order biological organization. Across datasets, annotation performance varied substantially according to tissue context, reference-query similarity, and annotation granularity. Fine-grained subtype annotation and recovery of rare cell populations remained challenging for many methods, particularly in datasets capturing injury, repair, developmental, and regenerative processes characterized by continuous cellular state transitions. Notably, high classification accuracy did not necessarily correspond to preservation of global cellular relationships or biologically coherent downstream pathway and gene-set enrichment analyses. Overall, scANVI, Seurat, and TACCO consistently ranked among the top-performing methods, although their relative advantages were context dependent. Together, our results provide a comprehensive assessment of current annotation strategies for spatial transcriptomics and offer practical guidance for selecting methods that best align with specific biological questions, dataset characteristics, and analytical priorities.
bioinformatics2026-06-22v1CellTosg2Sequence: A Unified Text-Omics-Signaling-Graph Large Language Model for Single-Cell Analysis
chen, w.; Ye, M.; Xu, T.; Huang, D.; Zhang, H.; Li, H.; Li, W.; Chen, Y.; Payne, P. R.; Li, F.Abstract
bioRxivLaTeXUnicodeabstract --- In single-cell (sc)-based scientific discovery, text-formatted biomedical prior knowledge and signaling graphs are essential for annotating and interpreting numeric sc-omics data and for generating novel testable hypotheses. A major limitation of existing single-cell large language models (scLLMs) is that they rely on numeric expression data with gene names as the only textual signal, while comprehensive biomedical priors -- cellular localization, gene function, disease associations, and signaling interaction patterns -- remain absent from the model input. We introduce CellTosg2Sequence, a textual-prior- and signaling-graph-augmented cell-omics-sentence language model. A lightweight heterogeneous graph encoder maps a curated 62,507-node biomedical knowledge graph (KG) into compact virtual tokens that are prepended to each cell sentence, allowing the language model to condition on biological structure with minimal sequence-length overhead. We train CellTosg2Sequence with a three-stage objective: Stage I anchors the KG channel under autoregressive language-model pretraining, leveraging Qwen2.5-32B's own language reasoning for rapid KG alignment; Stage II aligns labels via supervised fine-tuning with KG-anchored InfoNCE; Stage III applies Group Relative Policy Optimization (GRPO) with an ontology-hierarchy reward, enabling free-generation cell-type prediction that generalizes beyond the closed training vocabulary. Across multiple benchmarks and ablation experiments, CellTosg2Sequence outperforms strong baselines. All results are achieved with lightweight LoRA training and a single unified checkpoint.
bioinformatics2026-06-22v1Complex-valued representations of time-series gene expression profiles for network analysis
Sun, J.; Cao, W.; Ikumi, K.; Shimizu, K. K.; Sese, J.Abstract
Time-series RNA sequencing provides a powerful framework for studying dynamic gene regulation, yet conventional analyses usually represent gene expression profiles as real-valued vectors in Euclidean space and quantify similarity using correlation or distance. Inspired by quantum information theory, we present a framework for encoding time-series gene expression profiles as complex-valued vectors comprising amplitude and phase components in Hilbert space. We designed multiple encoding models to represent gene expression in the amplitude of complex-valued vectors, encode temporal differences in the phase, and extend the phase representation to incorporate the direction of local expression changes. Gene-gene similarity was then quantified using fidelity, which measures the overlap between two encoded vectors. Evaluation using time-series RNA-seq datasets across diverse species and biological contexts showed that different encoding models produced distinct fidelity distributions that were related to, but distinct from, conventional correlation measures. We then constructed gene-gene networks using pairwise fidelity values and detected communities containing genes with similar temporal profiles. Although fidelity distributions differed across encoding models, the resulting communities captured major temporal expression programs, and functional annotations based on gene ontology and Kyoto encyclopedia of genes and genomes pathway analyses provided exploratory biological context. The detected communities were comparable to those obtained using conventional methods, including weighted correlation network analysis and fuzzy c-means clustering. Furthermore, as a proof-of-concept, we performed SWAP-test circuit simulations to mimic fidelity computation on a quantum computer; under noise-aware conditions, these simulations produced less accurate fidelity estimates with higher computational cost than classical computation. As a proof-of-concept, this study provides a complementary view of temporal transcriptome organization, rather than a uniformly superior alternative to conventional methods.
bioinformatics2026-06-22v1EventHorizon: A Foundation Model for Clinical Flow Cytometry
Medina Grespan, M.; Morrison, M.; O'Fallon, B.; Shean, R.; Spies, N. C.; Ng, D.Abstract
Flow cytometry is an essential tool for diagnosis of hematologic malignancies, but existing clinical workflows are highly dependent on expert manual interpretation. Existing machine learning approaches typically require extensive labeled data and are sensitive to variability in panel design, instrumentation, and laboratory workflows, limiting their generalizability. We present EventHorizon, a self-supervised foundation model for clinical flow cytometry that produces unified specimen-level representations from heterogeneous multi-panel data. EventHorizon employs a two-stage hierarchical transformer architecture with marker-aware tokenization, enabling seamless integration of cells measured across different antibody panels into a single shared latent space. We pre-train the model using a DINO-inspired self-distillation strategy with a variety of flow cytometry-specific augmentations on a dataset of more than 100,000 clinical specimens across 17 distinct panels. We evaluate the resulting embeddings on three clinically relevant classification tasks spanning common and rare panels, demonstrating that simple k-nearest neighbor probing of frozen EventHorizon embeddings achieves performance comparable to a fully supervised baseline model and a prior panel-specific self-supervised model. To ensure EventHorizon is not simply shortcut learning on features such as the markers/panels run for a given specimen, we perform a graph-theoretic analysis of EventHorizon's latent space which argues that specimen embeddings are organized primarily by biological diagnosis. Taken together, these results demonstrate that EventHorizon produces biologically meaningful, panel-agnostic specimen representations from clinical flow cytometry data which, with further development and validation, could provide a potential basis for scalable, reproducible diagnostic support across diverse clinical laboratory settings.
bioinformatics2026-06-22v1From hotspot dependence to distributed robustness in resistance-aware lead optimization
Wang, Y.; Xiao, B.; Kang, J.; Cui, H.; Fu, Y.; Li, W.; Perea, S. E.; Han, W.Abstract
Drug resistance remains a recurrent failure mode in targeted anticancer and antiviral therapy, and resistance evidence often enters only after compound selection. ResistAgent is an evidence-constrained framework that converts mutational liabilities into design-time objectives through site- and combo-aware resistance mapping, deterministic mechanism diagnosis and robust counter-design. In EGFR-Erlotinib and HIV-RT-Rilpivirine, the framework separated residue-level liabilities from observed HIV combination liabilities and linked prioritized mutations to anchor loss, pocket rearrangement, electrostatic shifts and contact redistribution. Same-budget paired searches showed that robust objectives changed lower-tail mutant-panel behavior and interaction-dependence profiles while prioritizing robustness over average-affinity behavior. Under predefined liability panels, selected robust-best trajectories shifted support away from mutable hotspot contacts toward more distributed interaction networks. Supplementary physical summaries and ranking-first benchmarks support the scope of this resistance-aware design strategy while preserving clear boundaries for prospective validation.
bioinformatics2026-06-22v1Reference-guided immune recovery matching prioritizes traditional Chinese medicine ingredients
Hu, C.; Xiao, B.; Chen, C. Y.-C.Abstract
Therapeutic prioritization from single-cell transcriptomes requires a target that is closer to treatment response than disease-signature reversal. In immune diseases, post-treatment recovery may follow patient- and cell-type-specific trajectories rather than a simple return along the pretreatment disease axis. We developed ImmuneNavi, a healthy-reference-anchored recovery-matching workflow for ranking traditional Chinese medicine ingredients from paired PBMC data. The workflow maps heterogeneous PBMC cohorts to a common healthy immune coordinate system, constructs patient-cell-type disease and recovery states, and processes ITCM treated-control profiles into a fixed ingredient perturbation bank. Patient and ingredient states are represented in matched gene, pathway and transcription-factor views, allowing the model to combine local transcriptional direction with more stable program-level features. A matcher trained on one paired treatment cohort preserved recovery-aligned ingredient rankings in independent PBMC cohorts without redefining the feature space, candidate set or preprocessing procedure. This provides a reusable transcriptomic pipeline for moving from paired immune-state measurements to prioritized natural-product candidates for experimental follow-up.
bioinformatics2026-06-22v1PhaseWY: A pipeline for haplotype phasing, sex chromosome identification and extraction of sex-limited sequences
Ellerstrand, S. J.; Churcher, A. M. J.; Kutschera, V. E.; Hansson, B.Abstract
Sex chromosomes are central to many ecological and evolutionary processes. Evidence has accumulated that sex chromosome systems vary extensively in age, turnover and transitions, motivating renewed efforts to study the diversity of sex chromosome systems across the tree of life. However, successful genomic detection of sex chromosomes depends on several factors, including the size and divergence time, background genetic diversity, and the number of sequenced females and males. In addition, technical challenges associated with sequencing and analysing the sex-limited Y/W chromosome remain. Here, we present PhaseWY, an automated Snakemake pipeline that uses whole-genome sequencing data from multiple female and male individuals to identify sex-chromosomal regions and extract the corresponding Y/W sequences. PhaseWY (i) detects sex differences in alignment depth, (ii) applies read-based and statistical haplotype phasing, (iii) identifies sex-linked regions using haplotype clustering, and (iv) subsets autosomal, X/Z- and Y/W-linked variants for downstream analyses. We applied PhaseWY to simulated data to benchmark factors influencing sex-linkage detection and successful extraction of Y/W-linked variants. To demonstrate its practical utility, we further applied PhaseWY to the neo-sex chromosome system in Alauda larks (Alaudidae) and performed a range of downstream analyses demonstrating the scope of applications of the PhaseWY output. We conclude that PhaseWY provides an easy-to-use and reproducible tool for population-genomic analyses in non-model organisms, with particular importance for advancing our understanding of sex-chromosome evolution.
bioinformatics2026-06-22v1πDIA-CLIP: efficient identification of highly heterogeneous proteomics data via a generalized zero-shot framework
Liao, Y.; Li, Y.; Xiao, Z.; Miao, C.; Yi, T.; Zhao, X.; Zhang, Y.; Wen, H.; E, W.; Chang, C.; Zhang, W.Abstract
Data-independent acquisition mass spectrometry has increasingly emerged as a cornerstone for characterizing highly heterogeneous biological systems, such as single-cell proteomics, metaproteomics, and spatial proteomics, offering unparalleled identification depth and quantification reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring, which is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present {pi}DIA-CLIP, a generalized framework shifting the DIA analysis strategy from semi-supervised training to zero-shot cross-modal representation learning through integrating dual-encoder contrastive learning and encoder-decoder architectures to establish a unified, high-precision representation for spectral features and peptides. Notably, the generalized zero-shot nature of {pi}DIA-CLIP facilitates an inference-only architecture, streamlining the analysis to achieve exceptional computational efficiency. Extensive evaluations across five distinct benchmarks demonstrate that {pi}DIA-CLIP consistently outperforms existing tools, yielding an up to 44.6% increase in protein identification alongside a reduction in entrapment identifications reaching a maximal 52.5%. Furthermore, the enhanced identification depth facilitates the discovery of novel biomarkers and the elucidation of intricate cellular mechanisms.
bioinformatics2026-06-21v4SIEVEseq: One-stop differential expression, variability, and skewness analyses using RNA-Seq data
Li, H.; Khang, T. F.Abstract
RNA-Seq data analysis is commonly biased towards detecting differentially expressed genes and insufficiently conveys the complexity of gene expression changes between biological conditions. This bias arises because discrete count models cannot fully and independently parameterize the mean, variance, and skewness of gene expression distributions. Therefore, a unified statistical framework that simultaneously tests differential expression, variability, and skewness is needed. We present SIEVEseq, a statistical methodology that provides such a framework. SIEVEseq embraces a compositional data analysis strategy to transform discrete RNA-Seq counts into continuous form with a distribution well-fitted by the skew-normal distribution. Both parametric and nonparametric simulations show that SIEVEseq better controls the false discovery rate and Type II error than existing differential expression methods. Analysis of the Mayo RNA-Seq dataset for Alzheimer's disease demonstrates that gene sets with significant differences in mean, variance, and skewness between control and disease groups strongly predict disease state. Furthermore, functional enrichment analysis indicates that relying solely on differentially expressed genes identifies only part of the biological spectrum, whereas incorporating genes with differential variability and skewness reveals additional disease-related aspects. Cross-data and cross-methodology validation suggest the detected biological signals are genuine. The SIEVEseq R package and source codes are available at: https://github.com/Divo-Lee/SIEVEseq.
bioinformatics2026-06-21v3Hierarchical classification of immune cell transcriptomes at population-scale
Beltz, C.; Qiu, Z.; Sadowski, L.; Kraske, J. A.; Aggarwal, A.; Quintanal-Villalonga, A.; Manoj, P.; Littbarski, A.; Bajaj, S.; Meskauskaite, B.; Umeda, S.; Mazutis, L.; Rose, S. A.; Chan, J. M.; Nawy, T.; Nainys, J.; Chaligne, R.; de Stanchina, E.; Kaelber, K. A.; Cussigh, C. S.; Kallenberger, S. M.; Williams, A.; Jenzer, M.; Pompecki, T.; Kahle, S.; Hohmann, N.; Nussbaum, D. P.; Moss, N. S.; Ziv, E.; Berger, A. K.; Haag, G. M.; Springfeld, C.; Zschaebitz, S.; Hassel, J. C.; Debus, J.; Jaeger, D.; Iacobuzio-Donahue, C. A.; Ganesh, K.; Peer, D.; Ungerechts, G.; Rudin, C. M.; Huber, P. E.; WalleAbstract
Accurate immune cell classification is essential for interpreting single-cell RNA sequencing (scRNA-seq) data. However, progress in automating cell type annotation is constrained by the lack of independent, high-resolution benchmarks, as routine data integration introduces statistical dependencies that inflate model generalizability. Here, we present the single-cell universal classification omnibus (Suco), a resource of independent, uniform expert annotations, and Compocyte, a modular hierarchical classifier. Together, they establish a framework that substantially outperforms existing classifiers while facilitating expert review of ambiguous annotations. Applying Compocyte across 50 studies, including three newly generated datasets, we classified 15.6 million leukocytes from 3,965 patients. Within this cohort, we identified a new tumor-associated resorptive macrophage phenotype, a non-canonical monocyte subtype in subclinical cytokine release syndrome, and the programmatic erosion of T cell memory stemness across metastatic sites. Suco and Compocyte thus provide a generalizable framework to uncover the principles governing human immunity at population scale.
bioinformatics2026-06-21v2Antibody-Antigen Affinity Prediction with Chain-Aware Protein Language Modeling
Singh, H.; Malhotra, A.; Srivastava, S. P.; SINGH, R. K.; Gorantla, R.Abstract
Motivation: Antibody-antigen affinity determines which antibodies advance in therapeutic discovery, repertoire analysis and affinity maturation, but experimental measurements are sparse relative to the scale of sequence libraries. Structure-based predictors can exploit interface geometry when reliable complexes are available, yet early discovery often requires ranking many heavy-light chain pairs against antigens for which no complex structure exists. Existing sequence-based models are scalable, but frequently compress heavy and light chains into a single antibody representation or concatenate antibody and antigen features obscuring the chain-specific and epitope-specific signals that drive binding. Results: We present AbAffinity, a sequence-only chain-aware three-stream architecture that maintains heavy chain, light chain and antigen as distinct streams. It integrates frozen ESM-2 embeddings with heavy-chain CDR-focused pooling, heavy-light self-attention, adaptive fusion gating and gated cross-attention, training only a compact interaction module. On the SAAINT-DB benchmark, AbAffinity achieves strong predictive performance under ten-fold cross-validation and maintains robust accuracy on novel antigens. It consistently outperforms recent sequence-based models across external benchmarks including SAbDab, AB-Bind and SKEMPI 2.0. Ablation studies highlight the contributions of chain-specific representations, CDR-focused pooling and the gated interaction pathway. Integrated Gradients attributions recover known paratope and epitope residues at structurally validated interfaces. AbAffinity provides a lightweight, explainable sequence-first framework for antibody triage and prioritisation when structural information is limited or unavailable.
bioinformatics2026-06-21v1Fast Multi-objective RNA Optimization with Autoregressive Reinforcement Learning
Huang, J.; Feng, N.; Bai, H.; Fang, Y.; Liu, X.; Wang, S.; Yan, J.; Shen, H.-B.; Qiu, Z.; Yuan, Y.; Hu, R.; Pan, X.Abstract
Codon optimization is essential in mRNA vaccine development, while existing tools face limitations in the computational efficiency, sequence diversity and universality. To address these challenges, we develop RNAJog (RNA Joint Optimization with autoregressive Generative model), a framework integrating autoregressive generation with reinforcement learning to optimize codon sequences for minimum free energy (MFE), codon adaptation index (CAI) and GC content, even enabling sequence design without requiring annotated training data. Evaluations in both in silico and wet-lab experiments have confirmed RNAJog's effectiveness and efficiency, with two orders of magnitude faster than traditional algorithm (LinearDesign) for long RNA sequence and about a 10-fold increase in antibody titer compared to the wild-type mRNA for Influenza virus hemagglutinin (HA) mRNA vaccine design in mouse. RNAJog also supports biological constraints for sequence optimization. Using this feature, we minimized m6A modification motifs in Bmp2 coding sequence for enhancing the translational efficiency and RNA stability, which are validated in cell-based experiments.
bioinformatics2026-06-20v2The recount3 Python package for programmatic access to uniformly processed RNA-seq data
Alsalihi, A.; Flight, R. M.; Moseley, H. N. B.Abstract
The recount3 online resource provides tens of thousands of uniformly processed RNA-seq samples across human and mouse from major sequencing repositories like the Sequence Read Archive. While access to these datasets has traditionally been centered in the R/Bioconductor ecosystem, the growing prominence of Python in bioinformatics and machine learning necessitates native, efficient tooling for Python users. Therefore, we present the recount3 Python package with robust application programming interface (API) and command-line interface (CLI) for discovering, downloading, and materializing recount3 resources. The software orchestrates uniform resource locator (URL) resolution, persistent on-disk caching, and the automatic parsing of data into analysis-ready data structures, including Pandas DataFrames and BiocPy RangedSummarizedExperiment objects. The recount3 Python package drastically lowers the barrier to entry for large-scale utilization of RNA-seq data in Python-based computational pipelines, bridging the gap between massive public transcriptomic data and modern machine learning ecosystems.
bioinformatics2026-06-20v1A network approach to DNA methylation clocks
Carcedo, A.; Yang, S.-G.; Smiljanic, J.; Neunman, M.; Wennstedt, S.; Degerman, S.; Lizana, L.Abstract
Biological age predicts health and lifespan better than chronological age, but remains difficult to measure. One leading molecular proxy for biological age is DNA methylation, which underlies age predictors known as "clocks". These clocks use penalized linear regression to predict chronological age from methylation levels using selected cytosine--guanine pairs (CpGs) along DNA. Although they predict chronological age within a few years and track mortality risk, there are several issues. Different clocks share a vanishingly small number of CpG sites, many of which show weak associations with age. Also, the clocks often do not transfer across methylation array platforms. This paper takes a network approach to better understand these issues. By using 12 public datasets from human blood, we build a co-methylation network of the sites that show the strongest age correlation. After pruning weak links, we find that it has a small number of large modules of covarying CpGs surrounded by many small modules and singleton sites. These modules are biologically interpretable, as they are associated with CpG island contexts and enriched for distinct Gene Ontology functions. We also map five established clocks onto this network (Horvath, Hannum, AltumAge, Skin \& Blood, and Han) and find that they select some CpGs from the same module. This suggests that they are more similar than they appear. The network structure also suggests new ways to build clocks. A simple clock that retains one CpG per module matches the performance of established clocks. A second one, built from module-level principal components, outperforms all five established clocks in three validation cohorts and is transferable across array platforms (Illumina Infinium Methylation 450K or EPIC arrays). Overall, the network perspective shifts attention from individual CpG sites to modules of covarying sites. This perspective helps explain why DNA methylation clocks perform so well despite their differences and provides a more systematic approach for developing the next generation of aging biomarkers.
bioinformatics2026-06-20v1Ribosomes are covered by a coat of flexible protein fragments
McGrath, H.; Kvasnovsky, R.; Kolar, M.Abstract
Ribosomal proteins contain flexible terminal regions that are averaged out during electron density reconstructions, rendering them absent from experimental models derived by X-ray crystallography or cryogenic electron microscopy. These flexible protein fragments (FPFs) collectively form an invisible coat on the ribosome surface whose presence has been systematically overlooked. Here we analysed FPFs from 36 ribosomes spanning bacteria, eukaryotes, and mitochondria. We found that mitoribosomes harbour the most numerous and longest FPFs. Structural predictions confirmed that FPFs are predominantly disordered across all ribosome classes. Comparison of FPF amino acid composition against proteome-wide background frequencies revealed strong and domain-specific compositional biases. The balance between arginine and lysine content tracks the cardiolipin content of the membrane each ribosome class contacts. The arginine enrichment in mitoribosomal FPFs may additionally reflect selection arising from the RNA-rich environment of mitochondrial RNA granules, membraneless condensates where mitoribosomes are assembled. FPFs are uniformly depleted in aromatic residues, arguing against protein-driven liquid--liquid phase separation propensity. Our findings suggest that the flexibly tethered coat is a highly functional intrinsic part of all ribosomes.
bioinformatics2026-06-20v1Finding stable clusterings of single-cell RNA-seq data
Klebanoff, V. F.Abstract
Run a UMI count matrix through a pipeline to obtain n cell clusters. Suppose that counts for an equal number of additional cells from the same experiment become available. Would including them change the result? Form the matrix containing both sets of counts, obtain n clusters, restrict this clustering to the initial cells and compare it with the initial clustering. If they are not consistent, conclude that the initial clustering is unstable. This is unrealistic, but reverse the perspective: given a clustering, process samples of half of the cells. If their clusters are consistent with those of all cells restricted to the samples, conclude that the clustering is stable. We use divisive hierarchical spectral clustering and define what may be a novel mapping of the dendrogram to nested clusterings. Counts are transformed to points in low-dimensional Euclidean space. Positive affinities are defined for points that are k-nearest neighbors. The affinity equals the inverse of the distance between points. Ng, Jordan, and Weiss' algorithm divides the points into two clusters. The normalized cut measures the clusters' separation. Recursion generates a dendrogram. Set the length of the branch between a node and its daughters to the normalized cut. Nodes' distances from the root define the mapping to nested clusterings. Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample's clustering and clusters are compared with those of the full data set (restricted to the sample). This provides measures of the stability of the clustering and its clusters. For three large data sets, this yielded clusterings compatible with published results, though with fewer clusters. Clusterings of two were judged to be stable. We conclude that it is feasible to identify stable clusterings of as many as 100,000 cells. Future research should explore using differential expression for validation.
bioinformatics2026-06-19v5damidBind: an R/Bioconductor package for differential DamID analysis and data exploration
Marshall, O. J.Abstract
DamID, and its cell-type specific adaptations, including Targeted DamID (TaDa) and Chromatin Accessibility TaDa (CATaDa), are now widely-adopted as techniques for the genome-wide profiling of DNA binding proteins. Despite this popularity, no dedicated software solution exists for identifying differentially bound or accessible loci, or differentially transcribed genes, between cell types using DamID. The R/Bioconductor package damidBind provides these functions, allowing an end-user to move from processed binding profiles to identifying differentially-bound loci in a reproducible, statistically appropriate and straightforward workflow. Abstract Availability and Implementation: damidBind is an open-source R/Bioconductor package and freely available from Bioconductor at [https://bioconductor.org/packages/damidBind/||https://bioconductor.org/packages/damidBind/], and from GitHub at [https://github.com/marshall-lab/damidBind]. It is released under the GPLv3 licence.
bioinformatics2026-06-19v3PLncFire enables genome wide identification and annotation of plant long noncoding RNAs from RNA sequencing data
Mistry, S. D.; Saxena, S.; Rizvi, A. Z.Abstract
Long non-coding RNAs (lncRNAs) are key regulators of plant biology, yet their discovery is hindered by low sequence conservation and a lack of comprehensive annotations. To overcome these challenges, we developed PLncFire, a modular computational pipeline that automates the genome-wide identification and annotation of lncRNAs from standard RNA-seq data. PLncFire integrates quality control, transcript assembly, and a robust consensus coding-potential assessment using CPC2, PlantLncPipe, and FEELnc to generate high-confidence predictions. It classifies lncRNAs as known or novel, facilitates their prioritisation through differential expression analysis, and is designed for scalability and reproducibility across diverse plant species. PLncFire provides a standardised framework to empower large-scale lncRNA discovery and advance comparative functional genomics. The source code is available at https://github.com/ahsan-rizvi/PLncFire.git.
bioinformatics2026-06-19v2From Scarce Functional Labels to Label-Aware Generation in Homologous Protein Families
Rosset, L.; Weigt, M.; Zamponi, F.Abstract
Accurately annotating and controlling protein function from sequence data remains a major challenge in protein engineering, especially when functional labels are scarce within large homologous families. Here, we study a two-stage light-supervision strategy for fine-grained functional annotation and label-aware sequence generation. First, we compare several sequence representations, including one-hot encodings, Restricted Boltzmann Machines (RBMs), and ESM2-based protein language model embeddings, for predicting intra-family specificity labels from limited supervision. By using train/test splits that explicitly reduce phylogenetic leakage, we show that ESM2-based representations do not systematically outperform family-specific RBM embeddings or even simple one-hot baselines in this regime. Second, we use the inferred annotations to train an annotation-aware RBM capable of generating artificial homologs conditioned on prescribed labels. Across several protein families, we quantify how the number and quality of available labels determine the reliability of conditional generation. Our results show that scarce annotations can support label-aware protein design when they are accurately propagated, while also highlighting the importance of phylogeny-aware evaluation for assessing functional annotation methods within homologous families.
bioinformatics2026-06-19v2Perturbation Curve models continuous transcriptional response trajectories and improves prediction of genetic modulations
Zhong, Y.; wang, l.; Yang, G.; Yu, L.; Qi, X.; Jiang, H.Abstract
Single-cell CRISPR screens, Perturb-seq, have revolutionized functional genomics by revealing biological causality. However, although perturbation assignments are typically represented as discrete labels, the cell-level effective strength of perturbations is often continuous and diverse. Current analytical frameworks struggle to decouple the variability in perturbation strength from the diversity of downstream responses. Here, we present Perturbation Curve (PertCurve), a nonlinear, curve-based computational framework that models the trajectories of transcriptomic responses by explicitly incorporating diverse perturbation magnitudes and strengths. By ordering cells by perturbation strength, we demonstrate that PertCurve accurately recapitulates the response magnitudes and reveals the distinct modularity and asynchrony patterns of downstream gene behaviors. These patterns are categorized into archetypes, including proportional, sensitive, and threshold responses. By applying this framework across CRISPRi/a modalities, we identify universal response patterns in viral infection, apoptosis, and proliferation genes, and reveal previously overlooked context-specific regulatory features in cell differentiation. Finally, incorporating PertCurve into perturbation prediction models and evaluation metrics enhances predictive performance, delivering actionable insights for refining established models.
bioinformatics2026-06-19v1SteerAF: Distogram-based Steering of AlphaFold2 toward Alternative Conformations
Tang, J.; Zhu, Z.; Yang, S.; Song, C.Abstract
End-to-end structure predictors, such as AlphaFold2, typically output only the dominant conformational state of a given protein, which is biased by the training data set. Existing strategies for recovering alternative conformations are often computationally expensive and offer limited biological interpretability. Here, we present SteerAF, an inference-time optimization framework based on AlphaFold2 that leverages information encoded in the distogram derived from deep multiple sequence alignments (MSAs) to predict alternative protein conformations. Across four benchmark datasets, SteerAF matches or surpasses existing methods in predicting alternative conformations for the majority of systems. Sparse MSA-feature modifications generated via block gradient ascent exhibit a strong correlation with experimentally characterized functional residues, recovering them with approximately 50% precision in the tested proteins. Furthermore, SteerAF enables effective decoy selection in the absence of experimental structures, and its predictions can serve as seed structures for molecular dynamics simulations to map conformational landscapes. Thus, SteerAF provides an efficient and interpretable approach for predicting alternative conformations, offering a framework that can be extended to other similar predictors and problems.
bioinformatics2026-06-19v1OmniPath Metabo: chemical structures, interactions and mechanisms to study the metabolome
Schaul, J.; Bai, Y.; Franken, J.; Lawrence, T.; Palacio-Escat, N.; Bottazzi, D.; Carreno, E.; Daley, M.; Gul, L.; Sahin, A.; Mananes, D.; Bohar, B.; Dugourd, A.; Korcsmaros, T.; Turei, D.; Schmidt, C.; Saez-Rodriguez, J.Abstract
Mechanistic and functional analysis of omics data largely relies on the incorporation of prior knowledge; however, connecting metabolomics data and knowledge is a major methodological challenge. This is largely driven by the diverse prior knowledge being fragmented across many databases requiring the merging of different database records across chemical structures, identifiers, and varying levels of structural specificity. Hence, this limits mechanistic interpretation and functional characterisation of the metabolome. Here, we present OmniPath Metabo, a comprehensive, harmonized, metabolome-centric database covering metabolites, lipids, food-derived compounds, and small molecule drugs, along with their associated receptors, transporters, enzymes, reactions, allosteric regulators, and disease associations. OmniPath Metabo harmonizes attributes using controlled vocabularies and ontologies, structures and built-in cheminformatics to map identifiers and track ambiguity. OmniPath Metabo is built directly from 40+ original resources and is freely accessible via an interactive web app and API at metabo.omnipathdb.org. OmniPath Metabo enables dynamic, context-specific construction of subnetworks to serve dedicated purposes, such as cell-cell communication or integrated multi-omics metabolite-driven regulation, connecting reactions, allosteric regulation, metabolite-receptor and metabolite-transporter interactions. Combining it with the over 170 other resources in OmniPath, it can be used for integrated networks of signaling, gene regulation, and metabolism. We showcase the application of OmniPath Metabo by analysing publicly available metabolomics data of lung cancer cell lines and metabolic footprints to mutational patterns. In summary, OmniPath Metabo transforms fragmented resources into a harmonised prior knowledge framework for a mechanistic and functional analysis of the metabolome.
bioinformatics2026-06-19v1Simulation-based Bayesian deep learning enables uncertainty-aware tumor fraction estimation in cell-free DNA
Volkov, H.; Raitses-Gurevich, M.; Grad, M.; Shlayem, R.; Danilevsky, A.; Rubinek, T.; Gorfine, M.; Shomron, N.Abstract
Background: Estimating tumor fraction from whole-genome cell-free DNA sequencing is critical for liquid biopsy, but is hampered by weak signals and baseline noise at low tumor fractions. Existing computational methods often require matched controls or large labeled datasets for training and lack uncertainty quantification. To address these gaps, we developed purNPE, a Bayesian deep-learning framework trained without labeled cancer cell-free DNA samples. Specifically, purNPE leverages a two-part generative model: one component simulates diverse tumor copy-number profiles based on evolutionary genealogies, while a second, data-driven component learns and replicates realistic sequencing background patterns from cancer-free cell-free DNA. By training a Neural Posterior Estimator on synthetic tumor profiles augmented with learned noise, purNPE performs amortized inference in milliseconds without needing a reference sample set at inference. Results: In a real-world pan-cancer cohort, purNPE achieved comparable performance with existing methods against orthogonal mutant-allele-fraction validation (MAE = 0.066). In silico and semi-synthetic experiments suggested analytical sensitivity around 1% tumor fraction under the evaluated conditions and showed strong classification accuracy in low tumor fractions (AUC = 0.98 for TF [≤] 3% versus controls). Conclusions: This work provides a framework for using simulation-based inference to derive calibrated, uncertainty-aware TF estimates, offering a potential alternative to traditional data-dependent methods.
bioinformatics2026-06-19v1ContinuumCellAgent: A Framework-Guided Agent for Long-Horizon Scientific Research
Li, H.; Lu, Y.; Fang, K.; Xu, Z.; Li, F.Abstract
AI-scientist systems are beginning to automate parts of scientific research. We present ContinuumCellAgent, an autonomous agent that executes literature review, hypothesis formation, computational experimentation, manuscript drafting, and adversarial peer review as a single unattended run. Existing AI scientist systems remain difficult to diagnose because they lack modularity, systematic prompt grounding, and observability into long-running behavior. ContinuumCellAgent addresses these gaps with a modular supernode architecture for stage-wise backend swapping, protocols grounded in curated research-method checklists that also define reviewer rubrics, and a diagnostics layer that records file-based artifacts, message traces, and state transitions. We evaluate the system on open-domain QA benchmarks and biomedical/longevity case studies, showing that it can produce checkable research artifacts while exposing pipeline dynamics for rigorous AI co-scientist research.
bioinformatics2026-06-19v1Nickel-Driven Dynamics of Urease in Sporosarcina pasteurii: Integrated Computational and Experimental Insights
Al-Thawadi, S. M.Abstract
Urease is a nickel-dependent enzyme that plays an important role in urea hydrolysis and in a process named as microbial-induced calcium carbonate precipitation (MICP), which is widely used in sustainable environmental biotechnology. Despite its ecological importance, urease powers Biogrout (biocementation), a promising green technology for soil stabilization and infrastructure repair. Yet, the relationship between nickel availability, enzyme activation, and bacterial fitness remains poorly understood. In this study, we reveal a striking dual effect of nickel on Sporosarcina pasteurii: while high Ni2+ concentrations strongly inhibit growth (IC50 {approx} 637.7 {micro}M), they simultaneously boost specific urease activity up to six-fold. This uncoupling between biomass and enzymatic efficiency highlights a previously overlooked adaptive strategy under metal stress. Using structural bioinformatics and molecular docking, we show that Ure1--the catalytic subunit--exhibits the strongest nickel affinity (-4.3 kcal{middle dot}mol-1), supported by highly conserved active-site residues, whereas accessory proteins UreE and UreG display moderate and weak binding, consistent with their roles in metal delivery and GTP-dependent maturation. In addition, microscopic observations confirmed that calcium carbonate precipitation was most pronounced at intermediate nickel concentrations (approximately 400-1000 {micro}M), whereas higher concentrations ([≥]1000-1300 {micro}M) led to reduced mineral formation due to loss viable cells. Taken together, these results indicates that nickel availability controls both urease activation and bacterial fitness, and that an optimal balance is required to maximize biomenerilization efficiency in environmental applications, particularly in biocementation technology.
bioinformatics2026-06-19v1StickForStats: automated statistical assumption validation for reproducible computational biology
Bharti, V.; Chakraborty, D.Abstract
Reproducible computational biology depends on statistical decisions that routine workflows often skip: verifying that a differential-expression test's assumptions hold across all genes, that a strategy-comparison ANOVA is robust to non-normality, or that a meta-analysis is not distorted by publication bias. Surveys consistently find that fewer than 20% of published biomedical studies report checking these assumptions, and existing statistical software leaves validation to the analyst as an optional step. We present StickForStats, an open-source web platform that reframes assumption validation as a default precondition for every analysis. Its Guardian system--a middleware pipeline of eight validators (normality, variance homogeneity, independence, outliers, sample size, modality, linearity, homoscedasticity)--checks assumptions before execution and, on critical violations, reroutes to an appropriate nonparametric alternative with a documented decision trail. At genome scale, applying Guardian to a 91-sample synovial-sarcoma RNA-seq study (GSE271517) cascaded 90.6% of 27,221 genes to a rank-based test and flipped the differential-expression verdict for 553 genes--479 rescued from an under-powered t-test and 74 outlier-driven false positives rejected--materially changing the gene list a biologist would act on. The same automatic validation generalizes across domains: a CRISPR editing-strategy comparison (ANOVA F = 1122, with Guardian recommending Kruskal-Wallis H = 36.6), an ordinal correlation (Pearson r = 0.476 corrected to Spearman {rho} = 0.479), and a sixteen-trial clinical meta-analysis revealing severe publication bias (Egger's t = -5.78, p < 0.001); a complementary module extends the same validators to published manuscripts, checking claims against CONSORT, STROBE, ICH-E9, and JARS-Quant reporting standards. By making assumption validation automatic and transparent, StickForStats targets a tractable, under-served contributor to irreproducibility. The platform is MIT-licensed, validated against SciPy and R, and freely available at https://github.com/visvikbharti/stickforstats_new.
bioinformatics2026-06-19v1Accurate detection of tumor clonality and ongoing expansion mode from genomic data
Chen, Y.; Jaksik, R.; Terranova, P.; El Baghdadi, S.; Koval, A.; Kurpas, M. K.; Tavare, S.; Kimmel, M.; Dinh, K. N.Abstract
Recent evidence shows that despite considerable effort, currently available algorithms for estimating intra-tumor heterogeneity (ITH) remain limited. We developed DECODE (Deciphering Cancer Origin from DNA Evolution), a novel mutation clustering method that incorporates the impact of sample-specific sequencing coverage and mutation calling biases. On synthetic data, DECODE outperformed existing methods across multiple clonality metrics and accurately detected and characterized the neutral tail in the site frequency spectrum (SFS), which encodes the tumor's ongoing expansion mode. In acute myeloid leukemia, accounting for the neutral tail enabled DECODE to yield more parsimonious clonal decompositions that align more closely with known subclonal dynamics that drive relapse. Applied to data from The Cancer Genome Atlas, DECODE not only detected a neutral SFS tail in most samples across tumor types but also uncovered a clinically meaningful link between ITH and survival in low-grade glioma. By jointly inferring clonality and expansion mode, DECODE provides two complementary and prognostically relevant readouts of tumor evolution from single tumor genomic samples.
bioinformatics2026-06-19v1