Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Development of Deep-Learning Models that Predict Quantitative Protein-Ligand Interactions in Glycobiology as a part of a Capstone Course
Yin, H.; Liu, W.; Zhou, W.; Chang, Z.; Carpenter, E. J.; Satyajith, A.; Haregu, S.; Greiner, R.; Derda, R.Abstract
Glycans coat the surface of all cells, and every glycan is recognised by specific glycan-binding pro-teins (GBPs). There are no general tools that can accurately estimate the binding strength between glycan and GBP from the amino acid sequence of the GBP and the molecular structure of the glycan, represented as SMILES string. We describe models for predicting such binding strengths developed as a part of a Capstone Course at the University of Alberta. The models are trained on a dataset that combines BindingDB, a published database of small-molecule protein interactions, and data from glycan arrays measured by Consortium of Functional Glycomics (CFG). In this hybrid dataset of protein-ligand interactions the ligands are both glycans from CFG and small molecules from BindingDB; similarly, proteins include GBP and proteins from BindingDB. Three models are presented (i) ProMax which fuses ESM-2, MolFormer, and MolCLR features; (ii) APEX which constrains learning to a predetermined form, a physical model of binding; (iii) UltraMax adds inter-atomic distances for the ligands. To address the dataset's severe long-tail distribution, the models employ tail-aware losses for rare high-binding instances. Trained and evaluated on approximately one million protein--ligand pairs using hold-out splits for unseen molecules, the three models provide a unified framework for quantitative glycan-protein binding prediction. We observed that learning glycan-protein binding is harder than the similar task of learning small-molecule-protein interactions. Simple mirror-inversion tests led us to postulate that insufficient use of chiral features is an important source of difficulty in learning these interactions.
bioinformatics2026-06-26v2CoLa-VAE: A Cell-Cell Communication-Aware Variational Autoencoder for Representation Learning and Expression Denoising
Chen, Y.; Qi, C.; Fang, H.; Luan, F.; Zhang, Z.; Arya, S.; Wei, Z.Abstract
Single-cell RNA sequencing provides a powerful view of cellular heterogeneity, but its sparsity and dropout noise remain major obstacles for recovering biologically meaningful gene expression programs and for downstream analyses that depend on reliable expression measurements. Ligand-receptor-based cell-cell communication inference is such analysis, missing ligand or receptor expression can cause substantial false negatives in sparse single-cell data. Here, we present CoLa-VAE, a cell-cell communication-aware variational autoencoder that jointly learns latent representations and denoised expression profiles by incorporating ligand-receptor-derived communication topology through dynamic graph Laplacian regularization. Rather than treating denoising as a secondary output of representation learning, CoLa-VAE uses denoised expression to iteratively refine communication estimates and uses the resulting communication structure to guide both latent organization and expression reconstruction. In addition to improving latent space organization and producing robust denoised expression matrices, CoLa-VAE-denoised matrices also improved downstream biological analyses, including the detection of robust differential cell-cell communication programs, mitigation of batch-associated variation and enhanced spatial transcriptomic deconvolution when spatially constrained communication structure was incorporated. Together, these results establish CoLa-VAE as a communication-guided denoising and representation learning framework that recovers biologically meaningful expression signals from sparse single-cell and spatial transcriptomic data, enabling more sensitive and reliable downstream analysis.
bioinformatics2026-06-26v2Cell-free DNA Fragmentation Profiling at Transcription Start Sites Improves upon Cancer-Type-Specific Region Selection for Cancer Detection
Pronk, B.; Makrodimitris, S.; Wilting, S.; Reinders, M.Abstract
Accurate discrimination between healthy individuals and patients with cancer using minimally invasive liquid biopsies could improve cancer diagnosis and monitoring. Circulating cell-free DNA (cfDNA) is a promising biomarker, since fragmentation patterns reflect chromatin organization and have been used to interrogate regulatory regions such as transcription start sites (TSSs). Classification approaches typically rely on hypothesis-driven selection of genomic regions based on literature or external tissue data. Therefore, they assume that tumor-derived cfDNA constitutes the dominant diagnostic signal, potentially overlooking a systemic, genome-wide shift in the cfDNA pool. We present a data-driven framework that identifies discriminative genomic loci directly from cfDNA whole-genome sequencing data. Using fragmentomic features captured at TSSs within a nested cross-validation framework, the model outperforms ichorCNA and hypothesis-driven baselines in distinguishing healthy from colorectal and breast cancer samples (AUROC 0.95+-0.039). Performance was maintained in a pan-cancer setting across seven malignancies (AUROC 0.946+-0.032) and generalized to previously unseen cancer types within the same cohorts (AUROC 0.934+-0.006). While validation in an independent external cohort showed a performance gap (AUROC 0.694), the data-driven model was consistently competitive with baseline methods. These results indicate that robust cancer detection is enabled by integrating distributed genome-wide fragmentation patterns rather than restricting analysis to predefined regions.
bioinformatics2026-06-26v1Comp2GPR: A Sequence-Driven Framework for Gene.Protein-Reaction Rule Reconstruction
Castillo, S.Abstract
Accurate gene-protein-reaction (GPR) associations are essential for the predictive performance of genome-scale metabolic models (GEMs),as they define the mapping between genes, enzymes, and metabolic reactions. However, GPR rules are often incomplete or inconsistent due to limitations in annotation transfer and the ambiguous representation of multi-subunit protein complexes, leading to errors in downstream analyses such as gene essentiality prediction. Here, I introduce Comp2GPR, an automated pipeline for reconstructing GPR rules that integrates curated protein complex information with sequence-level evidence. Protein complexes were sourced from the Complex Portal and subjected to an AI-assisted curation workflow to retain only metabolically relevant assemblies. Comp2GPR combines deterministic sequence similarity mapping with explicit rule construction to generate Boolean GPR expressions that accurately represent obligate subunit relationships and isoenzyme redundancy. I evaluated the impact of the reconstructed GPR rules by integrating them into the Yeast9 metabolic model and comparing gene essentiality predictions with the original model. While global performance metrics remained largely unchanged, the updated model achieved a net improvement in prediction accuracy through gene-level corrections. Overall, Comp2GPR demonstrates that combining curated protein complex data with sequence-based validation improves the accuracy, interpretability, and reproducibility of GPR rules. The method provides a robust framework for enhancing metabolic model annotations and supports more reliable simulation-based analyses.
bioinformatics2026-06-26v1MYC and RNA Polymerase II Binding Near Transcriptional End Sites Regulate the Expression of Functionally-Related Genes
Prochownik, E. V.; Henchy, C. M.; Wang, H.Abstract
MYC oncoprotein binding at promoters and enhancers influences RNA polymerase II (RNAPII)-driven gene expression. Numerous genes also bind MYC near their transcriptional end sites (TESs). This often allows direct promoter-TES contact via looping and further regulates total and 'read-through' transcription that extends beyond standard termination sites. We aimed here to better clarify the rules governing TES associated MYC and/or RNAPII binding cross-talk in human and murine cells. Using ChIPseq and RNAseq datasets from the ENCODE portal and elsewhere, MYC and RNAPII binding profiles were found to differ around TESs and transcriptional start sites (TSSs). Variations in E box flanking sequences likely accounted for the somewhat lower affinities of MYC for TES-associated sites. Motifs for numerous other transcription factors were also observed to cluster non-randomly and in close proximity to MYC and RNAPII binding site peak summits. On average, genes with TES-proximal MYC or RNAPII sites were more highly expressed than those without, although co-binding tended to be suppressive. Both normal and neoplastic proliferative stimuli altered the MYC and RNAPII binding patterns of many genes, indicating that 'category switching' was common, subject to disparate external signals and often reversible. Functionally related gene sets with high levels of read-through transcription were uniformly marked by significant amounts of TES-associated MYC and/or RNAPII binding. These findings indicate that, both independently and together, MYC and RNAPII binding near TESs dynamically impact total and read-through transcription while also coordinating the expression of many common purpose gene sets.
bioinformatics2026-06-26v1PlantGeneAnn: a strand-specific genome foundation model for ab initio gene structure annotation of plant genomes
Qizhe, Z.; Zhengyang, Z.; Kepeng, L.; Wang, J.; Kaixuan, D.; Xianglei, X.; Wei, X.; Xuehai, H.Abstract
High-quality plant genome assemblies are rapidly increasing, but accurate structural annotation remains reliant on transcript and homology evidence, limiting applications in newly sequenced and non-model species. Here, we present PlantGeneAnn, a plant-optimized, strand-specific genome foundation model for ab initio gene structure annotation. Fine-tuned on only nine high-quality model plant annotations, PlantGeneAnn outperformed a multi-species model trained on 42 species, showing that annotation quality is more important than token volume. On a stringent 13-species benchmark covering rosids, asterids, and monocots, PlantGeneAnn surpassed four state-of-the-art baselines across five evaluation levels, from base-level classification to complete transcript recovery. It achieved higher intron precision and better captured complex gene structures. In zero-shot variant effect prediction, PlantGeneAnn identified cryptic splice donors and premature stop codons in maize and rice, with saturation mutagenesis confirming single-nucleotide, context-dependent sensitivity. It also retained generalizability for epigenomic track prediction, highlighting its value for pan-genomics, crop improvement, and non-model plant research.
bioinformatics2026-06-26v1Consistent consensus-based annotation of spatial adaptive immune receptor repertoires from long-read sequencing using LongAIRR
Schuck, J.; Ortega Iannazzo, S.; Mahmoud, Z.; Gwellem Anchang, C.; Hasse, L. M.; Weber, K.; Imkeller, K.Abstract
The combination of spatial transcriptomics with long-read sequencing enables spatial characterization of full-length transcripts within solid tissue sections. However, standardized computational analysis frameworks are lacking, and it remains unclear whether available long-read sequencing platforms from Oxford Nanopore Technologies and Pacific Biosciences yield comparable results. Here, we present a computational strategy for spatial full-length transcript analysis, focusing on the spatial profiling of adaptive immune receptor repertoires (AIRR). Our approach introduces an adaptive filtering strategy that dynamically refines read selection and significantly improves consensus accuracy, enabling high-confidence sequence reconstruction independent of platform-specific sequencing error profiles. We further derive evidence-based guidelines tailored to the consistent and robust analysis of spatial AIRR data. The resulting software LongAIRR is modular and interoperable with existing spatial transcriptomics and AIRR analysis frameworks. This work establishes a methodological foundation for spatial immunology, enabling precise mapping of immune repertoires within their native tissue microenvironments.
bioinformatics2026-06-26v1Learning Perturbation Effects Through Contrastive Alignment of Multimodal Biological Embeddings
Long, W.; Liu, T.; Szalata, A.; Theis, F. J.; Xue, L.; Zhao, H.Abstract
Multimodal single cell perturbation screens offer a scalable approach for characterizing the effects of genetic and chemical interventions on cellular state. However, most existing representation learning methods are tailored to a single perturbation modality and fail to explicitly incorporate external semantic knowledge, which limits their ability to generalize across datasets and perturbation types. Here, we introduce PertOmni, a CLIP style multimodal representation learning framework that aligns transcriptomic perturbation signatures with text derived embeddings of curated genes and compound descriptions, as well as image derived embeddings from cell paintings. PertOmni jointly trains a shared transcriptomic encoder and dataset specific text encoders using a masked contrastive objective that emphasizes within cell type discrimination while mitigating confounding effects arising from cell type heterogeneity. We evaluate the produced joint embedding space on bidirectional retrieval, drug gene interaction inference, and perturbation prediction across both small molecule and CRISPRi perturbation datasets, and demonstrate consistent improvements over strong baseline methods.
bioinformatics2026-06-26v1Computational reconstruction of hierarchical cis-regulatory networks reveals synergistic transcription control and disease-associated rewiring
Zhu, X.; Zhou, X.; Zhang, Y.; Cai, G.; Zhao, W.; Zhou, B.; Zhou, J.; Tang, Z.; Liu, J.; Zhu, Q.; Cao, J.; Yang, B.; Gu, X.; Zhou, Z.Abstract
Gene regulation emerges from coordinated interactions among dispersed cis-regulatory elements, yet how these elements integrate into functional regulatory networks and collectively regulate gene transcription remains poorly understood. Here, we present ORIGAMI, a multi-omics, gene-centric deep learning framework that reconstructs functional cis-regulatory networks constrained by transcriptional output. ORIGAMI formulates cis-regulatory modeling as a latent graph inference task, which integrates DNA sequence, epigenomic signals, and three-dimensional chromatin priors to infer denoised regulatory graphs that capture functional interactions rather than structural proximity alone. The inferred regulatory graphs exhibit distinct topological regimes, where hierarchical and modular organization encodes cell-state-specific functional demands and enables synergistic transcriptional control. Furthermore, we show that these regulatory architectures undergo measurable state-dependent rewiring across disease contexts. Finally, ORIGAMI accurately predicts the transcriptional consequences of both cis- and trans-regulatory perturbations and links the rearrangement of regulatory architecture to perturbation response. Together, ORIGAMI advances a network-based view of gene regulation and establishes a foundation for virtual cell modeling of regulatory dynamics.
bioinformatics2026-06-26v1A Generalised Epigenetic Clock Reveals Therapeutic Vulnerabilities Linked to Ageing in Cancer Cells
Fernandez-Rebollo, I.; Digilio, A.; Oikonomou, A.; Trastulla, L.; Esteller, M.; Iorio, F.Abstract
Epigenetic clocks estimate biological age from DNA methylation patterns but perform poorly in cancer due to extensive epigenetic reprogramming, limiting the study of ageing in tumour biology.Here, we develop GepiClock, an epigenetic clock trained on DNA methylation data from 32 cancer types in The Cancer Genome Atlas. Based on 4,862 CpG sites, GepiClock accurately predicts age across both tumour and normal samples, indicating that core ageing-associatedmethylation programmes remain detectable despite malignant transformation.Applying GepiClock to molecularly profiled cancer cell lines with matched drug response and CRISPR screening data revealed age-associated vulnerabilities. Younger-predicted cell lines were more sensitive to mTOR, MEK1/2 and HSP90 inhibitors, whereas older lines showed increased sensitivity to AKT and PI3K inhibitors. Additional cancer-type-specific patterns and age-associated genetic dependencies were identified.These findings establish a framework to quantify biological age in cancer and link ageing-associated states to therapeutic vulnerabilities.
bioinformatics2026-06-26v1Efficient evidence-based genome annotation with EviAnn
Zimin, A. V.; Puiu, D.; Pertea, M.; Yorke, J.; Salzberg, S.Abstract
For many years, machine learning-based ab initio gene finding approaches have been central components of eukaryotic genome annotation pipelines, and they remain so today. The reliance on these approaches was originally sustained by the high cost and low availability of gene expression data, a primary source of evidence for gene annotation along with protein homology. However, innovations in modern sequencing technologies have revolutionized the acquisition of gene expression data, allowing scientists to rely more heavily on this class of evidence. In addition, proteins found in a multitude of well-annotated genomes represent another invaluable resource for gene annotation. Existing annotation packages often underutilize these data sources, which prompted us to develop EviAnn (Evidence-based Annotator), a novel evidence-based eukaryotic gene annotation system. EviAnn takes a strongly data-driven approach, building the exon-intron structure of genes from transcript alignments or protein-sequence homology rather than from purely ab initio gene finding techniques. We show that when provided with the same input data, EviAnn consistently outperforms current state-of-the-art packages including BRAKER3, MAKER2, and FINDER, while utilizing considerably less computer time. Annotation of a mammalian genome can be completed in less than an hour on a single multi-core server. EviAnn is freely available under an open-source license from https://github.com/alekseyzimin/EviAnn_release and from Bioconda as "eviann".
bioinformatics2026-06-25v3Dynamic genomic constraints reveal fitness trade-offs underlying bacterial resistance evolution.
Dillon, L.; McInerney, J. O.; Creevey, C. J.Abstract
Antimicrobial resistance (AMR) is often modelled as the accumulation of resistance genes leading to multidrug resistance (MDR). We show that gene co-occurrence patterns in two opportunistic pathogens are consistent with fitness trade-offs that constrain which combinations of resistance mechanisms coexist. We applied a combined pangenomic and machine-learning analysis to 9,584 Escherichia coli genomes (99.2% phylogroup B2) and 7,057 Pseudomonas aeruginosa genomes. In E. coli, we identified eight cases of mutually exclusive gene pairs that independently predicted the same MDR phenotype, suggesting alternative routes to resistance whose components are typically not co-inherited. In a separate dataset of 352 strains with paired minimum inhibitory concentration (MIC) data, these dissociated combinations co-occurred more often in resistant than susceptible strains, consistent with the constraints being conditional on antibiotic selection. 33 gene pairs showed opposing association patterns between the two species, with combinations significantly associated in one species and significantly dissociated in the other (e.g. associated in E. coli and dissociated in P. aeruginosa, or vice versa). This indicates that genomic context modifies the contribution of individual genes to resistance phenotypes, and offers one explanation for the observation that 106 ARGs are present in >95% of strains yet do not predict resistance phenotype on their own. The findings are consistent with resistance evolution being shaped by fitness trade-offs and suggest that the dissociation patterns we identify could be targets for follow-up experimental work on resistance-associated fitness costs.
bioinformatics2026-06-25v2FoldARE, an RNA secondary structure analysis and prediction tool via generative pseudo-SHAPE modeling
Marino, S. M.; Husak, V.; Tebaldi, T.Abstract
RNA secondary structure prediction is limited by conformational heterogeneity and the scarcity of experimental data, as many RNAs populate ensembles of near-isoenergetic folds and SHAPE data are often unavailable. We present FoldARE (Folding and Analysis of RNA Ensembles), a two-step framework that derives pseudo-SHAPE constraints from in silico structural ensembles and uses them to guide SHAPE-aware secondary structure prediction. In the first step, an ensemble is generated and parsed nucleotide by nucleotide to estimate single-strandedness frequencies, which are converted into a pseudo-SHAPE reactivity profile using a weight-and-threshold scheme. In the second step, this profile is provided as a constraint to a SHAPE-compatible folding algorithm to improve the final prediction. We systematically evaluated all combinations of four ensemble-capable predictors, ViennaRNA, RNAstructure, LinearFold and EternaFold. After parameter optimization on a structurally diverse 25-RNA training set and validation using multiple scoring schemes, the best configuration combined EternaFold as ensembler and RNAstructure as predictor. Across external benchmark datasets (RNAstrand, ArchiveII and bpRNA) and the experimentally derived eFold dataset, FoldARE achieved the highest accuracy. Beyond prediction, FoldARE provides modules for ensemble-focused comparative analysis, including pairwise and multi-tool consensus assessment, per-nucleotide variability metrics, and interactive visualizations. Notably, it also supports the evaluation of m6A modification effects on structural ensembles. FoldARE is freely available on GitHub (https://github.com/TebaldiLab/FoldARE) and as a web accessible version (https://rdds.it/foldare/)
bioinformatics2026-06-25v2CellOS: Learning a World Model of Cellular State through Joint Embedding Prediction
Zhou, Q.; Le, Y.; Qi, X.; Chang, S.; Lu, H.; Wu, Y.; Wang, H.; Ran, R.; li, x.Abstract
Foundation models learned from single-cell transcriptomes are central to the prospect of AI virtual cell that can represent, query and predict cellular state. However, most current single-cell foundation models learn from a single view of gene expression and are optimized primarily through reconstruction or next-token prediction. As a result, they capture expression abundance but can-not explicitly reconcile complementary views of cellular state. Here we present CellOS, a multi-view foundation model that learns cellular representations from paired expression and perception views. CellOS integrates complementary views through a scalable three-stage training strategy that combines causal cell-sentence language modelling, function-preserving dense-to-mixture-of-experts expansion and latent-space alignment via an LLM-JEPA objective. Using this framework, we trained a 12-billion-parameter model on 390.5 million single-cell transcriptomes. Across diverse benchmarks spanning cell-state annotation, batch integration and perturbation-response prediction, CellOS consistently outperformed state-of-the-art single-cell foundation models. Together, these results suggest that predictive alignment between complementary cellular views provides a scalable path toward representation-centric cellular world models and transferable AI virtual cells.
bioinformatics2026-06-25v2LNGCN: a distance-aware continuous-time graph protocol for prioritizing protein-protein interaction candidates
Xiao, Y.; Zheng, Y.; Hua, Y.; Peng, J.; Liu, J.; Qu, Y.; Xu, J.; Fu, R.; Qian, Q.; Zhao, M.; Zhang, X.; Zhao, J.; Yao, Y.; Kosar, M.; Ke, Y.; Chi, Y.Abstract
Accurate high-throughput prediction of protein-protein interactions (PPIs) is essential for mapping cellular mechanisms and prioritizing experimental validation. Current graph-based methods often rely on discrete message passing, suffer from representation over-smoothing, and provide poorly calibrated confidence scores. We present LNGCN, a distance-aware continuous-time graph protocol that integrates residue-level structural graphs with liquid neural dynamics to model spatially heterogeneous interaction patterns. Residue radial distance is used as an explicit driving signal for continuous graph evolution, while hierarchical calibration converts raw model outputs into interpretable interaction probabilities. Across balanced, highly imbalanced, and cross-species benchmarks, LNGCN achieved robust predictive performance. Importantly, the calibrated scores supported biologically coherent prioritization in the FGF23-FGFR1c--Klotho complex, SHP2-associated signaling interactions and Tdk1 oligomeric-state-dependent binding. In a TPR-centered experimental case study, LNGCN recovered known TPR-associated partners and prioritized ELAVL1-TPR and RALY-TPR, whose physical interactions were subsequently confirmed via experimental validation. These results indicate that LNGCN can serve as a practical prioritization protocol for PPI candidates.
bioinformatics2026-06-25v2DextraDemixer enables accurate identification of antigen-specific T cells from pMHC multimer experiments
An, Y.; Drost, F.; Bonafonte-Pardas, I.; Grotz, M.; Schober, K.; Schubert, B.Abstract
Antigen specificity of T cells defines the adaptive immune response, yet the vast majority of known T cell receptors (TCRs) lack annotated antigen targets. Single-cell peptide-MHC (pMHC) multimer assays offer a scalable approach to map TCR-antigen interactions. Still, their utility is limited by pervasive non-specific binding and severe overlap between signal and noise, which confound the accurate identification of antigen-specific cells. To address these limitations, we present DextraDemixer, a Bayesian hierarchical mixture model that disentangles antigen-specific T cells from background noise in pMHC multimer data. The model integrates information from negative controls and clonotype structure while providing calibrated uncertainty estimates for classification. We further introduce a dynamic thresholding scheme that enables credible interval-bounded control of the false discovery rate. Extensive benchmarking on simulated datasets and antigen-specific spike-in experiments demonstrated the model's robustness and improved accuracy over established methods. In a longitudinal SARS-CoV-2 vaccine study, DextraDemixer identified antigen-specific TCRs characterized by high sequence similarity, elevated antigen-specificity prediction scores, and strong clonal purity. Annotations showed high concordance with external validation data and supported the identification of antigen-specific motifs. Overall, DextraDemixer provides a principled probabilistic framework for reliable identification of antigen-specific TCRs from single-cell pMHC-multimer assays.
bioinformatics2026-06-25v1The RdRp Thumb-1 Pocket is a Conserved Target for Broad-Spectrum Antiviral Development
Woods, V.; Umansky, T.; Russell, S. M.; Gallay, P.; Smith, D.; Haders, D.Abstract
Single-stranded RNA (ssRNA) viruses cause human diseases ranging from mild colds to deadly pandemics. Broad-spectrum non-nucleoside antivirals have been characterized as impossible to develop because allosteric binding sites are poorly conserved. The Thumb-1 allosteric site identified in HCV's RNA-dependent RNA polymerase (RdRp) governs an essential conformational change in the {Lambda}1-loop required for polymerase initiation. The only approved Thumb-1 inhibitor, beclabuvir, has been shown to be inactive against a broad panel of non-HCV viruses, including poliovirus, rhinovirus, coronavirus, coxsackievirus, influenza virus, and HIV. It subsequently failed to inhibit SARS-CoV-2 despite favorable docking predictions. A conserved, homologous allosteric site on RdRp that spans multiple viral families has not been reported. Here, we demonstrate that the Thumb-1 pocket and its associated {Lambda}1-loop are conserved across ssRNA viral families through comparative structural analysis and multiple sequence alignments. We demonstrate that beclabuvir's dependence on its indole C6 carbonyl to interact with the HCV-specific residue R503 and its C3 cyclohexyl chemistry restricts its activity to HCV. We validate the target discovery with MDL-001, which does not contain a C6 carbonyl or a C3 cycloalkyl substituent. MDL-001 directly blocks viral RNA synthesis in isolated replication complexes and selects for the canonical Thumb-1 resistance mutation P495S in HCV. MDL-001 demonstrates broad-spectrum in vitro inhibition of both HCV and SARS-CoV-2. Preclinical proof of concept and development of MDL-001 across HCV, HBV, HDV, influenza, SARS-CoV-2, and RSV are reported in a companion manuscript. These findings establish RdRp Thumb-1 as a conserved allosteric pocket and a druggable target for broad-spectrum antiviral development.
bioinformatics2026-06-23v5OmniCell: Unified Foundation Modeling of Single-Cell and Spatial Transcriptomics for Cellular and Molecular Insights
Pang, J.; Qiu, P.; He, Y.; Deng, Y.; Tang, W.; Zhi, H.; Yan, J.; Li, B.; Lin, A.; Cao, L.; Teng, F.; Fang, S.; Li, S.; Deng, Z.; Zhang, Y.; Li, Y.; Li, S.; Xu, X.Abstract
A cell's transcriptional programme is not fully defined by gene expression alone, but by the tissue context in which that programme is enacted. Singlecell RNA sequencing resolves molecular identity after dissociation, whereas spatial transcriptomics preserves tissue architecture but remains constrained by assay-specific sparsity and gene coverage. Here we present OmniCell, a tissue-contextual transcriptomic foundation model pretrained on 67 million dissociated and spatially resolved profiles. By integrating gene identity, expression magnitude and tissue context, OmniCell links transcriptional programmes to the cellular neighbourhoods and anatomical contexts in which they operate. OmniCell organised transcriptomes across molecular, cellular and tissue scales. It recovered celltypespecific programmes and tissuealigned gene modules, preserved robust cell-state structure across batches, species and rare populations, and improved the reconstruction of spatial cell identity, anatomical domains and cell-type composition. In human liver cancer Stereo-seq data, OmniCell resolved a tumour-margin transition zone characterised by immune infiltration, acute-phaseinflammation, coagulation/complement activity and metallothionein-linked metalion detoxification. Contextual geneembedding similarity analysis showed that gene relationships differed across tumour core, transition-zone and paratumour/adjacent non-malignant niches, indicating that OmniCell captures tissue-dependent gene function rather than expression similarity alone. In mouse brain development and macaque cortex, spatial virtual perturbations mapped regulatory genes onto stage and regionspecific anatomical programmes. Together, these results establish tissue context as a primary axis of transcriptomic representation and provide a framework for studying how cellular programmes acquire context-dependent biological meaning in intact tissues.
bioinformatics2026-06-23v3GenoME: a MoE-based generative model for individualized, multimodal prediction and perturbation of genomic profiles
Wei, J.; Xue, Y.; Chai, H.; Gao, Y. Q.Abstract
The non-coding genome operates through a complex, multiscale regulatory system where regulated gene expressions are closely associated with cell-type-specific histone modifications, transcription factor binding and 3D conformation. Developing computational models that can integrate these patterns to predict and interpret the regulatory system remains challenging. Here, we present GenoME, a Mixture of Experts (MoE)-based generative model that uses DNA sequence and cell-type-specific ATAC-seq signals to predict a unified genomic profile encompassing epigenomics, transcriptomics, and chromatin architecture at base-pair to kilobase resolutions. GenoME enables multiscale predictions for held-out genomic regions and, critically, generalizes to predict the full regulatory landscape of unseen or individualized cell types from a single ATAC-seq input. We equip GenoME with an in silico perturbation framework that accurately forecasts the multimodal consequences of genetic perturbations and identifies functional enhancer-promoter connections, outperforming specialized models like Activity-by-Contact. These predictions can also be used to decipher the transcription factor grammar of cell-type-specific enhancers. GenoME thus provides a versatile, all-in-one platform for generative modeling, cross-cell-type generalization, and causal mechanistic investigation of the multiscale regulatory genome.
bioinformatics2026-06-23v2Structural Pockets and Interacting RNA-Associated Ligands (SPIRAL): A DSSR-enabled Meta-Analysis of RNA-Small Molecule Recognition
Lu, X.-J.; Wang, Y.Abstract
Small molecules that target structured RNA hold therapeutic promise across a wide range of diseases, yet the structural principles governing RNA-ligand recognition remain poorly defined. We present SPIRAL (Structural Pockets and Interacting RNA-Associated Ligands), a curated database of 1,098 RNA-small molecule structures from the Protein Data Bank covering 1,137 ligand-binding events across six functional RNA categories. A customized pipeline built on DSSR (Dissecting the Spatial Structure of RNA) extracts structural interaction parameters from each complex, capturing stacking geometry, hydrogen-bond topology resolved by RNA moiety, groove engagement, and tertiary motif context. Unsupervised clustering of these fingerprints resolves six mechanistically distinct binding modes, the distribution of which is strongly governed by RNA functional class. To enable category-independent comparison of interaction quality across these diverse modes, we introduce the Composite Binding Quality Score (CBQS), a seven-metric framework that ranks riboswitches highest and regulatory RNA motifs lowest among the six categories. Across 275 affinity-characterized entries, C2'-endo sugar pucker count and total buried contact surface area emerge as the dominant predictors of binding affinity, converging with the structural features most underengaged by current regulatory RNA motif binders. SPIRAL provides a data-driven foundation for the rational design of next-generation RNA-targeted therapeutics.
bioinformatics2026-06-23v2HoloCell: A Generative Foundation Model for Holistic Cellular Modeling
Jiang, Q.; Li, Z.; Hu, B.; Bie, Y.; Li, K.; Li, Q.; Jin, P.; He, Y.; Deng, P.; Wang, Z.; Chen, X.; Qin, T.; Liu, H.; Jiang, R.; Yin, Q.Abstract
Single-cell multi-omics technologies have recently advanced to enable the profiling of epigenomic, transcriptomic, and proteomic layers within individual cells, offering new opportunities to characterize cellular states as integrated biological systems. However, developing a unified framework that can seamlessly integrate diverse omics modalities and remain robust to heterogeneous modality missingness remains challenging. Existing methods are often designed for specific modalities or modality pairs, relying on dataset-specific training or paired measurements. Here we present HoloCell, to our knowledge the first generative foundation model for joint representation learning and generative modeling across all three major single-cell omics modalities, i.e., epigenomics, transcriptomics, and proteomics. HoloCell contains over 860 million parameters and is pretrained on the Human-Multi-Omics-Corpus, which comprises approximately 468 million single-cell profiles across these three omics layers, corresponding to over 425 billion tokens. HoloCell introduces a simple yet biologically motivated hierarchical tokenization strategy that encodes cis-regulatory elements, genes, and proteins as structured tokens within a shared modeling framework. We evaluated HoloCell across single-omics representation learning, paired multi-omics integration, unpaired multi-omics alignment, and cross-modal generation via iterative diffusion and remasking, demonstrating its superior performance and flexibility across diverse omics tasks. From a representation perspective, HoloCell provides a unified digital mapping of cellular states across multiple omics layers, capturing cell heterogeneity as an integrated system. From a generation perspective, its iterative diffusion and remasking framework permits flexible generation orders beyond fixed left-to-right causality, enabling in silico simulation of multi-omics information flow. Together, these capabilities position HoloCell as a versatile foundation model toward the emerging concept of a virtual cell, offering both systematic characterization and generative simulation of cellular systems within a unified framework.
bioinformatics2026-06-23v2A tailored variant filtering procedure for multi-breed and multi-species unbalanced animal SNP collections
Lazzari, B.; Milanesi, M.; Talenti, A.; Bionda, A.; Li, Y.; Jiang, L.; Lenstra, J. A.; Bardou, P.; Tosser Klopp, G.; Crepaldi, P.; Colli, L.Abstract
Technological advancements and decreasing costs of whole-genome sequencing have generated a huge amount of resequencing data. Large-sized datasets, encompassing the molecular variation of several species and/or populations can now be assembled easily. However, these are extremely variable in terms of geographical provenance and sample sizes, with taxonomic groups varying from one single to hundreds of entries. Consequently, the application of standard filtering approaches may bias the representation of groups or gene pools. Commonly adopted variant filtering approaches relying on minor allele frequency (MAF) and linkage disequilibrium (LD) are not adequate because of remarkable differences in LD structure and frequency of allele variants within datasets representing both local and global diversity of multiple populations and species. Thus, by using the VarGoats 1000 goat genome project, we devised a novel approach which avoids the biases of the standard filtering procedures by adopting within-population subsampling, minor allele count (MAC) and marker spacing (bp-space) as filters. Starting from a quality-filtered dataset of >28M SNPs from 1372 animals, we generated a dataset of <14M markers and 750 individuals, complying with the initial requirements and facilitating further computational steps.
bioinformatics2026-06-23v2Early Tracheal and Salivary miRNAs in Extremely Preterm Infants Predict BPD-related Pulmonary Hypertension
Li, T.; Zhang, S.; Aluquin, V.; Donnelly, A.; Stephens, H.; Sharma, S.; Hicks, S. D.; Liu, D.; Austin, E.; Siddaiah, R.Abstract
Pulmonary hypertension (BPD-PH) associated with bronchopulmonary dysplasia (BPD) in preterm infants associates with high morbidity and mortality within the first two years of life. In a previous unbiased study, we identified a panel miRNAs in tracheal aspirates (TA) that were differentially expressed in extremely low gestational age newborns (ELGANs) with BPD-PH compared to those with BPD but no PH. To explore the predictive potential of these miRNAs, we studied TA exosomes from 7 days old ELGANs and analysed a curated panel of 16 miRNAs through logistic regression and calculated the predictive AUROC to diagnose BPD-PH at 36 weeks PMA. AUROC of TA miRNAs was 0.76 with sensitivity and specificity of 53% and 93%, respectively. Adding sex and gestational age to the variables improved the AUROC to 0.78 with sensitivity and specificity of 61 and 87% respectively. Due to challenges of obtaining TA in non-invasively ventilated infants, we collected saliva samples from ELGANs at 7 days of age and compared the log expression of these 16 miRNAs in both biofluids and found significant correlation in their expression (pearson r=0.92, p<0.001). We calculated the predictive AUROC of the same miRNAs to diagnose BPD-PH at 36 weeks PMA. AUROC of these miRNAs in saliva was = 0.85 with sensitivity and specificity of 82% and 72%, respectively; addition of biological sex and gestational age improved AUROC to 0.86 with sensitivity and specificity of 79% and 76% respectively. Leave-one-sample-out sensitivity analysis demonstrated stable training performance with reduced performance in testing samples, supporting the need for validation in larger independent cohorts. In conclusion, early salivary miRNAs have great potential for risk stratification of ELGANs to develop BPD-PH, while also providing the opportunity to identify target molecules and mechanisms that modulate molecular function.
bioinformatics2026-06-23v1CellOS: Learning a World Model of Cellular State through Joint Embedding Prediction
Zhou, Q.; Le, Y.; Qi, X.; Chang, S.; Lu, H.; Wu, Y.; Wang, H.; Ran, R.; li, x.Abstract
Foundation models learned from single-cell transcriptomes are central to the prospect of AI virtual cell that can represent, query and predict cellular state. However, most current single-cell foundation models learn from a single view of gene expression and are optimized primarily through reconstruction or next-token prediction. As a result, they capture expression abundance but can-not explicitly reconcile complementary views of cellular state. Here we present CellOS, a multi-view foundation model that learns cellular representations from paired expression and perception views. CellOS integrates complementary views through a scalable three-stage training strategy that combines causal cell-sentence language modelling, function-preserving dense-to-mixture-of-experts expansion and latent-space alignment via an LLM-JEPA objective. Using this framework, we trained a 12-billion-parameter model on 390.5 million single-cell transcriptomes. Across diverse benchmarks spanning cell-state annotation, batch integration and perturbation-response prediction, CellOS consistently outperformed state-of-the-art single-cell foundation models in cell-state annotation and perturbation-response prediction while preserving robust batch integration. Together, these results suggest that predictive alignment between complementary cellular views provides a scalable path toward representation-centric cellular world models and transferable AI virtual cells.
bioinformatics2026-06-23v1Systematic benchmarking of zero-shot utility and robustness in single-cell transcriptomic foundation models
Liu, T.; Feng, T.; Pan, X.; Chen, Y.; Ren, L.; Ye, X.; Sakurai, T.; Lin, H.; Zhang, Y.Abstract
Single-cell foundation models (scFMs) have been proposed as reusable representations for transcriptomic analysis, yet their practical utility and robustness when applied without task-specific fine-tuning remain incompletely characterized. Here, we systematically evaluated single-cell transcriptomic representations in zero-shot settings across 20 methods, 6 downstream tasks and 1,607 datasets comprising nearly 21.8 million cells. We characterized model behavior along three complementary dimensions: baseline utility, structural robustness, and dataset-level drivers of performance variability. Our large-scale analysis reveals a decoupling between utility and robustness: methods ranking highly on standard benchmarks often show marked instability under shifts in dataset structure. Furthermore, no single model performs uniformly well across tasks. In several tasks, classical statistical representations based on highly variable genes remain competitive under zero-shot conditions. Together, these results define the practical boundaries of zero-shot use in scFMs and provide a large-scale benchmark and decision framework for representation selection in single-cell genomics.
bioinformatics2026-06-23v1Learning interpretable structural similarity from tandem mass spectra for small molecule analog discovery
Piedrahita Giraldo, J. S.; Da Silva, K. M.; Zare Shahneh, M. R.; Wang, M.; Laukens, K.; De Vijlder, T.; Bittremieux, W.Abstract
Analog discovery remains a central bottleneck in mass spectrometry-based untargeted metabolomics, as conventional spectral similarity scores poorly reflect molecular structure. We introduce SIMBA, a transformer-based model that infers two interpretable graph-based distances, maximum common edge subgraph and substructure edit distance, directly from tandem mass spectra. SIMBA consistently retrieves structurally closer analogs than existing methods, enabling structure-aware small molecule identification beyond exact spectral matching.
bioinformatics2026-06-23v1biomeStat: Using Agentic AI for Scalable Genomic Epidemiology Demonstrated Through End-to-End Analysis of 1,000 Asian Dengue Virus Genomes
Ariyaratne, D.; Somaratna, N.; Malavige, G. N.Abstract
Genomic epidemiology workflows typically require expert curation of multiple specialized tools, extensive manual parameter tuning, and access to heterogeneous compute infrastructure. While standard generative AI models often hallucinate in complex biological domains, we introduce biomeStat: an autonomous AI agent that functions as a strict deterministic orchestrator. By automatically writing code to execute established bioinformatics tools in sandboxed environments, biomeStat dynamically provisions compute resources (CPU and GPU) and guarantees reproducibility, making it immediately useful for scientists without requiring command-line expertise. To demonstrate the platform, we performed a fully autonomous genomic epidemiology and structural analysis of 1,000 Dengue virus (DENV) genomes sampled from 16 Asian countries between 2000 and 2025. The agent seamlessly orchestrated phylogenetic reconstruction (IQ-TREE, TreeTime), Bayesian phylodynamics (BEAST2 via NVIDIA H200 GPU), selection pressure analysis (HyPhy), and structural mapping (PyMOL). The analysis was completed in under 24 hours of wall-clock time, revealing endemic stability (R_e ~1.0) and identifying 1,869 candidate immune escape sites structurally colocalized with B-cell and T-cell epitopes. Furthermore, the agent validated 176 highly conserved drug target residues across the viral replication complex, confirming that resistance-associated positions for emerging antivirals JNJ-1802 and NITD-688 remain absolutely conserved across all four serotypes. By bridging the gap between natural language intent and deterministic computational execution, biomeStat reduces weeks of expert effort into a single-session analysis with full methodological transparency.
bioinformatics2026-06-23v1VCBench: A Multi-Dimensional Benchmark for Single-Cell Foundation Models
Weidener, L. S.; Brkic, M.; Jovanovic, M.; Ulgac, E.; Meduri, A.Abstract
Single-cell foundation models are increasingly positioned as virtual cells, yet their capabilities are assessed by fragmented, largely single-task benchmarks that obscure where these models improve on simple baselines. VCBench addresses this by synthesizing four independent virtual-cell frameworks into seven capability dimensions: perturbation response prediction, cross-species universality, gene regulatory network (GRN) inference, modality integration, temporal dynamics, multi-scale integration, and in silico experimentation. Each dimension is assessed for operational testability under current architectures and datasets: five admit direct or proxy evaluation, while multi-scale integration and in silico experimentation are structurally untestable as end-to-end tasks. We evaluate five foundation models (Geneformer, scGPT, UCE, TranscriptFormer, Arc State) against pre-registered linear and nearest-neighbor baselines across the five testable dimensions, and report three findings. First, the baselines match or exceed every foundation model on four of the five scored dimensions, replicating the reported competitiveness of linear baselines on perturbation prediction and extending it to cross-species transfer, GRN inference, and temporal ordering. Second, TranscriptFormer alone exceeds the strongest baseline on cross-modal RNA-to-protein prediction (53% Pearson improvement, with a documented contamination caveat) and is the only model to reach Level 2 in the pre-registered Virtual Cell (VC) Level rubric; the architectural choice behind this advantage simultaneously causes a spectral collapse that destroys its temporal-ordering performance, a tradeoff invisible to single-task benchmarks. Third, no foundation model publishes a complete cell-level training manifest, leaving data contamination undetectable to users. Alongside the benchmark, VCBench releases a Contamination Reporting Schema and contributes two further methodological tools: a common-label-set protocol that controls for class-count confounds in cross-species transfer, and a spread-error correlation probe for epistemic calibration.
bioinformatics2026-06-23v1Comorbidity structure as an inductive bias: Comparing output-head designs for multi-label prediction of diabetes and myocardial infarction complications
Asumboya, W. A.; Agbenorhevi, P. K.; Adams, C. F.; Ayariga, D. A.; Adjadeh, T.; Adams Ziblim, S.; Kwofie, S. K.Abstract
Background: Clinical complications are often predicted with separate sigmoid outputs, even when the target labels arise from related pathophysiological processes. This paper asks whether output-layer choice should reflect both predictive convenience and the biological structure assumed among complications. The central premise is that label-dependence mechanisms are explicit hypotheses about comorbidity, not generic modelling additions. Methods: Output-head assumptions were compared across two clinically distinct multi-label prediction tasks. In Type 2 diabetes (T2D), six heads were evaluated for nephropathy, neuropathy, and retinopathy: independent baseline, linear additive, multiplicative, symmetric conditional random field (CRF), residual multilayer perceptron (MLP), and combined additive-multiplicative. In myocardial infarction (MI), four heads were evaluated for ventricular tachycardia, ventricular fibrillation, and atrioventricular block: independent baseline, linear additive, multiplicative, and symmetric CRF. All experiments used five training data fractions and seven independent seeds, with the same shared-backbone protocol within each disease setting. Results: In T2D, the symmetric CRF gave the most consistent improvement pattern, ranking highest at full data and at the two lowest data fractions while adding only three interaction parameters. At 20% training data, it was the only interaction head whose aggregate mean exceeded the independent baseline. The residual MLP, despite 123 interaction parameters, remained below the baseline across all T2D fractions. In MI, rankings changed across fractions: the multiplicative head led at 80% and 60%, the CRF led at 100% and 20%, and the baseline led at 40%. The combined additive-multiplicative head did not improve robustness in T2D and showed the largest negative baseline-relative deviations at lower fractions. Conclusions: The findings support a biology-guided view of output-layer design. A small constrained mechanism was most useful when its symmetry matched the shared microvascular structure of T2D, whereas the heterogeneous electrophysiology of MI produced no stable winner. Output-layer choice should therefore be reported and defended as an assumption about disease structure instead of a routine hyperparameter decision.
bioinformatics2026-06-23v1Measuring peptide-MHC generalization to unseen alleles across both HLA classes
Mysore, V.Abstract
Reported peptide-MHC (pMHC) AUROCs of 0.85-0.95 overstate generalization to unseen alleles: because immunopeptidome data are dense on a few well-studied alleles and sparse on the rest, training and test sets come to share near-identical alleles, so the numbers partly reflect interpolation rather than extrapolation to new MHC grooves. This is a property of the data, not of any one method. We assembled an open, harmonized corpus of 5.8 million experimental measurements across both HLA classes and use it to control the leakage explicitly: alleles held out at the sequence and cluster level, peptide-disjoint splits, and provenance-matched negatives. On strictly novel alleles, generalization is in the high 0.7s rather than the 0.9s a conventional split returns. Against this benchmark we trained a predictor that spans both classes in one model and factors presentation into a peptide-only ligand-likeness term and an allele-specific term; it exceeds eight published predictors by per-allele {Delta}AUROC = +0.22 to +0.37 (p < 10-9), most on the least-studied genes. Corpus, benchmark, and model are released.
bioinformatics2026-06-23v1Automated Segmentation of Prostatic Gold Fiducial Markers for MR-Only Radiotherapy Planning Using Multi-Modal Consensus Deep Learning
Stewart, A. W.; Goodwin, J.; Richardson, M.; Robinson, S. D.; O'Brien, K.; Jin, J.; Barth, M.Abstract
Purpose: To develop and evaluate a multi-model consensus deep learning approach for automated gold fiducial marker (FM) segmentation in T1-weighted prostate MRI. Materials and Methods: In this retrospective study, T1-weighted MRI and CT-derived reference standard segmentations were collected from 127 prostate cancer patients (all male; mean age, 70 years +/- 7 [standard deviation]; age range, 50-88 years; collected between October 2020 and January 2026) who each had three implanted gold FMs. A 3D U-Net was trained on 93 subjects using four random seeds to produce an ensemble. At inference, marker-class probability maps were averaged across models and the top three connected components selected. Performance was evaluated on 34 temporally held-out subjects (9 tuning, 25 test) using marker-level sensitivity and precision with exact (Clopper-Pearson) 95% confidence intervals (CIs). A model count ablation study was performed. The pipeline was deployed for on-scanner processing on Siemens MRI systems via the OpenRecon framework and as a browser-based application using WebAssembly, executing entirely client-side. Results: The four-model consensus achieved 96% (70 of 73) sensitivity and 95% (70 of 74) precision on 25 test subjects, with 29 of 34 (85%) subjects achieving perfect marker detection. Single models had a mean sensitivity of 84% (SD, 9%), improving to 96% with four-model consensus (SD, <1%). Conclusion: Multi-model consensus deep learning substantially improved FM segmentation reliability over individual models, achieving high sensitivity and precision using only routinely acquired T1-weighted MRI.
bioinformatics2026-06-23v1Model-based inference of gene expression noise from single-cell RNA-sequencing data
Giersdorf, F.; Rogers, D. W.; Christensen, S.; Dutheil, J. Y.Abstract
The heterogeneity of expression levels among genetically identical cells, termed gene expression noise, is a property of the gene expression process whose importance in the biology of organisms and their evolution is increasingly recognized. Measuring gene expression noise requires single-cell expression data, as obtained from single-cell RNA sequencing (scRNASeq). Its estimation, however, is challenging owing to (i) the presence of technical noise in addition to biological noise, and (ii) the heterogeneity of cell types in the sampled population. We propose a maximum-likelihood framework to infer biological noise from scRNASeq data, while accounting for technical noise, dropout probabilities, and distinct cell sequencing depths. We demonstrate the parameter identifiability using simulations and that the resulting noise estimates are uncorrelated from the mean gene expression, and therefore do not need extra correction in downstream analyses, easing intra- and inter- genome comparisons. Using two technical replicates of scRNASeq data from the wild yeast *Saccharomyces paradoxus*, we show that expression noise can be inferred in a reproducible manner.
bioinformatics2026-06-23v1WITHDRAWN: Generating Structurally Diverse Therapeutic Peptides with GFlowNet
Wijaya, E.Abstract
The authors have withdrawn this manuscript because the submitter did not have the rights to agree to the distribution license at the time of submission. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v5WITHDRAWN: Distilling Protein Language Models with Complementary Regularizers
Wijaya, E.Abstract
The authors have withdrawn this manuscript because, at the time of submission, the submitter did not have the rights required to agree to the distribution license. Accordingly, the authors request that this work not be cited as a reference for the project. Please contact the corresponding author with any questions.
bioinformatics2026-06-22v4Proteomics-constrained deconvolution reveals spatial cell-type programs in tumours
Isik, E. B.; Haley, M. J.; Anbaki, A. A.; Bere, L.; Roncaroli, F.; Piper Hanley, K.; Couper, K.; Wedge, D. C.; Sellers, R.; Baker, A.; Oliveira, P.; Ashton, J.; Bristow, R. G.; Alvarez, M. A.; Georgaka, S.; Rattray, M.Abstract
Accurately resolving cell-type mixtures in spatial transcriptomics remains challenging, particularly in heterogeneous tumours where cell populations are intermixed and matched single-cell references may be unavailable or poorly aligned. Current deconvolution approaches either require high-quality scRNA-seq references, suffer from scalability limitations, or lack interpretability. We introduce PISTACHIO, a proteomics-informed spatial transcriptomics deconvolution framework based on constrained non-negative matrix factorization with a negative-binomial likelihood. Rather than using probabilistic priors, PISTACHIO incorporates spatial cell-type constraints derived from paired Imaging Mass Cytometry, enforcing biologically grounded sparsity and explicit spatial feasibility of cell-type presence. PISTACHIO improved recovery of spatial cell-type distributions compared with Cell2location and STdeconvolve across synthetic and real tumour datasets. Our approach remains robust under cell-type assignment errors, maintaining high correlation with ground-truth under moderate noise, and achieves fast runtime on standard hardware, enabling practical large-scale deployment.
bioinformatics2026-06-22v2ATLAS: a scverse-compatible package for multi-omic single-cell trajectory inference integration
Leclercq, A.; Martini, L.; Bardini, R.; Savino, A.; Di Carlo, S.Abstract
Single-cell trajectory inference is widely used to study cellular differentiation and fate decisions, yet most existing approaches rely on transcriptomic information alone, limiting their ability to capture the regulatory processes underlying cell-state transitions. This work presents ATLAS (Advanced Trajectory Learning from multi-omics At Single-cell resolution), a scverse-compatible framework for trajectory inference in paired single-cell RNA-seq and ATAC-seq data. ATLAS integrates transcriptomic and chromatin accessibility information through Weighted Nearest Neighbor graphs, enabling both molecular layers to jointly inform pseudotime estimation, terminal-state identification, and fate probability inference within a unified multi-omic representation. Across synthetic and real datasets, ATLAS reconstructs coherent developmental trajectories, captures progressive fate commitment, and resolves biologically meaningful lineage structures, demonstrating the effectiveness of multi-omic integration for characterizing cellular dynamics. In addition, ATLAS enables the joint exploration of transcription factor expression and target gene activity along pseudotime, providing direct access to regulatory programs and chromatin-associated transitions that are not detectable from transcriptomic data alone. Overall, ATLAS provides a scalable and biologically informative framework for studying dynamic cellular processes in single-cell multi-omics experiments.
bioinformatics2026-06-22v2WITHDRAWN: Agent-Guided Ranking Policy Improvement for Peptide Drug Candidate Prioritization
Wijaya, E.Abstract
The authors have withdrawn this manuscript because the submitter did not have the rights to agree to the distribution license at the time of submission. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v2WITHDRAWN: Preprint Commons: A platform for the systematic tracking of preprint trends and impact
Behera, B. P.; panda, B.Abstract
The authors have withdrawn their manuscript because it was posted without the consent of all authors. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-06-22v2From hotspot dependence to distributed robustness in resistance-aware lead optimization
Wang, Y.; Xiao, B.; Kang, J.; Cui, H.; Fu, Y.; Li, W.; Perea, S. E.; Han, W.Abstract
Drug resistance remains a recurrent failure mode in targeted anticancer and antiviral therapy, and resistance evidence often enters only after compound selection. ResistAgent is an evidence-constrained framework that converts mutational liabilities into design-time objectives through site- and combo-aware resistance mapping, deterministic mechanism diagnosis and robust counter-design. In EGFR-Erlotinib and HIV-RT-Rilpivirine, the framework separated residue-level liabilities from observed HIV combination liabilities and linked prioritized mutations to anchor loss, pocket rearrangement, electrostatic shifts and contact redistribution. Same-budget paired searches showed that robust objectives changed lower-tail mutant-panel behavior and interaction-dependence profiles while prioritizing robustness over average-affinity behavior. Under predefined liability panels, selected robust-best trajectories shifted support away from mutable hotspot contacts toward more distributed interaction networks. Supplementary physical summaries and ranking-first benchmarks support the scope of this resistance-aware design strategy while preserving clear boundaries for prospective validation.
bioinformatics2026-06-22v1Reference-guided immune recovery matching prioritizes traditional Chinese medicine ingredients
Hu, C.; Xiao, B.; Chen, C. Y.-C.Abstract
Therapeutic prioritization from single-cell transcriptomes requires a target that is closer to treatment response than disease-signature reversal. In immune diseases, post-treatment recovery may follow patient- and cell-type-specific trajectories rather than a simple return along the pretreatment disease axis. We developed ImmuneNavi, a healthy-reference-anchored recovery-matching workflow for ranking traditional Chinese medicine ingredients from paired PBMC data. The workflow maps heterogeneous PBMC cohorts to a common healthy immune coordinate system, constructs patient-cell-type disease and recovery states, and processes ITCM treated-control profiles into a fixed ingredient perturbation bank. Patient and ingredient states are represented in matched gene, pathway and transcription-factor views, allowing the model to combine local transcriptional direction with more stable program-level features. A matcher trained on one paired treatment cohort preserved recovery-aligned ingredient rankings in independent PBMC cohorts without redefining the feature space, candidate set or preprocessing procedure. This provides a reusable transcriptomic pipeline for moving from paired immune-state measurements to prioritized natural-product candidates for experimental follow-up.
bioinformatics2026-06-22v1Complex-valued representations of time-series gene expression profiles for network analysis
Sun, J.; Cao, W.; Ikumi, K.; Shimizu, K. K.; Sese, J.Abstract
Time-series RNA sequencing provides a powerful framework for studying dynamic gene regulation, yet conventional analyses usually represent gene expression profiles as real-valued vectors in Euclidean space and quantify similarity using correlation or distance. Inspired by quantum information theory, we present a framework for encoding time-series gene expression profiles as complex-valued vectors comprising amplitude and phase components in Hilbert space. We designed multiple encoding models to represent gene expression in the amplitude of complex-valued vectors, encode temporal differences in the phase, and extend the phase representation to incorporate the direction of local expression changes. Gene-gene similarity was then quantified using fidelity, which measures the overlap between two encoded vectors. Evaluation using time-series RNA-seq datasets across diverse species and biological contexts showed that different encoding models produced distinct fidelity distributions that were related to, but distinct from, conventional correlation measures. We then constructed gene-gene networks using pairwise fidelity values and detected communities containing genes with similar temporal profiles. Although fidelity distributions differed across encoding models, the resulting communities captured major temporal expression programs, and functional annotations based on gene ontology and Kyoto encyclopedia of genes and genomes pathway analyses provided exploratory biological context. The detected communities were comparable to those obtained using conventional methods, including weighted correlation network analysis and fuzzy c-means clustering. Furthermore, as a proof-of-concept, we performed SWAP-test circuit simulations to mimic fidelity computation on a quantum computer; under noise-aware conditions, these simulations produced less accurate fidelity estimates with higher computational cost than classical computation. As a proof-of-concept, this study provides a complementary view of temporal transcriptome organization, rather than a uniformly superior alternative to conventional methods.
bioinformatics2026-06-22v1PhaseWY: A pipeline for haplotype phasing, sex chromosome identification and extraction of sex-limited sequences
Ellerstrand, S. J.; Churcher, A. M. J.; Kutschera, V. E.; Hansson, B.Abstract
Sex chromosomes are central to many ecological and evolutionary processes. Evidence has accumulated that sex chromosome systems vary extensively in age, turnover and transitions, motivating renewed efforts to study the diversity of sex chromosome systems across the tree of life. However, successful genomic detection of sex chromosomes depends on several factors, including the size and divergence time, background genetic diversity, and the number of sequenced females and males. In addition, technical challenges associated with sequencing and analysing the sex-limited Y/W chromosome remain. Here, we present PhaseWY, an automated Snakemake pipeline that uses whole-genome sequencing data from multiple female and male individuals to identify sex-chromosomal regions and extract the corresponding Y/W sequences. PhaseWY (i) detects sex differences in alignment depth, (ii) applies read-based and statistical haplotype phasing, (iii) identifies sex-linked regions using haplotype clustering, and (iv) subsets autosomal, X/Z- and Y/W-linked variants for downstream analyses. We applied PhaseWY to simulated data to benchmark factors influencing sex-linkage detection and successful extraction of Y/W-linked variants. To demonstrate its practical utility, we further applied PhaseWY to the neo-sex chromosome system in Alauda larks (Alaudidae) and performed a range of downstream analyses demonstrating the scope of applications of the PhaseWY output. We conclude that PhaseWY provides an easy-to-use and reproducible tool for population-genomic analyses in non-model organisms, with particular importance for advancing our understanding of sex-chromosome evolution.
bioinformatics2026-06-22v1When Less Is Not More: DICEPro Mitigates the Impact of Incomplete Reference Matrices on Cellular Frequency Deconvolution.
BA, K.; Thiebaut, R.; Hinaut, X.; Hejblum, B. P.Abstract
Cellular deconvolution aims to estimate the frequencies of different cell populations from gene expression measurements in a biological sample. Supervised approaches, such as CIBERSORTx and DISSECT, critically depend on the reference signature matrix, which encodes the gene expression profiles of cell-types based on prior knowledge. Despite numerous deconvolution methods, the impact of missing cell populations in the reference matrix remains understudied. Here, we evaluate the robustness of state-of-the-art deconvolution approaches using simulations based on real dataset examples combined with statistical modeling, validated against published data, and multiple real benchmark datasets. Results show that deconvolution performance remains stable when the reference matrix includes most cell-types, but declines sharply as the matrix becomes incomplete, especially for abundant cell populations. To address the limitations of incomplete reference matrices, we introduce DICEPro, an optimization-based framework designed to enhance existing deconvolution methods. By systematically adjusting the reference signatures, DICEPro better accounts for missing or underrepresented cell populations, leading to improved precision and robustness. We show that DICEPro consistently boosts deconvolution performance across both simulated datasets, derived from real data examples, and multiple real biological datasets, offering a practical solution when standard methods are hindered by incomplete references.
bioinformatics2026-06-22v1Benchmarking cell type annotation in spatial transcriptomics: resolving cellular hierarchies, biological fidelity, and dynamic cell states
Zhu, Y.; Hu, Y.; Xie, M. B.; Qin, H.; Szul, Z. J.; Young, D. M.; Yuan, W.; Wang, Q.; Liu, Y. H.; Shen, W.; Meltzer, S.; Zhou, X. M.Abstract
Spatial transcriptomics enables the quantification of gene expression within its native tissue context, providing unprecedented insight into tissue architecture, cellular ecosystems, and local cell-cell interactions at regional and single-cell resolution. Accurate cell type annotation is a critical prerequisite for interpreting these data and is often the first and most essential step in downstream analysis. Despite rapid advances in computational methods, cell type annotation remains challenging and frequently requires extensive expert-driven manual curation based on marker-gene expression, spatial context, and prior biological knowledge. While early approaches relied primarily on transcriptional similarity, newer methods increasingly incorporate spatial information, histological features, and multimodal data to improve annotation accuracy. Nevertheless, reliable annotation remains difficult when biological interpretation requires fine-grained subtype resolution, particularly for platforms with limited gene panels, tissues undergoing dynamic cellular state transitions, and studies in which reference and query datasets differ substantially in biological context or technical modality. Here, we present a systematic benchmark of 20 state-of-the-art cell type annotation methods across four spatial transcriptomics datasets spanning diverse technologies, experimental conditions, cell numbers, and gene panel sizes. Importantly, all benchmark datasets contain expert-curated cell type labels, including well-resolved cell populations and subtype annotations, providing high-quality biological ground truth for evaluation. The benchmark encompasses both reference-based and reference-free methods representing a broad range of computational frameworks. Performance was assessed using conventional classification metrics, including accuracy and F1-based measures, together with structure-aware metrics that evaluate both cell-level annotation accuracy and preservation of higher-order biological organization. Across datasets, annotation performance varied substantially according to tissue context, reference-query similarity, and annotation granularity. Fine-grained subtype annotation and recovery of rare cell populations remained challenging for many methods, particularly in datasets capturing injury, repair, developmental, and regenerative processes characterized by continuous cellular state transitions. Notably, high classification accuracy did not necessarily correspond to preservation of global cellular relationships or biologically coherent downstream pathway and gene-set enrichment analyses. Overall, scANVI, Seurat, and TACCO consistently ranked among the top-performing methods, although their relative advantages were context dependent. Together, our results provide a comprehensive assessment of current annotation strategies for spatial transcriptomics and offer practical guidance for selecting methods that best align with specific biological questions, dataset characteristics, and analytical priorities.
bioinformatics2026-06-22v1CellTosg2Sequence: A Unified Text-Omics-Signaling-Graph Large Language Model for Single-Cell Analysis
chen, w.; Ye, M.; Xu, T.; Huang, D.; Zhang, H.; Li, H.; Li, W.; Chen, Y.; Payne, P. R.; Li, F.Abstract
bioRxivLaTeXUnicodeabstract --- In single-cell (sc)-based scientific discovery, text-formatted biomedical prior knowledge and signaling graphs are essential for annotating and interpreting numeric sc-omics data and for generating novel testable hypotheses. A major limitation of existing single-cell large language models (scLLMs) is that they rely on numeric expression data with gene names as the only textual signal, while comprehensive biomedical priors -- cellular localization, gene function, disease associations, and signaling interaction patterns -- remain absent from the model input. We introduce CellTosg2Sequence, a textual-prior- and signaling-graph-augmented cell-omics-sentence language model. A lightweight heterogeneous graph encoder maps a curated 62,507-node biomedical knowledge graph (KG) into compact virtual tokens that are prepended to each cell sentence, allowing the language model to condition on biological structure with minimal sequence-length overhead. We train CellTosg2Sequence with a three-stage objective: Stage I anchors the KG channel under autoregressive language-model pretraining, leveraging Qwen2.5-32B's own language reasoning for rapid KG alignment; Stage II aligns labels via supervised fine-tuning with KG-anchored InfoNCE; Stage III applies Group Relative Policy Optimization (GRPO) with an ontology-hierarchy reward, enabling free-generation cell-type prediction that generalizes beyond the closed training vocabulary. Across multiple benchmarks and ablation experiments, CellTosg2Sequence outperforms strong baselines. All results are achieved with lightweight LoRA training and a single unified checkpoint.
bioinformatics2026-06-22v1EventHorizon: A Foundation Model for Clinical Flow Cytometry
Medina Grespan, M.; Morrison, M.; O'Fallon, B.; Shean, R.; Spies, N. C.; Ng, D.Abstract
Flow cytometry is an essential tool for diagnosis of hematologic malignancies, but existing clinical workflows are highly dependent on expert manual interpretation. Existing machine learning approaches typically require extensive labeled data and are sensitive to variability in panel design, instrumentation, and laboratory workflows, limiting their generalizability. We present EventHorizon, a self-supervised foundation model for clinical flow cytometry that produces unified specimen-level representations from heterogeneous multi-panel data. EventHorizon employs a two-stage hierarchical transformer architecture with marker-aware tokenization, enabling seamless integration of cells measured across different antibody panels into a single shared latent space. We pre-train the model using a DINO-inspired self-distillation strategy with a variety of flow cytometry-specific augmentations on a dataset of more than 100,000 clinical specimens across 17 distinct panels. We evaluate the resulting embeddings on three clinically relevant classification tasks spanning common and rare panels, demonstrating that simple k-nearest neighbor probing of frozen EventHorizon embeddings achieves performance comparable to a fully supervised baseline model and a prior panel-specific self-supervised model. To ensure EventHorizon is not simply shortcut learning on features such as the markers/panels run for a given specimen, we perform a graph-theoretic analysis of EventHorizon's latent space which argues that specimen embeddings are organized primarily by biological diagnosis. Taken together, these results demonstrate that EventHorizon produces biologically meaningful, panel-agnostic specimen representations from clinical flow cytometry data which, with further development and validation, could provide a potential basis for scalable, reproducible diagnostic support across diverse clinical laboratory settings.
bioinformatics2026-06-22v1πDIA-CLIP: efficient identification of highly heterogeneous proteomics data via a generalized zero-shot framework
Liao, Y.; Li, Y.; Xiao, Z.; Miao, C.; Yi, T.; Zhao, X.; Zhang, Y.; Wen, H.; E, W.; Chang, C.; Zhang, W.Abstract
Data-independent acquisition mass spectrometry has increasingly emerged as a cornerstone for characterizing highly heterogeneous biological systems, such as single-cell proteomics, metaproteomics, and spatial proteomics, offering unparalleled identification depth and quantification reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring, which is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present {pi}DIA-CLIP, a generalized framework shifting the DIA analysis strategy from semi-supervised training to zero-shot cross-modal representation learning through integrating dual-encoder contrastive learning and encoder-decoder architectures to establish a unified, high-precision representation for spectral features and peptides. Notably, the generalized zero-shot nature of {pi}DIA-CLIP facilitates an inference-only architecture, streamlining the analysis to achieve exceptional computational efficiency. Extensive evaluations across five distinct benchmarks demonstrate that {pi}DIA-CLIP consistently outperforms existing tools, yielding an up to 44.6% increase in protein identification alongside a reduction in entrapment identifications reaching a maximal 52.5%. Furthermore, the enhanced identification depth facilitates the discovery of novel biomarkers and the elucidation of intricate cellular mechanisms.
bioinformatics2026-06-21v4SIEVEseq: One-stop differential expression, variability, and skewness analyses using RNA-Seq data
Li, H.; Khang, T. F.Abstract
RNA-Seq data analysis is commonly biased towards detecting differentially expressed genes and insufficiently conveys the complexity of gene expression changes between biological conditions. This bias arises because discrete count models cannot fully and independently parameterize the mean, variance, and skewness of gene expression distributions. Therefore, a unified statistical framework that simultaneously tests differential expression, variability, and skewness is needed. We present SIEVEseq, a statistical methodology that provides such a framework. SIEVEseq embraces a compositional data analysis strategy to transform discrete RNA-Seq counts into continuous form with a distribution well-fitted by the skew-normal distribution. Both parametric and nonparametric simulations show that SIEVEseq better controls the false discovery rate and Type II error than existing differential expression methods. Analysis of the Mayo RNA-Seq dataset for Alzheimer's disease demonstrates that gene sets with significant differences in mean, variance, and skewness between control and disease groups strongly predict disease state. Furthermore, functional enrichment analysis indicates that relying solely on differentially expressed genes identifies only part of the biological spectrum, whereas incorporating genes with differential variability and skewness reveals additional disease-related aspects. Cross-data and cross-methodology validation suggest the detected biological signals are genuine. The SIEVEseq R package and source codes are available at: https://github.com/Divo-Lee/SIEVEseq.
bioinformatics2026-06-21v3Hierarchical classification of immune cell transcriptomes at population-scale
Beltz, C.; Qiu, Z.; Sadowski, L.; Kraske, J. A.; Aggarwal, A.; Quintanal-Villalonga, A.; Manoj, P.; Littbarski, A.; Bajaj, S.; Meskauskaite, B.; Umeda, S.; Mazutis, L.; Rose, S. A.; Chan, J. M.; Nawy, T.; Nainys, J.; Chaligne, R.; de Stanchina, E.; Kaelber, K. A.; Cussigh, C. S.; Kallenberger, S. M.; Williams, A.; Jenzer, M.; Pompecki, T.; Kahle, S.; Hohmann, N.; Nussbaum, D. P.; Moss, N. S.; Ziv, E.; Berger, A. K.; Haag, G. M.; Springfeld, C.; Zschaebitz, S.; Hassel, J. C.; Debus, J.; Jaeger, D.; Iacobuzio-Donahue, C. A.; Ganesh, K.; Peer, D.; Ungerechts, G.; Rudin, C. M.; Huber, P. E.; WalleAbstract
Accurate immune cell classification is essential for interpreting single-cell RNA sequencing (scRNA-seq) data. However, progress in automating cell type annotation is constrained by the lack of independent, high-resolution benchmarks, as routine data integration introduces statistical dependencies that inflate model generalizability. Here, we present the single-cell universal classification omnibus (Suco), a resource of independent, uniform expert annotations, and Compocyte, a modular hierarchical classifier. Together, they establish a framework that substantially outperforms existing classifiers while facilitating expert review of ambiguous annotations. Applying Compocyte across 50 studies, including three newly generated datasets, we classified 15.6 million leukocytes from 3,965 patients. Within this cohort, we identified a new tumor-associated resorptive macrophage phenotype, a non-canonical monocyte subtype in subclinical cytokine release syndrome, and the programmatic erosion of T cell memory stemness across metastatic sites. Suco and Compocyte thus provide a generalizable framework to uncover the principles governing human immunity at population scale.
bioinformatics2026-06-21v2Antibody-Antigen Affinity Prediction with Chain-Aware Protein Language Modeling
Singh, H.; Malhotra, A.; Srivastava, S. P.; SINGH, R. K.; Gorantla, R.Abstract
Motivation: Antibody-antigen affinity determines which antibodies advance in therapeutic discovery, repertoire analysis and affinity maturation, but experimental measurements are sparse relative to the scale of sequence libraries. Structure-based predictors can exploit interface geometry when reliable complexes are available, yet early discovery often requires ranking many heavy-light chain pairs against antigens for which no complex structure exists. Existing sequence-based models are scalable, but frequently compress heavy and light chains into a single antibody representation or concatenate antibody and antigen features obscuring the chain-specific and epitope-specific signals that drive binding. Results: We present AbAffinity, a sequence-only chain-aware three-stream architecture that maintains heavy chain, light chain and antigen as distinct streams. It integrates frozen ESM-2 embeddings with heavy-chain CDR-focused pooling, heavy-light self-attention, adaptive fusion gating and gated cross-attention, training only a compact interaction module. On the SAAINT-DB benchmark, AbAffinity achieves strong predictive performance under ten-fold cross-validation and maintains robust accuracy on novel antigens. It consistently outperforms recent sequence-based models across external benchmarks including SAbDab, AB-Bind and SKEMPI 2.0. Ablation studies highlight the contributions of chain-specific representations, CDR-focused pooling and the gated interaction pathway. Integrated Gradients attributions recover known paratope and epitope residues at structurally validated interfaces. AbAffinity provides a lightweight, explainable sequence-first framework for antibody triage and prioritisation when structural information is limited or unavailable.
bioinformatics2026-06-21v1