Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Svirlpool: structural variant detection from long read sequencing by local assembly
May, V.; Hartmann, T.; Beule, D.; Holtgrewe, M.Abstract
Motivation: Long-Read Sequencing (LRS), and Oxford Nanopore Technologies (ONT) in particular, has greatly improved the detection of structural genome variants (SVs). Fast alignment-based ONT callers achieve strong benchmark performance, but they necessarily reduce the read sequence to alignment-derived signals when deciding whether variants are shared across samples. This can be limiting for cohort and clinical analyses, especially for insertions and repeat regions where sequence representation matters. We present Svirlpool, a multi-sample SV caller for ONT data that builds local consensus assemblies of candidate SV regions and retains the assembled sequence up to the final joint-calling step, where merging tolerances are scaled by a reference-independent noise estimate derived from the reads. Results: We validated Svirlpool on two ONT family datasets: the recent high-quality HG002 Ashkenazi trio and the older Platinum Pedigree family, using the Genome in a Bottle and T2TQ100 benchmarks on the GRCh38, GRCh37, and CHM13v2 references and the Mendelian consistency of native multi-sample calls. We compare against current native joint callers and post-hoc merging workflows. Svirlpool produces highly Mendelian-consistent insertion calls in trio analyses (95.2% on GRCh38 and 95.1% on CHM13v2 at 30x), and on CHM13v2 it reaches the highest insertion and deletion consistency among all tested approaches. Sawfish and Sniffles achieve the highest SV benchmark F1 scores on recent high-quality ONT data, whereas Svirlpool enters the competition with more conservative SV calls. Svirlpool features native, sequence-aware joint calling with retained local consensus sequences and shows a very high Mendelian consistency with sequencing data from different batches and chemistries, which is a common situation in clinical application. Availability and Implementation: Source code, container images, and documentation available at https://github.com/bihealth/svirlpool
bioinformatics2026-06-29v3NPTX2-Centered Cognitive Resilience Mechanisms in the Context of AD Pathology
Lao, Y.; Xiao, M.-F.; Ji, S.; Piras, I. S.; Kim, K.; Bonfitto, A.; Song, S.; Aldabergenova, A.; Sloan, J.; Trejo, A.; Geula, C.; Na, C.-H.; Rogalski, E. J.; Kawas, C. H.; Corrada, M. M.; Serrano, G. E.; Beach, T. G.; Troncoso, J. C.; Huentelman, M. J.; Barnes, C. A.; Worley, P. F.; Colantuoni, C.Abstract
Background Cognitive resilience to Alzheimer's disease (AD) pathology is associated with preserved expression of NPTX2, an activity-regulated synaptic protein involved in circuit plasticity, excitation-inhibition balance, and complement-linked synapse regulation. However, the broader molecular programs coordinated with NPTX2 in resilient individuals remain unclear. Methods We analyzed postmortem middle temporal gyrus tissue using targeted PRM-MS proteomics in 135 individuals and bulk RNA-seq in an expanded 575-sample cohort. NPTX2-associated molecular coordination was assessed within cognitively normal low-pathology controls (CN-Lo), cognitively normal high-pathology controls (CN-Hi), mild cognitive impairment (MCI), and AD. Correlation-based approaches were applied using NPTX2 protein and NPTX2 mRNA expression as anchors to define resilience mechanisms in CN-Hi subjects. Results NPTX2 protein abundance was preserved across all controls regardless of age and pathology but reduced in MCI and AD. NPTX2 mRNA expression was also invariant across pathology within controls and reduced in MCI and AD but decreased markedly with age. Targeted proteomics identified NPTX2 relationships with synaptic and inhibitory-circuit proteins that were preserved across control groups, alongside CN-Hi-specific recruitment of trafficking, lysosomal, metabolic, and proteostasis-associated proteins. Transcriptome-wide correlations with NPTX2 revealed differences in gene co-expression between groups, identifying a prominent activity-dependent program including BDNF, VGF, SCG2, SST, SERTM1, DUSP4, and EGR4, that was preserved in both CN-Lo and CN-Hi subjects, while genes recruited to the NPTX2 network specifically in CN-Hi implicated immune, neuroprotective, translation, and proteostasis-related pathways. Coupling differential gene expression analysis with co-expression, we further identified five candidate resilience genes whose expression and NPTX correlation was preserved across controls, but lost in MCI and AD: SST, MAL2, TAC1, SERTM1, and RFK. Expression of genes in distinct NPTX2 co-expression classes can be freely explored in our bulk RNA-seq data and other public AD transcriptomic datasets at NeMO Analytics. Conclusion Findings suggest that cognitive resilience in the context of AD neuropathology engages a coordinated molecular state distinct from both preserved cognition without pathology and MCI/AD, which is organized around preserved and selectively remodeled NPTX2-associations. Rather than reflecting broad transcript abundance changes, resilience was characterized by maintained synaptic and inhibitory programs, and adaptive proteostasis and trafficking pathways that distinguish resilient high-pathology individuals from low-pathology controls or symptomatic AD.
bioinformatics2026-06-29v3The structural context of mutations in proteins predicts their effect on antibiotic resistance
Green, A. G.; Tasmin, M.; Vargas, R.; Farhat, M. R.Abstract
In Mycobacterium tuberculosis, a prevalent and deadly pathogen, resistance to antibiotics evolves primarily through non-synonymous mutations in proteins. Sequence-based analyses are currently used to understand the genetic basis of antibiotic resistance, either via genotype-phenotype association, or via signals of convergent evolution. These methods focus on primary sequence and often neglect other biological signals such as protein structural information. We hypothesize that integrating the structural context of mutations improves the prediction of effects on function and phenotype. We curate high confidence structural annotations for the M. tuberculosis proteome from 1,371 crystallography and 2,316 AlphaFold predictions, and combine the structures with mutations from over 31,000 clinical M. tuberculosis isolates. We demonstrate that mutations in proteins known to cause resistance are clustered in 3D space, even in proteins where inactivating mutations at any position are thought to cause resistance. We develop a statistic to search the M. tuberculosis proteome for signal of clustered mutations, finding over 450 proteins that display this signal, many of which have a known relationship with antibiotic resistance. We show that a supervised classifier trained on 3D distance to known resistance sites alone has an F1 score of 94.6% at classifying mutations as resistance-conferring across proteins. This work demonstrates that protein structure provides useful information for categorizing which variants may cause antibiotic resistance, even when the majority of structures are AI-predicted.
bioinformatics2026-06-29v2eRNAformer enables genome-wide de novo mapping of enhancer-derived RNA loci
Yu, H.; Li, W.; Li, W.; Liu, Y.; Chen, Y.; Zhang, X.; He, S.; Chen, Z.; Wang, H.; Ni, J.; Gao, T.; Li, F.; Lu, L.Abstract
Enhancer-derived RNAs (eRNAs) are critical regulators of gene transcription, yet their genome-wide annotation remains challenging. Here, we present eRNAformer, a multi-modal deep learning framework that integrates convolutional neural networks with transformers, specifically designed to capture long-range genetic features associated with bidirectional transcription. This approach enables de novo mapping of eRNA loci using DNA sequence and aggregated conventional RNA-seq data. When evaluated on ENCODE datasets, eRNAformer demonstrated high sensitivity and specificity in discriminating known eRNA loci from non-eRNA loci. Notably, the newly identified eRNA loci were enriched with evolutionarily constrained variants and genetic risk factors for complex diseases, and exhibit potential relevance for cancer therapy. Applied to GEO datasets, eRNAformer identified a range from 14,219 to 56,451 eRNA loci across multiple hematologic malignancies, facilitating the construction of a comprehensive eRNA database for blood cancers. We further identified and experimentally validated FOXO1e, a cluster of eRNAs located approximately 120 kb upstream of FOXO1, a known oncogene that drives t(8;21) acute myeloid leukemia (AML) preleukemic program. Together, these findings establish eRNAformer as a powerful tool for genome-wide eRNA annotation, provide a valuable resource for eRNA studies in hematologic cancers, and underscore the functional importance of eRNAs in AML pathogenesis.
bioinformatics2026-06-29v2SCiMS: Sex Calling in Metagenomic Sequences
Tran, H. N.; Kirven, K. J.; Davenport, E. R.Abstract
Background: Host sex is a critical determinant of microbial community structure across many host species, influenced by hormonal profiles, physiology, and sex-stratified behaviors. Despite its importance, sex metadata is frequently missing in microbiome studies, including for animal-associated samples. Host chromosomal sex can be inferred from the host-derived reads present in metagenomic data, but existing genomic sex prediction tools rely on fixed coverage thresholds calibrated for human XY chromosomes and require relatively high host reads, limiting their use on low host-biomass samples such as stool and on organisms with other sex-determination systems. Results: Here, we present SCiMS (Sex Calling in Metagenomic Sequences), a bioinformatic tool that leverages host-derived DNA within shotgun metagenomic data to predict host chromosomal sex, even at low host coverage. SCiMS uses a multinomial likelihood computed from observed read counts under each sex and reports chromosomal sex calls. Because the expected read distribution is derived directly from chromosome lengths and ploidy under each candidate karyotype, SCiMS applies to any organism with a heterogametic sex-determination system. We benchmarked SCiMS against existing tools on simulated metagenomic data, human metagenomic samples spanning multiple body sites, and metagenomic samples from seven animal species. SCiMS matched or outperformed existing tools, with its noticeable advantage at low host read conditions. Conclusions: SCiMS provides an accurate, scalable, and cross-species generalizable solution for host chromosomal sex classification, even when host DNA is minimal. By enabling recovery of missing sex metadata, it serves as a quality-control tool for analyses in microbiome research. SCiMS is freely available at <a href="http://github.com/davenport-lab/SCiMS">http://github.com/davenport-lab/SCiMS</a>.
bioinformatics2026-06-29v2Anatomy-Guided 3D Graph Networks for Couinaud Segmentation in Tumor Affected Livers
You, L.; Dang, H.; Wang, H.; Matta, E.; zhou, X.Abstract
Abstract: Image-based liver Couinaud segmentation is designed to automatically provide the locations of suspicious objects in liver CT/MR images. Once achieved, the physicians will be guided to the target slice and area where the suspicious node is located. However, conventional algorithms trained primarily on healthy liver images often fail to generalize to Hepatocellular Carcinoma (HCC) cases due to pathological structural distortions. In this work, we propose a robust two-stage framework that integrates a 3D Unet with a 3D Anatomical Structure-Guided Graph Convolutional Network (3D GCN). This two-stage strategy effectively isolates the liver volume to eliminate structural noise from neighboring organs, such as the spleen, allowing the framework to focus exclusively on the complex 3D anatomical relationships among the eight segments. To ensure the topological consistency required for global spatial reasoning, we implement a standardized preprocessing pipeline that normalizes liver-only volumes to exactly 50 frames along the z-axis. By combining a lightweight 3D UNet backbone with the 3D GCN for refined boundary reasoning, our model demonstrates superior generalization performance on unseen clinical datasets, achieving a mean Dice score of 0.828 in blind testing. By releasing our code and pretrained weights, we aim to provide the first publicly available deep learning resource for robust Couinaud segmentation.
bioinformatics2026-06-29v2Confounding effects of inferring gene co-expression networks from pooled data from different biological populations
Runghen, R.; Eliassi-Rad, T.; Bolnick, D. I.Abstract
Weighted Gene Co-expression Network Analysis (WGCNA) is routinely applied to pooled datasets from multiple biological populations, genotypes, or treatment groups, implicitly assuming a shared module structure across groups. While the distortion of pairwise correlations by pooling heterogeneous groups is well established statistically, three aspects of this problem have received little systematic attention in the context of co-expression network analysis: the extent to which pooling disrupts the discrete module-level community structure inferred by WGCNA; whether this disruption is detectable from the global topology metrics researchers routinely report; and how prevalent the pooling practice is in published multi-group WGCNA studies. Using analytical toy examples and a four-scenario simulation framework, we address all three questions. Module preservation Zsummary scores declined progressively with between-population divergence, from full preservation under identical populations (mean median Zsummary = 25.2 {+/-} 3.3, 95% interval 19.0--30.7 across 20 simulation replicates) to substantial disruption when both network structure and mean expression differed (mean median Zsummary = 11.9 {+/-} 1.0, 95% interval 10.2--13.5). This disruption was undetectable from global topology metrics: modularity and clustering coefficient remained stable across all scenarios, while edge density was sensitive but non-specific. These findings were corroborated in an empirical reanalysis of divergent lake and stream stickleback transcriptomes, where merged analysis collapsed 26 lake-specific and 59 stream-specific modules into only 19 merged modules. A survey of 100 publications found that 78.7% (95% CI 69.4--87.9%) of multi-group WGCNA studies with sufficient methodological reporting used a single merged analysis. Results were robust across network sizes of 250--1,000 genes and rewiring rates of 10--50%. We provide concrete recommendations including module preservation testing in both directions, population-specific baseline networks, and consensus WGCNA as a principled alternative.
bioinformatics2026-06-29v1Can a Tissue-derived Progression Signature Accurately Predict Colorectal Cancer Stage Transitions in Blood?
Sarkar, P.; Sarkar, P.Abstract
Abstract. Colorectal cancer (CRC) is challenging to track because its molecular changes are very complex as the disease progresses, creating significant challenges for robust biomarker discovery. In this study, we developed a machine learning framework by integrating monotonic progression and the StepMiner approach. We conducted external validation to identify reproducible, consistent transcriptomic biomarkers associated with CRC progression. Gene expression datasets were analyzed across four disease states from publicly available GEO: normal colon, adenoma, primary colorectal cancer, and metastasis. First, we identified genes with monotonic expression, then used the StepMiner approach to identify genes that act as 'switches' between stages. A balanced 74-gene signature was used for machine-learning classification with a Random Forest. External validation showed strong performance in tissue-based datasets. However, tissue-derived signatures and plasma and blood-based datasets showed poor performance, highlighting biological differences between transcriptomic profiles. Cross-filtering between tissue-derived genes and blood expression datasets was performed, which resulted in the selection of 62 blood-compatible gene signatures. Leakage-free retraining on GSE164191 achieved a mean AUC of 0.868 with balanced precision. Functional enrichment analysis showed that these genes are highly active in cancer growth. Specifically, genes CBX3, S100A11, PDK4, NCOR1, and SOX4 demonstrated stable and reliable performance across the validation fold. Overall, our study presents a progression-aware transcriptomic framework for CRC biomarker discovery and demonstrates the importance of external validation. Additionally, we evaluate whether tissue-derived signatures can predict blood profiles. This proposed approach may help the future development of tissue-based diagnostics and minimally liquid-biopsy strategies for CRC. To ensure reproducibility, our proposed workflow was automated as a Nextflow pipeline. The tissue-derived model was deployed as an application utilizing Angular, ASP.NET Core, and Plumber (R).
bioinformatics2026-06-29v1Placental pathology, circadian biology, and pathogenesis of spontaneous preterm birth: a pilot study of human placental gene expression profiling using a targeted HTG transcriptome panel
Zhou, G.; Hoffmann, H.; Yamamoto, H. S.; Woods, K.; Adkins, M.; Barbieri, R.; Fichorova, R. N.Abstract
BACKGROUND: Spontaneous preterm birth (sPTB) remains the foremost cause of neonatal morbidity and mortality worldwide. Although histologic chorioamnionitis (HCA) and placental vascular abnormalities are frequently observed in sPTB, the molecular cascades linking these lesions to labor initiation remain poorly understood. Emerging evidence implicates circadian dysregulation and trophoblast dysfunction as additional drivers of sPTB. OBJECTIVE: This study aims to map placental pathology to distinct transcriptomic functional signatures that may precipitate sPTB, delineate the contribution of circadian regulation - both core-clock genes and circadian transcription-factor target sets (TFTs) - to sPTB, and identify placental cell-type-enriched and developmental pathway signatures that differ between sPTB and term deliveries. STUDY DESIGN: We performed bulk RNA sequencing on 32 formalin fixed, paraffin embedded placental specimens from 12 selected women (9 sPTB and 3 Term) in the POUCH Study cohort. Samples were selected for white ethnicity, maternal age 23-33years, and parity 1-4 to reduce heterogeneity within groups. An extraction-free HTG transcriptome panel assayed 19,398 protein-coding genes. Log2-fold changes of all genes were computed with limma adjusted for maternal age, gestational age, parity, placental region, placental pathology, and POUCHID (a clustering variable) for sPTB vs. Term and HCA/vascular lesion vs. no pathology (no placental pathology adjustment). Gene-set enrichment used 50 Hallmark sets (MSigDB) plus curated placental circadian, circadian TFT, cell-type, and developmental pathways or gene sets. RESULTS: sPTB placentas displayed a global suppression of metabolic, secretory, and immune pathways (e.g., protein secretion, oxidative phosphorylation, Interferon responses, Complement, ROS, MYC Targets, TGF {beta}, mTORC1, and Coagulation) while KRAS Signaling Down and EMT were up-regulated. HCA-enriched sets (TNF/NF-{kappa}B, ROS, KRAS Up, IL-2/STAT5, Hypoxia, Interferon-{gamma}) were up-regulated, with EMT and Notch remaining down. Vascular abnormalities alone showed up-regulation of 12 Hallmark sets - including TGF-{beta}, TNF/NF-{kappa}B, ROS, pancreatic {beta}-cell stress, Hypoxia, Oxidative Phosphorylation, EMT, and mTORC1 - while Notch was down-regulated. When HCA co-exists with vascular abnormalities, the Hallmark profile becomes more inflammatory highlighting a synergistic exacerbation of innate immunity, oxidative stress, and programmed cell death with the 12 up-regulated sets (Complement, Interferon /{gamma}, TNF, ROS, Apoptosis, and Heme Metabolism). The exclusive downregulation of DNA Repair suggests compromised genomic integrity. Circadian gene-sets analysis revealed an up-regulated Regulation of Circadian Sleep Wake Cycle in sPTB but down-regulation of core clock pathway and suppressed circadian TF targets. Cell-type enrichment reveals increased trophoblast giant cells and IGFBP1-DKK1 positive fetal cells, with marked suppression of extravillous trophoblasts, syncytiotrophoblasts, villous cytotrophoblasts, and fetal myeloid cells. Placental developmental pathways were downregulated, indicating arrested trophoblast maturation. CONCLUSION: Our pilot analysis demonstrates sPTB placentas exhibit a global suppression of metabolic, secretory, and immune-modulatory programs and maladaptive trophoblast remodeling, whereas HCA and vascular abnormalities drove distinct inflammatory or hypoxic signatures. The shared and opposing Hallmark pathways across phenotypes highlight distinct yet overlapping pathogenic mechanisms. Dysregulated circadian pathways, consistent downregulated transcription factor target gene sets, and trophoblast specific signatures implicate circadian misalignment and impaired placental maturation as key contributors to preterm parturition. These findings provide a mechanistic atlas linking placental pathology to sPTB and highlight potential targets for chronotherapeutic and cell type specific interventions.
bioinformatics2026-06-29v1EnzyKAN: Protein Language Model Embeddings and Kolmogorov-Arnold Network Variants for Enzyme Commission Classification with a Proposed Electron-Transfer Physics Feature Framework
R, S.; Reddy, B. R. R.Abstract
Motivation: Computational enzyme classification has previously utilised sequence homology features and protein language model embeddings. The Kolmogorov-Arnold Network (KAN) paradigm, which uses learnable edge functions rather than fixed ones, has shown promising results in biological sequence tasks. Results: A fully reproducible investigation of KAN variants for seven-class EC classification on up to 9,516 labelled sequences from the CLEAN benchmark (9,386 for language model experiments). In the sequence only settings, fixed basis KAN variants outperformed an MLP baseline moderately (macro F1 = 0.17-0.29). Utilisation of ESM-2 650M embeddings greatly improved results via 5-fold cross-validation: MLP macro F1 = 0.750 +/- 0.009, accuracy = 0.823 +/- 0.009; learnable SineKAN macro F1 = 0.716 +/- 0.023, accuracy = 0.788 +/- 0.019. MLP performed comparably but did not exceed conventional baselines. As an aside, we introduce but do not investigate an approach to EC oxidoreductase sub-classification through the use of a Marcus theory-based electron transfer feature framework. Availability: Code and result files are available at https://github.com/sanjuz-cas/ENZYKAN.
bioinformatics2026-06-29v1Metrics for Distinguishing Biological and Interventional Change in AI Models
Ewing, M. A.Abstract
Statistical and machine-learning models of longitudinal biological data evaluate change by comparing each new observation against the trajectory implied by prior observations, assuming the process generating that trajectory is stable. We use data substrate to mean the underlying structure of the longitudinal data that determines what any such model can recover, independent of its architecture or capacity. When the generating process changes, whether through a biological transition or through an external intervention, the prior trajectory ceases to be a valid reference, and extrapolated predictions can be confidently wrong with no internal signal that the reference has failed. A distinct and recognised difficulty is that biological change and interventional change, observed only through serial intertemporal comparison under an assumed trajectory, are readily conflated; existing approaches address this through causal assumptions or hidden-confounder models rather than from the data substrate itself. Here we ask whether the two can be distinguished at the substrate level, and we introduce two subject-level metrics that quantify the geometric signature an interventional change leaves in the data: Curvature Shift, the change in trajectory slope across the event, and Deformation Risk, the departure of post-event observations from the prior-trajectory reference. We evaluate the condition on longitudinal cognitive measurements from 309 human subjects in the Alzheimer Disease Neuroimaging Initiative (ADNI), a large longitudinal dataset containing two distinct, ex-ante-defined regime-change events in the same subjects: a biological transition and an intervention. A model extrapolating the pre-event trajectory assigned the wrong direction of change to roughly two-thirds of post-event observations (post-event sign accuracy 0.341 after the biological event and 0.350 after the intervention, against a chance value of 0.50); only 11% of post-biological-event and 12% of post-intervention readings remained concordant with prior dynamics, and a higher-capacity multilayer perceptron reproduced rather than resolved the error. Curvature Shift was 2.23-fold higher after the biological event (p = 4.4e-8) and 2.26-fold higher after the intervention (p = 7.4e-8), and the two metrics were coupled (rho = 0.500; 95% CI, 0.407 to 0.587). Findings replicated on an independent endpoint and survived propensity matching, permutation, and leave-one-out. The metrics detect, per subject, when the reference of a fitted model has stopped governing the data and whether the departure carries the geometric signature of an interventional change.
bioinformatics2026-06-29v1MxSure: a mixture model for inferring within-host substitution rates and transmission SNP thresholds
Khurram, Z.; Chaguza, C.; Kwambana-Adams, B. A.; Shao, Y.; Lawley, T.; Yong, M.; Davies, M. R.; Zarebski, A. E.; Tonkin-Hill, G.Abstract
Quantifying short-term evolutionary rates of microbial genomes is essential for understanding the processes that shape within-host evolution and for establishing thresholds needed to track transmission. In studies of short-term evolutionary rates, samples are often collected from closely related clusters (e.g. longitudinally from the same host or from transmission pairs), with substantial time intervals separating genomes between clusters. Distinguishing strain replacement from persistence presents an additional challenge in these studies. In addition, many public health and metagenomic bacterial strain tracking pipelines output pairwise SNP distances rather than the multiple sequence alignments required by common substitution rate estimation pipelines. This makes it challenging to estimate within-host evolutionary rates in many commensal bacterial species that are difficult to culture and isolate. To address these challenges, we introduce MxSure, a tool for estimating substitution rates and transmission thresholds while accounting for strain replacement from pairwise SNP distance data, as commonly generated by transmission tracking and metagenomic analysis pipelines. We demonstrate the accuracy of MxSure through extensive simulations and by analysing species with previously estimated substitution rates from longitudinal metagenomic datasets. Using MxSure, we estimated within-host substitution rates and transmission SNP thresholds for multiple commensal bacterial species including Bifidobacterium longum and Bifidobacterium bifidum from a longitudinal study of the infant gut microbiome.
bioinformatics2026-06-29v1Single-cell transcriptomics reveals chondrocyte state transitions and ECM remodeling in osteoarthritic knee cartilage
Bo, Z.; Xu, H.; Liang, Y.Abstract
Osteoarthritis cartilage has heterogeneous chondrocyte states, yet their transitions remain unresolved from public single-cell data. We retrospectively reanalyzed a public knee cartilage single-cell RNA-seq dataset GSE255460 from 8 osteoarthritis and 3 non-osteoarthritis donors totaling 19 samples. After sample-wise quality control and doublet removal we performed batch-corrected clustering, chondrocyte subclustering with marker-based annotation, and trajectory inference using Slingshot. Regulatory chondrocytes were tested for osteoarthritis versus control differential expression, followed by Gene Ontology and KEGG enrichment with Benjamini-Hochberg false discovery rate <0.05, and protein-protein interaction hub screening. We retained 27,036 cells. Chondrocytes exhibited branching continuous states; regulatory cells localized near the main manifold and adjacent to inferred branches, suggesting a transition-adjacent state. In regulatory cells, osteoarthritis-upregulated genes were enriched for collagen-containing extracellular matrix organization, endoplasmic reticulum secretory/proteostasis, cell-matrix adhesion including focal adhesion, and TGFbeta/SMAD signaling. Protein-protein interaction analysis identified five high-connectivity hubs: COL5A1, COL5A2, COL6A1, COL1A2, and COL3A1. Our findings support a transition-adjacent regulatory program in OA with coordinated extracellular matrix remodeling and secretory/adhesion/TGFbeta signatures, nominating collagen hubs for validation.
bioinformatics2026-06-29v1Context-dependent correlations mislead transcriptomic network inference in bulk and single-cell data
Asiaee, A.; Bombina, P.; McGee, R. L.; Reed, J.; Abrams, Z. B.; Abruzzo, L. V.; Coombes, K. R.Abstract
Background. Correlation is the dominant input to co-expression module discovery and miRNA-target inference. Both rely on an implicit assumption: a Pearson coefficient pooled across heterogeneous samples, whether tissues, cancer types, or cell types, estimates one biologically meaningful quantity. Simpson's paradox makes this assumption fragile in principle, since between-group mean shifts can dominate or reverse within-group associations. How often this happens in real transcriptomic data has not been quantified. Results. Across 8,890 TCGA tumors from 31 cancer cohorts and 23,170,038 miRNA-mRNA pairs, 94.8% of pairs showed both positive and negative within-cohort correlations. Restricting to the high-variance domain of one million pairs, 13.3% of pooled correlations with |r_global| >= 0.2 reversed against the within-cohort majority at sign tolerance epsilon = 0.05. Heterogeneity was the rule rather than the exception (median I2 = 0.86, IQR 0.80-0.90), and 99.5% of pairs rejected equal correlation across cohorts at FDR < 0.05. Of 692,770 experimentally validated miRTarBase v10 targets measurable in our data, only 0.9% were uniformly negative across cohorts. The pattern recurred across modalities. In GTEx, 21.0% of pooled signs disagreed with the tissue majority, and 23.5% of pairs flipped sign after tissue-mean removal. In 10x PBMC scRNA-seq, 13.1% of gene-gene correlations flipped after cell-type-mean removal; in CITE-seq, 37.9% of protein-RNA pairs flipped under a joint WNN partition of cells. Refining context reduced reversal, though by how much depended on the partition: within BRCA, 5.5% of pairs reversed under molecular PAM50 subtypes versus 0.35% under clinical IHC receptor status, and refining T cells into transcriptome-defined subtypes cut PBMC reversal from 11.8% to 0.13%. Conclusions. A single pooled correlation coefficient can invert direction relative to its within-context constituents at rates that are not negligible. Correlations should be reported with their context: the within-context distribution, a heterogeneity statistic, and a diagnostic that separates between-context mean shifts from within-context association. We provide a small R interface that computes these summaries.
bioinformatics2026-06-29v1Retention, not flux: endpoint confounding caps computational prediction of peptide skin penetration, with a delivery-aware reframing
Komianos, N.; Prakash, P.Abstract
Bioactive peptides are now central to cosmetic and dermatological actives, yet predicting whether a given sequence will reach its site of action in skin remains unsolved. We contend that the dominant framing, predicting a single binary "skin permeability" label from sequence, is ill-posed, and that this, rather than a shortage of modelling power, explains the field's stalled predictive performance. The scope of the claim is narrow: barrier-crossing propensity is a legitimate, learnable function of molecular structure, whereas the vehicle- and endpoint-agnostic binary label that the literature supplies is not. We support this with a first-principles analysis and a study of public-source data. First, the experimental endpoint most commonly reported, transdermal flux into a diffusion-cell receptor compartment (OECD Test Guideline 428), conflates two opposite outcomes (genuine deep delivery and undesired systemic transport) and is, for a cosmetic active, frequently a failure signal rather than a success signal. That receptor flux is an imperfect measure of cutaneous bioavailability is long established in dermatopharmacokinetics; our contribution is to show that the same confound, inherited through scraped labels, is what caps machine learning from sequence. Second, reported "permeability" is a property of the sequence x delivery-vehicle x measurement-compartment triad, two terms of which are usually unrecorded. Third, on public-source data, a physicochemical intrinsic-permeability estimate (Potts-Guy) carries no positive predictive signal for scraped penetration labels (grouped AUC 0.45, 95% CI 0.40-0.51); sequence-only classifiers plateau in the mid-0.70s with diminishing returns as labels accumulate (AUC 0.70-0.77); and the same descriptor pipeline on a clean single-endpoint membrane dataset scores materially higher (AUC 0.83, non-overlapping CI). Our proposed reframing separates barrier-crossing (data-driven, sequence-level) from depth-and-retention (physics-driven, delivery-aware) and treats intrinsic transdermal flux as a regulatory risk axis; we close by proposing a triad-annotated reporting schema and a seed benchmark.
bioinformatics2026-06-29v1Causally measuring aging and rejuvenation through transcriptomic damage
Zhang, S.; Iqbal, S.; Tyshkovskiy, A.; Gladyshev, V. N.Abstract
Aging is caused, fully in large part, by the progressive accumulation of damage, yet quantifying age-related damage across tissues and conditions remains a challenge. Here, we present a computational framework to quantify damage from standard RNA-sequencing data. It captures four classes of aberrant transcript structures, including premature termination upon intron retention, domain-disrupting splice variants, repeat elements, and gene fusion events, each reflecting distinct forms of RNA integrity loss. Using this method, we revealed a robust age-associated increase in transcriptomic damage across tissues. To integrate these measurements into a unified biomarker, we constructed a transcriptomic damage-based aging (tDamAge) clock using machine learning models trained across mouse tissues or human peripheral blood. It could predict age and detect transcriptomic shifts under both pro-aging and anti-aging conditions. Progeroid models exhibited accelerated tDamAge, whereas interventions such as caloric restriction, rapamycin, and methionine restriction lowered tDamAge. Cross-dataset analysis showed that diverse anti-aging interventions converge on shared transcriptomic signatures, particularly RNA processing and chromatin organization pathways, and these age-associated patterns could be reversed by interventions. We further identified elevated damage age acceleration in Alzheimers disease and observed rejuvenation-like reductions during embryonic development. Together, our findings establish transcriptomic damage as a causal, quantifiable and biologically interpretable feature of aging and demonstrate that tDamAge could detect age progression, acceleration, deceleration, and reversal.
bioinformatics2026-06-29v1SentryPath: a mechanistic protocol-ranking simulator with leave-one-trial-out cross-validation across 13 phase-III oncology randomised controlled trials and a pre-registered prospective forecast
Kumar, M. D.; Kumar, M.Abstract
Background. Pivotal oncology trials cost a median of ~$19 million each (oncology often $45 million or more) and contribute to a capitalised cost of ~$2.6 billion per approved drug, yet most candidate protocols never reach trial. Existing in-silico screening tools either rely on closed proprietary PK/PD modelling or require patient-level data; a transparent, cohort-level, cross-validated mechanistic alternative is missing. Methods. SentryPath is a physics-based stochastic differential equation simulator built on a Gompertzian tumour-growth term with Emax pharmacodynamic kill and Bliss-independence combination modelling, scored at the cohort level. Validation against 13 published phase-3 randomised controlled trials covering six cancer types uses the 2-year overall-survival (OS) rate ratio as the primary endpoint, cross-checked against ClinicalTrials.gov posted results. For cancer types with >=2 trials we apply leave-one-trial-out cross-validation: two shared efficacy scalars per cancer type are fit on training trials and used to predict the held-out trial cold. Results. With the per-drug efficacy proxies held fixed from the literature, two shared cancer-type scalars fit on the training trials transfer to the held-out trial with a mean held-out error of 3.7 % (range 0.7-7.3 %) on 2-year OS rate ratios across three NSCLC trials; extending the same method to RCC, HCC, and ESCC yields a 5.4 % aggregate across nine folds (per-fold range 0.2-11.2 %), reported with per-cancer stratification. We are explicit that only the two scalars are held out - the per-protocol efficacy proxies underneath are literature-anchored to drug classes that include the benchmark trials, so this is a test of scalar transfer, not of the whole engine cold. Cross-validation improves on the same engine without it (16.4 % with production cancer priors; 21.9 % with no efficacy modifiers); a matched in-sample fit of the same two-scalar model gives 4.4 %, slightly below the 5.4 % held-out, the expected direction. Two prospective forecasts are pre-registered on the Open Science Framework with falsification envelopes and pre-readout bias disclosure. The first forecast (NCT04770896) reaches its primary data cutoff on 2026-06-30; the observed outcome and its mapping to the pre-committed interpretation will be reported in a versioned update to this preprint. Conclusion. A transparent mechanistic simulator, with a literature-anchored efficacy library and only two cross-validated scalars per cancer type, transfers those scalars across held-out NSCLC trials at 3.7 % mean error (range 0.7-7.3 %) and extends to other cancers with documented per-cancer stratification. The validation is pilot-scale (3-9 folds) and the scalars sit on a fixed, trial-informed substrate; its distinguishing contribution is less the error magnitude than the public predict-verify-disclose cycle that goes beyond retrospective fit. Keywords: mechanistic simulation; oncology; clinical trial design; cross-validation; pre-registration; protocol prioritisation.
bioinformatics2026-06-29v1G-LATO: Inference of Spatial Latent Ordering via Deep Gaussian Processes
Zago, M.; Mukherjee, S.; Schleicher, J. T.; Bürkner, P.; Tabatabai, G.; Claassen, M.Abstract
Spatial transcriptomics enables the study of cells within their native tissue context, yet identifying gradients of cellular development remains challenging. We introduce a deep Gaussian process model to address this gap. Our method recovers spatially smooth gradients explaining observed gene expression. We illustrate our method on healthy liver and glioblastoma data in reconstructing known spatial organisation and uncovering new pathological gradients, thus providing robust inference for spatial biology.
bioinformatics2026-06-29v1A hyperbolic topological atlas reveals polyamine steering of a shared developmental manifold in Arabidopsis
Zdrazil, J.; Kong, L.; Flores-Hernandez, E.; Rodriguez Kessler, M.; Klimes, P.; Spichal, L.; De Diego, N.; Snasel, V.Abstract
High-throughput plant phenotyping captures development at scale, yet image-rich screens are still often reduced to static trait summaries. We tested whether nutrient availability, polyamine priming, concentration, and their transport reshape Arabidopsis rosette development by generating distinct morphologies or by changing residence along a common trajectory. We analyzed 138,223 time-resolved rosette images from Col-0 and five mutants involved in polyamine transport (put1-5) primed to putrescine, spermidine, spermine, dose, and nutrient regimes using a self-supervised vision backbone, Poincare embedding, hyperbolic Mapper, and manifold straightening. The data form a single connected developmental manifold with 410 nodes and 746 edges, organized from an early, low-nutrient-biased hub through high-betweenness transition corridors to two late, nutrient-enriched terminal regions. Polyamine identity stratifies this manifold by developmental phase: putrescine enriches early states, spermidine occupies transition corridors, and spermine marks late compact rosettes. Nutrient richness and dose change distal occupancy, whereas put genotypes alter dwell time within shared regions rather than producing separate topologies. Manifold straightening resolves these effects into a short early lateral deflection followed by convergence, yielding two scalar readouts, early transverse offset and distal occupancy, that summarize treatment action on a common morphodynamic scale. The framework converts large image screens into interpretable developmental geometry for image-based phenomics.
bioinformatics2026-06-29v1Practical Use of Advanced AI Frameworks on Real-Life Scientific Problems: Three Case Studies
Gulluoglu, H. S. A.; Baby, J.; Bagul, K. M.; Basangari, B. R.; Bathini, S. A.; Chalamalla, N. K. R.; Dcunha, J.; Gupta, O.; Huang, L.; Jiang, X.; Naidu, Y. R.; Sathishkumar, G.; Sehrawat, M.; Thota, S. L.; Thuvara, D.; Vanguri, M. B.; Yin, J.; Jugder, B.-E.; Lusky, I. E.; Li, J.; Sinitskiy, A.Abstract
Agentic artificial intelligence (AI) systems increasingly claim to automate scientific research, yet independent evaluations report persistent gaps between those claims and demonstrated capability. We tested frontier agentic AI systems on three practical problems: prediction of treatment non-response in immune-mediated inflammatory diseases, optical chemical structure recognition for literature mining, and prediction of drug-design-related properties from small datasets. Each problem was first assigned to autonomous frameworks and then reattempted as human-led, AI-assisted work. Autonomous runs failed in most cases, while human-led work produced reusable resources and modest but defensible performance, including new evidence for possible mechanisms of treatment resistance and a more practical benchmark for mining chemical structures from scientific papers. Property prediction was the single task on which one autonomous AI framework matched the human expert. We conclude that current frameworks can carry out engineering and analysis once a human expert leads the project, but cannot yet engineer a novel solution without oversight. The use of AI on real-life scientific problems remains an art rather than a routine technology.
bioinformatics2026-06-29v1Learning Fragmentation Physics or Exploiting Sequence Priors? Benchmarking Bias in Deep Learning Models for De Novo Peptide Sequencing
Li, J.; Rost, H.Abstract
Deep learning models have advanced de novo peptide sequencing, but their predictions may reflect both physics-based spectral evidence and learned peptide-sequence priors. Systematically measuring such prior-associated behavior is important for benchmarking model robustness beyond conventional proteomics data. Here, we introduce the Prior Bias Index (PBI), a general framework for measuring the extent to which model behavior shifts toward prior-associated reference patterns under controlled conditions, and implement it as DeNovo-PBI, a benchmark for quantifying prior bias in de novo peptide sequencing models. DeNovo-PBI combines benchmark dataset construction, in silico sequence and spectral perturbation workflows, PBI-based metrics, and analysis algorithms to evaluate three forms of prior-associated behavior: sequence-distribution dependence, database amino-acid-pair order preference, and mutation-group prediction consistency under shared sequence context. In addition to experimentally acquired peptide spectra, we generated in silico spectra from random, natural, and mutated peptide sequences and selectively removed fragment ions that distinguish N-terminal residue orders. Across these assays, deep learning models showed peptide-sequence-distribution-dependent performance and strong directional amino-acid-pair order preferences even when order-diagnostic spectral evidence was removed. DeNovo-PBI provides a quantitative benchmark for measuring, comparing, and interpreting learned bias in de novo peptide sequencing models.
bioinformatics2026-06-29v1Lineage-aware stochastic modeling reveals gene-expression dynamics in development and disease
Xing, J.; Staklinski, S. J.; Liu, Z.; Nowak, D.; Siepel, A.Abstract
Gene expression evolves dynamically along cell lineages, yet most analysis methods treat single-cell RNA-seq (scRNA-seq) data as static snapshots and fail to exploit phylogenetic relationships among cells. Recent advances in cell-lineage tracing now enable the reconstruction of high-resolution lineage phylogenies, providing a natural framework for identifying when and where transcriptional changes arise during development, differentiation, and disease progression. Some models of gene expression have begun to consider phylogenetic structure, but they generally rely on imprecise Gaussian assumptions, focus on endpoint-level comparisons, or fail to consider sparse and overdispersed scRNA-seq read counts. Here, we present LaVOUS (Lineage-aware Variational Ornstein-Uhlenbeck Single-cell RNA-seq analysis), a probabilistic framework that couples lineage-based models of latent dynamics derived from the Brownian motion and Ornstein-Uhlenbeck stochastic processes with negative-binomial observation models and scalable variational inference. LaVOUS enables likelihood-based tests for cellular heritability and branch-specific shifts in gene expression, as well as phylogenetic reconstruction of latent expression histories. In simulations, LaVOUS outperformed Gaussian method in detecting lineage-associated expression changes and produced accurate reconstructions of expression histories across expression levels. We additionally applied LaVOUS to paired single-cell lineage and transcriptomic data from metastatic lung cancer, class-switching B cells, and the developing brain. Across these settings, LaVOUS identified lineage-associated expression changes related to metastatic progression, B-cell isotype switching, and dopaminergic and glutamatergic neuron differentiation. By providing an expressive framework for modeling sparse count data on lineage trees, LaVOUS establishes a foundation for studying single-cell expression dynamics across developmental and disease contexts, with natural extensions to multi-gene regulation, lineage uncertainty, and multi-modal integration.
bioinformatics2026-06-28v1LOESS and DE-SWAN can induce artifactual "waves" of molecular aging
Carbonneau, M.; Shutta, K. H.; Miller, J.; Shen, X.; Snyder, M.; Quackenbush, J.Abstract
A growing literature has investigated the relationship between age and biomolecular changes, leading to conclusions that aging occurs in discrete molecular "waves." Data summary tools such as LOESS and sliding window analyses like DE-SWAN are common approaches that have gained acceptance in recent years. We demonstrate via simple simulations that these tools can identify non-linear patterns of aging where they do not exist. Specifically, we show that (i) clustering of molecular trajectories using LOESS can lead to artifactual characteristic patterns of molecular aging, (ii) "waves" of aging identified using the combination of LOESS and DE-SWAN in real data are not robust to changes in the underlying age distribution and are not supported by valid permutation testing, and (iii) DE-SWAN alone can generate pronounced "waves" of nonlinear molecular aging in linear data due to differences in statistical power along the age continuum. Our results specifically challenge the statistical support for discrete aging crests inferred in the literature, but do not rule out nonlinear molecular aging or age-associated transitions that may be detectable using other cohorts and statistical models.
bioinformatics2026-06-28v1Client-server interfaces enable efficient agent-driven variant calling
Yu, X.; Zheng, Z.; CHEN, L.; QIn, Z.; Guo, X.; He, M.; Luo, R.Abstract
Background: Large language model (LLM) agents increasingly automate bioinformatics analyses, but most existing bioinformatics tools were built for standalone use by human experts. An agent driving such a tool must reason about its installation, configuration, and execution from documentation for human, spending many turns, tokens, and tool calls per result. How a method is exposed to an agent can therefore matter as much as the method itself. By designing agentic interfaces for these tools, agent can reduce such overhead and improve the reliability of agent-driven analyses. Findings: To test this design, we re-architected Clair3, a widely used deep-learning-based long-read variant caller, into a client-server system, Clair3-Connect. The client performs all genomics related processing and holds the identifiable data. The server runs only neural-network inference, and the client sends only feature tensors to the server, while sample identifiers and genomic context remain on the client. The client exposes schema-defined agent-facing tools that an agent invokes through single structured calls. On an APOE diplotyping task, all 60 agent runs were correct. The agentic tools used 12K tokens in 3 turns, 6.8 to 14 times fewer tokens than the shell-driven baselines (81K-163K tokens), at about a quarter the wall-clock time and far more stably (4% versus 35% token usage variation). Dropping the pileup and phasing stages to keep the client light left SNP F1 within 0.1-0.3 points of standard Clair3 by 50x coverage, while mutual TLS and AES-256-GCM encryption added 7.2% to end-to-end runtime. Conclusions: Recasting an established algorithm as developer-built, agentic tools behind a secure client-server boundary makes it more efficient, reliable, and easier to deploy for an LLM agent than a third-party wrapper, which cannot recover the defaults and conventions only its developers know. Agentic interfaces should be a first-class deliverable of bioinformatics tool development.
bioinformatics2026-06-28v1PARROT: Phase-Altering Regulatory Rewiring Over Time
Chen, C.; Padi, M.; Quackenbush, J.Abstract
Motivation: Gene regulatory networks undergo dynamic restructuring during development and disease. Identifying when and how these networks change is crucial for understanding developmental and disease transitions, yet existing change-point detection methods often ignore network structure or lack interpretable community assignments. Results: We present PARROT (Phase-Altering Regulatory Rewiring Over Time), a framework for detecting change- points in dynamic networks using Stochastic Block Models. PARROT jointly estimates change-point locations and community structure across four network classes: unipartite and bipartite with either Gaussian or Bernoulli edge models. Simulations demonstrate improved performance and community recovery compared to other methods. Applications to human cardiac differentiation and mouse lung development data successfully recovered known phase boundaries. PARROT identifies both which genes are reassigned across modules and how the connections change between states.
bioinformatics2026-06-28v1Spatial co-expression and cell-cell communication inference from spatially resolved transcriptomics with CONCISE
Zhao, J.; Shan, X.; Wang, G.; Chu, T.; Lin, C.; Chang, R.; Zhao, H.Abstract
Cell-cell communication is fundamental to tissue organization, homeostasis, and disease progression. Recent advances in spatial transcriptomics provide unprecedented opportunities to systematically characterize ligand-receptor interactions directly within intact tissues. However, robust inference of spatial ligand-receptor interactions remains challenging because intrinsic features of spatial transcriptomics data, including spatial autocorrelation, variation in total molecular counts, and measurement errors, can induce spurious spatial co-expression and lead to inflated false-positive results. Most existing methods do not adequately account for these confounding factors, limiting the reliability of inferred cellular communication. Here, we present CONCISE, a statistical method for spatially constrained co-expression and ligand-receptor interaction inference that jointly models spatial autocorrelation, variation in total molecular counts, measurement errors, and spatial proximity constraints. CONCISE combines efficient moment-based parameter estimation with analytical hypothesis testing, enabling fast and statistically rigorous inference without restrictive distributional assumptions. Through extensive simulations, real-data permutation experiments, and biologically motivated negative-control analyses across different spatial transcriptomics platforms, we show that most existing methods presented inflated false-positive rates, whereas CONCISE achieved well-calibrated inference, robust false-positive control, and improved detection power. Application of CONCISE to high-resolution MERFISH and CosMx datasets from intestinal inflammation and non-small cell lung cancer further highlights its biological utility in disease contexts. CONCISE uncovered inflammation-associated fibroblast-specific interactions during intestinal inflammation and delineated complex tumor-immune and tumor-stromal signaling networks within the tumor microenvironment.
bioinformatics2026-06-28v1Short-Read Sequencing Benchmarking with Donor-Specific Assemblies
McGee, S. R.; Smith, J. D.; Frazar, C. D.; Ryke, E.; Vollger, M. R.; Kwon, Y.; Bennett, J. T.; Eichler, E. E.; Stergachis, A.; Wei, C.-L.Abstract
Background High-throughput short-read sequencing has become a core technology for genomics, but the rapid expansion of available platforms has made it increasingly important to benchmark them under standardized conditions. A major challenge is that conventional reference-based comparisons confound true sequencing errors with inherited variation and reference bias, making it difficult to isolate platform-intrinsic performance. Results We benchmarked nine short-read chemistries across seven DNA sequencers using two highly characterized benchmark samples, HG002 and COLO829BL, together with donor-specific assemblies to measure sequencing errors against sample-matched genomic references. This strategy separated authentic platform errors from biological divergence and revealed substantial differences in substitution, indel, read-position, and sequence-context error profiles. Element AVITI UltraQ and Roche SBX-D showed the lowest substitution error rates, whereas Ultima and Roche chemistries exhibited the strongest indel-associated biases. We also found pronounced platform-specific effects in low-complexity regions and trinucleotide contexts, including homopolymer-associated errors and context-dependent substitution skews that are directly relevant to rare-variant detection. In addition, we show that donor-specific references are essential for unbiased base-quality recalibration because they minimize reference bias and more faithfully support cross-platform comparison and low-frequency variant-calling thresholds. Conclusions Donor-specific assembly-based benchmarking provides a robust framework for measuring true short-read sequencing errors and comparing platforms on a common, sample-matched basis. Our results establish a comprehensive reference for the community and show that authentic error profiles can guide platform selection, quality filtering, and improved detection of rare somatic variation.
bioinformatics2026-06-28v1CoLa-VAE: A Cell-Cell Communication-Aware Variational Autoencoder for Representation Learning and Expression Denoising
Chen, Y.; Qi, C.; Fang, H.; Luan, F.; Zhang, Z.; Arya, S.; Wei, Z.Abstract
Single-cell RNA sequencing provides a powerful view of cellular heterogeneity, but its sparsity and dropout noise remain major obstacles for recovering biologically meaningful gene expression programs and for downstream analyses that depend on reliable expression measurements. Ligand-receptor-based cell-cell communication inference is such analysis, missing ligand or receptor expression can cause substantial false negatives in sparse single-cell data. Here, we present CoLa-VAE, a cell-cell communication-aware variational autoencoder that jointly learns latent representations and denoised expression profiles by incorporating ligand-receptor-derived communication topology through dynamic graph Laplacian regularization. Rather than treating denoising as a secondary output of representation learning, CoLa-VAE uses denoised expression to iteratively refine communication estimates and uses the resulting communication structure to guide both latent organization and expression reconstruction. In addition to improving latent space organization and producing robust denoised expression matrices, CoLa-VAE-denoised matrices also improved downstream biological analyses, including the detection of robust differential cell-cell communication programs, mitigation of batch-associated variation and enhanced spatial transcriptomic deconvolution when spatially constrained communication structure was incorporated. Together, these results establish CoLa-VAE as a communication-guided denoising and representation learning framework that recovers biologically meaningful expression signals from sparse single-cell and spatial transcriptomic data, enabling more sensitive and reliable downstream analysis.
bioinformatics2026-06-26v2Development of Deep-Learning Models that Predict Quantitative Protein-Ligand Interactions in Glycobiology as a part of a Capstone Course
Yin, H.; Liu, W.; Zhou, W.; Chang, Z.; Carpenter, E. J.; Satyajith, A.; Haregu, S.; Greiner, R.; Derda, R.Abstract
Glycans coat the surface of all cells, and every glycan is recognised by specific glycan-binding pro-teins (GBPs). There are no general tools that can accurately estimate the binding strength between glycan and GBP from the amino acid sequence of the GBP and the molecular structure of the glycan, represented as SMILES string. We describe models for predicting such binding strengths developed as a part of a Capstone Course at the University of Alberta. The models are trained on a dataset that combines BindingDB, a published database of small-molecule protein interactions, and data from glycan arrays measured by Consortium of Functional Glycomics (CFG). In this hybrid dataset of protein-ligand interactions the ligands are both glycans from CFG and small molecules from BindingDB; similarly, proteins include GBP and proteins from BindingDB. Three models are presented (i) ProMax which fuses ESM-2, MolFormer, and MolCLR features; (ii) APEX which constrains learning to a predetermined form, a physical model of binding; (iii) UltraMax adds inter-atomic distances for the ligands. To address the dataset's severe long-tail distribution, the models employ tail-aware losses for rare high-binding instances. Trained and evaluated on approximately one million protein--ligand pairs using hold-out splits for unseen molecules, the three models provide a unified framework for quantitative glycan-protein binding prediction. We observed that learning glycan-protein binding is harder than the similar task of learning small-molecule-protein interactions. Simple mirror-inversion tests led us to postulate that insufficient use of chiral features is an important source of difficulty in learning these interactions.
bioinformatics2026-06-26v2PlantGeneAnn: a strand-specific genome foundation model for ab initio gene structure annotation of plant genomes
Qizhe, Z.; Zhengyang, Z.; Kepeng, L.; Wang, J.; Kaixuan, D.; Xianglei, X.; Wei, X.; Xuehai, H.Abstract
High-quality plant genome assemblies are rapidly increasing, but accurate structural annotation remains reliant on transcript and homology evidence, limiting applications in newly sequenced and non-model species. Here, we present PlantGeneAnn, a plant-optimized, strand-specific genome foundation model for ab initio gene structure annotation. Fine-tuned on only nine high-quality model plant annotations, PlantGeneAnn outperformed a multi-species model trained on 42 species, showing that annotation quality is more important than token volume. On a stringent 13-species benchmark covering rosids, asterids, and monocots, PlantGeneAnn surpassed four state-of-the-art baselines across five evaluation levels, from base-level classification to complete transcript recovery. It achieved higher intron precision and better captured complex gene structures. In zero-shot variant effect prediction, PlantGeneAnn identified cryptic splice donors and premature stop codons in maize and rice, with saturation mutagenesis confirming single-nucleotide, context-dependent sensitivity. It also retained generalizability for epigenomic track prediction, highlighting its value for pan-genomics, crop improvement, and non-model plant research.
bioinformatics2026-06-26v1Computational reconstruction of hierarchical cis-regulatory networks reveals synergistic transcription control and disease-associated rewiring
Zhu, X.; Zhou, X.; Zhang, Y.; Cai, G.; Zhao, W.; Zhou, B.; Zhou, J.; Tang, Z.; Liu, J.; Zhu, Q.; Cao, J.; Yang, B.; Gu, X.; Zhou, Z.Abstract
Gene regulation emerges from coordinated interactions among dispersed cis-regulatory elements, yet how these elements integrate into functional regulatory networks and collectively regulate gene transcription remains poorly understood. Here, we present ORIGAMI, a multi-omics, gene-centric deep learning framework that reconstructs functional cis-regulatory networks constrained by transcriptional output. ORIGAMI formulates cis-regulatory modeling as a latent graph inference task, which integrates DNA sequence, epigenomic signals, and three-dimensional chromatin priors to infer denoised regulatory graphs that capture functional interactions rather than structural proximity alone. The inferred regulatory graphs exhibit distinct topological regimes, where hierarchical and modular organization encodes cell-state-specific functional demands and enables synergistic transcriptional control. Furthermore, we show that these regulatory architectures undergo measurable state-dependent rewiring across disease contexts. Finally, ORIGAMI accurately predicts the transcriptional consequences of both cis- and trans-regulatory perturbations and links the rearrangement of regulatory architecture to perturbation response. Together, ORIGAMI advances a network-based view of gene regulation and establishes a foundation for virtual cell modeling of regulatory dynamics.
bioinformatics2026-06-26v1A Generalised Epigenetic Clock Reveals Therapeutic Vulnerabilities Linked to Ageing in Cancer Cells
Fernandez-Rebollo, I.; Digilio, A.; Oikonomou, A.; Trastulla, L.; Esteller, M.; Iorio, F.Abstract
Epigenetic clocks estimate biological age from DNA methylation patterns but perform poorly in cancer due to extensive epigenetic reprogramming, limiting the study of ageing in tumour biology.Here, we develop GepiClock, an epigenetic clock trained on DNA methylation data from 32 cancer types in The Cancer Genome Atlas. Based on 4,862 CpG sites, GepiClock accurately predicts age across both tumour and normal samples, indicating that core ageing-associatedmethylation programmes remain detectable despite malignant transformation.Applying GepiClock to molecularly profiled cancer cell lines with matched drug response and CRISPR screening data revealed age-associated vulnerabilities. Younger-predicted cell lines were more sensitive to mTOR, MEK1/2 and HSP90 inhibitors, whereas older lines showed increased sensitivity to AKT and PI3K inhibitors. Additional cancer-type-specific patterns and age-associated genetic dependencies were identified.These findings establish a framework to quantify biological age in cancer and link ageing-associated states to therapeutic vulnerabilities.
bioinformatics2026-06-26v1Cell-free DNA Fragmentation Profiling at Transcription Start Sites Improves upon Cancer-Type-Specific Region Selection for Cancer Detection
Pronk, B.; Makrodimitris, S.; Wilting, S.; Reinders, M.Abstract
Accurate discrimination between healthy individuals and patients with cancer using minimally invasive liquid biopsies could improve cancer diagnosis and monitoring. Circulating cell-free DNA (cfDNA) is a promising biomarker, since fragmentation patterns reflect chromatin organization and have been used to interrogate regulatory regions such as transcription start sites (TSSs). Classification approaches typically rely on hypothesis-driven selection of genomic regions based on literature or external tissue data. Therefore, they assume that tumor-derived cfDNA constitutes the dominant diagnostic signal, potentially overlooking a systemic, genome-wide shift in the cfDNA pool. We present a data-driven framework that identifies discriminative genomic loci directly from cfDNA whole-genome sequencing data. Using fragmentomic features captured at TSSs within a nested cross-validation framework, the model outperforms ichorCNA and hypothesis-driven baselines in distinguishing healthy from colorectal and breast cancer samples (AUROC 0.95+-0.039). Performance was maintained in a pan-cancer setting across seven malignancies (AUROC 0.946+-0.032) and generalized to previously unseen cancer types within the same cohorts (AUROC 0.934+-0.006). While validation in an independent external cohort showed a performance gap (AUROC 0.694), the data-driven model was consistently competitive with baseline methods. These results indicate that robust cancer detection is enabled by integrating distributed genome-wide fragmentation patterns rather than restricting analysis to predefined regions.
bioinformatics2026-06-26v1Comp2GPR: A Sequence-Driven Framework for Gene.Protein-Reaction Rule Reconstruction
Castillo, S.Abstract
Accurate gene-protein-reaction (GPR) associations are essential for the predictive performance of genome-scale metabolic models (GEMs),as they define the mapping between genes, enzymes, and metabolic reactions. However, GPR rules are often incomplete or inconsistent due to limitations in annotation transfer and the ambiguous representation of multi-subunit protein complexes, leading to errors in downstream analyses such as gene essentiality prediction. Here, I introduce Comp2GPR, an automated pipeline for reconstructing GPR rules that integrates curated protein complex information with sequence-level evidence. Protein complexes were sourced from the Complex Portal and subjected to an AI-assisted curation workflow to retain only metabolically relevant assemblies. Comp2GPR combines deterministic sequence similarity mapping with explicit rule construction to generate Boolean GPR expressions that accurately represent obligate subunit relationships and isoenzyme redundancy. I evaluated the impact of the reconstructed GPR rules by integrating them into the Yeast9 metabolic model and comparing gene essentiality predictions with the original model. While global performance metrics remained largely unchanged, the updated model achieved a net improvement in prediction accuracy through gene-level corrections. Overall, Comp2GPR demonstrates that combining curated protein complex data with sequence-based validation improves the accuracy, interpretability, and reproducibility of GPR rules. The method provides a robust framework for enhancing metabolic model annotations and supports more reliable simulation-based analyses.
bioinformatics2026-06-26v1Consistent consensus-based annotation of spatial adaptive immune receptor repertoires from long-read sequencing using LongAIRR
Schuck, J.; Ortega Iannazzo, S.; Mahmoud, Z.; Gwellem Anchang, C.; Hasse, L. M.; Weber, K.; Imkeller, K.Abstract
The combination of spatial transcriptomics with long-read sequencing enables spatial characterization of full-length transcripts within solid tissue sections. However, standardized computational analysis frameworks are lacking, and it remains unclear whether available long-read sequencing platforms from Oxford Nanopore Technologies and Pacific Biosciences yield comparable results. Here, we present a computational strategy for spatial full-length transcript analysis, focusing on the spatial profiling of adaptive immune receptor repertoires (AIRR). Our approach introduces an adaptive filtering strategy that dynamically refines read selection and significantly improves consensus accuracy, enabling high-confidence sequence reconstruction independent of platform-specific sequencing error profiles. We further derive evidence-based guidelines tailored to the consistent and robust analysis of spatial AIRR data. The resulting software LongAIRR is modular and interoperable with existing spatial transcriptomics and AIRR analysis frameworks. This work establishes a methodological foundation for spatial immunology, enabling precise mapping of immune repertoires within their native tissue microenvironments.
bioinformatics2026-06-26v1MYC and RNA Polymerase II Binding Near Transcriptional End Sites Regulate the Expression of Functionally-Related Genes
Prochownik, E. V.; Henchy, C. M.; Wang, H.Abstract
MYC oncoprotein binding at promoters and enhancers influences RNA polymerase II (RNAPII)-driven gene expression. Numerous genes also bind MYC near their transcriptional end sites (TESs). This often allows direct promoter-TES contact via looping and further regulates total and 'read-through' transcription that extends beyond standard termination sites. We aimed here to better clarify the rules governing TES associated MYC and/or RNAPII binding cross-talk in human and murine cells. Using ChIPseq and RNAseq datasets from the ENCODE portal and elsewhere, MYC and RNAPII binding profiles were found to differ around TESs and transcriptional start sites (TSSs). Variations in E box flanking sequences likely accounted for the somewhat lower affinities of MYC for TES-associated sites. Motifs for numerous other transcription factors were also observed to cluster non-randomly and in close proximity to MYC and RNAPII binding site peak summits. On average, genes with TES-proximal MYC or RNAPII sites were more highly expressed than those without, although co-binding tended to be suppressive. Both normal and neoplastic proliferative stimuli altered the MYC and RNAPII binding patterns of many genes, indicating that 'category switching' was common, subject to disparate external signals and often reversible. Functionally related gene sets with high levels of read-through transcription were uniformly marked by significant amounts of TES-associated MYC and/or RNAPII binding. These findings indicate that, both independently and together, MYC and RNAPII binding near TESs dynamically impact total and read-through transcription while also coordinating the expression of many common purpose gene sets.
bioinformatics2026-06-26v1Learning Perturbation Effects Through Contrastive Alignment of Multimodal Biological Embeddings
Long, W.; Liu, T.; Szalata, A.; Theis, F. J.; Xue, L.; Zhao, H.Abstract
Multimodal single cell perturbation screens offer a scalable approach for characterizing the effects of genetic and chemical interventions on cellular state. However, most existing representation learning methods are tailored to a single perturbation modality and fail to explicitly incorporate external semantic knowledge, which limits their ability to generalize across datasets and perturbation types. Here, we introduce PertOmni, a CLIP style multimodal representation learning framework that aligns transcriptomic perturbation signatures with text derived embeddings of curated genes and compound descriptions, as well as image derived embeddings from cell paintings. PertOmni jointly trains a shared transcriptomic encoder and dataset specific text encoders using a masked contrastive objective that emphasizes within cell type discrimination while mitigating confounding effects arising from cell type heterogeneity. We evaluate the produced joint embedding space on bidirectional retrieval, drug gene interaction inference, and perturbation prediction across both small molecule and CRISPRi perturbation datasets, and demonstrate consistent improvements over strong baseline methods.
bioinformatics2026-06-26v1Efficient evidence-based genome annotation with EviAnn
Zimin, A. V.; Puiu, D.; Pertea, M.; Yorke, J.; Salzberg, S.Abstract
For many years, machine learning-based ab initio gene finding approaches have been central components of eukaryotic genome annotation pipelines, and they remain so today. The reliance on these approaches was originally sustained by the high cost and low availability of gene expression data, a primary source of evidence for gene annotation along with protein homology. However, innovations in modern sequencing technologies have revolutionized the acquisition of gene expression data, allowing scientists to rely more heavily on this class of evidence. In addition, proteins found in a multitude of well-annotated genomes represent another invaluable resource for gene annotation. Existing annotation packages often underutilize these data sources, which prompted us to develop EviAnn (Evidence-based Annotator), a novel evidence-based eukaryotic gene annotation system. EviAnn takes a strongly data-driven approach, building the exon-intron structure of genes from transcript alignments or protein-sequence homology rather than from purely ab initio gene finding techniques. We show that when provided with the same input data, EviAnn consistently outperforms current state-of-the-art packages including BRAKER3, MAKER2, and FINDER, while utilizing considerably less computer time. Annotation of a mammalian genome can be completed in less than an hour on a single multi-core server. EviAnn is freely available under an open-source license from https://github.com/alekseyzimin/EviAnn_release and from Bioconda as "eviann".
bioinformatics2026-06-25v3Dynamic genomic constraints reveal fitness trade-offs underlying bacterial resistance evolution.
Dillon, L.; McInerney, J. O.; Creevey, C. J.Abstract
Antimicrobial resistance (AMR) is often modelled as the accumulation of resistance genes leading to multidrug resistance (MDR). We show that gene co-occurrence patterns in two opportunistic pathogens are consistent with fitness trade-offs that constrain which combinations of resistance mechanisms coexist. We applied a combined pangenomic and machine-learning analysis to 9,584 Escherichia coli genomes (99.2% phylogroup B2) and 7,057 Pseudomonas aeruginosa genomes. In E. coli, we identified eight cases of mutually exclusive gene pairs that independently predicted the same MDR phenotype, suggesting alternative routes to resistance whose components are typically not co-inherited. In a separate dataset of 352 strains with paired minimum inhibitory concentration (MIC) data, these dissociated combinations co-occurred more often in resistant than susceptible strains, consistent with the constraints being conditional on antibiotic selection. 33 gene pairs showed opposing association patterns between the two species, with combinations significantly associated in one species and significantly dissociated in the other (e.g. associated in E. coli and dissociated in P. aeruginosa, or vice versa). This indicates that genomic context modifies the contribution of individual genes to resistance phenotypes, and offers one explanation for the observation that 106 ARGs are present in >95% of strains yet do not predict resistance phenotype on their own. The findings are consistent with resistance evolution being shaped by fitness trade-offs and suggest that the dissociation patterns we identify could be targets for follow-up experimental work on resistance-associated fitness costs.
bioinformatics2026-06-25v2FoldARE, an RNA secondary structure analysis and prediction tool via generative pseudo-SHAPE modeling
Marino, S. M.; Husak, V.; Tebaldi, T.Abstract
RNA secondary structure prediction is limited by conformational heterogeneity and the scarcity of experimental data, as many RNAs populate ensembles of near-isoenergetic folds and SHAPE data are often unavailable. We present FoldARE (Folding and Analysis of RNA Ensembles), a two-step framework that derives pseudo-SHAPE constraints from in silico structural ensembles and uses them to guide SHAPE-aware secondary structure prediction. In the first step, an ensemble is generated and parsed nucleotide by nucleotide to estimate single-strandedness frequencies, which are converted into a pseudo-SHAPE reactivity profile using a weight-and-threshold scheme. In the second step, this profile is provided as a constraint to a SHAPE-compatible folding algorithm to improve the final prediction. We systematically evaluated all combinations of four ensemble-capable predictors, ViennaRNA, RNAstructure, LinearFold and EternaFold. After parameter optimization on a structurally diverse 25-RNA training set and validation using multiple scoring schemes, the best configuration combined EternaFold as ensembler and RNAstructure as predictor. Across external benchmark datasets (RNAstrand, ArchiveII and bpRNA) and the experimentally derived eFold dataset, FoldARE achieved the highest accuracy. Beyond prediction, FoldARE provides modules for ensemble-focused comparative analysis, including pairwise and multi-tool consensus assessment, per-nucleotide variability metrics, and interactive visualizations. Notably, it also supports the evaluation of m6A modification effects on structural ensembles. FoldARE is freely available on GitHub (https://github.com/TebaldiLab/FoldARE) and as a web accessible version (https://rdds.it/foldare/)
bioinformatics2026-06-25v2LNGCN: a distance-aware continuous-time graph protocol for prioritizing protein-protein interaction candidates
Xiao, Y.; Zheng, Y.; Hua, Y.; Peng, J.; Liu, J.; Qu, Y.; Xu, J.; Fu, R.; Qian, Q.; Zhao, M.; Zhang, X.; Zhao, J.; Yao, Y.; Kosar, M.; Ke, Y.; Chi, Y.Abstract
Accurate high-throughput prediction of protein-protein interactions (PPIs) is essential for mapping cellular mechanisms and prioritizing experimental validation. Current graph-based methods often rely on discrete message passing, suffer from representation over-smoothing, and provide poorly calibrated confidence scores. We present LNGCN, a distance-aware continuous-time graph protocol that integrates residue-level structural graphs with liquid neural dynamics to model spatially heterogeneous interaction patterns. Residue radial distance is used as an explicit driving signal for continuous graph evolution, while hierarchical calibration converts raw model outputs into interpretable interaction probabilities. Across balanced, highly imbalanced, and cross-species benchmarks, LNGCN achieved robust predictive performance. Importantly, the calibrated scores supported biologically coherent prioritization in the FGF23-FGFR1c--Klotho complex, SHP2-associated signaling interactions and Tdk1 oligomeric-state-dependent binding. In a TPR-centered experimental case study, LNGCN recovered known TPR-associated partners and prioritized ELAVL1-TPR and RALY-TPR, whose physical interactions were subsequently confirmed via experimental validation. These results indicate that LNGCN can serve as a practical prioritization protocol for PPI candidates.
bioinformatics2026-06-25v2CellOS: Learning a World Model of Cellular State through Joint Embedding Prediction
Zhou, Q.; Le, Y.; Qi, X.; Chang, S.; Lu, H.; Wu, Y.; Wang, H.; Ran, R.; li, x.Abstract
Foundation models learned from single-cell transcriptomes are central to the prospect of AI virtual cell that can represent, query and predict cellular state. However, most current single-cell foundation models learn from a single view of gene expression and are optimized primarily through reconstruction or next-token prediction. As a result, they capture expression abundance but can-not explicitly reconcile complementary views of cellular state. Here we present CellOS, a multi-view foundation model that learns cellular representations from paired expression and perception views. CellOS integrates complementary views through a scalable three-stage training strategy that combines causal cell-sentence language modelling, function-preserving dense-to-mixture-of-experts expansion and latent-space alignment via an LLM-JEPA objective. Using this framework, we trained a 12-billion-parameter model on 390.5 million single-cell transcriptomes. Across diverse benchmarks spanning cell-state annotation, batch integration and perturbation-response prediction, CellOS consistently outperformed state-of-the-art single-cell foundation models. Together, these results suggest that predictive alignment between complementary cellular views provides a scalable path toward representation-centric cellular world models and transferable AI virtual cells.
bioinformatics2026-06-25v2DextraDemixer enables accurate identification of antigen-specific T cells from pMHC multimer experiments
An, Y.; Drost, F.; Bonafonte-Pardas, I.; Grotz, M.; Schober, K.; Schubert, B.Abstract
Antigen specificity of T cells defines the adaptive immune response, yet the vast majority of known T cell receptors (TCRs) lack annotated antigen targets. Single-cell peptide-MHC (pMHC) multimer assays offer a scalable approach to map TCR-antigen interactions. Still, their utility is limited by pervasive non-specific binding and severe overlap between signal and noise, which confound the accurate identification of antigen-specific cells. To address these limitations, we present DextraDemixer, a Bayesian hierarchical mixture model that disentangles antigen-specific T cells from background noise in pMHC multimer data. The model integrates information from negative controls and clonotype structure while providing calibrated uncertainty estimates for classification. We further introduce a dynamic thresholding scheme that enables credible interval-bounded control of the false discovery rate. Extensive benchmarking on simulated datasets and antigen-specific spike-in experiments demonstrated the model's robustness and improved accuracy over established methods. In a longitudinal SARS-CoV-2 vaccine study, DextraDemixer identified antigen-specific TCRs characterized by high sequence similarity, elevated antigen-specificity prediction scores, and strong clonal purity. Annotations showed high concordance with external validation data and supported the identification of antigen-specific motifs. Overall, DextraDemixer provides a principled probabilistic framework for reliable identification of antigen-specific TCRs from single-cell pMHC-multimer assays.
bioinformatics2026-06-25v1The RdRp Thumb-1 Pocket is a Conserved Target for Broad-Spectrum Antiviral Development
Woods, V.; Umansky, T.; Russell, S. M.; Gallay, P.; Smith, D.; Haders, D.Abstract
Single-stranded RNA (ssRNA) viruses cause human diseases ranging from mild colds to deadly pandemics. Broad-spectrum non-nucleoside antivirals have been characterized as impossible to develop because allosteric binding sites are poorly conserved. The Thumb-1 allosteric site identified in HCV's RNA-dependent RNA polymerase (RdRp) governs an essential conformational change in the {Lambda}1-loop required for polymerase initiation. The only approved Thumb-1 inhibitor, beclabuvir, has been shown to be inactive against a broad panel of non-HCV viruses, including poliovirus, rhinovirus, coronavirus, coxsackievirus, influenza virus, and HIV. It subsequently failed to inhibit SARS-CoV-2 despite favorable docking predictions. A conserved, homologous allosteric site on RdRp that spans multiple viral families has not been reported. Here, we demonstrate that the Thumb-1 pocket and its associated {Lambda}1-loop are conserved across ssRNA viral families through comparative structural analysis and multiple sequence alignments. We demonstrate that beclabuvir's dependence on its indole C6 carbonyl to interact with the HCV-specific residue R503 and its C3 cyclohexyl chemistry restricts its activity to HCV. We validate the target discovery with MDL-001, which does not contain a C6 carbonyl or a C3 cycloalkyl substituent. MDL-001 directly blocks viral RNA synthesis in isolated replication complexes and selects for the canonical Thumb-1 resistance mutation P495S in HCV. MDL-001 demonstrates broad-spectrum in vitro inhibition of both HCV and SARS-CoV-2. Preclinical proof of concept and development of MDL-001 across HCV, HBV, HDV, influenza, SARS-CoV-2, and RSV are reported in a companion manuscript. These findings establish RdRp Thumb-1 as a conserved allosteric pocket and a druggable target for broad-spectrum antiviral development.
bioinformatics2026-06-23v5OmniCell: Unified Foundation Modeling of Single-Cell and Spatial Transcriptomics for Cellular and Molecular Insights
Pang, J.; Qiu, P.; He, Y.; Deng, Y.; Tang, W.; Zhi, H.; Yan, J.; Li, B.; Lin, A.; Cao, L.; Teng, F.; Fang, S.; Li, S.; Deng, Z.; Zhang, Y.; Li, Y.; Li, S.; Xu, X.Abstract
A cell's transcriptional programme is not fully defined by gene expression alone, but by the tissue context in which that programme is enacted. Singlecell RNA sequencing resolves molecular identity after dissociation, whereas spatial transcriptomics preserves tissue architecture but remains constrained by assay-specific sparsity and gene coverage. Here we present OmniCell, a tissue-contextual transcriptomic foundation model pretrained on 67 million dissociated and spatially resolved profiles. By integrating gene identity, expression magnitude and tissue context, OmniCell links transcriptional programmes to the cellular neighbourhoods and anatomical contexts in which they operate. OmniCell organised transcriptomes across molecular, cellular and tissue scales. It recovered celltypespecific programmes and tissuealigned gene modules, preserved robust cell-state structure across batches, species and rare populations, and improved the reconstruction of spatial cell identity, anatomical domains and cell-type composition. In human liver cancer Stereo-seq data, OmniCell resolved a tumour-margin transition zone characterised by immune infiltration, acute-phaseinflammation, coagulation/complement activity and metallothionein-linked metalion detoxification. Contextual geneembedding similarity analysis showed that gene relationships differed across tumour core, transition-zone and paratumour/adjacent non-malignant niches, indicating that OmniCell captures tissue-dependent gene function rather than expression similarity alone. In mouse brain development and macaque cortex, spatial virtual perturbations mapped regulatory genes onto stage and regionspecific anatomical programmes. Together, these results establish tissue context as a primary axis of transcriptomic representation and provide a framework for studying how cellular programmes acquire context-dependent biological meaning in intact tissues.
bioinformatics2026-06-23v3GenoME: a MoE-based generative model for individualized, multimodal prediction and perturbation of genomic profiles
Wei, J.; Xue, Y.; Chai, H.; Gao, Y. Q.Abstract
The non-coding genome operates through a complex, multiscale regulatory system where regulated gene expressions are closely associated with cell-type-specific histone modifications, transcription factor binding and 3D conformation. Developing computational models that can integrate these patterns to predict and interpret the regulatory system remains challenging. Here, we present GenoME, a Mixture of Experts (MoE)-based generative model that uses DNA sequence and cell-type-specific ATAC-seq signals to predict a unified genomic profile encompassing epigenomics, transcriptomics, and chromatin architecture at base-pair to kilobase resolutions. GenoME enables multiscale predictions for held-out genomic regions and, critically, generalizes to predict the full regulatory landscape of unseen or individualized cell types from a single ATAC-seq input. We equip GenoME with an in silico perturbation framework that accurately forecasts the multimodal consequences of genetic perturbations and identifies functional enhancer-promoter connections, outperforming specialized models like Activity-by-Contact. These predictions can also be used to decipher the transcription factor grammar of cell-type-specific enhancers. GenoME thus provides a versatile, all-in-one platform for generative modeling, cross-cell-type generalization, and causal mechanistic investigation of the multiscale regulatory genome.
bioinformatics2026-06-23v2Structural Pockets and Interacting RNA-Associated Ligands (SPIRAL): A DSSR-enabled Meta-Analysis of RNA-Small Molecule Recognition
Lu, X.-J.; Wang, Y.Abstract
Small molecules that target structured RNA hold therapeutic promise across a wide range of diseases, yet the structural principles governing RNA-ligand recognition remain poorly defined. We present SPIRAL (Structural Pockets and Interacting RNA-Associated Ligands), a curated database of 1,098 RNA-small molecule structures from the Protein Data Bank covering 1,137 ligand-binding events across six functional RNA categories. A customized pipeline built on DSSR (Dissecting the Spatial Structure of RNA) extracts structural interaction parameters from each complex, capturing stacking geometry, hydrogen-bond topology resolved by RNA moiety, groove engagement, and tertiary motif context. Unsupervised clustering of these fingerprints resolves six mechanistically distinct binding modes, the distribution of which is strongly governed by RNA functional class. To enable category-independent comparison of interaction quality across these diverse modes, we introduce the Composite Binding Quality Score (CBQS), a seven-metric framework that ranks riboswitches highest and regulatory RNA motifs lowest among the six categories. Across 275 affinity-characterized entries, C2'-endo sugar pucker count and total buried contact surface area emerge as the dominant predictors of binding affinity, converging with the structural features most underengaged by current regulatory RNA motif binders. SPIRAL provides a data-driven foundation for the rational design of next-generation RNA-targeted therapeutics.
bioinformatics2026-06-23v2HoloCell: A Generative Foundation Model for Holistic Cellular Modeling
Jiang, Q.; Li, Z.; Hu, B.; Bie, Y.; Li, K.; Li, Q.; Jin, P.; He, Y.; Deng, P.; Wang, Z.; Chen, X.; Qin, T.; Liu, H.; Jiang, R.; Yin, Q.Abstract
Single-cell multi-omics technologies have recently advanced to enable the profiling of epigenomic, transcriptomic, and proteomic layers within individual cells, offering new opportunities to characterize cellular states as integrated biological systems. However, developing a unified framework that can seamlessly integrate diverse omics modalities and remain robust to heterogeneous modality missingness remains challenging. Existing methods are often designed for specific modalities or modality pairs, relying on dataset-specific training or paired measurements. Here we present HoloCell, to our knowledge the first generative foundation model for joint representation learning and generative modeling across all three major single-cell omics modalities, i.e., epigenomics, transcriptomics, and proteomics. HoloCell contains over 860 million parameters and is pretrained on the Human-Multi-Omics-Corpus, which comprises approximately 468 million single-cell profiles across these three omics layers, corresponding to over 425 billion tokens. HoloCell introduces a simple yet biologically motivated hierarchical tokenization strategy that encodes cis-regulatory elements, genes, and proteins as structured tokens within a shared modeling framework. We evaluated HoloCell across single-omics representation learning, paired multi-omics integration, unpaired multi-omics alignment, and cross-modal generation via iterative diffusion and remasking, demonstrating its superior performance and flexibility across diverse omics tasks. From a representation perspective, HoloCell provides a unified digital mapping of cellular states across multiple omics layers, capturing cell heterogeneity as an integrated system. From a generation perspective, its iterative diffusion and remasking framework permits flexible generation orders beyond fixed left-to-right causality, enabling in silico simulation of multi-omics information flow. Together, these capabilities position HoloCell as a versatile foundation model toward the emerging concept of a virtual cell, offering both systematic characterization and generative simulation of cellular systems within a unified framework.
bioinformatics2026-06-23v2A tailored variant filtering procedure for multi-breed and multi-species unbalanced animal SNP collections
Lazzari, B.; Milanesi, M.; Talenti, A.; Bionda, A.; Li, Y.; Jiang, L.; Lenstra, J. A.; Bardou, P.; Tosser Klopp, G.; Crepaldi, P.; Colli, L.Abstract
Technological advancements and decreasing costs of whole-genome sequencing have generated a huge amount of resequencing data. Large-sized datasets, encompassing the molecular variation of several species and/or populations can now be assembled easily. However, these are extremely variable in terms of geographical provenance and sample sizes, with taxonomic groups varying from one single to hundreds of entries. Consequently, the application of standard filtering approaches may bias the representation of groups or gene pools. Commonly adopted variant filtering approaches relying on minor allele frequency (MAF) and linkage disequilibrium (LD) are not adequate because of remarkable differences in LD structure and frequency of allele variants within datasets representing both local and global diversity of multiple populations and species. Thus, by using the VarGoats 1000 goat genome project, we devised a novel approach which avoids the biases of the standard filtering procedures by adopting within-population subsampling, minor allele count (MAC) and marker spacing (bp-space) as filters. Starting from a quality-filtered dataset of >28M SNPs from 1372 animals, we generated a dataset of <14M markers and 750 individuals, complying with the initial requirements and facilitating further computational steps.
bioinformatics2026-06-23v2Early Tracheal and Salivary miRNAs in Extremely Preterm Infants Predict BPD-related Pulmonary Hypertension
Li, T.; Zhang, S.; Aluquin, V.; Donnelly, A.; Stephens, H.; Sharma, S.; Hicks, S. D.; Liu, D.; Austin, E.; Siddaiah, R.Abstract
Pulmonary hypertension (BPD-PH) associated with bronchopulmonary dysplasia (BPD) in preterm infants associates with high morbidity and mortality within the first two years of life. In a previous unbiased study, we identified a panel miRNAs in tracheal aspirates (TA) that were differentially expressed in extremely low gestational age newborns (ELGANs) with BPD-PH compared to those with BPD but no PH. To explore the predictive potential of these miRNAs, we studied TA exosomes from 7 days old ELGANs and analysed a curated panel of 16 miRNAs through logistic regression and calculated the predictive AUROC to diagnose BPD-PH at 36 weeks PMA. AUROC of TA miRNAs was 0.76 with sensitivity and specificity of 53% and 93%, respectively. Adding sex and gestational age to the variables improved the AUROC to 0.78 with sensitivity and specificity of 61 and 87% respectively. Due to challenges of obtaining TA in non-invasively ventilated infants, we collected saliva samples from ELGANs at 7 days of age and compared the log expression of these 16 miRNAs in both biofluids and found significant correlation in their expression (pearson r=0.92, p<0.001). We calculated the predictive AUROC of the same miRNAs to diagnose BPD-PH at 36 weeks PMA. AUROC of these miRNAs in saliva was = 0.85 with sensitivity and specificity of 82% and 72%, respectively; addition of biological sex and gestational age improved AUROC to 0.86 with sensitivity and specificity of 79% and 76% respectively. Leave-one-sample-out sensitivity analysis demonstrated stable training performance with reduced performance in testing samples, supporting the need for validation in larger independent cohorts. In conclusion, early salivary miRNAs have great potential for risk stratification of ELGANs to develop BPD-PH, while also providing the opportunity to identify target molecules and mechanisms that modulate molecular function.
bioinformatics2026-06-23v1