Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Quantifying the contribution of DNA conformational flexibility to transcription factor binding on nucleosomal DNA uncovers indirect readout across diverse TF families
Dey, U.; Martinez, G. S.; Kumar, R.; Yella, V. R.; Kumar, A.Abstract
Background: Eukaryotic gene regulation depends on transcription factors (TFs) recognizing short DNA motifs within chromatin. Many of these motifs lie within nucleosomes, where DNA is sharply bent, rotationally phased, and constrained by histone-DNA contacts. Yet only a subset is occupied in any cellular context. Motif identity alone, therefore, cannot fully explain selective TF engagement with nucleosomal DNA. We asked whether sequence-derived DNA conformational flexibility provides an interpretable representation of sequence context relevant to TF recognition on nucleosomes. Results: We compiled five DNA flexibility descriptors in the Python package DNAflexpy, representing bendability, torsional deformation, backbone conformational variability, and stiffness. We built quantitative models of TF binding affinity across 226 datasets from a high-throughput in vitro TF-nucleosome binding assay. Flexibility-augmented models improved prediction over mononucleotide baselines in most datasets, with smaller but reproducible gains over trinucleotide baselines. The gains were not uniform: they varied across TF families and were concordant with DNA shape-fluctuation features, suggesting that DNAflexpy descriptors capture a sequence-encoded structural signal. In PIONEAR-seq data, model performance generalized across nucleosomal templates in a TF- and sequence-dependent manner. Beyond prediction, position-resolved flexibility footprints revealed deformation signatures at cognate motifs and flanking regions across diverse TF families. For SOX11, model-derived footprints aligned with DNA shape fluctuations from nanosecond-to-microsecond molecular dynamics trajectories of SOX11-bound nucleosomes, consistent with independently observed DNA conformational dynamics and bound-state stabilization. The in vivo data showed a similar but more context-dependent pattern. OCT4 occupancy tended to correlate with local flexibility, whereas GATA3-pioneered regions showed flexibility coupled with altered rotational positioning of cognate motifs. Flexibility-augmented classifiers further improved discrimination of occupied nucleosomal motifs across ENCODE datasets. Torsional flexibility features, particularly twist dispersion and trx, were most informative for classification. Conclusions: Sequence-derived DNA conformational flexibility provides a quantitative and interpretable representation of sequence context in TF recognition on nucleosomes. By augmenting the sequence with structural information, these models help quantify and interpret an indirect-readout contribution in which DNA deformation tendencies may complement motif sequence and DNA shape. This framework may help explain why only selected motif instances are engaged in chromatin, without treating flexibility as independent of primary sequence.
bioinformatics2026-06-06v2Cross Dataset Transcriptomic Analysis Identifies Oxidative Stress Inflammation Gene Networks Modulated by Nutrigenomic Interventions in Parkinson Disease
Rafiee, M.; Abaj, F.; Ghiasvand, R.Abstract
Inflammation and oxidative stress (OS) are key to Parkinson's disease (PD). We performed a cross dataset integrative transcriptomic analysis to identify OS and inflammation-related hub genes consistently dysregulated in PD and to explore gene compound relationships using nutrigenomic studies using publicly available datasets. Four GEO datasets (GSE7621, GSE20141, GSE20146, GSE49036) were analysed to identify differentially expressed genes (DEGs), which were intersected with GeneCards OS inflammation gene sets. Functional enrichment analyses, including gene ontology (GO), pathway over-representation analysis (ORA), and protein-protein interaction (PPI) analysis, were used to identify key pathways and hub genes. Gene food bioactive compound (FBC) association was explored by integrating PD signatures with nutrigenomic profiles from NutriGenomeDB. We identified 183 DEGs in PD, enriched in synaptic, dopaminergic, OS, and inflammatory pathways. Intersection analysis yielded 26 OS inflammation related genes and 10 central regulators, including TH, DDC, SNCA, LRRK2, HSPB1, and HSPA1B. Integration with nutrigenomic datasets revealed opposing-direction transcriptional patterns, with several FBC associated signatures showing lower expression of stress related genes and higher expression of dopaminergic markers such as TH, GCH1, and DDC. Overall, this integrative analysis highlights OS inflammation gene networks in PD and identifies candidate diet gene associations that warrant further experimental and clinical validation.
bioinformatics2026-06-06v2Single-Cell Multi-Omics Dissection of Malignant Evolutionary Mechanisms and Construction of a Prognostic Model for Clear Cell Renal Cell Carcinoma
Liu, R.; Shi, Y.; Xiao, Y.; Ren, B.; Li, L.; Qi, B.; Li, T.; Zhang, Y.; Gao, J.Abstract
Clear cell renal cell carcinoma (ccRCC) exhibits pronounced heterogeneity across WHO histological grades, yet systematic single-cell multi-omics studies characterizing these transitions remain limited. We integrated scRNA-seq and scATAC-seq data across ccRCC WHO grades to establish a multi-omics framework encompassing tumor cells and immune populations. Using pseudotime trajectory analysis and machine learning ensembles, we developed a prognostic signature (CBG) from core nodes of transcriptional regulatory networks. We found that in tumor cells, epigenetic alterations consistently precede metabolic reprogramming and invasive adaptation. CD8+ T cell exhaustion followed a trajectory shifting from IRF7- to ZNF683-regulated states, while monocytes differentiated toward M1 and M2 macrophages orchestrated by NFIC/IL1B and CEBPD/GLI2. Intercellular communication networks showed a temporal progression from inflammation, through vascular remodeling, to immunosuppression dominance. The CBG signature demonstrated robust performance in independent cohorts, identifying SLC11A1 and SH3YL1 as antagonistic survival determinants. This study elucidates the dynamic molecular and immunological mechanisms underlying ccRCC grade progression, providing a robust framework for subtype-specific prognostication and precision therapeutic targeting
bioinformatics2026-06-06v1HOPE: Interpretable Histology Analysis with Spatial Omics-Derived Signatures for Precision Oncology
Wang, T.; Bieniosek, M.; Krpicak, T. J.; Luan, M.; Ruf, B.; Schürch, C. M.; Mayer, A. T.; Luo, R.; Trevino, A. E.; Wu, Z.Abstract
Hematoxylin and eosin (H&E) stained images are fundamental clinical tools for disease assessment. However, even with advanced computational models, their prognostic capabilities remain limited. Spatial omics characterizes tumor microenvironments (TME) in detail yet remains clinically inaccessible due to cost and complexity. In this study, we present HOPE, a lightweight framework that learns TME signatures from paired H&E and spatial omics data during training, then applies these to H&E alone at inference. Leveraging H&E foundation models, HOPE consistently outperforms identical architectures trained without spatial omics guidance across cancer types and cohorts. It further generates interpretable annotations of TME signature on H&E regions, stratifying patients into biologically coherent groups with different prognostic outcomes. HOPE establishes a practical route to translate high-content spatial omics discoveries into scalable, clinically deployable tools.
bioinformatics2026-06-06v1Compositional and interpretable representation of histology using AI foundation models and sparse autoencoders
Zhao, Z.; Maliga, Z.; Ogbonna, E. C.; Talemi, S. R.; Coy, S.; Gagne, A.; Lumamba, K.; Solomon, I. H.; Santagata, S.; Steyn, A. J. C.; Naidoo, T.; Sorger, P. K.Abstract
Light microscopy of tissue sections stained with hematoxylin and eosin (H&E) has been the foundation of histopathology for over 150 years and remains essential for diagnosis and research. The development of high-plex spatial profiling approaches able to measure protein and RNA expression at single-cell resolution augments but does not replace H&E imaging, even in research. Computational pathology (CPath) models based on deep learning promise to further increase the value of H&E imaging but interpreting these models in biological terms remains challenging. As a result, they are not widely used in spatial profiling studies. Here we describe a human-in-the-loop computational framework that leverages CPath foundation models (FMs) and sparse autoencoders (SAEs) to decompose FM embeddings and automatically identify diverse, human-interpretable histopathology features in H&E images. When FM-SAE modeling was applied to pulmonary diseases such as tuberculosis and lung cancer, human-machine interaction augmented and accelerated expert interpretation. Moreover, the resulting annotations provide a morphology-aware approach to integrating 2D and 3D mesoscale tissue architectures with molecular spatial profiling.
bioinformatics2026-06-06v1TetraFuse: A Synergistic Four-Dimensional Dynamic Fusion Framework for Efficient and Robust Medical Image Classification
Gao, Y.; Li, J.; Xu, J.; Li, Q.; Li, Z.; Shi, Y.; ZHao, G.; Wu, X.; Zhang, Y.Abstract
Accurate and robust classification of medical pathology images is pivotal for computer-aided diagnosis. However, the deployment of deep learning models in high-throughput clinical screening faces a fundamental challenge: the trade-off between diagnostic accuracy and computational efficiency. Current lightweight architectures, while reducing parameter complexity through grouped convolutions, often lead to cross-channel information isolation and diminished representational capacity. In this paper, we propose TetraFuse, a novel framework that systematically integrates features from four complementary domains: space, channel, statistics, and frequency. TetraFuse introduces a novel Cross-Channel Dynamic Aggregation (CCDA) paradigm that reconstructs global channel topology with negligible computational overhead, resolving the inter-group isolation issue. To balance perceptual fidelity and efficiency, we design a stage-aware local enhancement mechanism: Local Variance-Guided Enhancer (LVGE) is employed to filter out shallow-stage background noise, while High-Frequency Boundary Injection (HFBI) reinforces deep-stage pathological contours, preventing spatial over-smoothing. Experimental results on the COVID-19, ISIC 2018, and Kvasir datasets confirm that TetraFuse outperforms state-of-the-art (SOTA) methods. Notably, TetraFuse-Tiny achieves a transformative 91.53% reduction in FLOPs compared to ResNet50; on the Kvasir dataset, it achieved an accuracy of 0.926 and an AUC of 0.994 with only 0.345G FLOPs. By combining high representational power with minimal computational demand, TetraFuse offers a scalable solution for large-scale medical image analysis, especially in resource-constrained clinical environments.
bioinformatics2026-06-06v1Revised Adaptive Immune Receptor Data in the Immune Epitope Database
Scheffer, L.; Richardson, E. M.; Vita, R.; Zarebski, L.; Blazeska, N.; Wheeler, D. K.; Cantrell, J. R.; Deleuran, S. N.; Lees, W. D.; Christley, S.; Corrie, B.; Cowell, L. G.; Sette, A.; Peters, B.Abstract
The Immune Epitope Database (IEDB, iedb.org) is a freely available resource that catalogs experimentally defined immune epitopes and - if available - the immune receptors that recognize them. Currently, the IEDB records ~185,000 T cell receptors and ~5,000 B cell receptors/antibodies with experimentally verified epitope specificity. Because these receptor data were manually curated from ~3,300 references spanning decades, nomenclature inconsistencies present challenges for computational analyses and user queries. To support integrated analysis of the entire dataset, we revised the IEDB receptor data standardization and validation pipeline to flag and correct inaccuracies. Anomalous receptors from over 800 studies were flagged for re-curation. The updated receptor dataset shows greater conformity through consistent gene nomenclature formatting and harmonized CDR sequence delimitation. Taking advantage of the increased receptor data consistency, the IEDB web interface was expanded to include receptor search features directly on the homepage, support V/J gene and species options in the refined receptor search, and allow direct data export in the Adaptive Immune Receptor Repertoire (AIRR) format. We anticipate that the improved receptor data quality will simplify bioinformatics analyses, and facilitate integration of IEDB data into cross-repository data resources, such as the AIRR Knowledge Commons.
bioinformatics2026-06-06v1Comparative Proteomics Across Tissues and Crop Agroecosystems Reveals Agricultural Stressor Responses in the Western Honey Bee
Zhong, H.; ZHONG, P.; Park, J.; Kozlova-Ryabova, A.; Moravcova, R.; Rogalski, J. C.; Jamieson, A.; Lansing, L.; Fang, W. W. T.; Moon, K.-M.; Yuan, X.; Ovinge, L. P.; Kearns, J. D.; Gregoris, A. S.; Higo, H.; Common, J.; Conflitti, I. M.; Pepinelli, M.; Tran, L.; Cunningham, M.; Jabbari, H.; Bukhari, S. A.; French, S. K.; Ho, J.; Deckers, T. B.; Zorz, J.; Polo, R. O.; Hoover, S. E.; Pernal, S. F.; Giovenazzo, P.; Currie, R. W.; Guarna, M. M.; Zayed, A.; Foster, L. J.Abstract
Maintaining honey bee health in crop production systems is increasingly difficult because worker bees encounter multiple chemical and biological pressures from pesticides and pathogens. How these field-realistic pressures affect molecular physiology across functionally distinct tissues remains poorly understood. Here, we tested whether tissue-resolved proteomics could separate stable tissue-specific patterns from crop-associated molecular changes. To do this, we profiled abdomen, gut, and head proteomes from honey bees collected across four Canadian crop ecosystems over two consecutive years, and integrated these data with pesticide-residue and pathogen-load measurements. Proteomic variation was structured by both tissue identity and crop environment. Tissue-specific proteomic profiles were characterized across samples, whereas crop-associated effects were detected in both years and were stronger in 2021, the second year of the study. Tissue-specific enrichment and network analyses linked the abdomen to lipid catabolism and ubiquitin-proteasome proteostasis, the gut to central carbon metabolism, membrane transport, vesicle trafficking, and cytoskeletal organization, and the head to neurosensory and mitochondrial functions, together with amino-sugar metabolism and vesicle-associated quality-control modules. Among the measured pesticide residues, boscalid was the most reproducible chemical correlate of proteomic variation, with the strongest signal in the gut. Cross-year validation associated boscalid exposure with reduced abundance of gut proteins involved in mitochondrial metabolism, protein quality control, vesicle trafficking, nutrient transport, and biosynthetic pathways. Additionally, integrated proteome-transcriptome-microbiome factor analysis further identified gut-centered components associated with measured stressor variables and linked protein-level variation to coordinated transcriptomic and microbial shifts. Independent-year validation showed that compact crop-associated protein signatures detected in 2020 were also present in 2021. Together, these results show that honey bee tissues maintain stable proteomic identities while showing tissue- and year-specific responses to pesticide and pathogen pressures encountered in crop ecosystems. The gut proteome may specifically provide a sensitive molecular indicator of pesticide-associated perturbation under field conditions.
bioinformatics2026-06-06v1samsampleX: Distribution-aware downsampling for benchmarking next-generation sequencing data
Demiriz, S.; Taliun, D.Abstract
High-throughput next-generation sequencing (NGS) is essential for genetic variant discovery across diverse applications. As NGS evolve, there is a growing need for benchmarking tools that support realistic data simulation and downsampling. Existing downsampling tools apply uniform sampling of sequencing reads, which inadequately models realistic coverage distributions, particularly in difficult-to-sequence regions and hybrid sequencing designs. Here we present samsampleX, a Python-based tool implementing a novel distribution-aware downsampling algorithm that dynamically adjusts read retention probabilities to emulate coverage profiles derived from real sequencing data. Using ultra-high-coverage reference datasets, samsampleX accurately reproduces coverage patterns observed in typical sequencing experiments, outperforming uniform downsampling methods at preserving depth variability across genomic regions such as the HLA locus and hybrid whole-exome/genome sequencing configurations. samsampleX extends current downsampling strategies by offering enhanced flexibility for specialized NGS benchmarking scenarios, facilitating improved assessment of sequencing data analysis methods.
bioinformatics2026-06-06v1Multivariate integration of histological images and gene expression data: a comparative review
Ma, C.; Mao, J.; Le Cao, K.-A.Abstract
Integrating histological images with gene expression data offers a promising approach for linking tissue morphologies to molecular signatures and improving disease subtyping. However, such integration remains challenging due to the high dimensionality of these datasets, cross-modal heterogeneity, and limited interpretability. Multivariate methods such as Sparse Canonical Correlation Analysis (Sparse CCA), Joint Nonnegative Matrix Factorisation (Joint NMF), and Angle-based Joint and Individual Variation Explained (AJIVE), have been used to address these challenges by reducing dimensionality while identifying features associated with latent factors, thereby enhancing biological interpretability. Despite increasing application in imaging-omics research, systematic comparisons of their methodological properties remain limited. Consequently, users often lack guidance on how to appropriately select these methods in practice, and these approaches are frequently treated as interchangeable despite differing modelling assumptions. Here, we use paired H\&E images and gene expression data from breast cancer as a representative case study to examine the methodological characteristics, interpretability, and complementary properties of these integration approaches. Our results show that each method captures distinct yet complementary aspects of the underlying information. Although the biological findings are derived from the TCGA-BRCA datasets, the methodological insights identified here extend more broadly to imaging-omics integration studies. Overall, this comparative review highlights the strengths and limitations of each approach and outlines considerations for future methodological development.
bioinformatics2026-06-06v1Mapping Chemical Diversity: Descriptor-Guided Clustering of Natural Products in the COCONUT Database
Shreyasree, G.; Dileep, A.; Namani, A.; Karunakar, P.Abstract
Natural products represent a major source of bioactive compounds for drug discovery, yet their exploration remains challenging due to extensive structural complexity and scaffold diversity. Using the COCONUT database, we developed a cluster-oriented framework to systematically map and characterize the natural product chemical space through feature engineering, molecular clustering, and representative-based analysis. Descriptor selection identified a greedy maximum coverage strategy with a 0.35-0.85 correlation threshold range and 20 descriptors as the optimal feature set, enriched in physicochemical and graph-topological properties. Comparative evaluation of clustering approaches identified UMAP-HDBSCAN as the best-performing pipeline, generating 1,683 clusters with silhouette scores of 0.42 before and 0.24 after noise reassignment. Cluster profiling revealed a highly heterogeneous scaffold landscape, with 67.56% of clusters exhibiting low scaffold dominance and only 15.21% representing highly scaffold-dominated regions, supporting a chemical space composed largely of interconnected transitional clusters. Descriptor analyses showed that natural product clusters were generally enriched in saturated, low-aromaticity chemotypes with moderate lipophilicity and constrained molecular flexibility. Representative-based analyses demonstrated that central representatives (medoid and centroid-closest molecules) closely captured cluster-average properties, whereas diverse representatives better reflected structural breadth, findings further supported through descriptor-based and docking-based validation. Collectively, the results reinforce the natural product chemical space as a continuous yet structured manifold and provide a representative-guided framework for its efficient exploration in drug discovery applications. The complete data can be accessed at: https://github.com/shrek-28/DescriptorClusteringNPSpace
bioinformatics2026-06-06v1Ignet 2.0 and Vignet: An Ontology-Driven Web Platform for Biomedical Gene Interaction Discovery and Visualization
Asaduzzaman, S.; Bansal, B.; Combs, P.; Zhang, J.; Rehana, H.; McGregor, B.; He, Y.; Hur, J.Abstract
Background: The expansion of biomedical literature demands systematic ontology-guided discovery of gene interactions, vaccine mechanisms, drug associations, and adverse events. Existing platforms such as STRING, DisGeNET, and PubTator fall short of providing a unified, freely accessible system that integrates ontology-based semantic interaction classification, vaccine-focused heterogeneous network construction, and Artificial Intelligence-assisted evidence retrieval. Results: Ignet 2.0 and Vignet are freely accessible dual-platform systems that combine PubMed literature mining, BioBERT-based interaction scoring for millions of gene-gene co-occurrence pairs and integrate three biomedical ontologies and one curated drug resource, Interaction Network Ontology (INO), Vaccine Ontology (VO), Human Disease Ontology (HDO), and DrugBank. Ignet 2.0 supports gene interaction discovery, gene set enrichment retrieval of BioBERT-scored GenePair evidence, and AI-assisted summarization through BioSummarAI. Vignet extends these features with VO-guided Vaccine Exploration, VacPair interaction scoring, and the creation of vaccine, gene, drug, and disease networks in VacNet. A public Representational State Transfer Application Programming Interface (REST API) and Model Context Protocol (MCP) endpoint enable real-time integration, fostering trust in biomedical knowledge discovery. Conclusion: Ignet 2.0 and Vignet are scalable, ontology-guided biomedical knowledge platforms that facilitate evidence-based gene interaction analysis, vaccine-focused semantic exploration, and AI-assisted knowledge discovery. Their real-time PubMed data integration ensures up-to-date insights; however, users should consider validation processes and potential lags in incorporating the latest experimental data, which may affect the reliability of immediate data. Availability: Ignet 2.0: https://ignet.org/ignet; Vignet: https://ignet.org/vignet/
bioinformatics2026-06-06v1Correcting for Global Synonymous Selection Improves the Accuracy of Episodic Positive Selection Inference
Verdonk, H. E.; Pivirotto, A.; Hey, J.; Kosakovsky Pond, S. L.Abstract
The ratio of nonsynonymous to synonymous substitution rates ({omega}) constitutes a fundamental parameter for inferring adaptive protein evolution, predicated upon the assumption that synonymous substitutions are selectively inert. This premise, however, is increasingly untenable given evidence of selection acting on synonymous substitutions, driven by various biological processes such as translational efficiency and mRNA stability. In this study, we demonstrate that unmodelled synonymous selection introduces substantial bias into {omega} estimation, resulting in elevated false positive rates in tests for positive selection. To rectify this, we present BUSTED+S+MSS, a statistical framework incorporating Multiclass Synonymous Substitution (MSS) models into BUSTED, a method for detecting episodic selection. By partitioning synonymous codons into empirically derived rate classes, this approach accounts for global synonymous constraints. Application to five diverse clades - Drosophila, Caenorhabditis, Enterobacteria, Saccharomyces, and Primates - reveals that the inclusion of MSS components consistently improves model fit and reduces the proportion of genes inferred to be under positive selection. In Enterobacteria, genes retaining significance under the corrected model exhibit weaker constraint on synonymous substitutions (dSs), consistent with the hypothesis that unmodelled purifying selection drives spurious signals of adaptation. Furthermore, an information-theoretic analysis indicates that whilst site-specific variation (SRV) provides the primary correction, global synonymous rate variation (MSS) contributes a distinct second-order correction. In highly divergent alignments, these signals act in concert to improve model fit. The BUSTED+S+MSS framework, especially when coupled with an "error-sink" to absorb alignment artifacts, thus offers a computationally feasible means to disentangle adaptive nonsynonymous substitution from the confounding effects of synonymous constraint.
bioinformatics2026-06-06v1PAG-Agent: a biologist-oriented research assistant for context-aware pathway-level analysis and interpretation
Nguyen, Q.-H.; Zhang, Z.; Le, D.-H.; Chen, J. Y.; Ku, W.-S.; Chen, H.; Yue, Z.Abstract
Pathway analysis is a critical step for translating gene-level omics results into biological mechanisms, yet existing workflows often leave researchers with long lists of statistically significant pathways that are difficult to interpret, validate, and connect to experimental context. We developed PAG-Agent, a biologist-oriented virtual research assistant that integrates pathway-level statistical analysis, context-aware biological interpretation, literature-supported reasoning, and scientific writing support within a unified workflow. PAG-Agent supports bulk and single-cell transcriptomic data and enables users to perform data preprocessing, differential expression analysis, pathway analysis, pathway-level consensus analysis, and pathway-level meta-analysis through click-based and chat-based interactions. Unlike conventional pathway-analysis tools that analyze gene sets largely in isolation, PAG-Agent incorporates experimental conditions and research objectives to prioritize biologically relevant pathways and generate interpretable hypotheses. The system also provides gene and pathway annotation, citation retrieval, visualization, and writing refinement functions. In Alzheimer's disease case studies using three transcriptomic datasets, PAG-Agent consistently identified neurodegeneration-related pathways across multiple analysis methods and datasets. In citation-retrieval benchmarking, PAG-Agent outperformed six competing LLMs across five common literature-support scenarios, demonstrating improved ability to provide contextually relevant and valid references. Overall, PAG-Agent lowers technical barriers for pathway-level analysis and helps researchers move from transcriptomic data to biologically grounded interpretation, hypothesis generation, and scientific communication.
bioinformatics2026-06-06v1STITCH: Spatial Transcriptomics Imputation via Flow Matching with Internal Learning
Wang, S.; Wang, X.; Peng, Q.; Li, T.Abstract
Spatial transcriptomics datasets frequently suffer from spatial gaps and missing regions due to sectioning artifacts, tissue damage, and the high cost of sequencing that limits tissue coverage. We present STITCH, a scalable and robust generative framework for multidimensional virtual spatial transcriptomics reconstruction. STITCH models intrinsic spatial-transcriptomic patterns directly from individual tissue samples, enabling reconstruction without requiring external reference atlases or matched histological image priors. The framework adopts a decoupled architecture that separates spatial morphology restoration from transcriptomic generation. STITCH first compresses high-dimensional transcriptomic profiles into a low-dimensional latent representation through a spatial-aware graph autoencoder. For 3D cross-slice gaps, STITCH employs optimal transport-conditioned flow matching for spatial reconstruction, whereas 2D in-slice damage is repaired through an internal learning strategy. To generate the corresponding transcriptomic profiles, STITCH further establishes a point-wise conditional flow matching model in the latent space. This module achieves linear computational complexity, enabling continuous 3D atlas reconstruction of over 11 million cells within 5 hours on a single commodity GPU. Extensive evaluations across diverse spatial transcriptomics platforms, spanning both single-cell and spot-level technologies, demonstrate that STITCH consistently preserves transcriptomic identities, spatial topologies, and anatomical continuity. Overall, STITCH provides a scalable and platform-compatible computational framework for reconstructing high-resolution continuous spatial transcriptomic atlases.
bioinformatics2026-06-06v1EnzOracle: Mechanism-aware prediction of enzyme environmental adaptation via a classification-guided mixture-of-experts framework
Wei, D.-Q.; Gao, Q.; Fang, Z.; Yuan, Y.; Jin, M.; Sun, H.; Peng, Z.; Yang, L.; Li, J.Abstract
Industrial biocatalysis increasingly requires enzymes capable of operating under extreme physicochemical conditions, yet most natural sequence data reflect adaptation to mild environments, leading conventional predictive models to suffer from regression-to-the-mean effects in extremophilic regimes. Here we present EnzOracle, a classification-guided mixture-of-experts framework that enables distribution-aware prediction of enzyme melting temperature (Tm), optimal catalytic temperature (Topt), and optimal pH (pHopt) directly from sequence. EnzOracle demonstrated robust predictive accuracy across diverse benchmarks, achieving RMSE of 5.245 for Tm, 11.458 for Topt, and 0.781 for pHopt. Beyond predictive accuracy, we introduce a trait-resolved molecular simulation strategy to evaluate whether EnzOracle-derived attribution patterns correspond to independent physical mechanisms. Across representative systems, attention hotspots mapped onto rigidity-conferring interaction networks for Tm, dynamically preorganized active-site ensembles for Topt, and pH-dependent electrostatic and hydration networks for pHopt. These orthogonal validations indicate that EnzOracle captures transferable biophysical principles of enzyme environmental adaptation rather than merely exploiting dataset-specific correlations, positioning sequence-based learning as a mechanism-aware framework for discovering stability and activity determinants across diverse catalytic landscapes.
bioinformatics2026-06-06v1Temporal Biodynamics: An AI Platform for Identification of Stage-Relevant Targets and Biomarkers
Natekar, P.; Yao, B.; Mohammad-Taheri, S.; Rusnak, A.; Gort-Freitas, N. A.; Fillatre, J.; Raymond, J. J.; Saksena, S. D.; Lipnick, S.; Sokolov, A.Abstract
Temporal modeling of disease progression is poised to revolutionize the process of target identification, leading to better characterization of and intervention at the critical early stages of chronic conditions. Temporal Biodynamics is an artificial intelligence-driven platform that leverages within-tissue heterogeneity in cross-sectional cohorts to assemble a single, continuous trajectory of transcriptomic changes between health and disease. We demonstrate that the platform enriches for known disease-associated genes and proteins by more than 50% over the conventional case-control comparisons. When compared to other published pseudotime methods, our models were better at extracting disease-relevant signals in the presence of confounders and co-morbidities. The Temporal Biodynamics platform enables rich profiling of a disease continuum, providing temporal insights that are otherwise hidden by the traditional discrete staging of chronic diseases. This includes detecting cascades of molecular events, providing clues regarding causality, and increasing confidence in blood-based protein biomarkers using tissue-based context.
bioinformatics2026-06-06v1An inflammatory gene set driven epigenetic clock tracks down disease progression and rejuvenation
Sandor, P.; Kerepesi, C.; Castro, J. P.Abstract
Chronic, low-level inflammation, characterized by elevated pro-inflammatory programs, including epigenetic changes, in the absence of infection, is a major driver of aging and age-related diseases. On the other side of the spectrum, aging interventions work, at least in part, by decreasing inflammation. However, the molecular connection between epigenetic aging and inflammatory profiles in chronic diseases and rejuvenation has not been established yet. This study aimed to investigate the role of a newly described inflammatory signature gene set (ISig) in aging, previously associated with accelerated aging, in the progression of chronic diseases and rejuvenation. To achieve this, we developed inflammation-derived epigenetic aging clocks using ElasticNet regression models trained on CpG sites from ISig promoter regions. The newly developed inflammation aging clocks were validated on healthy samples and tested for their capacity to detect accelerated aging in diseased samples and rejuvenation during cellular reprogramming. The data demonstrate that the ISig inflammatory clocks accurately predict age, detect rejuvenation, and identify accelerated aging in disease contexts. Furthermore, we have demonstrated that it is possible to use a curated inflammatory gene-set with biological relevance to estimate biological age acceleration. We also developed a web application, the GeneClock Studio (available at https://ilab.sztaki.hu/geneclockstudio/), that allows researchers to apply the inflammatory aging clocks to their own DNA methylation datasets without requiring any programming expertise. Furthermore, the GeneClockStudio supports the training of new aging clocks based on an arbitrarily selected gene set in a similar way as in the case of the ISig inflammatory clocks.
bioinformatics2026-06-06v1Chromap Suite: an open-source single-binary platform for agentic multiomic RNA + ATAC profiling
Hung, L.-H.; Yeung, K. Y.Abstract
Background. Single-cell multiomic profiling of RNA expression and chromatin accessibility is now a standard tool for resolving regulatory state in single cells, but existing analysis toolchains have lagged. Cell Ranger ARC, the proprietary multiomic pipeline, uses a custom broad peak caller rather than the MACS3 narrow peaks that the ATAC field has consolidated on, and its restrictive end-user licence forbids redistribution of analysis pipelines that include it. A fully open-source, permissively-licensed alternative anchored on community-standard methods (Chromap for ATAC alignment and MACS3 for narrow peak calling) has been impractical to assemble because the two codebases are written in different languages with incompatible runtimes, leaving practitioners to chain them together with ad-hoc scripts. Results. We present Chromap Suite, the chromatin-accessibility side of an open-source multiomic stack built in support of the NIH Molecular Phenotypes of Null Alleles in Cells (MorPhiC) consortium's multiomic production pipeline. We extended Chromap with native BAM output and coordinate sorting, in-process narrow peak calling, optional Y-chromosome filtering, and native input from the compressed binary CBQ sequencing format alongside FASTQ, and hardened the result with a regression-test matrix that auto-validates the four upstream Chromap presets (bulk ATAC, scATAC, ChIP-seq, Hi-C). We reimplemented MACS3's narrow peak caller in portable C++ as libMACS3, byte-identical to MACS3 v3.0.3 and free of any Python interpreter dependency. Finally, we extracted Chromap's alignment and fragment-generation paths into a callable C++ library (libchromap) and embedded both libchromap and libMACS3 into STAR Suite, so that one STAR invocation runs alignment, peak calling, and cell calling for both RNA and ATAC modalities concurrently. To our knowledge this is the first true single-binary RNA + ATAC multiomic implementation. On the public 3K PBMC Multiome at 32 threads, the platform completes in 18 minutes 55 seconds wall time and 44.6 GB peak resident memory, against 40 minutes 4 seconds and 79.1 GB resident memory for Cell Ranger ARC v2.2.0 (a 2.12x wall speedup with 1.8x less peak memory), and produces 50,274 peaks that are byte-identical to MACS3 v3.0.3. To support deployment by both research scientists and the AI agents increasingly used in bioinformatics analysis, Chromap Suite ships a Model Context Protocol (MCP) server and a browser-based Launchpad driven by a shared set of composable YAML recipes that humans and agents drive the same way. Conclusions. Chromap Suite delivers a unified, freely redistributable multiomic pipeline that produces the MACS3 narrow peaks downstream ATAC analyses already rely on, with substantially lower wall time and memory than the proprietary alternative. The MIT- and BSD-3-licensed code carries no redistribution restrictions, the constituent libraries are independently embeddable in other open-source tools, and the MCP server plus Launchpad recipes make the platform straightforward to drive both by humans and by AI agents.
bioinformatics2026-06-06v1Predicting Clinical Phenotypes by Growth Curve Modeling of Transcriptomic Signatures during Disease Progression
Akhlaghi, M.; Ghasemi, E.; Ray, M. S.; Pyne, S.Abstract
High-throughput gene expression data analysis has benefited from many statistical tests of differential expression across two or more groups such as t tests, ANOVA, etc. Yet, in complex transcriptomic datasets such as longitudinal or repeated measures, few studied have addressed such key issues as group effects and temporal dependency in expression profiles with a single model that is both practically effective and theoretically grounded. In this study, we used Growth Curve Model (GCM), as a generalization of MANOVA, to identify differentially expressed longitudinal profiles of genes, and thus predicted the associated clinical phenotypes, of pediatric lupus during the progressions of the disease across two different racial groups. In particular, we detected a module of histone genes which was shown to be linked with lupus. Key words: Growth Curve Model; Trace test; Longitudinal gene expression; Pediatric lupus; Overrepresentation analysis; Clinical phenotypes
bioinformatics2026-06-05v2Mycol: A user-friendly app for automating analysis of microscopy images
Bradley, S. A.; Schiesaro, G.; Webel, H.; Skumantz, M.; Novillo-Sanjuan, O.; Panagou, A.; Lucena-Marin, R.; Jensen, E. D.; Di Pietro, A.; Acevedo-Rocha, C. G.Abstract
Microscopy image analysis is central to modern biology, yet many available platforms remain inaccessible to non-specialist users because they require advanced technical expertise, code-based workflows, extensive setup, or paid access. This creates a barrier for researchers who need reliable and fast image quantification but lack dedicated computational support. Here, we introduce Mycol, an open-source, machine-learning-assisted image analysis platform designed to be accessible and run on standard laptops with minimal setup. Mycol supports end-to-end workflows in which users annotate microscopy images, perform human-in-the-loop fine-tuning of machine learning models for automated segmentation and classification, deploy machine learning models, quality control predictions and quantitatively compare morphological and class frequency descriptors through a single intuitive interface. By combining machine-learning analysis with efficient quality control by humans, Mycol makes rapid and high-quality image quantification available to biologists without requiring specialist training. We demonstrate the utility of Mycol in diverse workflows using two economically important organisms, the crop pathogen (Fusarium oxysporum) and the blue mussel (Mytilus edulis). Through Mycol, curated training sets were generated and high quality segmentation and classification models were obtained in each case. Deploying these models through Mycol decreased the time requirements and increased traceability of established cell counting workflows and facilitated a quantitative comparison of morphological parameters that reveals new patterns in early M. edulis larval development.
bioinformatics2026-06-05v1inGSEA: An Improved Method for Gene Set Enrichment Analysis Using a Weighted Integral Statistic
Zhang, Q.; Li, Q.Abstract
Gene Set Enrichment Analysis (GSEA) is one of the most popular methods for transcriptomic analysis, yet its statistical power is limited when the biological pathways exhibit heterogeneous or non-concordant expression patterns. We propose an improved GSEA method, \textbf{in}tegral-based GSEA (inGSEA). inGSEA introduces a novel enrichment score based on the Anderson-Darling weighted integral statistic. The new enrichment score enhances detection power for complex signals, particularly sparse and bidirectional ones, while the Cauchy combination of integral and classic maximum statistics provides robustness across diverse expression patterns. Extensive numerical studies demonstrate that inGSEA achieves superior power and well-calibrated false discoveries. Application to real-world datasets reveals biologically relevant pathways missed by the standard GSEA. inGSEA reduces the computational burden of permutation testing by employing a generalized gamma distribution to approximate the null distribution. inGSEA is accessible as a user-friendly web-based software tool (https://amss-stat.github.io/inGSEA).
bioinformatics2026-06-05v1Cellpin enables reference-based imputation and denoising of spatial transcriptomes
Putze, P.; Lucarelli, D.; Wellappili, D.; Bahrami, M.; Luecken, M. D.; Theis, F. J.; Saur, D.Abstract
Spatially resolved transcriptomics enables gene expression profiling within tissue architecture, but targeted panels leave much of the transcriptome unmeasured and spatial artifacts such as RNA diffusion and segmentation errors introduce technical noise. These limitations necessitate computational imputation and denoising, yet existing methods typically incorporate spatial measurements during training, limiting scalability and risking the embedding of technology-specific artifacts into learned representations. To address this, we present cellpin, a variational autoencoder trained exclusively on single-cell RNA sequencing data, using teacher-student latent distillation and noise-simulating augmentations to jointly impute unmeasured genes and denoise spatial profiles without requiring cross-modality alignment. Benchmarked against six methods across multiple paired datasets, cellpin achieves superior held-out gene prediction while scaling efficiently to atlas-size references and multi-sample cohorts. In full-transcriptome Atera data, cellpin reduces residual spatial noise and improves cell-state resolution, providing a scalable and principled foundation for biological discovery from spatial transcriptomics data.
bioinformatics2026-06-05v1A Reproducible and Extensible Benchmark of Supervised Cell Type Annotation Tools for Cytometry Data
Kirk, F.; Sonnenholzner, A.; Herranz del Cerro, J.; Scheel Wegener, H.; Modvig, S.; Olsen, L. R.Abstract
High-dimensional cytometry technologies such as flow cytometry (FCM) and mass cytometry (CyTOF) are central to immunophenotyping in research and clinical practice. While manual gating remains the standard for cell population annotation, it is time-consuming, difficult to scale, and subject to inter-operator variability. Supervised annotation methods have emerged as a way of scaling manual annotation work, yet independent benchmarks for comparing these tools remain limited and quickly become outdated. This study presents a reproducible and extensible benchmark of supervised cytometry annotation tools implemented within the OmniBenchmark framework. Five supervised annotation methods were evaluated, spanning linear models, nearest-neighbor approaches, tree-based classifiers, mixture-rule systems, and deep learning, across eight publicly available datasets carefully selected to cover technologies, tissues, panel designs, and healthy and disease contexts. Using a sample-centric cross-validation design that reflects common reference-mapping scenarios, overall and per-population F1 scores, performance on rare populations, runtime, and robustness to reduced training set sizes was tested. Performance varied substantially across datasets and was not fully explained by dataset size or dimensionality, highlighting both operator dependence in annotation and the importance of biological context, cohort heterogeneity, and population imbalance. Less prevalent populations (<1%) remained a key challenge for most methods. Downsampling analyses showed that moderate reference sizes were often sufficient to achieve near-maximum performance. Rather than ranking methods, this benchmark provides a standardized and transparent framework for evaluating annotation tools under realistic deployment conditions. As a living resource, the OmniBenchmark implementation supports continuous integration of new datasets, tools, and metrics for both tool developers and end users annotating datasets. This enables ongoing, reproducible method comparison and informed tool selection for diverse cytometry applications.
bioinformatics2026-06-05v1Development of the Mitochondrial Base Editor Analysis Package (MitoBEAP).
Mutti, C. D.; Nash, P.; Silva-Pinheiro, P.; Minczuk, M.; Van Haute, L.Abstract
For many years, the genetic manipulation of mitochondrial DNA was largely hampered by inefficient delivery of nucleic acids to mitochondria. However, the development of mitoCBEs, such as mitochondrial cytosine base editors (DdCBEs), which catalyse C-to-T and G-to-A conversions, and more recently, mitoABEs, such as transcription-activator-like effector (TALE)-linked deaminases (TALEDs) enabling A-to-G and T-to-C conversion, has transformed this field. Generally, mitochondrial base editors exhibit high on-target efficiency and are straightforward to design and use. Nonetheless, unintended off-target effects cannot be overlooked and should be assessed consistently with each experiment, which can be challenging without specialised bioinformatic expertise. Here, we introduce Mitochondrial Base Editor Analysis Package (MitoBEAP), which, to our knowledge, is the first R package specifically designed to analyse next-generation sequencing data from base-edited mtDNA samples. The package facilitates the analysis of potential off-target effects, offers multiple visualisation options, and allows customisation of graphics and thresholds for calculations. As a proof of concept, this study demonstrates how MitoBEAP can be utilised to measure the efficiency of DdCBE treatment targeting human 12S rRNA, as well as to identify potentially harmful off-target conversions across the mtDNA.
bioinformatics2026-06-05v1Towards Generalizable Protein-ligand Co-folding with ACER
Vithayapalert, N.; Grisoni, F.Abstract
Predicting protein-ligand complex structures is a central challenge in drug discovery. While recent co-folding models such as AlphaFold-3 achieve accurate structure prediction, they fail to generalize to underexplored binding interfaces - systematically misplacing ligands, particularly for allosteric or structurally novel targets. To address this gap, we present ACER (A daptive Co-folding via pocket E xploration and pose R anking), a training-free framework that (a) enables co-folding models to systematically explore alternative binding pockets, and (b) leverages the discovered pockets to increase pose accuracy. Our method enables the efficient discovery of non-prevalent pockets without prior expert knowledge. ACER improves pocket discovery and pose accuracy on allosteric targets and structurally novel complexes, successfully modeling binding interfaces that are under-represented or absent from the training set. Our results demonstrate how improved sampling dynamics enhance the generalisability of co-folding models without retraining.
bioinformatics2026-06-05v1A mitochondrial-immune axis drives the transcriptomic transition from brain aging to Alzheimer's disease
Pal, A.; Arif, S.; Karthikeyan, I.; Waisberg, E.; Guarnieri, J. W.Abstract
Aging is the primary risk factor for Alzheimer's disease (AD), yet the molecular transitions linking normal brain aging to neurodegeneration remain poorly defined. Here, we performed integrative bulk transcriptomic analyses across a multi-region mouse aging atlas, a human aging-to-AD cohort, and an independent human AD validation dataset. Aging is associated with a progressive, region-specific increase in transcriptional perturbation, with the entorhinal cortex and choroid plexus showing the most pronounced age-associated remodeling. Females develop more extensive late-stage remodeling than males, characterized by stronger immune activation and greater suppression of mitochondrial metabolic pathways. Across cohorts, aging drives a coordinated shift toward immune activation and suppression of oxidative phosphorylation and respiratory-chain programs that is amplified in AD. Aged brains occupy an intermediate molecular state between young and AD conditions, supporting a continuum model. Together, our findings define a sex-modulated mitochondrial-immune axis linking normal aging to AD and highlight early immune-metabolic changes as potential intervention targets.
bioinformatics2026-06-05v1TSvelo: Comprehensive RNA velocity by modeling cascade of gene regulation, transcription and splicing
Li, J.; Wang, Z.; Shen, H.-B.; Yuan, Y.Abstract
RNA velocity approaches fit gene dynamics and infer cell fate by modeling the splicing process using single-cell RNA sequencing (scRNA-seq) data. However, due to short time scale of splicing, high noise and large complexity of data, existing RNA velocity methods often fail to precisely capture the complex velocity dynamics for individual gene and single cell, which makes its downstream analysis less reliable and less robust. We propose TSvelo, a comprehensive RNA velocity mathematics framework that can model the cascade of gene regulation, Transcription and Splicing using highly interpretable neural Ordinary Differential Equations (ODEs). TSvelo can precisely capture the transcription-unspliced-spliced 3D dynamics of all genes simultaneously, infer unified latent time shared by genes within single cell, and be applied to multi-lineage datasets. Experiments on six scRNA-seq datasets, including two multi-lineage datasets, demonstrate TSvelo's superiority.
bioinformatics2026-06-04v5SWARM resolves nanopore signal interference between RNA modification types and reveals splicing-shaped pseudouridylation
Prodic, S.; Cleynen, A.; Mahmud, S.; Srivastava, A.; Ravindran, A.; Kanchi, M.; Sethi, A. J.; Corovic, M.; Jain, R.; Santos-Rodriguez, G.; Vieira, G.; Preiss, T.; Weatheritt, R. J.; Hayashi, R.; Martinez, N. M.; Burgio, G.; Shirokikh, N. E.; Eyras, E.Abstract
Nanopore direct RNA sequencing promises to decode the epitranscriptome by detecting multiple modifications on individual RNA molecules, but its potential for biological discovery is hampered by high false-positive rates. We present SWARM, an AI-based framework designed to overcome this fundamental limitation. Its key innovation is a crosstalk-aware training strategy that incorporates non-target modifications and orthogonally validated cellular signals, enabling high-precision detection of m6A, pseudouridine ({Psi}), and m5C at single-nucleotide and single-molecule resolution. Using rigorous in vitro and cellular RNA benchmarks, SWARM outperforms existing tools and maintains strong agreement with orthogonal methods. Applying SWARM across mammalian tissues reveals thousands of novel modification sites with confirmed motifs and localisation patterns. Our high-resolution multi-tissue modification map revealed no evidence of widespread m6A-{Psi} interplay in predominant writer contexts, challenging models of a coordinated epitranscriptomic code. We further discovered a previously unrecognised splicing-shaped mode of {Psi} deposition, whereby TRUB1-mediated pseudouridylation preferentially occurs after exon-exon ligation, consistent with local RNA structure stabilisation. SWARM provides a robust, universally applicable tool for epitranscriptome discovery.
bioinformatics2026-06-04v4STAR Suite: Transcriptomics processing in a single binary through AI-assisted development
Hung, L.-H.; Baker, D.; Flynn, B.; Huangfu, D.; Luo, R.; Robson, P.; Zhou, T.; Yeung, K. Y.Abstract
The STAR aligner plays a key role in complex transcriptomics pipelines consisting of multiple analytical tools. We present STAR Suite, a drop-in replacement for STAR that internalizes entire pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. Deployed by the NIH MorPhiC consortium, STAR Suite provides an open-source alternative to proprietary Cell Ranger pipelines, achieving gene-level Pearson correlations of 0.99-1.0 and 3.8- to 5.7-fold faster speeds for Perturb-seq and Flex analysis through improved methodologies. Integrating multi-module workflows into a single executable makes STAR Suite ready-to-use for both human researchers and the AI agents increasingly used in analytical workflows. STAR Suite was developed using AI agents, enabling a single developer to add 97,000 lines of code to the 28,000-line codebase in four months - illustrating a modern paradigm for large-scale integration of complex open-source codebases by individual research groups. Utilities are included to facilitate future community contributions using AI assistants.
bioinformatics2026-06-04v4STAR Suite: Transcriptomics processing in a single binary through AI-assisted development
Hung, L.-H.; Baker, D.; Flynn, W. F.; Huangfu, D. F.; Luo, R.; Robson, P.; Zhou, T.; Yeung, K. Y.Abstract
The STAR aligner plays a key role in complex transcriptomics pipelines consisting of multiple analytical tools. We present STAR Suite, a drop-in replacement for STAR that internalizes entire pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. Deployed by the NIH MorPhiC consortium, STAR Suite provides an open-source alternative to proprietary Cell Ranger pipelines, achieving gene-level Pearson correlations of 0.99-1.0 and 3.8- to 5.7-fold faster speeds for Perturb-seq and Flex analysis through improved methodologies. Integrating multi-module workflows into a single executable makes STAR Suite ready-to-use for both human researchers and the AI agents increasingly used in analytical workflows. STAR Suite was developed using AI agents, enabling a single developer to add 97,000 lines of code to the 28,000-line codebase in four months - illustrating a modern paradigm for large-scale integration of complex open-source codebases by individual research groups. Utilities are included to facilitate future community contributions using AI assistants.
bioinformatics2026-06-04v3OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
Wang, L.Abstract
How do multi-modal large language models that jointly process natural language and biological sequences (DNA, protein, structural alphabets) actually answer biological questions, especially sequence-grounded questions whose answer depends on residue-level patterns rather than literature recall? We introduce OmniGene-4, a unified bio-language Mixture-of-Experts foundation model on Gemma-4-26B-A4B (128 experts/layer, top-8 routing), and use its discrete router state to dissect this question. By hooking every router across eight task families, we provide the first router-level decomposition for a biological MoE: continued pretraining (CPT) accounts for 96% of cross-task expert differentiation and supervised fine-tuning (SFT) for 4%, reshaping middle and output layers respectively. Within the protein-homology task family, per-pair routing divergence stays below 0.04 (vs 0.23 cross-task), implying that sequence-grounded decisions occur inside expert computation rather than at the gate --- the gate selects the modality, the experts compute the answer. The pipeline yields strong benchmarks: remote-homology 82.60% (vs ESM-2 3B, MMseqs2, DIAMOND by 28--31 pp); standard homology 99.40%; BixBench (general biological-knowledge) 93.66%. A dual-head architecture adds per-residue 3Di/DSSP classifiers (78.6%/100%). To probe whether the discovered transfer mechanism is robust under modality scaling, we further extend the model to OmniGene-4-MM, adding four vision modalities (chemical-structure images, medical/pathology imagery, charts) via a vision tower and a three-stage LoRA pipeline at 1.5 GPU-days total. The multi-modal model preserves the homology capability (85% standard, 69.5% remote) and acquires chemist-readable structure understanding (96% on Vis-CheBI20 functional-group captioning) while consuming roughly four orders of magnitude less compute than recent specialized MoE bio-models. The work characterizes how multi-modal bio-foundation models acquire, route, and preserve sequence-aware capability --- central to the next generation of scientific large language models.
bioinformatics2026-06-04v3SciCore-Omics: a tri-modal foundation model unifying histology, spatial transcriptomics and language for spatial biology
Xiao, X.; Li, Y.; Zeng, Z.; Yan, Y.; Liu, Z.; Liu, Z.; Xiang, Y.; Ye, Z.; Ying, J.; Li, Y.; Xie, L.; He, F.Abstract
Histomorphology and spatial transcriptomics capture complementary aspects of tissue biology, but their relationships remain difficult to extract, align, and interpret at scale. Existing foundation models typically connect histology, omics, or language only pairwise, which limits their capacity to jointly infer molecular states, decode spatial tissue organization, and generate biologically grounded explanations. Here, we show SciCore-Omics, the first tri-modal foundation model linking histology images, spatial transcriptomics, and biological language. We constructed a spatially paired image-gene-text dataset comprising 151,182 spots across multiple tissues and performed a three-stage progressive training of SciCore-Omics on this dataset. Across gene expression prediction and spatial domain recognition, SciCore-Omics achieved 23.6-80.9% relative gains in task-specific metrics over the strongest external baselines. It further showed robust zero-shot generalization in histopathology classification, outperforming GPT-5 by 6.16 percentage points in mean accuracy across four benchmarks. Expert evaluation in 10 breast cancer cases confirmed its H&E-only case-level molecular reasoning capability. Together, our method demonstrates that a tri-modal framework can effectively bridge histomorphology and molecular state, providing a more general and interpretable foundation model for computational pathology and omics analysis.
bioinformatics2026-06-04v2Skiver: Reference-free quality control of metagenomic sequencing datasets using (k,v)-mer sketches
Gu, Z.; Sharma, P.; Wong, L.; Nagarajan, N.Abstract
Background. Quality control of sequencing datasets is an important first step in numerous bioinformatics pipelines such as mapping, variant calling, and assembly. Existing methods typically rely on alignment results or quality scores. However, the reference genome is not always available for mapping, and uncalibrated quality scores may yield biased estimates of error rates. Results. We present skiver, a reference-free and alignment-free algorithm that estimates sequencing error rates and calibrates Phred quality scores using (k,v)-mer sketches. By identifying the consensus from the sketched (k,v)-mers, skiver estimates survival and hazard rates that capture positional information of sequencing errors. Across simulated and real datasets from various sequencing platforms, skiver accurately recovers sequencing error rates and the proportion of different error types. We further demonstrate its ability to calibrate Phred scores. It also reliably handles complex datasets containing multiple strains, alleles, and repetitive regions through an iterative outlier filtering strategy. Skiver is computationally efficient and supports tools that need accurate sequencing error rate estimates or quality scores as prior knowledge. Availability and Implementation. An implementation of skiver is available at https://github.com/GZHoffie/skiver, and dataset and scripts for reproducibility are available at https://github.com/GZHoffie/skiver-test.
bioinformatics2026-06-04v2Predicting P-glycoprotein Substrate Status Using a Pretrained Graph Neural Network: A TDC Benchmark Study
Yan, J.; Duan, W.Abstract
P-glycoprotein (Pgp/ABCB1) is a critical efflux transporter that significantly impacts drug bioavailability and multidrug resistance. Accurate prediction of Pgp substrate status is essential for early-stage drug discovery. In this study, we evaluate a pretrained Graph Isomorphism Network (GIN) with attribute masking on the Pgp_Broccatelli benchmark from the Therapeutics Data Commons (TDC). Our approach fine-tunes a GIN encoder pretrained on approximately 2 million molecules using a self-supervised attribute masking strategy, followed by a multilayer perceptron (MLP) classification head. On the TDC benchmark, our model achieves an AUROC of 0.937 +/- 0.004 across five independent runs, ranking second on the leaderboard, as of May 2026. We further compare this approach against an XGBoost baseline using Morgan fingerprints (AUROC 0.912 +/- 0.007), demonstrating the advantage of graph-based molecular representations with transfer learning for small-dataset ADMET prediction tasks.
bioinformatics2026-06-04v1Proteomics-constrained deconvolution reveals spatial cell-type programs in tumours
Isik, E. B.; Haley, M. J.; Anbaki, A. A.; Bere, L.; Roncaroli, F.; Piper Hanley, K.; Couper, K.; Wedge, D. C.; Sellers, R.; Oliveira, P.; Ashton, J.; Bristow, R. G.; Alvarez, M. A.; Georgaka, S.; Rattray, M.Abstract
Accurately resolving cell-type mixtures in spatial transcriptomics remains challenging, particularly in heterogeneous tumours where cell populations are intermixed and matched single-cell references may be unavailable or poorly aligned. Current deconvolution approaches either require high-quality scRNA-seq references, suffer from scalability limitations, or lack interpretability. We introduce PISTACHIO, a proteomics-informed spatial transcriptomics deconvolution framework based on constrained non-negative matrix factorization with a negative-binomial likelihood. Rather than using probabilistic priors, PISTACHIO incorporates spatial cell-type constraints derived from paired Imaging Mass Cytometry, enforcing biologically grounded sparsity and explicit spatial feasibility of cell-type presence. PISTACHIO improved recovery of spatial cell-type distributions compared with Cell2location and STdeconvolve across synthetic and real tumour datasets. Our approach remains robust under cell-type assignment errors, maintaining high correlation with ground-truth under moderate noise, and achieves fast runtime on standard hardware, enabling practical large-scale deployment.
bioinformatics2026-06-04v1Language Modeling Materializes a World Model of Protein Biology
Candido, S.; Hayes, T.; Derry, A.; Rao, R.; Lin, Z.; Verkuil, R.; Wu, B. Z.; Lee, J. S.; Bruguera, E. S.; Keval, J. A.; Kopylov, M.; Pak, J. E.; Wu, W.; Thomas, N.; Mataraso, S.; Hsu, A.; Trotman-Grant, A. C.; Fatras, K.; dos Santos Costa, A.; Badkundri, R.; Akin, H.; Oktay, D.; Deaton, J.; Montabana, E.; Sitwala, H.; Yu, Y.; Wiggert, M.; Carlin, D. A.; Goering, A. W.; Blazejewski, T.; Sandora, M.; Hla, M.; Jia, T. Z.; Kloker, L. H.; Sofroniew, N. J.; Uehara, M.; Pannu, J.; Bachas, S.; Liu, D. S.; Sercu, T.; Rives, A.Abstract
Proteins are fundamental to life. The full extent of their biology is beyond our ability to characterize with experimental approaches in the physical laboratory. Accurate digital representations could accelerate the discovery of protein biology through virtual experiments. We propose language modeling to learn unified and general representations that can be scaled to all of protein biology. Building on these representations, we develop a structure prediction model that exceeds the performance of established methods for biomolecular complex prediction across benchmarks, including for the interactions of antibodies with their targets. A simple search procedure yields high experimental success rates for the discovery of proteins with nanomolar binding affinities for both miniproteins and single-chain antibodies, a modality critical for therapeutic design. Study of the concepts in the language model's representation space reveals a systematic organization aligned with the reductionist understanding of proteins developed through empirical science. Leveraging this organization, we generate a comprehensive map of protein biology encompassing over 6.8 billion sequences and 1.1 billion predicted structures, identifying connections across known and unknown biology. As a whole, this shows language modeling as a powerful substrate for representing the biology of proteins, operating across scales from the prediction and design of protein interactions at the atomic level, to identifying properties of proteins at different levels of granularity and abstraction, to the scale of mapping connections between proteins across billions of years of evolution.
bioinformatics2026-06-04v1An interpretable machine learning framework for dog breed inference and ancestry decomposition
Bian, Y.; Bierman, R.; Snyder-Mackler, N.; Promislow, D.; Karlsson, E.; Dog Aging Project Consortium, ; Akey, J. M.Abstract
The over 300 currently recognized breeds of domesticated dogs are the culmination of centuries of intense artificial selection and recurrent population bottlenecks. While breed labels are widely used in genetic and veterinary studies, inferring breed identity from genomic data remains challenging due to the high dimensionality of genotype data, uneven sampling across breeds, and admixture resulting in mixed-breed individuals. Here, we present an interpretable machine learning framework to infer dog breed labels from genome-wide SNP data. Our approach combines dimensionality reduction with a multi-output random forest model that maps genetic variation to a continuous representation of breed membership, enabling both classification and mixed-breed inference. We apply this framework to the Dog Aging Project (DAP) dataset of 6,572 purebred and mixed-breed dogs across 100 breed classes, achieving 91.7% accuracy with an overlap-based metric, outperforming an ADMIXTURE-based benchmark that achieved 87.8% accuracy. Notably, we find that as few as 150 informative SNPs are sufficient to achieve near-maximal predictive performance, highlighting the highly structured nature of canine genetic variation. We also introduce a SNP importance score metric that links model predictions back to individual genetic variants. Analysis of top-ranked variants reveals loci previously associated with morphological, pigmentation, and behavioral traits, as well as candidate loci lacking prior phenotypic annotation, supporting both the biological relevance and discovery potential of the framework. Together, these results demonstrate that our framework provides an accurate, flexible, and interpretable approach to predict breed ancestry, with applications in veterinary genomics, canine population genetics, and the identification of loci underlying hallmark breed phenotypes.
bioinformatics2026-06-04v1Hierarchical classification of immune cell transcriptomes at population-scale
Beltz, C.; Qiu, Z.; Sadowski, L.; Kraske, J. A.; Aggarwal, A.; Quintanal-Villalonga, A.; Manoj, P.; Littbarski, A.; Bajaj, S.; Meskauskaite, B.; Umeda, S.; Mazutis, L.; Rose, S. A.; Chan, J. M.; Nawy, T.; Nainys, J.; Chaligne, R.; de Stanchina, E.; Kaelber, K. A.; Cussigh, C. S.; Kallenberger, S. M.; Williams, A.; Jenzer, M.; Pompecki, T.; Kahle, S.; Hohmann, N.; Nussbaum, D. P.; Moss, N. S.; Ziv, E.; Berger, A. K.; Springfeld, C.; Zschaebitz, S.; Hassel, J. C.; Debus, J.; Jaeger, D.; Iacobuzio-Donahue, C. A.; Ganesh, K.; Peer, D.; Ungerechts, G.; Rudin, C. M.; Huber, P. E.; Walle, T.Abstract
Accurate immune cell classification is essential for interpreting single-cell RNA sequencing (scRNA-seq) data. However, progress is constrained by the lack of independent, high-resolution benchmarks, as the routine integration of datasets introduces statistical dependencies that artificially inflate model generalizability. Here, we present the single-cell universal classification omnibus (Suco), a resource of independent, uniform expert annotations, and Compocyte, a modular hierarchical classifier. Together, they establish a framework designed for the scale of human population immunology. This approach substantially outperforms existing classifiers while facilitating expert review of ambiguous annotations. Applying Compocyte across 50 studies, including three newly generated datasets, we classified 15.6 million leukocytes from 3,965 patients. Within this expansive cohort, we identified a new tumor-associated resorptive macrophage phenotype, a non-canonical monocyte subtype in subclinical cytokine release syndrome, and the programmatic erosion of T cell memory stemness across metastatic sites. Suco and Compocyte thus provide a generalizable architecture and benchmark capable of sustaining high-resolution annotation across massive clinical cohorts.
bioinformatics2026-06-04v1Genomic, Transcriptomic, and Regulomic Analyses Do Not Support Profound Autism as a Distinct Biological Category
Eicher, T. D.; Ne'eman, A.; Quackenbush, J. D.Abstract
The Lancet Commission on the Future of Care and Clinical Research in Autism proposed the construct of "profound autism" as a recognizable subtype of autism. Supporters argue that this classification is necessary to ensure that autistic persons with severe impairment receive appropriate research attention and policy support, whereas critics contend that the construct lacks scientific validity and may reflect social or political considerations more than biological distinction. To inform this debate, we evaluate whether the proposed "profound autism" category represents a distinct genetic phenotype using multiple molecular data types collected in a large cohort. Across genomic, transcriptomic, and regulatory analyses, we find no evidence supporting "profound autism" as a biologically distinct phenotypic group. Instead, differences emerge primarily in inferred gene regulatory networks distinguishing nonspeaking from speaking autistic children, suggesting potential regulatory mechanisms contributing to speech ability. These findings suggest that future research into severe impairment may be more productive if focused on specific traits -- such as speech impairment -- rather than attempting to define a distinct biological subtype within the multidimensional phenomenon of autism.
bioinformatics2026-06-04v1UnBlender: validating individual analyses in respiratory bulk RNA-seq cell type deconvolution
Gillett, T. E.; van den Berge, M.; Nawijn, M. C.; Koppelman, G. H.Abstract
Analysis of RNA-seq data of respiratory samples has contributed much to our understanding of lung disease. However, bulk RNA-seq data are dependent on both cell type composition and the transcriptional activity of these samples' constituent cells, which complicates interpretation. Cell type deconvolution is frequently used to estimate cell type proportions of bulk transcriptomic gene expression data and improve interpretation of bulk transcriptomics data. However, accuracy of the estimated cell type proportions reported after deconvolution is unknown, which may have a negative impact on the validity of the conclusions drawn. Here, we present UnBlender, a pipeline that enables respiratory scientists to perform cell type deconvolution and routinely evaluate deconvolution accuracy of their approach. UnBlender allows for custom cell type deconvolution tailored to the research question at hand, using consensus cell type labels and validating the approach to promote accurate, reproducible results.
bioinformatics2026-06-04v1Nanobodies versus canonical antibodies: an updated comparison of their binding modes
Hauser, A.; Dangla-Pelissier, G.; Cazals, F.Abstract
Heavy-chain-only antibodies, produced by the adaptive immune systems of camelids and cartilaginous fish, complement canonical antibodies that contain variable domains from both heavy and light chains. We refine previous studies by providing a detailed analysis of the binding modes of VHHs versus canonical antibodies, using a dataset with a 20-fold increase in the number of cases. We show that VHHs exhibit a larger buried surface area despite relying on a single variable domain than double domain antibodies. This property can be attributed to contributions from both framework regions and CDR3. We further demonstrate that the binding modes of VHHs, characterized by the number of FR and CDR regions contacting the antigen, are more diverse than previously reported. In addition, we find that VHH and canonical antibody interfaces display similar solvation properties, although VHH interfaces are more tightly packed. Finally, we discuss the thermodynamic and kinetic implications of these findings for the design of high-affinity VHHs, an issue of particular importance in protein engineering and design.
bioinformatics2026-06-04v1De Novo Design and Computational Validation of a High-Affinity Peptide Inhibitor Targeting the HPV E1-E2 Interface
Fletcher, S.; Biswas-Fiss, E. E.; Biswas, S. B.Abstract
The oncogenic progression of high-risk Human Papillomavirus (HPV) strains relies fundamentally on the cooperative interaction between the E1 replicative helicase and the E2 origin-binding protein to initiate viral DNA amplification. Disrupting this essential protein-protein interaction presents a highly promising, yet clinically unrealized, therapeutic paradigm for treating established HPV infections prior to malignant transformation. This research presents a comprehensive computational pipeline for evaluating and screening de novo generated peptide inhibitors. We utilize the HPV E1-E2 protein interface as a proof of concept, specifically targeting a highly conserved arginine triad located on the solvent-exposed surface of the E1 helicase. Utilizing the AlphaProteo generative model for sequence discovery and AlphaFold 3 for complex structural prediction, a library of candidate binders was generated and subsequently subjected to dual-scale Molecular Dynamics simulations and thermodynamic validation utilizing GROMACS. The results establish Binder 8 as the lead candidate, yielding a predicted binding free energy (-59.1 +/- 0.7 kcal/mol) that indicates a significantly stronger theoretical affinity than the native E1-E2 baseline. Energy decomposition confirms that Binder 8 binds the E1 interface via precise interactions involving the arginine triad. Furthermore, deep-learning-based physicochemical profiling utilizing CSM-Toxin and AlgPred 2.0 confirms that Binder 8 possesses an optimal safety profile, exhibiting zero predicted toxicity and non-allergenic properties. Protein sequence alignment confirms the evolutionary conservation of the targeted arginine triad across the vast majority of oncogenic Alpha-papillomavirus genotypes, highlighting Binder 8 as a viable promising candidate scaffold for broad-spectrum antiviral development. The study demonstrates a computational solution for E1-E2 disruption, setting the stage for future in vitro validation via Bio-layer interferometry to confirm physical inhibition.
bioinformatics2026-06-04v1A comparative analysis of promoter-proximal pausing reveals kinetic and distributional dimensions of variation
Zeng, X.; Barshad, G.; Hassett, R.; Rice, E. J.; Danko, C. G.; Siepel, A.; Zhao, Y.Abstract
Promoter-proximal pausing of RNA polymerase II is a key regulatory checkpoint in metazoan transcription. Despite extensive study of this process, quantitative methods for comparing pausing dynamics across biological contexts have been lacking. Here we introduce a model-based framework for rigorous comparative analysis of both pause-escape kinetics and pause-site distributions across genes, cell types, and species. An application to available PRO-seq datasets revealed striking differences across perturbations, and comparative analyses across cell types and species highlighted distinct patterns of variation in both pause-escape kinetics and pause-site distributions, with only weak coupling between them. Integration with chromatin and sequence features showed that lower pause-escape rates are associated with stronger promoter-proximal nucleosome occupancy, whereas changes in pause-site dispersion are associated with sequence features such as GC skew. Together, these results establish a quantitative framework for comparative analysis of promoter-proximal pausing and reveal kinetic and distributional dimensions of pausing variation across biological contexts.
bioinformatics2026-06-04v1Learning residue-level context for modeling protein-protein interactions
Zhang, Z.; Yang, Z.; Liu, A.; Yu, K.-H.; Zhao, J.; Yang, Y.; Neale, B.; Chen, S.Abstract
Protein language models (PLMs) enable prediction of protein properties by learning residue-level features from sequence, yet most PLM-based approaches to protein-protein interactions aggregate information across entire proteins, limiting resolution and interpretability. Here we present ReCLIP, a transformer-based framework that learns interaction-specific representations at the level of individual residues by combining intra-protein residue neighborhoods with residue-conditioned representations of interaction partners. We show that residue-centered context provides a general framework for modeling protein interactions across diverse biological settings. ReCLIP accurately predicts mutation-induced perturbations (AUROC = 0.973), generalizes to post-translational modifications that do not alter sequence (AUROC = 0.822), and enables zero-shot prediction of peptide-MHC binding across unseen alleles (AUROC up to 0.972). Analysis of learned residue neighborhoods reveals structurally and functionally coherent patterns aligned with known determinants of binding. Applied to clinically annotated genetic variants, ReCLIP identifies disease-associated interaction perturbations that link pathogenic variants to specific molecular interaction contexts. Our results establish a generalizable and interpretable framework for modeling protein interactions and provide insights into how residue-level context shapes interaction specificity and its perturbation.
bioinformatics2026-06-04v1Assessing and Optimizing Low-Frequency Somatic Mutation Detection: A Multi-Platform High-Throughput Sequencing Perspective
Feng, B.; Lin, Y.; Liu, L.; Lin, Q.; Lin, Y.; Liu, Y.; Li, J.; Lei, C.; Chen, C.; Yang, M.; Peng, X.; Zhou, Z.; Yan, Q.; Sun, L.; Li, Q.Abstract
The availability of multiple commercial short-read sequencing platforms necessitates systematic cross-platform performance comparisons, particularly for challenging applications such as low-frequency somatic mutation detection. Here, a large-scale targeted sequencing dataset from five Genome in a Bottle (GIAB) human genomic DNA reference standards, HG001 to HG005, alongside Twist Biosciences cfDNA reference standards featuring 1% variant allele frequency (VAF), was generated by six platforms (NovaSeq 6000, NovaSeq X, FASTASeq 300, GenoLab M, SURFSeq 5000, and MGISEQ-T7). To build a realistic benchmark while keeping authentic sequencing backgrounds, we developed PosMix, a simulating tool that generates position-specific VAFs. To overcome the limitations of conventional variant callers (high recall with poor precision for VarScan2, higher precision with lower recall for Strelka2/Mutect2), we developed SomaticXGB, a machine learning-based caller. In this study, SURFSeq 5000 consistently exhibited the lowest error rates and achieved superior accuracy for VAFs as low as 0.5%, outperforming all other sequencing platforms. On the other hand, SomaticXGB attained F1 scores of approximately 0.92 on simulated datasets with VAFs ranging from 0.5% to 1.5% and 0.89 on Twist 1% standards, substantially outperforming conventional methods. This work delivers a valuable rich multi-platform data resource, offering a standardized pipeline for performance benchmarking and a machine learning-based strategy for optimized somatic mutation detection.
bioinformatics2026-06-03v2STAR Suite: Transcriptomics processing in a single binary through AI-assisted development
Hung, L.-H.; Baker, D.; Flynn, W. F.; Huangfu, D. F.; Luo, R.; Robson, P.; Zhou, T.; Yeung, K. Y.Abstract
The STAR aligner plays a key role in complex transcriptomics pipelines consisting of multiple analytical tools. We present STAR Suite, a drop-in replacement for STAR that internalizes entire pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. Deployed by the NIH MorPhiC consortium, STAR Suite provides an open-source alternative to proprietary Cell Ranger pipelines, achieving gene-level Pearson correlations of 0.99-1.0 and 3.8- to 5.7-fold faster speeds for Perturb-seq and Flex analysis through improved methodologies. Integrating multi-module workflows into a single executable makes STAR Suite ready-to-use for both human researchers and the AI agents increasingly used in analytical workflows. STAR Suite was developed using AI agents, enabling a single developer to add 97,000 lines of code to the 28,000-line codebase in four months - illustrating a modern paradigm for large-scale integration of complex open-source codebases by individual research groups. Utilities are included to facilitate future community contributions using AI assistants.
bioinformatics2026-06-03v2The machine-learning classifier ALLCatchR2 identifies 20 T-ALL subtypes across cohorts and age groups
Beder, T.; Wolgast, N.; Walter, W.; Bendig, S.; Hartmann, A. M.; Barz, M. J.; Zaliova, M.; Reitzel, E.; Baden, D.; Schwartz, S. M.; Gökbuget, N.; Kester, L.; Trka, J.; Haferlach, C.; Brüggemann, M.; Baldus, C. D.; Neumann, M.; Bastian, L.Abstract
T-cell acute lymphoblastic leukemia (T-ALL) comprises molecularly diverse subtypes, but robust cross-cohort validations and operational gene-expression definitions are lacking. To establish a gene-expression-anchored framework for T-ALL subtyping, we aggregated 2,314 transcriptomes (15 cohorts, age: 0.8 to 90.8 years). An extended unsupervised approach defined 17 main clusters and 3 subclusters in samples with high blast fractions. Supervised analyses added an overarching immature T-ALL (ETP-like) definition and resolved the LMO2 {gamma}{delta}-like subtype. All clusters contained samples from at least two cohorts. Characteristic genomic driver enrichments were consistent across cohorts, while gene expression clusters did not correspond exclusively to single driver events but also reflected developmental origins. A machine learning classifier based on ALLCatchR, our B-ALL classifier, identified these 20 transcriptomic subtypes and the immature T-ALL (ETP-like) signature with 0.995-1.0 accuracy in a validation set (n=203). Testing the classifier on a second hold-out data set (n=265 samples) showed that 92.7% of predictions matched with corresponding driver alterations. Across all samples, 83.2% of cases received high-confidence predictions, 7.3% candidate predictions, and 9.5% remained unclassified, largely because of low blast fractions. We identified a novel gene expression cluster markedly enriched (P<0.001) for clonal hematopoiesis mutations (IDH2 R140Q, DNMT3A) and a stem-/progenitor cell-like gene expression. This novel "clonal hematopoiesis-related" T-ALL subtype was observed in six cohorts representing 8.9% of adults and 39.5% of patients aged >50 years. We advanced ALLCatchR, as a free R package that now enables B-/T-lineage separation, gene-expression subtyping, blast estimation, and developmental annotation to harmonize T-ALL classification across studies and clinical contexts.
bioinformatics2026-06-03v2ROTS 2.0: A reproducibility-driven framework for robust statistical modeling across diverse high-throughput omics study designs
Suomi, T.; Kettunen, J.; Pusa, T.; Elo, L. L.Abstract
Reproducibility is fundamental to reliable scientific discoveries. The reproducibility-optimized test statistic (ROTS) is a robust framework designed to identify reproducible features (e.g. genes or proteins) in high-dimensional differential expression analyses such as transcriptomics and proteomics. This is achieved by optimizing the reproducibility of feature rankings under resampling. While originally implemented for univariate settings, ROTS now accommodates multi-group comparisons, survival analysis, linear models, and linear mixed-effects models, broadening its applicability to more complex and clinically relevant experimental designs. Using diverse simulations, benchmark datasets, and real-world case studies, we demonstrate the benefits of ROTS reproducibility optimization compared to the corresponding conventional test statistics. Additionally, we illustrate the utility of the reproducibility characteristics in assessing the overall reliability of the results. To facilitate widespread adoption, ROTS is provided as an open-source software package available through R/Bioconductor. Furthermore, to broaden the user base, we now also provide a Python interface available at pypi.org/project/PyROTS/.
bioinformatics2026-06-03v1Information Geometry of Intracellular Compartment Coupling Reveals Transcriptomic State Transitions in Single Cells
Sung, J.-Y.; Cheong, J.-H.Abstract
Single-cell transcriptomic analyses typically characterize cellular states using gene-expression variability, dimensionality reduction, and trajectory inference. However, existing approaches provide limited insight into how transcriptomic information is organized across interacting intracellular compartments. Here we introduce Compartment Coupling Entropy (CCE), an information-geometric framework that quantifies the organization of transcriptomic coupling between spliced and unspliced RNA compartments. CCE constructs a cross-compartment coupling operator from compartment-resolved transcriptomic profiles and characterizes its singular-value spectrum using coupling entropy, effective coupling dimension, and coupling susceptibility. These metrics measure how transcriptomic information is distributed across coupling modes and provide a quantitative description of transcriptomic organization beyond conventional expression-based statistics. Applying CCE to pancreatic endocrine differentiation revealed substantial remodeling of coupling architecture along developmental trajectories. Coupling entropy and effective coupling dimension underwent transient collapse and re-expansion during lineage progression, while coupling susceptibility identified discrete intervals of rapid transcriptomic reorganization corresponding to candidate cell-state transition regimes. Across cell states, coupling entropy showed weak correspondence with classical mutual information, indicating that spectral coupling organization captures information not represented by conventional information-theoretic measures. An organization ratio and spectral excess information further quantified the divergence between classical and coupling-based descriptions of transcriptomic structure. Robustness analyses demonstrated stability of the framework under bootstrap resampling, gene subsampling, spectral truncation, and trajectory discretization. Application to an independent dentate gyrus developmental dataset revealed similar hierarchical coupling spectra and susceptibility-defined transition regimes, suggesting that transient reorganization of compartment-coupling architecture may represent a general feature of cellular state transitions. CCE provides a general methodology for quantifying the information geometry of intracellular transcriptomic organization and complements existing single-cell analytical approaches by revealing coupling architectures that are inaccessible to conventional expression-based analyses.
bioinformatics2026-06-03v1