Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
STAR Suite: Transcriptomics processing in a single binary through AI-assisted development
Hung, L.-H.; Baker, D.; Flynn, W. F.; Huangfu, D. F.; Luo, R.; Robson, P.; Zhou, T.; Yeung, K. Y.Abstract
The STAR aligner plays a key role in complex transcriptomics pipelines consisting of multiple analytical tools. We present STAR Suite, a drop-in replacement for STAR that internalizes entire pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. Deployed by the NIH MorPhiC consortium, STAR Suite provides an open-source alternative to proprietary Cell Ranger pipelines, achieving gene-level Pearson correlations of 0.99-1.0 and 3.8- to 5.7-fold faster speeds for Perturb-seq and Flex analysis through improved methodologies. Integrating multi-module workflows into a single executable makes STAR Suite ready-to-use for both human researchers and the AI agents increasingly used in analytical workflows. STAR Suite was developed using AI agents, enabling a single developer to add 97,000 lines of code to the 28,000-line codebase in four months - illustrating a modern paradigm for large-scale integration of complex open-source codebases by individual research groups. Utilities are included to facilitate future community contributions using AI assistants.
bioinformatics2026-06-03v2The machine-learning classifier ALLCatchR2 identifies 20 T-ALL subtypes across cohorts and age groups
Beder, T.; Wolgast, N.; Walter, W.; Bendig, S.; Hartmann, A. M.; Barz, M. J.; Zaliova, M.; Reitzel, E.; Baden, D.; Schwartz, S. M.; Gökbuget, N.; Kester, L.; Trka, J.; Haferlach, C.; Brüggemann, M.; Baldus, C. D.; Neumann, M.; Bastian, L.Abstract
T-cell acute lymphoblastic leukemia (T-ALL) comprises molecularly diverse subtypes, but robust cross-cohort validations and operational gene-expression definitions are lacking. To establish a gene-expression-anchored framework for T-ALL subtyping, we aggregated 2,314 transcriptomes (15 cohorts, age: 0.8 to 90.8 years). An extended unsupervised approach defined 17 main clusters and 3 subclusters in samples with high blast fractions. Supervised analyses added an overarching immature T-ALL (ETP-like) definition and resolved the LMO2 {gamma}{delta}-like subtype. All clusters contained samples from at least two cohorts. Characteristic genomic driver enrichments were consistent across cohorts, while gene expression clusters did not correspond exclusively to single driver events but also reflected developmental origins. A machine learning classifier based on ALLCatchR, our B-ALL classifier, identified these 20 transcriptomic subtypes and the immature T-ALL (ETP-like) signature with 0.995-1.0 accuracy in a validation set (n=203). Testing the classifier on a second hold-out data set (n=265 samples) showed that 92.7% of predictions matched with corresponding driver alterations. Across all samples, 83.2% of cases received high-confidence predictions, 7.3% candidate predictions, and 9.5% remained unclassified, largely because of low blast fractions. We identified a novel gene expression cluster markedly enriched (P<0.001) for clonal hematopoiesis mutations (IDH2 R140Q, DNMT3A) and a stem-/progenitor cell-like gene expression. This novel "clonal hematopoiesis-related" T-ALL subtype was observed in six cohorts representing 8.9% of adults and 39.5% of patients aged >50 years. We advanced ALLCatchR, as a free R package that now enables B-/T-lineage separation, gene-expression subtyping, blast estimation, and developmental annotation to harmonize T-ALL classification across studies and clinical contexts.
bioinformatics2026-06-03v2Assessing and Optimizing Low-Frequency Somatic Mutation Detection: A Multi-Platform High-Throughput Sequencing Perspective
Feng, B.; Lin, Y.; Liu, L.; Lin, Q.; Lin, Y.; Liu, Y.; Li, J.; Lei, C.; Chen, C.; Yang, M.; Peng, X.; Zhou, Z.; Yan, Q.; Sun, L.; Li, Q.Abstract
The availability of multiple commercial short-read sequencing platforms necessitates systematic cross-platform performance comparisons, particularly for challenging applications such as low-frequency somatic mutation detection. Here, a large-scale targeted sequencing dataset from five Genome in a Bottle (GIAB) human genomic DNA reference standards, HG001 to HG005, alongside Twist Biosciences cfDNA reference standards featuring 1% variant allele frequency (VAF), was generated by six platforms (NovaSeq 6000, NovaSeq X, FASTASeq 300, GenoLab M, SURFSeq 5000, and MGISEQ-T7). To build a realistic benchmark while keeping authentic sequencing backgrounds, we developed PosMix, a simulating tool that generates position-specific VAFs. To overcome the limitations of conventional variant callers (high recall with poor precision for VarScan2, higher precision with lower recall for Strelka2/Mutect2), we developed SomaticXGB, a machine learning-based caller. In this study, SURFSeq 5000 consistently exhibited the lowest error rates and achieved superior accuracy for VAFs as low as 0.5%, outperforming all other sequencing platforms. On the other hand, SomaticXGB attained F1 scores of approximately 0.92 on simulated datasets with VAFs ranging from 0.5% to 1.5% and 0.89 on Twist 1% standards, substantially outperforming conventional methods. This work delivers a valuable rich multi-platform data resource, offering a standardized pipeline for performance benchmarking and a machine learning-based strategy for optimized somatic mutation detection.
bioinformatics2026-06-03v2GalaxyVS: Exploring 100-Billion Compounds in Seconds
Hong, X.; Li, P.; Zhu, W.; Wu, C.; Guo, H.; Tan, H.; Wu, Q.; Wu, K.; Chen, L.; Jia, Y.; Gao, B.; Jian, X.; Lai, Z.; Lu, Y.; Meng, X.; Lan, Y.Abstract
We present GalaxyVS, a hardware-software co-designed virtual screening framework built to explore the 100-billion commercially accessible chemical space in seconds, deployed at the National Supercomputing Center in Tianjin. Built upon the dense vector retrieval paradigm of DrugCLIP, GalaxyVS bypasses the structural dependencies and computational overhead of classical docking to enable rapid screening against experimentally determined as well as geometrically feasible pockets on AlphaFold-predicted structures. To scale this paradigm to the 100-billion level, the system must overcome the significant computational burden of offline representation encoding, critical memory and I/O bottlenecks during online retrieval, and the risks of diversity collapse and precision loss within final screening results. Utilizing the heterogeneous supercomputing infrastructure, GalaxyVS accelerates the offline encoding through deep operator adaptations and resolves online retrieval bottlenecks via disk-native vector indexing coupled with in-memory staging to ensure both broad accessibility and high throughput. Concurrently, a two-stage refinement protocol effectively mitigates diversity collapse and ensures high-fidelity affinity ranking. Consequently, GalaxyVS achieves a daily scoring throughput of $1.5 \times 10^{16}$ target-ligand pairs, representing a six-orders-of-magnitude leap over previous supercomputing records. Driven by this throughput, we screened nearly 100,000 protein structures across six species against the 100-billion compound library in just 16 hours. The resulting comprehensive cross-species interaction landscape, GalaxyDB, will be openly released at \url{https://galaxyvs.drugclip.com}.
bioinformatics2026-06-03v1Deep Proteoform Sequencing with Top-Down Direct Mass Technology
Durbin, K. R.; Su, T.; Fellers, R. T.; McGee, J. P.; Fisher, N. P.; Hollas, M. A. R.; Kafader, J. O.; Kelleher, N. L.Abstract
Individual Ion Mass Spectrometry (I2MS) using Direct Mass Technology mode on an Orbitrap mass spectrometer (DMTm) increases sensitivity, resolution, and mass range for protein analysis. Here, we present an end-to-end workflow for deep proteoform sequencing using top-down mass spectrometry with DMTm. By assigning the charge of individual fragment ions and converting spectra from the m/z to the mass domain, DMTm resolves overlapping isotopic distributions that have limited conventional top-down mass spectrometry. Across different fragmentation modes on Orbitrap mass spectrometers, top-down DMTm significantly outperformed conventional top-down mass spectrometry methods. For a glycosylated 50.8 kDa antibody heavy chain, sequence coverage was greatly increased, from 27.5% to 83.3%, in 10 minutes of acquisition using a single fragmentation mode. Coverage of the middle 350 residues improved from 0% to >95%, demonstrating near-complete coverage of the difficult-to-characterize internal region of a large protein. The fragmentation patterns of DMTm were found to be complementary to conventional top-down, with higher internal coverage for DMTm and higher terminal coverage for conventional. Accordingly, aggregation of the data from the two modes further increased heavy chain sequence coverage to 90.2%. A new software platform, Proteoform Studio, provided optimized ion processing for improved sequence coverage and enabled real-time experimental monitoring as individual ions were accumulated. The platform automatically integrates conventional and DMTm data to provide the most comprehensive sequence coverage possible. Together, these advances enable substantially deeper proteoform sequencing and establish a straightforward, complete top-down DMTm workflow to confidently define proteoforms in biological systems and biotherapeutic development.
bioinformatics2026-06-03v1Improving the Accuracy of Forensic Age Estimation Through Bias Reduction
Flores, M.; Pellegrini, M.Abstract
Chronological age estimation can provide supporting information in forensic casework when traditional identification methods are limited. DNA methylation, a stable epigenetic mark, has emerged as a promising tool for predicting chronological age from trace samples. However, many existing age estimation models rely on linear regression approaches, which often yield biased prediction errors across the age distribution (i.e. model residuals show a significant age dependence). In this study, we compared three approaches for age estimation modeling: multivariable linear regression, random forest regression and maximum likelihood estimation. While the first two approaches are well established, for the third one we constructed and validated a DNA methylation-based LOESS regression maximum likelihood model for age estimation utilizing forensic-relevant CpG markers. In all cases, model performance was evaluated through Leave-One-Out Cross-Validation (LOOCV). We utilized three independent publicly accessible methylation datasets collected using droplet digital PCR (ddPCR) to evaluate the most effective method for accuracy and bias in age estimation. Notably, when we compare the results of the maximum likelihood approach to the other approaches, multivariable linear regression and random forest regression, we find less bias in the age associated residuals compared to the other methods. These findings highlight the utility of non-linear modeling techniques in reducing the biases of epigenetic age estimation for forensic applications.
bioinformatics2026-06-03v1Topology-aware reconstruction of cellular state landscapes from microscopy using self-supervised learning
Messori, E.; Taha, D. M.; Fournier, L.; Foix Romero, A.; Uhlmann, V.; Frossard, P.; Vincent-Cuaz, C.; Patani, R.; Luisier, R.Abstract
Morphology and spatial organisation provide complementary readouts of cellular state. However, reconstructing continuous cellular state landscapes from imaging data remains challenging, particularly in dense biological cultures. Here we present SI-SimCLR, a spatially informed self-supervised learning framework that learns biologically informative representations directly from fluorescence microscopy images without requiring segmentation or manual annotation. Combined with a graph-based partial optimal transport framework, SI-SimCLR enables reconstruction of cellular phenotypic landscapes from static imaging data, revealing how phenotypic substates are organised and connected. To establish and validate this framework, we generated a multimodal dataset of human iPSC-derived astrocytes using high-content imaging and matched bulk transcriptomics. SI-SimCLR resolved distinct interconnected astrocyte substates associated with disease and inflammatory states. ALS astrocytes occupied constrained regions of the morphological landscape. Strikingly, morphology and transcriptomics captured distinct and complementary aspects of astrocyte state variation.Together, our framework establishes a scalable and annotation-free strategy for reconstructing cellular phenotypic landscapes from microscopy data, enabling analysis of cellular heterogeneity, landscape connectivity and phenotypic responses across biological systems.
bioinformatics2026-06-03v1CellClick: an interactive platform for adjustable and accurate cell type annotation in single-cell and spatial omics data
Shi, L.; Dai, M.; Zhang, Y.-b.; Wu, S.; Wang, M.; Wang, X.-j.Abstract
Single-cell omics and spatial omics technologies are nowadays widely used in biological and medical research. In both single-cell and spatial omics data analysis, accurate cell type annotation is a key step for downstream analysis and scientific discoveries. However, high-quality cell annotation usually requires multiple rounds of manual analysis for result refinement, which poses great challenges to most researchers. Here, we present CellClick, an interactive platform for convenient and accurate cell type annotation in single-cell and spatial omics data. CellClick provides Data Preprocessing, Data Visualization, Cell Annotation, Annotation Validation, and Cell Reannotation modules, which facilitate automatic or user-guided cell selection and annotation. The feasibility of using CellClick to generate more accurate cell annotation results was exemplified by both scRNA-seq and spatial transcriptomics data.
bioinformatics2026-06-03v1ORIGAMI: Orientation-Aware Graph Neural Network for Assessing Multimeric Interfaces of Protein Complex Structures
Wang, X.; Bhattacharya, D.Abstract
Deep learning-based protein structure prediction methods have led to a paradigm-shift in computational structural biology, yet reliably assessing the quality of computationally predicted multimeric structures remains challenging. Recent methods have demonstrated benefits of employing graph neural networks for assessing multimeric interfaces of protein complexes, but ignore geometric orientational features naturally occurring in 3-dimensional protein conformational space and act only on scalar weights. We present ORIGAMI, an orientation-aware graph neural network for assessing multimeric interfaces of protein complex structures that leverages both scalar and 3D vector node representations to perform symmetry-aware geometric operations while maintaining SO(3)-equivariance by capturing fine-grained orientational relationships between residues across protein-protein interfaces to estimate the interface Local Distance Difference Test (iLDDT) score. Tested on targets from multiple rounds of Critical Assessment of Structure Prediction (CASP) challenges, ORIGAMI achieves superior performance across multiple interface quality assessment benchmarks, with particularly strong gains in the expanded CASP16 interface-level evaluation and in controlled comparisons against both non-equivariant and equivariant graph neural network baselines. It also demonstrates robust cross-metric generalization by reproducing superposition-based DockQ scores with high fidelity, despite being trained only to estimate the superposition-free iLDDT score. ORIGAMI is freely available at https://github.com/Bhattacharya-Lab/ORIGAMI.
bioinformatics2026-06-03v1CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space
Vo, H. Q.; Vo, H. Q.; Ly, S. T.; Wan, Z.; Nguyen, A.-V.; Zhao, H.; Sheng, J.; Wong, S. T. C.; Nguyen, H. V.Abstract
Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, morphological feature extraction, and spatial organization analysis; however, these tools often require manual intervention and lack seamless integration with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies, while also offering limited flexibility for custom analyses by supporting only a fixed set of predefined spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data and streamlines the exploration of custom spatial cellular features across diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets spanning distinct tissue types, including frontal cortex, non-small-cell lung cancer, pancreas, and tonsil, and evaluate it under a realistic minimal prompt setting in which bioscientists pose simple questions without task-specific instructions or prior contextual knowledge, benchmarking multiple large language model backbones with strong coding capabilities. We further show that incorporating domain-agnostic few-shot in-context coding-reasoning examples, randomly sampled from outside the spatial analysis domain, substantially improves performance without requiring costly expert-crafted in-domain demonstrations; overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-driven reasoning agents to support custom feature exploration in spatial molecular imaging and accelerate biomarker discovery.
bioinformatics2026-06-03v1sstar2: A Python Package for S*-based Archaic Introgression Detection with Machine Learning
Koca, A.; Stöckl, A.; Chen, S.; Kuhlwilm, M.; Huang, X.Abstract
Detecting introgressed genomic fragments from unsampled or extinct source populations remains challenging. The S* statistic is widely used for this purpose, but the original sstar implementation relies on generalized additive models to smooth quantile-specific values precomputed from fixed count bins, requiring simulations with fixed numbers of segregating sites. Here, we present sstar2, a Python update that replaces this procedure with quantile regression to directly estimate S* thresholds at specified null quantiles from simulated genomic windows. We benchmarked sstar2 against the original sstar, linear quantile regression, and random forest quantile regression across three demographic models with both phased and unphased simulated data. sstar2 showed the best overall performance among the evaluated methods, with the most pronounced improvement under a challenging demographic model of ghost introgression in bonobos. These results show that sstar2 improves S* threshold calibration while making S*-based introgression analyses more flexible and compatible with modern simulation workflows.
bioinformatics2026-06-03v1CpG Atlas: A centralized multi-layer database and AI interface for DNA methylation research
Armstrong, J. F.; Wahi, S.; Borrus, D.; Sehgal, R.; Rizvi, S.; Zhang, S.; Jacques, M.; Eynon, N.; van Dijk, D.; Higgins-Chen, A.Abstract
DNA methylation research has vastly expanded over the past decade, producing a wealth of epigenome-wide association studies, biomarker algorithms such as epigenetic clocks, technical performance analyses, and functional annotations for CpG sites. However, these resources remain fragmented across dozens of databases and supplementary files within manuscripts, forcing researchers to spend time and effort on data cleaning and integration prior to meaningful analyses. No single resource currently unifies this information into a centralized, easy-to-query framework. Here, we present CpG Atlas, a curated relational database that integrates 18 distinct annotation layers encompassing over 1.2 million CpG sites across all four generations of Illumina methylation arrays (HM450K, EPIC v1, EPIC v2, and MSA). Built on a snowflake schema with a canonical probe identifier hub implemented in SQL, CpG Atlas consolidates over 800,000 CpG-trait associations, results from Mendelian randomization analyses, CpG membership across 81 epigenetic clocks, array manifest information, and probe reliability data. It further includes specialized layers such as solo-WCGW, CoRSIVs, PRC2 binding, transposon and retroelement annotations, tissue-specific differentially methylated positions across 17 tissues, and hallmarks of aging and cancer. To maximize utility and ease of use, the database is paired with an interactive web tool and a natural language-to-SQL query interface, enabling users to quickly perform complex multi-dimensional queries. Detailed documentation about every data source and table is also provided, facilitating the identification and interpretation of relevant studies. We demonstrate the utility of CpG Atlas through two case studies: a systematic enrichment analysis revealing distinct functional signatures across 16 epigenetic clocks, and an iterative biomarker discovery workflow for IBD that leverages cross-layer integration. Because it is readily scalable simply by adding or updating tables in the database, CpG Atlas provides a continuously evolving and extensible infrastructure for the epigenetics community that supports collaborative research, interpretable biomarker development, and integrative analyses across the growing landscape of epigenetic data.
bioinformatics2026-06-03v1Mapping the structural coverage of Arabidopsis thaliana plant developmental proteins: Insights from Experimental and AlphaFold Approaches
Rode, S. S.; Sudarsanam, K.; Bhalla, H.; Srivastava, A.; Sankaranarayanan, S.Abstract
Background: Plant development is a multifaceted process governed by intricate protein regulatory networks. High-throughput sequencing methods have vastly expanded plant transcriptomic and proteomic datasets, yet there is a large discrepancy between structural information for plant developmental proteins and the UniProt sequence entries. Advances in X-ray crystallography, NMR spectroscopy, and Cryo-EM have enabled the determination of protein complex structures and their dynamics. AI-driven tools like AlphaFold have revolutionized analysis of protein structural intricacies. However, available three-dimensional structural models predominantly prioritize the human proteome and other mammals over plants. Assessing structural coverage of plant developmental proteins is thus essential to identify research gaps, guide structure-function studies, and advance agriculture. Results: Here, we focus on mapping the structural coverage of developmental proteins in Arabidopsis thaliana. We observed a substantial disparity in the Protein Data Bank (PDB) representation of Arabidopsis thaliana proteins compared to those of Homo sapiens. Our analysis identified 16,389 reviewed UniProt entries, of which only 1,038 have experimentally determined structures. Functional mapping using PlantGSEA revealed 3,485 proteins associated with plant developmental processes; of which only 337 (9.67%) have experimentally determined structures. In contrast, analysis of the AlphaFold database showed that 69.85% of the 39,278 Arabidopsis thaliana UniProt protein entries have predicted structures. Notably, all 3,485 plant developmental proteins (100%) from Arabidopsis thaliana are covered by AlphaFold models. The substantially higher structural coverage provided by AlphaFold for Arabidopsis thaliana, relative to Homo sapiens, highlights the strength of computational approaches in addressing the challenges of structural studies of difficult-to-crystallize proteins. Furthermore, 79.15% of reviewed A. thaliana protein models exhibit high confidence (pLDDT > 70), indicating reliable structural predictions. Although the experimental structural coverage of Arabidopsis thaliana developmental proteins remains limited, AlphaFold has markedly expanded the accessible structural landscape. Conclusion: This study investigated the structural coverage of Arabidopsis thaliana plant developmental proteins, underscoring the critical need for structural studies using both experimental and AlphaFold approaches. It provides research directions for bridging the knowledge gap in understanding molecular mechanisms of plant development.
bioinformatics2026-06-03v1Convergent Evolution in Tumor Genomes Targets Functional Domains
Chen, H.; Liu, L.Abstract
Tumor evolution is shaped by selective pressures that repeatedly favor similar functional outcomes across genetically distinct cancers. While convergent evolution in cancer has been studied at the gene level, this work investigates selection on smaller functional units, namely protein domains. Using >9,500 primary tumor exomes from The Cancer Genome Atlas, we quantified selection strengths acting on missense and truncating mutations aggregated by protein domain. This analysis identified 818 domains under significant positive selection across tumor types. Notably, approximately half of these domains belonged to genes that would be difficult to implicate using conventional gene-centric approaches due to low mutational recurrence or mutations outside functionally critical regions. We classified positively selected domains by evolutionary antiquity. The most ancient domains trace back to pre-eukaryotes and are involved in core cellular processes (e.g., DNA mismatch repair and metabolism) and tend to accumulate the highest numbers of mutations. The majority of positively selected domains originated in early eukaryotes and are enriched for regulatory control and cellular organization, whereas metazoan-specific domains are primarily associated with signaling and cell-cell communication. These results suggest that cancer preferentially exploits deeply conserved biology, with regulatory complexity driving tumor adaptation, while recent evolutionary innovations are relatively fragile and dispensable. Collectively, these findings establish a domain-centered framework for understanding disease mechanisms and developing therapeutic strategies. By focusing on shared functional domains, this framework enables the identification of functionally convergent therapeutic targets and provides a new perspective for interpreting drug resistance, tumor recurrence, and relapse.
bioinformatics2026-06-03v1Applying Spatial Statistics to Spatial Transcriptomics Reveals Local Association Between M2-like Macrophages and Fibrosis in Diabetic Kidney Disease
Terakawa, K.; Kawaguchi, H.; Nangaku, M.; Mimura, I.Abstract
Renal fibrosis is the common final pathway of chronic kidney disease (CKD), driven in part by myofibroblast-mediated extracellular matrix deposition. M2 macrophages have been implicated as a source of myofibroblasts through macrophage-to-myofibroblast transition (MMT), yet whether M2 macrophages are pro- or anti-fibrotic remains controversial, and the spatial context in which MAC-M2-fibrosis coupling occurs is unknown. Here, we applied geographically weighted regression (GWR), a spatial statistical method, to Visium spatial transcriptomics data from diabetic kidney disease (DKD) to characterize spatially resolved high-coupling spots where MAC-M2-fibrosis coupling is significantly positive. In a small DKD cohort (n=6), GWR identified high-coupling spots enriched for B cell/ tertiary lymphoid structure (TLS)-like immune signatures, supporting the biological relevance of the analytical framework. To gain statistical power for differential gene expression (DEG) analysis, we then applied the same pipeline to the larger Kidney Precision Medicine Project (KPMP) DKD cohort (n=30), in which high-coupling spots showed upregulation of IgE-related immune genes (IGHE, FCER1A) together with the mast cell tryptase TPSB2. These findings suggest that IgE-related immune responses may be present within DKD fibrotic microenvironments characterized by local MAC-M2-fibrosis coupling. As a disease comparison, we further applied the pipeline to a KPMP hypertensive kidney disease (HKD) cohort (n = 27), where high-coupling spot signatures were distinct from DKD and did not show enrichment of IgE-related genes. Together, this study provides the first application of GWR to kidney spatial transcriptomics and suggests that IgE-related immune responses may be a feature of DKD fibrotic microenvironments in which M2 macrophages are locally associated with fibrosis.
bioinformatics2026-06-03v1Information Geometry of Intracellular Compartment Coupling Reveals Transcriptomic State Transitions in Single Cells
Sung, J.-Y.; Cheong, J.-H.Abstract
Single-cell transcriptomic analyses typically characterize cellular states using gene-expression variability, dimensionality reduction, and trajectory inference. However, existing approaches provide limited insight into how transcriptomic information is organized across interacting intracellular compartments. Here we introduce Compartment Coupling Entropy (CCE), an information-geometric framework that quantifies the organization of transcriptomic coupling between spliced and unspliced RNA compartments. CCE constructs a cross-compartment coupling operator from compartment-resolved transcriptomic profiles and characterizes its singular-value spectrum using coupling entropy, effective coupling dimension, and coupling susceptibility. These metrics measure how transcriptomic information is distributed across coupling modes and provide a quantitative description of transcriptomic organization beyond conventional expression-based statistics. Applying CCE to pancreatic endocrine differentiation revealed substantial remodeling of coupling architecture along developmental trajectories. Coupling entropy and effective coupling dimension underwent transient collapse and re-expansion during lineage progression, while coupling susceptibility identified discrete intervals of rapid transcriptomic reorganization corresponding to candidate cell-state transition regimes. Across cell states, coupling entropy showed weak correspondence with classical mutual information, indicating that spectral coupling organization captures information not represented by conventional information-theoretic measures. An organization ratio and spectral excess information further quantified the divergence between classical and coupling-based descriptions of transcriptomic structure. Robustness analyses demonstrated stability of the framework under bootstrap resampling, gene subsampling, spectral truncation, and trajectory discretization. Application to an independent dentate gyrus developmental dataset revealed similar hierarchical coupling spectra and susceptibility-defined transition regimes, suggesting that transient reorganization of compartment-coupling architecture may represent a general feature of cellular state transitions. CCE provides a general methodology for quantifying the information geometry of intracellular transcriptomic organization and complements existing single-cell analytical approaches by revealing coupling architectures that are inaccessible to conventional expression-based analyses.
bioinformatics2026-06-03v1ViTAMIn-O: Democratizing computer vision-based machine learning for stem cell research
Hamurcu, F.; Breunig, M.; Varga, A.; Bosch, B.; Lindenmayer, J.; Kanakapaddy, A. T.; Achberger, K.; Pashkovskaia, N.; Kleger, A.; Liebau, S.; Klingenstein, S.; Klingenstein, M.Abstract
Deep Learning (DL) holds exciting potential in automating the prediction of organoid differentiation results. Nevertheless, current models lack adaptability, openness, and robustness in performance. Additionally, broad employments of predictive models in wet-lab settings necessitate machine learning expertise, often not readily available in biologically oriented laboratories. To offer an intuitive solution, we present ColabViTAMIn-O, a code-free platform together with ViTAMIn-O. ViTAMIn-O is a fully open organoid-specific DL model trained and tested on a total of 34 organoid categories, incorporating annotated images across transmitted light microscopy (TLM) modalities at single-organoid resolution. It is adaptable to downstream prediction tasks of varying dataset sizes and outperforms established models even with linear-probing. It performs reliably within a few-shot framework and is even extensible to human embryo TLM imaging data at single specimen level. By releasing our platform, centralized model hub, and datasets, we hope to encourage broader deployments of specialized DL models in stem-cell laboratories.
bioinformatics2026-06-03v1ROTS 2.0: A reproducibility-driven framework for robust statistical modeling across diverse high-throughput omics study designs
Suomi, T.; Kettunen, J.; Pusa, T.; Elo, L. L.Abstract
Reproducibility is fundamental to reliable scientific discoveries. The reproducibility-optimized test statistic (ROTS) is a robust framework designed to identify reproducible features (e.g. genes or proteins) in high-dimensional differential expression analyses such as transcriptomics and proteomics. This is achieved by optimizing the reproducibility of feature rankings under resampling. While originally implemented for univariate settings, ROTS now accommodates multi-group comparisons, survival analysis, linear models, and linear mixed-effects models, broadening its applicability to more complex and clinically relevant experimental designs. Using diverse simulations, benchmark datasets, and real-world case studies, we demonstrate the benefits of ROTS reproducibility optimization compared to the corresponding conventional test statistics. Additionally, we illustrate the utility of the reproducibility characteristics in assessing the overall reliability of the results. To facilitate widespread adoption, ROTS is provided as an open-source software package available through R/Bioconductor. Furthermore, to broaden the user base, we now also provide a Python interface available at pypi.org/project/PyROTS/.
bioinformatics2026-06-03v1Loss of tissue specificity and recurrent pan-cancer activation define a conserved oncogenic microRNA class
Poptsova, M.; Ismailov, A.; Belogurov, A.; Evpak, A.Abstract
MicroRNAs (miRNAs) act as crucial post-transcriptional regulators of large gene networks, and their aberrant expression drives key oncogenic processes such as epithelial-mesenchymal transition (EMT), angiogenesis, immune evasion, and metastasis. Oncogenic miRNAs that lose tissue specificity during malignant transformation represent promising therapeutic targets, as their restricted expression in healthy organs could minimize off-target effects. To identify these candidates, this study performed a comprehensive pan-cancer analysis integrating tissue-specificity profiles of healthy tissues from the GTEx project with tumor data from the TCGA, TARGET, CGCI, and CPTAC cohorts. By combining profiling with differential expression analysis between tumor and matched normal samples, cross-cohort integration revealed that malignant transformation is characterized by a widespread loss of tissue-specific miRNA expression. Among these altered patterns, a cluster of nine oncomiRs was identified: miR-105-5p, miR-1269a, miR-196a-5p, miR-9-5p, miR-96-5p, miR-210-3p, miR-301b-3p, miR-592, and miR-135b-5p. These specific miRNAs were significantly and recurrently upregulated across various solid tumors. Functional enrichment analysis of their experimentally validated targets demonstrated a clear convergence on shared oncogenic pathways, particularly those governing hypoxia response, PI3K/AKT signaling, EMT, angiogenesis, and immune modulation.
bioinformatics2026-06-03v1Reachability-Preserving Minimum Edge Cut Problem and Applications in Biology
Xie, J.; Duan, Q.Abstract
Biological pathway analysis often requires identifying interventions that block reachability to an undesirable state, such as a disease-associated module, toxic byproduct, or adverse phenotype, while preserving reachability among essential biological functions. Motivated by this setting, we study the Reachability Preserving Minimum Edge Cut (RPMEC) problem: given protected terminals \(s_1\) and \(s_2\) and a target terminal \(t\), the goal is to remove a minimum-cost set of edges that separates \(s_1\) and \(s_2\) from \(t\) while keeping \(s_1\) and \(s_2\) connected. This formulation naturally models pathway-level intervention design, where one seeks to disrupt harmful signaling, metabolic, or interaction routes without breaking required functional connectivity. We revisit the three-terminal undirected edge-cut case and analyze a Dijkstra-style dynamic programming algorithm that is exact on planar graphs but fails on general graphs. We characterize the structural condition required for exactness, namely frontier-realizability of optimal source-side regions, and identify biological graph representations where this condition is likely to hold after appropriate preprocessing, including curated planar pathway maps, Reactome-style hierarchy trees, SCC-contracted feedback modules, metabolic building-block DAGs with dominator structure, and functional-module quotients of protein interaction networks. We further discuss directed variants, approximation strategies, and exact alternatives based on ASP, MILP, bounded-treewidth dynamic programming, and important separators. The results provide a graph-theoretic foundation for deciding when fast greedy computation is reliable for biological pathway intervention problems and when more expressive exact optimization methods are needed.
bioinformatics2026-06-03v1AdventML: Advanced Enzyme Temperature Prediction with Transformer-Based Embeddings and Resampling Strategies
Francois, J.; De Moor, B.; van Noort, V.Abstract
Accurate prediction of enzymes' optimal catalytic temperature (Topt) is crucial in biotechnology, as enzymes with extreme Topt values are highly desirable for reactions at extreme temperatures and for their general stability. However, experimental determination of Topt is costly, labor-intensive, and time-consuming. Meanwhile, existing computational methods suffer from small and imbalanced datasets, suboptimal predictions at extreme temperatures, and insufficient validation. In this study, we address these challenges by expanding the Topt dataset and validating on an independent test set based on sequence similarity. We further tackle these limitations by comparing multiple resampling techniques to improve predictions at extremes and by considering diverse protein representations and multiple machine learning architectures. Overall, the best performing models reached R2 approximately 0.64 with MAE approximately 7-8 degrees C, while extreme resampling improved tail performance, reducing tail MAE by up to approximately 1.8 degrees C. Notably, our models show improved performance over state-of-the-art prediction models. We also demonstrate that accurate prediction of Topt is achievable even in the absence of organism growth temperature (OGT). Our Topt prediction models are made freely available as AdventML on GitHub.
bioinformatics2026-06-03v1HyperNiche: Learning Heterophilic Cellular Niches with Hypergraph Neural Networks
Mahmud, M. I.; Banerjee, T.Abstract
We propose HyperNiche, a hypergraph-based framework for modeling higher-order, heterogeneous cellular niches from spatial transcriptomics data. Unlike conventional graph-based methods that rely on pairwise similarity and tend to produce homogeneous clusters, HyperNiche learns anchor-centered hyperedges through a compatibility-driven mechanism that captures both homophilic and heterophilic relationships among cells. By decoupling node roles into anchor and member representations and integrating spatial geometry into hyperedge construction, the model enables the discovery of multicellular niches that span diverse cell types. We evaluate HyperNiche on high-plex Xenium spatial transcriptomics datasets from breast and lung cancer tissue microarrays, demonstrating improvements over state-of-the-art graph-based baselines in clustering performance (ARI, NMI) and biological interpretability. Further analysis shows that HyperNiche produces hyperedges with significantly higher intra-edge feature diversity, indicating an enhanced ability to capture heterogeneous cellular niches compared to similarity-based models. These results highlight the importance of higher-order relational modeling for understanding complex spatial tissue organization and tumor microenvironments.
bioinformatics2026-06-03v1SciCore-Omics: a tri-modal foundation model unifying histology, spatial transcriptomics and language for spatial biology
Xiao, X.; Li, Y.; Zeng, Z.; Yan, Y.; Liu, Z.; Liu, Z.; Xiang, Y.; Ye, Z.; Ying, J.; Li, Y.; Xie, L.; He, F.Abstract
Histomorphology and spatial transcriptomics capture complementary aspects of tissue biology, but their relationships remain difficult to extract, align, and interpret at scale. Existing foundation models typically connect histology, omics, or language only pairwise, which limits their capacity to jointly infer molecular states, decode spatial tissue organization, and generate biologically grounded explanations. Here, we show SciCore-Omics, the first tri-modal foundation model linking histology images, spatial transcriptomics, and biological language. We constructed a spatially paired image-gene-text dataset comprising 151,182 spots across multiple tissues and performed a three-stage progressive training of SciCore-Omics on this dataset. Across gene expression prediction and spatial domain recognition, SciCore-Omics achieved 23.6-80.9% relative gains in task-specific metrics over the strongest external baselines. It further showed robust zero-shot generalization in histopathology classification, outperforming GPT-5 by 6.16 percentage points in mean accuracy across four benchmarks. Expert evaluation in 10 breast cancer cases confirmed its H&E-only case-level molecular reasoning capability. Together, our method demonstrates that a tri-modal framework can effectively bridge histomorphology and molecular state, providing a more general and interpretable foundation model for computational pathology and omics analysis.
bioinformatics2026-06-03v1Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large cis-regulatory domains, spanning tens to hundreds of kilobases, are pivotal in orchestrating cell-state-specific transcriptional programs that define cellular identity. However, existing single-cell analytical frameworks lack the capacity to identify these higher-order structures, thereby obscuring the coordinated, domain-level epigenetic regulation essential for complex biological processes. To address this, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) solely from single-cell ATAC-sequencing data. Our approach faithfully recapitulates established hematopoietic hierarchies and resolves lineage-specific regulatory programs by recovering canonical master transcription factors, frequent chromatin interactions, and enrichment of fine-mapped immune-related disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE captures tumor-associated H3K27ac landscapes and prioritizes USP7 as a potential therapeutic candidate, supported by in silico perturbation. Collectively, our framework provides a powerful and scalable platform for deciphering the complex epigenetic architectures underlying human development and disease.
bioinformatics2026-06-02v11Fold or flop: quality assessment of AlphaFold predictions on whole proteomes
Sarti, E.; Cazals, F.Abstract
MOTIVATION: Reliability of AlphaFold2 predictions is mainly assessed using the predicted Local Distance Difference Test (pLDDT). For model organisms, 30-40% of residues fall into the low-confidence pLDDT range. Moreover, pLDDT sometimes fails to flag physically implausible structures. This raises two questions: can more robust reliability indicators be identified, and do unreliable predictions share common structural or biophysical features? RESULTS: We characterize protein structures through histograms of per-residue neighbor counts, and use the Wasserstein principal component analysis to define the arity map, and lightweight and informative 2D embedding of proteins in a dataset. Using AlphaFold-DB, we show that the arity map reveals three structurally and biophysically distinct populations (well-folded proteins, intrinsically disordered proteins, and physically implausible predictions). We also use our packing based encoding at the residue level to define Abstraqt (Arity-Based STRuctural Arrangement Quality assessmenT), a per-residue scoring function complementing the pLDDT, assigning low scores to hallucinated helices and distorted beta strands while correctly scoring native like predictions. AVAILABILITY: The code to compute arity maps is available within Structural Bioinformatics Library, see https://sbl.inria.fr/doc/Alphafold_analysis-user-manual.html and https://sbl.inria.fr/data/AlphaFold-assessment.
bioinformatics2026-06-02v3UMITIC: An unsupervised framework for the joint characterization of cellular phenotypes and spatial neighborhoods in multiplex and hyperplex immunofluorescence imaging data
Sangüesa Recalde, M.; De Andrea, C. E.; Ariz, M.Abstract
Multiplexed imaging technologies enable the simultaneous measurement of dozens of protein markers while preserving context, providing a high-resolution view of tissue organization schemes. However, extracting meaningful insights from these high-dimensional datasets--particularly in hyperplex settings (>20 markers)--remains a major computational challenge, especially in the absence of annotated data. Here, we present UMITIC (Unsupervised Analysis of Multiplex Images via TIssue Characterization), a modular and unsupervised computational framework for the joint characterization of cell phenotypes and tissue neighborhoods from multiplex imaging data. UMITIC integrates three components: (i) CellCut, a strategy that combines nuclear and cytoplasmic predictions to improve the delineation capabilities of the framework; (ii) CellMap, a contrastive learning approach that generates low-dimensional representations of single-cell image crops that are enriched with morphological features; and (iii) TissueNet, a graph neural network that models spatial cell-cell interactions to identify tissue neighborhoods. We evaluated UMITIC across four datasets of increasing complexity to assess its robustness, scalability and biological relevance. With respect to a 7-plex human tonsil dataset, the framework identified canonical immune cell populations and reconstructed well-established anatomical regions. When applied to a 43-plex tonsil image, UMITIC preserved these tissue-level structures while enabling a finer cell subtype stratification process driven by increased marker dimensionality. We further validated our method on a 58-plex colorectal cancer cohort, where UMITIC was able to recover previously reported immune composition differences and spatial organization variations between patient groups with different prognoses. Finally, when an expert-annotated mass cytometry imaging dataset concerning human lung tissue was used, UMITIC achieved higher agreement with the reference tissue annotations than the existing approaches did, demonstrating improved lung microanatomy reconstruction accuracy. Together, these results show that UMITIC enables consistent and interpretable analyses of both cellular phenotypes and tissue architectures across diverse multiplex and hyperplex imaging datasets without the need for manual annotations.
bioinformatics2026-06-02v2MorphOTU: image-derived morphological operational units for open-set biodiversity assessment
Zhan, Z.; Ye, M.; Orr, M. C.; Chen, W.; Liu, X.; Yue, L.; Sun, X.; Zhang, F.Abstract
The absence of a scalable system for organizing the vast majority of unidentified species is a central obstacle in biodiversity science. Molecular methods can generate OTUs without species names but require sequencing infrastructure and often remain difficult to link to observable morphology, whereas most computer-vision methods still rely on closed-set species labels. These limitations hamper biodiversity quantification under the open, incomplete conditions that characterize real ecosystems. Here, we introduce morphOTUs, a general image-based framework that constructs operational units of biodiversity directly from phenotypes. Using morphOTU, we derive image-based OTUs across five standardized benchmark datasets spanning flowers, wood anatomy, and beetle dorsal habitus. These units closely approximate reference species-level groupings, including closely related species, retain coherent structure when most species are "unseen" during training, and accurately approximate -diversity metrics under sparse labeling or limited sampling. Furthermore, morphOTUs remain effective on a heterogeneous, long-tailed real-world insect survey dataset, demonstrating robustness beyond standardized imaging conditions. Visual explanations reveal that morphOTU consistently focuses on biologically meaningful traits and captures continuous phenotypic variation. By providing a scalable and open-set framework for quantifying phenotypic diversity, morphOTUs enable biodiversity assessment that includes unnamed species and unlock the ecological value of rapidly expanding digital image repositories.
bioinformatics2026-06-02v2GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking
Hu, S.; Bojar, D.Abstract
Quantifying all complex carbohydrates in a sample produces glycomics data, which constitutes compositional data and is stymied by biosynthetic dependencies between glycans, requiring dedicated analytic workflows. Properly assessing such methods frequently requires simulated data with known ground truths and injectable effects. However, simulating glycomics data, especially with control over effects and biases, is still unsolved. Here, we present GlycoForge, a feature-complete solution for simulating comparative glycomics data. GlycoForge supports simulating fully synthetic glycomics data and templated simulations based on real-world data, with specified motif-level effects, based on Gaussian copulas and estimated covariances. We further support injection of batch effects, both mean and variance shifts, via center-log ratio transformations to maintain compositional closure, and realistic missing data simulation. We showcase the utility of GlycoForge by evaluating batch effect correction algorithms for glycomics data, with automated guidelines for when to use such methods on real-world data. GlycoForge is available as an open-access Python package at https://github.com/BojarLab/GlycoForge.
bioinformatics2026-06-02v2ATI_Box: A Simple tool for convolutional neural network-based image semantic segmentation
Przygodzki, T.Abstract
Quantitative analysis of microscopic images has become a standard in basic biological and biomedical research. Deep machine learning provided a powerful tool facilitating this process. However, practical adoption of deep machine learning to image analysis may be difficult for a researcher who lacks basic coding skills. This is caused by a limited number of non-coding solutions, specifically in the domain of convolutional neural networks (CNNs). This scarcity may be explained by the following paradox. Training of CNNs is a relatively complex process. Researchers who are familiar with this process are also skilled enough to code the full pipeline of CNN implementation from annotation, through model training and evaluation to its usage in laboratory practice. Any kind of an alternative solution, acceptable by a broader group of researchers who are unfamiliar with CNN concepts, must inevitably result in simplification of the entire process, specifically the training step. Such simplification in turn may lead to limitation to solve specific problems by such a tool. Author believes however, that some compromise may be found between complexity and simplicity that would be sufficient to solve some basic problems in the field of basic biological and biomedical research. To address this challenge, author proposes ATI_Box (Annotation, Training, Inference in One Box), a unified, user-oriented platform for end-to-end image semantic segmentation. The system integrates data annotation, storage, model training, evaluation, and quantitative analysis into a single workflow, significantly simplifying the model development process. Image and annotation data are managed through an S3-compatible object storage system (MinIO), enabling scalable and transparent data handling. Annotation process is implemented through Label Studio. Model training is based on convolutional neural network U-Net architecture with ResNet as an encoder. Model evaluation is performed on ground-truth dataset held-out during training and provides pixel-level and object-level evaluation metrics. Batch analysis mode enables automated quantification of model predictions such as object counts and coverage areas. The usability of the platform was presented on examples from laboratory practice. The platform is intentionally devoid of model-tuning capabilities as it is addressed for a user unfamiliar with profound machine learning concepts. At the same time, accessibility of such basic features of model training as definition of epochs number or saving and implementing of trained model versions enables one to perform some basic analytical experiments. As such, the platform may serve not only as an analytical tool but also as an educational solution to explain practical basics of semantic segmentation process.
bioinformatics2026-06-02v1PepForge: Hierarchical HELM-Based Peptide Generation
Wang, Q.; Suessmuth, R. D.Abstract
Peptides carrying special connections such as macrocyclizations and various other structural modifications constitute a major class among peptide therapeutics, yet their chemical space remains largely inaccessible to computational generation methods. Here we present PepForge, a deep learning platform for peptide generation that exploits Hierarchical Editing Language for Macromolecules (HELM) notation to access the chemical space of modified peptides, through a Layout-Content-Connection (LCC) cascade decomposing the generation task into block layout, monomer content, and special connection prediction. The LCC cascade is trained on 383,817 HELM peptides covering 425 monomers and nine connection types. Beyond de novo generation, the LCC cascade supports masked infilling for targeted scaffold modification and multi-level constrained generation. Both the monomer library and the connection-type set support user-defined extensions for exploring a broader chemical space. The prediction module is decoupled from generation and accepts arbitrary scoring heads for downstream tasks. As a demonstration, we built an antimicrobial potency ensemble predictor trained on 11,026 peptides with minimum inhibitory concentration (MIC) values, alongside the external PeptiVerse predictor. Applied at scale, we generated 4.78 million novel HELM peptides and obtained 799 structurally novel hit antimicrobial peptide (AMP) candidates after potency and safety filtering. All code, pre-trained models, and a web interface for interactive use are publicly available at https://github.com/wqx1999/PepForge.
bioinformatics2026-06-02v1RAD: A Read-structure Agnostic Demultiplexer for Single-Cell Long-Read Sequencing and Analysis
Vaidya, C. M.; Carpenter, M. C.; Abdullah, L.; Kolling, F. W.; Huang, Y. H.; Song, L.; Ackerman, M. E.Abstract
Single-cell long-read sequencing (LRS) techniques enable the analysis of full transcript sequences within a cell. However, the high error rate inherent to LRS introduces computational challenges for parsing information like cell barcode, and custom workflows are often required to handle complex read layouts, such as split combinatorial barcodes. We introduce an error-robust, read-structure agnostic demultiplexer (RAD). In RAD, users can easily specify read structure, such as adapter sequence and barcode relative position, and can rapidly extract these elements for each read. In addition to finding the barcode, RAD implements efficient barcode correction strategies for scenarios of knowing or not knowing the full barcode whitelist or having paired short-read single-cell sequencing data for a short whitelist. In synthetic and real-world benchmarks, RAD is faster and achieves significantly higher sensitivity than existing pipelines while having comparable precision. We show RAD can be applied to high-definition long-read spatial transcriptomic data and demonstrate single cell and spatial analysis of B cell isotype and secretion states.
bioinformatics2026-06-02v1Bridging Ancestry Gaps in Genomic Risk Prediction with Tabular Foundation Models
Das, A.; Cui, Y.Abstract
Motivation: Models deployed for genomic prediction of diseases perform unevenly across populations, limiting clinical utility. Two factors drive this limitation: large imbalances in sample availability across ancestry groups and non-stationarity of genotype-phenotype effect sizes across the ancestry continuum. While tabular foundation models with in-context learning (ICL) have shown strong sample efficiency in other domains, their effectiveness for genotype-to-phenotype prediction and their robustness to ancestry-driven effect heterogeneity remain unclear. Results: Using large, ancestrally diverse biobank data, we show that ICL-capable tabular foundation models reduce performance degradation in under-sampled ancestry groups compared to conventional supervised approaches. However, we find that prevailing models trained on existing synthetic tabular tasks fail when allele effect sizes vary across ancestry space. Treating genetic ancestry as a continuous variable, we introduce an instruction-tuning framework that exposes models to synthetic tasks with ancestry-dependent non-stationary effects. Instruction-tuned models achieve improved and more stable predictive performance across the genetic ancestry continuum, including for individuals distant from in-context exemplars in ancestry space.
bioinformatics2026-06-02v1Hierarchical refinements of cis-regulatory inputs improve scalable gene expression prediction
Zhang, Q.; Xing, M.; Liao, Q.; Li, Z.; Huang, D.-S.Abstract
Deciphering the relationships between cis-regulatory elements (CREs) and target gene expression has long been a challenging problem in molecular biology. However, predicting gene expression from hundreds of candidate cis-regulatory elements (cCREs) requires models that scale to long, noisy inputs while retaining interpretable regulatory structure. Existing Transformer-based approaches typically attend over all nucleotides and all surrounding cCREs, diluting causal signals when hundreds of elements compete for limited model capacity. Here we introduce a two-stage selective framework (TSSF) that performs hierarchical refinements: nucleotide-level masking within each cCRE, followed by cCRE-level selection around each gene, implemented with information-bottleneck priors and a fully Transformer-based architecture. Across 70 human cell types and tissues, TSSF and lightweight variants improve expression prediction and enhancer-gene prioritization relative to strong baselines, including on cross-cell-line and cell-type-specific benchmarks. Prediction-stratified analysis motivates a distance-decay prior that aligns attention with long-range regulatory geometry, and chromatin-contact augmentation improves recovery of distal links. Motif analyses of high-confidence predictions recover proximal and distal regulatory programs, supporting mechanistic interpretability. TSSF offers a general strategy for scalable, interpretable modeling of high-dimensional regulatory inputs in genomics.
bioinformatics2026-06-02v1Deciphering functional dark matter: Machine and deep learning-based processing of protein embeddings enables targeted function discoveries
Wiegand, S.; Kaster, A.-K.Abstract
The ever-expanding catalogue of uncharacterized proteins - the so called functional dark matter - poses a major challenge for biotechnological and biomedical exploitation. Functional assessment of most proteins is hindered by the technical limitations of annotation transfer and by the propagation of erroneous annotations in databases. The common denominator here is the reliance on sequence similarities. However, these become inaccurate below certain thresholds and can diverge even at sequence identities around 70%. To approach this challenge, we implemented a strategy using embeddings generated by protein language models for targeted function discovery (PE-TFD). Datasets of proteins representing target as well as non-target functions were used to train supervised learning models. The resulting ensemble models yielded interpretable prediction scores, enabling the exploration of databases without relying on multiple sequence alignments or structural information. We here tested PE-TFD for the discovery of novel hydrogenases as proof-of-concept, resulting in the novel discovery of 773 [NiFe] and 1,929 [FeFe] hydrogenases that were not detected by established sequence- or profile-based approaches. Structural analyses supported their non-random nature and further revealed a significant number of enzymes lacking prior functional annotation. Our framework therefore enables interpretable function discovery in large-scale datasets and the exploitation of functional dark matter.
bioinformatics2026-06-02v1An integrated resource for systems-level analysis of aging hallmarks and associated genes
Tiwari, R.; Balaji, M.; Chivukula, N.; Sil, P.; Samal, A.Abstract
Aging is a complex biological process involving progressive cellular dysfunction, tissue decline, and increased susceptibility to multiple chronic diseases. A systemic view of aging through its established hallmarks provides a structured framework to understand this complexity and drive therapeutic discovery. Towards this, we present AgingHallmarksDB, an interactive web platform that enables systems-level analysis of hallmark-associated gene sets. Aging-related genes were first curated from seven established resources, and those present in at least 2 of these resources were considered as consensus aging-related genes. Using functional annotations derived from GO, KEGG, and Reactome, a total of 3111 genes were mapped to the 11 aging hallmarks, of which 2593 were supported by additional experimental or manually curated evidence, with 1089 of these forming the consensus set. Further, AgingHallmarksDB supplements gene annotations with tissue or cell type class specificity, exosomal profiles, and regulatory interactions. The platform allows users to interactively perform systems-level hallmark enrichment analysis across multiple condition-associated gene sets, while seamlessly integrating functional annotations and complex regulatory interactions to elucidate mechanistic hallmark-gene associations. The utility of the resource was explored through hallmark enrichment and network proximity analysis of gene sets corresponding to 11 chronic age-related diseases and PM2.5-associated skin transcriptome to explore relationships between aging hallmarks and disease mechanisms or environmental aging-related signatures. Overall, AgingHallmarksDB will support longevity research by enabling aging hallmark centered analysis, and the resource is accessible at https://cb.imsc.res.in/aginghallmarksdb/.
bioinformatics2026-06-02v1SNV and indel error modeling of deep targeted cell-free DNA sequencing data for sensitive detection of circulating tumor DNA in colorectal cancer
Diekema, M. H.; Rasmussen, M. H.; Drue, S. O.; Frydendahl, A.; Andersen, C. L.; Pedersen, J.Abstract
Circulating tumor DNA (ctDNA) is a promising biomarker for cancer detection, but low tumor burden makes it difficult to distinguish true signal from background noise. To aggregate and better evaluate weak mutational signals, we propose PyDREAMS, which incorporates both single-nucleotide variants (SNVs) and insertions and deletions (indels) for ctDNA detection and quantification. To distinguish signal from noise, a neural network background error model is learned from healthy controls. It captures the joint effects of cell-free DNA (cfDNA)-specific lesions and sequencing errors, accounting for both genomic context and read-level features. Finally, a statistical test is used to evaluate the presence of mutational signals. We evaluate the method in a tumor-informed setting, using cohorts of colorectal cancer samples with deep targeted plasma cfDNA sequencing across 12 cancer driver genes. We trained PyDREAMS on 46 healthy controls, with feature analysis revealing that both SNV and indel error rates were lowest at mononucleosomal fragment lengths, suggesting that nucleosomes protect cfDNA and reduce lesion accumulation during circulation and sample handling. In the validation cohort, combining SNVs with indels improved detection, with indels contributing approximately 1.5-fold more evidence per mutation than SNVs. On a test cohort of 209 stage I to III colorectal cancer (CRC) patients and 24 healthy controls, PyDREAMS outperformed a Shearwater-based caller, with an area under the receiver operating characteristic curve (AUC) of 0.917 compared with 0.909. In stage III post-operative (Post-OP) samples (n = 26), where ctDNA was expected only in non-cured patients, PyDREAMS detected ctDNA in 5 patients, including 3 of 9 with later recurrence, while Shearwater detected none. Together, these results show that PyDREAMS improves evaluation of ultra-low-frequency tumor signals through unified read-level modelling of SNV and indel background error.
bioinformatics2026-06-02v1Combining transcriptomic resolutions and machine learning strategies uncovers new OXPHOS genes in Caenorhabditis elegans
Zeballos - Goron, S.; Salinas, G.; Pazos Obregon, F.Abstract
Assigning functions to genes remains a major challenge in biology, as a large fraction of genes remain unannotated despite the availability of complete genomes. Oxidative phosphorylation (OXPHOS), the primary source of ATP in eukaryotes, exemplifies this gap: although it has been extensively studied in mammals, our understanding of this process in other lineages remains limited. In general, research in other organisms has relied on the identification of sequence homologs of genes previously characterized in mammals. While this strategy has enabled the inference of certain conserved functions, it may overlook genes with key roles that lack detectable homology. This highlights the need to explore alternative approaches, such as the integration of transcriptomic data, to better understand the specific features and adaptations of this process across different evolutionary lineages. Caenorhabditis elegans provides a powerful framework to address this problem, combining conservation of mitochondrial pathways with extensive transcriptomic resources. Studying this organism also has translational relevance for parasitic helminths, where OXPHOS represents a promising therapeutic target. We hypothesized that genes involved in OXPHOS share transcriptional signatures that can be exploited for functional prediction. Using a curated set of 65 well-established OXPHOS genes, we applied two complementary machine learning strategies to identify new candidates. We trained an ensemble of supervised learning models on a time-resolved bulk RNA-seq transcriptome of C. elegans. To address uncertainty in functional annotations, we implemented a novel informed bagging strategy combined with a two-round training scheme, in which weak positives were initially excluded and subsequently incorporated based on model predictions. In parallel, we performed cluster-based functional inference using embryonic and adult single-cell RNA-seq datasets. Integration of both approaches produced a list of candidate genes supported by strong predictive performance on an independent evaluation set. Several candidates lack prior functional annotation. A mutant strain in ril-1, one of the highly supported predictions, showed decreased respiration rates compared to the wild-type strain. Our results highlight the value of integrating biological priors, complementary learning paradigms, and multi-resolution transcriptomic data to enable systematic gene function discovery.
bioinformatics2026-06-02v1Ground Truth-Based Evaluation of False Discovery Rate and Statistical Power in DIA Proteomics
Yarbro, J. M.; Huang, Y.; Pagala, V.; Fu, Y.; Wang, Z.; Wu, L.; Wang, X.; High, A. A.; Byrum, S.; Peng, J.; Yuan, Z.-F.Abstract
Data-independent acquisition (DIA) mass spectrometry enables rapid proteomic quantification, yet the reliability of statistical inference in DIA-based protein quantification remains incompletely understood. Here, we systematically evaluated missingness, false discovery rate (FDR), and statistical power, defined as true positive rate (i.e. sensitivity or recall), using technical replicates and a spike-in benchmark with known ground truth. Analysis of 18 HeLa replicates revealed persistent, abundance-dependent missingness. In the spike-in experiment with five replicates, human peptides were titrated against a stable yeast background, allowing fold changes (FCs) to be compared with expected values. Across comparisons with log2FCs ranging from 0.2 to 2.5, the nominal BH-FDR substantially underestimated the true FDR. For example, at a BH-FDR threshold of 0.05, the true FDR was ~0.2. Statistical power was ~40% for a log2FC of 0.2 and increased to nearly 100% for a log2FC of 2.5. Additional incorporation of FC thresholds improved the true FDR for large-FC comparisons, with slight loss of power, but markedly reduced sensitivity for small-FC comparisons. Together, these results indicate that nominal FDR does not necessarily reflect actual error rates in DIA proteomics and that DIA performance is influenced by protein abundance and expected fold changes. This study provides a framework for experimental design and data interpretation in DIA-based proteomic studies.
bioinformatics2026-06-02v1Equitable Health Intelligence: An Open Benchmark of Multi-Population Machine Learning for Omics-Based Cancer Prognosis
Sharma, T.; Chopra, A. P.; Agrawal, L.; Verma, N. K.; Starlard-Davenport, A.; Wang, J.; Hayes, D. N.; Cui, Y.Abstract
Purpose: Machine learning (ML) models for omics-based cancer prognosis are often trained on data from predominantly European-ancestry populations, producing biased predictions for other populations and undermining equitable genomic medicine. Existing fairness benchmarks mainly focus on outcome parity rather than predictive performance parity across populations. Public benchmark resources are needed for systematically detecting and mitigating such performance disparities in multi-population cancer prognosis. Methods: We developed Equitable Health Intelligence (EHI, https://ehiportal.org), an open-source benchmark of multi-population ML for omics-based cancer prognosis. EHI contains 1,475 ML tasks across 40 cancer/pan-cancer types, 4 omics feature sets, 4 clinical endpoints, 5 event-time thresholds, and 3 data-disadvantaged population (DDP) groups relative to a majority European Ancestry population group. Deep neural network models are trained under three multi-population ML schemes (Mixture, Independent, and Transfer Learning), with Naive Transfer included as a no-adaptation control, comprising a total of 10,325 ML experiments. Results: The EHI platform provides an interactive environment with visualization and exploratory tools for users to inspect predictive performance disparities between the majority European-ancestry group and data-disadvantaged populations, evaluate the extent to which transfer learning mitigates these disparities, and examine the impact of feature engineering methods across cancer types, omics features, and clinical endpoints. Conclusion: EHI is an open, interactive, and extensible benchmark for identifying and addressing performance disparities in multi-population ML for omics-based cancer prognosis. It provides a foundation for a growing ecosystem of methods targeting ML performance disparities arising from biomedical data inequality and population-level distribution shifts, thereby advancing equitable AI in precision oncology.
bioinformatics2026-06-02v1A Pan-Cancer Multi-Omic SuperLearner for Regulated Cell Death Survival Topologies
Rodrigues de Souza, E.; Almeida Cordeiro Nogueira, H.; dos Santos Lopes, V.; Medina-Acosta, E.Abstract
Introduction: Regulated cell death (RCD) pathways profoundly influence tumor progression and immune modulation. In prior work, we constructed a comprehensive database mapping 25 forms of RCD across seven multi-omic layers encompassing 33 tumor types (CancerRCDShiny). Despite their robust ability to identify risk populations, translating these prognostic signatures into personalized clinical workflows requires a shift from generalized cohort stratification to individualized risk mapping. This necessitates mapping the complex geometric landscape of patient risk - Survival Topologies - to accurately capture the non-linear dynamics of RCD signatures. Methods: We engineered a Pan-Cancer Multi-Omic SuperLearner pipeline evaluating 33 cancer types. Phase I performed zero-leakage data harmonization and groupwise imputation to prevent cross-cohort amalgamation. Phase II utilized Elastic Net - regularized Cox (CoxNet) regression as an audit-compliant CANARY diagnostic to map mathematical proportional-hazards failures. Admissible strata enforcing a rigid 35% topological missingness barrier entered Phase III, deploying an advanced non-linear Quadripartite Base-Learner Ensemble (Random Survival Forests (RSF), Extreme Gradient Boosting (XGBoost), insulated Survival-Boruta, and Multi-Task Logistic Regression (MTLR)) - fused within an Elastic Net Multi-View Meta-Learner (MVL) - with local interpretability guaranteed via post-hoc SHAPley Additive exPlanations (TreeSHAP) and Local Interpretable Model-agnostic Explanations (LIME). Results: The CANARY diagnostic empirically proved the structural invalidity of pan-cancer geometric proportional-hazards. Advancing 96 verified matrices into the Quadripartite Machine Learning Ensemble, Phase III executed a structural algorithmic displacement: dense continuous multi-omic topologies computationally suppressed static genomic mutations and Copy Number Variations (CNVs) during multidimensional competition (85.7% vs 0.0% apex retention). Furthermore, the MVL stabilized global predictions against extreme biological variance, while surrogate LIME validations (R-squared < 0.10) confirmed the absolute failure of linear interpretative proxies. Extracting N-dimensional TreeSHAP interactions natively bypassed generalized risk parameters, mapping exact Survival Topologies. This dynamically exposed multi-omic synergistic (lethal peaks) and antagonistic (protective valleys) rescue trajectories invisible to additive models. We integrated this architecture into CancerRCDPredictor, a Shiny application operating as a digital tumor board. Conclusion: Deploying a Pan-Cancer Multi-Omic SuperLearner to bypass linear topological failures, this study advances beyond generalized cohort stratifications, establishing a deterministically mapped architecture for predicting RCD-related Survival Topologies. Through the CancerRCDPredictor interface, we directly translate multi-omic insights into individualized precision oncology interception.
bioinformatics2026-06-02v1Quantifying and Predicting the Difficulty of Multiple Sequence Alignment with AlDiScore
Bodynek, M.; Martin-Fernandez, L.; Bettisworth, B.; Haag, J.; Stamatakis, A.Abstract
Multiple Sequence Alignment (MSA) constitutes an important and frequent operation in molecular sequence data analysis. There exist numerous tools, algorithms, and criteria to infer an MSA. This plethora of available approaches to MSA may induced an ensemble of divergent MSAs for the same underlying unaligned sequence set. Even a single MSA tool may infer distinct MSAs when varying the input parameters. Hence, when using a diversified set of MSA algorithms and parameterizations, the observed dispersion within an MSA ensemble expresses the difficulty of inferring a robust alignment. We refer to this notion as MSA difficulty. As downstream analyses heavily rely on the MSA, characterizing MSA difficulty for a given unaligned sequence set is critical. Initially, we show that measures of dispersion within diversified MSA ensembles can reliably predict MSA difficulty. We then assess the adequacy of these measures by computing the average reference-based distance between the MSAs in the MSA ensemble and its corresponding structural reference MSA and subsequently comparing this distance to the corresponding reference-free average distance over all MSA pairs in the ensemble. We find that Blackburne and Whelan's dpos alignment metric is most appropriate as its reference-free counterpart most accurately approximates the reference-based difficulty computed on BAliBASE reference data. We therefore use the average pairwise distance measured by dpos to quantify MSA difficulty on a scale from 0 (easy) to 1 (difficult) given an MSA ensemble. Next, we introduce the AlDiScore open-source tool, which uses machine learning to directly and reliably predict reference-free difficulty scores from unaligned sequence sets to completely omit expensive MSA computations. The underlying regression model relies upon a large set of features, including sampling-based measures of transitive consistency. We trained our AlDiScore models on a diverse collection of empirical datasets from BAliBASE, TreeBASE, an published studies. Subsequently, we demonstrate that AlDiScore attains an R2 of 0.89 and of 0.84 on unseen AA and DNA sequence sets extracted from the PANDIT v17 database. Finally, we show that there is no correlation between MSA difficulty and the corresponding phylogenetic difficulty of the respective MSA.
bioinformatics2026-06-02v1miDGD: a multi-modal deep generative model predicts miRNA expression from bulk or single-cell mRNA expression
Zamani, F.; Rasmussen, A. M.; Schuster, V.; Diekema, M. H.; Krogh, A.; Pedersen, J. S.Abstract
MicroRNAs (miRNAs) are important post-transcriptional regulators, yet their expression is typically unobserved in single-cell and most bulk RNA-seq datasets. We present miDGD, a deep generative decoder model that predicts miRNA abundance directly from gene expression alone. Trained on bulk and single-cell datasets from TCGA, GTEx, and human cell lines, miDGD learned a shared latent representation of matched mRNA and miRNA profiles that organized samples into biologically meaningful clusters reflecting tissue and cancer types. The model reconstructed both tissue-specific and broadly expressed miRNAs, recapitulated known miRNA-target relationships, and showed robust performance in sparse and single-cell data. miDGD outperformed miRSCAPE and recent miRNA activity inference methods, with improved cross-dataset generalization. These results establish a deep generative model as an improved framework for predicting miRNA expression when direct measurements are unavailable.
bioinformatics2026-06-02v1Mechanistic Interpretability for Protein Language Models: A Validation Framework
Chon, P.; ANDREOPOULOS, W. B.Abstract
Protein language models (PLMs) are shown to be powerful predictors of protein structure and function but their internal mechanisms remain poorly understood. Recent mechanistic interpretability methods have decomposed PLM representations into interpretable features, but they have not combined methods on a single biologically meaningful task. This paper tests whether an InterPLM sparse autoencoder and ProtoMech cross-layer transcoder can discover features in ESM-2 (6 layers, 8M) that can mainly discriminate between Class A {beta}-lactamase and Class B {beta}-lactamase with class C and D used as more challenging comparisons. The main goal is to find distinct features for Class A {beta}-lactamase that are not shared by other classes. We find that both methods find distinct features for Class A {beta}-lactamase, but the cross-layer transcoders show that the concepts for Class A {beta}-lactamase seems to be distributed among nodes such as in layer 4 and 6 rather than one node. We also showcase a validation framework to prevent overclaiming the role of a node, and we use it to show that several strong nodes fail in some stages of the framework meaning that they cannot be the sole node that defines Class A {beta}-lactamase.
bioinformatics2026-06-02v1Decoding the Grammar of Protein-Protein Interaction Interfaces with Multimodal Representations
Cuturello, F.; Senci, S.; Di Vora, D.; Gardinazzi, Y.; Villegas Garcia, E. N.; Feltrin, A.Abstract
Protein-protein interactions (PPI) govern essential cellular processes, making the computational identification of interacting sites a central challenge in structural biology, with important implications for protein engineering and the development of targeted therapeutics. Existing prediction algorithms include sequence-based methods, which lack structural information, or structure-based approaches, which often struggle to effectively integrate evolutionary context. Here, we present ESM3-PPISites, a supervised model for residue-level classification of PPI interfaces, leveraging the multimodal representations of the ESM3 Protein Language Model. To ensure a bias-free evaluation, we adopt a stringent redundancy filtering protocol, systematically eliminating latent homology between the training data and a curated benchmark set in both sequence and structural space. Our findings demonstrate that while ESM3 largest proprietary version yields the highest predictive power, targeted fine-tuning of its small open-weight counterpart significantly narrows the performance gap. Requiring only primary sequence data at inference, ESM3-PPISites achieves unprecedented accuracy, vastly outperforming current approaches. Crucially, we demonstrate the practical impact of these predictions by integrating them as spatial restraints within the HADDOCK3 docking platform. When evaluated on an independent subset of 12 complexes from the Docking Benchmark v5, our prediction-guided pipeline strongly enhances the identification of near-native binding poses over ab initio blind docking, while reducing computational runtime by an order of magnitude. This framework establishes a scalable paradigm for high-throughput structural interactomics.
bioinformatics2026-06-02v1BacTaxID: A universal framework for standardized bacterial classification
Fernandez-de-Bobadilla, M. D.; Lanza, V. F.Abstract
Bacterial strain typing is key to surveillance, outbreak investigation and microbial ecology, yet current systems remain species-specific, reference-dependent and lack a universal, interpretable metric of genomic relatedness. Here, we introduce BacTaxID, a fully configurable, whole-genome k-mer-based framework that encodes each genome as a numeric sketch and organizes strains into hierarchical clusters with user-defined similarity thresholds. BacTaxID distances are strictly proportional to Average Nucleotide Identity (ANI), providing a direct quantitative link between vectorial typing and genome-wide divergence. Applied to 2.3 million genomes from All the Bacteria across 67 genera, BacTaxID demonstrates universal concordance species and sub-species classification systems, while capturing finer strain-level diversity than traditional reference-based approaches. In simulated surveillance and real outbreak datasets, BacTaxID reproduces SNP and cgMLST-based definitions while enabling rapid, scalable screening. Precomputed genus-level schemes and an open implementation provide a practical, genus-agnostic alternative to classical typing systems for standardized bacterial classification.
bioinformatics2026-06-01v4Systems Level Analysis of Gene, Pathway and Phytochemical Associations with Psoriasis
Ray, S.; Dutta, O.; Kousoulas, K. G.; Apostolopoulos, N.; Chamcheu, J. C.; Kaur, R.Abstract
Psoriasis is an inflammatory skin disorder driven by abnormal immune activation that promotes excessive proliferation and accelerated turnover of epidermal keratinocytes. IL-17 and TNF pathways are well established in psoriasis, but the other mechanisms that keep the disease active and link it to systemic comorbidities are not yet fully understood. A combined transcriptomic and systems biology framework was applied to map regulatory circuits in psoriatic lesions and to identify phytochemical candidates capable of multi-target modulation for topical intervention. Differential gene expression between lesional and healthy skin was analyzed, followed by functional characterization, employing Qiagen's Ingenuity Pathway Analysis (IPA) for pathway and upstream regulator inference, protein-protein interaction network, and chemical-gene interaction mapping. This integrative strategy revealed a transcriptional landscape dominated by type I/III interferon signaling, antiviral and antimicrobial responses, immunometabolic dysregulation, and transcriptional hubs centered on AP-1 and CREB1. Several previously unreported genes and upstream regulators without prior documented association with psoriasis were identified within inflammatory and cell migration-related modules, indicating unexplored regulatory layers in disease control. Network-guided chemical prioritization and direction-of-effect filtering highlighted seven phytochemicals (mahanine, atractylon, protopine, annomontine, taraxasterol, tricin, and tamarixetin) with multi-target activity across key disease axes. ADMET-based screening suggested protopine and atractylon as favorable candidates for topical delivery, while synergy modeling identified compatible phytochemical combinations, with flavonoid-alkaloid pairings among the top candidates. This multi-layered approach provides mechanistically informed phytochemicals targeting the IL-17/TNF-interferon-AP-1/CREB1-COX-2/MMP9 axis in psoriasis. Experimental validation in keratinocyte and organotypic skin models will be required to determine whether these compounds, individually or in combination, can effectively modulate psoriatic signaling in vivo.
bioinformatics2026-06-01v3Cross-etiology transcriptomic conservation in hepatocellular carcinoma reveals opposing proliferation and hepatocyte-loss programs validated across cohorts
Romero, R.; Toledo, C.Abstract
Background: Hepatocellular carcinoma (HCC) arises from diverse etiologies, but the extent to which viral etiologies converge on reproducible transcriptomic state axes remains incompletely resolved. Methods: We analyzed HBV- and HCV-associated HCC discovery cohorts using Hallmark GSVA, limma-based differential modeling, and cross-cohort meta-analysis. Conserved tumor-upregulated and tumor-downregulated genes were distilled into ProlifHub and HepLoss modules, combined as HCCStateScore = ProlifHubScore - HepLossScore. Module performance was evaluated across multiple independent GEO cohorts, module-size robustness was tested across alternative top-N definitions, and TCGA-LIHC was used for continuous Cox survival modeling. An HBV-derived injury axis was constructed from an ordinal ALT/AST/HBV-DNA injury index in GSE83148 and tested in GSE121248 with adjustment for E2F/G2M activity and CIBERSORTx-inferred immune composition. Results: HBV- and HCV-associated HCC showed conserved activation of proliferation/repair programs and suppression of hepatocyte functional programs. The HCCStateScore validated across independent HCC cohorts with consistently positive tumor-non-tumor deltas and high discrimination, and module-size sensitivity analysis showed that performance was not dependent on the top-20 cutoff. In TCGA-LIHC, higher ProlifHubScore and HCCStateScore were associated with poorer overall survival in continuous Cox models, including after age/sex/stage adjustment. A compact HBV injury program remained tumor-associated after simultaneous adjustment for E2F/G2M activity and CIBERSORTx-derived immune-composition covariates, with concordant results using an extended FDR-defined injury set. Conclusions: HCC exhibits a robust cross-etiology transcriptomic state characterized by opposing proliferation and hepatocyte-loss programs. The module framework provides a portable bulk transcriptomic state score and supports a residual tumor-associated HBV injury component that is not fully explained by proliferation or inferred immune composition.
bioinformatics2026-06-01v2Evolutionary constraints improve protein large language model predictions for protein stability, binding regions and epistasis
Tzavella, K.; Olsen, C.; Vranken, W. F.Abstract
Our understanding of protein function and evolution is largely based on the relationship between amino acid sequence and overall fold, now effectively captured by computational models. Yet predicting how mutations--shaped by epistasis--alter protein behavior, especially in dynamic or structurally ambiguous regions, remains difficult. Here we present D2D, which combines a self-supervised protein language model with protein-specific evolutionary information to predict mutational effects using little to no task-specific labeled data. D2D captures long-range epistatic interactions, accurately predicts single and higher-order mutation effects on protein thermostability and binding, without being trained on the task. When fine-tuned, D2D outperforms state-of-the-art methods on latent driver cancer mutations and co-occurring proliferation-enhancing mutations across independent experimental studies. Unlike most existing approaches, D2D avoids biases linked to solvent accessibility or to multiple sequence alignment depth and quality, making it particularly effective for disordered or surface binding regions where structure-based predictors typically falter. Overall, D2D provides a general framework for modeling mutational effects in proteins with limited experimental or structural information.
bioinformatics2026-06-01v2A unified transcriptome database to accelerate gene discovery in Amaryllidoideae species
Goncalves dos Santos, K. C.; Merindol, N.; Desgagne-Penix, I.Abstract
Amaryllidoideae plants produce structurally diverse and unique alkaloids with potent anti-cholinesterase, antiviral, and antitumor activities, making this subfamily a rich source of pharmaceutical leads. Despite the absence of reference genomes for any Amaryllidoideae species, many enzyme characterization and pathway reconstruction efforts to date have been made possible through transcriptome mining, often requiring bioinformatic expertise and data preprocessing. To facilitate new studies in this subfamily, here we present AmarylOmicBase, a unified transcriptomic dataset that integrates assemblies, annotations, and expression profiles from 39 studies, covering 27 species and four hybrid cultivars across 13 genera of Amaryllidoideae. The AmarylOmicBase includes both published and de novo assemblies generated from published raw data using Trinity or IsoSeq workflows and provides standardized functional annotation and quantitative expression datasets. AmarylOmicBase provides ready-to-use datasets that support gene discovery, comparative transcriptomics, and pathway-level investigations for specialized metabolism, including Amaryllidaceae alkaloid biosynthesis. By providing ready-to-use datasets and fully reproducible analysis scripts, this resource reduces computational barriers and expands access to transcriptomic information for researchers working on non-model plant species. AmarylOmicBase provides a centralized resource for transcriptomic data that can be reused in studies of enzyme function, pathway evolution, and regulatory processes in Amaryllidoideae.
bioinformatics2026-06-01v2Hierarchical latent representations reveal protein organization for functional discovery and design
Guo, Z.; Wang, Z.; Wang, S.; Chai, Y.; XU, K.; Li, M.; Li, W.; Ou, G.Abstract
Proteins can preserve conserved functions despite extensive sequence and structural divergence, suggesting that functional organization is governed by distributed constraints not captured by conventional representations. Here we develop a hierarchical sequence-based representation framework that compresses proteins into context-dependent latent states while preserving multiscale organizational information. Using this framework, we identified previously uncharacterized ciliary proteins lacking detectable sequence and structure homology, including ADMAP1, which is required for normal sperm axonemal organization and motility in mice. Discrete latent protein states captured species-level organizational signatures correlated with major evolutionary groups and revealed expansion of intrinsically disordered regulatory environments in eukaryotes. Autoregressive sampling within this latent space further enabled design of synthetic actin-remodeling proteins that maintained robust F-actin severing activity despite extensive sequence rewiring across key functional interfaces. These findings demonstrate that distributed protein organization can be inferred directly from sequence, linking functional discovery, evolutionary analysis, and protein design within a shared representational framework.
bioinformatics2026-06-01v2