Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
TSvelo: Comprehensive RNA velocity by modeling cascade of gene regulation, transcription and splicing
Li, J.; Wang, Z.; Shen, H.-B.; Yuan, Y.Abstract
RNA velocity approaches fit gene dynamics and infer cell fate by modeling the splicing process using single-cell RNA sequencing (scRNA-seq) data. However, due to short time scale of splicing, high noise and large complexity of data, existing RNA velocity methods often fail to precisely capture the complex velocity dynamics for individual gene and single cell, which makes its downstream analysis less reliable and less robust. We propose TSvelo, a comprehensive RNA velocity mathematics framework that can model the cascade of gene regulation, Transcription and Splicing using highly interpretable neural Ordinary Differential Equations (ODEs). TSvelo can precisely capture the transcription-unspliced-spliced 3D dynamics of all genes simultaneously, infer unified latent time shared by genes within single cell, and be applied to multi-lineage datasets. Experiments on six scRNA-seq datasets, including two multi-lineage datasets, demonstrate TSvelo's superiority.
bioinformatics2026-06-04v5SWARM resolves nanopore signal interference between RNA modification types and reveals splicing-shaped pseudouridylation
Prodic, S.; Cleynen, A.; Mahmud, S.; Srivastava, A.; Ravindran, A.; Kanchi, M.; Sethi, A. J.; Corovic, M.; Jain, R.; Santos-Rodriguez, G.; Vieira, G.; Preiss, T.; Weatheritt, R. J.; Hayashi, R.; Martinez, N. M.; Burgio, G.; Shirokikh, N. E.; Eyras, E.Abstract
Nanopore direct RNA sequencing promises to decode the epitranscriptome by detecting multiple modifications on individual RNA molecules, but its potential for biological discovery is hampered by high false-positive rates. We present SWARM, an AI-based framework designed to overcome this fundamental limitation. Its key innovation is a crosstalk-aware training strategy that incorporates non-target modifications and orthogonally validated cellular signals, enabling high-precision detection of m6A, pseudouridine ({Psi}), and m5C at single-nucleotide and single-molecule resolution. Using rigorous in vitro and cellular RNA benchmarks, SWARM outperforms existing tools and maintains strong agreement with orthogonal methods. Applying SWARM across mammalian tissues reveals thousands of novel modification sites with confirmed motifs and localisation patterns. Our high-resolution multi-tissue modification map revealed no evidence of widespread m6A-{Psi} interplay in predominant writer contexts, challenging models of a coordinated epitranscriptomic code. We further discovered a previously unrecognised splicing-shaped mode of {Psi} deposition, whereby TRUB1-mediated pseudouridylation preferentially occurs after exon-exon ligation, consistent with local RNA structure stabilisation. SWARM provides a robust, universally applicable tool for epitranscriptome discovery.
bioinformatics2026-06-04v4STAR Suite: Transcriptomics processing in a single binary through AI-assisted development
Hung, L.-H.; Baker, D.; Flynn, B.; Huangfu, D.; Luo, R.; Robson, P.; Zhou, T.; Yeung, K. Y.Abstract
The STAR aligner plays a key role in complex transcriptomics pipelines consisting of multiple analytical tools. We present STAR Suite, a drop-in replacement for STAR that internalizes entire pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. Deployed by the NIH MorPhiC consortium, STAR Suite provides an open-source alternative to proprietary Cell Ranger pipelines, achieving gene-level Pearson correlations of 0.99-1.0 and 3.8- to 5.7-fold faster speeds for Perturb-seq and Flex analysis through improved methodologies. Integrating multi-module workflows into a single executable makes STAR Suite ready-to-use for both human researchers and the AI agents increasingly used in analytical workflows. STAR Suite was developed using AI agents, enabling a single developer to add 97,000 lines of code to the 28,000-line codebase in four months - illustrating a modern paradigm for large-scale integration of complex open-source codebases by individual research groups. Utilities are included to facilitate future community contributions using AI assistants.
bioinformatics2026-06-04v4OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
Wang, L.Abstract
How do multi-modal large language models that jointly process natural language and biological sequences (DNA, protein, structural alphabets) actually answer biological questions, especially sequence-grounded questions whose answer depends on residue-level patterns rather than literature recall? We introduce OmniGene-4, a unified bio-language Mixture-of-Experts foundation model on Gemma-4-26B-A4B (128 experts/layer, top-8 routing), and use its discrete router state to dissect this question. By hooking every router across eight task families, we provide the first router-level decomposition for a biological MoE: continued pretraining (CPT) accounts for 96% of cross-task expert differentiation and supervised fine-tuning (SFT) for 4%, reshaping middle and output layers respectively. Within the protein-homology task family, per-pair routing divergence stays below 0.04 (vs 0.23 cross-task), implying that sequence-grounded decisions occur inside expert computation rather than at the gate --- the gate selects the modality, the experts compute the answer. The pipeline yields strong benchmarks: remote-homology 82.60% (vs ESM-2 3B, MMseqs2, DIAMOND by 28--31 pp); standard homology 99.40%; BixBench (general biological-knowledge) 93.66%. A dual-head architecture adds per-residue 3Di/DSSP classifiers (78.6%/100%). To probe whether the discovered transfer mechanism is robust under modality scaling, we further extend the model to OmniGene-4-MM, adding four vision modalities (chemical-structure images, medical/pathology imagery, charts) via a vision tower and a three-stage LoRA pipeline at 1.5 GPU-days total. The multi-modal model preserves the homology capability (85% standard, 69.5% remote) and acquires chemist-readable structure understanding (96% on Vis-CheBI20 functional-group captioning) while consuming roughly four orders of magnitude less compute than recent specialized MoE bio-models. The work characterizes how multi-modal bio-foundation models acquire, route, and preserve sequence-aware capability --- central to the next generation of scientific large language models.
bioinformatics2026-06-04v3STAR Suite: Transcriptomics processing in a single binary through AI-assisted development
Hung, L.-H.; Baker, D.; Flynn, W. F.; Huangfu, D. F.; Luo, R.; Robson, P.; Zhou, T.; Yeung, K. Y.Abstract
The STAR aligner plays a key role in complex transcriptomics pipelines consisting of multiple analytical tools. We present STAR Suite, a drop-in replacement for STAR that internalizes entire pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. Deployed by the NIH MorPhiC consortium, STAR Suite provides an open-source alternative to proprietary Cell Ranger pipelines, achieving gene-level Pearson correlations of 0.99-1.0 and 3.8- to 5.7-fold faster speeds for Perturb-seq and Flex analysis through improved methodologies. Integrating multi-module workflows into a single executable makes STAR Suite ready-to-use for both human researchers and the AI agents increasingly used in analytical workflows. STAR Suite was developed using AI agents, enabling a single developer to add 97,000 lines of code to the 28,000-line codebase in four months - illustrating a modern paradigm for large-scale integration of complex open-source codebases by individual research groups. Utilities are included to facilitate future community contributions using AI assistants.
bioinformatics2026-06-04v3Skiver: Reference-free quality control of metagenomic sequencing datasets using (k,v)-mer sketches
Gu, Z.; Sharma, P.; Wong, L.; Nagarajan, N.Abstract
Background. Quality control of sequencing datasets is an important first step in numerous bioinformatics pipelines such as mapping, variant calling, and assembly. Existing methods typically rely on alignment results or quality scores. However, the reference genome is not always available for mapping, and uncalibrated quality scores may yield biased estimates of error rates. Results. We present skiver, a reference-free and alignment-free algorithm that estimates sequencing error rates and calibrates Phred quality scores using (k,v)-mer sketches. By identifying the consensus from the sketched (k,v)-mers, skiver estimates survival and hazard rates that capture positional information of sequencing errors. Across simulated and real datasets from various sequencing platforms, skiver accurately recovers sequencing error rates and the proportion of different error types. We further demonstrate its ability to calibrate Phred scores. It also reliably handles complex datasets containing multiple strains, alleles, and repetitive regions through an iterative outlier filtering strategy. Skiver is computationally efficient and supports tools that need accurate sequencing error rate estimates or quality scores as prior knowledge. Availability and Implementation. An implementation of skiver is available at https://github.com/GZHoffie/skiver, and dataset and scripts for reproducibility are available at https://github.com/GZHoffie/skiver-test.
bioinformatics2026-06-04v2SciCore-Omics: a tri-modal foundation model unifying histology, spatial transcriptomics and language for spatial biology
Xiao, X.; Li, Y.; Zeng, Z.; Yan, Y.; Liu, Z.; Liu, Z.; Xiang, Y.; Ye, Z.; Ying, J.; Li, Y.; Xie, L.; He, F.Abstract
Histomorphology and spatial transcriptomics capture complementary aspects of tissue biology, but their relationships remain difficult to extract, align, and interpret at scale. Existing foundation models typically connect histology, omics, or language only pairwise, which limits their capacity to jointly infer molecular states, decode spatial tissue organization, and generate biologically grounded explanations. Here, we show SciCore-Omics, the first tri-modal foundation model linking histology images, spatial transcriptomics, and biological language. We constructed a spatially paired image-gene-text dataset comprising 151,182 spots across multiple tissues and performed a three-stage progressive training of SciCore-Omics on this dataset. Across gene expression prediction and spatial domain recognition, SciCore-Omics achieved 23.6-80.9% relative gains in task-specific metrics over the strongest external baselines. It further showed robust zero-shot generalization in histopathology classification, outperforming GPT-5 by 6.16 percentage points in mean accuracy across four benchmarks. Expert evaluation in 10 breast cancer cases confirmed its H&E-only case-level molecular reasoning capability. Together, our method demonstrates that a tri-modal framework can effectively bridge histomorphology and molecular state, providing a more general and interpretable foundation model for computational pathology and omics analysis.
bioinformatics2026-06-04v2De Novo Design and Computational Validation of a High-Affinity Peptide Inhibitor Targeting the HPV E1-E2 Interface
Fletcher, S.; Biswas-Fiss, E. E.; Biswas, S. B.Abstract
The oncogenic progression of high-risk Human Papillomavirus (HPV) strains relies fundamentally on the cooperative interaction between the E1 replicative helicase and the E2 origin-binding protein to initiate viral DNA amplification. Disrupting this essential protein-protein interaction presents a highly promising, yet clinically unrealized, therapeutic paradigm for treating established HPV infections prior to malignant transformation. This research presents a comprehensive computational pipeline for evaluating and screening de novo generated peptide inhibitors. We utilize the HPV E1-E2 protein interface as a proof of concept, specifically targeting a highly conserved arginine triad located on the solvent-exposed surface of the E1 helicase. Utilizing the AlphaProteo generative model for sequence discovery and AlphaFold 3 for complex structural prediction, a library of candidate binders was generated and subsequently subjected to dual-scale Molecular Dynamics simulations and thermodynamic validation utilizing GROMACS. The results establish Binder 8 as the lead candidate, yielding a predicted binding free energy (-59.1 +/- 0.7 kcal/mol) that indicates a significantly stronger theoretical affinity than the native E1-E2 baseline. Energy decomposition confirms that Binder 8 binds the E1 interface via precise interactions involving the arginine triad. Furthermore, deep-learning-based physicochemical profiling utilizing CSM-Toxin and AlgPred 2.0 confirms that Binder 8 possesses an optimal safety profile, exhibiting zero predicted toxicity and non-allergenic properties. Protein sequence alignment confirms the evolutionary conservation of the targeted arginine triad across the vast majority of oncogenic Alpha-papillomavirus genotypes, highlighting Binder 8 as a viable promising candidate scaffold for broad-spectrum antiviral development. The study demonstrates a computational solution for E1-E2 disruption, setting the stage for future in vitro validation via Bio-layer interferometry to confirm physical inhibition.
bioinformatics2026-06-04v1Nanobodies versus canonical antibodies: an updated comparison of their binding modes
Hauser, A.; Dangla-Pelissier, G.; Cazals, F.Abstract
Heavy-chain-only antibodies, produced by the adaptive immune systems of camelids and cartilaginous fish, complement canonical antibodies that contain variable domains from both heavy and light chains. We refine previous studies by providing a detailed analysis of the binding modes of VHHs versus canonical antibodies, using a dataset with a 20-fold increase in the number of cases. We show that VHHs exhibit a larger buried surface area despite relying on a single variable domain than double domain antibodies. This property can be attributed to contributions from both framework regions and CDR3. We further demonstrate that the binding modes of VHHs, characterized by the number of FR and CDR regions contacting the antigen, are more diverse than previously reported. In addition, we find that VHH and canonical antibody interfaces display similar solvation properties, although VHH interfaces are more tightly packed. Finally, we discuss the thermodynamic and kinetic implications of these findings for the design of high-affinity VHHs, an issue of particular importance in protein engineering and design.
bioinformatics2026-06-04v1UnBlender: validating individual analyses in respiratory bulk RNA-seq cell type deconvolution
Gillett, T. E.; van den Berge, M.; Nawijn, M. C.; Koppelman, G. H.Abstract
Analysis of RNA-seq data of respiratory samples has contributed much to our understanding of lung disease. However, bulk RNA-seq data are dependent on both cell type composition and the transcriptional activity of these samples' constituent cells, which complicates interpretation. Cell type deconvolution is frequently used to estimate cell type proportions of bulk transcriptomic gene expression data and improve interpretation of bulk transcriptomics data. However, accuracy of the estimated cell type proportions reported after deconvolution is unknown, which may have a negative impact on the validity of the conclusions drawn. Here, we present UnBlender, a pipeline that enables respiratory scientists to perform cell type deconvolution and routinely evaluate deconvolution accuracy of their approach. UnBlender allows for custom cell type deconvolution tailored to the research question at hand, using consensus cell type labels and validating the approach to promote accurate, reproducible results.
bioinformatics2026-06-04v1Genomic, Transcriptomic, and Regulomic Analyses Do Not Support Profound Autism as a Distinct Biological Category
Eicher, T. D.; Ne'eman, A.; Quackenbush, J. D.Abstract
The Lancet Commission on the Future of Care and Clinical Research in Autism proposed the construct of "profound autism" as a recognizable subtype of autism. Supporters argue that this classification is necessary to ensure that autistic persons with severe impairment receive appropriate research attention and policy support, whereas critics contend that the construct lacks scientific validity and may reflect social or political considerations more than biological distinction. To inform this debate, we evaluate whether the proposed "profound autism" category represents a distinct genetic phenotype using multiple molecular data types collected in a large cohort. Across genomic, transcriptomic, and regulatory analyses, we find no evidence supporting "profound autism" as a biologically distinct phenotypic group. Instead, differences emerge primarily in inferred gene regulatory networks distinguishing nonspeaking from speaking autistic children, suggesting potential regulatory mechanisms contributing to speech ability. These findings suggest that future research into severe impairment may be more productive if focused on specific traits -- such as speech impairment -- rather than attempting to define a distinct biological subtype within the multidimensional phenomenon of autism.
bioinformatics2026-06-04v1Predicting P-glycoprotein Substrate Status Using a Pretrained Graph Neural Network: A TDC Benchmark Study
Yan, J.; Duan, W.Abstract
P-glycoprotein (Pgp/ABCB1) is a critical efflux transporter that significantly impacts drug bioavailability and multidrug resistance. Accurate prediction of Pgp substrate status is essential for early-stage drug discovery. In this study, we evaluate a pretrained Graph Isomorphism Network (GIN) with attribute masking on the Pgp_Broccatelli benchmark from the Therapeutics Data Commons (TDC). Our approach fine-tunes a GIN encoder pretrained on approximately 2 million molecules using a self-supervised attribute masking strategy, followed by a multilayer perceptron (MLP) classification head. On the TDC benchmark, our model achieves an AUROC of 0.937 +/- 0.004 across five independent runs, ranking second on the leaderboard, as of May 2026. We further compare this approach against an XGBoost baseline using Morgan fingerprints (AUROC 0.912 +/- 0.007), demonstrating the advantage of graph-based molecular representations with transfer learning for small-dataset ADMET prediction tasks.
bioinformatics2026-06-04v1An interpretable machine learning framework for dog breed inference and ancestry decomposition
Bian, Y.; Bierman, R.; Snyder-Mackler, N.; Promislow, D.; Karlsson, E.; Dog Aging Project Consortium, ; Akey, J. M.Abstract
The over 300 currently recognized breeds of domesticated dogs are the culmination of centuries of intense artificial selection and recurrent population bottlenecks. While breed labels are widely used in genetic and veterinary studies, inferring breed identity from genomic data remains challenging due to the high dimensionality of genotype data, uneven sampling across breeds, and admixture resulting in mixed-breed individuals. Here, we present an interpretable machine learning framework to infer dog breed labels from genome-wide SNP data. Our approach combines dimensionality reduction with a multi-output random forest model that maps genetic variation to a continuous representation of breed membership, enabling both classification and mixed-breed inference. We apply this framework to the Dog Aging Project (DAP) dataset of 6,572 purebred and mixed-breed dogs across 100 breed classes, achieving 91.7% accuracy with an overlap-based metric, outperforming an ADMIXTURE-based benchmark that achieved 87.8% accuracy. Notably, we find that as few as 150 informative SNPs are sufficient to achieve near-maximal predictive performance, highlighting the highly structured nature of canine genetic variation. We also introduce a SNP importance score metric that links model predictions back to individual genetic variants. Analysis of top-ranked variants reveals loci previously associated with morphological, pigmentation, and behavioral traits, as well as candidate loci lacking prior phenotypic annotation, supporting both the biological relevance and discovery potential of the framework. Together, these results demonstrate that our framework provides an accurate, flexible, and interpretable approach to predict breed ancestry, with applications in veterinary genomics, canine population genetics, and the identification of loci underlying hallmark breed phenotypes.
bioinformatics2026-06-04v1Language Modeling Materializes a World Model of Protein Biology
Candido, S.; Hayes, T.; Derry, A.; Rao, R.; Lin, Z.; Verkuil, R.; Wu, B. Z.; Lee, J. S.; Bruguera, E. S.; Keval, J. A.; Kopylov, M.; Pak, J. E.; Wu, W.; Thomas, N.; Mataraso, S.; Hsu, A.; Trotman-Grant, A. C.; Fatras, K.; dos Santos Costa, A.; Badkundri, R.; Akin, H.; Oktay, D.; Deaton, J.; Montabana, E.; Sitwala, H.; Yu, Y.; Wiggert, M.; Carlin, D. A.; Goering, A. W.; Blazejewski, T.; Sandora, M.; Hla, M.; Jia, T. Z.; Kloker, L. H.; Sofroniew, N. J.; Uehara, M.; Pannu, J.; Bachas, S.; Liu, D. S.; Sercu, T.; Rives, A.Abstract
Proteins are fundamental to life. The full extent of their biology is beyond our ability to characterize with experimental approaches in the physical laboratory. Accurate digital representations could accelerate the discovery of protein biology through virtual experiments. We propose language modeling to learn unified and general representations that can be scaled to all of protein biology. Building on these representations, we develop a structure prediction model that exceeds the performance of established methods for biomolecular complex prediction across benchmarks, including for the interactions of antibodies with their targets. A simple search procedure yields high experimental success rates for the discovery of proteins with nanomolar binding affinities for both miniproteins and single-chain antibodies, a modality critical for therapeutic design. Study of the concepts in the language model's representation space reveals a systematic organization aligned with the reductionist understanding of proteins developed through empirical science. Leveraging this organization, we generate a comprehensive map of protein biology encompassing over 6.8 billion sequences and 1.1 billion predicted structures, identifying connections across known and unknown biology. As a whole, this shows language modeling as a powerful substrate for representing the biology of proteins, operating across scales from the prediction and design of protein interactions at the atomic level, to identifying properties of proteins at different levels of granularity and abstraction, to the scale of mapping connections between proteins across billions of years of evolution.
bioinformatics2026-06-04v1Proteomics-constrained deconvolution reveals spatial cell-type programs in tumours
Isik, E. B.; Haley, M. J.; Anbaki, A. A.; Bere, L.; Roncaroli, F.; Piper Hanley, K.; Couper, K.; Wedge, D. C.; Sellers, R.; Oliveira, P.; Ashton, J.; Bristow, R. G.; Alvarez, M. A.; Georgaka, S.; Rattray, M.Abstract
Accurately resolving cell-type mixtures in spatial transcriptomics remains challenging, particularly in heterogeneous tumours where cell populations are intermixed and matched single-cell references may be unavailable or poorly aligned. Current deconvolution approaches either require high-quality scRNA-seq references, suffer from scalability limitations, or lack interpretability. We introduce PISTACHIO, a proteomics-informed spatial transcriptomics deconvolution framework based on constrained non-negative matrix factorization with a negative-binomial likelihood. Rather than using probabilistic priors, PISTACHIO incorporates spatial cell-type constraints derived from paired Imaging Mass Cytometry, enforcing biologically grounded sparsity and explicit spatial feasibility of cell-type presence. PISTACHIO improved recovery of spatial cell-type distributions compared with Cell2location and STdeconvolve across synthetic and real tumour datasets. Our approach remains robust under cell-type assignment errors, maintaining high correlation with ground-truth under moderate noise, and achieves fast runtime on standard hardware, enabling practical large-scale deployment.
bioinformatics2026-06-04v1Hierarchical classification of immune cell transcriptomes at population-scale
Beltz, C.; Qiu, Z.; Sadowski, L.; Kraske, J. A.; Aggarwal, A.; Quintanal-Villalonga, A.; Manoj, P.; Littbarski, A.; Bajaj, S.; Meskauskaite, B.; Umeda, S.; Mazutis, L.; Rose, S. A.; Chan, J. M.; Nawy, T.; Nainys, J.; Chaligne, R.; de Stanchina, E.; Kaelber, K. A.; Cussigh, C. S.; Kallenberger, S. M.; Williams, A.; Jenzer, M.; Pompecki, T.; Kahle, S.; Hohmann, N.; Nussbaum, D. P.; Moss, N. S.; Ziv, E.; Berger, A. K.; Springfeld, C.; Zschaebitz, S.; Hassel, J. C.; Debus, J.; Jaeger, D.; Iacobuzio-Donahue, C. A.; Ganesh, K.; Peer, D.; Ungerechts, G.; Rudin, C. M.; Huber, P. E.; Walle, T.Abstract
Accurate immune cell classification is essential for interpreting single-cell RNA sequencing (scRNA-seq) data. However, progress is constrained by the lack of independent, high-resolution benchmarks, as the routine integration of datasets introduces statistical dependencies that artificially inflate model generalizability. Here, we present the single-cell universal classification omnibus (Suco), a resource of independent, uniform expert annotations, and Compocyte, a modular hierarchical classifier. Together, they establish a framework designed for the scale of human population immunology. This approach substantially outperforms existing classifiers while facilitating expert review of ambiguous annotations. Applying Compocyte across 50 studies, including three newly generated datasets, we classified 15.6 million leukocytes from 3,965 patients. Within this expansive cohort, we identified a new tumor-associated resorptive macrophage phenotype, a non-canonical monocyte subtype in subclinical cytokine release syndrome, and the programmatic erosion of T cell memory stemness across metastatic sites. Suco and Compocyte thus provide a generalizable architecture and benchmark capable of sustaining high-resolution annotation across massive clinical cohorts.
bioinformatics2026-06-04v1Learning residue-level context for modeling protein-protein interactions
Zhang, Z.; Yang, Z.; Liu, A.; Yu, K.-H.; Zhao, J.; Yang, Y.; Neale, B.; Chen, S.Abstract
Protein language models (PLMs) enable prediction of protein properties by learning residue-level features from sequence, yet most PLM-based approaches to protein-protein interactions aggregate information across entire proteins, limiting resolution and interpretability. Here we present ReCLIP, a transformer-based framework that learns interaction-specific representations at the level of individual residues by combining intra-protein residue neighborhoods with residue-conditioned representations of interaction partners. We show that residue-centered context provides a general framework for modeling protein interactions across diverse biological settings. ReCLIP accurately predicts mutation-induced perturbations (AUROC = 0.973), generalizes to post-translational modifications that do not alter sequence (AUROC = 0.822), and enables zero-shot prediction of peptide-MHC binding across unseen alleles (AUROC up to 0.972). Analysis of learned residue neighborhoods reveals structurally and functionally coherent patterns aligned with known determinants of binding. Applied to clinically annotated genetic variants, ReCLIP identifies disease-associated interaction perturbations that link pathogenic variants to specific molecular interaction contexts. Our results establish a generalizable and interpretable framework for modeling protein interactions and provide insights into how residue-level context shapes interaction specificity and its perturbation.
bioinformatics2026-06-04v1A comparative analysis of promoter-proximal pausing reveals kinetic and distributional dimensions of variation
Zeng, X.; Barshad, G.; Hassett, R.; Rice, E. J.; Danko, C. G.; Siepel, A.; Zhao, Y.Abstract
Promoter-proximal pausing of RNA polymerase II is a key regulatory checkpoint in metazoan transcription. Despite extensive study of this process, quantitative methods for comparing pausing dynamics across biological contexts have been lacking. Here we introduce a model-based framework for rigorous comparative analysis of both pause-escape kinetics and pause-site distributions across genes, cell types, and species. An application to available PRO-seq datasets revealed striking differences across perturbations, and comparative analyses across cell types and species highlighted distinct patterns of variation in both pause-escape kinetics and pause-site distributions, with only weak coupling between them. Integration with chromatin and sequence features showed that lower pause-escape rates are associated with stronger promoter-proximal nucleosome occupancy, whereas changes in pause-site dispersion are associated with sequence features such as GC skew. Together, these results establish a quantitative framework for comparative analysis of promoter-proximal pausing and reveal kinetic and distributional dimensions of pausing variation across biological contexts.
bioinformatics2026-06-04v1STAR Suite: Transcriptomics processing in a single binary through AI-assisted development
Hung, L.-H.; Baker, D.; Flynn, W. F.; Huangfu, D. F.; Luo, R.; Robson, P.; Zhou, T.; Yeung, K. Y.Abstract
The STAR aligner plays a key role in complex transcriptomics pipelines consisting of multiple analytical tools. We present STAR Suite, a drop-in replacement for STAR that internalizes entire pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. Deployed by the NIH MorPhiC consortium, STAR Suite provides an open-source alternative to proprietary Cell Ranger pipelines, achieving gene-level Pearson correlations of 0.99-1.0 and 3.8- to 5.7-fold faster speeds for Perturb-seq and Flex analysis through improved methodologies. Integrating multi-module workflows into a single executable makes STAR Suite ready-to-use for both human researchers and the AI agents increasingly used in analytical workflows. STAR Suite was developed using AI agents, enabling a single developer to add 97,000 lines of code to the 28,000-line codebase in four months - illustrating a modern paradigm for large-scale integration of complex open-source codebases by individual research groups. Utilities are included to facilitate future community contributions using AI assistants.
bioinformatics2026-06-03v2Assessing and Optimizing Low-Frequency Somatic Mutation Detection: A Multi-Platform High-Throughput Sequencing Perspective
Feng, B.; Lin, Y.; Liu, L.; Lin, Q.; Lin, Y.; Liu, Y.; Li, J.; Lei, C.; Chen, C.; Yang, M.; Peng, X.; Zhou, Z.; Yan, Q.; Sun, L.; Li, Q.Abstract
The availability of multiple commercial short-read sequencing platforms necessitates systematic cross-platform performance comparisons, particularly for challenging applications such as low-frequency somatic mutation detection. Here, a large-scale targeted sequencing dataset from five Genome in a Bottle (GIAB) human genomic DNA reference standards, HG001 to HG005, alongside Twist Biosciences cfDNA reference standards featuring 1% variant allele frequency (VAF), was generated by six platforms (NovaSeq 6000, NovaSeq X, FASTASeq 300, GenoLab M, SURFSeq 5000, and MGISEQ-T7). To build a realistic benchmark while keeping authentic sequencing backgrounds, we developed PosMix, a simulating tool that generates position-specific VAFs. To overcome the limitations of conventional variant callers (high recall with poor precision for VarScan2, higher precision with lower recall for Strelka2/Mutect2), we developed SomaticXGB, a machine learning-based caller. In this study, SURFSeq 5000 consistently exhibited the lowest error rates and achieved superior accuracy for VAFs as low as 0.5%, outperforming all other sequencing platforms. On the other hand, SomaticXGB attained F1 scores of approximately 0.92 on simulated datasets with VAFs ranging from 0.5% to 1.5% and 0.89 on Twist 1% standards, substantially outperforming conventional methods. This work delivers a valuable rich multi-platform data resource, offering a standardized pipeline for performance benchmarking and a machine learning-based strategy for optimized somatic mutation detection.
bioinformatics2026-06-03v2The machine-learning classifier ALLCatchR2 identifies 20 T-ALL subtypes across cohorts and age groups
Beder, T.; Wolgast, N.; Walter, W.; Bendig, S.; Hartmann, A. M.; Barz, M. J.; Zaliova, M.; Reitzel, E.; Baden, D.; Schwartz, S. M.; Gökbuget, N.; Kester, L.; Trka, J.; Haferlach, C.; Brüggemann, M.; Baldus, C. D.; Neumann, M.; Bastian, L.Abstract
T-cell acute lymphoblastic leukemia (T-ALL) comprises molecularly diverse subtypes, but robust cross-cohort validations and operational gene-expression definitions are lacking. To establish a gene-expression-anchored framework for T-ALL subtyping, we aggregated 2,314 transcriptomes (15 cohorts, age: 0.8 to 90.8 years). An extended unsupervised approach defined 17 main clusters and 3 subclusters in samples with high blast fractions. Supervised analyses added an overarching immature T-ALL (ETP-like) definition and resolved the LMO2 {gamma}{delta}-like subtype. All clusters contained samples from at least two cohorts. Characteristic genomic driver enrichments were consistent across cohorts, while gene expression clusters did not correspond exclusively to single driver events but also reflected developmental origins. A machine learning classifier based on ALLCatchR, our B-ALL classifier, identified these 20 transcriptomic subtypes and the immature T-ALL (ETP-like) signature with 0.995-1.0 accuracy in a validation set (n=203). Testing the classifier on a second hold-out data set (n=265 samples) showed that 92.7% of predictions matched with corresponding driver alterations. Across all samples, 83.2% of cases received high-confidence predictions, 7.3% candidate predictions, and 9.5% remained unclassified, largely because of low blast fractions. We identified a novel gene expression cluster markedly enriched (P<0.001) for clonal hematopoiesis mutations (IDH2 R140Q, DNMT3A) and a stem-/progenitor cell-like gene expression. This novel "clonal hematopoiesis-related" T-ALL subtype was observed in six cohorts representing 8.9% of adults and 39.5% of patients aged >50 years. We advanced ALLCatchR, as a free R package that now enables B-/T-lineage separation, gene-expression subtyping, blast estimation, and developmental annotation to harmonize T-ALL classification across studies and clinical contexts.
bioinformatics2026-06-03v2Reachability-Preserving Minimum Edge Cut Problem and Applications in Biology
Xie, J.; Duan, Q.Abstract
Biological pathway analysis often requires identifying interventions that block reachability to an undesirable state, such as a disease-associated module, toxic byproduct, or adverse phenotype, while preserving reachability among essential biological functions. Motivated by this setting, we study the Reachability Preserving Minimum Edge Cut (RPMEC) problem: given protected terminals \(s_1\) and \(s_2\) and a target terminal \(t\), the goal is to remove a minimum-cost set of edges that separates \(s_1\) and \(s_2\) from \(t\) while keeping \(s_1\) and \(s_2\) connected. This formulation naturally models pathway-level intervention design, where one seeks to disrupt harmful signaling, metabolic, or interaction routes without breaking required functional connectivity. We revisit the three-terminal undirected edge-cut case and analyze a Dijkstra-style dynamic programming algorithm that is exact on planar graphs but fails on general graphs. We characterize the structural condition required for exactness, namely frontier-realizability of optimal source-side regions, and identify biological graph representations where this condition is likely to hold after appropriate preprocessing, including curated planar pathway maps, Reactome-style hierarchy trees, SCC-contracted feedback modules, metabolic building-block DAGs with dominator structure, and functional-module quotients of protein interaction networks. We further discuss directed variants, approximation strategies, and exact alternatives based on ASP, MILP, bounded-treewidth dynamic programming, and important separators. The results provide a graph-theoretic foundation for deciding when fast greedy computation is reliable for biological pathway intervention problems and when more expressive exact optimization methods are needed.
bioinformatics2026-06-03v1HyperNiche: Learning Heterophilic Cellular Niches with Hypergraph Neural Networks
Mahmud, M. I.; Banerjee, T.Abstract
We propose HyperNiche, a hypergraph-based framework for modeling higher-order, heterogeneous cellular niches from spatial transcriptomics data. Unlike conventional graph-based methods that rely on pairwise similarity and tend to produce homogeneous clusters, HyperNiche learns anchor-centered hyperedges through a compatibility-driven mechanism that captures both homophilic and heterophilic relationships among cells. By decoupling node roles into anchor and member representations and integrating spatial geometry into hyperedge construction, the model enables the discovery of multicellular niches that span diverse cell types. We evaluate HyperNiche on high-plex Xenium spatial transcriptomics datasets from breast and lung cancer tissue microarrays, demonstrating improvements over state-of-the-art graph-based baselines in clustering performance (ARI, NMI) and biological interpretability. Further analysis shows that HyperNiche produces hyperedges with significantly higher intra-edge feature diversity, indicating an enhanced ability to capture heterogeneous cellular niches compared to similarity-based models. These results highlight the importance of higher-order relational modeling for understanding complex spatial tissue organization and tumor microenvironments.
bioinformatics2026-06-03v1SciCore-Omics: a tri-modal foundation model unifying histology, spatial transcriptomics and language for spatial biology
Xiao, X.; Li, Y.; Zeng, Z.; Yan, Y.; Liu, Z.; Liu, Z.; Xiang, Y.; Ye, Z.; Ying, J.; Li, Y.; Xie, L.; He, F.Abstract
Histomorphology and spatial transcriptomics capture complementary aspects of tissue biology, but their relationships remain difficult to extract, align, and interpret at scale. Existing foundation models typically connect histology, omics, or language only pairwise, which limits their capacity to jointly infer molecular states, decode spatial tissue organization, and generate biologically grounded explanations. Here, we show SciCore-Omics, the first tri-modal foundation model linking histology images, spatial transcriptomics, and biological language. We constructed a spatially paired image-gene-text dataset comprising 151,182 spots across multiple tissues and performed a three-stage progressive training of SciCore-Omics on this dataset. Across gene expression prediction and spatial domain recognition, SciCore-Omics achieved 23.6-80.9% relative gains in task-specific metrics over the strongest external baselines. It further showed robust zero-shot generalization in histopathology classification, outperforming GPT-5 by 6.16 percentage points in mean accuracy across four benchmarks. Expert evaluation in 10 breast cancer cases confirmed its H&E-only case-level molecular reasoning capability. Together, our method demonstrates that a tri-modal framework can effectively bridge histomorphology and molecular state, providing a more general and interpretable foundation model for computational pathology and omics analysis.
bioinformatics2026-06-03v1GalaxyVS: Exploring 100-Billion Compounds in Seconds
Hong, X.; Li, P.; Zhu, W.; Wu, C.; Guo, H.; Tan, H.; Wu, Q.; Wu, K.; Chen, L.; Jia, Y.; Gao, B.; Jian, X.; Lai, Z.; Lu, Y.; Meng, X.; Lan, Y.Abstract
We present GalaxyVS, a hardware-software co-designed virtual screening framework built to explore the 100-billion commercially accessible chemical space in seconds, deployed at the National Supercomputing Center in Tianjin. Built upon the dense vector retrieval paradigm of DrugCLIP, GalaxyVS bypasses the structural dependencies and computational overhead of classical docking to enable rapid screening against experimentally determined as well as geometrically feasible pockets on AlphaFold-predicted structures. To scale this paradigm to the 100-billion level, the system must overcome the significant computational burden of offline representation encoding, critical memory and I/O bottlenecks during online retrieval, and the risks of diversity collapse and precision loss within final screening results. Utilizing the heterogeneous supercomputing infrastructure, GalaxyVS accelerates the offline encoding through deep operator adaptations and resolves online retrieval bottlenecks via disk-native vector indexing coupled with in-memory staging to ensure both broad accessibility and high throughput. Concurrently, a two-stage refinement protocol effectively mitigates diversity collapse and ensures high-fidelity affinity ranking. Consequently, GalaxyVS achieves a daily scoring throughput of $1.5 \times 10^{16}$ target-ligand pairs, representing a six-orders-of-magnitude leap over previous supercomputing records. Driven by this throughput, we screened nearly 100,000 protein structures across six species against the 100-billion compound library in just 16 hours. The resulting comprehensive cross-species interaction landscape, GalaxyDB, will be openly released at \url{https://galaxyvs.drugclip.com}.
bioinformatics2026-06-03v1Deep Proteoform Sequencing with Top-Down Direct Mass Technology
Durbin, K. R.; Su, T.; Fellers, R. T.; McGee, J. P.; Fisher, N. P.; Hollas, M. A. R.; Kafader, J. O.; Kelleher, N. L.Abstract
Individual Ion Mass Spectrometry (I2MS) using Direct Mass Technology mode on an Orbitrap mass spectrometer (DMTm) increases sensitivity, resolution, and mass range for protein analysis. Here, we present an end-to-end workflow for deep proteoform sequencing using top-down mass spectrometry with DMTm. By assigning the charge of individual fragment ions and converting spectra from the m/z to the mass domain, DMTm resolves overlapping isotopic distributions that have limited conventional top-down mass spectrometry. Across different fragmentation modes on Orbitrap mass spectrometers, top-down DMTm significantly outperformed conventional top-down mass spectrometry methods. For a glycosylated 50.8 kDa antibody heavy chain, sequence coverage was greatly increased, from 27.5% to 83.3%, in 10 minutes of acquisition using a single fragmentation mode. Coverage of the middle 350 residues improved from 0% to >95%, demonstrating near-complete coverage of the difficult-to-characterize internal region of a large protein. The fragmentation patterns of DMTm were found to be complementary to conventional top-down, with higher internal coverage for DMTm and higher terminal coverage for conventional. Accordingly, aggregation of the data from the two modes further increased heavy chain sequence coverage to 90.2%. A new software platform, Proteoform Studio, provided optimized ion processing for improved sequence coverage and enabled real-time experimental monitoring as individual ions were accumulated. The platform automatically integrates conventional and DMTm data to provide the most comprehensive sequence coverage possible. Together, these advances enable substantially deeper proteoform sequencing and establish a straightforward, complete top-down DMTm workflow to confidently define proteoforms in biological systems and biotherapeutic development.
bioinformatics2026-06-03v1Improving the Accuracy of Forensic Age Estimation Through Bias Reduction
Flores, M.; Pellegrini, M.Abstract
Chronological age estimation can provide supporting information in forensic casework when traditional identification methods are limited. DNA methylation, a stable epigenetic mark, has emerged as a promising tool for predicting chronological age from trace samples. However, many existing age estimation models rely on linear regression approaches, which often yield biased prediction errors across the age distribution (i.e. model residuals show a significant age dependence). In this study, we compared three approaches for age estimation modeling: multivariable linear regression, random forest regression and maximum likelihood estimation. While the first two approaches are well established, for the third one we constructed and validated a DNA methylation-based LOESS regression maximum likelihood model for age estimation utilizing forensic-relevant CpG markers. In all cases, model performance was evaluated through Leave-One-Out Cross-Validation (LOOCV). We utilized three independent publicly accessible methylation datasets collected using droplet digital PCR (ddPCR) to evaluate the most effective method for accuracy and bias in age estimation. Notably, when we compare the results of the maximum likelihood approach to the other approaches, multivariable linear regression and random forest regression, we find less bias in the age associated residuals compared to the other methods. These findings highlight the utility of non-linear modeling techniques in reducing the biases of epigenetic age estimation for forensic applications.
bioinformatics2026-06-03v1Topology-aware reconstruction of cellular state landscapes from microscopy using self-supervised learning
Messori, E.; Taha, D. M.; Fournier, L.; Foix Romero, A.; Uhlmann, V.; Frossard, P.; Vincent-Cuaz, C.; Patani, R.; Luisier, R.Abstract
Morphology and spatial organisation provide complementary readouts of cellular state. However, reconstructing continuous cellular state landscapes from imaging data remains challenging, particularly in dense biological cultures. Here we present SI-SimCLR, a spatially informed self-supervised learning framework that learns biologically informative representations directly from fluorescence microscopy images without requiring segmentation or manual annotation. Combined with a graph-based partial optimal transport framework, SI-SimCLR enables reconstruction of cellular phenotypic landscapes from static imaging data, revealing how phenotypic substates are organised and connected. To establish and validate this framework, we generated a multimodal dataset of human iPSC-derived astrocytes using high-content imaging and matched bulk transcriptomics. SI-SimCLR resolved distinct interconnected astrocyte substates associated with disease and inflammatory states. ALS astrocytes occupied constrained regions of the morphological landscape. Strikingly, morphology and transcriptomics captured distinct and complementary aspects of astrocyte state variation.Together, our framework establishes a scalable and annotation-free strategy for reconstructing cellular phenotypic landscapes from microscopy data, enabling analysis of cellular heterogeneity, landscape connectivity and phenotypic responses across biological systems.
bioinformatics2026-06-03v1AdventML: Advanced Enzyme Temperature Prediction with Transformer-Based Embeddings and Resampling Strategies
Francois, J.; De Moor, B.; van Noort, V.Abstract
Accurate prediction of enzymes' optimal catalytic temperature (Topt) is crucial in biotechnology, as enzymes with extreme Topt values are highly desirable for reactions at extreme temperatures and for their general stability. However, experimental determination of Topt is costly, labor-intensive, and time-consuming. Meanwhile, existing computational methods suffer from small and imbalanced datasets, suboptimal predictions at extreme temperatures, and insufficient validation. In this study, we address these challenges by expanding the Topt dataset and validating on an independent test set based on sequence similarity. We further tackle these limitations by comparing multiple resampling techniques to improve predictions at extremes and by considering diverse protein representations and multiple machine learning architectures. Overall, the best performing models reached R2 approximately 0.64 with MAE approximately 7-8 degrees C, while extreme resampling improved tail performance, reducing tail MAE by up to approximately 1.8 degrees C. Notably, our models show improved performance over state-of-the-art prediction models. We also demonstrate that accurate prediction of Topt is achievable even in the absence of organism growth temperature (OGT). Our Topt prediction models are made freely available as AdventML on GitHub.
bioinformatics2026-06-03v1ORIGAMI: Orientation-Aware Graph Neural Network for Assessing Multimeric Interfaces of Protein Complex Structures
Wang, X.; Bhattacharya, D.Abstract
Deep learning-based protein structure prediction methods have led to a paradigm-shift in computational structural biology, yet reliably assessing the quality of computationally predicted multimeric structures remains challenging. Recent methods have demonstrated benefits of employing graph neural networks for assessing multimeric interfaces of protein complexes, but ignore geometric orientational features naturally occurring in 3-dimensional protein conformational space and act only on scalar weights. We present ORIGAMI, an orientation-aware graph neural network for assessing multimeric interfaces of protein complex structures that leverages both scalar and 3D vector node representations to perform symmetry-aware geometric operations while maintaining SO(3)-equivariance by capturing fine-grained orientational relationships between residues across protein-protein interfaces to estimate the interface Local Distance Difference Test (iLDDT) score. Tested on targets from multiple rounds of Critical Assessment of Structure Prediction (CASP) challenges, ORIGAMI achieves superior performance across multiple interface quality assessment benchmarks, with particularly strong gains in the expanded CASP16 interface-level evaluation and in controlled comparisons against both non-equivariant and equivariant graph neural network baselines. It also demonstrates robust cross-metric generalization by reproducing superposition-based DockQ scores with high fidelity, despite being trained only to estimate the superposition-free iLDDT score. ORIGAMI is freely available at https://github.com/Bhattacharya-Lab/ORIGAMI.
bioinformatics2026-06-03v1CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space
Vo, H. Q.; Vo, H. Q.; Ly, S. T.; Wan, Z.; Nguyen, A.-V.; Zhao, H.; Sheng, J.; Wong, S. T. C.; Nguyen, H. V.Abstract
Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, morphological feature extraction, and spatial organization analysis; however, these tools often require manual intervention and lack seamless integration with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies, while also offering limited flexibility for custom analyses by supporting only a fixed set of predefined spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data and streamlines the exploration of custom spatial cellular features across diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets spanning distinct tissue types, including frontal cortex, non-small-cell lung cancer, pancreas, and tonsil, and evaluate it under a realistic minimal prompt setting in which bioscientists pose simple questions without task-specific instructions or prior contextual knowledge, benchmarking multiple large language model backbones with strong coding capabilities. We further show that incorporating domain-agnostic few-shot in-context coding-reasoning examples, randomly sampled from outside the spatial analysis domain, substantially improves performance without requiring costly expert-crafted in-domain demonstrations; overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-driven reasoning agents to support custom feature exploration in spatial molecular imaging and accelerate biomarker discovery.
bioinformatics2026-06-03v1sstar2: A Python Package for S*-based Archaic Introgression Detection with Machine Learning
Koca, A.; Stöckl, A.; Chen, S.; Kuhlwilm, M.; Huang, X.Abstract
Detecting introgressed genomic fragments from unsampled or extinct source populations remains challenging. The S* statistic is widely used for this purpose, but the original sstar implementation relies on generalized additive models to smooth quantile-specific values precomputed from fixed count bins, requiring simulations with fixed numbers of segregating sites. Here, we present sstar2, a Python update that replaces this procedure with quantile regression to directly estimate S* thresholds at specified null quantiles from simulated genomic windows. We benchmarked sstar2 against the original sstar, linear quantile regression, and random forest quantile regression across three demographic models with both phased and unphased simulated data. sstar2 showed the best overall performance among the evaluated methods, with the most pronounced improvement under a challenging demographic model of ghost introgression in bonobos. These results show that sstar2 improves S* threshold calibration while making S*-based introgression analyses more flexible and compatible with modern simulation workflows.
bioinformatics2026-06-03v1CpG Atlas: A centralized multi-layer database and AI interface for DNA methylation research
Armstrong, J. F.; Wahi, S.; Borrus, D.; Sehgal, R.; Rizvi, S.; Zhang, S.; Jacques, M.; Eynon, N.; van Dijk, D.; Higgins-Chen, A.Abstract
DNA methylation research has vastly expanded over the past decade, producing a wealth of epigenome-wide association studies, biomarker algorithms such as epigenetic clocks, technical performance analyses, and functional annotations for CpG sites. However, these resources remain fragmented across dozens of databases and supplementary files within manuscripts, forcing researchers to spend time and effort on data cleaning and integration prior to meaningful analyses. No single resource currently unifies this information into a centralized, easy-to-query framework. Here, we present CpG Atlas, a curated relational database that integrates 18 distinct annotation layers encompassing over 1.2 million CpG sites across all four generations of Illumina methylation arrays (HM450K, EPIC v1, EPIC v2, and MSA). Built on a snowflake schema with a canonical probe identifier hub implemented in SQL, CpG Atlas consolidates over 800,000 CpG-trait associations, results from Mendelian randomization analyses, CpG membership across 81 epigenetic clocks, array manifest information, and probe reliability data. It further includes specialized layers such as solo-WCGW, CoRSIVs, PRC2 binding, transposon and retroelement annotations, tissue-specific differentially methylated positions across 17 tissues, and hallmarks of aging and cancer. To maximize utility and ease of use, the database is paired with an interactive web tool and a natural language-to-SQL query interface, enabling users to quickly perform complex multi-dimensional queries. Detailed documentation about every data source and table is also provided, facilitating the identification and interpretation of relevant studies. We demonstrate the utility of CpG Atlas through two case studies: a systematic enrichment analysis revealing distinct functional signatures across 16 epigenetic clocks, and an iterative biomarker discovery workflow for IBD that leverages cross-layer integration. Because it is readily scalable simply by adding or updating tables in the database, CpG Atlas provides a continuously evolving and extensible infrastructure for the epigenetics community that supports collaborative research, interpretable biomarker development, and integrative analyses across the growing landscape of epigenetic data.
bioinformatics2026-06-03v1Mapping the structural coverage of Arabidopsis thaliana plant developmental proteins: Insights from Experimental and AlphaFold Approaches
Rode, S. S.; Sudarsanam, K.; Bhalla, H.; Srivastava, A.; Sankaranarayanan, S.Abstract
Background: Plant development is a multifaceted process governed by intricate protein regulatory networks. High-throughput sequencing methods have vastly expanded plant transcriptomic and proteomic datasets, yet there is a large discrepancy between structural information for plant developmental proteins and the UniProt sequence entries. Advances in X-ray crystallography, NMR spectroscopy, and Cryo-EM have enabled the determination of protein complex structures and their dynamics. AI-driven tools like AlphaFold have revolutionized analysis of protein structural intricacies. However, available three-dimensional structural models predominantly prioritize the human proteome and other mammals over plants. Assessing structural coverage of plant developmental proteins is thus essential to identify research gaps, guide structure-function studies, and advance agriculture. Results: Here, we focus on mapping the structural coverage of developmental proteins in Arabidopsis thaliana. We observed a substantial disparity in the Protein Data Bank (PDB) representation of Arabidopsis thaliana proteins compared to those of Homo sapiens. Our analysis identified 16,389 reviewed UniProt entries, of which only 1,038 have experimentally determined structures. Functional mapping using PlantGSEA revealed 3,485 proteins associated with plant developmental processes; of which only 337 (9.67%) have experimentally determined structures. In contrast, analysis of the AlphaFold database showed that 69.85% of the 39,278 Arabidopsis thaliana UniProt protein entries have predicted structures. Notably, all 3,485 plant developmental proteins (100%) from Arabidopsis thaliana are covered by AlphaFold models. The substantially higher structural coverage provided by AlphaFold for Arabidopsis thaliana, relative to Homo sapiens, highlights the strength of computational approaches in addressing the challenges of structural studies of difficult-to-crystallize proteins. Furthermore, 79.15% of reviewed A. thaliana protein models exhibit high confidence (pLDDT > 70), indicating reliable structural predictions. Although the experimental structural coverage of Arabidopsis thaliana developmental proteins remains limited, AlphaFold has markedly expanded the accessible structural landscape. Conclusion: This study investigated the structural coverage of Arabidopsis thaliana plant developmental proteins, underscoring the critical need for structural studies using both experimental and AlphaFold approaches. It provides research directions for bridging the knowledge gap in understanding molecular mechanisms of plant development.
bioinformatics2026-06-03v1Convergent Evolution in Tumor Genomes Targets Functional Domains
Chen, H.; Liu, L.Abstract
Tumor evolution is shaped by selective pressures that repeatedly favor similar functional outcomes across genetically distinct cancers. While convergent evolution in cancer has been studied at the gene level, this work investigates selection on smaller functional units, namely protein domains. Using >9,500 primary tumor exomes from The Cancer Genome Atlas, we quantified selection strengths acting on missense and truncating mutations aggregated by protein domain. This analysis identified 818 domains under significant positive selection across tumor types. Notably, approximately half of these domains belonged to genes that would be difficult to implicate using conventional gene-centric approaches due to low mutational recurrence or mutations outside functionally critical regions. We classified positively selected domains by evolutionary antiquity. The most ancient domains trace back to pre-eukaryotes and are involved in core cellular processes (e.g., DNA mismatch repair and metabolism) and tend to accumulate the highest numbers of mutations. The majority of positively selected domains originated in early eukaryotes and are enriched for regulatory control and cellular organization, whereas metazoan-specific domains are primarily associated with signaling and cell-cell communication. These results suggest that cancer preferentially exploits deeply conserved biology, with regulatory complexity driving tumor adaptation, while recent evolutionary innovations are relatively fragile and dispensable. Collectively, these findings establish a domain-centered framework for understanding disease mechanisms and developing therapeutic strategies. By focusing on shared functional domains, this framework enables the identification of functionally convergent therapeutic targets and provides a new perspective for interpreting drug resistance, tumor recurrence, and relapse.
bioinformatics2026-06-03v1Applying Spatial Statistics to Spatial Transcriptomics Reveals Local Association Between M2-like Macrophages and Fibrosis in Diabetic Kidney Disease
Terakawa, K.; Kawaguchi, H.; Nangaku, M.; Mimura, I.Abstract
Renal fibrosis is the common final pathway of chronic kidney disease (CKD), driven in part by myofibroblast-mediated extracellular matrix deposition. M2 macrophages have been implicated as a source of myofibroblasts through macrophage-to-myofibroblast transition (MMT), yet whether M2 macrophages are pro- or anti-fibrotic remains controversial, and the spatial context in which MAC-M2-fibrosis coupling occurs is unknown. Here, we applied geographically weighted regression (GWR), a spatial statistical method, to Visium spatial transcriptomics data from diabetic kidney disease (DKD) to characterize spatially resolved high-coupling spots where MAC-M2-fibrosis coupling is significantly positive. In a small DKD cohort (n=6), GWR identified high-coupling spots enriched for B cell/ tertiary lymphoid structure (TLS)-like immune signatures, supporting the biological relevance of the analytical framework. To gain statistical power for differential gene expression (DEG) analysis, we then applied the same pipeline to the larger Kidney Precision Medicine Project (KPMP) DKD cohort (n=30), in which high-coupling spots showed upregulation of IgE-related immune genes (IGHE, FCER1A) together with the mast cell tryptase TPSB2. These findings suggest that IgE-related immune responses may be present within DKD fibrotic microenvironments characterized by local MAC-M2-fibrosis coupling. As a disease comparison, we further applied the pipeline to a KPMP hypertensive kidney disease (HKD) cohort (n = 27), where high-coupling spot signatures were distinct from DKD and did not show enrichment of IgE-related genes. Together, this study provides the first application of GWR to kidney spatial transcriptomics and suggests that IgE-related immune responses may be a feature of DKD fibrotic microenvironments in which M2 macrophages are locally associated with fibrosis.
bioinformatics2026-06-03v1Information Geometry of Intracellular Compartment Coupling Reveals Transcriptomic State Transitions in Single Cells
Sung, J.-Y.; Cheong, J.-H.Abstract
Single-cell transcriptomic analyses typically characterize cellular states using gene-expression variability, dimensionality reduction, and trajectory inference. However, existing approaches provide limited insight into how transcriptomic information is organized across interacting intracellular compartments. Here we introduce Compartment Coupling Entropy (CCE), an information-geometric framework that quantifies the organization of transcriptomic coupling between spliced and unspliced RNA compartments. CCE constructs a cross-compartment coupling operator from compartment-resolved transcriptomic profiles and characterizes its singular-value spectrum using coupling entropy, effective coupling dimension, and coupling susceptibility. These metrics measure how transcriptomic information is distributed across coupling modes and provide a quantitative description of transcriptomic organization beyond conventional expression-based statistics. Applying CCE to pancreatic endocrine differentiation revealed substantial remodeling of coupling architecture along developmental trajectories. Coupling entropy and effective coupling dimension underwent transient collapse and re-expansion during lineage progression, while coupling susceptibility identified discrete intervals of rapid transcriptomic reorganization corresponding to candidate cell-state transition regimes. Across cell states, coupling entropy showed weak correspondence with classical mutual information, indicating that spectral coupling organization captures information not represented by conventional information-theoretic measures. An organization ratio and spectral excess information further quantified the divergence between classical and coupling-based descriptions of transcriptomic structure. Robustness analyses demonstrated stability of the framework under bootstrap resampling, gene subsampling, spectral truncation, and trajectory discretization. Application to an independent dentate gyrus developmental dataset revealed similar hierarchical coupling spectra and susceptibility-defined transition regimes, suggesting that transient reorganization of compartment-coupling architecture may represent a general feature of cellular state transitions. CCE provides a general methodology for quantifying the information geometry of intracellular transcriptomic organization and complements existing single-cell analytical approaches by revealing coupling architectures that are inaccessible to conventional expression-based analyses.
bioinformatics2026-06-03v1ViTAMIn-O: Democratizing computer vision-based machine learning for stem cell research
Hamurcu, F.; Breunig, M.; Varga, A.; Bosch, B.; Lindenmayer, J.; Kanakapaddy, A. T.; Achberger, K.; Pashkovskaia, N.; Kleger, A.; Liebau, S.; Klingenstein, S.; Klingenstein, M.Abstract
Deep Learning (DL) holds exciting potential in automating the prediction of organoid differentiation results. Nevertheless, current models lack adaptability, openness, and robustness in performance. Additionally, broad employments of predictive models in wet-lab settings necessitate machine learning expertise, often not readily available in biologically oriented laboratories. To offer an intuitive solution, we present ColabViTAMIn-O, a code-free platform together with ViTAMIn-O. ViTAMIn-O is a fully open organoid-specific DL model trained and tested on a total of 34 organoid categories, incorporating annotated images across transmitted light microscopy (TLM) modalities at single-organoid resolution. It is adaptable to downstream prediction tasks of varying dataset sizes and outperforms established models even with linear-probing. It performs reliably within a few-shot framework and is even extensible to human embryo TLM imaging data at single specimen level. By releasing our platform, centralized model hub, and datasets, we hope to encourage broader deployments of specialized DL models in stem-cell laboratories.
bioinformatics2026-06-03v1ROTS 2.0: A reproducibility-driven framework for robust statistical modeling across diverse high-throughput omics study designs
Suomi, T.; Kettunen, J.; Pusa, T.; Elo, L. L.Abstract
Reproducibility is fundamental to reliable scientific discoveries. The reproducibility-optimized test statistic (ROTS) is a robust framework designed to identify reproducible features (e.g. genes or proteins) in high-dimensional differential expression analyses such as transcriptomics and proteomics. This is achieved by optimizing the reproducibility of feature rankings under resampling. While originally implemented for univariate settings, ROTS now accommodates multi-group comparisons, survival analysis, linear models, and linear mixed-effects models, broadening its applicability to more complex and clinically relevant experimental designs. Using diverse simulations, benchmark datasets, and real-world case studies, we demonstrate the benefits of ROTS reproducibility optimization compared to the corresponding conventional test statistics. Additionally, we illustrate the utility of the reproducibility characteristics in assessing the overall reliability of the results. To facilitate widespread adoption, ROTS is provided as an open-source software package available through R/Bioconductor. Furthermore, to broaden the user base, we now also provide a Python interface available at pypi.org/project/PyROTS/.
bioinformatics2026-06-03v1Loss of tissue specificity and recurrent pan-cancer activation define a conserved oncogenic microRNA class
Poptsova, M.; Ismailov, A.; Belogurov, A.; Evpak, A.Abstract
MicroRNAs (miRNAs) act as crucial post-transcriptional regulators of large gene networks, and their aberrant expression drives key oncogenic processes such as epithelial-mesenchymal transition (EMT), angiogenesis, immune evasion, and metastasis. Oncogenic miRNAs that lose tissue specificity during malignant transformation represent promising therapeutic targets, as their restricted expression in healthy organs could minimize off-target effects. To identify these candidates, this study performed a comprehensive pan-cancer analysis integrating tissue-specificity profiles of healthy tissues from the GTEx project with tumor data from the TCGA, TARGET, CGCI, and CPTAC cohorts. By combining profiling with differential expression analysis between tumor and matched normal samples, cross-cohort integration revealed that malignant transformation is characterized by a widespread loss of tissue-specific miRNA expression. Among these altered patterns, a cluster of nine oncomiRs was identified: miR-105-5p, miR-1269a, miR-196a-5p, miR-9-5p, miR-96-5p, miR-210-3p, miR-301b-3p, miR-592, and miR-135b-5p. These specific miRNAs were significantly and recurrently upregulated across various solid tumors. Functional enrichment analysis of their experimentally validated targets demonstrated a clear convergence on shared oncogenic pathways, particularly those governing hypoxia response, PI3K/AKT signaling, EMT, angiogenesis, and immune modulation.
bioinformatics2026-06-03v1CellClick: an interactive platform for adjustable and accurate cell type annotation in single-cell and spatial omics data
Shi, L.; Dai, M.; Zhang, Y.-b.; Wu, S.; Wang, M.; Wang, X.-j.Abstract
Single-cell omics and spatial omics technologies are nowadays widely used in biological and medical research. In both single-cell and spatial omics data analysis, accurate cell type annotation is a key step for downstream analysis and scientific discoveries. However, high-quality cell annotation usually requires multiple rounds of manual analysis for result refinement, which poses great challenges to most researchers. Here, we present CellClick, an interactive platform for convenient and accurate cell type annotation in single-cell and spatial omics data. CellClick provides Data Preprocessing, Data Visualization, Cell Annotation, Annotation Validation, and Cell Reannotation modules, which facilitate automatic or user-guided cell selection and annotation. The feasibility of using CellClick to generate more accurate cell annotation results was exemplified by both scRNA-seq and spatial transcriptomics data.
bioinformatics2026-06-03v1Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large cis-regulatory domains, spanning tens to hundreds of kilobases, are pivotal in orchestrating cell-state-specific transcriptional programs that define cellular identity. However, existing single-cell analytical frameworks lack the capacity to identify these higher-order structures, thereby obscuring the coordinated, domain-level epigenetic regulation essential for complex biological processes. To address this, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) solely from single-cell ATAC-sequencing data. Our approach faithfully recapitulates established hematopoietic hierarchies and resolves lineage-specific regulatory programs by recovering canonical master transcription factors, frequent chromatin interactions, and enrichment of fine-mapped immune-related disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE captures tumor-associated H3K27ac landscapes and prioritizes USP7 as a potential therapeutic candidate, supported by in silico perturbation. Collectively, our framework provides a powerful and scalable platform for deciphering the complex epigenetic architectures underlying human development and disease.
bioinformatics2026-06-02v11Fold or flop: quality assessment of AlphaFold predictions on whole proteomes
Sarti, E.; Cazals, F.Abstract
MOTIVATION: Reliability of AlphaFold2 predictions is mainly assessed using the predicted Local Distance Difference Test (pLDDT). For model organisms, 30-40% of residues fall into the low-confidence pLDDT range. Moreover, pLDDT sometimes fails to flag physically implausible structures. This raises two questions: can more robust reliability indicators be identified, and do unreliable predictions share common structural or biophysical features? RESULTS: We characterize protein structures through histograms of per-residue neighbor counts, and use the Wasserstein principal component analysis to define the arity map, and lightweight and informative 2D embedding of proteins in a dataset. Using AlphaFold-DB, we show that the arity map reveals three structurally and biophysically distinct populations (well-folded proteins, intrinsically disordered proteins, and physically implausible predictions). We also use our packing based encoding at the residue level to define Abstraqt (Arity-Based STRuctural Arrangement Quality assessmenT), a per-residue scoring function complementing the pLDDT, assigning low scores to hallucinated helices and distorted beta strands while correctly scoring native like predictions. AVAILABILITY: The code to compute arity maps is available within Structural Bioinformatics Library, see https://sbl.inria.fr/doc/Alphafold_analysis-user-manual.html and https://sbl.inria.fr/data/AlphaFold-assessment.
bioinformatics2026-06-02v3UMITIC: An unsupervised framework for the joint characterization of cellular phenotypes and spatial neighborhoods in multiplex and hyperplex immunofluorescence imaging data
Sangüesa Recalde, M.; De Andrea, C. E.; Ariz, M.Abstract
Multiplexed imaging technologies enable the simultaneous measurement of dozens of protein markers while preserving context, providing a high-resolution view of tissue organization schemes. However, extracting meaningful insights from these high-dimensional datasets--particularly in hyperplex settings (>20 markers)--remains a major computational challenge, especially in the absence of annotated data. Here, we present UMITIC (Unsupervised Analysis of Multiplex Images via TIssue Characterization), a modular and unsupervised computational framework for the joint characterization of cell phenotypes and tissue neighborhoods from multiplex imaging data. UMITIC integrates three components: (i) CellCut, a strategy that combines nuclear and cytoplasmic predictions to improve the delineation capabilities of the framework; (ii) CellMap, a contrastive learning approach that generates low-dimensional representations of single-cell image crops that are enriched with morphological features; and (iii) TissueNet, a graph neural network that models spatial cell-cell interactions to identify tissue neighborhoods. We evaluated UMITIC across four datasets of increasing complexity to assess its robustness, scalability and biological relevance. With respect to a 7-plex human tonsil dataset, the framework identified canonical immune cell populations and reconstructed well-established anatomical regions. When applied to a 43-plex tonsil image, UMITIC preserved these tissue-level structures while enabling a finer cell subtype stratification process driven by increased marker dimensionality. We further validated our method on a 58-plex colorectal cancer cohort, where UMITIC was able to recover previously reported immune composition differences and spatial organization variations between patient groups with different prognoses. Finally, when an expert-annotated mass cytometry imaging dataset concerning human lung tissue was used, UMITIC achieved higher agreement with the reference tissue annotations than the existing approaches did, demonstrating improved lung microanatomy reconstruction accuracy. Together, these results show that UMITIC enables consistent and interpretable analyses of both cellular phenotypes and tissue architectures across diverse multiplex and hyperplex imaging datasets without the need for manual annotations.
bioinformatics2026-06-02v2MorphOTU: image-derived morphological operational units for open-set biodiversity assessment
Zhan, Z.; Ye, M.; Orr, M. C.; Chen, W.; Liu, X.; Yue, L.; Sun, X.; Zhang, F.Abstract
The absence of a scalable system for organizing the vast majority of unidentified species is a central obstacle in biodiversity science. Molecular methods can generate OTUs without species names but require sequencing infrastructure and often remain difficult to link to observable morphology, whereas most computer-vision methods still rely on closed-set species labels. These limitations hamper biodiversity quantification under the open, incomplete conditions that characterize real ecosystems. Here, we introduce morphOTUs, a general image-based framework that constructs operational units of biodiversity directly from phenotypes. Using morphOTU, we derive image-based OTUs across five standardized benchmark datasets spanning flowers, wood anatomy, and beetle dorsal habitus. These units closely approximate reference species-level groupings, including closely related species, retain coherent structure when most species are "unseen" during training, and accurately approximate -diversity metrics under sparse labeling or limited sampling. Furthermore, morphOTUs remain effective on a heterogeneous, long-tailed real-world insect survey dataset, demonstrating robustness beyond standardized imaging conditions. Visual explanations reveal that morphOTU consistently focuses on biologically meaningful traits and captures continuous phenotypic variation. By providing a scalable and open-set framework for quantifying phenotypic diversity, morphOTUs enable biodiversity assessment that includes unnamed species and unlock the ecological value of rapidly expanding digital image repositories.
bioinformatics2026-06-02v2GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking
Hu, S.; Bojar, D.Abstract
Quantifying all complex carbohydrates in a sample produces glycomics data, which constitutes compositional data and is stymied by biosynthetic dependencies between glycans, requiring dedicated analytic workflows. Properly assessing such methods frequently requires simulated data with known ground truths and injectable effects. However, simulating glycomics data, especially with control over effects and biases, is still unsolved. Here, we present GlycoForge, a feature-complete solution for simulating comparative glycomics data. GlycoForge supports simulating fully synthetic glycomics data and templated simulations based on real-world data, with specified motif-level effects, based on Gaussian copulas and estimated covariances. We further support injection of batch effects, both mean and variance shifts, via center-log ratio transformations to maintain compositional closure, and realistic missing data simulation. We showcase the utility of GlycoForge by evaluating batch effect correction algorithms for glycomics data, with automated guidelines for when to use such methods on real-world data. GlycoForge is available as an open-access Python package at https://github.com/BojarLab/GlycoForge.
bioinformatics2026-06-02v2Mechanistic Interpretability for Protein Language Models: A Validation Framework
Chon, P.; ANDREOPOULOS, W. B.Abstract
Protein language models (PLMs) are shown to be powerful predictors of protein structure and function but their internal mechanisms remain poorly understood. Recent mechanistic interpretability methods have decomposed PLM representations into interpretable features, but they have not combined methods on a single biologically meaningful task. This paper tests whether an InterPLM sparse autoencoder and ProtoMech cross-layer transcoder can discover features in ESM-2 (6 layers, 8M) that can mainly discriminate between Class A {beta}-lactamase and Class B {beta}-lactamase with class C and D used as more challenging comparisons. The main goal is to find distinct features for Class A {beta}-lactamase that are not shared by other classes. We find that both methods find distinct features for Class A {beta}-lactamase, but the cross-layer transcoders show that the concepts for Class A {beta}-lactamase seems to be distributed among nodes such as in layer 4 and 6 rather than one node. We also showcase a validation framework to prevent overclaiming the role of a node, and we use it to show that several strong nodes fail in some stages of the framework meaning that they cannot be the sole node that defines Class A {beta}-lactamase.
bioinformatics2026-06-02v1miDGD: a multi-modal deep generative model predicts miRNA expression from bulk or single-cell mRNA expression
Zamani, F.; Rasmussen, A. M.; Schuster, V.; Diekema, M. H.; Krogh, A.; Pedersen, J. S.Abstract
MicroRNAs (miRNAs) are important post-transcriptional regulators, yet their expression is typically unobserved in single-cell and most bulk RNA-seq datasets. We present miDGD, a deep generative decoder model that predicts miRNA abundance directly from gene expression alone. Trained on bulk and single-cell datasets from TCGA, GTEx, and human cell lines, miDGD learned a shared latent representation of matched mRNA and miRNA profiles that organized samples into biologically meaningful clusters reflecting tissue and cancer types. The model reconstructed both tissue-specific and broadly expressed miRNAs, recapitulated known miRNA-target relationships, and showed robust performance in sparse and single-cell data. miDGD outperformed miRSCAPE and recent miRNA activity inference methods, with improved cross-dataset generalization. These results establish a deep generative model as an improved framework for predicting miRNA expression when direct measurements are unavailable.
bioinformatics2026-06-02v1Decoding the Grammar of Protein-Protein Interaction Interfaces with Multimodal Representations
Cuturello, F.; Senci, S.; Di Vora, D.; Gardinazzi, Y.; Villegas Garcia, E. N.; Feltrin, A.Abstract
Protein-protein interactions (PPI) govern essential cellular processes, making the computational identification of interacting sites a central challenge in structural biology, with important implications for protein engineering and the development of targeted therapeutics. Existing prediction algorithms include sequence-based methods, which lack structural information, or structure-based approaches, which often struggle to effectively integrate evolutionary context. Here, we present ESM3-PPISites, a supervised model for residue-level classification of PPI interfaces, leveraging the multimodal representations of the ESM3 Protein Language Model. To ensure a bias-free evaluation, we adopt a stringent redundancy filtering protocol, systematically eliminating latent homology between the training data and a curated benchmark set in both sequence and structural space. Our findings demonstrate that while ESM3 largest proprietary version yields the highest predictive power, targeted fine-tuning of its small open-weight counterpart significantly narrows the performance gap. Requiring only primary sequence data at inference, ESM3-PPISites achieves unprecedented accuracy, vastly outperforming current approaches. Crucially, we demonstrate the practical impact of these predictions by integrating them as spatial restraints within the HADDOCK3 docking platform. When evaluated on an independent subset of 12 complexes from the Docking Benchmark v5, our prediction-guided pipeline strongly enhances the identification of near-native binding poses over ab initio blind docking, while reducing computational runtime by an order of magnitude. This framework establishes a scalable paradigm for high-throughput structural interactomics.
bioinformatics2026-06-02v1Quantifying and Predicting the Difficulty of Multiple Sequence Alignment with AlDiScore
Bodynek, M.; Martin-Fernandez, L.; Bettisworth, B.; Haag, J.; Stamatakis, A.Abstract
Multiple Sequence Alignment (MSA) constitutes an important and frequent operation in molecular sequence data analysis. There exist numerous tools, algorithms, and criteria to infer an MSA. This plethora of available approaches to MSA may induced an ensemble of divergent MSAs for the same underlying unaligned sequence set. Even a single MSA tool may infer distinct MSAs when varying the input parameters. Hence, when using a diversified set of MSA algorithms and parameterizations, the observed dispersion within an MSA ensemble expresses the difficulty of inferring a robust alignment. We refer to this notion as MSA difficulty. As downstream analyses heavily rely on the MSA, characterizing MSA difficulty for a given unaligned sequence set is critical. Initially, we show that measures of dispersion within diversified MSA ensembles can reliably predict MSA difficulty. We then assess the adequacy of these measures by computing the average reference-based distance between the MSAs in the MSA ensemble and its corresponding structural reference MSA and subsequently comparing this distance to the corresponding reference-free average distance over all MSA pairs in the ensemble. We find that Blackburne and Whelan's dpos alignment metric is most appropriate as its reference-free counterpart most accurately approximates the reference-based difficulty computed on BAliBASE reference data. We therefore use the average pairwise distance measured by dpos to quantify MSA difficulty on a scale from 0 (easy) to 1 (difficult) given an MSA ensemble. Next, we introduce the AlDiScore open-source tool, which uses machine learning to directly and reliably predict reference-free difficulty scores from unaligned sequence sets to completely omit expensive MSA computations. The underlying regression model relies upon a large set of features, including sampling-based measures of transitive consistency. We trained our AlDiScore models on a diverse collection of empirical datasets from BAliBASE, TreeBASE, an published studies. Subsequently, we demonstrate that AlDiScore attains an R2 of 0.89 and of 0.84 on unseen AA and DNA sequence sets extracted from the PANDIT v17 database. Finally, we show that there is no correlation between MSA difficulty and the corresponding phylogenetic difficulty of the respective MSA.
bioinformatics2026-06-02v1