Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Easydecon: Efficient Cell Type Mapping for High-Definition Spatial Transcriptomic Data
Umu, S. U.; Karlsen, V. T.; Baekkevold, E. S.; Jahnsen, F. L.; Domanska, D.Abstract
The emergence of high-resolution spatial transcriptomics platforms, such as VisiumHD, has enabled transcriptome-wide spatial profiling at near-single-cell resolution. However, existing analysis tools often lack scalability or compatibility with this new resolution, limiting their utility for multimodal cell type analysis. We present Easydecon, a lightweight and modular computational framework for spatial transcriptomics analysis using marker genes from single-cell RNA sequencing datasets. Easydecon uses a two-phase strategy, which firstly detects expression hotspots and then refines cell type assignments with similarity-based methods. We demonstrate its efficacy by resolving cell type subsets with high accuracy. Easydecon supports integration with segmentation tools and outperforms established methods in speed, usability and cell type recovery.
bioinformatics2026-05-11v4barbieQ: An R software package for analysing barcode count data from clonal tracking experiments
Fei, L.; Maksimovic, J.; Oshlack, A.Abstract
Motivation: A clone encompasses a progenitor cell and its progeny cells. Tracking clonal composition as cells differentiate or evolve is useful in many fields. Various single-cell lineage tracing (clonal tracking) technologies use unique DNA barcodes that are passed from progenitor cells to their offspring. The barcode count for each sample indicates cell number in clones. However, analysis of barcode count data is often bespoke and relies on visualisations and heuristics. A generalized workflow for preprocessing and robust statistical analysis of barcode count data across protocols is needed. Results: We introduce barbieQ, a Bioconductor R package for analysing barcode count data across groups of samples. It provides data-driven quality control and filtering, extensive visualisations, and two statistical tests: 1) Differential barcode proportion (differences in proportions between sample groups), and 2) Differential barcode occurrence (differences in presence/absence odds between groups). Both tests handle complex experimental designs using regression models and rigorously account for sample-to-sample variability. We validated both tests on semi-simulated, real data and a case study, demonstrating that they hold their size, are sufficiently powered to detect true differences, and outperform existing approaches.
bioinformatics2026-05-11v3A Permutation-Based Framework for Evaluating Bias in Microbiome Differential Abundance Analysis
Zeng, K.; Fodor, A. A.Abstract
ABSTRACT Background: In microbiome research, differential abundance analysis aids in identifying significant differences in microbial taxa across two or more conditions. Statistical approaches used for this purpose include classical tests such as the t-test and Wilcoxon test, as well as methods designed to account for the compositional nature of microbiome data, including ALDEx2, ANCOM-BC2, and metagenomeSeq. In addition, methods originally developed for RNA sequencing data, such as DESeq2 and edgeR, have been frequently applied to microbiome studies. However, the use of these methods has been controversial. One area of concern is whether different modeling frameworks produce accurate p-values when the null hypothesis is true. Results: We evaluated seven methods across six datasets. Four permutation strategies were applied to generate data under the null hypothesis: shuffling sample names, shuffling counts within samples, shuffling counts within taxa, and fully randomizing the counts table. Methods based on the negative binomial distribution (DESeq2 and edgeR) produced p-values that were consistently smaller than expected under the null hypothesis. In contrast, methods that attempt to correct for compositionality (ALDEx2, ANCOM-BC2, and metagenomeSeq) tended to produce larger-than-expected p-values, even when only sample labels were shuffled, a permutation strategy that does not alter compositional structure. These deviations were dependent on dataset characteristics and permutation strategy, suggesting complex interactions between underlying data structure and algorithm performance. Generating data to follow the expected negative binomial distribution did not eliminate the tendency of DESeq2 and edgeR to exaggerate statistical significance. Although similar patterns were observed in RNA sequencing (RNAseq) datasets, the deviations were less pronounced than in microbiome data. In contrast, the classical t-test and Wilcoxon test yielded p-value distributions consistent with theoretical expectations across datasets and permutation strategies. Conclusions: These results indicate that the performance of several widely used differential abundance methods can be problematic under null conditions and may affect biological interpretation. Our findings emphasize the importance of careful method selection and highlight the robustness of simpler statistical approaches for reliable inference.
bioinformatics2026-05-11v2miR-128 Regulates Hypertensive Vascular Remodeling via PPAR-γ
Zhoufei, F.; Han, C.; Liu, R.; Yu, L.; Chen, C.; Chen, S.; Li, l.; Chen, Q.; Cai, H.; Su, J.; Peng, F.Abstract
This study investigated the role and mechanism of microRNA-128 (miR-128) in hypertensive vascular remodeling, focusing on peroxisome proliferator-activated receptor {gamma} (PPAR-{gamma}) and the Toll-like receptor 4/nuclear factor{kappa}-B (TLR4/NF-{kappa}B) pathway. Ten-week-old male spontaneously hypertensive rats (SHRs) were randomly divided into renal denervation (RDN), sacubitril/valsartan, and sham groups; age-matched Wistar-Kyoto rats served as normotensive controls. Eight weeks after intervention, mesenteric arteries were collected for histological, functional, and molecular analyses.Serum miR-128 levels were measured by quantitative real-time polymerase chain reaction (qRT-PCR). Protein expression was determined by immunofluorescence, immunohistochemistry, and Western blotting. Compared with the sham group, SHRs showed elevated blood pressure, severe vascular remodeling, and impaired vasodilation, accompanied by downregulated miR-128 and activated TLR4/NF-{kappa}B signaling (all p < 0.0001). RDN markedly restored miR-128 expression, suppressed the TLR4/NF-{kappa}B pathway and pro-inflammatory cytokines (IL-1{beta}, IL-6, TNF-), and improved vasodilatory function (all p < 0.0001). Mechanistically, miR-128 negatively regulated the TLR4/NF-B pathway by upregulating PPAR-{gamma} (p < 0.05). In conclusion, RDN attenuates hypertension and vascular remodeling. miR-128 alleviates vascular inflammation and remodeling via the PPAR-{gamma}/TLR4/NF-{kappa}B axis, representing a promising therapeutic target for hypertension.
bioinformatics2026-05-11v1eSkip2 prioritizes exon-skipping antisense oligonucleotide target regions across exon--intron contexts
Chiba, S.; Kunitake, K.; Shirakaki, S.; Haque, U. S.; Wilton-Clark, H.; Shah, M. N. A.; Leckie, J. N.; Matsui, K.; Uno-Ono, F.; Yokota, T.; Aoki, Y.; Okuno, Y.Abstract
Antisense oligonucleotides (ASOs) for exon skipping are increasingly used to correct pathogenic splicing; however, rational target-region selection remains difficult because regulatory information is distributed across exons, introns, and splice junctions. Here we present eSkip2, a framework for prioritizing exon-skipping ASO target regions from joint exon--intron sequence context. eSkip2 combines transfer learning from a genome-pretrained foundation model with joint training on ASO activity and SNV-derived splicing perturbation data and can be adapted to a target locus without experimental ASO labels. Across multi-gene benchmarks spanning canonical exons, pseudoexons, cell types, chemistries, and exonic, intronic, and exon--intron-spanning targets, eSkip2 robustly prioritized active regions; in exon-confined comparisons, it showed improved overall performance compared with applicable existing models. It also supported prospective design of dual-targeting ASOs for DMD exon 46, where top-ranked candidates were enriched for active ASOs and yielded dose-dependent dystrophin restoration. eSkip2 narrows the experimental search space across diverse target architectures.
bioinformatics2026-05-11v1Efficient and Tidy Manipulation of Annotated Matrix Data with plyxp
Landis, J. T.; Love, M. I.Abstract
Manipulating high-dimensional omics data, such as bulk or single cell gene expression counts matrices, typically requires a bioinformatics analyst to learn domain-specific functions and syntax. These matrix-centric functions and syntax can be less intuitive than working with tidy data analytic principles, as exemplified by tools such as dplyr applied to tabular data. We propose an expressive grammar for manipulating annotated matrix data, with syntax to access, modify, and append matrix data and tabular row and column metadata, including row-wise or column-wise grouped operations. This grammar defines multiple contexts, and providing pronouns for specific recall and assignment within and across these contexts. The plyxp package is an implementation of this grammar for the R/Bioconductor ecosystem, with efficient abstractions for the SummarizedExperiment class. We demonstrate plyxp's efficiency compared to alternative approaches on data manipulation tasks requiring computation across contexts.
bioinformatics2026-05-11v1BioMADE: Predicting Torsades de Pointes from molecular structures through biologically informed representations
Acitores Cortina, J. M.; Schut, M. C.; Tatonetti, N. P.Abstract
Drug-induced arrhythmias, particularly Torsades de Pointes (TdP), pose a significant risk to patient safety and can sometimes have life-threatening outcomes. They remain a major concern in drug development and regulation. Machine learning (ML) has become a powerful tool for analyzing complex biological and chemical datasets, enabling researchers to identify subtle patterns that differentiate safe compounds from those likely to cause dangerous cardiac effects. However, most existing in silico approaches do not sufficiently incorporate biological elements, relying heavily on chemical and structural properties or on computationally expensive simulations. Here, we introduce BioMADE, a novel ML framework that harnesses small-molecule-protein activity profiles from publicly available datasets to predict TdP risk without requiring exhaustive mechanistic annotation. Activity data from ChEMBL were used to train individual models for each gene, which predict activity values for any given compound. A curated set of arrhythmia-relevant genes was then used to construct a latent biological embedding (BioMADE embedding) for each molecule. We validated the performance of these features in distinguishing biological elements such as ATC3 class, showing superior classification performance compared with representations such as Molformer (lacks biological information) and MACCS (limited chemical properties) (0.85 AUROC vs 0.81 and 0.73, respectively). BioMADE representations served as input to a support vector machine classifier to discriminate TdP-inducing drugs from safe compounds. BioMADE achieved an AUROC of 0.91 in internal validation, indicating strong predictive performance. Against state-of-the-art models such as ADMEThyst, BioMADE achieved an AUROC of 0.74 on ADMEThyst's validation set (vs. 0.72 for ADMEThyst). When we combined both approaches, the AUROC reached 0.77. These results demonstrate that BioMADE provides a scalable, biology-informed, and generalizable approach for predicting drug-induced toxicities. By integrating protein activity profiles into toxicology modeling, our framework highlights the critical role of human biology in adverse drug reaction prediction, an aspect often overshadowed by purely chemical or structural descriptors.
bioinformatics2026-05-11v1EVd3x: a source-attributed multi-omic platform for mapping extracellular vesicle cargo evidence
Ait Ouares, K.; Weerakkody, J. S.Abstract
Extracellular vesicle (EV) studies increasingly generate mixed cargo lists that include genes, proteins, miRNAs, biofluids, cell contexts, disease labels, pathways, and interaction networks. The central interpretive challenge is determining which source supports each record and what level of biological claim that source can justify. We developed EVd3x, a source-attributed multi-omic platform that integrates 28 public resources into 17 canonical Apache Parquet analysis tables and converts molecule, disease, or natural-language queries into a reusable analysis state. The same state can be inspected across linked evidence layers for EV cargo, disease aggregation, pathway enrichment, cell context, ligand receptor evidence, miRNA target support, STRING protein protein interactions, and exportable source rows. We evaluated EVd3x using the disease-first query: early onset Alzheimers disease with behavioral disturbance. The query resolved a PSEN1-centered state with 5 seeds, 109 nodes, and 197 edges, and exported 647 EV evidence rows, 4,053 disease rows, 2,204 pathway rows, 3,555 cell-context/communication/ligand receptor rows, and 4,032 bridge rows. EVd3x recovered familial Alzheimer disease type 3, gamma-secretase and Notch context, nervous-system pathway terms, oligodendrocyte to astrocyte communication hypotheses, and PSEN1 bridges in which six queried miRNAs, including hsa-miR-107, target PSEN1 directly. These outputs are reported as separable evidence layers rather than as a composite proof score. A table-backed research assistant fine-tuned from Qwen2.5-1.5B-Instruct with QLoRA routes natural-language requests through deterministic retrieval before optional synthesis. EVd3x supports transparent EV hypothesis generation by preserving source attribution from query to export.
bioinformatics2026-05-11v1Autoresearch Discovery of Interpretable Filter Rules for Antibody Binder Classification
Landajuela, M.Abstract
Antibody design campaigns increasingly generate many candidates before only a small subset can be tested experimentally, making candidate filtering a central bottleneck. We study whether an autoresearch loop can discover better training-free filters for antibody binder classification by iteratively proposing rule variants, evaluating them under a fixed Leave-One-System-Out protocol, recording each experiment in version control, and using the results to guide the next iteration. Across 75 unique logged filter variants on seven antibody-antigen systems, the loop improves average ROC-AUC from 0.6371 for the initial baseline to 0.8060 for a compact final rule that we call the RMSD-Tuned Triad rule, an absolute gain of 0.1689 and a relative improvement of 26.5%. The discovered filter is competitive with supervised machine learning baselines and prompted LLM baselines evaluated on the same systems: it exceeds logistic regression (0.7144), feature-selected balanced logistic regression (0.7536), and GPT-4o tabular few-shot prompting (0.7640), and it comes within 0.0044 ROC-AUC of the strongest GPT-5 tabular few-shot result (0.8104). Unlike the LLM baseline, the final rule requires no prompted examples and no LLM inference once the numeric structure-derived features are available. These results show that systematic autoresearch can turn simple structural-confidence signals into compact, interpretable filters that are useful when target-specific training data are scarce.
bioinformatics2026-05-11v1Partner determination from protein sequences using class information with CLAPP
Gennai, L.; Caredda, F.; Rebeaud, M. E.; Pagnani, A.; De Los Rios, P.Abstract
Protein-protein interactions underpin nearly all cellular processes, making their accurate identification a central challenge in biology. With the rapid expansion of genomic data, sequence-based computational approaches have emerged as a powerful route to infer such interactions, complementing experimental methods that are often prohibitively time- and resource-intensive. This challenge becomes particularly acute in the presence of paralogs, which arise through gene duplication and typically diversify toward distinct, though sometimes overlapping, functions. Reconstructing their interaction networks is therefore essential for understanding a wide range of biological processes. Protein paralogs within a family can often be subdivided into classes based on a range of properties, including functional, structural and architectural features. When interactions between these classes are conserved across organisms, such that sequences from one class interact exclusively with sequences from another, this information can be used to solve the paralog matching problem. We introduce here CLAPP (CLAss Pooling for Paralog matching), a method for predicting interacting paralogs by pooling interaction scores from different subclasses across organisms. We apply it to scores extracted using coevolution-based methods. Pooling scores at the class level reduces noise in the interaction scores and replaces organism-specific assignments with a single shared assignment, improving performance and substantially reducing computational cost. We apply CLAPP to bacterial systems including histidine kinases and response regulators, as well as interacting families of chaperones and co-chaperones, and recover known interaction partners.
bioinformatics2026-05-11v1Benchmarking long-read simulators against Oxford Nanopore whole-genome sequencing data
Taouk, M. L.; Ingle, D. J.; Wick, R. R.Abstract
Background: Oxford Nanopore Technologies (ONT) sequencing is increasingly used for whole-genome sequencing (WGS) across a wide range of applications. However, the platform has evolved rapidly through updates to flow cell chemistry and basecalling algorithms, altering the characteristics of the resulting sequencing data. Read simulators provide synthetic datasets with known ground truth, enabling controlled development and evaluation of methods. However, many existing simulators were developed for earlier versions of ONT sequencing or use generic long-read assumptions, and their realism for contemporary ONT data is unclear. Results: We benchmarked six ONT-compatible read simulators (Badread, LongISLND, lrsim, NanoSim, PBSIM3 and SimLoRD) using a microbial genome reference and ONT R10.4.1 reads as the empirical standard. Each tool was configured to maximise realism, including training on empirical reads when supported. We compared simulated and real datasets with respect to read length, read accuracy, FASTQ quality scores and sequence error profiles. No simulator reproduced all metrics of the real data well. PBSIM3 most closely reproduced read length, read accuracy and FASTQ quality scores, making it a strong simulator for broad read-level realism. However, it did not capture important features of the real error profile, including context-dependent substitution rates and homopolymer-length errors. Badread and LongISLND better reproduced some aspects of the error profile, but showed other departures from the real data. Conclusion: PBSIM3 is a good general-purpose choice for many ONT WGS simulation tasks because it reproduced several key read-level properties well. However, Badread or LongISLND may be preferable for applications where error structure is more important. No evaluated tool was realistic across all tested metrics, highlighting a gap for improved long-read simulators.
bioinformatics2026-05-11v1Pathway-informed Universal Domain Adaptation for Single-cell RNA-seq Data
Wei, X.; Li, X.; Liu, H.; Du, G.; Wei, F.; Shang, X.Abstract
The rapid accumulation of single-cell atlases has yielded datasets of unprecedented scale, encompassing samples across diverse platforms, locations and laboratories. This multidimensional complexity drives an urgent need for universal domain adaptation methods capable of achieving precise cell-type annotation. However, existing methods lack computational scalability and fail to integrate biological priors. Here, we develop scPathOT, a pathway-informed universal domain adaptation framework that leverages pathway activation transformations to harmonize single-cell datasets across disparate conditions. We demonstrate the versatility of scPathOT across diverse technological platforms, tissues, disease contexts, cellular senescence and treatment conditions. Crucially, this pathway-informed alignment not only accurately resolves cellular identities but also uncovers functional mechanism. In pancreatic islets, scPathOT delineates a shared stress-repair axis traversed by beta-cells prior to their divergence into type 1 and type 2 diabetes-specific states. In aging bone marrow, scPathOT disentangles lineage-specific senescence modules that unexpectedly converge onto unified inflammatory and oxidative-stress programs. Furthermore, application to an in-house pancreatic ductal adenocarcinoma cohort uncovers the mechanistic basis underlying the neoadjuvant chemotherapy-induced reorganization of stromal-immune crosstalk. By coupling biological priors with universal domain adaptation, scPathOT provides a scalable, mechanistically interpretable framework to accelerate biological discovery from atlas-level single-cell data.
bioinformatics2026-05-11v1Investigation of Protein Melting Temperature Prediction with Cross-Method Validation on Biophysical Data
Pailozian, K.; Kohout, P.; Damborsky, J.; Mazurenko, S.Abstract
Motivation: Protein melting temperature (Tm) prediction accelerates the discovery of thermostable enzymes which are crucial for industrial biotechnology often requiring harsh reaction conditions. Experimental determination of Tm remains labour-intensive and varies across techniques, motivating the development of in silico predictors. Mass-spectrometry datasets such as Meltome Atlas now enable large-scale Tm prediction with models based on deep learning, but model generalisation across diverse experimental datasets has not been systematically tested. Results: We evaluated the generalisability of state-of-the-art deep learning approaches and explored ESM-based embeddings for Tm prediction. To this end, we assembled the ProMelt training dataset (45 441 proteins) and five independent biophysics-based validation datasets. Our analysis revealed substantial differences between proteomics- and biophysics-based Tm measurements, highlighting the challenge of cross-domain generalisation. Existing state-of-the-art predictors trained on large-scale proteomics datasets showed reduced performance on biophysics-based validation sets. Our fine-tuned embedding-based models, particularly LoRA-adapted ESM-2 (TmProt 1.0), outperformed state-of-the-art predictors in identifying thermostable proteins Tm [≥] 60{degrees}C) across heterogeneous datasets, achieving AUC scores of 0.75--0.77. We also demonstrated that the available models could be used efficiently in the sequence prioritization task. Availability: The TmProt web server is available at https://loschmidt.chemi.muni.cz/tmprot/.
bioinformatics2026-05-11v1Nanopore event detection in a simple and adaptive way
Wei, P.; Kansari, M.; Mierzejewski, M.; Ensslen, T.; Lin, C.-Y.; Kavetsky, K.; Jones, P. D.; Behrends, J. C.; Drndic, M.; Fyta, M.Abstract
Nanopore read-out, that is the current signals measured across nanometer-sized openings in dielectric membranes or through natural protein channels, enables the detection, identification and sequencing of individual molecules. The detection can take place by analyzing the events of single biomolecules interacting with the pore. The accuracy in the detection of these single events is key for identification of physicochemical properties of analyte molecules. To this end, we further develop a very simple, fast, almost parameter-free, and adaptable cluster-based event detection (CBED) algorithm that clusters the nanopore signals prior to detecting nanopore events. The algorithm is validated against two other event detection schemes with respect to simplicity and efficiency. For this, nanopore data from four different experiments stemming from different laboratories that vary in the nanopore type, size, and analyte are considered. The comparison is made on the basis of the number of events detected, their quality, and the most important features extracted from nanopore events. Our results underline the higher efficiency and less noise of the CBED detected events for biological nanopore data and the need for an on-the-fly adaptivity of the baseline current for a class of solid-state nanopore data.
bioinformatics2026-05-11v1Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV
Rouhollahi, A.; Nezami, F. R.Abstract
Objective: How structured clinical features and cluster-semantic embeddings interact under self-distillation in EHR prediction models is unknown. Existing approaches treat these sources separately (gradient-boosted trees exploit tabular features while sequence models process text), and their interaction under self-distillation regularisation remains uncharacterised. We introduce the Narrative Velocity (NV) framework and evaluate this interaction in a 7-model benchmark. Materials and Methods: Cadence is a ~5.86M-parameter residual multilayer perceptron (MLP) combining structured EHR features with frozen PubMedBERT embeddings of cluster-label strings under born-again self-distillation from a prior Cadence checkpoint (seed-42 teacher). Cadence is benchmarked against six comparators on MIMIC-IV v3.1 with dual-sex TRIPOD+AI reporting (5 student seeds for Cadence; 2--3 seeds for baselines). Results: At full-cohort scale, Cadence achieves 38.04 +/- 0.04% male and 35.66 +/- 0.04% female top-1 accuracy, exceeding the strongest non-neural baseline (XGBoost-2420, trained on the identical 2,420-dimensional input) by +1.35 pp male and +0.82 pp female (paired t-test on shared seeds 42--44: t(2)=69.06, p = 2.10 x 10^-4 male; t(2)=25.32, p = 1.56 x 10^-3 female). On time-to-next-event regression Cadence lowers MAE by 7.68 d male and 7.30 d female versus XGBoost-2420; FT-Transformer attains the lowest absolute MAE at full scale (27.58 d male, 36.63 d female), revealing a classification-regression trade-off across model families. A controlled 2x2 random-vector ablation isolates the self-distillation--embedding interaction at +0.49 pp top-1 (95% CI [0.35, 0.64] pp; bootstrap, n = 10,000 resamples; 3-teacher-seed mean +0.513 +/- 0.010 pp) under a matched-dimensionality null. A 3-teacher-seed validation (multi_teacher_02) confirms the interaction is robust to teacher-seed identity (per-seed values +0.525, +0.509, +0.507 pp; mean +0.513 +/- 0.010 pp). Cadence achieves the best Brier score among evaluated models (0.774 male / 0.798 female) but its raw probabilities are systematically miscalibrated (ECE 0.077 vs. XGBoost-884's 0.010); after a single scalar temperature scaling step (T* ~0.81), ECE drops to ~0.028 while Brier remains best. On a small (n = 1,120 patients, 39,120 events) external OCR-extracted BWH cohort, Cadence ranked 3rd of 7 models with three confounded sources of error (institutional shift, OCR noise, centroid mapping); we therefore report this as a generalisation probe rather than a definitive external validation. At the longer h30 evaluation horizon Cadence's MAE advantage reverses (47.35 d versus XGBoost 45.06 d), reflecting the absence of a matched-horizon self-distillation teacher. Discussion: The 2x2 random-vector ablation confirms that the self-distillation gain on PubMedBERT embeddings (+0.78 pp) exceeds that on matched-dimensionality random vectors (+0.29 pp) by +0.49 pp, isolating the interaction to semantic content rather than feature dimensionality. The factorial decomposition (+0.49--0.51 pp interaction) and the sequential pipeline-level decomposition (Supplementary Table S3) are complementary triangulations under different reference frames and are not directly additive. Conclusion: This 7-model benchmark establishes a dual-sex, dual-metric, cross-institutional reference for next clinical event prediction under the TRIPOD+AI reporting framework. These results characterise discrimination and calibration on a single retrospective cohort; prospective evaluation, decision-curve analysis, and harm-benefit assessment are required before clinical deployment. Keywords: clinical event prediction, electronic health records, MIMIC-IV, Narrative Velocity, residual MLP, PubMedBERT, knowledge distillation, TRIPOD+AI
bioinformatics2026-05-11v1G-SPRI: A Structure-Centric Graph Model for Comprehensive Prediction of Cancer Driver Events from Missense Mutations
Wang, B.; Ye, B.; Farhat, A.; Liang, J.; Yu, L.; Lu, Z.; Wang, X.; Xu, L.Abstract
In silico approaches for predicting the functional impact of missense mutations are critical for interpreting personal genomes and identifying disease-related biomarkers. Existing methods largely rely on sequence-based information or intuitive structural features, but often overlook the complex biophysical patterns encoded in protein 3D structures. Here, we present G-SPRI, a multilevel framework built on a novel alpha shape protein graph that accurately captures residue connectivity from atomic-resolution geometry and enables precise message passing around mutation sites. Using this graph representation, G-SPRI integrates wild-type structural properties and mutation-specific perturbation signals derived from the Protein Data Bank (PDB) universe to support graph-based learning for distinguishing pathogenic from benign missense variants. G-SPRI performs strongly across multiple key tasks. On the binary prediction benchmark, G-SPRI delivers improved pathogenicity prediction for individual mutations. By integrating mutation recurrence across the pan-cancer cohort, G-SPRI recovers more known cancer driver genes than state-of-the-art methods from more than 2.3 million mutations. Furthermore, by jointly quantifying site-specific pathogenicity and co-clustering influence within higher order structural organization units, G-SPRI provides comprehensive evidence for pinpointing likely driver mutations and structurally susceptible regions within disease genes.
bioinformatics2026-05-11v1sxRaep: A Rapid and Accurate Enzyme Predictor for high-throughput mining of enzymatic sequences
Duan, H.; Han, X.; Mo, Y.; Ren, B.; Xia, L. C.Abstract
Metagenomic sequencing generates petabyte-scale sequence datasets that strain both deep learning and alignment based enzyme annotation tools. A lightweight rapid and accurate filter tool is needed to identify enzymatic sequences prior to resource-intensive functional prediction. We present sxRaep (Rapid and Accurate Enzyme Predictor), a resource-efficient framework using lightweight physicochemical features for enzyme pre-screening. sxRaep achieves 6,604-fold speedup over Diamond (0.002 seconds per inference) with 62.1% memory reduction relative to Diamond (372 MB peak), while maintaining 99.4% accuracy and the highest recall in remote homology detection. This lightweight approach identifies enzymatic candidates missed by alignment-based methods without sacrificing accuracy.
bioinformatics2026-05-11v1A fine-tuned genomic language model adds complementary nucleotide-context information to missense variant interpretation
Su, Y.; Lin, Y.-J.Abstract
Missense variant interpretation remains a central challenge in clinical genomics. Missense pathogenicity predictors achieve strong performance, but many emphasize protein-level consequences or overlapping annotation priors. Whether genomic language models add non-redundant nucleotide-context signal to missense interpretation remains unclear. Here, we systematically adapted genomic language models to ClinVar missense pathogenicity prediction across backbone architectures, representation strategies, classifier heads, and adaptation regimes. In our analysis, variant-position embeddings consistently outperformed pooled sequence representations, multi-species pretraining provided the strongest backbone-level advantage, and low-rank adaptation generalized better than full fine-tuning. The resulting fine-tuned model, GLM-Missense, substantially outperformed zero-shot scoring from the same pretrained model. To test whether GLM-Missense contributes information beyond existing methods, we built MetaMissense, an XGBoost ensemble combining GLM-Missense with AlphaMissense, ESM1b, REVEL, CADD, SIFT, and PolyPhen-2. GLM-Missense showed the lowest concordance with other predictors, retained the strongest partial association with pathogenicity after controlling for the other predictors, and ranked as the most informative non-ensemble input to MetaMissense. MetaMissense achieved the best performance in both cross-validation and held-out testing. Analyses of variants correctly classified by GLM-Missense but misclassified by several established predictors suggested two patterns. First, part of the GLM-Missense signal may reflect splice-relevant exonic context. Second, GLM-Missense appears to add value in settings where other predictors may overweight allele frequency, gene-level constraint, or amino-acid-change severity. However, these features explained only about 10% of the distinction between the GLM-Missense-correct subset from the background. Together, our results demonstrate that fine-tuned genomic language models contribute complementary nucleotide-context information to missense variant interpretation.
bioinformatics2026-05-11v1Predicting Discrete Structural Transformations in Small Molecules from Tandem Mass Spectrometry
Wang, X.; Kiler, G.; Herrera-Rosero, D.; Shahneh, M. R.; Strobel, M.; Geibel, C.; El Abiead, Y.; Phelan, V. V.; Petras, D.; Wang, M.Abstract
Tandem mass spectrometry (MS/MS) fragments molecules into smaller pieces, generating spectra composed of m/z values and intensities that encode structural information for molecular annotation. With increasing mass spectrometry data acquisition speeds, manual annotation from MS/MS lags far behind data generation and remains a bottleneck in metabolite annotation. Current computational methods, such as molecular networking, address this challenge by organizing similar structures into families of related compounds. However, they generally provide only similarity scores, offering weak actionable insights for structural annotation. To address this limitation, we present the Molecular Transformation Graph Edit Measure (MT-GEM), a distance metric that quantifies discrete structural transformations between molecules through graph edge removals that approximate structural modifications. Building on this metric, we developed an ensemble machine learning architecture, the Spectrum Transformation Edit Predictor (STEP), that builds upon TransExION and DREAMS to predict MT-GEM distances from MS/MS spectra. STEP achieves an average precision of 48.4% for identifying single structural transformations between MS/MS pairs, representing more than a tenfold improvement over state-of-the-art similarity metrics, including spectral entropy similarity (3.8%) and modified cosine (2.5%). On experimental human gut microbial community data, STEP identifies 3 times more single-transformation metabolite pairs than feature-based molecular networking at equivalent precision. In a discovery application, STEP highlights one drug metabolite and two new natural product analogs missed by modified cosine in feature-based molecular networking. By providing discrete transformation predictions rather than continuous similarity scores, MT-GEM and STEP enable hypothesis-driven metabolite annotation with testable structural modifications, which we envision will accelerate discovery of new molecules from MS/MS metabolomics datasets.
bioinformatics2026-05-11v1conMItion: an R package adjusting confounding factors for associations in multi-omics
Wang, G.; Liu, F.; Chen, Z.; Davoli, T.Abstract
Association measurements, such as mutual information (MI), are fundamental in the analysis of cancer multi-omics data for identifying cancer-related genes, gene signatures, and gene regulatory networks, thereby shedding light on tumor development, progression, and treatment. Confounding factors, including tumor purity and mutation burden, can bias association measurements in MI, potentially leading to the misclassification of passenger events as drivers. Conditional mutual information (CMI) provides a robust framework for assessing both linear and non-linear associations while effectively accounting for different confounding factors. An R package called conMItion is introduced to estimate CMI and its statistical significance for multi-omics data, with flexibility to adjust for one or two confounding factors. We demonstrated the utilization of conMItion through two use cases. First, we identified co-occurring somatic alterations in bladder cancer genomic data. Second, we applied conMItion to a single-cell RNA sequencing dataset of lung cancer patients and identified positively or negatively associated cell types within the lung cancer tumor microenvironment.
bioinformatics2026-05-11v1pyTrance finds co-localizing RNAs in subcellular spatial transcriptomics data
Strenger, L.; Cerda-Jara, C. A.; Karaiskos, N.; Rajewsky, N.Abstract
Regulation of RNA subcellular localization is crucial for cellular functions in health and disease. For example, local translation of co-localized RNAs is crucial for neural biology. However, it is challenging to identify RNA co-localization events. Here, we present pyTrance, a computational framework that predicts and quantifies subcellular RNA co-localization from spatial transcriptomics data, leveraging latent embeddings learned by a graph neural network. Based on extensive benchmarking, detection of co-localizing RNAs was more accurate and robust compared to existing methods. In mouse brain tissue, pyTrance found several RNA co-localization patterns. Co-localized RNAs were often functionally related and validated by biological knowledge. Interestingly, among novel patterns, pyTrance identified co-localization of GABAergic markers, including Gad1, in neuronal projections. Experimental validation led to the discovery of a spatial overlap between Gad1 mRNA/protein, strongly suggesting local translation. Our results establish pyTrance as a state-of-the-art method to discover biologically important RNA co-localization at subcellular resolution.
bioinformatics2026-05-11v1Cosine Similarity Conflates Clinically Distinct Cancer Variants: A Case for Typed-Graph Retrieval in Precision Oncology Decision Support
Khan, U. A.Abstract
Cancer variant interpretation increasingly relies on retrieval from biomedical knowledge bases, with cosine similarity over neural text embeddings now the dominant retrieval substrate. Whether these embeddings preserve the entity-level distinctions that variant interpretation requires that BRAF V600E and V600K are distinct alleles, that EGFR L858R is a sensitizing mutation while T790M is a resistance mutation has not been systematically measured. We hypothesize that cosine-similarity retrieval over biomedical embeddings conflates clinically distinct cancer variants at high rates, while a typed-graph approach in which each variant is a discrete node preserves variant identity by construction. We constructed a benchmark of 9 cancer variant pairs known to have differential FDA-approved therapy indications or distinct molecular biology, curated from theCIViC clinical evidence database and primary clinical literature. Pairs included BRAF V600E vs V600K (melanoma), EGFR L858R vs T790M (NSCLC, the canonical sensitivity-vs-resistance pair), EGFR exon 19 deletion vs L858R, KRAS G12C vs G12D (only G12C has FDA-approved targeted therapy), KRAS G12C vs G12V, ERBB2 amplification vs activating point mutation, two PIK3CA hotspot pairs, and NTRK1 fusion vs point mutation. For each pair we computed cosine similarity across three open-source embedding models (PubMedBERT, MedCPT, BGE-large-en-v1.5) over three text formats (short, medium, long). A 6-pair positive control verified embeddings recognize equivalent variant-form pairs (e.g., "EGFR L858R" vs "EGFRp.L858R") as similar. Across the medium format (gene + variant + tumor type), **100% of clinically distinct variant pairs had cosine similarity [≥] 0.95 under both biomedicalencoders** (PubMedBERT, MedCPT). The general-purpose encoder (BGE-large-en-v1.5) conflated 33% in medium format but rose to 100% with added clinical context. At the more stringent {tau} = 0.99 (averaged across formats), PubMedBERT conflated 56%of pairs and MedCPT conflated 22%. The biomedically pre-trained encoders performed worse, not better, than the general-purpose encoder. A typed-graph retrieval baseline (each variant a discrete node) achieves zero conflation by construction. The conflation behavior is a property of the embedding architecture, not a coverage gap fixable by domain fine-tuning. We argue that bioinformatics applications that route on variant identity clinical decision support, variant-trial matching, pharmacogenomic recommendation should use typed-graph retrieval, not vector retrieval, as the routing substrate.
bioinformatics2026-05-11v1Lense: Optimizing data preprocessing in single-cell omics using LLMs
Liu, J.; Ji, Z.Abstract
Data preprocessing is critical for single-cell omics analyses, but default pipelines often underperform on diverse datasets, especially from emerging platforms like spatial transcriptomics. We introduce Lense, a language-model-guided method that automatically selects optimal preprocessing by comparing plots that visualize low-dimensional representations across pipeline variants. Integrated with Seurat, Lense streamlines analysis and improves preprocessing robustness without requiring manual tuning.
bioinformatics2026-05-11v1LAMPrEY: a Python-based automated quality control tool for large-scale proteomics datasets
Valdes-Tresanco, M. E.; Wacker, S.; Valdes-Tresanco, M. S.; Plakhotnyk, A.; Brodie, N. I.; Hepburn, M.; Ulke-Lemee, A.; Huttlin, E. L.; Lewis, I. A.Abstract
Over the past years, proteomics has moved increasingly towards the analysis of large cohorts of biological specimens. This has been made possible by significant improvements in mass spectrometry technology, chromatographic separation methods, and improved data acquisition strategies. These technological advances now routinely enable experiments that yield vast datasets that substantially outstrip the capacity of existing proteomics data analysis approaches. Processing such large datasets requires purpose-built, quality control tools designed to organize and analyze the data while recording all processing parameters for reproducibility. To address this need, we developed an open-source, Python-based software platform, Large-scale Automated Multi-level Proteomics Evaluation by Python (LAMPrEY), a comprehensive quality-control pipeline for quantitative proteomics analyses of large cohorts of samples. LAMPrEY features GUI-based file submission, automated processing with MaxQuant and RawTools, an interactive analytics dashboard, and an application programming interface (API) for programmatic usage that collectively enable rapid, reproducible analysis and interpretation of proteomics data. We demonstrate the longitudinal monitoring and analytical capabilities of LAMPrEY using TMT11 quantitative proteomics data generated from 910 Enterococcus faecium isolates collected from bloodstream infection patients. LAMPrEY is an open-source software that can be accessed at www.lewisresearchgroup.org/software.
bioinformatics2026-05-11v1PTM-dCN: Latent Space Control for Post-translational Modification-aware Protein Design
Zhang, S.; Huang, T.; Chen, E.; Qing, R.Abstract
Post-translational modifications (PTMs) are critical for protein function, yet their precise design by harnessing site specific information derived from native proteins remains challenging. Here, we present a deep learning-based PTM design framework that integrates latent diffusion models with ControlNet for sequence generation with site-specific PTM-control. The framework incorporates a PTM-aware protein language model featuring extractor, trained on a curated SwissProt PTM dataset with specialized modification tokens. Through de novo generation of protein sequences with designated PTM sites, our framework facilitates the exploration of PTM-driven functional landscapes and advances position-aware protein engineering.
bioinformatics2026-05-11v1On the state of protein function prediction: a report on the fourth CAFA challenge
Ramola, R.; De Paolis Klauza, M. C.; Piovesan, D.; Peng, Y.; Joshi, P.; Mehdiabadi, M.; Quaglia, F.; Pancsa, R.; Chemes, L. B.; Ahmadi, M.; Ahn, H.; Altenhoff, A. M.; Asgari, E.; Aspromonte, M. C.; Atalay, V.; Babbi, G.; Baldazzi, D.; Barot, M. M.; Ben-Hur, A.; Benso, A.; Berenberg, D.; Bjorne, J.; Boecker, F.; Boldi, P.; Bonello, J.; Bordin, N.; Borole, P.; Ebrahimpour Boroojeny, A.; Cao, R.; Di Carlo, S.; Casadio, R.; Casiraghi, E.; Chang, J.-M.; Chen, C.; Chen, T.-M.; Cheng, J.; Chiu, S.; Dalkiran, A.; Davidovic, R. S.; Dessimoz, C.; Diao, R.; Djeddi, W. E.; Dogan, T.; Flannery, S. T.; FontAbstract
Background: The Critical Assessment of Functional Annotation (CAFA) is a community effort held to understand the field of computational protein function prediction. Every three years, since 2010, the organizers initiate an experiment to collect function predictions on a large set of proteins and then evaluate the performance of predicting methods on a subset of proteins that have accumulated experimental annotations between the submission deadline and the evaluation time. CAFA provides an independent and rigorous assessment of the current state of the art, thus leveling the playing field, highlighting successes, revealing bottlenecks, and offering a forum for the exchange of ideas in protein science. Here, we report the results of the fourth CAFA experiment (CAFA4). Results: CAFA4 featured the participation of 148 methods from 70 research groups on a total of 46,205 unique proteins over a 5-year annotation accumulation phase, the longest in any CAFA. In a comparison across CAFA2-CAFA4 methods, the prediction of Gene Ontology (GO) terms has clearly improved across all three GO aspects and traditional evaluation settings. While not achieving the first rank, several CAFA2 and CAFA3 methods featured in the top ten methods in many evaluations, suggesting that earlier methods still hold relevance. The performance is weaker in the newly introduced "partial knowledge" evaluation category (proteins with experimental annotations before submission deadline that gained additional annotations in the same GO aspect during the annotation accumulation phase), highlighting the need for a new class of methods. The rankings of the methods were stable over the years in traditional evaluation settings, but less so in the new partial knowledge evaluation. Overall, the field continues to progress with some influx of new participants. Sustained efforts will be necessary to substantially advance it.
bioinformatics2026-05-11v1SPIFEE - A pipeline for analyzing traces of live-cell fluorescence microscopy data
Hogendorn, C.; R. Aragon, I.; Dallon, S.; Batchelor, E.Abstract
To properly respond to their environment, cells adjust the activity of key regulatory proteins and rates of gene expression. Methods to detect and quantify these forms of regulatory dynamics in living cells are of central importance for understanding cellular signaling events in both physiological and pathological conditions. Current technologies in this field make use of fluorescent probes to track cell signaling dynamics. Although these technologies have been used for decades, challenges remain. In particular, the segmentation, tracking, and interpretation of single cell dynamic data are time-consuming, prone to subjective errors, and often lacking in standardization across experiments. Here, we present SPIFEE, a data pipeline that uses experiment-dependent parameters to smooth noise and quantify key features of fluorescence data from time-lapse imaging studies. Processing data in this manner enhances and accelerates quantification of live-cell gene and protein expression, simplifies data analysis, and facilitates hypothesis generation.
bioinformatics2026-05-11v1ProteinFlux: accurate, rapid and scalable generative prediction of protein dynamics driven by post-translational modifications
Qian, Q.; Peng, J.; Ma, D.; Liu, K.; Cheng, Y.; Deng, Y.; Zhao, J.; Su, S.; Yao, Y.; Qu, Y.; Fu, R.; Liu, J.; Zhao, M.; Xiao, Y.; Wang, K.; Wu, Y.; Wang, Y.; Xu, Q.; Wang, J.; Hay, D. C.; Ke, Y.; Wang, Y.; Shipston, M. J.; Chi, Y.Abstract
The function of proteins, the building blocks of life, in health and disease depends not only on their 3D-conformational states but most importantly on the dynamic transition between states controlled by a wide array of post-translational modifications (PTMs). Recent major advances have been made in our ability to predict static 3D structures; however, understanding and predicting the impact of PTMs on protein conformational dynamics remains a major question and challenge in the field. Molecular dynamics (MD) simulation remains the major computational approach for studying protein dynamics. However, the high computational cost, lack of integration of PTMs as conditioning inputs and inefficient generation of continuous protein dynamics largely precludes PTM-regulated conformational dynamics and the study of slow conformational processes. To address this critical bottleneck, we developed ProteinFlux, a flow-matching generative framework that links PTM-conditioned conformational dynamics to evolutionary constraints encoded by PTM sites. Evolutionary information plays a critical role in capturing conformational dynamics beyond sequence identity, and PTM sites inherently encode evolutionary constraints critical to protein functional regulation. We therefore built FluxSite, a dual-modal PTM site predictor that integrates sequence evolutionary information and 3D structural features to generate a continuous conditional signal encoding conservation and functional importance for each predicted site. FluxSite achieves robust generalization across 18 PTM types and 30 disease-associated proteomes. ProteinFlux generates phosphorylation-conditioned, all-atom conformational trajectories across diverse protein fold classes, faithfully reproducing both thermodynamic properties such as free energy landscapes and kinetic features such as conformational transition pathways. It outperforms state-of-the-art predictors while achieving inference speeds several orders of magnitude faster than traditional MD. In addition, we introduce DynaMo-phos, a benchmark dataset of phosphorylated protein MD simulations. Together, ProteinFlux, FluxSite and DynaMo-phos provide a scalable, high-throughput platform for elucidating PTM-driven conformational mechanisms, with potential applications across allosteric drug design, functional annotation of disease-associated modifications and mechanism-guided therapeutic development.
bioinformatics2026-05-11v1SRSA-VAE: Self-Attention-Based Feature Learning for Single-Cell Multimodal Clustering
Das, R.; Dey, A.; Maulik, U.; Bandyopadhyay, S.Abstract
Clustering plays a critical role in the analysis of single-cell omics data for identifying cellular heterogeneity and uncovering biological mechanisms. However, the high dimensionality, sparsity, and multimodal nature of single-cell datasets such as single-cell RNA sequencing (scRNA-seq) and Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) pose significant challenges for effective feature learning and representation learning. Traditional dimensionality reduction methods often rely on linear transformations and fail to capture complex nonlinear relationships between gene and protein expression profiles. In this work, we propose SRSA-VAE, a scalable variational autoencoder framework that integrates a residual self-attention encoder for context-aware feature learning and multimodal representation learning. The proposed model dynamically contextualizes gene and protein representations through a self-attention mechanism, enabling the encoder to capture inter-cell relationships and emphasize biologically informative signals. A scalable residual connection further stabilizes training and preserves essential input information during latent representation learning. We evaluate SRSA-VAE on five large-scale publicly available single-cell datasets, including both scRNA-seq and CITE-seq data, and compare its performance with established deep generative models. Experimental results demonstrate that SRSA-VAE consistently outperforms existing methods in Adjusted Rand Index (ARI) across benchmark datasets, with particularly strong gains on complex immune cell populations. Ablation studies further confirm the importance of the self-attention mechanism and residual connection in enhancing model stability and clustering accuracy. The proposed model offers a generalizable, robust, and scalable solution for single-cell clustering tasks.
bioinformatics2026-05-11v1InterScale reveals multi-scale cellular interaction programs in spatial transcriptomics
Drummer, F. K.; Jimenez, S.; Marco, F. D.; Schaar, A. C.; Pentimalli, T. M.; Beckmann, J.; Rajewsky, N.; Theis, F. J.Abstract
Tissue homeostasis and disease emerge from cell-cell interactions operating across spatial scales: from autocrine and juxtacrine signals within micrometers to paracrine gradients coordinating responses across tissues. While these can be read out from spatial transcriptomics, existing computational methods capture either local adjacency-based or long-range dependencies, but rarely both within a single framework. We introduce InterScale, a graph-transformer approach that jointly models local and global cellular interactions from spatial transcriptomics data. By integrating a Graph Convolutional Network as a local component with a global transformer encoder, InterScale learns multi-scale representations of cellular communication. A downstream workflow enables scale-resolved interpretation of interactions from gene to tissue level. Applied to Sonic Hedgehog morphogen patterning in neural organoids, InterScale resolves spatially restricted neuronal differentiation programs and broader progenitor regulatory states along the morphogen gradient. In a human pancreatic dataset contrasting healthy and type 1 diabetic tissue, it reveals disease-associated spatial reorganization and tissue remodeling. InterScale's modular architecture supports diverse spatial transcriptomics platforms and provides a scalable, unbiased, and biologically interpretable framework for studying cellular interactions across scales.
bioinformatics2026-05-11v1CLCNet: a contrastive learning and chromosome-aware network for genomic prediction in plants
Huang, J.; Yang, Z.; Yin, M.; Li, C.; Li, J.; Wang, Y.; Huang, L.; Li, M.; Liang, C.; He, F.; Han, R.; Jiang, Y.Abstract
Genomic selection (GS) leverages genome-wide markers and phenotypes to predict breeding values, with its effectiveness largely dependent on the accuracy of genomic prediction (GP) models. However, GP methods often struggle to capture inter-individual variability and are limited by the curse of dimensionality, where the number of SNPs far exceeds the sample size. To address these challenges, we present CLCNet (Contrastive Learning and Chromosome-aware Network), a novel deep learning framework that integrates contrastive learning and chromosome-aware feature modeling. CLCNet comprises two key components: (i) a contrastive learning module that enhances the model's ability to capture fine-grained, genotype-dependent phenotypic differences among individuals, and (ii) a chromosome-aware module that captures structured feature selection at both chromosome and genome levels, thereby distilling the most informative SNPs. We evaluated CLCNet across four crop species, covering ten agronomically important traits, and compared it with a diverse set of classical linear, machine learning, and deep learning models. CLCNet achieved superior prediction performance, with statistically significant improvements in Pearson correlation coefficient (PCC), ranging from 0.34% to 12.19% over baseline, together with reduced mean squared error (MSE). Performance gains were more pronounced for traits with moderate linkage disequilibrium (LD; r2= 0.21-0.36) and high heritability (h2 > 0.66), such as those in maize, rapeseed, and soybean. For cotton traits characterized by high LD (r2 = 0.74) and lower heritability (h2 < 0.50), CLCNet maintained robust performance without degradation. Overall, these results demonstrate that CLCNet is an effective framework for improving genomic prediction accuracy and holds strong potential for practical applications in plant breeding.
bioinformatics2026-05-10v7A ML-framework for the discovery of next-generation IBD targets using a harmonized single-cell atlas of patient tissue
Joglekar, A.; Joseph, A.; Honsa, P.; Ruppova, K.; Pizzarella, V.; Honan, A.; Mediratta, D.; Vollmer, E.; Geller, E.; Valny, M.; Macuchova, E.; Zheng, S.; Greenberg, A.; Taus, P.; Kline-Schoder, A.; Konickova, R.; Cerna, L.; Sharim, H.; Ness, L.; Camilli, G.; Chouri, E.; Kaymak, I.; D'Rozario, J.; Castiblanco, D.; Oliveira, J.; Prandi, F.; Popov, N.; Moldoveanu, A. L.; Oliphant, C.; Escudero-Ibarz, L.; Uhlitz, F.; Freinkman, E.; Sponarova, J.; Vijay, P.; Joyce, C.; Leonardi, I.; Nayar, S.; Raveh-Sadka, T.; Solomon, N.; Platt, A.; Ort, T.; De Baets, G.; Corridoni, D.; Wroblewska, A.; Rahman, A.Abstract
Target discovery for IBD has traditionally relied on genetic associations, which lack the cellular resolution needed to identify novel, actionable, cell type-specific disease pathways. Here, we describe an integrated analytical and experimental framework that leverages harmonized single-cell data to systematically discover novel therapeutic strategies for IBD. We used AMICA DBTM, Immunai's harmonized database of single-cell RNA datasets to construct a harmonized 1 million single-cell atlas of the human intestine. We applied a machine learning framework (Immune Patient Representation, IPR) to identify disease-associated transcriptional programs and cell type-specific gene targets. Candidate targets were prioritized using atlas-derived metrics, refined using custom criteria emphasizing translational actionability, and validated across independent clinical cohorts. Select candidates were evaluated in human primary-cell models reflecting the target's cell-type context. The IPR framework identified 85 disease-associated transcriptional programs and ranked 400 cell type-specific target genes across immune and stromal lineages. Disease-associated programs were interpreted using a structured AI-assisted reasoning framework for structured biological reasoning, linking them to IBD-relevant pathways and guiding the identification of novel, promising gene targets. Functional validation of two cell-type-specific candidates, PTGIR in myeloid cells and IL6ST in fibroblasts, confirmed the reduction of inflammatory and fibrotic pathways linked to IBD pathology. Multi-omic profiling and projection of in vitro phenotypes to patient datasets demonstrated the reversal of disease-associated programs via mechanisms distinct from those of existing biologics. Our single-cell anchored, machine-learning framework integrates in silico discovery with experimental validation, revealing new cell type-specific therapeutic opportunities and supporting a scalable approach for precision target discovery in IBD and other immune-mediated diseases.
bioinformatics2026-05-10v3OTRec: Deep learning recommender for prospective druggable disease-target associations
Ofer, D.; Linial, M.Abstract
Identifying druggable disease--target associations remains a central challenge in translational medicine, limiting therapeutic discovery and repurposing. Here, we present OTRec, a deep learning--based recommender system that ranks such associations at scale and evaluates them in a temporal hold-out setting. Unlike approaches that rely on manually curated or aggregated evidence scores, OTRec employs a two-tower architecture to learn latent representations from 663,351 disease--target pairs. The model integrates heterogeneous inputs, including textual descriptions, ontology-derived features, and biological annotations such as tractability, Gene Ontology (GO) terms, and pathway information. We perform temporal validation by training on the 2022 Open Targets (OT) release and evaluating on clinical trial data from 2025. OTRec improves on the retrospective OT association score (ROC-AUC: 0.872 {+/-} 0.005 vs 0.559; PR-AUC: 0.288 {+/-} 0.009 vs. 0.08). In 5x5 target-disjoint cross-validation, OTRec reaches ROC-AUC 0.950 and PR-AUC 0.844) improving on the OT evidence score (ROC-AUC 0.91; PR-AUC 0.45). We rank the druggable genome across ~19,000 OT platform (OTP) diseases and release ~282,500 candidate associations above a 0.65 score threshold (in-distribution CV precision 0.92), covering 4,346 diseases including 2,322 orphan diseases, through an interactive prediction platform.
bioinformatics2026-05-10v2CRISPR-HAWK: Haplotype- and Variant-aware Guide Design Toolkit for CRISPR-Cas
Kumbara, A.; Tognon, M.; Carone, G.; Fontanesi, A.; Bombieri, N.; Giugno, R.; Pinello, L.Abstract
Current CRISPR guide RNA design tools rely on reference genomes, overlooking how genetic variation impacts editing outcomes. As genome editing advances toward clinical applications, incorporating population diversity becomes essential for ensuring therapeutic efficacy across diverse populations. We present CRISPR-HAWK, a framework integrating individual- and population-scale variants and haplotypes into gRNA design. Analyzing therapeutic targets across 79,648 genomes reveals that genetic variants substantially alter guide performance. For the clinically approved sickle cell disease therapeutic guide targeting BCL11A, we identify haplotypes that completely abolish predicted cutting activity. Across seven therapeutic loci, 82.5% of guides contain variants modifying on-target activity. Variants also create novel protospacer adjacent motif sites generating individual-specific guides invisible to reference-based design. These findings demonstrate that variant-aware selection is critical for equitable genome editing. CRISPR-HAWK is available at https://github.com/pinellolab/CRISPR-HAWK and https://github.com/InfOmics/CRISPR-HAWK
bioinformatics2026-05-10v2RNAcomp2D: a web visualization tool for comparingmultiple predictions of RNA secondary structure
Vitale, R.; Milone, D. H.; Stegmayer, G.Abstract
Ribonucleic acids (RNAs) are involved in many important biological processes. In particular, non-coding RNAs are crucial regulators of cellular processes, playing a significant role in gene expression. RNA secondary structure is key to infer their specific function and for understanding how they interact with other molecules. Many computational models have been developed in the last decade to predict the secondary structure, achieving increasingly higher success rates. However, each new method has its own input-output interface, programming language, computational requirements and, sometimes, a dedicated server to run the model or just a source code in a repository. Thus, nowadays it is very hard to obtain predictions from multiple methods and compare them at once. A unified interface is urgently needed, which allows accessing several methods at the same time, visualizing and comparing predictions among them, and also with a reference structure when available. We introduce here RNAcomp2D, a web-based tool that allows users to enter an RNA sequence, or select one from RNAcentral, and obtains predictions of RNA secondary structures using several state-of-the-art methods. Both classical thermodynamic methods and the latest deep learning models are packaged in containers and accessible in an unified website. All the predictions, and the reference structure if available, are shown at the same time in a single graphical interface. Moreover, as new models continue to be developed, this tool is designed to be scalable, allowing the addition of more prediction methods in the future. Web server is available at https://sinc.unl.edu.ar/web-demo/rnacomp2d/. Data and source code are available at https://github.com/sinc-lab/RNAcomp2D
bioinformatics2026-05-10v2Predicting Pre-treatment Resistance or Post-treatment Effect? A Systematic Benchmarking of Single-Cell Drug Response Models
Shen, L.; Sun, X.; Zheng, S.; Hashmi, A.; Eriksson, J.; Mustonen, H.; Seppänen, H.; Shen, B.; Li, M.; Vähä-Koskela, M.; Tang, J.Abstract
Intratumoral heterogeneity drives variable drug responses in cancer. Single-cell RNA sequencing (scRNA-seq) enables characterization of such heterogeneity and prediction of drug response at single-cell resolution. Accordingly, various computational models have been developed to infer drug response from scRNA-seq data. However, their performance, robustness, and generalizability across different biological contexts remain insufficiently evaluated. To address this gap, we benchmarked representative single-cell drug response prediction models using 26 curated datasets comprising over 760,000 cells across 12 cancer types and 21 therapeutic agents. We constructed balanced and imbalanced scenarios to reflect realistic drug-response label distributions. To address the lack of ground-truth labels in conventional scRNA-seq datasets, we incorporated lineage-tracing data with experimentally validated drug-response annotations, enabling evaluation in a clinically relevant pre-treatment prediction setting. Our results show that prediction performance was markedly higher in cell lines than in tissue samples. Under imbalanced conditions, most methods exhibited sharp performance declines, whereas scDEAL demonstrated the highest robustness. Independent validation using an in-house pancreatic ductal adenocarcinoma dataset further confirmed scDEAL's robustness and ability to capture biologically meaningful state transitions. Label-substitution experiment revealed that this robustness was partially driven by the model's specific training-label construction. However, benchmarking with lineage-tracing data revealed a fundamental limitation: most models capture drug-induced transcriptional changes but struggled to predict intrinsic resistance before treatment. In summary, our study defines the performance boundaries of current approaches and highlights their limitations in addressing intratumoral heterogeneity, class imbalance, and intrinsic resistance prediction, emphasizing the need for the next-generation single-cell drug response models with stronger clinical relevance.
bioinformatics2026-05-10v2A high-quality, chromosome-scale genome assembly of the shade-tolerant wild rice, Oryza granulata
Zhang, F.; Yang, Y.-h.; Li, W.; Shi, C.; Zhu, X.-g.; Gao, L.-z.Abstract
Oryza granulata Nees et Arn. ex Watt, a diploid wild rice (GG genome), possesses exceptional shade tolerance and is a key genetic resource for rice improvement. However, previous genome assemblies lacked continuity and completeness. Here we present a chromosome-scale reference genome of O. granulata using PacBio SMRT (113*), Hi-C (95*), and Illumina sequencing. The final assembly is ~764.24 Mb, with a scaffold N50 of ~59.32 Mb, and ~96.47% of the sequence anchored to 12 chromosomes. BUSCO completeness is ~98.6%. We annotated ~42,064 protein-coding genes, of which ~95.39% were functionally annotated, along with ~73.46% repetitive elements. The genome assembly and raw sequencing data are available at NGDC (PRJCA061980), NGDC GSA (CRA068332), and NGDC GWH (GWHISVE00000000.1). This high-quality genome will serve as a fundamental resource for evolutionary genomics, conservation biology, and breeding of shade-tolerant rice cultivars.
bioinformatics2026-05-10v2Scalable integration and prediction of unpaired single-cell and spatial multi-omics via regularized disentanglement
Sun, J.; Liang, C.; Wei, R.; Zheng, P.; Yan, H.; Bai, L.; Zhang, K.; Ouyang, W.; Ye, P.Abstract
Deciphering cellular states requires methods capable of integrating large-scale heterogeneous single-cell and spatial omics data. However, these data are typically unpaired due to destructive assays and further confounded by modality heterogeneity, technical noise, and immense scale. Here we present scMRDR, a scalable computational framework based on regularized disentangled representation learning for integrating fully unpaired single-cell and spatial multi-omics datasets. Built on a unified and structure-preserving architecture, scMRDR removes the need for pairing supervision while maintaining computational efficiency, enabling scaling to large datasets spanning multiple disparate omics modalities. Across diverse real-world benchmarks, scMRDR demonstrates strong performance in batch correction, modality alignment, and biological signal preservation. The framework further supports cross-modal translation across omics modalities and enables spatial coordinate imputation for non-spatial single-cell datasets using a reference atlas. The resulting spatial mapping allows spatially resolved analyses, including identification of spatially variable genes and characterization of epigenetic regulatory programs in their native tissue context. These capabilities position scMRDR as a scalable and versatile framework for large-scale multi-omics integration.
bioinformatics2026-05-10v2scLASER: a robust framework for simulating and detecting time-dependent single-cell dynamics in longitudinal studies
Vanderlinden, L. A.; Vargas, J.; Inamo, J.; Young, J.; Wang, C.; Zhang, F.Abstract
Longitudinal single-cell clinical studies enable tracking within-individual cellular dynamics, but methods for modeling temporal phenotypic changes and estimating power remain limited. We present scLASER, a framework detecting time-dependent cellular neighborhood dynamics and simulating longitudinal single-cell datasets for power estimation. Across benchmark experiments, scLASER shows consistently higher sensitivity than traditional cluster--based approaches, with particularly pronounced gains in rare cell types and non-linear temporal patterns. Applications to inflammatory bowel disease (95,813 cells, 38 patients) reveal treatment-responsive NOTCH3+ stromal trajectories with high cell type discrimination (AUC > 0.92), while analysis of COVID-19 data (188,181 cells, 84 patients) identifies three distinct axes of T cell activity (cytotoxic effector, NK immunoreceptor signaling, and interferon-stimulated gene programs) over disease progression. scLASER enables robust longitudinal single-cell analysis and optimization of study design.
bioinformatics2026-05-10v2Interpretable neural networks prioritize cancer driver genes from genome-wide dependency landscapes
Yin, Q.; Chen, L.Abstract
Identifying cancer driver genes and their therapeutic impact remains a core challenge in computational cancer biology. We introduce xNNDriver and xAEDriver, two interpretable neural network frameworks that connect cancer mutations with genome-wide DepMap gene dependencies, pathway activity, and drug-response patterns. xNNDriver is a supervised pathway-guided model that evaluates whether a gene's mutation status is encoded in the genome-wide dependency landscape; we interpret model fitness as a driver potential score, which quantifies the strength of this mutation-dependency signal and prioritizes genes with broad functional footprints. Across 3,008 candidate genes, xNNDriver recovers major established drivers and highlights literature-supported candidates, while pathway analyses reveal biologically coherent programs related to metabolism, growth factor signaling, and immune regulation. To capture combinatorial functional states, xAEDriver uses an unsupervised autoencoder to learn Driver Variant Representations (DVRs), latent binary features guided by the frequency distribution of known driver mutations. DVRs capture cell-line-specific dependency patterns and expression patterns and are associated with drug sensitivity and pathway activity. Together, these interpretable deep learning models demonstrate that gene dependency landscapes encode rich, interpretable signals of oncogenic function and provide a hypothesis-generating framework for prioritizing drivers, pathways, and therapeutic vulnerabilities for further experimental validation.
bioinformatics2026-05-10v2WSInsight: a cloud-native, agent-callable platform for single-cell whole-slide pathology
Huang, C. H.; Awosika, O. E.; Fernandez, D.Abstract
Translational study of the tumour microenvironment increasingly demands single-cell phenotyping at cohort scale. WSInsight is an open, reusable, cloud-native platform that performs patch- and single-cell H\&E inference on giga-pixel slides streamed from local, S3, or NCI~GDC storage, and returns QuPath- and OMERO-ready outputs with neighborhood-composition features. Validated on TCGA-BRCA and TCGA-CRC, it is callable from pathology viewers and AI agents through a standards-conformant MCP interface.
bioinformatics2026-05-10v2Entropy Sorting Feature Selection: information-theoretic gene set identification improves single-cell RNA sequencing data interpretability
Radley, A.; Boezio, G.; Shand, C.; Perez-Carrasco, R.; Briscoe, J.Abstract
Single-cell RNA sequencing (scRNA-seq) has transformed our ability to resolve cellular heterogeneity, but extracting meaningful signals remains challenging due to technical noise and batch effects. Most methods for denoising scRNA-seq data have focused on using latent representations such as principal component analysis and deep learning to prioritise biological signals. By contrast, despite its influence on downstream analyses, feature selection has received relatively limited attention, leading to widespread reliance on the comparatively simplistic strategy of highly variable gene selection. Here we present Entropy Sorting Feature Selection (ESFS), a modular, user-friendly framework that substantially improves the interpretability of scRNA-seq data. Notably, ESFS reveals complex expression dynamics that are obscured in latent representations. We demonstrate the utility of ESFS in diverse data: identifying coherent developmental programs across eight independent human embryo datasets without batch integration; resolving spatial gene expression in mouse colon missed by conventional analyses; disambiguating shared and tumour-specific microenvironments in glioblastoma; and disentangling spatial, temporal, and neurogenic programs in the developing mouse neural tube. Beyond delivering a powerful and user-friendly software that deepens insight into complex biological systems, our work establishes Entropy Sorting as a novel information theoretic for advanced data analysis methods.
bioinformatics2026-05-10v2Reconstructing True 3D Spatial Omics at Single-Cell Resolution
Yang, Y.; Luo, Y.; Zhang, K.; Bu, Y.; Xia, Z.; Peng, H.; Yan, R.; Liu, Q.; Chen, Y.; Shen, L.; Chen, E.Abstract
Capturing the three-dimensional (3D) organization of cells is essential for deciphering complex biological processes, yet comprehensive 3D spatial omics is severely hindered by the destructive nature of physical sectioning and the depth limitations of intact tissue imaging. Current computational methods rely on 2.5D stacking of discrete slices, which inherently disrupts tissue topology and fails to resolve continuous depth-dependent molecular gradients. To bridge this gap, we introduce DeepSpatial, an Optimal Transport flow matching framework that models tissue evolution as a continuous dynamic vector field. By solving the underlying probability flow ODEs, DeepSpatial enables the direct extraction of uninterrupted, infinitely resolvable tissue states at arbitrary spatial depths. Using Deep STAR/RIBOmap 3D technologies, we demonstrate that DeepSpatial achieves improved 3D reconstruction fidelity relative to 2.5D approaches, yielding structures that more closely recapitulate native tissue microenvironments in real-world datasets. Across diverse spatial omics modalities, including spatial proteomics using imaging mass cytometry in human breast cancer and spatial transcriptomics using openST in head and neck squamous cell carcinoma metastatic lymph nodes, DeepSpatial produces biologically interpretable and high-fidelity reconstructions across datasets. We evaluated the scalability and robustness of DeepSpatial on a large-scale mouse brain dataset, reconstructing a continuous 3D cellular atlas comprising 39 million cells within 41.6 hours. Systematic downstream characterization validated its ability to recapitulate consistent spatial architectures, cell-type distributions, transcriptomic patterns, and microenvironmental structures across brain regions. Collectively, these results demonstrate DeepSpatial as a generalizable and efficient solution for true 3D spatial reconstruction across scales and modalities.
bioinformatics2026-05-10v2Haplotype-resolved diploid genome inference on pangenome graphs
Chandra, G.; Doan, W. T.; Gibney, D.Abstract
Recent algorithmic advancements have shown how to utilize pangenome graphs in combination with the haplotype reconstruction framework of Li and Stephens to accurately reconstruct a haplotype from a reference pangenome graph and a set of input reads. However, significant work remains in developing techniques that utilize a pangenome graph to obtain a pair of phased haplotypes, called a diploid pair. We introduce new problem formulations and scalable algorithms for inferring phased diploid genomes from a pangenome graph and a set of input reads. We implement them in our tool, DipGenie. The key idea is to jointly optimize genotyping and phasing along global paths through the pangenome graph, guided by a biologically motivated recombination budget that constrains inferred haplotypes to plausible mosaics of reference haplotypes. We evaluate DipGenie on real Illumina short-read data from the highly polymorphic MHC region in 22 leave-one-out diploid experiments, benchmarking against three tools that also operate on graph structures: VG, which samples haplotypes directly from the pangenome graph, and PanGenie + Beagle and Paragraph + Beagle, which derive local graphs from a VCF panel for per-site genotyping and delegate phasing to a statistical method. At full coverage, DipGenie achieves a geometric mean switch error rate (SER) of 0.86%, which is 5.7 x lower than PanGenie + Beagle (4.88%), 7.9 x lower than VG (6.77%), and 13.2 x lower than Paragraph + Beagle (11.35%). For structural variant calling, DipGenie leads with a geometric mean F1-score of 0.571, compared to 0.470 (PanGenie + Beagle), 0.450 (VG), and 0.379 (Paragraph + Beagle). These advantages hold at every coverage level tested.
bioinformatics2026-05-10v2LIVIA: a browser-based tool for assessing and visualizing predicted protein interactions
Kim, A.-R.; Perrimon, N.Abstract
As protein structure prediction tools become widely adopted across biology, there is a growing need for accessible methods to assess and visualize predicted protein-protein interactions (PPIs). Here we present LIVIA (Local Interaction Visualization and Analysis), a browser-based tool that computes local PPI confidence metrics across multiple prediction platforms, identifies predicted interface residues, embeds an interactive Mol-star 3D viewer, and generates visualization scripts for ChimeraX and PyMOL. The tool automatically detects prediction formats; all parsing and computation occur locally on the users machine. LIVIA is freely available at https://flyark.github.io/LIVIA.
bioinformatics2026-05-10v1Cross Dataset Transcriptomic Analysis Identifies Oxidative Stress Inflammation Gene Networks Modulated by Nutrigenomic Interventions in Parkinson Disease
Rafiee, M.; Abaj, F.; Mahdevar, M.; Rashidian, A.; Ghaedi, K.; Ghiasvand, R.Abstract
Inflammation and oxidative stress (OS) are key to Parkinson's disease (PD). We performed a cross-dataset integrative transcriptomic analysis to identify OS and inflammation-related hub genes persistently dysregulated in PD and to evaluate their response to nutrigenomic interventions using publicly available datasets. Four GEO datasets (GSE7621, GSE20141, GSE20146, GSE49036) were analysed to identify differentially expressed genes (DEGs), which were intersected with GeneCards OS inflammation gene sets. Functional enrichment analyses, including gene ontology (GO), pathway over-representation analysis (ORA), and protein-protein interaction (PPI) analysis, were used to identify key pathways and hub genes. Gene food bioactive compound (FBC) association was explored by integrating PD signatures with nutrigenomic profiles from NutriGenomeDB. We identified 183 DEGs in PD, enriched in synaptic, dopaminergic, OS, and inflammatory pathways. Intersection analysis yielded 26 OS-inflammation-related genes and 10 central regulators, including TH, DDC, SNCA, LRRK2, HSPB1, and HSPA1B. revealed opposing transcriptional patterns, with several FBCs suppressing stress related genes and upregulating dopaminergic markers such as TH, GCH1, and DDC. Overall, this integrative analysis highlights OS inflammation gene networks in PD and identifies candidate diet gene interactions that warrant further experimental validation
bioinformatics2026-05-09v1Machine learning cross-platform proteomic imputation enables protein quality scoring and replication of epidemiological associations
Li, L.; Alaa, A.; Tan, Y.; Demirel, I.; Friedman, S.; Zha, Q.; Trac, R. P.; Taylor, K. D.; Yu, B.; Ballantyne, C. M.; Deo, R.; Dubin, R.; Tsai, M. Y.; Peloso, G. M.; Brody, J.; Austin, T.; Psaty, B. M.; Nicholas, J.; Raffield, L. M.; Tahir, U.; Coresh, J.; Hornsby, W.; Chan, A.; Rich, S. S.; Rotter, J. I.; Ganz, P.; Gerszten, R.; Philippakis, A.; Natarajan, P.; Yu, Z.Abstract
High-throughput affinity-based proteomics has advanced biomedical research, yet fundamental, persistent discordance between mainstream platforms (SomaScan and Olink) routinely undermines the replication of findings. This platform-driven non-replication complicates downstream biological validation and biomarker prioritization. Here, we develop a machine learning-based framework for cross-platform protein value imputation to resolve this translational bottleneck. Using paired proteomic data measured by both SomaScan and Olink from 5,325 participants of the Multi-Ethnic Study of Atherosclerosis, we developed models to impute cross-platform measurements and applied them to two independent and demographically distinct cohorts (Cardiovascular Health Study [N=3,171] and UK Biobank [UKB; N=41,405]) for external validation. Our bi-directional model 1) established an imputation performance-based protein fidelity index, validated against gold-standard measurements from Atherosclerosis Risk in Communities study (N=101) and Nurses' Health Study (N=54), 2) enabled imputation of platform-exclusive protein measurements, and 3) facilitated calibration of overlapping proteins. We demonstrate the utility of this framework through three applications: 1) fidelity-informed analyses enhanced the replication of biomarker discovery, 2) recovery of SomaScan signals that were previously inaccessible in UKB's original Olink measurements, and 3) improved replication performance for overlapping proteins. Our study offers a translational roadmap that allows researchers to achieve reliable epidemiological replication, target specific assays for future optimization, and prioritize biological signal over platform noise.
bioinformatics2026-05-09v1A Fractal-Dimension Framework for Quantifying Self-Similarity in Chromatin Folding
El-Yaagoubi, A.; Balubaid, A. O.; Chung, M. K.; tegner, j.; Ombao, H.Abstract
The three-dimensional folding of DNA is essential for genome function, but its organization remains difficult to summarize quantitatively across genomic scales. Here, we study DNA folding from Hi-C contact data using a network-based notion of fractal dimension. In this representation, genomic loci are treated as nodes, and observed Hi-C contacts define weighted edges, so that frequently interacting loci are closer in the resulting network. We then estimate fractal dimension using two complementary graph-based methods: the correlation dimension and the sandbox dimension. Validation on synthetic networks shows that the proposed estimators detect clear scaling behavior in hierarchical fractal-like networks, while distinguishing them from networks with local clustering but no stable multiscale self-similarity. Applied to intrachromosomal Hi-C data from the IMR90 human cell line, the method reveals approximate linear scaling regimes on log-log plots, suggesting fractal-like organization in chromatin contact networks. At the chromosome level, estimated fractal dimension tends to increase with chromosome size: larger chromosomes often have dimensions closer to 3, consistent with more compact and space-filling organization, whereas shorter chromosomes tend to have lower dimensions, closer to 1, consistent with simpler and more open folding patterns. A sliding-window analysis at 5 kb resolution further shows that fractal organization varies substantially along chromosomes rather than remaining uniform across genomic position. These results suggest that graph-based fractal dimension provides an interpretable summary of DNA folding complexity at both global and local scales. More broadly, the proposed framework offers a quantitative way to study multiscale genome organization from Hi-C data using tools from network geometry.
bioinformatics2026-05-09v1Building an open ecosystem for molecular neuroimaging: standards and tools from the OpenNeuroPET initiative
Ganz, M.; Norgaard, M.; Pernet, C.; Matheson, G. J.; Galassi, A.; Ceballos, E. G.; Wighton, P.; Bilgel, M.; Eierud, C.; Gonzalez-Escamilla, G.; Buckholtz, J.; Blair, R.; Markiewicz, C. J.; Hardcastle, N.; Greve, D. N.; Thomas, A. G.; Poldrack, R. A.; Calhoun, V. D.; Innis, R. B.; Knudsen, G. M.Abstract
Molecular neuroimaging with positron emission tomography (PET) and single-photon emission computed tomography (SPECT) enables quantification of specific molecular targets in the living brain. Despite its scientific impact, molecular neuroimaging research has historically faced challenges due to high costs, small sample sizes, laboratory-specific analysis pipelines, and limited large-scale data sharing. These factors have hindered reproducibility and the broader reuse of valuable PET datasets. The OpenNeuroPET initiative was established to address these barriers by developing standards, infrastructure, and open-source tools for organizing, sharing, and analyzing molecular neuroimaging data. Through collaborations across Europe and North America, OpenNeuroPET has supported the PET extension of the Brain Imaging Data Structure (PET-BIDS), providing a standardized framework for PET datasets and metadata. Building on PET-BIDS, tools such as PET2BIDS, ezBIDS, and BIDSCoin facilitate data conversion and curation. In parallel, OpenNeuro now hosts PET-BIDS datasets for open sharing, while complementary platforms such as PublicnEUro enable GDPR-compliant controlled access. Emerging open-source workflows and BIDS applications further support automated, reproducible PET preprocessing and quantitative analysis, promoting harmonized processing across centers. Together, these developments mark an important step toward an open molecular neuroimaging ecosystem in which datasets, software, and workflows can be transparently shared, reused, and scaled for collaborative research.
bioinformatics2026-05-09v1A structural grammar of truncation across the human homodimer landscape
Karagöl, T.; Karagöl, A.Abstract
Alternative splicing and proteolytic truncation generate tens of thousands of protein isoforms in the human proteome, but the structural consequences for quaternary state, the level at which most signaling, enzymatic and regulatory function operates, have largely been examined one molecule at a time. Leveraging the recent expansion of the AlphaFold Database to predicted human homodimers, we systematically compared 5,168 canonical-versus-truncated homodimer pairs across the human proteome. In high-confidence canonical homodimers, truncation is associated with predicted structural conservation in 56.4% of pairs (mean 85 residues lost), complete interface ablation in 26.1% (mean 178 residues lost), and partial destabilization in 17.5% (mean 134 residues lost); a distinct fourth class (4.0% of the dataset, n = 208) shows truncation-associated emergence of a predicted high-confidence interface from a sub-threshold canonical baseline. Two reproducible rules govern these transitions: a topological asymmetry in which N-terminal losses are preferentially enriched ~1.6-fold in interface preservation while C-terminal losses are rare overall (~6% of pairs) and modestly under-represented in the conservation class, and a biophysical rule in which emergence-class proteins show substantially elevated intrinsic disorder content relative to ablation-class proteins, as measured by both AlphaFold pLDDT-defined disorder of the canonical structure (Cohen's d {approx} 1.39) and AIUPred peak binding propensity of the truncated isoform (Cohen's d {approx} 0.65). Formal pathway enrichment recovered only a small nucleotide-metabolism signal, indicating that these rules operate across diverse gene-functional categories. Truncation-associated remodeling of homodimer architecture thus constitutes a structural grammar of the human proteome rather than a specialty of any single regulatory family.
bioinformatics2026-05-09v1