Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Exploring molecular signatures of senescence with markeR, an R toolkit for evaluating gene sets as phenotypic markers
Martins-Silva, R.; Kaizeler, A.; Barbosa-Morais, N. L.Abstract
Many biological processes, including cellular senescence, manifest as diverse phenotypes that vary across cell types and conditions. In the absence of single, definitive markers, researchers often rely on the expression of sets of genes to identify these complex states. However, there are multiple ways to summarise gene set expression into quantitative metrics (i.e., signatures), each with its own strengths and limitations, and we know of no consensual framework to systematically evaluate their performance across datasets. We therefore developed markeR (https://bioconductor.org/packages/markeR), an open-source, modular R package that evaluates gene sets as phenotypic markers using various scoring and enrichment-based approaches. markeR generates interpretable metrics and intuitive visualisations that enable benchmarking of gene signatures and exploration of their associations with chosen study variables. As a case study, we applied markeR to 9 published senescence-related gene sets across 25 RNA-seq datasets, covering 6 human cell types and 12 senescence-inducing conditions. There was wide variability in gene set performance, as some signatures (e.g., SenMayo) were robust senescence markers across contexts, while others (e.g., those from MSigDB), performed poorly as such. We also used markeR to analyse gene expression in 49 GTEx tissues, revealing tissue- and age-related differences in senescence-associated signals. Together, these findings emphasise the difficulty of characterising molecular phenotypes and demonstrate the potential of markeR in facilitating the systematic evaluation of gene sets in various biological contexts.
bioinformatics2026-04-15v3Longevity Bench: Are SotA LLMs ready for aging research?
Zhavoronkov, A.; Sidorenko, D.; Naumov, V.; Pushkov, S.; Zagirova, D.; Aladinskiy, V.; Unutmaz, D.; Aliper, A.; Galkin, F.Abstract
Aging is a core biological process observed in most species and tissues, which is studied with a vast array of technologies. We argue that the abilities of AI systems to emulate aging and to accurately interpret biodata in its context are the key criteria to judge an LLM's utility in biomedical research. Here, we present LongevityBench -- a collection of tasks designed to assess whether foundation models grasp the fundamental principles of aging biology and can use low-level biodata to arrive at phenotype-level conclusions. The benchmark covers a variety of prediction targets including human time-to-death, mutations' effect on lifespan, and age-dependent omics patterns. It spans all common biodata types used in longevity research: transcriptomes, DNA methylation profiles, proteomes, genomes, clinical blood tests and biometrics, as well as natural language annotations. After ranking state-of-the-art foundation models using LongevityBench, we highlight their weaknesses and outline procedures to maximize their utility in aging research and life sciences
bioinformatics2026-04-15v3Functional-space alignment resolves the eco-evolutionary landscape of siderophore biosynthesis across bacteria
Shao, J.; Wu, Y.; Tian, S.; Xu, R.; Luo, H.; He, R.; Shao, Y.; Yu, L.; Xiong, G.; Guo, P.; Nan, R.; Wei, Z.; Gu, S.; Li, Z.Abstract
Siderophores are central mediators of microbial iron acquisition, competition, and ecological adaptation, yet their biosynthetic diversity remains difficult to resolve across species because existing sequence-based BGC comparison is strongly constrained by phylogenetic background. Here we combine large-language-model-assisted literature mining, functional-space comparison, and genome-scale analysis to resolve the global organization of siderophore biosynthesis across bacteria. We first built SideroBank, a manually curated cross-species benchmark of siderophore biosynthetic gene clusters (BGCs), and used it to show that many identical products recur across distant taxa whereas the corresponding BGCs often fail to cluster in sequence space. We then developed BGC Block Aligner, which compares BGCs as ordered systems of functionally meaningful blocks and thereby converts comparison from sequence space to functional space. Applied to 97,432 bacterial genomes, this framework produced the Siderophore Atlas, revealing that siderophore synthesis is a remarkably pervasive trait encoded by over 60% of the analyzed genomes, with certain clusters being the most widely disseminated secondary metabolites across the bacterial domain. This global landscape suggests that the adoption of specific biosynthetic strategies is predominantly driven by ecological lifestyle rather than strict phylogenetic relatedness. Furthermore, a stark macro-evolutionary dichotomy was observed between the continuous structural diversification of NRPS pathways and the standardized, HGT-driven dissemination of NIS systems, linking functional-space genomics to the global ecology and evolution of siderophore biosynthesis..
bioinformatics2026-04-15v2A little longer, a lot better: simulation-guided exploration of extended-length single-end barcoded reads for structural variant detection
Luo, C.; Liu, Y. H.; Liu, H.; Zhang, Z.; Zhang, L.; Peters, B. A.; Zhou, X. M.Abstract
Accurate detection of genetic variants, including single nucleotide polymorphisms (SNPs), small insertions and deletions (INDELs), and structural variants (SVs), is essential for comprehensive genomic analysis. While short-read sequencing performs well for SNP and INDEL detection, it remains limited in resolving SVs, particularly in complex genomic regions, due to its short read length. Linked-read sequencing technologies, such as single-tube Long Fragment Read (stLFR), partially address this limitation by incorporating molecular barcodes to provide long-range information. In this study, we evaluate conventional paired-end linked reads (PE100_stLFR) and explore a conceptual extension: long single-end barcoded reads of 500 bp (SE500_stLFR) and 1000 bp (SE1000_stLFR). We developed stLFR-sim, a Python-based simulator that reproduces the stLFR workflow and enables realistic benchmarking. Using a high-quality T2T assembly of HG002, we generated multiple datasets across 12 sequencing configurations. SVs were called using Aquila_stLFR (v2) and benchmarked against the Genome in a Bottle (GIAB) HG002 SV truth set with Truvari. We show that simulated PE100_stLFR closely matches real data, validating the simulation framework. Increasing read length consistently improves SV detection accuracy, with SE1000_stLFR achieving the best performance and approaching long-read methods while outperforming short-read and pangenome-based approaches. Collectively, our results highlight the strong potential of long single-end barcoded reads for improving SV detection, and suggest that even modest increases in read length, when combined with barcode information, can provide a cost-effective and practical strategy for enhancing future sequencing technologies and SV discovery.
bioinformatics2026-04-15v2TFBindFormer: A Cross-Attention Transformer for Transcription Factor-DNA Binding Prediction
Liu, P.; Wang, L.; Basnet, S.; Cheng, J.Abstract
Transcription factors (TFs) are central regulators of gene expression, and their selective recognition of genomic DNA underlies various biological processes. Experimental profiling of TF-DNA interactions using chromatin immunoprecipitation followed by sequencing(ChIP-seq) provides high resolution maps of in vivoTF-DNA binding but remains costly, labor-intensive, and inherently low-throughput, limiting their scalability across different transcription factors,cell types, and regulatory conditions. Computational modeling therefore plays an essential role in inferring TF-DNA interactions at genome scale. However, most existing computational models rely solely on DNA sequence and chromatin features to predict TF-DNA binding, neglecting TF-specific protein information. This omission limits their ability to capture protein-dependent binding specificity. Here, we present TFBindFormer, a hybrid cross-attention transformer that explicitly integrates genomic DNA features with TF specific representations derived from protein sequences and structures. By modeling protein-conditioned, position-specific TF-DNA interactions, TFBindFormer enables direct learning of molecular determinants underlying DNA recognition. Evaluated across hundreds of cell-type-specific TFs and hundreds of millions of genome-wide DNA bins, TFBindFormer consistently outperforms DNA-only baselines, achieving substantial gains in both area under precision-recall curve(AUPRC) and area under receiver operating characteristic curve(AUROC). Together, these results demonstrate that integrating TF and DNA features via cross-attention enables TFBindFormer to serve as an effective and scalable framework for large-scale TF-DNA binding prediction.
bioinformatics2026-04-15v2PoolParty: streamlined design of DNA sequence libraries in Python
Liu, Z.; Cordero, A.; Kinney, J. B.Abstract
Motivation: Computationally designed DNA sequence libraries are essential components of massively parallel reporter assays (MPRAs), deep mutational scanning (DMS) experiments, and other multiplex assays of variant effect (MAVEs). They are also increasingly used in silico to analyze genomic AI models. Designing these libraries, however, remains tedious and error-prone due to the lack of purpose-built software. Results: Here we describe PoolParty, a Python package that streamlines the design of complex oligo pools using a simple but flexible API. In PoolParty, each library is represented by a computational graph that can be specified in just a few lines of code. Over 50 built-in operations cover nucleotide- and codon-level mutagenesis, motif insertion, barcode generation, and more. PoolParty automatically generates informative names for each sequence and provides "design cards" detailing how each sequence was generated. Visualization methods let users quickly audit library content and inspect the underlying graph. PoolParty thus transforms oligo pool design from a tedious task requiring custom functions and scripts into a structured, transparent, and reproducible process. Availability and implementation: PoolParty is freely available and can be installed using pip. It is compatible with Python [≥] 3.10. Documentation is provided at https://poolparty.readthedocs.io; source code is available at https://github.com/jbkinney/poolparty-statetracker. A static release is archived at DOI 10.5281/zenodo.19445098.
bioinformatics2026-04-15v2KyDab - a comprehensive database of antibody discovery selection campaigns.
Zhou, Q.; Chomicz, D.; Melvin, D.; Griffiths, M.; Yahiya, S.; Reece, S.; Le Pannerer, M.-M.; Krawczyk, K.Abstract
Preclinical antibody discovery relies on progressive screening and down-selection of candidate antibodies from large immune repertoires, yet this critical process is poorly represented in existing public databases. Here we introduce KyDab (Kymouse Antibody Database), a well-curated database of antibody discovery selection data generated using standardized workflows on the Kymouse humanized mouse platform. The current release includes 11 Kymouse platform mice immunisation studies covering 51 immunogens, more than 120,000 paired heavy-light chain sequences, and binding measurements for a selected subset of experimentally characterized clones. By capturing full-funnel selection data with consistent metadata and both positive and negative experimental outcomes, KyDab provides a valuable data resource for the development and evaluation of artificial intelligence models for antibody discovery. KyDab is accessible https://kydab.naturalantibody.com, and the database will be continuously updated as new datasets become available.
bioinformatics2026-04-15v2Fast and accurate resolution of ecDNA sequence using Cycle-Extractor
Faizrahnemoon, M.; Luebeck, J.; Hung, K. L.; Rao, S.; Prasad, G.; Tsz-Lo Wong, I.; G. Jones, M.; S. Mischel, P.; Y. Chang, H.; Zhu, K.; Bafna, V.Abstract
Extrachromosomal DNA (ecDNA) plays a key role in cancer pathology. EcDNAs mediate high oncogene amplification and expression and worse patient outcomes. Accurately determining the structure of these circular molecules is essential for understanding their function, yet reconstructing ecDNA cycles from sequencing data remains challenging. We introduce Cycle-Extractor (CE) for reconstruction. CE accepts a breakpoint graph derived from either short or long read sequencing data as input and extracts a cycle with the maximum length-weighted-copy-number. CE utilizes a mixed-integer linear program (MILP) and a separate traversal procedure, enabling fast optimization and compatibility with free solvers. We evaluated CE against CoRAL (long-read-based quadratic optimization), Decoil (long-reads), and AmpliconArchitect (AA for short reads) on both simulated data and real cancer cell lines. On simulated ecDNA, CE achieves performance comparable to CoRAL across three accuracy metrics and consistently outperforms AA and Decoil. On cancer cell lines, CE produces longer and heavier cycles than AA, and achieves performance similar to CoRAL. Moreover, CE is, on average, 40 x faster than CoRAL. These results demonstrate that CE accurately reconstructs ecDNA from both short- and long-read sequencing data, while long-read inputs allow CE to recover more complete and higher-confidence ecDNA structures. CE improved the prediction of many ecDNA structures. On a PC3 ecDNA containing MYC, CE uses ONT data to reconstruct a substantially larger and higher-copy sequence (4.2 Mbp) compared to the short-read-derived reconstruction (690 Kbp). CRISPR-CATCH experiments confirm the presence of a large ecDNA molecule, validating the long-read-based CE reconstruction.
bioinformatics2026-04-15v2DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics
Liao, Y.; Wen, H.; E, W.; Zhang, W.Abstract
Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semisupervised training to universal cross-modal representation learning. By integrating dualencoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.
bioinformatics2026-04-15v2Differential co-localisation analysis of multi-sample and multi-condition experiments with spatialFDA
Emons, M.; Scheipl, F.; Gunz, S.; Purdom, E.; Robinson, M. D.Abstract
Advances in spatial omics data generation have led to an explosion in new datasets that record the spatial location of transcripts and proteins. However, challenges remain in the analysis of spatial omics data. One important analysis is differential cellular co-localisation (CCoL): the quantification of the clustering, or spacing, of one or more cell types across multiple conditions. Our framework spatialFDA combines methodology from spatial statistics with functional data analysis to accurately quantify and test for differences between conditions in CCoL across spatial scales. Using two simulation studies, we show that spatialFDA performs well in controlled settings. Furthermore, spatialFDA recovers known biological processes in type-1 diabetes and adds insights about the CCoL strength in space. spatialFDA is readily available as an open source Bioconductor R package.
bioinformatics2026-04-15v1U-Probe: universal agentic probe design for imaging-based spatial-omics
Zhang, Q.; Cai, H.; Zhang, J.; Zhang, L.; Wu, X.; Wei, Y.; Chen, Y.; Wu, X.; Su, W.; Qi, W.; Qiu, X.; Cao, G.; Xu, W.Abstract
Probe design for fluorescence in situ hybridization (FISH) underpins spatial transcriptomics, three-dimensional genome studies, and clinical diagnostics, yet remains constrained by two challenges: dependence on expert knowledge for parameter selection and quality evaluation, and the inability of existing tools to accommodate the diverse probe architectures introduced by rapidly emerging methods. Here we present U-Probe, a universal and agentic probe design platform. U-Probe employs a declarative configuration system with a directed acyclic graph (DAG)-based assembly engine that supports arbitrary probe structures from established protocols such as MERFISH, seqFISH, and MiP-seq to entirely novel architectures without code modifications. Integrated LLM-based AI agents enable conversational design workflows, allowing users to specify experimental goals in natural language and receive synthesis-ready probe sequences. We validated U-Probe in three scenarios: agent-driven MiP-seq panel design for influenza-infected mouse lung tissue, genome-tiling DNA-FISH for herpesvirus detection, and a novel RCA-based ligation probe for single-nucleotide mutation discrimination. U-Probe is available as an open-source tool with CLI, Web, and agent interfaces.
bioinformatics2026-04-15v1Discovery of Selective Nrf2 Activators from Natural Products: AComputational Screening Approach to Minimize Off-Target Effects on PXR and CYP2D6
Wang, Y.; Gong, Y.; Li, R.; Li, Z.; Cai, H.; Fan, L.; Ma, H.Abstract
Nuclear factor erythroid 2-related factor 2 (Nrf2) is a central regulator of cellular antioxidant responses and a highly promising therapeutic target for a range of oxidative stress-related diseases. However, the clinical translation of Nrf2 activators has been hampered by significant off target effects notably unintended activation of the pregnane X receptor (PXR) and inhibition of cytochrome P450 2D6 (CYP2D6) which can lead to dangerous drug-drug interactions and metabolic complications. To overcome this critical barrier, we conducted the first large-scale computational screening of 628,898 natural products from the COCONUT database, integrating molecular docking with a rigorous three-tier selectivity strategy designed to prioritize compounds that strongly bind KEAP1 (the primary Nrf2 repressor) while minimizing interactions with PXR and CYP2D6. Our innovative approach identified 10 ultraselective candidates that demonstrate potent KEAP1 affinity, negligible PXR engagement, and only moderate CYP2D6 binding achieving up to 12.29-fold selectivity for Nrf2 pathway activation. These top hits are structurally novel, enriched in lipid-like and nucleoside inspired scaffolds, and exhibit promising drug-like properties. By providing both a curated set of chemically diverse, selectivity-optimized leads and a publicly accessible screening dataset, this work establishes a new foundation for the rational development of safer, more precise Nrf2-targeted therapies, bridging a crucial gap between target potential and clinical viability. By prioritizing compounds with minima off-target effects on PXR and CYP2D6, our approach offers a scalable template for reducing drug development failures and advancing safer therapeutics for oxidative stress-related diseases.
bioinformatics2026-04-15v1CROssBARv2: A Unified Computational Framework for Heterogeneous Biomedical Data Representation and LLM-Driven Exploration
Sen, B.; Ulusoy, E.; Darcan, M.; Ergun, M.; Lobentanzer, S.; Rifaioglu, A. S.; Turei, D.; Saez-Rodriguez, J.; Dogan, T.Abstract
Biomedical discovery is hindered by fragmented, modality-specific repositories and uneven metadata, limiting integrative analysis, accessibility, and reproducibility. To address these challenges, we present CROssBARv2, a provenance-rich biomedical data-and-knowledge integration platform that unifies heterogeneous sources into a maintainable, scalable system. By consolidating diverse data types into an extensive knowledge graph enriched with standardised ontologies, rich metadata, and deep learning based vector embeddings, CROssBARv2 alleviates the need for researchers to navigate multiple siloed databases and can facilitate downstream tasks, including predictive modelling and mechanistic reasoning, enabling applications such as drug repurposing and protein function prediction. The platform offers interactive graph exploration and embedding-based semantic search with CROssBAR-LLM, an intuitive natural language question-answering system that grounds large language model (LLM) outputs in the underlying knowledge graph to mitigate hallucinations. We assess CROssBARv2 through (i) multiple use-case analyses to test biological coherence and relational validity; (ii) knowledge-augmented biomedical question-answering benchmarks comparing CROssBAR-LLM against generalist LLMs; and (iii) a deep learning based predictive modelling experiment for protein function prediction leveraging the heterogeneous structure of CROssBARv2. Collectively, CROssBARv2 provides a scalable, AI-ready, and user-friendly foundation that facilitates hypothesis generation, knowledge discovery, and translational research.
bioinformatics2026-04-15v1RapCluster: Bridging the Reproducibility Gap in Clustering Analysis
Lutfi, A.; Warneke, R.; Fischer, L.; Rappsilber, J.Abstract
Clustering is ubiquitous across science, yet a text-mining audit of 736,399 open-access articles identified as using clustering (2000-2025) reveals common practice leaves key parameters undocumented or untuned, contributing to the reproducibility crisis in science. We developed an interactive web platform featuring 11 widely adopted clustering algorithms to enable transparent clustering analysis and reporting, aligning practical use with best practices in computational research.
bioinformatics2026-04-15v1Decoding Single-Cell Omics of Perturbation Responses Using DeSCOPE
Wu, P.; Wei, H.; Li, Y.; Zheng, X.; Zhou, C.; Hu, X.; Wang, C.Abstract
Deciphering cellular responses to genetic perturbations is fundamental to modeling gene regulatory networks and understanding mechanisms that change cellular phenotypes. However, current computational approaches often fail to outperform simple baseline models, highlighting a critical bottleneck in their generalizability and robustness. Here, we present DeSCOPE, a lightweight conditional variational autoencoder framework for predicting genetic perturbation responses spanning transcriptomic, epigenomic, and broader multi-modal landscapes. We systematically benchmarked DeSCOPE across diverse datasets under two challenging out-of-distribution settings: unseen genes and unseen cell types. DeSCOPE uniquely surpasses simple baselines in the unseen gene scenario, and achieves substantially improved performance for unseen cell types while requiring fine-tuning with far fewer perturbed genes. Finally, DeSCOPE demonstrates superior performance in predicting combinatorial multi-gene perturbations. Overall, DeSCOPE serves as a versatile multi-modal virtual cell model that can effectively guide the design of therapeutic targets that change cellular phenotypes. DeSCOPE is available at https://github.com/wanglabtongji/DeSCOPE.
bioinformatics2026-04-15v1Sex-biased gene expression shapes sex differences in gene essentiality
Rocca, C.; DeCasien, A. R.Abstract
Sex differences in disease incidence and progression are well documented, yet their underlying molecular mechanisms remain poorly understood. Multiple models suggest that baseline gene expression levels shape the impact of gene disruption, raising the possibility that sex-biased expression itself contributes to sex differences in cellular vulnerability. Here, we test this hypothesis by integrating sex-biased transcriptomic profiles with large-scale CRISPR loss-of-function screens to determine whether sex-biased expression predicts sex-biased gene essentiality across the genome. We find that gene expression level and sex chromosome dosage each explain a modest fraction of variance in essentiality, with substantially larger effects for sex chromosome genes than for autosomes. Across genes, sex effects on expression and essentiality are small in magnitude but directionally aligned, suggesting that sex differences in transcription can influence functional dependency. To resolve how these relationships arise, we applied gene-level mediation analyses to decompose sex effects on essentiality into expression-mediated and expression-independent components. This approach revealed multiple mechanistic architectures. On autosomes, most genes exhibited either sex-biased essentiality from direct sex effects (independent of expression) or sex-biased expression without functional consequence, while expression-mediated sex differences accounted for a smaller but substantial fraction of genes. In contrast, X chromosome genes were dominated by direct, expression-independent sex differences, consistent with strong effects of sex chromosome dosage, but also showed enrichment of expression-mediated architectures, particularly among X gametologs. Together, our results demonstrate that while sex-biased expression can generate sex-biased gene essentiality, this mechanism is not the default. Instead, sex-biased functional dependency is often driven by direct, expression-independent effects, particularly on the X chromosome, where dosage and compensatory mechanisms play a dominant role.
bioinformatics2026-04-15v1Benchmarking precision matrix estimation methods for differential co-expression network analysis
Overmann, M.; Grabert, G.; Kacprowski, T.Abstract
Background: Gene expression profiling is widely used to investigate disease mechanisms, but classical approaches such as differential expression or pairwise correlation analyses provide limited interpretability. Network-based differential co-expression methods that model conditional dependencies through partial correlations offer richer insights, yet their application in high-dimensional settings requires estimation of precision matrices. Numerous precision matrix estimation methods (PMEMs) have been proposed, but their relative performance under various conditions remains unclear. Results: Simulated gene expression datasets with known ground truth correlation structures were used to benchmark a broad set of PMEMs. Performance was strongly affected by data characteristics, including covariance structure, matrix density, covariance values, sample size-to-dimension ratio, and sampling distribution. Among the evaluated methods, GLassoElnetFast consistently showed the highest accuracy in recovering differential edges, although high signal-to-noise ratios and sufficient sample sizes remain essential for reliable inference. Conclusions: Evaluation across diverse simulation conditions demonstrated that no single metric or condition was sufficient to assess PMEM performance. Therefore, previous less extensive evaluations risked misleading conclusions. Our simulation and benchmarking framework supports future method development and ensures reproducible evaluation of newly developed approaches.
bioinformatics2026-04-15v1Beyond Structure and Affinity: Context-Dependent Signals for de novo Binder Success
Bozkurt, C.Abstract
De novo protein binder design has advanced rapidly, yet most designs fail experimentally and current structure- and affinity-centred evaluation does not reliably predict which candidates will succeed. Here we show that biology-informed sequence features, derived from models trained on natural proteins, identify transferable and context-dependent associations with binder expression and binding that are not captured by structural scoring alone. We re-analysed two public benchmarks - the Bits to Binders CAR-T CD20 competition (11,984 designs; expression, proliferation, and T cell function gates) and the Adaptyv EGFR competition (603 designs; expression and binding affinity) - using five biology-informed ML models predicting disorder, amyloidogenicity, topology, PTM sites, and protein classification. Every feature was tested at each gate with FDR-corrected statistics. We identify three layers of signal. Transferable: lower aggregation propensity is the most robust cross-benchmark signal; PTM-site density recurs univariately but is partly length-confounded in EGFR. Architecture-dependent: topology, disorder, and disulfide-related descriptors are significant in both datasets but flip direction, consistent with the different requirements of CAR extracellular domains versus standalone binders. Context-specific: phosphorylation-related associations with CAR-T depletion and low-disorder dominance in EGFR binding are tied to individual assay or format contexts. In the CAR-T benchmark, stacking biology-informed filters raises the enrichment hit rate from 13.8% to 38.6% (2.8x lift) after controlling for known sequence-level predictors. These results suggest that pre-synthesis screening of de novo binders may benefit from being multi-gate and context-aware, using biology-informed sequence descriptors not only to rank candidates but also to help flag likely failure modes earlier and reduce wasted synthesis and testing.
bioinformatics2026-04-15v1π-MSNet: A billion-scale, AI-ready living proteomics data portal
Dai, C.; Liu, Y.; Ling, T.; Qiu, Y.; Xu, H.; Zhang, Q.; Huang, X.; Zhu, Y.; Sachsenberg, T.; Bai, M.; He, F.; Perez-Riverol, Y.; Xie, L.; Chang, C.Abstract
Artificial intelligence (AI) is reshaping proteomics workflows, delivering remarkable gains in both peptide identification sensitivity and quantitative performance. However, the potential of deep learning models in proteomics has not been fully exploited due to the scarcity of large-scale, high-quality and consistently labeled datasets. Here, we present {pi}-MSNet, a billion-scale, AI-ready living mass spectrometry (MS) data portal. Using a uniform identification and quality control workflow, it comprises over 1.66 billion MS/MS spectra, 501 million peptide-spectrum matches (PSMs), and 9 million precursors from 36,356 LC-MS/MS runs across ten instrument types and 55 diverse species. Through community collaboration, the data are shared via international, interactive, and living web resources. Enabled by the built-in MSNetLoader Python API for seamless and scalable data access-with native support for PyTorch and TensorFlow-{pi}-MSNet provides an AI-ready data framework for efficient training and systematic benchmarking of multiple models across three representative tasks (e.g., MS/MS spectrum prediction, retention time prediction, and de novo peptide sequencing). In particular, by retraining multiple models on {pi}-MSNet, we achieved consistent performance improvements over their original versions. These improved models were subsequently integrated into the {pi}-MSNet agent to enable interactive, deployment-free use. Through SDRF (Sample and Data Relationship Format) metadata, an open-source cloud analysis workflow, and a community-driven interactive data portal that supports continuous data submission, {pi}-MSNet serves as a living, AI-ready resource for reproducible benchmarking, robust model training, and accelerated AI innovation in proteomics.
bioinformatics2026-04-15v1TSvelo: Comprehensive RNA velocity by modeling cascade of gene regulation, transcription and splicing
Li, J.; Wang, Z.; Shen, H.-B.; Yuan, Y.Abstract
RNA velocity approaches fit gene dynamics and infer cell fate by modeling the splicing process using single-cell RNA sequencing (scRNA-seq) data. However, due to short time scale of splicing, high noise and large complexity of data, existing RNA velocity methods often fail to precisely capture the complex velocity dynamics for individual gene and single cell, which makes its downstream analysis less reliable and less robust. We propose TSvelo, a comprehensive RNA velocity mathematics framework that can model the cascade of gene regulation, Transcription and Splicing using highly interpretable neural Ordinary Differential Equations (ODEs). TSvelo can precisely capture the transcription-unspliced-spliced 3D dynamics of all genes simultaneously, infer unified latent time shared by genes within single cell, and be applied to multi-lineage datasets. Experiments on six scRNA-seq datasets, including two multi-lineage datasets, demonstrate TSvelo's superiority.
bioinformatics2026-04-14v4TPCAV: Interpreting deep learning genomics models via concept attribution
Yang, J.; Mahony, S.Abstract
Interpreting genomics deep learning models remains challenging. Existing feature attribution methods are largely restricted to one-hot DNA inputs and therefore cannot assess the influence of more general genomic features such as chromatin states or genomic repeats. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We improve upon the original TCAV method by incorporating a PCA-based decorrelation transformation to address correlated and redundant embedding features commonly observed in genomics deep learning models, resulting in the Testing with PCA-projected Concept Activation Vectors (TPCAV) approach. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides comparable motif feature interpretation to TF-MoDISco on one-hot encoded DNA-based transcription factor binding prediction models. TPCAV also enables robust interpretive analysis of how more general biological concepts such as repetitive elements and chromatin state annotations contribute towards predictions. TPCAV uniquely generalizes to interpret features learned by tokenized foundation models as well as models incorporating chromatin signals as inputs. We further show that TPCAV can identify representative regions associated with specific concepts, motivating downstream investigation of distinct regulatory mechanisms. TPCAV provides a flexible and robust complement to existing model interpretation techniques.
bioinformatics2026-04-14v4GRASP: Gene-relation adaptive soft prompt for scalable and generalizable gene network inference with large language models
Feng, Y.; Deng, K.; Guan, Y.Abstract
Gene networks (GNs) encode diverse molecular relationships and are central to interpreting cellular function and disease. The heterogeneity of interaction types has led to computational methods specialized for particular network contexts. Large language models (LLMs) offer a unified, language-based formulation of GN inference by leveraging biological knowledge from large-scale text corpora, yet their effectiveness remains sensitive to prompt design. Here, we introduce Gene-Relation Adaptive Soft Prompt (GRASP), a parameter-efficient and trainable framework that conditions inference on each gene pair through only three virtual tokens. Using factorized gene-specific and relation-aware components, GRASP learns to map each pair's biological context into compact soft prompts that combine pair-specific signals with shared interaction patterns. Across diverse GN inference tasks, GRASP consistently outperforms alternative prompting strategies. It also shows a stronger ability to recover unannotated interactions from synthetic negative sets, suggesting its capacity to identify biologically meaningful relationships beyond existing databases. Together, these results establish GRASP as a scalable and generalizable prompting framework for LLM-based GN inference.
bioinformatics2026-04-14v2Beyond Single Algorithms: A Framework for Validating and Aggregating Active Modules in Genetic Interaction Networks
Liu, J.; Xu, M.; Xing, J.Abstract
High-throughput sequencing methods have generated vast amounts of genetic data for candidate gene studies. However, the complexity of the disease genetic structure often results in a large number of candidate genes and poses a significant challenge for these studies. To explore the multi-gene interactions and elucidate the genetic mechanism, candidate genes are often analyzed through Gene-Gene interaction (GGI) networks. These networks can become very large, necessitating efficient methods to reduce their complexity. Active Module Identification (AMI) is a common method to analyze GGI networks by identifying enriched subnetworks representing relevant biological processes. Multiple AMI algorithms have been developed for biological datasets, and a comparative analysis of their behaviors across a variety of datasets is crucial to their application. In this study, we introduce a framework to compare and aggregate the modules produced by multiple AMI algorithms. We first used a modified Empirical Pipeline to validate the output of four AMI algorithms -- PAPER, DOMINO, FDRnet, and HotNet2 -- and find that no single algorithm performs well across the different datasets. Using the Earth Mover's Distance to measure pairwise module similarity, we find that the outputs of different algorithms are structurally distinct, suggesting that each captures different aspects of the underlying biology. These findings suggest that a comprehensive analysis requires the aggregation of outputs from multiple algorithms. We propose two methods to this end: a spectral clustering approach for module aggregation, and an algorithm that combines modules with similar network structures called Greedy Conductance-based Merging (GCM). The merging algorithm not only allows researchers to obtain a set of cohesive modules from multiple algorithms, it also has the potential of identifying "hidden" genes that are not present in the original input data from the network. Overall, our results advance our understanding of AMI algorithms and how they should be applied. Tools and workflows developed in this study will facilitate researchers working with GGI and AMI algorithms to enhance their analyses. Our code is freely available at https://github.com/LiuJ0/AMI-Benchmark/.
bioinformatics2026-04-14v2From Movement to METs: A Validation of ActTrust(R) for Energy Expenditure Estimation and Physical Activity Classification in Young Adults
dos Santos Batista, E.; Basilio Gomes, S. R.; Bruno de Morais Ferreira, A.; Franca, L. G. S.; Fontenele Araujo, J.; Mortatti, A. L.; Leocadio-Miguel, M. A.Abstract
Estimating physical activity (PA) levels is a challenging and expensive task. An alternative could be the use of actigraphy devices to estimate PA. This has been previously done to a number of devices, including ActiGraph(R) GT3X+. In this study, we validated ActTrust(R) against the widely used GT3X+ and compared activity counts to metabolic equivalents (METs) derived from indirect calorimetry during treadmill walking and running. Fifty-six young adults (34 men, 22 women) participated in controlled effort exercises including light, moderate, vigorous, and very vigorous activity intensities. We developed a linear model to estimate energy expenditure (EE) from movement count of combinations of devices placed at hip or wrist. We then estimated cut-off points for each intensity range. Our results showed correlations between treadmill speed and both METs (<em>r</em> = 0.95, <em>p</em> < 0.05) and movement counts from both GT3X+ and ActTrust devices placed either on the hip (<em>r</em> = 0.94, <em>p</em> < 0.05; <em>r</em> = 0.93, <em>p</em> < 0.05) or on the wrist (<em>r</em> = 0.88, <em>p</em> < 0.05; <em>r</em> = 0.88, <em>p</em> < 0.05), respectively. Our proposed model performed well with balanced accuracies above 0.77 for all intensity ranges and over 0.9 for light and moderate activity. This is the first study to model estimate and validate PA intensity thresholds on ActTrust(R) devices. Our findings support the use of ActTrust(R) devices as simple, cost-effective tool for 24-hour assessments of EE.
bioinformatics2026-04-14v2found: Inferring cell-level perturbation from structured label noise in single-cell data
Afanasiev, E.; Goeva, A.Abstract
Recent work by Goeva et al. introduced HiDDEN, a method for refining batch-level labels to infer cell-level perturbation without prior knowledge of affected populations, addressing the mismatch between sample-level labels and heterogeneous perturbation effects across cells. Here, we present found, a Python and R implementation of HiDDEN, supporting pipeline customization, by-factor grouping, hyperparameter selection, and visualization. Through benchmarking across diverse datasets, we show that performance depends strongly on modeling choices, particularly regression, grouping, and embedding dimensionality. found provides a practical, flexible, and accessible framework for robust cell-level perturbation analysis.
bioinformatics2026-04-14v1Identification of the novel inhibitors against M. tuberculosis ESX-1 secretion system EccA1 enzyme using virtual screening, docking and dynamics simulation techniques
Kumar, R.; saxena, a. K.Abstract
The M. tuberculosis ESX-1 secretion system EccA1 enzyme is involved in the secretion of virulence factors and is essential for virulence and bacterial survival within the phagosome. Development of the small molecular inhibitors abolishing EccA1 function can yield new antivirulence drugs. In this study, we modeled the full-length EccA1 (573 residues, Mw [~]62.4 kDa) structure, which contains N-terminal TPR domain and a C-terminal CbxX/CfqX type ATPase domain. We have identified five ZINC compounds having binding energy i. e. Z1 (ZINC000004513760, -43.45 kcal/mol), Z2 (ZINC000000001793, -49.56 kcal/mol), Z3 (ZINC000005390388, -55.83 kcal/mol), Z4 (ZINC000257294577, -52.33 kcal/mol), Z5 (ZINC000004824264, -44.44 kcal/mol) through virtual screening of the ZINC compounds targeting C-terminal ATPase pocket of EccA1. The Z1-Z5 compounds were compared with ADP substrate having binding energy (Adenosine diphosphate, -35.00 kcal/mol), p97 ATPase inhibitors i.e. NMS873 (3-[3-cyclopentylsulfanyl-5-[[3-methyl-4-(4 methylsulfonylphenyl)phenoxy]methyl]-1,2,4-triazol-4-yl]pyridine, -48.68 kcal/mol), and CB5083 (1-[4-(benzylamino)-5H,7H,8H-pyrano[4,3-d]pyrimidin-2-yl]-2-methyl-1H- indole-4-carboxamide, -50.88 kcal/mol) against EccA1. The Z1-Z5 compounds exhibited good Absorption, Distribution, Metabolism, and/or Excretion properties (ADMTE). Pharmacokinetic properties and Lipinsky's rule of five for Z1-Z5 compounds showed drug-like properties. 100 ns dynamics simulation analysis on EccA1 complexed with (i) Z1-Z5 compounds (ii) ADP substrate and (iii) NMS873 and CB5083 inhibitors showed high stability and biologically relevant conformation during dynamics simulation. These data indicate that Z1-Z5 compounds may act as potential inhibitors against EccA1 and provide avenues for new antivirulence drug development after in vitro and in vivo clinical trials.
bioinformatics2026-04-14v1A Machine Learning Approach for Physiological Role Prediction in Protein Contact Networks: a large-scale analysis on the human proteome
Cervellini, M.; Martino, A.Abstract
Proteins are fundamental macromolecules involved in virtually all biological processes. Their physiological roles are tightly linked to their three-dimensional structure, which can be naturally abstracted as Protein Contact Networks (PCNs), i.e., graphs where residues are nodes and edges encode spatial proximity. This representation enables the application of Graph Machine Learning to address the protein functional annotation gap at proteome scale. In this work, protein function prediction is studied on the majority of the human proteome, focusing on enzymatic activity and enzyme class assignment as well-defined and biologically meaningful targets. A large-scale supervised analysis was conducted on PCNs derived from experimentally resolved human protein structures. Multiple graph-based learning paradigms were systematically compared under a unified evaluation protocol, including handcrafted graph embeddings, kernel methods, and end-to-end Graph Neural Networks (GNNs). Feature engineering approaches comprised (i) spectral density embeddings of the normalized graph Laplacian and (ii) higher-order topological representations based on simplicial complexes, with optional INDVAL-based feature selection. These representations were paired with linear, ensemble, and kernel classifiers, while GNNs were trained directly on raw PCNs exploiting a diverse set of message-passing architectures. Two tasks were considered: binary classification of enzymatic versus non-enzymatic proteins and multiclass prediction of first-level Enzyme Commission (EC) classes. Performance was assessed using repeated stratified splits to ensure robust and variance-aware evaluation. In the binary enzymatic classification task, the Jaccard-based graph kernel achieved the best performance with an adjusted balanced accuracy of 0.90, closely followed by GNNs trained end-to-end on PCNs. In the multiclass EC prediction task, GNNs demonstrated superior discriminative power, reaching an adjusted balanced accuracy of 0.92 and outperforming all explicit embedding and kernel-based approaches. Overall, results indicate that EC class prediction is intrinsically more complex than binary enzymatic discrimination and benefits from the higher expressivity of deep message-passing architectures. The findings demonstrate that graph-based representations of protein structure support competitive functional prediction at proteome scale, with classical kernel methods and modern GNNs offering complementary strengths in terms of accuracy, scalability, and flexibility.
bioinformatics2026-04-14v1Predicting Pre-treatment Resistance or Post-treatment Effect? A Systematic Benchmarking of Single-Cell Drug Response Models
Shen, L.; Sun, X.; Zheng, S.; Hashmi, A.; Eriksson, J.; Mustonen, H.; Seppänen, H.; Shen, B.; Li, M.; Vähä-Koskela, M.; Tang, J.Abstract
Intratumoral heterogeneity is a major driver of variable drug responses in cancer. Single-cell RNA sequencing (scRNA-seq) enables the characterization of such heterogeneity, providing an opportunity to predict drug response at single-cell resolution. As a result, a growing number of computational models have been developed to infer drug response from scRNA-seq datasets. However, their performance, robustness, and generalizability across different biological contexts have not been systematically evaluated. To address this gap, we conducted a comprehensive benchmarking of representative single-cell drug response prediction models. Using 26 curated datasets comprising over 760,000 cells across 12 cancer types and 21 therapeutic agents, we constructed balanced and imbalanced scenarios to reflect more realistic distributions of drug response labels. To address the lack of ground-truth drug-response labels in conventional scRNA-seq datasets, we further incorporated lineage-tracing data with experimentally validated drug-response annotations, enabling model evaluation in a clinically relevant pre-treatment prediction setting. Our results show that across the tested methods, the prediction performance is markedly higher in cell lines than in tissue samples. Under imbalanced conditions, most methods exhibited sharp performance declines, whereas scDEAL demonstrated the highest robustness. Independent validation using an in-house pancreatic ductal adenocarcinoma dataset further confirms the robustness of scDEAL and its ability to capture biologically meaningful state transitions. Label-substitution experiment revealed that this robust performance partially driven by the model's specific training label construction. However, the benchmarking with lineage-tracing data reveals a fundamental limitation: most models capture drug-induced transcriptional changes but struggle to predict a cell's intrinsic resistance state prior to treatment. In summary, our study not only defines the performance boundaries of current approaches but also highlights their limitations in addressing intratumoral heterogeneity, extreme class imbalance, and the prediction of intrinsic cellular resistance, emphasizing the need for the development of next-generation single-cell drug response models with stronger clinical relevance.
bioinformatics2026-04-14v1A correlational study of ABCA3 and SCN4B as exercise-related biomarkers of patients with Stanford type A aortic dissection
Qiao, S.; Chen, T.; Xie, B.; Han, Y.; Wang, B.; Li, Y.; Jia, B.; Wu, N.Abstract
Background: Accumulating evidence indicates that moderate exercise may reduce the incidence of Stanford type A aortic dissection (TAAD), but the specific mechanisms remain unclear. This study aims to identify exercise-related biomarkers in TAAD patients and to investigate their underlying mechanisms. Methods: Transcriptome data related to TAAD and exercise-related genes were obtained from publicly available databases. Candidate biomarkers for TAAD were identified through an integrative approach incorporating differential expression analysis, machine learning, and expression level assessment, leading to the construction of a diagnostic model. Subsequently, functional enrichment, immune infiltration, regulatory network analysis, and computational drug prediction were conducted to systematically investigate the pathological mechanisms and translational potential of the identified biomarkers. Results: ABCA3 and SCN4B were identified as exercise-related biomarkers in TAAD progression. A nomogram incorporating these two biomarkers exhibited strong diagnostic performance for identifying the disease. Functional enrichment analysis revealed potential involvement of these biomarkers in disease progression through pathways including circadian rhythm regulation and ribosome biosynthesis. Additionally, immune cells like M1 macrophages and naive B cells, as well as regulatory factors including hsa-miR-1343-3p and XIST, were found to be involved in this process. Finally, zonisamide and MRS1097 were identified through computation prediction as potential therapeutic drugs. Conclusion: ABCA3 and SCN4B were identified as exercise-related biomarkers associated with TAAD and represent potential valuable targets for both diagnosis and treatment strategies.
bioinformatics2026-04-14v1SPEAR: Predicting Gene Expression from Single-Cell Chromatin Accessibility
Walter-Angelo, T.; Uzun, Y.Abstract
Single-cell multiome assays enable direct measurement of chromatin accessibility and gene expression within the same cell. Still, most experimental designs remain constrained to two (and, less commonly, three) modalities per cell. This limitation motivates computational models that can predict unmeasured layers and, simultaneously, help dissect how cis-regulatory accessibility relates to transcription at gene resolution. Existing cross-modal methods often prioritize latent alignment or modality reconstruction, making it difficult to isolate the impact of model inductive bias under a shared cis-regulatory feature definition. We present SPEAR, a configuration-driven framework for gene-centric egression of single-cell gene expression from chromatin accessibility sing a fixed transcription-start site-centered representation shared cross model families. Here we show that, under identical features, splits, and valuation, model performance stratifies reproducibly across two multiome systems (mouse embryonic development and human hemogenic endothelium), with transformer encoders achieving the strongest mean test correlations (0.546 and 0.470, respectively). Per-gene performance distributions reveal substantial heterogeneity in predictability, indicating that accessibility driven signal is concentrated in subset of genes across contexts. Shapley value-based feature attribution further localizes predictive signal to promoter-proximal bins, with feature importance decaying with distance from the transcription start site, supporting a promoter-centered regime of cis-regulatory control within the modeled window. Together, these results provide a controlled comparison of inductive biases for chromatin-to-expression prediction and deliver analysis-ready outputs for gene-level interpretation. SPEAR is open source and publicly available for use at https://github.com/UzunLab/SPEAR.
bioinformatics2026-04-14v1MAJEC: unified gene, isoform, and locus-level transposable element quantification from RNA-seq
Lim, T.-Y.; Firestone, A. J.Abstract
Background: The study of transposable elements (TEs) has become increasingly central to fields such as cancer biology, immunology, and aging. Accurately quantifying disease- or laboratory-mediated perturbations in these elements is critical to support this expanding research, yet current RNA-seq pipelines struggle with the pervasive overlap between TEs and protein-coding genes. Existing tools either aggregate to the subfamily level with no locus resolution (TEtranscripts), or provide locus-level quantification without modeling gene overlap (Telescope), with the latter attributing over 40% of TE signal to the 1.1% of loci that overlap gene exons. Results: We present MAJEC (Momentum Accelerated Junction Enhanced Counting), a unified Expectation-Maximization (EM) framework that jointly quantifies genes, transcript isoforms, and individual TE loci from BAM alignments in a single pass. Splice junction evidence informs transcript-level priors, enabling MAJEC to probabilistically distinguish genic from TE-derived reads. This approach was independently validated against Salmon and RSEM on isoform quantification benchmarks. The joint feature space reduces exon-overlap contamination of locus-level TE estimates from 43% of total signal (Telescope) to 5% (MAJEC), while preserving subfamily-level accuracy (differential expression r = 0.987 vs TEtranscripts). Using paired biological vignettes, we demonstrate that MAJEC correctly resolves both the false TE reactivation artifacts endemic to TE-only models, and the false gene upregulation artifacts that occur when heuristic rules misassign genuine intragenic TE transcription. Conclusion: MAJEC simultaneously produces the isoform and locus-level resolution that TEtranscripts lacks, with greater accuracy than Telescope, and runs faster than either.
bioinformatics2026-04-14v1SpaceExpander: Automated Drafting and Evaluation of Markush Claims for Chemical Space Expansion
Wu, R.; Mao, L.; Diao, Y.; Li, H.Abstract
Drafting Markush claims for chemical patents remains difficult because manual claim writing is slow, error prone, and often fails to capture related chemical space in a systematic manner. We developed SpaceExpander, a computational method that converts disclosed compounds into generalized Markush claims by extracting core scaffolds, defining variable positions, decomposing complex substituents, and expanding substituent space through fragment matching. We evaluated the method on 24 publicly available chemical patents and compared its performance with IntelliPatent. SpaceExpander achieved a mean atom level scaffold accuracy of 0.92 and exactly recovered the reference scaffold in 19 of 24 patents. By contrast, IntelliPatent could process only 2 patents from the same set, indicating more limited applicability to structurally diverse cases. We further examined practical claim coverage in a case study based on the Osimertinib patent. Using representative disclosed compounds as input, SpaceExpander drafted a Markush claim that covered 5 of 7 additional approved third-generation EGFR inhibitors beyond Osimertinib. These results show that SpaceExpander is a validated method for automated Markush claim drafting and chemical space expansion.
bioinformatics2026-04-14v1Harnessing AI to Build Virtual Cells
Cheng, X.; Li, P.; Guo, H.; Liang, Y.; Gong, J.; de Vazelhes, W.; Gou, C.; Xie, P.; Song, L.; Xing, E. P.Abstract
A virtual cell is a world model of a cell: a computational system that predicts, simulates and programs cellular processes across modalities and scales. An important path toward this goal is to model how genetic and chemical perturbations give rise to transcriptional responses, a core capability for disease understanding and drug discovery. However, current approaches remain expert-intensive, relying on iterative manual model design, training and debugging over months. Here we present VCHarness, an autonomous AI system that constructs perturbation-response models by combining an AI coding agent with multimodal biological foundation models. The system explores large spaces of architectures and training pipelines with minimal human intervention, iteratively generating, evaluating and refining candidate models. Across multiple perturbation-response benchmarks, VCHarness identifies architectures that outperform expert-designed approaches while reducing development time from months to days. It further uncovers non-obvious architectural patterns associated with improved performance, indicating that automated search can extend beyond conventional design strategies. These results suggest a shift from manually engineered models toward autonomous systems for constructing components of virtual cell world models, enabling scalable and data-driven exploration of cellular systems.
bioinformatics2026-04-14v1Reconstructing intra-tumor fitness landscapes from scSeq CNA genotypes via simulation-based Bayesian inference and Deep Learning
KafiKang, M.; Skums, P.Abstract
Inferring the selective effects of copy-number alterations (CNAs) from clonal tumor data is essential for understanding tumor evolution. In practice, intra-tumor evolutionary parameters are typically estimated by fitting population genetic models to observed data using maximum likelihood or Bayesian methods. However, realistic mechanistic models often lead to intractable likelihoods, limiting the applicability of conventional inference approaches. Here, we introduce a likelihood-free, simulation-based framework for inferring intra-tumor selection coefficients directly from clonal CNA profiles. Our approach employs neural posterior estimation to amortize inference across simulated tumors and uses normalizing flows to flexibly parameterize high-dimensional posterior distributions while enabling robust uncertainty quantification. Our primary model, CloneMLP-NPE, learns representations of whole-tumor CNA genotypes using a multilayer perceptron (MLP)-based encoder. We compare this model against two baselines: (i) a Set Transformer encoder applied to the same whole-tumor CNA profiles, and (ii) a consensus-based approach that relies only on the CNA profile of the most abundant clone. On held-out simulations, CloneMLP-NPE achieves the strongest overall performance, yielding well-calibrated posterior distributions and more accurate posterior mean estimates than both baselines.
bioinformatics2026-04-14v1Multi-Agent Orchestration for Knowledge Extraction and Retrieval: AI Expert System for GPCRs
spieser, j. C.; Kogan, P.; Yang, J.; meller, j.; Patra, K.; shamsaei, B.Abstract
We present GPCR-Nexus, an AI-driven platform for integrated exploration of G protein coupled receptor (GPCR) biology that unifies structured databases with unstructured scientific literature. The system combines a GPCR / ligand knowledge graph with vector-based semantic retrieval to enable comprehensive, up-to-date information access. Central to GPCR Nexus is a multi agent architecture in which specialized components coordinate query planning, evidence retrieval, validation, and synthesis. This design ensures that generated responses are grounded in verifiable sources while maintaining coherence across heterogeneous data modalities. By jointly leveraging curated databases and primary literature, GPCR Nexus enables context-aware reasoning over molecular interactions, functional mechanisms, and disease associations. The platform produces citation-backed outputs with traceable evidence, addressing limitations of conventional database queries and standalone language models. We detail the system architecture, data integration strategy, and agent orchestration framework, and demonstrate its utility through representative query scenarios. GPCR Nexus provides a scalable approach to combining structured and unstructured biomedical knowledge using agent based AI, offering improved accuracy, interpretability, and coverage. This work establishes a foundation for trustworthy, AI-assisted knowledge synthesis in GPCR research and drug discovery.
bioinformatics2026-04-14v1A Hierarchy-aware Gene Exploration Platform for Multi-layered Toxicogenomic Analysis: A Case Study on Acetaminophen-induced Hepatotoxicity
Kim, M.; Cui, Y.; Kim, M. G.Abstract
Background: The interpretation of high-dimensional transcriptomic data remains a major challenge in mechanistic toxicology and drug safety assessment. Conventional clustering approaches based solely on expression profiles often fail to capture intrinsic biological relationships among genes, limiting interpretability and downstream analysis. Methods: We developed a hierarchy-aware gene exploration platform that integrates structured biological knowledge from the HUGO Gene Nomenclature Committee (HGNC). The core of the framework is a similarity kernel based on a single-step hyperdiffusion formulation (H K H^top), which embeds gene family hierarchy into the similarity space. The platform is implemented as an interactive web application supporting Uniform Manifold Approximation and Projection (UMAP) visualization, Leiden clustering, functional enrichment analysis, and hierarchy-based gene recommendation. Results: Applied to a transcriptomic dataset of acetaminophen-induced acute liver failure (APAP ALF), the proposed approach achieved a 33.8-fold improvement in functional coherence compared to an expression-only baseline. The hierarchy-aware embedding produced compact and biologically consistent clusters, enabling identification of key toxicological modules, including disruption of RNA processing, extracellular matrix remodeling, and impairment of lipid transport. In addition, the framework detected small but highly significant regulatory modules associated with epigenetic reprogramming. Conclusion: By incorporating biological hierarchy into gene similarity, the proposed platform enhances the interpretability of transcriptomic analysis and enables structured exploration of functional relationships. This approach provides a practical framework for mechanistic insight generation and supports more transparent and reproducible analysis in toxicogenomics.
bioinformatics2026-04-14v1IMAS enables target-aware integration of tumour multiomics to resolve communication-guided regulatory mechanisms
Deyang, W.; Yamashiro, T.; Inubushi, T.Abstract
Tumour multiomic datasets are often sparse, heterogeneous and limited in size, hindering robust and interpretable discovery of regulatory mechanisms. Here we present IMAS, a target-aware integrative framework for multiomic data augmentation and mechanism prioritization that leverages a pan-cancer single-cell multiomic resource to contextualize new tumour datasets and identify reliable sample-specific mechanistic hypotheses. IMAS combines shared latent-space modelling with target-domain adaptation to improve correspondence between predicted and observed RNA and TF profiles while concentrating explanatory predictive supports within the target dataset. Building on this adapted representation, IMAS reconstructs structured RNA-TF coupling networks, refines intercellular signaling through ligand-informed communication modelling, and organizes regulatory programs along communication-associated ordering. In independent colon cancer data, IMAS improved cluster-resolved correspondence and revealed communication-guided regulatory cascades across malignant epithelial states. A LAMB1-centred analysis further demonstrates how the framework supports progressive reinforcement of local regulatory structure and enables perturbation-based probing of context-specific dependencies. Rather than exhaustively predicting all possible outcomes, IMAS provides a target-aware and interpretable strategy to construct consistent and interpretable mechanism-discovery scaffolds and prioritize regulatory dependencies in data-limited tumour systems.
bioinformatics2026-04-13v1TB-Bench: A Systematic Benchmark of Machine Learning and Deep Learning Methods for Second-Line TB Drug Resistance Prediction
VP, B.; Jaiswal, S.; Meshram, A.; PVS, D.; S C, S.; Narayanan, M.Abstract
Drug-resistant tuberculosis (TB), characterized by prolonged treatment regimens and suboptimal treatment outcomes, remains a major obstacle to global TB elimination. Advances in sequencing technologies have enabled the development of machine-learning (ML) approaches, including deep-learning (DL) methods, to predict drug resistance directly from genomic data. However, a significant gap remains in translating these advances into clinical practice. While current approaches reliably predict resistance to first-line drugs, they show consistently lower and more variable performance for second-line drugs compared with traditional drug-susceptibility testing. To characterize these limitations and assess practical utility, we conducted a comprehensive survey and standardized benchmarking of current approaches for predicting TB drug resistance using whole-genome sequencing (WGS) data. Using systematic selection criteria, we identified 20 traditional ML and DL models from 8 studies and evaluated drug-specific versions across 14 second-line drugs within a unified framework. To account for methodological heterogeneity, the models were evaluated using three distinct feature sets reflecting variability in input representations. We trained and evaluated the models on different subsets of the WHO dataset, comprising 50,801 samples, and assessed generalizability using an external validation dataset comprising 1,199 samples. In the internal evaluation on the held-out WHO test dataset, traditional ML models using binary features achieved higher predictive performance than DL models. For example, XGBoost achieved the highest area under the precision-recall curve (PRAUC) scores (46%-93%) for 10 of the 14 drugs. However, performance varied substantially across drugs. Notably, the superior performance of traditional ML models - even with limited feature sets - highlights their applicability in low-resource settings. When evaluated on the external validation dataset, the performance of traditional ML and DL models was comparable, and neither class of models demonstrated substantial improvement over catalogue-based approaches, underscoring challenges in cross-dataset generalization. Overall, this benchmarking study provides a comprehensive and systematic evaluation of current approaches, establishes a rigorous evaluation framework for future comparisons, and identifies key methodological considerations necessary to advance robust drug resistance prediction in clinical settings. To enhance reproducibility and facilitate the application of TB-Bench to additional datasets and models, we have made the source code publicly available at https://github.com/BIRDSgroup/TB-Bench.
bioinformatics2026-04-13v1Introducing the digital PCR data essentials standard to harmonize data structure for clinical and research use
Trypsteen, W.; Vynck, M.; Untergrasser, A.; Whale, A. S.; Rodiger, S.; Dobnik, D.; Bogozalec Kosir, A.; Milavec, M.; Kubista, M.; Pfaffl, M. W.; Nour, A. A.; Young-Kyung, B.; Bustin, S. A.; Calin, G.; Chen, Y.; Cleveland, M. H.; De Falco, A.; Forootan, A.; O'Sullivan, D. M.; Devonshire, A. S.; Foy, C. A.; Fraley, S. I.; Gleerup, D. G.; He, H.-J.; Hellemans, J.; Lievens, A.; Lind, G. E.; Porco, D.; Romsos, E. L.; Thas, O.; Drandi, D.; de Tayrac, M.; Taly, V.; Huggett, J. F.; Vandesompele, J.; De Spiegelaere, W.Abstract
Digital PCR (dPCR) is a powerful technology for absolute quantification of nucleic acids, valued for its accuracy, sensitivity, and repeatability. Yet, the commercialization of different instruments with proprietary software has introduced challenges to data analysis, interoperability, and comparability. Therefore, we present the Digital PCR Data Essentials Standard (DDES), a lightweight, human- and machine-readable, and cross-platform data standard developed in collaboration with the dPCR community. The standard consists of three file types designed to enable both manual inspection and automated analysis: (i) a main file summarizing experiment and reaction-level (meta-)data; (ii) an assay file describing targets and detection chemistry, and (iii) intensity files capturing partition-level raw fluorescence data per reaction. DDES supports a wide range of current dPCR applications, including singleplex and multiplex assays, endpoint and real-time readouts, and will be curated to implement future dPCR developments. By harmonizing the data structure, DDES lays out the foundation for FAIR dPCR data practices and supports improved software compatibility, collaborative and reproducible research, and future dPCR data repositories.
bioinformatics2026-04-13v1VeloTrace Reconciles Divergent Velocity and Trajectory in Single-cell Transcriptomics with Deep Neural ODE
Cheng, H.; Qiao, Y.; Feng, Y.; Wei, Y.; Li, J.; Cai, J.; Zheng, S.; Chen, S.; Li, G.; Simons, B. D.; Lian, Q.; Xin, H.Abstract
Cellular identity and fate transitions are governed by continuous molecular processes that form dynamic trajectories within a high-dimensional transcriptomic landscape. Existing methods attempt to model these dynamics from two complementary perspectives: trajectory inference and velocity modeling. Ideally, velocity and trajectory are dual aspects of transcriptomic dynamics where velocity is tangent to trajectory everywhere. This inherent connection between velocity and trajectory is currently absent in transcriptomic analysis. Splicing velocity are precision-limited to inadequately-sequenced genes, while trajectory inference prioritizes the modeling of global trends while omitting local dynamics. This divergence breaks the geometric continuity between local velocities and global trajectories, hindering the reliable interpretation of developmental dynamics. To reconcile trajectory inference and RNA velocity, we introduce VeloTrace, a framework that unifies them through Neural Ordinary Differential Equations (NeuralODEs). VeloTrace learns a continuous-time velocity field whose integral curves constitute the trajectory itself, while ensuring that velocities are tangent to integral paths everywhere. Leveraging a splicing quality score, VeloTrace incorporates high-quality splicing velocity as partial supervision for velocity orientation and grounding. During optimization, VeloTrace incorporates a Monte Carlo multi-time-frame supervision strategy to ensure coherence between local and global trajectorys and suppress sequencing-induced stochastic diffusion. Through refining the velocity field and cell-specific parameters for pseudo-time, expression, and velocity, VeloTrace reconstructs a smooth, local-and-global-coherent velocity-vector-guided flow in the transcriptomic latent space. This strategy ensures a complementary integration of velocity and trajectory, imputing the transcriptional kinetics for genes of insufficient strength, whose kinetics cannot be accurately portrayed by splicing velocity. In simulation benchmarks, VeloTrace captured the transcriptional dynamics of all expressed genes, even those with inadequate sequencing coverage, producing velocity directions that were most consistent with the true direction and everywhere tangential across the entire process, outperforming state-of-the-art methods, including scVelo, UniTVelo, VeloVI and scTour. VeloTrace uniquely reconciles RNA velocity and trajectory inference, creating a velocity field where each cell can infer past and future transitions from its current state. Moreover, VeloTrace extends reliable velocity estimation to a broader set of genes. When applied to mouse neural stem cell differentiation data, it successfully recovers dynamics of driver genes for two developmental lineages, including those with low expression, shedding light on their regulatory roles during differentiation. This unified framework lays the foundation for more accurate modeling of gene regulation and cell fate decisions in complex biological systems.
bioinformatics2026-04-13v1BrainPET Studio: An Atlas-Based, User-Friendly Desktop Tool for Quantitative PET Neuroimaging Analysis
Nabizadeh, F.Abstract
Quantitative analysis of positron emission tomography (PET) neuroimaging data is essential for studying neurodegenerative diseases, yet existing processing pipelines often rely on computationally intensive software packages such as FreeSurfer, limiting accessibility for many research groups. Here I introduce BrainPET Studio, an open-source desktop application for atlas-based regional PET quantification that operates entirely in Montreal Neurological Institute (MNI) standard space. BrainPET Studio integrates affine registration, optional Muller-Gartner (MG) partial volume correction (PVC), interactive quality control (QC), and standardized uptake value ratio (SUVR) calculation into a single graphical user interface (GUI), eliminating the requirement for FreeSurfer-based cortical reconstruction. I validated BrainPET Studio against two established pipelines: (1) the UC Berkeley Alzheimers Disease Neuroimaging Initiative (ADNI) AV1451 (flortaucipir) pipeline, which employs FreeSurfer v7.1.1 parcellation, SPM-based coregistration, and Geometric Transfer Matrix (GTM) PVC in native subject space, and (2) the volBrain/petBrain online platform. Region-of-interest (ROI) SUVR values were compared across 322 subjects. Overall Pearson correlation coefficients for meta-ROI composites ranged from r=0.83 0.96 versus ADNI and r=0.86 0.94 versus volBrain/petBrain. Detailed per-subject validation on four representative cases across 112 FreeSurfer-defined regions demonstrated strong agreement for large cortical composites and acceptable variability for smaller medial temporal structures. These results establish BrainPET Studio as a reliable, accessible, and extensible tool for multi site PET research, educational applications, and studies where FreeSurfer-based processing is impractical.
bioinformatics2026-04-13v1HEIMDALL: Disentangling tokenizer design for robust transfer in single-cell foundation models
Haber, E.; Alam, S.; Ho, N.; Liu, R.; Trop, E.; Liang, S.; Yang, M.; Krieger, S.; Ma, J.Abstract
Foundation models for single-cell RNA-sequencing (scRNA-seq) data are emerging as powerful tools for single-cell analysis, yet their performance depends critically on how cells are tokenized into model inputs. Single-cell data lack a canonical tokenization scheme, and many design choices in current single-cell foundation models (scFMs) remain heuristic, entangled, and difficult to evaluate. Here, we introduce HEIMDALL, a unified framework for dissecting and redesigning tokenizers in scFMs. By decomposing existing tokenization strategies into individual design choices, HEIMDALL enables attribution of the components that underlie robust generalization, allowing more principled design of improved tokenizers. Combining HEIMDALL with a minimal transformer backbone, we find that tokenizer design is instrumental for generalization in challenging distribution-shift settings such as cross-tissue, cross-species, and cross-gene-panel cell type classification, as well as reverse perturbation prediction. We show that, while tokenizer choice has little effect in scenarios with matched train and test data, it becomes imperative under distribution shift. Rather than identifying a single globally optimal tokenizer, HEIMDALL reveals that robust transfer depends on a small number of tokenization design axes - especially gene identity, expression encoding, and ordering - that expose different biological priors to the model. In this sense, universal transferability in scFMs still depends on a non-universal tokenizer interface. Together, these findings establish tokenization as a critical design axis in scFMs and provide design principles and reusable infrastructure for more robust scFMs.
bioinformatics2026-04-12v3On the correctness of gene tree tagging under a unified model of gene duplication, loss, and coalescence
Parsons, R.; Liu, Y.; Dua, P.; Markin, A.; Molloy, E.Abstract
ASTRAL-pro is the leading method for reconstructing species trees under complex evolutionary scenarios involving gene duplication, loss, and coalescence, commonly modeled by DLCoal. A unique aspect of A-pro is that it utilizes rooted gene trees, with internal vertices labeled as duplications or speciations, to modify its objective function compared to the traditional ASTRAL method. Although there is a natural event-based definition of correct tagging when genes evolve with only gene duplications and losses, it cannot be applied when there is deep coalescence. Here, we introduce a definition of correct tagging that is broadly applicable, proposing that a gene tree vertex is correctly tagged as a duplication if it is the most recent common ancestor of at least one pair of gene copies related via a duplication event. Using this definition, we study some statistical properties of ASTRAL-pro's objective function under the DLCoal model and evaluate the accuracy of ASTRAL-pro's tagging algorithm in simulations.
bioinformatics2026-04-12v2KNexPHENIX: A PHENIX-Based Workflow for Improving Cryo-EM and Crystallographic Structural Models
Nandi, S.; Conn, G. L.Abstract
New and improved methods for visualizing complex macromolecules in atomic detail continue to expand structural information in the Protein Data Bank but accurately refining atomic models from experimental maps remains a challenge due to efficiency limitations of current refinement approaches. Standard PHENIX refinement can partially address these limitations with its speed and accessibility but often fails to yield the best model compared to more computationally demanding approaches. We therefore developed KNexPHENIX, a customized PHENIX-based workflow, to support optimal macromolecular model building. KNexPHENIX can be used to refine macromolecular structures obtained via cryo-electron microscopy (cryo-EM) or X-ray crystallography, regardless of molecular size or composition. KNexPHENIX was evaluated on deposited structures and de novo models and consistently produced models with lower MolProbity scores, indicating improved model stereochemistry, compared to default PHENIX, REFMAC Servalcat, REFMAC, or CERES refinement. Importantly, this was accomplished while maintaining model-to-map correlation for cryo-EM datasets and maintaining or reducing the Rfree-Rwork difference below accepted thresholds for X-ray crystallographic structures, thus limiting overfitting while preserving refinement accuracy. These results establish the KNexPHENIX workflow as a practical, accessible approach for refining both cryo-EM and crystallographic structures, enabling the generation of high-quality models for deposition and guiding further experimental studies.
bioinformatics2026-04-12v2CRIS: A Centralized Resource for High-Quality RNA Structure and Interaction Data in the AI Era
Lee, W. H.; Dharmawan, C.; Li, K.; Bai, J.; Solanki, P.; Sharma, A.; Zhang, M.; Lu, Z.Abstract
As interest in RNA-based therapeutics expands, there is a growing demand for RNA structure elucidation and RNA-RNA interactions in both academic and clinical settings. Despite rapid advances in methods for RNA structure determination, the field faces persistent challenges in data reproducibility, quality control, and accessibility, largely due to inconsistencies in data processing and analysis workflows. Concurrently, methodological improvements have generated increasingly complex datasets, which necessitate a standardized framework. Here, we present the Crosslinking-based RNA Interactomes and Structuromes (CRIS) database, a comprehensive resource designed to address these limitations. Among existing experimental and computational approaches for RNA structure characterization, crosslinking-based technologies offer superior reliability, high throughput, and high resolution. CRIS provides rigorously curated datasets, standardized workflows, and user-friendly tools, together with built-in quality metrics and detailed visualization guidance to ensure reproducibility and transparency while pairing seamlessly with existing experimental pipelines. By delivering high-complexity RNA datasets alongside accessible computational tools, CRIS serves as a standardized reference for both new and existing data, facilitating investigation through comparative analyses and providing a training resource for deep learning-based computational exploration. This will enable integration into machine learning workflows for large scale, novel RNA structure discovery.
bioinformatics2026-04-12v2Spectral Graph Features for Reference-free RNA 3D Quality Assessment
Zhu, Y.; Zhang, H.; Calhoun, V. D.; Bi, Y.Abstract
MotivationExisting RNA 3D structure quality assessment (QA) methods rely on local geometric descriptors or statistical potentials that evaluate atomic-level contacts but are blind to global topological coherence. This creates a critical failure mode--structures that are "locally correct but globally wrong"--where well-formed local helices mask misplaced domains and incorrect overall packing. ResultsWe introduce SpecRNA-QA, a lightweight method that scores RNA 3D models using multi-scale spectral features derived from the graph Laplacian of inter-nucleotide contact networks. By computing eigenvalue distributions, heat-kernel traces, and spectral entropy across four distance scales with binary and Gaussian kernels, SpecRNA-QA captures global structural coherence inaccessible to conventional descriptors. In leave-one-out cross-validation on CASP16 (42 targets, 7,368 models), spectral features achieve median per-target Spearman{rho} = 0.69 [95% CI: 0.64-0.73], significantly outperforming an internal geometry baseline ({rho} = 0.47, {Delta}{rho} = +0.22, Wilcoxon p = 1.2x10-10). Compared against established unsupervised statistical potentials--which require no labeled data, unlike the supervised spectral model-- rsRNASP outperforms on small-to-medium RNAs ({rho} = 0.67 vs. 0.57,[≤] 200 nt). However, rsRNASP times out on most large RNAs (>200 nt), where SpecRNA-QA provides the strongest available quality signal ({rho} = 0.72 vs. DFIRE 0.52), revealing clear complementarity between global-topological and local-energy scoring. A training-free heuristic using only three spectral statistics enables quality estimation without any labeled data. AvailabilitySpecRNA-QA is available as a Python package at https://github.com/yudabitrends/specrnaq. Contactybi3@gsu.edu Supplementary informationSupplementary data are available online.
bioinformatics2026-04-12v2rnaends: an R package to study exact RNA ends at nucleotide resolution
Caetano, T.; Redder, P.; Fichant, G.; Barriot, R.Abstract
5' and 3' RNA-end sequencing protocols have unlocked new opportunities to study aspects of RNA metabolism such as synthesis, maturation and degradation, by enabling the quantification of exact ends of RNA molecules in vivo. From RNA-Seq data that have been generated with one of the specialized protocols, it is possible to identify transcription start sites (TSS) and/or endoribonucleolytic cleavage sites, and even, in some cases, co-translational 5' to 3' degradation dynamics. Furthermore, post-transcriptional addition of ribonucleotides at the 3' end of RNA can be studied at the nucleotide resolution. While different RNA-end sequencing library protocols exist that have been adapted to a specific organism (prokaryote or eukaryote) or specific biological question, the generated RNA-Seq data are very similar and share common processing steps. Most importantly, the major aspect of RNA-end sequencing is that only the 5' or 3' end mapped location is of interest, contrary to conventional RNA sequencing that considers genomic ranges for gene expression analysis. This translates to a simple representation of the quantitative data as a count matrix of RNA-end location on the reference sequences. This representation seems under-exploited and is, to our knowledge, not available in a generic package focused on the analyses on the exact transcriptome ends. Here, we present the rnaends R package which is dedicated to RNA-end sequencing analysis. It offers functions for raw read pre-processing, RNA-end mapping and quantification, RNA-end count matrix post-processing, and further downstream count matrix analyses such as TSS identification, fast Fourier transform for signal periodic pattern analysis, or differential proportion of RNA-end analysis. The use of rnaends is illustrated here with applications in RNA metabolism studies through selected rnaends workflows on published RNA-end datasets: (i) TSS identification, (ii) ribosome translation speed and co-translational degradation, (iii) post-transcriptional modification analysis and differential proportion analysis.
bioinformatics2026-04-11v4COMPASS: A Web-Based COMPosite Activity Scoring System to Navigate Health and Disease Through Deterministic Digital Biomarkers
Sinha, S.; Ghosh, P.Abstract
Quantifying pathway activity in a reproducible and interpretable manner remains a central challenge in systems biology and precision medicine. Here, we introduce COMPASS (COMPosite Activity Scoring System), a deterministic, ontology-free, threshold-based framework that converts gene expression into per-sample pathway activity scores without reliance on permutation or reference cohorts. Implemented as an intuitive web application, COMPASS derives gene-specific activation thresholds directly from data, standardizes deviations from these boundaries, and integrates directionally opposing genes into a single composite score using closed-form logic. Implemented as an accessible web application, COMPASS enables users to upload expression matrices, define gene signatures, and perform activity scoring, statistical comparisons, and survival analyses without coding. Across diverse biological and clinical datasets, COMPASS generates stable and transferable digital biomarkers that quantify cellular states, benchmark humanness and relevance of model systems and enable outcome stratification. In head-to-head comparisons with widely used single-sample enrichment methods (GSVA and ssGSEA), COMPASS shows consistent performance across multi-cohort datasets, with improved discrimination when integrating bidirectional gene programs. Stratified bootstrap analyses further demonstrate reduced variability and increased robustness. By directly linking expression thresholds, deviation, and gene directionality, COMPASS provides a transparent and generalizable framework for ontology-free pathway activity quantification and outcome modeling.
bioinformatics2026-04-11v3Coherent Cross-modal Generation of Synthetic Biomedical Data to Advance Multimodal Precision Medicine
Marchesi, R.; Lazzaro, N.; Endrizzi, W.; Leonardi, G.; Pozzi, M.; Ragni, F.; Bovo, S.; Moroni, M.; Osmani, V.; Jurman, G.Abstract
Integration of multimodal, multi-omics data is critical for advancing precision medicine, yet its application is frequently limited by incomplete datasets where one or more modalities are missing. To address this challenge, we developed a generative framework capable of synthesizing any missing modality from an arbitrary subset of available modalities. We introduce Coherent Denoising, a novel ensemble-based generative diffusion method that aggregates predictions from multiple specialized, single-condition models and enforces consensus during the sampling process. We compare this approach against a multi-condition, generative model that uses a flexible masking strategy to handle arbitrary subsets of inputs. The results show that our architectures successfully generate high-fidelity data that preserve the complex biological signals required for downstream tasks. We demonstrate that the generated synthetic data can be used to maintain the performance of predictive models on incomplete patient profiles and can leverage counterfactual analysis to guide the prioritization of diagnostic tests. We validated the framework's efficacy on a large-scale multimodal, multi-omics cohort from The Cancer Genome Atlas (TCGA) of over 10,000 samples spanning across 20 tumor types, using data modalities such as copy-number alterations (CNA), transcriptomics (RNA-Seq), proteomics (RPPA), and histopathology (WSI). This work establishes a robust and flexible generative framework to address sparsity in multimodal datasets, providing a key step toward improving precision oncology.
bioinformatics2026-04-11v3scLongTree: an accurate computational tool to infer the longitudinal tree for scDNAseq data
Khan, R.; Bhattarai, P.; Zhang, L.; Zhou, X. M.; Mallory, X.Abstract
Longitudinal single-cell DNA sequencing (scDNA-seq) refers to single-cell data sequenced at different time points providing more knowledge of the order of mutations than scDNA-seq taken at only one time point. The technique can facilitate the inference of subclonal trees that depict the evolution of cancer cells and facilitate understanding of how cancer grows, with implications for prognosis and treatment. There is currently a scarcity of tools that can infer subclonal trees based on longitudinal scDNA-seq, and existing tools are limited in accuracy and scale. We therefore introduce scLongTree, a computational tool that can accurately infer a subclonal tree based on longitudinal scDNA-seq. ScLongTree is scalable to hundreds of mutations, and outperforms state-of-the-art tools such as LACE, SCITE, and SiCloneFit on a comprehensive simulated dataset. Tests on a real dataset, SA501, showed that scLongTree can more accurately interpret the progressive growth of the tumor than LACE, and is more robust to different numbers of mutations being used. Tests on a large AML dataset AML107, which has 4,617 cells, show that scLongTree is scalable to thousands of cells. ScLongTree is freely available on https://github.com/compbio-mallory/sc_longitudinal_infer.
bioinformatics2026-04-11v2