Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation.
bioinformatics2026-05-26v3Unveiling the Terra Cognita of Sequence Spaces using Cartesian Projection of Asymmetric Distances
Ramette, A.Abstract
Visualizing relationships within massive biological datasets remains a significant challenge, particularly as sequence length and volume increase. We introduce CAPASYDIS (Cartesian Projections of Asymmetric Distances), a scalable approach designed to map the explored regions of a given sequence space. Unlike traditional dimensionality reduction methods, CAPASYDIS calculates asymmetric distances which account for both the position and type of sequence variations. It projects sequences into a fixed, low-dimensional coordinate system, termed a "seqverse", where each sequence occupies a permanent location. This design allows for the instant mapping of new sequences without the need to recalculate the global structure, transforming sequence analysis from a relative comparison into navigation on a standardized map. We applied this method to a large rRNA sequence dataset spanning the three domains of life. Our results demonstrate that the sequences of Bacteria, Archaea, and Eukaryota occupy spatially distinct regions characterized by fundamentally different shapes and patterns of variation. Furthermore, the resulting seqverses retain high amount of taxonomic information, when analyzed from broad domain levels to single-base differences. Overall, CAPASYDIS provides a reproducible, scalable framework for defining the boundaries and topography of biological sequence universes.
bioinformatics2026-05-26v3WITHDRAWN: Scalable Microbiome Network Inference: Mitigating Sparsity and Computational Bottlenecks in Random Effects Models
Roy, D.; Ghosh, T. S.Abstract
The authors have withdrawn their manuscript because the biological validations associated with the inferred microbial interaction directions are currently incomplete and require further verification. We are actively validating these biological directions and ensuring the scientific validity of the reported findings before any further dissemination. Therefore, the authors do not wish this work to be cited as a reference for the project at this stage. If you have any questions, please contact the corresponding author.
bioinformatics2026-05-26v2Prioritizing peptides for targeted mass spectrometry experiments using deep learning
Sonthalia, S.; Dasgupta, P.; Hsu, C.; Wen, B.; MacCoss, M. J.; Noble, W. S.Abstract
One critical step in any targeted mass spectrometry experiment is selecting, from each protein of interest, a small number of peptides that respond well in the mass spectrometer and can serve as reliable proxies for protein quantification. Existing methods select target peptides either by relying on prior empirical measurements, limiting their applicability to previously observed peptides, or using machine learning to predict peptide behavior from sequence alone. However, current machine learning tools suffer from various limitations, including using detectability as an indirect proxy for intensity, relying on small training sets, or ignoring the precursor charge state. In this study, we introduce Bromo, a transformer-based deep learning model that ranks peptide precursors from a given protein by their relative response, taking charge state into account. Trained on millions of annotated peptide pairs derived from large-scale, publicly available data-independent acquisition mass spectrometry data, Bromo consistently outperforms existing sequence-based methods across diverse, independent datasets. Furthermore, we show that fine-tuning Bromo on experiment-specific data can account for differences in sample preparation, sample matrix, and instrument platform, all of which influence which peptides serve as optimal targets. This adaptability makes Bromo a practical tool for selecting target peptides for selected reaction monitoring and parallel reaction monitoring assay development across a wide range of experimental conditions.
bioinformatics2026-05-26v1Faithful Supervised Dimensionality Reduction for Biomedical Data via Decision Geometry
Wang, Z.; Zhou, Z.; Zhan, Q.; Shen, L.Abstract
Unsupervised dimensionality reduction methods aim to preserve intrinsic data geometry by maintaining local neighborhoods and approximate global relationships in low-dimensional embeddings, but they do not use label information and therefore may fail to reflect task-relevant class structure in biomedical and health applications. Supervised dimensionality reduction (SDR) incorporates labels to improve class organization, yet existing approaches often face a trade-off between discrimination and geometric faithfulness. Linear supervised methods are stable and interpretable but are limited in their ability to capture nonlinear structure, whereas many nonlinear methods impose supervision directly in the embedding space, which can over-separate classes and distort the underlying manifold. In biomedical applications, labels such as cell types in single-cell data or patient status in clinical cohorts provide meaningful biological signal, and supervised dimensionality reduction can use this information to produce more informative low-dimensional representations. Here we propose a new framework, DG-UMAP (Decision-Geometry UMAP), for faithful supervised dimensionality reduction via decision geometry. We first fit a classifier in the original feature space and use its boundary-local decision geometry to construct a low-rank metric deformation that emphasizes discriminative directions while limiting geometric distortion. Parametric UMAP is then applied to the transformed space, so supervision acts through the ambient geometry rather than by directly forcing class separation in the embedding. Across synthetic and multiple real-world biomedical datasets, our method yields embeddings with improved agreement with class structure and global organization while preserving local neighborhood quality.
bioinformatics2026-05-26v1SynFit: Synergistic Contrastive Learning for Multi-Objective Protein Fitness Prediction and Optimization
Tu, T.; Huang, W.; Li, Z.; Ding, K.; Yang, Y.; Luo, Y.Abstract
Proteins function through a complex interplay of structural and biochemical properties, and mutations can reshape these properties to generate fitness landscapes spanning multiple functional objectives. A central challenge in protein engineering is the need to simultaneously optimize multiple properties. In biocatalysis, for example, practical enzyme development routinely requires the concurrent optimization of catalytic activity, selectivity, stability, and substrate generality. However, despite recent advances in computational protein design and fitness prediction, most existing approaches treat these properties independently and do not explicitly capture the dependencies and trade-offs that govern real-world protein performance. We present SynFit, a multi-objective learning framework that integrates pretrained protein language models with experimental fitness measurements for protein fitness prediction and engineering. SynFit learns both shared and property-specific protein sequence representations through a synergistic contrastive learning strategy, enabling the identification of variants that simultaneously optimize multiple functional properties. Across a large-scale multi-fitness deep mutational scanning benchmark, SynFit consistently outperforms state-of-the-art supervised models trained on individual objectives and more accurately identifies variants that balance competing functional constraints. We further applied SynFit to multi-objective enzyme design for a new-to-nature biocatalytic enantioselective borylation reaction, providing a diverse array of novel cytochrome \textit{c} sextuple variants in a single round of design with simultaneously improved catalytic activity and enantioselectivity that rival the best variants obtained through directed evolution. Together, these results establish SynFit as a general framework for multidimensional protein fitness prediction and highlight its potential to enable efficient multi-objective optimization in protein engineering, particularly in biocatalysis.
bioinformatics2026-05-26v1Gene-Specific Analysis of Clonal Hematopoiesis Identifies ASXL1 as a Risk Factor for Lung Cancer
Zhang, Z.; Dong, J.; Huang, Y.; Liu, Y.; Amos, C. I.; Cheng, C.Abstract
Clonal hematopoiesis of indeterminate potential (CHIP) is a recognized risk factor for hematologic malignancies, but its contribution to different types of solid cancers remains incompletely defined. Here, we performed a systematic, gene-specific analysis of CHIP across 19 common solid cancer types using two large population-based cohorts, the UK Biobank and All of Us. Using Cox proportional hazards models and nested case-control logistic models, we demonstrate that the relationship between CHIP and solid tumors is highly cancer-type specific, with lung cancer exhibiting the strongest association. In lung cancer, this association is largely driven by ASXL1-mutant clones. Specifically, high variant allele fraction (high-VAF) ASXL1 conferring a significantly increased risk (hazard ratio = 3.2), and the associations remained robust after adjustment for age, sex, body mass index (BMI), smoking status, and genetic ancestry. Notably, ASXL1 CHIP was substantially enriched among smokers, and its association with lung cancer risk was restricted to ever-smokers, highlighting a key interaction between CHIP and environmental exposure. The enrichment of ASXL1 CHIP in lung cancer was further validated in two independent cancer-only cohorts, including MSK-IMPACT and TCGA. In addition, rare germline variant association analysis revealed that germline variation in ASXL1 had the strongest association with lung cancer susceptibility among all solid tumors. Collectively, our findings support a model in which smoking-associated expansion of ASXL1-mutant clones contributes to lung cancer development and suggest that gene-specific CHIP metrics may enhance risk stratification and early detection strategies.
bioinformatics2026-05-26v1Application of Computer Vision Tools to Maize Genomic Data for Trait Prediction and Gene Discovery
Higgins, S. A.; Anible, E.; Muthupari, M.; Dibble, C.; Murdoch, R. W.Abstract
Artificial intelligence and machine learning for computer vision (CV) and image recognition is a rapidly evolving field with multiple potential applications in plant genomics. While CV has been widely adopted by the research community for plant phenotyping and disease surveillance, applications of CV tools to plant genome analysis are underrepresented. CV tools may complement traditional statistical classification tools used in plant genomics, since CV perceives problems holistically rather than granularly (in terms of pattern recognition), which is particularly applicable to analysis of large, complex eukaryotic genomes. In this study, we report on a new strategy to apply existing CV tools to classify plant genotypes and predict genotype-phenotype relationships. A technique was developed for converting maize genome resequencing data into a set of images reminiscent of a quick response (QR) code. Several hundred maize genomes were processed and it was demonstrated that CV models can successfully categorize genome images into heterotic groups (accuracy and recall > 0.8). Models for classifying genome images into phenotypic trait groups (such as short, medium, and high plant height) performed with moderate success for the most heritable trait analyzed (ear height; accuracy and recall > 0.5). Querying model results permitted identification of genome regions that were important for model classification predictions. The CV model results revealed enriched metabolic pathways consistent with traits under consideration. Overall, our initial application of CV tools to plant genome analysis highlights its applicability to genomic data. Design of new CV architectures optimized for genome-derived images may further improve upon our initial results generated using only off-the-shelf CV tools optimized for unrelated image analysis tasks.
bioinformatics2026-05-26v1Pathogen-specific antimicrobial activity prediction with biological large language model-based methods
Ucar, B.; Demirsoy, E.; Salehi, A.; Sutherland, D.; Yanai, A.; Coombe, L.; Thompson, V. C.; Warren, R. L.; Helbing, C. C.; Birol, I.Abstract
Driven by the rise of antimicrobial resistance, antimicrobial peptides (AMPs) have emerged as promising therapeutics capable of targeting multidrug-resistant pathogens. Because identifying AMPs and their specific targets requires costly and labor-intensive wet-lab experiments, in silico methods to prioritize candidates are highly valuable. However, current computational methods often lack pathogen specificity or fail to incorporate crucial targeted proteomic and genomic contexts. To bridge this gap, we developed triAMPh, a robust, zero-shot framework for pathogen-specific peptide bioactivity prediction. triAMPh integrates a heterogeneous graph attention network-based link predictor (HLP), Extreme Gradient Boosting, and a multilayer perceptron trained on features from biological large language models (bLLMs). Our novel HLP constructs a knowledge graph that maps peptides and pathogens as distinct nodes, connected by similarity and bioactivity edges. The model extracts information through semantic traversals, prioritizing neighboring nodes and their biological contexts. Benchmarking shows that triAMPh provides unbiased, peptide- and pathogen-centered zero-shot predictions, matching or outperforming state-of-the-art methods across all metrics except precision. Ultimately, triAMPh offers a powerful computational tool to accelerate wet-lab AMP discovery while demonstrating the capability of bLLMs to capture complex, pathogen-specific bioactivity patterns.
bioinformatics2026-05-26v1Decoding Multicellular Communication Motifs from Spatial Transcriptomics with ALARMIST
Fan, J.; Hood, J.; Strong, J.; Quinn, J. F.; Dai, Y.; Data Science TeamLab, ; Schein, A.; Yu, K. K. H.; Tansey, W.Abstract
Cellular organization is driven by recurrent, coordinated interactions between multiple cell types, each sending and receiving multiple signals. Existing computational methods for spatial profiling data consider only individual ligand-receptor interactions and fail to capture the higher-order interactions governing the tissue microenvironment. To address this gap, we developed ALARMIST (Assessment of Ligand And Receptor Motifs And Impacts in Spatial Transcriptomics), a probabilistic framework that infers interpretable multicellular communication patterns from spatial data. ALARMIST decomposes neighborhood-level signaling patterns into motifs: recurrent communication subnetworks involving multiple cell types and sets of enriched ligand-receptor interactions. For each cell, ALARMIST identifies its active motifs and estimates the downstream phenotypic effects of each motif on active cells. We applied alarmist to spatial datasets of lung adenocarcinoma (LUAD) and glioblastoma (GBM) to identify microenvironmental drivers of tumor progression. In paired LUAD and adenocarcinoma-in-situ (AIS) samples, ALARMIST identified an immune-active vascular motif at the tumor-normal boundary and implicated motif-active plasmacytoid dendritic cells as drivers of inflammation in early carcinogenesis. In matched low- and high-grade glioma samples, ALARMIST identified a hub-and-spoke motif centered on a malignant macrophage subpopulation, implicating a GRN-SORT1 signaling axis with a downstream impact gene set predictive of survival in low-grade glioma patients. Code for ALARMIST is available at https://github.com/tansey-lab/alarmist.
bioinformatics2026-05-26v1GAE-Δ: A Graph-Learning Framework for Gene Network Rewiring and Clinical Outcome Prediction from Multi-Omics Data
Tang, Z.; Chen, Z.; Chen, M.; Wang, Y.; Ennis, S.; Niranjan, M.; Ewing, R.Abstract
Cancer progression and outcomes are driven in part by changes to molecular networks thatresult from genetic and/or environmental perturbations. These network changes manifestacross multiple interconnected network layers and include accumulation of somatic mutations, altered protein-protein interactions and dysregulated gene-expression. Here wedescribe a graph autoencoder based framework (Graph Autoencoder-Delta (GAE-{Delta})), for characterizing phenotype-specific gene role shifts across multiomics data. Given samples stratified into two contrasting phenotypic groups and a prior gene interaction network,GAE-{Delta} constructs group-specific gene graphs for each omics modality and trains, for each modality, a single graph autoencoder jointly on both group graphs, so that the two group conditional embeddings share a common latent space. Contrasting these embeddings defines a multiomics embedding-shift representation for each gene that reflects how its network role reorganizes across phenotypic contexts. These gene-level shifts are subsequently used for unsupervised gene prioritization, multiomics late fusion andsample-level classification. Applied to five TCGA cancer types with a survival endpoint, GAE-{Delta} achieves competitive or superior predictive performance compared with classical network based methods and multiomics matrix factorisation methods (MOFA+, iNMF), with statistically significant AUC gains over MOFA+ in three of five cohorts and statistical ties on the remaining two. Beyond predictive performance, the consensus shift genes are significantly enriched for known cancer drivers in three of five cohorts (hypergeometric p < 0.01; 11 - 17x fold enrichment), whereas matrix factorisation baselines reach p < 0.05 in zero of five cohorts (best per cancer p = 0.06), indicating that GAE-{Delta} captures biological signal that linear factor methods miss. In summary, the GAE- {Delta} approach provides for both improved outcome classification as well as for biological and mechanistic discovery through deep network-based integration of disease-associated multi-omics data.
bioinformatics2026-05-26v1Vision-Based Genomic Model for Copy Number Variant Pathogenicity Prediction
Buralkin, I.; Botas, J.; Chang, K.-L.; Deng, Y.; Papastathopoulos-Katsaros, A.; Liu, Z.; Park, J.Abstract
Copy number variants (CNVs) are a major class of structural genomic alterations underlying rare disease, including neurodevelopmental delay and intellectual disability, yet predicting their pathogenicity remains challenging. Existing methods reduce CNVs to region-level numerical features, discarding the positional structure and cross-track patterns that expert clinical reviewers use to interpret genomic evidence. To address this, we introduce Tesseract for CNV, a track-based spatial representation for CNV pathogenicity prediction, which represents each variant as a base-pair-resolution multi-track image and models spatial genomic patterns across annotation tracks while preserving positional structure and cross-track dependencies. Trained on a chromosome-level hold-out split of the ClinVar dataset, Tesseract outperforms prior methods on held-out and curated noncoding benchmarks, improving AUROC by up to 0.10 over the state-of-the-art baseline. On the independent DECIPHER cohort, the model demonstrates generalizability by maintaining the highest AUROC and the highest F1 score across baselines. Furthermore, the model localizes pathogenic signals to clinically meaningful genomic subregions, providing track-annotated evidence that supports practical clinical interpretation.
bioinformatics2026-05-26v1Benchmarking sequence performance on the DNBSEQ-T7 using Genome in a Bottle reference genomes
van Coller, A.; Taukobong, S.; Malima, M.; Ghoor, S.; Nangammbi, N.; Roode, E.; Naicker, M.; Cole, V.; Glanzmann, B.; Kinnear, C.; Carstens, N.Abstract
Advances in sequencing technologies have improved the accuracy, throughput, and completeness of human genome characterization, enabling more reliable detection of genetic variation. Well-characterized reference genomes are critical for benchmarking sequencing platforms and bioinformatics analysis pipelines. Here, we present whole genome sequencing datasets generated for the Ashkenazi Jewish trio reference samples from the Genome in a Bottle Consortium. Libraries were prepared using three distinct MGI-based workflows: PCR-free library preparation, FastFS DNA library preparation, and Universal DNA library preparation. Sequencing was performed on the MGI DNBSEQ-T7 platform, generating a minimum of 400 million paired-end reads per sample, corresponding to 30X mean genome coverage. Raw reads were processed using a standardized GATK bioinformatics workflow. Sequencing performance and variant detection accuracy were evaluated using the Genome in a Bottle high-confidence benchmark variant sets. All workflows demonstrated high sequencing quality and concordance with GIAB benchmark truth sets, with PCR-free libraries showing the strongest indel calling performance and lowest Mendelian violation rates across the Ashkenazi trio. This dataset provides a resource for benchmarking DNBSEQ-T7 sequencing and bioinformatics workflows, and for evaluating the impact of library preparation strategies on whole genome variant detection performance.
bioinformatics2026-05-26v1CoSTAR: Coarse Stem-Topology Alignment of Pseudoknotted RNA Structures by Relation-Constrained Search
Archinuk, F.; Jabbari, H.Abstract
RNA structural alignment is a central task in comparative RNA analysis, but many efficient methods achieve tractability by restricting the class of admissible structures, often excluding pseudoknots. This exclusion is limiting for viral and regulatory RNAs, where conserved structure can remain informative even when sequence conservation is weak. We introduce a coarse RNA structural alignment algorithm that aligns secondary structures by searching over partial maps between stems rather than nucleotides. Each input structure is decomposed into stems, annotated with nucleotide-level features, and encoded by pairwise topological relations among stems. Alignment is formulated as a cost-minimizing partial stem map with skip operations, and the search tree is pruned by RNA-specific directionality and topological constraints derived from already aligned stems. For the stated cost function and over the class of injective, direction-preserving, topologically consistent stem maps, the search is exact. This shifts the dominant computational dependence from sequence length to the number and arrangement of stems. We evaluated the method on 2100 pairwise alignments sampled from seven Rfam families spanning 40-224 nucleotides and 2-15 stems. Across these benchmarks, the algorithm returned terminal coarse alignments in which every stem was either matched or skipped. We measured running time and search-tree width to characterize performance on diverse family-to-family comparisons. The experiments also show that ordering the input structures affects efficiency: using the structure with more stems as the search-driving structure reduces tree width. The resulting partial stem map is directly interpretable for RNA annotation and can be projected to nucleotide resolution for downstream sequence-structure analysis. The code for CoSTAR is available at: https://github.com/TheCOBRALab/CoSTAR
bioinformatics2026-05-26v1How flat is your sample? An opportunistic survey of 3D tilt in public fluorescence microscopy data
Brocard, J.Abstract
Sample planarity is rarely monitored in fluorescence microscopy quality control, yet focal plane deviations across the field of view are a potential source of measurement error. Here I describe FlatStat, a tool that estimates sample tilt automatically from any 3D fluorescence stack, without prior knowledge of sample content, by fitting a plane to the Z-map of maximum intensity. Applied to an Argolight calibration slide and biological samples on a laser-scanning confocal system, FlatStat yielded reproducible slope and direction measurements attributable to the instrument rather than the sample. To establish community reference values, FlatStat was extended to Python and applied opportunistically to 1204 image stacks from 22 projects in the Image Data Resource, yielding 4670 tilt measurements. Slopes spanned several orders of magnitude across projects; inter-channel coherence confirmed that measured tilt reflects physical stage and mounting geometry rather than channel-specific biological topography. Unfortunately, instrument and sample preparation metadata were largely absent from the corpus, limiting causal inference. Finally, controlled tilt experiments on fluorescent beads showed that chromatic shift increased modestly with tilt (~57 nm over the full range tested), while lateral and axial resolutions were essentially unaffected.
bioinformatics2026-05-26v1Multi-Algorithm Machine Learning Benchmarking for Pan-Cancer Classification from Tumour-Educated Platelet RNA Sequencing
Ray, S.; Zalawadia, D. H.; Bhate, V.; Chakravarthy, T. D.; Chetty, A. G.Abstract
Tumour-educated platelets (TEPs) carry cancer-type-specific RNA signatures accessible through whole-blood RNA sequencing, but systematic multi-algorithm benchmarking with quantified statistical uncertainty had not been applied to the GSE68086 dataset. We applied an end-to-end transcriptomic and machine learning framework to 280 whole-blood platelet RNA-seq samples from six cancer types (non-small cell lung cancer, colorectal cancer, glioblastoma multiforme, hepatobiliary cancer, breast cancer, and pancreatic cancer) and healthy donors. After a standardised preprocessing and normalisation pipeline, seven supervised classifiers - Logistic Regression, SVM (RBF), XGBoost, LightGBM, Random Forest, K-Nearest Neighbours, and a Multilayer Perceptron were benchmarked using stratified 5-fold cross-validation and a held-out test set. Statistical uncertainty was quantified via 2,000-resample percentile bootstrap confidence intervals. Multinomial Logistic Regression achieved the highest test macro F1-score (0.522) and macro-averaged ROC-AUC (0.869), both substantially above the seven-class chance level (1/7 {approx} 0.14). SHAP analysis of the Random Forest classifier identified IFITM3 as the globally dominant TEP biomarker; cancer-type-specific discriminators included ATP5PD (hepatobiliary cancer), C6orf62 (NSCLC and pancreatic cancer), VPS13C (healthy donors), and TMSB4Y (breast cancer). Gene Ontology and KEGG pathway enrichment corroborated the biological specificity of identified transcriptomic signatures. These results support the diagnostic potential of TEP transcriptomics as a multi-class liquid biopsy platform and provide a methodologically transparent, reproducible reference framework for future blood-based cancer classification studies.
bioinformatics2026-05-26v1GenesetDiseaseDrugNetwork (GDDN): a web server for disease enrichment and drug prioritization
More, P.; Fontaine, J.-F.; Ten Cate, V.; Wild, P. S.; Andrade-Navarro, M. A.Abstract
Summary Omics technologies profile thousands of genetic and molecular features to provide a comprehensive and quantitative measure of the cellular state. Transcriptomics and proteomics have, especially, guided discoveries of the most important biomarkers and therapeutic targets. By virtue of ongoing developments in single-cell and spatial technologies, fields of targeted therapeutics and personalized medicine are rapidly advancing. However, downstream functional analysis and disease association still remain daunting tasks in bioinformatics. We address these challenges with the GenesetDiseaseDrugNetwork (GDDN) web server. GDDN facilitates functional discovery by connecting gene-sets to enriched diseases and their corresponding therapeutics in a single step. Using a ranking system that incorporates regulatory impact, specificity, and potency, GDDN effectively prioritizes drugs with the highest clinical relevance. Our platform facilitates the interpretation of omics outputs into disease associations and personalized drug identification. Availability and Implementation The GDDN web server is implemented in R Shiny and is freely accessible at https://cbdm-01.zdv.uni-mainz.de/shiny/piyusmor/GDDN/, supporting all major web browsers.
bioinformatics2026-05-26v1Beyond natural amino acids: Extending immunogenicity risk assessment to non-canonical peptide drugs through chemical feature encoding
Cairoli, M.; Nielsen, M.; Betts, C.; Obrezanova, O.; De Maria, L.Abstract
Peptide therapeutics are increasingly used to treat challenging diseases, but immunogenicity risks limit their clinical success. In silico tools enable immunogenicity screening through prediction of peptide-MHCII binding, yet current methods fail to capture chemical properties of non-natural amino acids routinely incorporated to improve drug properties. Here, we present a machine learning approach combining chemical fingerprints with sequence information to predict MHC class II binding for both canonical and modified peptides. We propose two molecular representations (direct-encoding and similarity-based chemical fingerprints) that preserve positional information while encoding chemical diversity. These representations achieved performance comparable to sequence-based encodings (BLOSUM62 and one-hot) for canonical peptides while accurately identifying binding cores and motifs. Testing on citrullinated peptides, chemical fingerprints substantially improved quantitative prediction accuracy while maintaining comparable linear correlation across encoding methods, demonstrating the importance of explicit chemical representation for accurate absolute binding affinity prediction. These descriptors can be integrated into pan-allele prediction frameworks, enabling immunogenicity risk assessment across diverse modifications and therapeutic modalities, including peptide therapeutics, antibody-drug conjugates, and synthetic vaccines. The proposed chemistry-informed framework addresses a critical gap in preclinical drug development, facilitating early mitigation strategies before costly clinical trials.
bioinformatics2026-05-26v1LVentiView: An Open-Source Software for Automated 3D Left Ventricular Mesh Reconstruction and Analysis from Cardiac MRI
Braun, I.; Wang, Y.; Ecker, A. S.; Bodenschatz, E.Abstract
Patient-specific cardiac modeling requires accurate three-dimensional representations of the left ventricle (LV) reconstructed from cardiac magnetic resonance imaging (MRI). Here, we present LVentiView, an open-source software that bridges medical imaging and cardiac simulation by automating the full pipeline from MRI segmentation to simulation-ready volumetric meshes, with integrated tools for volumetric analysis and regional myocardial thickness calculation. We validate LVentiView on the Sunnybrook Cardiac Dataset, comprising healthy subjects and three cardiac pathologies. LVentiView achieves blood pool segmentation at the inter-expert level. The generated meshes are verified by comparing LV volumes extracted from the meshes to those computed from expert manual segmentation masks, with volumes and cardiac parameters agreeing within inter-expert variability across all four cardiac pathologies. In addition, mesh-derived regional thickness maps capture pathology-specific patterns, including wall thickening in hypertrophic cases. LVentiView is freely available on GitHub and provides an accessible, validated foundation for patient-specific cardiac modeling.
bioinformatics2026-05-26v1Prediction and evaluation of Split-ORFs using Ribo-seq data
Kalk, C.; Murtagh, J.; Despic, V.; Mueller-McNicoll, M.; Schulz, M.Abstract
Split Open Reading frames (Split-ORFs) occur in transcripts containing at least two open reading frames, each encoding a part of the same full-length protein. These multiple open reading frames arise from alternatively spliced transcript isoforms. Split-ORFs have been described in the SR protein family of splicing factors, where the resulting protein halves play important autoregulatory roles. Here, we present the Split-ORF pipeline, a computational tool that predicts Split-ORFs from transcripts' sequences and identifies regions unique to the predicted Split-ORF products. Using this pipeline, we predicted more than 14,000 Split-ORF transcripts from alternatively spliced human transcripts containing premature termination codons or retained introns. Hundreds of the Split-ORF unique regions show significant Ribo-seq coverage across diverse cell types and diseases. The candidate Split-ORF genes with significant Ribo-seq coverage are enriched for RNA-binding and RNA-processing functions and the majority of them encodes RNA-binding proteins. Together, these results suggest that Split-ORFs are more widespread than previously assumed and are expressed across diverse cellular contexts. This work paves the road for future studies of the Split-ORF candidates, the mechanisms of their biogenesis and their functions within the RNA-binding protein class.
bioinformatics2026-05-26v1ARACoFusion: Uncertainty-aware calibrated deep learning for protein-protein interaction network prediction in Arabidopsis thaliana
Sarkar, D.; Sarkar, C.Abstract
Accurate mapping of the Arabidopsis thaliana protein-protein interaction (PPI) network is essential for deciphering complexity of plant systems biology. Here, we present ARACoFusion, a specialized deep learning architecture designed to predict inter-protein connectivity directly from primary sequences. To capture the asymmetric dependencies between plant proteins, the framework utilizes a reciprocal cross-attention encoder combined with latent interaction projections and multi-source feature fusion. Addressing the severe class imbalance inherent in plant interactomes, the model integrates uncertainty-aware variance regularization and focal loss with label smoothing, further enhancing reliability through posthoc probability calibration via temperature scaling. Extensive benchmarking on gold-standard Arabidopsis datasets demonstrates that ARACoFusion significantly outperforms existing plant-specific predictors, achieving superior scores in Area Under the Precision-Recall Curve (AUPRC), Balanced Accuracy, and Matthews Correlation Coefficient (MCC). Additionally, the model exhibits robust cross-species generalization and clear class separability in t-SNE latent space visualizations. To facilitate community-wide usage, we provide a dedicated web server for scalable network-level inference at https://ARAcofusion.compbiosysnbu.in/.
bioinformatics2026-05-26v1Cycle-consistent deep generative modeling unifies cellular states across unpaired spatial and single-cell modalities
Zhang, H.; Quinn, J. F.; Data Science TeamLab, ; Tansey, W.Abstract
Current spatial and single-cell technologies capture complementary but incomplete views of cellular state, with transcriptomic, proteomic, and spatial information distributed across distinct platforms. Integration is challenged by unpaired measurements, mismatched feature spaces, and modality-specific biases. We present MultiTME, a multimodal framework that integrates heterogeneous spatial and single-cell data using a spatially-regularized, cycle-consistent deep generative model. By enforcing consistency of bidirectional mappings, MultiTME learns a shared latent representation that enables translation between modalities without requiring paired observations or shared features. Across benchmarks, MultiTME outperforms existing methods, produces accurate cross-modal cell typing, improves spatial transcriptomic panel completion, and transfers whole-transcriptome information to generate spatially resolved maps at cellular resolution. Applied to a multimodal colorectal cancer dataset, we demonstrate that MultiTME integration reveals a spatially coherent proliferative-invasive tumor axis not directly observable within single modalities. Across five multimodal spatial datasets, we show MultiTME can correct for platform-specific biases between Xenium and CosMx, thereby facilitating cross-dataset harmonization and enabling pan-cancer spatial studies.
bioinformatics2026-05-26v1NAP: an open-source pipeline for cross-domain microbiome profiling using Nanopore sequencing-derived amplicon data
Jones, L. B.; Bagby, S.Abstract
Background Nanopore sequencing offers a cost-effective and portable platform for microbiome analysis, but amplicon-based approaches remain limited by higher sequencing error rates and a lack of workflows tailored to mixed domain ribosomal RNA profiling. While short-read technologies dominate microbial community analysis, their portability and flexibility are constrained. There is therefore a need for robust pipelines designed specifically for cross-domain Nanopore amplicon data. Results We introduce the Nanopore sequencing-based Amplicon Pipeline (NAP; https://github.com/Luke-B-Jones/NAP), an open-source workflow optimised for flexible mixed domain primer sets such as 515Y/926R. NAP performs adaptive quality filtering, chimera removal, centroid generation, BLAST-based taxonomic classification, hierarchical consensus correction, and domain-aware post-processing, outputting decontaminated abundance tables suitable for downstream analysis. Initial validation against two complementary commercial mock communities showed that NAP achieved strong genus-level performance across both low complexity logarithmic and more compositionally complex gut mock communities. Detection was most reliable above ca. 1% relative abundance, and replicate outputs showed strong agreement with expected composition under Bray-Curtis, Jaccard, agreement-plot, and Bland-Altman analyses. Benchmarking of NAPs internal filtering modes showed that the default adaptive setting provided the most robust balance of read quality, retained depth, and downstream taxonomic fidelity across heterogeneous inputs. Direct comparison against QIIME2 and Kraken2/Bracken further showed that NAP most accurately preserved expected community structure, with markedly fewer false positive assignments at genus level and substantially stronger species-level behaviour under the tested conditions. Species-level assignments were informative for some taxa, but remained less robust than genus-level outputs with the default V4-V5 amplicon. Conclusions NAP provides a robust and flexible workflow for cross-domain Nanopore amplicon profiling, with strongest performance at genus level and competitive species-level behaviour for well resolved taxa. Although analysis of field-derived data was not assessed here, NAP compatibility with portable Nanopore sequencing supports accurate mixed domain microbiome profiling under the tested conditions.
bioinformatics2026-05-26v1Sparse, trainable subnetworks for multi-omics integration: a cross-validated evaluation of the Lottery Ticket Hypothesis across nutrigenomic, toxicogenomic, and oncogenomic datasets
Miszczak, R.Abstract
Multi-omics integration, the joint analysis of two or more high-dimensional molecular data types collected on the same biological samples, is now a standard analytical approach across nutrigenomics, toxicogenomics, microbiome research, and disease genomics. Existing methods sit on a trade-off between expressiveness and interpretability: latent-variable methods such as MOFA and DIABLO yield compact, biologically interpretable signatures but assume a restrictive linear structure; tree ensembles such as Random Forests achieve strong predictive performance but resist mechanistic interpretation; deep neural networks combine the drawbacks of both, with large numbers of opaque weights and no built-in feature selection. I ask whether the Lottery Ticket Hypothesis (LTH), the conjecture that a randomly initialised dense network contains a sparse subnetwork that matches its accuracy when trained from the original initialisation, can help reconcile this trade-off in the multi-omics setting. I apply Iterative Magnitude Pruning with weight rewinding for 25 rounds (cumulative sparsity 99.6%) on a multi-input fused multi-layer perceptron across eight datasets spanning four biological domains (n=40 to n=1,492), with 5-fold outer cross-validation and inner-validation winning-ticket selection to avoid test-set leakage. On the largest task, TCGA Pan-Cancer (4-class tissue-of-origin, n=1,492), a 2,952-weight subnetwork (83% sparsity) reached 84% +/- 3% test accuracy compared with 86% +/- 2% for the dense network. Pruning improved test accuracy on two TCGA staging tasks (TCGA-LUAD: 51% +/- 1% vs 45% +/- 5%; TCGA-KIRC: 50% +/- 4% vs 48% +/- 7%). Networks compressed by 6x to 270x while retaining task-level signal on well-specified tasks. I suggest LTH as a domain-agnostic, prior-free option for sparse neural integration of multi-omics data, complementary to graph-based and pathway-constrained methods.
bioinformatics2026-05-26v1Tandem: a bioinformatics tool for detection, mechanism classification, and population quantification of bacterial tandem gene duplications
Ngan, W. Y.; Smith, E. S. J.Abstract
Motivation: Tandem gene duplication drives antibiotic resistance, metabolic adaptation, and gene-family expansion in bacteria, but no tool detects them in reference genomes, discovers their junctions in isolate sequencing, and quantifies the junctions in population samples. Existing callers (e.g. breseq) detect duplications without classifying formation mechanisms and often fail to quantify the duplication. Results: Tandem has 3 modules. Module 1 detects reference-genome duplications by NUCmer self-alignment and classifies each by homologous-recombination signature and the junction microhomology length. Module 2 confirms junctions in whole-genome sequencing at user-nominated coordinates after user inspecting the coverage plot. Module 3 quantifies known junction in population sequencing using the novel Junction Read Ratio (JRR). On 280 artificial population tests across seven bacterial species, Tandem achieves 100% recall and 4.3% mean absolute error. Applied to experimentally evolved Pseudomonas fluorescens SBW25 populations, Tandem resolves multiple co-segregating duplication fragments.
bioinformatics2026-05-26v1Constrained protein Large Language Model illustrated in protein stability, function and epistasis
Tzavella, K.; Olsen, C.; Vranken, W. F.Abstract
Our understanding of protein function and evolution is largely based on the relationship between amino acid sequence and overall fold, now effectively captured by computational models. Yet predicting how mutations--shaped by epistasis--alter protein behavior, especially in dynamic or structurally ambiguous regions, remains difficult. Here we present D2D, which combines a self-supervised protein language model with protein-specific evolutionary information to predict mutational effects using little to no task-specific labeled data. D2D captures long-range epistatic interactions, accurately predicts single and higher-order mutation effects on protein thermostability and binding, without being trained on the task. When fine-tuned, D2D outperforms state-of-the-art methods on latent driver cancer mutations and co-occurring proliferation-enhancing mutations across independent experimental studies. Unlike most existing approaches, D2D avoids biases linked to solvent accessibility or to multiple sequence alignment depth and quality, making it particularly effective for disordered or surface binding regions where structure-based predictors typically falter. Overall, D2D provides a general framework for modeling mutational effects in proteins with limited experimental or structural information.
bioinformatics2026-05-26v1Precision survival estimation in acute myeloid leukemia using evolutionary learning-derived microRNA signature
Yerukala Sathipati, S.; Agustriawan, D.; Gopireddy, N. S. R.; Popat, A.; Moat, L.; Aimalla, N.; Elugoti, M. R.; Kampa, S. A.; Sharma, P.; Ho, S.-Y.; Sharma, R.Abstract
Background Acute myeloid leukemia (AML) remains the most lethal acute leukemia in adults, with 5-year overall survival below 32% despite recent advances including venetoclax-, FLT3-, IDH1/2-, and Menin-targeted therapies. Clinical outcomes remain highly heterogeneous across patients, highlighting the need for robust molecular biomarkers capable of improving prognostic precision. MicroRNAs (miRNAs) are critical regulators of hematopoietic differentiation, apoptosis, and therapeutic resistance and are differentially expressed across AML subtypes. However, their clinical translation has been limited by high dimensionality, feature redundancy, and relatively small cohort sizes. Methods We developed and evaluated the AML Survival Estimator (AMLS), an inheritable bi-objective combinatorial genetic algorithm integrated with support vector regression (SVR), using TCGA-LAML miRNA expression profiles (n = 156). AMLS was benchmarked against ten widely used machine-learning approaches, including penalized regression, tree-based ensembles, support-vector regression, k-nearest neighbors, and multilayer perceptron models. Performance was assessed using stratified cross-validation with Pearson correlation (R), Harrell's concordance index (C-index), and mean absolute error (MAE). Functional characterization of the derived miRNA signature was performed through consensus target integration followed by pathway enrichment, gene ontology analysis, network reconstruction, and Kaplan-Meier risk stratification. Results AMLS achieved superior prognostic performance with pooled out-of-fold metrics of Pearson R = 0.86, C-index = 0.788, and MAE = 7.49 months, substantially outperforming all comparator models. Restricting analyses to the AMLS-derived 28-miRNA signature improved all baseline learners by approximately 2-4-fold, with the multilayer perceptron achieving R = 0.674; however, none matched the native AMLS framework, indicating that the evolutionary optimization strategy contributes predictive information beyond feature selection alone. The prognostic signature included biologically established AML-associated miRNAs, including hsa-miR-191, hsa-miR-29c, hsa-miR-125b, hsa-miR-148a, hsa-miR-15b, hsa-miR-10b, and hsa-miR-30c, linked to DNA methylation, apoptosis, cell-cycle regulation, and oncogenic Wnt/MAPK signaling pathways. Functional analyses demonstrated significant enrichment of canonical AML-associated pathways, including p53, PI3K-AKT, TGF-Beta, JAK-STAT, FoxO, and hematopoietic lineage signaling. Conclusions Our findings demonstrate that evolutionary learning integrated with SVR can recover a compact and biologically interpretable miRNA prognostic signature that substantially outperforms conventional machine-learning approaches for AML survival prediction. The identified miRNA network converged on key leukemogenic pathways involved in apoptosis, cell-cycle regulation, and oncogenic signaling, supporting both the biological relevance and prognostic utility of the framework. Given the minimally invasive and quantitatively scalable nature of miRNA profiling, this approach may provide a practical molecular adjunct for improving prognostic assessment and precision medicine strategies in AML.
bioinformatics2026-05-26v1Integrated optimization of experimental and computational workflows improves genome recovery in long-read gut metagenomics
Hu, Y.; Sun, L.; Huang, Y.; Jiang, F.; Tong, X.; Yang, J.; Ju, Y.; Yang, Z.; Liufu, S.; Hu, Y.; Ma, W.; Guo, R.; Li, W.; Zhang, T.; Zhu, X.; Zhang, Z.Abstract
Short-read metagenomic sequencing is widely applied in microbiome research due to its high quality and increasingly more affordable prices. However, it suffers from fragmented reads which limits assembly contiguity and the recovery of complete microbial genomes. In contrast, long-read sequencing, with substantially longer read lengths, can help overcome these limitations. Achieving complete and accurate genome recovery is a central goal in metagenomics. To advance this goal, we present a systematic effort to unify and optimize the long-read sequencing workflow, from experimental sample processing to computational genome assembly, using the CycloneSEQ platform.
bioinformatics2026-05-26v1OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome
Yang, L.; Xia, Y.; Yang, Z.; Xia, C.; Wu, T.; Zou, M.; Xia, Z.Abstract
While multi-species genomic language models have advanced biological representation learning, high-quality, single-species foundation models for crops remain scarce. Leveraging recently expanded rice pangenome resources, we introduce OryzaG3, a species-specific DNA language model with 700M parameters. OryzaG3 was pretrained on 59.20 Gb of chromosome-level sequences from 149 high-quality rice genomes using a non-overlapping 3-mer tokenization strategy and a causal language modeling objective, featuring context-length variants up to 32k tokens. On the Plants Genomic Benchmark polyA prediction task, OryzaG3 achieves competitive predictive performance against leading multi-species models while delivering a four-fold increase in inference throughput under identical long-context conditions. Ultimately, OryzaG3 demonstrates that lightweight, single-species foundation models trained on high-quality pangenomes can match multi-species benchmarks while significantly reducing computational overhead. This work provides a scalable framework for rice functional genomics, molecular breeding, and targeted crop foundation model development.
bioinformatics2026-05-26v1IID-KG: An ontology-aligned literature-derived knowledge graph for infectious and immune-mediated diseases
PAN, F.; Zhang, Y.; Wang, J.; Liu, M.-C.; Sui, X.; Yue, H.; Zhang, J.Abstract
Infectious and immune-mediated diseases (IIDs) represent a broad and rapidly expanding biomedical literature domain in which scalable evidence extraction, disease ontology refinement, and interpretable knowledge integration are essential for biomedical discovery. We constructed an IID-specific biomedical knowledge graph (IID KG) from PubMed abstracts and PMC full-text articles by integrating nested named entity recognition, ontology-guided identifier assignment, full-text relation extraction, and relation-resolution strategies. A gold-standard corpus of 500 PubMed abstracts and 8 PMC full-text articles was manually annotated for nested biomedical entities across six entity types. The resulting models were applied to 30,128,068 PubMed abstracts and 1,385,500 IID-related PMC full-text articles. A unified IID ontology was developed from 411,341 disease terms using hierarchical text classification, large language model-based refinement, ontology cross-referencing, and expert review, yielding 179,657 confirmed MeSH mappings. The final IID KG contains approximately 1,837,513 unique entities and 16,295,390 unique relations across eight relation types. The resource was released publicly together with repurposing workflows, supporting ontology-aligned literature mining, disease mechanism analysis, and drug-repurposing hypothesis generation for IID research.
bioinformatics2026-05-26v1misosoup: A metabolic modeling tool for identifying minimal microbial communities provides valuable insights into microbial ecology and biotechnological applications
Ochsner, N.; San Roman, M.; Jimenez-Fernandez, A.; Bonhoeffer, S.; Pascual-Garcia, A.Abstract
Microbial survival and function often depend on metabolic interactions within communities. Therefore, a central question in disentangling microbial organization is determining which minimal groups of species are able to thrive in a given medium--referred to as 'minimal communities'. Answering this question is essential for understanding microbial distribution, enhancing laboratory cultivation, and designing synthetic communities (SynComs). Here, we introduce misosoup, a Python package for identifying minimal communities (MInimal Supplying cOmmunity Search). Through genome-scale constraint-based metabolic modeling, misosoup enables the systematic identification of communities that support microbial growth in environments where individual species fail to survive alone. We validate misosoup against experimentally verified minimal communities, demonstrating its ability to predict known cooperative interactions, cocultures, and consortia with biotechnological potential. We further illustrate the use of misosoup to investigate broad microbial ecology questions by applying it to a set of 60 marine microbes, revealing pervasive cross-feeding-driven niche expansion and showing how the detailed outputs provided by misosoup facilitate research on topics such as the identification of functional groups. In summary, misosoup provides a powerful tool for microbial ecology and community design, with potential applications in both research and biotechnological innovation.
bioinformatics2026-05-25v2Read-Consistent Minimum Unique Substrings: A Parameter-Free, Linear-Time Framework for Genomic Sequence Representation
Adu, A. F.; Menkah, E. S.; Amoako-Yirenkyi, P.; Pandam Salifu, S.Abstract
Fixed-length k-mers have been the standard unit of genomic sequence representation for over two decades. However, they impose a uniform resolution on genomes whose complexity varies across loci. We introduce Minimum Unique Substrings (MUSs), variable-length sequence units defined by the local uniqueness structure of the genome rather than predefined parameters. We first extend MUS theory from single contiguous strings to fragmented sequencing reads by formalizing a definition of uniqueness that is consistent with these reads. Next, we present a linear-time extraction algorithm that runs in O(n) time using the generalized suffix tree. In this context, we introduce outpost nodes, topological anchors within the suffix tree that accurately localize MUS boundaries in fragmented sequencing reads. Finally, we empirically characterize the distributions of MUS lengths in E. coli K-12 and human chromosome 11. Our results demonstrate that MUS lengths naturally mirror genomic architectural complexity without the need for user-defined parameters. Notably, the MUS framework achieves 100% unique positional coverage with a mean length of only 36.08 bp. In contrast, fixed-length k=61 coverage reaches only 69.4%, despite being 1.69 times the MUS average. We show that increasing k from 21 to 61 triples the unique k-mer count from 2.35M to 6.86M. This k-paradox occurs because repetitive sequences are fragmented into spuriously unique tokens without improving true genomic resolution. MUSs escape this artifact entirely by adapting dynamically to local sequence complexity. These results establish MUSs as a biologically grounded, computationally tractable foundation for parameter-free genome assembly, repeat characterization, and alignment-free genomics.
bioinformatics2026-05-25v2HiCPotts: An R/Bioconductor package to identify significant interactions in chromosome conformation capture data and model sources of biases.
Osuntoki, I. G.; Harrison, A. P.; Dai, H.; Bao, Y.; Zabet, N. R.Abstract
Motivation: Chromosome Conformation Capture methods, including Hi-C, micro-C or Capture-C, are used to map chromatin interactions genome-wide. Most of the existing computational methods do not account for sources of biases (such as DNA accessibility, GC content or TE content) in the data. Results: We previously developed ZipHiC, a Bayesian method based on a the hidden Markov random field (HMRF) model and the Approximate Bayesian Computation (ABC), that uses zero-inflated Poisson distribution to model the noise, signal and false signal of the data and showed that this approach was able to detect biases from DNA accessibility, GC content and TE content in both Hi-C and micro-C data. Here, we present HiCPotts, another Bayesian method based on the HMRF model and the ABC that uses a zero-inflated Negative Binomial distribution instead to model the noise and signal of the data. We systematically show that HiCPotts reduces false positives and increases recovery of true interactions compared to ZipHiC, but also compared to other methods such as FastHiC, Juicer and HiCExplorer. Most importantly, we provide an R/Bioconductor package that allows modelling the noise, signal and false signal using various distributions such as the zero-inflated Negative Binomial (ZINB) and the zero-inflated Poisson distribution (ZIP). Availability: https://bioconductor.org/packages/HiCPotts/
bioinformatics2026-05-25v1MSLipidMapper: a pathway-centered lipidome analysis environment linking lipid class, acyl-chain subsets, and multi-omics data
Oka, T.; Nishida, K.; Harayama, T.; Tsugawa, H.Abstract
Lipids exhibit extensive structural diversity arising from variation in lipid classes, subclasses, and acyl-chain compositions, making systematic interpretation of lipidomics data challenging. Although untargeted lipidomics enables the quantification of hundreds to thousands of lipid molecular species, downstream analyses often treat pathway-level summaries, molecular-species visualization, structural subsetting, and multi-omics interpretation as separate steps. Here, we present MSLipidMapper, an R/Shiny-based lipidomics data exploration environment for pathway-centered and structure-aware analysis of annotated lipidomics datasets. MSLipidMapper reconstructs annotated lipid peak tables as Bioconductor SummarizedExperiment objects, thereby organizing quantitative lipid abundance values, sample metadata, lipid subclass annotations, and parsed acyl-chain features within a unified data structure. Lipid molecular species are summarized on static, curated lipid metabolic pathway maps at the subclass level while retaining direct links to the underlying molecular species and acyl-chain annotations. This design enables users to inspect molecular-species patterns underlying each pathway node, define lipid subsets based on structural features such as specific acyl chains, and re-project these subsets onto the same pathway context. Gene or protein expression data can also be overlaid on pathway-associated reactions to support multi-layer interpretation of lipid metabolism. The program is showcased using publicly available aging lipidome datasets of mice, illustrating how subclass-level pathway summaries can be connected to molecular-species heatmaps, acyl-chain-defined subsets, and transcriptome or proteome information.
bioinformatics2026-05-25v1Cell-type-specific transposable element transcription tracks symbiosis and calcification programs in the reef-building coral Acropora hemprichii
Zhong, H.; Konciute, M. K.; Hu, J.; Menzies, J.; Cui, G.; Aranda, M.Abstract
Transposable elements (TEs) are pervasive components of eukaryotic genomes and major drivers of genome evolution, yet their contribution to cell-type-specific regulatory landscapes remains poorly understood, particularly in non-model marine invertebrates. Here, we integrated single-cell RNA sequencing with pseudo-aligned TE expression profiling to examine how TE transcription relates to cell type identity in the reef-building coral Acropora hemprichii. We constructed a cell atlas comprising 4,716 cells across eight major cell types. Notably, TE expression alone was sufficient to accurately resolve all major cell types, indicating that cell-type-specific transcriptional states are robustly reflected in TE activity patterns. We identified 9,759 expressed TEs, of which 333 exhibited strong cell-type-specific activity. These differentially expressed TE features were associated with nearby expressed genes and transcription factor loci, suggesting a relationship between cell-type-specific TE activity and local gene regulatory programs. Genes associated with cell-type-specific TEs were enriched for core coral physiological processes, including calcification, metabolite transport, and symbiosis-related functions. Together, these findings indicate that TE transcription is structured along coral cell-type identity and physiological specialization. Our study provides a single-cell-resolved framework for investigating TE-gene relationships in early-diverging metazoans and a community resource for future functional interrogation in reef-building corals.
bioinformatics2026-05-25v1SpatialClaw: A Memory-Augmented Autonomous Ecosystem for Spatial Omics Analysis
Du, G.; Lan, O.; Wei, X.; Wu, Y.; Meng, G.; Wu, J.; Li, Z.; Li, X.; Shang, X.Abstract
While the expansion of spatial omics has revolutionized our ability to dissect tissue architecture, the accumulation of incompatible computational methods has heavily fragmented end-to-end analysis, rendering complex workflows irreproducible. Generic conversational agents lack the domain-specific precision necessary to navigate the intricate biological pipelines. To overcome this, we present SpatialClaw, a memory-augmented autonomous ecosystem to unify spatial omics analysis under a single natural-language interaction. SpatialClaw integrates 30 specialized skills, spanning raw data preprocessing, spatial domain identification, deconvolution, spatially variable gene detection, cell-cell communication analysis, multi-sample and cross-modality integration. Distinct from existing agents, SpatialClaw introduces a graph-based persistent memory architecture that stores dataset metadata, analysis lineage, biological insights, and user preferences as versioned nodes and edges across three hierarchical layers (Session, Episodic, and Semantic), governed by a deterministic promotion policy. A Memory-Augmented Reasoning (MAR) Operator bridges the memory store and the main agent, synthesizing retrieved experiences into task-specific guidance for each query. In rigorous benchmarking spanning three memory-sensitive scenarios across 10 spatial-omics skills, SpatialClaw outperforms both a standard large language model and the memory-only configuration. Furthermore, we demonstrate its robust biological utility by dissecting the complex tumor microenvironment of a 15-section human triple-negative breast cancer cohort. In merely three conversational turns and with zero direct scripting, SpatialClaw executes a comprehensive end-to-end workflow, yielding standardized output bundles. Ultimately, by synergizing comprehensive analytical tools with structured persistent memory, SpatialClaw elevates spatial omics from disjointed computational stitching to a fully traceable, reproducible, and self-improving discovery ecosystem.
bioinformatics2026-05-25v1E-InfertilityTest: An Explainable AI Framework for Male Infertility Assessment
Das, G.; Ghosh, B.; Ghosh, Z.Abstract
Male infertility has emerged as a significant concern in modern society, with genetic defects as one of the major underlying cause behind it. This impairment negatively impacts sperm motility and morphology, leading to conditions such as Asthenozoospermia (reduced sperm motility), Teratozoospermia (abnormal sperm morphology) and sometimes Asthenoteratozoospermia (both motility and morphology defects). Assisted reproductive technologies (ART), such as in-vitro fertilization (IVF), offer a potential solution for such cases but with a low success rate. Classical semen analysis provides only a phenotypic snapshot without revealing the fertilizing potential of the sperms. Hence, in order to screen the functional sperm population as well as to get a deeper insight into the reasons underlying the aberrant sperm population, it is important to study their genetic profile. In this work, we have performed a meta analysis of the transcriptomic data of infertile sperms from Asthenozoospermia and Teratozoospermia patients with that from fertile sperms of normal individuals. Thereafter we have screened a signature gene set which has been used to develop a prediction model named Explainable Infertility Test (E-InfertilityTest) to classify between fertile versus infertile sperm at the preliminary level. For each prediction, it will also provide the set of genes which are playing a dominant role towards such prediction. Thus, it will provide patient specific dominant gene expression profile responsible for the aberration. This work warrants validation experiments in future to substantiate the model performance in a clinical setting. User can access the tool named E-InfertilityTest as a standalone version on GitHub. Github Link: https://github.com/zglabDIB/einfertility.git
bioinformatics2026-05-25v1OAC-PCA: orthogonal adjustment of confounding effects in principal component analysis for metabolomics data mining
Kurata, M.; Yamamoto, H.; Tsugawa, H.Abstract
Principal component analysis (PCA) is widely used in mass spectrometry-based metabolomics for exploratory data mining. Statistical testing of loading values can extract metabolite features associated with score patterns, but this approach requires principal components (PCs) to remain orthogonal while loadings are defined as correlation coefficients between PC scores and variables. Adjustment for Confounding PCA (AC-PCA) was previously developed to explore biologically meaningful components from data matrices affected by biological and technical confounders. However, AC-PCA does not simultaneously ensure PC orthogonality and a correlation-coefficient definition of loadings, limiting the statistical interpretation of its loadings. Here, we reformulated AC-PCA as Orthogonal Adjustment for Confounding effects in PCA (OAC-PCA). In OAC-PCA, PCs remain orthogonal, and loadings retain this correlation-coefficient interpretation. These properties enable statistical testing of metabolite associations while accounting for confounding effects.
bioinformatics2026-05-25v1CardioSeg: An interactive platform for integrated spatial transcriptomics data and nuclear morphological analysis of mouse heart tissue
Kancherla, S. K.; Melleby, A. O.; Aronsen, J. M.Abstract
Motivation: Spatial transcriptomics enables gene expression profiling within its spatial context in intact tissue sections. Existing workflows for segmentation, spatial annotation, and morphological analysis are often code-heavy and poorly integrated. This limits the joint analysis of spatial gene expression at a single-nucleus resolution, and corresponding nuclear morphology. Results: We present CardioSeg, a Python-based graphical interface for nuclei segmentation, spatial annotation, and interactive analysis of myocardial histology. CardioSeg integrates multi-threshold Cellpose-based segmentation with nuclei-level transcriptomic mapping and interactive visualisation. CardioSeg achieved robust segmentation performance across heterogeneous imaging conditions, with union-based inference outperforming the individual parameter configurations. For cell-type annotation, CardioSeg achieved 0.88 in accuracy and 0.85 in balanced accuracy against reference labels, while also resolving spatial heterogeneity not captured by spot-based approaches. Application to pressure-overloaded cardiac tissue revealed uncharacterized intra-ventricular variations in nuclear morphology, indicating the potential of CardioSeg to couple disease-specific nuclear morphology with the associated transcriptomics. Availability and Implementation: Source code is available at GitHub under the CC BY 4.0 license (https://github.com/SrijanKancherla/CardioSeg). A versioned release was archived in Zenodo (DOI: 10.5281/zenodo.20177171). Keywords: Spatial transcriptomics, nuclei segmentation, cardiac histology, single-cell annotation, bioimage analysis, interactive visualization
bioinformatics2026-05-25v1Highly Constrained Kinetic Models for Single-Cell Gene Expression Analysis
Cho, H. J.; Bohrer, C. H.; Trzaskoma, P.; Kim, J. M.; Pekowska, A.; Casellas, R. C.; Patro, R.; Chow, C. C.; Larson, D. R.Abstract
Advances in single-cell RNA sequencing (scRNA-seq) and high-resolution imaging techniques, such as single-molecule tracking (SMT) of RNA and transcription factors, allow researchers to quantitatively explore dynamics and variation but have never been integrated into a single coherent model. In this study, we propose a kinetic model that intakes multiple data types, including steady-state and time-resolved datasets, to simulate and fit stochastic models of gene transcription to experimental data. We find that 3-state models provide an essential improvement over the widely used 2-state model for most genes and have the property of kinetic proofreading, which we argue is advantageous in the cellular context. We further identify two dimensionless quantities derived from the rate equations which are broadly conserved across genes. Finally, we extend this model to scRNA-seq datasets to infer kinetic rates under defined perturbations and reveal biochemical insight into the mechanism of action of transcription factors.
bioinformatics2026-05-25v1Decoding Condition-Specific Cellular Crosstalk in Spatial Omics via Bilinear Edge Classification
Karin, J.; Friedman, R.; Nitzan, M.Abstract
Tissues are multicellular structured communities whose function emerges from a combination of individual cellular characteristics along with their corresponding spatial configuration, affecting their interactions and response patterns. During processes such as disease progression or aging, tissues can undergo structural reorganization, including changes in co-localization of different cell types, assembly or destruction of functional niches, and disruption of intercellular communication axes. Such changes can manifest primarily in the spatial reorganization of cells rather than in the transcriptional states of individual cells. While computational tools for spatial transcriptomics have made significant progress in characterizing tissue architecture, most approaches for characterizing changes in tissue states across biological conditions operate at the level of individual cells or rely on discrete cell type labels, thus limiting the ability to detect coordinated transcriptional changes between neighboring cells that distinguish one condition from another. We present Casei, a bilinear classification framework operating on cellular proximity graphs, which directly models condition-specific cell-cell interactions in spatial omics data by focusing on interactions (edges), rather than cells (nodes), as the fundamental unit of biological inference. To capture such condition-specific signals, we leverage a model whose inductive bias aligns with cellular interactions through coordinated gene-gene relationships of neighboring cells. Casei enables the discovery of condition-associated multicellular interactions and spatial expression programs, and characterizes the loss of multicellular function and structure. Applied to mammalian liver fibrosis, atherosclerosis, and brain aging, Casei reveals biologically meaningful spatial reorganization, including the shift from endothelial- to macrophage-dominated networks in atherosclerotic plaques, disruption of hepatocyte zonation in fibrosis, and oligodendrocyte-microglia crosstalk in aging white matter.
bioinformatics2026-05-24v2BioGraphX-RNA: A Universal Physicochemical Graph Encoding for Interpretable RNA Subcellular Localization Prediction
Saeed, A.; Abbas, W.Abstract
RNA subcellular localization is a critical determinant of cellular function. However, current computational approaches often operate as "black boxes," overlooking the complex interplay among sequence, structure, and physicochemical interactions that govern RNA localization. Building upon BioGraphX, originally developed for proteins, we introduce BioGraphX-RNA, a universal physicochemical graph-encoding framework that provides structure-informed encoding by translating primary nucleotide sequences into multi-scale interaction graphs using explicit biophysical rules. When combined with frozen RiNALMo embeddings through an interpretable gated fusion layer, BioGraphX-RNA outperforms DeepLocRNA and, uniquely, quantifies the relative contributions of sequence and structure for each RNA type, achieving macro-AUROC improvements of 0.0172 for mRNA, 0.0545 for miRNA, and 0.0422 for lncRNA on human datasets. In a blind cross-species prediction task on mouse data, the model demonstrates promising zero-shot transfer performance, suggesting that biophysical localization cues are evolutionarily conserved. Notably, the BioGraphX graph-only model outperforms RNAfold-derived secondary-structure graphs for miRNA (macro-AUROC 0.9482 vs. 0.8787), validating the structural proxy hypothesis under the most stringent possible conditions. Explainability analyses further reveal RNA-type-specific structural dependencies. In particular, miRNA exhibits a near-equilibrium balance between sequence and structure. SHAP-based interpretations provide mechanistic insights, identifying patterned GC content as a potential nuclear retention signal and an anti-structure profile as indicative of exosome-mediated targeting. These advances are achieved with only 2.05 million trainable parameters, aligning with Green AI principles. BioGraphX-RNA therefore demonstrates that explicitly integrating biophysical constraints into graph-based encodings enables accurate, generalizable, and interpretable predictions, advancing structure-aware RNA biology and laying a foundation for precision medicine.
bioinformatics2026-05-24v2Interpreting Omics Data Analysis with Large Language Models for Disease Target and Drug Discovery
XU, Z.; Chen, W.; Ren, W.; Xu, T.; Amaechin, S.; Khan, R.; Chen, Y.; Province, M.; Payne, P.; Li, F.Abstract
In biomedical scientific discovery, synthesizing prior knowledge from the literature is an essential component of interpreting numerical omics data analyses for disease target identification and drug discovery. Large language models (LLMs) alone can rapidly retrieve disease mechanisms from biomedical text, but text-only outputs are general and unreliable for target and drug prioritization without cohort-specific quantitative evidence. Herein, we propose a provenance-aware Text-to-Target framework that couples schema-constrained multi-model LLM retrieval with numeric omics data analysis. The key design is a modality-aware fusion step: candidates are partitioned into overlap-supported anchors, retrieval-only hidden hubs, and network-emergent novelty nodes, then propagated into staged hypothesis and strategy generation under topology constraints. We evaluate the model in Alzheimer's disease (AD) and pancreatic ductal adenocarcinoma (PDAC). In PDAC, the workflow produced a balanced 75-gene candidate universe and a 23-strategy portfolio, with significant DepMap support at both target level and strategy level. In AD, stricter candidate controls yielded a compact 34-gene universe and 14 strategies; under an expanded CRISPRbrain registry, both target-level axes were significant , with strong strategy-level enrichment. Across both diseases, final strategies preserved full provenance closure to the candidate pool, enabling end-to-end auditability from retrieval artifacts to validation outputs. These results support a transferable discovery architecture in which omics evidence constrains biological activity, LLM retrieval expands mechanistic search space, and network-aware fusion preserves interpretability. The framework provides a reproducible basis for dual-disease target prioritization and motivates continuous literature-mechanism concordance with agentic evidence-refresh loops.
bioinformatics2026-05-23v2Time-Resolved Phosphoproteomics-Guided BFS Beam Search Reveals Cell-Type-Specific EGFR Signaling Architectures and SHP2 Inhibitor-Induced Pathway Rewiring
Lee, H.; Lee, G.Abstract
Adaptive resistance to kinase- and phosphatase-targeted therapies is frequently driven by pharmacological rewiring of intracellular signaling networks, yet systematic computational methods for quantifying cross-condition pathway changes from phosphoproteomic data remain limited. We present an algorithmic framework for reconstructing cell-type-specific signaling pathways from time-resolved phosphoproteomic data using Breadth-First Search (BFS) combined with interaction-weight-guided Beam Search over the STRING protein-protein interaction database. The framework integrates the data-adaptive Median Absolute Deviation (MAD)-based binary-state assignment, BFS Beam Search traversal anchored to experimentally supported active nodes at zone boundaries and terminals (with STRING-inferred bridge proteins permitted as intermediate connectors), and a post-enumeration path cleaning pipeline that produces biologically interpretable, acyclic signaling routes (with edge-level validation against Human Protein Atlas-based cell-line expression data), with real-time access to the STRING REST API (v12.5), enabling network construction without local database installation. Benchmarked across five published phosphoproteomic datasets spanning three cell types (HeLa, MDA-MB-468, EGFR Flp-In HEK293T), the framework captures cell-type-specific EGFR signaling architectures and quantifies drug-induced pathway rewiring. Applied to MDA-MB-468 cells under three pharmacological conditions, SHP2 inhibition abolished PTPN11-mediated pathways and shifted first-hop effector distribution toward ERBB3 (21.5% to 25.2% of paths) and PIK3CA engagement (9.2% to 14.3% of paths), while SHP2 inhibitor washout revealed partial PTPN11 recovery with ERBB2 re-emerging as the dominant first-hop effector (30.3% of paths). This framework provides a systematic, reproducible approach for transforming time-resolved phosphoproteomic measurements into mechanistically interpretable signaling hypotheses, with direct applicability to drug resistance modeling and combination therapy design.
bioinformatics2026-05-23v2Reproducible transcriptional modules define glioblastoma ecosystems across independent cohorts.
Seo, H.Abstract
Glioblastoma (GBM) comprises a complex ecosystem of malignant, immune, vascular and neural transcriptional states. However, it remains difficult to determine which gene expression programmes are reproducibly recovered across independent cohorts and profiling platforms, because programme-level analyses are sensitive to cohort composition, technical context and factorization rank. Here, we analyzed three public GBM datasets--GLASS and IVYGAP bulk RNA-seq cohorts and the HEILAND Visium spatial transcriptomics cohort--to examine whether deconvolution-derived programmes could be organized into shared cross-cohort modules. Integrating 279 programmes inferred by consensus non-negative matrix factorization identified eight transcriptional communities, including myeloid immune microenvironment, neuronal and synaptic, oligodendrocyte and myelin, developmental, tumour-associated mesenchymal or hypoxic, proliferative, and ciliated or ependymal-like modules, as well as one cohort-restricted community. Community activity showed coherent associations with independent annotations: the myeloid community correlated with ESTIMATE immune score and inversely with tumour purity; the oligodendrocyte and myelin community was reduced in recurrent tumours; and ciliated or ependymal-like and neuronal communities showed modest exploratory associations with overall survival. Spatial projection onto Visium data provided qualitative support for the histological coherence of several modules, while also highlighting the limits of spot-level interpretation. Together, these results provide a proof-of-concept that cross-cohort integration can recover recurrent transcriptional structure across heterogeneous GBM datasets and offer an interpretable framework for comparing gene expression programmes while preserving cohort-specific signal and uncertainty in biological assignment.
bioinformatics2026-05-23v1Atlas-Level Single-Cell and Spatial Transcriptomics Data Integration via PRIME
Wu, X.; Wang, X.; Wang, J.; Wan, S.Abstract
Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have enabled atlas-scale cellular cartography, with consortium efforts now assembling millions of cells across diverse tissues, donors, and technologies to build comprehensive references for cell identify and disease mechanism, yet the scientific value of these atlases hinges on robust computational integration across heterogeneous data sources. Unlike pairwise batch correction, atlas-level integration must jointly reconcile heterogeneous and often hierarchically nested batch effects across many datasets whose cell-type compositions are highly imbalanced, all while preserving subtle biological variation and remaining computationally tractable at the scale of millions of cells. Existing approaches often prioritize either batch mixing or preservation of local biological structure, and most cannot natively accommodate spatial coordinates. Here we introduce PRIME (Projection-based Robust Integration via Manifold Embedding), an ensemble integration framework that combines random-projection-based consensus anchoring, graph-Laplacian correction, and optional spatial-neighborhood regularization. Across multiple random projections of the expression manifold, PRIME uses consensus voting to keep only cell pairs that repeatedly matched, reducing false anchors caused by projection-specific distortions. For ST, PRIME couples this expression-based anchor graph with a coordinate-derived spatial neighborhood graph in a unified graph-Laplacian objective with closed-form solution, enabling simultaneous cross-batch alignment and local spatial coherence. Based on extensive benchmarking spanning diverse datasets, we show that PRIME consistently outperforms state-of-the-art methods in both batch correction and biological conservation across scRNA-seq and ST integration scenarios and downstream tasks including trajectory inference, spatial-domain preservation, and perturbation-response analysis. Particularly, when integrating a human hematopoiesis benchmark spanning eight donors and approximately 33,000 cells, PRIME preserves biologically coherent developmental trajectories in human hematopoiesis. It also maintains cortical laminar architecture across dorsolateral prefrontal cortex sections in a ST dataset and recovers known drug-target relationships in a perturbation atlas of more than 1 million cells while suppressing batch-associated confounders. Together, these results establish PRIME as a versatile and scalable framework for atlas-level integration of scRNA-seq and ST across diverse biological applications.
bioinformatics2026-05-23v1Widespread use of invalid statistical tests in biomedical machine learning
Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.Abstract
Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance -- not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
bioinformatics2026-05-22v2IDEAL-Age: an interpretable deep learning framework for single-cell resolution profiling of immunological aging
Xu, Y.; Luo, Z.; He, K.; Zhang, F.; Zhang, Y.; Wang, J.; Wen, H.; Li, Y.; Han, D.Abstract
Immunosenescence increases susceptibility to infection and reduces vaccine responsiveness, yet bulk transcriptomic clocks obscure the cellular heterogeneity underlying this process. Here, we present IDEAL-Age, an interpretable deep learning framework that operates directly on single-cell PBMC transcriptomes. Benchmarking against 31 methods across independent cohorts demonstrates superior predictive performance. The framework' s interpretability uncovers linear and non-linear transcriptomic dynamics that reveal phase-specific physiological transitions, and identifies pro-youthful or pro-aging cellular contributions. Application to systemic lupus erythematosus (SLE) reveals accelerated immunological aging driven by interferon-associated monocyte shifts. IDEAL-Age establishes a high-resolution computational framework for deciphering systemic immune aging.
bioinformatics2026-05-22v2Large-Scale Assessment of Animal-to-Human Drug Translation Using Natural Language Processing
Doneva, S. E.; Ellendorff, T. R.; Schneider, G.; Held, L.; von Wyl, V.; Simpson, I.; Sick, B.; Ineichen, B. V.Abstract
Background: Large-scale estimates of animal-to-human drug translation and the study characteristics associated with successful translation remain limited. The expanding preclinical literature also challenges manual evidence synthesis. We developed a natural language processing (NLP) pipeline to structure and link preclinical and clinical evidence at scale. Methods: In this retrospective meta-research study, we analysed more than 500,000 neuroscience-related animal drug studies from PubMed and linked them to clinical trial and regulatory approval data. NLP methods extracted drug, disease, and experimental design characteristics from abstracts and full texts. Translation was defined as progression to completed phase III/IV trials or regulatory approval. Logistic regression assessed associations between preclinical study characteristics and successful translation. Findings: Among 291,624 drug entities identified in animal studies, 6.7% entered clinical development and 3.1% reached phase III/IV trials or regulatory approval. At the drug-disease level, 4.4% entered clinical development and 1.9% achieved translation. Restricting analyses to successfully linked ontology entities increased estimates to 11.3% and 4.1%, respectively. Male-only animal studies predominated, whereas reporting of randomisation, blinding, and sample size calculations remained limited. Testing across multiple species and reporting blinding were associated with higher odds of successful translation. Interpretation: Only a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and blinding were associated with improved translational success. NLP-based evidence synthesis may support scalable evaluation of translational research and identification of potentially modifiable research practices.
bioinformatics2026-05-22v1A community machine learning challenge to predict the effects of gene perturbations on T cell differentiation for cancer immunotherapy
Zhang, J.; Schwartz, M. A.; Mutaher, M.; Olajide, O.; Pritykin, Y.; Ashenberg, O.; Hacohen, N.; Uhler, C.Abstract
Perturbations of genes with functional importance in T cells could be used to change the distribution of CD8 T cell states to enhance anti-tumor functions for cancer immunotherapies. We launched a world-wide computational challenge to predict the effects of gene perturbations and to devise objective functions for prioritizing gene perturbations that lead to desired T-cell state distributions. We supported the challenge by generating a single-cell Perturb-seq dataset profiling the effect of knocking out 73 individual expert-defined genes in T cells transferred into a mouse melanoma model. We compared the top algorithms developed by participants, and found that performance was primarily determined by the prior data used for gene feature representation, with perturbational data derived features, proving most effective. Experimental validation of the top 61 genes nominated by the algorithms revealed that perturbation of Ndufv2 and Dimt1 reached the defined objective and biased T cell differentiation toward desired states.
bioinformatics2026-05-22v1