Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Pathway redistribution across cellular states reveals a shared signaling backbone and context-dependent regulatory modules in RNA-binding protein networks
Osato, N.; Sato, K.Abstract
Understanding how regulatory architectures are reorganized across cellular contexts remains a central challenge in functional genomics. Here, we integrate co-expression-derived candidate regulatory interactions with interpretable deep learning to generate gene-level contribution scores and introduce delta NES (normalized enrichment score difference) to quantify pathway redistribution between cellular states. Because gene expression reflects the combined effects of multiple regulatory inputs, contribution scores capture relative regulatory influence rather than transcriptional abundance itself. Applying this framework to neural progenitor cells and K562 leukemia cells, we identify systematic redistribution of functional modules across multiple RNA-binding proteins, including PKM, HNRNPK, and NELFE. Neural System- and Immune System-associated modules are differentially positioned along the delta NES spectrum, indicating context-dependent redistribution of regulatory influence rather than isolated pathway activation events. At the pathway level, Signal Transduction consistently forms a shared signaling backbone across proteins and cellular contexts, while modules related to neuronal functions, immune responses, and developmental processes exhibit context-dependent redistribution. Subpathway analysis further reveals convergence on receptor-mediated signaling processes, including FGFR/RTK-, IRS-, and MAPK-related pathways. These redistribution patterns are preserved under alternative DeepLIFT background settings despite polarity changes in contribution-expression correlations, indicating that pathway-level contrasts arise from stable rank-structure differences rather than background-dependent score artifacts. Together, our findings demonstrate that contribution score-based pathway ranking reveals a conserved signaling backbone alongside context-dependent functional modules, providing a framework for interpreting regulatory architecture beyond expression-centric analyses.
bioinformatics2026-05-27v12Finding stable clusterings of single-cell RNA-seq data
Klebanoff, V. F.Abstract
Run a UMI count matrix through a pipeline to obtain n cell clusters. Suppose that counts for an equal number of additional cells from the same experiment become available. Would including them change the result? Form the matrix containing both sets of counts, obtain n clusters, restrict this clustering to the initial cells and compare it with the initial clustering. If they are not consistent, conclude that the initial clustering is unstable. This is unrealistic, but reverse the perspective: given a clustering, process samples of half of the cells. If their clusters are consistent with those of all cells restricted to the samples, conclude that the clustering is stable. We use divisive hierarchical spectral clustering and define what may be a novel mapping of the dendrogram to nested clusterings. Counts are transformed to points in low-dimensional Euclidean space. Positive affinities are defined for points that are k-nearest neighbors. The affinity equals the inverse of the distance between points. Ng, Jordan, and Weiss' algorithm divides the points into two clusters. The normalized cut measures the clusters' separation. Recursion generates a dendrogram. Set the length of the branch between a node and its daughters to the normalized cut. Nodes' distances from the root define the mapping to nested clusterings. Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample's clustering and clusters are compared with those of the full data set (restricted to the sample). This provides measures of the stability of the clustering and its clusters. For three large data sets, this yielded clusterings compatible with published results, though with fewer clusters. Clusterings of two were judged to be stable. We conclude that it is feasible to identify stable clusterings of as many as 100,000 cells. Future research should explore using differential expression for validation.
bioinformatics2026-05-27v4Sequence-independent protein domain detection and classification with PRISM
Tan, A.; Seedorf, H.Abstract
The explosion of predicted protein structures has revealed countless novel domain families. However, gold-standard segmentation tools like Chainsaw and Merizo are trained on rapidly obsoleting CATH databases, lack automatic domain classification, and cannot be easily fine-tuned without deep learning expertise. We introduce PRISM, a unified framework enabling sequence-independent, one-shot fine-tuning for simultaneous domain segmentation and classification, bypassing traditional constraints to accurately resolve complex, novel protein architectures.
bioinformatics2026-05-27v1growthcurves: User-friendly tools for quality-controlled cellular growth analysis
Bradley, S. A.; Webel, H.; Donati, S.; Acevedo-Rocha, C.Abstract
Biological growth curves are widely used but inconsistently analyzed due to fragmented workflows and limited quality control. We present growthcurves, a Python package for extracting growth parameters, and two open-source web applications (MicroGrowth and AutoGrowth) enabling human-in-the-loop analysis of datasets from microplate reader or mini-bioreactor experiments in either batch or turbidostat cultivation mode. By combining automated fitting with convenient quality control, the platform improves reproducibility and reliability of growth-curve analysis.
bioinformatics2026-05-27v1Ordered Gromov-Hausdorff Metric: A New Tool for Comparative Analysis of Protein Structures
Timofeev, A.; Anufriev, A.Abstract
Motivation: Classical protein structure comparison metrics such as RMSD and TM-score effectively assess geometric similarity but ignore the linear order of amino acid residues (Zhang and Skolnick, 2004). The Gromov Hausdorff (GH) metric compares metric spaces by shape but also does not account for order (Gromov, 1981). This can lead to incorrectly identifying proteins with swapped domains as similar. We introduce the Ordered Gromov Hausdorff (OGH) metric, defined on ordered metric spaces, to incorporate residue order into the comparison. Results: OGH combines coordinate normalization, an exponential penalty for order violations, and a monotonic alignment algorithm with computational complexity O(n*w), where w is the search window width. It is proven that OGH satisfies all metric axioms for > 0. Analytical properties include invariance under isometries, upper boundedness, Lipschitz continuity under small coordinate perturbations, and concavity in the weight parameter . On the VAD dataset (28 viral proteins from HIV 1, SARS CoV 2, MERS CoV), OGH increases monotonically with residue shuffling (up to 0.363 at 100% shuffling) and correlates strongly with TM score (r = 0.706). In the task of separating homologs at fixed global similarity (TM score {approx} 0.5), OGH achieves AUC = 0.800, whereas TM score gives AUC = 0.467, demonstrating that OGH detects conserved order even when global geometry is not conserved. Availability: The Python source code for OGH is freely available at https://github.com/andytimoffilim/OGH. The VAD dataset (PDB IDs listed in the paper) is publicly accessible from the RCSB Protein Data Bank (Berman et al., 2000; wwPDB, 2019).
bioinformatics2026-05-27v1There and back again: a multi-omics tale of thyroid co-expression network rewiring
Pozhidaeva, M.; Bussmann, H.; Huisinga, M.; Buesen, R.; Hackermüller, J.; Canzler, S.Abstract
The integration of multi-omics data offers unprecedented insight into complex biological systems but presents significant analytical challenges. In this study, we propose a best-practice framework for constructing simultaneous weighted gene co-expression networks (WGCNA) from transcriptomics, proteomics, and metabolomics data. Using a rodent model of thyroid toxicity induced by propylthiouracil (PTU), we analyzed thyroid tissues from control, treated, and recovery groups. We demonstrate that concatenating individually processed omics layers at the sample level--without additional scaling--preserves meaningful correlation structures and reflects best practices for biologically interpretable network construction. Co-expression networks were constructed for each group, revealing extensive disruption of molecular interactions under treatment and partial restoration during recovery. We highlight the complementary strengths of two analytical strategies: module preservation analysis identifies disrupted co-regulatory structures, while differential connectivity analysis detects feature-level rewiring events. As a methodological advance, we introduce a permutation-based approach for calculating feature-specific p-values for differential connectivity (DiffK), enabling robust statistical inference. This strategy uncovered over 4,400 significantly rewired features, many of which showed stable expression, underscoring the added value of network-based analyses. Our findings demonstrate the utility of integrated multi-omics WGCNA and differential network analysis in capturing dynamic, system-wide regulatory changes.
bioinformatics2026-05-27v1Slivka and Slivka-bio: a lightweight framework for presenting executables as web services and its application in bioinformatics.
Warowny, M.; Down, T.; Macgowan, S. A.; Mukhyala, K.; Barton, G. J.; Procter, J. B.Abstract
Motivation. Execution of code is critical for computational biology, but technical requirements can prevent others from running it. Public web-apps and services thus remain the most effective way to make code accessible, but no fully reusable infrastructure exists to help researchers do this. Results. We developed Slivka to enable easy provision of robust HTTP-based execution services backed by local or distributed hardware; accessible via curl and dedicated clients. We demonstrate it with Slivka-bio, which provides semantically annotated services for Jalview 2.12 (https://www.jalview.org/development/jalview_develop/) and includes 15+ tools for protein and RNA analysis. Slivka has been in production in academic and industry environments for 5 years and ran more than 1.5M jobs. Availability and Implementation. Slivka and Slivka-bio are released under the Apache 2.0 License. Slivka-bio public instance at https://www.compbio.dundee.ac.uk/slivka with links to documentation, docker containers, and github repositories for Slivka-bio and Slivka.
bioinformatics2026-05-27v1ATLAS: a scverse-compatible package for multi-omic single-cell trajectory inference integration
Leclercq, A.; Martini, L.; Bardini, R.; Savino, A.; Di Carlo, S.Abstract
Single-cell trajectory inference is widely used to study cellular differentiation and fate decisions, yet most existing approaches rely on transcriptomic information alone, limiting their ability to capture the regulatory processes underlying cell-state transitions. This work presents ATLAS (Advanced Trajectory Learning from multi-omics At Single-cell resolution), a scverse-compatible framework for trajectory inference in paired single-cell RNA-seq and ATAC-seq data. ATLAS integrates transcriptomic and chromatin accessibility information through Weighted Nearest Neighbor graphs, enabling both molecular layers to jointly inform pseudotime estimation, terminal-state identification, and fate probability inference within a unified multi-omic representation. Across synthetic and real datasets, ATLAS reconstructs coherent developmental trajectories, captures progressive fate commitment, and resolves biologically meaningful lineage structures, demonstrating the effectiveness of multi-omic integration for characterizing cellular dynamics. In addition, ATLAS enables the joint exploration of transcription factor expression and target gene activity along pseudotime, providing direct access to regulatory programs and chromatin-associated transitions that are not detectable from transcriptomic data alone. Overall, ATLAS provides a scalable and biologically informative framework for studying dynamic cellular processes in single-cell multi-omics experiments.
bioinformatics2026-05-27v1ClusToRa: A niche-centric framework for identifying structural recruitment and infiltration in spatial omics
Githaka, J. M.; Lerner, E. P.Abstract
Spatial omics maps cellular landscapes, yet current tools might conflate stochastic proximity with organized niches. We present ClusToRa (Cluster-to-Randomization), a framework that identifies high-density cellular territories and quantifies cell-type recruitment using a fixed-position null model. Benchmarked against graph-based neighborhood-enrichment and point-pattern statistics, ClusToRa reduced false-positive enrichment in simulations and resolved core-vs-boundary interactions. Applied to cirrhotic MASH liver, ClusToRa identifies stellate-cell territories with immune/endothelial infiltration and stress-, Notch-, and PPAR-associated programs, providing a niche-centric framework for distinguishing structural cellular infiltration from boundary adjacency or density-driven colocalization.
bioinformatics2026-05-27v1Transcriptomic Profiling and Regulatory Network Analysis of Ten Metabolic Transporters Across Five Diabetic Complications: A Multi-Dataset, Twelve-Phase GEO Bioinformatics Study
Adegboyega, B. B.; Ekanem, P. C.; Awolaja, O. O.; Osarietin, E.; Okorie, B.Abstract
Objective: Diabetic complications collectively represent one of the most urgent unresolved problems in medicine, yet the field continues to study them in near-complete isolation from one another. No unified framework has systematically characterised the shared and divergent molecular signatures of ten clinically critical metabolic transporters across all five major complications, cardiomyopathy (DCM), nephropathy (DN), retinopathy (DR), peripheral neuropathy (DPN), and atherosclerosis and vasculopathy (DAD), through an integrated, multi-method computational pipeline. This study was designed to address that gap directly. Methods: Eleven GEO microarray datasets comprising 118 diabetic and 76 control samples were analysed through twelve sequential phases: differential expression analysis, pan-complication overlap, weighted gene co-expression network analysis (WGCNA), GO/KEGG functional enrichment with gene set enrichment analysis (GSEA), STRING protein-protein interaction (PPI) network construction, competing endogenous RNA (ceRNA) network mapping, transcription factor activity inference using a VIPER-style algorithm, immune cell infiltration estimation by single-sample GSEA, diagnostic biomarker modelling using LASSO logistic regression and Random Forest classification, CMap-style drug repurposing by connectivity scoring, and two-sample Mendelian randomisation (MR) employing four independent estimators (inverse-variance weighted [IVW], MR-Egger, weighted median, and weighted mode). Results: CD36 was the only transporter to achieve significant dysregulation across three independently sourced tissue types (DN, DR, DPN; logFC range 0.88 to 2.18), whilst TLR4 exhibited the highest fold-change in the study (logFC = 3.88, DPN) and the greatest WGCNA module membership (kME = 0.976, DPN). SERCA2 was significantly downregulated in three complications (DCM, DN, and DR) at formal significance thresholds and trended negatively in the remaining two (DPN and DAD), constituting the most consistently suppressed transporter in the study. Its universal downregulation was explicable through four convergent mechanisms spanning transcriptional, oxidative, ceRNA-mediated, and transcription factor-level regulation, and was confirmed as causally relevant to diabetic cardiomyopathy by eQTL Mendelian randomisation (beta = -0.085, p = 0.005). miR-21-5p was identified as the dominant ceRNA regulatory bridge (betweenness centrality = 0.428; 6.7-fold above the second-ranked miRNA), with MALAT1 as the sole lncRNA hub active in all five complications. PPARgamma and TP53 repression emerged as the leading transcription factor-level explanations for the simultaneous metabolic and inflammatory dysregulation characteristic of the diabetic transcriptome. Immune deconvolution revealed DCM as immunologically quiescent, DN as comprehensively infiltrated (ten enriched cell types), and DPN as mast-cell-dominated, identifying a cellular mechanism for TLR4-driven neuroinflammation that has not previously been systematically characterised. GLUT4 achieved perfect diagnostic discrimination for DPN (AUC = 1.000, p < 0.001; LASSO coefficient = -2.143), whilst SGLT2 was the leading DAD diagnostic marker (AUC = 1.000, p = 0.002). Epalrestat was the sole pan-complication drug repurposing candidate (significant connectivity reversal in four of five complications). Mendelian randomisation confirmed causal effects of T2DM genetic liability on all five complications (all p < 0.0001, all four estimators concordant), and eQTL-MR identified TLR4 (beta = +0.073, p = 0.006) and CD36 (beta = +0.070, p = 0.008) as causal risk factors for DN, SERCA2 reduced expression as a causal driver of DCM (beta = -0.085, p = 0.005), and SGLT2 expression as a causal protector against DN (beta = -0.070, p = 0.013). Conclusions: This twelve-phase investigation identifies a pan-complication CD36/TLR4 inflammatory dyad and a SERCA2 calcium-mitochondrial effector axis, both confirmed at seven independent analytical levels, including causal genomic inference. GLUT4 downregulation defines DPN at the diagnostic level with perfect accuracy and is explicable through a five-layer mechanistic chain from MODY transcription factor inactivation to ceRNA competitive pressure. Epalrestat warrants prospective evaluation beyond its established DPN indication. These findings collectively constitute the most comprehensive computational characterisation of metabolic transporter biology in diabetic complications to date.
bioinformatics2026-05-27v1TIMS-Bench: Towards community standards for benchmarking untargeted trapped ion mobility metabolomics tools and datasets
Rajkumar, P.; Gadiya, Y.; Deleray, V.; Roux, A.; West, K. A.; Allen, A.; Dorrestein, P.; Domingo-Fernandez, D.; Misra, B. B.Abstract
Untargeted liquid chromatography- tandem mass spectrometry (LC - MS/MS) - based metabolomics is an important technology for unbiased discovery of small molecules in biomedical (e.g., drug discovery to diagnostics), animal, plant, environmental, and microbial research. Over the past decade, ion mobility has added an additional dimension to the triplet of MS1, MS2, and retention time, helping resolve co-eluting or isomeric features in an LC- MS/MS that aid in compound identification. Here, we focused on evaluating the current trapped ion mobility spectrometry (TIMS) - amenable feature-finding tools (MZmine 4.9, MS-DIAL 5.5, and MetaboScape 2025 14.0.3) for pre-processing of metabolomics-scale data generated using a popular ion mobility mass spectrometry (IM- MS) technique, TIMS. We leveraged ten public and three benchmark TIMS datasets to evaluate these tools for their strengths and weaknesses. Our results show that MZmine consistently identified the highest number of features and confidently annotated features; however, this performance was accompanied by an increased number of false positives, due to peak splitting, as well as reduced accuracy in collision cross section (CCS) measurements. In contrast, MetaboScape achieved the highest fraction of high-quality MS2 spectra, reflecting a more conservative feature detection strategy. MS-DIAL demonstrated balanced performance, identifying features that other tools missed. Finally, we publicly release the ground-truth datasets and code to support future developments in improving IMS data analysis.
bioinformatics2026-05-27v1Training Strategy Optimization to Mitigate Shortcut Learning in Pan-Cancer Drug Response Prediction
Shimamoto, K.; Ito, T.; Lysenko, A.; Tsunoda, T.Abstract
Background: Prediction of in vivo drug response is a central challenge in precision medicine, but the scarcity of labeled clinical data still necessitates the use of large-scale cancer cell line resources for model training. Domain adaptation methods, which aim to transfer knowledge learned from a source domain (cell lines) to a target domain (patients) by aligning feature distributions across domains, are a promising approach to bridge the gap between in vitro models and in vivo patients. However, we observed that these methods can exhibit a significant discrepancy between pan-cancer evaluation metrics and cancer type-specific prediction accuracy. This performance gap warrants a detailed investigation into their underlying predictive characteristics. Results: We discovered that cancer-type-specific class imbalances in training data can lead domain adaptation models to engage in shortcut learning, where they primarily discriminate between cancer types rather than capturing the actual biological determinants of drug sensitivity. To address this, we propose a strategy of combining two approaches: (1) excluding cancer types causing imbalance from the training data, and (2) adjusting class balance through oversampling and class weighting while retaining cancer types causing the imbalance. Among all configurations tested in conjunction with the CODE-AE (Context-aware Deconfounding AutoEncoder) framework, the combination of moderate oversampling (30% non-responder ratio) with class weighting achieved the best performance, significantly improving prediction accuracy in 5 out of 11 external patient cohorts from TCGA and GEO. Conclusions: Our findings demonstrate that appropriate class imbalance correction, rather than wholesale exclusion of imbalanced cancer subtypes, enables effective utilization of biologically relevant information shared across cancer types for drug response prediction. This study highlights the critical importance of jointly optimizing training data composition and class balance adjustment strategies in developing robust pan-cancer drug response prediction models for precision medicine applications.
bioinformatics2026-05-27v1NeuroFate: endpoint-locked transcriptomic axis scoring for neurodegeneration risk research
Ghosh, N.; Sinha, K.Abstract
Motivation: AD and PD transcriptomic cohorts can reveal disease-associated neuronal, glial, mitochondrial, myelin, proteostasis, vascular, and immune programs, but these signals are difficult to compare reproducibly across studies without endpoint-locked, sample-level biological summaries. Results: We present NeuroFate, a command-line research package that converts compact transcriptomic cohorts into curated neurodegeneration-axis scores, exploratory research-use risk scores, and conservative evidence reports. The software locks disease-state endpoints before scoring, maps genes or probes onto a 10-axis NeuroFate panel, records axis-gene coverage, and grades external cohort evidence by direction, effect size, nominal/FDR support, and claim-safety rules. Demonstrations across AD and PD resources show nominal independent AD support for a neuronal vulnerability axis, mixed PD convergence, and a PD-divergent synuclein-mitochondrial example while avoiding clinical or mechanism-overstating claims. Availability and implementation: NeuroFate is implemented in Python and available at https://github.com/sinhakrishnendu/NeuroFate.git. Contact: nabanitaghosh89@gmail.com; dr.krishnendusinha@gmail.com. Supplementary information: Documentation, examples, tests, and reproducibility notes are included in the repository.
bioinformatics2026-05-27v1GraphTox: A Semi-Supervised Pre-Trained Framework for Peptide Toxicity Prediction using Geometric Graph Transformer and LORA-Based Finetuning
BHADURI, S.; Das, D.; MITRA, P.Abstract
Peptides are widely used as potential therapeutic agents in drug discovery and biotechnology because they are specific, effective, and relatively inexpensive to produce. They are used in drug development, vaccines, and antimicrobial treatments. However, peptide toxicity remains a major concern as it offers unwanted toxic consequences, such as membrane rupture, haemolysis, tissue damage and adverse immunological response. Early detection of toxic peptide candidates is vital for the development of safe and effective therapies. Current computational methods for predicting peptide toxicity are largely based on hand-crafted sequence descriptors or sequence-only deep learning architectures that may not fully account for the underlying 3-dimensional structural determinants of peptide toxicity. We introduce GraphTox, a structure-aware geometric deep learning framework which combines self-supervised graph representation learning with hierarchical structural modelling to accurately predict peptide toxicity. Our framework learns geometry-aware embeddings from peptide structural graphs via self-supervised masked residue reconstruction, based on a Masked Graph Autoencoder (MGAE) built on a Geometric Graph Transformer (GGT) encoder. The pretrained structural representations are cross fused via a multi-scale U-Net architecture to capture both local residue-level interactions and global conformational patterns associated with peptide toxicity. GraphTox explicitly models spatial relationships between residues, thereby efficiently capturing structural aspects that are generally neglected by sequence-based predictors, such as residue clustering, hydrophobic interactions and electrostatic organization. On benchmark datasets our framework shows superior performance and interpretability over the existing state-of-the-art methods. Our hybrid hierarchical structural modelling framework is a superior computational platform to improve the prediction of peptide toxicity and expedite the creation of safer peptide therapies. https://github.com/debraj-55555/GraphTox
bioinformatics2026-05-27v1Toward Large-Scale Numerical Modeling of the Cardiovascular System with up to 34 Billion Vessels
Newhauser, W.; Cole, M.; Diehl, P.; Moreno, J.; Kaiser, H.; Tohid, R.; Nader, N.; Chancellor, J.Abstract
Cardiovascular diseases, such as stroke and heart attacks, are the leading cause of death worldwide. Computational models like cardiovascular digital twins (CVDTs) offer a promising path for research and intervention but are challenged by the complexity of simulating the full human vasculature. This study evaluates the feasibility of simulating blood flow through a vascular network containing 34 billion vessels (the estimated number in the human body) using first-principles physics and simplified geometry which is a first step towards CVDT. We synthesized 3D vasculature using a fractal model and computed blood flow rates via Poiseuille equation and steady-state fluid dynamics, implemented with high-performance computing. Simulations were conducted for networks ranging from 6 vessels to 34 billion vessels. The results demonstrated high accuracy (within 1% of benchmarks), reproducibility across platforms, and strong scalability. Simulating the full vasculature required 156 node-hours on the second-fastest supercomputer in the world, using 29 TB of memory and 84 TFLOPS. Maximum speedup factor was 80, with parallel efficiency no lower than 0.48. These findings show it is computationally feasible to simulate blood flow through a full-body vascular network at scale. The approach is well suited to parallel computing, suggesting that with continued development, CVDTs could enable whole-organism modeling for applications such as stroke, trauma, radiation injury, and cancer metastasis.
bioinformatics2026-05-27v1GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation.
bioinformatics2026-05-26v3Unveiling the Terra Cognita of Sequence Spaces using Cartesian Projection of Asymmetric Distances
Ramette, A.Abstract
Visualizing relationships within massive biological datasets remains a significant challenge, particularly as sequence length and volume increase. We introduce CAPASYDIS (Cartesian Projections of Asymmetric Distances), a scalable approach designed to map the explored regions of a given sequence space. Unlike traditional dimensionality reduction methods, CAPASYDIS calculates asymmetric distances which account for both the position and type of sequence variations. It projects sequences into a fixed, low-dimensional coordinate system, termed a "seqverse", where each sequence occupies a permanent location. This design allows for the instant mapping of new sequences without the need to recalculate the global structure, transforming sequence analysis from a relative comparison into navigation on a standardized map. We applied this method to a large rRNA sequence dataset spanning the three domains of life. Our results demonstrate that the sequences of Bacteria, Archaea, and Eukaryota occupy spatially distinct regions characterized by fundamentally different shapes and patterns of variation. Furthermore, the resulting seqverses retain high amount of taxonomic information, when analyzed from broad domain levels to single-base differences. Overall, CAPASYDIS provides a reproducible, scalable framework for defining the boundaries and topography of biological sequence universes.
bioinformatics2026-05-26v3WITHDRAWN: Scalable Microbiome Network Inference: Mitigating Sparsity and Computational Bottlenecks in Random Effects Models
Roy, D.; Ghosh, T. S.Abstract
The authors have withdrawn their manuscript because the biological validations associated with the inferred microbial interaction directions are currently incomplete and require further verification. We are actively validating these biological directions and ensuring the scientific validity of the reported findings before any further dissemination. Therefore, the authors do not wish this work to be cited as a reference for the project at this stage. If you have any questions, please contact the corresponding author.
bioinformatics2026-05-26v2Beyond natural amino acids: Extending immunogenicity risk assessment to non-canonical peptide drugs through chemical feature encoding
Cairoli, M.; Nielsen, M.; Betts, C.; Obrezanova, O.; De Maria, L.Abstract
Peptide therapeutics are increasingly used to treat challenging diseases, but immunogenicity risks limit their clinical success. In silico tools enable immunogenicity screening through prediction of peptide-MHCII binding, yet current methods fail to capture chemical properties of non-natural amino acids routinely incorporated to improve drug properties. Here, we present a machine learning approach combining chemical fingerprints with sequence information to predict MHC class II binding for both canonical and modified peptides. We propose two molecular representations (direct-encoding and similarity-based chemical fingerprints) that preserve positional information while encoding chemical diversity. These representations achieved performance comparable to sequence-based encodings (BLOSUM62 and one-hot) for canonical peptides while accurately identifying binding cores and motifs. Testing on citrullinated peptides, chemical fingerprints substantially improved quantitative prediction accuracy while maintaining comparable linear correlation across encoding methods, demonstrating the importance of explicit chemical representation for accurate absolute binding affinity prediction. These descriptors can be integrated into pan-allele prediction frameworks, enabling immunogenicity risk assessment across diverse modifications and therapeutic modalities, including peptide therapeutics, antibody-drug conjugates, and synthetic vaccines. The proposed chemistry-informed framework addresses a critical gap in preclinical drug development, facilitating early mitigation strategies before costly clinical trials.
bioinformatics2026-05-26v1LVentiView: An Open-Source Software for Automated 3D Left Ventricular Mesh Reconstruction and Analysis from Cardiac MRI
Braun, I.; Wang, Y.; Ecker, A. S.; Bodenschatz, E.Abstract
Patient-specific cardiac modeling requires accurate three-dimensional representations of the left ventricle (LV) reconstructed from cardiac magnetic resonance imaging (MRI). Here, we present LVentiView, an open-source software that bridges medical imaging and cardiac simulation by automating the full pipeline from MRI segmentation to simulation-ready volumetric meshes, with integrated tools for volumetric analysis and regional myocardial thickness calculation. We validate LVentiView on the Sunnybrook Cardiac Dataset, comprising healthy subjects and three cardiac pathologies. LVentiView achieves blood pool segmentation at the inter-expert level. The generated meshes are verified by comparing LV volumes extracted from the meshes to those computed from expert manual segmentation masks, with volumes and cardiac parameters agreeing within inter-expert variability across all four cardiac pathologies. In addition, mesh-derived regional thickness maps capture pathology-specific patterns, including wall thickening in hypertrophic cases. LVentiView is freely available on GitHub and provides an accessible, validated foundation for patient-specific cardiac modeling.
bioinformatics2026-05-26v1Prediction and evaluation of Split-ORFs using Ribo-seq data
Kalk, C.; Murtagh, J.; Despic, V.; Mueller-McNicoll, M.; Schulz, M.Abstract
Split Open Reading frames (Split-ORFs) occur in transcripts containing at least two open reading frames, each encoding a part of the same full-length protein. These multiple open reading frames arise from alternatively spliced transcript isoforms. Split-ORFs have been described in the SR protein family of splicing factors, where the resulting protein halves play important autoregulatory roles. Here, we present the Split-ORF pipeline, a computational tool that predicts Split-ORFs from transcripts' sequences and identifies regions unique to the predicted Split-ORF products. Using this pipeline, we predicted more than 14,000 Split-ORF transcripts from alternatively spliced human transcripts containing premature termination codons or retained introns. Hundreds of the Split-ORF unique regions show significant Ribo-seq coverage across diverse cell types and diseases. The candidate Split-ORF genes with significant Ribo-seq coverage are enriched for RNA-binding and RNA-processing functions and the majority of them encodes RNA-binding proteins. Together, these results suggest that Split-ORFs are more widespread than previously assumed and are expressed across diverse cellular contexts. This work paves the road for future studies of the Split-ORF candidates, the mechanisms of their biogenesis and their functions within the RNA-binding protein class.
bioinformatics2026-05-26v1ARACoFusion: Uncertainty-aware calibrated deep learning for protein-protein interaction network prediction in Arabidopsis thaliana
Sarkar, D.; Sarkar, C.Abstract
Accurate mapping of the Arabidopsis thaliana protein-protein interaction (PPI) network is essential for deciphering complexity of plant systems biology. Here, we present ARACoFusion, a specialized deep learning architecture designed to predict inter-protein connectivity directly from primary sequences. To capture the asymmetric dependencies between plant proteins, the framework utilizes a reciprocal cross-attention encoder combined with latent interaction projections and multi-source feature fusion. Addressing the severe class imbalance inherent in plant interactomes, the model integrates uncertainty-aware variance regularization and focal loss with label smoothing, further enhancing reliability through posthoc probability calibration via temperature scaling. Extensive benchmarking on gold-standard Arabidopsis datasets demonstrates that ARACoFusion significantly outperforms existing plant-specific predictors, achieving superior scores in Area Under the Precision-Recall Curve (AUPRC), Balanced Accuracy, and Matthews Correlation Coefficient (MCC). Additionally, the model exhibits robust cross-species generalization and clear class separability in t-SNE latent space visualizations. To facilitate community-wide usage, we provide a dedicated web server for scalable network-level inference at https://ARAcofusion.compbiosysnbu.in/.
bioinformatics2026-05-26v1Cycle-consistent deep generative modeling unifies cellular states across unpaired spatial and single-cell modalities
Zhang, H.; Quinn, J. F.; Data Science TeamLab, ; Tansey, W.Abstract
Current spatial and single-cell technologies capture complementary but incomplete views of cellular state, with transcriptomic, proteomic, and spatial information distributed across distinct platforms. Integration is challenged by unpaired measurements, mismatched feature spaces, and modality-specific biases. We present MultiTME, a multimodal framework that integrates heterogeneous spatial and single-cell data using a spatially-regularized, cycle-consistent deep generative model. By enforcing consistency of bidirectional mappings, MultiTME learns a shared latent representation that enables translation between modalities without requiring paired observations or shared features. Across benchmarks, MultiTME outperforms existing methods, produces accurate cross-modal cell typing, improves spatial transcriptomic panel completion, and transfers whole-transcriptome information to generate spatially resolved maps at cellular resolution. Applied to a multimodal colorectal cancer dataset, we demonstrate that MultiTME integration reveals a spatially coherent proliferative-invasive tumor axis not directly observable within single modalities. Across five multimodal spatial datasets, we show MultiTME can correct for platform-specific biases between Xenium and CosMx, thereby facilitating cross-dataset harmonization and enabling pan-cancer spatial studies.
bioinformatics2026-05-26v1NAP: an open-source pipeline for cross-domain microbiome profiling using Nanopore sequencing-derived amplicon data
Jones, L. B.; Bagby, S.Abstract
Background Nanopore sequencing offers a cost-effective and portable platform for microbiome analysis, but amplicon-based approaches remain limited by higher sequencing error rates and a lack of workflows tailored to mixed domain ribosomal RNA profiling. While short-read technologies dominate microbial community analysis, their portability and flexibility are constrained. There is therefore a need for robust pipelines designed specifically for cross-domain Nanopore amplicon data. Results We introduce the Nanopore sequencing-based Amplicon Pipeline (NAP; https://github.com/Luke-B-Jones/NAP), an open-source workflow optimised for flexible mixed domain primer sets such as 515Y/926R. NAP performs adaptive quality filtering, chimera removal, centroid generation, BLAST-based taxonomic classification, hierarchical consensus correction, and domain-aware post-processing, outputting decontaminated abundance tables suitable for downstream analysis. Initial validation against two complementary commercial mock communities showed that NAP achieved strong genus-level performance across both low complexity logarithmic and more compositionally complex gut mock communities. Detection was most reliable above ca. 1% relative abundance, and replicate outputs showed strong agreement with expected composition under Bray-Curtis, Jaccard, agreement-plot, and Bland-Altman analyses. Benchmarking of NAPs internal filtering modes showed that the default adaptive setting provided the most robust balance of read quality, retained depth, and downstream taxonomic fidelity across heterogeneous inputs. Direct comparison against QIIME2 and Kraken2/Bracken further showed that NAP most accurately preserved expected community structure, with markedly fewer false positive assignments at genus level and substantially stronger species-level behaviour under the tested conditions. Species-level assignments were informative for some taxa, but remained less robust than genus-level outputs with the default V4-V5 amplicon. Conclusions NAP provides a robust and flexible workflow for cross-domain Nanopore amplicon profiling, with strongest performance at genus level and competitive species-level behaviour for well resolved taxa. Although analysis of field-derived data was not assessed here, NAP compatibility with portable Nanopore sequencing supports accurate mixed domain microbiome profiling under the tested conditions.
bioinformatics2026-05-26v1Sparse, trainable subnetworks for multi-omics integration: a cross-validated evaluation of the Lottery Ticket Hypothesis across nutrigenomic, toxicogenomic, and oncogenomic datasets
Miszczak, R.Abstract
Multi-omics integration, the joint analysis of two or more high-dimensional molecular data types collected on the same biological samples, is now a standard analytical approach across nutrigenomics, toxicogenomics, microbiome research, and disease genomics. Existing methods sit on a trade-off between expressiveness and interpretability: latent-variable methods such as MOFA and DIABLO yield compact, biologically interpretable signatures but assume a restrictive linear structure; tree ensembles such as Random Forests achieve strong predictive performance but resist mechanistic interpretation; deep neural networks combine the drawbacks of both, with large numbers of opaque weights and no built-in feature selection. I ask whether the Lottery Ticket Hypothesis (LTH), the conjecture that a randomly initialised dense network contains a sparse subnetwork that matches its accuracy when trained from the original initialisation, can help reconcile this trade-off in the multi-omics setting. I apply Iterative Magnitude Pruning with weight rewinding for 25 rounds (cumulative sparsity 99.6%) on a multi-input fused multi-layer perceptron across eight datasets spanning four biological domains (n=40 to n=1,492), with 5-fold outer cross-validation and inner-validation winning-ticket selection to avoid test-set leakage. On the largest task, TCGA Pan-Cancer (4-class tissue-of-origin, n=1,492), a 2,952-weight subnetwork (83% sparsity) reached 84% +/- 3% test accuracy compared with 86% +/- 2% for the dense network. Pruning improved test accuracy on two TCGA staging tasks (TCGA-LUAD: 51% +/- 1% vs 45% +/- 5%; TCGA-KIRC: 50% +/- 4% vs 48% +/- 7%). Networks compressed by 6x to 270x while retaining task-level signal on well-specified tasks. I suggest LTH as a domain-agnostic, prior-free option for sparse neural integration of multi-omics data, complementary to graph-based and pathway-constrained methods.
bioinformatics2026-05-26v1Tandem: a bioinformatics tool for detection, mechanism classification, and population quantification of bacterial tandem gene duplications
Ngan, W. Y.; Smith, E. S. J.Abstract
Motivation: Tandem gene duplication drives antibiotic resistance, metabolic adaptation, and gene-family expansion in bacteria, but no tool detects them in reference genomes, discovers their junctions in isolate sequencing, and quantifies the junctions in population samples. Existing callers (e.g. breseq) detect duplications without classifying formation mechanisms and often fail to quantify the duplication. Results: Tandem has 3 modules. Module 1 detects reference-genome duplications by NUCmer self-alignment and classifies each by homologous-recombination signature and the junction microhomology length. Module 2 confirms junctions in whole-genome sequencing at user-nominated coordinates after user inspecting the coverage plot. Module 3 quantifies known junction in population sequencing using the novel Junction Read Ratio (JRR). On 280 artificial population tests across seven bacterial species, Tandem achieves 100% recall and 4.3% mean absolute error. Applied to experimentally evolved Pseudomonas fluorescens SBW25 populations, Tandem resolves multiple co-segregating duplication fragments.
bioinformatics2026-05-26v1Constrained protein Large Language Model illustrated in protein stability, function and epistasis
Tzavella, K.; Olsen, C.; Vranken, W. F.Abstract
Our understanding of protein function and evolution is largely based on the relationship between amino acid sequence and overall fold, now effectively captured by computational models. Yet predicting how mutations--shaped by epistasis--alter protein behavior, especially in dynamic or structurally ambiguous regions, remains difficult. Here we present D2D, which combines a self-supervised protein language model with protein-specific evolutionary information to predict mutational effects using little to no task-specific labeled data. D2D captures long-range epistatic interactions, accurately predicts single and higher-order mutation effects on protein thermostability and binding, without being trained on the task. When fine-tuned, D2D outperforms state-of-the-art methods on latent driver cancer mutations and co-occurring proliferation-enhancing mutations across independent experimental studies. Unlike most existing approaches, D2D avoids biases linked to solvent accessibility or to multiple sequence alignment depth and quality, making it particularly effective for disordered or surface binding regions where structure-based predictors typically falter. Overall, D2D provides a general framework for modeling mutational effects in proteins with limited experimental or structural information.
bioinformatics2026-05-26v1Precision survival estimation in acute myeloid leukemia using evolutionary learning-derived microRNA signature
Yerukala Sathipati, S.; Agustriawan, D.; Gopireddy, N. S. R.; Popat, A.; Moat, L.; Aimalla, N.; Elugoti, M. R.; Kampa, S. A.; Sharma, P.; Ho, S.-Y.; Sharma, R.Abstract
Background Acute myeloid leukemia (AML) remains the most lethal acute leukemia in adults, with 5-year overall survival below 32% despite recent advances including venetoclax-, FLT3-, IDH1/2-, and Menin-targeted therapies. Clinical outcomes remain highly heterogeneous across patients, highlighting the need for robust molecular biomarkers capable of improving prognostic precision. MicroRNAs (miRNAs) are critical regulators of hematopoietic differentiation, apoptosis, and therapeutic resistance and are differentially expressed across AML subtypes. However, their clinical translation has been limited by high dimensionality, feature redundancy, and relatively small cohort sizes. Methods We developed and evaluated the AML Survival Estimator (AMLS), an inheritable bi-objective combinatorial genetic algorithm integrated with support vector regression (SVR), using TCGA-LAML miRNA expression profiles (n = 156). AMLS was benchmarked against ten widely used machine-learning approaches, including penalized regression, tree-based ensembles, support-vector regression, k-nearest neighbors, and multilayer perceptron models. Performance was assessed using stratified cross-validation with Pearson correlation (R), Harrell's concordance index (C-index), and mean absolute error (MAE). Functional characterization of the derived miRNA signature was performed through consensus target integration followed by pathway enrichment, gene ontology analysis, network reconstruction, and Kaplan-Meier risk stratification. Results AMLS achieved superior prognostic performance with pooled out-of-fold metrics of Pearson R = 0.86, C-index = 0.788, and MAE = 7.49 months, substantially outperforming all comparator models. Restricting analyses to the AMLS-derived 28-miRNA signature improved all baseline learners by approximately 2-4-fold, with the multilayer perceptron achieving R = 0.674; however, none matched the native AMLS framework, indicating that the evolutionary optimization strategy contributes predictive information beyond feature selection alone. The prognostic signature included biologically established AML-associated miRNAs, including hsa-miR-191, hsa-miR-29c, hsa-miR-125b, hsa-miR-148a, hsa-miR-15b, hsa-miR-10b, and hsa-miR-30c, linked to DNA methylation, apoptosis, cell-cycle regulation, and oncogenic Wnt/MAPK signaling pathways. Functional analyses demonstrated significant enrichment of canonical AML-associated pathways, including p53, PI3K-AKT, TGF-Beta, JAK-STAT, FoxO, and hematopoietic lineage signaling. Conclusions Our findings demonstrate that evolutionary learning integrated with SVR can recover a compact and biologically interpretable miRNA prognostic signature that substantially outperforms conventional machine-learning approaches for AML survival prediction. The identified miRNA network converged on key leukemogenic pathways involved in apoptosis, cell-cycle regulation, and oncogenic signaling, supporting both the biological relevance and prognostic utility of the framework. Given the minimally invasive and quantitatively scalable nature of miRNA profiling, this approach may provide a practical molecular adjunct for improving prognostic assessment and precision medicine strategies in AML.
bioinformatics2026-05-26v1Integrated optimization of experimental and computational workflows improves genome recovery in long-read gut metagenomics
Hu, Y.; Sun, L.; Huang, Y.; Jiang, F.; Tong, X.; Yang, J.; Ju, Y.; Yang, Z.; Liufu, S.; Hu, Y.; Ma, W.; Guo, R.; Li, W.; Zhang, T.; Zhu, X.; Zhang, Z.Abstract
Short-read metagenomic sequencing is widely applied in microbiome research due to its high quality and increasingly more affordable prices. However, it suffers from fragmented reads which limits assembly contiguity and the recovery of complete microbial genomes. In contrast, long-read sequencing, with substantially longer read lengths, can help overcome these limitations. Achieving complete and accurate genome recovery is a central goal in metagenomics. To advance this goal, we present a systematic effort to unify and optimize the long-read sequencing workflow, from experimental sample processing to computational genome assembly, using the CycloneSEQ platform.
bioinformatics2026-05-26v1OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome
Yang, L.; Xia, Y.; Yang, Z.; Xia, C.; Wu, T.; Zou, M.; Xia, Z.Abstract
While multi-species genomic language models have advanced biological representation learning, high-quality, single-species foundation models for crops remain scarce. Leveraging recently expanded rice pangenome resources, we introduce OryzaG3, a species-specific DNA language model with 700M parameters. OryzaG3 was pretrained on 59.20 Gb of chromosome-level sequences from 149 high-quality rice genomes using a non-overlapping 3-mer tokenization strategy and a causal language modeling objective, featuring context-length variants up to 32k tokens. On the Plants Genomic Benchmark polyA prediction task, OryzaG3 achieves competitive predictive performance against leading multi-species models while delivering a four-fold increase in inference throughput under identical long-context conditions. Ultimately, OryzaG3 demonstrates that lightweight, single-species foundation models trained on high-quality pangenomes can match multi-species benchmarks while significantly reducing computational overhead. This work provides a scalable framework for rice functional genomics, molecular breeding, and targeted crop foundation model development.
bioinformatics2026-05-26v1IID-KG: An ontology-aligned literature-derived knowledge graph for infectious and immune-mediated diseases
PAN, F.; Zhang, Y.; Wang, J.; Liu, M.-C.; Sui, X.; Yue, H.; Zhang, J.Abstract
Infectious and immune-mediated diseases (IIDs) represent a broad and rapidly expanding biomedical literature domain in which scalable evidence extraction, disease ontology refinement, and interpretable knowledge integration are essential for biomedical discovery. We constructed an IID-specific biomedical knowledge graph (IID KG) from PubMed abstracts and PMC full-text articles by integrating nested named entity recognition, ontology-guided identifier assignment, full-text relation extraction, and relation-resolution strategies. A gold-standard corpus of 500 PubMed abstracts and 8 PMC full-text articles was manually annotated for nested biomedical entities across six entity types. The resulting models were applied to 30,128,068 PubMed abstracts and 1,385,500 IID-related PMC full-text articles. A unified IID ontology was developed from 411,341 disease terms using hierarchical text classification, large language model-based refinement, ontology cross-referencing, and expert review, yielding 179,657 confirmed MeSH mappings. The final IID KG contains approximately 1,837,513 unique entities and 16,295,390 unique relations across eight relation types. The resource was released publicly together with repurposing workflows, supporting ontology-aligned literature mining, disease mechanism analysis, and drug-repurposing hypothesis generation for IID research.
bioinformatics2026-05-26v1Prioritizing peptides for targeted mass spectrometry experiments using deep learning
Sonthalia, S.; Dasgupta, P.; Hsu, C.; Wen, B.; MacCoss, M. J.; Noble, W. S.Abstract
One critical step in any targeted mass spectrometry experiment is selecting, from each protein of interest, a small number of peptides that respond well in the mass spectrometer and can serve as reliable proxies for protein quantification. Existing methods select target peptides either by relying on prior empirical measurements, limiting their applicability to previously observed peptides, or using machine learning to predict peptide behavior from sequence alone. However, current machine learning tools suffer from various limitations, including using detectability as an indirect proxy for intensity, relying on small training sets, or ignoring the precursor charge state. In this study, we introduce Bromo, a transformer-based deep learning model that ranks peptide precursors from a given protein by their relative response, taking charge state into account. Trained on millions of annotated peptide pairs derived from large-scale, publicly available data-independent acquisition mass spectrometry data, Bromo consistently outperforms existing sequence-based methods across diverse, independent datasets. Furthermore, we show that fine-tuning Bromo on experiment-specific data can account for differences in sample preparation, sample matrix, and instrument platform, all of which influence which peptides serve as optimal targets. This adaptability makes Bromo a practical tool for selecting target peptides for selected reaction monitoring and parallel reaction monitoring assay development across a wide range of experimental conditions.
bioinformatics2026-05-26v1Faithful Supervised Dimensionality Reduction for Biomedical Data via Decision Geometry
Wang, Z.; Zhou, Z.; Zhan, Q.; Shen, L.Abstract
Unsupervised dimensionality reduction methods aim to preserve intrinsic data geometry by maintaining local neighborhoods and approximate global relationships in low-dimensional embeddings, but they do not use label information and therefore may fail to reflect task-relevant class structure in biomedical and health applications. Supervised dimensionality reduction (SDR) incorporates labels to improve class organization, yet existing approaches often face a trade-off between discrimination and geometric faithfulness. Linear supervised methods are stable and interpretable but are limited in their ability to capture nonlinear structure, whereas many nonlinear methods impose supervision directly in the embedding space, which can over-separate classes and distort the underlying manifold. In biomedical applications, labels such as cell types in single-cell data or patient status in clinical cohorts provide meaningful biological signal, and supervised dimensionality reduction can use this information to produce more informative low-dimensional representations. Here we propose a new framework, DG-UMAP (Decision-Geometry UMAP), for faithful supervised dimensionality reduction via decision geometry. We first fit a classifier in the original feature space and use its boundary-local decision geometry to construct a low-rank metric deformation that emphasizes discriminative directions while limiting geometric distortion. Parametric UMAP is then applied to the transformed space, so supervision acts through the ambient geometry rather than by directly forcing class separation in the embedding. Across synthetic and multiple real-world biomedical datasets, our method yields embeddings with improved agreement with class structure and global organization while preserving local neighborhood quality.
bioinformatics2026-05-26v1SynFit: Synergistic Contrastive Learning for Multi-Objective Protein Fitness Prediction and Optimization
Tu, T.; Huang, W.; Li, Z.; Ding, K.; Yang, Y.; Luo, Y.Abstract
Proteins function through a complex interplay of structural and biochemical properties, and mutations can reshape these properties to generate fitness landscapes spanning multiple functional objectives. A central challenge in protein engineering is the need to simultaneously optimize multiple properties. In biocatalysis, for example, practical enzyme development routinely requires the concurrent optimization of catalytic activity, selectivity, stability, and substrate generality. However, despite recent advances in computational protein design and fitness prediction, most existing approaches treat these properties independently and do not explicitly capture the dependencies and trade-offs that govern real-world protein performance. We present SynFit, a multi-objective learning framework that integrates pretrained protein language models with experimental fitness measurements for protein fitness prediction and engineering. SynFit learns both shared and property-specific protein sequence representations through a synergistic contrastive learning strategy, enabling the identification of variants that simultaneously optimize multiple functional properties. Across a large-scale multi-fitness deep mutational scanning benchmark, SynFit consistently outperforms state-of-the-art supervised models trained on individual objectives and more accurately identifies variants that balance competing functional constraints. We further applied SynFit to multi-objective enzyme design for a new-to-nature biocatalytic enantioselective borylation reaction, providing a diverse array of novel cytochrome \textit{c} sextuple variants in a single round of design with simultaneously improved catalytic activity and enantioselectivity that rival the best variants obtained through directed evolution. Together, these results establish SynFit as a general framework for multidimensional protein fitness prediction and highlight its potential to enable efficient multi-objective optimization in protein engineering, particularly in biocatalysis.
bioinformatics2026-05-26v1Gene-Specific Analysis of Clonal Hematopoiesis Identifies ASXL1 as a Risk Factor for Lung Cancer
Zhang, Z.; Dong, J.; Huang, Y.; Liu, Y.; Amos, C. I.; Cheng, C.Abstract
Clonal hematopoiesis of indeterminate potential (CHIP) is a recognized risk factor for hematologic malignancies, but its contribution to different types of solid cancers remains incompletely defined. Here, we performed a systematic, gene-specific analysis of CHIP across 19 common solid cancer types using two large population-based cohorts, the UK Biobank and All of Us. Using Cox proportional hazards models and nested case-control logistic models, we demonstrate that the relationship between CHIP and solid tumors is highly cancer-type specific, with lung cancer exhibiting the strongest association. In lung cancer, this association is largely driven by ASXL1-mutant clones. Specifically, high variant allele fraction (high-VAF) ASXL1 conferring a significantly increased risk (hazard ratio = 3.2), and the associations remained robust after adjustment for age, sex, body mass index (BMI), smoking status, and genetic ancestry. Notably, ASXL1 CHIP was substantially enriched among smokers, and its association with lung cancer risk was restricted to ever-smokers, highlighting a key interaction between CHIP and environmental exposure. The enrichment of ASXL1 CHIP in lung cancer was further validated in two independent cancer-only cohorts, including MSK-IMPACT and TCGA. In addition, rare germline variant association analysis revealed that germline variation in ASXL1 had the strongest association with lung cancer susceptibility among all solid tumors. Collectively, our findings support a model in which smoking-associated expansion of ASXL1-mutant clones contributes to lung cancer development and suggest that gene-specific CHIP metrics may enhance risk stratification and early detection strategies.
bioinformatics2026-05-26v1Application of Computer Vision Tools to Maize Genomic Data for Trait Prediction and Gene Discovery
Higgins, S. A.; Anible, E.; Muthupari, M.; Dibble, C.; Murdoch, R. W.Abstract
Artificial intelligence and machine learning for computer vision (CV) and image recognition is a rapidly evolving field with multiple potential applications in plant genomics. While CV has been widely adopted by the research community for plant phenotyping and disease surveillance, applications of CV tools to plant genome analysis are underrepresented. CV tools may complement traditional statistical classification tools used in plant genomics, since CV perceives problems holistically rather than granularly (in terms of pattern recognition), which is particularly applicable to analysis of large, complex eukaryotic genomes. In this study, we report on a new strategy to apply existing CV tools to classify plant genotypes and predict genotype-phenotype relationships. A technique was developed for converting maize genome resequencing data into a set of images reminiscent of a quick response (QR) code. Several hundred maize genomes were processed and it was demonstrated that CV models can successfully categorize genome images into heterotic groups (accuracy and recall > 0.8). Models for classifying genome images into phenotypic trait groups (such as short, medium, and high plant height) performed with moderate success for the most heritable trait analyzed (ear height; accuracy and recall > 0.5). Querying model results permitted identification of genome regions that were important for model classification predictions. The CV model results revealed enriched metabolic pathways consistent with traits under consideration. Overall, our initial application of CV tools to plant genome analysis highlights its applicability to genomic data. Design of new CV architectures optimized for genome-derived images may further improve upon our initial results generated using only off-the-shelf CV tools optimized for unrelated image analysis tasks.
bioinformatics2026-05-26v1Pathogen-specific antimicrobial activity prediction with biological large language model-based methods
Ucar, B.; Demirsoy, E.; Salehi, A.; Sutherland, D.; Yanai, A.; Coombe, L.; Thompson, V. C.; Warren, R. L.; Helbing, C. C.; Birol, I.Abstract
Driven by the rise of antimicrobial resistance, antimicrobial peptides (AMPs) have emerged as promising therapeutics capable of targeting multidrug-resistant pathogens. Because identifying AMPs and their specific targets requires costly and labor-intensive wet-lab experiments, in silico methods to prioritize candidates are highly valuable. However, current computational methods often lack pathogen specificity or fail to incorporate crucial targeted proteomic and genomic contexts. To bridge this gap, we developed triAMPh, a robust, zero-shot framework for pathogen-specific peptide bioactivity prediction. triAMPh integrates a heterogeneous graph attention network-based link predictor (HLP), Extreme Gradient Boosting, and a multilayer perceptron trained on features from biological large language models (bLLMs). Our novel HLP constructs a knowledge graph that maps peptides and pathogens as distinct nodes, connected by similarity and bioactivity edges. The model extracts information through semantic traversals, prioritizing neighboring nodes and their biological contexts. Benchmarking shows that triAMPh provides unbiased, peptide- and pathogen-centered zero-shot predictions, matching or outperforming state-of-the-art methods across all metrics except precision. Ultimately, triAMPh offers a powerful computational tool to accelerate wet-lab AMP discovery while demonstrating the capability of bLLMs to capture complex, pathogen-specific bioactivity patterns.
bioinformatics2026-05-26v1Decoding Multicellular Communication Motifs from Spatial Transcriptomics with ALARMIST
Fan, J.; Hood, J.; Strong, J.; Quinn, J. F.; Dai, Y.; Data Science TeamLab, ; Schein, A.; Yu, K. K. H.; Tansey, W.Abstract
Cellular organization is driven by recurrent, coordinated interactions between multiple cell types, each sending and receiving multiple signals. Existing computational methods for spatial profiling data consider only individual ligand-receptor interactions and fail to capture the higher-order interactions governing the tissue microenvironment. To address this gap, we developed ALARMIST (Assessment of Ligand And Receptor Motifs And Impacts in Spatial Transcriptomics), a probabilistic framework that infers interpretable multicellular communication patterns from spatial data. ALARMIST decomposes neighborhood-level signaling patterns into motifs: recurrent communication subnetworks involving multiple cell types and sets of enriched ligand-receptor interactions. For each cell, ALARMIST identifies its active motifs and estimates the downstream phenotypic effects of each motif on active cells. We applied alarmist to spatial datasets of lung adenocarcinoma (LUAD) and glioblastoma (GBM) to identify microenvironmental drivers of tumor progression. In paired LUAD and adenocarcinoma-in-situ (AIS) samples, ALARMIST identified an immune-active vascular motif at the tumor-normal boundary and implicated motif-active plasmacytoid dendritic cells as drivers of inflammation in early carcinogenesis. In matched low- and high-grade glioma samples, ALARMIST identified a hub-and-spoke motif centered on a malignant macrophage subpopulation, implicating a GRN-SORT1 signaling axis with a downstream impact gene set predictive of survival in low-grade glioma patients. Code for ALARMIST is available at https://github.com/tansey-lab/alarmist.
bioinformatics2026-05-26v1GAE-Δ: A Graph-Learning Framework for Gene Network Rewiring and Clinical Outcome Prediction from Multi-Omics Data
Tang, Z.; Chen, Z.; Chen, M.; Wang, Y.; Ennis, S.; Niranjan, M.; Ewing, R.Abstract
Cancer progression and outcomes are driven in part by changes to molecular networks thatresult from genetic and/or environmental perturbations. These network changes manifestacross multiple interconnected network layers and include accumulation of somatic mutations, altered protein-protein interactions and dysregulated gene-expression. Here wedescribe a graph autoencoder based framework (Graph Autoencoder-Delta (GAE-{Delta})), for characterizing phenotype-specific gene role shifts across multiomics data. Given samples stratified into two contrasting phenotypic groups and a prior gene interaction network,GAE-{Delta} constructs group-specific gene graphs for each omics modality and trains, for each modality, a single graph autoencoder jointly on both group graphs, so that the two group conditional embeddings share a common latent space. Contrasting these embeddings defines a multiomics embedding-shift representation for each gene that reflects how its network role reorganizes across phenotypic contexts. These gene-level shifts are subsequently used for unsupervised gene prioritization, multiomics late fusion andsample-level classification. Applied to five TCGA cancer types with a survival endpoint, GAE-{Delta} achieves competitive or superior predictive performance compared with classical network based methods and multiomics matrix factorisation methods (MOFA+, iNMF), with statistically significant AUC gains over MOFA+ in three of five cohorts and statistical ties on the remaining two. Beyond predictive performance, the consensus shift genes are significantly enriched for known cancer drivers in three of five cohorts (hypergeometric p < 0.01; 11 - 17x fold enrichment), whereas matrix factorisation baselines reach p < 0.05 in zero of five cohorts (best per cancer p = 0.06), indicating that GAE-{Delta} captures biological signal that linear factor methods miss. In summary, the GAE- {Delta} approach provides for both improved outcome classification as well as for biological and mechanistic discovery through deep network-based integration of disease-associated multi-omics data.
bioinformatics2026-05-26v1Vision-Based Genomic Model for Copy Number Variant Pathogenicity Prediction
Buralkin, I.; Botas, J.; Chang, K.-L.; Deng, Y.; Papastathopoulos-Katsaros, A.; Liu, Z.; Park, J.Abstract
Copy number variants (CNVs) are a major class of structural genomic alterations underlying rare disease, including neurodevelopmental delay and intellectual disability, yet predicting their pathogenicity remains challenging. Existing methods reduce CNVs to region-level numerical features, discarding the positional structure and cross-track patterns that expert clinical reviewers use to interpret genomic evidence. To address this, we introduce Tesseract for CNV, a track-based spatial representation for CNV pathogenicity prediction, which represents each variant as a base-pair-resolution multi-track image and models spatial genomic patterns across annotation tracks while preserving positional structure and cross-track dependencies. Trained on a chromosome-level hold-out split of the ClinVar dataset, Tesseract outperforms prior methods on held-out and curated noncoding benchmarks, improving AUROC by up to 0.10 over the state-of-the-art baseline. On the independent DECIPHER cohort, the model demonstrates generalizability by maintaining the highest AUROC and the highest F1 score across baselines. Furthermore, the model localizes pathogenic signals to clinically meaningful genomic subregions, providing track-annotated evidence that supports practical clinical interpretation.
bioinformatics2026-05-26v1Benchmarking sequence performance on the DNBSEQ-T7 using Genome in a Bottle reference genomes
van Coller, A.; Taukobong, S.; Malima, M.; Ghoor, S.; Nangammbi, N.; Roode, E.; Naicker, M.; Cole, V.; Glanzmann, B.; Kinnear, C.; Carstens, N.Abstract
Advances in sequencing technologies have improved the accuracy, throughput, and completeness of human genome characterization, enabling more reliable detection of genetic variation. Well-characterized reference genomes are critical for benchmarking sequencing platforms and bioinformatics analysis pipelines. Here, we present whole genome sequencing datasets generated for the Ashkenazi Jewish trio reference samples from the Genome in a Bottle Consortium. Libraries were prepared using three distinct MGI-based workflows: PCR-free library preparation, FastFS DNA library preparation, and Universal DNA library preparation. Sequencing was performed on the MGI DNBSEQ-T7 platform, generating a minimum of 400 million paired-end reads per sample, corresponding to 30X mean genome coverage. Raw reads were processed using a standardized GATK bioinformatics workflow. Sequencing performance and variant detection accuracy were evaluated using the Genome in a Bottle high-confidence benchmark variant sets. All workflows demonstrated high sequencing quality and concordance with GIAB benchmark truth sets, with PCR-free libraries showing the strongest indel calling performance and lowest Mendelian violation rates across the Ashkenazi trio. This dataset provides a resource for benchmarking DNBSEQ-T7 sequencing and bioinformatics workflows, and for evaluating the impact of library preparation strategies on whole genome variant detection performance.
bioinformatics2026-05-26v1CoSTAR: Coarse Stem-Topology Alignment of Pseudoknotted RNA Structures by Relation-Constrained Search
Archinuk, F.; Jabbari, H.Abstract
RNA structural alignment is a central task in comparative RNA analysis, but many efficient methods achieve tractability by restricting the class of admissible structures, often excluding pseudoknots. This exclusion is limiting for viral and regulatory RNAs, where conserved structure can remain informative even when sequence conservation is weak. We introduce a coarse RNA structural alignment algorithm that aligns secondary structures by searching over partial maps between stems rather than nucleotides. Each input structure is decomposed into stems, annotated with nucleotide-level features, and encoded by pairwise topological relations among stems. Alignment is formulated as a cost-minimizing partial stem map with skip operations, and the search tree is pruned by RNA-specific directionality and topological constraints derived from already aligned stems. For the stated cost function and over the class of injective, direction-preserving, topologically consistent stem maps, the search is exact. This shifts the dominant computational dependence from sequence length to the number and arrangement of stems. We evaluated the method on 2100 pairwise alignments sampled from seven Rfam families spanning 40-224 nucleotides and 2-15 stems. Across these benchmarks, the algorithm returned terminal coarse alignments in which every stem was either matched or skipped. We measured running time and search-tree width to characterize performance on diverse family-to-family comparisons. The experiments also show that ordering the input structures affects efficiency: using the structure with more stems as the search-driving structure reduces tree width. The resulting partial stem map is directly interpretable for RNA annotation and can be projected to nucleotide resolution for downstream sequence-structure analysis. The code for CoSTAR is available at: https://github.com/TheCOBRALab/CoSTAR
bioinformatics2026-05-26v1How flat is your sample? An opportunistic survey of 3D tilt in public fluorescence microscopy data
Brocard, J.Abstract
Sample planarity is rarely monitored in fluorescence microscopy quality control, yet focal plane deviations across the field of view are a potential source of measurement error. Here I describe FlatStat, a tool that estimates sample tilt automatically from any 3D fluorescence stack, without prior knowledge of sample content, by fitting a plane to the Z-map of maximum intensity. Applied to an Argolight calibration slide and biological samples on a laser-scanning confocal system, FlatStat yielded reproducible slope and direction measurements attributable to the instrument rather than the sample. To establish community reference values, FlatStat was extended to Python and applied opportunistically to 1204 image stacks from 22 projects in the Image Data Resource, yielding 4670 tilt measurements. Slopes spanned several orders of magnitude across projects; inter-channel coherence confirmed that measured tilt reflects physical stage and mounting geometry rather than channel-specific biological topography. Unfortunately, instrument and sample preparation metadata were largely absent from the corpus, limiting causal inference. Finally, controlled tilt experiments on fluorescent beads showed that chromatic shift increased modestly with tilt (~57 nm over the full range tested), while lateral and axial resolutions were essentially unaffected.
bioinformatics2026-05-26v1Multi-Algorithm Machine Learning Benchmarking for Pan-Cancer Classification from Tumour-Educated Platelet RNA Sequencing
Ray, S.; Zalawadia, D. H.; Bhate, V.; Chakravarthy, T. D.; Chetty, A. G.Abstract
Tumour-educated platelets (TEPs) carry cancer-type-specific RNA signatures accessible through whole-blood RNA sequencing, but systematic multi-algorithm benchmarking with quantified statistical uncertainty had not been applied to the GSE68086 dataset. We applied an end-to-end transcriptomic and machine learning framework to 280 whole-blood platelet RNA-seq samples from six cancer types (non-small cell lung cancer, colorectal cancer, glioblastoma multiforme, hepatobiliary cancer, breast cancer, and pancreatic cancer) and healthy donors. After a standardised preprocessing and normalisation pipeline, seven supervised classifiers - Logistic Regression, SVM (RBF), XGBoost, LightGBM, Random Forest, K-Nearest Neighbours, and a Multilayer Perceptron were benchmarked using stratified 5-fold cross-validation and a held-out test set. Statistical uncertainty was quantified via 2,000-resample percentile bootstrap confidence intervals. Multinomial Logistic Regression achieved the highest test macro F1-score (0.522) and macro-averaged ROC-AUC (0.869), both substantially above the seven-class chance level (1/7 {approx} 0.14). SHAP analysis of the Random Forest classifier identified IFITM3 as the globally dominant TEP biomarker; cancer-type-specific discriminators included ATP5PD (hepatobiliary cancer), C6orf62 (NSCLC and pancreatic cancer), VPS13C (healthy donors), and TMSB4Y (breast cancer). Gene Ontology and KEGG pathway enrichment corroborated the biological specificity of identified transcriptomic signatures. These results support the diagnostic potential of TEP transcriptomics as a multi-class liquid biopsy platform and provide a methodologically transparent, reproducible reference framework for future blood-based cancer classification studies.
bioinformatics2026-05-26v1GenesetDiseaseDrugNetwork (GDDN): a web server for disease enrichment and drug prioritization
More, P.; Fontaine, J.-F.; Ten Cate, V.; Wild, P. S.; Andrade-Navarro, M. A.Abstract
Summary Omics technologies profile thousands of genetic and molecular features to provide a comprehensive and quantitative measure of the cellular state. Transcriptomics and proteomics have, especially, guided discoveries of the most important biomarkers and therapeutic targets. By virtue of ongoing developments in single-cell and spatial technologies, fields of targeted therapeutics and personalized medicine are rapidly advancing. However, downstream functional analysis and disease association still remain daunting tasks in bioinformatics. We address these challenges with the GenesetDiseaseDrugNetwork (GDDN) web server. GDDN facilitates functional discovery by connecting gene-sets to enriched diseases and their corresponding therapeutics in a single step. Using a ranking system that incorporates regulatory impact, specificity, and potency, GDDN effectively prioritizes drugs with the highest clinical relevance. Our platform facilitates the interpretation of omics outputs into disease associations and personalized drug identification. Availability and Implementation The GDDN web server is implemented in R Shiny and is freely accessible at https://cbdm-01.zdv.uni-mainz.de/shiny/piyusmor/GDDN/, supporting all major web browsers.
bioinformatics2026-05-26v1misosoup: A metabolic modeling tool for identifying minimal microbial communities provides valuable insights into microbial ecology and biotechnological applications
Ochsner, N.; San Roman, M.; Jimenez-Fernandez, A.; Bonhoeffer, S.; Pascual-Garcia, A.Abstract
Microbial survival and function often depend on metabolic interactions within communities. Therefore, a central question in disentangling microbial organization is determining which minimal groups of species are able to thrive in a given medium--referred to as 'minimal communities'. Answering this question is essential for understanding microbial distribution, enhancing laboratory cultivation, and designing synthetic communities (SynComs). Here, we introduce misosoup, a Python package for identifying minimal communities (MInimal Supplying cOmmunity Search). Through genome-scale constraint-based metabolic modeling, misosoup enables the systematic identification of communities that support microbial growth in environments where individual species fail to survive alone. We validate misosoup against experimentally verified minimal communities, demonstrating its ability to predict known cooperative interactions, cocultures, and consortia with biotechnological potential. We further illustrate the use of misosoup to investigate broad microbial ecology questions by applying it to a set of 60 marine microbes, revealing pervasive cross-feeding-driven niche expansion and showing how the detailed outputs provided by misosoup facilitate research on topics such as the identification of functional groups. In summary, misosoup provides a powerful tool for microbial ecology and community design, with potential applications in both research and biotechnological innovation.
bioinformatics2026-05-25v2Read-Consistent Minimum Unique Substrings: A Parameter-Free, Linear-Time Framework for Genomic Sequence Representation
Adu, A. F.; Menkah, E. S.; Amoako-Yirenkyi, P.; Pandam Salifu, S.Abstract
Fixed-length k-mers have been the standard unit of genomic sequence representation for over two decades. However, they impose a uniform resolution on genomes whose complexity varies across loci. We introduce Minimum Unique Substrings (MUSs), variable-length sequence units defined by the local uniqueness structure of the genome rather than predefined parameters. We first extend MUS theory from single contiguous strings to fragmented sequencing reads by formalizing a definition of uniqueness that is consistent with these reads. Next, we present a linear-time extraction algorithm that runs in O(n) time using the generalized suffix tree. In this context, we introduce outpost nodes, topological anchors within the suffix tree that accurately localize MUS boundaries in fragmented sequencing reads. Finally, we empirically characterize the distributions of MUS lengths in E. coli K-12 and human chromosome 11. Our results demonstrate that MUS lengths naturally mirror genomic architectural complexity without the need for user-defined parameters. Notably, the MUS framework achieves 100% unique positional coverage with a mean length of only 36.08 bp. In contrast, fixed-length k=61 coverage reaches only 69.4%, despite being 1.69 times the MUS average. We show that increasing k from 21 to 61 triples the unique k-mer count from 2.35M to 6.86M. This k-paradox occurs because repetitive sequences are fragmented into spuriously unique tokens without improving true genomic resolution. MUSs escape this artifact entirely by adapting dynamically to local sequence complexity. These results establish MUSs as a biologically grounded, computationally tractable foundation for parameter-free genome assembly, repeat characterization, and alignment-free genomics.
bioinformatics2026-05-25v2HiCPotts: An R/Bioconductor package to identify significant interactions in chromosome conformation capture data and model sources of biases.
Osuntoki, I. G.; Harrison, A. P.; Dai, H.; Bao, Y.; Zabet, N. R.Abstract
Motivation: Chromosome Conformation Capture methods, including Hi-C, micro-C or Capture-C, are used to map chromatin interactions genome-wide. Most of the existing computational methods do not account for sources of biases (such as DNA accessibility, GC content or TE content) in the data. Results: We previously developed ZipHiC, a Bayesian method based on a the hidden Markov random field (HMRF) model and the Approximate Bayesian Computation (ABC), that uses zero-inflated Poisson distribution to model the noise, signal and false signal of the data and showed that this approach was able to detect biases from DNA accessibility, GC content and TE content in both Hi-C and micro-C data. Here, we present HiCPotts, another Bayesian method based on the HMRF model and the ABC that uses a zero-inflated Negative Binomial distribution instead to model the noise and signal of the data. We systematically show that HiCPotts reduces false positives and increases recovery of true interactions compared to ZipHiC, but also compared to other methods such as FastHiC, Juicer and HiCExplorer. Most importantly, we provide an R/Bioconductor package that allows modelling the noise, signal and false signal using various distributions such as the zero-inflated Negative Binomial (ZINB) and the zero-inflated Poisson distribution (ZIP). Availability: https://bioconductor.org/packages/HiCPotts/
bioinformatics2026-05-25v1MSLipidMapper: a pathway-centered lipidome analysis environment linking lipid class, acyl-chain subsets, and multi-omics data
Oka, T.; Nishida, K.; Harayama, T.; Tsugawa, H.Abstract
Lipids exhibit extensive structural diversity arising from variation in lipid classes, subclasses, and acyl-chain compositions, making systematic interpretation of lipidomics data challenging. Although untargeted lipidomics enables the quantification of hundreds to thousands of lipid molecular species, downstream analyses often treat pathway-level summaries, molecular-species visualization, structural subsetting, and multi-omics interpretation as separate steps. Here, we present MSLipidMapper, an R/Shiny-based lipidomics data exploration environment for pathway-centered and structure-aware analysis of annotated lipidomics datasets. MSLipidMapper reconstructs annotated lipid peak tables as Bioconductor SummarizedExperiment objects, thereby organizing quantitative lipid abundance values, sample metadata, lipid subclass annotations, and parsed acyl-chain features within a unified data structure. Lipid molecular species are summarized on static, curated lipid metabolic pathway maps at the subclass level while retaining direct links to the underlying molecular species and acyl-chain annotations. This design enables users to inspect molecular-species patterns underlying each pathway node, define lipid subsets based on structural features such as specific acyl chains, and re-project these subsets onto the same pathway context. Gene or protein expression data can also be overlaid on pathway-associated reactions to support multi-layer interpretation of lipid metabolism. The program is showcased using publicly available aging lipidome datasets of mice, illustrating how subclass-level pathway summaries can be connected to molecular-species heatmaps, acyl-chain-defined subsets, and transcriptome or proteome information.
bioinformatics2026-05-25v1Cell-type-specific transposable element transcription tracks symbiosis and calcification programs in the reef-building coral Acropora hemprichii
Zhong, H.; Konciute, M. K.; Hu, J.; Menzies, J.; Cui, G.; Aranda, M.Abstract
Transposable elements (TEs) are pervasive components of eukaryotic genomes and major drivers of genome evolution, yet their contribution to cell-type-specific regulatory landscapes remains poorly understood, particularly in non-model marine invertebrates. Here, we integrated single-cell RNA sequencing with pseudo-aligned TE expression profiling to examine how TE transcription relates to cell type identity in the reef-building coral Acropora hemprichii. We constructed a cell atlas comprising 4,716 cells across eight major cell types. Notably, TE expression alone was sufficient to accurately resolve all major cell types, indicating that cell-type-specific transcriptional states are robustly reflected in TE activity patterns. We identified 9,759 expressed TEs, of which 333 exhibited strong cell-type-specific activity. These differentially expressed TE features were associated with nearby expressed genes and transcription factor loci, suggesting a relationship between cell-type-specific TE activity and local gene regulatory programs. Genes associated with cell-type-specific TEs were enriched for core coral physiological processes, including calcification, metabolite transport, and symbiosis-related functions. Together, these findings indicate that TE transcription is structured along coral cell-type identity and physiological specialization. Our study provides a single-cell-resolved framework for investigating TE-gene relationships in early-diverging metazoans and a community resource for future functional interrogation in reef-building corals.
bioinformatics2026-05-25v1