Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Anchors for Homology-Based Scaffolding
Kaether, K.-K.; Gatter, T.; Lemke, S.; Stadler, P. F.Abstract
Homology based scaffolding orders contigs based on conserved collinearity of homologous sequences across related species. Existing methods often rely on costly whole genome alignments or show limited robustness when integrating multiple references. Here, we introduce an anchor based scaffolding framework that adapts synteny anchors to efficiently infer contig order and orientation relative to one or more reference genomes. Our approach leverages precomputed, sufficiently unique anchors and their respective high confidence homology matches in a greedy approach, combining single reference to multi reference scaffolds using a maximum matching. Across simulated and real datasets, anchor based scaffolding achieves accuracy comparable to state of the art methods. Notably, the approach shows particular strengths in multi reference settings. These results demonstrate that synteny anchor based scaffolding provides an additional tool for homology based scaffolding with robust accuracy and superior performance in multi reference scenarios.
bioinformatics2026-05-28v4Blender tissue cartography: an intuitive tool for the analysis of dynamic 3D microscopy data
Claussen, N. H.; Regis, C.; Wopat, S.; Lefebvre, M. F.; Streichan, S. J.Abstract
Volumetric microscopy can image complex 3D tissues, but 3D image data remains difficult to visualize and quantify. Many biological systems are organized as thin, curved sheets (for example, epithelia). Tissue cartography extracts and cartographically projects these curved surfaces from volumetric images. This converts 3D into 2D image data, greatly facilitating visualization, analysis, and computational processing. Existing tools, however, demand advanced coding expertise and are limited to simple tissue geometries. Here, we present Blender issue cartography (btc), an interactive add-on for the 3D editor Blender that makes tissue cartography user-friendly by a graphical interface, and handles complex biological shapes using powerful computer graphics algorithms. An accompanying Python library supports faithful 3D measurements in 2D cartographic projections and custom analysis pipelines. Time-lapse data can be batch-processed by algorithmically aligning all time points to a single "key frame". We demonstrate btc on diverse and complex tissue shapes from Drosophila, stem-cell organoids, Arabidopsis, and zebrafish. btc enables quantitative cartographic analysis of complex 3D tissues, broadening access to methods previously restricted to specialists, while leveraging tools from computer graphics to unlock new capabilities.
bioinformatics2026-05-28v3Unimeth: A unified transformer framework for accurate DNA methylation detection from nanopore reads
Wang, S.; Xiao, Y.; Sheng, T.; Huang, N.; Shu, Y.; Zhai, J.; Luo, F.; Ni, P.Abstract
Nanopore sequencing enables direct detection of DNA modifications from native DNA. However, accurate methylation calling across species, sequence contexts, modification types and chemistries remains challenging. We present Unimeth, a transformer-based framework that jointly processes raw signals and basecalled sequences in read patches and predicts all target methylation sites within each patch. Unimeth uses a three-phase training strategy that combines signal pre-training, methylation fine-tuning and site-level calibration using methylation frequency information. We evaluated Unimeth for 5mC and 6mA detection using public and in-house datasets spanning 14 species, three nanopore chemistries and wild-type, mutant and enzyme-treated samples. Unimeth improved plant 5mC detection in non-CpG contexts, reduced false-positive calls in low-methylation samples and maintained high 5mCpG performance in mammalian datasets. For 6mA, Unimeth reduced background calls while preserving signals for Fiber-seq nucleosome and gene-level analyses. Unimeth provides a unified framework for nanopore-based methylation detection across methylation types and biological contexts.
bioinformatics2026-05-28v3Data Representation Bias and Conditional Distribution Shift Drive Predictive Performance Disparities in Multi-Population Machine Learning
Kumar, S.; Cui, Y.Abstract
Machine learning frequently encounters challenges when applied to population-stratified datasets, where data representation bias and data distribution shifts substantially impact model performance and generalizability across different population groups. These challenges are well illustrated in the context of polygenic prediction for diverse ancestry groups, and the underlying mechanisms are broadly applicable to machine learning with population-stratified data across domains. Using synthetic genotype-phenotype datasets representing five continental populations, we evaluate three approaches for utilizing population-stratified data, mixture learning, independent learning, and transfer learning, to systematically investigate how data representation bias and distribution shifts influence multi-population machine learning. Our results show that conditional distribution shifts, in combination with data representation bias, significantly influence machine learning performance across diverse populations and the effectiveness of transfer learning as a disparity mitigation strategy, while the effect of marginal distribution shifts is limited. The joint effects of data representation bias and distribution shifts demonstrate distinct patterns under different multi-population machine learning approaches, providing critical insights for the development of effective and equitable machine learning models for population-stratified data.
bioinformatics2026-05-28v2CardioSeg: An interactive platform for integrated spatial transcriptomics data and nuclear morphological analysis of mouse heart tissue
Kancherla, S. K.; Melleby, A. O.; Aronsen, J. M.Abstract
MotivationSpatial transcriptomics enables gene expression profiling within its spatial context in intact tissue sections. Existing workflows for segmentation, spatial annotation, and morphological analysis are often code-heavy and poorly integrated. This limits the joint analysis of spatial gene expression at a single-nucleus resolution, and corresponding nuclear morphology. ResultsWe present CardioSeg, a Python-based graphical interface for nuclei segmentation, spatial annotation, and interactive analysis of myocardial histology. CardioSeg integrates multi-threshold Cellpose-based segmentation with nuclei-level transcriptomic mapping and interactive visualisation. CardioSeg achieved robust segmentation performance across heterogeneous imaging conditions, with union-based inference outperforming the individual parameter configurations. For cell-type annotation, CardioSeg achieved 0.88 in accuracy and 0.85 in balanced accuracy against reference labels, while also resolving spatial heterogeneity not captured by spot-based approaches. Application to pressure-overloaded cardiac tissue revealed uncharacterized intra-ventricular variations in nuclear morphology, indicating the potential of CardioSeg to couple disease-specific nuclear morphology with the associated transcriptomics. Availability and ImplementationSource code is available at GitHub under the CC BY 4.0 license (https://github.com/SrijanKancherla/CardioSeg). A versioned release was archived in Zenodo (DOI: 10.5281/zenodo.20177171).
bioinformatics2026-05-28v2Atlas-Level Single-Cell and Spatial Transcriptomics Data Integration via PRIME
Wu, X.; Wang, X.; Wang, J.; Wan, S.Abstract
Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have enabled atlas-scale cellular cartography, with consortium efforts now assembling millions of cells across diverse tissues, donors, and technologies to build comprehensive references for cell identify and disease mechanism, yet the scientific value of these atlases hinges on robust computational integration across heterogeneous data sources. Unlike pairwise batch correction, atlas-level integration must jointly reconcile heterogeneous and often hierarchically nested batch effects across many datasets whose cell-type compositions are highly imbalanced, all while preserving subtle biological variation and remaining computationally tractable at the scale of millions of cells. Existing approaches often prioritize either batch mixing or preservation of local biological structure, and most cannot natively accommodate spatial coordinates. Here we introduce PRIME (Projection-based Robust Integration via Manifold Embedding), an ensemble integration framework that combines random-projection-based consensus anchoring, graph-Laplacian correction, and optional spatial-neighborhood regularization. Across multiple random projections of the expression manifold, PRIME uses consensus voting to keep only cell pairs that repeatedly matched, reducing false anchors caused by projection-specific distortions. For ST, PRIME couples this expression-based anchor graph with a coordinate-derived spatial neighborhood graph in a unified graph-Laplacian objective with closed-form solution, enabling simultaneous cross-batch alignment and local spatial coherence. Based on extensive benchmarking spanning diverse datasets, we show that PRIME consistently outperforms state-of-the-art methods in both batch correction and biological conservation across scRNA-seq and ST integration scenarios and downstream tasks including trajectory inference, spatial-domain preservation, and perturbation-response analysis. Particularly, when integrating a human hematopoiesis benchmark spanning eight donors and approximately 33,000 cells, PRIME preserves biologically coherent developmental trajectories in human hematopoiesis. It also maintains cortical laminar architecture across dorsolateral prefrontal cortex sections in a ST dataset and recovers known drug-target relationships in a perturbation atlas of more than 1 million cells while suppressing batch-associated confounders. Together, these results establish PRIME as a versatile and scalable framework for atlas-level integration of scRNA-seq and ST across diverse biological applications.
bioinformatics2026-05-28v2Accelerated Aging Signatures in 3D Genome Organization and Transcriptome in Schizophrenia
Ulianov, K. A.; Zagirova, D. R.; Kononkova, A. D.; Dudkovskaia, A. V.; Molodova, M. N.; Morozov, K. V.; Efimova, O. I.; Bazarevich, M.; Cherkasov, A. V.; Morozova, P. D.; Tvorogova, A. V.; Pletenev, I. A.; Kondratyev, N.; Golimbet, V. E.; Razin, S. V.; Khaitovich, P. E.; Ulianov, S. V.; Khrameeva, E.Abstract
Schizophrenia is a severe neuropsychiatric disorder that affects the behavioral, emotional and cognitive state of patients. Despite its substantial heritability, the molecular etiology of the disease remains poorly understood. Many schizophrenia-associated genetic variants reside in non-coding regions, and exert their effects through distal regulatory elements of the genome. In this context, the three-dimensional organization of the genome is expected to play a decisive role in establishing contacts between these regulatory elements and their target genes, thereby mediating schizophrenia-associated dysregulation of gene expression. Here, we present a novel Hi-C dataset providing an unprecedented view of three-dimensional genome organization in post-mortem schizophrenia brain samples. Our findings indicate that most changes occur at long-range genomic distances while local architecture of topologically-associated domains remains largely intact. However, neurons display localized and functionally relevant loop differences, particularly in regulatory regions associated with neurodevelopmental processes. Global characteristics of higher-order chromatin organization show accelerated aging alteration pattern in schizophrenia, and downstream analysis of transcriptomic data in schizophrenia brain samples further confirms that schizophrenia is associated with accelerated aging.
bioinformatics2026-05-28v1Minimal Computational Framework for Systematic Identification of Antimicrobial Targets
Hassan, S. A.Abstract
Systematic identification of antimicrobial targets remains a major challenge, as discovery still relies largely on empirical, resource-intensive approaches with limited efficiency. We present a method for identifying antimicrobial targets based on protein dynamics, enabling rational polypharmacology. The approach spans multiple biological scales, from taxa (genus and species) to biological networks, including network hubs and edges, their constituent proteins, protein binding sites, and their conformational states. It is grounded in the premise that coordinated intervention across multiple, optimally selected targets, using combinations of compounds at safe or submaximal doses, can achieve therapeutic effects while reducing toxicity and limiting mutational escape. A survey of known antimicrobials indicates that a small number of recurrent protein-level mechanisms account for most disruptions of microbial survival. We introduce metrics to detect these mechanisms across a pathogen proteome and describe a streamlined, modular workflow for target identification and prioritization that is optimized for ease of deployment and naturally interfaces with downstream applications such as molecular screening and de novo design.
bioinformatics2026-05-28v1SQANTI-browser: visualization and curation of SQANTI3-classified long-read transcriptomes within the UCSC Genome Browser
Paniagua, A.; Blanco-Gomez, C.; Colomer Fernandez, A.; Diekhans, M.; Conesa, A.; Monzo, C.Abstract
Long-read sequencing enables transcriptome-wide isoform discovery. However, it generates substantial technical and structural ambiguity that complicates transcript interpretation. Here, we present SQANTI-browser, a classification-aware visualization framework that converts SQANTI3 outputs into interactive UCSC Genome Browser Track Hubs, preserving full transcript structural metadata. By integrating SQANTI classifications directly within the UCSC ecosystem, SQANTI-browser enables dynamic filtering and evidence-guided curation alongside public resource tracks. Furthermore, its adaptive architecture natively supports non-reference genomes, orthogonal data, and custom metadata fields. Applied to clinical, noisy, and synthetic datasets, SQANTI-browser resolves alignment artifacts and rescues actionable novel isoforms, providing a robust framework for long-read transcriptome curation.
bioinformatics2026-05-28v1In silico characterization of unique fungal modular rhodopsin expands the horizon of novel optobiological and biomedical applications
Kateriya, S.; Kumari, A.; Kumar, A.; Sharma, K.; Pati, S. R.; Mohanty, S.Abstract
Microbial modular rhodopsins, in which light-sensing rhodopsin domains are fused with effector modules, have emerged as promising tools for optogenetic regulation in algae and other systems. However, the diversity and potential regulatory roles of fungal modular rhodopsins remain largely unexplored. Here, we performed a comprehensive in-silico analysis to identify previously uncharacterized fungal modular-rhodopsins that pair a conserved light-sensing core with diverse effector domains, including RPEL-motif, NADP-binding Rossmann fold domain, MCM (Mini-Chromosome Maintenance) domain, and GC-cAT (Carnitine O-Acetyltransferase) modules. In Aureobasidium pullulans, the representative modular rhodopsin (ApRh-RPEL) contains RPEL-motif associated with actin-related and transcriptional regulatory processes, suggesting light-driven fungal signaling pathway involved in transcriptional and cellular regulation, respectively. Rhodopsins fused with NADP-binding Rossmann fold and MCM domains further indicate possible applications in light-programmable metabolic and cell-cycle signaling. Genome mining additionally revealed that A. pullulans harbours a diverse but underexplored array of biosynthetic gene clusters (BGCs), raising the intriguing possibility that light perception may regulate secondary metabolite pathways. Supporting this, multisource protein-protein interaction network analysis links ApRh-RPEL to enzymes involved in terpenoid and sphingolipid biosynthesis, indicating potential cross-talk between light-sensing module and metabolic regulation. These findings outline a computationally derived model in which fungal modular rhodopsins (ApRh-RPEL) function as opto-synthetic regulators of biosynthetic processes. Structural predictions confirmed conserved Schiff-base lysine and retinal-binding pocket, highlighting functional diversity across fungal rhodopsins. Together, these findings expand the optogenetic toolkit and provide a framework for engineering light-driven signaling in fungi, with applications in optobiological and biomedical applications.
bioinformatics2026-05-28v1UcTCRp: a TCRβ-based framework for quantitative MAIT- and iNKT-associated repertoire-state profiling
Chen, L.; Li, Y.; Shan, S.; Wang, K.; Feng, C.; Dou, Y.; Xu, Q.; Cai, L.; Wang, H.; Wang, H.; Bo, X.; Zhang, J.Abstract
MAIT and iNKT cells are conventionally identified using invariant or semi-invariant TCR chains, antigen-loaded tetramers, or transcriptomic phenotypes. These requirements limit their detection in public and clinical immune-repertoire datasets that contain only TCR{beta} sequences. Here we present UcTCRp, a TCR{beta}-only framework for profiling MAIT- and iNKT-associated repertoire states in bulk immune repertoires. UcTCRp integrates V-gene context and CDR3{beta} sequence features using a transformer-based representation pretrained on more than one million TCR{beta} sequences and supervised with curated cross-species MAIT, iNKT and conventional T cell references. The framework defines conserved model-informative TCR{beta} features, uses V-matched negative sampling to reduce germline-segment shortcuts, and generalizes across independent human and mouse datasets. In paired scRNA-seq/scTCR-seq datasets, UcTCRp recovered transcriptome-defined MAIT and iNKT cells and identified additional MAIT-like candidates supported by receptor evidence but missed by expression-only annotation. Bulk calibration against paired single-cell references and synthetic spike-in experiments established operating characteristics for repertoire-level abundance estimation. These results establish unpaired TCR{beta} repertoires as an actionable substrate for reconstructing unconventional T cell-associated immune states, enabling archived repertoire resources to be repurposed for systems-level studies of tissue immunity, disease and therapeutic response.
bioinformatics2026-05-28v1Individual-Specific Gaussian Graphical Models for Heterogeneous Populations with Application to Epigenetic Gene Regulation in Lung Adenocarcinoma
Saha, E.Abstract
Inter-patient molecular heterogeneity is a fundamental challenge in precision oncology: population-level multi-omics networks reveal average biology aggregated across the population but obscure individual variations that drive differential clinical outcomes. We introduce SIREN (Sample-specific Inference via Regularized Empirical-Bayes Networks), a method that estimates one partial correlation network per sample across omics layers by combining a population-level empirical Bayes prior with a rank-1 individual-specific update. Since a sample-specific precision matrix cannot be estimated from a single observation, SIREN uses a conjugate Inverse Wishart prior whose mean is the Oracle Approximating Shrinkage estimator, yielding closed-form individual-specific posteriors without MCMC. On simulated heterogeneous populations, SIREN achieves superior edge recovery over population-average methods including OAS, Ledoit-Wolf, and graphical Lasso, while remaining competitive in homogeneous settings. Applied to paired transcriptomic and methylomic profiles from lung adenocarcinoma, SIREN identifies individual-specific gene-methylation regulatory edges that stratify patients by survival in ways population-level analysis cannot, implicating chromatin remodeling and WNT signaling pathways in epigenetic heterogeneity. SIREN is computationally scalable and available as a Python package.
bioinformatics2026-05-28v1Leveraging AI and structural proteomics for rational design of a KAT6A degrader
Arad, G.; Simchi, N.; Brodsky, S.; Shtrikman, A.; Kedem, Y.; Alchanati, I.; Otonin, G.; Shenoy, A.; Kovalerchik, D.; Ran Shchory, M.; Ben Shoshan-Galeczki, Y.; Cohen, N.; Lange, K.; Seger, E.; Pevzner, K.Abstract
While targeted protein degraders such as PROTACs are a clinically proven therapeutic strategy, the discovery of novel degraders remains hampered by trial-and-error process. To address this challenge, we developed the AIMSTM platform, which combines structural proteomics with AI models for rational PROTAC design. AIMSTM is an end-to-end toolkit for PROTAC optimization, encompassing structure solving using proteomics and AI, prediction of ADME and degradation properties, and prospective ranking of compound design ideas. Altogether, this integrated platform successfully enabled the multi-parameter optimization of a potent and bioavailable in vivo validated KAT6A degrader, establishing a versatile framework for PROTAC development across various targets.
bioinformatics2026-05-28v1MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders
Wijaya, A. S.; Leung, H.; Yoo, H.Abstract
Frozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each gene's natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts. In 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 / kappa 0.821, compared with 0.672 / kappa 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context.
bioinformatics2026-05-28v1BIFO: A Biological Information Flow Ontology for Directed Propagation in Heterogeneous Biomedical Knowledge Graphs
Taylor, D. M.; Mohseni Ahooyi, T.; Stear, B.; Zhang, Y.; Lahiri, A. M.; Simmons, J. A.; Chinwalla, A.; Nemarich, C.; Callahan, T. J.; Silverstein, J. C.Abstract
Biomedical knowledge graphs integrate heterogeneous data by connecting many entity types through many relationship types. Computational analyses that propagate signal across these graphs (random walks, diffusion, and message passing) implicitly assume that every traversable edge can carry a biological signal. In a heterogeneous KG this is rarely true: hierarchical, lexical, and purely statistical edges do not, by themselves, define an admissible directed state transformation, and traversing them propagates signal along paths that are not biologically meaningful. We present the Biological Information Flow Ontology (BIFO), a graph-agnostic specification of which directed transformations are biologically admissible for computable information flow. BIFO defines fourteen entity classes, a taxonomy of flow classes organized around the backbone G+CH[->]RNA[->]P[->]PW[->]C[->]PH[->]DS, a set of admissibility constraints, and a two-level CURIE mapping that can be applied without schema-specific code to any graph whose identifiers and predicates are resolvable through, or extendable to, the BIFO mapping tables. A four-step conditioning protocol converts a raw property graph into a conditioned propagation graph in which only admissible, direction-aware edges remain. We provide a reference implementation on the Data Distillery Knowledge Graph (DDKG); conditioning a cohort-independent, gene-anchored subgraph as a BIFO substrate of 33.6 million edges retained 23.7 million (70.7%) as BIFO-classified relationships, cleanly separating 13.3 million propagating mechanistic edges from 10.5 million retained-but-non-propagating observational associations, and confirming that pathway concepts are configured as scoring accumulation endpoints for BIFO-PPR pathway scoring. BIFO is an admissibility specification for computable propagation of signal over knowledge graphs. It is released as an open specification with versioned mapping tables and tooling, providing a reusable substrate for biologically interpretable, direction-aware analysis of biomedical knowledge graphs.
bioinformatics2026-05-28v1Mapping Genetic Risk Associations to Cellular Contexts via Deep Learning and Biological Ontologies
Margalit, T.; Levi, H.; Shamir, R.; Elkon, R.Abstract
Translating genome-wide association studies (GWAS) signals into trait-relevant cellular contexts remains challenging due to the complexity of the genomic regulatory code and linkage disequilibrium among associated variants. We present a novel computational framework that aggregates deep learning-based predictions of the functional effects of noncoding variants on transcriptional regulatory elements across GWAS loci and empirically evaluates their statistical significance. By organizing these aggregated signals within biological ontologies, our approach enables statistically calibrated interpretation of GWAS associations, highlighting relevant cell-type and tissue contexts across human traits.
bioinformatics2026-05-28v1A New Hybrid Method for Brain Tumor Detection Based on Deep Learning
Sharbaf, S.Abstract
Brain tumor detection using Magnetic Resonance Imaging (MRI) remains a challenging task due to tumor hetero-geneity and imaging variability. This paper presents a novel hybrid Deep Convolutional Neural Network Whale Optimization Algorithm (DCNN,WOA) framework for automated brain tumor detection and classification. The proposed method consists of four main stages: MRI data preprocessing and augmentation, deep feature extraction using multi-layer Convolutional Neural Networks (CNN), feature selection and hyperparameter optimization via the Whale Optimization Algorithm (WOA), and final classification with comprehensive performance evaluation. By jointly optimizing deep features and training parameters, the framework effectively reduces feature redundancy, accelerates convergence, and enhances model generalization. Experimental results on a publicly available MRI dataset demonstrate that the DCNN-WOA model outperforms conventional CNN and state-of-the-art Deep Learning (DL) architectures, achieving an accuracy of 97.8%, sensitivity of 96.4%, specificity of 98.1%, and F1-score of 97.2%. The practical impact of this approach makes it a promising solution for real-time clinical decision-support systems in neuroimaging.
bioinformatics2026-05-28v1Inferring Multi-Stage Pathway Progression Models from Tumor Phylogenies
Cankosyan, M.; Khan, S. R.; Sashittal, P.Abstract
Cancer progression is an evolutionary process driven by the accumulation and selection of somatic mutations, giving rise to genetically diverse subclonal populations within tumors. Understanding the dependencies among mutations and identifying recurrent evolutionary trajectories is critical for understanding cancer progression and informing therapeutic strategies. Recent advances in genomic sequencing and phylogenetic reconstruction now enable large-scale inference of tumor phylogenies, providing detailed representations of intratumor evolutionary histories across patient cohorts. However, modeling cancer progression from these data remains challenging due to extensive inter- and intratumor heterogeneity, often arising from mutations in different genes within the same pathway that confer similar fitness advantages. Existing methods to infer pathway-level progression models summarize each tumor by a single consensus genotype, ignoring intra-tumor heterogeneity, while phylogeny-based methods typically focus on individual mutations and do not model pathways. We introduce PhyloStage, an algorithm for inferring multi-stage pathway-level cancer progression models from large cohorts of tumor phylogenies. PhyloStage represents progression as a partial order over pathways, permitting independent mutations in incomparable pathways while constraining the order of mutations within the same or dependent pathways. The framework also incorporates uncertainty in tumor phylogenies, resolves mutation clusters with unknown ordering, and stratifies patients by progression stage. Applied to a cohort of 120 acute myeloid leukemia (AML) tumor phylogenies, PhyloStage infers progression models that are aligned with known AML progression. On 99 non-small cell lung cancer (NSCLC) patients, PhyloStage stratifies patients into progression stages such that later stages have larger tumor sizes, corroborating phenotypic tumor progression.
bioinformatics2026-05-28v1Sequence-Based Prioritization of Promoter Regulatory Variants in Colorectal Cancer Using a DNA Foundation Model
Shome, S.; Vajinepalli, S.; Saraf, A.Abstract
Noncoding regulatory variants contribute to colorectal cancer (CRC) susceptibility, yet their functional interpretation remains difficult.This is mainly attributed to regulatory effects being context-dependent and most noncoding regions lack reliable genomic annotations. We have developed a computational framework that aids in prioritizing promoter-associated variants using Evo2, a large-scale autoregressive DNA foundation model. In the framework, variants were mapped to promoter regions across ~1,250 CRC-associated genes and scored using Evo2-derived delta scores, the difference in sequence probability between reference and alternate alleles. Promoter variants showed greater predicted regulatory impact than non-promoter variants (median delta = 0.015 vs. 0.002; overall mean = 0.018, SD = 0.011). Applying a distributional threshold (delta > 0.020; top ~25%) identified 287 high-impact variants across 198 CRC-associated genes. These genes were enriched in CRC-relevant pathways such as Wnt signaling, p53 signaling, and cell cycle regulation and 36.4% (72/198) overlapped known cancer genes. Independent validation showed high-impact variants were enriched at CRC GWAS loci and overlapped transcription factor binding sites (~32%) and motif-disrupting positions (~21%), supporting their functional relevance. Together, these results show that sequence-based foundation models can scalably prioritize noncoding regulatory candidates in CRC without supervised training or predefined annotations.
bioinformatics2026-05-28v1Design of a Multi-epitope Vaccine Against Human Glanders Targeting Outer Membrane β-barrel Proteins of Burkholderia mallei
Kapoor, J.; Panda, A.; Kumar, S.; Bandyopadhyay, A.Abstract
Burkholderia mallei, a facultative intracellular Gram-negative pathogen, is the causative agent of glanders that primarily affects solipeds and sporadically transmitted to humans. Current interventions mainly rely on antibiotics; however, increasing resistance and the lack of a licensed vaccine further complicate disease management. In the present study, a consensus-based computational framework was employed on the B. mallei turkey2 proteome. Total 59 proteins - including porins, TonB receptors, autotransporters, and efflux components - were identified as surface exposed outer membrane {beta}-barrel (OMBB) proteins that were used to design a multi-epitope vaccine (MEV) construct. B- and T-cell epitopes were predicted from 59 proteins, and ten epitopes each of cytotoxic T-lymphocyte (CTL), helper T-lymphocyte (HTL), and B-cell were chosen based on their antigenicity, non-allergenicity, non-toxicity, surface accessibility, and conservation across 32 B. mallei strains. The MEV was included with suitable adjuvants at the N-terminus to enhance its immunogenicity. The 780 amino acid MEV construct was predicted to be antigenic, and soluble upon overexpression with 62.69% random coils, while the rest formed -helices and {beta}-strands. The tertiary structure of the MEV was generated and subsequently validated, indicating good structural quality. Molecular docking of the MEV with toll-like receptor 4 (TLR4) demonstrated strong affinity, and molecular dynamics simulation confirmed the structural stability of the MEV-TLR4 complex. In-silico immune simulation showed the capability of MEV to induce a strong immune response. The study proposes an MEV construct by utilizing surface exposed OMBB proteins which directly interact with the host and serve as effective immunogenic targets against B. mallei infection.
bioinformatics2026-05-28v1gTranslate: rapid and accurate translation table prediction for prokaryotic genomes
Chaumeil, P.-A.; Hugenholtz, P.; Parks, D. H.Abstract
Background: Bioinformatic tools often require the prediction of protein-coding genes to make inferences about prokaryotic genomes. Typically, the genetic code used for translating genes to proteins must be specified by the user based on the taxonomic classification of a genome assembly or, for some widely used tools, established using a heuristic rule based on gene coding densities. Manual specification is at best inconvenient, but more challenging is that many bioinformatic tools are applied before taxonomic classifications have been established making specifying the translation table impractical. Methods: Here we provide a computationally efficient tool, gTranslate, that uses an ensemble of five machine learning methods to accurately predict translation tables for prokaryotic genomes. The feature vector used by gTranslate takes advantage of differences in gene coding densities when predicting genes under different translation tables along with features that consider the number and ratio of UGA stop codon reassignments to tryptophan or glycine. Results: We demonstrate that gTranslate correctly predicts the translation table of prokaryotic genomes >99.99% of the time (i.e. <1 error per 10,000 genomes) and outperforms a more computationally expensive prediction method and a coding density heuristic used by popular bioinformatic tools. Using gTranslate, we identify a basal lineage of Ca. Stammera capleta that uses the standard bacterial genetic code instead of the UGA stop codon to tryptophan reassignment common to other members of this species. We also identify the first instances of UGA-to-tryptophan reassignment in the Patescibacteriota making this the first bacterial phylum with members capable of using translation tables 4, 11, and 25.
bioinformatics2026-05-28v1Distinct fibrotic, epithelial and immune transcriptomic programs in phenotypes of chronic lung allograft dysfunction
Ishiwata, T.; Berra, G.; Allen, J.; Burman, A.; Wilson, G.; Carter, Z.; Watanabe, T.; Solomon, M.; Keshavjee, S.; Yeung, J.; Juvet, S. C.; Martinu, T.Abstract
Background: Chronic lung allograft dysfunction (CLAD) is the major cause of late mortality after lung transplantation and includes two principal phenotypes, bronchiolitis obliterans syndrome (BOS) and restrictive allograft syndrome (RAS). RAS and other phenotypes with RAS-like opacities (RLO) on chest imaging have a poorer prognosis. Despite clear clinical and pathological differences, molecular distinctions between phenotypes remain poorly defined. We aimed to explore gene transcriptional profiles across CLAD phenotypes and relevant controls. Methods: We performed bulk RNA sequencing on explanted lung tissue from 45 lung transplant recipients with end-stage CLAD (20 with RLO and 25 without RLO). Samples from twenty-seven control donor and lobectomy lungs and sixteen idiopathic pulmonary fibrosis (IPF) lungs served as comparators. Non-negative matrix factorization (NMF) was used to identify latent transcriptomic signatures, which were correlated with clinical, radiologic, and histopathologic features. Results: NMF identified seven distinct gene signatures that segregated CLAD phenotypes. RLO-CLAD lungs were enriched for extracellular matrix remodeling and B cell/plasma cell-associated signatures, overlapping partly with IPF, whereas non-RLO-CLAD showed relative enrichment of epithelial injury and surfactant-response pathways. Signatures related to epithelial homeostasis and ciliary/microtubule function were progressively reduced from control lungs to non-RLO-CLAD and were most suppressed in RLO-CLAD. Conclusions: RLO-CLAD and non-RLO-CLAD, aligning with RAS and BOS phenotypes, show distinct transcriptomic signatures. RLO-CLAD is characterized by profibrotic and humoral immune signatures with profound epithelial dysfunction, whereas non-RLO-CLAD shows relative enrichment of epithelial injury responses. These data provide molecular stratification of CLAD and support the development of phenotype-specific biomarkers and targeted therapies.
bioinformatics2026-05-28v1Pathway redistribution across cellular states reveals a shared signaling backbone and context-dependent regulatory modules in RNA-binding protein networks
Osato, N.; Sato, K.Abstract
Understanding how regulatory architectures are reorganized across cellular contexts remains a central challenge in functional genomics. Here, we integrate co-expression-derived candidate regulatory interactions with interpretable deep learning to generate gene-level contribution scores and introduce delta NES (normalized enrichment score difference) to quantify pathway redistribution between cellular states. Because gene expression reflects the combined effects of multiple regulatory inputs, contribution scores capture relative regulatory influence rather than transcriptional abundance itself. Applying this framework to neural progenitor cells and K562 leukemia cells, we identify systematic redistribution of functional modules across multiple RNA-binding proteins, including PKM, HNRNPK, and NELFE. Neural System- and Immune System-associated modules are differentially positioned along the delta NES spectrum, indicating context-dependent redistribution of regulatory influence rather than isolated pathway activation events. At the pathway level, Signal Transduction consistently forms a shared signaling backbone across proteins and cellular contexts, while modules related to neuronal functions, immune responses, and developmental processes exhibit context-dependent redistribution. Subpathway analysis further reveals convergence on receptor-mediated signaling processes, including FGFR/RTK-, IRS-, and MAPK-related pathways. These redistribution patterns are preserved under alternative DeepLIFT background settings despite polarity changes in contribution-expression correlations, indicating that pathway-level contrasts arise from stable rank-structure differences rather than background-dependent score artifacts. Together, our findings demonstrate that contribution score-based pathway ranking reveals a conserved signaling backbone alongside context-dependent functional modules, providing a framework for interpreting regulatory architecture beyond expression-centric analyses.
bioinformatics2026-05-27v12Finding stable clusterings of single-cell RNA-seq data
Klebanoff, V. F.Abstract
Run a UMI count matrix through a pipeline to obtain n cell clusters. Suppose that counts for an equal number of additional cells from the same experiment become available. Would including them change the result? Form the matrix containing both sets of counts, obtain n clusters, restrict this clustering to the initial cells and compare it with the initial clustering. If they are not consistent, conclude that the initial clustering is unstable. This is unrealistic, but reverse the perspective: given a clustering, process samples of half of the cells. If their clusters are consistent with those of all cells restricted to the samples, conclude that the clustering is stable. We use divisive hierarchical spectral clustering and define what may be a novel mapping of the dendrogram to nested clusterings. Counts are transformed to points in low-dimensional Euclidean space. Positive affinities are defined for points that are k-nearest neighbors. The affinity equals the inverse of the distance between points. Ng, Jordan, and Weiss' algorithm divides the points into two clusters. The normalized cut measures the clusters' separation. Recursion generates a dendrogram. Set the length of the branch between a node and its daughters to the normalized cut. Nodes' distances from the root define the mapping to nested clusterings. Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample's clustering and clusters are compared with those of the full data set (restricted to the sample). This provides measures of the stability of the clustering and its clusters. For three large data sets, this yielded clusterings compatible with published results, though with fewer clusters. Clusterings of two were judged to be stable. We conclude that it is feasible to identify stable clusterings of as many as 100,000 cells. Future research should explore using differential expression for validation.
bioinformatics2026-05-27v4Training Strategy Optimization to Mitigate Shortcut Learning in Pan-Cancer Drug Response Prediction
Shimamoto, K.; Ito, T.; Lysenko, A.; Tsunoda, T.Abstract
Background: Prediction of in vivo drug response is a central challenge in precision medicine, but the scarcity of labeled clinical data still necessitates the use of large-scale cancer cell line resources for model training. Domain adaptation methods, which aim to transfer knowledge learned from a source domain (cell lines) to a target domain (patients) by aligning feature distributions across domains, are a promising approach to bridge the gap between in vitro models and in vivo patients. However, we observed that these methods can exhibit a significant discrepancy between pan-cancer evaluation metrics and cancer type-specific prediction accuracy. This performance gap warrants a detailed investigation into their underlying predictive characteristics. Results: We discovered that cancer-type-specific class imbalances in training data can lead domain adaptation models to engage in shortcut learning, where they primarily discriminate between cancer types rather than capturing the actual biological determinants of drug sensitivity. To address this, we propose a strategy of combining two approaches: (1) excluding cancer types causing imbalance from the training data, and (2) adjusting class balance through oversampling and class weighting while retaining cancer types causing the imbalance. Among all configurations tested in conjunction with the CODE-AE (Context-aware Deconfounding AutoEncoder) framework, the combination of moderate oversampling (30% non-responder ratio) with class weighting achieved the best performance, significantly improving prediction accuracy in 5 out of 11 external patient cohorts from TCGA and GEO. Conclusions: Our findings demonstrate that appropriate class imbalance correction, rather than wholesale exclusion of imbalanced cancer subtypes, enables effective utilization of biologically relevant information shared across cancer types for drug response prediction. This study highlights the critical importance of jointly optimizing training data composition and class balance adjustment strategies in developing robust pan-cancer drug response prediction models for precision medicine applications.
bioinformatics2026-05-27v1NeuroFate: endpoint-locked transcriptomic axis scoring for neurodegeneration risk research
Ghosh, N.; Sinha, K.Abstract
Motivation: AD and PD transcriptomic cohorts can reveal disease-associated neuronal, glial, mitochondrial, myelin, proteostasis, vascular, and immune programs, but these signals are difficult to compare reproducibly across studies without endpoint-locked, sample-level biological summaries. Results: We present NeuroFate, a command-line research package that converts compact transcriptomic cohorts into curated neurodegeneration-axis scores, exploratory research-use risk scores, and conservative evidence reports. The software locks disease-state endpoints before scoring, maps genes or probes onto a 10-axis NeuroFate panel, records axis-gene coverage, and grades external cohort evidence by direction, effect size, nominal/FDR support, and claim-safety rules. Demonstrations across AD and PD resources show nominal independent AD support for a neuronal vulnerability axis, mixed PD convergence, and a PD-divergent synuclein-mitochondrial example while avoiding clinical or mechanism-overstating claims. Availability and implementation: NeuroFate is implemented in Python and available at https://github.com/sinhakrishnendu/NeuroFate.git. Contact: nabanitaghosh89@gmail.com; dr.krishnendusinha@gmail.com. Supplementary information: Documentation, examples, tests, and reproducibility notes are included in the repository.
bioinformatics2026-05-27v1GraphTox: A Semi-Supervised Pre-Trained Framework for Peptide Toxicity Prediction using Geometric Graph Transformer and LORA-Based Finetuning
BHADURI, S.; Das, D.; MITRA, P.Abstract
Peptides are widely used as potential therapeutic agents in drug discovery and biotechnology because they are specific, effective, and relatively inexpensive to produce. They are used in drug development, vaccines, and antimicrobial treatments. However, peptide toxicity remains a major concern as it offers unwanted toxic consequences, such as membrane rupture, haemolysis, tissue damage and adverse immunological response. Early detection of toxic peptide candidates is vital for the development of safe and effective therapies. Current computational methods for predicting peptide toxicity are largely based on hand-crafted sequence descriptors or sequence-only deep learning architectures that may not fully account for the underlying 3-dimensional structural determinants of peptide toxicity. We introduce GraphTox, a structure-aware geometric deep learning framework which combines self-supervised graph representation learning with hierarchical structural modelling to accurately predict peptide toxicity. Our framework learns geometry-aware embeddings from peptide structural graphs via self-supervised masked residue reconstruction, based on a Masked Graph Autoencoder (MGAE) built on a Geometric Graph Transformer (GGT) encoder. The pretrained structural representations are cross fused via a multi-scale U-Net architecture to capture both local residue-level interactions and global conformational patterns associated with peptide toxicity. GraphTox explicitly models spatial relationships between residues, thereby efficiently capturing structural aspects that are generally neglected by sequence-based predictors, such as residue clustering, hydrophobic interactions and electrostatic organization. On benchmark datasets our framework shows superior performance and interpretability over the existing state-of-the-art methods. Our hybrid hierarchical structural modelling framework is a superior computational platform to improve the prediction of peptide toxicity and expedite the creation of safer peptide therapies. https://github.com/debraj-55555/GraphTox
bioinformatics2026-05-27v1Toward Large-Scale Numerical Modeling of the Cardiovascular System with up to 34 Billion Vessels
Newhauser, W.; Cole, M.; Diehl, P.; Moreno, J.; Kaiser, H.; Tohid, R.; Nader, N.; Chancellor, J.Abstract
Cardiovascular diseases, such as stroke and heart attacks, are the leading cause of death worldwide. Computational models like cardiovascular digital twins (CVDTs) offer a promising path for research and intervention but are challenged by the complexity of simulating the full human vasculature. This study evaluates the feasibility of simulating blood flow through a vascular network containing 34 billion vessels (the estimated number in the human body) using first-principles physics and simplified geometry which is a first step towards CVDT. We synthesized 3D vasculature using a fractal model and computed blood flow rates via Poiseuille equation and steady-state fluid dynamics, implemented with high-performance computing. Simulations were conducted for networks ranging from 6 vessels to 34 billion vessels. The results demonstrated high accuracy (within 1% of benchmarks), reproducibility across platforms, and strong scalability. Simulating the full vasculature required 156 node-hours on the second-fastest supercomputer in the world, using 29 TB of memory and 84 TFLOPS. Maximum speedup factor was 80, with parallel efficiency no lower than 0.48. These findings show it is computationally feasible to simulate blood flow through a full-body vascular network at scale. The approach is well suited to parallel computing, suggesting that with continued development, CVDTs could enable whole-organism modeling for applications such as stroke, trauma, radiation injury, and cancer metastasis.
bioinformatics2026-05-27v1ClusToRa: A niche-centric framework for identifying structural recruitment and infiltration in spatial omics
Githaka, J. M.; Lerner, E. P.Abstract
Spatial omics maps cellular landscapes, yet current tools might conflate stochastic proximity with organized niches. We present ClusToRa (Cluster-to-Randomization), a framework that identifies high-density cellular territories and quantifies cell-type recruitment using a fixed-position null model. Benchmarked against graph-based neighborhood-enrichment and point-pattern statistics, ClusToRa reduced false-positive enrichment in simulations and resolved core-vs-boundary interactions. Applied to cirrhotic MASH liver, ClusToRa identifies stellate-cell territories with immune/endothelial infiltration and stress-, Notch-, and PPAR-associated programs, providing a niche-centric framework for distinguishing structural cellular infiltration from boundary adjacency or density-driven colocalization.
bioinformatics2026-05-27v1Transcriptomic Profiling and Regulatory Network Analysis of Ten Metabolic Transporters Across Five Diabetic Complications: A Multi-Dataset, Twelve-Phase GEO Bioinformatics Study
Adegboyega, B. B.; Ekanem, P. C.; Awolaja, O. O.; Osarietin, E.; Okorie, B.Abstract
Objective: Diabetic complications collectively represent one of the most urgent unresolved problems in medicine, yet the field continues to study them in near-complete isolation from one another. No unified framework has systematically characterised the shared and divergent molecular signatures of ten clinically critical metabolic transporters across all five major complications, cardiomyopathy (DCM), nephropathy (DN), retinopathy (DR), peripheral neuropathy (DPN), and atherosclerosis and vasculopathy (DAD), through an integrated, multi-method computational pipeline. This study was designed to address that gap directly. Methods: Eleven GEO microarray datasets comprising 118 diabetic and 76 control samples were analysed through twelve sequential phases: differential expression analysis, pan-complication overlap, weighted gene co-expression network analysis (WGCNA), GO/KEGG functional enrichment with gene set enrichment analysis (GSEA), STRING protein-protein interaction (PPI) network construction, competing endogenous RNA (ceRNA) network mapping, transcription factor activity inference using a VIPER-style algorithm, immune cell infiltration estimation by single-sample GSEA, diagnostic biomarker modelling using LASSO logistic regression and Random Forest classification, CMap-style drug repurposing by connectivity scoring, and two-sample Mendelian randomisation (MR) employing four independent estimators (inverse-variance weighted [IVW], MR-Egger, weighted median, and weighted mode). Results: CD36 was the only transporter to achieve significant dysregulation across three independently sourced tissue types (DN, DR, DPN; logFC range 0.88 to 2.18), whilst TLR4 exhibited the highest fold-change in the study (logFC = 3.88, DPN) and the greatest WGCNA module membership (kME = 0.976, DPN). SERCA2 was significantly downregulated in three complications (DCM, DN, and DR) at formal significance thresholds and trended negatively in the remaining two (DPN and DAD), constituting the most consistently suppressed transporter in the study. Its universal downregulation was explicable through four convergent mechanisms spanning transcriptional, oxidative, ceRNA-mediated, and transcription factor-level regulation, and was confirmed as causally relevant to diabetic cardiomyopathy by eQTL Mendelian randomisation (beta = -0.085, p = 0.005). miR-21-5p was identified as the dominant ceRNA regulatory bridge (betweenness centrality = 0.428; 6.7-fold above the second-ranked miRNA), with MALAT1 as the sole lncRNA hub active in all five complications. PPARgamma and TP53 repression emerged as the leading transcription factor-level explanations for the simultaneous metabolic and inflammatory dysregulation characteristic of the diabetic transcriptome. Immune deconvolution revealed DCM as immunologically quiescent, DN as comprehensively infiltrated (ten enriched cell types), and DPN as mast-cell-dominated, identifying a cellular mechanism for TLR4-driven neuroinflammation that has not previously been systematically characterised. GLUT4 achieved perfect diagnostic discrimination for DPN (AUC = 1.000, p < 0.001; LASSO coefficient = -2.143), whilst SGLT2 was the leading DAD diagnostic marker (AUC = 1.000, p = 0.002). Epalrestat was the sole pan-complication drug repurposing candidate (significant connectivity reversal in four of five complications). Mendelian randomisation confirmed causal effects of T2DM genetic liability on all five complications (all p < 0.0001, all four estimators concordant), and eQTL-MR identified TLR4 (beta = +0.073, p = 0.006) and CD36 (beta = +0.070, p = 0.008) as causal risk factors for DN, SERCA2 reduced expression as a causal driver of DCM (beta = -0.085, p = 0.005), and SGLT2 expression as a causal protector against DN (beta = -0.070, p = 0.013). Conclusions: This twelve-phase investigation identifies a pan-complication CD36/TLR4 inflammatory dyad and a SERCA2 calcium-mitochondrial effector axis, both confirmed at seven independent analytical levels, including causal genomic inference. GLUT4 downregulation defines DPN at the diagnostic level with perfect accuracy and is explicable through a five-layer mechanistic chain from MODY transcription factor inactivation to ceRNA competitive pressure. Epalrestat warrants prospective evaluation beyond its established DPN indication. These findings collectively constitute the most comprehensive computational characterisation of metabolic transporter biology in diabetic complications to date.
bioinformatics2026-05-27v1TIMS-Bench: Towards community standards for benchmarking untargeted trapped ion mobility metabolomics tools and datasets
Rajkumar, P.; Gadiya, Y.; Deleray, V.; Roux, A.; West, K. A.; Allen, A.; Dorrestein, P.; Domingo-Fernandez, D.; Misra, B. B.Abstract
Untargeted liquid chromatography- tandem mass spectrometry (LC - MS/MS) - based metabolomics is an important technology for unbiased discovery of small molecules in biomedical (e.g., drug discovery to diagnostics), animal, plant, environmental, and microbial research. Over the past decade, ion mobility has added an additional dimension to the triplet of MS1, MS2, and retention time, helping resolve co-eluting or isomeric features in an LC- MS/MS that aid in compound identification. Here, we focused on evaluating the current trapped ion mobility spectrometry (TIMS) - amenable feature-finding tools (MZmine 4.9, MS-DIAL 5.5, and MetaboScape 2025 14.0.3) for pre-processing of metabolomics-scale data generated using a popular ion mobility mass spectrometry (IM- MS) technique, TIMS. We leveraged ten public and three benchmark TIMS datasets to evaluate these tools for their strengths and weaknesses. Our results show that MZmine consistently identified the highest number of features and confidently annotated features; however, this performance was accompanied by an increased number of false positives, due to peak splitting, as well as reduced accuracy in collision cross section (CCS) measurements. In contrast, MetaboScape achieved the highest fraction of high-quality MS2 spectra, reflecting a more conservative feature detection strategy. MS-DIAL demonstrated balanced performance, identifying features that other tools missed. Finally, we publicly release the ground-truth datasets and code to support future developments in improving IMS data analysis.
bioinformatics2026-05-27v1Slivka and Slivka-bio: a lightweight framework for presenting executables as web services and its application in bioinformatics.
Warowny, M.; Down, T.; Macgowan, S. A.; Mukhyala, K.; Barton, G. J.; Procter, J. B.Abstract
Motivation. Execution of code is critical for computational biology, but technical requirements can prevent others from running it. Public web-apps and services thus remain the most effective way to make code accessible, but no fully reusable infrastructure exists to help researchers do this. Results. We developed Slivka to enable easy provision of robust HTTP-based execution services backed by local or distributed hardware; accessible via curl and dedicated clients. We demonstrate it with Slivka-bio, which provides semantically annotated services for Jalview 2.12 (https://www.jalview.org/development/jalview_develop/) and includes 15+ tools for protein and RNA analysis. Slivka has been in production in academic and industry environments for 5 years and ran more than 1.5M jobs. Availability and Implementation. Slivka and Slivka-bio are released under the Apache 2.0 License. Slivka-bio public instance at https://www.compbio.dundee.ac.uk/slivka with links to documentation, docker containers, and github repositories for Slivka-bio and Slivka.
bioinformatics2026-05-27v1ATLAS: a scverse-compatible package for multi-omic single-cell trajectory inference integration
Leclercq, A.; Martini, L.; Bardini, R.; Savino, A.; Di Carlo, S.Abstract
Single-cell trajectory inference is widely used to study cellular differentiation and fate decisions, yet most existing approaches rely on transcriptomic information alone, limiting their ability to capture the regulatory processes underlying cell-state transitions. This work presents ATLAS (Advanced Trajectory Learning from multi-omics At Single-cell resolution), a scverse-compatible framework for trajectory inference in paired single-cell RNA-seq and ATAC-seq data. ATLAS integrates transcriptomic and chromatin accessibility information through Weighted Nearest Neighbor graphs, enabling both molecular layers to jointly inform pseudotime estimation, terminal-state identification, and fate probability inference within a unified multi-omic representation. Across synthetic and real datasets, ATLAS reconstructs coherent developmental trajectories, captures progressive fate commitment, and resolves biologically meaningful lineage structures, demonstrating the effectiveness of multi-omic integration for characterizing cellular dynamics. In addition, ATLAS enables the joint exploration of transcription factor expression and target gene activity along pseudotime, providing direct access to regulatory programs and chromatin-associated transitions that are not detectable from transcriptomic data alone. Overall, ATLAS provides a scalable and biologically informative framework for studying dynamic cellular processes in single-cell multi-omics experiments.
bioinformatics2026-05-27v1Sequence-independent protein domain detection and classification with PRISM
Tan, A.; Seedorf, H.Abstract
The explosion of predicted protein structures has revealed countless novel domain families. However, gold-standard segmentation tools like Chainsaw and Merizo are trained on rapidly obsoleting CATH databases, lack automatic domain classification, and cannot be easily fine-tuned without deep learning expertise. We introduce PRISM, a unified framework enabling sequence-independent, one-shot fine-tuning for simultaneous domain segmentation and classification, bypassing traditional constraints to accurately resolve complex, novel protein architectures.
bioinformatics2026-05-27v1growthcurves: User-friendly tools for quality-controlled cellular growth analysis
Bradley, S. A.; Webel, H.; Donati, S.; Acevedo-Rocha, C.Abstract
Biological growth curves are widely used but inconsistently analyzed due to fragmented workflows and limited quality control. We present growthcurves, a Python package for extracting growth parameters, and two open-source web applications (MicroGrowth and AutoGrowth) enabling human-in-the-loop analysis of datasets from microplate reader or mini-bioreactor experiments in either batch or turbidostat cultivation mode. By combining automated fitting with convenient quality control, the platform improves reproducibility and reliability of growth-curve analysis.
bioinformatics2026-05-27v1Ordered Gromov-Hausdorff Metric: A New Tool for Comparative Analysis of Protein Structures
Timofeev, A.; Anufriev, A.Abstract
Motivation: Classical protein structure comparison metrics such as RMSD and TM-score effectively assess geometric similarity but ignore the linear order of amino acid residues (Zhang and Skolnick, 2004). The Gromov Hausdorff (GH) metric compares metric spaces by shape but also does not account for order (Gromov, 1981). This can lead to incorrectly identifying proteins with swapped domains as similar. We introduce the Ordered Gromov Hausdorff (OGH) metric, defined on ordered metric spaces, to incorporate residue order into the comparison. Results: OGH combines coordinate normalization, an exponential penalty for order violations, and a monotonic alignment algorithm with computational complexity O(n*w), where w is the search window width. It is proven that OGH satisfies all metric axioms for > 0. Analytical properties include invariance under isometries, upper boundedness, Lipschitz continuity under small coordinate perturbations, and concavity in the weight parameter . On the VAD dataset (28 viral proteins from HIV 1, SARS CoV 2, MERS CoV), OGH increases monotonically with residue shuffling (up to 0.363 at 100% shuffling) and correlates strongly with TM score (r = 0.706). In the task of separating homologs at fixed global similarity (TM score {approx} 0.5), OGH achieves AUC = 0.800, whereas TM score gives AUC = 0.467, demonstrating that OGH detects conserved order even when global geometry is not conserved. Availability: The Python source code for OGH is freely available at https://github.com/andytimoffilim/OGH. The VAD dataset (PDB IDs listed in the paper) is publicly accessible from the RCSB Protein Data Bank (Berman et al., 2000; wwPDB, 2019).
bioinformatics2026-05-27v1There and back again: a multi-omics tale of thyroid co-expression network rewiring
Pozhidaeva, M.; Bussmann, H.; Huisinga, M.; Buesen, R.; Hackermüller, J.; Canzler, S.Abstract
The integration of multi-omics data offers unprecedented insight into complex biological systems but presents significant analytical challenges. In this study, we propose a best-practice framework for constructing simultaneous weighted gene co-expression networks (WGCNA) from transcriptomics, proteomics, and metabolomics data. Using a rodent model of thyroid toxicity induced by propylthiouracil (PTU), we analyzed thyroid tissues from control, treated, and recovery groups. We demonstrate that concatenating individually processed omics layers at the sample level--without additional scaling--preserves meaningful correlation structures and reflects best practices for biologically interpretable network construction. Co-expression networks were constructed for each group, revealing extensive disruption of molecular interactions under treatment and partial restoration during recovery. We highlight the complementary strengths of two analytical strategies: module preservation analysis identifies disrupted co-regulatory structures, while differential connectivity analysis detects feature-level rewiring events. As a methodological advance, we introduce a permutation-based approach for calculating feature-specific p-values for differential connectivity (DiffK), enabling robust statistical inference. This strategy uncovered over 4,400 significantly rewired features, many of which showed stable expression, underscoring the added value of network-based analyses. Our findings demonstrate the utility of integrated multi-omics WGCNA and differential network analysis in capturing dynamic, system-wide regulatory changes.
bioinformatics2026-05-27v1Unveiling the Terra Cognita of Sequence Spaces using Cartesian Projection of Asymmetric Distances
Ramette, A.Abstract
Visualizing relationships within massive biological datasets remains a significant challenge, particularly as sequence length and volume increase. We introduce CAPASYDIS (Cartesian Projections of Asymmetric Distances), a scalable approach designed to map the explored regions of a given sequence space. Unlike traditional dimensionality reduction methods, CAPASYDIS calculates asymmetric distances which account for both the position and type of sequence variations. It projects sequences into a fixed, low-dimensional coordinate system, termed a "seqverse", where each sequence occupies a permanent location. This design allows for the instant mapping of new sequences without the need to recalculate the global structure, transforming sequence analysis from a relative comparison into navigation on a standardized map. We applied this method to a large rRNA sequence dataset spanning the three domains of life. Our results demonstrate that the sequences of Bacteria, Archaea, and Eukaryota occupy spatially distinct regions characterized by fundamentally different shapes and patterns of variation. Furthermore, the resulting seqverses retain high amount of taxonomic information, when analyzed from broad domain levels to single-base differences. Overall, CAPASYDIS provides a reproducible, scalable framework for defining the boundaries and topography of biological sequence universes.
bioinformatics2026-05-26v3GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation.
bioinformatics2026-05-26v3WITHDRAWN: Scalable Microbiome Network Inference: Mitigating Sparsity and Computational Bottlenecks in Random Effects Models
Roy, D.; Ghosh, T. S.Abstract
The authors have withdrawn their manuscript because the biological validations associated with the inferred microbial interaction directions are currently incomplete and require further verification. We are actively validating these biological directions and ensuring the scientific validity of the reported findings before any further dissemination. Therefore, the authors do not wish this work to be cited as a reference for the project at this stage. If you have any questions, please contact the corresponding author.
bioinformatics2026-05-26v2OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome
Yang, L.; Xia, Y.; Yang, Z.; Xia, C.; Wu, T.; Zou, M.; Xia, Z.Abstract
While multi-species genomic language models have advanced biological representation learning, high-quality, single-species foundation models for crops remain scarce. Leveraging recently expanded rice pangenome resources, we introduce OryzaG3, a species-specific DNA language model with 700M parameters. OryzaG3 was pretrained on 59.20 Gb of chromosome-level sequences from 149 high-quality rice genomes using a non-overlapping 3-mer tokenization strategy and a causal language modeling objective, featuring context-length variants up to 32k tokens. On the Plants Genomic Benchmark polyA prediction task, OryzaG3 achieves competitive predictive performance against leading multi-species models while delivering a four-fold increase in inference throughput under identical long-context conditions. Ultimately, OryzaG3 demonstrates that lightweight, single-species foundation models trained on high-quality pangenomes can match multi-species benchmarks while significantly reducing computational overhead. This work provides a scalable framework for rice functional genomics, molecular breeding, and targeted crop foundation model development.
bioinformatics2026-05-26v1IID-KG: An ontology-aligned literature-derived knowledge graph for infectious and immune-mediated diseases
PAN, F.; Zhang, Y.; Wang, J.; Liu, M.-C.; Sui, X.; Yue, H.; Zhang, J.Abstract
Infectious and immune-mediated diseases (IIDs) represent a broad and rapidly expanding biomedical literature domain in which scalable evidence extraction, disease ontology refinement, and interpretable knowledge integration are essential for biomedical discovery. We constructed an IID-specific biomedical knowledge graph (IID KG) from PubMed abstracts and PMC full-text articles by integrating nested named entity recognition, ontology-guided identifier assignment, full-text relation extraction, and relation-resolution strategies. A gold-standard corpus of 500 PubMed abstracts and 8 PMC full-text articles was manually annotated for nested biomedical entities across six entity types. The resulting models were applied to 30,128,068 PubMed abstracts and 1,385,500 IID-related PMC full-text articles. A unified IID ontology was developed from 411,341 disease terms using hierarchical text classification, large language model-based refinement, ontology cross-referencing, and expert review, yielding 179,657 confirmed MeSH mappings. The final IID KG contains approximately 1,837,513 unique entities and 16,295,390 unique relations across eight relation types. The resource was released publicly together with repurposing workflows, supporting ontology-aligned literature mining, disease mechanism analysis, and drug-repurposing hypothesis generation for IID research.
bioinformatics2026-05-26v1Prioritizing peptides for targeted mass spectrometry experiments using deep learning
Sonthalia, S.; Dasgupta, P.; Hsu, C.; Wen, B.; MacCoss, M. J.; Noble, W. S.Abstract
One critical step in any targeted mass spectrometry experiment is selecting, from each protein of interest, a small number of peptides that respond well in the mass spectrometer and can serve as reliable proxies for protein quantification. Existing methods select target peptides either by relying on prior empirical measurements, limiting their applicability to previously observed peptides, or using machine learning to predict peptide behavior from sequence alone. However, current machine learning tools suffer from various limitations, including using detectability as an indirect proxy for intensity, relying on small training sets, or ignoring the precursor charge state. In this study, we introduce Bromo, a transformer-based deep learning model that ranks peptide precursors from a given protein by their relative response, taking charge state into account. Trained on millions of annotated peptide pairs derived from large-scale, publicly available data-independent acquisition mass spectrometry data, Bromo consistently outperforms existing sequence-based methods across diverse, independent datasets. Furthermore, we show that fine-tuning Bromo on experiment-specific data can account for differences in sample preparation, sample matrix, and instrument platform, all of which influence which peptides serve as optimal targets. This adaptability makes Bromo a practical tool for selecting target peptides for selected reaction monitoring and parallel reaction monitoring assay development across a wide range of experimental conditions.
bioinformatics2026-05-26v1Gene-Specific Analysis of Clonal Hematopoiesis Identifies ASXL1 as a Risk Factor for Lung Cancer
Zhang, Z.; Dong, J.; Huang, Y.; Liu, Y.; Amos, C. I.; Cheng, C.Abstract
Clonal hematopoiesis of indeterminate potential (CHIP) is a recognized risk factor for hematologic malignancies, but its contribution to different types of solid cancers remains incompletely defined. Here, we performed a systematic, gene-specific analysis of CHIP across 19 common solid cancer types using two large population-based cohorts, the UK Biobank and All of Us. Using Cox proportional hazards models and nested case-control logistic models, we demonstrate that the relationship between CHIP and solid tumors is highly cancer-type specific, with lung cancer exhibiting the strongest association. In lung cancer, this association is largely driven by ASXL1-mutant clones. Specifically, high variant allele fraction (high-VAF) ASXL1 conferring a significantly increased risk (hazard ratio = 3.2), and the associations remained robust after adjustment for age, sex, body mass index (BMI), smoking status, and genetic ancestry. Notably, ASXL1 CHIP was substantially enriched among smokers, and its association with lung cancer risk was restricted to ever-smokers, highlighting a key interaction between CHIP and environmental exposure. The enrichment of ASXL1 CHIP in lung cancer was further validated in two independent cancer-only cohorts, including MSK-IMPACT and TCGA. In addition, rare germline variant association analysis revealed that germline variation in ASXL1 had the strongest association with lung cancer susceptibility among all solid tumors. Collectively, our findings support a model in which smoking-associated expansion of ASXL1-mutant clones contributes to lung cancer development and suggest that gene-specific CHIP metrics may enhance risk stratification and early detection strategies.
bioinformatics2026-05-26v1SynFit: Synergistic Contrastive Learning for Multi-Objective Protein Fitness Prediction and Optimization
Tu, T.; Huang, W.; Li, Z.; Ding, K.; Yang, Y.; Luo, Y.Abstract
Proteins function through a complex interplay of structural and biochemical properties, and mutations can reshape these properties to generate fitness landscapes spanning multiple functional objectives. A central challenge in protein engineering is the need to simultaneously optimize multiple properties. In biocatalysis, for example, practical enzyme development routinely requires the concurrent optimization of catalytic activity, selectivity, stability, and substrate generality. However, despite recent advances in computational protein design and fitness prediction, most existing approaches treat these properties independently and do not explicitly capture the dependencies and trade-offs that govern real-world protein performance. We present SynFit, a multi-objective learning framework that integrates pretrained protein language models with experimental fitness measurements for protein fitness prediction and engineering. SynFit learns both shared and property-specific protein sequence representations through a synergistic contrastive learning strategy, enabling the identification of variants that simultaneously optimize multiple functional properties. Across a large-scale multi-fitness deep mutational scanning benchmark, SynFit consistently outperforms state-of-the-art supervised models trained on individual objectives and more accurately identifies variants that balance competing functional constraints. We further applied SynFit to multi-objective enzyme design for a new-to-nature biocatalytic enantioselective borylation reaction, providing a diverse array of novel cytochrome \textit{c} sextuple variants in a single round of design with simultaneously improved catalytic activity and enantioselectivity that rival the best variants obtained through directed evolution. Together, these results establish SynFit as a general framework for multidimensional protein fitness prediction and highlight its potential to enable efficient multi-objective optimization in protein engineering, particularly in biocatalysis.
bioinformatics2026-05-26v1Faithful Supervised Dimensionality Reduction for Biomedical Data via Decision Geometry
Wang, Z.; Zhou, Z.; Zhan, Q.; Shen, L.Abstract
Unsupervised dimensionality reduction methods aim to preserve intrinsic data geometry by maintaining local neighborhoods and approximate global relationships in low-dimensional embeddings, but they do not use label information and therefore may fail to reflect task-relevant class structure in biomedical and health applications. Supervised dimensionality reduction (SDR) incorporates labels to improve class organization, yet existing approaches often face a trade-off between discrimination and geometric faithfulness. Linear supervised methods are stable and interpretable but are limited in their ability to capture nonlinear structure, whereas many nonlinear methods impose supervision directly in the embedding space, which can over-separate classes and distort the underlying manifold. In biomedical applications, labels such as cell types in single-cell data or patient status in clinical cohorts provide meaningful biological signal, and supervised dimensionality reduction can use this information to produce more informative low-dimensional representations. Here we propose a new framework, DG-UMAP (Decision-Geometry UMAP), for faithful supervised dimensionality reduction via decision geometry. We first fit a classifier in the original feature space and use its boundary-local decision geometry to construct a low-rank metric deformation that emphasizes discriminative directions while limiting geometric distortion. Parametric UMAP is then applied to the transformed space, so supervision acts through the ambient geometry rather than by directly forcing class separation in the embedding. Across synthetic and multiple real-world biomedical datasets, our method yields embeddings with improved agreement with class structure and global organization while preserving local neighborhood quality.
bioinformatics2026-05-26v1Application of Computer Vision Tools to Maize Genomic Data for Trait Prediction and Gene Discovery
Higgins, S. A.; Anible, E.; Muthupari, M.; Dibble, C.; Murdoch, R. W.Abstract
Artificial intelligence and machine learning for computer vision (CV) and image recognition is a rapidly evolving field with multiple potential applications in plant genomics. While CV has been widely adopted by the research community for plant phenotyping and disease surveillance, applications of CV tools to plant genome analysis are underrepresented. CV tools may complement traditional statistical classification tools used in plant genomics, since CV perceives problems holistically rather than granularly (in terms of pattern recognition), which is particularly applicable to analysis of large, complex eukaryotic genomes. In this study, we report on a new strategy to apply existing CV tools to classify plant genotypes and predict genotype-phenotype relationships. A technique was developed for converting maize genome resequencing data into a set of images reminiscent of a quick response (QR) code. Several hundred maize genomes were processed and it was demonstrated that CV models can successfully categorize genome images into heterotic groups (accuracy and recall > 0.8). Models for classifying genome images into phenotypic trait groups (such as short, medium, and high plant height) performed with moderate success for the most heritable trait analyzed (ear height; accuracy and recall > 0.5). Querying model results permitted identification of genome regions that were important for model classification predictions. The CV model results revealed enriched metabolic pathways consistent with traits under consideration. Overall, our initial application of CV tools to plant genome analysis highlights its applicability to genomic data. Design of new CV architectures optimized for genome-derived images may further improve upon our initial results generated using only off-the-shelf CV tools optimized for unrelated image analysis tasks.
bioinformatics2026-05-26v1Pathogen-specific antimicrobial activity prediction with biological large language model-based methods
Ucar, B.; Demirsoy, E.; Salehi, A.; Sutherland, D.; Yanai, A.; Coombe, L.; Thompson, V. C.; Warren, R. L.; Helbing, C. C.; Birol, I.Abstract
Driven by the rise of antimicrobial resistance, antimicrobial peptides (AMPs) have emerged as promising therapeutics capable of targeting multidrug-resistant pathogens. Because identifying AMPs and their specific targets requires costly and labor-intensive wet-lab experiments, in silico methods to prioritize candidates are highly valuable. However, current computational methods often lack pathogen specificity or fail to incorporate crucial targeted proteomic and genomic contexts. To bridge this gap, we developed triAMPh, a robust, zero-shot framework for pathogen-specific peptide bioactivity prediction. triAMPh integrates a heterogeneous graph attention network-based link predictor (HLP), Extreme Gradient Boosting, and a multilayer perceptron trained on features from biological large language models (bLLMs). Our novel HLP constructs a knowledge graph that maps peptides and pathogens as distinct nodes, connected by similarity and bioactivity edges. The model extracts information through semantic traversals, prioritizing neighboring nodes and their biological contexts. Benchmarking shows that triAMPh provides unbiased, peptide- and pathogen-centered zero-shot predictions, matching or outperforming state-of-the-art methods across all metrics except precision. Ultimately, triAMPh offers a powerful computational tool to accelerate wet-lab AMP discovery while demonstrating the capability of bLLMs to capture complex, pathogen-specific bioactivity patterns.
bioinformatics2026-05-26v1Decoding Multicellular Communication Motifs from Spatial Transcriptomics with ALARMIST
Fan, J.; Hood, J.; Strong, J.; Quinn, J. F.; Dai, Y.; Data Science TeamLab, ; Schein, A.; Yu, K. K. H.; Tansey, W.Abstract
Cellular organization is driven by recurrent, coordinated interactions between multiple cell types, each sending and receiving multiple signals. Existing computational methods for spatial profiling data consider only individual ligand-receptor interactions and fail to capture the higher-order interactions governing the tissue microenvironment. To address this gap, we developed ALARMIST (Assessment of Ligand And Receptor Motifs And Impacts in Spatial Transcriptomics), a probabilistic framework that infers interpretable multicellular communication patterns from spatial data. ALARMIST decomposes neighborhood-level signaling patterns into motifs: recurrent communication subnetworks involving multiple cell types and sets of enriched ligand-receptor interactions. For each cell, ALARMIST identifies its active motifs and estimates the downstream phenotypic effects of each motif on active cells. We applied alarmist to spatial datasets of lung adenocarcinoma (LUAD) and glioblastoma (GBM) to identify microenvironmental drivers of tumor progression. In paired LUAD and adenocarcinoma-in-situ (AIS) samples, ALARMIST identified an immune-active vascular motif at the tumor-normal boundary and implicated motif-active plasmacytoid dendritic cells as drivers of inflammation in early carcinogenesis. In matched low- and high-grade glioma samples, ALARMIST identified a hub-and-spoke motif centered on a malignant macrophage subpopulation, implicating a GRN-SORT1 signaling axis with a downstream impact gene set predictive of survival in low-grade glioma patients. Code for ALARMIST is available at https://github.com/tansey-lab/alarmist.
bioinformatics2026-05-26v1Integrated optimization of experimental and computational workflows improves genome recovery in long-read gut metagenomics
Hu, Y.; Sun, L.; Huang, Y.; Jiang, F.; Tong, X.; Yang, J.; Ju, Y.; Yang, Z.; Liufu, S.; Hu, Y.; Ma, W.; Guo, R.; Li, W.; Zhang, T.; Zhu, X.; Zhang, Z.Abstract
Short-read metagenomic sequencing is widely applied in microbiome research due to its high quality and increasingly more affordable prices. However, it suffers from fragmented reads which limits assembly contiguity and the recovery of complete microbial genomes. In contrast, long-read sequencing, with substantially longer read lengths, can help overcome these limitations. Achieving complete and accurate genome recovery is a central goal in metagenomics. To advance this goal, we present a systematic effort to unify and optimize the long-read sequencing workflow, from experimental sample processing to computational genome assembly, using the CycloneSEQ platform.
bioinformatics2026-05-26v1