Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Integrated Multi-Omics Analysis for the Identification of Disease-Associated Variations and Prognostic Biomarkers in Triple-Negative Breast Cancer (TNBC)
MANNEKUNTA, N.; NATRAJAN, E.Abstract
Background: Triple-negative breast cancer (TNBC) exhibits high molecular heterogeneity. While multi-omic panels capture disease complexity, translating these profiles into actionable, cost-effective prognostic tools remains analytically challenging. Objective: To mathematically distill a high-dimensional multi-omic profile into a lean, highly predictive biomarker panel. Furthermore, we aimed to construct, validate, and clinically anchor a prognostic survival nomogram. Methods: Matched TCGA-BRCA transcriptomic and epigenomic data (n=5546) were integrated utilizing MOFA2. Functional pathways were mapped via the Enrichr database against Reactome, KEGG, and WikiPathways libraries. A machine learning ensemble (LightGBM, Random Forest) optimized the discovery signature. Prognostic stability was validated via Kaplan-Meier stratification, continuous Z-score Multivariate Cox Regression, and Time-Dependent ROC modeling. The tumor microenvironment was profiled via ssGSEA, and immunotherapy checkpoint correlation was assessed. External validation was executed on a microarray cohort (GSE58812). Results: A 47-gene discovery signature was computationally optimized into a 15-gene clinical panel (Internal AUC = 0.9898). Kaplan-Meier analysis demonstrated profound prognostic separation (p < 0.0001). Multivariate Cox regression confirmed the signature's independent prognostic value (Hazard Ratio = 10.67, p < 0.001). Immune profiling revealed the signature is driven by tumor-intrinsic factors, showing no significant correlation with local checkpoint expressions like PD-L1 (p = 0.72). External validation achieved an integrated multi-covariate AUC of 0.6874. Conclusion: This optimized 15-gene signature and the associated clinical-genomic nomogram provide an accurate, independent, and generalizable framework for individualized TNBC survival prediction.
bioinformatics2026-05-29v3Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators
Thiel, M.; Cunningham, A.; Barnes, C. P.Abstract
We compare the efficacy and distributional effects of supervised fine-tuning (SFT) and reinforcement learning (RL) post-training for PlasmidGPT, a foundation model for whole-plasmid generation, using Group Relative Policy Optimization (GRPO) for the RL model. Using a biologically motivated reward function encoding functional annotations, length constraints, and repeat penalties, the RL model achieves a 71.6% quality-control pass rate across 8 prompts on 4,000 sequences, compared to 4.3% for the pretrained baseline and 11.0% for SFT. A five-model reward ablation identifies the cassette arrangement bonus, which rewards correct promoter[->]CDS[->]terminator ordering, as the critical reward component. Rejection sampling baselines indicate that the gain is not recovered by sampling more heavily from the base model. Beyond directly optimized features, RL generated sequences converge toward real plasmid distributions in 3-mer composition and minimum free energy density, neither of which is directly optimized by the reward function. Minimum free energy density independently converges to the real-plasmid regime under both SFT and RL despite these being parallel post-training paths. On a small curated hold-out set, RL improves continuation log-likelihood over the pretrained baseline on all 29 held-out sequences (mean {triangleup} = +0.83 nats).
bioinformatics2026-05-29v3WITHDRAWN: Rescuing true protein binders from AI hallucinations via zero-shot, ensemble-driven statistical physics scoring
Chou, C.-H.; Hong, X.; Xu, J.Abstract
The authors have withdrawn their manuscript because it contains proprietary data and was submitted without proper corporate authorization from MoleculeMind. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-05-29v2A Conditional Variational Autoencoder with QSAR-Guided Surrogate-Weighted Fine-Tuning and Cross-Entropy Optimization for Targeted Antimicrobial Peptide Generation
Castanon, I.; Wan, F.; de la Fuente-Nunez, C.; Pini, A.; Falciani, C.Abstract
Machine Learning frameworks have emerged as a promising tool for antimicrobial peptide design; however, generative models remain limited by two persistent problems: the limited availability of experimentally validated peptides and the circular dependency of the models. In this work we present a conditional variational autoencoder pipeline that addresses both limitations through a modular architecture that combines both binary and quantitative experimental data and implements a multimodal approach to externally guide the generation. A transformer-based encoder successfully generated a discriminative 64-dimensional latent space (test AUROC 0.968, F1 0.919) separating antimicrobial from non-antimicrobial sequences. This latent representation conditions a species-specific LoRA fine-tuned ProtGPT2 decoder through a scalar gating function, which generates balanced antimicrobial peptides through two different modes; prior and perturb, depending on their generation starting points. We introduced a Surrogate Weighted Fine-Tuning (SWF) ensemble to eliminate the circular dependency and a Cross-Entropy Method to explore and exploit the latent space, leading to successful antimicrobial peptide generation. The best candidates exhibited competitive physicochemical characteristics, a mean helical fraction of 0.874 (mean pLDDT 83.7), and externally predicted efficacy evaluated by APEX.
bioinformatics2026-05-29v2nail: software for high-speed sequence annotation with profile hidden Markov models
Roddy, J. W.; Rich, D. H.; Wheeler, T. J.Abstract
Background: Profile hidden Markov models (pHMMs) deliver state-of-the-art sensitivity for sequence annotation, but the Forward/Backward algorithm fills a dynamic programming matrix sized by the product of model and sequence lengths, making pHMM search slower than fast heuristic alignment tools like MMseqs2 by an order of magnitude or more. Results: We introduce nail, which approximates Forward/Backward by computing only a sparse cloud of high-probability matrix cells, recovering accurate pHMM scores, E-values, and alignments at a fraction of the cost. nail annotates the ~2.4 billion protein MGnify metagenomic dataset with all of Pfam in 73.9 hours on a single 48 core machine, recovering most of HMMER's recall advantage over MMseqs2, with run time ~8.7x faster than HMMER3. Detailed analysis of HMMER-only multi-domain hits suggests that many of these matches missed by nail are the result of accumulating score across short, often fragmentary alignments to repetitive regions of the target, consistent with spurious hits rather than genuine homology. We also derive a closed-form approximation for single-sequence E-value calibration, eliminating a per-model simulation step. nail is released under the open BSD-3-clause license at https://github.com/TravisWheelerLab/nail.
bioinformatics2026-05-29v2bMAE: Masked Autoencoder Latent Representations for Bulk RNA-seq Tissues
Wan, Z.; Untalan, M. Z. G.; Vasconcellos Vargas, D.Abstract
Bulk tissue RNA-sequencing data from large-scale consortia such as GTEx provide comprehensive gene expression profiles across diverse human tissues. However, the high-dimensional nature of bulk RNA-seq data, combined with technical noise and batch effects, poses challenges for downstream analyses. While dimensionality reduction methods are routinely applied, standard approaches such as PCA often fail to optimally preserve tissue-discriminative information and exhibit poor generalization to unseen tissue types. We developed a masked autoencoder for bulk tissue RNA-seq that learns compressed latent representations through self-supervised learning with variable masking schedules. Evaluated on GTEx data (31 tissue types, 19,788 samples, 19,308 genes), our method substantially outperformed all baselines in leave-one-tissue-out (LOTO) cross-validation. We achieved mean silhouette 0.20, ARI 0.58, and NMI 0.84 versus best baseline UMAP (0.007, 0.25, 0.47), representing 28.6-fold, 2.3-fold, and 1.8-fold improvements. Remarkably, within held-out tissue categories containing subtissues, our latent space revealed hierarchical structure with enhanced subtissue separation (silhouette 0.16, ARI 0.35, NMI 0.36) versus baselines (silhouette 0.15, ARI 0.20, NMI 0.20), despite never observing these distinctions during training. The method compressed 19,308 genes to 128 dimensions while preserving multi-scale structure.
bioinformatics2026-05-29v2Accurate Identification of Functional Residues Across the Human Proteome with TAMALE
Van Riper, J.; Corsaro, B. J.; Pillon, M. C.Abstract
The central challenge of the post-AlphaFold era is the "functional gap". Despite having structure predictions for nearly every human protein, we remain unable to systematically distinguish residues that drive activity from those that merely maintain structural integrity. Here we introduce TAMALE, a machine learning model that calculates graded residue-level functional scores across the human proteome without prior annotation. TAMALE transforms structure models and variant effect predictions to identify residues involved in catalysis, ligand binding, nucleic acid interactions, and regulation. The model also distinguishes pseudo-enzymes from catalytically active homologs. The model was validated across 20 case studies along with experimental characterization of the FASTKD5 ribonuclease, demonstrating its utility for functional discovery. Applied proteome-wide to 19,528 human proteins, TAMALE generates testable hypotheses enabling mechanistic discovery at scale.
bioinformatics2026-05-29v1Just Add Structure: Protein Language Models Combined with Structural Equivariance Excel at Protein Tasks
Deane, C.; Cagiada, M.; Qurat-ul-ain, Q.; Whye Teh, Y.; Outeiral Rubiera, C.Abstract
Accurate in silico prediction of protein properties, functional fitness, and mutational effects remains a central challenge in protein engineering and therapeutic design. While Protein Language Models (PLMs) successfully capture rich evolutionary and functional constraints from sequence data, they only indirectly encode the spatial and geometric information that fundamentally governs protein function. Consequently, state-of-the-art approaches typically rely on extensive fine-tuning, ensembling, or the incorporation of handcrafted structural features to achieve competitive accuracy, making them computationally expensive and difficult to scale. In this work, we demonstrate that explicit geometric modeling can substitute for, and in most cases outperform, large-scale PLM fine-tuning, with much higher parameter efficiency. Our approach, ProtEGNN, pairs PLM residue representations with a lightweight E(3) Equivariant Graph Neural Net- work, competing with or achieving state-of-the-art performance across eight different benchmarks in protein property, mutational effect and function prediction, while needing 100-1000x fewer parameters than competing methods. Even when protein structure is combined with representations from ESM2-T6, a small 8M-parameter PLM, ProtEGNN matches fine-tuned sequence-only approaches based on substantially larger PLM backbones, while training orders of magnitude fewer parameters. Together, these results highlight geometric inductive bias as a powerful and scalable alternative to task-specific fine-tuning of large PLMs for protein modeling.
bioinformatics2026-05-29v1CellExLink: End-to-end cell-type recognition and normalization in biomedical text
Nabijiang, A.; Shahriyari, L.Abstract
Cell-type extraction is an important task in biomedical text mining because biomedical literature contains evidence about cell types and cell-type-related biological interactions that supports studies of disease mechanisms, therapeutic response, and translational biomedical modeling. However, current biomedical text-mining systems either do not explicitly support cell-type extraction, provide limited support for Cell Ontology normalization, or achieve limited accuracy for end-to-end cell-type extraction. These limitations can affect downstream tasks that depend on reliable cell-type information. Here, we present CellExLink, an end-to-end biomedical natural language processing pipeline designed specifically for cell-type recognition and Cell Ontology normalization in biomedical text. The pipeline is designed to improve extraction accuracy and practical usability in literature-mining workflows, while accounting for computational efficiency in its recognition and normalization design. We evaluate CellExLink across heterogeneous biomedical corpora and compare it with established and recent biomedical text-mining tools. The results show that CellExLink provides reliable cell-type recognition, Cell Ontology normalization, and end-to-end extraction across these corpora. By addressing the need for reliable end-to-end cell-type recognition and Cell Ontology normalization, CellExLink can support downstream tasks such as curation, search, relation extraction, and knowledge graph construction.
bioinformatics2026-05-29v1DINMC: A Deep Learning Framework for Interpretable Normative Model Construction and Pathological Brain Alteration Detection
Ge, Z.; Liu, S.; Dou, W.Abstract
Background and Objective: Normative modeling is a key tool for understanding brain alterations in neurodegenerative diseases, such as cerebellar-type multiple system atrophy. However, existing methods lack interpretability and fail to capture clinically meaningful pathological changes. This study presents DINMC, a Deep Interpretable Normative Model Construction framework, which combines autoencoder-based learning with statistical hypothesis testing to better capture and interpret disease-specific neuroanatomical changes. Methods: The DINMC framework constructs normative models using neuroimaging data from multi-site large healthy cohorts. It utilizes a U-shaped convolutional autoencoder to train these models, which are then applied to reconstruct brain features from both patients and healthy controls within the same study cohort. Pathological confidence values are derived by fusing original and deviation feature spaces, offering a measure of disease-related pathology reflected in each dimension of the features. The framework was validated through statistical analysis and prognostic classification and regression tasks. Results: The pathological confidence provides valuable insights into the neuroanatomical regions most affected by the disease, as well as the correlation between changes in these regions and clinical assessment scales. Our optimal model outperform traditional methods in prognostic prediction tasks, with an AUC of 0.972 for classification tasks and an R-squared of 0.432 for regression tasks. Conclusion: DINMC provides a novel and interpretable framework for neuroimaging analysis. By combining deep learning and statistical hypothesis testing, this framework offers a unique solution to improving both the interpretability and performance of normative models in neuroimaging. The approach is scalable to other neuroimaging datasets, offering a versatile tool for broader biomedical applications.
bioinformatics2026-05-29v1ctSpyderFields: A Python package for visual field reconstruction in spiders
De Agro, M.; Caradonna, D.; Pande, A.; Falotico, E.; Sumner-Rooney, L.Abstract
The measurement of visual fields in arachnology has a long-standing history. Given the wide variety of eye positions, orientation and structure, the topic is fundamental for studies of taxonomy, evolution, ecology and behavior. The existing methods for measuring visual fields deploy ophthalmoscopic measurements, which require custom microscopes, anatomical structures like the reflective tapetum, which may not always be present, or the capacity to detect photoreceptor autofluorescence. Here we present the ctSpyderFields python package: a tool for geometrically predicting the visual fields of arachnids from digital images of the lens and retina. The tool uses images coming from computed tomography (CT) scans of specimens, but could be applied to other 3D microscopy techniques, to virtually project the boundaries of the retina through the geometrically predicted nodal point of the lens, deriving a rough per-eye visual field both in cartesian and spherical coordinates. The extracted data can then be used to calculate likely visual field overlap between eyes and angular spans, which can be compared within or between species. We also provide a use case, reporting the visual field data extracted from a museum specimen of Philaeus crysops. We propose that the tool will allow a wider comparative analysis of visual fields across spider species, unlocking the potential for a deeper understanding of visual ecology and evolution.
bioinformatics2026-05-29v1Complex Indel Detection: A Simulation-Based Framework and Parsing with FreeBayes
Loh, Y. H. E.; Lieber, M. R.; Hsieh, C.-L.; Manojlovic, Z.Abstract
In contrast to simple deletions and simple insertions, most complex indels involve both deletions and insertions, often with base changes within a few nucleotides of the indel's left and right boundaries. These complex indels often arise from double-strand breaks (DSB), which in normal somatic cells are predominantly repaired by nonhomologous DNA end joining (NHEJ). Such complex indels pose a difficult analytical problem for existing indel callers because the observed VCF representation may be locally shifted, extended with matching flanking bases, or fragmented into several closely spaced calls. To evaluate complex indel representation, we tested six variant calling approaches: FreeBayes, HaplotypeCaller, Mutect2, Strelka2, DRAGEN Germline, and DRAGEN Somatic pipelines. Among the approaches evaluated, FreeBayes most consistently represented simulated complex indels as single nearby variant records. We then developed a parsing workflow that derives effective deleted and inserted sequences from FreeBayes VCF output and enriches for candidate complex indels. This approach supports analysis of naturally occurring DSB repair events in single human colon crypts.
bioinformatics2026-05-29v1Transcriptomics-Conditioned Virtual Tissue Synthesis via Diffusion Transformers
Vlachas, P.; Nonchev, K.; Koelzer, V.; Ratsch, G.Abstract
Spatial transcriptomics couples hematoxylin and eosin (H&E) tissue morphology with spatially resolved gene expression (GE). However, generative models that exploit this coupling to synthesize tissue images from transcriptomic profiles remain scarce. We present STMDiT (Spatial Transcriptomics and Morphology Diffusion Transformer), a diffusion transformer that synthesizes H&E histopathology patches conditioned jointly on morphological embeddings and transcriptomic profiles. Building on PixCell (Yellapragada et al., 2025), we integrate gene expression from a frozen CancerFoundation encoder (Theus et al., 2024) through adaptive layer normalization and per-block cross-attention, and we train under dual classifier-free guidance with independent modality dropout. On the 10x TuPro Visium melanoma cohort, GE conditioning improves both image quality over the no-GE PixCell-B baseline (best FID = 252.9 vs 330.7) and transcriptomic fidelity (best AUC = 0.267 vs 0.229, reaching 82% of the real-tile ceiling). Training with DeepSpot's predicted-transcriptomics pseudo-labels (PTPL) uniquely transfers zero-shot to TCGA SKCM, an out-of-distribution (OOD) H&E-only melanoma cohort: PTPL-XAttn-PMA-B reaches FID = 690.0, a 57-point improvement over the no-GE baseline (747.1), with a within-model GE-ablation effect of {Delta}OOD = +309.5, enabling virtual tissue synthesis beyond native spatial-transcriptomics coverage. Our results indicate that gene-expression conditioning produces morphologically distinct tissue images and supports virtual tissue simulation for hypothesis testing in computational pathology. Code availability: https://github.com/ratschlab/stmdit
bioinformatics2026-05-29v1Sensitive long-read amplicon sequence variant recovery with savont
Shaw, J.; Riisgaard-Jensen, M.; Andersen, K. S.; Kirkegaard, R.; Dueholm, M. K. D.; Li, H.Abstract
Long-read amplicon sequencing can profile longer sequences compared to short reads, but recovering amplicon sequence variants (ASVs) is a challenge for long, noisy reads. We present savont, an algorithm for recovering ASVs from modern long-read amplicons with mean accuracy [≥] 98%, including Oxford Nanopore Technologies (ONT) R10.4 and PacBio HiFi reads. Savont requires 5 to 16 times lower sequencing depth for full-length 16S rRNA ONT reads than previous methods and generates up to 95 times more ASVs for complex environments. Savont makes ASV-based analysis from shallow and noisy long-read amplicon sequencing feasible.
bioinformatics2026-05-29v1MICAFlow: Fast and Robust MRI Preprocessing Bridging Research Neuroimaging and Clinical Practice
Goodall-Halliwell, I.; DeKraker, J.; Bautin, P.; Mendelson, D.; Cabalo, D. G.; Sahlas, E.; Ngo, A.; Xie, K.; Lam, J.; Smith, M.; Hwang, Y.; Vavassori, L.; Milano, P.; Chen, J.; Dascal, A.; Ding, R.; Zhou, G.; Naish, M.; Mo, J.; Fadaie, F.; Cruces, R. R.; Bernhardt, B. C.Abstract
MICAFlow is a fully automated MRI preprocessing pipeline designed to translate advanced neuroimaging workflows from research into routine clinical practice. The pipeline emphasizes speed, robustness, and ease of use, focusing on structural and diffusion MRI. Key innovations include a Label-Augmented Modality-Agnostic Registration (LAMAReg) technique driven by deep learning segmentations for reliable cross-modal alignment, integration of state-of-the-art distortion corrections, and adherence to reproducible standards (Snakemake workflow, BIDSApp specifications). We describe the design of MICAFlow and evaluate its performance across heterogeneous datasets. First, accessibility: MICAFlow processes a multimodal MRI exam in minutes with clinically accessible hardware and without requiring GPU access, making it feasible for same-day clinical use. Second, registration accuracy: LAMAReg achieves cutting-edge multi-modal registration accuracy, yielding accurate alignment of diffusion MRI, FLAIR, and intra-subject T1-weighted images while remaining generally robust to common artifacts. Third, data reliability: Using identifiability, we show MICAFlow maintains consistent performance across diverse datasets, including subjects with pathology, and is closely comparable to contemporary pipelines. In sum, MICAFlow's combination of machine learning and efficient workflows produces research-grade data quality with clinical-grade speed. This work demonstrates that advanced MRI preprocessing can be done fast and robustly, helping close the gap between research neuroimaging and broad clinical application of quantitative MRI techniques. The source code for MICAFlow is available here: https://github.com/MICA-MNI/micaflow, and for LAMAReg here: https://github.com/MICA-MNI/LAMAReg.
bioinformatics2026-05-29v1A Multi-Agent RAG Framework for Biomedical Literature Analysis
Palem, R. R.; Chen, H.; Yue, Z.Abstract
Background: The biomedical literature is expanding at an unprecedented rate, with over 4,000 new articles indexed on PubMed each day. Clinicians and researchers frequently lack the time to review this volume before making decisions. Retrieval-Augmented Generation (RAG) systems attempt to bridge this gap by grounding language model responses in relevant documents, but standard implementations rank all retrieved passages solely by semantic similarity, treating a case report and a meta-analysis as equally authoritative. Objective: We aimed to develop and pilot-evaluate a RAG variant that incorporates evidence quality and publication recency into the retrieval scoring function, and to determine whether these signals improve answer quality on biomedical questions compared with standard cosine similarity RAG and a full-context baseline. Methods: We developed ET-RAG (Evidence-Temporal RAG), which scores each retrieved chunk using a weighted combination of cosine similarity (50%), evidence quality based on the GRADE hierarchy (30%), and temporal recency (20%). We evaluated ET-RAG alongside two baselines: a full context agent powered by Gemini 2.0 Flash and a standard cosine RAG agent using GPT-4o-mini. All agents were tested on 40 benchmark questions (10 single-choice, 10 multiple-choice, 10 short answer, and 10 long answer) drawn from 10 peer-reviewed Alzheimer's disease papers published between 2021 and 2025. Results: ET-RAG achieved the highest scores across all four question categories: single choice (0.90), multiple choice (0.74), short answer (0.92), and long answer (0.89), with a combined average of 0.86. Cosine RAG scored 80%, 0.48, 0.82, and 0.69, respectively (average 0.70), while the full context agent scored 0.60, 0.59, 0.71, and 0.53 (average 0.61). The full context agent, despite having access to the entire corpus through Gemini's large context window, struggled with consistent answer extraction and was prone to rate limiting under heavy query loads. A control question on forestry was correctly rejected by all three agents, suggesting no hallucination on this control item. Conclusions: In this pilot Alzheimer's disease benchmark, incorporating evidence quality and recency into RAG retrieval improved answer quality relative to pure cosine similarity retrieval and full-corpus prompting. The evidence-temporal scoring function is lightweight to implement and adds minimal computational overhead to existing vector search pipelines, but broader validation across domains, evidence levels, and stronger retrieval baselines are required before claims of generalizable biomedical reliability can be made.
bioinformatics2026-05-29v1FlowTransOP: Distributional Translation of Omics Signatures via Constrained Deep Flow Matching
Meimetis, N.; Magliacane, S.; Hoang, T. N.; Lauffenburger, D. A.Abstract
Observations from pre-clinical models rarely generalize to human patients, leading to many failures in clinical trials. Most existing methods cannot handle domains with non-overlapping features and no paired samples. Here, we developed FlowTransOP to translate biological observations across such domains without requiring 1-to-1 feature mappings and paired data, while providing a guideline for model selection across four translational regimes. We use flow matching to align full domain distributions in a pre-aligned latent space, with a structural regularization term that keeps similar conditions proximate after transformation. FlowTransOP remains competitive with gold-standard approaches requiring paired samples, but outperforms them when pairs become scarce (<35 pairs) or when cross-domain features are only moderately correlated (r<=0.58). Overall, FlowTransOP can translate perturbations between pre-clinical models and patients when direct correspondences are unavailable, enabling reliable therapeutic inference. As a proof-of-concept, we trained a foundational mouse-human transcriptomic map on ARCHS4 and applied it to liver disease predictions.
bioinformatics2026-05-29v1MetworkPy A Python Package for Graph- and Information-theoretic Investigation of Metabolic Networks
Griebel, B. T.; Ma, S.Abstract
We present MetworkPy, a python package for investigating in silico genome-scale models of metabolism (GSMM). By using novel graph- and information-theoretic methods to explore the feasible reaction flux space, MetworkPy quantifies network context and simulates metabolic relationships between sets of enzyme-encoding genes without imposing assumptions of optimal growth. To demonstrate utility, we used MetworkPy to identify metabolic features perturbed by the transcription factor ArgR, a known regulator of arginine biosynthesis in Mycobacterium tuberculosis, based on published transcriptome data generated from an argR mutant strain. MetworkPy successfully linked reaction flux shifts in ArgR's transcriptome-constrained GSMM to arginine biosynthesis, which cannot be easily ascertained by conventional constraint-based optimization modeling approaches. MetworkPy offers a flexible toolbox for metabolic contextualization of genes-of-interest in microbial, eukaryotic, and multi-organism systems with potential applications for medicine and bioengineering.
bioinformatics2026-05-29v1Multiple versus pairwise sequence alignments for protein phylogenetics using foundation models
Alibutud, R. F.; Kumar, S.Abstract
Phylogenetic inference is a common task in molecular and evolutionary biology and has conventionally required a multiple sequence alignment (MSA), a statistical model of amino acid substitutions, and an optimality principle. Recently, global models of amino acid substitutions have been inferred from millions of MSAs using transformer-based deep learning, resulting in protein foundation models (pFMs), also known as protein language models (PLMs). Training pFMs on MSAs hypothetically enables them to encode residue dependencies and the phylogenetic structure of the MSA collection. In contrast, pFMs trained on individual sequences lack access to such phylogenetic structure. Here, we assess the phylogeny inference gains offered by the use of MSA for training pFMs by comparing the relative accuracies of phylogenies inferred using two types of pFMs: one trained on a large collection of MSAs (msat-pFM, [1]) and the other trained using a collection of single sequences (esm-pFM). For msat-pFM analysis, we inferred neighbor-joining trees using pairwise distances estimated directly from the sequence attention matrices. For esm-pFM [2], pairwise distances were obtained using the correlation of attentions of homologous residues, where pairwise sequence alignments (PSA) were used to establish residue homologies. Surprisingly, MSA phylogenies inferred using the msat-pFM were less accurate than esm-pFMs. This pattern was seen across datasets spanning both small and large numbers of species and proteins. Also, PSA phylogenies obtained using residue attentions from early ESM-PFM layers were much more accurate. These results suggest that the multiple sequence alignment step, which is obligatory to establish residue homologies across multiple sequences, may not add information when using evolutionary distances based on attentions in pFMs.
bioinformatics2026-05-29v1Cophenetic Spatial Topology Embedding reveals multiscale tissue architecture in spatial omics
Long, M.; Hu, T.; Sountoulidis, A.; Samakovlis, C.; Nilsson, M.Abstract
The spatial organization of tissues emerges from cell interactions across multiple scales, yet current spatial omics analysis tools often emphasize local neighborhoods and may not summarize broader tissue architecture. Here we introduce Cophenetic Spatial Topology Embedding (COSTE), a computational framework that embeds directed nearest-neighbor distance profiles into a hierarchical metric space without requiring the user to define a spatial radius or neighborhood cutoff. COSTE can be applied to cell-level and single-transcript inputs without requiring cell segmentation. It constructs directed distance profiles between cell populations and uses hierarchical clustering to quantify tissue topology. This yields a Spatial Separation Score (SSS), a sample-normalized score from 0 to 1 that summarizes relative spatial separation within an analyzed tissue. We apply COSTE to spatial transcriptomics datasets of pulmonary fibrosis and triple-negative breast cancer (TNBC), where it delineates tissue structures, nominates spatially defined cell states, and highlights disease- or treatment-associated architectural patterns that are not readily captured by local neighborhood-based analyses. Our approach provides an interpretable framework for exploring tissue architecture and cell-cell spatial relationships in spatial omics data.
bioinformatics2026-05-29v1SPACKLE: A spatial-first framework for multi-layer spatial transcriptomic analysis
Maynard, T. M.Abstract
Background The emergence of accessible spatial transcriptomic platforms such as 10x Genomics Visium HD and Xenium has created demand for analysis tools that can handle the complexity and scale of spatial datasets. Current frameworks approach spatial data primarily as an extension of single-cell RNA-seq pipelines, where spatial coordinates are retained as metadata rather than treated as a first-class organizing principle. As a result, common tasks such as multi-modal data alignment, region-of-interest selection, and cross-resolution visualization require manually managing disparate data types, coordinates, and scales, making spatial analysis unnecessarily time-consuming and error-prone. Results We present SPACKLE (Spatial Platform for Analysis of Composite stacKs and Layered data Extraction), a Python-based 'spatial-first' framework that treats absolute physical micron coordinates as the organizing principle for all data types. All data - morphology images, transcript point clouds, expression matrices, segmented cells, and user-defined regions - are stored as typed objects ('Channels') that carry their own spatial metadata, keeping all layers in automatic registration regardless of platform, resolution, or analysis operation. Two complementary interfaces simplify access to underlying data: the ViewPort, a compositing engine for efficient multi-channel visualization, and the DataPort, which extracts raw data in its native format for downstream analysis. A set of spatial analysis tools demonstrates the practical benefits of the framework, including ROI-based expression binning, cortical unfolding, and sub-micron fine alignment of transcript and image data. The use of modern Python data management methods helps maintain the efficiency of the framework, allowing for quick visualizations and analysis with a low memory footprint. Conclusions SPACKLE is designed to complement rather than replace widely used tools in the spatial analysis ecosystem (Scanpy, Squidpy, CellPose, StarDist), by handling the spatial mechanics of large datasets so that the analyst can focus on the biology. SPACKLE is freely available under the MIT license at https://github.com/maynardt/spackle.
bioinformatics2026-05-29v1Transcriptomic phase transitions along the Alzheimer's disease continuum using riemmaninan tensor model: a bifurcation-intermediate variance framework on the ROSMAP cohort
Choi, M.; Bauermeister, S.; Kim, D.-G.Abstract
Alzheimer's disease (AD) progression involves systemic network transitions. To capture these using ROSMAP bulk RNA-seq (n=624), we focused on the geometry of the covariance structure, performing a Riemannian (Log-Euclidean) analysis of stage-wise covariance matrices as points on the manifold of symmetric positive-definite (SPD) matrices. On the SPD manifold the three stages were non-collinear: geodesic distances were non-uniform and MCI was displaced from the NCI-AD chord, while the von Neumann entropy of the covariance structure dipped at MCI (S = 2.760, 2.639, 2.647 for normal cognitive intact NCI, mild cognitive impairment MCI, AD) and the path-curvature profile reached a minimum there -- together identifying MCI as a saddle/bifurcation state. The differential covariance spectrum (CAD - CNCI) separated AD-amplified ("structural collapse") from AD-suppressed ("protective loss") modes. Ultimately, second-order statistics analyzed through Riemannian geometry, rather than Euclidean summaries, reveal AD progression structure invisible to mean-level analysis.
bioinformatics2026-05-29v1Memory-safe high-performance sequence mapping with rammap
Wang, J. R.; Li, H.Abstract
We introduce a reimplementation of the widely used mapping tool minimap2 in Rust called rammap. We demonstrate perfect concordance with minimap2, enabling its backwards compatibility as a drop-in replacement for minimap2-based workflows. Additionally, rammap implements performance optimizations for modern architectures and applications, including AVX512 and WASM v128 SIMD support for dynamic programming alignment and SIMD-accelerated chaining. These achieve comparable or better performance than minimap2 across diverse mapping workloads while maintaining Rust's stronger memory safety constraints. The rammap API exposes both SIMD-accelerated sequence-sequence alignment modules and full mapping pipelines for use as an integrated library. Lastly, we describe the modular architecture and provide examples illustrating the extensibility of major mapping components, including seeding, chaining, and gap-filling/extension to support development of improved or domain-specific mapping components within a high-performance framework.
bioinformatics2026-05-29v1HESTA: a curated and reusable database for the human early organogenesis spatiotemporal transcriptome atlas
Xu, Z.; Wang, W.; Li, Y.; Zhang, Y.; Chen, J.; Du, W.; Yang, T.Abstract
Background Human organogenesis is orchestrated by precise spatiotemporal gene expression. Mapping these dynamic processes requires transcriptomic data that preserve native anatomical context across continuous developmental stages. Results We present a spatiotemporal transcriptome database of human embryogenesis, profiling 77 sagittal sections from 13 euploid embryos (CS12-CS23) using Stereo-seq, yielding 14,744,703 bin50 spots. The atlas annotates 50 organs and maps 198 molecularly distinct substructures, complemented by 607,093 snRNA-seq cells. The database features a Spatial Exploration module for locating sections and visualizing spatial distributions of organs and substructures, and an Organ Atlas module for visualizing gene expression, regulon activities, and pathway enrichment at the single-organ level across embryos. Conclusions This database provides an interactive resource to access spatial gene expression, substructures, and regulatory networks across 50 developing human organs, supporting further research into the mechanisms of human organogenesis.
bioinformatics2026-05-29v1ProtXAI: Explainable AI Reveals Structural Determinants of Protein Dynamics
Haddadi, F.; Planas Iglesias, J.; Mican, J.; Horackova, J.; Marques, S. M.; Demovic, M.; Kohout, P.; Damborsky, J.; Bednar, D.; Mazurenko, S.Abstract
Molecular dynamics simulations provide atomistic views of protein motions, but conventional analyses often struggle with extracting subtle mechanistic insights from complex trajectories. Here, we present an integrated framework, ProtXAI, combining molecular dynamics and explainable artificial intelligence (XAI), to identify residue-level determinants of conformational change across diverse protein systems. By leveraging inter-residue distance dynamics, deep learning, and sequential relevance propagation, the approach captures both local fluctuations and long-range communication pathways within protein structures. We applied this framework to three mechanistically distinct systems: apolipoprotein E4 (ApoE4), staphylokinase (SAK) variants, and an ancestral luciferase. Across these applications, our XAI-based approach recovered experimentally supported dynamic hotspots: ligand-responsive hinges in ApoE4, mutation-dependent flexibility shifts in SAK, and evolutionary redistribution of motions in the luciferase. ProtXAI also revealed additional long-range couplings not accessible to classical analysis. Together, these findings demonstrate that combining molecular dynamics with XAI provides a general and scalable strategy for dissecting protein dynamics and uncovering structural determinants of function, stability, and evolutionary changes without prior bias. This approach thus advances the current methodological repertoire for analysing proteins and their intrinsic properties.
bioinformatics2026-05-29v1SwiftNJ: Fast Exact Neighbour Joining via Correctness-Gated Coding Agents
Christensen, J.Abstract
The capability profile of frontier coding agents in 2026 varies sharply across technical domains, motivating domain-specific empirical study of where, and under what oversight conditions, such systems can contribute to specialised technical work. This paper presents one such study in computational phylogenetics. Neighbour joining (NJ) is a widely used distance-based method for inferring evolutionary trees in microbial epidemiology, comparative genomics, and large-scale sequence clustering. Its constant-factor runtime is set by hand-tuned native implementations; RapidNJ is a widely-cited representative of that class and serves here as the comparison baseline. We ask whether a current-generation coding agent, operating under a correctness-gated optimisation harness with deterministic correctness gates calibrated against a QuickTree reference, can advance that constant factor on a fixed benchmark. The resulting implementation, SwiftNJ, achieves a geometric-mean runtime ratio of 0.565 against a locally-rebuilt RapidNJ-native binary across a 59-matrix corpus, sub-parity on 58 of 59 matrices. On 400 shuffled inputs drawn from 16 small matrices (n <= 2000), SwiftNJ matched the QuickTree reference at Robinson-Foulds distance zero. In this domain, a correctness-gated coding agent meaningfully improved on a strong native baseline, suggesting that harness-guided optimisation holds promise for performance-critical bioinformatics tools; further work is needed to establish how broadly the approach generalises.
bioinformatics2026-05-29v1TopOmics: Topic Modelling for All Omics
Sanguinetti, G.; El Kazwini, N.; Caretti, F.Abstract
Topic models have emerged as a popular paradigm to analyse and interpret complex single-cell and spatial data. Yet, current implementations are usually data-type specific and rely on different modelling and estimation approaches, hindering usability and interoperability. In this work we introduce TopOmics, a library to perform efficient and flexible topic modeling with any combination of -omics data at scale. The framework leverages standard libraries of the Python ecosystem, guaranteeing seamless integration with existing pipelines, and shows competitive performance against state-of-the-art methods while preserving interpretability. We provide several examples of TopOmics on diverse data sets, including a novel topic model for spatial multi-omic data, and an analysis of a very large VisiumHD data set.
bioinformatics2026-05-29v1DMPKformer: An Interpretable Multimodal Deep Learning Framework for Reliable ADMET Property Prediction
A. S., B. G.; Singh, A.; Kanchan, S.; Anapat, S.; Gurram, K.; Kulkarni, N. M.Abstract
Accurate prediction of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties remains a critical challenge in drug discovery. Traditional single modality approaches often fail to capture the complex, multi-scale relationships governing molecular behaviour across physicochemical, structural, and pharmacokinetic dimensions. In this work, we propose a multi-modal deep learning framework that integrates complementary molecular representations, MACCS fingerprints, molecular graphs, and physicochemical descriptors to achieve robust ADMET property prediction. Each modality is modelled using a specialized neural subnetwork tailored to its structural characteristics: a self-attention-based Transformer encoder for MACCS fingerprints, a Graph Attention Network (GAT) for molecular graph representations, and a tanh-activated multilayer perceptron for RDKit, PaDEL, and Mordred-derived descriptors. Each modality is independently trained for binary classification, and latent embeddings extracted from internal layers serve as transferable molecular representations. These embeddings are subsequently fused and fine-tuned via a tanh-activated dense network and shared prediction head to form a unified ADMET predictor. The proposed framework achieves competitive performance across multiple TDC ADMET benchmarks while providing enhanced interpretability through modality-specific attention mechanisms. In addition, the incorporation of latent-space out-of-distribution (OOD) confidence estimation enables identification of high-confidence operating regions, improving the reliability and practical applicability of the framework for molecular property prediction in drug discovery workflows.
bioinformatics2026-05-29v1Variation in seagrass habitat use by fishery-important nekton across the Gulf of Mexico revealed by deep transfer learning with trophic group priors
Li, L.; Rodemann, J.; Hayes, C.; Belgrad, B.; Darnell, K. M.; Martin, C. W.; Furman, B. T.; Smee, D. L.; Darnell, M. Z.Abstract
Understanding how species use habitats across environmental gradients is central to guiding fisheries management and habitat restoration, yet inference is often limited by heterogeneous data and inconsistent observations. In coastal ecosystems, variation in relationships between nekton and seagrass habitats remains unresolved, in part because habitat structure, environmental context, and sampling methods are rarely integrated in predictive models. Here, we combine multigear monitoring data across the Gulf of Mexico with a transfer learning framework incorporating trophic priors to quantify how nekton respond to seagrass structure under varying environmental conditions. We show that environmental gradients, including temperature, salinity, and water clarity, define broad-scale distributions, while seagrass structure refines habitat use at local scales. Apparent inconsistencies in seagrass nekton relationships are largely attributable to environmental context and differences in observational processes. By integrating observations collected using different sampling methods, our approach reveals consistent specie and environment relationships across sites and improves predictive performance, particularly for data-limited species, by using trophic priors. We further show that species differ in their responses to environmental gradients, with some exhibiting consistent patterns across sites and others showing strong context dependence. These results demonstrate that combining heterogeneous datasets can strengthen ecological inference and provide a pathway for scalable, data driven conservation and restoration in rapidly changing coastal systems.
bioinformatics2026-05-29v1Anchors for Homology-Based Scaffolding
Kaether, K.-K.; Gatter, T.; Lemke, S.; Stadler, P. F.Abstract
Homology based scaffolding orders contigs based on conserved collinearity of homologous sequences across related species. Existing methods often rely on costly whole genome alignments or show limited robustness when integrating multiple references. Here, we introduce an anchor based scaffolding framework that adapts synteny anchors to efficiently infer contig order and orientation relative to one or more reference genomes. Our approach leverages precomputed, sufficiently unique anchors and their respective high confidence homology matches in a greedy approach, combining single reference to multi reference scaffolds using a maximum matching. Across simulated and real datasets, anchor based scaffolding achieves accuracy comparable to state of the art methods. Notably, the approach shows particular strengths in multi reference settings. These results demonstrate that synteny anchor based scaffolding provides an additional tool for homology based scaffolding with robust accuracy and superior performance in multi reference scenarios.
bioinformatics2026-05-28v4Blender tissue cartography: an intuitive tool for the analysis of dynamic 3D microscopy data
Claussen, N. H.; Regis, C.; Wopat, S.; Lefebvre, M. F.; Streichan, S. J.Abstract
Volumetric microscopy can image complex 3D tissues, but 3D image data remains difficult to visualize and quantify. Many biological systems are organized as thin, curved sheets (for example, epithelia). Tissue cartography extracts and cartographically projects these curved surfaces from volumetric images. This converts 3D into 2D image data, greatly facilitating visualization, analysis, and computational processing. Existing tools, however, demand advanced coding expertise and are limited to simple tissue geometries. Here, we present Blender issue cartography (btc), an interactive add-on for the 3D editor Blender that makes tissue cartography user-friendly by a graphical interface, and handles complex biological shapes using powerful computer graphics algorithms. An accompanying Python library supports faithful 3D measurements in 2D cartographic projections and custom analysis pipelines. Time-lapse data can be batch-processed by algorithmically aligning all time points to a single "key frame". We demonstrate btc on diverse and complex tissue shapes from Drosophila, stem-cell organoids, Arabidopsis, and zebrafish. btc enables quantitative cartographic analysis of complex 3D tissues, broadening access to methods previously restricted to specialists, while leveraging tools from computer graphics to unlock new capabilities.
bioinformatics2026-05-28v3Unimeth: A unified transformer framework for accurate DNA methylation detection from nanopore reads
Wang, S.; Xiao, Y.; Sheng, T.; Huang, N.; Shu, Y.; Zhai, J.; Luo, F.; Ni, P.Abstract
Nanopore sequencing enables direct detection of DNA modifications from native DNA. However, accurate methylation calling across species, sequence contexts, modification types and chemistries remains challenging. We present Unimeth, a transformer-based framework that jointly processes raw signals and basecalled sequences in read patches and predicts all target methylation sites within each patch. Unimeth uses a three-phase training strategy that combines signal pre-training, methylation fine-tuning and site-level calibration using methylation frequency information. We evaluated Unimeth for 5mC and 6mA detection using public and in-house datasets spanning 14 species, three nanopore chemistries and wild-type, mutant and enzyme-treated samples. Unimeth improved plant 5mC detection in non-CpG contexts, reduced false-positive calls in low-methylation samples and maintained high 5mCpG performance in mammalian datasets. For 6mA, Unimeth reduced background calls while preserving signals for Fiber-seq nucleosome and gene-level analyses. Unimeth provides a unified framework for nanopore-based methylation detection across methylation types and biological contexts.
bioinformatics2026-05-28v3Atlas-Level Single-Cell and Spatial Transcriptomics Data Integration via PRIME
Wu, X.; Wang, X.; Wang, J.; Wan, S.Abstract
Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have enabled atlas-scale cellular cartography, with consortium efforts now assembling millions of cells across diverse tissues, donors, and technologies to build comprehensive references for cell identify and disease mechanism, yet the scientific value of these atlases hinges on robust computational integration across heterogeneous data sources. Unlike pairwise batch correction, atlas-level integration must jointly reconcile heterogeneous and often hierarchically nested batch effects across many datasets whose cell-type compositions are highly imbalanced, all while preserving subtle biological variation and remaining computationally tractable at the scale of millions of cells. Existing approaches often prioritize either batch mixing or preservation of local biological structure, and most cannot natively accommodate spatial coordinates. Here we introduce PRIME (Projection-based Robust Integration via Manifold Embedding), an ensemble integration framework that combines random-projection-based consensus anchoring, graph-Laplacian correction, and optional spatial-neighborhood regularization. Across multiple random projections of the expression manifold, PRIME uses consensus voting to keep only cell pairs that repeatedly matched, reducing false anchors caused by projection-specific distortions. For ST, PRIME couples this expression-based anchor graph with a coordinate-derived spatial neighborhood graph in a unified graph-Laplacian objective with closed-form solution, enabling simultaneous cross-batch alignment and local spatial coherence. Based on extensive benchmarking spanning diverse datasets, we show that PRIME consistently outperforms state-of-the-art methods in both batch correction and biological conservation across scRNA-seq and ST integration scenarios and downstream tasks including trajectory inference, spatial-domain preservation, and perturbation-response analysis. Particularly, when integrating a human hematopoiesis benchmark spanning eight donors and approximately 33,000 cells, PRIME preserves biologically coherent developmental trajectories in human hematopoiesis. It also maintains cortical laminar architecture across dorsolateral prefrontal cortex sections in a ST dataset and recovers known drug-target relationships in a perturbation atlas of more than 1 million cells while suppressing batch-associated confounders. Together, these results establish PRIME as a versatile and scalable framework for atlas-level integration of scRNA-seq and ST across diverse biological applications.
bioinformatics2026-05-28v2CardioSeg: An interactive platform for integrated spatial transcriptomics data and nuclear morphological analysis of mouse heart tissue
Kancherla, S. K.; Melleby, A. O.; Aronsen, J. M.Abstract
MotivationSpatial transcriptomics enables gene expression profiling within its spatial context in intact tissue sections. Existing workflows for segmentation, spatial annotation, and morphological analysis are often code-heavy and poorly integrated. This limits the joint analysis of spatial gene expression at a single-nucleus resolution, and corresponding nuclear morphology. ResultsWe present CardioSeg, a Python-based graphical interface for nuclei segmentation, spatial annotation, and interactive analysis of myocardial histology. CardioSeg integrates multi-threshold Cellpose-based segmentation with nuclei-level transcriptomic mapping and interactive visualisation. CardioSeg achieved robust segmentation performance across heterogeneous imaging conditions, with union-based inference outperforming the individual parameter configurations. For cell-type annotation, CardioSeg achieved 0.88 in accuracy and 0.85 in balanced accuracy against reference labels, while also resolving spatial heterogeneity not captured by spot-based approaches. Application to pressure-overloaded cardiac tissue revealed uncharacterized intra-ventricular variations in nuclear morphology, indicating the potential of CardioSeg to couple disease-specific nuclear morphology with the associated transcriptomics. Availability and ImplementationSource code is available at GitHub under the CC BY 4.0 license (https://github.com/SrijanKancherla/CardioSeg). A versioned release was archived in Zenodo (DOI: 10.5281/zenodo.20177171).
bioinformatics2026-05-28v2Data Representation Bias and Conditional Distribution Shift Drive Predictive Performance Disparities in Multi-Population Machine Learning
Kumar, S.; Cui, Y.Abstract
Machine learning frequently encounters challenges when applied to population-stratified datasets, where data representation bias and data distribution shifts substantially impact model performance and generalizability across different population groups. These challenges are well illustrated in the context of polygenic prediction for diverse ancestry groups, and the underlying mechanisms are broadly applicable to machine learning with population-stratified data across domains. Using synthetic genotype-phenotype datasets representing five continental populations, we evaluate three approaches for utilizing population-stratified data, mixture learning, independent learning, and transfer learning, to systematically investigate how data representation bias and distribution shifts influence multi-population machine learning. Our results show that conditional distribution shifts, in combination with data representation bias, significantly influence machine learning performance across diverse populations and the effectiveness of transfer learning as a disparity mitigation strategy, while the effect of marginal distribution shifts is limited. The joint effects of data representation bias and distribution shifts demonstrate distinct patterns under different multi-population machine learning approaches, providing critical insights for the development of effective and equitable machine learning models for population-stratified data.
bioinformatics2026-05-28v2UcTCRp: a TCRβ-based framework for quantitative MAIT- and iNKT-associated repertoire-state profiling
Chen, L.; Li, Y.; Shan, S.; Wang, K.; Feng, C.; Dou, Y.; Xu, Q.; Cai, L.; Wang, H.; Wang, H.; Bo, X.; Zhang, J.Abstract
MAIT and iNKT cells are conventionally identified using invariant or semi-invariant TCR chains, antigen-loaded tetramers, or transcriptomic phenotypes. These requirements limit their detection in public and clinical immune-repertoire datasets that contain only TCR{beta} sequences. Here we present UcTCRp, a TCR{beta}-only framework for profiling MAIT- and iNKT-associated repertoire states in bulk immune repertoires. UcTCRp integrates V-gene context and CDR3{beta} sequence features using a transformer-based representation pretrained on more than one million TCR{beta} sequences and supervised with curated cross-species MAIT, iNKT and conventional T cell references. The framework defines conserved model-informative TCR{beta} features, uses V-matched negative sampling to reduce germline-segment shortcuts, and generalizes across independent human and mouse datasets. In paired scRNA-seq/scTCR-seq datasets, UcTCRp recovered transcriptome-defined MAIT and iNKT cells and identified additional MAIT-like candidates supported by receptor evidence but missed by expression-only annotation. Bulk calibration against paired single-cell references and synthetic spike-in experiments established operating characteristics for repertoire-level abundance estimation. These results establish unpaired TCR{beta} repertoires as an actionable substrate for reconstructing unconventional T cell-associated immune states, enabling archived repertoire resources to be repurposed for systems-level studies of tissue immunity, disease and therapeutic response.
bioinformatics2026-05-28v1In silico characterization of unique fungal modular rhodopsin expands the horizon of novel optobiological and biomedical applications
Kateriya, S.; Kumari, A.; Kumar, A.; Sharma, K.; Pati, S. R.; Mohanty, S.Abstract
Microbial modular rhodopsins, in which light-sensing rhodopsin domains are fused with effector modules, have emerged as promising tools for optogenetic regulation in algae and other systems. However, the diversity and potential regulatory roles of fungal modular rhodopsins remain largely unexplored. Here, we performed a comprehensive in-silico analysis to identify previously uncharacterized fungal modular-rhodopsins that pair a conserved light-sensing core with diverse effector domains, including RPEL-motif, NADP-binding Rossmann fold domain, MCM (Mini-Chromosome Maintenance) domain, and GC-cAT (Carnitine O-Acetyltransferase) modules. In Aureobasidium pullulans, the representative modular rhodopsin (ApRh-RPEL) contains RPEL-motif associated with actin-related and transcriptional regulatory processes, suggesting light-driven fungal signaling pathway involved in transcriptional and cellular regulation, respectively. Rhodopsins fused with NADP-binding Rossmann fold and MCM domains further indicate possible applications in light-programmable metabolic and cell-cycle signaling. Genome mining additionally revealed that A. pullulans harbours a diverse but underexplored array of biosynthetic gene clusters (BGCs), raising the intriguing possibility that light perception may regulate secondary metabolite pathways. Supporting this, multisource protein-protein interaction network analysis links ApRh-RPEL to enzymes involved in terpenoid and sphingolipid biosynthesis, indicating potential cross-talk between light-sensing module and metabolic regulation. These findings outline a computationally derived model in which fungal modular rhodopsins (ApRh-RPEL) function as opto-synthetic regulators of biosynthetic processes. Structural predictions confirmed conserved Schiff-base lysine and retinal-binding pocket, highlighting functional diversity across fungal rhodopsins. Together, these findings expand the optogenetic toolkit and provide a framework for engineering light-driven signaling in fungi, with applications in optobiological and biomedical applications.
bioinformatics2026-05-28v1SQANTI-browser: visualization and curation of SQANTI3-classified long-read transcriptomes within the UCSC Genome Browser
Paniagua, A.; Blanco-Gomez, C.; Colomer Fernandez, A.; Diekhans, M.; Conesa, A.; Monzo, C.Abstract
Long-read sequencing enables transcriptome-wide isoform discovery. However, it generates substantial technical and structural ambiguity that complicates transcript interpretation. Here, we present SQANTI-browser, a classification-aware visualization framework that converts SQANTI3 outputs into interactive UCSC Genome Browser Track Hubs, preserving full transcript structural metadata. By integrating SQANTI classifications directly within the UCSC ecosystem, SQANTI-browser enables dynamic filtering and evidence-guided curation alongside public resource tracks. Furthermore, its adaptive architecture natively supports non-reference genomes, orthogonal data, and custom metadata fields. Applied to clinical, noisy, and synthetic datasets, SQANTI-browser resolves alignment artifacts and rescues actionable novel isoforms, providing a robust framework for long-read transcriptome curation.
bioinformatics2026-05-28v1Minimal Computational Framework for Systematic Identification of Antimicrobial Targets
Hassan, S. A.Abstract
Systematic identification of antimicrobial targets remains a major challenge, as discovery still relies largely on empirical, resource-intensive approaches with limited efficiency. We present a method for identifying antimicrobial targets based on protein dynamics, enabling rational polypharmacology. The approach spans multiple biological scales, from taxa (genus and species) to biological networks, including network hubs and edges, their constituent proteins, protein binding sites, and their conformational states. It is grounded in the premise that coordinated intervention across multiple, optimally selected targets, using combinations of compounds at safe or submaximal doses, can achieve therapeutic effects while reducing toxicity and limiting mutational escape. A survey of known antimicrobials indicates that a small number of recurrent protein-level mechanisms account for most disruptions of microbial survival. We introduce metrics to detect these mechanisms across a pathogen proteome and describe a streamlined, modular workflow for target identification and prioritization that is optimized for ease of deployment and naturally interfaces with downstream applications such as molecular screening and de novo design.
bioinformatics2026-05-28v1Accelerated Aging Signatures in 3D Genome Organization and Transcriptome in Schizophrenia
Ulianov, K. A.; Zagirova, D. R.; Kononkova, A. D.; Dudkovskaia, A. V.; Molodova, M. N.; Morozov, K. V.; Efimova, O. I.; Bazarevich, M.; Cherkasov, A. V.; Morozova, P. D.; Tvorogova, A. V.; Pletenev, I. A.; Kondratyev, N.; Golimbet, V. E.; Razin, S. V.; Khaitovich, P. E.; Ulianov, S. V.; Khrameeva, E.Abstract
Schizophrenia is a severe neuropsychiatric disorder that affects the behavioral, emotional and cognitive state of patients. Despite its substantial heritability, the molecular etiology of the disease remains poorly understood. Many schizophrenia-associated genetic variants reside in non-coding regions, and exert their effects through distal regulatory elements of the genome. In this context, the three-dimensional organization of the genome is expected to play a decisive role in establishing contacts between these regulatory elements and their target genes, thereby mediating schizophrenia-associated dysregulation of gene expression. Here, we present a novel Hi-C dataset providing an unprecedented view of three-dimensional genome organization in post-mortem schizophrenia brain samples. Our findings indicate that most changes occur at long-range genomic distances while local architecture of topologically-associated domains remains largely intact. However, neurons display localized and functionally relevant loop differences, particularly in regulatory regions associated with neurodevelopmental processes. Global characteristics of higher-order chromatin organization show accelerated aging alteration pattern in schizophrenia, and downstream analysis of transcriptomic data in schizophrenia brain samples further confirms that schizophrenia is associated with accelerated aging.
bioinformatics2026-05-28v1Mapping Genetic Risk Associations to Cellular Contexts via Deep Learning and Biological Ontologies
Margalit, T.; Levi, H.; Shamir, R.; Elkon, R.Abstract
Translating genome-wide association studies (GWAS) signals into trait-relevant cellular contexts remains challenging due to the complexity of the genomic regulatory code and linkage disequilibrium among associated variants. We present a novel computational framework that aggregates deep learning-based predictions of the functional effects of noncoding variants on transcriptional regulatory elements across GWAS loci and empirically evaluates their statistical significance. By organizing these aggregated signals within biological ontologies, our approach enables statistically calibrated interpretation of GWAS associations, highlighting relevant cell-type and tissue contexts across human traits.
bioinformatics2026-05-28v1gTranslate: rapid and accurate translation table prediction for prokaryotic genomes
Chaumeil, P.-A.; Hugenholtz, P.; Parks, D. H.Abstract
Background: Bioinformatic tools often require the prediction of protein-coding genes to make inferences about prokaryotic genomes. Typically, the genetic code used for translating genes to proteins must be specified by the user based on the taxonomic classification of a genome assembly or, for some widely used tools, established using a heuristic rule based on gene coding densities. Manual specification is at best inconvenient, but more challenging is that many bioinformatic tools are applied before taxonomic classifications have been established making specifying the translation table impractical. Methods: Here we provide a computationally efficient tool, gTranslate, that uses an ensemble of five machine learning methods to accurately predict translation tables for prokaryotic genomes. The feature vector used by gTranslate takes advantage of differences in gene coding densities when predicting genes under different translation tables along with features that consider the number and ratio of UGA stop codon reassignments to tryptophan or glycine. Results: We demonstrate that gTranslate correctly predicts the translation table of prokaryotic genomes >99.99% of the time (i.e. <1 error per 10,000 genomes) and outperforms a more computationally expensive prediction method and a coding density heuristic used by popular bioinformatic tools. Using gTranslate, we identify a basal lineage of Ca. Stammera capleta that uses the standard bacterial genetic code instead of the UGA stop codon to tryptophan reassignment common to other members of this species. We also identify the first instances of UGA-to-tryptophan reassignment in the Patescibacteriota making this the first bacterial phylum with members capable of using translation tables 4, 11, and 25.
bioinformatics2026-05-28v1Design of a Multi-epitope Vaccine Against Human Glanders Targeting Outer Membrane β-barrel Proteins of Burkholderia mallei
Kapoor, J.; Panda, A.; Kumar, S.; Bandyopadhyay, A.Abstract
Burkholderia mallei, a facultative intracellular Gram-negative pathogen, is the causative agent of glanders that primarily affects solipeds and sporadically transmitted to humans. Current interventions mainly rely on antibiotics; however, increasing resistance and the lack of a licensed vaccine further complicate disease management. In the present study, a consensus-based computational framework was employed on the B. mallei turkey2 proteome. Total 59 proteins - including porins, TonB receptors, autotransporters, and efflux components - were identified as surface exposed outer membrane {beta}-barrel (OMBB) proteins that were used to design a multi-epitope vaccine (MEV) construct. B- and T-cell epitopes were predicted from 59 proteins, and ten epitopes each of cytotoxic T-lymphocyte (CTL), helper T-lymphocyte (HTL), and B-cell were chosen based on their antigenicity, non-allergenicity, non-toxicity, surface accessibility, and conservation across 32 B. mallei strains. The MEV was included with suitable adjuvants at the N-terminus to enhance its immunogenicity. The 780 amino acid MEV construct was predicted to be antigenic, and soluble upon overexpression with 62.69% random coils, while the rest formed -helices and {beta}-strands. The tertiary structure of the MEV was generated and subsequently validated, indicating good structural quality. Molecular docking of the MEV with toll-like receptor 4 (TLR4) demonstrated strong affinity, and molecular dynamics simulation confirmed the structural stability of the MEV-TLR4 complex. In-silico immune simulation showed the capability of MEV to induce a strong immune response. The study proposes an MEV construct by utilizing surface exposed OMBB proteins which directly interact with the host and serve as effective immunogenic targets against B. mallei infection.
bioinformatics2026-05-28v1Sequence-Based Prioritization of Promoter Regulatory Variants in Colorectal Cancer Using a DNA Foundation Model
Shome, S.; Vajinepalli, S.; Saraf, A.Abstract
Noncoding regulatory variants contribute to colorectal cancer (CRC) susceptibility, yet their functional interpretation remains difficult.This is mainly attributed to regulatory effects being context-dependent and most noncoding regions lack reliable genomic annotations. We have developed a computational framework that aids in prioritizing promoter-associated variants using Evo2, a large-scale autoregressive DNA foundation model. In the framework, variants were mapped to promoter regions across ~1,250 CRC-associated genes and scored using Evo2-derived delta scores, the difference in sequence probability between reference and alternate alleles. Promoter variants showed greater predicted regulatory impact than non-promoter variants (median delta = 0.015 vs. 0.002; overall mean = 0.018, SD = 0.011). Applying a distributional threshold (delta > 0.020; top ~25%) identified 287 high-impact variants across 198 CRC-associated genes. These genes were enriched in CRC-relevant pathways such as Wnt signaling, p53 signaling, and cell cycle regulation and 36.4% (72/198) overlapped known cancer genes. Independent validation showed high-impact variants were enriched at CRC GWAS loci and overlapped transcription factor binding sites (~32%) and motif-disrupting positions (~21%), supporting their functional relevance. Together, these results show that sequence-based foundation models can scalably prioritize noncoding regulatory candidates in CRC without supervised training or predefined annotations.
bioinformatics2026-05-28v1Inferring Multi-Stage Pathway Progression Models from Tumor Phylogenies
Cankosyan, M.; Khan, S. R.; Sashittal, P.Abstract
Cancer progression is an evolutionary process driven by the accumulation and selection of somatic mutations, giving rise to genetically diverse subclonal populations within tumors. Understanding the dependencies among mutations and identifying recurrent evolutionary trajectories is critical for understanding cancer progression and informing therapeutic strategies. Recent advances in genomic sequencing and phylogenetic reconstruction now enable large-scale inference of tumor phylogenies, providing detailed representations of intratumor evolutionary histories across patient cohorts. However, modeling cancer progression from these data remains challenging due to extensive inter- and intratumor heterogeneity, often arising from mutations in different genes within the same pathway that confer similar fitness advantages. Existing methods to infer pathway-level progression models summarize each tumor by a single consensus genotype, ignoring intra-tumor heterogeneity, while phylogeny-based methods typically focus on individual mutations and do not model pathways. We introduce PhyloStage, an algorithm for inferring multi-stage pathway-level cancer progression models from large cohorts of tumor phylogenies. PhyloStage represents progression as a partial order over pathways, permitting independent mutations in incomparable pathways while constraining the order of mutations within the same or dependent pathways. The framework also incorporates uncertainty in tumor phylogenies, resolves mutation clusters with unknown ordering, and stratifies patients by progression stage. Applied to a cohort of 120 acute myeloid leukemia (AML) tumor phylogenies, PhyloStage infers progression models that are aligned with known AML progression. On 99 non-small cell lung cancer (NSCLC) patients, PhyloStage stratifies patients into progression stages such that later stages have larger tumor sizes, corroborating phenotypic tumor progression.
bioinformatics2026-05-28v1Distinct fibrotic, epithelial and immune transcriptomic programs in phenotypes of chronic lung allograft dysfunction
Ishiwata, T.; Berra, G.; Allen, J.; Burman, A.; Wilson, G.; Carter, Z.; Watanabe, T.; Solomon, M.; Keshavjee, S.; Yeung, J.; Juvet, S. C.; Martinu, T.Abstract
Background: Chronic lung allograft dysfunction (CLAD) is the major cause of late mortality after lung transplantation and includes two principal phenotypes, bronchiolitis obliterans syndrome (BOS) and restrictive allograft syndrome (RAS). RAS and other phenotypes with RAS-like opacities (RLO) on chest imaging have a poorer prognosis. Despite clear clinical and pathological differences, molecular distinctions between phenotypes remain poorly defined. We aimed to explore gene transcriptional profiles across CLAD phenotypes and relevant controls. Methods: We performed bulk RNA sequencing on explanted lung tissue from 45 lung transplant recipients with end-stage CLAD (20 with RLO and 25 without RLO). Samples from twenty-seven control donor and lobectomy lungs and sixteen idiopathic pulmonary fibrosis (IPF) lungs served as comparators. Non-negative matrix factorization (NMF) was used to identify latent transcriptomic signatures, which were correlated with clinical, radiologic, and histopathologic features. Results: NMF identified seven distinct gene signatures that segregated CLAD phenotypes. RLO-CLAD lungs were enriched for extracellular matrix remodeling and B cell/plasma cell-associated signatures, overlapping partly with IPF, whereas non-RLO-CLAD showed relative enrichment of epithelial injury and surfactant-response pathways. Signatures related to epithelial homeostasis and ciliary/microtubule function were progressively reduced from control lungs to non-RLO-CLAD and were most suppressed in RLO-CLAD. Conclusions: RLO-CLAD and non-RLO-CLAD, aligning with RAS and BOS phenotypes, show distinct transcriptomic signatures. RLO-CLAD is characterized by profibrotic and humoral immune signatures with profound epithelial dysfunction, whereas non-RLO-CLAD shows relative enrichment of epithelial injury responses. These data provide molecular stratification of CLAD and support the development of phenotype-specific biomarkers and targeted therapies.
bioinformatics2026-05-28v1BIFO: A Biological Information Flow Ontology for Directed Propagation in Heterogeneous Biomedical Knowledge Graphs
Taylor, D. M.; Mohseni Ahooyi, T.; Stear, B.; Zhang, Y.; Lahiri, A. M.; Simmons, J. A.; Chinwalla, A.; Nemarich, C.; Callahan, T. J.; Silverstein, J. C.Abstract
Biomedical knowledge graphs integrate heterogeneous data by connecting many entity types through many relationship types. Computational analyses that propagate signal across these graphs (random walks, diffusion, and message passing) implicitly assume that every traversable edge can carry a biological signal. In a heterogeneous KG this is rarely true: hierarchical, lexical, and purely statistical edges do not, by themselves, define an admissible directed state transformation, and traversing them propagates signal along paths that are not biologically meaningful. We present the Biological Information Flow Ontology (BIFO), a graph-agnostic specification of which directed transformations are biologically admissible for computable information flow. BIFO defines fourteen entity classes, a taxonomy of flow classes organized around the backbone G+CH[->]RNA[->]P[->]PW[->]C[->]PH[->]DS, a set of admissibility constraints, and a two-level CURIE mapping that can be applied without schema-specific code to any graph whose identifiers and predicates are resolvable through, or extendable to, the BIFO mapping tables. A four-step conditioning protocol converts a raw property graph into a conditioned propagation graph in which only admissible, direction-aware edges remain. We provide a reference implementation on the Data Distillery Knowledge Graph (DDKG); conditioning a cohort-independent, gene-anchored subgraph as a BIFO substrate of 33.6 million edges retained 23.7 million (70.7%) as BIFO-classified relationships, cleanly separating 13.3 million propagating mechanistic edges from 10.5 million retained-but-non-propagating observational associations, and confirming that pathway concepts are configured as scoring accumulation endpoints for BIFO-PPR pathway scoring. BIFO is an admissibility specification for computable propagation of signal over knowledge graphs. It is released as an open specification with versioned mapping tables and tooling, providing a reusable substrate for biologically interpretable, direction-aware analysis of biomedical knowledge graphs.
bioinformatics2026-05-28v1MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders
Wijaya, A. S.; Leung, H.; Yoo, H.Abstract
Frozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each gene's natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts. In 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 / kappa 0.821, compared with 0.672 / kappa 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context.
bioinformatics2026-05-28v1A New Hybrid Method for Brain Tumor Detection Based on Deep Learning
Sharbaf, S.Abstract
Brain tumor detection using Magnetic Resonance Imaging (MRI) remains a challenging task due to tumor hetero-geneity and imaging variability. This paper presents a novel hybrid Deep Convolutional Neural Network Whale Optimization Algorithm (DCNN,WOA) framework for automated brain tumor detection and classification. The proposed method consists of four main stages: MRI data preprocessing and augmentation, deep feature extraction using multi-layer Convolutional Neural Networks (CNN), feature selection and hyperparameter optimization via the Whale Optimization Algorithm (WOA), and final classification with comprehensive performance evaluation. By jointly optimizing deep features and training parameters, the framework effectively reduces feature redundancy, accelerates convergence, and enhances model generalization. Experimental results on a publicly available MRI dataset demonstrate that the DCNN-WOA model outperforms conventional CNN and state-of-the-art Deep Learning (DL) architectures, achieving an accuracy of 97.8%, sensitivity of 96.4%, specificity of 98.1%, and F1-score of 97.2%. The practical impact of this approach makes it a promising solution for real-time clinical decision-support systems in neuroimaging.
bioinformatics2026-05-28v1Leveraging AI and structural proteomics for rational design of a KAT6A degrader
Arad, G.; Simchi, N.; Brodsky, S.; Shtrikman, A.; Kedem, Y.; Alchanati, I.; Otonin, G.; Shenoy, A.; Kovalerchik, D.; Ran Shchory, M.; Ben Shoshan-Galeczki, Y.; Cohen, N.; Lange, K.; Seger, E.; Pevzner, K.Abstract
While targeted protein degraders such as PROTACs are a clinically proven therapeutic strategy, the discovery of novel degraders remains hampered by trial-and-error process. To address this challenge, we developed the AIMSTM platform, which combines structural proteomics with AI models for rational PROTAC design. AIMSTM is an end-to-end toolkit for PROTAC optimization, encompassing structure solving using proteomics and AI, prediction of ADME and degradation properties, and prospective ranking of compound design ideas. Altogether, this integrated platform successfully enabled the multi-parameter optimization of a potent and bioavailable in vivo validated KAT6A degrader, establishing a versatile framework for PROTAC development across various targets.
bioinformatics2026-05-28v1