Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Using Variable Window Sizes for Phylogenomic Analyses of Whole Genome Alignments
Ivan, J.; Lanfear, R.Abstract
Many phylogenomic studies used non-overlapping windows to address gene tree discordance across a set of aligned genomes. Recently, Ivan et al. (2025) proposed an information theoretic approach to choose an optimal window size given the alignment. However, this approach selects only a single fixed window size per chromosome, which is a useful first step but fails to account for variation in the size of non-recombining regions along each chromosome. Such variation is expected to occur due to the stochastic nature of recombination as well as the variation in recombination rates along chromosomes. In this study, we extend the approach of Ivan et al. (2025) to allow window sizes to vary across the chromosome, using a splitting-and-merging strategy that allows for each window to be of an arbitrary length. We showed that the new method outperformed the fixed-window approach in recovering gene tree topologies on a wide range of simulated datasets. Applying the new method on the genomes of seven Heliconius butterflies, we found that the average window sizes for the group ranged between 538-808bp, but with a very similar distribution of gene tree topologies compared to previous studies that used fixed window sizes. For the genomes of great apes, the average window sizes ranged from 4.2kb to 6.2kb, with the proportion of the major topology (i.e., grouping human and chimpanzee together) reaching approximately 80%. In conclusion, our study highlights the limitations of using a fixed window size when recombination rates vary across the chromosomes, and proposes a splitting-and-merging approach that allows for variable window sizes across whole genome alignments.
bioinformatics2026-03-19v2Frequency-domain kernels enable atlas-scale detection of spatially variable genes
Yang, C.; Zhang, X.; Chen, J.Abstract
Identifying spatially variable genes in spatial transcriptomics requires methods that are accurate, well calibrated and scalable, yet current approaches trade expressive kernels for tractable computation. We present FlashS, which moves spatial testing to the frequency domain: Random Fourier Features and sparse sketching enable multi-scale kernel testing on zero-inflated data without constructing distance matrices, and a kurtosis-corrected null preserves calibration. Across 50 datasets from 9 platforms, FlashS achieves a mean Kendall {tau} of 0.935, exceeding the next-best method by 0.049. On the Allen Brain MERFISH atlas of 3.94 million cells, it completes in 12.6 minutes using 21.5 GB memory and maintains near-nominal false-positive rates under permutation. In human cardiac tissue, this improved ranking recovers a ventricular cardiomyocyte-associated mitochondrial biogenesis program that largely eludes parametric alternatives and replicates in an independent cohort.
bioinformatics2026-03-19v2Composition and higher-order structure in nucleic acids sequenced from a chondrite
Farage, C.; Church, G. M.; Bachelet, I.Abstract
The known tree of life occupies an infinitesimal region of the space of all mathematically possible evolutionary histories, yet our sequence analysis frameworks are implicitly calibrated to it and to its associated compositional and grammatical regularities. Here we analyze nucleic acid molecules sequenced from the Zag meteorite as part of a broader effort to understand how nucleic acid sequence composition and higher-order structure are shaped under chemically divergent environments. We characterize these sequences across multiple analytical layers, and show that they lack signatures of protein-coding organization, translational periodicity, or known biological grammar. At the same time, they deviate significantly from random or composition-only null models, displaying constrained complexity and low-dimensional structure in k-mer frequency space. Multiple tests place amplification and sequencing-driven artifacts and metagenomic contaminants at a low likelihood. Taken together, these findings indicate that the Zag sequences occupy an unusual region of sequence space that is not readily accounted for by known biological or technical models, thereby narrowing, but not resolving, the range of plausible explanations and motivating independent replication and further investigation.
bioinformatics2026-03-19v2WITHDRAWN: Beyond Binding Affinity: The Kinetic-Compatibility Hypothesis for Nipah Virus Neutralization
Bozkurt, C.Abstract
The authors have withdrawn their manuscript because of a fundamental error in the identification of the biological target protein. The analysis was originally framed around the mechanical transitions of the Nipah virus Fusion (F) protein; however, the empirical functional data utilized (from the 2025 AdaptyvBio competition) was directed toward the Attachment (G) glycoprotein. While the sequence-level characterization of the binders remains internally consistent, the mechanical analogies used are not applicable to the Attachment (G) protein architecture. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author.
bioinformatics2026-03-19v2ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling
Shen, L.; Chao, L.; Liu, T.; Liu, Q.; Zhou, G.; Wang, H.; Dong, X.; Li, T.; Zhang, X.; Ni, J.Abstract
While protein language models typically rely on sequence-only pretraining objectives, this approach often fails to capture structural regularities and demands large datasets. To address this, we introduce ProteinSage, a pretraining framework that learns protein representations under explicit structural constraints. ProteinSage incorporates structural signals via structure-guided masking and a causal objective designed to model longrange dependencies. This structure-constrained pretraining equips ProteinSage with transferable representations using less data and computation, yet achieves competitive or superior performance across diverse structure-aware and general protein modeling benchmarks. To determine whether these gains stem from genuine structural generalization rather than task-specific fitting, we applied ProteinSage to a structure-driven protein discovery task, focusing on proteins with multi-pass transmembrane helical architectures such as distantly related microbial rhodopsins. The model successfully identified six previously unannotated microbial rhodopsin homologs. Together, our work establishes structure-constrained pretraining as an effective pathway toward data-efficient and structurally faithful protein representation learning.
bioinformatics2026-03-19v1evedesign: accessible biosequence design with a unified framework
Hopf, T. A.; Gazizov, A.; Garcia Busto, S.; Eschbach, E.; Lee, S.; Mirdita, M.; Orenbuch, R.; Belahsen, K.; Ross, D.; Sander, C.; Steinegger, M.; d'Oelsnitz, S.; Marks, D.Abstract
Machine learning methods for protein engineering are rarely interoperable, require bespoke workflows, and remain inaccessible to non-experts. Yet the design problems that matter most - conditional design subject to real-world constraints, multi-objective optimization, and iterative lab-in-the-loop workflows where experimental data continuously refines successive design rounds - demand exactly the kind of flexible, composable infrastructure that no single tool provides. We present evedesign, a unified open-source framework that formalizes conditional biosequence design in a method-agnostic way, enabling complex multiobjective workflows combining supervised and unsupervised models from standardized specifications, and built from the outset to support iterative experimental integration. An interactive web interface facilitates end-to-end design for a broad scientific audience at https://evedesign.bio. We demonstrate evedesign's utility in antibody engineering, enzyme design, and natural enzyme discovery, and invite open-source community contributions.
bioinformatics2026-03-19v1SELFormerMM: multimodal molecular representation learning via SELFIES, structure, text, and knowledge graph integration
Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.Abstract
Motivation: Molecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. Results: We introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph-derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the model's ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. Availability: SELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contact: tuncadogan@gmail.com
bioinformatics2026-03-19v1PhyloRNA: a database of RNA secondary structures with associated phylogenies
Quadrini, M.; Tesei, L.Abstract
The ability to access, search, and analyse large collections of RNA molecules together with their secondary structure and evolutionary context is essential for comparative and phylogeny-driven studies. Although RNA secondary structure is known to be more conserved than primary sequence, no existing resource systematically associates individual RNA molecules with curated phylogenetic classifications. Here, we introduce PhyloRNA, a curated meta-database that provides large-scale access to RNA secondary structures collected from public resources or derived from experimentally resolved 3D structures. PhyloRNA allows users to search, select, and download extensive sets of RNA molecules in multiple textual formats, each entry being explicitly linked to phylogenetic annotations derived from five curated taxonomy systems. In addition to taxonomic information, each RNA molecule is accompanied by a rich set of descriptors, including pseudoknot order, genus, and three levels of structural abstraction - Core, Core Plus, and Shape - which facilitate comparative analyses across sets of molecules. PhyloRNA is publicly available at https://bdslab.unicam.it/phylorna/ and is regularly updated to incorporate newly available data and revised taxonomic annotations.
bioinformatics2026-03-19v1NOHIC: A PIPELINE FOR PLANT CONTIG SCAFFOLDING USING PERSONALIZED REFERENCES FROM PANGENOME GRAPHS
Nguyen-Hoang, A.; Arslan, K.; Kopalli, V.; Windpassinger, S.; Perovic, D.; Stahl, A.; Golicz, A.Abstract
Hi-C data is commonly used for reference-free de novo scaffolding. However, with the rapid increase in high-quality reference genomes, reference-guided workflows are now more practical for assembling large numbers of target genomes without relying on costly and labor-intensive Hi-C sequencing. Recently, a pangenome graph-based haplotype sampling algorithm was introduced to generate personalized graphs for target genomes. Such graphs have strong potential as references for reference-guided contig scaffolding. Here, we present noHiC, a reference-guided scaffolding pipeline supporting key steps of plant contig scaffolding. A distinctive feature of noHiC is the nohic-refpick script, generating a best-fit synthetic reference (synref) from a pangenome graph that is genetically close to the target contigs. This enables the integration of genetic information from many references (up to 48 in our tests) without using them separately during scaffolding. Synrefs showed advantages over highly contiguous conventional references in reducing false contig breaking during reference-based correction. Additionally, nohic-refpick can be combined with fast scaffolders (ntJoin) to rapidly produce highly contiguous assemblies using synrefs derived from pangenome graphs. The noHiC pipeline, used alone or in combination with ntJoin, can generally produce assemblies that are structurally consistent with public Hi-C-based or manually curated genomes. The pipeline is publicly available at https://github.com/andyngh/noHiC.
bioinformatics2026-03-19v1Translating Histopathology Foundation Model Embeddings into Cellular and Molecular Features for Clinical Studies
Cui, S.; Sui, Z.; Li, Z.; Matkowskyj, K. A.; Yu, M.; Grady, W. M.; Sun, W.Abstract
AI-powered pathology foundation models provide general-purpose representations of histopathological images by encoding image tiles into numerical embeddings. However, these embeddings are not directly interpretable in biological or clinical terms and must be translated into biologically meaningful features, such as cell-type composition or gene expression, to enable downstream clinical applications. To bridge this gap, we developed STpath, a framework that integrates histopathology image embeddings derived from existing pathology foundation models with matched, spatially resolved transcriptomics data. STpath consists of cancer-specific XGBoost models trained to infer cell-type compositions and gene expression from histopathology image tiles. We evaluated STpath in colorectal and breast cancer datasets and showed that it provides accurate estimates of the composition of major cell types and the expression of a subset of genes, with further performance gains achieved by combining embeddings from multiple foundation models. Finally, we demonstrated that STpath inferred features that can be used in downstream studies to evaluate their associations with clinical outcomes.
bioinformatics2026-03-19v1G-VEP: GPU-Accelerated Variant Effect Prediction for Clinical Whole-Genome Sequencing Analysis
Green, E.; Mardinoglu, A.Abstract
Whole-genome sequencing (WGS) has transformed clinical diagnostics, yet variant annotation remains a computational bottleneck. The Variant Effect Predictor (VEP) integrates pathogenicity predictors and population databases essential for ACMG/AMP variant classification, but these annotation plugins are fundamentally I/O-bound, consuming over 70% of total pipeline runtime. Here, we present G-VEP, a GPU-accelerated annotation framework built on a custom CUDA kernel that replaces sequential per-variant database lookups with massively parallel binary search across precomputed indices. By executing annotation lookups for all input variants simultaneously, G-VEP reduces plugin runtime from 72 minutes to 4 minutes (17-fold acceleration) and total annotation runtime from 100 minutes to 33 minutes (3-fold acceleration), while maintaining complete concordance with standard VEP output. Benchmarking across 75 clinical WGS samples demonstrated consistent performance, with no annotation discrepancies; validation on samples containing known pathogenic variants confirmed the preservation of all clinically significant findings. The 8.8 GB index footprint fits within consumer-grade 16 GB GPUs. G-VEP addresses an unmet need in clinical WGS analysis, while GPU suites such as NVIDIA Parabricks accelerate alignment and variant calling, they do not provide the Ensembl VEP plugin ecosystem used in clinical interpretation. G-VEP removes this final bottleneck and enables accelerated WGS interpretation. G-VEP is freely available through a web-based user interface with REST API documentation at https://www.phenomeportal.org/gvep, and source code for local installation and deployment at https://github.com/Phenome-Longevity/G-VEP.
bioinformatics2026-03-19v1ST-PARM: Pareto-Complete Inference-Time Alignment for Multi-Objective Protein Design
Yin, R.; Shen, Y.Abstract
Motivation: Protein engineering is inherently multi-objective: improving one property can degrade others, so practical workflows require generating non-dominated (Pareto-optimal) candidates spanning a trade-off surface. Linear objective scalarization and deterministic pairwise preference learning can under-explore non-convex Pareto regions and amplify noise from uncertain evaluators, limiting Pareto coverage and trade-off controllability. Results: We introduce Smooth Tchebycheff Preference-Aware Reward Model (ST-PARM), an inference-time alignment framework that steers a frozen protein language model along user-specified trade-offs with a lightweight reward model trained only once. ST-PARM combines (i) a reward-calibrated pairwise preference loss that is uncertainty-aware by down-weighting ambiguous comparisons under noises, (ii) a smooth Tchebycheff scalarization that is Pareto-complete in principle and improves empirical trade-off coverage, and (iii) latent-space pair-construction strategies. On GFP fluorescence--stability (full-length design) and IL-6 nanobody stability--solubility (CDR3+suffix design), ST-PARM delivers broader Pareto coverage and stronger preference tracking than baselines PARM and MosPro. For GFP, a conservative structural screen for local confidence and global fold preservation retains a broad frontier and strong controllability, yielding an actionable cohort for downstream assays. We also provide cross-evaluator robustness checks, a three-objective extension, and a natural-language alignment generality check in the Supplement, establishing a practical foundation for controllable sequence generation under competing multi-objectives and noisy measurements. Availability and Implementation: https://github.com/Shen-Lab/ST-PARM.
bioinformatics2026-03-19v1RiboBA: a bias-aware probabilistic framework for robust ORF identification across diverse ribosome profiling protocols
BAI, J.; Yang, R.Abstract
By mapping ribosome-protected fragments (RPFs) genome-wide, ribosome profiling (Ribo-seq) has uncovered extensive translation beyond conventional coding sequences, revealing non-canonical ORFs (ncORFs) with emerging roles in diverse biological processes. However, protocol-induced biases introduced during library construction can substantially distort RPF signals. Most existing ORF callers are not designed to explicitly account for such artifacts, limiting robust ncORF identification. Here, we present RiboBA, a bias-aware probabilistic framework to address this challenge. RiboBA consists of two main components: a generative module that recovers protocol-induced biases and codon-level ribosome occupancy, and a supervised module that identifies translated ORFs and initiation sites using the resulting bias-adjusted profiles. Evaluated through simulations and on a range of Ribo-seq datasets-particularly supported by cell-type-specific immunopeptidomics-RiboBA robustly recovers protocol-induced parameters and achieves superior accuracy and sensitivity in ncORF identification. Notably, RiboBA performs particularly well on RNase I libraries with attenuated three-nucleotide periodicity, as well as on MNase and nuclease P1 libraries, while maintaining competitive runtimes. In a Drosophila case study, RiboBA identifies conserved ncORFs with coding potential, including recurrent upstream translation of ThrRS and Mettl2 that suggests a potential threonine-specific translational control axis.
bioinformatics2026-03-19v1SNMF: Ultrafast, Spatially-Aware Deconvolution for Spatial Transcriptomics
Alonso, L.; Ochoa, I.; Rubio, A.Abstract
Sequencing-based spatial transcriptomics has revolutionized the study of tissue architecture, but its `spots' often contain multiple cells, creating a key computational challenge, termed deconvolution, to decipher each spot's cell-type composition. Reference-free deconvolution methods avoid the need for a matched single-cell RNA-seq dataset, but typically neglect the spatial correlation between neighboring spots and do not leverage modern hardware for efficient computation. Here, we propose SNMF (Spatial Non-negative Matrix Factorization): a rapid, accurate, and reference-free deconvolution method. SNMF extends the standard NMF framework with a spatial mixing matrix that models neighborhood influences, guiding the factorization toward spatially coherent solutions. Our R package is, to our knowledge, the first spatial transcriptomics deconvolution tool to natively support GPU execution, completing benchmark analyses in under one minute---over two orders of magnitude faster than the slowest competing methods---with moderate memory requirements. On synthetic and real benchmark datasets, SNMF significantly outperforms state-of-the-art methods in deconvolution accuracy, and on a human melanoma dataset it recovers biologically meaningful cell-type signatures---including a tumor-boundary transition zone---without any reference input. The proposed mehtod is publicly available at https://github.com/ML4BM-Lab/SNMF.
bioinformatics2026-03-19v1Identification and classification of all Cytochrome P450 deposits in the Protein Data Bank
Smieja, P.; Zadrozna, M.; Syed, K.; Nelson, D.; Gront, D.Abstract
Cytochrome P450 monooxygenases (CYPs/P450s) form a highly diverse enzyme superfamily central to biotechnology, pharmacology, and environmental science. Despite the large number of available structures, identifying and comparing P450 entries in structural repositories remains challenging due to their extreme sequence divergence and inconsistent annotation practices. In particular, many deposits lack the standardized nomenclature (CYPid) and rather rely on legacy or author-defined common names (like P450cam, P450BM-3 and P450-PCN1), which are often inconsistent in formatting and specificity. This is particularly difficult for a superfamily as sequentially diverse as P450s. This hinders reliable retrieval and cross-referencing, making even identification all P450 structures in the database nontrivial. To overcome these obstacles, we developed a structure-guided discovery and validation workflow combining keyword search, Hidden Markov Models, and structural alignment, enabling robust detection and annotation. This strategy identified 1,513 deposits representing 674 unique sequences. All sequences were reannotated using the P450Atlas server and manually verified, confirming high assignment accuracy. In the process, we have also identified five new CYP subfamilies. The resulting dataset constitutes the first rigorously curated, structure-linked registry of P450 enzymes, integrated into a publicly accessible resource and supported by an automated pipeline that periodically scans newly released entries. By unifying structurally validated identification with standardized CYP nomenclature, this work establishes a reliable framework for accurate retrieval, comparison, and future large-scale analyses of P450 enzymes.
bioinformatics2026-03-19v1Super Bloom: Fast and precise filter for streaming k-mer queries
Conchon-Kerjan, E.; Rouze, T.; Robidou, L.; Ingels, F.; Limasset, A.Abstract
Approximate membership query structures are used throughout sequence bioinformatics, from read screening and metagenomic classification to assembly, indexing, and error correction. Among them, Bloom filters remain the default choice. They are not the most efficient structures in either time or memory, but they provide an effective compromise between compactness, speed, simplicity, and dynamic insertions, which explains their widespread adoption in practice. Their main drawback is poor cache locality, since each query typically requires several random memory accesses. Blocked Bloom filters alleviate this issue by restricting accesses for any given element to a single memory block, but this usually comes with a loss in accuracy at fixed memory. In this work, we introduce the Super Bloom Filter, a Bloom filter variant designed for streaming k-mer queries on biological sequences. Super Bloom uses minimizers to group adjacent k-mers into super-k-mers and assigns all k-mers of a group to the same memory block, thereby amortizing random accesses over consecutive k-mer queries and improving cache efficiency. We further combine this layout with the findere scheme, which reduces false positives by requiring consistent evidence across overlapping subwords. We provide a theoretical analysis of the construction of Super Bloom filters, showing how minimizer density controls the expected reduction in memory transfers, and derive a practical parameterization strategy linking memory budget, block size, collision overhead, and the number of hash functions to robust false-positive control. Across a broad range of memory budgets and numbers of hash functions, Super Bloom consistently outperforms existing Bloom filter implementations, with several-fold time improvements. As a practical validation, we integrated it into a Rust reimplementation of BioBloom Tools, a sequence screening tool that builds filters from reference genomes and classifies reads through k-mer membership queries for applications such as host removal and contamination filtering. This replacement yields substantially faster indexing and querying than both the original C++ implementation and Rust variants based on Bloom filters and blocked Bloom filters. The findere scheme also reduces false positives by several orders of magnitude, with some configurations yielding no observed false positives among 10^9 random queried k-mers. Code is available at https://github.com/EtienneC-K/SuperBloom and https://github.com/Malfoy/SBB
bioinformatics2026-03-19v1STiLE: Automated Tissue Microarray Dearraying for Spatial Transcriptomics
Sinha, H.; Das, A.; Chiu, Y.-C.; Gao, S.-J.; Huang, Y.Abstract
Tissue microarrays (TMAs) enable high-throughput spatial transcriptomic profiling of dozens of tissue cores on a single slide. However, existing dearraying methods operate on histological images and do not support the coordinate-based outputs of spatial transcriptomics platforms. Therefore, task of assigning cells to their respective cores (dearraying) remains a manual bottleneck. We present STiLE, a tool for automated TMA dearraying that operates solely on cell centroid coordinates. By eliminating dependence on image data, STiLE is robust to artifacts such as variable staining quality and uneven illumination. The algorithm combines connectivity-based component detection, density-based clustering (HDBSCAN), component-guided cluster merging, and optional grid-based peak detection. Validation on eleven public TMA samples (50-150 cores, three platforms) achieved ARI > 0.99, while systematic benchmarking on 396 synthetic datasets with realistic artifacts demonstrated consistently robust performance (mean ARI = 0.992). STiLE accepts standard formats (AnnData, CSV) and is platform-agnostic, supporting diverse platforms including Vizgen MERSCOPE, 10x Xenium, and NanoString CosMx. An interactive Streamlit interface enables parameter tuning, visual inspection, and region-based processing for large slides.
bioinformatics2026-03-19v1High Resolution Solvated Models Reveal Mechanisms of Allosteric Activation of mTORC1 by RHEB
Ghosh, P.; Maity, A.; Kutti, V. R.; Venkatramani, R.Abstract
The mechanistic target of rapamycin complex 1 (mTORC1) is a ~1.2 MDa dimeric assembly comprising mTOR, mLST8, and RAPTOR that integrates nutrient, energy, and stress signals to regulate cell growth. While Cryo-EM structures have provided insights into allosteric activation of the complex by the small GTPase RHEB, their limited resolution has constrained a full mechanistic understanding. Here, we combine deep learning-based AlphaFold-3 models with Molecular Dynamics Flexible Fitting and simulations to generate refined, atomistic solvated models of mTORC1{+/-}ATP{+/-}RHEB. Simulations reveal a global remodelling of the complex by RHEB, which strengthens mTOR-RAPTOR interactions while weakening mTOR-mLST8 contacts. These drive the reorganization of Kinase N- and C-lobes into a catalytically competent state in which ATP binding is stabilized enthalpically with improved Magnesium ion coordination. Our studies present structural, energetic and dynamic changes induced by RHEB binding which collectively cause allosteric preorganization of mTORC1 for catalysis prior to substrate binding.
bioinformatics2026-03-19v1StrucTTY: An Interactive, Terminal-Native Protein Structure Viewer
Jang, L. S.-e.; Cha, S.; Steinegger, M.Abstract
Terminal-based workflows are central to large-scale structural biology, particularly in high-performance computing (HPC) environments and SSH sessions. Yet no existing tool enables real-time, interactive visualization of protein backbone structures directly within a text-only terminal. To address this gap, we present StrucTTY, a fully interactive, terminal-native protein structure viewer. StrucTTY is a single self-contained executable that loads mulitple PDB and mmCIF files, normalizes three-dimensional coordinates, and renders protein structures as ASCII graphics. Users can rotate, translate, and zoom in on structures, adjust visualization modes, inspect chain-level features and view secondary structure assignments. The tool supports simultaneous visualization of up to nine protein structures and can directly display structural alignments using Foldseek's output, enabling rapid comparative analysis in headless environments. The source code is available at https://github.com/steineggerlab/StrucTTY
bioinformatics2026-03-19v1ChiMER: Integrating chromatin architecture into splicing graphs for chimeric enhancer RNAs detection
Xiang, Y.; Xiao, X.; Zhou, B.; Xie, L.Abstract
Motivation: Enhancer-derived RNAs (eRNAs) and their fusion with protein coding genes represent a crucial yet understudied layer of transcriptional regulation. eRNAs are typically expressed at low levels, which makes fusion events difficult to detect with conventional fusion detection tools. In addition, these tools are not designed to capture fusion transcripts arising from spatial proximity between distal regulatory elements and gene loci. Reads spanning such regions are also frequently filtered as mapping artifacts. As a result, computational approaches for systematically identifying spatially mediated enhancer-exon fusion transcripts remain lacking. Methods: We developed ChiMER, a graph-based framework for detecting ChiMeric Enhancer RNAs from short-read RNA-seq data. ChiMER constructs splice graphs with chromatin contact information to introduce enhancer-exon edges and uses graph alignment to search for potential transcriptional paths. A ranking-based scoring module then prioritizes high-confidence events. Evaluations on simulated and real RNA-seq datasets show that ChiMER achieves higher sensitivity than conventional linear fusion detection methods while maintaining low false-positive rates. Results: Applied to cancer cell line RNA-seq datasets, ChiMER identified multiple enhancer-exon chimeric transcripts, several associated with super-enhancer regions. Multi-omics analysis further show that fusion transcripts occur in transcriptionally active regulatory environments and frequently coincide with strong R-loop signals, suggesting a potential role of RNA-DNA hybrid structures in facilitating long-range transcriptional joining events.
bioinformatics2026-03-19v1Semantic-Aware Energy-Efficient Operation inSmart Capsule Endoscopy
Zoofaghari, M.; Rahaimifard, A.; Chatterjee, S.; Balasingham, I.Abstract
Goal-oriented semantic communication has recently emerged in wireless sensor-actuator networks, emphasizing the meaning and relevance of information over raw data delivery, thereby enabling resource-efficient telecommunication. This paradigm offers significant benefits for intra-body or implantable sensor-actuator networks, including dramatic reductions in band-width requirements, latency, and power consumption. In this paper, we address a patch-based energy-efficient anomaly detection method for smart capsule endoscopy. We propose a deep learning-based algorithm that employs the similarity between features extracted from measured images and a reference (normal) image as the detection metric. The algorithm is evaluated using a clinical dataset of capsule-captured images, combined with a simulated intra-body channel model. The results demonstrate that even with only 60% of the transmission power (relative to a standard link design for QPSK modulation) and 65% of the light intensity, the probability of anomaly detection remains above 85%, and it gradually improves as power and illumination levels increase. This improvement translates into a potential battery life extension of over 43%. The findings highlight the potential of semantic-aware, energy-efficient intra-body devices for more sustainable and effective medical interventions.
bioinformatics2026-03-19v1ABAG-Rank: Improving Model Selection of AlphaFold Antibody-Antigen Complexes by Learning to Rank
Tadiello, M.; Ludaic, M.; Viliuga, V.; Elofsson, A.Abstract
Motivation: AlphaFold has transformed structural biology with an unprecedented accuracy in modeling protein structures and their interactions with biomolecules, with AlphaFold3 (AF3) achieving state-of-the-art performance. However, AF3 and other methods often struggle to accurately predict the structure of protein complexes that lack strong co-evolutionary information, such as antibody-antigen (Ab-Ag) complexes. One of the fundamental issues is that AF3 often generates accurate predictions, but fails to reliably distinguish them from the much larger set of incorrect ones. Results: To address this, we propose ABAG-Rank, a deep neural network that provides an efficient and robust solution for model selection of Ab-Ag interactions from a pool of structural ensembles predicted with AlphaFold. Built on the permutation-invariant DeepSets architecture, ABAG-Rank can process variable-sized ensembles of structural decoys and is directly applicable to prediction settings in which the number of candidates may vary. We train a model on a redundancy-reduced set of all known antibody-antigen complexes and find that simple geometric descriptors, along with confidence scores from AlphaFold, provide rich information about interface quality without requiring intensive physics-based calculations. Our experiments demonstrate that ABAG-Rank significantly outperforms AF3 internal scoring and the ranking performance of existing deep learning baselines. Implementation: Source code can be found at: https://github.com/tadteo/ABAG-Rank
bioinformatics2026-03-19v1GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation. GAP-MS is freely available as a web interface at https://webclu.bio.wzw.tum.de/gapms/.
bioinformatics2026-03-19v1A Cross-Study Multi-Organ Cell Atlas ofMacaca fascicularis Informed by Human Foundation Model Annotation: A Resource for Translational Target Assessment
Souza, T. M.; Gamse, J. T.; Moreno, L.; van Rumpt, M.; Nunez-Moreno, G.; Khatri, I.; van Asten, S. D.; Khusial, N. V.; Baltasar-Perez, E.; Adhav, R.; Abdelaal, T.; Wojtuszkiewicz, A.; Calis, J. J. A.; Csala, A.; Dahlman, A.; Fuller, C. L.; Thalhauser, C. J.; Kolder, I. C. R. M.Abstract
Non-human primates (NHPs), particularly Macaca fascicularis (cynomolgus macaque), represent an essential model for preclinical assessment of biologics due to their high genetic and physiological similarity to humans. However, mounting regulatory pressure to reduce NHP use and the lack of a unified, well-annotated single-cell atlas currently limits both target qualification and mechanistic interpretation of toxicity in this species. To address this gap, we assembled and harmonized the largest single-cell transcriptomic atlas of M. fascicularis to date, integrating 30 publicly available studies spanning 57 anatomical regions, 43 organs and 14 physiological systems. We implemented a scalable framework for cross-species cell type annotation by embedding both cynomolgus monkeys and human (Tabula Sapiens V2) datasets into a shared reference space using Universal Cell Embeddings (UCE), enabling consistent harmonization of cell identities. In total, 27 organs were annotated using human reference labels, while the remaining sets retained author-provided annotations or labels transferred from other cynomolgus studies with available annotations. The resulting atlas comprises over 2.5 million high-quality cells and demonstrates strong concordance in cell-type-specific expression patterns between cynomolgus and humans, including tissue-specific markers and targets relevant for biologics development. Through multiple translational use cases, we illustrate how this resource can be applied to assess target expression in tissues affected by concordant human-NHP toxicities, investigate ocular adverse events associated with antibody-drug conjugates (ADCs), and identify species-specific features of immune cell subtypes with known safety implications. By enabling scalable, high-resolution, cross-species comparisons of gene expression across organs, tissues, and cell states, this atlas supports improved target qualification, more mechanistic interpretation of toxicities, and evidence-based decisions on the relevance and design of NHP studies. Collectively, this work provides a unified cross-species single-cell resource for cynomolgus monkey and a modular computational framework that advances new approach methodologies and contributes to the refinement and reduction of NHP use in preclinical research.
bioinformatics2026-03-19v1DOTSeq enables genome-wide detection of differential ORF usage
Lim, C. S.; Chieng, G. S. W.Abstract
Protein synthesis is regulated by multiple cis-regulatory elements, including small ORFs, yet current differential translation methods assume uniform changes at the gene level. We present DOTSeq, a Differential ORF Translation statistical framework that resolves ORF-level regulation in bulk ribosome profiling (Ribo-seq) experiments and provides ORF-level read summarisation for single-cell Ribo-seq. DOTSeq's core module, Differential ORF Usage (DOU), quantifies changes in an ORF's relative contribution to a gene's translation output, using a beta-binomial GLM with flexible dispersion modelling. DOTSeq also implements ORF-level Differential Translation Efficiency (DTE) using a standard approach to complement DOU. Benchmarks show that DOU achieves superior sensitivity with near-nominal FDR across effect sizes, while DTE and some existing methods excel when technical noise is low. DOTSeq introduces an ORF-aware, quantitative framework for ribosome profiling, delivering end-to-end workflows for ORF annotation, read summarisation, contrast estimation, and visualisation to uncover translational control events at scale.
bioinformatics2026-03-18v3plsMD: A plasmid reconstruction tool from short-read assemblies
Lotfi, M.; Jalal, D.; Sayed, A. A.Abstract
While whole genome sequencing (WGS) has become a cornerstone of antimicrobial resistance (AMR) surveillance, the reconstruction of plasmid sequences from short-read WGS data remains a challenge due to repetitive sequences and assembly fragmentation. Current computational tools for plasmid identification and binning, such as PlasmidFinder, cBAR, PlasmidSPAdes, and Mob-recon, have limitations in reconstructing full plasmid sequences, hindering downstream analyses like phylogenetic studies and AMR gene tracking. To address this gap, we present plsMD, a tool designed for full plasmid reconstruction from short-read assemblies. plsMD integrates Unicycler assemblies with replicon and full plasmid sequence databases (PlasmidFinder, MOB-typer and PLSDB) to guide plasmid reconstruction through a series of contig manipulations. Using two datasets, one established benchmark dataset used in previous benchmarking studies and another novel dataset consisting of newly sequenced bacterial isolates, plsMD outperformed existing tools in both. In the benchmark dataset, it achieved excellent recall, precision, and F1 scores of 91.3%, 95.5%, and 92.0%, respectively. In the novel dataset, it achieved good recall, precision, and F1 scores of 77.6%, 88.9%, and 74.5%, respectively. plsMD supports two usage modalities: single-sample analysis for plasmid reconstruction and gene annotation, and batch-sample analysis for phylogenetic investigations of plasmid transmission. This computational tool represents a significant advancement in plasmid analysis, offering a robust solution for utilizing existing short-read WGS data to study plasmid-mediated AMR spread and evolution.
bioinformatics2026-03-18v2InSTaPath: Integrating Spatial Transcriptomics and histoPathology Images via Multimodal Topic Learning
Xiao, W.; Chen, H.; Osakwe, A.; Zhang, Q.; Li, Y.Abstract
Spatial transcriptomic (ST) technologies enable the measurement of gene expression directly within tissue sections while preserving spatial context. Many ST platforms additionally generate paired histological images alongside spatially resolved transcriptomic profiles. However, most existing computational approaches only incorporate histology images as auxiliary features in representation learning models and typically produce latent embeddings that are difficult to interpret. We present InSTaPath (Integrating Spatial Transcriptomics and histoPathology images), a multimodal topic modeling framework that links transcriptional programs with tissue morphology. InSTaPath converts tokenlevel embeddings extracted from pretrained histology foundation models into discrete image words through vector quantization, enabling histological morphology to be represented in a count-based form analogous to gene expression. InSTaPath then jointly analyzes image-word and gene expression counts to infer shared latent topics that are interpretable through both topic-gene and topic-image-word associations. Across multiple ST datasets, InSTaPath improves spatial domain identification and uncovers biologically meaningful relationships between gene programs and tissue morphology through pathway enrichment and in silico perturbation analyses.
bioinformatics2026-03-18v1A Permutation-Based Framework for Evaluating Bias in Microbiome Differential Abundance Analysis
Zeng, K.; Fodor, A. A.Abstract
Background: In microbiome research, differential abundance analysis aids in identifying significant differences in microbial taxa across two or more conditions. Statistical approaches used for this purpose include classical tests such as the t-test and Wilcoxon test, as well as methods designed to account for the compositional nature of microbiome data, including ALDEx2, ANCOM-BC2, and metagenomeSeq. In addition, methods originally developed for RNA sequencing data, such as DESeq2 and edgeR, have been frequently applied to microbiome studies. However, the use of these methods has been controversial. One area of concern is whether different modeling frameworks produce accurate p-values when the null hypothesis is true. Results: We evaluated eight methods across six publicly available datasets. Four permutation strategies were applied to generate data under the null hypothesis: shuffling sample names, shuffling counts within samples, shuffling counts within taxa, and fully randomizing the counts table. Methods based on the negative binomial distribution (DESeq2 and edgeR) produced p-values that were consistently smaller than expected under the null hypothesis. In contrast, methods that attempt to correct for compositionality (ALDEx2, ANCOM-BC2, and metagenomeSeq) tended to produce larger-than-expected p-values, even when only sample labels were shuffled, a permutation strategy that does not alter compositional structure. These deviations were dependent on dataset characteristics and permutation strategy, suggesting complex interactions between underlying data structure and algorithm performance. Generating data to follow the expected negative binomial distribution did not eliminate the tendency of DESeq2 and edgeR to exaggerate statistical significance. Although similar patterns were observed in RNA sequencing (RNAseq) datasets, the deviations were less pronounced than in microbiome data. In contrast, the classical t-test and Wilcoxon test yielded p-value distributions consistent with theoretical expectations across datasets and permutation strategies. Conclusions: These results indicate that the performance of several widely used differential abundance methods can be problematic under null conditions and may affect biological interpretation. Our findings emphasize the importance of careful method selection and highlight the robustness of simpler statistical approaches for reliable inference.
bioinformatics2026-03-18v1DeSCENT: Deconvolutional Single-Cell RNA-seq Enhances Transcriptome-based Cancer Survival Analysis
Zhao, Y.; You, Z.; Shen, Y.; Chu, J.; Gong, X.; Li, T.; Wang, Z.; Xu, C.; Luo, Z.; He, Y.Abstract
Motivation: Accurate cancer survival prediction requires modeling tumor heterogeneity across both population and cell levels. Most cancer survival analyses use tumor transcriptomes only, since cohorts are usually measured with bulk RNA-seq but are rarely recorded with single-cell RNA-seq. This prevents the direct use of cell-level transcriptomes in cancer survival analysis. Results: To bridge this gap, we propose using bulk RNA-seq deconvolution algorithms to reconstruct each subject's scRNA-seq profile from their bulk data. Then, by combining both scRNA-seq and bulk RNA-seq together with their survival labels (paired to bulk), we perform multimodal transcriptome-based survival analysis. We built this framework as DeSCENT and evaluated it with common survival models on eight TCGA cancer cohorts. Results showed notable and consistent improvements in C-index over bulk-only models or models using cellular information alone. Availability: Our code is available at GitHub: https://github.com/YonghaoZhao722/DeSCENT.
bioinformatics2026-03-18v1PREMISE: A Quality-Aware Probabilistic Framework for Pathogen Resolution and Source Assignment in Viral mNGS
Vijendran, S.; Dorman, K.; Anderson, T. K.; Eulenstein, O.Abstract
The circulation of Influenza A viruses (IAVs) in wildlife and livestock presents a significant public health threat due to their zoonotic potential and rapid genomic diversification. Accurate classification of viral subtypes and characterization of within-host diversity are crucial for risk assessment and vaccine development. Although metagenomic sequencing facilitates early detection, prevalent memory-efficient k-mer-based pipelines often discard critical linkage information. This loss of information can result in missed or imprecise pathogen identification, potentially delaying clinical and public health responses. We introduce premise (Pathogen Resolution via Expectation Maximization In Sequencing Experiments), a probabilistic, alignment-based framework implemented in RUST for high-resolution viral genome identification. By integrating advanced string data structures for efficient alignment with a quality-score-aware Expectation-Maximization algorithm, premise accurately identifies source strains, estimates relative abundances, and performs precise read assignments. This framework provides superior source estimation with statistical confidence, enabling the identification of mixed infections, recombination, and IAV-reassortment directly from raw data. Validated against simulated and empirical datasets, premise outperforms state-of-the-art k-mer methods. Ultimately, this framework represents a significant advancement in viral identification, providing a foundation for novel approaches that can automatically flag reassorted viruses or recombination events in the future, thereby improving the detection of emerging pathogens with zoonotic potential. Availability: https://github.com/sriram98v/premise} under a MIT license. Contact: sriramv@iastate.edu
bioinformatics2026-03-18v1Interpolating and Extrapolating Node Counts in Colored Compacted de Bruijn Graphs for Pangenome Diversity
Parmigiani, L.; Peterlongo, P.Abstract
A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.
bioinformatics2026-03-18v1Hierarchical genomic feature annotation with variable-length queries
Alanko, J. N.; Ranallo-Benavidez, T. R.; Barthel, F. P.; Puglisi, S. J.; Marchet, C.Abstract
K-mer-based methods are widely used for sequence classification in metagenomics, pangenomics, and RNA-seq analysis, but existing tools face important limitations: they typically require a fixed k-mer length chosen at index construction time, handle multi-matching k-mers (whose origin in the indexed data is ambiguous) in ad-hoc ways, and some resort to lossy approximations, complicating interpretation. We present HKS, a data structure for exact hierarchical variable-length k-mer annotation. Building on the Spectral Burrows-Wheeler Transform (SBWT), a single HKS index is constructed for a specified maximum query length s, and supports queries at any length k [≤] s. HKS associates each k-mer with exactly one label from a user-defined category hierarchy, where multi-matching k-mers are resolved to their most specific common node in the hierarchy. We formalize a feature assignment framework that partitions indexed k-mers into disjoint sets according to a user-defined category hierarchy. To recover specificity lost to multi-matching and novel k-mers, we introduce a hierarchy-aware smoothing algorithm that makes use of flanking sequence context. We validate the approach by assigning each query k-mer to a specific chromosome across human genome assemblies, including the T2T-CHM13v2.0 reference as a positive control and two diploid genomes of different ancestries (HG002, NA19185). Smoothing increases overall concordance from [~]81% to [~]97%, with residual errors attributable to known biological phenomena including acrocentric short-arm recombination and subtelomeric duplications. In performance benchmarks against Kraken2, HKS provides comparable query throughput while providing exact, lossless annotation across all k-mer lengths simultaneously from a single index. A prototype implementation is available at https://github.com/jnalanko/HKS.
bioinformatics2026-03-18v1HARVEST: Unlocking the Dark Bioactivity Data of Pharmaceutical Patents via Agentic AI
Shepard, V.; Musin, A.; Chebykina, K.; Zeninskaya, N. A.; Mistryukova, L.; Avchaciov, K.; Fedichev, P. O.Abstract
Pharmaceutical patents contain vast Structure-Activity Relationship tables documenting protein-ligand binding data that are technically public yet computationally inaccessible, rendering this wealth of data effectively dark - trapped in unstructured archives no existing database has systematically captured. We present HARVEST, a multi-agent large language model pipeline that autonomously extracts structured bioactivity records from USPTO patent archives at $0.11 per document. Applied to 164,877 patents, HARVEST produced 3.36 million activity records, recovering 365,713 unique scaffolds and 1,108 protein targets absent from BindingDB - completing in under a week a task requiring over 55 years of continuous expert labor. Automated extraction achieves 91% agreement with human curators while exhibiting lower unit-conversion error rates. We further introduce H-Bench, a structurally guaranteed held-out benchmark built from this recovered data. Evaluation of the leading open-source model Boltz-2 on H-Bench reveals a two-dimensional generalization gap: performance degrades both on novel chemical scaffolds and on uncharacterized protein targets, exposing fundamental limitations of models trained on existing public repositories.
bioinformatics2026-03-18v1Sex Checking by Zygosity Distributions
Molina-Sedano, O.; Mas Montserrat, D.; Ioannidis, A. G.Abstract
Motivation: In genomic and clinical studies, verifying concordance between self-reported and genotype-inferred sex is a crucial quality control step, since mismatches arising from mislabeling or aneuploidies can bias downstream analyses and affect diagnostic accuracy. Existing approaches typically require substantial auxiliary data, and often require manual threshold tuning. There remains a need for a streamlined, reference-free method that generalizes across different data modalities, including whole-genome, single-sample and array, without requiring additional files or parameter tuning. Results: We present Zigo, a novel ML-based sex-checking method that operates solely on a standard VCF file, designed using X-chromosome genotype class distributions across sexes. Our model was trained on synthetic data incorporating standard demographic models and empirical recombination maps to ensure realistic genetic architecture and population structure. We simulate WGS, array, and single-sample files for broad applicability. Unlike traditional methods, we eliminate manual thresholding by distilling learned discriminative patterns into a single polynomial equation that determines genetic sex directly from normalized genotype counts. We validated Zigo on independent datasets, including 1000 Genomes, UK Biobank, and HGDP. Additional experiments assessed robustness under reduced variant availability through random SNP subsampling and allele-frequency filtering. Across all evaluations, the model achieved state-of-the-art accuracy, high time efficiency, and strong generalization, even with severely limited variant sets.
bioinformatics2026-03-18v1usiGrabber: Automating the curation of proteomics spectra data at scale, making large datasets ready for use in machine learning systems
Auge, G.; Clausen, M.; Ketterer, K.; Schaefer, J.; Schmitt, N.; Altenburg, T.; Hartmaring, Y.; Raetz, H.; Schlaffner, C. N.; Renard, B. Y.Abstract
Motivation: An unprecedented amount of mass spectrometry-based proteomics data is publicly available through repositories such as the PRoteomics IDEntifications Database (PRIDE), and the field is increasingly leveraging machine learning approaches. However, the available data is not ready to be reused in a scalable way beyond the original acquisition purpose. Existing machine learning models commonly rely on a few manually curated datasets that require deep domain expertise and tedious technical work to construct. Importantly, these datasets have not been updated in recent years, so that newly published data remains inaccessible. We present usiGrabber, a scalable framework for assembling large proteomic datasets. usiGrabber is designed around portability and extensibility. It extracts spectra identification data from mzIdentML files, stores additional project-level metadata retrieved through the PRIDE API, indexes raw spectra using Universal Spectrum Identifiers (USIs), and offers download utilities to retrieve spectra data at scale. Results: Within 49 hours, we parsed over 800 million peptide spectrum matches and corresponding USIs from over 1,200 projects. As a proof of concept, we used usiGrabber to construct a phosphorylation-specific training dataset of nearly 11 million spectra in under two days and used it to retrain a binary phosphorylation classifier based on the AHLF model architecture. With a balanced accuracy of 0.78, our model achieves comparable performance to the original model on an independent test set, showing that automated data extraction is an alternative to manual curation of static datasets. Availability: All code is available at https://github.com/usiGrabber/usiGrabber; the data is available at https://zenodo.org/records/18853258.
bioinformatics2026-03-18v1Ryder: Epigenome normalization using a two-tier model and internal reference regions
Cao, Y.; Ge, G.; Zhao, K.Abstract
Motivation: Sequencing-based epigenomic profiling methods are powerful but suffer from technical variability that complicates cross-sample comparisons and can obscure true biological signals. While existing normalization methods using spike-in controls or computational approaches have been proposed, they often rely on assumptions that may not hold across diverse experimental conditions or require additional data types. Results: We present Ryder, a flexible and robust Python package for the normalization and differential analysis of epigenomic data. Ryder introduces a normalization strategy that leverages stable internal reference regions, such as invariant CTCF binding sites, to correct for technical artifacts genome-wide. Our results show that it effectively models and adjusts both background noise and signal intensity, ensuring accurate signal alignment across samples. We demonstrate that Ryder performs robust, genome-wide normalization -- correcting signals in both peak and background regions -- across a range of assays including DNase-seq, CUT&RUN, ATAC-seq, MNase-seq, and ChIP-seq, with or without spike-in controls. By reducing technical noise, we show that Ryder improves the detection of genuine biological changes, such as quantitative reduction of chromatin accessibility at key enhancer elements by depletion of BRG1, a key subunit of the chromatin remodeling BAF complexes. Availability and Implementation: The Ryder source code and documentation are freely available at: https://github.com/YaqiangCao/ryder .
bioinformatics2026-03-18v1Millisecond Prediction of Protein Contact Maps from Amino AcidSequences
Lin, R.; Ahnert, S. E.Abstract
Protein structure prediction typically outputs static coordinates, often obscuring the underlying physical principles and conformational flexibility. In this work, we present a coarse-grained generative framework to recover the Circuit Topology (CT) of proteins using Generative Flow Matching. We represent protein architecture using highly compressed Secondary Structure Elements (SSEs), reducing the sequence length to roughly 1/13 of the original amino acid sequence. We show that this minimal representation captures the essential "topological fingerprint" required to determine the global fold. By employing a joint-prediction head, our model simultaneously generates contact probabilities and asymmetric topological features, achieving a mean F1 score of 0.822 at the SSE level. Notably, our results demonstrate a counter-intuitive robustness in capturing long-range interactions, suggesting that global topology acts as a stable constraint compared to local residue packing. Furthermore, we show that these coarse-grained predictions can be mapped back to residue-level contact maps with sub-helical precision, yielding a mean alignment error of 2.69 residues. The probabilistic nature of the flow model effectively separates the stable structural signal of the folding core from flexible regions, providing a physically interpretable view of the protein's conformational ensemble. This pipeline is extremely fast, capable of completing a contact map prediction from amino acid sequence in an average of 110 milliseconds on a single GPU. These ultra-fast and accurate predictions provide a valuable tool for identifying conserved protein folding cores, facilitating the exploration of the protein structural genotype-phenotype (GP) map through large-scale sampling of mutants with highly similar folding cores.
bioinformatics2026-03-18v1Developing a Standard Definition for Sequences of Concern
Alexanian, T.; Beal, J.; Bartling, C.; Berlips, J.; Carr, P. A.; Clore, A.; Cozzarini, H.; Diggans, J.; El Moubayed, Y.; Esvelt, K.; Flyangolts, K.; Foner, L.; Fullerton, P. A.; Gemler, B. T.; Jagla, C. A.; Lababidi, R.; Mitchell, T.; Murphy, S. T.; Parker, M. T.; Roehner, N.; Rusch, A.; Talley, K.; Timmerman, T.; Wheeler, N. E.Abstract
Readily available nucleic acid synthesis is both critical for the bioeconomy and an increasingly pressing security concern due to the potential for accidental or deliberate misuse. While biosecurity experts broadly agree that nucleic acid providers should screen orders for potential "sequences of concern," there has previously been no agreed standard for how to define and recognize such sequences. To address this gap, we first organized a test set of 1.1 million sequences from pathogens and toxins on the Australia Group Common Control Lists and their non-controlled relatives, along with model organisms and synthetic constructs. An initial categorization of sequences as to whether or not they were sequence of concern was produced by comparing the results of four biosecurity screening systems for each of these sequences, finding that these systems already agreed on the categorization of more than 80% of sequences. We then refined these results through a science-based stakeholder review process to define a rubric for determining whether a sequence should be flagged as a potential sequence of concern, then applied this rubric to improve the categorization of test sets. The result is a rubric that identifies sequences of concern with respect to human pandemic-potential viruses, key classes of low-risk genes, and controlled toxins. Applying this rubric to the test set collection has reduced the number of test sequences with disputed categorization by 44.3% for controlled viruses and 10.7% across the test set as a whole. Together, these results provide a concrete "sequence of concern" definition that can be used as a foundation for development of biosecurity screening standards and policy.
bioinformatics2026-03-18v1Deciphering context-dependent epigenetic program by network-based prediction of clustered open regulatory elements from single-cell chromatin accessibility
Park, S.; Ma, S.; Lee, W.; Park, S. H.Abstract
Large-scale cis-regulatory domains, such as super-enhancers, are pivotal in orchestrating robust and cell-state-specific transcriptional programs that define cellular identity. However, current single-cell methods do not effectively identify these higher-order structures, obscuring the coordinated, domain-level regulation essential for complex biological processes. Identifying such domain-scale representation at the single-cell level is critical for understanding the regulatory logic underlying development and disease. Here, we introduce enCORE, a computational framework that leverages enhancer-enhancer interaction networks to determine Clustered Open Regulatory Elements (COREs) from single-cell ATAC-sequencing. Our approach faithfully recapitulated established hematopoietic hierarchies and resolved lineage-specific regulatory programs by recovering canonical lineage-defining regulators, frequent chromatin interactions, and enrichment of fine-mapped autoimmune disease-associated genome-wide association study (GWAS) variants. In colorectal cancer, enCORE successfully captured tumor-associated H3K27ac programs and prioritized cancer-relevant regulators, pointing to USP7 as a potential therapeutic candidate supported by in silico perturbation. Our framework provides a fruitful approach for deciphering context-dependent epigenetic reprogramming.
bioinformatics2026-03-18v1GOTFlow: Learning Directed Population Transitions from Cross-Sectional Biomedical Data with Optimal Transport
Wright, G.; Alzaid, E.; Muter, J.; Brosens, J.; Minhas, F.Abstract
Motivation: Many biological and clinical processes are dynamic, yet most datasets are cross-sectional, capturing populations at discrete states rather than tracking individuals over time. This makes it difficult to quantify how populations change across developmental, physiological, or disease-associated conditions. Existing trajectory and transport-based methods often rely on fixed feature spaces, assumptions tailored to transcriptomic time-course data, or approximately linear progression, limiting their ability to model heterogeneous and unbalanced transitions across diverse biomedical modalities. Flexible methods are needed that can infer directed population-level change from cross-sectional data while retaining biological interpretability. Results: We present GOTFlow, a framework for learning directed population transitions from cross-sectional biomedical data using graph-constrained optimal transport in a learned latent space. GOTFlow integrates representation learning with unbalanced optimal transport to jointly estimate embeddings and transport couplings between biological states. This enables hypothesis-driven modelling of progression structures while accommodating non-linear geometry, branching relationships, and changes in population mass. From the inferred transport plans, GOTFlow derives interpretable summaries of dynamics, including drift vectors quantifying transitions, and feature-level transported changes that highlight molecular drivers of progression. In synthetic data, GOTFlow recovered known transitions with strong agreement between inferred and ground-truth drifts. Across three biological applications, endometrial remodelling, breast cancer risk progression, and prion disease, GOTFlow identified state-to-state transitions and biologically meaningful feature shifts reflecting impaired decidualisation, increasing cancer risk, and neurodegenerative progression. These results establish GOTFlow as a general and interpretable framework for analysing directed population dynamics from cross-sectional data. Availability: Code available at: https://github.com/wgrgwrght/GOTFlow Supplementary information: Available online.
bioinformatics2026-03-18v1New Space-Time Tradeoffs for Subset Rank and k-mer Lookup
Diseth, A. C.; Puglisi, S. J.Abstract
Given a sequence S of subsets of symbols drawn from a fixed alphabet, a subset rank query srank(i, c) asks for the number of subsets before the ith subset that contain the symbol c. It was recently shown (Alanko et al., Proc. SIAM ACDA, 2023) that subset rank queries on the spectral Burrows-Wheeler lead to efficient k-mer lookup queries, an essential and widespread task in genomic sequence analysis. In this paper we design faster subset rank data structures that use small space---less than 3 bits per k-mer. Our experiments show that this translates to new Pareto optimal SBWT-based k-mer lookup structures at the low-memory end of the space-time spectrum.
bioinformatics2026-03-18v1Homology-based perspective on pangenome graphs
Lisiecka, A.; Kowalewska, A.; Dojer, N.Abstract
Pangenome graphs conveniently represent genetic variation within a population. Several types of such graphs have been proposed, with varying properties and potential applications. Among them, variation graphs (VGs) seem best suited to replace reference genomes in sequencing data processing, while whole genome alignments (WGAs) are particularly practical for comparative genomics applications. For both models, no widely accepted optimization criteria for a graph representing a given set of genomes have been proposed. In the current paper we introduce the concept of homology relation induced by a pangenome graph on the characters of represented genomic sequences and define such relations for both VG and WGA model. Then, we use this concept to propose homology-based metrics for comparing different graphs representing the same genome collection, and to formulate the desired properties of transformations between VG and WGA models. Moreover, we propose several such transformations and examine their properties on pangenome graph data. Finally, we provide implementations of these transformations in a package WGAtools, available at https://github.com/anialisiecka/WGAtools.
bioinformatics2026-03-18v1scTimeBench: A streamlined benchmarking platform for single-cell time-series analysis
Osakwe, A.; Huang, E. H.; Li, Y.Abstract
Temporal modelling of single-cell gene expression is essential for capturing dynamic cellular processes, yet a systematic framework for evaluating time-aware trajectory inference methods has not yet been established. Here, we present a modular and scalable benchmark designed to assess methods across three critical tasks: forecast accuracy (temporal cell alignment) for projecting cells to unseen time points, embedding coherence between original and projected data, and cell-type lineage fidelity. We evaluated nine state-of-the-art methods, which are broadly categorized into 7 forecasting-based and 2 optimal transport (OT)-based methods across eight diverse datasets spanning four species. Our results show that while several methods achieve high forecast accuracy, they often fail to preserve biological signals, both in their latent spaces and in cell lineage reconstruction. Notably, most methods confer low lineage fidelity and often underperform compared to a correlation baseline. We further demonstrate that integrating pseudotime can effectively denoise trajectories by aligning the data snapshots with the intrinsic biological clock in each cell. Finally, to streamline benchmarking for temporal single-cell analysis, we built one of the first self-contained Python packages for the research community: https://github.com/li-lab-mcgill/scTimeBench.
bioinformatics2026-03-18v1drFrankenstein: An Automated Pipeline for the Parameterisation of Non-Canonical Amino Acids
Shrimpton-Phoenix, E.; Notari, E.; Wood, C. W.Abstract
The incorporation of non-canonical amino acids (ncAAs) is a powerful strategy for introducing novel chemical functions into proteins. Molecular dynamics (MD) simulations are essential for understanding the structural and dynamic effects of these modifications, yet the creation of accurate force field parameters for ncAAs remains a significant bottleneck. Current parameterisation methods are often inaccurate or computationally expensive. To address this, we present drFrankenstein, an automated pipeline for generating AMBER force field parameters for ncAAs. drFrankenstein is a robust and accessible tool that streamlines the parameterisation workflow, enabling the routine use of MD simulations to study the behaviour of ncAA-containing proteins.
bioinformatics2026-03-18v1SpeciefAI: Multi-species mRNA-level Antibody Framework Generation using Transformers
Grabarczyk, D.; Kocikowski, M.; Parys, M.; Cohen, S. B.; Alfaro, J. A.Abstract
Motivation: Encoding antibodies (Abs) and nanobodies (Nbs) as mRNA enables in vivo production of therapeutic proteins. However, this approach requires meeting two species-dependent requirements: the mRNA encoding must support efficient expression in the host species, and the encoded protein sequence must resemble the natural Ab repertoire of the recipient species to minimize immunogenicity. These requirements motivate species-conditioned generative models for joint mRNA and protein design. Results: We propose SpeciefAI a transformer-based model for multi-species Ab and Nb species sequence-harmonisation by generation of novel Framework Regions (FRs) tailored to input Complementarity-Determining Regions (CDRs). Our model works directly in the mRNA space and learns the correspondence between FRs and CDRs in six species. The model is capable of generating sequences with a highly similar distribution to natural sequences and a mean absolute difference in codon adaptation index (CAI) of 0.013 and 0.033 for humans and dogs respectively. We show that the generated human sequences are highly human (0.95 T20 score) and canine sequences highly canine (0.95 cT20 score). We furthermore demonstrate that we can generate diverse candidate sequences using our method. Availability and Implementation: Source code is available on https://github.com/Dominko/SpeciefAI. OAS and COGNANO data are publicly available on https://opig.stats.ox.ac.uk/webapps/oas/ and https://cognanous.com/datasets/vhh-corpus (preprocessed versions available upon request). Canine data is available on https://zenodo.org/records/18301526
bioinformatics2026-03-18v1Mutation-centric Network Construction using Long-Range Interactions
Huseynov, R.; Otlu, B.Abstract
Somatic mutations can alter normal cells and lead to cancer development. Yet distinguishing functional driver mutations from neutral passenger mutations remains a significant challenge. Traditional genomic tools often prioritize linear overlap searches, failing to capture the complex, three-dimensional regulatory environment of the genome. We present a graph-based framework, MutationNetwork, for constructing mutation-centric networks by integrating long-range intrachromosomal interactions with local genomic overlaps. Our method utilizes a unique positive and negative indexing scheme to represent interacting genomic intervals as nodes. By encoding both interactions and overlaps as edges, we enable constant-time retrieval of complex relationship data. By iteratively expanding the graph from a seed mutation, we can quantify a mutation's influence on the genomic landscape and assess its proximity to genes. We applied this framework to a dataset of 560 breast cancer whole-genome sequences, focusing on Triple-Negative Breast Cancer (TNBC) and Luminal A subtypes. Our results demonstrate that the generated mutation embeddings successfully cluster samples according to their biological subtypes, with the highest classification performance achieved at specific ranges. This approach provides a comprehensive view of mutation impact, offering a scalable solution for cancer patient stratification and the prioritization of potential non-coding driver mutations by assessing their network-level impact.
bioinformatics2026-03-18v1PalmaClust: A graph-fusion framework leveraging the Palma ratio for robust ultra-rare cell type detection in scRNA-seq data
Niu, X.; Wang, J.; Wan, S.Abstract
Motivation: Single-cell RNA sequencing (scRNA-seq) is routinely used to build atlases of tissues, resolve developmental trajectories, and characterize disease microenvironments. Yet many biologically and clinically meaningful populations--including transient progenitors, therapy-resistant tumor subclones, and antigen-specific lymphocytes--occur at very low frequencies (<1%) and are easily missed by standard clustering pipelines. Existing approaches often require extensive manual curation, rely on known marker genes, or trade sensitivity for unacceptable false positive rates due to the insensitivity of metrics like the Gini index to heavy-tailed distributions. A scalable, statistically grounded method is needed to sensitively detect rare populations while providing calibrated confidence and interpretable molecular signatures. Results: We present PalmaClust, a graph-fusion clustering framework that repurposes Palma ratio--a tail-sensitive inequality metric in sociology--to identify marker genes driven by extreme sparsity. PalmaClust constructs and fuses multiple K-Nearest Neighbor (KNN) graphs derived from complementary gene-selection statistics including the Palma ratio, Gini index, and Fano factor. It employs a local refinement strategy that re-prioritizes Palma-ranked genes within parent clusters. Benchmarking across diverse public scRNA-seq datasets confirms that PalmaClust consistently outperforms state-of-the-art baselines, improving rare-class F1 scores by at least 20% (absolute) while maintaining high global clustering stability. Further studies demonstrate that the Palma ratio-derived graph layer is essential for capturing ultra-rare signatures that other views miss. Availability: https://github.com/wan-mlab/PalmaClust.
bioinformatics2026-03-18v110-minimizers: a promising class of constant-space minimizers
Shur, A.; Tziony, I.; Orenstein, Y.Abstract
Minimizers are sampling schemes which are ubiquitous in almost any high-throughputsequencing analysis. Assuming a fixed alphabet of size {sigma}, a minimizer is defined by two positive integers k, w and a linear order {rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k - 1 its minimal k-mer with respect to {rho}. A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite {sigma}-ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and nearoptimal densities, but they require to explicitly store k-mer ranks in {Omega}(2k) space. While constantspace minimizers exist, and some of them are proven to be asymptotically optimal, no constantspace minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w [≥] k - 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer
bioinformatics2026-03-18v1Outperforming the Majority-Rule Consensus Tree Using Fine-Grained Dissimilarity Measures
Takazawa, Y.; Takeda, A.; Hayamizu, M.; Gascuel, O.Abstract
Phylogenetic analyses often require the summarization of multiple trees, e.g., in Bayesian analyses to obtain the centroid of the posterior distribution of trees, or to determine the consensus of a set of bootstrap trees. The majority-rule consensus tree is the most commonly used. It is easy to compute and minimizes the sum of Robinson-Foulds (RF) distances to the input trees. In mathematical terms, the majority-rule consensus tree is the median of the input trees with respect to the RF distance. However, due to the coarse nature of RF distance, which only considers whether two branches induce exactly the same bipartition of the taxa or not, highly unresolved trees can be produced when the phylogenetic signal is low. To overcome this limitation, we propose using median trees with respect to finer-grained dissimilarity measures between trees. These measures include a quartet distance between tree topologies, and transfer distances, which quantify the similarity between bipartitions, in contrast to the 0/1 view of RF. We describe fast heuristic consensus algorithms for transfer-based tree dissimilarities, capable of efficiently processing trees with thousands of taxa. Through evaluations on simulated datasets in both Bayesian and bootstrapping maximum-likelihood frameworks, our results show that our methods improve consensus tree resolution in scenarios with low to moderate phylogenetic signal, while providing better or comparable dissimilarities to the true phylogeny. Applying our methods to Mammal phylogeny and a large HIV dataset of over nine thousand taxa confirms the improvement with real data. These results demonstrate the usefulness of our new consensus tree methods for analyzing the large datasets that are available today. Our software, PhyloCRISP, is available from https://github.com/yukiregista/PhyloCRISP.
bioinformatics2026-03-18v1SpatialFusion: A lightweight multimodal foundation model for pathway-informed spatial niche mapping
Yates, J.; Shavakhi, M.; Choueiri, T. K.; Van Allen, E.; Uhler, C.Abstract
Foundation models enable knowledge transfer across data modalities and tasks, yet foundation models for spatial biology remain in their early stages, largely centered on encoding single-cell representations in spatial context without fully integrating transcriptomic and morphological information to delineate functional niches. Here we introduce SpatialFusion, a lightweight multimodal foundation model that identifies biologically coherent microenvironments defined by distinct pathway activation patterns rather than spatial proximity alone. SpatialFusion integrates paired histopathology, gene expression, and inferred pathway activity into a unified representation. Compared with two specialist niche-detection methods and four spatial foundation models, SpatialFusion performs competitively and consistently resolves fine-grained spatial niches with unique pathway-level signatures. Applying the model to two Visium HD cohorts uncovered a pre-malignant niche in morphologically normal mucosa adjacent to colorectal tumors and revealed distinct malignant microenvironments in non-small cell lung cancer that were predictive of tumor stage. Overall, SpatialFusion offers a versatile framework for multimodal spatial analysis, enabling the discovery of new morpho-molecular niches with significant biological and clinical relevance.
bioinformatics2026-03-18v1