Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
SwiftTCR: Efficient Computational Docking protocol of TCRpMHC-I Complexes Using Restricted Rotation Matrices
Parizi, F. M.; Aarts, Y. J. M.; Smit, N.; Roran A R, D.; Diepenbroek, D.; Krösschell, W. A.; Thijs, L.; Tepperik, J.; Eerden, S.; Marzella, D. F.; Ramakrishnan, G.; Xue, L. C.Abstract
The T cell's ability to discern self and non-self depends on its T cell receptor (TCR), which recognizes peptides presented by MHC molecules. Understanding this TCR-peptide-MHC (TCRpMHC) interaction is important for cancer immunotherapy design, tissue transplantation, pathogen identification, and autoimmune disease treatments. Understanding the intricacies of TCR recognition, encapsulated in TCRpMHC structures, remains challenging due to the immense diversity of TCRs (>108/individual), rendering experimental determination and general-purpose computational docking impractical. Addressing this gap, we have developed a rapid integrative modeling protocol leveraging unique docking patterns in TCRpMHC complexes. Built upon PIPER, our pipeline significantly cuts down FFT rotation sets, exploiting the consistent polarized docking angle of TCRs at pMHC. Additionally, our ultra-fast structure superimposition tool, GradPose, accelerates clustering. It models a case in 3-4 minutes on 12 CPUs, showcasing a speedup of up to 25-40 times compared to the ClusPro webserver. On a benchmark set of 38 TCRpMHC class I (TCRpMHC-I) complexes, our protocol outperforms the state-of-the-art docking tools in model quality. This protocol can potentially provide structural information to TCR repertoires targeting specific peptides. Its computational efficiency can also enrich existing pMHC-specific single-cell sequencing TCR data, facilitating the development of structure-based deep learning (DL) algorithms. These insights are essential for understanding T cell recognition and specificity, advancing the development of therapeutic interventions.
bioinformatics2026-03-10v3FASTiso: Fast Algorithm on Search state Tree for subgraph ISOmorphism in graphs of any size and density
Agbeto, W.; Coti, C.; Reinharz, V.Abstract
Subgraph isomorphism is a fundamental combinatorial problem that involves finding one or more occurrences of a pattern graph within a target graph. It arises in a wide range of application domains, including biology, chemistry, social network analysis, and pattern recognition. Although subgraph isomorphism is NP-complete in the general case, many exact algorithms allow it to be solved in practice on many instances. However, the increasing size and structural diversity of graph datasets continue to pose significant challenges in terms of robustness and scalability. In this article, we propose FASTiso, an exact subgraph isomorphism algorithm that emphasizes a strong consistency between the variable ordering strategy and the pruning rules used during search. This design enables a unified exploitation of structural information throughout the exploration process, leading to improved efficiency and stable performance across heterogeneous graph structures. An extensive experimental evaluation on widely used synthetic and real-world benchmarks shows that FASTiso consistently outperforms reference solvers such as VF3, VF3L, and RI, and achieves competitive performance compared to constraint programming-based approaches (Glasgow, PathLad+), while outperforming them on most datasets. The results further demonstrate that FASTiso remains highly efficient on small instances and scales well to large graphs, while maintaining a lower memory footprint than most evaluated solvers. The peak memory usage is 7.74 GB for FASTiso, 36.19 GB for PathLad+, over 500 GB for Glasgow, 9.62 GB for VF3/VF3L, and 4.31 GB for RI. FASTiso code is available at https://gitlab.info.uqam.ca/cbe/fastiso as a C++ implementation, a Python module, and an integration within an extended version of NetworkX. The implementations support simple graphs and multigraphs, directed or undirected, with labels on nodes, edges, or both.
bioinformatics2026-03-10v3Non-consensus flanking sequence of hundreds of base pairs around in vivo binding sites: statistical beacons for transcription factor scanning
Faltejskova, K.; Sulc, J.; Vondrasek, J.Abstract
It was long suspected that for specific DNA binding by a transcription factor, the flanks of the binding motifs can play an important role. By a thorough analysis of the DNA sequence in the broad context (+- 5000 bp) of in vivo binding sites (as identified in a ChIP-seq or a Cut&Tag experiment), we show that the average GC content is in most cases statistically significantly increased around the binding site in a patch spanning 1000-- 1500 bp. This increase was observed consistently in experiment targeting the same TF in different cell lines. The surrounding of binding sites of certain TFs like MYC display a directional alteration of dinucleotide frequencies. We attempt to explain these preferences by alteration in DNA shape features as well as by potential cooperation with other TFs. We observed differences in sequence affinity to various potential cooperating TFs between cell lines. Altogether, we propose that the observed feature distortion is indicative of a coarse scanning mechanism that helps TFs find the target binding site.
bioinformatics2026-03-10v3GREmLN: A Cellular Graph Structure Aware Transcriptomics Foundation Model
Zhang, M.; Swamy, V.; Cassius, R.; Dupire, L.; Kanatsoulis, C.; Paull, E.; AlQuraishi, M.; Karaletsos, T.; Califano, A.Abstract
The ever-increasing availability of large-scale single-cell profiles presents an opportunity to develop foundation models to capture cell properties and behavior. However, standard language models such as transformers benefits from sequentially structured data with well defined absolute or relative positional relationships, while single cell RNA data have orderless gene features. Molecular-interaction graphs, such as gene regulatory networks (GRN) or protein-protein interaction (PPI) networks, offer graph structure-based models that effectively encode both non-local gene token dependencies, as well as potential causal relationships. We introduce GREmLN (Gene Regulatory Embedding-based Large Neural model), a foundation model that leverages graph signal processing to embed gene token graph structure directly within its attention mechanism, producing biologically informed single cell specific gene embeddings. Our model faithfully captures transcriptomics landscapes and achieves superior performance relative to state-of-the-art baselines on cell type annotation, graph structure understanding, and fine-tuned reverse perturbation prediction tasks. It offers a unified and interpretable framework for learning high-capacity foundational representations that capture complex, long-range regulatory dependencies from high-dimensional single-cell transcriptomic data. Moreover, the incorporation of graph-structured inductive biases enables more parameter-efficient architectures and accelerates training convergence.
bioinformatics2026-03-10v3Sassy: Fuzzy Searching DNA Sequences using SIMD
Beeloo, R.; Groot Koerkamp, R.Abstract
Motivation. Approximate string matching (ASM) is the problem of finding all occurrences of a pattern in a text while allowing up to k errors. Many modern methods use seed-chain-extend, which is fast in practice, but does not guarantee finding all matches with [≤]k errors. However, applications such as CRISPR off-target detection require exhaustive results. Methods. We introduce Sassy, a library and tool for ASM of short patterns in long texts. Sassy splits the text into 4 parts that are searched in parallel, and uses bitvectors in the text direction rather than the pattern direction. This has compexity O(k{lceil}n/W{rciel}) when searching a random text of length n, where W = 256 is the SIMD width, and provides significant speedups for small k. Separately, we allow matches of the pattern to extend beyond the text for an overhang cost of e.g. = 0.5 per character, to find matches near contig or read ends. Results. Sassy is 4x to 15x faster than Edlib for patterns [≤]1000bp, and can search text with a throughput near 2 Gbp/s. Likewise, Sassy is over 100x faster than parasail. We apply Sassy to CRISPR off-target detection by searching 61 guide sequences in a human genome. Sassy is 100x faster than SWOffinder and only slightly slower (for k [≤]3) than CHOPOFF, for which building its index takes 20 minutes. Sassy also scales well to larger k, unlike CHOPOFF whose index took over 10 hours to build for k = 5. Availibility. Sassy is available as library and binary at https://github.com/RagnarGrootKoerkamp/sassy, and archived at swh:1:dir:e884758dce5777a441bc2799dc8824e563c5f97b.
bioinformatics2026-03-10v2Computed atlas of the human GPCR-G protein signaling complexes
Miglionico, P.; Matic, M.; Franchini, L.; Arai, H.; Nemati Fard, L. A.; Arora, C.; Gherghinescu, M.; DeOliveira Rosa, N.; Ryoji, K.; Gutkind, J. S.; Orlandi, C.; Inoue, A.; Raimondi, F.Abstract
Experimental mapping of G protein-coupled receptors (GPCR)-G protein signaling coupling has illuminated hundreds of receptors, yet the coupling specificity of a large fraction of this large receptor family remains unknown, thereby preventing the development of new GPCR-targeting therapies. Here, we used AlphaFold3 (AF3) to predict the 3D structures of the human GPCRome in complex with heterotrimeric G proteins. We used experimental GPCR-G protein binding data to show that AF3 predictions significantly discriminate between positive and negative binders, and used 3D structural features to train a machine learning (ML) algorithm to predict coupling potency. Interpretation of the ML model helped discriminate universal features governing the strength of G protein coupling from those determining binding specificity. We computationally illuminated the coupling preferences of 180 non-olfactory GPCRs (non-OR) with previously unreported transduction mechanisms and experimentally validate the predicted couplings for multiple previously uncharacterized GPCRs, including QRFPR, GPR50, GPR37, GPR37L1 and GPRC5A. Our predictions established that Gi/o is the most prevalent coupling among non-OR GPCRs, which is often co-occurring with Gq/11 and, to a lesser extent, G12/13 signaling. Gs coupling is less common and restricted to specific clusters within the non-OR GPCRome phylogeny, likely due to stricter structural requirements for its binding. We also computed G protein complexes for over 400 ORs, establishing Gs as the most prevalent coupling. ORs are predicted to bind to Gs with a simpler interface compared to non-ORs, ultimately leading to energetically less stable complexes. Additionally, we predict recurrent bindings to Gq/11 and Gi/o proteins for ORs, suggesting potentially novel ORs signaling mechanisms. We exploited the GPCRome coupling atlas to interpret healthy and cancer expression data, revealing the coupling of most GPCR-G protein co-expressed pairs. This analysis highlights a richer coupling repertoire in healthy tissues compared to cancer, likely reflecting the high signaling requirements of specialized normal cell functions, which are lost in most cancer cells due to their de-differentiated state or under cancer selection processes. In summary, this study provides the first computational 3D atlas of the human GPCR-G protein transductome, thereby illuminating the signaling mechanisms of neglected GPCR classes and providing the basis for interpreting omics datasets from a myriad of pathological conditions, thus enabling the development of novel precision therapeutics.
bioinformatics2026-03-10v1STAR Suite: Integrating transcriptomics through AI software engineering in the NIH MorPhiC consortium
Hung, L.-H.; Yeung, K. Y.Abstract
To accommodate rapid methodological turnover, bioinformatics pipelines typically consist of discrete binaries linked via scripts. While flexible, this architecture relies on intermediate files, sacrificing performance, and treating complex codebases as static silos. For example, the STAR aligner {dobin2013star}---the standard engine for transcriptomics---uses an external script for adapter trimming, necessitating the decompression and re-compression of large files. These limitations presented scalability problems for uniform processing of data in the NIH MorPhiC consortium. We present our solution, STAR Suite, a human-engineered and AI-implemented modernization that integrates functionality directly into the C++ source. In just four months, a single developer added over 92,000 lines to the original 28,000-line codebase to produce four unified modules: STAR-core, STAR-Flex, STAR-Perturb, and STAR-SLAM that can be installed as a pre-compiled binary without introducing any new dependencies. This work demonstrates a new paradigm for the rapid evolution of high-performance bioinformatics software.
bioinformatics2026-03-10v1AQuA2-Cloud: a web platform for fluorescence bioimaging activity analysis
Bright, M.; Mi, X.; Duarte, D.; Carey, E.; Lyu, B.; Wang, Y.; Nimmerjahn, A.; Yu, G.Abstract
Advanced biological imaging analysis platforms such as Activity Quantification and Analysis (AQuA2) enable accurate spatiotemporal activity analysis across diverse cell populations within many species. These tools are increasingly important for investigating cellular signaling dynamics and behavior. However, despite advances in the accuracy and species capability of AQuA2, it remains computationally demanding for analysis of long time-series datasets and requires all users to maintain a MATLAB license, which may limit accessibility and large-scale deployment. To address these limitations, we have designed and made available AQuA2-Cloud, a portable software stack and web platform developed as an improvement and further evolution of AQuA2. This container-deployable system permits multi-user cloud-based high accuracy activity quantification with intuitive workflows, export of analysis data and project files, and comparable processing times. The platform offers integrated features such as in-browser analysis control interfaces, asynchronous program state control, multiple users and user management, support for unreliable connections, file uploading and downloading via web browsers and File Transfer Protocol, and centralized organization of analysis output. AQuA2-Cloud constitutes a cloud-native solution for laboratories or research groups seeking to centralize analysis of spatiotemporal biological imaging datasets while reducing software installation and licensing barriers for end users. The platform enables researchers with minimal technical expertise to perform advanced bioimaging analysis through standard web browsers while maintaining the analytical capabilities of AQuA2. AQuA2-Cloud source code, deployment procedures, and documentation are freely available at (https://github.com/yu-lab-vt/AQuA2-Cloud).
bioinformatics2026-03-10v1Automatic Generation of Model Sequences for Complex Regions in Assembly Graphs
Antipov, D.; Chen, Y.; Sollitto, M.; Phillippy, A. M.; Formenti, G.; Koren, S.Abstract
Recent developments in genome sequencing and assembly technologies have enabled the automated assembly of vertebrate chromosomes from telomere to telomere. However, for some long, highly similar repeats, genome assemblers may lack sufficient information to unambiguously resolve the sequence, leaving tangles in the assembly graph and gaps in the final assembly. In recently published genomes, such gaps are often closed by manual graph curation, a process that is labor-intensive, error-prone, and sometimes infeasible. This can leave important genomic repeats, such as recently duplicated genes, misassembled or excluded from the final assembly. Here we present the Trivial Tangle Traverser (TTT) algorithm that finds optimized resolutions of assembly graph tangles. TTT uses depth of coverage and read-to-graph alignment information in a two-stage process to identify evidence-based traversals that are consistent with the underlying data. First, sequence multiplicities are estimated through mixed-integer linear programming, after which an Eulerian path is found in the derived multigraph and optimized through a gradient-descent-like approach. We evaluate TTT traversals on the HG002 human reference genome and demonstrate its use to characterize a previously unassembled amplified gene array in the zebra finch genome. Availability: TTT is available at https://github.com/marbl/TTT
bioinformatics2026-03-10v1Measuring Amorphous Motion: Application of Optical Flow to Three-Dimensional Fluorescence Microscopy Images
Lee, R. M.; Eisenman, L. R.; Hobson, C.; Aaron, J. S.; Chew, T.-L.Abstract
Motion is an essential component of any living system. It is rich with information, but it is often challenging to quantitatively extract biologically informative results from the motion apparent in microscopy images. This challenge is exacerbated by the wide variety in biological movement, which often takes the form of difficult-to-segment amorphous structures undergoing complex motion. An image processing technique known as optical flow can capture motion at each pixel in an image, thus bypassing the need for object segmentation or a priori definition of motion types. This makes it a powerful tool for quantitative assessment of biological systems from the protein to organism scale. However, despite its flexibility and strengths for analyzing fluorescence microscopy images, its adoption in the bioimaging community has been limited by the availability of easy-to-use tools and guidance in results interpretation. Here we describe an optical flow tool, OpticalFlow3D, that can be run in Python or MATLAB and is compatible with three-dimensional microscopy images. Using biological examples across length scales, we illustrate how OpticalFlow3D can enable new biological insight.
bioinformatics2026-03-10v1In silico analysis of the human titin protein (Immunoglobulin-like, fibronectin type III, and Protein kinase domains) as a potential forensic marker for postmortem interval (PMI) estimation
Gill, M. U.; Akhtar, M.Abstract
Abstract: Due to the limited availability of reliable and well-validated molecular markers, the determination of postmortem interval (PMI) is still a major obstacle for forensic investigators to resolve a case. The largest human protein, known as titin, has never undergone at domain level examination of postmortem degradation patterns. This study focused on the In-silico analysis of the Immunoglobulin-like, fibronectin-type III, and Protein kinase domains of human titin to assess their potential utility in PMI estimation. Sequence data for the studied domains were retrieved from UniProt, 2D & 3D models were generated by PSIPRED and SWISS-MODEL, respectively, followed by physicochemical properties, solubility assessment, and structural comparison. This study revealed that the Ig-like domain is the most stable, followed by the Fn-III and Protein kinase domains. These findings indicate that Titin domains may degrade at different rates in the postmortem period. This study introduces the first computational basis for considering Titin as a multi-domain candidate biomarker for PMI estimation, laying the groundwork for upcoming laboratory validation.
bioinformatics2026-03-10v1SpatioCAD: Context-aware graph diffusion model for pinpointing spatially variable genes in heterogeneous tissues
Zhang, S.; Wen, H.; Shen, Q.Abstract
Spatial transcriptomics enables comprehensive characterization of tissue architecture, and the identification of spatially variable genes (SVGs) is a critical step for defining region-specific molecular markers and uncovering spatially regulated mechanisms across diverse biological contexts. However, most existing methods for SVG detection overlook cell density variations, a major confounding factor in complex tissues such as tumors, where heterogeneous cellularity frequently introduces false-positive calls. Here we present SpatioCAD, a computational framework that explicitly decouples genuine spatial expression patterns from confounding effects driven by cellularity. SpatioCAD leverages and extends a graph diffusion model to simulate expression propagation under cell-density-aware con- ditions, thereby ensuring unbiased detection of SVGs across all expression levels. Systematic evaluations on simulated datasets demonstrate its superior statistical power and specificity. Applied to breast cancer, lung cancer, and glioma datasets, SpatioCAD identifies functionally diverse SVGs, including low-abundance transcripts with established roles in tumor progression, while also recapitulates biologically meaningful tissue architecture features.
bioinformatics2026-03-10v1MOZAIC: Compound Growth via In Silico Reactions and Global Optimization using Conformational Space Annealing
Yoo, J.; Shin, W.-H.Abstract
Motivation: Fragment-based drug discovery (FBDD) is an efficient strategy that leverages small molecular fragments to explore broader chemical space by combining them. Advances in computational methods have enabled the calculation of molecular properties and docking scores, thereby accelerating the development of algorithm- and AI-based approaches in FBDD. However, it should be noted that certain methods do not provide synthetic pathways to obtain the proposed compounds. Consequently, these molecules might not be synthesized easily. Results: In light of these developments, we propose MOZAIC, a novel framework that explores chemical space using a reaction-based fragment growing and Conformational Space Annealing, a powerful global optimization algorithm. Our results show that MOZAIC effectively produces chemically diverse molecules with balanced improvements in lead-like properties, including QED, synthetic accessibility, and binding affinity. Furthermore, its flexible objective function allows fine-tuning for specific design goals, such as enhancing solubility with binding affinity. These capabilities position MOZAIC as a valuable platform for advancing fragment-to-lead and lead optimization efforts in drug discovery. Availability and implementation: MOZAIC is available at https://github.com/kucm-lsbi/MOZAIC/. Supplementary Information: Supplementary data are available at Bioinformatics online.
bioinformatics2026-03-10v1InversePep: Diffusion-Driven Structure-Based Inverse Folding for Functional Peptides
Chilakamarri, S. K.; Kasturi, S. R.; Yerrabandla, S. P. R.; Gogte, S.; Kondaparthi, V.Abstract
Designing functional peptides with specific structural and biochemical properties is critical for applications in protein engineering and therapeutic discovery. However, most peptide design approaches rely on evolutionary or local sequence optimization methods, which are limited when adapting to peptides' shorter length, high conformational flexibility, and unique physicochemical constraints. While recent structure-based inverse folding models have shown success for proteins, these models often underperform on peptides because sequence recovery alone is not a reliable indicator of stability or foldability in short, flexible backbones. To address this challenge, we introduce InversePep, a generative diffusion model for structure-based peptide inverse folding. InversePep learns the conditional distribution of sequences that can adopt a given backbone conformation, enabling direct generation of peptides tailored to target structural geometries. The framework integrates a geometric graph neural network to encode 3D backbone features with a Transformer-based sequence refinement module that iteratively denoises candidate sequences during diffusion. Trained on a diverse set of peptide backbones sourced from Propedia and SATPdb, InversePep effectively captures structural and biochemical diversity across peptide families. In systematic evaluations on held-out peptide structures and the PepBDB benchmark, InversePep achieves a mean TM score of 0.38 and a median of 0.28, outperforming ProteinMPNN and ESM-IF1 in generating geometry-consistent peptide sequences. In-silico folding analyses confirm that sampled peptides reliably adopt the target conformations. These results highlight InversePep's capability for designing structurally stable and sequence-diverse peptides, demonstrating its potential in antimicrobial peptide discovery, peptide therapeutics, and molecular probe development.
bioinformatics2026-03-10v1Neurotox: Deep learning decodes conserved hallmarks of neurotoxicity across venomous species
Bedraoui, A.; El Mejjad, S.; Enezari, S.; El Hajji, F. Z.; Galan, J.; El Fatimy, R.; Daouda, T.Abstract
Neurotoxic proteins drive the most pathophysiological effects of animal envenomation, yet it remains unclear whether neurotoxicity is encoded directly within the protein sequence or emerges from higher-order structure binding and interactions with their target receptor. To address this, we developed Neurotox, a sequence-based deep learning framework trained on 200,000 curated protein sequences, with balanced representation of neurotoxic and non-neurotoxic proteins across taxa, achieving high classification accuracy (96%) with strong performance on unseen toxin families. We further introduced a controlled sequence-representation warping strategy that selectively perturbs neurotoxicity-relevant features, inducing a systematic loss of predicted neurotoxicity while preserving primary sequence identity. Structural modeling using AlphaFold 3 showed that, for most top-ranked toxins, warping disrupted beta sheet architectures and reduced interface precision, with all top candidates showing highly significant effects (p < 0.0001). These structural changes were accompanied by recurrent cysteine-centered substitutions, implicating disruption of conserved disulfide frameworks. A single exception retained its global fold (RMSD = 2.8 Angstrom), maintained low PAE, high pLDDT, and high pDockQ scores, and preserved a close arginine-glutamate contact (Arg53-Glu75), yet still exhibited marked attenuation of predicted neurotoxicity. These results suggest that neurotoxicity arises from distributed sequence features that shape secondary-structure organization and receptor interaction, rather than from isolated contact residues alone.
bioinformatics2026-03-10v1NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Wang, G.; Yang, S.; Ding, J.-e.; Zhu, H.; Liu, F.Abstract
Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.
bioinformatics2026-03-10v1Counting strands in outer membrane beta-barrels
Lim, S.; Nimmagadda, T.; Khamis, A.; Montezano, D.; Feehan, R.; Copeland, M.; Slusky, J.Abstract
Beta-barrel structures are critical components of bacterial outer membranes, where they facilitate transport, cell signaling, antibiotic resistance, and structural integrity. A key feature of beta-barrels is their strand count, which influences pore diameter, binding site locations, and functional properties. However, because of breaks in strands and the presence of strands in periplasmic domains and plug domains, manual counting is inefficient and current algorithms do not accurately determine barrel strand count. To address this, we refined our previous beta-barrel structural assessment tool, PolarBearal, to improve strand number identification in large-scale datasets. To enhance the accuracy of barrel strand number labeling, our updated algorithm integrates three structural criteria, namely inter-residue vector angles, hydrogen-bonding distances, and strand connectivity. Using this algorithm, we labeled strand numbers for 571,760 predicted outer membrane beta-barrel structures obtained from the AlphaFold2 database. Our algorithm has 97% accuracy in strand number assignments, and the resulting dataset facilitates assessment of the homogeneity of strand counts for different types of outer membrane proteins. The strand labeling also provides insights on beta-barrel strand distribution and evolutionary patterns, supporting further research in protein structure prediction and design.
bioinformatics2026-03-10v1From General-Purpose to Disease-Specific Features: Aligning LLM Embeddings on a Disease-Specific Biomedical Knowledge Graph for Drug Repurposing
Pandey, S.; Talo, M.; Siderovski, D. P.; Sumien, N.; Bozdag, S.Abstract
Identifying new therapeutic uses for existing drugs is a major challenge in biomedicine, especially for complex neurodegenerative conditions such as Alzheimer disease and related dementias (ADRD), where treatment options remain limited and relevant data are often sparse, heterogeneous, and difficult to integrate. Although general-purpose Large Language Model (LLM) embeddings encode rich semantic information, they often lack the task-specific biomedical context needed for inference tasks such as computational drug repurposing. We introduce Contextualizing LLM Embeddings via Attention-based gRaph learning (CLEAR), a multimodal representation-fusion framework that aligns LLM embeddings with the topological structure of a context-specific Knowledge Graph (KG). Across five benchmark datasets, CLEAR achieved state-of-the-art results, improving predictive performance (e.g., F1 score) by up to 30% over prior methods. We further applied CLEAR to identify FDA-approved drugs with potential for repurposing for ADRD, including Parkinson disease-related dementia and Lewy Body dementia. CLEAR learned a biologically coherent embedding space, prioritized leading ADRD drug candidates, and accurately summarized known therapeutic relationships for FDA-approved Alzheimer disease drugs. Overall, CLEAR shows that grounding biomedical LLM embeddings with context-specific KG signals can improve drug repurposing in data-sparse, real-world settings.
bioinformatics2026-03-10v1Exploring per-base quality scores as a surrogate marker of cell-free DNA fragmentome
Volkov, H. H. V.; Raitses-Gurevich, M.; Grad, M.; Shlayem, R.; Leibowitz, D.; Rubinek, T.; Golan, T.; Shomron, N.Abstract
Per-base quality scores are widely treated as technical metadata in next-generation sequencing. Here, we show that in rigorously controlled whole-genome sequencing of cell-free DNA, quality profiles may encode fragmentomic signals that enable classification of cancer samples against matched controls. Analyzing four independent batches (23 cancer samples: pancreatic and breast; 22 matched controls) sequenced in a within-lane regime and further normalized per flow-cell tile to reduce technical confounders, we demonstrate through unsupervised analysis that boundary-enriched dynamics captured in these quality scores consistently separate cancer from control samples. A leave-one-batch-out classifier trained on quality-derived scores achieved a pooled area under the curve of 0.81. Furthermore, we show that the quality-derived metric correlates with short-fragment enrichment and tumor-associated 5-end motifs, performing comparably to established, motif-based orthogonal methods. These results provide initial evidence that quality scores could serve as a low-cost, alignment-free biomarker for cfDNA-based cancer detection.
bioinformatics2026-03-10v1NanoVI: a Bayesian variational inference Nextflow pipelinefor species-level taxonomic classification from full-length16S rRNA Nanopore reads
Curiqueo, C.; Fuentes-Santander, F.; Ugalde, J. A.Abstract
NanoVI is a Nextflow pipeline for species-level taxonomic classification of full length 16S rRNA Oxford Nanopore reads. Unlike existing tools that rely on expectation maximization (EM) algorithms, NanoVI employs Bayesian variational inference with a Dirichlet Categorical conjugate model, yielding abundance estimates accompanied by Bayesian 95% credible intervals that quantify estimation uncertainty, along with automatic shrinkage that suppresses spurious taxa. NanoVI integrates the Genome Taxonomy Database (GTDB) r226, providing phylogenetically consistent taxonomy while maintaining compatibility with NCBI style databases. Benchmarked against a standardized mock community, NanoVI achieves species-detection metrics comparable to Emu, with 25 to 62% lower execution time and fewer false-positive assignments. Validation on 20 clinical vaginal microbiome samples confirms reproducibility against previously published Emu-based analyses.
bioinformatics2026-03-10v1DIA-NN EasyFilter workflow for the fast and user-friendly critical assessment and visualization of DIA-NN proteomics analysis outcome
Moagi, M. G.; Thatiana, F. F.; Kristof, E. K.; Arda, A. G.; Arianti, R.; Horvatovich, P.; Csosz, E.Abstract
Liquid chromatography-tandem mass spectrometry (LC-MS/MS) based proteomics, particularly data-independent acquisition (DIA), has become widely adopted across in One Health approaches for biological and clinical research for quantitative protein characterization. Among the many computational tools available, DIA-NN has demonstrated superior performance; however, the primary output of the current versions is provided as a compact, compressed PARQUET file that can be difficult to interrogate without programming expertise. To address this limitation, we developed DIA-NN EasyFilter (DEF), a fast, user-friendly, KNIME-based workflow for comprehensive protein filtering, and visualization. DEF integrates chromatographic peak-based filtering, curated contaminant libraries, and quantity-quality assessment, along with interactive modules for qualitative and quantitative data exploration. The workflow is optimized for efficient execution within the KNIME local desktop environment and is designed to support end-users in improving accuracy and interpretability without requiring coding skills. We provide detailed description on how to run DEF and demonstrate the utility and robustness of DEF using published large-scale proteomics datasets, showing high comparability across studies regardless of instrument platform or dataset size.
bioinformatics2026-03-10v1Improving Causal Gene Identification Using Large Language Models
Ofer, D.; Kaufman, H.Abstract
Genome-Wide Association Studies (GWAS) have successfully identified numerous loci associated with complex traits and diseases, yet pinpointing causal genes remains a significant challenge. The reliance on simple proximity-based heuristics is often insufficient due to linkage disequilibrium, gene interactions, and regulatory effects. Recent advancements in Large Language Models (LLMs) have demonstrated potential in automating causal gene identification, but their effectiveness remains limited by knowledge representation and retrieval mechanisms. This study builds on previous research by evaluating LLMs for causal gene identification, with a focus on enhancing performance through Retrieval- Augmented Generation (RAG) and the incorporation of genomic distance information. We replicate prior results using smaller model Qwen2.5 - assessing their predictive accuracy using a benchmark dataset from Open Targets. We improved the preformences when integrating RAG-based literature retrieval (F1 = 0.795) and gene distance information (F1 = 0.806). However, the combined approach yielded diminishing returns, suggesting interactions between these enhancements. Error analysis revealed that genomic distance features improved predictions by reinforcing established heuristics, while RAG enhanced domain knowledge but occasionally led to semantic biases. These findings highlight the potential of hybrid approaches in leveraging both structured genomic features and unstructured textual data.
bioinformatics2026-03-10v1FAMUS: A Few-Shot Learning Framework for Large-Scale Protein Annotation
Shur, G.; Burstein, D.Abstract
Predicting gene function is a pivotal and challenging step in genomic and metagenomic data analysis. Current automatic annotation tools typically rely on the single most similar sequence from the query database. The sparsity of data per annotation makes it challenging to confidently assign gene function for underrepresented genes. Here, we present a contrastive learning framework for functional annotation. FAMUS (Functional Annotation Method Using Supervised contrastive learning) compares query sequences to profile Hidden Markov Model databases and transforms the similarity scores into a condensed vector space that minimizes the distance of proteins from the same family. The similarity scores of a query to all profiles are used for its representation instead of considering only the top-ranking hit. In a protein family assignment task, FAMUS outperformed KEGG's native KofamScan for KEGG Orthology annotation and InterPro's InterProScan for PANTHER family annotation. We thus created four protein annotation models using protein families from the KEGG Orthology, InterPro family, OrthoDB, and EggNOG databases. All four models are available as a conda package and via our user-friendly web server, allowing users to annotate large-scale datasets. FAMUS is the first comprehensive and modular annotation framework based on contrastive learning. It supports both pre-defined and user-specific databases for tailored annotation, and can be easily integrated into any genomic and metagenomic analysis pipeline to facilitate accurate, large-scale functional annotation.
bioinformatics2026-03-10v1Ensemble-based genomic prediction for maize flowering-time improves prediction accuracy and reveals novel insights into trait genetic variation
Tomura, S.; Powell, O. M.; Wilkinson, M. J.; Cooper, M.Abstract
While various genomic prediction models have been evaluated for their potential to accelerate genetic gain for multiple traits, no individual genomic prediction model has outperformed all others across all applications. As an alternative approach, ensembles of multiple individual genomic prediction models can be applied to utilise the complementary strengths of individual prediction models and offset the prediction errors of each. We used the EasiGP (Ensemble AnalySis with Interpretable Genomic Prediction) pipeline to investigate the performance of an ensemble approach, targeting flowering-time traits measured in two maize nested association mapping datasets. For both datasets, the ensemble-based prediction approach achieved higher prediction accuracy and lower prediction error across the flowering-time traits compared to each individual model. Multiple genomic regions known to contain key flowering-time related genes were repeatedly included as features across individual genomic prediction models, indicating the models successfully captured SNPs as features that are associated with genomic regions known to contain flowering-time genes. Although repeatability was high for some genomic regions, estimated marker effects varied across many genomic regions, suggesting that the models might also have captured different aspects of the genetic variation underlying the traits. The ensemble combination of the diverse views likely contributed to the improvement of prediction performance by the ensemble-based approach over the individual prediction models. Ensemble-based prediction can be applied to overcome limitations observed in the continuous exploration for the best individual genomic prediction models that can consistently achieve the highest prediction performance, thereby potentially contributing to improved prediction accuracy for applications in crop breeding.
bioinformatics2026-03-09v4ChatSpatial: Schema-Enforced Agentic Orchestration for Reproducible and Cross-Platform Spatial Transcriptomics
Yang, C.; Zhang, X.; Chen, J.Abstract
Spatial transcriptomics has transformed our ability to study tissue architecture at molecular resolution, yet analyzing these data demands navigating dozens of computational methods across incompatible Python and R ecosystems---forcing researchers to devote more effort to making tools function than to pursuing biological questions. We present ChatSpatial, a platform in which the LLM selects from pre-validated tool schemas rather than generating free-form code, with domain expertise embedded in schema descriptions for context-aware parameter inference. Built on the Model Context Protocol (MCP), ChatSpatial unifies 60+ methods across 15 analytical categories into a single conversational workflow spanning Python and R ecosystems. Replication of two published studies---recovering subclonal heterogeneity in ovarian cancer and tumor microenvironment organization in oral squamous cell carcinoma---and validation across seven LLM platforms demonstrate that schema-enforced orchestration yields near-deterministic reproducibility at the workflow level for multi-step spatial analyses. Beyond replication, exploratory cross-method analyses illustrate practical triangulation across independent analytical frameworks.
bioinformatics2026-03-09v2singIST: an R/Bioconductor library and Quarto dashboard for automated single-cell comparative transcriptomics analysis ofdisease models and humans
Moruno Cuenca, A.; Picart-Armada, S.; Perera-Lluna, A.; Fernandez-Albert, F.Abstract
Preclinical disease models often diverge from human pathophysiology at single-cell resolution, complicating model selection and limiting translational value. We present singIST, an R/Bioconductor package for quantitative and explainable comparison of disease model scRNA-seq data against a human reference. For each superpathway, singIST fits an adaptive sparse multi-block PLS-DA model on human pseudobulk expression, integrated one-to-one orthology and cell type mapping, and translates model fold changes into the human expression space to compute signed recapitulation at the superpathway, cell type, and gene levels. To streamline interpretation and reporting, we provide singIST Visualizer, a companion Quarto/Shiny dashboard that loads singIST outputs and offers interactive exploration with export ready plots and tables, avoiding manual figure coding across many superpathways and models. We demonstrate the workflow. We illustrate an end-to-end workflow on an oxazolone mouse model against a human atopic dermatitis reference for two representative pathways: Dendritic Cells in regulating Th1/Th2 Development [BIOCARTA] and Cytokine-cytokine receptor interaction [KEGG]. singIST is distributed under the MIT License via Bioconductor, and the Visualizer is available on GitHub.
bioinformatics2026-03-09v1MapMyCells: High-performance mapping of unlabeled cell-by-gene data to reference brain taxonomies
Daniel, S. F.; Lee, C.; Mollenkopf, T.; Lee, M.; Arbuckle, J.; Fiabane, E.; Gabitto, M. I.; Johansen, N.; Kapen, I.; Kraft, A. W.; Lai, J.; Li, S. Y.; McGinty, R.; Miller, J. A.; Welch-Moosman, S.; Otto, S.; Sawyer, L.; Shepard, N.; Thompson, C. L.; Tjaernberg, A.; Waters, J.; Zhen, X.; Macosko, E.; Lein, E.; Ng, L.; Zeng, H.; Mufti, S.; Yao, Z.; Hawrylycz, M.Abstract
Single-cell mapping methods convert raw, heterogeneous single-cell datasets into interpretable and comparable representations of biological identity. As reference cell-type taxonomies mature, mapping new datasets to shared references has become a central strategy for enabling cross-study integration, reproducible annotation, and cumulative biological knowledge. Here we present MapMyCells, an open-source framework designed to align diverse single-cell omics datasets to hierarchical reference taxonomies with minimal preprocessing. MapMyCells provides out-of-the-box support for an expanding set of high-quality brain cell-type references generated by the Allen Institute for Brain Science, the BRAIN Initiative, and the Seattle Alzheimer's Disease Brain Cell Atlas, including whole-brain mouse and human atlases, aging and Alzheimer's disease cohorts, and a cross-species consensus taxonomy initially focused on the basal ganglia. MapMyCells enables efficient mapping of hundreds of thousands of cells on standard workstations without specialized hardware, providing a deterministic, scalable, and modality-agnostic approach that is robust across species and molecular assays. The framework produces interpretable confidence metrics and quantitative summaries of mapping performance, allowing users to evaluate assignment precision and accuracy. We demonstrate the mapping of unlabeled transcriptomic, epigenomic, and spatial datasets to reference taxonomies and describe a general workflow for preparing arbitrary hierarchical taxonomies for reference-based mapping. As the ecosystem of single-cell reference atlases expands, MapMyCells offers a practical and reproducible solution for community-scale cell-type annotation and cross-dataset integration, supporting the development of unified and extensible brain cell atlases.
bioinformatics2026-03-09v1Benchmarking tissue- and cell type-of-origin deconvolution in cell-free transcriptomics
Ioannou, A.; Friman, E. T.; Daub, C. O.; Bickmore, W. A.; Biddie, S. C.Abstract
Plasma cell-free RNA (cfRNA) reflects tissue- and cell-type-specific activity across pathological states and is a promising biomarker for organ injury and disease. Computational deconvolution methods are widely used to infer organ and cell-type contributions to cfRNA profiles. However, most were originally developed for single-tissue bulk transcriptomes and their performance in body-wide cfRNA settings, where any tissue or cell type can contribute, remains poorly characterised. Here, we present a systematic benchmarking of tissue- and cell type-of-origin deconvolution for plasma cfRNA that considers both methodological and reference-related sources of variability under realistic cfRNA simulation settings. We evaluated seven commonly used deconvolution methods across distinct algorithmic classes and multi-organ reference configurations derived from bulk and single-cell atlases. We assessed performance using simulation frameworks that model multi-organ mixtures, technical noise, and transcript degradation. We further examined deconvolution methods across multiple previously published clinical cfRNA cohorts spanning diverse disease contexts. Across both tissue- and cell-type-level analyses, deconvolution performance was strongly influenced by both method choice and reference parameters. Tissue-of-origin inference was comparatively robust across simulated and clinical datasets, recovering disease-associated organ signals and concordance with biochemical markers. In contrast, cell type-of-origin inference showed greater variability and reduced consistency across analytical settings, leading to divergent interpretations in both simulations and published clinical cfRNA cohorts. Together, these findings demonstrate that methodological and reference-related variability are major sources of uncertainty in cfRNA deconvolution, with tissue-level inference being more robust than cell-type-level inference. Our benchmarking framework provides guidance for reference selection and comparative interpretation in cfRNA deconvolution.
bioinformatics2026-03-09v1Fractal: Towards FAIR bioimage analysis at scale with OME-Zarr-native workflows
Lüthi, J.; Cerrone, L.; Comparin, T.; Hess, M.; Hornbachner, R.; Tschan, A.; Glasner de Medeiros, G. Q.; Repina, N. A.; Cantoni, L. K.; Steffen, F. D.; Bourquin, J.-P.; Liberali, P.; Pelkmans, L.; Uhlmann, V.Abstract
The rapid growth in microscopy data volume, dimensionality, and diversity urgently calls for scalable and reproducible analysis frameworks. While efforts on the open OME-Zarr format have helped standardize the storage of large microscopy datasets, solutions for standardized processing are still lacking. Here, we introduce two complementary contributions to address this gap: 1) the Fractal task specification, defining OME-Zarr processing units that can interoperate across computational environments and workflow engines, and 2) the Fractal platform, using this specification to enable scalable and modular OME-Zarr-native analysis workflows. We demonstrate their use across diverse biological research data, including terabyte-scale multiplexed, volumetric, and time-lapse imaging. In a clinical setting, we show that Fractal workflows achieve near-identical quantification of millions of cells across independent deployments, demonstrating the reproducibility required for translational applications. With its growing community of contributors, the Fractal ecosystem provides a foundation for FAIR microscopy image analysis relying on open file formats.
bioinformatics2026-03-09v1Quantum Hamiltonian Learning using Time-Resolved Measurement Data and its Application to Gene Regulatory Network Inference
Sohail, M. A.; Sudharshan, R. R.; Pradhan, S. S.; Rao, A.Abstract
We present a new Hamiltonian-learning framework based on time-resolved measurement data from a fixed local IC-POVM and its application to inferring gene regulatory networks. We introduce the quantum Hamiltonian-based gene-expression model (QHGM), in which gene interactions are encoded as a parameterized Hamiltonian that governs gene expression evolution over pseudotime. We derive finite-sample recovery guarantees and establish upper bounds on the number of time and measurement samples required for accurate parameter estimation with high probability, scaling polynomially with system size. To recover the QHGM parameters, we develop a scalable variational learning algorithm based on empirical risk minimization. Our method recovers network structure efficiently on synthetic benchmarks and reveals novel, biologically plausible regulatory connections in Glioblastoma single-cell RNA sequencing data, highlighting its potential in cancer research. This framework opens new directions for applying quantum-like modeling to biological systems beyond the limits of classical inference.
bioinformatics2026-03-09v1Defining mutational signatures of lung cancer-associated carcinogens through in vitro exposure of human airway epithelial cells
Gurevich, N. Q.; Chiu, D. J.; Yajima, M.; Huggins, J.; Mazzilli, S. A.; Campbell, J. D.Abstract
While distinct environmental exposures imprint unique mutational signatures on cancer genomes, the specific causal patterns for many known carcinogens remain uncharacterized in relevant human tissues. To address this gap, we developed a novel, physiologically relevant system that uses a combination of airway epithelial cells and whole genome sequencing to characterize mutational patterns induced by genotoxic carcinogens associated with lung cancer. After validating the platform's accuracy by successfully recapturing the known signature for Benzo(a)pyrene (BaP), we used this system to gain detailed insights into the types of mutations that occur with exposure to N-nitrosotris-(2-chloroethyl) urea (NTCU) and 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK), genotoxic compounds that induce lung squamous cell carcinoma and lung adenocarcinoma in mouse models, respectively. Cells exposed to NTCU had significantly more somatic SNVs compared to control samples. An average of 82.3% of mutations in NTCU samples were attributed to a novel mutational signature distinct from those in the COSMIC database but highly correlated with recent in vivo mouse models. In contrast, NNK exposure did not demonstrate a distinct mutational pattern above background at both high and low concentrations. Ultimately, this in vitro system provides a robust platform to define causal links between environmental exposures and mutational patterns in lung cancer mutagenesis.
bioinformatics2026-03-09v1Application of large language models to the annotation of cell lines and mouse strains in genomics data
Rogic, S.; Mancarci, B. O.; Xu, B.; Xiao, A.; Yang, C.; Pavlidis, P.Abstract
Accurate, consistent and comprehensive metadata are essential for the reuse of functional genomics data deposited in repositories such as the Gene Expression Omnibus (GEO), however, achieving this often requires careful manual curation that is time-consuming, costly and prone to errors. In this paper, we evaluate the performance of Large Language Models (LLMs), specifically OpenAI's GPT-4o, as an assistive tool for entity-to-ontology annotation of two commonly encountered descriptors in transcriptomic experiments - mouse strains and cell lines. Using over 9,000 manually curated experiments from the Gemma database and over 5,000 associated journal articles, we assess the model's ability to identify relevant free-text entries and map them to appropriate ontology terms. Using zero-shot prompting and retrieval-augmented generation (RAG) to incorporate domain-specific ontology knowledge, GPT-4o correctly annotated 77% of mouse strain and 59% of cell line experiments, and uncovered manual curation errors in Gemma for over 200 experiments. GPT-4o substantially outperformed a regular expression-based string-matching method, which correctly annotated only 6% of mouse strain experiments due to low precision. Model errors often arose from typographical mistakes or inconsistent naming in the GEO record or publication, and resembled those made by human curators. Along with annotations, our approach requests that the model output supporting context and quotes from the sources. These were typically accurate and enabled rapid curator verification. These findings suggest that LLMs are not ready to fully replace manual curators, but can already effectively support them. A human-in-the-loop workflow, in which LLM's annotations are provided to human curators for validation, may improve the efficiency and quality of large-scale biomedical metadata curation.
bioinformatics2026-03-07v1Telomere-to-telomere assembly and haplotype analysis of tetraploid Dendrobium officinale illuminate Orchidaceae polyploid evolution and mycorrhizal symbiosis genes
Chen, E.; Xu, J.; Liu, Y.; Li, Y.; Feng, Y.; Lu, Q.; Ding, X.; Niu, Z.; Qin, S.; Niu, S.; Luo, Y.; Guo, X.; Luo, X.Abstract
Dendrobium officinale is a typical epiphytic orchid. We report the telomere-to-telomere (T2T) genome assembly for D. officinale, representing the first T2T reference genome within the Orchidaceae family. The assembly is anchored to 19 chromosomes and contains 38 complete telomeres and 15 characterized centromeres. We further generated haplotype-resolved assemblies of the autotetraploid genome, identifying 12,761 sets of tetra-allelic genes. Based on synonymous substitution analysis, we inferred that the autotetraploidization event occurred approximately 0.86 million years ago. A systematic analysis of the SWEET gene family across the genus Dendrobium revealed that the gene family size is shaped primarily by epiphytic types and environmental factors. In D. officinale from Langshan, eight SWEET genes were specifically expressed in roots, suggesting they may play specialized roles in the root mycorrhizal system, potentially contributing to the D. officinale's ability to recruit and maintain fungal partners. Together, these resources provide valuable foundations for studies of orchid evolution, functional genomics, and molecular breeding.
bioinformatics2026-03-07v1Hybrid molecular dynamics-deep generative framework expands apo RNA ensembles toward cryptic ligand-binding conformations: application to HIV-1 TAR
Kurisaki, I.; Hamada, M.Abstract
RNA plays vital roles in diverse biological processes and represents an attractive class of therapeutic targets. In particular, cryptic ligand-binding sites--absent in apo structures but formed upon conformational rearrangement--offer high specificity for RNA-ligand recognition, yet remain rare among experimentally-resolved RNA-ligand complex structures and difficult to predict in silico. RNA-targeted structure-based drug design (SBDD) is therefore limited by challenges in sampling cryptic states. Here, we apply Molearn, a hybrid molecular dynamics-deep generative framework, to expand apo RNA conformational ensembles toward cryptic states. Focusing on the paradigmatic HIV-1 TAR-MV2003 system, Molearn was trained exclusively on apo TAR conformations and used to generate a diverse ensemble of TAR structures. Candidate cryptic MV2003-binding conformations were subsequently identified using post-generation geometric analyses. Docking simulations of these conformations with MV2003 yielded binding poses with RNA-ligand interaction scores comparable to those of NMR-derived complexes. Notably, this work provides the first demonstration that a generative model can access cryptic RNA conformations that are ligand-binding competent and have not been recovered in prior molecular dynamics and deep generative modeling studies. Finally, we discuss current limitations in scalability and systematic detection, including application to the Internal Ribosome Entry Site, and outline future directions toward RNA-targeted SBDD.
bioinformatics2026-03-06v8scExploreR: a flexible platform for democratized analysis of multimodal single-cell data by non-programmers
Showers, W.; Desai, J.; Gipson, S. R.; Engel, K. L.; Smith, C.; Jordan, C. T.; Gillen, A. E.Abstract
Single-cell sequencing has revolutionized biomedical research by uncovering cellular heterogeneity in disease mechanisms, with significant potential for advancing personalized medicine. However, participation in single-cell data analysis is limited by the programming experience required to access data. Several existing browsers allow the interrogation of single-cell data through a point-and-click interface accessible to non-programmers, but many of these browsers are limited in the depth of analysis that can be performed, or the flexibility of input data formats accepted. Thus, programming experience is still required for comprehensive data analysis. We developed scExploreR to address these limitations and extend the range of analysis tasks that can be performed by non-programmers. scExploreR is implemented as a packaged R Shiny app that can be run locally or easily deployed for multiple users on a server. scExploreR offers extensive customization options for plots, allowing users to generate publication quality figures. Leveraging our SCUBA package, scExploreR seamlessly handles multimodal data, providing identical plotting capabilities regardless of input format. By empowering researchers to directly explore and analyze single-cell data, scExploreR bridges communication gaps between biological and computational scientists, streamlining insight generation.
bioinformatics2026-03-06v2Multi-omics Profiling Identifies Molecular and Cellular Signatures of Regular Physical Activity in Human Peripheral Blood
Song, X.; Lv, J.; Ge, S.; Xu, S.; Wu, Y.; Zheng, Y.; Zhou, W.; Li, L.; Zhang, Y.; Zhang, J.; Gao, P.; Chen, Z.; Yin, P.; Yin, J.; Liu, C.Abstract
Regular physical activity is well established to protect against metabolic disorders and bolster immunity; yet, the underlying molecular and cellular mechanisms remain incompletely understood. We integrated plasma metabolomics and lipidomics with single-cell transcriptomic and chromatin accessibility profiles to decode the systemic impact of physical activity on human immunity and metabolism. Our data reveal that regular physical activity is linked to a coordinated metabolic signature marked by enhanced fatty acid oxidation and antioxidant defense. In circulating immune cells, regularly active individuals exhibited synchronous enhancement at both the chromatin accessibility and transcriptional levels for antigen presentation-related genes in antigen-presenting cells (APCs), particularly in classical monocytes, naive B cells, and switched memory B cells. Meanwhile, cytotoxic programs in CD8+ cytotoxic T and mature NK cells showed epigenetic pre-activation of effector function regulators. Intercellular communication analysis further revealed that regular exercise enhanced MHC-I/II signaling between APCs and T cells and suppressed inflammatory signaling networks. Together, these findings elucidate molecular mechanisms underlying the health benefits of regular exercise and offer a theoretical basis for enhancing public health and preventing chronic diseases.
bioinformatics2026-03-06v2Getting over ANOVA: Estimation graphics for multi-group comparisons
Lu, Z.; Anns, J.; Mai, Y.; Zhang, R.; Lian, K.; Lee, N. M.; Hashir, S.; Wang Zhouyu, L.; Li, Y.; Gonzalez, A. R. C.; Ho, J.; Choi, H.; Xu, S.; Claridge-Chang, A.Abstract
Data analysis in experimental science mainly relies on null-hypothesis significance testing, despite its well-known limitations. A powerful alternative is estimation statistics, which focuses on effect-size quantification. However, current estimation tools struggle with the complex, multi-group comparisons common in biological research. Here we introduce DABEST 2.0, an estimation framework for complex experimental designs, including shared-control, repeated-measures, two-way factorial experiments, and meta-analysis of replicates.
bioinformatics2026-03-06v2Mutation Reporter: Protein-Level Identification of Single and Compound Mutations in NGS Data
Teodoro, M.; das Chagas, R. V.; Yunes, J. A.; Migita, N. A.; Meidanis, J.Abstract
Next-generation sequencing (NGS) has accelerated precision medicine by enabling simultaneous analysis of multiple genes and detection of low-frequency mutations. However, few open-source tools allow non-specialized users to transparently adjust quality parameters during mutation analysis. Mutation Reporter was developed to identify both single and compound amino acid alterations directly from raw fastq files of sequencing originated from RNA or exon sequences. The software provides full parameter control --- including alignment e-value, minimum read length, minimum read depth, and minimum variant allele frequency (VAF). The software is freely available under a GNU license on GitHub (https://github.com/meidanis-lab/mutation-reporter) and as a Code Ocean capsule (https://codeocean.com/capsule/0121109/tree).
bioinformatics2026-03-06v2Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles
Ni, Z.; Li, Y.; Qiu, Z.; Schölkopf, B.; Guo, H.; Liu, W.; Liu, S.Abstract
Generative models have recently advanced de novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce RigidSSL (Rigidity-Aware Self-Supervised Learning}, a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.
bioinformatics2026-03-06v2Automated Cell Type Annotation with Reference Cluster Mapping
Galanti, V.; Shi, L.; Azizi, E.; Liu, Y.; Blumberg, A. J.Abstract
Single-cell RNA sequencing has transformed the field of cellular biology by providing unprecedented insights into cellular heterogeneity. However, characterizing scRNA-seq datasets remains a significant challenge. We introduce RefCM, a novel computational method that combines optimal transport and integer programming to enhance the annotation of scRNA clusters using established reference datasets. Our method produces highly accurate cross-technology, cross-tissue, and cross-species mappings while remaining tractable at atlas scale, outperforming existing methods across all these tasks. By providing precise annotations, RefCM can enable the discovery of new cell types, states, and relationships in single-cell transcriptomic data.
bioinformatics2026-03-06v2DPGT: A spark based high-performance joint variant calling tool for large cohort sequencing
Gong, C.; Yang, Q.; Wan, R.; Li, S.; Zhang, Y.; Li, Y.Abstract
Background: Joint variant calling is a crucial step in population-scale sequencing analysis. While population-scale sequencing is a powerful tool for genetic studies, achieving fast and accurate joint variant calling on large cohorts remains computationally challenging. Findings: To meet this challenge, we developed Distributed Population Genetics Tool (DPGT), an efficient computing framework and a robust tool for joint variant calling on large cohorts based on Apache Spark. DPGT simplifies joint calling tasks for large cohorts with a single command on a local computer or a computing cluster, eliminating the need for users to create complex parallel workflows. We evaluated the performance of DPGT using 2,504 1000 Genomes Project (1KGP), 6 Genome in a Bottle (GIAB) and 9,158 internal whole genome sequencing (WGS) samples together with existing methods. As a result, DPGT produced results comparable in accuracy to existing methods, with less time and better scalability. Conclusions: DPGT is a fast, scalable, and accurate tool for joint variant calling. The source code is available under a GPLv3 license at https://github.com/BGI-flexlab/DPGT, implemented in Java and C++. Keywords: SNP/INDEL, joint calling, computational performance optimization, parallel computing
bioinformatics2026-03-06v2What Do Biological Foundation Models Compute? Sparse Autoencoders from Feature Recovery to Mechanistic Interpretability
Orlov, A. V.; Makus, Y. V.; Ashniev, G. A.; Orlova, N. N.; Nikitin, P. I.Abstract
Foundation models trained on protein and DNA sequences are increasingly deployed for variant interpretation, drug design, and gene regulation prediction, yet their internal representations remain opaque, limiting both biological insight and trust in model-guided decisions. Existing interpretation approaches establish what these models encode but cannot reveal how biological knowledge is internally organized and computed. Sparse autoencoders (SAEs) offer a complementary approach by decomposing model activations into interpretable features, each capturing a distinct biological concept. Over the past year, SAEs have been applied to protein language models, genomic language models, pathology vision transformers, single-cell foundation models, and protein structure generators. Here we provide a systematic review of sparse dictionary learning across biological foundation models. We find that independent studies using different architectures and evaluation strategies consistently recover features spanning biological scales (from secondary structure elements and functional domains in proteins to transcription factor binding sites and regulatory elements in genomes), providing convergent evidence that these models learn interpretable representations accessible through sparse decomposition. However, we identify a critical gap: validation relies almost exclusively on matching features against existing annotations, risking circularity when those annotations derive from the same sequence databases used for model training. We propose a three-level interpretability framework (representational, computational, and causal mechanistic) and argue that the field's most distinctive opportunity lies in experimental validation through deep mutational scanning, massively parallel reporter assays, and structural characterization, which can establish whether these models have learned genuine biological mechanisms rather than training set statistics.
bioinformatics2026-03-06v1TFBSpedia: a comprehensive human and mouse transcription factor binding sites database
Li, S.; Chou, E.; Wang, K.; Boyle, A. P.; Sartor, M. A.Abstract
Mapping the genomic locations and patterns of transcription factor binding sites (TFBS) is essential for understanding gene regulation and advancing treatments for diseases driven by DNA modifications, including epigenetic changes and sequence variants. Although several TFBS databases exist, no study has systematically benchmarked these databases across different sequencing technologies and computational algorithms. In this study, we addressed this gap by constructing a TFBS database that integrates all available ENCODE cell line ATAC-seq and Cistrome Data Browser ChIP-seq datasets, comprising 11.3 million human and 1.87 million mouse TFBS. We also integrated previously published TFBS resources (Factorbook, Unibind, RegulomeDB, and ENCODE_footprint) and found each contains a substantial fraction of unique TFBS predictions, highlighting significant discrepancies among existing resources. To assess the accuracy of the combined TFBS regions, we assembled ten independent genomic annotation datasets for evaluation and found that TFBS regions predicted by multiple databases are more likely to represent true and biologically meaningful binding sites. For each predicted TFBS region, we define two scores: the confidence score reflects prediction reliability, while the importance score represents biological functional relevance. Finally, we introduce TFBSpedia, a lightweight and efficient search engine that enables rapid retrieval of TFBS regions and comprehensive annotation information across the integrated databases.
bioinformatics2026-03-06v1A latent space thermodynamic model of cell differentiation
Poursina, A.; Hajhashemi, S.; Mikaeili Namini, A.; Saberi, A.; Emad, A.; Najafabadi, H. S.Abstract
Inferring the governing dynamics of differentiation that capture cell state evolution remains a central challenge in single-cell biology. We present Latent Space Dynamics (LSD), a thermodynamics-inspired framework that models cell differentiation as evolution on a learned Waddington landscape in latent space. LSD jointly infers a low-dimensional cell state, a differentiable potential function governing developmental flow, and a local entropy term that quantifies cellular plasticity. Using a neural ordinary differential equation, LSD reconstructs continuous differentiation trajectories from time-ordered single-cell data. Across diverse developmental systems, LSD accurately recovers lineage hierarchies, predicts fate commitment for unseen cell types, and outperforms existing trajectory inference approaches in directional accuracy. Moreover, in silico gene perturbations reveal how individual regulators reshape the landscape, and entropy provides a quantitative measure of plasticity in development and cancer.
bioinformatics2026-03-06v1Reliable prediction of short linear motifs in the human proteome
Pancsa, R.; Ficho, E.; Kalman, Z. E.; Gerdan, C.; Remenyi, I.; Zeke, A.; Tusnady, G. E.; Dobson, L.Abstract
Short linear motifs (SLiMs) are small, often transient interaction modules within intrinsically disordered regions (IDRs) of proteins that interact with particular domains and thereby regulate numerous biological processes. The limited sequence information within these short peptides leads to frequent false positive hits in both computational and experimental SLiM identification methods. This makes the description of novel SLiMs challenging and has limited the number of known cases to a few thousand, even though SLiMs play widespread roles in cellular functions. We present SLiMMine, a deep learning-based method to identify SLiMs in the human proteome. By refining the annotations of known, annotated motif classes, we created a high-quality dataset for model training. Using protein embeddings and neural networks, SLiMMine reliably predicts novel SLiM candidates in known classes, eliminates ~80% of the pattern matching-based motif hits as false-positives, furthermore, it can also be used as a discovery tool to find uncharacterized SLiMs based on optimal sequence environment. In addition, we narrowed the highly general interactor-domain definitions of known SLiM classes to specific human proteins, enabling more precise prediction of a wide range of potential protein-protein interactions (PPIs) in the human interactome. SLiMMine is available in the form of an appealing, user-friendly, multi-purpose web-server at https://slimmine.pbrg.hu/.
bioinformatics2026-03-06v1RNA-seq analysis in seconds using GPUs
Melsted, P.; Guthnyjarson, E. M.; Nordal, J.Abstract
We present a GPU implementation of kallisto for RNA-seq transcript quantification. By redesigning the core algorithms: pseudoalignment, equivalence class intersection, and the EM algorithm; for massively parallel execution on GPUs, we achieve a 30-50x speedup over multithreaded CPU kallisto. On a benchmark of 100 Geuvadis samples from Human cell lines the GPU version processes paired-end reads at a rate of 3.6 million per second, completing a typical sample in seconds rather than minutes. For a large dataset of 295 million reads, runtime drops from 40 minutes to 50 seconds. Our implementation demonstrates that careful algorithmic redesign, rather than naive porting of software, is necessary to fully exploit the computing power of GPUs in sequence analysis.
bioinformatics2026-03-06v1From expansion to consolidation: two decades ofGene Ontology evolution
Pitarch, B.; Pazos, F.; Chagoyen, M.Abstract
The Gene Ontology (GO) is a long-standing, community-maintained knowledge resource that underpins the functional annotation of gene products across numerous biological databases. Released regularly, GO and its associated annotations form a large, continuously evolving dataset whose temporal dynamics have direct consequences for data reuse, versioning, and reproducibility. Because analytical results derived from GO are inherently tied to specific ontology and annotation releases, a systematic understanding of how GO changes over time is essential for transparent interpretation and long-term reuse of GO-based analyses. Here, we present a comprehensive temporal characterization of the Gene Ontology and its annotations spanning 21 years of publicly available releases. Treating successive ontology and annotation versions as longitudinal research data, we quantify changes in ontology structure, term composition, relationships, and annotation content across time and across three representative annotation resources. Our analysis reveals sustained growth of GO over its lifetime, accompanied by marked structural reorganization, particularly affecting high-level, general ontology terms. Notably, across multiple structural and annotation metrics, we identify a transition toward increased stability beginning around 2017, consistent with a maturation phase of the resource. This work provides a reference framework for researchers who rely on GO releases for data integration, benchmarking, and reproducible functional analysis.
bioinformatics2026-03-06v1In silico drug repurposing and in vitro validation of cestode fatty acid binding proteins
Rodriguez, S.; Alberca, L. N.; Gavernet, L.; Franchini, G. R.; Talevi, A.Abstract
Echinococcosis is a Neglected Tropical Disease (NTD) caused by Echinococcus granulosus and Echinococcus multilocularis, the etiological agents of cystic and alveolar echinococcosis, respectively. These infections pose a significant public health burden, particularly in endemic regions. Cestodes lack key enzymes involved in lipid metabolism and must acquire lipids from their hosts. Fatty Acid Binding Proteins (FABPs), which mediate lipid trafficking and intracellular transport, have therefore emerged as essential and potentially druggable targets. In this study, we implemented an integrated virtual screening strategy combining ligand-based and structure-based approaches to identify novel FABP binders as potential therapeutic agents against Echinococcus spp. High-specificity screening of approximately 435,000 compounds yielded a limited number of prioritized in silico hits. Four compounds (hydrochlorothiazide, naratriptan, fenticonazole, and montelukast) were selected for experimental validation, prioritizing repurposing candidates. Fluorescence displacement assays confirmed that hydrochlorothiazide binds to three cestode FABPs (EgFABP1, EmFABP1, and EmFABP3), validating the predictive performance of the computational workflow. These findings support the value of parallel in silico screening strategies and drug repurposing approaches for the discovery of new therapeutic candidates against neglected tropical diseases. Keywords:Drug discovery, Drug repositioning; Drug repurposing; Echinococcus spp; FABP; Medicinal chemistry; Neglected tropical diseases;Virtual screening
bioinformatics2026-03-06v1CLAMP: Curated Latent-variable Analysis with Molecular Priors
Subirana-Granes, M.; Nandi, S.; Zhang, H.; Chikina, M.; Pividori, M.Abstract
Gene expression analysis has long been fundamental for elucidating molecular pathways and gene-disease relationships, but traditional single-gene approaches cannot capture the coordinated regulatory networks underlying complex phenotypes; although unsupervised matrix factorization methods (e.g., PCA, NMF) reveal coexpression patterns, they lack the ability to incorporate prior biological knowledge and often struggle with interpretability and technical noise correction. Semi-supervised strategies such as PLIER have improved interpretability by integrating pathway annotations during latent variable extraction, yet the original PLIER implementation is prohibitively slow and memory-intensive, making it impractical for modern large-scale resources like ARCHS4 or recount3. Here, we introduce CLAMP, which overcomes these constraints through a two-phase algorithmic design (an unsupervised CLAMPbase initialization followed by a CLAMPfull regression that incorporates priors via glmnet), rigorous internal cross-validation to tune regularization parameters for each latent variable, and efficient on-disk data handling using memory-mapped matrices from the bigstatsr package. Benchmarking on GTEx, recount2, and ARCHS4 demonstrates that CLAMP achieves 7x-41x speedups over PLIER, succeeds in modeling hundreds of thousands of samples that PLIER cannot handle, and maintains or improves biological specificity of latent variables as shown by tissue-alignment and pathway enrichment analyses. By filling the gap in scalable, biologically informed latent variable extraction, CLAMP enables comprehensive analysis of modern transcriptomic compendia and paves the way for deeper insights into gene regulatory networks and downstream applications in translational genomics.
bioinformatics2026-03-05v2GraTools, an user-friendly tool for exploring and manipulating pangenome variation graphs
Ravel, S.; Marthe, N.; Carrette, C.; Mohamed, M.; Sabot, F.; Tranchant-Dubreuil, C.Abstract
Background: Pangenome variation graphs (PVGs), which represent genomic diversity through multiple genomes alignment, are powerful tools for studying genomic variations in populations. However, current tools often lack integration, efficiency, or require format conversions, to use them, hindering their usability. Results: Here, we introduce GraTools, a fast and user-friendly command-line tool for manipulating PVGs using the original GFA file. After a one-time graph import, GraTools enables rapid subgraph extraction, fasta sequence retrieval, and comprehensive analyses, including core/dispensable genome ratio calculation or group-specific segment identification. The import step results in conversion in standard data formats (BAM/BED), enabling the reuse of well-optimized existing tools, allowing an efficient storage and the querying of the PVGs large complex data structures. Scalability is ensured by a modular architecture sup- porting parallel processing and asynchronous I/O operations. GraTools supports coordinates defined on both the primary reference as well as from alternative genomes within the graph without re-importing, and its outputs can be easily visualized or manipulated using external tools. Using an Asian rice pangenome graph (13 accessions), we demonstrate its ability to easily extract subgraphs, compute depth statistics, and identify subspecies-specific segments. An intuitive command-line interface, a real-time execution feedback and a detailed logging system make this tool suitable for a wide range of applications, from population genetics to breeding and genomic medicine, for both biologists and bioinformati- cians. Conclusions: Through its unified graph manipulation interface, GraTools offers an interesting alternative to the few existing tools for manipulating PVGs, facil- itating rapid, efficient and flexible downstream analyses. It is available as an open-source tool (GNU GPLv3), with its documentation available at https: //gratools.readthedocs.io.
bioinformatics2026-03-05v2