Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Near perfect identification of half sibling versus niece/nephew avuncular pairs without pedigree information or genotyped relatives
Sapin, E.; Kelly, K.; Keller, M. C.Abstract
Motivation: Large-scale genomic biobanks contain thousands of second-degree relatives with missing pedigree metadata. Accurately distinguishing half-sibling (HS) from niece/nephew-avuncular (N/A) pairs--both sharing approximately 25% of the genome--remains a significant challenge. Current SNP-based methods rely on Identical-By-Descent (IBD) segment counts and age differences, but substantial distributional overlap leads to high misclassification rates. There is a critical need for a scalable, genotype-only method that can resolve these "half-degree" ambiguities without requiring observed pedigrees or extensive relative information. Results: We present a novel computational framework that achieves near-complete separation of HS and N/A pairs using only genotype data. Our approach utilizes across-chromosome phasing to derive haplotype-level sharing features that summarize how IBD is distributed across parental homologues. By modeling these features with a Gaussian mixture model (GMM), we demonstrate near-perfect classification accuracy (> 98%) in biobank-scale data. Furthermore, we show that these high-confidence relationship labels can serve as long-range phasing anchors, providing structural constraints that improve the accuracy of across-chromosome homologue assignment. This method provides a robust, scalable solution for pedigree reconstruction and the control of cryptic relatedness in large-scale genomic studies.
bioinformatics2026-04-08v6Local and Global Patterns Support Medical Imaging as a Biomarker of Ageing
Mueller, T. T.; Starck, S.; Llalloshi, R.; Kaissis, G.; Ziller, A.; Graf, R.; Schlett, C.; Ringhof, S.; Bamberg, MD, MPH, F.; Wielpuetz, M.; Völzke, H.; Leitzmann, M.; Niendorf, T.; Keil, T.; Krist, L.; Pischon, T.; Karch, A.; Berger, K.; Kirschke, J.; Rueckert, D.; Braren, R.Abstract
Background: Understanding human ageing across multiple organs is essential for characterising individual health trajectories and identifying abnormal ageing processes. Multi-organ imaging provides an opportunity to quantify biological ageing beyond chronological age. The aim of this study is to assess organ-specific and whole-body ageing patterns and their associations with disease and lifestyle factors. Methods: In this large-scale study, we evaluate biological ageing patterns using 70,000 MRI scans from the UK Biobank and the German National Cohort. We employ 3D ResNet-18 models to predict chronological age from various body regions (brain, heart, liver, spine, lungs, muscle, and intestine) and the whole body. From these predictions, we derive age gaps relative to a strictly healthy reference cohort, which enables the identification of accelerated ageing patterns. We then evaluate associations with chronic diseases and lifestyle factors, and a virtual ageing framework was developed to explore counterfactual scenarios by substituting anatomical regions across subjects, quantifying local impacts on global biological age. Results: Here we show significant associations between detected accelerated ageing and specific chronic diseases, including multiple sclerosis and chronic obstructive pulmonary disease, as well as lifestyle factors such as smoking and physical activity. Virtual substitution of anatomical regions demonstrates that local substitutions can influence global ageing patterns. Conclusions: This study demonstrates that multi-organ imaging enables the detection of abnormal ageing patterns at both local and global levels. The presented framework provides a foundation for improved risk stratification and supports the development of personalised approaches to health assessment and disease prevention.
bioinformatics2026-04-08v3TPCAV: Interpreting deep learning genomics models via concept attribution
Yang, J.; Mahony, S.Abstract
Interpreting genomics deep learning models remains challenging. Existing feature attribution methods are largely restricted to one-hot DNA inputs and therefore cannot assess the influence of more general genomic features such as chromatin states or genomic repeats. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We improve upon the original TCAV method by incorporating a PCA-based decorrelation transformation to address correlated and redundant embedding features commonly observed in genomics deep learning models, resulting in the Testing with PCA-projected Concept Activation Vectors (TPCAV) approach. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides comparable motif feature interpretation to TF-MoDISco on one-hot encoded DNA-based transcription factor binding prediction models. TPCAV also enables robust interpretive analysis of how more general biological concepts such as repetitive elements and chromatin state annotations contribute towards predictions. TPCAV uniquely generalizes to interpret features learned by tokenized foundation models as well as models incorporating chromatin signals as inputs. We further show that TPCAV can identify representative regions associated with specific concepts, motivating downstream investigation of distinct regulatory mechanisms. TPCAV provides a flexible and robust complement to existing model interpretation techniques.
bioinformatics2026-04-08v3A longitudinal data framework for context-specific genotype-to-phenotype mapping
Veith, T.; Beck, R. J.; Tagal, V.; Li, T.; Alahmari, S.; Cole, J.; Hannaby, D.; Kyei, J.; Yu, X.; Maksin, K.; Schultz, A.; Lee, H.; Diaz, A.; Lupo, J.; El Naqa, I.; Eschrich, S. A.; Ji, H.; Andor, N.Abstract
Molecular assays can resolve clonal structure, but they are expensive and typically sparse in time, whereas phenotypic observations such as imaging can be collected frequently but often are not preserved in the context needed for later interpretation. We present CLONEID, an event-based framework for organizing clone-resolved phenotypic, molecular, and specimen-context records so that genotype-to-phenotype interpretation can be maintained across time. CLONEID links time-stamped Events, assay-specific Perspectives, and reconciled Identities through structured ingestion, provenance-aware retrieval, and reproducible export, complementing upstream clone-calling methods. In a long-term gastric cancer density-selection experiment, CLONEID linked repeated culture events, growth measurements, and late karyotypic profiling within a shared record, supporting longitudinal interpretation of phenotypic adaptation together with underlying chromosomal state.
bioinformatics2026-04-08v3Reconstructing biologically coherent cellular profiles from imaging-based spatial transcriptomics
Yuan, L.; Zheng, Y.; Zhang, S.; Beroukhim, R.; Deshpande, A.Abstract
In imaging-based spatial transcriptomics, transcript-to-cell assignment shapes downstream biological interpretation including cell typing, ligand-receptor inference, and niche characterization. However, two-dimensional segmentation of volumetric tissue often yields mixed cellular profiles, while cells without detected nuclei are missed entirely, distorting the aforementioned downstream analyses. We present TRACER, which refines cellular representations in imaging-based transcriptomics by leveraging gene-gene coherence and spatial co-localization of transcripts observed directly in the data, without requiring external annotations or reference atlases. TRACER resolves mixed cellular profiles and reconstructs partial cells whose nuclei are not detected, enabling more complete representation of cells within the tissue section. We also introduce coherence-based metrics that quantify transcriptional purity and conflict, enabling platform-agnostic benchmarking of segmentation quality. Across diverse platforms, tissues, and segmentation methodologies, TRACER consistently and reproducibly improves the coherence of cellular profiles and the quality of downstream analyses.
bioinformatics2026-04-08v2Horse, not zebra: accounting for lineage abundance in maximum likelihood phylogenetics
De Maio, N.Abstract
Maximum likelihood phylogenetic methods are popular approaches for estimating evolutionary histories from genome data. These methods do not make prior assumptions regarding strategies used for deciding which genomes were sequenced. However, in genomic epidemiology the sequencing rate is often agnostic to the specific pathogen strain considered. In this scenario, a pathogen strain prevalence should be reflected in its relative abundance in the genome data. Here, I show that this simple assumption, when appropriate and incorporated within maximum likelihood phylogenetics, greatly improves the accuracy of phylogenetic inference. I introduce and assess two separate approaches to achieve this. The first approach rescales the likelihood of a phylogenetic tree by the number of distinct binary topologies obtainable by arbitrarily resolving multifurcations in the tree. This approach interprets multifurcations as the result of lack of signal for resolving a bifurcating topology rather than as an instantaneous multifurcating event. The second approach instead includes a tree prior that assumes that genomes are sequenced at a rate proportional to their abundance. Both approaches favor phylogenetic placement at abundant lineages, and dramatically improve the accuracy of phylogenetic inference in scenarios like SARS-CoV-2 phylogenetics, where large multifurcations are common. This considerable impact is also observed in real pandemic-scale SARS-CoV-2 genome data, where accounting for lineage prevalence reduces phylogenetic uncertainty by around one order of magnitude. Both approaches were implemented in the open source phylogenetic software MAPLE v0.7.5.4 (https://github.com/NicolaDM/MAPLE).
bioinformatics2026-04-08v2Genetic demultiplexing and transcript start site identification from nanopore sequencing of 10x Genomics multiome libraries
Mears, J.; Orchard, P.; Varshney, A.; Bose, M. L.; Robertson, C. C.; Piper, M.; Pashos, E.; Dolgachev, V.; Manickam, N.; Jean, P.; Kitzman, D. W.; Fauman, E.; Damilano, F.; Roth Flach, R. J.; Nicklas, B.; Parker, S. C.Abstract
Short-read Illumina sequencing of 10x Genomics single-nucleus multiome libraries captures only the 3' end of RNA transcripts, losing transcription start site (TSS) information. Here we demonstrate nanopore sequencing of 10x multiome libraries, which enables the profiling of full length transcripts. We show concordance with common short-read sequencing based workflows including successful genetic demultiplexing of nanopore data despite its higher error rate. We compare TSS identified using nanopore sequencing of multiome cDNA to those identified using a short-read 5' assay, and provide an optimized approach for the preprocessing of nanopore reads prior to TSS identification. We find that nanopore sequencing of multiome cDNA captures a median of 63% of the TSS detected by the 5' assay.
bioinformatics2026-04-08v2GAP-MS: Automated validation of gene predictions using integrated mass spectrometry evidence
Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.Abstract
Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation.
bioinformatics2026-04-08v2A mathematical model for inflammation and demyelination in multiple sclerosis
Jenner, A. L.; Weatherley, G. R.; Frascoli, F.Abstract
Multiple sclerosis (MS) is an incurable life-long disease caused by the demyelination of neurons in the brain and spine. MS is often characterised by relapses in inflammation and demyelination, that are then followed by periods of remittance. Symptoms can be highly debilitating and there are still many open questions about the origin and progression of the disease. Mathematical modelling is well-placed to capture the dynamics of MS and provide insight into disease aetiology. In this work, we present a minimal model for MS disease onset and progression driven by inflammation and demyelination. The model dynamics are capable of describing a typical evolution of the illness, with changes from a healthy state to a diseased scenario captured by certain ranges of parameter values. Our model also describes the non-uniform oscillatory nature of the disease, born from a Hopf bifurcation due to the strength of the inflammatory response. In particular, using experimental data for Contrast Enhancing Lesions obtained from MS patients, we are able to reproduce some of the typical relapsing-remitting behaviours of this disease. We hope that the model presented here can serve as a baseline for more complex approaches and as a tool to predict possible evolutions of the disease.
bioinformatics2026-04-08v2A geometric criterion links HIV-1 capsid topography to its biophysical properties and function
Li, W.; Peeples, C. A.; Rey, J. S.; Perilla, J. R.; Twarock, R.Abstract
Mathematical models of virus capsid structure are pillars of modern virology, aiding the understanding of viral mechanisms and the design of antiviral interventions. Traditionally, the HIV-1 capsid core geometry is represented as a fullerene lattice, akin to the icosahedral models of spherical viruses in Caspar-Klug theory. However, recent studies revealed that many viral capsids deviate from such idealised lattices, with important functional implication. Here we demonstrate that this is the case also for the conical HIV-1 core geometries, in which the hexamer and pentamer boundaries form a pseudo-tiling rather than a perfectly aligned fullerene network. We introduce a triangular geometric criterion that quantifies local deviations of an HIV-1 atomic model from its idealised fullerene backbone. Using this criterion, we demonstrate that this difference in geometric organisation between idealised (fullerene) and actual (data-derived) capsid model has implications for the capsid's biophysical properties. We also discuss the use of the geometric criterion as a predictive tool regarding cofactor binding and implied geometric changes in the capsid surface coupled to the interfacial frustration response. Our results establish a quantitative framework linking capsid geometry, curvature, and biophysical function, offering new perspectives for assembly inhibitor design and lentiviral vector engineering.
bioinformatics2026-04-08v2Sampling protein structural token space enables accurate prediction of multiple conformations
Wang, Z.; Yu, Y.; Yu, C.; Bu, D.Abstract
Protein function is fundamentally mediated by ensembles of distinct metastable states. However, existing methods, such as AlphaFold 3, typically exhibit a bias toward predicting a single dominant state, failing to capture alternative conformations or provide robust metrics for identifying high-quality multi-state conformations. Here, we present MultiStateFold (MSFold), a framework that integrates Parallel Tempering into the discrete structure token space of the ESM3 protein language model. By conceptualizing the model's latent space as an implicit energy landscape, MSFold enables global exploration and barrier crossing, thereby overcoming the local sampling limitations inherent in base generative models. Across a benchmark of 313 multi-conformation pairs, MSFold sets a new performance standard: it achieves the highest success rate in modeling native states and substantially outperforms leading methods, including AlphaFold 3, on challenging alternative conformations, while maintaining competitive accuracy for primary structures. Furthermore, we propose Sequence Log-Likelihood (SLL), a novel confidence metric derived from sequence-structure consistency. Our results demonstrate that SLL offers a modest improvement over standard metrics such as pTM and pLDDT. This work establishes a new paradigm for conformational sampling, bridging classical statistical physics with protein language models.
bioinformatics2026-04-08v2Adaptive Integration of Heterogeneous Foundation Models to Find Histologically Predictable Genes in Breast Cancer
Nguyen, H.; Li, C.; Peng, C.; Simpson, P.; Ye, N.; Nguyen, Q.Abstract
Foundation models for computational pathology have rapidly emerged as powerful tools for extracting rich biological and morphological representations from histopathology images. However, variations in model architecture, pre-training data, and optimization objectives often lead to task-dependent performance, rather than universal generalization. As a result, effective strategies for integrating their complementary strengths are essential to fully realize the potential of foundation models for robust histopathology analysis. Meanwhile, recent breakthroughs such as spatial transcriptomics provide an unprecedented opportunity to integrate genetic and histopathology information from the same patient sample, thereby maximizing both molecular and anatomical pathology insights. Specifically, each model's embedding is first mapped to gene-level predictions via a dedicated prediction head, enabling model-specific feature utilization. A lightweight weighting network then adaptively aggregates these predictions to produce a unified and robust output at gene and spatial location levels. Across multiple spatial transcriptomics datasets, our approach consistently outperforms both individual foundation models and classical ensembling methods. Focusing on breast cancer, we observe substantial gains in prediction accuracy for clinically relevant PAM50 subtype markers and drug-target genes. Moreover, the proposed framework improves interpretability by revealing model-specific contributions and specialization at the gene level. Overall, our work presents an effective solution to integrating multiple foundation models for enhancing the genetic analyses of histopathology images.
bioinformatics2026-04-08v1Spatially Anchored Regulatory State Inference in Melanoma
Dwarampudi, J. M. R.; Kochat, V.; Satpati, S.; Mahmud, M. I.; Anzum, H.; Wani, K.; Lazar, A.; Saw, A. K.; Malke, J.; Nguyen, H. V.; Rai, K.; Banerjee, T.Abstract
Spatial transcriptomics (ST) captures gene expression within tissue architecture but lacks direct regulatory information, while single-cell multiome assays profile transcriptional and chromatin states without spatial context. We present a framework for spatially anchored regulatory inference that integrates Visium ST with single-cell multiome data to infer spatially resolved regulatory programs. Building upon GraphST, we introduce spatially regularized cell-to-spot mapping and propagate chromatin accessibility and transcription factor motif activity into tissue space. Regulatory analysis is performed at the spatial domain level via joint differential expression and accessibility testing, along with quantitative concordance assessment. Applied to melanoma tissue sections, the framework reveals spatially localized regulatory programs and shows that assignment strategy substantially affects downstream regulatory stability. This modular approach enables interpretable gene-, peak-, and transcription factor-level outputs for multimodal spatial analysis.
bioinformatics2026-04-08v1UBL3 UBL domain exhibits distinct helix-centered dynamic control among ubiquitin-like proteins
Matsuda, K.; Moriya, Y.; Xu, L.; Ohmagari, R.; Aramaki, S.; Zhang, C.; Baba, A.; Hirayama, S.; Kahyo, T.; Setou, M.Abstract
Ubiquitin-like protein 3 (UBL3) is a post-translational modifier that sorts proteins into small extracellular vesicles and regulates the trafficking of disease-associated proteins such as -synuclein. The structural and dynamic features of the UBL domain that underlie these functions, however, remain poorly understood. Here we performed in silico structural dynamics analysis of the UBL3 UBL domain using an NMR structure ensemble combined with anisotropic network modeling (ANM) and perturbation response scanning (PRS). Principal component analysis and residue- wise fluctuation analysis consistently revealed high flexibility in the C-terminal region of UBL3. Comparative ANM analysis across 20 ubiquitin-like proteins (UBLs) further showed that C-terminal flexibility is a conserved yet variable property within the UBL family. PRS analysis demonstrated that residues forming the central -helix of the {beta}-grasp fold exert greater dynamic control over collective motions than {beta}- sheet residues. Notably, UBL3 displayed the highest helix/sheet PRS effectiveness ratio among all UBLs analyzed, highlighting the prominent dynamic contribution of helix residues in this domain. Together, these results provide a structural basis for understanding UBL3-dependent protein interactions and disease-related mechanisms, and suggest that helix-centered dynamic control in the UBL domain may represent a potential target for modulating UBL3 function.
bioinformatics2026-04-08v1Geometry-enhanced protein language modeling enables discovery of novel antibiotic resistance genes
Lin, X.; Guan, J.; Hong, Y.; Guo, Y.; Yang, Y.; Xie, P.; Zhao, Z.; Liu, X.; Huang, Y.; Ye, Y.; Tang, Y.; Lee, T.-Y.; Chiang, Y.-C.; Wei, L.; Liu, X.; Wang, J.; Pan, Y.; Tang, J.; Pei, Y.; Yao, L.Abstract
The global antibiotic resistome remains largely unexplored, not because antibiotic resistance genes (ARGs) are rare in the environment, but because many are evolutionarily distant from known ARGs. Current computational approaches primarily rely on sequence homology, and thus miss distant homologues. We develop GeoARG, a geometry-enhanced framework that integrates structural features with protein language models through knowledge distillation, enabling efficient large-scale screening using sequence input alone. Across multiple benchmarks, GeoARG substantially improves the detection of remotely homologous ARGs, particularly under low sequence identity and fragmented conditions. Large-scale metagenomic analysis uncovers 1,485 high-confidence ARG candidates that are highly divergent from known ARGs, expanding the phylogenetic and functional landscape of the resistome. Structural analyses further show that these candidates preserve active-site geometry and maintain stable ligand-binding configurations consistent with known resistance mechanisms. These results demonstrate that geometric constraints enable systematic expansion of the resistome and facilitate the discovery of evolutionarily distant yet functionally conserved genes. A public web server is available at https://ycclab.cuhk.edu.cn/GeoARG/
bioinformatics2026-04-08v1Geometry-aware ligand-receptor analysis distinguishes interface association from spatial localization and reveals a continuum of tumor communication
Yepes, S.Abstract
Spatial transcriptomics enables inference of cell-cell communication through ligand-receptor (LR) interactions, but current prioritization strategies often rely on expression strength or interface-associated enrichment without explicitly modeling tissue geometry. As a result, interactions associated with population interfaces are frequently interpreted as spatially localized even when their underlying expression is broadly distributed. Here, we present a geometry-aware framework for LR prioritization that explicitly separates interface structure from spatial localization within a locked and reproducible analysis pipeline. We quantify interface-associated communication using a distance-weighted boundary score defined on a spatial neighbor graph, evaluate interface specificity using a label-permutation null model that preserves spatial geometry, and compute an LR-specific localization score that captures the proximity of ligand and receptor expression to the corresponding interface. This framework distinguishes interface-associated compatibility from interaction-level spatial concentration. Across spatial transcriptomics datasets from breast cancer, colorectal cancer, melanoma, and pancreatic ductal adenocarcinoma, interface-aware ranking consistently recovers pathway families associated with extracellular matrix, adhesion, inflammatory, and immune-related processes. However, interface enrichment frequently shows limited separation from the null model, indicating that interface structure alone does not establish spatial specificity. Incorporating geometric localization substantially alters LR prioritization, distinguishing interactions that are concentrated near interfaces from those that are more diffusely distributed. Under a fixed, deterministic pipeline applied identically across datasets without parameter tuning, discrete spatial communication regimes were not reproducibly recovered. Instead, variation across samples is more consistently captured as continuous differences in geometry-aware attenuation, reflecting the degree to which inferred interactions are spatially constrained by tissue architecture. Together, these results demonstrate that interface-associated enrichment and spatial localization are distinct properties of inferred LR interactions, and that accurate interpretation of spatial communication requires explicit modeling of tissue geometry. Under this framework, tumor communication is more consistently described as a continuum of spatial constraint.
bioinformatics2026-04-08v1Analysis of multicellular anatomical structures from spatial omics data using sosta
Gunz, S.; Crowell, H. L.; Robinson, M. D.Abstract
Spatial omics technologies enable high-resolution, large-scale quantification of molecular features while preserving the spatial context within tissues. Existing analysis methods largely focus on spatial arrangements of single cells, whereas biological function often emerges from multicellular arrangements. Here, we introduce structure-based analysis of spatial omics data, which focuses on the direct analysis of multicellular, anatomical structures. We illustrate this type of analysis using two publicly available datasets and provide sosta, an open-source Bioconductor package for broad community use.
bioinformatics2026-04-07v3STDrug enables spatially informed personalized drug repurposing from spatial transcriptomics
Yang, Y.; Unjitwattana, T.; Zhou, S.; Kadomoto, S.; Yang, X.; Chen, T.; Karaaslanli, A.; Du, Y.; Zhang, W.; Liang, H.; Guo, X.; Keller, E. T.; Garmire, L. X.Abstract
Drug repurposing offers a scalable route to accelerate therapeutic discovery, yet existing approaches based on single-cell RNA sequencing (scRNA-seq) often overlook spatial tissue context, limiting their ability to capture microenvironment-dependent drug responses. Here we present STDrug, a spatially informed computational framework that integrates spatial transcriptomics, graph-based modeling, and multimodal learning to enable patient-specific therapeutic prioritization. STDrug identifies and aligns disease and control spatial domains using graph convolutional networks and coherent point drift, and prioritizes candidate drugs through an integrative scoring scheme combining tumor-reversible gene signatures, perturbation-based reversal scores, and knowledge-guided gene weighting within a machine learning framework. By modeling spatial domain interactions alongside predicted drug efficacy and toxicity, STDrug generates robust patient-level drug scores. Across hepatocellular carcinoma and prostate cancer datasets, STDrug outperforms existing single-cell and spatial transcriptomics-based drug repurposing methods, achieving signficantly improved predictive accuracy (AUCs=0.81-0.82) across patients. Validation using large-scale electronic health records and in vitro assays further supports the translational relevance of top-ranked candidates. Taking together, STDrug establishes a generalizable framework for incorporating spatial omics into therapeutic discovery, advancing spatially informed and personalized drug repurposing.
bioinformatics2026-04-07v1Representation Methods of Transcriptomics with Applications in Neuroimmune Biology
Abbasi, M.; Ochoa Zermeno, S.; Spendlove, M. D.; Tashi, Z.; Plaisier, C. L.; Bartelle, B. B.Abstract
Interpretable representations of gene expression are used to define cellular identities and the molecular programs active within cells, two related, but distinct phenomena. In the case of microglia, a cell type with high transcriptomic, functional, and morphological heterogeneity, the predominant representation of transcriptomic data presumes the adoption of distinct molecular identities, despite a lack of easily separable transcriptional states. Here, we explore alternative transcriptomic representations by comparing two single-cell analysis methods: differential expression analysis for identities and co-expression network analysis for molecular programs. For microglia, co-expression network analysis identifies highly significant functional ontologies not resolved by differential expression analysis. The identified co-expression modules are preserved across transcriptomic datasets and suggest reducible functional programs that activate and modulate depending on context. We conclude that co-expression analysis constitutes a best practice for single cell analysis of an individual cell type and describing microglia function as concurrent molecular programs offers a more parsimonious model of microglia function.
bioinformatics2026-04-07v1Locat: Joint enrichment and depletion testing identifies localized marker genes in single-cell transcriptomics
Lewis, W. R.; Aizenbud, Y.; Strino, F.; Kluger, Y.; Parisi, F.Abstract
Several methods have been developed to identify marker genes that delineate cell populations in single-cell transcriptomic data, yet most emphasize enrichment within candidate populations without testing whether expression is significantly reduced outside those populations. We present Locat, a framework for identifying highly specific localized genes by testing whether expression is concentrated within compact regions of the cellular embedding and depleted elsewhere. For each gene, Locat fits weighted Gaussian mixture models to gene-specific and background densities, computes test statistics for concentration within compact regions and depletion outside those regions, and integrates the results into a unified localization score. Across synthetic benchmarks with controlled ground truth, Locat detects localized genes spanning uni-modal, multi-modal, and sparse expression patterns, and appropriately loses significance when simulated expression becomes indistinguishable from background structure. In biological datasets spanning developmental, perturbation, and differentiation contexts, Locat identifies compact marker sets that capture lineage organization, condition-specific programs, and temporal regulatory dynamics. Localized gene sets are often smaller than conventional feature selections such as highly variable genes, and embeddings constructed from localized gene sets tend to preserve separation of major cell populations and developmental programs. In murine dermis, embeddings computed using localized genes preserve differentiation and cell-cycle trajectories observed in the full dataset. In interferon-{beta}-treated PBMCs, independent localization analysis of control and stimulated samples reveals stimulus-responsive programs and markers of shared immune populations without requiring batch correction or data integration. In retinoic acid-induced embryonic stem cell differentiation, localized genes exhibit reproducible stage-specific patterns across time points. Together, these results demonstrate that jointly assessing concentration and depletion yields specific, interpretable marker genes that enable direct cross-condition and multi-sample comparisons of marker genes across diverse biological settings.
bioinformatics2026-04-07v1A Context-Aware Single-Cell Proteomics Analysis pipeline.
Salomo Coll, C.; Makar, A. N.; Brenes, A. J.; Inns, J.; Trost, M.; Rajan, N.; Wilkinson, S.; von Kriegsheim, A.Abstract
Single-cell proteomics (SCP) by mass spectrometry can now quantify hundreds to thousands of proteins per cell, but the field still lacks standardised analytical pipelines that accommodate the diversity of instruments, sample preparation workflows and biological contexts encountered in practice. Existing workflows, largely adapted from single-cell transcriptomics, do not account for the informative missingness, pervasive ambient protein contamination and limited feature space that distinguish proteomic from transcriptomic data. In addition, cell type annotation remains a manual bottleneck that is subjective, difficult to reproduce and hard to scale. Here we present an end-to-end pipeline that integrates adaptive quality control, entropy-guided iterative batch correction, multi-modal marker discovery that exploits detection patterns unique to proteomics, and context-aware annotation by large language models (LLMs) coupled to structured contradiction reasoning and orthogonal data-driven validation. Benchmarking on published single-cell proteomic datasets from developing human brain and glioblastoma-associated neutrophils revealed systematic LLM failure modes, including context-insensitive marker vocabulary and misinterpretation of phagocytic or lytic cell states. We addressed these errors using a three-round prompt architecture that combines general biological principles with auto-generated dataset-specific constraints. In held-out validation on a skin tumour dataset acquired, the pipeline showed high concordance with FACS-sorted ground truth. In the caerulein-injured pancreas, orthogonal immunohistochemistry further supported annotations of macrophage, stellate and immune populations. The pipeline is fully automated under fixed settings, and available as Context-Aware Single-Cell Proteomics Analysis (CASPA), providing SCP laboratories and facilities with a reproducible workflow that delivers interpretable, confidence-quantified annotations suitable for downstream expert review.
bioinformatics2026-04-07v1DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
Liu, T.; Jiang, S.; Zhang, F.; Sun, K.; Head-Gordon, T.; Zhao, H.Abstract
Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.
bioinformatics2026-04-07v1Correlation Between Information Entropy and Functions of Gene Sequences in the Evolutionary Context: A New Way to Construct Gene Regulatory Networks from Sequence
Pan, L.; Chen, M.; Tanik, M.Abstract
The information encoded in DNA sequences can be rigorously quantified using Shannon entropy and related measures. When placed in an evolutionary context, this quantification offers a principled yet underexplored route to constructing gene regulatory networks (GRNs) directly from sequence data. While most GRN inference methods rely exclusively on gene expression profiles, the regulatory code is ultimately written in the DNA sequence itself. Here we review the mathematical foundations of information theory as applied to gene sequences, survey existing computational methods for GRN inference with emphasis on information-theoretic and sequence-based approaches, and examine how evolutionary conservation constrains sequence entropy to preserve biological function. We then propose a four-layer integrative framework that combines per-position Shannon entropy profiles, evolutionary conservation scoring via Jensen-Shannon divergence, expression-based mutual information and transfer entropy, and DNA foundation model embeddings to construct GRNs from sequence. Through worked examples on the Escherichia coli SOS regulatory sub-network, we demonstrate how conservation-weighted mutual information improves edge discrimination and how transfer entropy resolves regulatory directionality. The framework generates testable predictions: edges supported by low-entropy regulatory regions should show higher experimental validation rates, and cross-species entropy profile conservation should predict GRN topology conservation. This work bridges three scales of biological information-nucleotide-level entropy, evolutionary constraint patterns, and network-level regulatory logic-establishing information entropy as the natural mathematical language for sequence-to-network regulatory inference.
bioinformatics2026-04-07v1Accurate estimation of canine inbreeding using ultra low-coverage whole genomesequencing
Pellegrini, M.; Kim, R.; Rubbi, L.; Kislik, G.; Smith, D.Abstract
The measurement of inbreeding has gained significance across diverse fields, including population and conservation genetics, agricultural genetics, breeding programs for animals and plants, and wildlife management. This is due to the fact that inbreeding leads to increased homozygosity and results in lower genetic diversity, rendering populations more vulnerable to environmental changes, diseases, and other stressors. High or mid-coverage whole genome sequencing (WGS) has been widely used for inbreeding estimation, but it is resource-intensive. We aimed to investigate the use of ultra low-coverage whole genome sequencing (ulcWGS) as a cost-effective alternative for inbreeding analysis. Domestic dogs were used for our study as their extensive breeding histories lead to populations with a wide range of inbreeding levels. We constructed a multi-breed reference panel from high-coverage WGS samples. Inbreeding in independent ulcWGS samples was then estimated using runs of homozygosity (RoH) and inbreeding coefficients (F). We modeled the relationship between these measures and sequencing depth using nonlinear regression, to generate inbreeding estimates relative to sequencing depth. Resulting relative RoH and F measurements were significantly correlated, with purebred dogs exhibiting more runs of homozygosity and higher inbreeding coefficients2 compared to mixed-breed dogs. Our findings demonstrate that ulcWGS can provide reliable and economical estimations of inbreeding, expanding accessibility to genetic monitoring.
bioinformatics2026-04-07v1MitoChontrol: Adaptive mitochondrial filtering for robust single-cell RNA sequencing quality control
Strassburg, C.; Pitlor, D.; Singhi, A. D.; Gottschalk, R.; Uttam, S.Abstract
Mitochondrial transcript abundance is a standard quality control metric in single-cell RNA sequencing, but fixed percentage thresholds fail to account for the substantial variation in mitochondrial content across cell types and tissues, risking both retention of compromised cells and exclusion of transcriptionally active viable cell populations. We present MitoChontrol, a cell-type-aware probabilistic framework for mitochondrial quality control that models the mitochondrial transcript fraction within transcriptionally coherent clusters as a Gaussian mixture distribution. Compromised-cell components are identified from the upper tail of each cluster-specific distribution, and filtering thresholds are defined as the point at which theposterior probability of cellular compromise exceeds a user-definded confidence value. Applied to controlled perturbation experiments and a pancreatic ductal adenocarcinoma single-cell dataset, MitoChontrol selectively removes transcriptionally compromised cells while preserving biologically elevated but viable populations, outperforming fixed-threshold and outlier-based approaches.
bioinformatics2026-04-07v1Estimation of metabolite levels in cheese from microbial gene expression
Mansouri, A.; Mekuli, R.; Swennen, D.; Durazzi, F.; Remondini, D.Abstract
Characterizing aroma and flavours generated during cheese production is of high relevance for the food industry. A deeper comprehension of flavour generation can be achieved by understanding the role of microbial population governing milk processing, and in particular their metabolic activity governed by gene expression. In this work we considered two independent experiments in which gene expression of the microbial population involved in cheese processing is sampled, together with final volatile products quantification. We estimated the final volatile compound profile from the measured metatranscriptomic expression by using machine learning with two different strategies for model training and validation, and we were able to associate specific biochemical pathways to the identified gene signatures.
bioinformatics2026-04-07v1FunctionaL Assigning Sequence Homing (FLASH) maps phenotype to sequence with deep and machine learning
Cotter, D. J.; Harrison, M.-C.; Rustagi, A.; Wang, P. L.; Kokot, M.; Carey, A. F.; Deorowicz, S.; Salzman, J.Abstract
Genome-wide association studies (GWAS) map genetic variation to a reference genome and correlate variants to phenotypes. Yet, GWAS and similar procedures have limitations, including an inability to predict phenotype on variants never seen during the discovery phase and difficulty integrating structural variants. Deep and machine learning alternatives have not been successful at consistent prediction of resistance phenotypes (Hu et al. 2024). Here, we introduce FLASH: a new interpretable, statistically-based deep learning framework that operates directly on raw sequencing reads. In over 35,000 isolates of bacteria, fungi and viruses, FLASH achieves uniformly high accuracy on independent test data, including on variation never seen in training, meeting or exceeding bespoke state of the art methods. FLASH identifies canonical drug targets ab initio and new pan-species predictors of virulence, including those lacking annotation and those only partially aligned to NCBI reference databases. Further, FLASH can predict phenotypes beyond the possibility of GWAS, such as bacterial host range of phage, a task that to our knowledge is impossible today. FLASH is simple to run, highly efficient and constitutes a new approach for predicting gene function and phenotype across the tree of life. It is especially valuable when bioethical concerns and the vast genetic complexity of pathogenic microbes limit the feasibility of experimental validation.
bioinformatics2026-04-07v1Flow molecular dynamics simulations reveal mechanosensitive regulation of von Willebrand factor through glycan-modulated autoinhibitory modules
Richard Louis, N. E. L.; Zhao, Y. C.; Ju, L. A.Abstract
Force-induced protein conformational changes govern many essential biological processes, yet their molecular mechanisms remain difficult to resolve. Von Willebrand factor (VWF), a central regulator of haemostasis, is activated by hydrodynamic forces in blood flow, but how mechanical signals propagate across its multidomain architecture is poorly understood. Here, we use flow molecular dynamics (FMD), a simulation framework that applies fluid forces via controlled solvent flow to interrogate mechanosensitive proteins. Using VWF as a model system, we reconstructed the complete mechanomodule (DD3A1A2A3; 1,109 residues) with native glycosylation by integrating crystallographic data and AlphaFold predictions. FMD simulations capture a force-driven transition from a compact, autoinhibited bird-nest ensemble to an extended, activated state, revealing asymmetric autoinhibitory strengths within the NAIM and CAIM modules of the A1 domain. By directly linking static structures to dynamic, force-regulated behaviour, this work establishes a generalizable platform for dissecting protein mechanosensitivity and enabling the rational design of force-responsive therapeutics.
bioinformatics2026-04-07v1Integrative AlphaFold Modeling, Fragment Mapping, and Microsecond Molecular Dynamics Reveal Ligand-Specific Structural Plasticity at the Human Urotensin II Receptor
Torbey, A. G.Abstract
Peptide ligands Urotensin II (hUII, human), hUII-related peptide (URP) and its cognate human receptor (hUT) are known for their implications in cardiovascular pathophysiology, yet the lack of experimentally resolved hUT structures has limited a deep mechanistic understanding of ligand binding and receptor activation. Here, we leverage recent breakthroughs in multistate AlphaFold predictions, long-timescale molecular dynamics (MD) simulations, and site identification by ligand competitive saturation (SILCS) based pocket mapping and solving ligand bound conformation to illuminate the dynamic interaction of hUII and URP with hUTR. By analyzing hUT dynamics in its intracellular transducer binding pocket, and residue-level interaction probabilities in each simulation, we capture subtle distinctions in the way hUII and URP anchor key pocket residues, modulate transmembrane (TM) domain tilts. Results indicate that hUII imposes stronger conformational constraints on TM5 and TM6 relative to URP, both potentially stabilizing different active-like receptor configurations. At the same time, interaction maps highlight unique aromatic and polar networks that each ligand exploits. These findings reinforce the concept that relatively small differences in GPCR peptide ligand structure may lead to large effects on receptor-state selection, signal specificity, ultimately reflecting different clinical outcomes. By integrating computational modeling with per-residue dynamics, this work not only reconciles prior mutagenesis and docking data but also provides validated 3D models and MD simulations of the endogenous ligands bound to hUT, offering new opportunities to selectively harness ligand-dependent signaling in the urotensinergic system.
bioinformatics2026-04-07v1REBEL, Reproducible Environment Builder for Explicit Library resolution
Martelli, E.; Ratto, M. L.; Nuvolari, B.; Arigoni, M.; Tao, J.; Micocci, F. M. A.; Alessandri, L.Abstract
Background: Achieving FAIR-compliant computational research in bioinformatics is systematically undermined by two compounding challenges that existing tools leave unresolved: long-term reproducibility and accessibility. Standard package managers re-download dependencies from live repositories at every build, making environments vulnerable to library disappearance and version drift, and pinning a package version does not pin the versions of its transitive dependencies, causing divergences between builds performed at different points in time. Compounding this, packages from repositories such as CRAN, Bioconductor, and PyPI frequently omit critical system-level dependencies from their installation metadata, leaving users to manually discover which underlying library is missing or which version is required. Beyond these technical failures, constructing a truly reproducible environment demands expertise in containerization making reproducibility in practice a privilege and not a standard. Findings: We present REBEL (Reproducible Environment Builder for Explicit Library Resolution), a framework that addresses both challenges through three dependency inference heuristics: (i) Deep Inspection of source code, (ii) Fuzzy Matching against a manually curated knowledge base, and (iii) Conservative Dependency Locking. The resolved dependency stack is then archived into a self-contained local store, enabling offline and deterministic rebuilds at any future time. We compared the installation of 1,000 randomly sampled CRAN packages in isolated Docker containers versus the standard package manager and REBEL resolved 149 of 328 standard installation failures (45.4%). Moreover through its DockerBuilder component, REBEL further generates fully reproducible Docker images from a plain text requirements file, making deterministic environment construction accessible without expertise in containerization. Conclusions: REBEL provides a practical foundation for FAIR-compliant, long-term reproducible bioinformatics analyses, making deterministic environment construction accessible to researchers regardless of their technical background. REBEL is freely available at https://github.com/Rebel-Project-Core Keywords: reproducibility, bioinformatics, dependency resolution, Docker, FAIR, software environments, package management
bioinformatics2026-04-07v1Multistage Machine Learning Reveals Circadian Gene Programs and Supports a Retina-Choroid Axis in Myopia Development
Watcharapalakorn, A.; Poyomtip, T.; Tawonkasiwattanakun, P.; Dewi, P. K. K.; Thomrongsuwannakij, T.; Mahawan, T.Abstract
Purpose To determine whether circadian timing defines critical molecular windows in myopia development and to assess the transferability of circadian gene programs across ocular tissues, disease stages, and species. Methods Publicly available retinal and choroidal RNA-seq datasets from chick models of form-deprivation myopia were analyzed using unsupervised transcriptomic profiling and multistage machine-learning classification. Circadian windows were defined based on Zeitgeber time, and samples were grouped accordingly for downstream analyses. Classification model robustness was evaluated through cross-tissue and cross-stage validation and further assessed using external validation in an independent dataset. Functional translation to humans was examined using ortholog-based Gene Ontology enrichment analysis to identify conserved biological processes and higher-order regulatory pathways. Results A circadian critical window at ZT8-ZT12 exhibited the strongest transcriptional divergence during both myopia onset and progression. Gene signatures derived from this window generalized across retina and choroid and remained predictive across disease stages, supporting coordinated molecular regulation between ocular tissues. External validation confirmed the reproducibility of these signatures despite differences in experimental design and gene coverage. Functional mapping revealed that conserved molecular components in chicks are reorganized into more complex neuroendocrine and regulatory networks in humans, indicating cross-species conservation with increased functional complexity. Conclusions Circadian timing strongly shapes myopia-related gene expression and underlies coordinated retina-choroid signaling. These findings highlight circadian biology as a key factor of refractive development and suggest that time-dependent mechanisms may influence myopia susceptibility, progression, and response to treatment.
bioinformatics2026-04-06v1MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization
Zhang, L.; Wang, L.; Sun, X.; Tang, W.; Su, H.; Qian, Y.; Yang, Q.; Li, Q.; Tang, Z.; Sun, H.; Han, Y.; Jiang, Y.; Lou, W.; Zhou, B.; Wang, X.; Bai, L.; Xie, Z.Abstract
Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
bioinformatics2026-04-06v1From Parametric Guessing to Graph-Grounded Answers: Building Reliable ChatGPT-like tools for Plant Science
Itharajula, M.; Lim, S. C.; Mutwil, M.Abstract
Large language models (LLMs) are increasingly used by plant biologists to summarize literature, generate hypotheses, and interpret experimental results. However, LLMs are unreliable sources of exhaustive, source-attributed facts, a critical limitation for the list-style queries that pervade plant biology (e.g., "list all transcription factors regulating secondary cell wall (SCW) biosynthesis in Arabidopsis"). Here, we query ChatGPT, Claude, and Gemini with such queries and demonstrate that none return complete gene lists with reliable citations. We trace these failures to how LLMs store knowledge: as statistical patterns distributed across billions of internal parameters, with no mechanism to guarantee completeness, provenance, or reproducibility. We also review fine-tuning mitigation strategies, including multi-task instruction tuning, parameter-efficient methods, and context engineering, that alleviate but do not resolve these limitations. We then discuss retrieval-augmented generation (RAG), which feeds relevant documents to the LLM at query time, and argue that while it improves source attribution, it remains impractical when answers require synthesizing information scattered across hundreds of papers. As an alternative, we advocate graph retrieval-augmented generation (GraphRAG), in which the LLM serves as a reasoning and language interface over a structured, provenance-linked knowledge graph (KG) that returns complete result sets reproducibly. We outline a practical GraphRAG architecture and survey existing plant KG resources. Finally, we discuss open challenges, including entity disambiguation, relation normalization and evidence grading, and propose a roadmap for building open, continuously updated plant KGs that can turn "read 1,000 papers" into a single reproducible query.
bioinformatics2026-04-06v1Statistical signals indicate a dependence between amino acid backbone conformation and the translated synonymous codon
Rosenberg, A.; Marx, A.; Bronstein, A. M.Abstract
Synonymous codons encode the same amino acid but can differ in their usage and translational properties. In previous work we reported statistical differences in backbone dihedral angle distributions associated with synonymous codons in the Escherichia coli proteome. This finding has been questioned due to concerns regarding the statistical methodology used. Here we revisit the dataset using corrected statistical procedures and alternative statistical tests. Across multiple frameworks, the real dataset consistently shows an excess of small p-values relative to randomized controls, indicating detectable codon-associated differences in backbone conformation.
bioinformatics2026-04-06v1EV-Net: A computational framework to model extracellular vesicles-mediated communication
Torrejon, E.; Sleegers, J.; Matthiesen, R.; Macedo, M. P.; Baudot, A.; Machado de Oliveira, R.Abstract
Summary Extracellular vesicles (EVs) are bilayer vesicles that carry a diverse cargo of molecules, such as nucleic acids, proteins and metabolites. These EVs can be transported throughout the organism to specific recipient tissues. For this reason, EVs have been recognized as pivotal mediators of cell-to-cell communication (CCC). Importantly, alterations in EV-mediated communication have been linked to pathological processes, further highlighting their biological relevance. However, the in silico exploration of the functional effects of EV cargo in recipient tissues remains limited due to the lack of dedicated tools that can be applied to EV omics datasets. Most current bioinformatics tools for assessing CCC rely on ligand-mediated communication and therefore cannot be used to explore EV-mediated communication. To address this gap, we developed EV-Net, a bioinformatics tool designed to explore the effects of EV cargo on recipient tissues. EV-Net was built by adapting NicheNet, a CCC bioinformatics tool that relies on ligand-receptor mediated communication, for the analysis of EVs proteomics and RNA-seq data. The EV-Net framework enables the identification and prioritization of EV cargo molecules with high regulatory potential in a recipient tissue of interest. This prioritization facilitates the systematic translation of EV cargo profiles into testable biological hypotheses. Availability and documentation The source code of EV-Net is stored in GitHub https://github.com/torrejoNia/EV-Net alongside instructions on how to install it. Comprehensive tutorials and additional documentation are available at https://torrejonia.github.io/EV-Net/. The datasets used in the use cases were deposited in Zenodo. The corresponding Zenodo links are provided in the tutorials for each use case. This software is distributed under a GLP3 licence.
bioinformatics2026-04-06v1NovoTax: prokaryotic strain identification from mass spectrometry-based proteomics data
Svedberg, D.; Mateus, A.Abstract
Traditional mass spectrometry-based proteomics typically requires prior knowledge of sample composition to match spectra to peptides. Yet, novel de novo peptide sequencing approaches can provide peptide sequences to identify the organism. Here, we introduce an end-to-end pipeline (NovoTax) to identify the closest prokaryotic genome directly from raw bottom-up proteomics data. The approach combines peptide sequencing tools with an optimized implementation of peptide searching through an extensive genome database. On a benchmark dataset of species isolates, we identified the reported species and strain in the majority of the cases, and showed that in discordant cases NovoTax was likely correct. Interestingly, NovoTax was also able to identify contaminating species in some samples. The algorithm also identified the most abundant species in bacterial communities. In summary, NovoTax provides strain level identification of microbial samples enabling the downstream use of traditional proteomics search engines for a deeper proteome analysis.
bioinformatics2026-04-06v1Multimodal Fusion of Circular Functional Data on High-resolution Neuroretinal Phenotypes
Pyne, S.; Wainwright, B.; Ali, M. H.; Lee, H.; Ray, M. S.; Senthil, S.; Jammalamadaka, S. R.Abstract
Progressive optic neuropathies, particularly glaucoma, represent a significant global health challenge, and the need for precise understanding of the heterogeneous neurodegenerative phenotypes cannot be overstated. Here, we brought together two complementary sources of unstructured yet clinically-relevant information about neurotinal rim (NRR) thinning, a common clinical marker of such decay. These are based on a new dataset of Fundus digital images and a corresponding one of optical coherence tomography, both collected from a large clinical cohort of healthy eyes. First, we represented them using a common data structure that imposed a high-resolution scale of 180 equally-spaced and registered measurements on a 360{whitebullet} circular axis. We modeled such NRR data-points of each eye as circular curves, and aligned these multimodal curves to obtain a fused NRR curve for each eye. Unsupervised clustering of these fused curves identified 4 clusters of eyes with structural heterogeneity, which were also found to have distinctive clinical covariates. The computation of functional derivatives revealed the troughs in the curves of each cluster. Using circular statistics, we estimated the directional distributions of such troughs as potentially clinically-relevant regions of NRR decay. We also demonstrated that multimodal fusion leads to improvement in the robustness of baseline NRR data obtained from fundus imaging.
bioinformatics2026-04-06v1Domain classification of archaeal proteomes reveals conserved fold repertoire
Schaeffer, R. D.; Pei, J.; Guo, R.; Zhang, J.; Medvedev, K.; Cong, Q.; Grishin, N.Abstract
Archaea represent one of the three domains of cellular life and yet account for fewer than 1% of experimentally determined protein structures, leaving the extent of their structural novelty unknown. Here we present a systematic domain-level classification of 124,075 proteins from 65 archaeal classes spanning 21 phyla and all major lineages, using both AFDB and newly predicted AlphaFold3 structures classified against the Evolutionary Classification of protein Domains (ECOD). We assigned 204,758 domains, of which 76.8% received high-confidence classifications, spanning 987 ECOD X-groups; 40% of known structural diversity within a single domain of life. Clustering by Foldseek recovered structural relationships for 63% of domains that are singletons by sequence comparison. To characterize the 21% of proteins lacking high-confidence classification, we applied successive filters for structure prediction confidence, protein length, and structural cluster context, reducing 8,452 domain-free proteins to a small number of well-folded structural orphans (less than 0.1% of the dataset). The unclassified fraction is dominated by sub-threshold matches to known folds (14% of all proteins) and low-confidence structure predictions (5%), not by novel structures. These results demonstrate that the protein fold repertoire at the single-domain level is broadly conserved across the deepest phylogenetic distances in cellular life, and that the gap between archaeal and well-characterized proteomes reflects classification sensitivity for divergent sequences rather than unexplored structural diversity.
bioinformatics2026-04-06v1BABAPPASnake: a workflow for episodic selection analysis with robustness-aware summaries
Singha, S.; Panda, P.; Panda, A.; Das, S. K.; Das, A.; Ghosh, N.; Sinha, K.Abstract
Episodic selection analyses are often assembled from fragmented toolchains in which ortholog discovery, codon alignment, phylogeny, exploratory scans, branch-site testing, and reporting are handled separately, making reproducibility and sensitivity tracking difficult. We introduce BABAPPASnake as an integrated workflow contribution for orthogroup-centered episodic selection analysis. The workflow combines orthogroup construction logic, CDS quality-aware mapping, multi-engine alignment pathways, phylogenetic inference, exploratory nomination, and branch-site follow-up testing in one reproducible execution framework. It also supports optional HyPhy GARD recombination screening as a conservative preprocessing report layer without forcing fragment-level rerouting by default. It generates pathway-level and cross-pathway robustness outputs, including matrix, consensus, narrative, and provenance summaries to support sensitivity-aware interpretation. A four-gene mosquito melanization-associated module is analyzed as a real-data empirical demonstration of end-to-end workflow behavior. In this demonstration, branch/site signals show both recurrent and method-sensitive components across six method-trim pathways, with a directional core-tier tendency in several summaries. These case-study patterns are interpreted as workflow-based empirical evidence and hypothesis-generating asymmetry, not decisive pathway level confirmation. Overall, BABAPPASnake provides a practical and reproducible framework for episodic selection studies where analytical sensitivity must be explicitly reported.
bioinformatics2026-04-05v7RAMBO: Resolving Amplicons in Mixed Samples for Accurate DNA Barcoding with Oxford Nanopore
Kolter, A.; Hebert, P. D. N.Abstract
DNA barcoding, the use of short genetic markers to identify and differentiate species, is a foundational tool for ecological and taxonomic research. The method has been scaled rapidly with next-generation sequencing technologies enabling the processing of thousands of specimens in parallel. Nanopore sequencing not only offers a flexible, low cost alternative to other platforms but produces full-length reads in real time and can be used in remote settings. However, its comparatively high error rate complicates downstream processing, particularly when PCR amplifies multiple templates from a single specimen, reflecting pseudogenes, paralogs, or contaminants. We present a novel pipeline for DNA barcoding that resolves mixed sequence signals from Nanopore reads using unsupervised clustering and staged consensus generation, without relying on curated reference databases, taxonomic priors, or error models. While existing methods to curate Nanopore sequence data assume a single dominant amplicon per sample or require deep sequence divergence among amplicons, our pipeline can distinguish variants differing by as little as 0.15 percent. It combines column-weighted encodings, UMAP projection, and HDBSCAN clustering, followed by conservative consensus refinement. The pipeline was benchmarked and validated using datasets with known composition, including high-fidelity PacBio sequences. The results show that Nanopore barcoding, when paired with appropriate analysis, can recover biologically meaningful variation even in technically complex samples. The pipeline is particularly suited for specimens where divergent templates are co-amplified, including mitochondrial pseudogenes or multicopy nuclear regions like ITS. As such, it provides a generalizable framework for high-resolution Nanopore analysis of complex amplicon mixtures.
bioinformatics2026-04-05v2Unravelling genome-wide mosaic microsatellite mutations at single-cell resolution
Wang, C.; Fan, W.; Wang, W.; Xia, Y.; Lu, J.; Ma, X.; Yu, J.; Zheng, Y.; Luo, Y.; Li, W.; Yang, Q.; Lin, M.; Liu, H.; Lan, Y.; Li, C.; Liu, X.; HE, D.; Cai, S.; Yu, X.; Zhou, D.; Kellis, M.; Xiong, X.; Xie, Q.; Dou, Y.Abstract
Short tandem repeats (STRs), or microsatellites, are highly mutable genomic elements that modulate gene regulations and are implicated in a range of human diseases. However, detecting mosaic STR mutations at single-cell resolution remains challenging due to both technical and biological complexities. To address this, we developed BayesMonSTR, a robust algorithm that enables accurate detection of mosaic STR mutations. Using this tool in single-cell analysis of human tissues, we reveal an accumulation of longer mosaic STR insertions and deletions (indels) in aging mitotic and post-mitotic cells. Strikingly, prefrontal cortex (PFC) neurons accumulate a higher burden of STR mutations than B cells or lung epithelium, with aged neurons exhibiting a particularly pronounced increase in longer STR deletions. These mutations are enriched at transcription start sites (TSSs) and active enhancers of highly expressed genes. Our work establishes a foundation for genome-wide, hypothesis-free discovery of disease-associated mosaic STR mutations and reveals a previously unexplored landscape of mosaic STR variation in development and aging.
bioinformatics2026-04-05v2Benchmarking long-read RNA-seq across modalities, methods, and sequencing depth in iNeurons
Schubert, R.Abstract
Long-read RNA sequencing (lrRNA-seq) provides advantages for transcript discovery and quantification through the sequencing of full-length transcripts. Although recent benchmarks have evaluated long-read technologies and quantification tools, to the best of our knowledge, no study to date has jointly compared sequencing technology, quantification choice, and depth across both bulk and single-cell platforms. Here, we generate a matched dataset using NGN2-induced neurons derived from Fragile X syndrome and isogenic rescue lines, profiled with bulk and single-cell Illumina, Oxford Nanopore Technologies (ONT), and Pacific Biosciences (PB) Kinnex technologies. All platforms and technologies capture the expected FMR1 reactivation signal. We find that PB bulk under-detects and under-quantifies short transcripts (less than 1.25 kb), ONT bulk under-detects and under-quantifies long transcripts (greater than 5 kb), and single-cell long read technologies a large number of single-cell specific transcripts associated with truncations. Across six bulk and four single-cell long-read quantification tools, Isosceles, Miniquant, and Oarfish provide the best compromise between computational efficiency and performance in terms of quantification accuracy as measured by spike-ins, comparisons to Illumina, and on consensus based down stream tasks such as differential transcript expression (DTE). Depth-equivalency analyses reveal that PB single-cell sequencing requires approximately three- to four-fold greater depth than bulk to reach comparable power for transcript discovery and differential transcript expression. Our work aims to offer practical guidance for study design, including the choice of technology, sequencing depth, and quantification method. In addition, we hope our data may serve a reference dataset to evaluate emerging long-read transcriptomic protocols and methods as well as more closely investigate FMR1 biology.
bioinformatics2026-04-04v1Correlate: A Web Application for Analyzing Gene Sets and Exploring Gene Dependencies Using CRISPR Screen Data
Deolankar, S.; Wermeling, F.Abstract
CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design.
bioinformatics2026-04-04v1muat: portable transformer-based method for tumour classification and representation learning from somatic variants
Sanjaya, P.; Pitkänen, E.Abstract
Motivation: Deep neural networks have proven effective in classifying tumour types using next-generation sequencing data. However, developing transferable models that work across heterogeneous operating environments remains challenging due to differences in cohort compositions and data generation protocols, privacy concerns, and limited computational capabilities. Results: We introduce muat, a transformer-based software for tumour classification using somatic variant data from whole-genome (WGS) and whole-exome sequencing (WES). Building on previously developed MuAt and MuAt2 models, we distribute the software via Docker containers and Bioconda for deployment in high-performance computing (HPC) systems and Secure Processing Environments (SPEs). Using a downloadable MuAt checkpoint, we reproduce the performance reported in the original study on whole genome (PCAWG; 89% accuracy in histological tumour typing) and exome sequencing data (TCGA; 64% accuracy). Cross-cohort evaluation in Genomics England SPE achieved 81% accuracy without retraining and 89% following fine-tuning. As a demonstration of the software's adaptability, we also deployed muat within the iCAN Digital Precision Cancer Medicine Flagship's SPE and integrated it into a Nextflow-managed workflow. Availability and implementation: muat is available through conda (www.anaconda.org/bioconda/muat) and GitHub (https://github.com/primasanjaya/muat), under the Apache 2.0 License. Contact: prima.sanjaya@helsinki.fi, esa.pitkanen@helsinki.fi; website: mlbiomed.net
bioinformatics2026-04-03v1Conserved water molecules as structural ligands modulating pathogenic variation in human protein binding sites
Konc, J.; Recer, K.; Kunej, T.; Janezic, D.Abstract
Conserved water molecules (CWMs) are tightly bound solvent molecules that occupy well-defined and recurrent positions in protein structures. Although they are known to influence protein stability, function, and ligand binding, their contribution to human genetic disease has remained largely unexplored. Here, we demonstrate that CWMs substantially contribute to the pathogenicity of single nucleotide polymorphisms (SNPs). By systematically mapping SNPs onto ligand-binding and conserved water sites across human protein structures in the Protein Data Bank, we find that pathogenic variants are strongly enriched at CWM positions. Enrichment is particularly pronounced at CWM sites within ligand-binding regions, exceeding that observed for ligand-binding sites as a whole. To establish a mechanistic link, we performed molecular dynamics simulations on human lysosomal acid glucosylceramidase (GCase), encoded by GBA1 and associated with Gaucher disease and Parkinson's disease risk. Removal of a single conserved water molecule in the wild-type protein recapitulates key structural features of the pathogenic L444P variant, whereas stabilization of this water in the mutant restores native-like behavior. These findings demonstrate that disruption of a conserved water molecule can induce long-range structural changes consistent with disease-associated mutations. Together, our results identify conserved water molecules as functional structural elements whose disruption represents a recurrent mechanism of protein dysfunction and provide direct mechanistic evidence for their pathogenic role in Gaucher disease.
bioinformatics2026-04-03v1LigandForge: A Web Server for Structure-Guided De Novo Drug Design
Nada, H.; Sipos-Szabo, L.; Bajusz, D.; Keseru, G.; Gabr, M.Abstract
Despite advances in computational drug discovery, de novo drug design remains hindered by high licensing costs and the need for specialized programming expertise. We present LigandForge, a webserver for structure-guided de novo ligand generation. LigandForge integrates structural validation and binding-site characterization; voxel-based property grid construction for spatial mapping of electrostatics and hydrophobicity; chemistry-aware fragment assembly; multi-objective lead optimization; and retrosynthetic feasibility analysis. The platform utilizes a structure-guided framework to assemble molecules from curated fragment libraries while enforcing physicochemical constraints, including molecular weight, LogP, and hybridization states. Generated molecules are refined via reinforcement learning and genetic algorithms which are subsequently evaluated using composite metrics such as the quantitative estimate of drug-likeness. By leveraging RDKit for cheminformatics and NGL viewer for real-time 3D visualization, LigandForge provides a synthesis-aware environment that bridges the gap between macromolecular structural data and experimentally feasible lead compounds without requiring local software installation.
bioinformatics2026-04-03v1Anonymized Somatic Tumor Twins (STTs) enable open genome data sharing and use in research and clinical oncology
Gaitan, N.; Martin, R.; Tello, D.; Benetti, E.; Riba, M.; Licata, L.; Arbones, M.; Royo, R.; Olmos, D.; Morelli, M. J.; Tonon, G.; Castro, E.; Torrents, D.Abstract
The study of somatic variants from tumor genomes is fundamental to cancer research and clinical decision-making. However, existing data protection frameworks impose restrictions on the use and sharing of these variants in conjunction with sensitive germline information. To overcome these challenges, we developed GenomeAnonymizer, the first method to anonymize short-read DNA sequences from tumor-normal pairs. This generates Somatic Tumor Twins (STTs), an anonymized version of the original data that preserves the donor's privacy while retaining somatic tumor information and sequencing noise. This method successfully removed all detectable germline variants from the 47 PCAWG-Pilot samples. We further demonstrate that Whole-Genome Sequencing (WGS) STTs preserve more than 98% of the original somatic variants, enabling reliable downstream analysis that replicates somatic-related findings from the original samples, including cancer driver genes, mutational signatures, and intratumor heterogeneity. Importantly, we also show that STTs can reproduce the identification of actionable genes and downstream clinical interpretations and decision-making. We generated a cancer cohort of STTs matched with synthetic clinical data that could be openly shared and used across projects and centers worldwide. This paradigm-shifting approach will accelerate discovery and clinical translation in oncology and enable the robust benchmarking of genome analysis and large-scale data infrastructures.
bioinformatics2026-04-03v1Importance of taking Single Amino Acid Variant and accessory proteome variability into account in Data Independent Acquisition Proteomics: illustrated with Legionella pneumophila analysis
Dupas, A.; Ibranosyan, M.; Ginevra, C.; Jarraud, S.; Lemoine, J.Abstract
Understanding allelic variability is crucial for elucidating intrinsic bacterial mechanisms and distinguishing phenotypic profiles. However, such variability poses a major challenge for the reliable identification of proteins in data-independent acquisition (DIA) proteomics. To address this, we developed an analytical workflow that integrates protein sequence variability to enhance proteome coverage. Fifteen Legionella pneumophila isolates were analyzed using DIA-NN, with spectral libraries generated either from a reference proteome or incorporating allelic variability. Our workflow includes protein clustering and subsequent protein inference from these clusters, allowing the accurate assignment of shared and variant-specific peptides. Integration of variability enabled the identification of a comparable number of proteins as the reference proteome while capturing between 28 and 77 % of variant-specific sequences in each isolate, all while maintaining a low false positive rate. These findings demonstrate that accounting for allelic variability substantially improves proteomic coverage and identification confidence, providing a more comprehensive view of the proteome. This approach facilitates a deeper understanding of biological mechanisms and enables precise bacterial proteotyping of Legionella pneumophila isolates.
bioinformatics2026-04-03v1Proteome analyses reveal Endoplasmic Reticulum stress-induced changes in protein abundance associated with Ube2j2 deficiency in human cell culture
Dahlberg, C. L.; Zinkgraf, M.; Laugesen, S. H.; Soltoft, C. L.; Ginebra, Q.; Bennett, E. P.; Hartmann-Petersen, R.; Ellgaard, L.Abstract
The unfolded protein response (UPR) helps reinstate cellular proteostasis upon an accumulation of misfolded proteins in the endoplasmic reticulum (ER), in part through ER-associated degradation (ERAD). Ube2j2 is an ER-localized E2 ubiquitin-conjugating enzyme that participates in ERAD. We used mass spectrometry analysis of cultured U2OS cells to investigate how the loss of Ube2j2 affects the cellular proteome in response to tunicamycin-induced ER stress. We constructed a network of twelve statistically distinct modules of protein abundance profiles across conditions. We describe the Gene Ontology annotations for each module along with the hub gene proteins whose abundance levels most closely adhere to each modules protein abundance profile. Our analysis identifies known Ube2j2-associated pathways (e.g., the UPR and ERAD) and cellular functions that were previously unassociated with Ube2j2 (e.g., RNA metabolism, ER-Golgi transport, and cell-cycle progression). These data are available via ProteomeXchange with identifier PXD076153 and provide avenues for further investigation into the cellular functions of Ube2j2 under basal and ER-stressed conditions.
bioinformatics2026-04-03v1PANDA: Read-Level Phased Analysis of DNA Amplicons for Methylation Studies
Kubota, A.; Kobayashi, H.; Tajima, A.Abstract
DNA methylation analysis using bisulfite sequencing is widely used to investigate epigenetic regulation at single-base resolution; however, conventional analysis workflows primarily rely on site-wise averaging, which obscures contiguous methylation patterns encoded within individual DNA molecules and limits interpretation of epiallelic heterogeneity in targeted amplicon studies. Here, we present PANDA (Phased ANalysis of DNA Amplicons), an end-to-end graphical pipeline that restores contiguous single-molecule methylation patterns by linking unmerged paired-end reads to reconstruct epiallelic patterns across unsequenced regions. PANDA supports both Sanger and next-generation sequencing inputs, providing a unified workflow for alignment, read-level methylation calling, phased visualization, and quantification of within-sample methylation heterogeneity. Using synthetic benchmarking datasets, we demonstrated that in silico motif filtering isolates specific target reads, enabling the accurate detection of allele-specific methylation and loss of imprinting. Furthermore, the re-analysis of primate placentae datasets confirmed that long-range phasing across unsequenced regions successfully restored the original epiallelic architectures. PANDA establishes a robust, practical approach to single-molecule epigenomic profiling using targeted bisulfite amplicon sequencing.
bioinformatics2026-04-03v1