Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
The FAIRSCAPE AI-readiness Framework for Biomedical Research
Al Manir, S.; Levinson, M. A.; Niestroy, J.; Churas, C.; Sheffield, N. C.; Sullivan, B.; Fairchild, K.; Torres, M. M.; Ratcliffe, S. J.; Parker, J. A.; Ideker, T.; Clark, T.Abstract
Objective: Biomedical datasets intended for use in AI applications require packaging with rich pre-model metadata to support model development that is explainable, ethical, epistemically grounded and FAIR (Findable, Accessible, Interoperable, Reusable). Methods: We developed FAIRSCAPE, a digital commons environment, using agile methods, in close alignment with the team developing the AI-readiness criteria and Bridge2AI data production teams. Work was initially based on an existing provenance-aware framework for clinical machine learning. We incrementally added RO-Crate data+metadata packaging and exchange methods, client-side packaging support, provenance visualization, and support metadata mapped to the AI-readiness criteria, with automated AI-readiness evaluation. LinkML semantic enrichment and Croissant ML-ecosystem translations were also incorporated. Results: The FAIRSCAPE framework generates, packages, evaluates, and manages critical pre-model AI-readiness and explainability information with descriptive metadata and deep provenance graphs for biomedical datasets. It provides ethical, schema, statistical, and semantic characterization of dataset releases, licensing and availability information, and an automated AI-readiness evaluation across all 28 AI-readiness criteria. We applied this framework to successive, large-scale releases of multimodal datasets, progressively increasing dataset AI-readiness to full compliance. Conclusion: FAIRSCAPE enables AI-readiness in biomedical datasets using standard metadata components and has been used to establish this pattern across a major, multimodal NIH data generation program. It eliminates early-stage opacity apparent in many biomedical AI applications and provides a basis for establishing end-to-end AI explainability.
bioinformatics2026-03-04v4PopGenAgent: Tool-Aware, Reproducible, Report-Oriented Workflows for Population Genomics
su, h.; Long, W.; Feng, J.; Hou, Y.; Zhang, Y.Abstract
Population-genetic inference routinely requires coordinating many specialized tools, managing brittle file formats, iterating through diagnostics, and converting intermediate results into interpretable figures and written summaries. Although workflow frameworks improve reproducibility, substantial last-mile effort remains for parameterization, troubleshooting, and report preparation. Here we present PopGenAgent, a turnkey, report-oriented delivery system that packages a curated library of population-genetics toolchains into validated execution and visualization templates with standardized I/O contracts and full provenance capture. PopGenAgent separates retrieval-grounded user assistance for interpretation and write-up from conservative, template-driven execution that emphasizes auditable commands, artefact integrity checks, and report-ready figure generation. To control operating cost, an economical language model is used for template selection, parameter instantiation, and minor repairs, while higher-capacity models can be invoked selectively for narrative report generation grounded in recorded artefacts. We evaluate PopGenAgent on a broad panel of routine and advanced tasks spanning preprocessing, population structure analysis, and allele-sharing statistics, and we further demonstrate end-to-end replication of standard analyses on 26 populations from the 1000 Genomes Project, reproducing canonical summaries including ROH/heterozygosity profiles, LD decay, PCA, ADMIXTURE structure, TreeMix diagnostics, and f-statistics. Together, these results indicate that a validated template library coupled with provenance-aware reporting can substantially reduce manual scripting and coordination overhead while preserving reproducibility and step-level inspectability for population-genomic studies.
bioinformatics2026-03-04v1Uncovering Latent Structure in Gliomas Using Multi-Omics Factor Analysis
Carvalho, C. G.; Carvalho, A. M.; Vinga, S.Abstract
Background: Gliomas are the most common malignant brain tumors in adults, characterized by a poor prognosis. Although the current World Health Organization (WHO) classification provides clear guidelines for classifying oligodendroglioma, astrocytoma, and glioblastoma patients, significant heterogeneity persists within each class, limiting the effectiveness of current treatment strategies. With the increase of large-scale multi-omics datasets due to advancements in sequencing technologies, and online databases that provide them, such as The Cancer Genome Atlas (TCGA), it is now possible to investigate these tumors at multiple molecular levels. Methods: In this work, we apply integrative multi-omics analysis to explore the interplay between genomic (mutations), epigenomic (DNA methylation), and transcriptomic (mRNA and miRNA) layers. Our approach relies on Multi-Omics Factor Analysis (MOFA), a Bayesian latent factor analysis model designed to capture sources of variation across different omics types. Results: Our results highlight distinct molecular profiles across the three glioma types and identify potential relationships between methylation and genetic expression. In particular, we uncover novel candidate biomarkers with prognostic value, as well as a transcriptional profile associated with neural system development. Conclusions: These findings may contribute to more personalized therapeutic strategies, potentially enhancing treatment effectiveness and improving survival outcomes for this disease.
bioinformatics2026-03-04v1LLMsFold: Integrating Large Language Models and Biophysical Simulations for De Novo Drug Design
Waththe Liyanage, W. W.; Bove, F.; Righelli, D.; Romano, S.; Visone, R.; Iorio, M. V.; Lio, P.; Taccioli, C.Abstract
The discovery of novel small molecules is challenging because of the vastness of chemical space and the complexity of protein-ligand interactions, leading to low success rates and time-consuming workflows. Here, we present LLMsFold, a computational framework that combines Large Language Models (LLMs) and biophysical foundation tools to design and validate new small molecules targeting pathogenic proteins. The pipeline starts by identifying viable binding pockets on a target protein through geometry-based pocket detection. A 70-billion-parameter transformer model from the LlaMA family then generates candidate molecules as SMILES strings under prompt constraints that enforce drug-likeness. Each molecule is evaluated by Boltz-2, a diffusion-based model for protein-ligand co-folding that predicts bound 3D structure and binding affinity. Promising candidates are iteratively optimized through a reinforcement learning loop that prioritizes high predicted affinity and synthetic accessibility. We demonstrate the approach on two challenging targets: ACVR1 (Activin A Receptor Type 1), implicated in fibrodysplasia ossificans progressiva (FOP), and CD19, a surface antigen expressed on most B-cell lymphoma and leukemia cells. Top candidates show strong in silico binding predictions and favorable drug-like profiles. All code and models are made available to support reproducibility and further development.
bioinformatics2026-03-04v1Deciphering the links between metabolism and health by building small-scale knowledge graphs: application to endometriosis and persistent pollutants
Mathe, M.; Laisney, G.; Filangi, O.; Giacomoni, F.; Delmas, M.; Cano-Sancho, G.; Jourdan, F.; Frainay, C.Abstract
Knowledge graphs (KGs) are a robust formalism for structuring biomedical knowledge, but large-scale KGs often require complex queries, are difficult for non-experts to explore, and lack real-world context (such as experimental data, clinical conditions, patients symptoms). This limits their usability for addressing specific research questions. We present Kg4j, a computational framework built on FORVM (a large-scale KG containing 82 million compound-biological concept associations), that constructs local, keyword-based sub-graphs tailored to address biomedical research questions. Resulting graphs support hypothetical relationships and can integrate experimental datasets, enabling the discovery of plausible but yet unknown connections. Starting from a conceptual definition of a research field of interest (e.g., disease, symptoms, exposure), the framework extracts relevant associations from FORVM and identifies potential biological mechanisms and chemical compounds. We applied this approach to endometriosis, exploring links between exposure to Persistent Organic Pollutants (POPs) and disease risk. We propose a novel validation strategy comparing the resulting sub-graph (2,706 nodes and 23,243 edges, 0.002% of FORVM) with recent scientific literature, showing consistency with known findings while also revealing new hypothetical associations requiring further investigation.We also showed that removing duplicated nodes and edges from the KG improves the proportion of validated nodes (from 8.4% to 16%), doubles the precision (from 0.085 to 0.197) while maintaining the recall (0.954 to 0.952), illustrating a trade-off between the loss of potentially relevant but redundant information and the reliability of remaining associations. By combining automated knowledge mining with experimental data integration, this framework supports reproducible, context-based exploration of biomedical knowledge and systematic hypothesis generation. Applied to endometriosis, it highlights potential mechanisms linking exposure to POPs to the aetiology of the disease, offering a scalable strategy for constructing disease-specific KGs.
bioinformatics2026-03-04v1Formalized scientific methodology enables rigorous AI-conducted research across domains
Zhang, Y.; Zhao, J.Abstract
We formalize scientific methodology, the end-to-end process from question formulation to evidence-grounded writing, as a phase-gated research protocol with explicit return paths and persistent constraints, and instantiate it for general-purpose language models as executable protocol specifications. The formalization decomposes methodology into three complementary layers: a procedural workflow, an integrity discipline, and project governance. Encoded as protocol and activated across the lifecycle, these constraints externalize planning and verification artifacts and make integrity-relevant interventions auditable. We validate the approach in six end-to-end projects, including a matched controlled study, where the same agent produced two complete papers with and without the protocol. Across domains, the protocol-constrained agent produced evidence-backed, auditable research outputs - including closed-form derivations, quantitative ablations that resolve modeling design choices, and algorithmic refactors that preserve the objective while changing the computational primitive. In population-genomic applications, it also recovered well-studied biological signals as validity checks, including known admixture targets in the 1000 Genomes Project and Neanderthal-introgressed immune loci on chromosome 21 consistent with prior catalogs. In the controlled study, the protocol-free baseline could still produce a complete manuscript, but integrity-relevant risks were easier to introduce and harder to detect when constraints and artifacts were absent.
bioinformatics2026-03-04v1T cell-Macrophage Interactions Potentially Influence Chemotherapeutic Response in Ovarian Cancer Patients.
Hameed, S. A.; kolch, W.; Zhernovkov, V.Abstract
Tumor development and progression involve complex cell-cell interactions and dynamic co-evolution between cancer cells, immune cells and stromal cells in the tumour microenvironment and this may influence therapeutic resistance. A large proportion of this network relies on direct physical interactions between cells, particularly T-cell mediated interactions. Cell-cell communication inference has now become routine in downstream scRNAseq analysis but this mostly fails to capture physical cell-cell interactions due to tissue dissociation. Doublets occur naturally in scRNA-seq and are usually excluded from analysis. However, they may represent directly interacting cells that remain undissociated during library preparation. In the present study, we uncover the physical interaction landscape of the ovarian tumour microenvironment using the scRNAseq datasets from 13 treatment-naive ovarian cancer patients. Focusing on T-cell-Macrophage (T-Mac) interaction doublet, we reveal the modulatory effect of macrophages on T cells and the potential influence of this interaction on therapeutic response. Our findings show that T-Macs from resistant patients are functionally polarized to the M2 phenotype and engage T cells to induce T-cell exhaustion. Whereas, T-Macs from sensitive patients are predominantly of the M1 polarized phenotype, physically engaging T cells that lack exhaustion signatures. We also demonstrate that T cells and macrophages in T-Mac doublet are interacting primarily for the purpose of antigen presentation, with the enrichment of several ligand-receptor pairs involved in TCR-MHC interactions and immune synapse formations. We partly validated some of these findings from a spatial transcriptomics dataset of ovarian cancer patients from a separate cohort.
bioinformatics2026-03-04v1Direct pathway enrichment prediction from histopathological whole slide images and comparison with gene expression mediated models
Jabin, A.; Ahmad, S.Abstract
Molecular profiling of tumours via RNA sequencing (RNA-seq) enables clinically actionable stratification but remains costly, tissue-intensive, and time-consuming. Recent advances in computational pathology suggest that routine H&E whole-slide images (WSIs) can be utilized to estimate transcriptomic states of cancer cells. Given the WSI-derived predictions of transcriptional signatures are noisy, their use for accurate biological interpretation faces challenges. On the other hand pathway enrichment analysis has been routinely used in describing biologically meaningful cellular states from noisy gene expression data and some studies have evaluated the ability of WSI-predicted gene expression profiles to reconstruct enriched pathways in experiments where the two data modalities were concurrently available. However, it remains unclear if a predictive model that is designed to predict enriched pathways directly from WSI samples would be better than the current approaches to do so by first predicting gene expressions. Here, we develop and evaluate these two complementary approaches for predicting pathway enrichment profiles from WSIs in TCGA Breast Invasive Carcinoma (TCGA-BRCA) by training parallel models which predict pathway enrichment directly from image features and those which rely on predicted gene expression profiles, which is the current state-of-the-art. Our results suggest that under controlled experiments direct prediction of a selected pool of enriched pathways outperforms the models trained on predicting gene expression and then inferring enrichments on predicted gene expression values. These findings will be helpful in prioritizing the goals of predictive modeling of WSI images and improving diagnostic outcomes of cancer patients.
bioinformatics2026-03-04v1Towards Useful and Private Synthetic Omics: Community Benchmarking of Generative Models for Transcriptomics Data
Öztürk, H.; Afonja, T.; Jälkö, J.; Binkyte, R.; Rodriguez-Mier, P.; Lobentanzer, S.; Wicks, A.; Kreuer, J.; Ouaari, S.; Pfeifer, N.; Menzies, S.; Pentyala, S.; Filienko, D.; Golob, S.; McKeever, P.; Banerjee, J.; Foschini, L.; De Cock, M.; Saez-Rodriguez, J.; Fritz, M.; Stegle, O.; Honkela, A.Abstract
Background: The synthesis of anonymized data derived from real-world cohorts offers a promising strategy for regulatory-compliant and privacy-preserving biological data sharing, potentially facilitating model development that can improve predictive performance. However, the extent to which generative models can preserve biological signals while remaining resilient to adversarial privacy attacks in high-dimensional omics contexts remains underexplored. To address this gap, the CAMDA 2025 Health Privacy Challenge launched a community-driven effort to systematically benchmark synthetic and privacy-preserving data generation for bulk RNA-seq cohorts. Results: Building on this initiative, we systematically benchmarked 11 generative methods across two cancer cohorts (~1,000 and ~5,000 patients) over 978 landmark genes. Methods were evaluated across complementary axes of distributional fidelity, downstream utility, biological plausibility and empirical privacy risk, with emphasis on trade-offs between vulnerability to membership inference attacks (MIA) and other evaluation dimensions. Expressive deep generative models achieved strong predictive utility and differential expression recovery, but were often more vulnerable to membership inference risk. Differentially private methods improved resistance to attacks at the cost of reduced utility, while simpler statistical approaches offered competitive utility with moderate privacy risk and fast training. Conclusions: Synthetic bulk RNA-seq quality is inherently multi-dimensional and shaped by trade-offs between utility, biological preservation and privacy. Our results indicate that differences in model architecture drive distinct trade-offs across these axes, suggesting that model choice should align with dataset characteristics, intended downstream use and privacy requirements. Privacy risk should also be assessed using multiple complementary attack methods and, where possible, formal differential privacy protection.
bioinformatics2026-03-04v1A comprehensive benchmark of discrepancies across microbial genome reference databases
Boldirev, G.; Aguma, P.; Munteanu, V.; Koslicki, D.; Alser, M.; Zelikovsky, A.; Mangul, S.Abstract
Metagenomic analysis of microbial communities relies significantly on the quality and completeness of reference genomes, which allow researchers to compare sequencing reads against reference genome collections to reveal essential community characteristics. However, the reliability of these analyses is often compromised by substantial discrepancies across existing reference resources, including differences in genome content, assembly fragmentation, taxonomic representation, and metadata completeness. While these inconsistencies are known to introduce bias, the extent of divergence between major databases remains largely unknown. Here, we present a comprehensive benchmark of discrepancies across multiple widely used microbial genome reference resources. We developed the Cross-DB Genomic Comparator (CDGC), which utilizes reference genome alignments to systematically capture discrepancies in genome assemblies across reference databases. Applying this framework, we found that 99% of viral genomes were identical across databases, indicating strong consistency in viral reference resources. In contrast, fungal genomes showed substantially greater variability: although 82% of assemblies exhibited at least 90% similarity, only 7% were identical across databases. More concerning, we identified a subset of 461 assemblies with less than 50% similarity, suggesting the presence of technical artifacts, incomplete assemblies, or damaged genome files that require closer examination. Collectively, these results demonstrate that systematic cross-database benchmarking provides a critical mechanism for refining the accuracy of individual reference databases and advancing efforts towards more unified and reliable universal reference genomes.
bioinformatics2026-03-04v1gSV: a general structural variant detector using the third-generation sequencing data
HAO, J.; Shi, J.; Lian, S.; Zhang, Z.; Luo, Y.; Hu, T.; Ishibashi, T.; Wang, D.; Wang, S.; Fan, X.; Yu, W.Abstract
Structural variants (SVs) are of increasing significance with the advancement of the third-generation sequencing technologies, but detecting complex SVs is still challenging for existing SV detection tools. In this paper, we propose gSV, a general SV detector that integrates alignment-based and assembly-based approaches with the maximum exact match (MEM) strategy. Without predefined assumptions about SV types, gSV captures all potential variant signals, enabling the detection of SVs with complex alignment patterns that are usually missed by other tools. Evaluations using both simulated and real datasets demonstrate that gSV outperforms state-of-the-art tools in detecting both simple and complex SVs. Unique SV discoveries in four breast cancer cell lines, particularly in cancer-associated genes, validate the clinical utility of gSV. The application in a breast cancer cohort from the Chinese population further illustrates the usefulness of our new tool in genomic studies.
bioinformatics2026-03-04v1Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles
Ni, Z.; Li, Y.; Qiu, Z.; Schölkopf, B.; Guo, H.; Liu, W.; Liu, S.Abstract
Generative models have recently advanced de novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce RigidSSL (Rigidity-Aware Self-Supervised Learning}, a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.
bioinformatics2026-03-04v1EvoStructCLIP: A Mutation-Centered Multimodal Embedding Model for CAGI7 Variant Effect Prediction
Chung, K.; Lee, J.; Kim, Y.; Lee, J.; Park, J.; Lee, H.Abstract
We present EvoStructCLIP, a mutation-centered multimodal embedding model that integrates local 3D structural windows and evolutionary constraints to predict missense variant effects. EvoStructCLIP combines two encoders: a structure voxel encoder derived from AlphaFold residue neighborhoods and an MSA-based evolutionary encoder. It aligns the modalities through CLIP-style contrastive learning, with FuseMix regularization and an auxiliary pathogenicity loss trained on 153,787 ClinVar variants. Evaluations using lightweight regressors demonstrate that EvoStructCLIP embeddings capture highly transferable predictive signals across diverse phenotypes, including gene-specific functional readouts of BRCA1, KCNQ4, and PTEN/TPMT. This transferability is further supported in the CAGI7 blind competition setting, where models generalized to predicting different gene-specific readouts for BARD1, FGFR, and TSC2 without target-specific retraining and achieved competitive performance across heterogeneous biological tasks.
bioinformatics2026-03-04v1MiGenPro: A linked data workflow for phenotype-genotype prediction of microbial traits using machine learning.
Loomans, M.; Suarez-Diez, M.; Schaap, P. J.; Saccenti, E.; Koehorst, J. J.Abstract
The availability of microbial genomic data and the development of machine learning methods have created a unique opportunity to establish associations between genetic information and phenotypes. Here, we introduce a computational workflow for Microbial Genome Prospecting (MiGenPro) that combines phenotypic and genomic information. MiGenPro serves as a workflow for the training of machine learning models that predict microbial traits from genomes that have been annotated. Microbial genomes have been consistently annotated and features were stored in a semantic framework that is easy to query using SPARQL. The data was used to train machine learning models and successfully predicted microbial traits such as motility, Gram stain, optimal temperature range, and sporulation capabilities. To ensure robustness, a hyper parameter halving grid search was used to determine optimal parameter settings followed by a five-fold cross-validation which demonstrated consistent model performance across iterations and without overfitting. Effectiveness was further validated through comparison with existing models, showing comparable accuracy, with modest variations attributed to differences in datasets rather than methodology. Classification can be further explored using feature importance characterisation to identify biologically relevant genomic features. MiGenPro provides an easy to use interoperable workflow to build and validate models to predict phenotypes from microbes based on their annotated genome.
bioinformatics2026-03-03v2Characterizing and Mitigating Protocol-Dependent Gene Expression Bias in 3' and 5' Single-Cell RNA Sequencing
Shydlouskaya, V.; Haeryfar, S. M. M.; Andrews, T. S.Abstract
Single-cell RNA sequencing (scRNA-seq) has enabled large-scale characterization of cellular heterogeneity; yet, integrating datasets generated through different library preparation protocols remains challenging. For instance, comparisons between 10X Genomics 3' and 5' chemistries are complicated by protocol-dependent technical biases imposed by differences in transcript end capture and amplification. While normalization, and often batch correction, is an integral step in preprocessing scRNA-seq datasets, it remains unclear which correction is most appropriate, or even necessary, for reliable cross-protocol comparisons. Here, we systematically characterize protocol-related expression differences using 35 matched donors across six tissues profiled with both 3' and 5' scRNA-seq approaches. We find that gene expression discrepancies are not pervasive across the whole transcriptome, but driven instead by a relatively small, reproducible subset of protocol-biased genes. Excluding these genes improves cross-protocol concordance, indicating that most genes are directly comparable without aggressive correction. We then benchmark commonly employed normalization approaches and show that while several methods, such as fastMNN, improve statistical alignment when cell populations are well matched, they can distort gene-level signals and inflate differential expression in biologically realistic settings with incomplete cell-type overlap. Taken together, our results demonstrate that protocol bias between 3' and 5' scRNA-seq is limited in scope and that targeted handling of a small set of biased genes presents an alternative approach to normalization or batch correction strategies. This work provides a practical guideline for integrating 3' and 5' scRNA-seq data and highlights the importance of matching normalization strategies to the structure of technical variation and the intended downstream analyses.
bioinformatics2026-03-03v1selscape: A Snakemake Workflow for Investigating Genomic Landscapes of Natural Selection
Chen, S.; Huang, X.Abstract
Analyzing natural selection is a central task in evolutionary genomics, yet applying multiple tools across populations in a reproducible and scalable manner is often complicated by heterogeneous input formats, parameter settings, and tool dependencies. Here, we present selscape, a Snakemake workflow that automates end-to-end genome-wide selection analysis--from input preparation and statistic calculation to functional annotation, downstream visualization, and summary reporting. We demonstrate selscape on high-coverage genomes from the 1000 Genomes Project, illustrating how the workflow enables efficient, large-scale analyses and streamlined comparisons across populations. By unifying diverse tools with Snakemake, selscape lowers the barrier to robust genome-wide analyses and provides a flexible framework for future extensions and integration with complementary population genetic analyses.
bioinformatics2026-03-03v1The limits of Bayesian estimates of divergence times in measurably evolving populations
Ivanov, S.; Fosse, S.; dos reis, M.; Duchene, S.Abstract
Bayesian inference of divergence times for extant species using molecular data is an unconventional statistical problem: Divergence times and molecular rates are confounded, and only their product, the molecular branch length, is statistically identifiable. This means we must use priors on times and rates to break the identifiability problem. As a consequence, there is a lower bound in the uncertainty that can be attained under infinite data for estimates of evolutionary timescales using the molecular clock. With infinite data (i.e., an infinite number of sites and loci in the alignment) uncertainty in ages of nodes in phylogenies increases proportionally with their mean age, such that older nodes have higher uncertainty than younger nodes. On the other hand, if extinct taxa are present in the phylogeny, and if their sampling times are known (i.e., `heterochronous' data), then times and rates are identifiable and uncertainties of inferred times and rates go to zero with infinite data. However, in real heterochronous datasets (such as viruses and bacteria), alignments tend to be small and how much uncertainty is present and how it can be reduced as a function of data size are questions that have not been explored. This is clearly important for our understanding of the tempo and mode of microbial evolution using the molecular clock. Here we conducted extensive simulation experiments and analyses of empirical data to develop the infinite-sites theory for heterochronous data. Contrary to expectations, we find that uncertainty in ages of internal nodes scales positively with the distance to their closest tip with known age (i.e., calibration age), not their absolute age. Our results also demonstrate that estimation uncertainty decreases with calibration age more slowly in data sets with more, rather than fewer site patterns, although overall uncertainty is lower in the former. Our statistical framework establishes the minimum uncertainty that can be attained with perfect calibrations and sequence data that are effectively infinitely informative. Finally, we discuss the implications for viral sequence data sets. In a vast majority of cases viral data from outbreaks is not sufficiently informative to display infinite-sites behaviour and thus all estimates of evolutionary timescales will be associated with a degree of uncertainty that will depend on the size of the data set, its information content, and the complexity of the model. We anticipate that our framework is useful to determine such theoretical limits in empirical analyses of microbial outbreaks.
bioinformatics2026-03-03v1A comprehensive assessment of tandem repeat genotyping methods for Nanopore long-read genomes
Aliyev, E.; Avvaru, A.; De Coster, W.; Arner, G. M.; Nyaga, D. M.; Gibson, S. B.; Weisburd, B.; Gu, B.; Gonzaga-Jauregui, C.; 1000 Genomes Long-Read Sequencing Consortium, ; Chaisson, M. J. P.; Miller, D. E.; Ostrowski, E.; Dashnow, H.Abstract
Background Tandem repeats (TRs) play critical roles in human disease and phenotypic diversity but are among the most challenging classes of genomic variation to measure accurately. While it is possible to identify TR expansions using short-read sequencing, these methods are limited because they often cannot accurately determine repeat length or sequence composition. Long-read sequencing (LRS) has the potential to accurately characterize long TRs, including the identification of non-canonical motifs and complex structures. However, while there are an increasing number of genotyping methods available, no systematic effort has been undertaken to evaluate their length and sequence-level accuracy, performance across motifs from STRs to VNTRs and across allele lengths, and, critically, how usable these tools are in practice. Results We reviewed 25 available bioinformatic tools, and selected seven that are actively maintained for benchmarking using publicly available Oxford Nanopore genome sequencing data from more than 100 individuals. Our benchmarking catalog included ~43k TR loci genome-wide, selected to represent a range of simple and challenging TR loci. As no "truth" exists for this purpose, we used four complementary strategies to assess accuracy: concordance with high-quality haplotype-resolved Human Pangenome Reference Consortium (HPRC) assemblies, Mendelian consistency in Genome in a Bottle trios, cross-tool consistency, and sensitivity in individuals with pathogenic TR expansions confirmed by molecular methods. For all comparisons, we assess both total allele length and full sequence similarity using the Levenshtein distance. We also evaluated installation, documentation, computational requirements, and output characteristics to reflect real-world use. We provide a complete analysis workflow for all tools to support community reuse. Tool performance varied substantially across both accuracy and usability. Most methods achieved high concordance with HPRC assemblies, with higher accuracy when using the R10 ONT pore chemistry. Accuracy generally declined with increasing allele length, and most tools performed worse on homopolymers, likely reflecting underlying sequencing accuracy. Tools generally performed worse at heterozygous loci and at alleles that differed from the reference genome. Interestingly, concordance with assembly in population samples did not predict sensitivity to pathogenic expansions, with different genotypers performing best in each category. Similarly, Mendelian consistency was highest in the tool that performed worst in assembly concordance. Conclusions No single genotyper emerged as consistently best across all assessments, but strong contenders emerged in each. Our results demonstrate that length accuracy (a typical benchmarking approach) alone overestimates TR genotyping performance. Sequence-level benchmarking is essential for selecting tools best-suited for population studies and clinical diagnostics. This work provides practical guidance for tool selection and highlights key priorities for future long-read TR genotyping method development.
bioinformatics2026-03-03v1iGS: A Zero-Code Dual-Engine Graphical Software for Polygenic Trait Prediction
Zhang, J.; Chen, F.Abstract
Genomic selection (GS) has become the core driving force in modern plant and animal breeding. However , state-of-the-art comprehensive GS tools often rely on complex underlying environment configurations and command-line operations, posing significant technical barriers for breeders lacking programming expertise . To address this critical pain point, this study developed a fully "zero-code" graphical user interface (GUI) decision support system for genomic selection. The platform innovatively employs a "portable dual-engine architecture" (R-Portable and Python-Portable) to achieve completely dependency-free, "out-of-the-box" deployment , and integrates a standardized six-step end-to-end workflow from data quality control to result export . Furthermore, the platform comprehensively integrates 33 cutting-edge prediction models across four major paradigms, linear, Bayesian, machine learning, and deep learning , and features an original intelligent parameter configuration system that dynamically renders algorithm parameters to provide a minimalist UI interaction experience . Benchmark testing on the Wheat2000 dataset across six complex agronomic and quality traits, including thousand-kernel weight (TKW) and grain protein content (PROT), demonstrated that classic linear models remain highly robust for polygenic additive traits, while tree-based machine learning and hybrid deep learning architectures exhibit superior predictive potential and noise resilience when resolving complex epistatic effects and low-heritability traits. The successful deployment of this platform fundamentally liberates biologists from the constraints of computational science, providing robust digital infrastructure to accelerate the popularization and practical application of GS technologies in agricultural production.
bioinformatics2026-03-03v1Phenotypic Bioactivity Prediction as Open-set Biological Assay Querying
Sun, Y.; Zhang, X.; Zheng, Q.; Li, H.; Zhang, J.; Hong, L.; Wang, Y.; Zhang, Y.; Xie, W.Abstract
The traditional drug discovery pipeline is severely bottlenecked by the need to design and execute bespoke biological assays for every new target and compound, which is both time-consuming and prohibitively expensive. While machine learning has accelerated virtual screening, current models remain confined to ``closed-set'' paradigms, unable to generalize to entirely novel biological assays without target-specific experimental data. Here, we present OpenPheno, a groundbreaking multimodal foundation model that fundamentally redefines bioactivity prediction as an open-set, visual-language question-answering (QA) task. By integrating chemical structures (SMILES), universal phenotypic profiles (Cell Painting images), and natural language descriptions of biological assays, OpenPheno unlocks the highly coveted "profile once, predict many'' paradigm. Instead of conducting countless target-specific wet-lab experiments, researchers only need to capture a single, low-cost Cell Painting image of a novel compound. OpenPheno then evaluates this universal phenotypic ``fingerprint'' against the text-based description of any unseen assay, predicting bioactivity in a zero-shot manner. On 54 entirely unseen assays, it achieves strong zero-shot performance (mean AUROC 0.75), exceeding supervised baselines trained with full labeled data, and few-shot adaptation further improves predictions. In the most stringent setting where both compounds and assays are novel, OpenPheno maintains robust generalization (mean AUROC 0.66), opening up a new paradigm for a highly scalable, cost-effective, and universal engine for next-generation drug discovery.
bioinformatics2026-03-03v1Large-Scale Statistical Dissection of Sequence-Derived Biochemical Features Distinguishing Soluble and Insoluble Proteins
Vu, N. H. H.; Nguyen Bao, L.Abstract
Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a statistically rigorous large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliffs {delta}, and discriminative performance was assessed by ROC-AUC. Although 34 features remained significant after correction, most exhibited small effect sizes and substantial class overlap, consistent with a weak-signal regime. The strongest effects were associated with size-related features (sequence length and molecular weight; {delta} {approx} -0.21), whereas charge-related descriptors, particularly the proportion of negatively charged residues ({delta} = 0.150; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ({rho} up to 0.998). Applying a redundancy threshold (|{rho}| [≥] 0.85), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings demonstrate that sequence-level solubility information is intrinsically low-dimensional and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.
bioinformatics2026-03-03v1h5adify: neuro-symbolic metadata harmonizationenables scalable AnnData integration with locallarge language models
Rincon de la Rosa, L.; Mouazer, A.; Navidi, M.; Degroodt, E.; Künzle, T.; Geny, S.; Idbaih, A.; Verrault, M.; Labreche, K.; Hernandez-Verdin, I.; Alentorn, A.Abstract
Background: The rapid growth of public single-cell and spatial transcriptomics repositories has shifted the main bottleneck for atlas-scale integration from data generation to metadata heterogeneity. Even when datasets are released in the AnnData H5AD format, inconsistent column naming, partial annotations, and mixed gene identifier conventions frequently prevent reproducible merging, downstream benchmarking, and reuse in foundation model training. Automated approaches that resolve semantic inconsistency while preserving biological validity are therefore essential for scalable data reuse. Results: We present h5adify, a neuro-symbolic toolkit that combines deterministic biological inference with locally deployed large language models to transform heterogeneous AnnData objects into schema-normalized, integration-ready representations. The framework performs metadata field discovery, gene identifier harmonization, optional paper-aware extraction, and consensus resolution with explicit uncertainty logging. Benchmarking four open-weight model families deployed through Ollama (Gemma, Llama, Mistral, and Qwen) demonstrates that small local models achieve high semantic accuracy in metadata resolution with low hallucination rates and modest computational requirements. In controlled simulations introducing annotation noise into single-cell and Visium-like datasets, harmonization improves integration benchmarking and reduces spurious batch effects. Application to sex-annotated glioblastoma datasets recovers biologically coherent microenvironmental patterns and cell type-specific genomic differences not explained by differential expression alone. Conclusions: Together, h5adify provides a reproducible framework for evaluating LLM-assisted biocuration and enables scalable, privacy-preserving metadata harmonization for modern single-cell atlases and foundation model pipelines. These results demonstrate that modular neuro-symbolic integration of deterministic biological inference and small local language models can effectively resolve semantic heterogeneity while remaining computationally accessible.
bioinformatics2026-03-03v1snputils: A High-Performance Python Library for Genetic Variation and Population Structure
Bonet, D.; Comajoan Cara, M.; Barrabes, M.; Smeriglio, R.; Agrawal, D.; Aounallah, K.; Geleta, M.; Dominguez Mantes, A.; Thomassin, C.; Shanks, C.; Huang, E. C.; Franquesa Mones, M.; Luis, A.; Saurina, J.; Perera, M.; Lopez, C.; Sabat, B. O.; Abante, J.; Moreno-Grau, S.; Mas Montserrat, D.; Ioannidis, A. G.Abstract
The increasing size and resolution of genomic and population genetic datasets offer unprecedented opportunities to study population structure and uncover the genetic basis of complex traits and diseases. The collection of existing analytical tools, however, is characterized by format incompatibilities, limited functionality, and computational inefficiencies, forcing researchers to construct fragile pipelines that chain together fragmented command-line utilities and ad hoc scripts. These are difficult to maintain, scale, and reproduce. To address such limitations, we present snputils, a Python library that unifies high-performance I/O, transformation, and analysis of genotype, ancestry, and phenotypic information within a single framework suitable for biobank-scale research. The library provides efficient tools for essential operations, including querying, cleaning, merging, and statistical analysis. In addition, it offers classical population genetic statistics with optional ancestry-specific masking. An identity-by-descent module supports reading of multiple formats, filtering and ancestry-restricted segment trimming for relatedness and demographic inference. snputils also incorporates ancestry-masking and multi-array functionalities for dimensionality reduction methods, as well as efficient implementations of admixture simulation, admixture mapping, and advanced visualization capabilities. With support for the most commonly used file formats, snputils integrates smoothly with existing tools and clinical databases. At the same time, its modular and optimized design reduces technical overhead, facilitating reproducible workflows that accelerate discoveries in population genetics, genomic research, and precision medicine. Benchmarking demonstrates a significant reduction in genotype data loading speed compared to existing Python libraries. The open-source library is available at https://github.com/AI-sandbox/snputils, with full documentation and tutorials at https://snputils.org.
bioinformatics2026-03-03v1Minimum Unique Substrings as a Context-Aware k-mer Alternative for Genomic Sequence Analysis
Adu, A. F.; Menkah, E. S.; Amoako-Yirenkyi, P.; Pandam Salifu, S.Abstract
Fixed-length k-mers have long been the standard in sequence analysis. However, they impose a uniform resolution across heterogeneous genomes, often resulting in significant redundancy and a loss of contextual sensitivity. To address these limitations, we introduce Minimum Unique Substrings (MUSs), which are variable-length sequence units that adapt to the local complexity of the genome. MUSs function as context-aware markers that naturally define repeat boundaries by extending only until uniqueness is achieved. We build upon the theoretical relationship between MUSs and maximal repeats, extending this framework to sequencing reads by establishing a read-consistent definition of uniqueness. We present a linear-time (O(n)) algorithm based on a generalized suffix tree and introduce the concept of outposts. These outposts act as anchors for uniqueness, enabling precise localization of MUS boundaries within the sequencing data. Empirical studies of E. coli K-12 and human HiFi reads reveal distinct distributions in MUS lengths that reflect their respective genomic architectures. The compact bacterial genome produces a highly dense set of MUSs with a narrow length distribution (averaging 30.44 bp). In contrast, the repeat-rich human genome requires longer substrings to resolve uniqueness, resulting in an increased mean length (36.08 bp) and a broader distribution that delineates complex repetitive elements. The MUS framework achieves 100% unique coverage with an average length of 36.08 bp, surpassing the 69% coverage of k = 61. By reducing the total number of tokens by over 99%, it provides higher resolution and superior data compression compared to fixed-length k-mer sampling. These results demonstrate that MUSs provide a biologically meaningful, context-sensitive alternative to k-mers, with direct applications in genome assembly, repeat characterization, and comparative genomics.
bioinformatics2026-03-03v1Improved prediction of virus-human protein-protein interactions by incorporating network topology and viral molecular mimicry
Zhang, Z.; Feng, Y.; Meng, X.; Peng, Y.Abstract
The protein-protein interactions (PPIs) between viruses and human play crucial roles in viral infections. Although numerous computational approaches have been proposed for predicting virus-human PPIs, their performances remain suboptimal and may be overestimated due to the lack of benchmark dataset. To address these limitations, we first constructed a carefully curated benchmark dataset, ensuring non-overlapped PPIs and minimum sequences similarity of both human and viral proteins in the training and test sets. Based on this dataset, we developed vhPPIpred, a machine learning-based prediction method that not only incorporated sequence embedding and evolutionary information but also leveraged network topology and viral molecular mimicry of human PPIs. Comparative experiments demonstrated that vhPPIpred outperformed five state-of-the-art methods on both our benchmark dataset and three independent datasets. vhPPIpred also achieved high computational efficiency, requiring relatively low runtime and memory. Finally, vhPPIpred was demonstrated to have great potential in identifying human virus receptors, and in inferring virus phenotypes as the virus-human PPIs predicted by vhPPIpred can be used to effectively infer virus virulence. In summary, this study provides a valuable benchmark dataset and an effective tool for virus-human PPI prediction, with potential applications in antiviral drug discovery, host-pathogen interaction research and early warnings of emerging viruses.
bioinformatics2026-03-03v1LLPSight: enhancing prediction of LLPS-driving proteins using machine learning and protein Language Models
GONAY, V.; VITALE, R.; STEGMAYER, G.; Dunne, M. P.; KAJAVA, A. V.Abstract
In eukaryotic cells, essential functions are often confined within organelles enclosed by lipid membranes. Increasing evidence, however, highlights the role of membrane-less organelles (MLOs), formed through liquid-liquid phase separation (LLPS). MLO assemblies are typically initiated by >>driver>> proteins, which form a scaffold to recruit additional >>client>> molecules. By leveraging expanding MLO datasets and modern machine learning approaches, we developed LLPSight, an ML-based predictor of LLPS-driving proteins. The model was trained using rigorously curated datasets: a positive set of proteins experimentally confirmed to drive LLPS in vivo and a negative set of soluble, unstructured proteins not associated with LLPS. For the features, we employed a cutting-edge approach using embeddings from protein Language Models. LLPSight achieves the highest F1 score (0.885) among existing tools, enabling more efficient discovery of new LLPS drivers eagerly awaited by researchers for experimental validation. An additional key feature of LLPSight is its ability to perform proteome-wide analyses; application to the human proteome yielded promising targets. LLPSight can be obtained from authors upon request.
bioinformatics2026-03-03v1In Silico Screening of Indian Medicinal Herb Compounds for Intestinal α-Glucosidase Inhibition with ADMET and Toxicity Assessment for Postprandial Glucose Management in Type-2 Diabetes
Roy, D. A. C.; GHOSH, D. I.Abstract
Postprandial hyperglycemia is a major concern in type 2 diabetes, and inhibition of intestinal alpha-glucosidases is an established method for controlling post-meal glucose excursions. In this study, we conducted an in-silico screening of phytochemicals from different well-known medicinal plants (Withania somnifera, Rauwolfia serpentina, Curcuma longa, and Camellia sinensis) against MGAM, using the clinically approved inhibitor miglitol as reference for docking protocol validation. Molecular docking revealed that miglitol binds to MGAM with a binding energy of -6.86 kcal/mol and an RMSD of 1.04 (with co-crystal structure; PBD ID:3L4W); however, several phytochemicals exhibited binding affinities equal to or stronger than miglitol. Among these, Withanolide B (-9.25 kcal/mol) and Withanone (-7.57 kcal/mol) from Withania somnifera showed the highest predicted affinities, indicating robust engagement of the MGAM catalytic pocket. Rauwolfia serpentina alkaloids such as yohimbine (-8.50 kcal/mol) and raubasine (-8.46 kcal/mol) also displayed strong binding energies, whereas curcuminoids (curcumin -6.36 kcal/mol; deoxycurcumin -6.35 kcal/mol) and tea catechins (e.g., epicatechin gallate -6.85 kcal/mol) demonstrated moderate affinity. Interaction analysis showed that top-ranking compounds formed extensive hydrogen-bonding and hydrophobic interactions with key catalytic residues of MGAM, suggesting stable occupancy of the active site. In-silico ADME profiling predicted favorable gastrointestinal absorption for lead phytochemicals, supporting their potential for oral intestinal action. Collectively, these results identify plant-derived ligands with binding energies comparable to or exceeding that of miglitol, highlighting Withania somnifera withanolides as priority candidates for experimental validation in enzyme inhibition assays and glucose tolerance models, and providing a focused set of natural MGAM inhibitors for further translational investigation in postprandial glucose control.
bioinformatics2026-03-03v1scUnify: A Unified Framework for Zero-shot Inference of Single-Cell Foundation Models
KIM, D.; Jeong, K.; KIM, K.Abstract
Foundation models (FMs) pre-trained on large-scale single-cell RNA sequencing (scRNA-seq) data provide powerful cell embeddings, but their practical usability and systematic comparison are limited by model-specific environments, preprocessing pipelines, and execution procedures. To address these challenges, we introduce scUnify, a unified zero-shot inference framework for single-cell foundation models. scUnify accepts a standard AnnData object and automatically manages environment isolation, preprocessing, and tokenization through a registry-based modular design. It employs a hierarchical distributed inference strategy that combines Ray-based task scheduling with multi-GPU data-parallel execution via HuggingFace Accelerate, enabling scalable inference on datasets containing up to one million cells. In addition, built-in integration of scIB and scGraph metrics enables standardized cross-model embedding evaluation within a single workflow. Benchmarking results demonstrate substantial reductions in inference time compared with the original model implementations, while preserving embedding quality and achieving near-linear multi-GPU scaling. scUnify is implemented in Python and is publicly available at https://github.com/DHKim327/scUnify.
bioinformatics2026-03-03v1RankMap: Rank-based reference mapping for fast and robust cell type annotation in spatial and single-cell transcriptomics
Cheng, J.; Li, S.; Kim, S.; Ang, C. H.; Chew, S. C.; Chow, P. K.-H.; Liu, N.Abstract
Accurate cell type annotation is essential for the analysis of single-cell and spatial transcriptomics data. While reference-based annotation methods have been widely adopted, many existing approaches rely on full-transcriptome profiles and incur substantial computational cost, limiting their applicability to large-scale spatial datasets and platforms with partial gene panels. Here, we present RankMap (https://github.com/jinming-cheng/RankMap), an efficient and flexible R package for reference-based cell type annotation across both single-cell and spatial transcriptomics. RankMap transforms gene expression profiles into rank-based representations using the top expressed genes per cell, improving robustness to platform-specific biases and expression scale differences. A multinomial regression model trained with elastic net regularization is then used to predict cell types and associated confidence scores. We benchmarked RankMap on five spatial transcriptomics datasets, including Xenium, MERFISH, and Stereo-seq, as well as two single-cell datasets, and compared it with established methods such as SingleR, Azimuth, and RCTD. RankMap achieved competitive or superior annotation accuracy while consistently reducing runtime compared to existing methods, particularly for large spatial datasets. These results demonstrate that RankMap provides a scalable and robust solution for reference-based cell type annotation in modern single-cell and spatial transcriptomics studies.
bioinformatics2026-03-03v1Enabling Megascale Microbiome Analysis with DartUniFrac
Zhao, J.; McDonald, D.; Sfiligoi, I.; Lladser, M. E.; Patel, L.; Weng, Y.; Khatib, L.; Degregori, S.; Gonzalez, A.; Lozupone, C.; Knight, R.Abstract
We introduce a new algorithm, DartUniFrac, and a near-optimal implementation with GPU acceleration, up to three orders of magnitude faster than the state of the art and scaling to millions of samples (pairwise) and billions of taxa. DartUniFrac connects UniFrac with weighted Jaccard similarity and exploits sketching algorithms for fast computation. We benchmark DartUniFrac against exact UniFrac implementations, demonstrating that DartUniFrac is statistically indistinguishable from them on real-world microbiome and metagenomic datasets.
bioinformatics2026-03-03v1Evaluating Few-Shot Meta-Learning using STUNT for Microbiome-Based Disease Classification
Peng, C.; Abeel, T.Abstract
The human gut microbiome is increasingly explored as a diagnostic indicator for disease, yet machine learning models trained on metagenomic data are often constrained by limited sample sizes and poor cross-cohort generalizability. Meta-learning, a machine learning paradigm that optimizes models for rapid adaptation to new tasks with limited examples, offers a promising strategy to address this by leveraging the potential shared microbial structure across publicly available metagenomic datasets. Here, we evaluated STUNT, a framework combining self-supervised pretraining with metric-based meta-learning (Prototypical Networks), for few-shot microbiome-based disease classification. Using over 5,000 species-level gut metagenomic profiles from 57 cohorts in GMrepo v2, we meta-trained STUNT on 52 cohorts and evaluated the pretrained embedding on five held-out disease cohorts covering rheumatoid arthritis (RA), gestational diabetes mellitus during pregnancy (GDM), non-alcoholic fatty liver disease (NAFLD), diabetes mellitus, type 1 (T1D), and inflammatory bowel disease (IBD). We compared Prototypical Networks, Logistic Regression, and Random Forest with and without STUNT-derived embeddings across shot sizes of 1 to 10 samples per class. We found that STUNT-derived embeddings provided a modest benefit only under extreme data scarcity (one labeled sample per class) and this advantage rapidly diminished and reversed with additional samples, indicating that the meta-learned representations impose an information bottleneck limiting access to task-specific signals. Classification performance varied substantially across cohorts, consistent with PERMANOVA-estimated microbiome-disease separability. These results highlight the need for representation learning approaches that preserve disease- and cohort-specific variation and suggest that intrinsic biological signal strength is the primary determinant of classification success.
bioinformatics2026-03-03v1Towards Cross-Sample Alignment for Multi-Modal Representation Learning in Spatial Transcriptomics
Dai, J.; Nonchev, K.; Koelzer, V. H.; Raetsch, G.Abstract
The growing number of spatial transcriptomics (ST) datasets enables comprehensive multi-modal characterization of cell types across diverse biological and clinical contexts. However, integration across patient cohorts remains challenging, as local microenvironment, patient-specific variability, and technical batch effects can dominate signals. Here, we hypothesize that combining specialized transcriptomics correction methods with deep representation learning can jointly align morphology, transcriptomics, and spatial information across multiple tissue samples. This approach benefits from recent transcriptomics and pathology foundation models, projecting cells into a shared embedding space where they cluster by cell type rather than dataset-specific conditions. Applying this framework to 18 skin melanoma, 12 human brain, and 4 lung cancer datasets, we demonstrate that it outperforms conventional batch-correction approaches by 58%, 38%, and 2-fold, respectively. Together, this framework enables efficient integration of multi-modal ST data across modalities and samples, facilitating the systematic discovery of conserved cellular programs and spatial niches while remaining robust to cohort-specific batch effects. Code availability: https://github.com/ratschlab/aestetik
bioinformatics2026-03-03v1Pinc: a simple probabilistic AlphaFold interaction score
Toth-Petroczy, A.; Badonyi, M.Abstract
Abstract Motivation Screening of interacting proteins with AlphaFold has become widespread in biological research owing to its utility in generating and testing hypotheses. While several model quality and interaction confidence metrics have been developed, their interpretation is not always straightforward. Results Here, building on a previously published method, we address this limitation by converting predicted aligned errors of an AlphaFold model into conditional contact probabilities. We show that, without additional parametrisation, the contact probabilities are readily calibrated to the fraction of native contacts observed across experimentally determined protein dimers. We find that the average contact probability for interacting chains, termed Pinc (probability of interface native contacts), is more sensitive to interactions involving smaller interfaces than many commonly used scores. We provide an R script to calculate Pinc for AlphaFold models, and propose its use as an alternative scoring metric for interaction screens and for prioritising interface residues for experimental validation. Availability and implementation An R script and a Colab notebook are available at https://git.mpi-cbg.de/tothpetroczylab/Pinc
bioinformatics2026-03-03v1Navigating the peptide sequence space in search for peptide binders with BoPep
Hartman, E.; Samsudin, F.; Siljehag Alencar, M.; Tang, D.; Bond, P. J.; Schmidtchen, A.; Malmstrom, J.AI Summary
- The study developed BoPep, a framework using Bayesian optimization to efficiently explore peptide sequence space for protein binders, reducing the need for extensive docking evaluations.
- BoPep was applied to peptides from clinical wound fluids, the human proteome, and de novo designs, identifying novel peptide classes that bind CD14 and neutralize pneumolysin's hemolytic activity.
Abstract
Peptides are short amino-acid chains that mediate essential biological processes, including antimicrobial defence, immune modulation and cell signalling. Their high degree of modularity, biocompatibility and capacity to bind proteins with high specificity make them attractive therapeutic candidates. However, identifying peptides that bind and modulate the function of specific proteins remains challenging due to the immense size of the peptide sequence space. To adress this challenge, we developed BoPep (Bayesian Optimization for Peptides), an end-to-end modular framework that effectively navigates the landscape of peptide-protein interactions by directing the search toward informative regions of sequence space and prioritizes candidates with high binding potential. By focusing computational effort where it is most informative and using calibrated uncertainty to balance exploration and exploitation, BoPep reduces the number of expensive docking evaluations by orders of magnitudes. We demonstrate the utility of BoPep by applying it to three sources of peptides: endogenous proteolytic fragments from clinical wound fluids, the complete human proteome, and a de novo design peptide landscape generated by diffusion-based backbone sampling. Using these sources, we uncover novel encrypted peptide classes that bind CD14 and identify peptides that neutralize the hemolytic activity of pneumolysin, a major bacterial virulence factor. Together, these findings show that BoPep accelerates the identification of testable therapeutic leads from large and diverse peptide collections. BoPep is available at GitHub.
bioinformatics2026-03-02v2A Query-to-Dashboard Framework for Reproducible PubMed-Scale Bibliometrics and Trend Intelligence
Kidder, B. L.AI Summary
- The study introduces PubMed Atlas, a platform for conducting topic-specific bibliometric analyses using PubMed E-utilities, which retrieves and organizes metadata into a SQLite database for analysis.
- An interactive Streamlit dashboard allows for the exploration of publication trends, journal distributions, MeSH term frequencies, and author geography.
- The framework was applied to cancer stem cell biology and stem cell transcriptional regulatory networks, demonstrating its utility in identifying research trends and gaps.
Abstract
The rapid expansion of biomedical literature necessitates computational approaches for systematic analysis of publication patterns, identification of emerging scientific themes, and characterization of field evolution. We present PubMed Atlas, an integrated command-line and web-based platform for conducting topic-specific bibliometric analyses through programmatic access to PubMed E-utilities. This workflow retrieves PubMed identifiers matching user-defined queries, downloads comprehensive metadata in batch mode, extracts structured information including titles, abstracts, author affiliations, Medical Subject Headings, publication classifications, funding acknowledgments, and digital object identifiers, then organizes these data within a local SQLite relational database optimized for rapid queries and visualization. An accompanying Streamlit-based interactive dashboard enables exploration of temporal publication patterns, journal distribution profiles, MeSH term frequencies, geographic author distributions, and direct linking to recent publications. We demonstrate the application of PubMed Atlas to cancer stem cell biology and stem cell transcriptional regulatory network research, providing a framework for reproducible bibliometric investigation and systematic identification of research gaps within dynamically evolving scientific domains.
bioinformatics2026-03-02v1Density-guided AlphaFold3 uncovers unmodelled conformations in β2-microglobulin
Maddipatla, S. A.; Vedula, S.; Bronstein, A. M.; Marx, A.AI Summary
- The study uses density-guided AlphaFold3 to model alternative backbone conformations of β2-microglobulin from crystallographic maps, which are typically obscured in standard X-ray crystallography models.
- Findings show that the approach can reveal conformational heterogeneity influenced by electron density quality, crystallization conditions, and lattice packing.
- This method enhances the ability to capture the full structural landscape of proteins, improving macromolecular crystallography interpretation.
Abstract
Although X-ray crystallography captures the ensemble of conformations present within the crystal lattice, models typically depict only the most dominant conformation, obscuring the existence of alternative states. Applying the electron density-guided AlphaFold3 approach to {beta}2-Microglobulin highlights how ensembles of alternate backbone conformations can be systematically modeled directly from crystallographic maps. This study also highlights how the detection of conformational ensembles is affected by the local quality of electron density and subtle variations in crystallization conditions and lattice packing. These results demonstrate that density-guided AlphaFold3 can uncover conformational heterogeneity missed by conventional refinement, offering a robust, systematic framework to capture the full structural landscape of proteins in crystals and enhancing the interpretive power of macromolecular crystallography.
bioinformatics2026-03-02v1Synora: vector-based boundary detection for spatial omics
Li, J.-T.; Liang, Z.; Fu, Z.; Chen, H.; Liang, Y.-L.; Liu, N.; Wu, Q.-N.; Liu, Z.; Zheng, Y.; Huo, J.; Li, X.; Zuo, Z.; Zhao, Q.; Liu, Z.-X.AI Summary
- Synora is a computational framework for detecting tumor-stroma boundaries in spatial omics data, using only cell coordinates and binary annotations.
- It introduces 'orientedness' to differentiate true boundary cells from infiltrated regions, integrating this with diversity measures into a BoundaryScore.
- Synora effectively identifies boundaries in synthetic and real datasets, revealing gene signatures and spatial patterns, and performs well under data perturbations.
Abstract
Tumor-stroma boundaries are critical microenvironmental niches where malignant and non-malignant cells exchange signals that shape invasion, immune modulation and therapeutic response. Spatial omics platforms now resolve these interfaces at single-cell scale, but computational boundary detection remains challenging because heterogeneous neighborhoods can arise either from true compartment interfaces or from unstructured immune infiltration. Here we present Synora, a modality-agnostic computational framework that identifies tumor boundaries using only cell coordinates and binary tumor/non-tumor annotations, making it readily applicable across a broad range of spatial omics modalities. Synora introduces 'orientedness', a novel metric that quantifies directional neighborhood asymmetry and distinguishes true boundary cells, where neighbors are spatially segregated by type, from infiltrated regions where cell types intermingle randomly. By integrating orientedness with traditional diversity measures into a unified BoundaryScore, Synora achieves robust boundary identification across synthetic datasets with ground-truth boundaries, maintaining performance under realistic perturbations including 50% missing cells and 25% infiltration. Application to 15 Visium HD spatial transcriptomic datasets across multiple cancer types reveals consistent boundary-enriched gene signatures and cell-type spatial gradients. Validation on a CODEX multiplexed protein dataset demonstrates that Synora's precise boundary identification enables discovery of clinically relevant cellular neighborhoods and disease-associated spatial patterns missed by frequency-based approaches. Synora enables boundary-aware spatial analyses by making tissue interfaces quantifiable from minimal inputs, helping to standardize interface detection and comparison across spatial omics platforms and biological contexts.
bioinformatics2026-03-02v1STCS: A Platform-Agnostic Framework for Cell-Level Reconstruction in Sequencing-Based Spatial Transcriptomics
Chen Wu, L.; Hu, X.; Zhan, F.; Sun, C.; Gonzales, J.; Ofer, R.; Tran, T.; Verzi, M. P.; Liu, L.; Yang, J.AI Summary
- The study introduces STCS, a platform-agnostic framework for reconstructing single-cell expression profiles from sequencing-based spatial transcriptomics data by integrating transcriptomic and spatial data from H&E images.
- STCS uses two interpretable parameters for optimization, selected via internal metrics, and outperforms existing methods in reconstructing cell-level data from Visium HD and Stereo-seq datasets.
Abstract
Sequencing-based spatial transcriptomics platforms such as Visium HD and Stereo-seq achieve transcriptome-wide coverage at subcellular resolution, yet their measurements are defined over spatially barcoded units rather than biologically segmented cells. Reconstructing coherent cell-level expression profiles from these data remains a central computational challenge. Here, we introduce Spatial Transcriptomics Cell Segmentation (STCS), a platform-agnostic framework that reconstructs single-cell expression profiles by assigning spatial units to nuclei segmented from paired H&E images, using a combined transcriptomic and spatial distance. STCS is governed by two interpretable parameters that can be selected using reference-free internal metrics. On both Visium HD human lung cancer data with matched Xenium references and Stereo-seq mouse brain data, STCS achieves consistent improvements over existing methods across multiple evaluation dimensions. STCS is fully open-source and designed for broad applicability across sequencing-based spatial transcriptomics technologies.
bioinformatics2026-03-02v1STEQ: A statistically consistent quartet distance based species tree estimation method
Saha, P.; Saha, A.; Roddur, M. S.; Sikdar, S.; Anik, N. H.; Reaz, R.; Bayzid, M. S.AI Summary
- The study introduces STEQ, a new method for estimating species trees from multi-locus data using a quartet-based distance metric, which is statistically consistent under the multi-species coalescent model.
- STEQ offers faster computation with a time complexity of for taxa and genes, outperforming methods like ASTRAL in speed.
- Evaluations on simulated and empirical datasets show STEQ maintains competitive accuracy with leading methods like ASTRAL and wQFM-TREE while significantly reducing inference time.
Abstract
Accurate estimation of large-scale species trees from multi-locus data in the presence of gene tree discordance remains a major challenge in phylogenomics. Although maximum likelihood, Bayesian, and statistically consistent summary methods can infer species trees with high accuracy, most of these methods are slow and not scalable to large number of taxa and genes. One of the promising ways for enabling large-scale phylogeny estimation is distance based estimation methods. Here, we present STEQ, a new statistically consistent, fast, and accurate distance based method to estimate species trees from a collection of gene trees. We used a quartet based distance metric which is statistically consistent under the multi-species coalescent (MSC) model. The running time of STEQ scales as $\mathcal{O}(kn^2 \log n)$, for $n$ taxa and $k$ genes, which is asymptotically faster than the leading summary based methods such as ASTRAL. We evaluated the performance of STEQ in comparison with ASTRAL and wQFM-TREE -- two of the most popular and accurate coalescent-based methods. Experimental findings on a collection of simulated and empirical datasets suggest that STEQ enables significantly faster inference of species trees while maintaining competitive accuracy with the best current methods. STEQ is publicly available at \url{https://github.com/prottoysaha99/STEQ}.
bioinformatics2026-03-02v1Genomic language models improve cross-species gene expression prediction and accurately capture regulatory variant effects in Brachypodium mutant lines
Vahedi Torghabeh, B.; Moslemi, C.; Dybdal Jensen, J.; Hentrup, S.; Li, T.; Yu, X.; Wang, H.; Asp, T.; Ramstein, G. P.AI Summary
- This study developed deep learning sequence-to-expression (S2E) models using context-aware sequence embeddings from PlantCaduceus to predict gene expression across 17 plant species, incorporating chromatin accessibility data.
- The models showed superior performance over PhytoExpr in predicting gene expression across species (Pearson R=0.82 vs. R=0.74) and in Brachypodium mutant lines for between-gene expression differences (β=0.78 vs. β=0.57).
- Notably, the models accurately predicted single-nucleotide mutation effects on within-gene expression, outperforming existing models (β=0.38 vs. β=0.08).
Abstract
Predicting gene expression from cis-regulatory DNA sequences at the promoter and terminator regions is a central challenge in plant genomics. This capability is also a prerequisite for assessing the effects of regulatory mutations on gene expression. Here, we developed deep learning sequence-to-expression (S2E) models that leverage context-aware sequence embeddings from the PlantCaduceus genomic language model instead of one-hot encoding of sequences, to predict gene expression across 17 plant species. To further improve predictions, we integrated chromatin accessibility data as auxiliary regulatory features. First, we evaluated our models to predict gene expression on unseen gene families via cross-validation, demonstrating our model's prediction accuracy across all species outperforms PhytoExpr, the current state-of-the-art (SOTA) S2E model in plants (Pearson R=0.82 vs. R=0.74). We then validated variant effect predictions using an experimental dataset across 796 Brachypodium mutant lines, specifically designed to test predictions at single-base resolution. Our models outperformed SOTA S2E models in predicting between-gene expression differences (regression coefficient {beta}=0.78 vs. {beta}=0.57). Remarkably, they also accurately predicted the effects of single-nucleotide mutations on within-gene expression, while SOTA S2E models showed only weak associations (regression coefficient {beta}=0.38 vs. {beta}=0.08). Our results demonstrated the value of context-aware DNA sequence embeddings for predicting regulatory variant effects in plants. They also reveal a persistent accuracy gap in S2E models when moving from between-gene to allelic variation, a challenge that needs to be addressed in future S2E studies.
bioinformatics2026-03-02v1DNA fragment length analysis using machine learning assisted vibrational spectroscopy
Fatayer, R.; Ahmed, W.; Szeto, I.; Sammut, S.-J.; Senthil Murugan, G.AI Summary
- This study introduces a rapid, label-free method using ATR-FTIR and Raman spectroscopy combined with machine learning to quantify DNA fragment lengths from 50-300 bp.
- Machine learning models achieved high accuracy in predicting DNA length (R2=0.92-0.96), with multimodal fusion enhancing performance.
- The approach requires minimal sample (4 µL), short processing time (15 minutes), and allows full sample recovery, making it a scalable alternative for DNA length analysis.
Abstract
DNA length analysis is essential for genomic workflows including next-generation sequencing and fragmentomics based diagnostics. Conventional approaches typically require large, expensive instrumentation and sample-destructive protocols with long processing times. Here we present a rapid, label-free approach integrating vibrational spectroscopy with deep learning to quantify DNA fragment length distributions. We demonstrate that ATR-FTIR and Raman spectroscopy capture length-dependent spectral features arising from phosphate backbone, nucleobase, and structural vibrations. Machine learning models trained on spectra acquired from purified monodisperse DNA (50-300 bp) predicted DNA length with high accuracy (R2=0.92-0.94), with multimodal fusion improving performance to R2=0.96. A convolutional neural network trained on 35 DNA mixtures comprising molecules of different lengths also successfully deconvoluted their fragment length profile. Transfer learning enabled adaptation to biological samples, achieving low prediction error (RMSE=0.3-7.2%, {Delta}=12 bp). Importantly, the method requires only 4 L sample and 15 minutes passive drying, with no consumables beyond cleaning materials, and allows full sample recovery. This establishes vibrational spectroscopy as a scalable alternative for DNA length quantification.
bioinformatics2026-03-02v1Evaluation of deep learning tools for chromatin contact prediction
Nguyen, T. H. T.; Vermeirssen, V.AI Summary
- This study evaluates five deep learning models (C.Origami, Epiphany, ChromaFold, HiCDiffusion, GRACHIP) for predicting Hi-C contact maps from genomic and epigenomic data.
- Epiphany was found to have the best performance in terms of accuracy, generalization across cell types, and biological relevance.
- Key findings include the importance of CTCF binding and chromatin co-accessibility in prediction accuracy, with only a subset of omics inputs significantly contributing to model performance.
Abstract
Three-dimensional chromatin organization is essential for gene regulation and is commonly measured using Hi-C contact maps. Recent deep learning models have been developed to predict Hi-C maps from genomic and epigenomic features. However, their relative performance and biological interpretability remain poorly understood due to the lack of systematic evaluation. Here, we present a comprehensive benchmarking framework that evaluates five Hi-C prediction models: C.Origami, Epiphany, ChromaFold, HiCDiffusion, and GRACHIP, across predictive accuracy, visual fidelity, and downstream biological analyses. Among them, Epiphany consistently achieved the best overall performance, combining high accuracy, cross-cell-type generalization, realistic map quality, and reliable loop recovery. The framework further shows that epigenomic features, particularly CTCF binding and chromatin co-accessibility, are the primary drivers of accurate Hi-C pattern prediction. Notably, although many models incorporate multiple omics inputs, only a limited subset substantially contributes to performance. This manuscript clarifies model behaviour and provides guidance for developing and interpreting Hi-C prediction methods.
bioinformatics2026-03-02v1miREA: a network-based tool for microRNA-oriented enrichment analysis
Zhang, Z.; Lai, X.AI Summary
- miREA is a network-based tool designed for miRNA-oriented enrichment analysis, focusing on miRNA-gene interactions (MGIs) to interpret miRNA function at the pathway level.
- It employs five edge-based enrichment methods, integrating expression and interactome data with pathway networks, outperforming traditional node-based methods in sensitivity and biological interpretability.
- Benchmarking in various cancer types, including bladder cancer, demonstrated miREA's effectiveness in identifying relevant pathways and generating mechanistic hypotheses for experimental validation.
Abstract
MicroRNAs (miRNAs) regulate gene expression at the post-transcriptional level. To interpret the function of miRNAs at the pathway level, it is necessary to use enrichment analysis tools that employ gene regulatory networks. However, existing network node-centric methods focus predominantly on gene expression profiles, neglecting the role of regulatory information encoded in miRNA-gene interactions (MGIs) that constitute network edges. This omission introduces analytical bias and limits the methods' biological interpretability. Here, we present miREA, a network-based tool for miRNA enrichment analysis that leverages MGIs to characterize miRNA function at the pathway level. miREA implements five edge-based enrichment methods spanning over-representation, scoring-based, topology-aware, and network propagation approaches by integrating expression and interactome profiles with pathway networks. Benchmarking across multiple cancer types shows that the edge-based methods outperform node-based methods in improving sensitivity to identify relevant pathways and biological interpretability while maintaining controlled false positive rates. We further demonstrate the utility of miREA in elucidating miRNA-gene-pathway regulatory mechanisms in bladder cancer. miREA is a versatile enrichment analysis tool that provides pathway-level interpretation of human miRNA function and facilitates mechanistic hypothesis generation for experimental validation.
bioinformatics2026-03-02v1Evaluating genome assemblies with HMM-Flagger
Asri, M.; Eizenga, J. M.; Hebbar, P.; Real, T. D.; Lucas, J.; Loucks, H.; Calicchio, A.; Diekhans, M.; Eichler, E. E.; Salama, S.; Miga, K. H.; Paten, B.AI Summary
- HMM-Flagger uses a hidden Markov model with a Gaussian autoregressive process to detect structural errors in genome assemblies by analyzing read coverage.
- It achieved F1 scores of 78.4% and 60.4% for synthetic errors with Pacific Biosciences HiFi and Oxford Nanopore Technologies R10 data, respectively.
- Applied to real assemblies, it identified large misassemblies in HG002 and showed significant error rate reduction from 0.94% to 0.38% between HPRC releases, validating NOTCH2NL assemblies.
Abstract
HMM-Flagger is a reference-free tool for detecting structural errors in haplotype-resolved genome assemblies based upon the coverage of mapped reads. It models read coverage with a hidden Markov model augmented by a Gaussian autoregressive process, which enables classifying coverage anomalies as erroneous blocks, false duplications, or collapsed blocks. Trained and tested on synthetic misassemblies, it detected synthetic errors using Pacific Biosciences HiFi and Oxford Nanopore Technologies R10 data with F1 scores of 78.4\% and 60.4\% respectively. When applied to six HG002 assemblies it revealed multiple large misassemblies including false duplications and collapse events in human satellites. Applied to assemblies from the Human Pangenome Reference Consortium (HPRC), HMM-Flagger demonstrated substantial improvements from release 1 (0.94\% error rate) to release 2 (0.38\%), reflecting technological advances. HMM-Flagger also validated NOTCH2NL assemblies in HPRC release 2 and confirmed the correctness of three novel structural configurations.
bioinformatics2026-03-02v1Benchmarking niche identification via domain segmentation for spatial transcriptomics data
Wang, Y.; Chen, Y.; Yang, L.; Wang, C.; Cai, J.; Xin, H.AI Summary
- This study benchmarks 16 domain segmentation algorithms on high-resolution CosMx ST data from a human lymph node to identify tissue niches, revealing that most algorithms fail to accurately define niche boundaries in their default settings.
- The primary challenge identified is the reduction in spatial signal-to-noise ratio due to stochastic infiltration of peripheral cell types, which obscures key functional lineage distributions.
- Strategic weighting of core functional lineages improved niche resolution, highlighting the need for specialized computational methods for functional microenvironment analysis.
Abstract
Tissue niches are spatially organized microenvironments in which coordinated multicellular interactions shape cellular states and biological functions. Currently, niche identification is routinely performed using domain segmentation frameworks. While interrelated, spatial domains and niches are not fundamentally equivalent. The former emphasizes intra-domain compositional consistency and transcriptomic homogeneity, whereas the latter is defined by the emergent properties of localized signaling gradients and the functional reciprocity between key cell lineages. Here, we present a high-resolution reference by thoroughly annotating single-cell resolution CosMx ST data of a human follicular lymphoid hyperplasia lymph node, a dynamic, non-compartmentalized tissue containing several critical immune niches defined by specific lineage architectures. We systematically benchmarked 16 contemporary domain segmentation algorithms, demonstrating that most methods in their default configurations fail to recapitulate biologically defined niche boundaries. Our analysis reveals that the definitive, disjoint spatial distributions of key functional lineages are frequently obscured by the stochastic infiltration of peripheral cell types. Such reduction in the spatial signal-to-noise ratio represents a primary bottleneck for existing algorithms, which prioritize local transcriptomic variance over global architectural logic. Following this observation, we demonstrate that strategic weighting of core functional lineages can restore the resolution of spatial niches in select domain segmentation frameworks. Cross-comparison against compartmentalized tissues further underscores the unique challenges of niche identification in non-mechanically separated environments and clarifies the fundamental divergence between structural domain segmentation and functional niche discovery. Our work delineates the limitations of current paradigms and advocates for the development of specialized computational approaches tailored specifically to the complexity of functional microenvironments.
bioinformatics2026-03-02v1GTA-5: A Unified Graph Transformer Framework for Ligands and Protein Binding Sites - Part I: Constructing the PDB Pocket and Ligand Space
Ciambur, B. C.; Pageau, R.; Sperandio, O.AI Summary
- GTA-5 is a graph transformer auto-encoder framework that integrates ligands and protein binding sites into a unified latent space by representing them as 3D point clouds with Tripos atom type labels.
- Trained on 64,124 liganded pockets and 23,133 unique ligands, GTA-5 clusters functional protein families coherently while capturing physicochemical properties like volume and hydrophobicity.
- The framework supports applications like scaffold hopping, QSAR/QSPR modeling, and drug repurposing by enabling structural reasoning based on spatial context rather than bond connectivity.
Abstract
Structural recognition between a protein target and a ligand underpins therapeutic innovation, yet computational representations of protein binding sites and small molecules remain largely disjoint. Here we introduce GTA-5, a unified graph transformer auto-encoder framework designed to capture the geometric structure and chemical composition of ligands and protein binding pockets, embedding them into multidimensional latent spaces where proximity reflects functional compatibility. Ligands and pockets are represented as three-dimensional point clouds annotated with Tripos atom type labels, omitting explicit bond connectivity to enable structural reasoning based on spatial context rather than predefined connectivity graphs. By not enforcing bond topology, GTA-5 maintains representational flexibility across molecular modalities while preserving chemically meaningful local environments. The model was trained on a curated dataset from the Protein Data Bank comprising 64,124 liganded pockets and 23,133 unique ligands spanning 2,257 protein families. We find that functional protein families cluster coherently in both pocket and ligand latent spaces while retaining biologically meaningful heterogeneity. The model captures physicochemical pocket properties such as volume, exposure, and hydrophobicity directly from raw structural data, while ligands with distinct scaffolds co-localise when occupying similar binding environments. This provides a basis for several downstream applications including scaffold hopping in ligand-based virtual screening, QSAR/QSPR modelling using embedding-derived descriptors, and drug repurposing via pocket similarity. More broadly, the GTA-5 framework establishes a foundation for structural reasoning across molecular modalities in drug discovery.
bioinformatics2026-03-02v1ProPrep: An Interactive and Instructional Interface for Proper Protein Preparation with AMBER
Walker, a.; Guberman-Pfeffer, M. J.AI Summary
- ProPrep is an interactive interface designed to guide users through the process of preparing proteins for molecular dynamics (MD) simulations using AMBER, addressing the need for accessible yet expert-quality preparation.
- It integrates multiple functions including structure downloading, homology searches, alignment, structural repair, mutation application, and simulation setup, all within a single workspace.
- The tool was demonstrated on a 64-heme cytochrome 'nanowire' bundle, completing the preparation from a PDF to energy minimization in 18 minutes, showcasing its efficiency and transparency through an interactive session log.
Abstract
Millions of experimental and AI-predicted protein structures are now available, and the biosynthetic promise of bespoke proteins is increasingly within reach. The functional characterization challenge thus posed cannot be addressed by experimental techniques alone. Molecular dynamics (MD) simulations offer functional screening with atomic resolution, yet accessibility remains limited. Existing computational chemistry software presents stark trade-offs whereby powerful tools require extensive expertise and manual effort, or user-friendly programs function as black boxes that obscure critical preparation decisions. Herein, we present ProPrep, an interactive workflow manager that guides users through expert-quality MD preparation by showing the 'what, why, and how' of each step while automating tedious manual operations. Within a single workspace, ProPrep integrates (1) downloading structures from multiple sources (PDB, AlphaFold, AlphaFill), (2) performing homology searches, (3) aligning structures, (4) curating and repairing structural issues, (5) applying mutations, (6) parameterizing specialized residues, (7) converting redox-active sites to forcefield-compatible forms, (8) generating topology and coordinate files, and (9) configuring, executing, and analyzing simulations with active monitoring of key quantities via ASCII visualizations. A key innovation is ProPrep's extensible transformer framework for detecting, defining, and transforming redox-active sites--including mono- and polynuclear metal centers, organic cofactors, and redox-active amino acids--for forcefield compatibility. We demonstrate the full workflow on a 64-heme cytochrome 'nanowire' bundle (PDB: 9YUQ), proceeding from a PDF file to energy minimization of the solvated system (467,635 atoms) for constant pH molecular dynamics--a process demanding 4,819 PDB record modifications and 610 bond definitions'in 18 minutes of user interaction. The entire process is recorded in an interactive session log that can be shared and replayed for reproducibility, making simulation setup a fully transparent process that relies on what was done instead of what was remembered and reported.
bioinformatics2026-03-02v1Assessment of Generative De Novo Peptide Design Methods for G Protein-Coupled Receptors
Junker, H.; Schoeder, C. T.AI Summary
- The study assessed the effectiveness of deep learning methods (AlphaFold2 Initial Guess, Boltz-2, RosettaFold3) in designing de novo peptides for G protein-coupled receptors (GPCRs) by validating 124 known GPCR-peptide complexes.
- Generative methods (BindCraft, BoltzGen, RFdiffusion3) were evaluated for their peptide sampling capabilities, revealing issues with confidence overestimation and memorization in both prediction and generation.
- While backbone sampling was adequate, sequence generation was less effective, though improved by ProteinMPNN.
Abstract
G protein-coupled receptors (GPCRs) play an ubiquitous role in the transduction of extracellular stimuli into intracellular responses and therefore represent a major target for the development of novel peptide-based therapeutics. In fact, approximately 30% of all non-sensory GPCRs are peptide-targeted, representing a blueprint for the design of de novo peptides, both as pharmacological tools and therapeutics. The recent advances of deep learning-based protein structure generation and structure prediction offer a multitude of peptide design strategies for GPCRs, yet confidence metrics rarely correlate with experimental success. In the context of peptides, this problem is exacerbated due to the lack of elaborate tertiary structures in peptides, raising the question of whether this is due to inadequate sampling or insufficient scoring. In this two-part benchmark, we addressed this question by first simulating the validation process of 124 unique known GPCR-peptide complexes using AlphaFold2 Initial Guess, Boltz-2 and RosettaFold3. We then assessed the peptide sampling capabilities of the respective generative methods BindCraft, BoltzGen and RFdiffusion3. Our results indicate that current design pipelines primarily suffer from significant confidence overestimation for misplaced peptides in the validation phase across all three prediction methods. We further highlight occurrences of significant memorization in both prediction as well as generation of peptides. While all generative methods sample backbone space sufficiently, their simultaneous sequence generation remains subpar and can be partially recovered through the use of ProteinMPNN. Taken together, our benchmark offers guidance for the design of peptides specifically using deep learning-based pipelines.
bioinformatics2026-03-02v1SPATIALLY PATTERNED PODOCYTE STATE TRANSITIONS COORDINATE AGING OF THE GLOMERULUS
Chaney, C.; Pippin, J. W.; Tran, U.; Eng, D.; Wang, J.; Carroll, T. J.; Shankland, S. J.; Wessely, O.AI Summary
- The study investigated how aging affects the glomerulus by analyzing single nuclei transcriptomics from kidneys of mice at different ages, focusing on regional and cell type-specific responses.
- Results showed that aging in podocytes is characterized by a transition from expressing canonical podocyte genes to showing inflammatory and senescent signatures, predominantly in the juxtamedullary region.
- Unlike podocytes, other glomerular cell types showed minimal age-related changes, indicating that podocyte aging is selective and coordinated rather than a universal degeneration.
Abstract
Background: With the US population living longer, the risk, incidence, prevalence and severity for chronic kidney diseases become more abundant. Glomerular diseases are the leading cause for chronic and end-stage kidney disease. Yet, the cellular responses and the underlying mechanisms of progressive glomerular disease, which ultimately leads to glomerulosclerosis and loss of kidney function with advancing age, are poorly understood. Methods: Kidneys of young (4 months-old), middle-aged (20 months-old) and aged (24 months-old) mice were separated into outer cortex and juxta-medullary region and processed for single nuclei transcriptomics. Focusing on the aging glomerulus data were analyzed using a state-of-the-art analysis pipeline dissecting out the cellular age- and kidney region-specific responses. Results: Global analysis of the transcriptome reveals regional-specific differences that are detectable across multiple cell types exemplified by the expression of Napsa as a bona-fide juxta-medullary marker. In contrast aging led to rather cell type-specific responses. In the glomerulus, healthy podocytes were characterized by expression of canonical podocyte genes; conversely the senescent, aged podocytes were characterized by the down-regulation of canonical podocyte genes and the emergence of inflammatory and senescent signatures. Interestingly, these senescent podocytes were primarily located in the juxtamedullary region suggesting that juxtamedullary podocytes are more sensitive. Yet, instead of aging being defined by distinct cell states, the profiles, as well as ligand-receptor and pseudotime analyses suggest that podocytes aging is selective and coordinated, not universal degeneration. This was different to the other glomerular cell types, parietal epithelial cells, glomerular endothelial cells and mesangial cells. While they also as existed in different subpopulations, they exhibited little regional-, or age-depended changes. Finally proximal tubular aging manifested itself as discrete cellular states. Conclusions: The single nuclei transcriptomics of the aging kidney provides a mechanistic explanation for regional susceptibility of nephrons and suggests that the future therapeutic strategies need to consider the cellular and spatial complexity of the glomerulus.
bioinformatics2026-03-02v1Detecting Extrachromosomal DNA from Routine Histopathology
Khalid, M. A.; Gratius, M.; Brown, C.; Younis, R.; Ahmadi, Z.; Chavez, L.AI Summary
- This study developed a deep learning framework to detect extrachromosomal DNA (ecDNA) from standard histopathology images across twelve cancer types.
- The approach successfully distinguished ecDNA-amplified tumors from chromosomally amplified or non-amplified ones, with notable results in glioblastoma.
- The method identified histomorphologic changes associated with ecDNA, correlating with poor survival outcomes, suggesting potential for routine diagnostic integration.
Abstract
Extrachromosomal DNA (ecDNA) is a major driver of oncogene amplification, tumour heterogeneity and poor clinical outcomes [1-3], yet its detection relies on specialised genomic assays that are not integrated into routine diagnostics. Here, we show that ecDNA status can be inferred directly from standard haematoxylin and eosin-stained whole-slide pathology images. We develop an end-to-end, weakly supervised deep learning framework that aggregates thousands of high-magnification patches per slide with slide-level augmentation and interpretable attention. Across twelve cancer types from The Cancer Genome Atlas, the approach identifies tumours with genomic amplifications and, critically, distinguishes ecDNA-amplified from chromosomally amplified or non-amplified tumours, with the strongest signal in glioblastoma. Attention maps localise regions enriched for nuclei with altered chromatin intensity and texture, and predicted ecDNA status recapitulates its adverse association with survival. These results indicate that ecDNA amplifications leave reproducible histomorphologic footprints detectable by routine pathology, enabling scalable screening to prioritise tumours for confirmatory molecular testing.
bioinformatics2026-03-02v1