Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
VLab4Mic: prediction of structural resolvability in super-resolution microscopy
Martinez, D.; Saraiva, B. M.; Shakespeare, T.; Bates, M.; Owen, D. M.; Leterrier, C.; Del Rosario, M.; Henriques, R.Abstract
Determining whether a microscopy experiment can resolve a specific feature of a protein assembly remains difficult because researchers must balance imaging modality, labelling strategy, and probe choice. We present VLab4Mic, a simulation platform that predicts structural resolvability before experiments. Starting from atomic models from the PDB or AlphaFold predictions, VLab4Mic places antibodies, nanobodies, chemical linkers, or fluorescent proteins on epitopes, applies stochastic labelling and steric constraints, and generates virtual samples for widefield, confocal, AiryScan, Stimulated Emission Depletion (STED), and Single-Molecule Localisation Microscopy (SMLM). Comparisons with nuclear pore complex data show realistic agreement across modalities. Case studies show that HIV capsid appearance depends strongly on orientation, and that STED and SMLM distinguish domed from flat clathrin lattices, whereas confocal and AiryScan struggle. VLab4Mic thereby helps researchers predict which biological questions are experimentally tractable with a given imaging configuration before spending time finetuning imaging parameters at the microscope.
bioinformatics2026-06-16v3Robust integration of weakly anchored spatial multi-omics
Wang, C.; Liu, Y.; Wang, Z.; Sun, P.; Li, Z.; Li, J.; Wang, X.; Chen, K.; Zou, Q.; Daoliang, Z.; Hu, Z.; Du, Y.; Qian, B.; Feng, X.; Yuan, Z.; Guan, R.Abstract
Spatial multi-omics holds great promise for dissecting complex biological processes, though inherent technical constraints continue to limit its widespread adoption. Currently, most studies therefore measure distinct omics features on separate tissue sections, necessitating spatial diagonal integration. An emerging practical solution is to leverage hematoxylin and eosin (H&E) images as an integration anchor, given their ubiquity, low cost, and compatibility across tissue preparations. However, this anchor is frequently compromised in real-world settings by variations in H&E staining style, absence of reliable histological landmarks, and mismatches in spatial resolutions across omics modalities. To address this, we introduce SpaWeaver, a computational framework that couples a pathology foundation model with a graph Transformer and a latent feature aligner module, providing a highly robust solution for weakly anchored spatial omics data diagonal integration. Extensive experiments demonstrate that SpaWeaver exhibits superior robustness against isolated or synergistic weak-anchoring factors. The spatial multi-omics profiles generated by SpaWeaver link molecular features originally separated on two sections, unlocking diverse downstream analyses once exclusive to co-assayed spatial multi-omics data, including niche-aware cell-cell communication inference and multi-omics resolved cell state. In this study, it unveils tumor-distance-dependent fibroblast-CD4+ T-cell signaling in human colon adenocarcinoma and identifies a hypoxic glycolytic tumor state with pyknotic nuclei in human ovarian cancer. Overall, our approach bridges readily accessible single-omics measurements across weakly anchored tissue sections, enabling unified spatial multi-omics characterization and system-level tissue analysis.
bioinformatics2026-06-16v3Sparse Autoencoders Reveal Interpretable Features in Single-Cell Foundation Models
Pedrocchi, F.; Barkmann, F.; Joudaki, A.; Boeva, V.Abstract
Single-cell foundation models (scFMs) hold promise for applications in cell type annotation, data integration, and prediction of the effects of cell perturbations, but their internal mechanisms remain poorly understood. We investigate the structure of these models by training sparse autoencoders (SAEs) on the hidden representations of three widely used scFMs: scGPT, scFoundation, and Geneformer.The learned features reveal diverse and complex biological and technical signals, which emerge even in pre-trained models. We also observe that the encoding of this information differs between scFMs with distinct training protocols and architectures. Finally, we demonstrate that SAE-derived features are functionally related to model behavior and can be intervened upon. Suppressing batch-associated features reduces unwanted technical variation and improves data integration while preserving the core biological signal. Activating drug-encoding features steers control cells toward drug-perturbed states in a concentration-dependent manner. These findings provide a path toward more interpretable and controllable single-cell foundation models.
bioinformatics2026-06-16v3RareFold: Structure prediction and design of proteins with noncanonical amino acids
Li, Q.; Daumiller, D.; Zuo, F.; Marcotte, H.; Pan-Hammarstrom, Q.; Bryant, P.Abstract
Protein structure prediction and design have traditionally been confined to the 20 canonical amino acids. Expanding this chemical space to include non-canonical amino acids (ncAAs) is essential for engineering proteins with novel chemical and functional properties. However, existing methods are not designed to generalise across chemically diverse residue types. Here, we present RareFold, a deep learning architecture for structure prediction and design of proteins containing the 20 canonical amino acids and 29 ncAAs. By representing each residue as an independent token, RareFold learns context-dependent atomic interaction patterns across chemically diverse sequence spaces, enabling modelling of non-standard chemistries within a unified framework. We apply this capability in EvoBindRare, a generative framework for de novo design of linear and cyclic peptide binders with an efficient implementation that substantially reduces computational requirements compared to existing architectures. We demonstrate its performance by designing binders against Ribonuclease A, yielding novel linear and cyclic peptides incorporating ncAAs within predicted interfaces with low-micromolar affinities (KD ~2-9 M), comparable to the native ligand (KD ~2 M). Hydrogen-deuterium exchange mass spectrometry confirms that the designed peptides engage the target at regions consistent with predicted binding interfaces. In addition, immunogenicity profiling in human-derived organoid models shows no detectable immune activation. By extending deep learning-based protein design to non-canonical chemical spaces, RareFold enables programmable access to expanded amino acid alphabets and broadens the scope of de novo protein engineering.
bioinformatics2026-06-16v3Prediction and analysis of new HisKA-like domains
Silly, L.; Perriere, G.; Ortet, P.Abstract
Histidine kinases (HKs) are part of many signaling pathways, by being implicated in two components systems (TCS). Using autophosphorylation and phosphotransfer to a response regulators (RR), they enable organisms to adapt to their environment. Most HKs are transmembrane proteins with a sensing domain outside of the cell and two catalytic domains called HisKA and HATPase. HATPase is required for interaction with the ATP and HisKA contains the phosphorylated histidine residue. HKs are involved in various environmental adaptation mechanisms, like light sensing or biochemical changes. Studying their diversity is therefore important to better understand how cells interacts with their environment. There exist incomplete HKs (iHKs) lacking either the HisKA or HATPase domain. Some iHKs with an HATPase domain possess a section of their sequence where an HisKA domain could be expected. These iHKs may contain "true" HKs, with unknown HisKA domain, that could fill gaps in various signaling pathways. In this study we analyzed 869 964 sequences of iHKs having an HATPase domain but lacking an HisKA domain. We identified 18 HisKA-like profiles and did multiple meta-studies to assessed their HisKA-like characteristics. We found that their 3D structures matched the structure of known HisKA domains. We saw that the genomic context of the genes associated to these profiles contained genes implicated in signal transduction pathways. We cross-validated some of our profiles with curated annotations, as well as with a "negative dataset" made of non-HK proteins. We believe that our work could help improve the annotation of regulation pathways in prokaryotes.
bioinformatics2026-06-16v2Identifying Modulators of Cellular Responses by Heterogeneity-sequencing
Berg, K.; Sakellaridi, L.; Rummel, T.; Hennig, T.; Whisnant, A.; Lodha, M.; Krammer, T.; Toussaint, C.; Szymanska-De Wijs, K.; Zheng, Y.; Prusty, B. K.; Doelken, L.; Saliba, A.-E.; Erhard, F.Abstract
The response of individual cells to drug treatment, virus infections or other molecular stimuli is highly heterogeneous and depends on the cell's initial state. Library preparation for single-cell transcriptomics is destructive, precluding a direct comparison between the initial state and the stimulus outcome. Consequently, current methods are restricted to identifying correlative associations rather than resolving causal drivers of heterogeneous outcomes. We developed Heterogeneity-seq, which combines single-cell RNA-seq with metabolic RNA labeling (scSLAM-seq) and double machine learning to overcome this limitation. By leveraging simultaneous measurements of unlabeled and labeled RNA in individual cells, Heterogeneity-seq uncovers the transition from pre-stimulated cell states to distinct stimulation outcomes across thousands of cells. These links enable the identification of factors that causally govern heterogeneous cellular responses. We used Heterogeneity-seq to identify both known and novel genes that drive responses to drug treatment, as well as pro- and antiviral host factors governing cytomegalovirus infection.
bioinformatics2026-06-16v2A Transformer-derived transcriptomic score associates with ex-vivo drug response in AML
Barman, J.; Adhikari, S.; Heckman, C.; Vaha-Koskela, M.Abstract
Background Drug-tolerant persister (DTP) cell states have been implicated in relapse across multiple cancers, including acute myeloid leukaemia (AML) [1,2]. Methods that score such states from transcriptomic data, generalise to held-out samples, expose calibrated probability outputs, and link predictions to candidate biology are useful for prioritising follow-up experimental work. Existing transcriptomic methods for scoring drug-tolerant or persister-like states largely rely on fixed gene signatures or general-purpose cell-type classifiers adapted post hoc (scPred, scANVI, scClassify); deep-learning approaches developed specifically for AML drug-tolerant persister scoring with calibrated probability outputs, prespecified thresholds, and transparent external validation against ex-vivo drug-response data are, to our knowledge, lacking. Our approach addresses this gap by combining a Transformer teacher with a knowledge-distilled 1,000-gene student, prespecified threshold {tau} = 0.31, and direct evaluation against BeatAML drug-AUC. Our in silico approach aims to fill this gap of non-existent analytical methods to identify and mark the DTP cells. Methods We trained a Transformer classifier on a pooled scRNA-seq corpus of nine samples (six from GSE123902 -lung adenocarcinoma metastasis, normal, and primary tumour [4] -plus three primary AML samples; 32,342 cells, 13,369 common genes), with stratified 5-fold cross-validation at the cell level, a 20% held-out test split, and a prespecified probability threshold selected on out-of-fold predictions. A 1,000-gene student model was trained by knowledge distillation [5]. For every input cell, the student outputs a probability between 0 and 1 (hereafter "the score") representing predicted membership in the positive training class. The trained model was applied without re-tuning to five external or independent application cohorts: 39 primary AML donors[in-house]; GSE74246[6]; BeatAML (n = 452 with linked ex-vivo drug-AUC; n = 405 with overall-survival metadata)[7]; TCGA-LAML (n = 149)[8]; and an in-house n = 10 scRNA-seq cohort with linked survival. Survival and drug-response data were not used during training, threshold selection, or tuning. The score was anchored mechanistically against CRISPR/DepMap essentiality[9], pathway enrichment, and a normal-tissue-filtered surface-protein candidate list (HPA[11], GTEx[12]). To assess concordance between transcriptomic prioritisation and protein-level evidence, each ranked candidate was additionally annotated with two HPA-derived flags: HPA_surface_protein (Yes/No, derived from HPA Protein class and Subcellular location fields, identifying genes annotated as plasma-membrane, GPCR, ion-channel, transporter, receptor, or CD-marker) and HPA_antibody_reliability (Enhanced, Supported, Approved, Uncertain, or Not available, per HPA antibody validation tier). Annotations were merged on HGNC symbol; 248 of 250 candidates (99.2%) matched. Two candidates using the older CORF nomenclature did not auto-match HPA's lowercase convention and were resolved manually. HPA's per-gene RNA-protein numeric correlation is published only on per-gene web pages and not in the bulk download; we therefore used the detection-level and antibody-reliability tiers as the operational concordance filter. Results Cross-validation area under the receiver operating characteristic curve (AUROC) was 0.936 +/- 0.014 (held-out test 0.941, Matthews correlation coefficient (MCC) 0.696, F1-score 0.895). The 1,000-gene student showed Spearman {rho} {approx} 0.96 with the teacher and >85% class agreement at the prespecified threshold. The principal external result was in BeatAML: the score correlated with ex-vivo drug-response AUC across seven AML-relevant drugs, with consistent per-drug Spearman correlations (r = 0.41-0.53, all p < 0.05). The aggregate correlation across 3,164 patient-drug pairs from 452 patients was r = +0.482 and is reported as a summary, recognising that pairs from the same patient are not fully independent. The score did not stratify overall survival in TCGA-LAML or in the in-house n = 10 cohort, in part because predicted high-score fractions saturated. At the prespecified threshold the score did not separate cell types in GSE74246, indicating that absolute calibration is cohort-dependent. Compared against logistic regression, random forest, the LSC17 stemness signature, and a mean-expression baseline on the same gene panel, the Transformer was the most stable model under aliquot-grouped cross-validation and the only one to transfer with strong, positive correlation to BeatAML drug-AUC. The mechanistic candidate-target pipeline produced a 250-candidate ranked surface-protein list (full breakdown in Results); FLT3 and CD33 were recovered from the unbiased ranking as positive controls. Conclusion We present a Transformer-derived transcriptomic score that addresses the lack of validated computational methods for identifying drug-tolerant persister-like states in AML. The score shows external rank-order association with ex-vivo drug response, providing a research-use tool for prioritising candidate persister-associated transcriptional programs for follow-up. Together, these results support the score as a research-use transcriptomic ranking tool for AML drug-response-associated states. The strongest external support comes from the consistent association with BeatAML ex-vivo drug-response AUC. The fixed probability threshold did not transfer reliably across all cohorts, so threshold-based classification should require cohort-specific recalibration. The score is not validated for clinical decision-making and is not proposed as a survival predictor. The candidate-target list is a starting point for functional follow-up. Keywords. AML; ex-vivo drug response; single-cell RNA-seq; Transformer; knowledge distillation; transcriptomic score; BeatAML; surface-protein target prioritisation.
bioinformatics2026-06-16v1scIsoAgent enables autonomous isoform-resolved characterization and sequence-informed interpretation of long-read single-cell transcriptomes
Zhao, C.; Liu, M.; Li, X.; Li, D.; Xu, Y.; Wang, Z.Abstract
Alternative isoform usage can alter gene function independently of total gene expression, creating a need to resolve transcript isoforms at single-cell resolution. Long-read single-cell RNA sequencing meets this need by linking cellular identity to transcript isoforms and sequence-level features. Realizing its full biological value requires reproducible workflows that connect specialized long-read analysis with biological interpretation. Existing large language model (LLM)-based biomedical agents support general omics analysis, but are not designed for isoform-resolved long-read single-cell workflows. Here, we present scIsoAgent, an autonomous LLM-powered scientific agent for long-read single-cell RNA-seq analysis. scIsoAgent turns heterogeneous long-read single-cell inputs into traceable isoform-resolved workflows, using stage-aware planning and persistent computational context to support both execution and interpretation. Across complementary evaluations, this design improved the continuity from analysis planning to executable, interactive workflows compared with general-purpose LLM baselines. In real-data reanalysis, scIsoAgent recovered major findings from published long-read single-cell resources and extended a representative differential transcript usage event into a sequence-informed functional hypothesis. By linking full-length isoform sequences with model-inferred transcript properties, scIsoAgent connects observed isoform usage with potential sequence-level functional consequences. These results demonstrate that autonomous scientific agents can transform fragmented long-read single-cell analysis into coherent, reproducible workflows for isoform-resolved discovery and biological interpretation.
bioinformatics2026-06-16v1MetaPilot: genome-aware adaptive search-space refinement for unified DDA and DIA metaproteomics
Cheng, K.; Figeys, D.Abstract
Metaproteomic peptide identification is constrained by the structure and size of the protein search space. Pooled gene catalogues provide coverage but obscure genome-level evidence, and current workflows for data-dependent (DDA) and data-independent (DIA) acquisition diverge in their database strategies. We present MetaPilot, a genome-aware workflow that uses conserved marker-protein evidence to rank candidate genomes from MGnify catalogues and construct adaptive, sample-specific search spaces. Applied to paired DDA/DIA datasets of defined mixtures and fecal samples, MetaPilot adapted genome selection to community complexity and reproduced published peptide evidence while expanding the detectable peptide space. In DDA-independent reanalysis of Orbitrap human gut DIA data, MetaPilot identified 24.4% more peptides than the published DDA-derived library and 2.06-fold more than the matched DDA-assisted DIA search. On timsTOF DIA-PASEF mouse intestinal data, it outperformed uMetaP by 41.8~119.7%, enabling genome-resolved functional interpretation without DDA-PASEF input.
bioinformatics2026-06-16v1OmicOS: A Comprehensive Omics Ecosystem Infrastructure and Agent System for the AI Era
Zeng, Z.; Meng, X.; Hu, L.; Li, C.; Liu, P.; Shi, Y.; Ma, X.; Gao, L.; Wang, X.; Luo, Z.; Zheng, Y.; Xian, J.; Lin, Z.; Zhu, H.; Jiang, Z.; Mao, S.; Lu, Y.; Tang, W.; Peng, Q.; Ma, Y.; Zhou, L.; Xing, C.; Zhang, X.; Xiong, Y.; Du, H.Abstract
Biology has accumulated a vast ecosystem of omics methods, but much of this ecosystem remains built for expert humans rather than scientific agents. Methods are scattered across Python packages, R/Bioconductor and CRAN workflows, command-line tools, incompatible data containers and implicit object states, making even routine analyses difficult for an AI system to choose, execute and verify reliably. Here we introduce OmicOS, a comprehensive omics ecosystem infrastructure and agent system that turns OmicVerse V2, an open-source omics community, into an executable foundation for agentic biology. OmicVerse V2 provides the community substrate: scalable AnnDataOOM-compatible rust backends, agent-friendly Python algorithms for single-cell, spatial, bulk and multi-omics analysis, interfaces to single-cell foundation models, and Python-native reconstructions of historically R-centred Bioconductor/CRAN-style workflows. OmicOS makes this substrate actionable by registering analytical functions as state-aware capability contracts, allowing agents to inspect live data objects, select valid methods, execute controlled workflows and record provenance. The result is not a fixed pipeline, but a programmable omics environment in which agents compose real analyses from verified community methods rather than inventing tools. Across external and purpose-built benchmarks, OmicOS ranked first among the evaluated systems, reaching 81.2% on BiomniBench. Adding OmicVerse to a minimal agent improved task completion by up to 34.2 percentage points with qwen-3.6-35b, and controlled ablations showed that the gains came from registry-grounded execution rather than from larger models, documentation retrieval or unrestricted tool exposure. The same infrastructure scaled to atlas-sized data, reproduced R-centred workflows in Python and converted external pathology software into agent-usable skills. In a discovery task starting from a whole-body spatial map and the term Alzheimer disease, OmicOS composed a non-canonical workflow that integrated spatial expression, genetic association, eQTL and colocalization evidence to nominate a colon epithelial risk axis centred on PICALM, CD2AP and CR1. Together, OmicVerse and OmicOS define an open foundation for AI-era omics, showing how a community of biological methods can be transformed into a reliable, extensible and agent-operable system for discovery.
bioinformatics2026-06-16v1THEOBROMA: an aggregated open database of 1.13 million natural products with per-compound license auditing, three-tier classification, and stereochemistry-aware deduplication
Klamt, T.; Jaczkowski, A.; Franke, J.; Nejdl, W.Abstract
Natural products remain one of the most productive sources of pharmacologically active compounds for drug discovery, yet the current open aggregator landscape attributes licenses at database rather than compound granularity, with consequences that have become tangible as the field grows. A recent relicensing event in one constituent source (the September 2024 transition of the Natural Products Atlas to CC BY-NC 4.0) demonstrates how database-level licensing propagates across an aggregate and motivates the per-compound audit framework presented here. The same peer cohort separately leaves classification provenance and stereoisomer-family relations coarser than either layer warrants. THEOBROMA, accessible at \url{https://theobroma.l3s.uni-hannover.de}, integrates 1{,}133{,}004 natural products from 29 open sources under a per-compound license audit that resolves each compound's license tier across all attesting sources under a most-restrictive-wins rule, identifying 900{,}170 compounds (79.4\%) under open-use licenses and exposing the per-source attestation chain and resolved tier through a dedicated audit endpoint and a query-time license filter. A three-tier classification stratifies 89.3\% coverage into 35.1\% curated, 43.9\% high-confidence inferred, and 10.3\% exploratory tiers, with 486{,}215 stereoisomer families preserved by full 27-character InChIKey deduplication and exposed via a dedicated \texttt{/api/stereoisomers/<comp\_id>} endpoint and a radial-family display. Per-compound license provenance is the primary differentiator. Classification stratification and stereoisomer-family exposure add finer-grained access to two related axes, supporting license-compatible virtual screening and isomer-specific bioactivity analysis at corpus scale. As an evolving open resource, THEOBROMA pairs continuous pipeline maintenance with interactive geographic, taxonomic, and chemical-space exploration.
bioinformatics2026-06-16v1cuBayes: GPU accelerated FreeBayes that achieves 1-minute whole-genome SNV calling while maintaining algorithmic semantics
Pitman, A.; Yang, C.; Qiao, Y.Abstract
Next-generation sequencing now produces whole-genome data in hours, but downstream variant calling remains a multi-hour to multi-day bottleneck that excludes genomic analysis from time-critical clinical settings. GPU acceleration offers a natural path forward -- variant calling is inherently parallelizable across genomic positions -- yet open-source infrastructure for porting existing algorithms to GPU hardware remains limited, leaving many widely-used tools without accelerated implementations. FreeBayes, a haplotype-based variant caller central to the 1000 Genomes Project and to multi-sample tumor evolution analyses, exemplifies this gap: it is natively single-threaded despite its algorithmic suitability for parallelization. We present cuBayes, a CUDA implementation of FreeBayes germline SNV calling that completes HG002 and HG004 2x250bp Illumina 60x whole-genome analysis in one minute (as opposed to hours if not days with manual region-based CPU parallelization) on a single NVIDIA RTX 6000 Ada GPU, while producing variant calls with >99.9% concordance to the CPU reference. cuBayes is structured around an atom/molecule architecture in which reusable functional units (BAM decompression, position-wise pileup, batch coordination) are cleanly separated from algorithm-specific logic, providing a foundation intended to support acceleration of additional sequence analysis algorithms without redundant low-level engineering.
bioinformatics2026-06-16v1Infectious Disease Forecasting via Physics-Informed Machine Learning
Hart, J. C.; Smith, H.; McMahan, C.; Rennert, L.Abstract
Infectious disease transmission evolves as a dynamic process shaped by biological mechanisms, population behavior, and intervention policies, yet public health responses are often driven by lagging indicators. Accurate short- and long-term disease forecasting is essential for the timely deployment of intervention strategies, healthcare capacity planning, and uncertainty-aware, risk-informed decision-making. To address this challenge, three broad classes of forecasting models have traditionally been used: statistical, machine learning, and mechanistic approaches. However, each of these modeling paradigms faces fundamental limitations. In particular, traditional statistical models often lack the flexibility needed to capture complex disease dynamics, machine learning approaches require large, high-quality data streams, and mechanistic models are notoriously difficult to calibrate. To overcome these challenges, we propose a novel physics-informed machine learning (PIML) framework for forecasting infectious disease dynamics. Our approach simultaneously forecasts new case and hospitalization counts, along with other key epidemiological quantities such as the time-varying reproduction number. This is achieved through the design of a machine learning model and estimation strategy regularized by a system of differential equations that encode disease dynamics of the SIHR model, thereby bridging the gap between purely data-driven and mechanistic models. We demonstrate the proposed methodology through in-depth numerical studies and an application to COVID-19 data collected in the state of South Carolina.
bioinformatics2026-06-16v1Better data, better trees: GenBank-GISAID deduplication and source-specific artifact masking in viral genomics
de Moraes, L.; de Alencar, A. L.; Brusselmans, M.; Candido, D. d. S.; Faria, N. R.; Dellicour, S.; Lemey, P.; Khouri, R.Abstract
GenBank and GISAID are the primary repositories for viral genomic data, but integrating records across them remains a challenge. The same sequence could be made available in both databases without any cross-reference linking the two entries. Consequently, there is no systematic way to identify this redundancy, which compromises the compilation of representative, non-redundant large-scale datasets. In parallel, the growth of viral genomic data has increased the risk of systematic technical artifacts introduced during sequencing or assembly. These artifacts can inflate substitution rate estimates and degrade temporal signal, biasing evolutionary rate estimates. To address both challenges, here we present a formal, reproducible workflow integrating two newly developed complementary tools: G2G matcher for cross-repository harmonization and Lab-Specific Bias FILTer (LSBFILT) for masking of laboratory-specific artifacts. Using the Eastern/Central/South African (ECSA) chikungunya virus lineage as a proof-of-concept, we demonstrate that our integrated workflow restores temporal signal and provides a robust, curated dataset for downstream phylodynamic analyses. Critically, restricting masking of homoplastic sites to specific sequences reduces the substitution rate estimate from an inflated 8.517 x 10e-4; to 5.078 x 10e-4; substitutions/site/year and increases the coefficient of determination (R2) of the root-to-tip regression analysis from 0.353 to 0.677. By enabling systematic cross-repository harmonization and source-specific artifact masking, we provide the molecular epidemiological community with scalable tools to reconcile fragmented genomic data and reduce technical biases, fostering more accurate and reproducible phylogenetic analysis. G2G matcher is available at https://github.com/andrezaleite/G2G-Matcher, and LSBFILT at https://github.com/khourious/LSBFILT.
bioinformatics2026-06-16v1FlowBench: separating planning, fault recovery and interpretation in agentic bioinformatics
Kurjan, A.; Cribbs, A. P.Abstract
Agentic large language model (LLM) systems are being deployed in bioinformatics faster than they are understood, and single-metric evaluations conflate capabilities that fail independently. We introduce FlowBench, a benchmark that decomposes agentic bioinformatics performance into planning, fault recovery, biological interpretation, and end-to-end output-fidelity. Existing systems achieve high plan completeness, but their closed, single-provider designs prevent attribution of performance to scaffolding versus the underlying model. We therefore built FlowAgent, a modular, provider-agnostic framework whose components can be selectively disabled and whose backbone model can be swapped across providers on a shared harness, and used it to evaluate 23 models from three main providers. Three findings emerge. First, generating a valid workflow plan from a named toolchain is largely solved, whereas inferring an appropriate toolchain from biological intent alone is uniformly difficult regardless of model tier, compressing all models into a narrow 44-57% pass-rate band. Second, ablation shows that the dependency-structured plan and a completeness-reflection step drive performance, while adding a same-context validator-driven retry makes structural quality worse. Third, fault recovery and data-grounded interpretation remain unsolved. Models frequently propose fixes that force a clean exit while leaving the underlying data invalid, and data-grounded interpretation lags internal-knowledge recall by a consistent margin. Safety does not emerge from capability, and reasoning-tier models were among the least reliable at recognising unrecoverable faults. Once planning saturates, agent architecture and refusal calibration, not model scale, are the productive frontier.
bioinformatics2026-06-16v1RetroMol: Parsing a shared encoding from natural products and their biosynthetic gene clusters
Meijer, D.; Williams, S. E.; Terlouw, B.; Charusanti, P.; Kok, L.; Skinnider, M. A.; Weber, T.; van der Hooft, J. J. J.; Healy, A. R.; Medema, M. H.Abstract
Natural products such as polyketides and nonribosomal peptides (NRPs) are important sources of bioactive compounds, including many antibiotics. Many of them are assembled by modular enzyme complexes and further modified and diversified by tailoring reactions encoded by biosynthetic gene clusters (BGCs). Although natural products and their coding BGCs describe different data modalities of the same biochemical process, a unified language to jointly describe their biochemistry is lacking. Here we introduce a sequence-based representation of the core biosynthesis of modular natural products, which we call primary sequences, that bridges chemical structures and BGCs. We also present RetroMol, an algorithm that parses either natural product structures or their encoding BGCs into their primary sequences of natural product building blocks. RetroMol allows for similarity scoring between natural products and BGCs, enabling the retrieval of compounds, BGCs, and a combination of the two, based on their biosynthetic similarity. This can, for instance, be used to retrieve biosynthetically similar but structurally dissimilar compounds, or link natural products to candidate coding BGCs in large experimental datasets. We demonstrate the latter by rediscovering the nocardichelin B BGC as a proof of principle. We also exemplify the utility of biosynthetic similarity by showing various pairs of biosynthetically similar compounds with low structural similarity. Together, these results establish primary sequences as a shared biosynthetic encoding for natural product comparison and BGC prioritization.
bioinformatics2026-06-16v1Integrative Transfer Network: Deep Transfer Learning Across Populations and Prediction Targets
Gao, Y.; Cui, Y.Abstract
Large-scale clinical and biomedical datasets increasingly contain both diverse subgroup attributes (e.g., demographic or clinical subgroups) and multiple prediction targets. Although various machine learning approaches can address subgroup differences or multi-target prediction, they often consider these aspects independently rather than jointly. To more effectively capture the shared and subgroup-specific information in such complex datasets, we propose the Integrative Transfer Network (ITN), a deep neural network designed to leverage data across subgroups and multiple related outcomes simultaneously. In extensive experiments, including time-to-event and classification tasks where demographic subgroups and multiple disease endpoints are prevalent, ITN demonstrates consistent improvements in subgroup-specific prediction by borrowing strength from other subgroups and outcomes. We envision ITN as a unified framework for learning from heterogeneous datasets where subgroup-specific insights are critical.
bioinformatics2026-06-16v1Programmatic access to ICTV virus taxonomy through a public ontology API
Lieutaud, P.; McLaughlin, j.; Hendrickson, R. C.; David, R.; Parkinson, H.; Lefkowitz, E.; Dempsey, D.; Coutard, B.Abstract
The International Committee on Taxonomy of Viruses (ICTV) is responsible for developing and maintaining a universal virus taxonomy. As the reference framework for organising the viral world, it is essential for virology and related fields. Despite its widespread use in research and public health, programmatic access to ICTV taxonomy has remained limited, posing challenges for integration, versioning, and interoperability across databases and bioinformatics resources requiring up-to-date virus taxonomy. To address this, we developed a public and sustainable solution leveraging ontology-based APIs. Successive ICTV Master Species List (MSL) releases were transformed into a structured ontology and deployed as a unified representation through the Ontology Lookup Service (OLS). The framework also provides ICTV-NCBI mappings and helper libraries for integration into downstream systems. This enables, for the first time, public programmatic retrieval of current and historical virological taxon names, taxonomic relationships, metadata, and persistent identifiers through stable endpoints. More broadly, this work illustrates a general strategy for transforming structured biological datasets into semantically enriched graph resources exposed through scalable public APIs. These developments enhance interoperability, reduce manual curation, and support FAIR-aligned taxonomic data management in virology and pandemic preparedness.
bioinformatics2026-06-16v1Evidence for recombination in dengue virus genomes
de Paula Oliveira, H.; Jacob Machado, D.; Prieto Oliveira, P.; Ocana, K.Abstract
Recombination is a key driver of RNA virus evolution, yet its extent and evolutionary implications in dengue virus (DENV) remain incompletely understood. We conducted a comprehensive, genome-wide recombination screen across 6,905 complete DENV genomes representing all four serotypes, 82 countries, and eight decades of sampling (1944-2023) retrieved from the Bacterial and Viral Bioinformatics Resource Center. Using seven complementary recombination detection methods implemented in RDP5, we identified 66 recombination events across 53 unique recombinant sequences, of which 29 are newly described. Events included intra-genotypic (n = 18), inter-genotypic (n = 32), and inter-serotypic (n = 16) exchanges spanning 14 genotypes and four continents, with no meaningful serotype-level enrichment (Cramer's V = 0.054). Recombination was concentrated in non-structural genes, most frequently NS3 (19 events), NS5 (17), and NS2 (12), while the capsid gene contained no recombination events, consistent with strong functional constraint. Single-nucleotide polymorphism analyses confirmed low divergence between recombinants and their inferred parents in both recombinant and non-recombinant regions. Phylogenomic analysis of 6,642 sequences revealed that recombinants cluster significantly closer to their major parents (p = 8.9 x 10-6 ) and that their removal does not significantly alter tree topology (p = 0.898), suggesting that the short length of recombinant regions limits phylogenetic conflict. We also introduce RECOSIM, an unsupervised machine-learning tool for recombination detection that achieved higher precision than RDP5 on both simulated (93.4% vs. 80.0%) and empirical (98.1% vs. 39.3%) datasets. Collectively, these results establish recombination as a widespread, pan-serotypic phenomenon in DENV with implications for genomic surveillance, vaccine evaluation, and evolutionary inference.
bioinformatics2026-06-16v1Expanding gene regulatory networks from transcriptome data through graphical modeling with heterogeneous priors
Kokaji, T.; Suzuki, K. T.; Kunida, K.; Sakumura, Y.Abstract
Gene regulatory network inference is widely used to reconstruct large-scale networks and identify functional genes from transcriptome data. Meanwhile, in many biological fields, core regulatory genes have been extensively studied, leading to the establishment of small-scale gene regulatory networks, and novel genes connected to these networks remain to be identified. However, methods for expanding existing gene networks by identifying novel regulatory interactions, rather than reconstructing the entire network, are not well established. Here, we propose a method for gene network expansion that incorporates known regulatory relationships and evaluates each candidate gene individually to infer its regulatory connections to the existing network. Using simulated datasets from the DREAM4 benchmark and the PRECISE-1K experimental dataset, our method outperformed conventional methods by incorporating prior knowledge. In particular, it improved the ability to distinguish true regulatory interactions from indirect associations arising from strong correlations among genes in the existing network. The method also showed strong performance for interactions involving genes with high outdegree or centrality. Furthermore, it maintained stable performance as the size of the existing network increased and was robust to noise in prior information. These results demonstrate that our method provides an effective framework for expanding existing gene regulatory networks by leveraging prior knowledge.
bioinformatics2026-06-16v1DynamicDemiLog: A Single Sketch for Ultrafast Similarity, Frequency, and Cardinality Estimation
Bushnell, B. J.Abstract
Probabilistic cardinality estimators (HyperLogLog), similarity sketches (MinHash), and frequency estimators (Count-Min Sketch) are fundamental approximate data structures that each target one primary problem. We present DynamicDemiLog (DDL), a sketch that unifies cardinality estimation, set similarity, containment, element frequency and composition in one tiny data structure built from a single pass over the input stream. Using an inverted index over 200,687 RefSeq sketches (159,567 organisms), DDL performs all-to-all sketch similarity comparison of the full database in 30 seconds (128 threads, indexed) - over 375x faster per query than Mash's brute-force all-to-all comparison of 91,282 sketches, or 31x faster without the index, at double the sketch resolution. DDL extends the LogLog register with a mantissa: each register stores a floating-point-encoded hash value consisting of an integer exponent (the leading-zero count) and a fractional mantissa (the sub-leading-zero bits), rather than the integer leading-zero count alone. This preserves enough hash information for meaningful register-by-register comparison - a property that standard 6-bit registers lack - while improving on LogLog's cardinality estimation machinery, including DynamicLogLog's early exit mask for high-throughput streaming. With a default 10 mantissa bits (16-bit registers, 2,048 buckets, 4 KB), DDL achieves a per-register false-match rate of 0.018% on unrelated random same-size sets (compared to 17.0% for LL6, a basic HyperLogLog implementation), enabling Weighted Kmer Identity (WKID), Average Nucleotide Identity (ANI), containment, and completeness estimation from register comparison alone. A 16-bit per-register observation counter provides element frequency information at trivial additional computation cost, and an additional byte tracks element composition (GC content, for biological data). Furthermore, DDL's high-specificity registers enable an inverted index structure (DDLIndex) that answers similarity queries against a database of N sketches in O(B + M) time, where M is the number of matching index entries, compared to O(NxB) for pairwise comparison.
bioinformatics2026-06-16v1Super Learner Ensemble Modeling of CPTAC Proteomic Data for Survival Prediction in Head and Neck Squamous Cell Carcinoma
Park, E.; Lee, H.; Oh, E. J.; Tham, T.; Ahn, S.Abstract
Survival analysis in head and neck squamous cell carcinoma (HNSCC) is traditionally performed using Cox proportional hazards models, alongside some exploration into black-box machine learning methods. The Super Learner (SL) algorithm addresses this model selection dilemma by combining diverse candidate algorithms into a weighted ensemble to perform comparably to the best candidate method. This study evaluates the performance of SL in HNSCC. Proteomic features as well as clinical covariates from 96 CPTAC HNSCC samples were modeled with three candidate algorithms (Cox LASSO, Cox Ridge, and Random Survival Forest) as well as the ensemble SL method. Models were optimized via Uno's time-dependent Concordance Index (C-index) and tested at 1- and 3-year time horizons using 2000 bootstrap resamples. The Cox Ridge regression model achieved the highest predictive accuracy among the four total methods. However, the SL demonstrated stable performance over both time horizons (1-year C-index: 0.985; 3-year C-index: 0.960). Variable importance analysis of the Cox Ridge model successfully identified malignant proteins (ATR, MAML1, MIEN1) alongside novel potential prognostic indicators (ZNF800, KERA). This analysis emphasizes the statistical necessity for larger cohorts for ensemble learning, while providing a benchmark of proteomic indicators in HNSCC.
bioinformatics2026-06-16v1PhenoBIC: operator-free single-cell spatial phenotyping in multiplex imaging data using deep learning of cell staining patterns
Sankaranarayanan, A.; Zhao, C.; Hernandez, M. G.; Clemens, E. A.; Smythe, K. S.; Kazerouni, A. S.; Carr, L. L.; Li, C. I.; Partridge, S. C.; Vinayak, S.; Mittal, S.Abstract
Multiplex imaging is a valuable tool for spatially examining tissue microenvironments at the single-cell level to uncover biological and clinical insights. However, most multiplex image analysis workflows currently require manual intervention for cell phenotyping, which slows progress, demands human effort, and yields operator-dependent outputs. Here, we developed PhenoBIC, a pre-trained deep learning model for image classification of the multiplexed biomarker signals in a cell (Biomarker Imprint of a Cell) to classify cell phenotypes. We show that PhenoBIC (F1-score ~0.88) outperforms manual gating (widely used) and other machine learning-based computational approaches for cell marker expression classification. We validated this across multiple biomarkers, tissue sampling strategies (whole biopsies and tissue microarrays), multiplex panels, imaging platforms, and tissue types. We have released our in-house training and validation datasets of ~1.4 million manually curated cell expression ground truth labels. We have also open-sourced PhenoBIC and enabled its community-wide deployment via the QuPath interface.
bioinformatics2026-06-16v1Physics-Driven Zero-Shot Reconstruction of Isotropic 3D Fluorescence Microscopy under Undersampled Acquisition
Cao, R.; Jin, T.; Xin, F.; Hou, Y.; Fu, Y.; Jin, B.; Li, L.; Gao, S.; Wang, H.; Li, Y.; Saimi, D.; Ren, W.; Wang, W.; Xin, G.; Yuan, K.; Chen, Z.; Su, X.; Kim, D.; Li, M.; Xi, P.Abstract
Three-dimensional (3D) imaging represents the development of next generation of fluorescence microscopy. However, routine axial down-sampling makes isotropic resolution unrealistic. Here, we propose DeepUI, a physical zero-shot framework designed to achieve isotropic 3D fluorescence images from a low axial sampling rate. DeepUI fully leverages the intrinsic characteristics of 3D images through physics-guided degradation, which incorporates spatial-frequency joint learning to generate a scaled optical transfer function, combined with noise degradation and an up-sampling branch. Typically requiring just 5 minutes for training and 0.5 minutes for high-throughput and fast prediction, we demonstrate the superior performance of DeepUI to get isotropic results, and the exclusivity to axial down-sampling conditions, even in more challenging conditions, including defocused background, noise, and resolution blur.
bioinformatics2026-06-16v1Accelerating String Comparison in RLZ Compressed Sequences via LCE Jumps
Varki, R.; Boucher, C.Abstract
Relative Lempel-Ziv (RLZ) is an effective compression method for large, repetitive collections; however, the fundamental primitives required to elevate it from a passive archival format to a tractable representation for compressed construction have yet to be fully established. In this paper, we introduce an algorithmic framework for structurally comparing and lexicographically sorting sequences of RLZ factors. We characterize when direct factor comparisons are necessary and when they can be bypassed using RLZ specific shortcuts. We further introduce a method for extending truncated factors into right-maximal matches, enabling the recovery of matching statistics from the RLZ parse. Experimentally, RLZ sorting achieved speedups of up to 3.93x over character-based sorting. Together, these results advance the use of the RLZ format as a foundation for compressed construction.
bioinformatics2026-06-16v1Rapid and consistent clustering of millions of genomes highlights the diversity of prokaryotic life
von Wachsmann, J. H.; Lorenz, L. J.; Gurbich, T. A.; Russell, M. J.; Rodriguez Bouza, V.; Horsfield, S. T.; Lees, J. A.; Finn, R. D.Abstract
Bacterial genome and metagenome databases collectively contain over 5 million high-quality assemblies. However, the redundancy of these databases and the limited scalability of existing tools create bottlenecks for fully comprehensive, tree-of-life-scale genomic analyses. A fundamental task is to first break this data into smaller chunks, guided by their genome similarity. However, alignment-based comparative methods struggle to handle more than a few tens of thousands of genomes at a time, making the global organisation computationally complex and expensive. Here, we present gemsparcl (https://github.com/johannahelene/gemsparcl), a tool that clusters bacterial genomes into genomic cohesive units (GCUs), at approximately species-level resolution, over 500 times faster than existing methods. As part of developing gemsparcl, we developed sketchlib.rust, a one-permutation MinHash approach that implements an auxiliary inverted index to further accelerate all-versus-all comparisons. We added a statistical correction for incomplete metagenome-assembled genomes (MAGs) to enable accurate distance estimation and network-based edge quality filtering. After genome completeness quality control, we clustered 5.6 million high-quality bacterial genomes (2.88 million isolates and 2.77 million MAGs) into 92,954 GCUs in ~14 hours using 48 CPU threads and less than 16.5 GB of memory. Using taxonomic validation of the GCUs, the method achieves very high (99.76%) cluster purity (meaning only one species label occurs per GCU). We demonstrate that the clustering also highlights cases where taxonomic naming can be potentially harmonised or improved. Furthermore, we identify the most frequently reconstructed MAGs that lack a corresponding isolate genome and are thus priorities for culturing. The enhanced speed of gemsparcl enables routine database updates to incorporate the latest genomes. It also makes reference-free microbiome analysis across millions of genomes computationally tractable for the first time.
bioinformatics2026-06-15v2SMLMFlow: Improving Structural Resolution in Single Molecule Localization Microscopy with Flow Matching
Bauer, S.; Panconi, L.; Cunha, I.; Latron, E.; Sage, D.; Peters, R.; Griffie, J.Abstract
While Single Molecule Localization Microscopy (SMLM) aims to generate precise coordinates of molecular targets in cells, the resulting point clouds are inherently blurred by additive noise sources across the experimental, imaging, and processing workflow. This blurring often limits SMLM's ability to accurately quantify complex assembled structures required to address biological issues, despite reported localization precision down to a couple of nanometers. Here, we present SMLMFlow, a machine learning framework for improving structural resolution in SMLM datasets that combines a graph neural network and a hierarchical transformer with flow matching. We show that SMLMFlow improves structural resolution and downstream quantification across different structures, including filaments and protein nano-clusters, and generalizes to new unseen photophysics models.
bioinformatics2026-06-15v1SMS: Symmetric Mediation Statistics for Powerful High-Dimensional Mediation Analysis
Wang, Y.; Yan, S.; Wang, H.-J.; Hu, Y.-J.Abstract
Background: Mediation analysis of high-dimensional features, particularly molecular-level omics features, provides important opportunities to uncover biological mechanisms underlying human health and disease. However, two central statistical challenges remain: testing the composite-null hypothesis and maintaining power when the exposure-mediator and mediator-outcome associations differ substantially in statistical significance. Existing methods typically rely on accurate estimation of the proportions of the three null types or on the maximum of the two association p-values, and may not always control the FDR well and may have limited power under imbalanced significance. Methods: We propose SMS, a new statistical framework based on symmetric mediation statistics. By exploiting symmetry, SMS calibrates the composite null distribution as a whole for FDR control. It also allows flexible combinations of the two association p-values, including the maximum, and then enables construction of an omnibus test. Moreover, it permits direct use of effect-size estimates, bypassing the need to compute p-values. Results: SMS controlled the FDR across a wide range of simulation scenarios while achieving a substantial sensitivity gain, often around 20 percentage points, over existing methods including HDMT, DACT, and DEI-B. Applications to a metabolomics dataset and a DNA methylation dataset further corroborated these findings. Notably, SMS discovered five plausible mediators in the metabolomics dataset that were missed by all existing methods considered.
bioinformatics2026-06-15v1oxo-flow: compiled, memory-safe bioinformatics workflow orchestration
Wang, S.Abstract
Bioinformatics analyses depend on workflow engines to coordinate dozens of computational tools across complex dependency chains. The most widely adopted engines-Snakemake, Nextflow, the Common Workflow Language (CWL), and the Workflow Description Language (WDL)-run on interpreted or just-in-time (JIT) compiled language runtimes, incurring hundreds of milliseconds of startup latency and providing no compile-time safety guarantees from the host language. We developed oxo-flow, a workflow engine written in Rust that compiles to a single native binary. On an Apple M5 processor, oxo-flow parses, validates, and dry-runs a production-scale workflow in roughly 22 milliseconds-before Snakemake or Nextflow have finished loading their runtime environments. Peak memory usage is 16 megabytes, representing six- to seven-fold reductions relative to Snakemake and Nextflow. Dry-run latency is essentially independent of workflow size: a hundred-fold increase in rule count adds approximately 0.4 milliseconds. oxo-flow integrates 31 command-line tools, a REST interface with 60 endpoints, an embedded web application, and native cluster submission into a single 10-megabyte binary. It provides per-rule environment isolation across seven backends, checkpoint-based fault tolerance with cryptographic output verification, and a formal installation and operational qualification protocol for regulated laboratory environments. Ten curated workflows and three demonstration pipeline repositories are available. oxo-flow is freely available under Apache License 2.0 at https://github.com/Traitome/oxo-flow.
bioinformatics2026-06-15v1AliceDB database and pipeline for identification of natural protein variants based on mass spectrometry measurement data
Thiel, M.; Rozycka, A.; Puchalski, M.; Oldziej, S.Abstract
The natural variation that distinguishes living organisms within a single species is currently being studied intensively, primarily at the genetic level. Unfortunately, studies of natural variants at the level of protein gene products are not very common, mainly due to the lack of appropriate databases and bioinformatics tools. The main research technique used to study proteomes/peptidomes is mass spectrometry (MS). A classic method for interpreting raw mass spectrometry data in proteomic/peptidomic studies involves the use of databases containing representative (canonical) sequences that define the proteome of the organism under study. In this paper, we present the AliceDB database, which contains information on over 7 million natural variants of protein sequences described in the scientific literature for Homo sapiens. The data contained in the AliceDB database can be utilized using widely available and commonly used software for interpreting proteomic data. Test results regarding the use of the AliceDB database for the interpretation of proteomic data indicate that accounting for the presence of natural variants increases both the number and quality of identified proteins. Furthermore, it is easy to identify protein sequence variants that may, for example, be of significance in medicine.
bioinformatics2026-06-15v1RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views
Hou, H.; Xia, T.; Hu, L.; Qin, H.; Zhang, Y.; Li, Y.; Fang, S.; Cao, L.Abstract
Genes can be described through multiple heterogeneous biological views, including genomic sequence, transcript sequence, protein sequence, textual knowledge, and single-cell expression context, yet existing gene embeddings remain largely modality-specific and difficult to compare or reuse when many views are unavailable. We study a narrower but practically important question: whether pretrained embeddings from these distinct sources can be organized into a shared gene representation interface that remains usable under severe missing-modality conditions. To investigate this question, we introduce RepGene, a lightweight single-branch framework that combines modality adapters, a shared encoder, presence-aware fusion, and self-supervised cross-view objectives to map five biological views into one latent space. Our goal is not to claim a new multimodal learning principle or to establish superiority over all simpler fusion strategies, but to provide an initial technical instantiation for testing whether such a shared interface is feasible in a fixed-feature setting. Under a two-stage protocol in which RepGene is trained self-supervised on frozen upstream embeddings and evaluated by downstream linear probing, we find preliminary evidence that the learned representation is broadly competitive in the full-modality setting and remains informative when only partial modality subsets are observed at inference time. The strongest signal in our study is robustness under missing views: average performance changes are often limited when one modality is removed, and even single-view inference remains non-trivial in the evaluated benchmark regime.These results do not resolve unified biological representation learning, and they should be interpreted in light of incomplete simple-fusion baselines, limited architectural ablation, benchmark dependence, and possible upstream feature exposure. We therefore position RepGene as a feasibility study and a starting point for stronger comparisons, broader benchmarks, and leakage-aware validation.
bioinformatics2026-06-15v1VrySure: A Multi-Task AI Scientific Fraud Detection Platform for Identifying Manipulated and AI-Generated Biomedical Research Images
Sun, J.; Li, B.; Kalluri, R.Abstract
Integrity of scientific data is critical in biomedical research, where images often serve as primary evidence for experimental observations and conclusions. Advances in image-editing technologies and generative artificial intelligence (AI) have increased the accessibility and realism of visual manipulation, making detection through manual review increasingly challenging. To empower our laboratory researchers to continuously monitor and uphold scientific rigor and data integrity, and serve the global scientific community, we developed VrySure, an easy-to-deploy, AI-driven multi-task platform for automated image-integrity screening in biomedical research. VrySure integrates four detection modules: cross-image transformation detection, within-image copy-move detection, splicing detection in blot and gel images, and AI-generated image detection. The system identifies potentially manipulated images and, when possible, localizes suspicious regions using bounding-box outputs to support downstream verification. To support development and evaluation, we constructed task-specific datasets by combining public biomedical image resources, curated manipulated examples, and synthetic images generated by multiple generative AI systems. We evaluated VrySure using region-level F1 score, recall, precision, false negative rate (FNR), and false discovery rate (FDR) across multiple manipulation categories and compared its performance with two commonly used commercial image-integrity screening platforms under a predefined benchmark protocol. Under the tested conditions, VrySure achieved a higher F1 score and recall, lower FNR, and maintained a low FDR for within-image copy-move detection, splicing detection, and AI-generated image detection, while showing comparable performance in transformation detection. Beyond automated screening, VrySure is designed to support source-data comparison and evidence-based assessment in scientific integrity investigations. By integrating multiple detection capabilities into a unified and scalable workflow, VrySure provides a practical framework to improve the efficiency and consistency of image-integrity screening in biomedical research.
bioinformatics2026-06-15v1Maternal BMI and Placental Transcriptomic Changes: A Meta-Analysis of Gene Expression at the Maternal-Fetal Interface
Tangri, R.; Regnault, T. R. H.; Shooshtari, P.Abstract
Objective: Maternal body mass index (BMI) is often used as a measure of metabolic status and increased or decreased maternal BMI is associated with a heightened risk of cardiometabolic diseases across generations. The placenta mediates these maternal metabolic cues; however, its genome wide transcriptional adaptations in response to maternal BMI remain incompletely defined. Methods: To delineate placental genes, pathways, and interaction clusters whose transcript abundance varies with maternal prepregnancy BMI through a genome wide meta analysis of human placental RNA sequencing datasets. Placental RNA seq reads from four publicly available cohorts (n=146) were mapped to the GRCh38 reference genome and differentially expressed genes were identified. An independent microarray cohort (n=19) was reanalysed separately to facilitate cross platform comparison. Functional enrichment employed GO, KEGG, and STRING protein interaction resources. Results: Meta-analysis of 146 RNA seq samples identified eight genes with genome-wide significance in placentae from underweight pregnancies including inflammatory signaling gene MAP4K1 and metabolic enzyme PSPH, while overweight and obese categories revealed nominally significant differential expression. KEGG analysis demonstrated significant downregulation of oxidative phosphorylation with increasing maternal BMI, and protein-protein interaction networks revealed inflammatory mediators as central nodes in overweight and obese groups. Independent microarray validation corroborated key findings, including consistent downregulation of oxidative phosphorylation in obesity. Conclusion: Maternal BMI is associated with placental transcriptomic signatures involving inflammatory, metabolic, and hormonal pathways, with consistent downregulation of oxidative phosphorylation across platforms. This genome-wide meta-analysis provides a reproducible catalogue of BMI-responsive placental transcripts that may contribute to developmental programming of offspring health.
bioinformatics2026-06-15v1Multi-platform reassessment of human mitochondrial DNA methylation reveals signals consistent with technical artifacts
Basrai, S.; Bahcheli, A. T.; Tan, D.; Zuzarte, P. C.; Bevan, A.; Chan, T.; Ng, K.; Lam, B.; Arruda, A.; Das, S.; Minden, M. D.; Simpson, J. T.; Reimand, J.; Abelson, S.Abstract
The existence and functional relevance of mitochondrial DNA methylation remain controversial. Here, we systematically profiled cytosine methylation and hydroxymethylation across human brain and blood tissues spanning healthy and malignant states using orthogonal sequencing approaches that avoid chemical conversion during library preparation. While nuclear DNA exhibited canonical methylation patterns, mitochondrial DNA consistently showed negligible signal, indistinguishable from background technical noise. By mapping cytosine-guanine sites between mitochondrial DNA and nuclear-embedded mitochondrial sequences, we demonstrate the potential of these nuclear counterparts to confound not only cytosine methylation but also hydroxymethylation measurements, corroborating and extending prior findings implicating nuclear contamination as a potential source of apparent mitochondrial epigenetic signals. Additional technical factors that inflate apparent mtDNA methylation signals were identified, including sequence context biases, flow cell chemistries, and coverage-dependent discrepancies between the heavy and light strands. Collectively, these results provide convergent evidence against the presence of biologically meaningful cytosine methylation or hydroxymethylation in mitochondrial DNA. These findings caution against interpreting apparent mtDNA methylation signals in human adult tissues as meaningful without rigorous orthogonal validation and comprehensive consideration of technical and analytical confounding factors.
bioinformatics2026-06-15v1Multiple Fault Analysis and Drug Therapy on Signaling Pathways Using Dynamic Bayesian Network-based Model
Chowdhury, T.; Maitra, A.; Agarwal, A.; Sur, A.; Sarkar, S.; Majumder, S.; Lodh, E.Abstract
Cell growth is an intricate biological phenomenon that is closely regulated by the interplay between various growth factors and transcription factors. Signaling pathways are the main mediators in this event, which provide the driving force for mitosis or sometimes meiosis. However, when malfunctions occur within the biological network, they can cause uncontrolled cell division, regardless of external stimuli. By employing Dynamic Bayesian Networks (DBNs), these malfunctions can be explicitly simulated, offering insights into their effects on cellular behavior and growth regulation. To a significant extent, the resultant outcomes can be mitigated through the use of reduced drug combinations. This study delves into the intricacies of signaling pathway behavior under the influence of concurrent malfunctions. Initially, we replicate the effects of these dysfunctions within DBNs. Subsequently, drug therapy is applied to alleviate their impact. Our methodology introduces a parameter known as efficiency_score, enabling the identification of optimized drug combinations without prior knowledge of specific dysfunctions. Particularly relevant in the context of realistic cancer conditions, these tailored drug inhibition points demonstrate enhanced efficacy compared to conventional treatments. Leveraging GPU acceleration throughout the modeling process accelerates the analysis of multiple faults within the biological networks, rendering our approach notably faster and more efficient.
bioinformatics2026-06-15v1COMPASS enables cohort-independent digital biomarker discovery and pathway quantification
Sinha, S.; Ghosh, P.Abstract
Reproducible and clinically transferable quantification of pathway activity remains a major barrier in precision medicine, where biomarker performance often depends on cohort composition and normalization strategies. Here, we introduce COMPASS (COMPosite Activity Scoring System), a deterministic threshold-based framework that converts gene expression into quantitative pathway activity scores without reliance on reference cohorts. COMPASS derives gene-specific activation thresholds directly from data, standardizes deviations from thresholds, and integrates directionally opposing genes into a single composite score. This enabled transparent activity scoring, statistical comparisons, and survival analyses without coding. Across diverse biological and clinical datasets, COMPASS robustly quantified cellular states, benchmarked the humanness and disease relevance of new approach methodologies, and stratified outcomes. Compared to GSVA and ssGSEA, COMPASS demonstrated greater consistency across datasets and improved robustness in bootstrap analyses, particularly for bidirectional programs, including regulatory-approved sepsis gene signatures. COMPASS therefore addresses a critical unmet need for exact, interpretable, and clinically transferable biomarker discovery and outcome modeling across diverse biological and clinical settings.
bioinformatics2026-06-14v4Transposable elements as evolutionary substrates of proteindisorder in the human proteome
Mac Donagh, J.; Vergesio, N.; Aguilar, A.; Nores, R.; Lagares, A.; Fornasari, M. S.; Parisi, G.Abstract
Intrinsically disordered regions (IDRs) are central contributors to protein function, evolution and human disease, yet the evolutionary routes that seed new disordered segments within pre-existing proteins are still poorly understood. Sequence insertions provide a powerful mechanism for disorder expansion, but the genomic donors of inserted IDR and its long-term conformational fate remain largely unknown. Transposable elements (TEs), abundant mobile genetic elements with distinctive compositional biases, represent compelling candidates for generating disorder within proteins. Here, we systematically mapped TE-derived segments across human proteins and isoforms, and we found that these insertions are strongly enriched in intrinsic disorder. The structural consequences of their insertion are shaped by TE class and family, reflecting the sequence biases of the elements from which they originate. Recent, Primate specific insertions preferentially generate disordered segments, whereas older insertions more frequently occupy ordered structural contexts, revealing an age-dependent transition in the conformational state of TE-derived sequences. TE-containing isoforms are expressed at lower levels than TE-free isoforms, particularly when insertions are young and disorder-rich, suggesting that intrinsic disorder may constrain the cellular tolerance of newly exonized sequences. These findings identify TEs as a major evolutionary mechanism linking genome mobility to the emergence of new disordered conformational ensembles in the human proteome.
bioinformatics2026-06-14v1Somatic variant detection in normal tissues from single-cell sequencing data
Luo, R.; Wang, Z.; Dou, J.; Bhamidipati, S. V.; Kalra, D.; Grochowski, C. M.; Doddapaneni, H. V.; Gibbs, R. A.; Chen, K.; Chen, R.Abstract
A crucial advantage of single-cell sequencing (SCS) is its ability to identify somatic variants in individual cells, enabling phylogenetic analysis of cellular populations within bulk tissues. While identifying somatic variants in tumor tissues via SCS has become a common practice, doing so in normal tissues remains challenging due to the rarity of somatic variants in normal cells. To evaluate the feasibility of somatic variant calling from widely available single-nucleus RNA-seq (snRNA-seq) and single-nucleus ATAC-seq (snATAC-seq) data, we profiled a Cell-line mix of six HapMap samples prepared by the SMaHT consortium using 10x Genomics 5' snRNA-seq (12k cells with 36k mean reads per cell) and snATAC-seq (11k cells with 14k median high-quality fragments per cell) for variant calling. PacBio long-read whole genome sequencing (WGS) data (109x) generated from individual cell lines were used as ground truth. Two computational tools, Monopogen and SComatic, were used for somatic variant calling from the SCS data. Monopogen achieved single nucleotide variant (SNV) detection accuracies of 93.30% in the snRNA-seq and 99.64% in the snATAC-seq data, both of which outperformed SComatic (74.35% and 94.29%, respectively). Monopogen also consistently detected somatic SNVs at cellular fractions as low as 0.5% (2.54% in snRNA and 0.81% in snATAC) in individual samples. Notably, snATAC-seq exhibited higher genomic coverage breadth and larger number of variants detected than snRNA-seq. While the SCS data have lower overall genome coverage than that of the bulk WGS, the single-cell level variant resolution allows Monopogen to assign variants to their cells of origin with over 80% accuracy in both RNA and ATAC modalities, thereby facilitating studies of clonal evolution and cell-type-specific mutagenesis. Other benchmarking methods were also evaluated (DeepVariant, Cellsnp-lite and Mutect2) for comparison. In conclusion, our study demonstrated the feasibility of performing reliable single-cell somatic mutation calling in a cell-line mixture and discussed the strengths and limitations of current computational methods when applied to normal tissues.
bioinformatics2026-06-14v1TopoMIL: Topology Improves Multiple Instance Learning in Diagnostic Microscopic Images
Kazeminia, S.; Dasdelen, M. F.; Rieck, B.; Marr, C.Abstract
Microscopic images of cells and tissues are central to disease diagnosis. In computational pathology, multiple instance learning (MIL) has emerged as a key paradigm for analyzing numerous images within a single patient sample. While the representative distribution of cells in a sample is important for diagnosis, existing MIL frameworks largely overlook it. We introduce TopoMIL, a framework that extracts the representative topological structure of the sample and integrates it into the MIL classifier. Three topological representations are assessed, each with distinct advantages and computational costs. We evaluate TopoMIL on four histopathology and cytomorphology datasets, each presenting unique challenges. Integrating the sample's topological information into MIL enhances classification across average, max, attention-based, and transformer pooling, yielding AUCROC gains of 3.3%, 4.2%, 5.9%, and 0.5%, respectively, with moderate computational cost. Our work underscores the potential of TopoMIL as a scalable extension to existing morphology-based models in computational pathology.
bioinformatics2026-06-14v1Robust integration of weakly anchored spatial multi-omics
Wang, C.; Liu, Y.; Wang, Z.; Sun, P.; Li, Z.; Li, J.; Wang, X.; Chen, K.; Zou, Q.; Daoliang, Z.; Hu, Z.; Du, Y.; Qian, B.; Feng, X.; Yuan, Z.; Guan, R.Abstract
Spatial multi-omics holds great promise for dissecting complex biological processes, though inherent technical constraints continue to limit its widespread adoption. Currently, most studies therefore measure distinct omics features on separate tissue sections, necessitating spatial diagonal integration. An emerging practical solution is to leverage hematoxylin and eosin (H&E) images as an integration anchor, given their ubiquity, low cost, and compatibility across tissue preparations. However, this anchor is frequently compromised in real-world settings by variations in H&E staining style, absence of reliable histological landmarks, and mismatches in spatial resolutions across omics modalities. To address this, we introduce SpaWeaver, a computational framework that couples a pathology foundation model with a graph Transformer and a latent feature aligner module, providing a highly robust solution for weakly anchored spatial omics data diagonal integration. Extensive experiments demonstrate that SpaWeaver exhibits superior robustness against isolated or synergistic weak-anchoring factors. The spatial multi-omics profiles generated by SpaWeaver link molecular features originally separated on two sections, unlocking diverse downstream analyses once exclusive to co-assayed spatial multi-omics data, including niche-aware cell-cell communication inference and multi-omics resolved cell state. In this study, it unveils tumor-distance-dependent fibroblast-CD4+ T-cell signaling in human colon adenocarcinoma and identifies a hypoxic glycolytic tumor state with pyknotic nuclei in human ovarian cancer. Overall, our approach bridges readily accessible single-omics measurements across weakly anchored tissue sections, enabling unified spatial multi-omics characterization and system-level tissue analysis.
bioinformatics2026-06-14v1Variant annotation across homologous proteins (Paralogue Annotation) identifies disease-causing missense variants with high precision, and is widely applicable across protein families
Li, N.; Zhang, X.; Mazaika, E.; Theotokis, P.; Jang, M.; Ahmad, M.; Powell, G.; Heyne, H. O.; Lal, D.; Barton, P. J.; Walsh, R.; Whiffin, N.; Ware, J. S.Abstract
Background: Distinguishing pathogenic variants from those that are rare but benign remains a key challenge in clinical genetics, especially for variants not previously observed and characterised in humans. In vitro and in vivo functional characterisation are typically resource intensive, and model systems may not accurately predict influence on human disease. Many in silico tools have been developed to predict which variants are disease-causing, but typically lack precision. Here we demonstrate the applicability of a framework, called Paralogue Annotation, that draws on information from previously-characterised variants in homologous proteins to predict whether variants in a gene of interest are likely disease causing. Methods: We assessed the performance of Paralogue Annotation through three orthogonal approaches: (1) comparison to established in silico variant prediction tools using 47,360 missense variants from ClinVar across 3,524 genes representing a broad range of diverse protein classes, by calculating precision and sensitivity; (2) evaluation against large-scale functional assays of variant effect in TP53 and PPARG; and (3) comparing odd ratios calculated from case-control association tests for inherited cardiac arrhythmia syndromes, and neurodevelopmental disorders with epilepsy, stratifying variants by Paralogue Annotation. Results: Paralogue Annotation correctly annotates 4,328 ClinVar pathogenic variants, with 245 false positives, yielding a precision of 0.95. This increases to 0.99 with more stringent annotation parameters (requiring greater conservation of amino acids between annotated orthologues) at the expense of sensitivity. Compared to established tools, Paralogue Annotation has higher precision for identification of pathogenic variants, albeit with lower sensitivity across diverse test sets. Extending the technique by transferring annotations between homologous protein domains, rather than full-length protein paralogues, increases sensitivity. Rare variants predicted pathogenic by Paralogue Annotation were more strongly disease-associated (increased odds ratio) than unstratified rare variants for six out of eight genes tested with case-control cohort approaches. Conclusions: Paralogue Annotation has high precision for detection of pathogenic missense variants, outperforming in silico methods where data are available to make a prediction. As the number of characterised variants increases in reference datasets such as ClinVar, Paralogue Annotation will further increase in sensitivity and applicability.
bioinformatics2026-06-13v2Phylogenetic detection of protein sites associated with continuous traits
Duchemin, L.; Muntane, G.; Boussau, B.; Veber, P.Abstract
Comparative genomic data can be used to look for substitutions in coding sequences that are associated with the variation of a particular phenotypic trait. A few statistical methods have been proposed to do so for phenotypes represented by discrete values. For continuous traits, no such statistical approach has been proposed, and researchers have resorted to sensible but uncharacterized criteria. Here, we investigate a phylogenetic model for coding sequences where amino acid preferences at a site are given by a continuous function of a quantitative trait. This function is inferred from the amino acids and the trait values in extant species and requires inferred point estimates of ancestral values of the trait at internal nodes. For detecting sites whose evolution is associated with this trait, we use a significance test against the hypothesis that amino acid preference does not depend on the trait. This procedure is compared to simpler strategies on simulated alignments. It displays an increased recall for low false positive rates, which is of special importance for performing whole-genome scans. This comes however at a much higher computational cost, and we suggest using a simple test to filter promising candidate sites. We then revisit a dataset of alignments for 62 species of mammals, using longevity as a phenotypic trait. We apply our method to three protein families that have previously been proposed to display sites associated with variation in lifespan in mammals. Using a graphical representation extracted from the detailed phylogenetic analysis of candidate sites, we suggest that the evidence for this in the sequence data alone is weak. The proposed method has been added to our Pelican software. It is available at https://gitlab.in2p3.fr/phoogle/pelican and can now be used with both discrete and continuous phenotypes to search for sites associated with phenotypic variation, on data sets with thousands of alignments.
bioinformatics2026-06-12v4RSTG: Robust Generation of High Quality Spatial Transcriptomics Data using Beta Divergence Based AutoEncoder
Halder, A.; Ghosh, A.; Bandyopadhyay, S.Abstract
One of the key challenges in spatial transcriptomics data analysis is the lack of sufficient data to train models. To address this shortcoming, multiple generative models have been developed to generate synthetic spatial transcriptomics samples in a controlled environment. However, these models often fail in out-of-the-box generation in the presence of noise (such as outliers). To tackle this challenge, we propose RSTG (Robust Spatial Transcriptomic Generator), an autoencoder that incorporates the {beta}-ELBO loss, to generate realistic and high-quality spatial transcriptomic sequences. Our model uncovers data' intrinsic structure by approximating its underlying distribution through variational inference, resulting in more interpretable and robust density estimation. We validate the effectiveness of RSTG across multiple tasks, including the recovery of cellular positions in both the 2D spatial and location domains. Our method shows improved performance, both qualitatively and quantitatively, across multiple datasets, including brain and liver samples generated using MERFISH, MERSCOPE, and Visium technologies. We further illustrate the robustness of RSTG to outliers by contaminating a portion of the data with anomalies (such as white noise, batch effects, and dropouts) as well as on a real-life degraded sample. The results show that our proposal maintains high quality and stability even when the training data are contaminated, across a variety of experimental settings and in comparison with existing approaches.
bioinformatics2026-06-12v2DyMoTree decodes early cell state transitions and drivers from single-cell transcriptomes using a tree-structured neural network
Wang, J.; Li, R.; Guo, C.; Qiang, M.; Wang, S.; Wang, G.; Tu, K.; Xu, Y.Abstract
Inferring early cell fate from single-cell RNA-sequencing data is essential for identifying cellular origins and fate plasticity in development and disease. However, existing methods often fail to exploit tree-structured lineage trajectories, limiting the accuracy and interpretability of fate mapping. Here we present DyMoTree, a computational framework that models cell fate decisions as nonlinear mappings between progenitor and terminal cell states under explicit lineage constraints. By integrating lineage graphs with a tree-structured neural architecture, DyMoTree learns lineage-resolved cell-state transition maps from single-cell transcriptomes, enabling robust inference of early fate bias and identification of fate-specific progenitor substates and driver genes. Across simulations, lineage-tracing experiments, and in vivo systems, DyMoTree outperformed existing methods in resolving early fate biases. Applications to mouse embryogenesis, lung adenocarcinoma progression, and CAR-T immunotherapy revealed regulatory programs underlying developmental and disease-associated transitions. DyMoTree provides a general framework for modeling lineage-resolved cell-state dynamics underlying development and disease progression.
bioinformatics2026-06-12v2Evaluating cell type annotations in single-cell omics in the absence of ground truth
Garnica, J.; Andreatta, M.; Carmona, S. J.Abstract
Accurate cell type annotation is essential for single-cell transcriptomics, directly shaping downstream analyses and biological interpretations. Yet, objective evaluation of annotation quality remains a major challenge. Here, we argue that a cell type or cell state label has practical utility only if it captures a molecular pattern that is reproducible across biological replicates. Based on this principle, we introduce inter-sample consistency (ISC), a quantitative framework to assess annotation quality in single-cell RNA-seq datasets. Unlike existing cluster validation approaches, ISC distinguishes annotations that generalize across samples and individuals from those driven by technical or unwanted variation, thereby providing principled criteria for annotation quality and transferability. When applied to published single-cell atlases, ISC reveals widespread reproducibility gaps and provides actionable guidance for repairing inconsistent annotations. Notably, ISC enables benchmarking of automated cell type annotation tools even when ground-truth labels are unavailable, providing interpretable metrics to guide their development and evaluation. Implemented as the scTypeEval Bioconductor package, this framework offers a broadly applicable resource for evaluating and improving cell type annotations in single-cell RNA-seq experiments.
bioinformatics2026-06-12v1A Graph-based QSAR Modeling Pipeline for Predicting In vitro PubChem Assays and In vivo Human Hepatotoxicity: Mechanistic Analysis of Caspase-3/7 Activation
Chitikela, Y.; Zhu, c.; Jia, Z.Abstract
Background: Caspase-3 and -7 are key effector caspases in the apoptotic pathway, a form of programmed cell death, and their activities serve as a well-established biomarker for evaluating environmental chemical toxicity and informing chemical risk assessment. Loss of mitochondrial membrane potential is a key event in the activation of Caspase-3/7 signaling and the subsequent induction of apoptosis. Therefore, simultaneous assessment of mitochondrial membrane potential and Caspase-3/7 activity enables elucidation of the mechanisms and pathways through which apoptosis is initiated. Rapid and accurate assessment of the potential toxicity of environmental chemicals and drugs remains a major challenge. Quantitative Structure Activity Relationship (QSAR) modeling have been widely used for toxicity prediction. Graph-based approaches encode compounds directly as molecular graphs, allowing structure-activity relationships to be learnt from molecular topology without the information loss in binary fingerprints. While advanced graph models such as graph transformers (GTs) have shown outstanding performance in many domains, they have not been fully leveraged in QSAR modeling on Caspase and mitochondrial toxicity. Methods: We propose a QSAR modeling pipeline that encompasses assay data preprocessing, feature representations (fingerprints and molecular graphs), and benchmarking machine learning (ML) models, including classic ML models, graph neural networks (GNNs), GTs, and their consensus ensembles. Based on in vitro Caspase and mitochondrial assays in PubChem, we applied the pipeline to predict Caspase-3/7 activation and mitochondrial membrane potential (MMP). Beyond in vitro assays, we also built in vivo QSAR modeling for FDA Drug-Induced Liver Injury (DILI) gold standard on human hepatotoxicity. Moreover, mechanistic analysis on Caspase-3/7 activation was conducted by comparing with MMP disruption to identify chemical substructures that may be responsible for dual activations. We also investigated cell-line-specific responses by identifying structural motifs that selectively induce Caspase-3/7 activation in individual cell lines.Results:Experimental evaluations show that GTs and GNNs outperformed classic ML models when the number of active compounds is large, such as MMP disruption, while classic ML models and GTs performed good for highly imbalance data with limited active compounds, such as Caspase-3/7 activation. For DILI prediction, the full consensus model achieved the highest AUC 0.69 and Graphormer had the highest F1 score 0.79, both surpassing the previous best model with AUC 0.63 and F1 0.65 with a large margin.Our mechanistic analysis shows that phenolic compounds bearing a para-hydroxyphenyl motif, as well as members of the lipophilic chain family with long alkyl chains can trigger the collapse of MMP, leading to the activation of caspases-3 and -7. Human embryonic kidney (HEK293) was the only cell line with a distinct structural motif: 1,1-dichloroethane and chlorobenzene. Human neuroblastoma (SK-N-SH) is uniquely impacted by an epoxide fragment and rat hepatoma (H-4-II-E) is uniquely impacted by a tetramethylcyclohexene motif and an acetaldehyde fragment.Conclusions:The proposed pipeline for QSAR modeling, including data preprocessing, feature representations, and incorporation of advanced graph ML approaches, is highly effective in predicting not only on Caspase-3/7 activation and membrane potential collapse, but also on FDA DILI human hetatotoxicity. As future research directions, we will leverage extra information, e.g., biological activity and findings in existing toxicity literature, and recent advances in large language models and agentic AI to further improve the predictive performance and enable a sensitive and specific framework for assessing human hepatotoxicity of environmental compounds.
bioinformatics2026-06-12v1CAREPath: Semantic Context-Aware Reasoning Paths with Mechanism-Augmented Embeddings for Drug Repurposing
song, h.; bang, d.; koo, b.; Kim, S.; lee, s.Abstract
Biomedical knowledge graphs (BKGs) that include drugs, genes, and diseases support drug repurposing by connecting drugs to diseases through gene-mediated multi-hop paths, thereby enabling mechanism-of-action reasoning. However, deeper traversal does not necessarily improve mechanistic reasoning: long paths grow combinatorially and frequently pass through hub genes, producing irrelevant gene regulatory signals, whereas overly constrained or sparse paths may miss broader biological context. We propose CAREPath, a KG-LLM framework inspired by depth-first search (DFS)-like and breadth-first search (BFS)-like reasoning to balance mechanistic specificity, scalability, and context recovery. The DFS-like module constrains traversal to short disease-gene-drug paths, converts each path into a structured prompt, and encodes it with a biomedical language model to generate semantic path embeddings. Complementarily, the BFS-like module constructs entity-level mechanism-context embeddings from one-hop gene neighborhoods and enriches them through similarity-guided augmentation using pharmacologically related drugs and gene-signature-similar diseases. Across five biomedical KGs, CAREPath achieves the best overall AUPRC among 18 baselines, improving performance by up to 3.8%. Additional analyses show that semantic short-path encoding contributes most to performance, while mechanism-context augmentation improves robustness under sparse evidence and strengthens Gene Ontology functional agreement. Case studies and recently FDAapproved indications further demonstrate its practical relevance, positioning CAREPath as an interpretable framework for scalable and mechanism-aware drug repurposing. Source code is available at https://github.com/hamppy-song/CAREPath.
bioinformatics2026-06-12v1Systematic functional annotation of thousands of BAHD acyltransferases in plant genomes using Protein Language Model and phylogenomic tools
Smith, N.; Yuan, X.; Melissinos, C.; Satani, S.; Grissom, C.; Moghe, G. D.Abstract
The functional annotation of plant genes lags significantly behind their genomic annotation. Closing this gap requires thorough cataloging of reported protein activities alongside predictive methods that scale beyond sequence-similarity inference. Focusing on the BAHD acyltransferase enzyme family as a model, we assembled FuncZymeDB-BAHD, a large database of 2,705 LLM-retrieved and curated enzyme-acceptor-donor activities covering 336 BAHDs from 156 plant species, a 2-to-6-fold expansion over Swiss-Prot and prior compilations. We further developed FuncPred-OG, which maps queries to orthologous groups and previously characterized enzymes in FuncZymeDB-BAHD, returning hits with high evidence provenance. FuncPred-OG enabled functional prediction of over half of BAHDs across 85 plant proteomes, of which five novel predictions were validated via in vitro assays and recent studies. For the remaining BAHDs without FuncPred-OG annotation, we developed FuncPred-AI, where logistic-regression classifiers trained on protein language model embeddings achieved high Area-Under-the-Precision-Recall-curve (AUPR) scores and correct-hit rates up to 93%. FuncPred-AI yielded >1 probable donor/acceptor annotation for 99.9% (8894/8897) of BAHDs in our pan-plant dataset. Finally, the FuncPred workflow and datasets were deployed on a web portal for broader utilization, potentially reducing experimentalist efforts for selecting candidates from days to minutes. Overall, this framework provides a generalizable template for functional annotation of entire enzyme families.
bioinformatics2026-06-12v1From Proteome Mining to Structural Validation: Phosphopyruvate Hydratase as a Structurally Tractable Drug Target in Kinetoplastid Parasites
Goyzueta Mamani, L. D.; Barazorda Ccahuana, H. L.; G Ng, M.; Pineda R, L.; Medina Franco, J. L.; Florin Christensen, M.; Ferraz Coelho, E. A.; Spadafora, C.; Chavez Fumagalli, M. A.Abstract
Chagas disease, caused by Trypanosoma cruzi, demands novel therapeutic strategies that overcome the toxicity and limited efficacy of current treatments. To address this need, herein we report an integrative, target-centric strategy that combines parasite proteome mining, structural modeling, and experimental validation. Functional enrichment and druggability analyses identified phosphopyruvate hydratase (PPH) as a promising candidate due to its essential metabolic role and limited similarity to human homologs. Notably, proteome mining revealed the presence and conservation of PPH across kinetoplastid parasites, including Leishmania donovani, supporting its evaluation beyond T. cruzi. For the selected PPH sequences, AlphaFold-derived three-dimensional models underwent extensive molecular dynamics refinement, yielding stable conformational ensembles suitable for structure-based studies. Using this validated model, virtual screening of the Latin American Natural Products Database - LANaPDB - identified aptosimon as a top-ranked compound candidate. Molecular dynamics simulations further showed ligand-dependent binding behavior, suggesting alternative binding modes distinct from the canonical substrate configuration. In vitro assays demonstrated consistent antiparasitic activity against intracellular T. cruzi amastigotes (IC50 = 3.52 ug/mL) and Leishmania donovani promastigotes (IC50 = 13.06 ug/mL), supporting the biological relevance of the aptosimon-related lignan chemotype, hinokinin, across two kinetoplastid parasite models. Together, these results support PPH as a structurally tractable and biologically relevant candidate target, while identifying an aptosimon-related lignan chemotype, represented experimentally by hinokinin, as a cross-species antiparasitic scaffold that warrants further biochemical target-validation studies.
bioinformatics2026-06-12v1Generalisable tissue-wide molecular reconstruction from histology
Zhang, A.; Yu, L.; Bian, B.; Cao, Y.; Ye, S.; Han, E.; Robertson, H.; Dong, Y.; Mao, Y.; Liu, B.; Patrick, E.; Kim, J.; Yang, J. Y. H.Abstract
Spatial transcriptomics technologies measure gene expression within intact tissues but remain difficult to scale across large tissue sections and patient cohorts. Consequently, many studies rely on tissue microarrays (TMAs) or sparse spatial profiling designs, where molecular measurements are available for only limited tissue regions and are often generated using heterogeneous gene panels. Existing H&E to spatial gene expression prediction methods remain challenged by sparse molecular measurements, partially overlapping gene panels and tissue-wide reconstruction across heterogeneous spatial datasets. Here, we present GHIST+, a framework for tissue-wide reconstruction of single-cell molecular states from H&E histology. GHIST+ integrates cellular morphology, local tissue context and shared tissue representations to extend sparse molecular measurements into tissue-wide molecular maps across heterogeneous spatial datasets. Across multiple cancer types and GTEx breast tissues, GHIST+ reconstructs biologically meaningful tissue-wide molecular organisation from sparse TMA-derived measurements while preserving spatial tissue structure, cell-type organisation and age-associated tissue states across cancer and non-cancer settings. GHIST+ establishes a scalable framework for transforming sparse spatial profiling experiments into tissue-wide molecular maps, enabling cohort-scale molecular reconstruction from routine histology under heterogeneous spatial transcriptomic settings.
bioinformatics2026-06-12v1