Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
VaxjoGNN: A Graph Neural Network for Ontology-Grounded Vaccine Adjuvant Recommendation
He, Y.; Zheng, Y.Abstract
Selecting an effective adjuvant remains a bottleneck in vaccine development, but most computational efforts have targeted antigen discovery rather than adjuvant prioritization. We frame disease-adjuvant matching as a top-k recommendation task on a heterogeneous knowledge graph grounded in biomedical ontologies, integrating curated facts, mechanistic pathways, and textual evidence. We introduce VaxjoGNN, a graph neural network trained with a listwise ranking objective. On a public benchmark, VaxjoGNN achieves NDCG@10 of 0.59 on seen diseases and 0.27 on previously unseen diseases (a 5.4 times improvement over a random baseline). The framework provides an ontology-anchored approach to adjuvant prioritization that complements existing antigen-focused tools.
bioinformatics2026-06-18v3Impact of the N-glycosylation on full-length IgG2 and IgG4 antibodies: a comparative study using molecular dynamics simulations.
LEON FOUN LIN, R.; Bellaiche, A.; Diharce, J.; Etchebest, C.Abstract
Like other proteins, monoclonal antibodies - important biodrugs- are subject to post translational modifications, especially the N-glycosylations. However, the effect of the N-glycosylations remains poorly studied and atomistic details about their influence are rarely available. . Moreover, the few existing studies focus on the prevalent immunoglobulin G1. To go further in the understanding of the impact of glycosylations, we have carried out a comparative exploration of the effect of N-glycosylations on two different classes of antibodies, namely Mab231, an IgG2 and the pembrolizumab, an IgG4 . The two antibodies differ by their sequences, their length, their 3D structure but also by the location and composition of the glycans. In the present work, detailed and important information were gained through molecular dynamics simulations where both monoclonal antibodies were studied without and with the presence of their glycans. The results of 1.5 microseconds of sampling for each system show that glycosylation does not drastically alter the overall conformational landscape of either antibody, whatever the metrics considered. However, it measurably modulates local flexibility, inter-domain correlated motions, and the relative orientation of the Fab arms with respect to the Fc domain, with statistically significant shifts in key geometric descriptors. Importantly, contact analysis reveals that glycan interactions extend beyond the Fc region to reach Fab residues. The allosteric network calculations demonstrate that the influence of Fc-bound glycans propagates even until the Fab framework regions in both mAbs, which could impact the antigen binding. The nature and magnitude of these effects are subclass-dependent, reflecting differences in glycan composition, hinge architecture, and three-dimensional organization Our findings challenge the prevailing view that Fc glycosylation uniformly promotes CH2 domain opening. More importantly, it underscores the necessity of considering full-length structures and IgG subclass diversity in glyco-engineering strategies.
bioinformatics2026-06-18v3Cross-platform nanopore benchmarking reveals methylation-associated substitution errors in bacterial reads
Liu, X.; Ding, Q.; Shao, Y.; GUO, Z.; Ni, Y.; Fan, L.; Yang, Y.; Chen, K.; Yang, M.; Li, R.Abstract
Nanopore sequencing enables long-read genome assembly and direct detection of DNA modifications, but emerging platforms require systematic evaluation against established technologies. We benchmarked CycloneSEQ against Oxford Nanopore Technologies R9.4.1 and R10.4.1 using matched native whole-genome sequencing and methylation-free whole-genome amplification libraries from six bacterial species. Updated CycloneSEQ chemistry and basecalling improved mean observed read accuracy to 96.0%, approaching R10.4.1. Across platforms, error spectra were non-random, with adenine-to-guanine and guanine-to-adenine substitutions consistently overrepresented. Comparisons with methylation-free controls showed that bacterial DNA methylation contributes substantially to these substitution patterns, highlighting a source of systematic nanopore error relevant to variant analysis. CycloneSEQ reads, when combined with short-read polishing, produced near-finished bacterial assemblies. We further show that CycloneSEQ supports bacterial methylation profiling: strand-specific basecalling errors enabled de novo discovery of 12 methylation-associated motifs, and two signal-to-reference alignment strategies enabled raw-signal comparison between native and amplification-derived reads. These results establish a cross-platform framework for nanopore benchmarking and extend bacterial epigenomic analysis to CycloneSEQ.
bioinformatics2026-06-18v2Multiple Fault Analysis and Drug Therapy on Signaling Pathways Using Dynamic Bayesian Network-based Model
Chowdhury, T.; Majumder, S.; Lodh, E.; Maitra, A.; Agarwal, A.; Sur, A.; Sarkar, S.Abstract
Cell growth is an intricate biological phenomenon that is closely regulated by the interplay between various growth factors and transcription factors. Signaling pathways are the main mediators in this event, which provide the driving force for mitosis or sometimes meiosis. However, when malfunctions occur within the biological network, they can cause uncontrolled cell division, regardless of external stimuli. By employing Dynamic Bayesian Networks (DBNs), these malfunctions can be explicitly simulated, offering insights into their effects on cellular behavior and growth regulation. To a significant extent, the resultant outcomes can be mitigated through the use of reduced drug combinations. This study delves into the intricacies of signaling pathway behavior under the influence of concurrent malfunctions. Initially, we replicate the effects of these dysfunctions within DBNs. Subsequently, drug therapy is applied to alleviate their impact. Our methodology introduces a parameter known as efficiency_score, enabling the identification of optimized drug combinations without prior knowledge of specific dysfunctions. Particularly relevant in the context of realistic cancer conditions, these tailored drug inhibition points demonstrate enhanced efficacy compared to conventional treatments. Leveraging GPU acceleration throughout the modeling process accelerates the analysis of multiple faults within the biological networks, rendering our approach notably faster and more efficient.
bioinformatics2026-06-18v2Global StationaryOT: Trajectory inference for aging time courses of single-cell snapshots
Boyle, C.; Ventre, E.; Schiebinger, G.Abstract
Trajectory inference (TI) methods for single-cell snapshots of developmental systems have yielded numerous insights into the gene regulatory networks (GRNs) that control cell differentiation. Many TI algorithms have been proposed for recovering cell trajectories from single samples containing cells spanning a spectrum of differentiation states; however, these methods cannot leverage temporal information when a time course of such diverse samples is available. As interest grows in understanding how the regulation of GRNs changes as an organism ages, current TI theory and methods must be adapted to take advantage of all information in aging time courses of single-cell data. In this paper, we present our novel age-conscious method, global StationaryOT, which exploits the temporal information in aging time courses to simultaneously reconstruct debiased cell trajectories at all ages. We demonstrate that this first-of-its-kind method achieves more accurate, biologically consistent trajectories in synthetic and real biological contexts where data sparsity produces significant noise in the outputs of current TI methods when they are applied to time course samples independently.
bioinformatics2026-06-18v2Identification of environmental factors and growth stages in the prediction of fibre yield and fibre quality traits in rain-grown cotton
Feng, Q.; Rafter, P.; Wilson, I.; Li, Z.; Conaty, W.Abstract
Context Understanding how and when environmental conditions influence overall crop performance is crucial for optimising the development of genotypes to a specific breeding target environment. We focused on economically important traits of Australian rain-grown cotton including fibre yield and quality traits, which have not been investigated comprehensively. The aim of the study was to identify relevant environmental factors, and the timing and extent of their impact on rain-grown cotton production. Methods We used a data driven approach to analyse the relationship between ten climate related environmental factors across various plant growth stages and eight fibre yield and quality traits, using a large-scale field dataset of 9,283 records collected over 23 years at 4 locations, with 53 unique year-location combinations. We applied eight complementary statistical models including stepwise, penalised and Bayesian linear regression, regression-tree based ensemble methods and deep learning frameworks to (1) select the most essential environmental covariates affecting rain-grown cotton production, and (2) evaluate the predictive performance of these models. Results The environmental impacts on rain-grown cotton production were trait and growth-stage specific. Number of rainy days and solar radiation were identified as the most influential environmental factors for fibre yield traits, vapour pressure deficit at maximum daily temperature was the most influential factor for majority of fibre quality traits. However, each analysed trait was influenced by multiple environmental factors across multiple growth stages (rather than a single factor or a single growth stage). These influential covariates explained a wide range of variation in the traits, accounting for 5.8% to 68.2%. Using the best-fit random forest model, our findings revealed non-linear relationships between key environmental covariates and the traits. Conclusions Environmental factors at different rain-grown cotton growth stages are key determinants for the performance of end-of-season fibre yield and fibre quality parameters. These findings highlight the need to account for environment conditions when developing cotton varieties optimised for rain-grown production systems. Potential strategies are proposed whereby these key environmental factors can be used to increase the rate of genetic gain in rain-grown cotton production systems. Implications The results of this study will be crucial for future genetic evaluations and analyses of genotype-by-environment interaction effects in rain-grown cotton, which must account for the influence of the environment on plant performance. Furthermore, these methods can be applied to other species to identify critical growth stages and environmental factors which most influence crop performance.
bioinformatics2026-06-18v1A data-driven rediscovery of the specificity-conferring code of adenylation domains in nonribosomal peptide synthetases
Li, Z.; Bozhuyuk, K. A. J.; Kalinina, O. V.; Klakow, D.Abstract
Nonribosomal peptide synthetases (NRPSs) are large modular enzymes that assemble structurally diverse peptides, many of pharmacological importance, including antibiotics and immunosuppressants. Within each NRPS module, the adenylation (A) domain selects the substrate to be incorporated, a choice governed by a small set of residues lining the binding pocket. For two decades, computational prediction of A-domain substrate specificity has relied on residue sets - most prominently the Stachelhaus code and the 34-residue "8 Angstrom code" - that were defined by spatial proximity to the substrate rather than by demonstrated predictive value. Here we revisit which residues govern substrate specificity from a purely data-driven perspective. We assembled a non-redundant dataset of 5,366 A-domain sequences (4,693 bacterial and 673 fungal) and used information-theoretic measures to rank alignment positions by their statistical association with substrate identity, without restricting candidate positions to any predefined structural shell. This procedure yielded two compact, kingdom-specific codes: IG15B (15 positions) for bacterial and IG13F (13 positions) for fungal A-domains. Both match or exceed the predictive accuracy of the 34-residue 8 Angstrom code while using fewer than half its positions, and both independently recover the majority of the classical Stachelhaus positions. Notably, our analysis identifies four positions (242, 280, 281, and 284) that lie outside all conventional codes yet carry non-redundant specificity information and co-localize with classical determinants on two helices flanking the binding pocket. These positions provide new candidate sites for the rational engineering of A-domain specificity.
bioinformatics2026-06-18v1Metrics for Evaluating Biological AI Model Predictive Accuracy at the Data-Substrate Level
Ewing, M. A.Abstract
Reports in the biological literature disagree on whether a given model can predict a biological outcome from a given data sample --- one study finding a model capable, another, on the same kind of data, finding it is not. This is particularly a challenge in relation to LLMs--where the models are large and opaque, with weights and training data inaccessible.\textbf{ }Such disagreements cannot be settled by directly inspecting the model. To address this challenge, we consider\textbf{ }an alternative approach: assessing whether the data sample is adequate to support the prediction asserted. For a given dataset, its substrate --- the underlying structure of the data --- determines what any model can recover, independent of architecture or capacity. At the same time, predicting the present state of a biological process and predicting the direction of its future change are different tasks; the second is supportable among AI models only where the data encode direction as determinable from the state --- a property we call encoding --- and is unsupportable where the same observed state precedes change in opposite directions --- a property we call non-identifiability, in the informational rather than the statistical sense. We introduce two generic metrics, Predictive Blindness Risk (PBR) and Prediction Indeterminacy Measure (PIM), that evaluate a data substrate for predictive accuracy directly --- without access to model weights, architecture, or training data --- and locate the regions of a data substrate where a predictive claim can be supported and where it cannot. Using human biological subjects, we employ the Yale Brain Metastases Longitudinal Data (1,430 human subjects; 11,892 MRI studies; four sequences) and show that direction of change was non-identifiable across regions encompassing the majority of transitions; a nonlinear AI model gained essentially nothing over majority-direction prediction there while recovering direction near-perfectly where the state encoded it; and model accuracy tracked data-substrate resolvability continuously (Spearman {rho} = -0.95 to -1.00). The metrics adjudicate, before any model is trusted and from the data alone, where claims of predictive accuracy --- of state, or of the law of change --- can be supported.
bioinformatics2026-06-18v1Looking beyond stereotyped neuron structures reveals links between beading and morphological rearrangements in aging phenotypes.
Gomez, K.; Nguyen, K.; Lagergren, J.; Flores, K.; San Miguel, A.Abstract
Understanding how neuronal morphology changes during aging and acute stress is essential for elucidating mechanisms of neurodegeneration. The highly branched PVD neuron of Caenorhabditis elegans provides a powerful model for studying dendritic remodeling and degeneration-associated phenotypes such as dendritic beading. However, the complexity of this arbor presents substantial challenges for automated segmentation and quantitative analysis. In this study, we adapted a convolutional neural network (CNN)-guided region growing framework for automated dendrite tracing, coupled with two topology-based algorithms for categorizing dendritic segments by branching degree. The segmentation algorithm achieved high accuracy relative to manual tracing, with a median Dice coefficient of 0.82, while reducing analysis time by approximately tenfold. Automated dendrite categorization demonstrated strong agreement with manual annotations across branching orders, though position-based mapping performance declined with age due to progressive morphological distortion. Leveraging this platform, we investigated mechanistic differences in dendritic beading patterns observed during aging and cold shock. Consistent with prior work, aging was associated with decreased inter-bead spacing, whereas cold shock produced increased bead dispersion with stress severity. Structural analysis revealed that these trends were not driven by dendritic pruning or reduced arbor complexity. Instead, while a traditional anatomically unflexible paradigm falsely implicated lower-degree dendrites as highly vulnerable, our branching-informed framework revealed that age-dependent beading is fundamentally dictated by a segments history of successive branching events. Conversely, acute cold shock triggered systemic beading that expanded across all dendritic orders in a severity-dependent manner. Together, these findings demonstrate that chronic aging and acute stress engage distinct degenerative pathways (compartment-specific lineage vulnerability versus global architectural collapse) rather than gross morphological loss, as well as highlighting the need for paradigms that enable reliable analysis of changing morphologies.
bioinformatics2026-06-18v1Calculation of sequence space coverage in a mutagenesis library
Florez Prada, A.; Uguzzoni, G.; Hart, D. J.Abstract
Directed evolution requires screening of large mutagenesis libraries, but accurate calculation of library sizes needed to discover functional variants remains challenging. Existing models provide baseline estimates, yet current computational approaches for finding the best variants scale poorly with library complexity. Here, we introduce a scalable algorithmic framework to compute exact discovery probabilities in saturation mutagenesis libraries with no requirement for explicit sequence enumeration. By aggregating variants into a composition log--sum distribution and applying log-space convolution across randomisation blocks, it is possible to extend this to massive sequence spaces and mixed codon schemes. By inverting these calculations, absolute mathematical ceilings for experimental design are established. Ultimately, this framework provides a rapid, quantitative tool to balance the statistical coverage-diversity trade-off within the limitations of laboratory screening. Finally, this is implemented as an open-source web application (SSCC) that allows researchers to construct heterogeneous library designs and compute required sampling depths, coverage probabilities, and absolute randomisation limits.
bioinformatics2026-06-18v1novelBGC: An interactive dual-score framework for biosynthetic gene cluster novelty assessment and candidate prioritisation
Shukla, G.; Merugu, B.; Sharma, G.Abstract
Genome mining now yields tens of thousands of putative biosynthetic gene clusters (BGCs) per project, yet, separating genuinely novel candidates from rediscoveries of known compounds remains the rate-limiting step before experimental validation. Single-axis prioritisation tools, antiSMASH similarity, BiG-FAM GCF distance, and self-resistance-enzyme (SRE) filters such as ARTS, each surface a different facet of evidence, yet their isolated use systematically over-ranks rediscovery-prone BGCs and overlooks genuinely orphan clusters. We present novelBGC, a web-hosted framework that converts these disparate outputs into two deliberately non-inverse continuous metrics per BGC, a Novelty (N) and a Reference Similarity (RS) score which together define a 2D decision plane that resolves rediscoveries, divergent family members, contig-edge artefacts, and uncharted chemistry with interactive visualisations, with all component weights user-tuneable at submission. Retrospective validation across three independent experimental datasets demonstrates the utility of the framework for candidate prioritization. Within the first 186-BGC SRE-guided cloning study, every confirmed bioactive product fell within the low-to-mid N band whereas 55 high-N (N [≥] 0.50) BGCs were never selected. Moreover, in the other two studies, it correctly prioritised the fully orphan lariocidin BGC of Paenibacillus sp. M2 and the divergent within-family indanopyrrole-A idp BGC of Streptomyces sp. CNX-425. Together, these case studies demonstrate that the joint (N, RS) space facilitates prioritization decisions that are difficult to achieve using any single criterion alone. from identical input data. novelBGC requires no command-line expertise, no local tool installation, and no manual integration of intermediate output formats, addressing a well-documented accessibility barrier for wet-laboratory researchers engaging with genome-mining workflows. novelBGC is freely available at https://project.iith.ac.in/sharmaglab/novelbgc/.
bioinformatics2026-06-18v1A unified smoothing framework for protein domain bigram model
Cui, X.; Iyer, G.; Durand, D.Abstract
Biomolecular sequences can be represented as strings over an alphabet, an analogy that has motivated many applications of computational linguistic techniques to biological problems. However, such methods must be adapted to the characteristic scale and organization of biomolecular data. Here, we consider the problem of bigram smoothing for multidomain protein architectures, where domain bigram frequency data is extremely sparse and differs from textual data in alphabet size, string length distribution, the relationship between bigram and unigram frequencies, tandem repeat lengths, and the distribution of domain adjacencies. Moreover, some domain combinations are unobserved because they are biologically incompatible, others because the data are incomplete. A smoothing method that distinguishes these two cases is required. We propose a unified smoothing framework based on interpolation that can be tuned to accommodate different bigram data characteristics. Within this framework, we design specific model variants suited to protein domain bigram data: these assign low adjusted counts to pairs that are likely incompatible, while making appropriate adjustments for undersampled pairs. We demonstrate empirically that this approach distinguishes the two cases while preserving the characteristic signatures of multidomain data.
bioinformatics2026-06-18v1Trajectory inference of epithelial-centered neighborhood profiles reconstructs a pseudo-temporal continuum in idiopathic pulmonary fibrosis
Nakamura, S.; Tsubouchi, K.; Yamamoto, Y.; Takano, T.; Nakatsuru, K.; Takenaka, T.; Hashisako, M.; Oda, Y.; Okamoto, I.Abstract
Idiopathic pulmonary fibrosis (IPF) is characterized by complex lung architecture and spatially heterogeneous remodeling, which have hindered integrated analysis of cell-intrinsic activity and intercellular communication during disease progression. Here we profiled six IPF lung specimens comprising more than 630,000 cells using the Xenium 5k panel and developed an epithelial-centered neighborhood profiling framework based on the local cellular composition around each epithelial cell. This approach captured fibrosis-associated variation in epithelial niches without requiring predefined histological regions. Pseudo-temporal continuum inference of these profiles reconstructed a continuous axis that reflected the spatial progression of fibrotic remodeling from relatively preserved alveolar regions to fibrotic and airway-like remodeled regions. Within this spatial dataset, we mapped coordinated changes in epithelial states, local microenvironments, epithelial intracellular pathway activities, and directional interactions with neighboring cell types along the same axis. Our findings provide a spatial framework that generates testable hypotheses for progressive epithelial niche remodeling in IPF.
bioinformatics2026-06-18v1Predicting optimal growth temperatures of bacteria using learned structural information from a single protein
Hoffert, M.; Myerscough, D.; Dragone, N. B.; Gebert, M. J.; Silberg, J. J.; Fierer, N.Abstract
Temperature is a fundamental determinant of bacterial physiology and ecology. Optimal growth temperature (OGT) is highly variable across species, contributing to differences in where and when species are most likely to thrive. Although the OGTs for most bacteria remain unknown, the increasing availability of genomes from uncultivated and cultivated taxa has made it advantageous to build genomic, cultivation-independent models to infer OGT. However, pre-existing genomic models often lack the generalizability and mechanistic grounding required for robust inferences of OGT. We propose a novel framework for predicting bacterial OGT which uses learned protein structural signatures of thermal adaptation. We hypothesize that biophysical tradeoffs which dictate enzymatic functions across variable temperatures provide a more robust empirical basis for OGT prediction than broad genomic features. Our OGT-predicting model, ROSEATE, is based on a single gene, adenylate kinase (ADK), that encodes for a ubiquitous enzyme essential for energy homeostasis. ROSEATE uses high-dimensional latent space encoding via MSA Transformer, a protein language model which embeds ADKs in a manner which preserves biophysical information about embedded proteins. We show that the accuracy of the ROSEATE model is on par with other genome-based models, has a high degree of phylogenetic generalizability, and the ESM embeddings effectively capture key temperature-adaptive enzyme characteristics derived from AlphaFold structures. Because ROSEATE is based on analyses of a single ubiquitous protein, it can be used with metagenomic data to infer the community-level variation in bacterial OGTs. We demonstrate this feature of ROSEATE by reconstructing ADK sequences from over 500 environmental and host-associated metagenomes, successfully distinguishing community-wide thermal preferences across diverse habitats, from polar oceans to mammalian guts. By transitioning from genomic proxies to informationally dense protein structural features, this work provides an efficient, interpretable tool for predicting bacterial OGTs across taxa and whole communities.
bioinformatics2026-06-18v1Benchmarking gene expression reconstruction from single-cell latent representations
Fu, X.; Klein, D.; Antipov, E.; Palma, A.; Tejada-Lapuerta, A.; Bahrami, M.; Kummerle, L. B.; Lubetzki, M.; Casale, F. P.; Luecken, M. D.; Theis, F. J.Abstract
Single-cell transcriptomics is typically modeled in low-dimensional latent representations that improve the signal-to-noise ratio of the data. Such representations underpin data integration, cell state discovery, and perturbation prediction, with applications ranging from large-scale organ atlases to latent trajectory modeling. Recent virtual cell approaches further leverage these representations to predict cellular responses as distributional shifts in latent space. Each of these applications ultimately requires faithful gene expression reconstruction from latent spaces for biological interpretation, enabling gene-level analysis of predicted perturbed or batch-corrected cells. Yet representation choice is typically treated as an implementation detail rather than a primary modeling decision, with no systematic evaluation of how well latent representations support gene expression reconstruction. Here, we introduce ReconEval, a benchmark for evaluating gene expression reconstruction from single-cell latent spaces. We benchmark two classes of latent representations: end-to-end trained models such as PCA, autoencoders, and variational autoencoders, and pretrained single-cell foundation model embeddings coupled to newly trained decoders. Reconstruction is evaluated both directly and after latent-space perturbation prediction. Across perturbational and observational datasets totaling over 100 million cells, our metric suite quantifies statistical fidelity; biological signal preservation, including differential expression, coexpression, cell-cycle structure, cytokine response and pathway activity; and perturbation-specific effects. We find that autoencoders achieve the strongest stand-alone reconstruction at low dimensionality, while variational regularization does not improve generalization in reconstruction. Frozen foundation model embeddings retain recoverable gene-level information, with reconstruction quality depending strongly on decoder architecture and pretraining objective. In latent perturbation modeling, high-dimensional PCA matches foundation model embeddings, while low-dimensional AE embeddings are optimal for flow-based generative models. Overall, reconstruction depends critically on the interplay between representation and downstream model, and simpler representations can outperform complex alternatives given appropriate capacity. Our benchmark establishes reconstruction as a critical evaluation axis for single-cell foundation models. We envision it improving the biological interpretability of latent-space modeling, a prerequisite for future virtual cell models to be validated by domain experts and grounded in biology.
bioinformatics2026-06-18v1MorphoStat: A Statistics-Aware Pipeline for Morphological Profiling Analysis
Altobi, A.; Heo, D.Abstract
High-content imaging produces thousands of morphological measurements per cell. Interpreting these measurements requires normalization to remove plate effects, statistical tests selected on the basis of data distribution, and control over false discoveries across many features tested at once. MorphoStat is an open-source Python pipeline that applies this sequence of steps automatically. Given a CSV file from CellProfiler or a compatible imaging platform, it removes low-quality wells, normalizes each plate against DMSO controls using a MAD-scaled z-score, routes each feature to a parametric or nonparametric test based on a distributional check, applies Benjamini Hochberg correction, and writes out results and publication-ready figures. On the BBBC021 benchmark (MCF-7 breast-cancer cells, 632 wells, 473 features), MorphoStat recovered 12 of 13 known mechanism-of-action classes in principal component space, confirming that the normalization and statistical routing work as intended. The tool is available at https://github.com/Almunthir334/morphostat (DOI: 10.5281/zenodo.20354069) under the MIT license.
bioinformatics2026-06-18v1Bayesian modeling of longitudinal metatranscriptomes of broiler meat spoilage microbiomes shows shared predictive signature associated with spoilage at refrigerated temperatures
Nushi, E.; Manninen, J.; Johansson, P.; Honkela, A.; Björkroth, J.Abstract
Microbial spoilage of packaged meat is driven by complex microbial succession and related metabolic activity, yet conventional shelf-life assessment is mainly based on shelf-life studies relying on culturing and sensory analysis. In routine quality assurance, results are obtained retrospectively, and they are only indirectly linked to the metabolic activity related to sensory deterioration. Functional, time informative approaches that capture the active metabolic state of the spoilage microbiome and predict the rate of spoilage are lacking. We developed a censoring-aware Gaussian process (CAGP) framework to model longitudinal pathway expression profiles from broiler meat metatranscriptomes collected over consecutive storage days at 4 or 6{degrees}C. Samples were annotated using odor-based sensory scores defining fresh, early-spoilage, and late-spoilage phases. Because observed zeros in pathway-level data may reflect non-detection rather than true absence, the model treats low values as left-censored observations below a detection threshold while estimating smooth temporal trajectories with uncertainty. In leave-one-out prediction within the 4{degrees}C time series, predicted sampling days differed from the true days by an average of 0.43 days, and predicted spoilage phases agreed with the sensory classification. Trajectories learned at 4{degrees}C also transferred to an independent 6{degrees}C time series at the spoilage-phase level, suggesting that shared functional spoilage programs are preserved despite temperature-dependent changes in spoilage rate. Cross-entropy ranking further identified pathway modules carrying time- and phase-informative signals across temperatures. Overall, this framework provides a probabilistic approach for linking metatranscriptomic functional dynamics to sensory spoilage progression, supporting shelf-life assessment beyond retrospective microbial enumeration.
bioinformatics2026-06-18v1Bioinf-Farma: supervised integration of epitope prediction and recombinant protein developability for automated vaccine candidate prioritization
Bondi, H.; Crespi, M.; Orlando, M.; Lescai, F.; Serapian, S. A.; Colombo, G.; Fasano, M.; Pollegioni, L.; Molla, G.Abstract
Vaccine antigen discovery requires prioritizing protein candidates according to both immunogenic potential and recombinant expression feasibility. These properties are typically evaluated using separate computational tools, requiring researchers to integrate heterogeneous outputs through ad hoc workflows. Here, we present BIOINF-farma, a modular platform integrating epitope prediction and developability assessment for rational antigen selection within a unified environment. Candidates can be submitted as amino acid sequences or three-dimensional structures. When experimental structures are unavailable, BIOINF-farma automatically searches for models in AlphaFold DB or performs structure prediction using Boltz-2, ensuring a standardized structural representation for downstream analyses. Antigenicity is quantified by combining structure-based conformational epitope signals (MLCE/REBELOT-BEPPE) and sequence-based linear epitope propensity scores (BepiPred 3.0) into a protein-level Antigenicity Score, with a classification threshold optimized on a manually curated validation dataset. Developability is evaluated through two supervised Random Forest meta-learners that integrate three solubility predictors (DeepSoluE, SoluProt, Protein-Sol) and three thermal stability predictors (TemStaPro, ProLaTherm, BertThermo), whose outputs are combined into an Expression Efficiency Score (EES). By integrating complementary predictive signals, the meta-learning framework achieves greater accuracy and robustness than individual predictors while maintaining performance across a broad range of sequence identities. The Antigenicity Score effectively discriminates antigenic from non-antigenic proteins with a large effect size, whereas EES successfully distinguishes soluble from insoluble outcomes on an independent panel of recombinant proteins expressed in Escherichia coli. BIOINF-farma jointly assesses antigenicity and expression feasibility within a single framework. Its modular architecture facilitates the incorporation of future predictive methods, while its web-based interface makes the full pipeline accessible to users without programming expertise, supporting rapid candidate triage in vaccine research and emerging pathogen responses.
bioinformatics2026-06-18v1Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction
Bhattacharya, S.; Gensbigler, C.; Karim, S.; Lees, J.Abstract
Next-token prediction has produced predictable scaling in language, but the recipe presumes a sequence of tokens with a meaningful order. Single-cell RNA-seq counts have no natural gene ordering, so applying the recipe directly to raw expression fails under an ill-suited left-to-right bias. We instead ask whether a learned latent can supply the structure the recipe needs. We introduce \texttt{ExpressionVAE} (eVAE), a discrete-latent perturbation model that compresses each cell into a short sequence of discrete codes through a finite-scalar-quantization (FSQ) bottleneck and trains a perturbation-conditioned discrete prior over those codes. On Replogle and Parse~1M, eVAE sets a new state of the art on every distributional metric and leads on most cell-eval perturbation metrics, with Fr\'echet distance and $\mathrm{MMD}^2$ roughly $3$ to $20\times$ lower than the strongest continuous-latent baseline. Swapping the prior between autoregressive and masked discrete diffusion leaves performance near-identical, isolating the gain to the discrete latent itself rather than the prior family. A decoder-head ablation then exposes a single design axis, the richness of the predictive distribution at inference, that splits the standard metrics into two groups, variance-sensitive and mean-sensitive, which move in opposite directions along the axis. Finally, on a held-out CRISPRi reversion benchmark of $1{,}732$ perturbations under inflammatory cytokine stress, the frozen eVAE encoder outperforms UMAP and differential expression and matches scGPT on perturbation ranking at a fraction of the data.
bioinformatics2026-06-18v1Deciphering shared and divergent tissue architectures from cross-species spatial transcriptomics
Zhang, B.; Zhou, X.; Zhang, S.; Zhang, S.Abstract
The integration of spatial transcriptomics (ST) data across species is essential for cross-species and translational studies, but remains challenging due to molecular divergence and anatomical differences between organisms. We present STACAME, a graph attention autoencoder-based framework to decipher shared and divergent tissue architectures from cross-species ST data by explicitly modeling both orthologous and species-specific genes. STACAME aligns ST slices in a spatially aware manner, identifies homologous and species-specific domains, and enables a suite of downstream comparative analyses. We demonstrate its utility by integrating ST datasets from diverse tissues, including hippocampus, isocortex, embryo, breast, liver, and cerebellum, across multiple species such as human, macaque, marmoset, mouse, and zebrafish. STACAME supports cross-species spatial domain alignment, the detection of shared and divergent spatially variable genes, development alignment and comparison, and the 3D integration of tissue architecture. This flexible approach facilitates the translation of findings from model organisms to humans, providing a unified computational platform for cross-species spatial transcriptomics.
bioinformatics2026-06-18v1Benchmarking attention-based methods for vision transformers' interpretability in retinal fundus imaging
Bors, S.; Beyeler, M.; Trofimova, O.; VascX Consortium, ; Presby, D.; Bontempi, D.; Bergmann, S.Abstract
Deep learning models based on Vision Transformers (ViTs) have shown strong performance in retinal fundus imaging, but their interpretability remains poorly understood. In particular, attention-based attribution methods are widely used to explain ViT predictions, despite limited evaluation of their faithfulness and biological relevance in medical imaging. Here, we systematically benchmark four attention-based interpretability methods for RETFound, a retinal ViT-based foundation model, that we previously fine-tuned to predict 17 retinal vascular phenotypes from UK Biobank fundus images1. We compare raw attention, attention rollout, gradient-weighted attention rollout, and Chefer's hybrid relevance-based method using both qualitative visualisation and quantitative evaluation frameworks. To assess attribution faithfulness, we perform perturbation-based deletion and insertion experiments, quantifying changes in model predictions as highly attended image regions are progressively removed or restored. To evaluate biological specificity, we run structure-aware analyses combining attribution maps with vessel segmentation and artery-vein labels through the Relative ratio of Attention Intensity (RAI) metric. Across models, attribution maps differed substantially depending on the selected interpretability method, highlighting the need for rigorous quantitative evaluation. Among the evaluated approaches, gradient-weighted attention rollout consistently achieved the strongest perturbation performance and produced attribution maps most closely aligned with the anatomical definition of the predicted retinal traits. Furthermore, vessel-type specific models systematically concentrate attention on the corresponding vascular structures despite being trained using only a single scalar value per image as supervision. These findings demonstrate that attention-based attribution methods capture biologically meaningful vascular representations, while also revealing method-dependent variability in attribution behaviour. This work provides a quantitative framework for evaluating interpretability methods in medical imaging with annotated segmentation and contributes toward more transparent and biologically grounded medical AI systems.
bioinformatics2026-06-18v1ScriptManager: a platform for scalable and reproducible high-resolution analysis of genomics datasets
Lang, O. W.; Beer, B.; Zhang, D.; LeSon, C.; Deen, A.; Pugh, F.; Lai, W. K.Abstract
Background: The growing diversity of genomic and epigenomic assays has driven a parallel expansion in data formats, analysis workflows, and figure-generation tools. However, tools for analyzing data and assembling publication-quality figures are often specialized to a specific assay, dramatically limiting their interoperability and reproducibility. Results: We present the v1.0 release of ScriptManager, a Java-based framework for modular and reproducible analysis and visualization workflows of genomics and epigenomics data. Unlike existing tools specialized for individual assay types, ScriptManager provides a unified and extensible framework for cross-assay visualization and workflow reproducibility. The v1.0 release adds novel analytical modules, GUI session logging, automated unit and integration testing, tutorials, and expanded documentation. It also integrates with the broader reproducibility ecosystem through Singularity containers, Anaconda packaging, and Galaxy XML wrappers. We demonstrate ScriptManager's TagPileup scaling from local single-core execution to a 10,305-job analysis distributed across the Open Science Grid (OSG), with the full workload completing in <2 hours of wall-clock time. Conclusions: ScriptManager v1.0 enhances workflow portability, transparency, and reproducibility across a diverse range of high-resolution genomic assays. By coupling a flexible module design with modern reproducibility standards, ScriptManager provides a bridge between exploratory data analysis and formal, publication-ready figure generation. These improvements enable researchers to build, share, and reproduce genomic analyses across diverse computational infrastructures with minimal configuration.
bioinformatics2026-06-18v1Accounting for allelic diversity and multicopy gene detection improves the accuracy of antibiotic resistance genotypic determination
Garcia Gonzalez, N.; Ferragud, R.; Blane, B.; Kim, J. I.; Torok, M. E.; Harrison, E. M.; Gouliouris, T.; Coll, F.Abstract
Background Genomic prediction of antimicrobial resistance (AMR) relies on the accurate detection of resistance genes or allelic variants of core genes from raw or assembled genomes sequences. For several bacterial species and antibiotics, AMR genotype-phenotype discrepancies are common, indicating that important sources of error remain unresolved. For Enterococcus faecium, we focused on identifying the sources of discrepancies for tetracycline resistance, for which genotypic detection had shown particularly low accuracy. We investigated the effect of structural variation in antibiotic resistance genes (ARGs), including gene duplications, truncations, interruptions, and mixed configurations of complete and partial gene copies, as a source of genotype-phenotype discrepancies from short-read data. We conduct further extended investigations to other antibiotic families and into another bacterial species: Escherichia coli. Methods We analyzed collections of E. faecium and E. coli genomes, integrating high-quality complete assemblies, simulated Illumina short reads, and matched AMR phenotypic data. The integrity, copy number, and allelic diversity of ARGs were examined for multiple antibiotic classes, and their impact on ARG detection and accuracy of AMR determination was assessed using several commonly used bioinformatic tools (SRST2, ARIBA and AMRFinderPlus). Results For E. faecium, after ruling out the effect of specific tet allelic variants on tetracycline susceptibility, we found that the integrity and copy number of tet(M) had a major effect on detection accuracy. Duplicated and incomplete ARGs are also common in E. faecium genomes, particularly for macrolides (erm(B)) and aminoglycosides (ant(6)-Ia and aph(3')-IIIa). In E. coli, similar patterns were observed for tet(A), erm(B) and aminoglycoside-associated genes (aph(3')-IIIa and ant(6)-Ia). Across ARGs in both species, short-read mapping methods wrongly reported interrupted genes as complete in some instances, while assembly-based methods often failed to resolve complete copies of duplicated genes. Detection accuracy improved when tools were adapted to account for gene integrity and when extended AMR databases incorporating species-specific alleles were included. Conclusions Our findings reveal that bioinformatic limitations in dealing with ARG copy number and completeness, and in accounting for allelic variation, underly a substantial source of genotype-phenotype errors, highlighting the need for improved AMR databases and bioinformatic tools that consider these factors to achieve reliable genomic prediction of AMR.
bioinformatics2026-06-18v1A high-quality, chromosome-scale genome assembly of the shade-tolerant wild rice, Oryza granulata
Zhang, F.; Yang, Y.-h.; Li, W.; Shi, C.; Zhu, X.-g.; Gao, L.-z.Abstract
Oryza granulata Nees et Arn. ex Watt, a diploid wild rice (GG genome), possesses exceptional shade tolerance and is a key genetic resource for rice improvement. However, previous genome assemblies lacked continuity and completeness. Here we present a chromosome-scale reference genome of O. granulata using PacBio SMRT (113*), Hi-C (95*), and Illumina sequencing. The final assembly is ~764.24 Mb, with a scaffold N50 of ~59.32 Mb, and ~96.47% of the sequence anchored to 12 chromosomes. BUSCO completeness is ~98.6%. We annotated ~42,064 protein-coding genes, of which ~95.39% were functionally annotated, along with ~73.46% repetitive elements. The genome assembly and raw sequencing data are available at NGDC (PRJCA061980), NGDC GSA (CRA068332), and NGDC GWH (GWHISVE00000000.1). This high-quality genome will serve as a fundamental resource for evolutionary genomics, conservation biology, and breeding of shade-tolerant rice cultivars.
bioinformatics2026-06-17v3Intrinsic dataset features drive mutational effect prediction by protein language models
Vieira, L. C.; Lin, S.; Wilke, C. O.Abstract
Protein language models (pLMs) are commonly used for predicting protein fitness landscapes, but their wide range of performance across datasets remains poorly understood. We evaluated supervised transfer learning on 41 viral and 33 cellular deep-mutational-scanning (DMS) datasets using embeddings from multiple pLMs. We observed consistently lower predictive performance on viral datasets compared to cellular datasets, independent of model architecture or transfer learning strategy. Surprisingly, a simple baseline model that predicted site mean fitness matched or outperformed supervised models on many datasets, highlighting the dominant role of site effects. Analysis of site variability using two metrics, relative variability of site means (RVSM) and fraction of highly variable sites (FHVS), revealed that patterns of fitness variation within and among sites constrain model performance and largely explain the observed differences between viral and cellular datasets. Moreover, splitting training and test data by site, rather than pooling, revealed that supervised models often rely on site effects rather than capturing broader mutational patterns. These findings highlight limitations of current pLMs for mutational effect prediction and suggest that dataset composition, rather than model architecture or training, is the primary driver of predictive success.
bioinformatics2026-06-17v2Predicting Mouse Lifespan-Extending Chemical Compounds with Machine Learning
Belikov, A. V.; Ribeiro, C.; Farmer, C. K.; Petrascheck, M.; de Magalhaes, J. P.; Freitas, A. A.Abstract
Pharmacological interventions targeting the biological processes of ageing hold significant potential to extend healthspan and promote longevity. This, to our knowledge, is the first study that uses Machine Learning models trained specifically on mouse lifespan data (from DrugAge) to predict lifespan-extending compounds. The use of mammalian data significantly elevates translational relevance compared to previously available models trained predominately on C. elegans data. Our most successful Random Forest classifiers were trained on direct drug-target annotations, including Gene Ontology, UniProt Keywords, pathways (KEGG, Reactome, Wiki) and protein domains (InterPro), whereas models trained on gene expression (LINCS) and chemical substructures (PubChem) underperformed. Models trained on male datasets performed better than those trained on mixed-sex and female datasets, with the latter suffering from severe class imbalance due to much fewer positive-class instances. Notably, features related to G-protein coupled receptors, especially receptors for neurotransmitters, metabolic hormones and sex hormones, were identified as strong predictors of lifespan extension. We used ensemble classifiers comprised of top models to screen compounds from DrugBank, highlighting novel candidates for longevity studies. Major clusters of compounds with the highest predicted longevity-promoting effects target IGF1 and insulin receptors, beta adrenergic receptors, carbonic anhydrases, dopamine and serotonin receptors, voltage-gated potassium and calcium channels, sodium-dependent dopamine, serotonin and noradrenalin transporters, muscarinic acetylcholine receptors and adenosine receptors. We tested 22 predicted compounds in C. elegans and found that 6 of them significantly extended median lifespan: dihydroergotamine, mianserin, bromocriptine, voxtalisib, bms-754807 and solifenacine. We have also created a public web server with our top performing classifier ensembles: https://www.cs.kent.ac.uk/projects/lodprime/ Our study not only provides an important contribution to the longevity pharmacology field but also informs research on the fundamental mechanisms of ageing.
bioinformatics2026-06-17v2Evaluating FoldX5.1 for MAVISp Stability Data Collection
Vliora, A.; Tiberti, M.; Papaleo, E.Abstract
MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance. The number of disagreements was higher at sites with low AlphaFold2 confidence. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.
bioinformatics2026-06-17v2MetaHarmonizer: robust biomedical metadata harmonization and a contamination control for inflated LLM performance on public benchmarks
Li, C.; Dahl, A.; Gravel-Pucillo, K. D.; Long, K.; Waters, M.; de Bruijin, I.; Davis, S.; Oh, S.Abstract
Public biomedical repositories hold substantial reuse potential, but inconsistent metadata routinely blocks integration across studies. Recent LLM-based harmonization approaches address scale but suffer from non-determinism, hallucinated ontology terms, and, in their highest-accuracy configurations, dependence on proprietary APIs or labeled fine-tuning data. A more fundamental concern is that LLM accuracies on widely-used public benchmarks may substantially inflate transferable capability: under a contamination-controlled evaluation protocol we developed, the apparent LLM-only advantage on the GDC schema-mapping benchmark is inverted, and three out of five LLMs recover 80 -100% of GDC identifiers from zero-schema context, suggesting direct memorization. Building on this insight, we present MetaHarmonizer, an automated metadata harmonization system designed to be robust by construction: SchemaMapper aligns attribute names across schemas, and OntologyMapper standardizes values to controlled vocabularies. Both modules implement a multi-stage cascade that escalates to more resource-intensive methods only when earlier stages fall short, with all candidates grounded in pre-defined controlled vocabularies to preclude hallucinated outputs and LLMs used only as bounded preprocessing components rather than inference-time dependencies. On the GDC schema-matching benchmark, SchemaMapper with the deployment-optimized LLM-generated alias dictionary achieved 71.6% Top-1 accuracy and the higher Recall@GT than Magneto bipartite variants, recovering significantly more ground-truth mappings; with the best performing alias dictionary, it reached the highest Top-1/Top-5/Recall@GT, and also matched the best Magneto reranker (fine-tuned LLM-reranker) on MRR; and it also outperforms LLM-only performance under contamination-controlled conditions. On four EFO benchmarks, OntologyMapper achieved 77.9 - 95.5% Top-1 accuracy, outperforming text2term by up to 16.4 pp and direct LLM inference (against the smaller corpus) by 19.2 pp because memorization is not a viable shortcut for this task. Across both modules, calibrated confidence scores separate correct from incorrect predictions (AUC 0.73 - 0.94), enabling principled human-in-the-loop triage. Inference is fully local, deterministic, and computationally efficient - seconds on schema mapping and under a minute for ontology mapping of up to ~7,000 terms against the pre-indexed 33,230-term corpus. Released as a Python package with a domain-agnostic architecture, MetaHarmonizer provides a scalable foundation for improving the FAIRness of biomedical data and enabling cross-study integration, alongside an evaluation methodology applicable to any LLM-augmented bioinformatics benchmark built on public benchmarks.
bioinformatics2026-06-17v1Correcting spatial transcriptomics data affected by a prevalent transcript leakage problem across platforms, species, and tissues
Shi, C. H.; Zhai, Y.; Chow, S. H.-C.; Li, L.; Carver, C. M.; Teneche, M. G.; Flores, J.; Kern, C.; Adams, P. D.; Ren, B.; Schafer, M. J.; Zhu, Q.; Wei, Y.; Yip, K. Y.Abstract
Spatial transcriptomics has been widely applied to study the spatial distribution of cell types, cell states, and specific gene expression in tissue samples. However, we show that there is a prevalent transcript leakage problem in spatial transcriptomics data, where transcripts expressed by a cell diffuse to its neighborhood and are recurrently detected in the nearby cells. By analyzing published data sets, we show that this problem is general across data produced from different tissues and different species using different imaging-based and sequencing-based spatial transcriptomics platforms. It affects both upstream tasks such as expression quantification as well as downstream tasks such as cell-type annotation and detection of spatially-dependent gene expression. To tackle the transcript leakage problem, we propose a reference-free Bayesian model-based method, DeLeakage, which cleans up the data much more effectively than existing denoising methods. DeLeakage also improves cell-type annotation and avoids false detection of spatially dependent expression.
bioinformatics2026-06-17v1Posterior-calibrated multimodal motor states reveal longitudinal and imaging-associated heterogeneity in Parkinson's disease
Tirhekar, H. M.; Yadav, P.; Bajaj, C. l.Abstract
Parkinson's disease (PD) motor heterogeneity is commonly summarized by hard subtype labels, although clinical states vary longitudinally, severity can dominate unsupervised structure, and model uncertainty is rarely calibrated. We developed a posterior and refit-stability calibrated multimodal motor state framework that assigns probabilistic MDS-UPDRS-III motor states, aggregates them at the patient level, separates global burden from residual tremor-axial profile, and tests whether imaging can recover the resulting posterior distribution. In 29,366 aligned PPMI motor-posterior visits spanning 4,773 participant identifiers, patient-level state families were stable on average (modal-family fraction 0.925; 95\% CI 0.921 - 0.930), but 25.5\% of patients transitioned state over follow-up (95\% CI 24.1 - 26.7\%). PD-only cohort definitions produced smaller denominators and are reported as sensitivity cohorts with rerun calibration and imaging-posterior checks. Severity and covariates explained substantial motor-domain variance, especially bradykinesia (\rsecond=0.850), but residual profile modeling retained five active components across total-severity, principal-component, leave-one-domain, non-target-burden, and clinical-only severity axes. Refit-stability calibration with 250 patient-blocked bootstrap refits showed high nominal posterior confidence (0.989) but lower empirical label consistency (0.849), quantifying overconfidence rather than hiding it. Patient-held-out temporal modeling predicted future axial burden (best XGBoost \rsecond=0.605) and future state transition (XGBoost AUC=0.830; 95\% CI 0.822 - 0.837). DaTSCAN plus FreeSurfer ROI features predicted patient-level soft motor posterior vectors (RF \jsd=0.209; 95\% CI 0.199 - 0.220; macro-AUROC=0.692), while severity/demographic-adjusted imaging features further improved soft posterior recovery (\jsd=0.188). BioFIND transfer reproduced clinically meaningful endpoint gradients after state assignment in 225 external patients, supporting external face validity rather than definitive transportability. These results support PD motor phenotypic states as calibrated, dynamic, clinically interpretable profiles with convergent imaging associations, not as definitive biological subtypes.
bioinformatics2026-06-17v1Beyond phylogeny: Genome-wide DNA sequence patterns suggest DNA physical properties associated with thermal adaptation in extremophile microbes
Safari, M.; Kari, L.Abstract
Temperature is a fundamental constraint on biological systems, yet how it is reflected in genome sequence organization remains unclear. Here, we show that genome-wide distributions of short DNA sequences contain a robust signal of thermal adaptation that is largely independent of phylogeny. Using Structural Topic Modelling (STM), a machine-learning approach for identifying groups of co-occurring sequence motifs, we analyze canonical 6-mer and 9-mer frequency profiles of bacterial and archaeal genome proxies (randomly sampled genomic regions) and identify motif families systematically associated with thermophiles and psychrophiles. In bacterial thermophiles, the identified motif families are dominated by highly specific, overrepresented and co-occurring C- and G-stacked hexamers, and a distinct family of CG-periodic hexamers recurring across multiple temperature comparisons. In contrast, bacterial psychrophile-associated motifs are dominated by low-complexity A-, T-, and AT-run hexamers. Thermophilic archaea generally exhibit a distinct CTAG-centred hexamer family, suggesting that different domains may adapt to similar environmental constraints through different sequence-level solutions. However, this domain-level contrast is not absolute: in a targeted analysis of two thermophilic bacterium--archaeon pairs, we find unusually similar frequencies of all the STM-identified thermophile-associated hexamer families, suggesting that shared high-temperature environments can, in specific cases, partially override phylogenetic divergence. Notably, the identified motif families constitute only a small and highly selective subset of the vast space of possible G+C-rich or A+T-rich sequences. This indicates that thermal adaptation is associated with specific sequence architectures rather than broad shifts in nucleotide composition. Accordingly, the observed signal cannot be explained by overall base composition alone, but instead arises from structured combinations and positional arrangements of nucleotides within short sequence contexts. Related motif families are recovered at both k=6 and k=9, indicating that the signal reflects systematic shifts in genome-wide sequence organization rather than isolated sequence motifs. These patterns are consistent with known sequence-dependent DNA physical properties documented in biochemical and biophysical studies, including differences in base-stacking interactions and conformational flexibility. Together, our results suggest that genome-wide sequence organization reflects sequence-dependent DNA physical properties associated with thermal adaptation, revealing a previously underappreciated physical layer of genomic information beyond phylogenetic history.
bioinformatics2026-06-17v1In silico characterization of lysis and host-recognition modules in Staphylococcus aureus bacteriophage genomes
Hasugian, I. A.; Alifiyah, N. I.Abstract
Background/aim: Antimicrobial resistance in methicillin-resistant Staphylococcus aureus (MRSA) requires precision non-antibiotic therapeutics, yet phage lytic efficacy is poorly predicted by phenotypic assays, as shown by paradoxical biofilm responses. This study characterized the genomic architecture of lytic S. aureus bacteriophages, focusing on the conservation of the lysis module and the variability of host-recognition modules, to provide a rational basis for phage candidate selection. Materials and methods: Twenty-two complete S. aureus phage genomes were retrieved from NCBI GenBank. Genomic features were extracted with custom Biopython scripts. Lysis (endolysin, holin) and host-recognition (tail fiber/receptor-binding protein) modules were annotated and validated by InterPro domain analysis, with disrupted endolysins resolved by tBLASTn. Phylogeny was reconstructed from large terminase subunit (TerL) sequences using maximum likelihood. Results: Genome size spanned three classes, from 17.5 to 148.6 kb. The LysK-type endolysin (CHAP, Amidase, SH3b) was highly conserved, whereas tail fiber/RBP genes were detected in only 14 of 22 phages. Domain analysis reclassified two proteins annotated as endolysins as virion-associated peptidoglycan hydrolases, and identified two independent mechanisms, HNH endonuclease insertion and intron splitting, that interrupt lysis-module genes and confound automated annotation. Maximum likelihood analysis recovered a strongly supported, highly conserved core clade with EW and SA13 as divergent lineages. Conclusion: Lysis modules are conserved whereas host-recognition modules are variable, indicating that host recognition rather than the lytic enzyme is the principal determinant of host range and the more rational target for phage selection and engineering.
bioinformatics2026-06-17v1An Integrated Framework for Transcriptomic Characterization and Lorentzian Hyperbolic Visualization of a High-Risk Topological Branch in Alzheimer's Disease
Zeng, C.; Pu, Z.; Tao, Y.; Wei, W.; Zhao, J.; Cai, M.; Ge, S.Abstract
Alzheimer's disease (AD) is a highly heterogeneous brain disorder in which molecular alterations vary across brain regions, disease stages, and patient subgroups. This study introduces an integrated analytical framework for characterizing transcriptomic variation associated with a high-risk topological branch, which was identified based on Lorentz distance in postmortem Brodmann area 36 samples from the Mount Sinai Brain Bank cohort, where over 70% of samples were in Braak stages V-VI. The framework integrates weighted gene co-expression network analysis, repeated stability-based differential expression analysis, network-level gene filtering, Gene Ontology enrichment, and nested stratified cross-validation to evaluate whether topological branch-associated genes capture biologically meaningful signals and carry predictive information for high-Braak group status. The identified gene sets were functionally enriched for neuronal development, neuron projection organization, synaptic signaling, vesicle fusion, and regulated synaptic release, suggesting that the high-risk topological branch reflects biologically relevant transcriptomic programs linked to neurodegenerative progression. Nested cross-validation further showed that the selected genes achieved measurable internal predictive performance for distinguishing high-Braak samples. As a second methodological contribution, we introduced a Lorentzian hyperbolic variant of t-distributed stochastic neighbor embedding (Lorentz t-SNE) to explore latent non-Euclidean structure in transcriptomic data. This method embeds samples in hyperbolic space, providing an alternative to Euclidean embeddings for representing hierarchical or nonlinear structures. Compared with conventional Euclidean embeddings, the proposed Lorentz t-SNE revealed a more localized organization of high-Braak samples. Together, these results demonstrate the utility of the proposed analytical framework and Lorentz t-SNE for investigating heterogeneous, potentially non-Euclidean organization in AD transcriptomes.
bioinformatics2026-06-17v1VLab4Mic: prediction of structural resolvability in super-resolution microscopy
Martinez, D.; Saraiva, B. M.; Shakespeare, T.; Bates, M.; Owen, D. M.; Leterrier, C.; Del Rosario, M.; Henriques, R.Abstract
Determining whether a microscopy experiment can resolve a specific feature of a protein assembly remains difficult because researchers must balance imaging modality, labelling strategy, and probe choice. We present VLab4Mic, a simulation platform that predicts structural resolvability before experiments. Starting from atomic models from the PDB or AlphaFold predictions, VLab4Mic places antibodies, nanobodies, chemical linkers, or fluorescent proteins on epitopes, applies stochastic labelling and steric constraints, and generates virtual samples for widefield, confocal, AiryScan, Stimulated Emission Depletion (STED), and Single-Molecule Localisation Microscopy (SMLM). Comparisons with nuclear pore complex data show realistic agreement across modalities. Case studies show that HIV capsid appearance depends strongly on orientation, and that STED and SMLM distinguish domed from flat clathrin lattices, whereas confocal and AiryScan struggle. VLab4Mic thereby helps researchers predict which biological questions are experimentally tractable with a given imaging configuration before spending time finetuning imaging parameters at the microscope.
bioinformatics2026-06-16v3Sparse Autoencoders Reveal Interpretable Features in Single-Cell Foundation Models
Pedrocchi, F.; Barkmann, F.; Joudaki, A.; Boeva, V.Abstract
Single-cell foundation models (scFMs) hold promise for applications in cell type annotation, data integration, and prediction of the effects of cell perturbations, but their internal mechanisms remain poorly understood. We investigate the structure of these models by training sparse autoencoders (SAEs) on the hidden representations of three widely used scFMs: scGPT, scFoundation, and Geneformer.The learned features reveal diverse and complex biological and technical signals, which emerge even in pre-trained models. We also observe that the encoding of this information differs between scFMs with distinct training protocols and architectures. Finally, we demonstrate that SAE-derived features are functionally related to model behavior and can be intervened upon. Suppressing batch-associated features reduces unwanted technical variation and improves data integration while preserving the core biological signal. Activating drug-encoding features steers control cells toward drug-perturbed states in a concentration-dependent manner. These findings provide a path toward more interpretable and controllable single-cell foundation models.
bioinformatics2026-06-16v3Robust integration of weakly anchored spatial multi-omics
Wang, C.; Liu, Y.; Wang, Z.; Sun, P.; Li, Z.; Li, J.; Wang, X.; Chen, K.; Zou, Q.; Daoliang, Z.; Hu, Z.; Du, Y.; Qian, B.; Feng, X.; Yuan, Z.; Guan, R.Abstract
Spatial multi-omics holds great promise for dissecting complex biological processes, though inherent technical constraints continue to limit its widespread adoption. Currently, most studies therefore measure distinct omics features on separate tissue sections, necessitating spatial diagonal integration. An emerging practical solution is to leverage hematoxylin and eosin (H&E) images as an integration anchor, given their ubiquity, low cost, and compatibility across tissue preparations. However, this anchor is frequently compromised in real-world settings by variations in H&E staining style, absence of reliable histological landmarks, and mismatches in spatial resolutions across omics modalities. To address this, we introduce SpaWeaver, a computational framework that couples a pathology foundation model with a graph Transformer and a latent feature aligner module, providing a highly robust solution for weakly anchored spatial omics data diagonal integration. Extensive experiments demonstrate that SpaWeaver exhibits superior robustness against isolated or synergistic weak-anchoring factors. The spatial multi-omics profiles generated by SpaWeaver link molecular features originally separated on two sections, unlocking diverse downstream analyses once exclusive to co-assayed spatial multi-omics data, including niche-aware cell-cell communication inference and multi-omics resolved cell state. In this study, it unveils tumor-distance-dependent fibroblast-CD4+ T-cell signaling in human colon adenocarcinoma and identifies a hypoxic glycolytic tumor state with pyknotic nuclei in human ovarian cancer. Overall, our approach bridges readily accessible single-omics measurements across weakly anchored tissue sections, enabling unified spatial multi-omics characterization and system-level tissue analysis.
bioinformatics2026-06-16v3RareFold: Structure prediction and design of proteins with noncanonical amino acids
Li, Q.; Daumiller, D.; Zuo, F.; Marcotte, H.; Pan-Hammarstrom, Q.; Bryant, P.Abstract
Protein structure prediction and design have traditionally been confined to the 20 canonical amino acids. Expanding this chemical space to include non-canonical amino acids (ncAAs) is essential for engineering proteins with novel chemical and functional properties. However, existing methods are not designed to generalise across chemically diverse residue types. Here, we present RareFold, a deep learning architecture for structure prediction and design of proteins containing the 20 canonical amino acids and 29 ncAAs. By representing each residue as an independent token, RareFold learns context-dependent atomic interaction patterns across chemically diverse sequence spaces, enabling modelling of non-standard chemistries within a unified framework. We apply this capability in EvoBindRare, a generative framework for de novo design of linear and cyclic peptide binders with an efficient implementation that substantially reduces computational requirements compared to existing architectures. We demonstrate its performance by designing binders against Ribonuclease A, yielding novel linear and cyclic peptides incorporating ncAAs within predicted interfaces with low-micromolar affinities (KD ~2-9 M), comparable to the native ligand (KD ~2 M). Hydrogen-deuterium exchange mass spectrometry confirms that the designed peptides engage the target at regions consistent with predicted binding interfaces. In addition, immunogenicity profiling in human-derived organoid models shows no detectable immune activation. By extending deep learning-based protein design to non-canonical chemical spaces, RareFold enables programmable access to expanded amino acid alphabets and broadens the scope of de novo protein engineering.
bioinformatics2026-06-16v3Prediction and analysis of new HisKA-like domains
Silly, L.; Perriere, G.; Ortet, P.Abstract
Histidine kinases (HKs) are part of many signaling pathways, by being implicated in two components systems (TCS). Using autophosphorylation and phosphotransfer to a response regulators (RR), they enable organisms to adapt to their environment. Most HKs are transmembrane proteins with a sensing domain outside of the cell and two catalytic domains called HisKA and HATPase. HATPase is required for interaction with the ATP and HisKA contains the phosphorylated histidine residue. HKs are involved in various environmental adaptation mechanisms, like light sensing or biochemical changes. Studying their diversity is therefore important to better understand how cells interacts with their environment. There exist incomplete HKs (iHKs) lacking either the HisKA or HATPase domain. Some iHKs with an HATPase domain possess a section of their sequence where an HisKA domain could be expected. These iHKs may contain "true" HKs, with unknown HisKA domain, that could fill gaps in various signaling pathways. In this study we analyzed 869 964 sequences of iHKs having an HATPase domain but lacking an HisKA domain. We identified 18 HisKA-like profiles and did multiple meta-studies to assessed their HisKA-like characteristics. We found that their 3D structures matched the structure of known HisKA domains. We saw that the genomic context of the genes associated to these profiles contained genes implicated in signal transduction pathways. We cross-validated some of our profiles with curated annotations, as well as with a "negative dataset" made of non-HK proteins. We believe that our work could help improve the annotation of regulation pathways in prokaryotes.
bioinformatics2026-06-16v2Identifying Modulators of Cellular Responses by Heterogeneity-sequencing
Berg, K.; Sakellaridi, L.; Rummel, T.; Hennig, T.; Whisnant, A.; Lodha, M.; Krammer, T.; Toussaint, C.; Szymanska-De Wijs, K.; Zheng, Y.; Prusty, B. K.; Doelken, L.; Saliba, A.-E.; Erhard, F.Abstract
The response of individual cells to drug treatment, virus infections or other molecular stimuli is highly heterogeneous and depends on the cell's initial state. Library preparation for single-cell transcriptomics is destructive, precluding a direct comparison between the initial state and the stimulus outcome. Consequently, current methods are restricted to identifying correlative associations rather than resolving causal drivers of heterogeneous outcomes. We developed Heterogeneity-seq, which combines single-cell RNA-seq with metabolic RNA labeling (scSLAM-seq) and double machine learning to overcome this limitation. By leveraging simultaneous measurements of unlabeled and labeled RNA in individual cells, Heterogeneity-seq uncovers the transition from pre-stimulated cell states to distinct stimulation outcomes across thousands of cells. These links enable the identification of factors that causally govern heterogeneous cellular responses. We used Heterogeneity-seq to identify both known and novel genes that drive responses to drug treatment, as well as pro- and antiviral host factors governing cytomegalovirus infection.
bioinformatics2026-06-16v2Infectious Disease Forecasting via Physics-Informed Machine Learning
Hart, J. C.; Smith, H.; McMahan, C.; Rennert, L.Abstract
Infectious disease transmission evolves as a dynamic process shaped by biological mechanisms, population behavior, and intervention policies, yet public health responses are often driven by lagging indicators. Accurate short- and long-term disease forecasting is essential for the timely deployment of intervention strategies, healthcare capacity planning, and uncertainty-aware, risk-informed decision-making. To address this challenge, three broad classes of forecasting models have traditionally been used: statistical, machine learning, and mechanistic approaches. However, each of these modeling paradigms faces fundamental limitations. In particular, traditional statistical models often lack the flexibility needed to capture complex disease dynamics, machine learning approaches require large, high-quality data streams, and mechanistic models are notoriously difficult to calibrate. To overcome these challenges, we propose a novel physics-informed machine learning (PIML) framework for forecasting infectious disease dynamics. Our approach simultaneously forecasts new case and hospitalization counts, along with other key epidemiological quantities such as the time-varying reproduction number. This is achieved through the design of a machine learning model and estimation strategy regularized by a system of differential equations that encode disease dynamics of the SIHR model, thereby bridging the gap between purely data-driven and mechanistic models. We demonstrate the proposed methodology through in-depth numerical studies and an application to COVID-19 data collected in the state of South Carolina.
bioinformatics2026-06-16v1cuBayes: GPU accelerated FreeBayes that achieves 1-minute whole-genome SNV calling while maintaining algorithmic semantics
Pitman, A.; Yang, C.; Qiao, Y.Abstract
Next-generation sequencing now produces whole-genome data in hours, but downstream variant calling remains a multi-hour to multi-day bottleneck that excludes genomic analysis from time-critical clinical settings. GPU acceleration offers a natural path forward -- variant calling is inherently parallelizable across genomic positions -- yet open-source infrastructure for porting existing algorithms to GPU hardware remains limited, leaving many widely-used tools without accelerated implementations. FreeBayes, a haplotype-based variant caller central to the 1000 Genomes Project and to multi-sample tumor evolution analyses, exemplifies this gap: it is natively single-threaded despite its algorithmic suitability for parallelization. We present cuBayes, a CUDA implementation of FreeBayes germline SNV calling that completes HG002 and HG004 2x250bp Illumina 60x whole-genome analysis in one minute (as opposed to hours if not days with manual region-based CPU parallelization) on a single NVIDIA RTX 6000 Ada GPU, while producing variant calls with >99.9% concordance to the CPU reference. cuBayes is structured around an atom/molecule architecture in which reusable functional units (BAM decompression, position-wise pileup, batch coordination) are cleanly separated from algorithm-specific logic, providing a foundation intended to support acceleration of additional sequence analysis algorithms without redundant low-level engineering.
bioinformatics2026-06-16v1THEOBROMA: an aggregated open database of 1.13 million natural products with per-compound license auditing, three-tier classification, and stereochemistry-aware deduplication
Klamt, T.; Jaczkowski, A.; Franke, J.; Nejdl, W.Abstract
Natural products remain one of the most productive sources of pharmacologically active compounds for drug discovery, yet the current open aggregator landscape attributes licenses at database rather than compound granularity, with consequences that have become tangible as the field grows. A recent relicensing event in one constituent source (the September 2024 transition of the Natural Products Atlas to CC BY-NC 4.0) demonstrates how database-level licensing propagates across an aggregate and motivates the per-compound audit framework presented here. The same peer cohort separately leaves classification provenance and stereoisomer-family relations coarser than either layer warrants. THEOBROMA, accessible at \url{https://theobroma.l3s.uni-hannover.de}, integrates 1{,}133{,}004 natural products from 29 open sources under a per-compound license audit that resolves each compound's license tier across all attesting sources under a most-restrictive-wins rule, identifying 900{,}170 compounds (79.4\%) under open-use licenses and exposing the per-source attestation chain and resolved tier through a dedicated audit endpoint and a query-time license filter. A three-tier classification stratifies 89.3\% coverage into 35.1\% curated, 43.9\% high-confidence inferred, and 10.3\% exploratory tiers, with 486{,}215 stereoisomer families preserved by full 27-character InChIKey deduplication and exposed via a dedicated \texttt{/api/stereoisomers/<comp\_id>} endpoint and a radial-family display. Per-compound license provenance is the primary differentiator. Classification stratification and stereoisomer-family exposure add finer-grained access to two related axes, supporting license-compatible virtual screening and isomer-specific bioactivity analysis at corpus scale. As an evolving open resource, THEOBROMA pairs continuous pipeline maintenance with interactive geographic, taxonomic, and chemical-space exploration.
bioinformatics2026-06-16v1OmicOS: A Comprehensive Omics Ecosystem Infrastructure and Agent System for the AI Era
Zeng, Z.; Meng, X.; Hu, L.; Li, C.; Liu, P.; Shi, Y.; Ma, X.; Gao, L.; Wang, X.; Luo, Z.; Zheng, Y.; Xian, J.; Lin, Z.; Zhu, H.; Jiang, Z.; Mao, S.; Lu, Y.; Tang, W.; Peng, Q.; Ma, Y.; Zhou, L.; Xing, C.; Zhang, X.; Xiong, Y.; Du, H.Abstract
Biology has accumulated a vast ecosystem of omics methods, but much of this ecosystem remains built for expert humans rather than scientific agents. Methods are scattered across Python packages, R/Bioconductor and CRAN workflows, command-line tools, incompatible data containers and implicit object states, making even routine analyses difficult for an AI system to choose, execute and verify reliably. Here we introduce OmicOS, a comprehensive omics ecosystem infrastructure and agent system that turns OmicVerse V2, an open-source omics community, into an executable foundation for agentic biology. OmicVerse V2 provides the community substrate: scalable AnnDataOOM-compatible rust backends, agent-friendly Python algorithms for single-cell, spatial, bulk and multi-omics analysis, interfaces to single-cell foundation models, and Python-native reconstructions of historically R-centred Bioconductor/CRAN-style workflows. OmicOS makes this substrate actionable by registering analytical functions as state-aware capability contracts, allowing agents to inspect live data objects, select valid methods, execute controlled workflows and record provenance. The result is not a fixed pipeline, but a programmable omics environment in which agents compose real analyses from verified community methods rather than inventing tools. Across external and purpose-built benchmarks, OmicOS ranked first among the evaluated systems, reaching 81.2% on BiomniBench. Adding OmicVerse to a minimal agent improved task completion by up to 34.2 percentage points with qwen-3.6-35b, and controlled ablations showed that the gains came from registry-grounded execution rather than from larger models, documentation retrieval or unrestricted tool exposure. The same infrastructure scaled to atlas-sized data, reproduced R-centred workflows in Python and converted external pathology software into agent-usable skills. In a discovery task starting from a whole-body spatial map and the term Alzheimer disease, OmicOS composed a non-canonical workflow that integrated spatial expression, genetic association, eQTL and colocalization evidence to nominate a colon epithelial risk axis centred on PICALM, CD2AP and CR1. Together, OmicVerse and OmicOS define an open foundation for AI-era omics, showing how a community of biological methods can be transformed into a reliable, extensible and agent-operable system for discovery.
bioinformatics2026-06-16v1MetaPilot: genome-aware adaptive search-space refinement for unified DDA and DIA metaproteomics
Cheng, K.; Figeys, D.Abstract
Metaproteomic peptide identification is constrained by the structure and size of the protein search space. Pooled gene catalogues provide coverage but obscure genome-level evidence, and current workflows for data-dependent (DDA) and data-independent (DIA) acquisition diverge in their database strategies. We present MetaPilot, a genome-aware workflow that uses conserved marker-protein evidence to rank candidate genomes from MGnify catalogues and construct adaptive, sample-specific search spaces. Applied to paired DDA/DIA datasets of defined mixtures and fecal samples, MetaPilot adapted genome selection to community complexity and reproduced published peptide evidence while expanding the detectable peptide space. In DDA-independent reanalysis of Orbitrap human gut DIA data, MetaPilot identified 24.4% more peptides than the published DDA-derived library and 2.06-fold more than the matched DDA-assisted DIA search. On timsTOF DIA-PASEF mouse intestinal data, it outperformed uMetaP by 41.8~119.7%, enabling genome-resolved functional interpretation without DDA-PASEF input.
bioinformatics2026-06-16v1scIsoAgent enables autonomous isoform-resolved characterization and sequence-informed interpretation of long-read single-cell transcriptomes
Zhao, C.; Liu, M.; Li, X.; Li, D.; Xu, Y.; Wang, Z.Abstract
Alternative isoform usage can alter gene function independently of total gene expression, creating a need to resolve transcript isoforms at single-cell resolution. Long-read single-cell RNA sequencing meets this need by linking cellular identity to transcript isoforms and sequence-level features. Realizing its full biological value requires reproducible workflows that connect specialized long-read analysis with biological interpretation. Existing large language model (LLM)-based biomedical agents support general omics analysis, but are not designed for isoform-resolved long-read single-cell workflows. Here, we present scIsoAgent, an autonomous LLM-powered scientific agent for long-read single-cell RNA-seq analysis. scIsoAgent turns heterogeneous long-read single-cell inputs into traceable isoform-resolved workflows, using stage-aware planning and persistent computational context to support both execution and interpretation. Across complementary evaluations, this design improved the continuity from analysis planning to executable, interactive workflows compared with general-purpose LLM baselines. In real-data reanalysis, scIsoAgent recovered major findings from published long-read single-cell resources and extended a representative differential transcript usage event into a sequence-informed functional hypothesis. By linking full-length isoform sequences with model-inferred transcript properties, scIsoAgent connects observed isoform usage with potential sequence-level functional consequences. These results demonstrate that autonomous scientific agents can transform fragmented long-read single-cell analysis into coherent, reproducible workflows for isoform-resolved discovery and biological interpretation.
bioinformatics2026-06-16v1A Transformer-derived transcriptomic score associates with ex-vivo drug response in AML
Barman, J.; Adhikari, S.; Heckman, C.; Vaha-Koskela, M.Abstract
Background Drug-tolerant persister (DTP) cell states have been implicated in relapse across multiple cancers, including acute myeloid leukaemia (AML) [1,2]. Methods that score such states from transcriptomic data, generalise to held-out samples, expose calibrated probability outputs, and link predictions to candidate biology are useful for prioritising follow-up experimental work. Existing transcriptomic methods for scoring drug-tolerant or persister-like states largely rely on fixed gene signatures or general-purpose cell-type classifiers adapted post hoc (scPred, scANVI, scClassify); deep-learning approaches developed specifically for AML drug-tolerant persister scoring with calibrated probability outputs, prespecified thresholds, and transparent external validation against ex-vivo drug-response data are, to our knowledge, lacking. Our approach addresses this gap by combining a Transformer teacher with a knowledge-distilled 1,000-gene student, prespecified threshold {tau} = 0.31, and direct evaluation against BeatAML drug-AUC. Our in silico approach aims to fill this gap of non-existent analytical methods to identify and mark the DTP cells. Methods We trained a Transformer classifier on a pooled scRNA-seq corpus of nine samples (six from GSE123902 -lung adenocarcinoma metastasis, normal, and primary tumour [4] -plus three primary AML samples; 32,342 cells, 13,369 common genes), with stratified 5-fold cross-validation at the cell level, a 20% held-out test split, and a prespecified probability threshold selected on out-of-fold predictions. A 1,000-gene student model was trained by knowledge distillation [5]. For every input cell, the student outputs a probability between 0 and 1 (hereafter "the score") representing predicted membership in the positive training class. The trained model was applied without re-tuning to five external or independent application cohorts: 39 primary AML donors[in-house]; GSE74246[6]; BeatAML (n = 452 with linked ex-vivo drug-AUC; n = 405 with overall-survival metadata)[7]; TCGA-LAML (n = 149)[8]; and an in-house n = 10 scRNA-seq cohort with linked survival. Survival and drug-response data were not used during training, threshold selection, or tuning. The score was anchored mechanistically against CRISPR/DepMap essentiality[9], pathway enrichment, and a normal-tissue-filtered surface-protein candidate list (HPA[11], GTEx[12]). To assess concordance between transcriptomic prioritisation and protein-level evidence, each ranked candidate was additionally annotated with two HPA-derived flags: HPA_surface_protein (Yes/No, derived from HPA Protein class and Subcellular location fields, identifying genes annotated as plasma-membrane, GPCR, ion-channel, transporter, receptor, or CD-marker) and HPA_antibody_reliability (Enhanced, Supported, Approved, Uncertain, or Not available, per HPA antibody validation tier). Annotations were merged on HGNC symbol; 248 of 250 candidates (99.2%) matched. Two candidates using the older CORF nomenclature did not auto-match HPA's lowercase convention and were resolved manually. HPA's per-gene RNA-protein numeric correlation is published only on per-gene web pages and not in the bulk download; we therefore used the detection-level and antibody-reliability tiers as the operational concordance filter. Results Cross-validation area under the receiver operating characteristic curve (AUROC) was 0.936 +/- 0.014 (held-out test 0.941, Matthews correlation coefficient (MCC) 0.696, F1-score 0.895). The 1,000-gene student showed Spearman {rho} {approx} 0.96 with the teacher and >85% class agreement at the prespecified threshold. The principal external result was in BeatAML: the score correlated with ex-vivo drug-response AUC across seven AML-relevant drugs, with consistent per-drug Spearman correlations (r = 0.41-0.53, all p < 0.05). The aggregate correlation across 3,164 patient-drug pairs from 452 patients was r = +0.482 and is reported as a summary, recognising that pairs from the same patient are not fully independent. The score did not stratify overall survival in TCGA-LAML or in the in-house n = 10 cohort, in part because predicted high-score fractions saturated. At the prespecified threshold the score did not separate cell types in GSE74246, indicating that absolute calibration is cohort-dependent. Compared against logistic regression, random forest, the LSC17 stemness signature, and a mean-expression baseline on the same gene panel, the Transformer was the most stable model under aliquot-grouped cross-validation and the only one to transfer with strong, positive correlation to BeatAML drug-AUC. The mechanistic candidate-target pipeline produced a 250-candidate ranked surface-protein list (full breakdown in Results); FLT3 and CD33 were recovered from the unbiased ranking as positive controls. Conclusion We present a Transformer-derived transcriptomic score that addresses the lack of validated computational methods for identifying drug-tolerant persister-like states in AML. The score shows external rank-order association with ex-vivo drug response, providing a research-use tool for prioritising candidate persister-associated transcriptional programs for follow-up. Together, these results support the score as a research-use transcriptomic ranking tool for AML drug-response-associated states. The strongest external support comes from the consistent association with BeatAML ex-vivo drug-response AUC. The fixed probability threshold did not transfer reliably across all cohorts, so threshold-based classification should require cohort-specific recalibration. The score is not validated for clinical decision-making and is not proposed as a survival predictor. The candidate-target list is a starting point for functional follow-up. Keywords. AML; ex-vivo drug response; single-cell RNA-seq; Transformer; knowledge distillation; transcriptomic score; BeatAML; surface-protein target prioritisation.
bioinformatics2026-06-16v1Accelerating String Comparison in RLZ Compressed Sequences via LCE Jumps
Varki, R.; Boucher, C.Abstract
Relative Lempel-Ziv (RLZ) is an effective compression method for large, repetitive collections; however, the fundamental primitives required to elevate it from a passive archival format to a tractable representation for compressed construction have yet to be fully established. In this paper, we introduce an algorithmic framework for structurally comparing and lexicographically sorting sequences of RLZ factors. We characterize when direct factor comparisons are necessary and when they can be bypassed using RLZ specific shortcuts. We further introduce a method for extending truncated factors into right-maximal matches, enabling the recovery of matching statistics from the RLZ parse. Experimentally, RLZ sorting achieved speedups of up to 3.93x over character-based sorting. Together, these results advance the use of the RLZ format as a foundation for compressed construction.
bioinformatics2026-06-16v1Physics-Driven Zero-Shot Reconstruction of Isotropic 3D Fluorescence Microscopy under Undersampled Acquisition
Cao, R.; Jin, T.; Xin, F.; Hou, Y.; Fu, Y.; Jin, B.; Li, L.; Gao, S.; Wang, H.; Li, Y.; Saimi, D.; Ren, W.; Wang, W.; Xin, G.; Yuan, K.; Chen, Z.; Su, X.; Kim, D.; Li, M.; Xi, P.Abstract
Three-dimensional (3D) imaging represents the development of next generation of fluorescence microscopy. However, routine axial down-sampling makes isotropic resolution unrealistic. Here, we propose DeepUI, a physical zero-shot framework designed to achieve isotropic 3D fluorescence images from a low axial sampling rate. DeepUI fully leverages the intrinsic characteristics of 3D images through physics-guided degradation, which incorporates spatial-frequency joint learning to generate a scaled optical transfer function, combined with noise degradation and an up-sampling branch. Typically requiring just 5 minutes for training and 0.5 minutes for high-throughput and fast prediction, we demonstrate the superior performance of DeepUI to get isotropic results, and the exclusivity to axial down-sampling conditions, even in more challenging conditions, including defocused background, noise, and resolution blur.
bioinformatics2026-06-16v1PhenoBIC: operator-free single-cell spatial phenotyping in multiplex imaging data using deep learning of cell staining patterns
Sankaranarayanan, A.; Zhao, C.; Hernandez, M. G.; Clemens, E. A.; Smythe, K. S.; Kazerouni, A. S.; Carr, L. L.; Li, C. I.; Partridge, S. C.; Vinayak, S.; Mittal, S.Abstract
Multiplex imaging is a valuable tool for spatially examining tissue microenvironments at the single-cell level to uncover biological and clinical insights. However, most multiplex image analysis workflows currently require manual intervention for cell phenotyping, which slows progress, demands human effort, and yields operator-dependent outputs. Here, we developed PhenoBIC, a pre-trained deep learning model for image classification of the multiplexed biomarker signals in a cell (Biomarker Imprint of a Cell) to classify cell phenotypes. We show that PhenoBIC (F1-score ~0.88) outperforms manual gating (widely used) and other machine learning-based computational approaches for cell marker expression classification. We validated this across multiple biomarkers, tissue sampling strategies (whole biopsies and tissue microarrays), multiplex panels, imaging platforms, and tissue types. We have released our in-house training and validation datasets of ~1.4 million manually curated cell expression ground truth labels. We have also open-sourced PhenoBIC and enabled its community-wide deployment via the QuPath interface.
bioinformatics2026-06-16v1Super Learner Ensemble Modeling of CPTAC Proteomic Data for Survival Prediction in Head and Neck Squamous Cell Carcinoma
Park, E.; Lee, H.; Oh, E. J.; Tham, T.; Ahn, S.Abstract
Survival analysis in head and neck squamous cell carcinoma (HNSCC) is traditionally performed using Cox proportional hazards models, alongside some exploration into black-box machine learning methods. The Super Learner (SL) algorithm addresses this model selection dilemma by combining diverse candidate algorithms into a weighted ensemble to perform comparably to the best candidate method. This study evaluates the performance of SL in HNSCC. Proteomic features as well as clinical covariates from 96 CPTAC HNSCC samples were modeled with three candidate algorithms (Cox LASSO, Cox Ridge, and Random Survival Forest) as well as the ensemble SL method. Models were optimized via Uno's time-dependent Concordance Index (C-index) and tested at 1- and 3-year time horizons using 2000 bootstrap resamples. The Cox Ridge regression model achieved the highest predictive accuracy among the four total methods. However, the SL demonstrated stable performance over both time horizons (1-year C-index: 0.985; 3-year C-index: 0.960). Variable importance analysis of the Cox Ridge model successfully identified malignant proteins (ATR, MAML1, MIEN1) alongside novel potential prognostic indicators (ZNF800, KERA). This analysis emphasizes the statistical necessity for larger cohorts for ensemble learning, while providing a benchmark of proteomic indicators in HNSCC.
bioinformatics2026-06-16v1