Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
TITAN-BBB: Predicting BBB Permeability using Multi-Modal Deep-Learning Models
de Oliveira, G. B.; Saeed, F.Abstract
Computational prediction of blood-brain barrier (BBB) permeability has emerged as a vital alternative to traditional experimental assays, which are often resource-intensive and low-throughput to meet the demands of early-stage drug discovery. While early machine learn-ing approaches have shown promise, integration of traditional chemical descriptors with deep learning embeddings remains an underexplored frontier. In this paper, we introduce TITAN-BBB, a multi-modal deep-learning architecture that utilizes tabular, image, and text-based features and combines them using attention mechanisms. To evaluate, we aggregated multiple literature sources to create the largest BBB permeability dataset to date, enabling robust training for both classification and regression tasks. Our results demonstrate that TITAN-BBB achieves 86.5% of balanced accuracy on classification tasks and 0.436 of mean absolute error for regression, outperforming the state-of-the-art by 3.1 percentage points in balanced accuracy and reducing the regression error by 20%. Our approach also outperforms state-of-the-art models in both classification and regression performance, demonstrating the benefits of combining deep and domain-specific representations. The source code is publicly available at https://github.com/pcdslab/TITAN-BBB. The inference-ready model is hosted on Hugging Face at https://huggingface.co/SaeedLab/TITAN-BBB, and the aggregated BBB permeability datasets are available at https://huggingface.co/datasets/SaeedLab/BBBP.
bioinformatics2026-06-11v2MargheRita: streamlining MS-DIAL output analysis and metabolite identification in R
Mosca, E.; Ulaszewska, M.; Alavikakhki, Z.; Bellini, E. N.; Mannella, V.; Frigerio, G.; Drago, D.; Andolfo, A.Abstract
In the field of untargeted metabolomics, the deployment of high-resolution mass spectrometry technologies generates an immense volume of complex metabolite signals. This data density necessitates sophisticated computational frameworks for post-acquisition processing and the integration of specialized databases for accurate metabolite identification. Currently, many web-based data processing solutions offer fragmented workflows, covering only specific stages of the analysis and frequently requiring researchers to migrate data across multiple, often incompatible, platforms. To address these challenges, we introduced margheRita, an R package designed to streamline the workflow for untargeted metabolomic profiling. Developed to work seamlessly with MS-DIAL output, margheRita provides a comprehensive pipeline for liquid chromatography-tandem mass spectrometry (LC-MS/MS) data. This tool is particularly effective for Data-Independent Acquisition (DIA) experiments, where the high-resolution acquisition of all MS/MS spectra demands rigorous and integrated processing capabilities. A key innovation of margheRita is its ability to significantly enhance fragment matching accuracy. It achieves this by utilizing an original, curated high-quality spectral library from authentic reference standards. This library includes data acquired in both positive and negative ionization polarities using various chromatographic columns, ensuring high versatility. By bridging the gap between initial MS-DIAL processing and final biological insights, margheRita offers a holistic solution from metabolite identification to the functional interpretation of complex biological datasets.
bioinformatics2026-06-11v2Reducing haystacks to needles - ViralClust: A Nextflow pipeline to cluster viral sequences
Triebel, S.; Lamkiewicz, K.; Eulenfeld, T.; Marz, M.Abstract
The rapid accumulation of viral genome sequences presents major challenges for downstream analysis tools, including tools for multiple sequence alignments, phylogeny, and genome/alignment visualization, due to computational constraints and sampling biases caused by outbreak-driven over- representation. Selecting representative genomes through clustering offers a principled alternative to random subsampling, yet choosing appropriate clustering strategies remains non-trivial and context dependent. Here, we present ViralClust, a modular Nextflow pipeline for bias-aware representative selection from large viral genome datasets. ViralClust integrates five distinct clustering algorithms (CD-HIT-EST, SUMACLUST, VSEARCH, MMSeqs2, and HDBSCAN) within a unified workflow, enabling direct comparison of clustering outcomes and flexible adaptation to diverse biological questions, considering a balanced phylogenic distribution of the selected sequences. We evaluated ViralClust on six RNA and DNA virus datasets ranging from 632 to 156,586 sequences and spanning genome lengths from 890 to 197,185 nucleotides. Across all datasets, clustering reduced dataset size by 95% or more while preserving genetic diversity across species, genera, and families, and effectively mitigating biases introduced by outbreaks, partial genomes, and sequence orientation artifacts. By supporting whole-genome clustering and scalable representative selection, ViralClust enables efficient and reproducible downstream analyses that would otherwise be computationally infeasible. Our framework provides a flexible foundation for large-scale viral genomics and supports future applications in comparative analysis and virus classification.
bioinformatics2026-06-11v2RdRpCATCH: A unified resource for RNA virus discovery using viral RNA-dependent RNA polymerase profile Hidden Markov models
Karapliafis, D.; Neri, U.; Olendraite, I.; Charon, J.; Sakaguchi, S.; Hou, X.; de Ridder, D.; Zwart, M. P.; Kupczok, A.Abstract
Recent advances in large-scale sequence mining have expanded our knowledge of RNA virus diversity. Most genome mining approaches for detecting RNA viruses that encode RNA-dependent RNA polymerase (RdRp) rely on identifying this conserved protein by employing profile Hidden Markov Models (pHMMs) to scan sequencing datasets. Recently, several new pHMM databases for RdRp detection have been released, each following distinct design principles. However, their relative performance is unclear and their accessibility to users without specialized computational expertise is limited. Here, we introduce the RdRp Collaborative Analysis Tool with Collections of pHMMs (RdRpCATCH: https://github.com/dimitris-karapliafis/RdRpCATCH), developed to consolidate publicly available RdRp pHMM resources into a single, accessible platform. RdRpCATCH enables the scanning of (meta)transcriptomic assemblies to discover RNA viruses and provides subsequent taxonomic annotation of detected contigs. A comparative analysis of RdRp pHMM databases reveals that most are highly effective at detecting known diversity of RNA viruses while minimizing false positives, supporting their joint use within RdRpCATCH. RdRpCATCH is distributed as both a conda package and a web server application (https://rdrpcatch.bioinformatics.nl), facilitating access for researchers with diverse expertise. By integrating multiple pHMM resources, this unified framework addresses fragmentation in the field and reduces technical barriers, enabling comprehensive viral discovery.
bioinformatics2026-06-11v2Explainable protein-protein binding affinity prediction via fine-tuning protein language models
Singh, H.; SINGH, R. K.; Srivastava, S. P.; Pradhan, S.; Gorantla, R.Abstract
Protein-protein interactions underpin virtually every aspect of cellular life, and the precise quantification of their binding affinity is fundamental to understanding immune recognition, disease mechanisms, and the rational design of therapeutic antibodies. Yet predicting binding affinity at scale remains an unsolved challenge: reliable experimental assays are low-throughput and expensive, while computational methods that depend on three-dimensional complex structures cannot be applied to the vast majority of clinically relevant targets where structural data are absent. Here we present BALM-PPI, a framework that predicts protein-protein binding affinity from amino acid sequence alone. Both proteins are encoded by a protein language model trained on evolutionary sequence data and projected into a shared representational space, where their distance directly reflects binding strength. Fine-tuning this protein language model requires updating fewer than 1% of its parameters, and we show that this targeted adaptation steers the model toward interface-relevant sequence signals rather than spurious background correlations. On a curated benchmark of over 12,000 protein complexes, BALM-PPI matches or exceeds the accuracy of structure-based methods and retains predictive power for proteins with less than 30% sequence identity to the training set. Using only a subset of project-specific assay data, BALM-PPI outperforms a recent method trained on three times the data, suggesting that the model has already encoded the underlying interaction signals and requires only minimal supervision to specialise to a new target. BALM-PPI further provides residue-level attribution maps that pinpoint the amino acid positions driving each affinity prediction, consistently recovering experimentally validated interaction hotspots across enzyme-inhibitor, signalling, and antibody-antigen systems without any structural input during training. This allows predictions to be cross-validated against structural and mutagenesis evidence, providing a mechanistic basis for candidate shortlisting ahead of experimental follow-up. BALM-PPI is freely accessible via an interactive web server.
bioinformatics2026-06-11v2DeepSynBa: Actionable Drug Combination Prediction with Complete Dose-Response Profiles
Kuru, H. I.; Zhang, H.; Rattray, M.; Ek, C. H.; Cicek, A. E.; Tastan, O.; Milo, M.Abstract
Many cancer monotherapies demonstrate limited clinical efficacy, making combination therapies a relevant treatment strategy. The extensive number of potential drug combinations and context-specific response profiles complicates the prediction of drug combination responses. Existing computational models are typically trained to predict a single aggregated synergy score, which summarises drug responses across different dosage combinations, such as Bliss or Loewe scores. This oversimplification of the drug-response surface leads to high prediction uncertainty and limited actionability, as these models fail to distinguish between potency and efficacy. We introduce DeepSynBa, an actionable model that predicts the complete dose-response matrix of drug pairs instead of relying on an aggregated synergy score. This is achieved by predicting parameters describing the response surface as an intermediate layer in the model. Evaluated on the NCI-ALMANAC and the O'Neil datasets, DeepSynBa outperforms the state-of-the-art methods in the dose-response matrix prediction task across most evaluation scenarios, including testing on novel drug combinations, cell lines, and drugs, across nine different tissue types. We also show that DeepSynBa yields reliable synergy score predictions. More importantly, DeepSynBa can predict drug combination responses across different dosages for untested combinations. The intermediate dose-response parameter layer enables the separation of efficacy from potency, informing the selection of dosage ranges that optimise efficacy while limiting off-target toxicity in experimental screens. The predictive capability and the downstream actionability make DeepSynBa a powerful tool for advancing drug combination research beyond the limitations of the current approaches.
bioinformatics2026-06-11v2HESTA: a curated and reusable database for the human early organogenesis spatiotemporal transcriptome atlas
Xu, Z.; Li, Y.; Wang, W.; Zhang, Y.; Fan, L.; Chen, J.; Du, W.; Yang, T.; Gao, Y.Abstract
Background: Human organogenesis is orchestrated by precise spatiotemporal gene expression. Mapping these dynamic processes requires transcriptomic data that preserve native anatomical context across continuous developmental stages. Findings: We present a spatiotemporal transcriptome database of human embryogenesis, profiling 77 sagittal sections from 13 euploid embryos (CS12-CS23) using Stereo-seq, yielding 14,708,858 bin50 spots. The atlas annotates 50 organs and maps 198 molecularly distinct substructures, complemented by 607,093 snRNA-seq cells. The database features a Spatial Exploration module for locating sections and visualizing spatial distributions of organs and substructures, and an Organ Atlas module for visualizing gene expression, regulon activities, and pathway enrichment at the single-organ level across stages. Conclusions: This database provides an interactive resource to access spatial gene expression, substructures, and regulatory networks across 50 developing human organs, supporting further research into the mechanisms of human organogenesis.
bioinformatics2026-06-11v2TMO: ASYMMETRIC CROSS-MODAL ATTENTION FOR LEARNINGCELL-STATE-DEPENDENT REGULATORY LAGS FROM SINGLE-CELL MULTIOMIC DATA
Lopez-Delgado, P. A.; Delgado-Carlo, M. M.Abstract
Abstract Background: Single-cell multi-omics technologies simultaneously measure chromatin accessibility (ATAC) and gene expression (RNA), providing a unique window into the temporal ordering of regulatory events during differentiation. However, most computational models treat the two modalities symmetrically, ignoring the directional relationship between chromatin and transcription, and existing lag-aware methods estimate a single global lag per gene, failing to capture cell-state-dependent dynamics. Methods and Results: We introduce Temporal Multi-Omics (TMO), a deep learning framework that learns signed, cell-state-conditional regulatory lags ({Delta}{tau}) using asymmetric cross-modal attention. TMO projects RNA and ATAC into 50 latent components each, tokenises each cell as a sequence of 100 tokens, and uses a two-pass transformer in which a data-driven lag prior - derived from a sliding-window cross-correlation function - directly biases attention asymmetrically. On four independent 10x Multiome datasets (mouse brain, human brain, mouse kidney, human PBMC), the asymmetric model achieves Lag Concordance Scores (LCS) of 0.988-0.999, compared to 0.048-0.108 for an architecturally identical symmetric baseline. A stratified 80/20 held-out experiment confirms that the learned component-lag ordering generalises to unseen cells (held-out LCS 0.85-0.99). Clustered {Delta}{tau} heatmaps show positive {Delta}{tau} (ATAC-led priming) in early pseudotime and negative {Delta}{tau} (RNA-led, activity-dependent regulation) in late pseudotime; the ATAC-RNA correlation heatmap exhibits a U-shaped pattern indicative of developmental decoupling. Components with the most positive {Delta}{tau} are enriched for chromatin organization and stem cell differentiation (FDR < 0.05), while those with the most negative {Delta}{tau} are enriched for synaptic signalling and immune activation. Ablating the cell-state information from the lag predictor reduces the LCS and collapses per-component temporal dynamics (KS p [≤] 0.039 in all four tissues), proving that TMOs dynamic lag patterns depend on cell-state conditioning. Independent ChIP-seq validation for four transcription factors (PAX5, Pax6, ASCL1, Hnf4) confirms highly significant separation between target genes and expression-matched background (p < 10-4 in all cases). Two Multiome Perturb-seq screens provide causal validation: SMARCB1 knockout shows a directional trend (1.5-fold target shift, p = 0.056, n = 147 perturbed cells), and SMARCE1 knockout reaches statistical significance (p = 0.0089, n = 3,394 perturbed cells). Gene-level cross-correlation independently validates that the regulatory lag signal is present in the raw data, and TMO further identifies rare, statistically significant biphasic gene programs where the regulatory direction reverses across pseudotime. Conclusions: TMO is the first method to make regulatory lag a learnable, cell-state-conditional, and architecturally encoded parameter. It is scalable, interpretable, and open-source, providing a powerful tool for studying regulatory timing in development, disease, and perturbation screens.
bioinformatics2026-06-11v1Robust semi-supervised scRNA-seq integration from virtual adversarial learning
He, C.; Filippidis, P.; Xing, J.; Kleinstein, S.; Guan, L.Abstract
Single-cell RNA sequencing integration methods that rely solely on transcriptomic data often struggle to preserve fine-grained distinctions between closely related cell subtypes. As a result, cell populations that are separable in the raw data may become over-mixed after integration, reducing biological resolution and interpretability. Incorporating marker gene information can potentially address these issues; however, the variability and complexity of available marker sets limit their effective application. To address this, we introduce scCRAFT+, a semi-supervised integration model that innovatively incorporates marker gene information through Virtual Adversarial Training (VAT). By jointly optimizing marker-derived supervision and transcriptome-wide representations, VAT enforces local prediction smoothness among transcriptionally similar cells, improving robustness to noisy marker annotations while enhancing both integration quality and cell type auto-annotation. This targeted approach significantly enhances annotation accuracy and robustness, particularly when faced with incomplete or incorrect marker gene sets. Benchmarking shows that scCRAFT+ achieves consistently stronger performance than current unsupervised and supervised integration approaches, resulting in improved integration quality and biologically meaningful sub-cell type auto-annotations.
bioinformatics2026-06-11v1An AI-Powered Trisomy 21 Research Assistant
NANDI, S.; Sundararajan, Z.; Subirana-Granes, M.; Espinosa, J. M.; Pividori, M.; Sullivan, K. D.; Galbraith, M. D.; Costello, J.Abstract
Down syndrome, caused by trisomy 21, increases the risk of diverse co-occurring conditions. With more than 34,000 related publications indexed in PubMed as of early 2026, keeping pace with this expanding literature is challenging. While general-purpose large language models are widely used for information retrieval, they often rely on broad training data rather than specific evidence. Retrieval-augmented generation (RAG) improves rigor and reliability of responses by linking model outputs to source texts. In research, source texts are peer-reviewed articles. Standard implementations treat all manuscript sections equally, allowing background text to rank as highly as experimental results. To focus model outputs on experimentally supported responses, we developed the T21 Research Assistant, a section-aware RAG system that prioritizes Results sections to ground responses in primary experimental evidence. The system draws exclusively from 1,789 open-access Down syndrome publications from PubMed Central, including 327 NIH INCLUDE-funded studies, and uses a multistage pipeline for query validation, retrieval, reranking, synthesis, and citation verification. Built on NVIDIA Nemotron models, it generates structured, cited responses. Evaluation using expert-curated questions demonstrated strong performance, achieving a BERTScore F1 of 0.712 and recall of 0.758, comparable to or exceeding leading proprietary and open-source models. T21 Research Assistant is available at: https://bioinformatics.cuanschutz.edu/t21-res-assi/
bioinformatics2026-06-11v1GeroQubit: a lightweight, honesty-first de-novo design platform for geroscience-native small molecules with calibrated uncertainty
k, D.; Swetha, H.Abstract
Computational molecule generation has outpaced its own credibility. We present GeroQubit, a GPU-free de-novo design platform that organizes candidates along a target x tissue x hallmark model and reports every signal alongside its measured baseline. We treat our tissue aging-signature readout as a mechanistic structural prior that we explicitly disclose is not validated against lifespan, and we surface efficacy only through a structure-to-lifespan k-NN whose weak but real signal (leave-one-out rho ~ 0.145) is wrapped in empirically-calibrated conformal intervals (90% target, 90.3% measured coverage). On a held-out retrospective recovery of ~1,940 ChEMBL binders against decoys, the score reaches ROC-AUC 0.945 with ~20x enrichment at 1% (BEDROC 0.91) and survives a scaffold-disjoint split - yet we report that it collapses to near-random (AUC 0.62) on genuinely novel chemotypes. Molecules are assembled reaction-first, so every candidate carries a verified synthetic route and atom-level synthon provenance; ADMET is handled as a multi-objective Pareto problem. We frame the disclosed weak signals and the hard-case failures not as flaws but as the honest, decision-useful output the field's own critics demand.
bioinformatics2026-06-11v1Revealing trajectories of multi-modal voxel-level changes in neurodegenerative diseases using latent event mapping
Pinnawala, S.; Hartanto, A.; Jairamani, M.; Simpson, I. J. A.; Wijeratne, P. A.Abstract
Neurodegenerative diseases are driven by pathological mechanisms that can be indirectly measured in vivo using multi-modal neuroimaging. However, current computational methods that aim to reconstruct trajectories of voxel-level changes in the brain are either not computationally scalable or fully interpretable, limiting their ability to reveal associations between disease progression and underlying mechanisms. Here we introduce Latent Event Mapping (LEMING), a generative unsupervised modelling technique that learns a latent map of disease events along a common pseudo-timeline of events. We apply LEMING to amyloid PET and structural MRI data from the Alzheimer's Disease Neuroimaging Initiative to reveal the first voxel-level trajectories of events in Alzheimer's disease. Notably, we show how LEMING can provide new insights into progression-dependent disease mechanisms. We find that acetylcholine receptor density is significantly positively associated with both late-stage amyloid and atrophy events, suggesting that either these receptors are targeted later in disease progression, or that amyloid does not play an active role. This has strong implications for therapeutics that target acetylcholine receptors, particularly for early-stage intervention strategies.
bioinformatics2026-06-11v1Calibrated Uncertainty Quantification for Patient-Level AML Drug Sensitivity Prediction Using Split Conformal Prediction
Shokrzadeh, A. J.; Shokrzadeh, P.Abstract
Accurate prediction of ex vivo drug sensitivity in acute myeloid leukemia (AML) patients from transcriptomic data is a critical challenge for precision oncology. Existing computational approaches have explored uncertainty quantification in cancer drug response prediction primarily using cell line data, while patient-level AML models typically rely on heuristic confidence measures rather than statistically calibrated uncertainty estimates. Here, we present a framework applying split conformal prediction to patient-level AML drug response modeling using the BeatAML 2.0 cohort. We trained Elastic Net and XGBoost regressors on bulk RNA-seq gene expression profiles from 318 AML patients, analyzing 34,764 patient-drug observations across 122 compounds. Baseline models achieved median Pearson R values of 0.291 (Elastic Net) and 0.281 (XGBoost) across 122 drugs. Wrapping these models with split conformal prediction yielded well-calibrated prediction intervals across three confidence levels: empirical coverages of 81.4%, 90.7%, and 95.5% against nominal targets of 80%, 90%, and 95%, respectively. Analysis of prediction interval widths revealed substantial drug-class-specific uncertainty patterns, with HDAC and BCL-2 inhibitors exhibiting markedly higher uncertainty than MDM2 inhibitors, suggesting a potential association between transcriptomic predictability and drug mechanism of action, although several drug classes were represented by only a small number of compounds. Predictive uncertainty was not significantly associated with ELN2017 molecular risk classification (Kruskal-Wallis p=0.395) or NPM1 mutation status (p=0.788). These results demonstrate that statistically valid uncertainty quantification can be achieved for patient-level AML drug response prediction despite substantial biological heterogeneity. to the best of our knowledge, no published study has applied split conformal prediction to patient-level ex vivo drug sensitivity prediction in the BeatAML cohort, providing a principled alternative to heuristic confidence scoring approaches. Keywords: Acute myeloid leukemia (AML); Ex vivo drug sensitivity; Conformal prediction; Uncertainty quantification; Precision oncology; BeatAML; Transcriptomic biomarkers; Machine learning.
bioinformatics2026-06-11v1DivQuant: Estimation of Species Richness and Entropy from Small Samples
Schmitz, J. E.; Rahmann, S.Abstract
Estimating diversity properties of discrete distributions from a small observed sample is a fundamental problem in algorithmic statistics that has applications in many fields, in particular bioinformatics, but also in ecology or linguistics. The two most common diversity measures are the number of distinct elements in a multiset, also referred to as species richness in ecology or alpha diversity in microbial analysis, and the Shannon entropy, also referred to as evenness. Estimating these properties from a small sample is particularly challenging for distributions with many rare elements. Thus, many estimators have been proposed in the past that, in practice, work well for different types of distributions. We present DivQuant, an optimization-based, extrapolating richness and entropy estimator with three contributions. First, we formulate the upsampling problem as a convex quadratic program with a Neyman {chi}2 objective. Unlike the linear program of its predecessor RichnEst, DivQuant admits confidence intervals via {chi}2 test inversion that are empirically well-calibrated. Second, we replace RichnEst's fixed-threshold fingerprint truncation with the rare/abundant fingerprint split of Valiant and Valiant, which strongly reduces problem size and preserves enough degrees of freedom for the confidence-interval program to remain valid and feasible. Third, we plug the optimal population fingerprint returned by the program into Shannon's entropy formula to obtain an entropy estimate. DivQuant attains close-to-nominal 95% confidence intervals in essentially all tested regimes, including six simulated distribution families, Tara Oceans microbiome data, and 10X Genomics scRNA-seq data, while competing state-of-the-art methods (RichnEst, iNext, PreSeq) miss the true richness in up to 80% of instances, well above the nominal 5%. In addition, DivQuant outperforms classical asymptotic entropy estimators (Miller-Madow, CAE) and the extrapolating iNext estimator. Running times remain competitive, with DivQuant typically completing in seconds. DivQuant is available as a command-line tool at https://gitlab.com/rahmannlab/divquant.
bioinformatics2026-06-11v1Integrating Spatially Adjusted Protein Summaries for Survival Prediction in Spatial Proteomics
Ahn, S.; Oh, E. J.; Prada, D.; Shojaie, A.Abstract
Recent advances in spatial proteomics, particularly imaging mass cytometry, enable the measurement of protein expression at the single-cell level while preserving a spatial context. Conventional survival analyses, however, typically rely on patient-level averages of protein intensities and therefore overlook spatial heterogeneity and tissue architecture. To address this limitation, we introduce a framework that incorporates spatial information into survival modeling by generating spatially adjusted protein summaries (SAPS). In this approach, cell-level protein intensities within each patient are modeled using spatial spline regression to capture spatial trends. From these models, we extract two complementary features: a spatially adjusted mean expression and a residual variance that reflects cell-to-cell variability unexplained by spatial effects. These summaries are then incorporated into Cox proportional hazards models in combination with clinical covariates. In simulation studies, our proposed framework achieved improved predictive performance compared to other alternative methods. The application of the method to breast cancer imaging mass cytometry data indicate that spatially adjusted summaries may enhance survival prediction and reveal biologically interpretable spatial protein patterns, suggesting high translational potential. This methodology offers an efficient means of translating complex spatial proteomics data into patient-level features, providing both improved survival prediction and new insights into the role of spatial heterogeneity in cancer outcomes.
bioinformatics2026-06-11v1Machine Learning-Guided Discovery of Bacterial-Selective Membrane-Active Compounds Reveals Mechanistic Bias in Antibiotic Training Datasets
Chain, C.; Ghaffari, S.; Belakaria, S.; Sheehan, J. P.; Irani, I.; Wu, C.-Y.; Kim, H.; Engelhardt, B. E.; Gitai, Z. E.Abstract
The rise of antibiotic resistance necessitates the discovery of antibacterial compounds with novel mechanisms of action (MoAs). Recent machine learning approaches have shown promise in antibacterial compound discovery, but often identify derivatives of known antibiotic classes rather than mechanistically novel compounds. Previous approaches applied Tanimoto similarity filters at the end of screening pipelines, but this method has substantial drawbacks: Tanimoto similarity can be misleading in chemical space, and post-hoc filtering does not influence what activity models learn to prioritize. Here, we present a machine learning pipeline that addresses chemical novelty upfront by employing an XGBoost-based MoA classifier to explicitly prioritize compounds predicted to have mechanisms distinct from known antibiotic classes, combined with graph neural networks for antibacterial activity and toxicity prediction. Applied to the Zinc20 database, our approach successfully identified non-toxic antibacterial compounds structurally distinct from known antibiotics. Notably, the majority of these hits exhibited membrane-targeting activity with selectivity for bacterial cells over mammalian cells, suggesting potential for next-generation membrane-active antibiotics. However, we did not identify compounds with novel protein targets. Systematic analysis revealed that this limitation stems from mechanistic bias in training data rather than model architecture. Specifically, our activity model learned to preferentially score compounds similar to specific groups in the training data, thus overrepresenting certain MoA classes including membrane-active compounds. Even substantial model architecture and training data enhancements did not overcome this constraint. Our findings demonstrate that the primary bottleneck for discovering mechanistically novel antibiotics is the scarcity of diverse, mechanistically-annotated training data. This work provides both a methodological framework for mechanism-aware screening and critical insights into data requirements for genuinely novel antibiotic discovery.
bioinformatics2026-06-11v1AGZArank: Investigating epitope-conditioned antibody binder ranking with structure-derived synthetic supervision
Sadykov, Z.; Khamidullina, A.; Sultankulov, B.; Seitkali, D.Abstract
Computational antibody design methods can generate large libraries of candidate binders for a target epitope, but prioritizing which candidates to test experimentally remains a major bottleneck. Existing scoring approaches, including physics-based affinity estimators, structure-prediction-derived confidence measures, and inverse-folding likelihood models, provide useful proxy signals but are not explicitly optimized for early enrichment of binders among many structurally similar candidates. Here we investigate epitope-conditioned antibody binder ranking as a dedicated learning problem and introduce AGZArank, a geometric deep learning framework trained with structure-derived synthetic supervision based on normalized pseudo-energy targets. On a benchmark of 45 experimentally validated antibody-antigen interfaces, AGZArank recovered the true binder within the top ten candidates in 44.4% of cases and showed stronger generalization on post-2021 structures than ProteinMPNN, ESM-IF, and PRODIGY. Ablation experiments indicate that ranking performance depends primarily on training scale and alignment between the optimization objective and retrieval-based evaluation, rather than architectural complexity alone. These results support candidate prioritization as a distinct and tractable problem in computational antibody design.
bioinformatics2026-06-11v1EditorForge: An Active-Site-Aware Framework for Inverse-Folding-Based Protein Redesign
Chen, A.; Siddiqui, J.; Taucar, W.; Tiralongo, L.; Tkachenko, M.; Xu, A.; Bawa, S.; Guo, S.; Pinska, O.; Rim, J.; Shi, J.; Wang, M.; Zhao, E.Abstract
Inverse-folding models can rapidly generate protein sequences compatible with a supplied backbone, but unconstrained redesign is poorly suited to enzyme and genome-editor-associated domains, where catalytic, substrate-proximal, and conserved structural regions must remain protected. In this paper, we present EditorForge, a modular constraint-and-audit suite for editor-domain protein redesign that wraps fixed-backbone inverse folding with explicit design masks, fixed-position enforcement, active-site-proximity auditing, active-site-shielded regeneration, and downstream structural quality control. Using full-length Moloney murine leukemia virus reverse transcriptase structure 4MH8 (MMLV RT 4MH8) as a demonstration target, EditorForge first restricted redesign to a bounded 25-position envelope while fixing 428 residues. An initial audit detected active-site-proximal failure modes despite fixed-position integrity. Later, the Active Site Shield module then removed five unsafe design positions, replaced them with lower-contact alternatives, and regenerated candidates under stricter constraints. Post Shield Audit evaluated 24 regenerated candidates, all of which satisfied the hard sequence/mask and active-site-shield constraints. For the eight candidates that were selected or returned for structure-prediction/refolding quality control. Enhanced RefoldQC found that all 8 evaluated predicted structures passed the computational structure-QC screen. That said, the selected 8 candidates passed the computational structure-QC screen, with global C RMSD values of 1.2061--1.5555~[A], active-site C RMSD values of 0.4098--1.8397~[A], mutation-neighborhood C RMSD values of 1.3155-1.6848~[A], and average pLDDT-like confidence values of 94.87-95.11. In short, EditorForge provides a reproducible triage layer that converts general inverse-folding output into constrained and editor-specific candidate sets for downstream structural and biological review on top of existing structural prediction tools.
bioinformatics2026-06-11v1GermRL: Alleviating The Germline Bias In Autoregressive Antibody Language Models Through Reinforcement Learning
Ludwig, L.; Chungyoun, M.; Gray, J. J.Abstract
Antibodies are powerful therapeutics whose antigen specificity arises from sequence diversity shaped during development. Recently, language models trained on large antibody repertoire datasets have enabled the generation and screening of novel candidates, but these models retain a strong germline bias. As AI adoption increases in therapeutic workflows, it is crucial to develop models that harness the diversity of antibodies necessary for the discovery of mutations that encode desirable properties. Previous work explored the germline bias in masked antibody language models, yet the bias in generative autoregressive language models has not yet been addressed. Here, we present GermRL, a lightweight and modular reinforcement learning (RL) framework capable of alleviating the germline bias in pre-trained antibody autoregressive language models through group relative policy optimization (GRPO). GermRL achieves consistent one-shot generation of antibodies that satisfy specified mutation thresholds from germline while maintaining structural plausibility. Under the lowest and highest mutation thresholds tested (5 and 35 mutations from germline), GermRL scores 0.992 and 0.950 pass@1, respectively, compared to 0.398 and 0.034 for the pre-trained language model. Within GermRL, we introduce a key pair of modifications to GRPO that increase training efficiency by discouraging reward hacking under our antibody application. Furthermore, comparison of RL generated and natural antibody sequences reveals how RL based optimization can explore alternative evolutionary mutational patterns and residue compositional strategies while preserving key global properties of natural antibodies, including identifiable germline assignments, embedding-level similarity and comparable developability profiles. Thus, RL-trained generative models optimized to promote antibody mutations through diversity from germline provide a promising framework for navigating the antibody sequence landscape, enabling exploration of novel yet biologically plausible candidates for therapeutic design.
bioinformatics2026-06-11v1A multi-agent system for spine MRI report generation from multi-sequence imaging
Xiao, Z.; Yang, J.; Sun, G.; Zhang, H.; Xu, H.; Yao, Y.; Miller, Z. D.; King, W. E.; Kanani, M. M.; Andre, J. B.; Chu, S.; Zhang, M.; Kinahan, P. E.; Cross, N. M.; Wang, S.Abstract
Spinal pathology is a leading cause of pain and disability worldwide. Spine magnetic resonance imaging (MRI) is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, with mean 10.8% AUROC improvement across 17 spinal condition-prediction tasks compared to the best competing method, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents for condition diagnosis, pathological-region localization, and clinically-similar-cases retrieval. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation. Together, SpineAgent introduces a continual training approach for multi-sequence spine MRI understanding. By decomposing report generation into clinically grounded subtasks addressed by specialized agents, the SpineAgent framework enables accurate, interpretable and generalizable spine MRI reporting across diverse imaging sequences and anatomical regions.
bioinformatics2026-06-11v1OCOO-T : A SIMPLE AND SCALABLE VIRTUAL CELL MODEL FOR TRANSCRIPTIONAL PERTURBATION RESPONSE PREDICTION
Zhao, Y.; Lai, L.; Jiang, D.; An, Z.Abstract
Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.
bioinformatics2026-06-11v1Pillbox: A Leakage-Aware Foundation-Model Predictor and Lineage-Ceiling Diagnostic for Cancer Drug Response
Hill, J. J. K.; Ryoo, H. J.; Ghanta, A.; Singh, S.; Anders, D.; Jiao, E.; Jeong, J.Abstract
We present Pillbox, a predictor whose pipeline is audited against the six Asiaee leakage modes with the one residual pathway shown by per-fold ablation to be non-load-bearing on hard splits. Our model combines CpGPT methylation embeddings, CLAMP drug embeddings, and per-fold-fit gene-expression principal components which are fused by Feature-wise Linear Modulation (FiLM)-conditioned graph attention on the STRING v12 protein-protein interaction graph. Then we alpha-ensemble the model against a histogram-based gradient boosting regressor baseline. On GDSC GSE68379 (987 cell lines, 375 drugs) across seeds 42, 7, and 123, the ensemble reaches test R-Squared of 0.78, 0.77, and 0.76 on random, histology-blind, and site-blind splits respectively, with cell-aware lifts above the drug-mean floor of +0.054, +0.060, and +0.037. As a quantitative diagnostic for feature-stack saturation we propose the cross-architecture residual correlation, calibrated against a same-architecture-different-initialization control. On histology-blind splits the cross-architecture value of 0.939 falls short of the same-architecture ceiling of 0.974 by approximately 0.03 in residual correlation, a gap we interpret as the headroom available to architecture choice on top of the current foundation-model representation and consistent with the long-established observation that tissue lineage dominates cell-line drug response. We integrated curated mutation, methylation, and drug-target-expression channels, but these do not improve prediction once foundation-model embeddings are in place. Cross-screen validation against PRISM matches the GDSC-to-PRISM measurement reproducibility ceiling within 0.01 Spearman.
bioinformatics2026-06-11v1STITCH links cellular morphology and gene expression in spatial transcriptomics
Kumar, S.; Shi, Y.; Vallius, T.; Day, C.-P.; Absil, P.- A.; Srivastava, A.; Hannenhalli, S.; Gopalan, V.Abstract
In situ spatial (ISS) sequencing can uncover co-variation between cellular morphology and gene expression in vivo. However, a principled and interpretable mathematical representation of morphology has not yet been applied in this context. In particular, current deep learning-based representations of cell images confound a cell's shape with its size. We present an interpretable representation of cellular boundary contours, based on tangent principal component analysis (TPCA) in a Kendall shape manifold, that captures size-independent contour shape features. This approach successfully recovers shape-perturbing genes in an RNAi screen than a previous metric geometry-based approach. We build on TPCA to develop STITCH (Shape-TranscriptomIc Correlation and Harmonization), an approach to reveal covariation between cell morphology with gene expression in ISS datasets. In a Xenium dataset, STITCH outperforms a deep learning-based approach in both recovering the layered organization of keratinocytes and a spatial gradient in nuclear eccentricity. Across samples in a melanoma CosMx dataset, STITCH reproducibly associates elongated and triangular fibroblasts with proximity to malignant cells and myofibroblast-like transcriptional program. Finally, STITCH independently recovers a known link between mesenchymal-like malignant cell states and increased cell area in two melanoma cohorts. STITCH can thus yield interpretable morphology-transcriptome relationships across cell types, patients, and spatial transcriptomics platforms.
bioinformatics2026-06-11v1ANCHOR: haplotype-aware allelic and isoform inference from single-cell long-read RNA sequencing with de novo variant calling
Fu, Z.-C.; Zhang, C.; Yan, Y.; Xu, Y.; Yin, X.; Tao, T.; Lu, P.; Liang, Y.; Wu, H.; Cui, W.; Hou, R.; Chen, X.; Ke, Y.; Li, Y.; Chen, Z.-J.; Huang, T.; Wu, K.; Yuan, S.Abstract
Long-read RNA sequencing enables haplotype- and isoform-resolved allelic analysis of transcriptomes, yet extending this capability to single cells and distinct cell types remains computationally challenging due to sparse coverage, sequencing errors, incomplete variant information, and reference-biased transcript assignment. Here we present ANCHOR, a haplotype-aware framework for single-cell long-read RNA sequencing that performs de novo expressed-variant discovery, molecule-level haplotype assignment and isoform-resolved allelic quantification. ANCHOR combines a signed-graph variant caller, pair hidden Markov modelling and beta-binomial UMI aggregation to infer parental allele counts for genes and splice-resolved isoforms, without requiring a pre-existing phased genotype or deep learning. In human single-cell long-read RNA benchmarks, ANCHOR improved variant-calling performance over tested long-read RNA callers at single-cell and low-to-moderate coverage, and its beta-binomial model reduced depth-driven false positives in allele-specific expression testing. Applied to newly generated single-cell long-read RNA-seq data from reciprocal mouse crosses during gastrulation, ANCHOR resolved cell-type- and isoform-specific parent-of-origin imprinting and identified an antagonistic maternally biased Sgce isoform. ANCHOR provides a general framework for allele- and isoform-resolved analysis of diploid single-cell long-read transcriptomes.
bioinformatics2026-06-11v1Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling
Ellerbrock, R.; Valentini, A.; Paul, A. C.; Mukhopadhyay, S.; Perelshtein, M. R.Abstract
Therapeutic peptides offer high target specificity, low toxicity, and the ability to modulate protein-protein interactions, yet experimental functional characterization remains costly and slow. Computational prediction of therapeutic function directly from sequence could accelerate peptide screening and enable generative design pipelines, but requires reliable discrimination between therapeutic and non-therapeutic peptides. Existing multi-label predictors cover few functions, rely on limited datasets, and exhibit high \glspl{fpr}, limiting their practical utility. We present a lightweight CNN classifier trained on the most comprehensive therapeutic peptide database to date (54,655 peptides, 48 functional categories). A key contribution is a statistically motivated negative sampling strategy using Markov models to generate diverse synthetic decoys at multiple difficulty levels. When evaluated on this controlled decoy benchmark, the FRP is reduced from over 60% for previous models to 2.1% for our approach. Our fine-tuned five-model ensemble achieves 78.9% Micro F1 and 54.6% Macro F1 while requiring only amino acid sequences as inputs. Analysis using a sparse L1-constrained variant of our model shows that convolutional filters capture conserved functional motifs and statistically improbable non-therapeutic patterns, with downstream layers combining these signals, providing mechanistic evidence that the network learns biologically meaningful structure. In a generalization task on the TPpred-LE benchmark, our model achieves 55.3% Micro F1 and 38.6% Macro F1, comparable to TPpred-LE trained on its native dataset (57.9%/38.1%) while predicting four times more therapeutic functions with four times fewer parameters. Code and models will be made available at https://github.com/terra-quantum-public/tq-therapep-ai.
bioinformatics2026-06-11v1HoloCell: A Generative Foundation Model for Holistic Cellular Modeling
Jiang, Q.; Li, Z.; Hu, B.; Bie, Y.; Li, K.; Li, Q.; Jin, P.; He, Y.; Deng, P.; Wang, Z.; Chen, X.; Qin, T.; Liu, H.; Jiang, R.; Yin, Q.Abstract
Single-cell multi-omics technologies have recently advanced to enable the profiling of epigenomic, transcriptomic, and proteomic layers within individual cells, offering new opportunities to characterize cellular states as integrated biological systems. However, developing a unified framework that can seamlessly integrate diverse omics modalities and remain robust to heterogeneous modality missingness remains challenging. Here we present HoloCell, to our knowledge the first generative foundation model for joint representation learning and generative modeling across all three major single-cell omics modalities, i.e., epigenomics, transcriptomics, and proteomics. HoloCell contains over 860 million parameters and is pretrained on the Human-Multi-Omics-Corpus, which comprises approximately 468 million single-cell profiles across these three omics layers, corresponding to over 425 billion tokens. HoloCell introduces a simple yet biologically grounded hierarchical tokenization strategy that encodes cis-regulatory elements, genes, and proteins as structured tokens within a shared modeling framework. We evaluated HoloCell across single-omics representation learning, paired multi-omics integration, unpaired multi-omics alignment, and cross-modal generation via iterative diffusion and remasking, demonstrating its superior performance and flexibility across diverse omics tasks. From a representation perspective, HoloCell provides a unified digital mapping of cellular states across multiple omics layers, capturing cell heterogeneity as an integrated system. From a generation perspective, its iterative diffusion and remasking framework accounts for the inherently unordered nature of biological features, enabling in silico simulation of multi-omics information flow. Together, these capabilities position HoloCell as a versatile foundation model toward the emerging concept of a virtual cell, offering both systematic characterization and generative simulation of cellular systems within a unified framework.
bioinformatics2026-06-11v1Combinatorial docking and molecular generation to navigate over 100-billion molecules for prospective ligand discovery
Zhang, J.; Yang, C.; Zhang, Y.; Chen, X.; Lam, B.; Bryant, C.; Pidathala, S.; Wang, Y.; Moroz, Y.; Radchenko, D.; Alon, A.; Lee, C.-H.; Zhang, Z.; Lyu, J.Abstract
Commercially available make-on-demand libraries now exceed 100 billion compounds, requiring over 50 years to screen on 2,000 CPU cores using conventional docking. We present two complementary approaches to address this challenge. CombiDOCK, a combinatorial docking framework, enables exhaustive screening at the 100-billion scale within 40 days. MINT-Dock, a generative framework, accelerates navigation of this space by integrating CombiDOCK with Monte Carlo Tree Search. Benchmarked on 46 diverse targets, CombiDOCK matched full-molecule docking accuracy, and MINT-Dock achieved a 4,800-fold enrichment over random selection. Compared with prior billion-scale brute-force campaigns against {sigma}2, VMAT2, and VAChT, prospective CombiDOCK screens of the 100-billion-molecule library yielded higher hit rates and more potent ligands, while MINT-Dock achieved comparable outcomes across single- and multi-target objectives with >20-fold computational cost reductions. Docking-predicted poses of the best VAChT-binding compounds were confirmed by cryo-EM structures. These methods provide exhaustive and generative paths for navigating the trillion-molecule frontier of drug discovery.
bioinformatics2026-06-11v1A high-quality chromosome-scale reference genome assembly for Asparagus racemosus var. CIM-Shakti (Shatavari), a medicinal plant of Ayurvedic importance
Tyagi, S.; Sharma, A.; Shivani, K.; Gupta, V.; Paterson, A. H.; Trivedi, P. K.Abstract
Asparagus racemosus Wild., commonly known as Shatavari, is an important medicinal plant in Ayurveda and is valued for its steroidal saponins, particularly shatavarin compounds, which contribute to its adaptogenic, galactagogue, immunomodulatory, and therapeutic properties. Despite its medicinal and economic importance, genomic resources for this species have remained limited, restricting molecular breeding, pathway discovery, and comparative evolutionary studies within Asparagaceae. Here, we report a high quality chromosome scale reference genome assembly of A. racemosus var. CIM Shakti generated using PacBio HiFi long read sequencing and Omni C chromatin conformation scaffolding. The pseudo haploid assembly spans 817 Mb across 53 scaffolds, with a scaffold N50 of 98.50 Mb, L50 of 5, and a largest scaffold of 113.80 Mb. Ten major chromosome scale pseudomolecules were resolved, corresponding to the haploid chromosome complement of A. racemosus. The assembly showed high gene space completeness, with BUSCO completeness of 99.8% against the Eukaryota dataset and 98.0% against the Embryophyta dataset. BlobToolKit profiling further supported assembly quality, with GC content of approximately 39 to 40% and no major evidence of contamination. EDTA based repeat annotation identified 580.93 Mb of interspersed repetitive elements, accounting for 71.06% of the 817.57 Mb genome assembly. The repeat landscape was dominated by LTR retrotransposons, particularly Gypsy elements, which accounted for 25.01% of the assembly, followed by unclassified LTR elements at 26.58% and Copia elements at 4.84%. Structural and functional annotation identified 29,199 protein coding genes represented by 29,199 transcript models, 138,433 exons, and 125,201 CDS features. The annotation was structurally robust, with an average gene length of 4,605.1 bp, 4.74 exons per transcript, and 97.80% of transcripts containing multiple exons. The CIM Shakti reference genome provides a foundational genomic resource for investigating steroidal saponin biosynthesis, sex chromosome evolution, repeat driven genome expansion, and comparative genomics in Asparagaceae. This assembly will support future studies on medicinal trait improvement, conservation genomics, and genomics assisted breeding of climate resilient Shatavari cultivars.
bioinformatics2026-06-11v1VFUSE: Virulent Feature Understanding with Sparse autoEncoders
Olson, M. L.; Yu, M.Abstract
Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC 0.84 (q < 10-13).
bioinformatics2026-06-11v1TifBERT: a self-supervised foundation model for normalization-robust bulk RNA-seq representation learning
Hosseini, S.; Sharma, D.Abstract
Bulk RNA sequencing remains central to translational genomics, yet foundation-model development has largely focused on single-cell data. Existing transformer approaches for bulk RNA-seq often rely on expression discretization, numerical reconstruction, external gene embeddings, or restricted gene sets, limiting robustness across normalization schemes and cohorts. Here, we introduce TifBERT, a self-supervised framework for full-transcriptome bulk RNA-seq representation learning. TifBERT converts each unordered expression profile into a sample-specific gene sequence using term frequency-inverse document frequency (TF-IDF) ordering, prioritizing genes that are both highly expressed within a sample and selectively expressed across the cohort. It is then pretrained using masked gene modeling, predicting gene identities from transcriptomic context rather than reconstructing expression values. Pretrained on harmonized TCGA Pan-Cancer data spanning five RNA-seq normalization schemes, TifBERT learns contextual representations across approximately 10,000 genes without expression binning, landmark-gene restriction, or external biological embeddings. Across 33 TCGA cancer types, TifBERT achieved 90.83% accuracy, 0.996 macro AUC-ROC, and 0.903 MCC. It also captured pathway-level biology, achieving mean sample-wise and pathway-wise Pearson correlations of 0.754 and 0.762 across 1,387 PARADIGM pathway activities. Independent evaluation on GTEx healthy tissues showed preservation of tissue-level transcriptomic structure without retraining. In comparison with existing models, TifBERT achieves competitive subtype discrimination with substantially greater stability and produces markedly richer embedding geometry (effective rank 95.6 versus 6.3), without requiring expression discretization or in-distribution pretraining exposure. Together, TifBERT provides a scalable, normalization-independent foundation model for reusable bulk transcriptomic representation learning
bioinformatics2026-06-11v1Beyond single markers: bacterial synergies identified by Multidimensional Feature Selection reveal conserved microbiome disease signatures
Zielinska, K.; Rudnicki, W.; Labaj, P. P.Abstract
The gut microbiome encodes disease-relevant information not only in the abundance of individual taxa and functions, but in the way they co-occur and interact. Yet metagenomic analyses have largely relied on univariate approaches that evaluate features in isolation, systematically overlooking the combinatorial signals that arise from microbial co-occurrence. Here, we introduce a framework based on the Multidimensional Feature Selection (MDFS) algorithm to identify synergistic feature pairs - combinations of taxa and functions whose joint predictive relevance substantially exceeds that of either constituent alone, including features that carry no individual signal and would be discarded by any conventional analysis. We first validated the approach on a meta-analysis of colorectal cancer (CRC) cohorts - one of the most competitive microbiome classification benchmarks available - using a leave-one-cohort-out cross-validation framework. Our framework matched state-of-the-art classification performance (AUC = 0.85) while simultaneously revealing microbial interactions that are structurally inaccessible to univariate methods. A subset of high-stability synergistic pairs showed consistently elevated model selection frequencies and robust discriminatory power across independent cohorts, confirmed under stringent per-cohort effect size testing. Extending the framework to 20 disease cohorts spanning inflammatory bowel disease, type 2 diabetes, liver cirrhosis, and atherosclerotic cardiovascular disease, we identified thousands of high-impact synergistic interactions and 21 conserved cross-cohort markers. Across all contexts examined, synergistic pairs substantially outperformed their individual constituents, establishing microbial co-occurrence as a reproducible and biologically informative axis of disease-associated variation that univariate approaches are structurally unable to detect. The framework is freely available at https://github.com/Kizielins/MDFS_synergies. Importance: Most microbiome studies search for individual gut bacterial species associated with disease. However, bacteria do not act in isolation, and their combined presence or relative balance may be far more informative than any single microbe considered alone. This study presents a computational framework that identifies pairs of gut microorganisms whose co-occurrence or relative abundance carries substantially greater predictive signal than either constituent feature independently. Applied to stool metagenomic data from patients with colorectal cancer and more than a dozen additional conditions, we demonstrate that these synergistic interactions are widespread, reproducible across independent patient cohorts, and reveal disease-relevant microbial relationships that standard analyses miss entirely. Our framework offers a more complete view of how the gut microbiome is altered in disease and provides a principled basis for identifying robust, interaction-based biomarkers.
bioinformatics2026-06-10v3SLiMNet: a deep learning model to detect short linear motifs using protein large language model representations and paired inputs
McFee, M. C.; Kim, P. M.Abstract
Short linear motifs (SLiMs) are short (3-15 amino acids in length) segments within intrinsically disordered regions (IDRs) that mediate transient protein-protein interactions as well as other functions such as stability and subcellular localization. Only a few thousand out of likely hundreds of thousands have been experimentally validated. SLiMs can be detected as conserved regions inside of IDRs using local alignments, though current approaches have limited sensitivity and specificity and are unable to functionally annotate their hits. Assigning function is hence a major outstanding issue in SLiM biology. Here we present SLiMNet, a deep learning model inspired by siamese networks and contrastive learning that predicts functional similarity in pairs of SLiMs. SLiMNet uses uses protein large language model embeddings and is trained on annotated sets of SLiMS. We show that it detects shared function in unseen, non-redundant motif pairs, and its scores correlate with experimental binding strengths from deep mutational scanning of cyclin-binding motifs. Using SLiMNet we provide repositories of putative SLiM pairs derived from annotated IDR regions for to help with hypothesis generation for the functional annotation of SLiMs. This includes an atlas generated from all-by-all scoring 16-mers from tiled IDRs from the DisProt database. We show that it captures a new nuclear localization motif recently added to MoMaP and a PRMT1 methylation motif in the literature. We also provided a repository of all IDRs scored with SLiMNet against against all MoMaP instances, and an atlas of potential functional pairs for 256 known orphan motifs (motifs with only a single known instance with essential function). Collectively, these atlases are useful resources for the SLiM biology community
bioinformatics2026-06-10v3Depth normalization for single-cell genomics count data
Booeshaghi, A. S.; Hallgrimsdottir, I. B.; Galvez-Merchan, A.; Pachter, L.Abstract
Single-cell genomics analysis requires normalization of feature counts that stabilizes variance, accounts for variable cell sequencing depth, and preserves monotonicity of within-cell feature abundances. We show that normalization via an (additive in the raw counts) proportional fitting step followed by the logarithm and then another (multiplicative in the raw counts) proportional fitting step (PFlogPF) is the only feature-relabeling-equivariant method satisfying the three desiderata. We demonstrate superior performance of this method, which is equivalent to a shifted centered-log ratio transform, in comparison to other normalizations on numerous benchmarks across hundreds of single-cell RNA-seq datasets.
bioinformatics2026-06-10v3RingNet: An Interactive Platform for Multi-Modal Data Visualization in Networks
Zhang, L.; Lai, X.Abstract
The exponential growth of data in biomedicine has created an urgent need for intuitive visualization tools. These tools must be able to effectively represent complex biological networks and remain accessible to domain experts without extensive computational training. Current network visualization approaches often require specialized programming skills and/or cannot handle the scale and complexity of modern biomedical datasets, which creates significant barriers to biological discovery. We develop RingNet, a web-based interactive visualization tool that integrates computational efficiency with flexible, user-driven exploration. This tool addresses the community's need to visualize multi-modal datasets within a single, compact network representation, as well as identify patterns of interest in complex data. RingNet uses an R backend for network computation and coordinate optimization. This generates JSON data structures that feed into a JavaScript and HTML frontend, which provides real-time, interactive visualization functions. It offers dynamic layout adjustments, node and edge filtering, and customizable color schemes for representing data. It can export reproducible, publication-ready figures in SVG and PNG formats. In our case studies, we use RingNet to visualize breast cancer patients' omics profiles in a gene regulatory network and a cell-to-cell communication network in atopic dermatitis. This demonstrates RingNet's ability to reveal biological relationships across multiple data modalities. RingNet lowers the barrier to exploring, analyzing, and communicating data-driven findings, thereby accelerating research.
bioinformatics2026-06-10v3jsPCA: fast, scalable, and interpretable identification of spatial domains and variable genes across multi-slice and multi-sample spatial transcriptomics data
Assali, I.; Escande, P.; Picard, F.; Villoutreix, P.Abstract
Spatially structured cell heterogeneity within tissues is essential for healthy organ function. This heterogeneity is reflected by differential gene expression activity at various spatial location. Spatial transcriptomics technologies record genome-wide measurements of gene expression at the scale of entire tissues with high spatial resolution. While they have revolutionized our quantitative understanding of tissue architecture, these technologies generate large and high dimensional datasets encompassing tens of thousands of genes recorded at tens of thousands of spatial locations, requiring efficient automated methods for their analysis. In this study we introduce joint spatial PCA (jsPCA), a novel, fast, scalable and interpretable method for the automatic identification of spatial domains and variable genes in multi-slice and multi-sample spatial transcriptomics data. jsPCA relies on a simple mathematical formulation of a spatial covariance defined as the product of the gene expression covariance with the spatial autocorrelation. The principal components of this spatial covariance yield a biologically meaningful low-dimensional representation. From this representation, we can derive spatial domains by simple clustering. In addition, spatially variable genes can be identified directly from the principal components coefficients. Moreover, this approach enables the joint representation of multiple slices and samples, a frequent experimental setting. This joint representation is obtained without spatial alignment by computing common principal components via joint diagonalization of the set of spatial covariance matrices obtained for each slice. By leveraging sparsity and non-convex optimization on manifold, jsPCA leads to computing time in the order of seconds to minutes, substantially outperforming state-of-the-art approaches. We benchmarked jsPCA on the Visium 10x dataset of human dorsolateral prefrontal cortex and the Stereo-seq MOSTA dataset of mouse embryonic development against 10 state-of-the-art methods. Our approach demonstrated excellent performances, comparable or better than state-of-the-art methods, such as SpatialPCA, BASS, GraphPCA or Stagate, while being much faster, interpretable, and scalable to very large datasets.
bioinformatics2026-06-10v2Candidate Molecular Subtypes of Cognitive Resilience in Alzheimers Disease: A Multi-Cohort Machine Learning and Neuroimaging Study
Kitani, A.; Matsui, Y.Abstract
Background: Cognitive resilience (CR) in Alzheimers disease (AD) refers to preserved cognitive function despite substantial AD pathology. Diverse biological processes have been implicated in CR, including synaptic maintenance, neuroimmune regulation, and metabolic homeostasis. However, how these mechanisms are organized into molecularly distinct CR subtypes and relate to clinical and neuroanatomical heterogeneity remains unclear. Here, we applied a machine learning framework to multi-cohort transcriptomic, proteomic, and neuroimaging data to investigate molecular subtypes of CR in AD. Methods: RNAseq data from the Religious Orders Study and Memory and Aging Project (ROSMAP) cohort were used to train machine learning models classifying individuals with AD pathology as CR or non-CR based on residual-based resilience scores. Model development and performance estimation used nested cross-validation to minimize information leakage. Final ROSMAP-trained models were evaluated in the independent Mount Sinai Brain Bank (MSBB) cohort. Model-derived genes were used for biological interpretation and hierarchical clustering of CR individuals. The subtype structure was further evaluated in the Alzheimers Disease Neuroimaging Initiative (ADNI) cohort using cerebrospinal fluid proteomics, MRI-derived brain measures, and longitudinal MMSE data. Results: Machine learning models showed modest but consistent predictive performance in ROSMAP, with out of fold AUROC values of 0.644 to 0.688. In the independent MSBB full cohort, AUROC values were 0.586 to 0.659, with improved discrimination in a top/bottom quartile analysis. Hierarchical clustering identified two major molecular subgroups among CR individuals in ROSMAP/MSBB RNA-seq data. A reduced 22 gene/protein signature showed a partial, cluster-like resemblance to this structure in ADNI cerebrospinal fluid proteomics. In ADNI, both projected CR subtypes showed preserved brain tissue-volume profiles and slower longitudinal MMSE decline compared with non-CR participants, whereas clear differences between CR subtypes were not observed. Differential CSF proteomic analysis suggested partially distinct molecular characteristics. Conclusions: These findings suggest that CR in AD may encompass molecularly heterogeneous, subtype-like profiles that converge on broadly preserved brain structure and slower cognitive decline. Our results provide a candidate framework for stratifying resilience-associated molecular phenotypes in AD and warrant prospective and experimental validation. We also developed the Resilience Gene Analyzer, a web-based platform for visualizing gene-level contributions to CR prediction (https://igcore.cloud/GerOmics/REsilienceGeneAnalyzer/).
bioinformatics2026-06-10v2Multi-level, multi-body atomic interaction graphs for machine learning-based prediction of protein-ligand binding energies
Le, T. T. H.; Nguyen, B. T.; Vo, H.; Nguyen, N. H.; Nguyen, D. D.Abstract
Accurate prediction of binding affinity is crucial for rational drug design and discovery. Traditional computational methods often rely on complex scoring functions that incorporate a multitude of physical and chemical descriptors, leading to high computational demands and sometimes limited generalizability. In this work, we propose a novel scoring function that models multi-level, multi-body atomic interactions using graph-based representations. Our method constructs comprehensive interaction graphs that incorporate both pairwise and triplet-wise atomic features that help capture cooperative spatial patterns essential for binding affinity prediction. By employing a feature fusion strategy, GMI-Score maintains model simplicity while enhancing accuracy. Extensive evaluation across multiple datasets, such as PDBbind v2013, PDBbind v2016, PDBbind v2020, CSAR-NRC-HiQ, and PDBbind-Redocked, demonstrates that our model consistently outperforms state-of-the-art scoring functions, achieving Pearson correlation coefficients up to 0.877. Furthermore, it retains strong predictive power under strict data leakage controls and realistic docking conditions to highlight its robustness and generalizability.
bioinformatics2026-06-10v2Advances in protein function prediction from the fifth CAFA challenge
De Paolis Kaluza, M. C.; Ramola, R.; Joshi, P.; Piovesan, D.; Reade, W.; Orchard, S.; Martin, M. J.; Ignatchenko, A.; Rost, B.; Orengo, C. A.; Robinson-Rechavi, M.; Durand, D.; Brenner, S. E.; Greene, C. S.; Mooney, S. D.; Friedberg, I.; Radivojac, P.Abstract
The Critical Assessment of Function Annotation (CAFA) is a long-standing community effort to independently assess computational methods for protein function prediction, to highlight well-performing methodologies, to identify bottlenecks in the field, and to provide a forum for the dissemination of results and exchange of ideas. In its fifth round (CAFA 5) of triennial challenges, a partnership with Kaggle Inc. facilitated participation from a large community of data scientists and computational biologists through a competitive prospective challenge on the crowdsourcing platform. In this work, we present an in-depth analysis of the submitted predictions and report improvements in accuracy over all methods from the previous CAFA challenges. We further introduce a new evaluation setting for proteins with pre-existing (incomplete) annotations and identify the need for methods that better leverage existing annotations to predict those that will be discovered later. Finally, we characterize the prospective evaluation framework by examining performance on a strict set of unpublished annotations and across intermediate database releases. Our results indicate that recent developments in the field, such as the availability of protein language models and accurately predicted 3D structures, as well as the growth of experimental annotations through biocuration, have all contributed to performance improvements.
bioinformatics2026-06-10v2Transcriptomic profiling reveals multiple mechanisms of insecticide resistance in Aedes aegypti from Angola
Youd, H. A.; Ooi, J. M. F.; Muhammad, A.; Paine, M. J. I.; Lucas, E. R.; Grau-Bove, X.; Grigoraki, L. R.; Troco, A. D.; Parreira, R.; Sousa, C. A.; Pinto, J.; Weetman, D.Abstract
Control of arboviruses remains heavily reliant on insecticide-based vector control targeting adult Aedes aegypti, especially during outbreaks, but the effectiveness of these tools can be compromised by insecticide resistance. While the mechanisms underlying resistance have been widely studied in Latin American and South East Asian Ae. aegypti, knowledge from African populations is limited, particularly regarding metabolic resistance. To address this knowledge gap, we sequenced the transcriptomes of Ae. aegypti collected in Angola, from both unexposed individuals and survivors of exposure to the organophosphate fenitrothion, alongside two insecticide-susceptible laboratory reference strains. Many overexpressed genes belonged to the major detoxification enzyme families, including 96 cytochrome P450 monooxygenases (CYP450s), 18 glutathione S-transferases (GSTs), and 35 carboxylesterases, with multiple genes previously detected as upregulated in Latin American and Asian populations. These included frequently reported, functionally-validated, metabolic resistance genes such as CYP9J24, CYP9J26, and CYP6BB2. However, expression of auxiliary resistance families including hexamerins, heat shock proteins, and odorant binding proteins were linked to the insecticide resistance phenotype, whilst numerous cuticular genes differentiated the Angolan population from both susceptible laboratory strains. A novel candidate, CYP6AG7, that was overexpressed after fenitrothion exposure was experimentally validated, and surprisingly metabolised fenitrothion into its toxic oxon form, which it did not subsequently break down. The antioxidant response element (ARE) motif, to which the transcription factor Maf-S binds, was detected in all CYP450 overexpressed in the fenitrothion treatment suggesting their potential coordinated induction. Analysis of genetic differentiation revealed several resistance-linked genes under potential selection, and SNP screening identified both known and novel non-synonymous mutations in the voltage-gated sodium channel (VGSC) gene, the target for pyrethroid insecticides. This is the first RNAseq dataset for Ae. aegypti from Africa in the context of insecticide resistance, providing insight into the complexity of resistance mechanisms, including some shared, and others potentially novel, compared to better studied populations from other geographical regions.
bioinformatics2026-06-10v2GEOAgent: An AI-driven Autonomous Framework for Intelligent GEO Data Retrieval and Standardized Preprocessing
Zhao, Y.; Cai, Q.; Chen, D.; Chen, J.Abstract
Datasets in the Gene Expression Omnibus (GEO) remain difficult to reuse at scale because sample annotations are heterogeneous and raw sequencing data require assay-specific preprocessing. We present GEOAgent, an AI-driven autonomous framework designed for intelligent dataset retrieval and standardized preprocessing by coupling autonomous semantic governance with an automated Nextflow pipeline named bioStream. Metadata from 181,760 sequencing series and 84,756 associated PubMed records were organized in a relational database and semantic index to support natural-language dataset retrieval. The framework automatically determines assay modalities, resolves experimental design pairings, and standardizes sample naming to minimize manual curation overhead. Based on these parsed attributes, the framework generates deployment-ready manifests to automatically execute containerized workflows across bulk and single-cell omics modalities. In expert-curated benchmarks, the workflow achieved 96% retrieval precision alongside 100% accuracy in assay classification and sample relationship resolution. The web platform is publicly accessible, while the source code and associated databases are openly available via GitHub and Zenodo.
bioinformatics2026-06-10v1Bias-mitigated microbiome inference refines coronary artery disease signature
Honeybrook, L.Abstract
Roughly half the cells in the human body are microbial, and changes in these communities are increasingly implicated in cardiovascular, metabolic, and oncological diseases. Yet identifying which taxa truly differ in abundance, differential abundance (DA), is distorted by four major sources of bias: loss of total microbial load, taxa measurement efficiencies, arbitrary pseudocounts required to handle pervasive zeros, and contamination which has recently driven retractions. No existing DA method accounts for all four. Here we introduce BootDA, a non-parametric bootstrap-based method that explicitly models each bias source without data transformations, pseudocounts, parametric assumptions, or assuming that most taxa are non-DA. In semi-parametric simulations preserving the sparsity (>70% zeros) and correlation structure of real 16S amplicon data, BootDA achieved the highest sensitivity among tested methods, including ANCOM-BC2, LinDA, MaAsLin 3, and Wilcoxon tests, while controlling the false discovery rate. Performance was retained in low biomass settings when contamination contributed ~50% of counts, and without negative controls, indicating de novo decontamination capability. Applied to a coronary artery disease cohort, BootDA refined the original signature to two co-enriched genera, Klebsiella and Gemmiger, and excluded likely contaminants. BootDA is available as an R package and could generalise to other sparse, high dimensional biological data.
bioinformatics2026-06-10v1SPARQ-MI leverages end-to-end spatial single-cell analysis of the tumor microenvironment
Kiwitz, L.; Turiello, R.; Effern, M.; Toma, M.; Landsberg, J.; Hoelzel, M.; Thurley, K.Abstract
Detailed spatial analysis of the tumor micro-environment (TME) through multiplexed fluorescence imaging requires quantitative image-processing and data-analysis methods. While data-preprocessing down to segmentation of individual cells is captured by available methods, statistical analysis of single-cell features is compromised by the uneven noise distribution especially in complex tissues such as the TME, as well as by labor-intensive manual cell-type annotation and region segmentation. Here, we present SPARQ-MI (Spatial Phenotyping, Architecture Reconstruction and Quantification from Multiplexed Imaging) for streamlined spatial single-cell analysis, along with a tissue microarray PhenoCycler data-set with 37 fluorescent channels from melanoma patients under immunotherapy. We demonstrate that SPARQ-MI enables robust reconstruction of the cellular and spatial composition in this and other tissue types. Our analysis reveals associations of the cell-state and spatial location of CD8 T cells with response to immunotherapy. Overall, SPARQ-MI allows for quantitative analysis of complex fluorescence histology samples under minimal user input, and accounting for spatially uneven coverage of antibody signals, setting the stage for quantitative analysis of clinical samples.
bioinformatics2026-06-10v1Is level-1 blob reconstruction under the network multispecies coalescent easy?
Dai, J.; Molloy, E.Abstract
Hybridization is an important evolutionary process, commonly modeled by the network multispecies coalescent. Reconstructing evolutionary histories under this model is notoriously costly, even for level-1 networks where hybridization events are isolated from each other. The widely used methods that combine speed with statistical guarantees rely on quartet concordance factors computed for all subsets of four species, resulting in an o(n^4k) bottleneck that severely limits scalability to large numbers of species (n) and genes (k). Among quartet-based methods, NANUQ+ is notable because it decomposes the problem into two steps: first reconstructing a tree of blobs, which compresses each non-treelike part of the network, called a blob, into a single vertex, and second reconstructing the internal structure of each level-1 blob, specifically its circular order and hybrid vertex. Here, we investigate whether level-1 blob reconstruction is difficult once the tree of blobs is known. We present a fast and statistically consistent algorithm, called NetCS, based on two simple primitives: majority voting and merge sort, circumventing the bottleneck of computing all quartet concordance factors. In simulations, NetCS achieved comparable accuracy to NANUQ+ and was dramatically faster, enabling analyses of 200 taxa and 1000 genes in only a few minutes. Both methods attained near-perfect accuracy when given the true tree of blobs; however, their performance degraded in end-to-end pipelines due to errors in tree of blobs reconstruction. Strikingly, even methods that reconstruct level-1 networks directly struggled to accurately predict hybrid ancestry. Our results suggest that reconstructing level-1 blobs is unexpectedly easy once the tree of blobs is known, and that a major challenge for phylogenetic network inference lies in accurate tree of blobs reconstruction.
bioinformatics2026-06-10v1APOSM: Pairwise preference learning improves generative small-molecule design
Dreisler, M. W.; Michael, R.; Hatzakis, N. S.; Boomsma, W.Abstract
Small-molecule lead refinement is constrained by the cost of synthesizing and assaying candidates, making the surrogate models that prioritize compounds for experimental testing central to the design process. The reliability of such surrogates is limited by the noise and sparsity of screening measurements. We show that training the surrogate on pairwise comparisons between candidate molecules, rather than on absolute predicted scores, yields a substantially more reliable signal for active candidate selection in this regime. We develop APOSM, an active-learning algorithm that combines a fragment-based generator, a pairwise message-passing graph neural network surrogate, and probabilistic ranking inside a batched acquisition loop. On the Practical Molecular Optimization benchmark and a GPCR ligand rediscovery task, APOSM improves target attainment and sampling efficiency over unguided fragment-based optimization, the Graph-GA genetic algorithm, and a pointwise-regression ablation, with the largest gains on tasks where absolute scores are hardest to calibrate.
bioinformatics2026-06-10v1A Unified Spatial AI Framework for Cross-Domain Tissue-State Analysis in Trauma, Oral, and Cardiovascular Pathology
Pham, T. D.Abstract
Objective: To develop a cross-domain spatial AI framework for identifying conserved tissue-state organisation across trauma, oral disease, and cardiovascular tissue using spatial transcriptomic data. Methods: Four public spatial transcriptomic datasets spanning wound healing, periodontitis, oral squamous cell carcinoma, and cardiac tissue were integrated using recurrence modelling, graph-based spatial learning, fuzzy tissue-state analysis, and tensor decomposition. Cross-domain coupling, spatial fragmentation, recurrence structure, and permutation-based topological validation were evaluated. Results: Six conserved fuzzy tissue states were identified, dominated by extracellular matrix remodelling, fibroblast/stromal activation, endothelial signalling, and inflammatory pathways. Latent embedding analysis demonstrated strong overlap between trauma and oral domains, while cardiovascular tissue exhibited more compact spatial organisation. Oral inflammatory tissue showed the highest fragmentation, whereas cardiovascular tissue demonstrated greater recurrence coherence. Tensor decomposition identified conserved stromal-remodelling programmes across domains. Permutation testing confirmed significantly elevated graph modularity and reduced spatial entropy relative to null distributions. Conclusion: The proposed framework identified conserved spatial tissue-state architecture linking wound healing, oral pathology, and cardiovascular tissue despite differences in tissue origin, pathology, and acquisition technology. Significance: These findings demonstrate the potential of spatial AI for investigating conserved stromal and inflammatory microenvironmental organisation across clinically related disease systems and may support spatial biology research in trauma--oral--systemic health.
bioinformatics2026-06-10v1When batch correction corrupts gene expression: uncovering distortions in correlation structures
Nourisa, J.; Passemiers, A.; Moreau, Y.; Raimondi, D.Abstract
Batch correction is essential for integrating datasets and enabling population-level insights into health and disease. Embedding-based approaches are among the most widely used solutions, but here we highlight a critical, overlooked limitation: these methods can distort feature-to-feature (e.g., gene gene) relationships, potentially undermining downstream analyses. We investigate this issue and introduce a novel metric to quantify it.
bioinformatics2026-06-10v1Promera: a unified model for biomolecular structure prediction, filtering, and design
Jing, B.; Bafna, M.; Diaz, D. J.; Klivans, A. R.; Berger, B.Abstract
Generative models have become staple tools for modeling and designing biomolecular structures. However, although these tools have improved in structural prediction accuracy, their ability to filter designed binders---an essential use case---remains insufficient; whereas design methods have focused more on unconstrained binder generation rather than capabilities enabled by controllable design. We introduce Promera, a unified generative model that combines all-atom structure prediction with improved filtering and controllable design. We find that Promera's confidence metrics are more accurate for filtering binders from non-binders for both miniproteins and nanobodies, while its co-folding performance surpasses popular open-source models (OpenFold3-p2, Boltz-2) on therapeutically relevant categories. As a design model, Promera generates binders by predicting masked protein sequences with optional epitope, paratope, and template constraints. Remarkably, our nanobody designs match the in silico success rates from backprop-based techniques (mBER) when evaluated under co-folding confidence filters. We further provide two in silico demonstrations of the the versatile capabilities of our design method: epitope targeting of the Andes hantavirus glycoprotein with VHHs and active state stabilization of the beta-2 andrenergic GPCR. We conclude by proposing a scaling law for co-folding models, suggesting a path for further performance improvement.
bioinformatics2026-06-10v1ECMME: an atlas of selection pressures on the mammalian extracellular matrix reveals contrasting evolutionary dynamics
Petrov, P. B.; Oshinjo, A.; Roning, J.; Izzi, V.Abstract
The extracellular matrix (ECM) is a fundamental metazoan innovation that provides structural support and regulatory cues essential for multicellular life. While core matrisome components are subject to strong functional constraints, their evolutionary dynamics at the molecular level remain incompletely characterized. Here, we present a comprehensive per-residue analysis of selection pressures across 272 human core matrisome proteins using high-quality orthologous sequences from up to 228 placental mammal species. We developed an automated pipeline integrating ortholog identification, codon-aware alignments, and site-specific selection analyses with the MEME and FUBAR methods from the HyPhy suite. Results reveal pervasive strong purifying selection across the matrisome, consistent with its structural and functional indispensability. This is accompanied by episodic positive selection and rarer pervasive positive selection, with collagens exhibiting significantly elevated episodic positive selection compared to glycoproteins and proteoglycans. To facilitate community access, we developed ECMME (ECM Molecular Evolution) browser, an intuitive open-access web resource that visualizes selection metrics plotted directly onto protein topologies. ECMME allows researchers to seamlessly browse and investigate the data, providing a powerful framework for interpreting functional sites. It is available online and requires no local installation or set-up (https://izzilab-ecmme.share.connect.posit.cloud/).
bioinformatics2026-06-10v1NaVis: a virtual microscopy framework for interactive histological interrogation of spatial transcriptomics data
Oshinjo, A.; Wu, J.; Petrov, P.; Hashmi, A.; Englund, J. I.; Izzi, V.Abstract
Despite the widespread adoption of spatial transcriptomics (ST), revealing the alignment between transcriptional layers and tissue morphology remains technically demanding, typically requiring proficiency across multiple computational frameworks and thereby limiting accessibility for a substantial fraction of the biomedical community. Here, we introduce NaVis (https://github.com/Izzilab/NaVis), a point-and-click virtual microscopy framework that redefines ST analysis as an interactive, image-centric experience. NaVis enables rapid high-resolution inference from low-resolution whole-transcriptome platforms, producing microscopy-like visualizations while preserving transcriptome-wide coverage. It further decomposes histological images into quantitative tissue architecture priors (nuclei-rich regions, fibrillar extracellular matrix, and soft tissue) allowing direct integration of gene expression with local morphology. This unified representation supports analyses of compartment enrichment, boundary concordance, spatial cross-correlation, morphological patterning, histology/expression decoupling, and transcriptome-wide spatial similarity. By coupling transcriptomic and image-derived information within an interactive framework, NaVis shifts ST from static computational workflows to an exploratory modality, broadening its accessibility, conceptual reach and potential for biological discoveries.
bioinformatics2026-06-09v2FLAG-X: Hybrid machine learning workflows for automated gating of clinical flow cytometry data
Martini, P.; Mohammadi, M.; Thrun, M. C.; Blumenthal, D. B.; Krause, S. W.Abstract
Flow cytometry analysis is widespread practice in cell biology, immunology and hematology. Cell populations of interest are typically identified by consecutively examining the expression levels of antigen marker pairs. Since this manual gating process lacks standardization and is time-consuming, several machine learning (ML) methods for automated gating of flow cytometry data have been proposed in recent years. However, their translation into routine workflows has been limited. To address this, we developed the Python package FLAG-X (''flow cytometry automated gating toolbox''), which supports two novel workflows that integrate manual with ML-based gating, using labeled and unlabeled training data. We selected state-of-the-art ML methods developed for automated gating for inclusion in FLAG-X, based on their gating performance in comparison to manual expert annotations. FLAG-X provides a unified interface for top-performing methods and enables seamless integration with standard software for manual gating by exporting results as FCS files. To demonstrate its practical utility, we applied FLAG-X to representative cases from clinical practice. FLAG-X is available at https://anaconda.org/channels/bioconda/packages/flagx/overview
bioinformatics2026-06-09v2