Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
BioGraphX: Bridging the Sequence-Structure Gap via PhysicochemicalGraph Encoding for Interpretable Subcellular Localization Prediction
Saeed, A.; Abbas, W.Abstract
Computational approaches for protein subcellular localization prediction are important for understanding cellular mechanisms and developing treatments for complex diseases. However, a critical limitation of current methods is their lack of interpretability: while they can predict where a protein localizes, they fail to explain why the protein is assigned to a specific location. Moreover, understanding protein behavior traditionally requires knowledge of three dimensional structure, which is a costly and time-consuming process. Here, we propose BioGraphX, a novel encoding framework that constructs protein interaction graphs directly from protein sequences using biochemical rules. This approach provides a constraint-based structural proxy directly from sequence, reducing the dependency on experimentally determined three-dimensional structures. Building upon this representation, BioGraphX-Net demonstrates superior performance on the DeepLoc 2.0 benchmark by integrating ESM-2 embeddings with the proposed features via a gating mechanism. Gating analysis shows that although ESM-2 embeddings provide strong contributions, BioGraphX features function as high-precision filters. SHAP analysis reveals feature importance patterns consistent with a sophisticated biophysical logic: sequence signals act as universal exclusion filters, while organelle-specific combinations of biophysical features enable precise compartment discrimination. Notably, Frustration features help resolve targeting ambiguities in complex compartments, reflecting evolutionary constraints while preventing mislocalization from sequence mimicry. It has the additional advantage of promoting Green AI in bioinformatics, achieving performance comparable to the state-of-the-art while maintaining a minimal parameter count of 13.46 million. In summary, BioGraphX not only provides accurate predictions but also offers new insights into the language of life.
bioinformatics2026-05-13v3A Context-Specific, Literature-Supported Framework for Validating Stress Response Differentially Expressed Gene Sets
Frishman, B. A.; Gonzalez, J. L.; Forbes, V. E.Abstract
Computational models of stress responses identify genes underlying physiological adaptation, but their utility depends on rigorous validation. Often, gene activity reflects both adaptive mechanisms and noise. Here, we develop a framework that leverages public databases to support the subselection of biologically supported model genes for temperature-stress responses. We test our framework on a model that identified and categorized differentially expressed genes (DEGs) into Key-Response, Treatment-Specific, Noisy, and Support groups based on inter-individual gene expression variability before and after treatment. The first three groups were hypothesized to constitute a Principal Response. To validate these groupings, we constructed protein-protein interaction (PPI) networks using the Human Protein Atlas and STRING. The main contribution of this work is the implementation of second-order connections restricted to those made via DEGs, ensuring connectivity reflects condition-specific responses rather than generic hubs. Across two temperature conditions, >75% of Principal Response genes assembled into subnetworks of interactions significantly larger than random expectations. Support Group genes also showed strong interconnectivity and enrichment for housekeeping genes. STRING confirmed PPI enrichment but produced less stable results than our framework. By emphasizing DEG-restricted second-order connections, we address limitations of context-free enrichment methods and strengthen biological evaluation of computational models of differential gene expression.
bioinformatics2026-05-13v3Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models
Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.Abstract
The function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous model accuracy estimation (EMA) methodologies. Here we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structural models. Our method employs a structure-sequence cross-consistency mechanism to quantify the bidirectional compatibility between the input sequence and the predicted three-dimensional structure, enabling a comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in the Pearson correlation and 49.0% in the Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensus-based methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced on the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. Our results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.
bioinformatics2026-05-13v2Keeping SCORE enables interpretable uncertainty-aware classification from diffusion models for genomics
Kuznets-Speck, B.; Jung, J.; Pholraksa, P.; Zhong, A.; Schwartz, L.; Prashnani, E.; Vaikuntanathan, S.; Goyal, Y.Abstract
Classifying cellular states from high-dimensional molecular and genomic measurements requires methods that provide not only accurate predictions but also calibrated uncertainty and interpretability. Current nonlinear classifiers offer accuracy but often lack uncertainty quantification and mechanistic insights into the features that matter most. We introduce Keeping SCORE, a framework that transforms conditional diffusion models into probabilistic engines for classification and regression by computing exact likelihoods along stochastic noising trajectories. We first benchmark Keeping SCORE on image recognition tasks (handwritten digits, natural photos). We then apply Keeping SCORE to single-cell transcriptomics across a 22-million-cell atlas, classifying 164 cell types with accuracy matching or exceeding state-of-the-art methods, while uniquely providing posterior probability estimates and prediction confidence. For genetic perturbation mapping across 100 CRISPRi conditions in a multi-study Perturb-seq dataset, our approach again matches or surpasses discriminative baselines, with feature-level attributions identifying which genomic features drive each decision. Applied to large-scale protein sequence data, our framework accurately regresses mutational stability effects, attributing them quantitatively to positions along the input sequence. Keeping SCORE requires no retraining or architectural changes to existing diffusion models, providing portable, interpretable, and uncertainty-aware predictions for biological discovery.
bioinformatics2026-05-13v2Preferential IsomiR Enrichment in Extracellular Vesicles Improves Identification of Their Cellular Origins
Ripan, R. C.; Li, x.; Hu, H.Abstract
Extracellular vesicles (EVs) carry microRNAs (miRNAs) that mediate intercellular communication and have strong potential as disease biomarkers, yet the roles of miRNA isoforms (isomiRs) in EVs remain poorly understood. Here, we analyzed 96 human EV and corresponding source samples from nine public datasets. We found that EV samples consistently contained substantially higher proportions of isomiR reads than their corresponding source samples, indicating widespread isomiR enrichment in EVs. Although individual isomiRs showed limited reproducibility across biological replicates and limited sharing between EVs and their corresponding source samples, the parent miRNAs that generated these isomiRs remained highly reproducible across replicates and strongly shared between EV-source pairs. Despite extensive isomiR diversification, EV-source pairs retained highly correlated miRNA expression profiles. Using integrated miRNA- and isomiR-related features, we further developed a random forest model that successfully associated EV samples with their corresponding source samples, with improved performance when isomiR information was included. Together, our results demonstrate that EVs are enriched for biologically meaningful isomiRs while preserving source-associated miRNA landscapes, highlighting the importance of incorporating isomiRs into future EV studies.
bioinformatics2026-05-13v1xNNPCD identifies regulators of programmed cell death by integrating perturbation transcriptomes with cancer dependency profiles
Yin, Q.; Chen, L.Abstract
Programmed cell death (PCD) encompasses multiple regulated processes whose dysregulation shapes cancer fitness, yet current computational studies largely use known PCD genes for prognosis rather than discovering regulators. We developed xNNPCD, an interpretable neural-network framework that links CRISPR-Cas9 perturbation signatures from CMap to gene dependency profiles from DepMap. The model constrains hidden neurons to five PCD pathways and iteratively refines a prior gene-pathway mask matrix derived from GO, KEGG, and Reactome using pathway-neuron ablation. This converts binary gene-pathway relationships into continuous-valued associations and improves dependency prediction over random forests, standard fully connected multi-layer perceptron, and its own non-iterative variant. The learned matrix recovers annotated death regulators and nominates candidate regulators, including RPL23A, HSPA5, SNRPA1, SLC6A2, and ASAH1; combined with dependency scores, it further separates pathway coupling from regulatory direction. Transferring the refined relationship matrix and learned weights to compound-induced perturbation data enables in silico drug screening, identifying BRD-K19103580 and decitabine as targeted therapeutic agents for apoptosis and ferroptosis, respectively. The pathway-resolved drug profiles can facilitate the rational design of combination therapies targeting complementary PCD pathways to overcome single-pathway resistance. Overall, xNNPCD offers a generalizable, interpretable approach for mapping the regulatory landscape and elucidating the molecular processes of PCD in cancer.
bioinformatics2026-05-13v1Cell-Level Virtual Screening
Ellington, C. N.; Addagudi, S.; Wang, J.; Lengerich, B. J.; Xing, E. P.Abstract
Virtual screening methods prioritize therapeutic candidates by predicting molecular properties and interactions. However, molecular models are insufficient to predict higher-order effects that arise in real biological systems, leading to late-stage failures in drug discovery. Virtual cells have been posed as a solution to this problem by predicting gene expression responses to drugs, but they remain weakly validated as screening tools; gene expression is only an intermediate in understanding drug success or failure. Despite burgeoning progress in virtual cells, some basic questions remain. Is expression even a good representation of higher-order drug effects? How can expression and other cell-level representations be applied to prioritize therapeutic candidates? Can cell-level methods be fairly compared against traditional molecular-level screens? We address these questions in a two-pronged approach. First, we curate two benchmarks, Drug-Disease Retrieval Bench (DDR-Bench) and Drug-Target Retrieval Bench (DTR-Bench), which directly compare cell-level methods against traditional molecular methods on canonical drug discovery tasks. DDR-Bench evaluates a method's ability to prioritize disease indications for drugs with novel target profiles. DTR-Bench evaluates a method's ability to reconstruct drug-target interactions from separate perturbation modalities that act on shared mechanisms, bridging the gap between cell-level methods and classic molecular screens. We identify shortcomings of existing screening methods on these benchmarks, and propose an alternative representation of drug effects: perturbed gene networks. Inferring post-perturbation gene networks on-demand for unseen drugs requires methods that generalize beyond traditional plug-in network estimators. We develop a scalable differentiable surrogate loss for multivariate Gaussians, which we apply to train a context-adaptive amortized estimator that maps perturbation metadata to gene-gene dependency network parameters. The resulting model, CellVS-Net, achieves SOTA on predicting how gene networks restructure under a variety of complex multivariate experimental conditions, including different cell types, small molecule therapeutics, signaling molecules, gene knockdowns, and gene over-expressions. When compared to other molecular and cell-level representations of drugs, we find that CellVS-Net achieves SOTA on both virtual screening benchmarks. Overall, CellVS-Net demonstrates that cell-level virtual screening methods are a viable alternative to molecular screening, and associated benchmarks enable hill-climbing on relevant drug discovery tasks.
bioinformatics2026-05-13v1Integrated RNA-seq analysis identifies ABC transporters mediating taxane export in Taxus species
Nasiri, J.; Fotuhi Siahpirani, A.; Dong, Y.; Xu, C.; Xia, Y.; Ignea, C.Abstract
RNA-seq datasets from medicinal yews are crucial for studying paclitaxel biosynthesis. However, cross-study data analyses are hindered by pronounced batch effects. Here, we compiled 45 RNA-seq samples from three studies across four tissues (bark, leaf, root, stem) and assessed 35 preprocessing pipelines combining six normalization strategies with five batch-effect correction approaches. Unsupervised clustering (HCA, k-means, Grade-of-Membership), evaluated using Jaccard and Adjusted Rand indices, revealed significant variability in batch effect removal. Supervised classification of tissue and project labels (Random Forest and linear/radial SVM) demonstrated improved accuracy in tissue type prediction, highlighting the effectiveness of correction methods. The processed data facilitated the identification of 189 putative ABC transporters across samples, six of which showing a strong correlation to the gene encoding 10-deacetylbaccatin-III-10{beta}-O-acetyltransferase, a key biosynthetic enzyme in the taxol pathway. High expression levels in leaf and bark further support their role in taxane intermediates trafficking in taxol biosynthesis. Structural analysis and molecular docking further supported the selection of these candidates, and the agreement between transcriptomic ranking and docking-based prioritization suggests that these transporters may participate in taxane intermediate recognition, trafficking, or export. These findings demonstrate the importance of normalization and batch effect correction in RNA-seq analysis to advance gene discovery in Taxus species and, more broadly, in plant research.
bioinformatics2026-05-13v1Redesign selective protein binders using contrastive decoding
Xie, Z.; Xu, J.Abstract
Motivation: Fixed-backbone sequence design methods such as ProteinMPNN operate on backbone coordinates alone and cannot represent target side-chains at the binding interface. Their decoding algorithm also lacks a mechanism to balance binding affinity and folding stability or to improve selectivity against structurally similar off-targets. These gaps limit the computational design of protein binders with high affinity and specificity. Results: We present RedNet, a multiscale graph neural network that encodes side-chain information of the binding target. We further develop a contrastive decoding algorithm, motivated by the thermodynamic decomposition of binding free energy, that addresses two objectives: (1) balancing binding affinity and folding stability, and (2) improving selectivity against structurally similar off targets. RedNet reaches 43% native sequence recovery on heterodimers, compared with 37% for ProteinMPNN and 33% for ESM-IF. With contrastive decoding, itmatchesnative-sequenceco-foldingsuccess(68%)onhigh-confidenceAlphaFold3 targets, exceeding ProteinMPNN (59%) and ESM-IF (61%). On a new benchmark of structurally similar on-/off-target pairs, RedNet with contrastive decoding reaches 64.8% energetic selectivity, ahead of PiFold (55.6%), ProteinMPNN (53.7%), and ESM-IF (53.7%).
bioinformatics2026-05-13v1De novo protein discovery in non-model organisms
Ali, A.Abstract
We developed plant (Parallel Annotation of Transcriptomes), a de novo method that can potentially compare RNA-seq data of any two species without a reference genome. plant is conceptually similar to chromatography. In the same way a complex mixture is filtered to isolate its individual components, we applied a computational method to identify, annotate, and quantify components across transcriptomes. The comparison points are universal protein domain annotations rather than species-specific genes, as would be the case for a differential gene expression analysis. We looked at several Selaginella species via the 1000 Plant transcriptomes initiative (1KP) where RNA-seq data for various plant species have been made publicly available. The raw reads were assembled via Trinity. The assembled transcripts were then searched against the Pfam protein domain database via InterProScan. The assembled transcripts were also quantified via kallisto. By merging these two aspects, we were able to see how often a predicted protein structure is expressed. These quantified annotations of protein domains are comparable across species, assuming a relatively short evolutionary distance. We were also able to identify the presence of species-specific protein domains and trace each annotation back to the gene. A bubble plot was created to visualize the distributions of Pfam annotations across species as well as GO terms.
bioinformatics2026-05-13v1A chemoinformatics-guided platform for efficient discovery of RNA-binding small molecules: Proof-of-concept for myotonic dystrophy type 1
taghavi, a.; Shan, J.; Yao, X.; Zanon, P. R. A.; Sung, K.; Simba-Lahuas, A.; Gorlach, S.; Labuhn, H.; Salthouse, D.; Wang, Z.; Feri, A.; Disney, M. D.Abstract
Structured RNAs cause human diseases but remain challenging to target selectively with small molecules. Here, we report a chemoinformatics-guided discovery framework that integrates fingerprint-based molecular design, experimental validation, and mechanistic profiling to identify small molecules that bind highly structured, disease-associated RNAs. Using an RNA-binder fingerprint derived from known ligands, a Tversky similarity screen of >8 million compounds yielded a 150-member library enriched in chemical space for RNA-active scaffolds. Target engagement and cell-based assays identified multiple selective ligands for the pathogenic expanded triplet repeat, r(CUG)exp, that causes myotonic dystrophy type 1 (DM1) by binding and sequestering the RNA-binding protein muscleblind-like 1 (MBNL1). Biophysical and single-molecule analyses revealed that the small molecules bind the 1x1 nucleotide U/U internal loops formed when r(CUG)exp folds, partially block MBNL1 binding, and modulate RNA folding equilibria. Two optimized scaffolds rescued MBNL1-dependent splicing in patient-derived myotubes with micromolar potency and minimal cytotoxicity. This study establishes a generalizable, data-driven platform for discovering drug-like RNA-binding lead small molecules and demonstrates its application to the toxic repeat expansion RNA underlying DM1.
bioinformatics2026-05-13v1Cell Type-informed Characterization of Spatial Niches from Spatial Multimodal and Multi-omics Data
Du, G.; Xu, J.; Wei, X.; Liu, C.; Zhao, D.; Jia, X.; Li, X.; Shang, X.Abstract
Cell niches play critical roles in tissue organization and orchestrate homeostasis, development, and disease progression. Advances in spatial omics technologies now allow diverse molecular and image-derived data to be jointly captured while preserving spatial context, but deciphering cell niches from such spatial multimodal and multi-omics data remains challenging. Existing computational methods are still limited in their flexibility across variable combinations of spatial modalities and omics data. Here we introduce SpaNECT, a unified and flexible framework designed to accommodate spatial multimodal and multi-omics data for cell niche characterization. SpaNECT further incorporates reference-informed cell-type information to support biologically interpretable niche analysis. Systematic evaluations across diverse tissues, disease conditions, and developmental stages showed that SpaNECT consistently outperformed representative methods in resolving cell niches. In mouse brain spatial multi-omics data, SpaNECT uncovered niche-associated molecular and regulatory programs; in developing chick heart, it tracked cross-stage niche reorganization and progressive remodeling of ventricular-associated cell states during maturation. Overall, SpaNECT establishes a general and robust framework for characterizing cell niches across spatial multimodal and multi-omics data.
bioinformatics2026-05-13v1HAIRpred2: Human Host-Specific Prediction of Antibody-Interacting Residues Using Hybrid Physicochemical and Structural Features
Mehta, N. K.; Sahni, R.; Kumar, N.; Raghava, G. P. S.Abstract
Prediction of conformational B-cell epitopes is critical for vaccine design, immunotherapy, and antibody engineering. To date, several host-independent computational methods have been developed for predicting antibody-interacting residues in antigen structures. However, it is well established that antigen-antibody (Ag-Ab) interactions vary depending on the host immune system indicating the importance of developing host-specific prediction models. In this study, we present, for the first time, a human host-specific method, HAIRpred2, that predicts antibody-interacting residues in an antigen from its tertiary structure. The dataset was derived from HAIRpred and comprises 277 human Ag-Ab complexes, with 221 structures used for training and 56 for independent testing. Preliminary analysis revealed that residues with a relative surface accessibility (RSA) below 0.05, corresponding to buried regions, are highly likely to be non-interacting, underscoring the importance of structural accessibility in antibody recognition. To identify the most informative features, we evaluated multiple feature representations, including RSA, large language model (LLM)-based embeddings, distance-based features, and physicochemical properties. A model trained on single-residue RSA features achieved an AUC of 0.72. Incorporating a sliding window of 15 residues to capture local structural context improved performance to an AUC of 0.75. The best performance (AUC = 0.78 on the independent test set) was achieved by integrating RSA with physicochemical descriptors. Benchmarking against existing antibody-interaction prediction methods on the same independent dataset demonstrated that HAIRpred2 outperforms current tools, further highlighting the advantage of host-specific modeling. HAIRpred2 is freely available as a web server at https://webs.iiitd.edu.in/raghava/hairpred2/.
bioinformatics2026-05-13v1cran2crux: automatically create CRUX ports for R-packages
Petrov, P.; Izzi, V.Abstract
Motivation: R together with CRAN and Bioconductor provides one of the richest ecosystems for bioinformatics and computational biology, with thousands of specialized packages. While GNU/Linux is a vastly-used operating system in this field, R-packages are typically managed independently of the system's native package manager. This separation makes installation, updates and mass rebuilds cumbersome. CRUX, a minimalist semi-source GNU/Linux distribution, offers great flexibility with its ports-based system for the seamless integration of R-packages with its native package manager. Results: The hereby presented cran2crux tool automatically generates CRUX ports for packages from both CRAN and Bioconductor. It performs recursive dependency resolution, handles naming conventions, extracts dependencies information, and supports inclusion of optional dependencies. The tool also provides convenient functions for checking updates and regenerating outdated ports. It can generate over 140 ports for complex packages such as Seurat in approximately 11 seconds, dramatically simplifying the maintenance of large R-dedicated repositories on CRUX. Availability: cran2crux is available under the MIT license at https://github.com/izzilab/cran2crux. As of now, more than 650 R package ports, generated with the tool, are available in the CRUX ports database.
bioinformatics2026-05-13v1Metagenomics-enabled proteomics reveals how AMF and PSB co-inoculation reshapes tomato rhizosphere dynamics across growth stages
Son, Y.; Craft, E. J.; Pineros, M. A.; Mathieson, O. L.; Awan, A.; Blakeley-Ruiz, J. A.; Kleiner, M.; Kao-Kniffin, J.Abstract
Urban agriculture increasingly relies on compost-based substrates for sustainable production, yet we lack a clear characterization of how these systems respond to biological amendments aimed at introducing beneficial microbiota. Here we investigated how developmental stage and co-inoculation with arbuscular mycorrhizal fungi (AMF) and phosphate-solubilizing bacteria (PSB) reshape rhizosphere microbial function in Solanum lycopersicum grown in compost-based urban farm substrate. Using plant physiology assays, 16S rRNA amplicon sequencing, and metagenome-informed metaproteomics, we characterized tomato physiological responses and rhizosphere microbial activity during flowering and fruiting across control, single AMF, single PSB, and AMF and PSB co-inoculation treatments. Co-inoculation synergistically enriched beneficial taxa, improved fruit nutrient accumulation, elevated nutrient transporter and quorum sensing protein production, and drove stress-driven dormancy in competitively excluded taxa, with responses varying between developmental stages. Our findings establish metagenome-informed metaproteomics as essential for resolving stage-specific rhizosphere microbiome functional responses to tomato development and AMF and PSB co-inoculation.
bioinformatics2026-05-13v1Transferable spatial omics deconvolution with SpaRank
Yan, X.; Zheng, R.; Chen, J.; Li, M.; Lan, W.Abstract
By resolving cell-type compositions from multi-cellular spatial measurements, deconvolution is central to resolving the cellular landscape of complex tissues. Existing deconvolution methods fit continuous expression values and are therefore sensitive to batch effects between single-cell references and spatial data, requiring retraining for each new context. Here we present SpaRank, a context-aware framework that performs spatial deconvolution by representing spots as ranked feature sequences. Adapting the rank-based encodings of single-cell foundation models, this formulation is inherently robust to technical variation, enabling a pretrain-transfer paradigm. On simulated benchmarks, SpaRank achieves strong deconvolution accuracy, robustness to expression perturbations, and substantial computational efficiency. On experimental datasets, pretrained models generalize across diverse biological contexts: a model pretrained on a multi-organ lymphoid atlas accurately resolved cell-type distributions across distinct tissues and sequencing platforms; likewise, a model pretrained on an integrated breast atlas delineated cell-type compositions across normal and malignant disease states. Furthermore, the framework naturally extends to multimodal spatial deconvolution by employing gated fusion to adaptively integrate diverse omics signals, improving accuracy over single-modality approaches. Overall, SpaRank establishes a transferable deconvolution paradigm, enabling unified cellular atlases to support direct, context-aware inference across diverse biological states and profiling modalities.
bioinformatics2026-05-13v1An assessment of normalization and differential expression methods for miRNA-seq analysis using a realistic benchmark dataset
Aparicio-Puerta, E.; Baran, A. M.; Ashton, J. M.; Pritchett, E. M.; Gaca, A.; Becker, J.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.Abstract
MicroRNAs are short noncoding RNAs that regulate gene expression and are commonly profiled by small RNA sequencing (miRNA-seq). Despite the widespread use of miRNA-seq, datasets are often analyzed with RNA-seq method such as DESeq2 or edgeR, which do not take into account the specific characteristics of miRNA-seq data. Here, we present a benchmark study of normalization and differential expression approaches using a realistic ground-truth dataset. By mixing mouse RNA of two organs, we generated expression trends while capturing biological and technical variability. Using monotonicity across the dataset and expected fold changes from the mixture design, we assessed normalization and differential expression methods. Normalization benchmarking showed that within-sample scaling, particularly Read Per Million (RPM), best preserved the expected monotonic trends, outperforming cross-sample methods such as TMM, rlog, and VST. These approaches sometimes recovered apparent monotonicity among abundant miRNAs, but inspection of individual profiles suggested likely over-correction. Regarding differential expression, edgeR consistently ranked among the best-performing methods across several metrics, including log2 fold-change estimation, with performance comparable to miRNA-seq-specific tools such as miRglmm and NBSR. DESeq2, edgeR-v4, and limma-based approaches tended to systematically underestimate log2 fold changes. Applying a common RPM-based normalization substantially improved the performance of cross-sample methods, highlighting the strong influence of normalization on differential expression analysis. Overall, our findings support within-sample scaling methods such as RPM for normalization, and edgeR, miRglmm, or NBSR for differential expression. The dataset has been made publicly available, providing a valuable resource for objective method comparison and future miRNA-seq software development.
bioinformatics2026-05-13v1Disagreement between demultiplexing methods reveals structured cell quality gradients in multiplexed single-cell data
Sen, E.; Steiger, S.; Basic, M.; Prokoph, N.; Syed, A. P.; Seufert, I.; Rehman, U.-U.; Schumacher, S.; Baumann, A.; Feuring, M.; Weinhold, N.; Lübbert, M.; Döhner, H.; Döhner, K.; Raab, M. S.; Mallm, J.-P.; Stegle, O.; Rippe, K.Abstract
Background: Single-cell multi-omics profiling of hematopoietic malignancies frequently involves pooling of patient samples before library preparation to reduce costs. Demultiplexing and quality control of the resulting sequencing data depend on experimental design, sequencing depth, and computational methods. Existing approaches benchmark individual tools, auto-select a single best method, or apply majority voting. However, none systematically exploit disagreement patterns among orthogonal strategies as a diagnostic signal for cell quality. Results: We introduce Split-flow, a modular Nextflow pipeline that runs hashing-based and SNP-based demultiplexing, and transcriptome-based doublet detection in parallel. It classifies cells into quality strata through a concordance-based decision framework. Validation on multiplexed CITE-seq data from 14 multiple myeloma patients across eight Chromium channels demonstrates high reproducibility and shows that discordant cells cluster within specific cell types and quality strata. TCR clonotype cross-referencing against VDJdb confirms that concordance-based classification enriches for biologically genuine immune receptor sequences, with a 5.3-fold enrichment of confirmed public TCR sequences in the high-confidence stratum. Downsampling analysis reveals that SNP-based methods are more depth-sensitive than hash-based approaches, supporting the recommendation to combine both strategies. The framework transfers to AML samples across three assay types (snMultiome-seq, scRNA-seq, scATAC-seq), where ATAC-based demultiplexing resolves donor assignment discordance under low hashing efficiency. Conclusions: Split-flow demonstrates that combining of orthogonal preprocessing methods yields structured information about cell quality and offers a concordance-based framework that transforms this disagreement into a diagnostic signal. It introduces a preprocessing approach that can be exploited beyond hematopoietic malignancies in multiplexed single-cell applications.
bioinformatics2026-05-13v1GatorDuo: Global-Consistency Dual-Graph Refinement With Pseudo-Label Agreement for Spatial Transcriptomics
Zhang, Z.; Jimeno Yepes, A.; Bian, J.; Li, F.; Liu, Y.Abstract
Spatial transcriptomics (ST) measures gene expression together with spatial coordinates, enabling spatial domain identification of coherent tissue regions. Many recent approaches rely on graph-based modeling to combine spatial neighborhoods and transcriptomic (gene-expression) similarity, yet neighborhood construction is often unreliable under sparsity and technical noise. As a result, spurious cross-domain shortcut edges can persist in static graphs and propagate misleading signals during message passing, ultimately blurring domain boundaries and weakening cluster separability. In this paper, we propose GatorDuo, a topology-aware dual-graph contrastive self-supervised framework for robust spatial domain identification that couples gene-expression similarity with spatial proximity through complementary neighborhood graphs. GatorDuo introduces global-consistency-based graph refinement that uses a pseudo-label agreement mask to suppress cross-domain shortcut edges in both views, thus stabilizing neighborhood topology for representation learning. To avoid manual tuning of domain resolution, GatorDuo further employs a contextual bandit reinforcement-learning strategy to adaptively select the clustering granularity (the number of clusters) used for refinement. The refined view-specific embeddings are integrated via a hybrid-routing Mixture-of-Experts (MoE) module to generate a unified embedding, optimized with contrastive objectives augmented by an MoE-alignment term. Across eight public benchmarks spanning sequencing- and imaging-based ST at spot and single-cell resolution, and compared with ten representative baselines, GatorDuo consistently delivers strong and robust spatial domain identification performance across multiple clustering metrics, while yielding informative unified embeddings that can support downstream biological analyses.
bioinformatics2026-05-13v1Disease-guided functional gene mapping across species reveals translational correspondences beyond sequence orthology
Yan, J.; Cao, Z.Abstract
Selecting the correct mouse gene to model a human disease phenotype is critical for translational research, yet sequence-based orthology can fail when genes have been lost, duplicated, or functionally rewired between species. Here we present BRIDGE (Biological Rank Integration for Disease Gene Equivalence), a sequence-free framework that identifies functional mouse equivalents of human disease genes. BRIDGE integrates 3.37 million disease-gene associations, biological pathways, and Gene Ontology annotations into a unified heterogeneous graph with 94,897 nodes and approximately 8.3 million edges. The graph is encoded by a heterogeneous graph transformer and combined with fused Gromov-Wasserstein alignment and multi-strategy reciprocal rank fusion. On two sequence-independent benchmarks, BRIDGE achieves Recall@5 of 61.8-66.7%, compared with 0.0-20.1% for Ensembl Compara. We validate BRIDGE through case studies including neutrophil pathway rewiring (CXCL8 to Cxcl1/2/5), acute-phase divergence (CRP to Apcs), and immune checkpoint substitution (LILRB2 to Pirb), and demonstrate complementarity with sequence methods in drug-translation analysis. Prospective validation of 30 novel predictions against three independent data modalities, including tissue expression, cell-type expression, and phenotype concordance, shows that BRIDGE picks are favored in 64 of 65 orthogonal tests (sign test P = 3.6 x 10^-10) and significantly outperform tested baselines including Ensembl Compara, BLAST RBH, and ESM-2. BRIDGE provides a benchmarked framework for functional cross-species gene mapping in disease-model design.
bioinformatics2026-05-13v1BiLSTM-Powered Bilinear Attention for Protein-Ligand Prediction
Cheng, C.-Y.; Chen, Y.-A.; Li, F.-Y.; Re, S.Abstract
Rapid and accurate prediction of protein-ligand bindings is essential for drug discovery. While generative AI has driven rapid advancements in structure-based approaches, sequence-based methods remain significantly faster and more cost-effective. Here, we present a weakly supervised deep learning framework integrating graph convolutional networks (GCN) for molecular encoding and bidirectional long short-term memory (BiLSTM) for protein modeling. The latter represents long-range dependencies better than the widely used convolutional neural network (CNN). Leveraging a bilinear attention network (BAN), this model learns protein-ligand pairwise interactions without requiring three-dimensional structural supervision. By using the publicly available BindingDB dataset, the model was trained, solely on affinity labels, and successfully classified binder and non-binders with AUROC of 0.96 and an AUPRC of 0.95. The model generates interpretable attention maps that serve as a "GPS" to locate binding sites. Remarkably, despite the lack of structural training data, it can pinpoint key contact residues confirmed by crystal structures. Our method could function as a scalable filter for giga-scale libraries, allowing rapid screening of drug candidates with direct structural insights into the protein-ligand interface.
bioinformatics2026-05-13v1Systematic Regional Bias is Widespread in ChIP-seq
Hughes, O.; Foley, G.; Balderson, B.; Piper, M.; Boden, M.Abstract
Robust and reproducible results are essential for confident scientific analysis. We demonstrate that transcription factor (TF) Chromatin Immunoprecipitation coupled with sequencing (ChIP-seq) suffers from systematic bias that may threaten its reproducibility: 80% of 200+ condition-matched, dual-replicate experiments in ENCODE contain genomic regions of systematic bias. We observe this regional bias even between replicates produced within the same experiment, resulting in thousands of unreplicated peaks, which often contain valuable biological data. We provide evidence that regional bias may lead to qualitative differences in TF biology inferred by different experiments; we discovered eight TFs with binding activity in compact chromatin that was identified by one experiment, yet systematically absent from others. To mitigate the effects of bias, we derive simple but effective metrics to quantify the quality of data within biased regions and demonstrate that they can be used for the robust integration of data from multiple experiments.
bioinformatics2026-05-13v1An improved generic schema for high fidelity data linkage and sample tracing across complex multi-assay medical entomology studies
Kavishe, D. R.; Msoffe, R. V.; Mmbaga, S.; Tarimo, L. J.; Butler, F.; Kaindoa, E. W.; Govella, N. J.; Kiware, S. S.; Killeen, G.Abstract
Evidence-based decision making on malaria vector control strategies increasingly rely on triangulation of data which requires informatics systems that can integrate data from complex, multi-stage studies involving mosquitoes. This manuscript describes a performance evaluation of an extended version of the generic schema underpinning the VBDs360 platform, specifically improved to accommodate multiple distinct entomological assays spanning the field, insectary, and laboratory. The utility of this extension, with respect to high-fidelity data linkage and robust sample traceability across complex entomological workflows, was evaluated through a case study conducted in southern Tanzania. Wild female mosquitoes were collected from 40 locations across more than 4,000 square km and then reared through multiple generations in an insectary before derived iso-female lineages were tested for phenotypic susceptibility to a pyrethroid insecticide. Such multi-generational lineages (F0 to Fn; where n is greater than or equal to 2) were propagated to prevent non-heritable maternal effects on phenotype and produce enough progeny for standard WHO susceptibility assays. All samples were subsequently archived in a molecular laboratory, where all F0 specimens were tested for sibling species identity. A paper-based implementation of the extended schema enabled successful integration of 77,017 lines of data distributed across 6 different tables that spanned 3 distinct field, insectary, and laboratory workflows, implemented by three different teams working in different locations. At each step, fully independent and redundant primary and secondary keys enabled high fidelity error correction and sample tracing. Consistently perfect linkage between assay design and sample sorting data was achieved for F0 wild-caught adults, with 100% of 66,108 record successfully linked between field capture and morphological categorization. This complete traceability extended to the propagation of derived Fn lineages, with all 100 and 243 records from 9 adult-derived and 13 larval-derived lineages, respectively, correctly linked. Insecticide susceptibility phenotype further confirmed 100% linkage for 5,654 records between exposure history and recorded mortality outcome data in the insectary. Although such cross-cleaned linkages to sample analysis and storage data recorded by the laboratory team were not entirely perfect and could be improved, they were nevertheless of very high fidelity (97.3% (1967/2,022) for F0 samples and 99.3% (437/440) for Fn samples). Overall, this pilot application of the extended generic schema ensured robust data provenance and minimized transcription errors in this complex study distributed across multiple teams and locations. These findings demonstrate how this generic informatics framework may be scaled and adapted to support data integrity across diverse, large-scale, multi-team entomological research workflows.
bioinformatics2026-05-13v1Phylogenomic coupling of F1 chemosensory and archaellum systems across archaea and monoderm bacteria
Mahanta, U.; Baker, M.; Sharma, G.Abstract
Archaellum-associated motility has been viewed as solely archaeal, yet new findings in Chloroflexota prompt a broader perspective. By analysing a curated ~22,000 NCBI reference genomes alongside 2,397 archaeal and 226 archaellum-encoding Chloroflexota genomes, this study systematically characterises the co-distribution of archaellum loci with chemosensory system (CSS) classes. Maximum-likelihood phylogeny of 3,727 F1-type CheA proteins reveals three major clades, with Clade 1 comprising ~80% monoderm representation, uniting archaeal and monoderm bacterial lineages in a shared evolutionary grouping. Overall, this work shows that not only archaeal-type motility, but also F1-CSS based sensing system, might have been gained from Archaea to Chloroflexota via horizontal gene transfer and both systems shared an evolutionary trajectory altogether.
bioinformatics2026-05-13v1GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Liu, Z.-Y.; Berthel, A.; Czech, E.; Marroquin, E.; Stitzer, M. C.; Hsu, S.-K.; Pennell, M.; Buckler, E. S.; Zhai, J.Abstract
Accurate genome annotation is fundamental to biological discovery, yet identifying gene structures directly from DNA sequence remains a major challenge in complex genomes. We introduce GeneCAD, a sequence-only framework that predicts biologically coherent gene models without requiring species-matched transcriptomic or proteomic evidence. GeneCAD integrates lineage-specific DNA representations from the PlantCAD2 foundation model with a transformer encoder and a chromosome-scale conditional random field (CRF) to enforce structural constraints, such as splice-phase and feature order. To ensure high-quality supervision, we implement a curation strategy using a sequence-based masked-motif score to filter reference transcripts. As a primary validation across diverse angiosperms, including a complex allotetraploid, GeneCAD improves transcript F1 by approximately 9% over current tools like Helixer and BRAKER3, while sharpening boundary precision and achieving a best-in-class recovery of 86% of classical coding sequences. Furthermore, we demonstrate the framework's modularity by adapting it to animal lineages through the substitution of the underlying DNA foundation model. While the long introns of vertebrates challenge full transcript reconstruction, the model remains highly effective at identifying individual exons. By connecting evolutionary signals with structured decoding, GeneCAD provides a versatile and scalable solution for high-fidelity genome annotation across the Tree of Life.
bioinformatics2026-05-12v4GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Liu, Z.-Y.; Berthel, A.; Czech, E.; Marroquin, E.; Stitzer, M. C.; Hsu, S.-K.; Pennell, M.; Buckler, E. S.; Zhai, J.Abstract
Accurate genome annotation is fundamental to biological discovery, yet identifying gene structures directly from DNA sequence remains a major challenge in complex genomes. We introduce GeneCAD, a sequence-only framework that predicts biologically coherent gene models without requiring species-matched transcriptomic or proteomic evidence. GeneCAD integrates lineage-specific DNA representations from the PlantCAD2 foundation model with a transformer encoder and a chromosome-scale conditional random field (CRF) to enforce structural constraints, such as splice-phase and feature order. To ensure high-quality supervision, we implement a curation strategy using a sequence-based masked-motif score to filter reference transcripts. As a primary validation across diverse angiosperms, including a complex allotetraploid, GeneCAD improves transcript F1 by approximately 9% over current tools like Helixer and BRAKER3, while sharpening boundary precision and achieving a best-in-class recovery of 86% of classical coding sequences. Furthermore, we demonstrate the framework's modularity by adapting it to animal lineages through the substitution of the underlying DNA foundation model. While the long introns of vertebrates challenge full transcript reconstruction, the model remains highly effective at identifying individual exons. By connecting evolutionary signals with structured decoding, GeneCAD provides a versatile and scalable solution for high-fidelity genome annotation across the Tree of Life.
bioinformatics2026-05-12v3ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses
Shahar, N.; Miller, D.; Shahoha, M.; Lurie, G.; Weiner, I. N.Abstract
Translating transcriptomic data into therapeutic hypotheses remains fragmented and labor-intensive. Here we present ConvergeCELL, a platform combining a patient representation model trained on over 20 million cells across 4,479 patients, an interpretability framework for gene discovery, and a large language model-driven workflow that classifies candidates along an evidence hierarchy and constructs mechanism-of-action hypotheses. Validated on held-out cohorts spanning lupus, multiple myeloma, and sepsis across single-cell and bulk modalities, ConvergeCELL recovers known disease-associated genes at or above differential expression, machine-learning, and patient-level foundation model (PaSCient) baselines. The advantage is most pronounced for clinically validated, disease-specific drug targets: ConvergeCELL ranks TNFSF13B (Belimumab; lupus), TNFRSF17/BCMA (Belantamab; myeloma), and CXCR4 (Plerixafor; myeloma) within the top 0.3% of its gene rankings - significantly outcompeting alternative approaches. ConvergeCELL delivers an end-to-end translational workflow with state-of-the-art performance on both disease-associated gene recovery and patient-level disease classification. The pretrained ConvergeCELL patient representation model and bulk distillation module are publicly available on Hugging Face (huggingface.co/ConvergeBio/virtual-cell-patient) under the Apache 2.0 license.
bioinformatics2026-05-12v1CardioSafe: Multi-task prediction of cardiac ion channel activity with reverse-leak audited benchmarking
Jovanovic, M.; Weidener, L. S.; Brkic, M.; Ulgac, E.; Meduri, A.Abstract
Drug-induced inhibition of the hERG potassium channel is the leading cause of cardiac safety-related drug attrition, but the Comprehensive in Vitro Proarrhythmia Assay (CiPA) framework requires activity data on multiple cardiac ion channels to assess proarrhythmic risk. We present CardioSafe, a three-branch multi-task neural network with cross-attention fusion that integrates chemical fingerprints, ChemBERTa embeddings, and predicted L1000 transcriptomic features to predict blocker status and potency for hERG, Nav1.5, and Cav1.2, with an exploratory IKs head. CardioSafe was trained on the largest publicly reported multi-channel cardiac ion channel dataset, combining ChEMBL 36 with the hERGCentral database (331127 hERG, 3160 Nav1.5, 1138 Cav1.2, and 115 IKs compounds), curated under a pharmacology-aware policy that retains censored measurements and inhibition-percentage votes. Under Tanimoto-similarity-controlled splits, CardioSafe outperforms the leading published comparators (CToxPred2 and CardioGenAI) on the data-rich hERG head; on the smaller Nav1.5 and Cav1.2 heads the standard evaluation is statistically inconclusive. A reverse-leak audit revealed that 22% of Nav1.5 and 21% of Cav1.2 test compounds were present in published comparators' training data (92% as exact compound matches); after removing these contaminated compounds, CardioSafe's lead on Nav1.5 and Cav1.2 also reaches statistical significance, demonstrating that prior cross-publication benchmarks for these channels were inflated by training-data overlap.
bioinformatics2026-05-12v1MurineCyto-Det: A High-Resolution Murine BALF Cytology Dataset for Leukocyte Segmentation and Detection
Le, T. X.; Tran, L.-A. T.; Farabi, D. A.; Wang, S.; Phan, A. T. Q.; Cormier, S. A.; Taada, A.; McGrew, D.; Du, Y.; Vu, L. D.Abstract
Automated analysis of murine bronchoalveolar lavage fluid (BALF) cytology is important for preclinical respiratory research, yet progress has been limited by the lack of publicly available, well-annotated mouse BALF image datasets. We present MurineCyto-Det, a high-resolution murine BALF cytology dataset comprising 333 image tiles of size 1024 x 1024 pixels, annotated across five cytological categories with both pixel-level segmentation masks and one-to-one matched bounding boxes. The dataset contains 14,551 annotated cell instances and supports two complementary analysis tasks: morphology-oriented cell segmentation and object-level cell detection. To establish reproducible benchmark baselines, we evaluated representative segmentation and detection models. The results demonstrate the practical utility of MurineCyto-Det while highlighting realistic challenges arising from class imbalance, small object size, irregular cell morphology, and ambiguous debris-like structures. MurineCyto-Det provides a standardized resource for developing, evaluating, and comparing automated methods for murine BALF cytology analysis. The dataset is publicly available at https://doi.org/10.5281/zenodo.17608677.
bioinformatics2026-05-12v1Spurious correlation inflates performance in single-cell perturbation prediction
Nicol, P. B.; Shivakumar, S.; Irizarry, R.Abstract
The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.
bioinformatics2026-05-12v1Generative Chemistry Platform for Small Molecules Targeting RNA: A Case Study for Chemical Optimization
Allen, T. E. H.; Bonnet, M.; Khan, R. T.Abstract
We introduce the Serna Bio GenAI platform, a generative chemistry and multiparametric optimization platform for the design of RNA-targeting small molecules. Targeting RNA with small molecules has proven historically challenging but offers notable potential upsides, including access to unique mechanisms of action and the ability to target otherwise untargetable genes. We consider a major challenge here to be designing chemistry specific to RNA-targeting. Molecular design is a valuable application of AI in drug discovery, but many publicly available models use training data focused on protein-targeting - the modality best historically explored in drug discovery. We showcase the difference and value in building a specifically RNA-targeting platform, comparing its performance to state-of-the-art public chemical generators and experimentally validating its chemical designs in comparison to chemistry designed by a human expert.
bioinformatics2026-05-12v1BAT: an integrated pipeline for gene tree construction, annotation, and functional inference
Sheppard, B. D.; Behnken, B.; Steinbrenner, A.Abstract
Gene family functional exploration often requires analyzing motifs, domains, and associated datasets (e.g. gene expression) in the phylogenetic context of a gene tree. As genomic resources become more abundant, local pipelines are needed to analyze gene families of interest with project-specific resources. Here we present BLAST-Align-Tree (BAT), a bioinformatic pipeline for automated gene family phylogeny construction and annotation to enable gene tree exploration. BAT combines a BLAST search of local genome databases with a robust and flexible gene tree construction pipeline that enables multiple modes of annotation. Output visualizations display experimental datasets, custom regex specified amino acid motifs, and protein HMM domain annotations. For flexibility, BAT runs locally and is independent of pre-existing databases, allowing the easy incorporation of custom genomes and datasets. Three primary case studies described here demonstrate the utility of BAT for inferring the function of homologs and orthologs within characterized gene families. BAT is suitable for fine scale phylogenomic analysis of gene families across the tree of life, and default genomes available on installation span model eukaryotes.
bioinformatics2026-05-12v1Temporal-deviation-driven community detection uncovers early-warning signals for critical transitions in complex diseases
Wang, L.; Xu, M.; Yan, H.; Zheng, Y.; Feng, S.; Zhang, Y.; Li, C.; Qiu, D.; Hu, B.; Wan, X.; Zhang, F.Abstract
Early detection of critical transitions in complex diseases is crucial for timely clinical intervention. However, as patients often provide only a single snapshot, identifying sample-specific early-warning signals (EWS) from a dynamical evolution perspective remains challenging, coupled with high-dimensional noise amplification. Here, we present TD-COM, a framework for detecting personalized EWS of critical transitions via single-sample community detection. By constructing a temporal perturbation map STDN, TD-COM captures latent dynamical perturbations inferred from static individual profiles. Synergizing these temporal-deviation signals with static topological features, TD-COM implements a multi-level node filtering strategy during community detection, effectively suppressing single-sample noise. Validated on hour-scale, multi-year, and multi-decade transcriptomic data, TD-COM robustly detects critical states preceding clinical deterioration and uncovers their underlying molecular mechanisms. Comparative experiments demonstrate that TD-COM outperforms existing methods in accuracy and topological robustness. Thus, TD-COM provides a generalizable framework for personalized early warning of complex diseases, particularly when longitudinal sampling is infeasible.
bioinformatics2026-05-12v1SigBridgeR: An Integrative Framework and Toolkit for Comprehensive Screening and Benchmarking of Phenotype-Associated Cell Subpopulations in Single-Cell Transcriptomics
Yang, Y.; Yan, Z.; Qian, H.; Du, L.; Wang, C.; Peng, Y.; Bu, X.; Zhou, J.-G.; Wang, S.Abstract
Single-cell RNA sequencing has revolutionized our understanding of cellular heterogeneity, yet linking specific cell subpopulations to clinically relevant phenotypes remains a persistent challenge. Although multiple computational methods have been developed to bridge this gap, they are typically implemented as standalone packages with heterogeneous preprocessing pipelines, incompatible parameter conventions, and divergent output formats, thereby hindering rigorous cross-method benchmarking and reproducible multi-method workflows. Here, we present SigBridgeR, an extensible R framework and comprehensive toolkit that currently unifies eight state-of-the-art phenotype-associated cell screening algorithms within consistent workflows. We conducted a systematic benchmarking study across four cancer types HER2-positive breast cancer, triple-negative breast cancer, lung adenocarcinoma, and ovarian cancer using both binary phenotypes and patient survival endpoints. Our evaluation incorporated positive and negative control assessments based on differentially expressed genes and randomly selected marker panels, alongside quantitative accuracy comparisons using ground-truth cell labels. Building upon these insights, SigBridgeR provides standardized preprocessing for scRNA-seq and bulk transcriptomic data, unified algorithmic interfaces through a registry-based architecture, ensemble analysis via weighted voting, and comprehensive visualization utilities for multi-method comparison. By lowering technical barriers and promoting methodological standardization, SigBridgeR facilitates reliable discovery of phenotype-relevant cell subpopulations and enhances the translational potential of single-cell omics research.
bioinformatics2026-05-12v1Dual-view Guided Context-aware Network for Automated Bone Lesion Segmentation and Quantification in Whole-body SPECT
chen, w.; Yang, X.; Lu, J.; Miao, M.; Huang, Y.; Zheng, S.; Zhang, C.; Xie, L.; Zhang, Y.Abstract
Whole-body SPECT bone scintigraphy reflects skeletal metabolic activity throughout the body and plays an indispensable role in the screening, treatment evaluation, and prognostic assessment of bone metastases in tumors. However, the automatic detection and segmentation of hypermetabolic bone lesions remain challenging due to low contrast, limited spatial resolution, and complex lesion distributions. In this study, we proposed Bone-Segnet, a dual-view guided automatic segmentation network for hypermetabolic bone lesions that integrated multi-scale feature modeling, global context modeling, and view-conditioned modulation. Pixel-level annotated anterior and posterior whole-body bone scintigraphy images were used for model training and prediction. The proposed network enhanced the recognition of low-contrast and small-scale lesions through small-lesion enhancement and multi-scale contextual modeling. A Transformer module was further introduced to strengthen global feature representation, while cross-view collaborative modeling was achieved by incorporating the complementary characteristics of anterior and posterior imaging. Experimental results demonstrated that the proposed method outperformed existing approaches across multiple evaluation metrics, with the Dice score improving from 0.7440 to 0.8750, indicating a substantial improvement in segmentation performance. Further quantitative analysis based on the segmentation results revealed significant differences among disease types in lesion count, pixel burden, and spatial distribution patterns, reflecting the heterogeneity of disease-related skeletal metabolic activity. Overall, the proposed method improved automatic lesion segmentation performance and enabled quantitative analysis of lesion burden and spatial distribution patterns, providing objective data support for the assessment of related diseases.
bioinformatics2026-05-12v1Culsma: A Formal Language for Laboratory Protocols
Chen, Y.; Sun, M.; Tadepally, L.; Wang, J.; Barcenilla, H.; Gonzalez, L.; Brodin, P.Abstract
The application of artificial intelligence to biomedical research increasingly depends on iterative cycles in which AI systems analyze experimental data, propose follow-up conditions, and drive automated execution at scale, a paradigm central to Bio-AI and autonomous laboratory science. For such cycles to operate, laboratory protocols must be expressed in a form that is simultaneously human-readable and machine-executable. Natural-language descriptions, the current standard in laboratory practice, do not satisfy this dual requirement. We present Culsma, a formal language and execution framework that elevates laboratory protocols from informal prose to semantically explicit workflow programs that can be analyzed, validated, executed, and transferred across settings. The same protocol can be read and verified by a bench scientist, and parsed, validated, and executed by an automated pipeline without re-translation. We demonstrate an end-to-end implementation providing concrete evidence of practical viability.
bioinformatics2026-05-12v1Receptor-Anchored Olfaction Representation through Perception-Consistent Metric Learning
Tian, C.; Wang, J.; Hou, J.; Liu, W.; Luo, Y.; Wang, Y.; Yang, L.; Lin, W.Abstract
Olfactory perception arises from distributed activation across hundreds of olfactory receptors (ORs), yet our understanding of this landscape remains constrained by the scarcity of OR affinity measurements. Here, we present Receptor-Anchored Metric Supervision (RAMS), a transfer learning framework using perceptual consistency as weak supervision to predict OR activation spectra. RAMS fine-tunes a pretrained drug-target affinity model by imposing constraints derived from olfactory perception, where similar odorants are encouraged to exhibit similar OR activations. It transfers protein-ligand interaction knowledge learned from large-scale pharmacological data into the olfactory domain and reshapes it toward OR activation prediction. Evaluations against experimental measurements show that RAMS improves the accuracy of receptor-spectrum prediction and yields biologically plausible activation patterns. The predicted spectra show concordance between receptor discriminative capacity and expression level, and highlight the understudied OR52 family as a potential contributor to primary odor recognition. Together, RAMS provides a scalable framework for reconstructing receptor-anchored olfactory representations.
bioinformatics2026-05-12v1Figra: A WebAssembly-based Excel Add-in for publication-quality scientific visualization with ggplot2
Sato, Y.Abstract
Data visualization is a critical step in scientific communication. Most researchers rely on subscription-based software for this purpose, which requires ongoing licensing costs. Free alternatives such as R and Python offer publication-quality output but demand programming expertise that many researchers do not possess. Artificial intelligence tools can assist with figure generation but remain frustrating when users wish to fine-tune specific visual parameters to their preference. Meanwhile, Microsoft Excel, the most widely used tool for scientific data storage and management, offers limited visualization capabilities, forcing researchers to transfer their data to external software as an extra step before creating figures. Here we present Figra, a free Excel Office Add-in that eliminates this extra step by enabling publication-quality ggplot2-based figure generation directly within Excel, with simple and direct control over every visual option. Figra leverages WebAssembly technology (webR) to execute R code entirely within the browser, requiring no R installation, no subscription, and no server connection. The add-in supports over 20 chart types spanning distribution plots, grouped comparisons, time-series, scatter plots, and specialized curve-fitting analyses. For applicable chart types, Figra performs automated or manual statistical analysis supporting both paired and unpaired designs across two or more groups. Additionally, Figra exports simplified, executable R code that reproduces the displayed figure, serving as an educational tool for researchers wishing to learn ggplot2. Figra is open-source and freely available at https://h20gg702.github.io/figra-pages/index.html while the source code is provided at https://github.com/h20gg702/Figra.
bioinformatics2026-05-12v1Engineering a pacemaker-driven human mini-heart guided by spatial multi-omics of sinoatrial node development
Zhu, J.; Zhang, Z.; Gregorio, R. D.; Chang, K.; Dong, X.; Banerjee, K.; Liu, K.; Rea-Moreno, M.; Kizilbash, M.; Alonso, A.; Liu, J.; Tsai, S.; Chen, Y.-W.; Evans, T.; Chen, S.Abstract
The human sinoatrial node (SAN) functions as the primary pacemaker of the heart and coordinates the hierarchical electrical activity that drives cardiac contraction. However, experimental systems capable of reconstructing pacemaker driven cardiac organization in human tissues remain limited. Here we integrate spatial multi-omics of the human fetal SAN with stem cell engineering to generate pacemaker organoids (Sinoids) and assemble them into a pacemaker driven human mini-heart composed of sinoatrial, atrial and ventricular cardiac modules. High-resolution spatial transcriptomics and single nucleus multiomic analyses of human fetal SAN tissues identify regulatory pathways guiding pacemaker lineage specification, which we leverage to engineer human pluripotent stem cell derived SAN organoids with robust pacemaker identity and electrophysiological activity. When integrated with atrial and ventricular cardioids, Sinoids initiate and coordinate electrical activation across assembled cardiac tissues, establishing directional propagation of electrophysiological signals within structured mini-heart organoids. Combining AI guided perturbation modeling with functional validation further identifies conserved regulatory pathways controlling pacemaker specification and regionalization, including YAP TEAD and NRG ERBB signaling. Together, these results establish a multiomic guided strategy for engineering pacemaker tissues and reconstructing cardiac conduction hierarchy in vitro. The pacemaker driven mini heart platform provides a modular human cardiac system for studying pacemaker biology, modeling arrhythmia mechanisms and enabling electrophysiological drug discovery.
bioinformatics2026-05-12v1Dogcatcher2: Improved statistical detection of transcriptional readthrough and repetitive element analysis across sequencing platforms
melnick, m.; Link, C. D.Abstract
Downstream of Gene (DoG) transcription occurs when RNA polymerase II fails to terminate normally at the transcription end site, resulting in extended transcription downstream of the gene. This is a widespread phenomenon linked to cellular stress, cancer and neurodegeneration. Existing tools for DoG detection from short-read RNA-seq rely on absolute coverage thresholds and sliding window approaches that are sensitive to sequencing depth and expression level. Here we present Dogcatcher2, which applies improved statistical detection methods to gene body-normalized coverage profiles. Using long-read ground truth across multiple datasets, we show that Dogcatcher2 outperforms existing methods in both detection sensitivity and boundary accuracy while maintaining high precision even at low sequencing depths. Dogcatcher2 further improves detection on pseudobulk scRNA-seq and snRNA-seq data. Analysis of DoG regions in human reveals specific enrichment for Alu elements including inverted Alu pairs capable of forming double-stranded RNA, with transposable elements within DoG regions showing elevated expression, connecting readthrough transcription to dsRNA generation and innate immune signaling.
bioinformatics2026-05-12v1Carbohydrate active enzymes in Pectobacteriaceae: coevolving enzyme sets and host adaptation
Hobbs, E. E. M.; Gloster, T. M.; Pritchard, L.Abstract
Many phytopathogenic bacteria have evolved large, diverse arsenals of Carbohydrate Active enZymes (CAZymes) that liberate simple sugars, and thus nutrition and energy, from the complex lignocellulosic matrices of their plant hosts. The CAZyme arsenals of these phytopathogens are expected to be influenced by and adapted to the cell wall composition of their plant hosts. The solutions these organisms have reached for the problem of degrading plant material may help us understand their host ranges and present a rich source of novel CAZymes for exploitation in industrial bioprocessing. Here we catalogue and analyse CAZyme complements (CAZomes) of publicly-available Enterobacterial phytopathogen genomes, including those of the economically significant and widely-studied Pectobacterium and Dickeya genera. These comprise a broad diversity of CAZymes, providing insight into host adaptation and a resource for bioprospection of industrially-relevant enzymes. We find evidence supporting coevolution of sets of CAZymes specific to bacterial genus and species and, notably, CAZymes associated with pathogen preference for either woody or soft plant tissue, suggesting adaptation of CAZomes to host plant cell wall composition.
bioinformatics2026-05-12v1WasteFams: A database of protein families from global wastewater microbiomes
Galaras, A.; Chasapi, I. N.; Aplakidou, E.; Chasapi, M. N.; Lamari, E.; Diplari, S.; Georgakopoulos-Soares, I.; Karatzas, E.; Baltoumas, F. A.; Kyrpides, N.; Pavlopoulos, G.Abstract
Wastewater surveillance has emerged as a critical tool for global epidemiology, yet the functional diversity of wastewater microbiomes remains poorly characterized at the protein level. Here, we present WasteFams, the first comprehensive database dedicated to the systematic exploration of protein families in wastewater metagenomic and metatranscriptomic studies worldwide. Integrating data from 580 metagenomes, 132 metatranscriptomes, and 1,709 reference genomes, WasteFams catalogs 3,887 non-redundant protein families (containing {succeq}100 members) derived from over 105 million predicted proteins. Each protein family is enriched with multi-layered annotations, including AlphaFold3 structural predictions, taxonomic classifications, and biome-specific metadata. To further expand their functional annotation, we integrated deep genomic context analysis to link protein families to Mobile Genetic Elements (MGEs), Biosynthetic Gene Clusters (BGCs), Antibiotic Resistance Genes (ARGs), and CRISPR elements. Accessible through the EnvoFams portal, WasteFams provides a user-friendly interface featuring advanced search capabilities, sequence and structural similarity tools, and interactive visualization modules. As global initiatives increasingly leverage wastewater for public health and environmental insights, WasteFams can serve as a critical resource for discovering novel microbial functions, monitoring resistance mechanisms, and exploring the biotechnological potential of secondary metabolites within wastewater-engineered ecosystems.
bioinformatics2026-05-12v1CausalKnowledgeTrace: A Novel Computational Framework for Automated Literature-Based Causal Graph Construction and Evidence-Based Variable Selection in Biomedical Research
Upadhayaya, R.; Pradhan, M. M.; Metzger, V. T.; Malec, S. A.Abstract
Background: Variable selection for causal inference from observational biomedical data is challenging, as overlooking confounders or conditioning on colliders leads to biased estimates. While vast causal knowledge exists in biomedical literature, manually extracting this information for principled variable selection is impractical at scale. Methods: We developed CausalKnowledgeTrace, a Python-based computational framework with Django web interface that systematically leverages structured causal knowledge from the Semantic MEDLINE Database (SemMedDB) to inform variable selection in causal studies. The system implements a six-stage analysis pipeline using NetworkX for graph operations, including graph parsing, basic analysis, comprehensive cycle detection, systematic generic node removal, post-removal analysis, and formal causal inference with bias detection. Results: Analysis of the hypertension and Alzheimer's relationship across three degree neighborhoods (1 to 3) demonstrated systematic scaling of causal complexity: 361 to 866 variables, 429 to 1,442 relationships, with graph densities of 0.0033 to 0.0019. The analysis revealed complex cyclic structures with 54 to 606 baseline cycles across degree levels. Processing times ranged from 0.3 to 1.0 seconds for all three degrees, demonstrating computational efficiency for complex biomedical networks. Key confounders identified across all degrees included inflammation, diabetes, insulin resistance, obesity, and ischemia. In the third degree of graph, the pipeline structurally identified 39 confounders, 11 mediators, and 3 colliders from the causal graph. Among the key identified confounders and mediators (including obesity, oxidative stress, ischemia, and vascular diseases), all were found to have strong supporting evidence in established epidemiological and pathophysiological literature. Conclusions: CausalKnowledgeTrace provides a scalable, evidence-based approach to causal graph construction that systematically identifies confounders and bias structures often missed by conventional approaches. The Python-Django architecture enables both standalone analysis and integration into larger computational workflows, representing a significant advance in computational support for causal inference in biomedical research.
bioinformatics2026-05-12v1BatchVaria: a variance-aware framework for evaluating batch correction in high-dimensional omics data
Moir, N.; Sherwood, K.; Simpson, I.Abstract
Batch effects and other unwanted technical sources of variation remain a persistent challenge in the integrative analysis of high-dimensional -omics data. Although established methods such as ComBat effectively mitigate batch-associated signal, their impact on biologically meaningful variation is frequently evaluated in an ad hoc and non-quantitative manner. This is particularly problematic in heterogeneous disease contexts, such as breast cancer transcriptomics, where technical and biological sources of variation may be partially confounded. We present BatchVaria, an R package that implements a variance-aware framework for batch correction and post-adjustment evaluation. BatchVaria integrates variance component modelling, batch adjustment, and systematic re-profiling within a unified analysis container, enabling iterative quantification and reassessment of technical and biological variance contributions while preserving analytical provenance. By supporting multiple variance profiling engines and structured storage of intermediate results, BatchVaria facilitates transparent and reproducible evaluation of batch correction strategies. We demonstrate the utility of BatchVaria using a publicly available breast cancer transcriptomic dataset with known covariate-driven structure, illustrating how iterative variance profiling can guide responsible batch correction without erosion of subtype-associated biological signal.
bioinformatics2026-05-12v1A novel vaccine and drug targets for global eradication of bovine tuberculosis: Holistic frameworks for construction of a potent vaccine and identification of drug targets
Pawar, P.; samarasinghe, s.; Kulasiri, D.Abstract
Bovine tuberculosis (TB), caused by Mycobacterium bovis, has become a global concern over the last two decades. Bovine TB primarily affects cattle, but other domestic livestock are also affected and it is more common in less developed and developing countries. The significant loss of livestock leads to trade restrictions and economic crises. Zoonotic potential of bovine TB raises health concerns for the public. Currently, no effective treatment is available and animal slaughtering is usually undertaken to reduce the burden of it in the environment. Antibiotic therapy can be used on animals living in captivity, but it is not reliable for herd or free-grazing animals. The BCG vaccine is another option available for treating the disease, but it shows limited efficacy in cattle. The prevention of bovine TB is a long-term goal that can only be accomplished by developing a more effective vaccine than BCG and designing new drugs. In this research, we propose therapeutic drug targets and vaccine for treating bovine TB. The conceptual framework for vaccine developed in this study uses a number of bioinformatics approaches to identify potential vaccine candidates and construct an in-silico epitope-based vaccine. Our holistic framework identified potential therapeutic candidates by directly analysing the proteome of TB bacterial strains. Specifically, we performed a comparative proteomic analysis of 11 Mycobacterium bovis strains to cover the diversity and identify conserved proteins among those strains for developing the bovine TB vaccine. An extensive reverse vaccinology and immunoinformatics analysis provided 26 highly immunogenic, non-toxic and non-allergenic epitopes (CTL epitopes- 8, HTL epitopes- 2 and B-cell epitopes-16) for Mycobacterium bovis required for three-dimensional structure construction of TB vaccine. The constructed epitope-based vaccine showed a potent interaction inside the host, thus generating efficient cell-mediated and humoral immune responses. Next, a framework based on a novel subtractive proteomic approach was developed for identifying bovine TB drug targets. We performed this approach on the 11 Mycobacterium bovis strains and identified nine drug targets that are conserved, essential, antigenic and have unique metabolic pathways in Mycobacterium bovis. These drug targets could further help investigate therapeutic drugs for the treatment of bovine TB. Several bioinformatics prediction tools were used together to ensure checks and balances, aiming to reduce the chance of errors and provide accurate results. The vaccine and drug targets developed in this study can be tested experimentally with confidence for further validation as therapeutics with the potential to eradicate bovine TB globally. The strategies implemented in the study are generic and can be used for other zoonotic infectious diseases. This study would be a game changer in the field of bovine tuberculosis treatment.
bioinformatics2026-05-12v1Amino Acid Insertion Energetics in a POPC Bilayer from Unbiased Molecular Dynamics
Bories, S. C. A.; Lague, P.Abstract
Membrane association is governed by the thermodynamics of amino acid partitioning between water and the lipid bilayer. Here, we quantified amino acid side-chain insertion energetics in a 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine (POPC) bilayer using unbiased molecular dynamics simulations. Equilibrium depth distributions of 28 analogs, including multiple protonation states, were converted into potentials of mean force (PMFs) by Boltzmann inversion. The resulting PMFs reproduced the main features of bilayer partitioning. Hydrophobic analogs favored the bilayer core, aromatic analogs were stabilized in interfacial regions, and polar or charged analogs remained unfavorable in the hydrophobic interior. A diglycine analog representing the peptide backbone behaved similarly to uncharged polar residues. Depth-dependent pKa profiles and orientational analyses further showed how protonation equilibria and aromatic-ring alignment influence insertion energetics. Agreement with experimental hydrophobicity scales supports the robustness of the approach. These results provide an efficient and internally consistent framework for characterizing bilayer insertion energetics and establish a reference for future studies in more complex lipid environments.
bioinformatics2026-05-12v1MucOneUp: A Simulation Framework for MUC1-VNTR Variant Benchmarking
Popp, B.; Saei, H.Abstract
Summary: Variable number tandem repeats (VNTRs) in the MUC1 gene cause autosomal dominant tubulointerstitial kidney disease when disrupted by frameshift variants, but the GC-rich 60-bp repeat structure (20-125 copies) challenges variant detection. While tools like VNtyper enable MUC1 variant calling, no gold-standard benchmarking datasets exist for systematic performance evaluation. We present MucOneUp, a specialized simulation framework for generating MUC1-VNTR reference sequences with targeted variants and platform-specific sequencing reads (Illumina, Oxford Nanopore, PacBio). MucOneUp employs Markov chain-based repeat generation, supports diploid simulation with customizable variant placement, and includes additional analysis modules for SNaPshot assay simulation and exploratory frameshift analysis. We validate MucOneUp through a multi-variant, cross-platform benchmark of six tool-platform combinations using 13 distinct frameshift variants and investigate VNTR length effects on detection.
bioinformatics2026-05-12v1Mechanisms Matter: Transportability of Cellular Perturbation Effects
Qi, S.-a.; Chapfuwa, P.Abstract
Predicting cellular responses to genetic or chemical perturbations across biological contexts is central to drug development and disease understanding.Despite increases in data and model scale, deep learning models have not consistently outperformed simple baselines. Leveraging causal transportability theory, we show that cross-context generalization is governed by shared causal mechanisms, not merely distributional similarity.To enable controlled evaluation, we develop a causal simulator that generates realistic semi-synthetic Perturb-seq datasets with tunable mechanistic divergence, providing benchmarks with known ground-truth causal structure. Further, we adapt the Vendi diversity score to the perturbation setting as a diagnostic for mode collapse, a failure mode invisible to standard per-perturbation metrics. Extensive experiments across four deep learning models and six simple baselines on semi-synthetic and real Perturb-seq datasets reveal a cross-context generalization gap: performance under cross-context splits drops substantially, often to simple baseline levels. Notably, even on synthetic data with fully specified causal structure, no model generalized across contexts with different causal mechanisms. These results underscore the need for cross-context evaluation, diversity-aware metrics, and mechanistically grounded inductive biases.
bioinformatics2026-05-12v1misoTar: A novel approach for predicting miRNA and isomiR targets
Ripan, R. C.; Li, x.; Hu, H.Abstract
Understanding the interactions between microRNAs/isomiRs and mRNAs has long been a major challenge in RNA biology. Although numerous computational approaches have been developed to predict these interactions, most fail to account for isomiR mediated targeting. To address this limitation, we developed misoTar, a deep learning framework trained on more than 6.662 million positive and negative interaction pairs derived from 67 publicly available human samples across six independent studies. In five-fold cross-validation, misoTar achieved an average precision of 0.930 and a recall of 0.898. Evaluation on independent test datasets demonstrated consistently superior or comparable performance relative to existing tools, including TargetScan, Mimosa, DMISO, and TEC-miTarget. In addition, single-nucleotide mutation analyses of true positive interactions revealed the critical functional contributions of non-seed regions in microRNA/isomiR targeting. Overall, misoTar provides a robust and accurate framework for predicting microRNA/isomiR interactions while offering new biological insights into microRNA targeting mechanisms. The misoTar tool is publicly available at https://figshare.com/projects/misoTar/262723.
bioinformatics2026-05-12v1Identifying Context-Specific Cell-Cell Interaction Genes Without Ligand-Receptor Databases from Spatial Transcriptomics
Kim, H.; Park, B.; Jung, J.; Lee, S.; Panahandeh, S.; Kwon, S.; Li, J. J.; Madan, E.; Kim, D.; Kim, J.; Gogna, R.; Won, K. J.Abstract
Current approaches to inferring cell-cell interactions (CCIs) are largely constrained by predefined ligand-receptor databases, particularly for low-resolution spatial transcriptomics (ST) platforms such as Visium. Due to the difficulties in accurately resolving interacting cells at coarse spatial resolution, other modes of interaction are often overlooked. Low-resolution ST data, however, can serve as an alternative to high-resolution ST, which suffers from low sensitivity, and to image-based ST, which is limited by restricted gene panels. Here, we present CellNeighborEX v2, a database-free framework that directly infers CCI-associated genes from ST data by detecting deviations between observed and expected gene expression at the spot-population level. These deviations are rigorously evaluated through a hybrid statistical framework involving permutation testing and are further refined by considering the abundance of interacting cell-type pairs. Compared with other conventional approaches relying on ligand-receptor databases, CellNeighborEX v2 can capture CCI genes from a broad spectrum of interactions, including both paracrine signaling and contact-dependent communication. Across datasets from hippocampus, liver cancer, colorectal cancer, ovarian cancer, and lymph node infection, CellNeighborEX v2 accurately recapitulated previously identified CCIs. Notably, it uniquely detected interactions absent from existing ligand-receptor databases, enabling detection of context-specific CCIs from Visium data. CellNeighborEX v2 is a tool that expands the analytical spectrum of Visium data and deepens our understanding of the molecular language of intercellular communication.
bioinformatics2026-05-12v1