Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
A geometric criterion links HIV-1 capsid topography to its biophysical properties and function
Li, W.; Peeples, C. A.; Rey, J. S.; Perilla, J. R.; Twarock, R.Abstract
Mathematical models of virus capsid structure are pillars of modern virology, aiding the understanding of viral mechanisms and the design of antiviral interventions. Traditionally, the HIV-1 capsid core geometry is represented as a fullerene lattice, akin to the icosahedral models of spherical viruses in Caspar-Klug theory. However, recent studies revealed that many viral capsids deviate from such idealised lattices, with important functional implication. Here we show that this is the case also for the conical HIV-1 core geometries, in which the hexamer and pentamer boundaries form a pseudo-tiling rather than a perfectly aligned fullerene network. We introduce a triangular geometric criterion that quantifies local deviations of an HIV-1 atomic model from its idealised fullerene backbone. Using this criterion, we present that this difference in geometric organisation between idealised (fullerene) and actual (data-derived) capsid model has implications for the capsid's biophysical properties. We also discuss the use of the geometric criterion as a predictive tool regarding cofactor binding and implied geometric changes in the capsid surface coupled to the interfacial frustration response. Our results establish a quantitative framework linking capsid geometry, curvature, and biophysical function, offering new perspectives for assembly inhibitor design and lentiviral vector engineering.
bioinformatics2026-05-14v3geneRNIB: a living benchmark for gene regulatory network inference
Nourisa, J.; Passemiers, A.; Kalfon, J.; Stock, M.; Zeller-Plumhoff, B.; Cannoodt, R.; Arnold, C.; Netea, M. G.; Hartford, J.; Tong, A.; Scialdone, A.; Cantini, L.; Moreau, Y.; Raimondi, D.; Li, Y.; Luecken, M.Abstract
Gene regulatory networks (GRNs) underpin cellular identity and function, playing a key role in health and disease. GRN inference has received substantial attention, motivating systematic benchmarking. Despite various benchmarking efforts, existing studies remain limited in the number of methods, datasets, and metrics, fail to capture the context-specific nature of regulatory interactions across biological conditions, and are constrained by the absence of a reliable ground truth. Here, we introduce geneRNIB, a comprehensive GRN inference benchmarking framework built on three key principles: continuous integration, context-specific evaluation, and holistic assessment in the absence of a true reference network. geneRNIB enables the seamless incorporation of new algorithms, datasets, and evaluation metrics to reflect ongoing developments. In the current version, we systematically integrated and assessed 12 GRN inference methods, spanning single- and multiomics approaches across 11 datasets including thousands of perturbation scenarios. We introduced complementary metrics specifically designed to assess context-specific inference. Our findings indicate that simple models with fewer assumptions often outperform more complex pipelines across several perturbation-informed and predictive metrics. Notably, gene expression-based algorithms yielded better results than more advanced multimodal approaches. In addition, we identify several potential factors that influence the performance of GRN inference and offer actionable guidelines for the future development of the method. By addressing these critical limitations in existing benchmarks, geneRNIB advances GRN inference research and fosters progress toward personalized medicine.
bioinformatics2026-05-14v2A fully open structure-guided RNA foundation model for robust structural and functional inference
Zhu, H.; Li, R.; Chang, A.; Chen, H.; Zhang, F.; Tang, F.; Ye, T.; Li, X.; Gu, Y.; Xiong, P.; Zhou, S. K.Abstract
RNA language models have achieved strong performances across diverse downstream tasks by leveraging large-scale sequence data. However, RNA function is fundamentally shaped by its hierarchical structure, making the integration of structural information into pre-training essential. Existing methods often depend on noisy structural annotations or introduce task-specific biases, limiting model generalizability. Here, we propose structRFM, a structure-guided RNA foundation model that is pre-trained on millions of RNA sequences and secondary structures data by integrating base pairing interactions into masked language modeling through a novel pair matching operation. We further introduce MUSES (multi-source ensemble of secondary structures) to mitigate model bias, and a dynamic masking ratio to balance the structure-guided mask and nucleotide-level mask. structRFM learns joint knowledge of sequential and structural data, producing versatile representations, including classification-level, sequence-level, and pairwise matrix features, that support a broad spectrum of downstream adaptations. structRFM ranks among the top models in zero-shot homology classification across seventeen biological language models, and sets new benchmarks for secondary structure prediction. structRFM further derives Zfold, which enables robust and reliable tertiary structure prediction, with consistent improvements in estimating 3D structures and their accordingly extracted 2D structures, achieving a pronounced about 20% performance gain compared with baselines and comparable performances with AlphaFold3 on CASP15-natural, CASP16, and RNA-Puzzles datasets. In functional tasks such as internal ribosome entry site identification, structRFM achieves a whopping 48% performance gain in F1 score. Furthermore, state-of-the-art performances in extensive experiments across novel RNA families and long non-coding RNAs indicate the robustness and generalizability of structRFM. These results demonstrate the effectiveness of structure-guided pre-training and highlight a promising direction for developing multi-modal RNA language models in computational biology. To support the broader scientific community, we have made the 21-million sequence-structure dataset and the pre-trained structRFM model fully open-source, facilitating the development of multimodal foundation models in biology.
bioinformatics2026-05-14v2Anatomy-Guided 3D Graph Networks for Couinaud Segmentation in Tumor Affected Livers
You, L.; Dang, H.; Wang, H.; Matta, E.; zhou, X.Abstract
Abstract: Image-based liver Couinaud segmentation is designed to automatically provide the locations of suspicious objects in liver CT/MR images. Once achieved, the physicians will be guided to the target slice and area where the suspicious node is located. However, conventional algorithms trained primarily on healthy liver images often fail to generalize to Hepatocellular Carcinoma (HCC) cases due to pathological structural distortions. In this work, we propose a robust two-stage framework that integrates a 3D Unet with a 3D Anatomical Structure-Guided Graph Convolutional Network (3D GCN). This two-stage strategy effectively isolates the liver volume to eliminate structural noise from neighboring organs, such as the spleen, allowing the framework to focus exclusively on the complex 3D anatomical relationships among the eight segments. To ensure the topological consistency required for global spatial reasoning, we implement a standardized preprocessing pipeline that normalizes liver-only volumes to exactly 50 frames along the z-axis. By combining a lightweight 3D UNet backbone with the 3D GCN for refined boundary reasoning, our model demonstrates superior generalization performance on unseen clinical datasets, achieving a mean Dice score of 0.828 in blind testing. By releasing our code and pretrained weights, we aim to provide the first publicly available deep learning resource for robust Couinaud segmentation.
bioinformatics2026-05-14v1Viral non-coding RNA structure annotation and API-based data retrieval with Rfam and R2DT
Muston, P.; Triebel, S.; Nawrocki, E.; Ontiveros-Palacios, N.; Jandalala, I.; Sweeney, B.; Bateman, A.; Marz, M.; Petrov, A. I.; Madrigal, P.Abstract
Rfam is a comprehensive database of non-coding RNA (ncRNA) families providing curated sequence alignments, consensus secondary structures, and covariance models for thousands of RNA families. The database is essential for identifying structured non-coding RNAs in newly sequenced genomes and understanding RNA structure-function relationships. Here we present computational protocols for automated ncRNA annotation of viral genomes, and for programmatic interaction with Rfam through its RESTful API. We showcase genome-wide RNA structure visualization from a genome sequence and from a multiple sequence alignment by generating comprehensive 2D structure diagrams using newly developed features in R2DT. We also present practical examples for retrieving family metadata, downloading alignments, accessing secondary structures, and searching user sequences from the Rfam API. These methods enable researchers in virology and RNA biology to integrate Rfam data into custom bioinformatics pipelines, comparative analyses, and machine learning workflows.
bioinformatics2026-05-14v1MethylCurate: Tool For Dataset Curation and Epigenetic Aging Clock Evaluation
Edwards, T. A.; Shen, L.; Long, Q.Abstract
DNA methylation datasets from public repositories such as NCBI Gene Expression Omnibus are central to the development and evaluation of epigenetic aging clocks, yet existing resources and tools do not fully resolve the bottlenecks of dataset retrieval and metadata harmonization. Current benchmarking frameworks often rely on static curated collections, support only a subset of available Gene Expression Omnibus studies, focus on specific tissues, or require substantial manual intervention when metadata fields and supplementary files are inconsistently structured across studies. We developed MethylCurate, an agentic AI framework that addresses these limitations by automating the retrieval of DNA methylation datasets from the Gene Expression Omnibus, harmonizing heterogeneous metadata, mapping datasets to a unified format, and enabling scalable evaluation of epigenetic aging clocks through an integrated, dialogue-driven workflow.
bioinformatics2026-05-14v1PXN Unlocks the Power of Public Gene Expression Data Through Cross-Technology Integration
Sui, Z.; Yu, D.; Erdengasileng, A.; Zhang, J.; Qiu, X.Abstract
The immense value of public gene expression repositories is constrained by the lack of compatibility among datasets generated from diverse experimental technologies. Differences in measurement scales, probe chemistries, and signal distributions create systematic discrepancies across platforms and laboratories. These inconsistencies make large-scale integrative analysis nearly impossible, even though such studies could achieve great statistical power and improved reproducibility. We introduce PXN, a probabilistic machine learning framework that captures a unified representation of biological signal across multiple gene expression technologies. Once trained, PXN can seamlessly translate data between multiple platforms, preserving informative biological variation while removing technology-specific biases. In benchmarking studies, PXN consistently outperforms existing normalization methods in cross-platform accuracy and substantially enhances the power of differential expression analysis. Importantly, we show that PXN is powerful enough to bridge even the most challenging technological divide - between microarray and RNA-seq. This capability provides a scalable route for integrating legacy microarray data with modern RNA-seq studies. By enabling direct comparison and integration of heterogeneous datasets, PXN unlocks the full potential of public repositories for future biological discovery and therapeutic innovation.
bioinformatics2026-05-14v1End-to-end mapping of membrane transport from chemical structure to microorganisms
Gricourt, G.; Duigou, T.; Meyer, P.; Faulon, J.-L.Abstract
Membrane transport is a fundamental biological process with profound implications for pharmacology, biotechnology, and microbiology. While computational approaches have largely adopted a protein-centric perspective to annotate transportomes, inferring transport function directly from the intrinsic properties of substrates remains a major challenge. Addressing transport at the compound level enables the systematic evaluation of whether molecules undergo active transport and by which mechanisms, independent of prior transporter annotation. Here, we introduce ChemProFlow, a comprehensive computational framework that redefines transport analysis from a substrate-centric perspective. By integrating geometric deep learning with orthology-based genomic mapping, ChemProFlow predicts molecular transportability, assigns transport mechanisms according to the Transporter Classification Database, and identifies the microorganisms encoding the corresponding transport systems. We show that this integrated pipeline enables scalable, end-to-end mapping of substrate-transporter-organism relationships, with broad applications in pharmacology for anticipating drug transport, in biotechnology for guiding strain engineering, and in microbiology for dissecting substrate utilization across diverse taxa. By capturing the chemical determinants of transportability, ChemProFlow generalizes to previously unseen substrates and provides a high-throughput framework for systematic exploration of molecular transport across diverse biological contexts.
bioinformatics2026-05-14v1mehari: high-performance, strict HGVS-first variant effect prediction
Hartmann, T. F.; Zhao, M. X.; Beule, D.; Holtgrewe, M.Abstract
Variant annotation requires the precise and consistent computation of Sequence Ontology (SO) terms and Human Genome Variation Society (HGVS) nomenclature. To ensure robust synchronization between these two key facets, we present mehari, a high-performance variant effect predictor implemented in Rust that employs a strict "HGVS-first" approach. By deterministically projecting variants to transcripts before evaluating functional consequences, mehari structurally aligns HGVS notation and SO terms. Benchmarking on ClinVar demonstrates that mehari achieves exceptional processing speeds and high concordance with established tools like Ensembl VEP, while also providing refined handling for complex biological edge cases such as selenoprotein recoding.
bioinformatics2026-05-14v1Constrained Evolutionary Design of Matrixyl Analogs: Balancing Permeability and Functional Preservation Through Computational Optimization
Komianos, N.; Prakash, P.Abstract
Matrixyl (palmitoyl pentapeptide-4, KTTKS core) is a collagen-stimulating peptide used in topical anti-ageing products, but its in-use efficacy is limited by poor permeation through the stratum corneum. We describe a deterministic computational workflow that combines a tournament genetic algorithm and NSGA-II with exact RDKit molecular descriptors to search the fixed-length, edit-distance-2 neighbourhood of KTTKS (3,706 candidate sequences) for analogs with descriptors more favourable for passive transdermal diffusion. The search returns a 9-member Pareto frontier that quantifies the trade-off between predicted permeability and motif preservation. Five of the nine frontier members carry the same substitution, lysine to proline at position 4 (K4P). This single change lowers the topological polar surface area by 25.6%, removes the +1 charge contributed by lysine, and reduces the functional-preservation score from 1.00 (KTTKS) to 0.67. The frontier ranking is unchanged by +/-30% perturbations to the TPSA and Mw penalty weights and by a 30% increase in the LogP penalty; only a 30% reduction in the LogP penalty produces rank movement. The frontier matches the ground-truth Pareto set obtained by exhaustive enumeration of all 3,706 candidates (precision and recall both 100%). On the basis of these results we recommend three sequences for experimental validation: PTTPS (largest predicted gain), KTTPS (single-mutation, conservative), and KTTPP (backup). All code, results, and figures are released under MIT and CC BY 4.0.
bioinformatics2026-05-14v1Predicting Biological Age and Clinical Biomarkers from DNA Methylation Profiles of Cheek Mucosa
Shoji, T.; Tomo, Y.; Nakaki, R.Abstract
Background DNA methylation-based biomarkers have been widely used to predict biological age; however, most blood-derived data have been used in most existing models, and whether cheek mucosa can serve as an alternative indicator for methylation-based estimation of aging-related and clinical phenotypes is unclear. Methods DNA methylation profiles from cheek mucosa and whole blood of 186 Japanese adults were analyzed using Illumina Infinium Methylation Screening Array (MSA). Models were constructed to predict chronological age, phenotypic age, and clinical laboratory biomarkers from cheek mucosa- and blood-derived methylation data. In addition to applying the ordinary elastic net method, a two-stage residual learning method incorporating existing blood-based epigenetic clocks was applied for more accurate prediction of biological age. Sex-stratified analyses and comparisons of selected CpG features across sexes and tissues were performed. Results Cheek mucosa-derived MSA methylation data enabled accurate prediction of chronological age (R = 0.965) and phenotypic age (R = 0.964) using the two-stage method. The performance gain achieved by the two-stage approach was greater for phenotypic age than for chronological age. Multiple clinical laboratory biomarkers could be predicted using cheek mucosa-derived methylation data, particularly after sex stratification, including inflammatory, metabolic, thyroid-related, and sex hormone-related markers. Most biomarkers that could be predicted using blood-derived methylation data were also predicted using cheek mucosa-derived methylation data. However, the CpG sites selected for prediction showed minimal overlap across sexes and tissues despite overlap in the corresponding predictable phenotypes. Conclusions Cheek mucosa-derived DNA methylation profiles measured using the MSA can predict chronological age, phenotypic age, and multiple clinically relevant laboratory biomarkers, supporting the utility of cheek mucosa as a less invasive alternative for methylation-based assessment of biological aging and systemic physiological state.
bioinformatics2026-05-14v1Differential Analysis of Gene Spatial Organisation with Minkowski Functionals and Tensors
Baratta, P.; Villoutreix, P.; Baudot, A.Abstract
Spatial transcriptomics measures gene expression together with transcript coordinates in tissues. To date, comparing spatial gene expression patterns within and across samples remains challenging. We present here minkiPy, a geometric framework that computes, for each gene, a compact profile of morphological and topological descriptors based on Minkowski functionals and tensors. These profiles are defined in a shared feature space, enabling direct comparison of spatial organisation across genes, samples, and conditions, and the ranking of genes by the magnitude of their spatial reorganisation. We applied minkiPy to a MERFISH dataset of control and facioscapulohumeral muscular dystrophy myoblast cultures and to a Visium~HD dataset of colorectal cancer and normal adjacent tissues, illustrating its utility across tissue types and spatial transcriptomics platforms. minkiPy is an open-source Python library available at \url{https://github.com/BAUDOTlab/minkiPy}.
bioinformatics2026-05-14v1OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
Wang, L.Abstract
Mixture-of-Experts (MoE) architectures offer a rare opportunity to probe the internal organization of large language models, but this affordance has not been systematically exploited in biological foundation modeling. We introduce OmniGene-4, a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing) by injecting 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP secondary structure), continuing pretraining (CPT) on a 32.5 GB mixture of DNA, protein, natural-language and structural corpora, and supervised fine-tuning (SFT) on 199,576 instruction-format examples spanning eight task families. On a suite of standard benchmarks, the final model (v3) reaches 99.95% accuracy on BioPAWS standard protein homology (6,000 pairs), 59.50% on remote homology (2,000 pairs from protein_pair_remote), and 93.66% on BixBench knowledge questions. Relative to its un-fine-tuned vocabulary-extended Gemma-4-Instruct baseline (85% / 60% / 87%), v3 gains +14.5 on Standard, is comparable on Remote (-0.5, within statistical noise on this 2,000-pair sample), and gains +6.7 on BixBench. We do not claim parity with specialist remote-homology tools; published numbers for ESM-2, CATHe and PLMSearch on differently constructed splits reach 65--75%, and closing this gap is discussed as an open problem. By installing forward hooks on every router we directly measure how CPT and SFT each reshape expert routing. Across 400 prompts drawn from 8 modalities, the mean pair-wise Jensen--Shannon divergence between task routing distributions, averaged over the 30 layers, rises from 0.138 (vocabulary-extended baseline) to 0.230 after CPT and further to 0.232 after the full CPT+SFT pipeline. Under this layer-averaged metric, most of the increase (Delta JS +0.092) occurs during CPT, with the SFT stage contributing a small further rise (Delta JS +0.002). The layer-wise picture is more nuanced: CPT reshapes routing in middle transformer layers (L_11--L_22, peak +0.16 at L_12), while SFT primarily reshapes the final two layers (L_28, L_29, peak +0.048 at L_29), so SFT is small under the aggregate metric but non-trivial at the layers nearest lm_head. We summarize this as a tentative representation/output-alignment factorization of bio-foundation training. At the token level, layer-12 routing reveals experts with strongly skewed token preferences, including an English-function-word expert at 80% NL purity, two DNA-dinucleotide experts, an amino-acid expert, and a cellular-biology expert; absolute purities for other experts are modest (15--46%), and we do not assume that "the same expert ID" refers to the same object across different layers. These findings are exploratory --- a single architecture, a single training run, and a small-N routing sample --- and we explicitly frame them as such throughout.
bioinformatics2026-05-14v1Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites
Kravchenko, P.; Vorontsov, I. E.; Makeev, V. J.; Kulakovskiy, I. V.; Penzar, D. D.Abstract
Motivation: DNA motifs recognised by transcription factors are typically represented as position weight matrices (PWMs), assuming independent contributions of individual nucleotides to protein binding specificity. Many alternative models accounting for correlations of positional contributions have been introduced in the past decades. However, performance gains have generally not out-weighed the advantages of simplicity, interpretability, and practical applicability of PWMs with the well-established codebase. Existing software tools and motif databases provide multiple non-identical PWMs for the same transcription factor or even for the same dataset. It remains a prac-tical question whether these PWMs can be effectively combined into a single improved model. Results: Here we describe ArChIPelago (https://github.com/autosome-ru/ArChIPelago), a compu-tational framework that combines multiple PWMs into a joint model using classic machine learning techniques, from linear regression to ensembles of decision trees. We show that such a combina-tion improves prediction of transcription factor binding sites in genomic sequences. With a diverse collection of 704 ChIP-Seq datasets spanning 36 orthologous human and mouse transcription factors of diverse structural families, we show that ArChIPelago consistently outperforms the best available individual mono- and dinucleotide PWMs as well as sparse local inhomogeneous mixture models. Furthermore, using both human and mouse data, we demonstrate that PWM ensembles are capable of making reliable cross-species predictions.
bioinformatics2026-05-14v1GlyComboCLI enables command line-based FAIR workflows for glycan composition assignment in mass spectrometry data
Kelly, M. I.; Thang, W. C. M.; Pang, C. N. I.; Gustafsson, O. J. R.; Ashwood, C.Abstract
Glycans are integral biomolecules whose presence cannot be predicted from genomic data alone, necessitating experimental characterisation through approaches including mass spectrometry. Assignment of glycan compositions to observed mass to charge ratios is computationally challenging due to the potential monosaccharide diversity and existing tools lack the required flexibility for integration into automated bioinformatic workflows. Here, we present GlyComboCLI, an open-source command-line application for the assignment of glycan compositions to mass spectrometry data which expands upon our previous GUI application, GlyCombo. GlyComboCLI accepts mass lists and vendor-neutral mzML files, supports an extensive range of monosaccharides, derivatisation states, reducing-end modifications, and adducts to ensure compatibility with a breadth of glycomics approaches. Outputs are compatible with downstream tools including Skyline and GlycoWorkBench. This software is deployable as a standalone executable, a Docker container, and a Galaxy tool, adhering to FAIR principles. When applied to 52 raw files from a published mouse glycomics dataset, a local instance completed composition assignment and downstream quality control in under three hours, recovering biologically consistent findings. Furthermore, an integrated Galaxy workflow demonstrated reproducible detection of sialidase treatment effects. GlyComboCLI substantially reduces the pool of spectra requiring manual structural interpretation, offering a flexible and scalable solution for glycomics bioinformatic workflows.
bioinformatics2026-05-14v1BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research
Qu, Y.; Lu, Y.; Tu, X.; Zhang, S.; She, T.; Shaw, A. G.; Shih, J.-H.; Zhao, B.; Shen, M.; Yang, H.; Yan, J.; Zhang, R.; Wu, X.; Li, T.; Cong, L.; Hu, X.; Jiang, Y.; Dong, J.; Peng, T.; Leskovec, J.; Huang, K.Abstract
LLM agents now perform real biomedical research, but evaluating them rigorously is hard. Outcome-only benchmarks fail in two ways. First, a correct final answer can come from memorization, reward hacking, or wrong reasoning that produces the right number by chance. Second, valid alternative analyses are marked wrong simply because they differ from the reference. We introduce BiomniBench, a process-level evaluation framework that scores the full agent trajectory against expert-designed, task-specific rubrics. Its first instantiation, BiomniBench-DA, contains 100 data-analysis tasks across 17 analytical task types, 5 disease areas, and a general-biology category, each based on a high-impact paper from top-tier journals such as Nature, Cell, and Science and co-developed with an original paper author or an experienced domain expert. Benchmarking frontier and open-weight models across four agent harnesses reveals three findings: (1) frontier models lead but substantial headroom remains; (2) the agent harness shifts scores as much as the base model; (3) agents recurrently fall short on method selection, biological interpretation, and scientific reasoning. BiomniBench is the first process-level benchmark for AI agents on real-world biomedical research, exposing failure modes that outcome-only evaluation cannot detect.
bioinformatics2026-05-14v1A Context-Specific, Literature-Supported Framework for Validating Stress Response Differentially Expressed Gene Sets
Frishman, B. A.; Gonzalez, J. L.; Forbes, V. E.Abstract
Computational models of stress responses identify genes underlying physiological adaptation, but their utility depends on rigorous validation. Often, gene activity reflects both adaptive mechanisms and noise. Here, we develop a framework that leverages public databases to support the subselection of biologically supported model genes for temperature-stress responses. We test our framework on a model that identified and categorized differentially expressed genes (DEGs) into Key-Response, Treatment-Specific, Noisy, and Support groups based on inter-individual gene expression variability before and after treatment. The first three groups were hypothesized to constitute a Principal Response. To validate these groupings, we constructed protein-protein interaction (PPI) networks using the Human Protein Atlas and STRING. The main contribution of this work is the implementation of second-order connections restricted to those made via DEGs, ensuring connectivity reflects condition-specific responses rather than generic hubs. Across two temperature conditions, >75% of Principal Response genes assembled into subnetworks of interactions significantly larger than random expectations. Support Group genes also showed strong interconnectivity and enrichment for housekeeping genes. STRING confirmed PPI enrichment but produced less stable results than our framework. By emphasizing DEG-restricted second-order connections, we address limitations of context-free enrichment methods and strengthen biological evaluation of computational models of differential gene expression.
bioinformatics2026-05-13v3BioGraphX: Bridging the Sequence-Structure Gap via PhysicochemicalGraph Encoding for Interpretable Subcellular Localization Prediction
Saeed, A.; Abbas, W.Abstract
Computational approaches for protein subcellular localization prediction are important for understanding cellular mechanisms and developing treatments for complex diseases. However, a critical limitation of current methods is their lack of interpretability: while they can predict where a protein localizes, they fail to explain why the protein is assigned to a specific location. Moreover, understanding protein behavior traditionally requires knowledge of three dimensional structure, which is a costly and time-consuming process. Here, we propose BioGraphX, a novel encoding framework that constructs protein interaction graphs directly from protein sequences using biochemical rules. This approach provides a constraint-based structural proxy directly from sequence, reducing the dependency on experimentally determined three-dimensional structures. Building upon this representation, BioGraphX-Net demonstrates superior performance on the DeepLoc 2.0 benchmark by integrating ESM-2 embeddings with the proposed features via a gating mechanism. Gating analysis shows that although ESM-2 embeddings provide strong contributions, BioGraphX features function as high-precision filters. SHAP analysis reveals feature importance patterns consistent with a sophisticated biophysical logic: sequence signals act as universal exclusion filters, while organelle-specific combinations of biophysical features enable precise compartment discrimination. Notably, Frustration features help resolve targeting ambiguities in complex compartments, reflecting evolutionary constraints while preventing mislocalization from sequence mimicry. It has the additional advantage of promoting Green AI in bioinformatics, achieving performance comparable to the state-of-the-art while maintaining a minimal parameter count of 13.46 million. In summary, BioGraphX not only provides accurate predictions but also offers new insights into the language of life.
bioinformatics2026-05-13v3Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models
Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.Abstract
The function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous model accuracy estimation (EMA) methodologies. Here we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structural models. Our method employs a structure-sequence cross-consistency mechanism to quantify the bidirectional compatibility between the input sequence and the predicted three-dimensional structure, enabling a comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in the Pearson correlation and 49.0% in the Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensus-based methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced on the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. Our results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.
bioinformatics2026-05-13v2Keeping SCORE enables interpretable uncertainty-aware classification from diffusion models for genomics
Kuznets-Speck, B.; Jung, J.; Pholraksa, P.; Zhong, A.; Schwartz, L.; Prashnani, E.; Vaikuntanathan, S.; Goyal, Y.Abstract
Classifying cellular states from high-dimensional molecular and genomic measurements requires methods that provide not only accurate predictions but also calibrated uncertainty and interpretability. Current nonlinear classifiers offer accuracy but often lack uncertainty quantification and mechanistic insights into the features that matter most. We introduce Keeping SCORE, a framework that transforms conditional diffusion models into probabilistic engines for classification and regression by computing exact likelihoods along stochastic noising trajectories. We first benchmark Keeping SCORE on image recognition tasks (handwritten digits, natural photos). We then apply Keeping SCORE to single-cell transcriptomics across a 22-million-cell atlas, classifying 164 cell types with accuracy matching or exceeding state-of-the-art methods, while uniquely providing posterior probability estimates and prediction confidence. For genetic perturbation mapping across 100 CRISPRi conditions in a multi-study Perturb-seq dataset, our approach again matches or surpasses discriminative baselines, with feature-level attributions identifying which genomic features drive each decision. Applied to large-scale protein sequence data, our framework accurately regresses mutational stability effects, attributing them quantitatively to positions along the input sequence. Keeping SCORE requires no retraining or architectural changes to existing diffusion models, providing portable, interpretable, and uncertainty-aware predictions for biological discovery.
bioinformatics2026-05-13v2Phylogenomic coupling of F1 chemosensory and archaellum systems across archaea and monoderm bacteria
Mahanta, U.; Baker, M.; Sharma, G.Abstract
Archaellum-associated motility has been viewed as solely archaeal, yet new findings in Chloroflexota prompt a broader perspective. By analysing a curated ~22,000 NCBI reference genomes alongside 2,397 archaeal and 226 archaellum-encoding Chloroflexota genomes, this study systematically characterises the co-distribution of archaellum loci with chemosensory system (CSS) classes. Maximum-likelihood phylogeny of 3,727 F1-type CheA proteins reveals three major clades, with Clade 1 comprising ~80% monoderm representation, uniting archaeal and monoderm bacterial lineages in a shared evolutionary grouping. Overall, this work shows that not only archaeal-type motility, but also F1-CSS based sensing system, might have been gained from Archaea to Chloroflexota via horizontal gene transfer and both systems shared an evolutionary trajectory altogether.
bioinformatics2026-05-13v1GatorDuo: Global-Consistency Dual-Graph Refinement With Pseudo-Label Agreement for Spatial Transcriptomics
Zhang, Z.; Jimeno Yepes, A.; Bian, J.; Li, F.; Liu, Y.Abstract
Spatial transcriptomics (ST) measures gene expression together with spatial coordinates, enabling spatial domain identification of coherent tissue regions. Many recent approaches rely on graph-based modeling to combine spatial neighborhoods and transcriptomic (gene-expression) similarity, yet neighborhood construction is often unreliable under sparsity and technical noise. As a result, spurious cross-domain shortcut edges can persist in static graphs and propagate misleading signals during message passing, ultimately blurring domain boundaries and weakening cluster separability. In this paper, we propose GatorDuo, a topology-aware dual-graph contrastive self-supervised framework for robust spatial domain identification that couples gene-expression similarity with spatial proximity through complementary neighborhood graphs. GatorDuo introduces global-consistency-based graph refinement that uses a pseudo-label agreement mask to suppress cross-domain shortcut edges in both views, thus stabilizing neighborhood topology for representation learning. To avoid manual tuning of domain resolution, GatorDuo further employs a contextual bandit reinforcement-learning strategy to adaptively select the clustering granularity (the number of clusters) used for refinement. The refined view-specific embeddings are integrated via a hybrid-routing Mixture-of-Experts (MoE) module to generate a unified embedding, optimized with contrastive objectives augmented by an MoE-alignment term. Across eight public benchmarks spanning sequencing- and imaging-based ST at spot and single-cell resolution, and compared with ten representative baselines, GatorDuo consistently delivers strong and robust spatial domain identification performance across multiple clustering metrics, while yielding informative unified embeddings that can support downstream biological analyses.
bioinformatics2026-05-13v1Disease-guided functional gene mapping across species reveals translational correspondences beyond sequence orthology
Yan, J.; Cao, Z.Abstract
Selecting the correct mouse gene to model a human disease phenotype is critical for translational research, yet sequence-based orthology can fail when genes have been lost, duplicated, or functionally rewired between species. Here we present BRIDGE (Biological Rank Integration for Disease Gene Equivalence), a sequence-free framework that identifies functional mouse equivalents of human disease genes. BRIDGE integrates 3.37 million disease-gene associations, biological pathways, and Gene Ontology annotations into a unified heterogeneous graph with 94,897 nodes and approximately 8.3 million edges. The graph is encoded by a heterogeneous graph transformer and combined with fused Gromov-Wasserstein alignment and multi-strategy reciprocal rank fusion. On two sequence-independent benchmarks, BRIDGE achieves Recall@5 of 61.8-66.7%, compared with 0.0-20.1% for Ensembl Compara. We validate BRIDGE through case studies including neutrophil pathway rewiring (CXCL8 to Cxcl1/2/5), acute-phase divergence (CRP to Apcs), and immune checkpoint substitution (LILRB2 to Pirb), and demonstrate complementarity with sequence methods in drug-translation analysis. Prospective validation of 30 novel predictions against three independent data modalities, including tissue expression, cell-type expression, and phenotype concordance, shows that BRIDGE picks are favored in 64 of 65 orthogonal tests (sign test P = 3.6 x 10^-10) and significantly outperform tested baselines including Ensembl Compara, BLAST RBH, and ESM-2. BRIDGE provides a benchmarked framework for functional cross-species gene mapping in disease-model design.
bioinformatics2026-05-13v1BiLSTM-Powered Bilinear Attention for Protein-Ligand Prediction
Cheng, C.-Y.; Chen, Y.-A.; Li, F.-Y.; Re, S.Abstract
Rapid and accurate prediction of protein-ligand bindings is essential for drug discovery. While generative AI has driven rapid advancements in structure-based approaches, sequence-based methods remain significantly faster and more cost-effective. Here, we present a weakly supervised deep learning framework integrating graph convolutional networks (GCN) for molecular encoding and bidirectional long short-term memory (BiLSTM) for protein modeling. The latter represents long-range dependencies better than the widely used convolutional neural network (CNN). Leveraging a bilinear attention network (BAN), this model learns protein-ligand pairwise interactions without requiring three-dimensional structural supervision. By using the publicly available BindingDB dataset, the model was trained, solely on affinity labels, and successfully classified binder and non-binders with AUROC of 0.96 and an AUPRC of 0.95. The model generates interpretable attention maps that serve as a "GPS" to locate binding sites. Remarkably, despite the lack of structural training data, it can pinpoint key contact residues confirmed by crystal structures. Our method could function as a scalable filter for giga-scale libraries, allowing rapid screening of drug candidates with direct structural insights into the protein-ligand interface.
bioinformatics2026-05-13v1Systematic Regional Bias is Widespread in ChIP-seq
Hughes, O.; Foley, G.; Balderson, B.; Piper, M.; Boden, M.Abstract
Robust and reproducible results are essential for confident scientific analysis. We demonstrate that transcription factor (TF) Chromatin Immunoprecipitation coupled with sequencing (ChIP-seq) suffers from systematic bias that may threaten its reproducibility: 80% of 200+ condition-matched, dual-replicate experiments in ENCODE contain genomic regions of systematic bias. We observe this regional bias even between replicates produced within the same experiment, resulting in thousands of unreplicated peaks, which often contain valuable biological data. We provide evidence that regional bias may lead to qualitative differences in TF biology inferred by different experiments; we discovered eight TFs with binding activity in compact chromatin that was identified by one experiment, yet systematically absent from others. To mitigate the effects of bias, we derive simple but effective metrics to quantify the quality of data within biased regions and demonstrate that they can be used for the robust integration of data from multiple experiments.
bioinformatics2026-05-13v1Preferential IsomiR Enrichment in Extracellular Vesicles Improves Identification of Their Cellular Origins
Ripan, R. C.; Li, x.; Hu, H.Abstract
Extracellular vesicles (EVs) carry microRNAs (miRNAs) that mediate intercellular communication and have strong potential as disease biomarkers, yet the roles of miRNA isoforms (isomiRs) in EVs remain poorly understood. Here, we analyzed 96 human EV and corresponding source samples from nine public datasets. We found that EV samples consistently contained substantially higher proportions of isomiR reads than their corresponding source samples, indicating widespread isomiR enrichment in EVs. Although individual isomiRs showed limited reproducibility across biological replicates and limited sharing between EVs and their corresponding source samples, the parent miRNAs that generated these isomiRs remained highly reproducible across replicates and strongly shared between EV-source pairs. Despite extensive isomiR diversification, EV-source pairs retained highly correlated miRNA expression profiles. Using integrated miRNA- and isomiR-related features, we further developed a random forest model that successfully associated EV samples with their corresponding source samples, with improved performance when isomiR information was included. Together, our results demonstrate that EVs are enriched for biologically meaningful isomiRs while preserving source-associated miRNA landscapes, highlighting the importance of incorporating isomiRs into future EV studies.
bioinformatics2026-05-13v1xNNPCD identifies regulators of programmed cell death by integrating perturbation transcriptomes with cancer dependency profiles
Yin, Q.; Chen, L.Abstract
Programmed cell death (PCD) encompasses multiple regulated processes whose dysregulation shapes cancer fitness, yet current computational studies largely use known PCD genes for prognosis rather than discovering regulators. We developed xNNPCD, an interpretable neural-network framework that links CRISPR-Cas9 perturbation signatures from CMap to gene dependency profiles from DepMap. The model constrains hidden neurons to five PCD pathways and iteratively refines a prior gene-pathway mask matrix derived from GO, KEGG, and Reactome using pathway-neuron ablation. This converts binary gene-pathway relationships into continuous-valued associations and improves dependency prediction over random forests, standard fully connected multi-layer perceptron, and its own non-iterative variant. The learned matrix recovers annotated death regulators and nominates candidate regulators, including RPL23A, HSPA5, SNRPA1, SLC6A2, and ASAH1; combined with dependency scores, it further separates pathway coupling from regulatory direction. Transferring the refined relationship matrix and learned weights to compound-induced perturbation data enables in silico drug screening, identifying BRD-K19103580 and decitabine as targeted therapeutic agents for apoptosis and ferroptosis, respectively. The pathway-resolved drug profiles can facilitate the rational design of combination therapies targeting complementary PCD pathways to overcome single-pathway resistance. Overall, xNNPCD offers a generalizable, interpretable approach for mapping the regulatory landscape and elucidating the molecular processes of PCD in cancer.
bioinformatics2026-05-13v1Cell-Level Virtual Screening
Ellington, C. N.; Addagudi, S.; Wang, J.; Lengerich, B. J.; Xing, E. P.Abstract
Virtual screening methods prioritize therapeutic candidates by predicting molecular properties and interactions. However, molecular models are insufficient to predict higher-order effects that arise in real biological systems, leading to late-stage failures in drug discovery. Virtual cells have been posed as a solution to this problem by predicting gene expression responses to drugs, but they remain weakly validated as screening tools; gene expression is only an intermediate in understanding drug success or failure. Despite burgeoning progress in virtual cells, some basic questions remain. Is expression even a good representation of higher-order drug effects? How can expression and other cell-level representations be applied to prioritize therapeutic candidates? Can cell-level methods be fairly compared against traditional molecular-level screens? We address these questions in a two-pronged approach. First, we curate two benchmarks, Drug-Disease Retrieval Bench (DDR-Bench) and Drug-Target Retrieval Bench (DTR-Bench), which directly compare cell-level methods against traditional molecular methods on canonical drug discovery tasks. DDR-Bench evaluates a method's ability to prioritize disease indications for drugs with novel target profiles. DTR-Bench evaluates a method's ability to reconstruct drug-target interactions from separate perturbation modalities that act on shared mechanisms, bridging the gap between cell-level methods and classic molecular screens. We identify shortcomings of existing screening methods on these benchmarks, and propose an alternative representation of drug effects: perturbed gene networks. Inferring post-perturbation gene networks on-demand for unseen drugs requires methods that generalize beyond traditional plug-in network estimators. We develop a scalable differentiable surrogate loss for multivariate Gaussians, which we apply to train a context-adaptive amortized estimator that maps perturbation metadata to gene-gene dependency network parameters. The resulting model, CellVS-Net, achieves SOTA on predicting how gene networks restructure under a variety of complex multivariate experimental conditions, including different cell types, small molecule therapeutics, signaling molecules, gene knockdowns, and gene over-expressions. When compared to other molecular and cell-level representations of drugs, we find that CellVS-Net achieves SOTA on both virtual screening benchmarks. Overall, CellVS-Net demonstrates that cell-level virtual screening methods are a viable alternative to molecular screening, and associated benchmarks enable hill-climbing on relevant drug discovery tasks.
bioinformatics2026-05-13v1Integrated RNA-seq analysis identifies ABC transporters mediating taxane export in Taxus species
Nasiri, J.; Fotuhi Siahpirani, A.; Dong, Y.; Xu, C.; Xia, Y.; Ignea, C.Abstract
RNA-seq datasets from medicinal yews are crucial for studying paclitaxel biosynthesis. However, cross-study data analyses are hindered by pronounced batch effects. Here, we compiled 45 RNA-seq samples from three studies across four tissues (bark, leaf, root, stem) and assessed 35 preprocessing pipelines combining six normalization strategies with five batch-effect correction approaches. Unsupervised clustering (HCA, k-means, Grade-of-Membership), evaluated using Jaccard and Adjusted Rand indices, revealed significant variability in batch effect removal. Supervised classification of tissue and project labels (Random Forest and linear/radial SVM) demonstrated improved accuracy in tissue type prediction, highlighting the effectiveness of correction methods. The processed data facilitated the identification of 189 putative ABC transporters across samples, six of which showing a strong correlation to the gene encoding 10-deacetylbaccatin-III-10{beta}-O-acetyltransferase, a key biosynthetic enzyme in the taxol pathway. High expression levels in leaf and bark further support their role in taxane intermediates trafficking in taxol biosynthesis. Structural analysis and molecular docking further supported the selection of these candidates, and the agreement between transcriptomic ranking and docking-based prioritization suggests that these transporters may participate in taxane intermediate recognition, trafficking, or export. These findings demonstrate the importance of normalization and batch effect correction in RNA-seq analysis to advance gene discovery in Taxus species and, more broadly, in plant research.
bioinformatics2026-05-13v1Redesign selective protein binders using contrastive decoding
Xie, Z.; Xu, J.Abstract
Motivation: Fixed-backbone sequence design methods such as ProteinMPNN operate on backbone coordinates alone and cannot represent target side-chains at the binding interface. Their decoding algorithm also lacks a mechanism to balance binding affinity and folding stability or to improve selectivity against structurally similar off-targets. These gaps limit the computational design of protein binders with high affinity and specificity. Results: We present RedNet, a multiscale graph neural network that encodes side-chain information of the binding target. We further develop a contrastive decoding algorithm, motivated by the thermodynamic decomposition of binding free energy, that addresses two objectives: (1) balancing binding affinity and folding stability, and (2) improving selectivity against structurally similar off targets. RedNet reaches 43% native sequence recovery on heterodimers, compared with 37% for ProteinMPNN and 33% for ESM-IF. With contrastive decoding, itmatchesnative-sequenceco-foldingsuccess(68%)onhigh-confidenceAlphaFold3 targets, exceeding ProteinMPNN (59%) and ESM-IF (61%). On a new benchmark of structurally similar on-/off-target pairs, RedNet with contrastive decoding reaches 64.8% energetic selectivity, ahead of PiFold (55.6%), ProteinMPNN (53.7%), and ESM-IF (53.7%).
bioinformatics2026-05-13v1De novo protein discovery in non-model organisms
Ali, A.Abstract
We developed plant (Parallel Annotation of Transcriptomes), a de novo method that can potentially compare RNA-seq data of any two species without a reference genome. plant is conceptually similar to chromatography. In the same way a complex mixture is filtered to isolate its individual components, we applied a computational method to identify, annotate, and quantify components across transcriptomes. The comparison points are universal protein domain annotations rather than species-specific genes, as would be the case for a differential gene expression analysis. We looked at several Selaginella species via the 1000 Plant transcriptomes initiative (1KP) where RNA-seq data for various plant species have been made publicly available. The raw reads were assembled via Trinity. The assembled transcripts were then searched against the Pfam protein domain database via InterProScan. The assembled transcripts were also quantified via kallisto. By merging these two aspects, we were able to see how often a predicted protein structure is expressed. These quantified annotations of protein domains are comparable across species, assuming a relatively short evolutionary distance. We were also able to identify the presence of species-specific protein domains and trace each annotation back to the gene. A bubble plot was created to visualize the distributions of Pfam annotations across species as well as GO terms.
bioinformatics2026-05-13v1A chemoinformatics-guided platform for efficient discovery of RNA-binding small molecules: Proof-of-concept for myotonic dystrophy type 1
taghavi, a.; Shan, J.; Yao, X.; Zanon, P. R. A.; Sung, K.; Simba-Lahuas, A.; Gorlach, S.; Labuhn, H.; Salthouse, D.; Wang, Z.; Feri, A.; Disney, M. D.Abstract
Structured RNAs cause human diseases but remain challenging to target selectively with small molecules. Here, we report a chemoinformatics-guided discovery framework that integrates fingerprint-based molecular design, experimental validation, and mechanistic profiling to identify small molecules that bind highly structured, disease-associated RNAs. Using an RNA-binder fingerprint derived from known ligands, a Tversky similarity screen of >8 million compounds yielded a 150-member library enriched in chemical space for RNA-active scaffolds. Target engagement and cell-based assays identified multiple selective ligands for the pathogenic expanded triplet repeat, r(CUG)exp, that causes myotonic dystrophy type 1 (DM1) by binding and sequestering the RNA-binding protein muscleblind-like 1 (MBNL1). Biophysical and single-molecule analyses revealed that the small molecules bind the 1x1 nucleotide U/U internal loops formed when r(CUG)exp folds, partially block MBNL1 binding, and modulate RNA folding equilibria. Two optimized scaffolds rescued MBNL1-dependent splicing in patient-derived myotubes with micromolar potency and minimal cytotoxicity. This study establishes a generalizable, data-driven platform for discovering drug-like RNA-binding lead small molecules and demonstrates its application to the toxic repeat expansion RNA underlying DM1.
bioinformatics2026-05-13v1Cell Type-informed Characterization of Spatial Niches from Spatial Multimodal and Multi-omics Data
Du, G.; Xu, J.; Wei, X.; Liu, C.; Zhao, D.; Jia, X.; Li, X.; Shang, X.Abstract
Cell niches play critical roles in tissue organization and orchestrate homeostasis, development, and disease progression. Advances in spatial omics technologies now allow diverse molecular and image-derived data to be jointly captured while preserving spatial context, but deciphering cell niches from such spatial multimodal and multi-omics data remains challenging. Existing computational methods are still limited in their flexibility across variable combinations of spatial modalities and omics data. Here we introduce SpaNECT, a unified and flexible framework designed to accommodate spatial multimodal and multi-omics data for cell niche characterization. SpaNECT further incorporates reference-informed cell-type information to support biologically interpretable niche analysis. Systematic evaluations across diverse tissues, disease conditions, and developmental stages showed that SpaNECT consistently outperformed representative methods in resolving cell niches. In mouse brain spatial multi-omics data, SpaNECT uncovered niche-associated molecular and regulatory programs; in developing chick heart, it tracked cross-stage niche reorganization and progressive remodeling of ventricular-associated cell states during maturation. Overall, SpaNECT establishes a general and robust framework for characterizing cell niches across spatial multimodal and multi-omics data.
bioinformatics2026-05-13v1HAIRpred2: Human Host-Specific Prediction of Antibody-Interacting Residues Using Hybrid Physicochemical and Structural Features
Mehta, N. K.; Sahni, R.; Kumar, N.; Raghava, G. P. S.Abstract
Prediction of conformational B-cell epitopes is critical for vaccine design, immunotherapy, and antibody engineering. To date, several host-independent computational methods have been developed for predicting antibody-interacting residues in antigen structures. However, it is well established that antigen-antibody (Ag-Ab) interactions vary depending on the host immune system indicating the importance of developing host-specific prediction models. In this study, we present, for the first time, a human host-specific method, HAIRpred2, that predicts antibody-interacting residues in an antigen from its tertiary structure. The dataset was derived from HAIRpred and comprises 277 human Ag-Ab complexes, with 221 structures used for training and 56 for independent testing. Preliminary analysis revealed that residues with a relative surface accessibility (RSA) below 0.05, corresponding to buried regions, are highly likely to be non-interacting, underscoring the importance of structural accessibility in antibody recognition. To identify the most informative features, we evaluated multiple feature representations, including RSA, large language model (LLM)-based embeddings, distance-based features, and physicochemical properties. A model trained on single-residue RSA features achieved an AUC of 0.72. Incorporating a sliding window of 15 residues to capture local structural context improved performance to an AUC of 0.75. The best performance (AUC = 0.78 on the independent test set) was achieved by integrating RSA with physicochemical descriptors. Benchmarking against existing antibody-interaction prediction methods on the same independent dataset demonstrated that HAIRpred2 outperforms current tools, further highlighting the advantage of host-specific modeling. HAIRpred2 is freely available as a web server at https://webs.iiitd.edu.in/raghava/hairpred2/.
bioinformatics2026-05-13v1cran2crux: automatically create CRUX ports for R-packages
Petrov, P.; Izzi, V.Abstract
Motivation: R together with CRAN and Bioconductor provides one of the richest ecosystems for bioinformatics and computational biology, with thousands of specialized packages. While GNU/Linux is a vastly-used operating system in this field, R-packages are typically managed independently of the system's native package manager. This separation makes installation, updates and mass rebuilds cumbersome. CRUX, a minimalist semi-source GNU/Linux distribution, offers great flexibility with its ports-based system for the seamless integration of R-packages with its native package manager. Results: The hereby presented cran2crux tool automatically generates CRUX ports for packages from both CRAN and Bioconductor. It performs recursive dependency resolution, handles naming conventions, extracts dependencies information, and supports inclusion of optional dependencies. The tool also provides convenient functions for checking updates and regenerating outdated ports. It can generate over 140 ports for complex packages such as Seurat in approximately 11 seconds, dramatically simplifying the maintenance of large R-dedicated repositories on CRUX. Availability: cran2crux is available under the MIT license at https://github.com/izzilab/cran2crux. As of now, more than 650 R package ports, generated with the tool, are available in the CRUX ports database.
bioinformatics2026-05-13v1Metagenomics-enabled proteomics reveals how AMF and PSB co-inoculation reshapes tomato rhizosphere dynamics across growth stages
Son, Y.; Craft, E. J.; Pineros, M. A.; Mathieson, O. L.; Awan, A.; Blakeley-Ruiz, J. A.; Kleiner, M.; Kao-Kniffin, J.Abstract
Urban agriculture increasingly relies on compost-based substrates for sustainable production, yet we lack a clear characterization of how these systems respond to biological amendments aimed at introducing beneficial microbiota. Here we investigated how developmental stage and co-inoculation with arbuscular mycorrhizal fungi (AMF) and phosphate-solubilizing bacteria (PSB) reshape rhizosphere microbial function in Solanum lycopersicum grown in compost-based urban farm substrate. Using plant physiology assays, 16S rRNA amplicon sequencing, and metagenome-informed metaproteomics, we characterized tomato physiological responses and rhizosphere microbial activity during flowering and fruiting across control, single AMF, single PSB, and AMF and PSB co-inoculation treatments. Co-inoculation synergistically enriched beneficial taxa, improved fruit nutrient accumulation, elevated nutrient transporter and quorum sensing protein production, and drove stress-driven dormancy in competitively excluded taxa, with responses varying between developmental stages. Our findings establish metagenome-informed metaproteomics as essential for resolving stage-specific rhizosphere microbiome functional responses to tomato development and AMF and PSB co-inoculation.
bioinformatics2026-05-13v1Transferable spatial omics deconvolution with SpaRank
Yan, X.; Zheng, R.; Chen, J.; Li, M.; Lan, W.Abstract
By resolving cell-type compositions from multi-cellular spatial measurements, deconvolution is central to resolving the cellular landscape of complex tissues. Existing deconvolution methods fit continuous expression values and are therefore sensitive to batch effects between single-cell references and spatial data, requiring retraining for each new context. Here we present SpaRank, a context-aware framework that performs spatial deconvolution by representing spots as ranked feature sequences. Adapting the rank-based encodings of single-cell foundation models, this formulation is inherently robust to technical variation, enabling a pretrain-transfer paradigm. On simulated benchmarks, SpaRank achieves strong deconvolution accuracy, robustness to expression perturbations, and substantial computational efficiency. On experimental datasets, pretrained models generalize across diverse biological contexts: a model pretrained on a multi-organ lymphoid atlas accurately resolved cell-type distributions across distinct tissues and sequencing platforms; likewise, a model pretrained on an integrated breast atlas delineated cell-type compositions across normal and malignant disease states. Furthermore, the framework naturally extends to multimodal spatial deconvolution by employing gated fusion to adaptively integrate diverse omics signals, improving accuracy over single-modality approaches. Overall, SpaRank establishes a transferable deconvolution paradigm, enabling unified cellular atlases to support direct, context-aware inference across diverse biological states and profiling modalities.
bioinformatics2026-05-13v1An assessment of normalization and differential expression methods for miRNA-seq analysis using a realistic benchmark dataset
Aparicio-Puerta, E.; Baran, A. M.; Ashton, J. M.; Pritchett, E. M.; Gaca, A.; Becker, J.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.Abstract
MicroRNAs are short noncoding RNAs that regulate gene expression and are commonly profiled by small RNA sequencing (miRNA-seq). Despite the widespread use of miRNA-seq, datasets are often analyzed with RNA-seq method such as DESeq2 or edgeR, which do not take into account the specific characteristics of miRNA-seq data. Here, we present a benchmark study of normalization and differential expression approaches using a realistic ground-truth dataset. By mixing mouse RNA of two organs, we generated expression trends while capturing biological and technical variability. Using monotonicity across the dataset and expected fold changes from the mixture design, we assessed normalization and differential expression methods. Normalization benchmarking showed that within-sample scaling, particularly Read Per Million (RPM), best preserved the expected monotonic trends, outperforming cross-sample methods such as TMM, rlog, and VST. These approaches sometimes recovered apparent monotonicity among abundant miRNAs, but inspection of individual profiles suggested likely over-correction. Regarding differential expression, edgeR consistently ranked among the best-performing methods across several metrics, including log2 fold-change estimation, with performance comparable to miRNA-seq-specific tools such as miRglmm and NBSR. DESeq2, edgeR-v4, and limma-based approaches tended to systematically underestimate log2 fold changes. Applying a common RPM-based normalization substantially improved the performance of cross-sample methods, highlighting the strong influence of normalization on differential expression analysis. Overall, our findings support within-sample scaling methods such as RPM for normalization, and edgeR, miRglmm, or NBSR for differential expression. The dataset has been made publicly available, providing a valuable resource for objective method comparison and future miRNA-seq software development.
bioinformatics2026-05-13v1Disagreement between demultiplexing methods reveals structured cell quality gradients in multiplexed single-cell data
Sen, E.; Steiger, S.; Basic, M.; Prokoph, N.; Syed, A. P.; Seufert, I.; Rehman, U.-U.; Schumacher, S.; Baumann, A.; Feuring, M.; Weinhold, N.; Lübbert, M.; Döhner, H.; Döhner, K.; Raab, M. S.; Mallm, J.-P.; Stegle, O.; Rippe, K.Abstract
Background: Single-cell multi-omics profiling of hematopoietic malignancies frequently involves pooling of patient samples before library preparation to reduce costs. Demultiplexing and quality control of the resulting sequencing data depend on experimental design, sequencing depth, and computational methods. Existing approaches benchmark individual tools, auto-select a single best method, or apply majority voting. However, none systematically exploit disagreement patterns among orthogonal strategies as a diagnostic signal for cell quality. Results: We introduce Split-flow, a modular Nextflow pipeline that runs hashing-based and SNP-based demultiplexing, and transcriptome-based doublet detection in parallel. It classifies cells into quality strata through a concordance-based decision framework. Validation on multiplexed CITE-seq data from 14 multiple myeloma patients across eight Chromium channels demonstrates high reproducibility and shows that discordant cells cluster within specific cell types and quality strata. TCR clonotype cross-referencing against VDJdb confirms that concordance-based classification enriches for biologically genuine immune receptor sequences, with a 5.3-fold enrichment of confirmed public TCR sequences in the high-confidence stratum. Downsampling analysis reveals that SNP-based methods are more depth-sensitive than hash-based approaches, supporting the recommendation to combine both strategies. The framework transfers to AML samples across three assay types (snMultiome-seq, scRNA-seq, scATAC-seq), where ATAC-based demultiplexing resolves donor assignment discordance under low hashing efficiency. Conclusions: Split-flow demonstrates that combining of orthogonal preprocessing methods yields structured information about cell quality and offers a concordance-based framework that transforms this disagreement into a diagnostic signal. It introduces a preprocessing approach that can be exploited beyond hematopoietic malignancies in multiplexed single-cell applications.
bioinformatics2026-05-13v1An improved generic schema for high fidelity data linkage and sample tracing across complex multi-assay medical entomology studies
Kavishe, D. R.; Msoffe, R. V.; Mmbaga, S.; Tarimo, L. J.; Butler, F.; Kaindoa, E. W.; Govella, N. J.; Kiware, S. S.; Killeen, G.Abstract
Evidence-based decision making on malaria vector control strategies increasingly rely on triangulation of data which requires informatics systems that can integrate data from complex, multi-stage studies involving mosquitoes. This manuscript describes a performance evaluation of an extended version of the generic schema underpinning the VBDs360 platform, specifically improved to accommodate multiple distinct entomological assays spanning the field, insectary, and laboratory. The utility of this extension, with respect to high-fidelity data linkage and robust sample traceability across complex entomological workflows, was evaluated through a case study conducted in southern Tanzania. Wild female mosquitoes were collected from 40 locations across more than 4,000 square km and then reared through multiple generations in an insectary before derived iso-female lineages were tested for phenotypic susceptibility to a pyrethroid insecticide. Such multi-generational lineages (F0 to Fn; where n is greater than or equal to 2) were propagated to prevent non-heritable maternal effects on phenotype and produce enough progeny for standard WHO susceptibility assays. All samples were subsequently archived in a molecular laboratory, where all F0 specimens were tested for sibling species identity. A paper-based implementation of the extended schema enabled successful integration of 77,017 lines of data distributed across 6 different tables that spanned 3 distinct field, insectary, and laboratory workflows, implemented by three different teams working in different locations. At each step, fully independent and redundant primary and secondary keys enabled high fidelity error correction and sample tracing. Consistently perfect linkage between assay design and sample sorting data was achieved for F0 wild-caught adults, with 100% of 66,108 record successfully linked between field capture and morphological categorization. This complete traceability extended to the propagation of derived Fn lineages, with all 100 and 243 records from 9 adult-derived and 13 larval-derived lineages, respectively, correctly linked. Insecticide susceptibility phenotype further confirmed 100% linkage for 5,654 records between exposure history and recorded mortality outcome data in the insectary. Although such cross-cleaned linkages to sample analysis and storage data recorded by the laboratory team were not entirely perfect and could be improved, they were nevertheless of very high fidelity (97.3% (1967/2,022) for F0 samples and 99.3% (437/440) for Fn samples). Overall, this pilot application of the extended generic schema ensured robust data provenance and minimized transcription errors in this complex study distributed across multiple teams and locations. These findings demonstrate how this generic informatics framework may be scaled and adapted to support data integrity across diverse, large-scale, multi-team entomological research workflows.
bioinformatics2026-05-13v1GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Liu, Z.-Y.; Berthel, A.; Czech, E.; Marroquin, E.; Stitzer, M. C.; Hsu, S.-K.; Pennell, M.; Buckler, E. S.; Zhai, J.Abstract
Accurate genome annotation is fundamental to biological discovery, yet identifying gene structures directly from DNA sequence remains a major challenge in complex genomes. We introduce GeneCAD, a sequence-only framework that predicts biologically coherent gene models without requiring species-matched transcriptomic or proteomic evidence. GeneCAD integrates lineage-specific DNA representations from the PlantCAD2 foundation model with a transformer encoder and a chromosome-scale conditional random field (CRF) to enforce structural constraints, such as splice-phase and feature order. To ensure high-quality supervision, we implement a curation strategy using a sequence-based masked-motif score to filter reference transcripts. As a primary validation across diverse angiosperms, including a complex allotetraploid, GeneCAD improves transcript F1 by approximately 9% over current tools like Helixer and BRAKER3, while sharpening boundary precision and achieving a best-in-class recovery of 86% of classical coding sequences. Furthermore, we demonstrate the framework's modularity by adapting it to animal lineages through the substitution of the underlying DNA foundation model. While the long introns of vertebrates challenge full transcript reconstruction, the model remains highly effective at identifying individual exons. By connecting evolutionary signals with structured decoding, GeneCAD provides a versatile and scalable solution for high-fidelity genome annotation across the Tree of Life.
bioinformatics2026-05-12v4GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Liu, Z.-Y.; Berthel, A.; Czech, E.; Marroquin, E.; Stitzer, M. C.; Hsu, S.-K.; Pennell, M.; Buckler, E. S.; Zhai, J.Abstract
Accurate genome annotation is fundamental to biological discovery, yet identifying gene structures directly from DNA sequence remains a major challenge in complex genomes. We introduce GeneCAD, a sequence-only framework that predicts biologically coherent gene models without requiring species-matched transcriptomic or proteomic evidence. GeneCAD integrates lineage-specific DNA representations from the PlantCAD2 foundation model with a transformer encoder and a chromosome-scale conditional random field (CRF) to enforce structural constraints, such as splice-phase and feature order. To ensure high-quality supervision, we implement a curation strategy using a sequence-based masked-motif score to filter reference transcripts. As a primary validation across diverse angiosperms, including a complex allotetraploid, GeneCAD improves transcript F1 by approximately 9% over current tools like Helixer and BRAKER3, while sharpening boundary precision and achieving a best-in-class recovery of 86% of classical coding sequences. Furthermore, we demonstrate the framework's modularity by adapting it to animal lineages through the substitution of the underlying DNA foundation model. While the long introns of vertebrates challenge full transcript reconstruction, the model remains highly effective at identifying individual exons. By connecting evolutionary signals with structured decoding, GeneCAD provides a versatile and scalable solution for high-fidelity genome annotation across the Tree of Life.
bioinformatics2026-05-12v3BAT: an integrated pipeline for gene tree construction, annotation, and functional inference
Sheppard, B. D.; Behnken, B.; Steinbrenner, A.Abstract
Gene family functional exploration often requires analyzing motifs, domains, and associated datasets (e.g. gene expression) in the phylogenetic context of a gene tree. As genomic resources become more abundant, local pipelines are needed to analyze gene families of interest with project-specific resources. Here we present BLAST-Align-Tree (BAT), a bioinformatic pipeline for automated gene family phylogeny construction and annotation to enable gene tree exploration. BAT combines a BLAST search of local genome databases with a robust and flexible gene tree construction pipeline that enables multiple modes of annotation. Output visualizations display experimental datasets, custom regex specified amino acid motifs, and protein HMM domain annotations. For flexibility, BAT runs locally and is independent of pre-existing databases, allowing the easy incorporation of custom genomes and datasets. Three primary case studies described here demonstrate the utility of BAT for inferring the function of homologs and orthologs within characterized gene families. BAT is suitable for fine scale phylogenomic analysis of gene families across the tree of life, and default genomes available on installation span model eukaryotes.
bioinformatics2026-05-12v1SigBridgeR: An Integrative Framework and Toolkit for Comprehensive Screening and Benchmarking of Phenotype-Associated Cell Subpopulations in Single-Cell Transcriptomics
Yang, Y.; Yan, Z.; Qian, H.; Du, L.; Wang, C.; Peng, Y.; Bu, X.; Zhou, J.-G.; Wang, S.Abstract
Single-cell RNA sequencing has revolutionized our understanding of cellular heterogeneity, yet linking specific cell subpopulations to clinically relevant phenotypes remains a persistent challenge. Although multiple computational methods have been developed to bridge this gap, they are typically implemented as standalone packages with heterogeneous preprocessing pipelines, incompatible parameter conventions, and divergent output formats, thereby hindering rigorous cross-method benchmarking and reproducible multi-method workflows. Here, we present SigBridgeR, an extensible R framework and comprehensive toolkit that currently unifies eight state-of-the-art phenotype-associated cell screening algorithms within consistent workflows. We conducted a systematic benchmarking study across four cancer types HER2-positive breast cancer, triple-negative breast cancer, lung adenocarcinoma, and ovarian cancer using both binary phenotypes and patient survival endpoints. Our evaluation incorporated positive and negative control assessments based on differentially expressed genes and randomly selected marker panels, alongside quantitative accuracy comparisons using ground-truth cell labels. Building upon these insights, SigBridgeR provides standardized preprocessing for scRNA-seq and bulk transcriptomic data, unified algorithmic interfaces through a registry-based architecture, ensemble analysis via weighted voting, and comprehensive visualization utilities for multi-method comparison. By lowering technical barriers and promoting methodological standardization, SigBridgeR facilitates reliable discovery of phenotype-relevant cell subpopulations and enhances the translational potential of single-cell omics research.
bioinformatics2026-05-12v1Dual-view Guided Context-aware Network for Automated Bone Lesion Segmentation and Quantification in Whole-body SPECT
chen, w.; Yang, X.; Lu, J.; Miao, M.; Huang, Y.; Zheng, S.; Zhang, C.; Xie, L.; Zhang, Y.Abstract
Whole-body SPECT bone scintigraphy reflects skeletal metabolic activity throughout the body and plays an indispensable role in the screening, treatment evaluation, and prognostic assessment of bone metastases in tumors. However, the automatic detection and segmentation of hypermetabolic bone lesions remain challenging due to low contrast, limited spatial resolution, and complex lesion distributions. In this study, we proposed Bone-Segnet, a dual-view guided automatic segmentation network for hypermetabolic bone lesions that integrated multi-scale feature modeling, global context modeling, and view-conditioned modulation. Pixel-level annotated anterior and posterior whole-body bone scintigraphy images were used for model training and prediction. The proposed network enhanced the recognition of low-contrast and small-scale lesions through small-lesion enhancement and multi-scale contextual modeling. A Transformer module was further introduced to strengthen global feature representation, while cross-view collaborative modeling was achieved by incorporating the complementary characteristics of anterior and posterior imaging. Experimental results demonstrated that the proposed method outperformed existing approaches across multiple evaluation metrics, with the Dice score improving from 0.7440 to 0.8750, indicating a substantial improvement in segmentation performance. Further quantitative analysis based on the segmentation results revealed significant differences among disease types in lesion count, pixel burden, and spatial distribution patterns, reflecting the heterogeneity of disease-related skeletal metabolic activity. Overall, the proposed method improved automatic lesion segmentation performance and enabled quantitative analysis of lesion burden and spatial distribution patterns, providing objective data support for the assessment of related diseases.
bioinformatics2026-05-12v1Culsma: A Formal Language for Laboratory Protocols
Chen, Y.; Sun, M.; Tadepally, L.; Wang, J.; Barcenilla, H.; Gonzalez, L.; Brodin, P.Abstract
The application of artificial intelligence to biomedical research increasingly depends on iterative cycles in which AI systems analyze experimental data, propose follow-up conditions, and drive automated execution at scale, a paradigm central to Bio-AI and autonomous laboratory science. For such cycles to operate, laboratory protocols must be expressed in a form that is simultaneously human-readable and machine-executable. Natural-language descriptions, the current standard in laboratory practice, do not satisfy this dual requirement. We present Culsma, a formal language and execution framework that elevates laboratory protocols from informal prose to semantically explicit workflow programs that can be analyzed, validated, executed, and transferred across settings. The same protocol can be read and verified by a bench scientist, and parsed, validated, and executed by an automated pipeline without re-translation. We demonstrate an end-to-end implementation providing concrete evidence of practical viability.
bioinformatics2026-05-12v1Temporal-deviation-driven community detection uncovers early-warning signals for critical transitions in complex diseases
Wang, L.; Xu, M.; Yan, H.; Zheng, Y.; Feng, S.; Zhang, Y.; Li, C.; Qiu, D.; Hu, B.; Wan, X.; Zhang, F.Abstract
Early detection of critical transitions in complex diseases is crucial for timely clinical intervention. However, as patients often provide only a single snapshot, identifying sample-specific early-warning signals (EWS) from a dynamical evolution perspective remains challenging, coupled with high-dimensional noise amplification. Here, we present TD-COM, a framework for detecting personalized EWS of critical transitions via single-sample community detection. By constructing a temporal perturbation map STDN, TD-COM captures latent dynamical perturbations inferred from static individual profiles. Synergizing these temporal-deviation signals with static topological features, TD-COM implements a multi-level node filtering strategy during community detection, effectively suppressing single-sample noise. Validated on hour-scale, multi-year, and multi-decade transcriptomic data, TD-COM robustly detects critical states preceding clinical deterioration and uncovers their underlying molecular mechanisms. Comparative experiments demonstrate that TD-COM outperforms existing methods in accuracy and topological robustness. Thus, TD-COM provides a generalizable framework for personalized early warning of complex diseases, particularly when longitudinal sampling is infeasible.
bioinformatics2026-05-12v1Receptor-Anchored Olfaction Representation through Perception-Consistent Metric Learning
Tian, C.; Wang, J.; Hou, J.; Liu, W.; Luo, Y.; Wang, Y.; Yang, L.; Lin, W.Abstract
Olfactory perception arises from distributed activation across hundreds of olfactory receptors (ORs), yet our understanding of this landscape remains constrained by the scarcity of OR affinity measurements. Here, we present Receptor-Anchored Metric Supervision (RAMS), a transfer learning framework using perceptual consistency as weak supervision to predict OR activation spectra. RAMS fine-tunes a pretrained drug-target affinity model by imposing constraints derived from olfactory perception, where similar odorants are encouraged to exhibit similar OR activations. It transfers protein-ligand interaction knowledge learned from large-scale pharmacological data into the olfactory domain and reshapes it toward OR activation prediction. Evaluations against experimental measurements show that RAMS improves the accuracy of receptor-spectrum prediction and yields biologically plausible activation patterns. The predicted spectra show concordance between receptor discriminative capacity and expression level, and highlight the understudied OR52 family as a potential contributor to primary odor recognition. Together, RAMS provides a scalable framework for reconstructing receptor-anchored olfactory representations.
bioinformatics2026-05-12v1Figra: A WebAssembly-based Excel Add-in for publication-quality scientific visualization with ggplot2
Sato, Y.Abstract
Data visualization is a critical step in scientific communication. Most researchers rely on subscription-based software for this purpose, which requires ongoing licensing costs. Free alternatives such as R and Python offer publication-quality output but demand programming expertise that many researchers do not possess. Artificial intelligence tools can assist with figure generation but remain frustrating when users wish to fine-tune specific visual parameters to their preference. Meanwhile, Microsoft Excel, the most widely used tool for scientific data storage and management, offers limited visualization capabilities, forcing researchers to transfer their data to external software as an extra step before creating figures. Here we present Figra, a free Excel Office Add-in that eliminates this extra step by enabling publication-quality ggplot2-based figure generation directly within Excel, with simple and direct control over every visual option. Figra leverages WebAssembly technology (webR) to execute R code entirely within the browser, requiring no R installation, no subscription, and no server connection. The add-in supports over 20 chart types spanning distribution plots, grouped comparisons, time-series, scatter plots, and specialized curve-fitting analyses. For applicable chart types, Figra performs automated or manual statistical analysis supporting both paired and unpaired designs across two or more groups. Additionally, Figra exports simplified, executable R code that reproduces the displayed figure, serving as an educational tool for researchers wishing to learn ggplot2. Figra is open-source and freely available at https://h20gg702.github.io/figra-pages/index.html while the source code is provided at https://github.com/h20gg702/Figra.
bioinformatics2026-05-12v1Generative Chemistry Platform for Small Molecules Targeting RNA: A Case Study for Chemical Optimization
Allen, T. E. H.; Bonnet, M.; Khan, R. T.Abstract
We introduce the Serna Bio GenAI platform, a generative chemistry and multiparametric optimization platform for the design of RNA-targeting small molecules. Targeting RNA with small molecules has proven historically challenging but offers notable potential upsides, including access to unique mechanisms of action and the ability to target otherwise untargetable genes. We consider a major challenge here to be designing chemistry specific to RNA-targeting. Molecular design is a valuable application of AI in drug discovery, but many publicly available models use training data focused on protein-targeting - the modality best historically explored in drug discovery. We showcase the difference and value in building a specifically RNA-targeting platform, comparing its performance to state-of-the-art public chemical generators and experimentally validating its chemical designs in comparison to chemistry designed by a human expert.
bioinformatics2026-05-12v1