Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Fleming: An AI Agent for Antibiotic Discovery in Mycobacterium Tuberculosis
Wei, Z.; Ektefaie, Y.; Zhou, A.; Negatu, D.; Aldridge, B. B.; Dick, T. B.; Skarlinski, M.; White, A.; Rodriques, S. G.; Hosseiniporgham, S.; Parai, M.; Flores, A.; Inna, K. V.; Zitnik, M.; Sacchettini, J.; Farhat, M. R.Abstract
Antibiotic development is challenged by high costs and failure rates. Artificial intelligence (AI) holds promise to overcome these challenges by predicting inhibitory properties of novel compounds, generating new candidates, and contextualizing property predictions in the biological background. Fleming is an integrative AI agent that explores novel chemical space to identify lead compounds meeting multiple criteria. The discriminative and generative AI models for Mycobacterium tuberculosis (Mtb) inhibition were trained on a set of 114,900 diverse compounds and fragments based on in vitro growth inhibition. We combined both models as well as molecular optimization, ADMET prediction and literature search functions to make Fleming an integrated agent for Mtb preclinical lead identification. Fleming has 17% higher discrimination between known Mtb leads and leads for other diseases than a generic LLM agent along with 13% higher discrimination than molecular property prediction alone on challenging ADMET tasks. Fleming demonstrates an 83% in vitro hit rate of predicted inhibition and a 100% hit rate of de novo generative design. Fleming's generative designs also demonstrate an 83% rate of favorable ADMET profiles. Fleming is an integrative AI agent able to explore new regions of the chemical space to select lead compounds that simultaneously meet several desirable criteria.
bioinformatics2026-03-12v2Harnessing methylation signals inherent in long-read sequencing data for improved variant phasing
Pfennig, A.; Akey, J. M.Abstract
Accurate phasing of genetic and epigenetic variation is crucial for many downstream analyses, including association testing, clinical variant interpretation, and inference of population history. Although long-read sequencing significantly improves the continuity and completeness of genome sequencing, reconstructing chromosome-scale haplotypes remains challenging, often requiring the integration of multiple technologies, such as PacBio HiFi and Oxford Nanopore Technologies (ONT) sequencing. While these sequencing platforms detect the epigenetic modification 5-methylcytosine (5mC), current read-based phasing algorithms do not incorporate this information. We developed a read-based phasing method named LongHap that seamlessly integrates sequence and methylation data and shows that it significantly improves haplotype reconstruction. LongHap first creates phase blocks based on overlapping heterozygous sequence variants, accurately phasing complex variants by embedding them into the broader haplotype context through belief propagation. LongHap then dynamically identifies differentially methylated sites that are informative for phasing to refine and extend initial phase blocks. Through extensive analyses, we demonstrate that LongHap outperforms existing tools, including WhatsHap, HapCUT2, LongPhase, and MethPhaser, by achieving lower switch error rates and greater phase block contiguity. Crucially, we show that LongHap also improves variant phasing in challenging, medically relevant genes. In summary, by leveraging native methylation signals from long-read sequencing data, LongHap enhances long-range haplotype reconstruction, enabling more accurate haplotype-based genome analysis. LongHap is available from: https://github.com/AkeyLab/LongHap.
bioinformatics2026-03-12v1DEX: a consensus-based amino acid exchangeability measure for improved codon substitution modelling
Douglas, G. M.; Bobay, L.-M.Abstract
Physicochemically similar amino acids undergo more frequent substitutions compared to dissimilar amino acid pairs. Despite their clear potential, amino acid similarity matrices remain underused in molecular evolution, partially due to the high number of proposed amino acid distance measures and the lack of agreement on which are most accurate. In this study, we assessed the performance of 30 amino acid distance measures, including a new amino acid distance measure we developed based on recent deep mutational scanning data. We compared these measures across codon substitution models fit to alignments spanning Streptococcus, Drosophila, and mammalian lineages, as well as segregating variants across Escherichia coli strains and human genotypes. We further constructed consensus measures from combinations of top-performing measures in this analysis using the DISTATIS approach and retested these matrices. Our results show that experimentally-derived measures, particularly our new measure and the existing experimental exchangeability (EX) measure, best fit codon substitution patterns across diverse lineages. We found that a consensus measure based on these two approaches, which we named DEX, performed best overall. In addition, although site-specific variant effect predictors are intended to identify deleterious mutations, the representative tools we tested did not outperform amino acid distance measures for predicting mean substitution frequencies. They were however substantially more informative for identifying individual highly deleterious mutations. Overall, we provide a systematic comparison of the performance of existing measures, and we introduce an improved general-purpose amino acid distance measure for molecular evolution models.
bioinformatics2026-03-12v1Rational Design of Selective IL-2-based Activators for CAR T Cells Using AlphaFold3 and Physics-Informed Machine Learning
Dahmani, L. Z.; Banerjee, A.Abstract
Recombinant human Interleukin-2 (rhIL-2, Aldesleukin) is used in immunotherapy for metastatic melanoma and renal cell carcinoma. Low-dose IL-2 has been investigated for administration after adoptive T cell transfer to enhance CAR T expansion and sustain effector function. However, systemic IL-2 can cause severe toxicities and promote expansion of regulatory T cells (Tregs). Previous attempts at mitigating cytokine-mediated side effects involved isolating CAR T cell signaling from endogenous immune responses by developing IL-2/IL-2RB; based selective ligand-receptors systems. Expressing these variant orthogonal (ortho)IL-2-RB; receptors in CAR T cells and supplying variant orthoIL-2, was shown to dramatically improve selectivity in CAR T cell expansion and anti-tumoral potency in a leukemia mouse model. This study describes the computational design of synthetic orthogonal cytokine receptor-ligand systems based on the scaffolds of the human canonical IL-2 and IL-2RB;. Leveraging state-of-the-art AlphaFold3 (AF3) structure prediction capabilities and a physics-informed constrained sequence generator (CSG), the pipeline generates, filters and ranks sets of putative orthoIL-2/orthoIL-2RB; mutant designs. Variants displaying minimal predicted off-target interactions and enhanced in target contacts are prioritized for structural modelling. Top designs showed outstanding AF3 structural and interfacial quality metrics ipTM and pTM, with averages between cognate pairs of 0.724{+/-}0.05 and 0.770{+/-}0.042, respectively. All in-silico hits showed ipTM <0.5 for non-cognates, indicating a good likelihood of orthogonality. Additionally, putative hits showed high levels of predicted structural fidelity to wild-type (WT) human IL-2/IL-2RB; (PDB: 2ERJ), with an average structural root-mean-square deviation (RMSD) of 0.843{+/-}0.375 Angstrom. These mutants incorporated 7-26 interfacial mutations derived from multiple interface selection strategies. Altogether, the results support the putative foldability and selective affinity of top-ranking mutants displaying metrics close-to or within experimental reference range. Finally, strengths and limitations are discussed, alongside the experimental implications of coupling a constrained protein design pipeline to the discovery and validation of selective binders based on naturally occurring scaffolds.
bioinformatics2026-03-12v1Cyclic peptides space: The methodology of sequence selection to cover the comprehensive physical properties
Tsuchihashi, R.; Kinoshita, M.Abstract
Cyclic peptides have emerged as a pivotal modality for next-generation therapeutics, due to their superior biocompatibility, high selectivity, and structural stability. While AI-driven peptide design has advanced rapidly, conventional optimization algorithms are often constrained by initialization biases, which impede the efficient exploration of the vast chemical space. Here, we propose a novel methodology that integrates the protein language model ESM-2 with cyclic permutation averaging of embeddings to resolve this bottleneck. This approach establishes a comprehensive "peptide space", a high-dimensional vector representation that encapsulates the physicochemical and structural attributes of cyclic peptides. Our analysis reveals that random sequence selection results in a heterogeneous distribution within this space, potentially underrepresenting specific functional regions. Conversely, navigating this defined peptide space enables the selection of libraries that uniformly span diverse molecular properties. In a proof-of-concept study designing binders for {beta}2-microglobulin ({beta}2m), we demonstrate that initial sequences uniformly sampled from our peptide space yield superior candidates more efficiently than those derived from random selection. Furthermore, this framework facilitates the quantitative assessment of mutational perturbations on global peptide properties, supporting rational decision-making for both broad exploration and local optimization. This "peptide space" concept provides a foundational framework for defining appropriate search boundaries and enhancing computational efficiency in AI-mediated drug discovery.
bioinformatics2026-03-12v1Benchmarking zero-shot single-cell foundation model embeddings for cellular dynamics reconstruction
Zhou, X.; Wang, Z.; Ling, Y.; Tian, Q.; Zhang, Z.; Li, Y.; Zhou, P.; Chen, L.Abstract
Reconstructing cellular trajectories from time-resolved single-cell transcriptomics is fundamental to understanding processes from embryonic development to cancer progression. While single-cell foundation models (scFMs) promise universal biological representations through large-scale pretraining, their capacity to capture the non-linear dynamics governing cell-fate decisions remains uncharacterized. Here we systematically benchmark multiple scFMs across challenging biomedical scenarios involving branching lineages and continuous state transitions. By coupling zero-shot scFM embeddings with dynamic optimal transport, we evaluated their performance against a traditional highly variable gene (HVG) baseline in backtracking progenitor states, interpolating transition intermediates, and extrapolating future fates. We find that zero-shot scFM embeddings underperform the HVG baseline across diverse biological systems, particularly in recovering the distributional complexity of unobserved cells. Mechanistic analysis reveals that current scFM architectures tend to over-compress subtle temporal signals, causing an artificial "linearization" of branched biological structures that may obscure critical divergence points in disease progression. Our findings suggest that while scFMs provide unified cell-state views, the HVG baseline remains more robust for trajectory inference, identifying a fundamental "temporal-compression" bottleneck that must be addressed to develop next-generation, dynamics-aware foundation models.
bioinformatics2026-03-12v1Benchmarking BEAGLE to find optimal parameters for BEAST X
Fosse, S.; Duchene, S.; Duitama Gonzalez, C.Abstract
Bayesian phylogenetic analyses are notoriously time-consuming, largely because exploring the posterior distribution requires computing Felsenstein's likelihood. The BEAGLE library is a high-performance computational tool that dramatically accelerates the calculation of such likelihoods by leveraging parallel processing on GPUs, multicore CPUs, and SSE vectorisation. Here, we present results from benchmarking a widely popular phylogenetics package, BEAST X, using BEAGLE integration, focusing on how hardware allocation affects running times. We demonstrate substantial differences among BEAGLE settings on real Dengue Virus (DENV) data, both with and without partitioning. Using simulated sequences, we establish guidelines for GPU usage in BEAST X runs. These guidelines can be used for effective resource allocation for empirical analyses and simulation studies.
bioinformatics2026-03-12v1Directional Variant Tension (Tv): A Causal Framework for Quantifying Substitution Asymmetry
Karagöl, A.; Karagöl, T.Abstract
Amino acid substitutions are often directionally asymmetric due to underlying biophysical constraints and diverse evolutionary pressures. We introduce Tv (variant tension), a kernel regression-based metric that quantifies this directional asymmetry directly from aligned multiple sequence alignments (MSAs). Tv leverages empirical amino acid frequencies and a non-parametric Gaussian kernel to capture nonlinear substitution flows, providing a causality-inspired framework for understanding evolutionary dynamics. We also present a web-based application that implements the calculation, allowing users to input MSAs, adjust parameters (kernel bandwidth;, smoothing window size), and visualize results, including global tension scores and high-tension sites. Applying T_v to the human glutamate transporter (EAA1), we identify significant substitution asymmetries, localize high-tension sites, and reveal correlations between elevated Tv and known pathogenic variants. This framework integrates statistical learning with protein evolution, offering a powerful tool for bridging protein design principles with evolutionary inference. Beyond variant prioritization, Tv; offers a scalable framework for simulating evolution under directional constraints, enabling predictive modeling of protein adaptation. The free web application is openly accsessible at https://www.karagolresearch.com/variantt
bioinformatics2026-03-12v1GCN-Mamba: Graph Convolutional Network with Mamba for Antibacterial Synergy Prediction
Su, H.; Liang, Y.; Xiao, W.; Li, H.; Liu, X.; Yang, Z.; Yuan, M.; Liu, X.Abstract
The escalating crisis of antimicrobial resistance necessitates novel therapeutic strategies, among which drug combination therapy shows great promise by enhancing efficacy and reducing toxicity. However, identifying effective synergistic pairs from the vast combinatorial space remains experimentally challenging and resource-intensive. To address this, we introduce GCN-Mamba, a deep learning framework that integrates Graph Convolutional Networks (GCN) with the Mamba State Space Model. This architecture captures both local molecular topological structures and global implicit interactions by leveraging Extended 3-Dimensional Fingerprints (E3FP) and bacterial gene expression profiles. Evaluation on a comprehensive dataset demonstrated that GCN-Mamba significantly outperforms classical machine learning models in predictive accuracy. In a targeted case study against Methicillin-resistant Staphylococcus aureus (MRSA), the model successfully rediscovered known synergistic pairs, such as Quercetin and Curcumin, consistent with recent literature. Furthermore, prospective in vitro validation confirmed a novel synergistic combination of Shikimic acid and Oxacillin, validating the model's practical utility. By efficiently prioritizing potential candidates, GCN-Mamba serves as a powerful and reliable tool for accelerating the discovery of synergistic antimicrobial combinations, effectively bridging the gap between computational prediction and experimental validation.
bioinformatics2026-03-12v1HitAnno: Atlas-level cell type annotation based on scATAC-seq data via a hierarchical language model
Wang, Z.; Chen, X.; Cui, X.; Gao, Z.; Li, Z.; Li, K.; Jiang, R.Abstract
The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) has emerged as a core technology for dissecting cellular epigenomic heterogeneity and gene regulatory programs. With the emergence of atlas-level scATAC-seq datasets, cell type annotation increasingly faces challenges arising from unprecedented data scale and increased cell-type diversity, which together place stringent demands on model reliability and robustness. Here, we introduce HitAnno, a hierarchical language model capable of accurate and scalable cell type annotation in atlas-level scATAC-seq data. Leveraging selected cell-type-specific peaks to construct cell sentences, HitAnno employs a two-level attention mechanism that captures accessibility profiles hierarchically. Extensive evaluations show that HitAnno robustly annotates both major and rare cell types across multiple settings, including intra-dataset, cross-donor and inter-dataset annotation. The hierarchical attention mechanisms of the model reveal co-accessibility patterns among peaks and dependencies across higher-order peak sets, ensuring an interpretable annotation process. Training on a 31-cell-type human atlas, HitAnno can directly annotate new query datasets without retraining and is accessible through an online interface. Our model identifies heterogeneous subgroups within mixed labeled cells from unseen datasets, demonstrating its potential to assist researchers in refining existing cell atlases.
bioinformatics2026-03-12v1mnDINO: Accurate and robust segmentation of micronuclei with vision transformer networks
Ren, Y.; Morlot, L.; Andrews, J. O.; Thrane Hertz, E. P.; Mailand, N.; Caicedo, J. C.Abstract
Recent advances in cell segmentation successfully produce models that generalize across various cell-lines and imaging types. However, these methods still fail to recognize subcellular structures such as micronuclei (MN), which are rare and tiny DNA-containing structures found outside of the main nucleus and observable under the microscope. While they can be hard to recognize in images, studying MN formation is of great interest because of their relationship to chromosome instability, genotoxicity, and cancer progression. Here we present a segmentation model, mnDINO, to segment micronuclei in DNA stained images under diverse experimental conditions with very high efficiency and accuracy. To train this model, we collected a heterogeneous set of images with more than five thousand annotated micronuclei. Trained with this diverse resource, the mnDINO model improves the accuracy of MN segmentation, and exhibits strong generalization across microscopes and cell lines. The dataset, code, and pre-trained model are made publicly available to facilitate future research in MN biology.
bioinformatics2026-03-12v1Comparative Analysis of Structural and Dynamical Properties of Lipid Membranes Simulated with the AMBER Lipid21 ForceField Using SPC/E, TIP3P, TIP3P-FB, TIP4P-FB, TIP4P-Ew, TIP4P/2005, TIP4P-D, and OPC Water Models
Chakraborty, D. S.; Singh, P. P.; Dey, C.; Kaur, J.Abstract
We have conducted all atom molecular dynamics simulations of POPC and DPPC lipid bilayers using AMBER Lipid21 force field with eight different water models, including SPC/E, TIP3P, TIP3P-FB, TIP4P-FB, TIP4P-Ew, TIP4P/2005, TIP4P-D, and OPC, to identify the most compatible one without any modification. A number of parameters have been computed in order to understand the structure of the lipid bilayer: Area per lipid, Isothermal compressibility modulus, average Volume per lipid, electron density profile, bilayer thickness, X-ray and neutron scattering form factors, deuterium order parameter, and radial distribution function. The estimated Area per lipid, Isothermal compressibility factor, volume per lipid and bilayer thickness are highly consistent with experimental results for the SPC/E water model, indicating its suitability with the AMBER Lipid21 force field, insted of any modification. The bilayer electron density profiles of both the lipid bilayers demonstrate a little augmentation of water penetration with respect to the membrane surface for TIP4P-D water model. However, the experimental X-ray and neutron scattering form factors are aligning well with the simulated results for all studied water models, and TIP4P-D shows better for X-ray data. The deuterium order parameter for lipid acyl chains value less than 0.25 for all observed water models, depicting their disorderness for both the lipid bilayers. The lateral diffusion and reorientation autocorrelation function of the lipid molecules in both the bilayers are computed to reveal their dynamics across all water models. In comparison to other water models, the simulated trajectories predict better structure and reasonably fair dynamic properties for the SPC/E water model. The TIP4P-Ew water model reproduces the lateral diffusion co-efficient in close agreement with experiment. Reorientational dynamics for both the lipids in the bilayers for eight different water models are observed; the presence of slow and slowest time components corresponds to the lipid axial motion (wobble motion) and Twist/Splay motions. So, in view of the overall performance of the different water models with the AMBER Lipid21 all atom force field in reproducing membrane physical properties, the SPC/E water model appears to be an optimal choice.
bioinformatics2026-03-12v1DiaReport: Reproducible Workflow for Differential Expression Analysis and Interactive Reporting in DIA-based Proteomics
Argentini, A.; Fernandez Fernandez, E.; Pauwels, J.; Gevaert, K.Abstract
Data-independent acquisition (DIA) has become the preferred data acquisition method for mass spectrometry-based proteomics, yet, reproducible workflows for differential expression (DE) analysis and reporting results remain limited. We present DiaReport, an R package that performs precursor- and protein-level DE analysis from DIA-NN output using MSqRob and QFeatures, while generating high-quality, interactive HTML reports through Quarto. DiaReport integrates precursor data, filtering of missing values, normalization, protein summarization and statistical modeling within a single function, supporting both simple pairwise as well as complex experimental designs. The package provides structured outputs and configuration files to ensure computational reproducibility across different studies. To accommodate diverse research needs, DiaReport includes multiple reporting templates tailored to different proteomic applications. Applying DiaReport to an extracellular vesicle (EV) proteomics dataset demonstrates its ability to efficiently analyze DIA data and provide rapid insights into sample quality and protein level differences. Availability and Implementation: DiaReport is an open-source R package available at https://github.com/Gevaert-Lab/diareport. The package is platform-independent and distributed under the MIT license. Reports are generated using Quarto and require only standard R dependencies. Detailed documentation, installation guides and usage vignettes are provided within the repository. The interactive HTML reports discussed in this study, including the UPS2 benchmark and EV case study, are archived on Zenodo (https://doi.org/10.5281/zenodo.18632744 and https://doi.org/10.5281/zenodo.18632731).
bioinformatics2026-03-12v1Evaluating transformer-based models for structural characterization of orphan proteins
Seckin, E.; Colinet, D.; Danchin, E.; Sarti, E.Abstract
Transformer-based models (TBMs) are state-of-the-art deep learning architectures that predict protein structural and functional features with high accuracy. Despite methodological differences, they all rely on large protein sequence datasets structured by homology, as homologous proteins typically share structure and function. However, 5-30% of eukaryotic proteomes consist of orphan proteins - sequences without detectable similarity to known families. Although they may share structural or functional traits with characterized proteins, their lack of homology makes them ideal for evaluating TBM generalization beyond familiar sequence space. We compared predictions from several widely used TBM architectures on an expert-curated set of orphan proteins from the Meloidogyne genus. None of these proteins has an experimentally determined structure. To assess model performance, we conducted consistency analyses, comparing predicted features with those observed in sets of known homologous proteins and across models. Multiple sequence alignment-based approaches such as AlphaFold2 performed poorly on orphan proteins, as did single-sequence or embedding-based language models including ESMFold, OmegaFold, and ProtT5. This limited performance cannot be fully attributed to intrinsic disorder, as confirmed by independent non-TBM disorder predictors. While accurate tertiary structure prediction remains out of reach, secondary structure is more reliably captured: predictors share about 70% of secondary structure elements on average, regardless of global fold similarity, and these elements are consistently identified by dedicated secondary structure tools.
bioinformatics2026-03-12v1GE-BiCross: A Hierarchical Bidirectional Cross-Attention Framework for Genotype-by-Environment Prediction in Maize
Zhou, S.; Zhao, T.Abstract
Genotype-by-environment interactions are central to crop adaptation and yield stability, yet they remain difficult to model for robust prediction across heterogeneous environments. Although enviromic profiling has improved the characterization of dynamic field conditions, most existing genomic prediction methods adopt a late-fusion strategy that encodes genomic and environmental information independently before global integration, thereby limiting their ability to resolve fine-scale, context-dependent G x E effects. Here, we developed GE-BiCross, a hierarchical bidirectional cross-attention framework for maize prediction. GE-BiCross incorporates a dual-path feature extraction module to disentangle independent and cooperative effects, a tokenized bidirectional cross-attention module to enable reciprocal genotype-environment interaction learning, and a mixture-of-experts module to adaptively capture heterogeneous response patterns across environments. Using a large-scale dataset of approximately 360,000 observations from 4,923 maize hybrids evaluated in 241 environments, GE-BiCross consistently outperformed conventional genomic prediction, machine learning, and deep learning baselines across six agronomic traits. The greatest improvements were observed for environmentally responsive and genetically complex traits. In particular, GE-BiCross achieved an R2 of 0.672 for grain yield and 0.880 for grain moisture, significantly surpassing all comparison models. Ablation analyses demonstrated that the three core modules make distinct and complementary contributions to predictive performance.These results show that deep, bidirectional integration of genomic and enviromic information can substantially improve modeling of complex G x E interactions, providing a powerful framework for interpretable genomic prediction and climate-smart crop breeding.
bioinformatics2026-03-12v1Sassy2: Batch Searching of Short DNA Patterns
Beeloo, R.; Groot Koerkamp, R.Abstract
Motivation. Searching short DNA patterns such as barcodes, primers, or CRISPR spacers within sequencing reads or genomes is a fundamental task in bioinformatics. These problems are instances of multiple approximate string matching (MASM) [Baeza-Yates and Navarro, 1997], which requires locating all occurrences with up to k errors of multiple patterns of length m in a text of length n. Classical approaches based on seeding with exact matches become inefficient for short patterns (m [≤]64 bp) as k increases, producing either many spurious hits or missing true matches. Our previous work, Sassy1, showed that careful hardware optimization drastically accelerates single-pattern searches in long texts by distributing chunks of the text across SIMD lanes. Methods. Sassy2 distributes multiple patterns across SIMD lanes to maximize parallelism when searching batches of short patterns. When k is small, often only a short substring of the pattern of length O(k) is needed to reject a possible match. Thus, Sassy2 first examines short suffixes of the patterns (e.g., the last 16 bp of 32 bp patterns), allowing more (but smaller) parallel SIMD lanes. Only positions passing this suffix filter undergo full pattern verification. Results. On synthetic data, Sassy2 achieves 10-50x speedups over Sassy1 for short texts (n [≤]200 bp) and 2-4x for large texts (n [≥]1 Mbp). On real-world tasks with 16 threads, Sassy2 reaches over 100 Gbp/s text throughput per guide when searching 312 gRNAs across the human genome and 116 Gbp/s throughput when demultiplexing Nanopore reads with 96 barcodes. In both cases, Sassy2 outperforms Sassy1 by 2-5x and Edlib by 20-45x. Availability. Sassy2 is implemented in Rust and available at github.com/RagnarGrootKoerkamp/sassy.
bioinformatics2026-03-12v1AlphaFind v2: Similarity Search in AlphaFold DB and TED Domains across Structural Contexts
Slaninakova, T.; Rosinec, A.; Cillik, J.; Krenek, A.; Gresova, K.; Porubska, J.; Marsalkova, E.; Olha, J.; Prochazka, D.; Hejtmanek, L.; Dohnal, V.; Berka, K.; Svobodova, R.; Antol, M.Abstract
The availability of large-scale protein structure collections enables structure-based analysis of their function and evolution beyond what is possible from sequence alone. However, applying three-dimensional structure comparison at scale remains computationally demanding and limits practical exploration of large experimental and predicted collections. This creates a need for fast, structure-based search methods that retain biological relevance while enabling large-scale exploration. In this paper, we present AlphaFind v2, an application for finding structurally similar proteins in the AlphaFold Database (https://alphafold.ebi.ac.uk/) of predicted structures. AlphaFind v2 uses fast pre-filtering via state-of-the-art protein embeddings that preserve structural information, followed by refinement with US-align. The application presents multiple complementary search modes, including (i) search over full protein chains, (ii) search aware of the AlphaFold pLDDT metric, restricting similarity computation to the most stable and structurally relevant regions, (iii) search over protein domains from the TED database (https://ted.cathdb.info/), and (iv) a multidomain search mode, combining multiple chain-level domain matches within a single score and alignment. The application accepts protein identifiers and returns similar proteins with metrics, rich metadata, and interactive superpositions. AlphaFind v2 additionally allows searching within an organism or CATH label and matches the proteins with experimental structures. AlphaFind v2 is accessible at https://alphafind.ics.muni.cz/.
bioinformatics2026-03-12v1Joint Geometric--Chemical Distance for Protein Surfaces
Swami, H.; Eckmann, J.-P.; McBride, J. M.; Tlusty, T.Abstract
Protein function is executed at the molecular surface, where shape and chemistry act together to govern interaction. Yet most comparison methods treat these aspects separately, privileging either global fold or local descriptors and missing their coupled organization. Here we introduce IFACE (Intrinsic Field--Aligned Coupled Embedding), a correspondence-based framework that aligns protein surfaces through probabilistic coupling of intrinsic geometry with spatially distributed chemical fields. From this alignment, we derive a joint geometric-chemical distance that integrates structural and physicochemical discrepancies within a single formulation. Across diverse proteins, this distance separates conformational variability from true structural divergence more effectively than fold-based similarity measures. Applied to the cytochrome P450 family, it reveals coherent family-level organization and identifies conserved buried catalytic pockets despite the complex topology. By linking interpretable surface correspondences with a unified distance, IFACE establishes a principled basis for comparing protein interfaces and detecting functionally related interaction patches across proteins.
bioinformatics2026-03-12v1MultiPopPred: A Trans-Ethnic Disease Risk Prediction Method, and its Application to the South Asian Population
Kamal, R.; Narayanan, M.Abstract
Genome-wide association studies (GWAS) have guided significant contributions towards identifying disease-associated Single Nucleotide Polymorphisms (SNPs) in Caucasian populations, albeit with limited focus on other understudied low-resource non-Caucasian populations. There have been active efforts over the years to understand and exploit the population specific versus shared aspects of the genotype-phenotype relation across different populations or ethnicities to bridge this gap. However, the efficacy of transfer learning models that are simpler than existing approaches and utilize individual-level data remains an open question. We propose MultiPopPred, a novel and simple trans-ethnic polygenic risk score (PRS) estimation method that taps into the shared genetic risk across populations and transfers information learned from multiple well-studied auxiliary populations to a less-studied target population. The default version of MultiPopPred (MPP-PRS+) harnesses individual-level data using a specially designed Nesterov-smoothed penalized shrinkage model and an L-BFGS optimization routine. Extensive comparative analyses performed on simulated genotype-phenotype data, assuming an infinitesimal model, reveal that MPP-PRS+ improves PRS prediction in the South Asian population by 38% on average across all simulation settings when compared to state-of-the-art trans-ethnic PRS estimation methods. This improvement is enhanced in settings with low target sample sizes and in semi-simulated settings. Furthermore, MPP-PRS+ produces better or comparable PRS predictions than state-of-the-art methods across 12 out of 16 evaluated quantitative and binary traits in UK Biobank, with the exception being 4 lipid-related traits. This performance trend is promising and encourages application of MultiPopPred for reliable PRS estimation in low-resource populations with individual-level data for complex omnigenic traits.
bioinformatics2026-03-11v3Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data
Vicente, A.; Dornfeld, L.; Coines, J.; Ferruz, N.Abstract
Proteins can bind small molecules with high specificity. However, designing proteins that bind user-defined ligands remains a challenge, typically relying on structural information and costly experimental iteration. While protein language models (pLMs) have shown promise for unconditional generation and conditioning on coarse functional labels, instance-level conditioning on a specific ligand has not been evaluated using purely textual inputs. Here we frame small-molecule protein binder design as a sequence-to-sequence translation problem and train ligand-conditioned pLMs that map molecular strings to candidate binder sequences. We curate large-scale ligand-protein datasets (>17M ligand-protein pairs) covering different data regimes and train a suite of models, spanning 16 to 700M parameters. Results reveal a consistent trade-off driven by supervision ambiguity: when each ligand is paired with few proteins, models generate near-neighbour, foldable sequences; when each ligand is paired with many proteins, generations are more diverse but less consistently foldable. Our study exposes how annotation diversity and sampling choices elicit this behaviour and how it changes with the data distribution. These insights highlight dataset redundancy and incompleteness as key bottlenecks for sequence-only binder design. We release the curated datasets, trained models, and evaluation tools to support future work on ligand-conditioned protein generation.
bioinformatics2026-03-11v2Automated extraction and optimization of protein purification protocols using multi-agent large language models
Ye, J.; DeRocher, A.; Khim, M.; Subramanian, S.; Cron, L.; Myler, P. J.; Phan, I. Q.Abstract
Recent advances in Large Language Models (LLMs) present new opportunities for automating critical bottlenecks in scientific workflows such as literature reviews or protocol design. One such bottleneck is the purification of recombinant proteins, a vital aspect of biomedical research that frequently fails. To improve success rates, researchers must manually define optimal large-scale purification conditions and establish robust rescue protocols for proteins with low stability or solubility -- a time-intensive process. To address this gap, we introduce a multi-agent LLM system that automates the creation and optimization of protein purification protocols to facilitate the production of high-concentration, high-purity protein samples. Our application streamlines the labor-intensive manual process of sequence similarity searches, literature reviews, and protocol comparison. Operating in a tool-like constrained workflow, the system identifies analogous proteins, leverages specialized LLM agents to extract successful purification methodologies from primary source literature, and cross-references them against failed protocols to generate optimization recommendations. Evaluation on a select number of targets demonstrated high accuracy in protocol extraction and the generation of scientifically sound, expert-validated optimization recommendations. While this system reduces complex analysis time from hours to minutes, we identify the lack of programmatic open access to literature, specifically primary citations in the Protein Data Bank, as a fundamental limitation to LLM agent-based scientific workflows. Ultimately, this system demonstrates the feasibility of using LLM agents to streamline wet-lab workflows while preserving methodological transparency and reproducibility.
bioinformatics2026-03-11v1Beyond Binding Affinity: The Kinetic-Compatibility Hypothesis for Nipah Virus Neutralization
Bozkurt, C.Abstract
Nipah virus (40-75% fatality) has no approved treatments. Its highly dynamic fusion (F) protein presents a severe challenge for static binder design. We analyzed 1,194 validated computational binders, focusing on 22 functionally tested candidates (8 neutralizers, 14 non-neutralizers) to identify features associated with live-virus neutralization. We initially hypothesized that maximizing binding affinity would be the primary driver of success. However, we observed an affinity-neutralization mismatch: higher static affinity did not stratify neutralizers from non-neutralizers, and ultra-tight static affinity did not correlate with functional success. We found that successful neutralizers were instead enriched for specific architectural patterns, including computational structural flexibility and terminal sequence motifs. These findings motivate a "Kinetic Compatibility Hypothesis," suggesting that neutralization may require a state-dependent, multi-feature profile rather than maximum affinity alone. Furthermore, we report exploratory developability associations - such as a 0.48-0.55 amyloid propensity "sweet spot" and secondary structure constraints - specific to the 15 kDa miniprotein scaffolds in this dataset. This 10-point framework integrates empirical sequence data with Orbion's Astra ML model suite predictions to propose an exploratory lead-triage heuristic, though it does not yet definitively prove mechanism
bioinformatics2026-03-11v1HAETAE: A highly accurate and efficient epigenome transformer for tissue-specific histone modification prediction
Park, S.-J.; Im, S.-H.; Kim, S.-Y.; Kim, J.-Y.Abstract
While genomic models trained on four bases often fail to capture cell-type specificity, we introduce HAETAE, which integrates 5-methylcytosine from long-read sequencing into a 5-base framework. By explicitly modeling epigenetic context, HAETAE achieves state-of-the-art accuracy (>0.95) with orders of magnitude fewer parameters, challenging the prevailing scaling-law paradigm. Furthermore, HAETAE deciphers tissue-specific regulatory logic, as demonstrated by revealing the distinct, context-dependent functional impact of the TERT promoter mutation across diverse tissues.
bioinformatics2026-03-11v1Modularity, ecology, and theoretical evolution of the ribozyme body plan
Bachelet, I.Abstract
Ribozymes are relics of molecular life forms from the primitive earth that are embedded within modern genomes across all kingdoms of life. Despite significant knowledge from decades of bioinformatic and biochemical research, a gap remains in our understanding of the world in which ribozymes existed, their interactions, ecology, and possibly also evolution. The present study proposes a new theoretical basis for understanding these aspects of ribozyme biology by adopting a zoological frame of thought. Seven families of small self-cleaving ribozymes are each mapped to a primitive marine animal analog based on topological architecture, and classified into body plan grades paralleling cnidarian, ctenophore, and bilaterian organization. A formal notation describing ribozyme regions as bodies, cavities, and limbs enables systematic comparison with animal body plans and highlights reusability of parts across ribozyme groups, in turn enabling the construction of a connectivity network and a putative body plan-based evolutionary ordering. This ordering of body plans identifies systematic gaps corresponding to undiscovered ribozyme forms, one of which, a planktonic form of hammerhead, was bioinformatically found in 16.2% of all hammerhead sequences. Computational cross-cleavage analysis across all 49 pairwise interactions (including conspecific) suggests that the hammerhead was a generalist apex predator in the RNA world, while the hatchet was a vulnerable, filter-feeding or scavenger prey species. Conspecific analysis suggests that cannibalism was also a prevalent feeding strategy. Evolutionary avoidance signatures suggest ancient predator-prey coevolution. This theory emphasizes behavior, modularity, and ecological interactions as primary drivers of early ribozyme evolution, offering a new pathway for inferring ancient RNA forms independent of sequence-first assumptions.
bioinformatics2026-03-11v1Unsupervised identification of low-frequency antigen-specific TCRs using distance-based anomaly scoring
Kinoshita, K.; Kobayashi, T. J.Abstract
Identifying antigen-specific T cell receptors (TCRs) within the diverse human repertoire remains challenging due to their extremely low frequencies, often as rare as one per million cells. Here, we propose a novel unsupervised approach that detects low-frequency antigen-specific TCRs through distance-based anomaly detection in TCR sequence space. Our method is based on the observation that antigen-specific TCRs preferentially localize at the periphery of V gene clusters rather than cluster centers. Using TCRdist3 to quantify sequence distances, we identify query TCRs that are anomalous compared to reference repertoires within their V-J gene combinations. We validated this approach across three immunological contexts: COVID-19 infection, influenza vaccination, and yellow fever vaccination. For SARS-CoV-2-specific TCR detection in a COVID-19 patient, our method demonstrated 34.3% accuracy, significantly outperforming similarity-based (ALICE: 8.0%) and frequency-based methods (edgeR: 5.8%, the Pogorelyy method: 6.3%), and uniquely detected low-frequency antigen-specific TCRs at clone count one. The minimal overlap with conventional approaches ([≤]6.7%) indicates our method captures distinct TCR clones overlooked by existing analyses. This spatial distribution-based paradigm provides a complementary strategy for TCR specificity detection, particularly valuable for identifying rare antigen-specific clones essential for understanding immune responses.
bioinformatics2026-03-11v1MESSI: Multimodal Experiments with SyStematic Interrogation using nextflow
Liang, C.; Grewal, T.; Singh, A.; Singh, A.Abstract
Background: Multimodal biomedical studies increasingly profile multiple molecular and clinical modalities from the same samples, creating new opportunities for disease prediction and biological discovery. However, benchmarking multimodal integration methods remains difficult because studies often use inconsistent preprocessing, unequal tuning strategies, and non-comparable evaluation schemes, limiting fair assessment across methods. Results: We developed MESSI (Multimodal Experiments with SyStematic Interrogation), a reproducible Nextflow-based benchmarking framework for multimodal outcome prediction that standardizes data preparation, supports interoperable R and Python workflows, and enforces leakage-free nested cross-validation for model selection and model assessment. MESSI currently implements representative intermediate- and late-integration methods and supports bulk multiomics, bulk multimodal, and single-cell multiomics datasets. In simulation studies with known ground truth, most methods were well calibrated in the absence of signal and achieved high performance under strong signal, whereas differences emerged under weaker signal and in feature recovery. We then applied MESSI to 19 real datasets spanning cancer, neurodevelopmental, neurodegenerative, infectious, renal, transplant, and metastatic disease settings, with diverse modality combinations including transcriptomic, epigenomic, proteomic, imaging, electrical, clinical, and single-cell-derived features. Across bulk multimodal datasets, classification differences were generally modest, although DIABLO and multiview cooperative learning tended to rank highest, while MOFA+glmnet and MOGONET were weaker overall. Biological enrichment analyses revealed clearer differences: DIABLO, RGCCA, MOFA, and IntegrAO more consistently recovered significant Reactome, oncogenic, and tissue-relevant gene signatures. In single-cell multiomics benchmarks, method rankings were more dataset dependent, but DIABLO performed consistently well across all case studies, while RGCCA also showed strong performance in specific settings. Computational analyses further showed that DIABLO and MOFA had the most favorable runtime and memory profiles, whereas multiview was the most time-intensive and IntegrAO the most memory-demanding. Conclusions: MESSI provides a reproducible, extensible, and equitable framework for benchmarking multimodal integration methods under a common model assessment strategy. Our results indicate that no single method is uniformly optimal across datasets and objectives; instead, method choice should balance predictive performance, biological interpretability, and computational efficiency. MESSI establishes a foundation for transparent benchmarking and future extensions to broader multimodal learning tasks.
bioinformatics2026-03-11v1CESAR: High-Sensitivity Detection of Copy Number Variations in ctDNA Using Segmentation and Anchor Recalibration
Ni, S.; Kan, K.; Wang, L.; Wu, N.; Jiang, X.Abstract
Background: Detecting copy number variations (CNVs) in circulating tumor DNA (ctDNA) is crucial for the companion diagnosis and resistance monitoring of various solid tumors (e.g., NSCLC, Glioblastoma). However, when tumor-derived DNA fractions are extremely low (often <1%), traditional depth-based methods frequently fail due to non-linear sequencing depth fluctuations and probe-specific capture biases inherent to targeted Next-Generation Sequencing (NGS). Methods: We developed CESAR (CNV Estimation with Segmentation and Anchor Recalibration), a novel computational tool optimized for ultra-sensitive, tumor-only CNV detection in targeted NGS panels. CESAR utilizes Circular Binary Segmentation (CBS) to re-partition target regions based on relative capture efficiency. It then introduces a dynamic "anchor" selection algorithm that identifies a personalized set of genomic segments mirroring the non-linear coverage behavior of each target gene. By minimizing the Coefficient of Variation (CV) through iterative anchor selection, CESAR effectively recalibrates the baseline to suppress technical noise. Results: Validation using standard DNA reference materials demonstrated that CESAR successfully identified both amplifications (e.g., MET, ERBB2, EGFR) and relative copy number deletions at ultra-low tumor fractions. Notably, CESAR achieved stable detection of focal alterations as subtle as 2.18 copies (a mere 1.09x fold change relative to the diploid baseline), while maintaining zero false positives in control regions. Evaluation across distinct clinical biofluids, 36 clinical plasma samples and 41 glioma cerebrospinal fluid (CSF) samples, identified critical, previously undetected CNV events, including subtle ERBB2 gains and distinct MET deletions. Furthermore, comprehensive benchmarking revealed that CESAR consistently outperformed the widely used CNVkit, particularly in suppressing technical variance and resolving ultra-low-level copy number gains that CNVkit failed to distinguish from background noise. Conclusions: CESAR provides a highly stable and sensitive algorithmic framework for tumor-only CNV calling in liquid biopsies, facilitating precise therapeutic decision-making in precision oncology.
bioinformatics2026-03-11v1Pairing Data Independent Acquisition and High-Resolution Full Scan for Fast Urinary Tract Infection Diagnosis
Coyle, E.; Lacombe-Rastoll, A.; Roux-Dalvai, F.; Leclercq, M.; Bories, P.; Berube, E.; Gotti, C.; Bekker-Jensen, D.; Bache, N.; Isabel, S.; Droit, A.Abstract
Background: Rapid and accurate identification of urinary tract infection (UTI) pathogens is critical for effective treatment and combating antimicrobial resistance. Conventional culture-based diagnostics are slow, and standard tandem mass spectrometry workflows are resource-intensive. Methods: We present a proof-of-concept workflow that integrates high-resolution data-independent acquisition (DIA) MS/MS on the Thermo Scientific Orbitrap Astral with MS1-only spectra from the Orbitrap Exploris 480. DIA data establish a reference panel of pathogen-specific peptides, which are then identified in MS1 spectra from urine samples. Machine learning models trained on these matched MS1 features were used to classify eight common uropathogens and non-infected controls across synthetic inoculations, pure cultures, and clinical patient samples. Results: The approach accurately distinguished bacterial species in both controlled inoculated samples and clinical patient samples, achieving a Matthews Correlation Coefficient (MCC) of 0.924 on held-out test data and 0.77 on patient samples. Conclusions: This proof-of-concept demonstrates that pairing DIA-derived peptide panels with MS1-only data acquired on a cost-effective instrument suitable for routine analysis, enables rapid, culture-free identification of UTI pathogens. The method provides a scalable, high-throughput platform suitable for clinical applications and establishes a foundation for broader biomarker discovery and potential quantitative workflows.
bioinformatics2026-03-11v1Making Biorisk Measurable: A Bayesian Framework for Laboratory Risk Management
Prodanov, D.Abstract
Biosafety risk assessment traditionally relies on categorical scales embodied by the four WHO Risk Groups and biocontainment levels. Mapping such categories to quantitative metrics is an open problem for the field: the classifications are too coarse for operational decision-making, yet strictly probabilistic language remains inaccessible to most safety professionals, laboratory managers, and decision-makers. To bridge these gaps, the present work develops a quantitative Bayesian framework for laboratory risk management that combines WHO Risk Group classification as a prior with a Markov chain model of the incident--disaster escalation chain. Risk is reported on a log-risk scale that transforms multiplicative probabilities into additive quantities, mirroring the decibel scale in acoustics. The framework accommodates longitudinal updating with local incident data and quantifies the separate contributions of training, preventive maintenance, and inspection to system-level safety. Resource allocation recommendations are derived that complement existing compliance frameworks with auditable, evidence-based prioritisation. The framework is illustrated on synthetic BSL-3 scenarios and shifts the perspective of biorisk governance from static compliance assessment to dynamic risk and resource management.
bioinformatics2026-03-11v1Rational in silico discovery and serological validation of Trypanosoma cruzi-specific B-cell epitopes for high-precision Chagas disease diagnosis
Candia Puma, M. A.; Goyzueta Mamani, L. D.; Barazorda Ccahuana, H. L.; S B Camara, R.; A.G. Pereira, I.; L Silva, A.; M Rodrigues, M.; P N Assis, B.; Chaves, A. T.; A V A Correa, L.; O da Costa Rocha, M.; U Goncalves, D.; Maia Goncalves, A. A.; B de Moura, A.; Galdino, A.; Machado de Avila, R.; Cordeiro Giunchetti, R.; Ferraz Coelho, E. A.; Chavez Fumagalli, M. A.Abstract
Chagas disease is caused by the parasite Trypanosoma cruzi and remains a neglected tropical disease presenting a substantial global health burden. Crude antigen-based assays have historically been limited in specificity; however, even contemporary recombinant-antigen tests may exhibit residual cross-reactivity, depending on antigen composition and geographic context. To overcome this limitation, this study developed a novel diagnostic strategy that integrates computational and experimental approaches to identify specific linear B-cell epitopes within the T. cruzi proteome. The strategy was developed to exclude sequences homologous to H. sapiens and Leishmania spp. proteins, thereby minimizing potential cross-reactivity. Using a consensus approach across five prediction algorithms, B-cell epitopes were identified and subsequently clustered to reveal conserved, immunoreactive consensus sequences. The peptide sequences were characterized for optimal physicochemical properties and subsequently modeled to interact with a human antibody using protein-peptide docking and molecular dynamics simulations to assess complex stability. The most promising candidates were chemically synthesized and validated using ELISA against a cohort comprising Chagas disease patients (chronic indeterminate and cardiac forms), healthy donors, and a cross-reactive control group (visceral and tegumentary leishmaniasis and leprosy). From the initial set of 19,245 proteins, the multi-tiered bioinformatic analysis identified 4,431 unique, non-homologous sequences. Consensus prediction yielded 401 high-confidence epitopes, which were refined to 179 structurally stable candidates. Computational analyses identified five top-ranking epitopes capable of forming high-affinity, stable complexes with a human antibody. Experimental validation confirmed the high diagnostic accuracy of two epitopes, which demonstrated exceptional diagnostic performance: Epitope 4 and Epitope 5 achieved 100% sensitivity. Notably, Epitope 5 exhibited superior specificity, reaching 96.67% against healthy controls and 90.91% against the cross-reactive group. This study establishes a basis for the development of an improved immunoassay for Chagas disease and provides a reproducible framework for targeted epitope discovery. Consequently, this study validates a high-precision computational pipeline capable of discovering T. cruzi-specific antigens that effectively circumvent cross-reactivity with Leishmania spp., proposing Epitope 5 as a qualified candidate for reliable serological diagnosis in co-endemic regions.
bioinformatics2026-03-11v1BICEP: an extension to indels and copy number variants for rare variant prioritisation in pedigree analysis
Ormond, C.; Ryan, N. M.; Corvin, A.; Heron, E. A.Abstract
Summary: BICEP is a Bayesian inference model that evaluates how likely a rare variant is to be causal for a genomic trait in pedigree-based analyses. The original prior model in BICEP was designed for single nucleotide variants only. Here, we have developed an extension of the prior models for more comprehensive genomic analysis to include indels and copy number variants. We benchmark the performance of these new priors and show comparable performance accuracy with the existing single nucleotide variant prior model. For copy number variants we evaluate four different input predictors to the models and recommend the best performing ones as the default. Availability and implementation: the updated prior models have been implemented in the current version of BICEP available from: https://github.com/cathaloruaidh/BICEP.
bioinformatics2026-03-11v1Cell DiffErential Expression by Pooling (CellDEEP) highlights issues in differential gene expression in scRNA-seq
Cheng, Y.; Kettlewell, T.; Laidlaw, R. F.; Hardy, O. M.; McCluskey, A.; Otto, T. D.; Somma, D.Abstract
Accurate identification of differentially expressed genes (DEGs) in single-cell RNA sequencing (scRNA-seq) data remains challenging. Single-cell-specific statistical models often report large numbers of candidate genes but can exhibit inflated false positive rates, whereas pseudobulk approaches improve false discovery control at the cost of reduced sensitivity. To overcome the noise and bias that other tools have, and allow the user to have more control of the DEG process, we present CellDEEP, which uses a cell aggregation (metacell) approach. This tool provides a framework for flexible selection of pooling strategies and parameterisation for differential expression analysis (DE). Benchmarking on simulated and real datasets, including COVID-19 and rheumatoid arthritis, shows that CellDEEP often outperforms other methods, consistently reduces false positives compared to single-cell methods and recovers more true positives than pseudobulk methods. Our work shifts the focus from selecting a single "best" method to an approach that reduces cell-level noise while preserving biological signal, together with transparent validation framework, advancing more reliable differential-expression analysis in single-cell transcriptomics.
bioinformatics2026-03-11v1FishMamba-1: A Linear-Complexity Foundation Model for Deciphering Polyploid Cyprinid Genomes
Lu, S.; Fang, C.; Wang, C.; Qian, Y.; Fang, W.; Li, T.; Zeng, H.; He, S.Abstract
Abstract The Cypriniformes order, comprising essential aquaculture species like carps and minnows, presents unique genomic challenges due to complex whole-genome duplication (WGD) events and extensive repetitive elements. Conventional annotation tools and Transformer-based foundation models often struggle to capture long-range dependencies in these expanded genomes due to quadratic computational complexity. Here, we introduce FishMamba, the first genomic foundation model tailored for the aquatic clade, built upon the selective state-space model (SSM) architecture. By leveraging Mamba-2's linear scaling efficiency, FishMamba processes context windows of 32,768 base pairs (32k) - significantly surpassing the 4-6k limit of standard DNA Transformers - enabling the modeling of distal regulatory patterns on a single GPU. We curated Cypri-24, a comprehensive dataset comprising 28.8 Gb of high-quality genome assemblies from 24 representative species, to pre-train FishMamba on 15 billion tokens. Subsequent fine-tuning for genome segmentation (FishSegmenter) demonstrates the model's capability to annotate gene structures at single-nucleotide resolution with remarkable precision. Evaluation on a held-out test set reveals that FishMamba achieves a precision of 64.6% in exon identification, effectively distinguishing coding regions from the vast non-coding background without relying on RNA-seq evidence. Furthermore, interpretability analysis confirms that the model captures biological syntax such as splice acceptor motifs. FishMamba provides a scalable, open-source framework for decoding the complex genomes of non-model organisms, providing a scalable computational resource to support downstream applications in molecular breeding and ecological monitoring. The complete source code, pre-trained model weights, and datasets are freely available at https://github.com/lu1000001/FishMamba. Additionally, the FishMamba Hub, a web-based inference platform, is accessible at https://huggingface.co/spaces/lu1000001/FishMamba-Hub to facilitate real-time genomic segmentation for the aquatic research community.
bioinformatics2026-03-11v1The Genomic Legacy of Ancient Polyploidy in Crop Domestication
McKibben, M. T. W.; Barker, M. S.Abstract
Species that have an ancestry of whole-genome duplications (WGDs) are more likely to be domesticated, but the underlying mechanisms remain unclear. We tested whether paleologs--genes duplicated during ancient WGDs--are enriched in candidate domestication lists across 22 crop species. Paleologs were significantly enriched in 14 species, with single-copy paleologs showing the most consistent overrepresentation. This finding provides the first empirical test of an assumption in plant genome evolution: models based on retention patterns inferred that genes rapidly returning to single-copy status are under strong purifying selection, potentially limiting their adaptive potential. We find instead that constraint on copy number does not appear to preclude selection on gene function. Several non-mutually exclusive processes could explain this pattern, including accumulated genetic diversity becoming available upon return to single-copy, selection to maintain essential functions, and greater selection efficiency on unmasked loci. Ancient WGDs thus provide a persistent genomic substrate for crop evolution millions of years later.
bioinformatics2026-03-11v1MSstatsResponse: Semi-parametric statistical model enhances detection of drug-protein interactions in chemoproteomics experiments
Szvetecz, S.; Kohler, D.; Federspiel, J.; Field, D. S.; Jean-Beltran, P.; Seward, R. J.; Suh, H.; Xue, L.; Vitek, O.Abstract
Chemoproteomics is a popular approach for the identification of small molecule-protein interactions in biological systems. Several chemoproteomics workflows leverage functionalized chemical probes and mass spectrometry to measure protein engagement through direct protein enrichment or competition using a range of small molecule concentrations. Statistical methods for analysis of such dose-response chemoproteomics datasets are limited. For example, existing methods rely on fixed curve shapes and are sensitive to experimental variation, particularly when the number of doses or replicates is limited. Here, we present MSstatsResponse, a semi-parametric statistical framework for analyzing chemoproteomic dose-response experiments that uses isotonic regression that does not require a fixed curve shape. This approach improves the accuracy and robustness of curve fitting, target identification, and half-response estimation across diverse experimental designs. We evaluate MSstatsResponse by generating a benchmark chemoproteomic dataset that profiled the competition between the kinase-binding probe XO44 and the drug Dasatinib using three mass spectrometry acquisition strategies: data-independent acquisition, tandem mass tag-based data-dependent acquisition, and selected reaction monitoring. We further evaluate the method on simulated datasets that vary the number of doses, number of replicates, and levels of noise, and demonstrate that MSstatsResponse consistently improves sensitivity, specificity, and reproducibility compared to existing methods, particularly in low-replicate and low-dose settings. MSstatsResponse is implemented as an open-source R/Bioconductor package that integrates with the MSstats ecosystem for quantitative proteomics. It provides a unified workflow for preprocessing, curve fitting, target identification, and experimental design, enabling researchers to select the number of doses and replicates appropriate to their experimental goals. The software and documentation are freely available at https://bioconductor.org/packages/MSstatsResponse.
bioinformatics2026-03-11v1Landscape of 8q24.3-Encoded microRNAs and Their Prognostic Impact in Ovarian Cancer
Filipek, K.; Merelli, I.; Chiappori, F.; Penzo, M.Abstract
Ovarian cancer is the most lethal gynecological malignancy, largely because of late diagnosis and marked genomic instability, with high-grade serous ovarian cancer (HGSOC) representing its most common and aggressive subtype. Amplification of chromosome 8q24.3 is a recurrent event in HGSOC, yet the regulation and clinical relevance of the non-coding RNA output from this locus remain poorly defined. Here, we performed an integrative analysis of 8q24.3-encoded miRNAs in ovarian cancer using copy-number, transcriptomic, isoform-resolved, and clinical data from TCGA and NCBI datasets. We identified pronounced heterogeneity in miRNA abundance and strand usage across this locus. Copy-number gain broadly associated with increased miRNA expression, although this effect was not uniform across all candidates. Intronic miRNAs showed variable coupling with their host genes, indicating that mature miRNA output is shaped by both genomic dosage and post-transcriptional regulation. Isoform-level analysis revealed marked strand asymmetry and regulatory complexity, but did not strengthen copy-number or histotype associations compared with total miRNA measurements. Clinically, higher expression of miR-937, miR-4664, and miR-6849 was associated with improved overall survival in HGSOC. Functional enrichment of validated targets highlighted pathways related to cellular stress responses, senescence, p53 signaling, endocytosis, and metabolic adaptation. Together, these findings define 8q24.3 as a heterogeneous non-coding regulatory hub in ovarian cancer and provide a basis for future mechanistic and biomarker studies.
bioinformatics2026-03-11v1SwiftTCR: Efficient Computational Docking protocol of TCRpMHC-I Complexes Using Restricted Rotation Matrices
Parizi, F. M.; Aarts, Y. J. M.; Smit, N.; Roran A R, D.; Diepenbroek, D.; Krösschell, W. A.; Thijs, L.; Tepperik, J.; Eerden, S.; Marzella, D. F.; Ramakrishnan, G.; Xue, L. C.Abstract
The T cell's ability to discern self and non-self depends on its T cell receptor (TCR), which recognizes peptides presented by MHC molecules. Understanding this TCR-peptide-MHC (TCRpMHC) interaction is important for cancer immunotherapy design, tissue transplantation, pathogen identification, and autoimmune disease treatments. Understanding the intricacies of TCR recognition, encapsulated in TCRpMHC structures, remains challenging due to the immense diversity of TCRs (>108/individual), rendering experimental determination and general-purpose computational docking impractical. Addressing this gap, we have developed a rapid integrative modeling protocol leveraging unique docking patterns in TCRpMHC complexes. Built upon PIPER, our pipeline significantly cuts down FFT rotation sets, exploiting the consistent polarized docking angle of TCRs at pMHC. Additionally, our ultra-fast structure superimposition tool, GradPose, accelerates clustering. It models a case in 3-4 minutes on 12 CPUs, showcasing a speedup of up to 25-40 times compared to the ClusPro webserver. On a benchmark set of 38 TCRpMHC class I (TCRpMHC-I) complexes, our protocol outperforms the state-of-the-art docking tools in model quality. This protocol can potentially provide structural information to TCR repertoires targeting specific peptides. Its computational efficiency can also enrich existing pMHC-specific single-cell sequencing TCR data, facilitating the development of structure-based deep learning (DL) algorithms. These insights are essential for understanding T cell recognition and specificity, advancing the development of therapeutic interventions.
bioinformatics2026-03-10v3FASTiso: Fast Algorithm on Search state Tree for subgraph ISOmorphism in graphs of any size and density
Agbeto, W.; Coti, C.; Reinharz, V.Abstract
Subgraph isomorphism is a fundamental combinatorial problem that involves finding one or more occurrences of a pattern graph within a target graph. It arises in a wide range of application domains, including biology, chemistry, social network analysis, and pattern recognition. Although subgraph isomorphism is NP-complete in the general case, many exact algorithms allow it to be solved in practice on many instances. However, the increasing size and structural diversity of graph datasets continue to pose significant challenges in terms of robustness and scalability. In this article, we propose FASTiso, an exact subgraph isomorphism algorithm that emphasizes a strong consistency between the variable ordering strategy and the pruning rules used during search. This design enables a unified exploitation of structural information throughout the exploration process, leading to improved efficiency and stable performance across heterogeneous graph structures. An extensive experimental evaluation on widely used synthetic and real-world benchmarks shows that FASTiso consistently outperforms reference solvers such as VF3, VF3L, and RI, and achieves competitive performance compared to constraint programming-based approaches (Glasgow, PathLad+), while outperforming them on most datasets. The results further demonstrate that FASTiso remains highly efficient on small instances and scales well to large graphs, while maintaining a lower memory footprint than most evaluated solvers. The peak memory usage is 7.74 GB for FASTiso, 36.19 GB for PathLad+, over 500 GB for Glasgow, 9.62 GB for VF3/VF3L, and 4.31 GB for RI. FASTiso code is available at https://gitlab.info.uqam.ca/cbe/fastiso as a C++ implementation, a Python module, and an integration within an extended version of NetworkX. The implementations support simple graphs and multigraphs, directed or undirected, with labels on nodes, edges, or both.
bioinformatics2026-03-10v3Non-consensus flanking sequence of hundreds of base pairs around in vivo binding sites: statistical beacons for transcription factor scanning
Faltejskova, K.; Sulc, J.; Vondrasek, J.Abstract
It was long suspected that for specific DNA binding by a transcription factor, the flanks of the binding motifs can play an important role. By a thorough analysis of the DNA sequence in the broad context (+- 5000 bp) of in vivo binding sites (as identified in a ChIP-seq or a Cut&Tag experiment), we show that the average GC content is in most cases statistically significantly increased around the binding site in a patch spanning 1000-- 1500 bp. This increase was observed consistently in experiment targeting the same TF in different cell lines. The surrounding of binding sites of certain TFs like MYC display a directional alteration of dinucleotide frequencies. We attempt to explain these preferences by alteration in DNA shape features as well as by potential cooperation with other TFs. We observed differences in sequence affinity to various potential cooperating TFs between cell lines. Altogether, we propose that the observed feature distortion is indicative of a coarse scanning mechanism that helps TFs find the target binding site.
bioinformatics2026-03-10v3GREmLN: A Cellular Graph Structure Aware Transcriptomics Foundation Model
Zhang, M.; Swamy, V.; Cassius, R.; Dupire, L.; Kanatsoulis, C.; Paull, E.; AlQuraishi, M.; Karaletsos, T.; Califano, A.Abstract
The ever-increasing availability of large-scale single-cell profiles presents an opportunity to develop foundation models to capture cell properties and behavior. However, standard language models such as transformers benefits from sequentially structured data with well defined absolute or relative positional relationships, while single cell RNA data have orderless gene features. Molecular-interaction graphs, such as gene regulatory networks (GRN) or protein-protein interaction (PPI) networks, offer graph structure-based models that effectively encode both non-local gene token dependencies, as well as potential causal relationships. We introduce GREmLN (Gene Regulatory Embedding-based Large Neural model), a foundation model that leverages graph signal processing to embed gene token graph structure directly within its attention mechanism, producing biologically informed single cell specific gene embeddings. Our model faithfully captures transcriptomics landscapes and achieves superior performance relative to state-of-the-art baselines on cell type annotation, graph structure understanding, and fine-tuned reverse perturbation prediction tasks. It offers a unified and interpretable framework for learning high-capacity foundational representations that capture complex, long-range regulatory dependencies from high-dimensional single-cell transcriptomic data. Moreover, the incorporation of graph-structured inductive biases enables more parameter-efficient architectures and accelerates training convergence.
bioinformatics2026-03-10v3Sassy: Fuzzy Searching DNA Sequences using SIMD
Beeloo, R.; Groot Koerkamp, R.Abstract
Motivation. Approximate string matching (ASM) is the problem of finding all occurrences of a pattern in a text while allowing up to k errors. Many modern methods use seed-chain-extend, which is fast in practice, but does not guarantee finding all matches with [≤]k errors. However, applications such as CRISPR off-target detection require exhaustive results. Methods. We introduce Sassy, a library and tool for ASM of short patterns in long texts. Sassy splits the text into 4 parts that are searched in parallel, and uses bitvectors in the text direction rather than the pattern direction. This has compexity O(k{lceil}n/W{rciel}) when searching a random text of length n, where W = 256 is the SIMD width, and provides significant speedups for small k. Separately, we allow matches of the pattern to extend beyond the text for an overhang cost of e.g. = 0.5 per character, to find matches near contig or read ends. Results. Sassy is 4x to 15x faster than Edlib for patterns [≤]1000bp, and can search text with a throughput near 2 Gbp/s. Likewise, Sassy is over 100x faster than parasail. We apply Sassy to CRISPR off-target detection by searching 61 guide sequences in a human genome. Sassy is 100x faster than SWOffinder and only slightly slower (for k [≤]3) than CHOPOFF, for which building its index takes 20 minutes. Sassy also scales well to larger k, unlike CHOPOFF whose index took over 10 hours to build for k = 5. Availibility. Sassy is available as library and binary at https://github.com/RagnarGrootKoerkamp/sassy, and archived at swh:1:dir:e884758dce5777a441bc2799dc8824e563c5f97b.
bioinformatics2026-03-10v2Computed atlas of the human GPCR-G protein signaling complexes
Miglionico, P.; Matic, M.; Franchini, L.; Arai, H.; Nemati Fard, L. A.; Arora, C.; Gherghinescu, M.; DeOliveira Rosa, N.; Ryoji, K.; Gutkind, J. S.; Orlandi, C.; Inoue, A.; Raimondi, F.Abstract
Experimental mapping of G protein-coupled receptors (GPCR)-G protein signaling coupling has illuminated hundreds of receptors, yet the coupling specificity of a large fraction of this large receptor family remains unknown, thereby preventing the development of new GPCR-targeting therapies. Here, we used AlphaFold3 (AF3) to predict the 3D structures of the human GPCRome in complex with heterotrimeric G proteins. We used experimental GPCR-G protein binding data to show that AF3 predictions significantly discriminate between positive and negative binders, and used 3D structural features to train a machine learning (ML) algorithm to predict coupling potency. Interpretation of the ML model helped discriminate universal features governing the strength of G protein coupling from those determining binding specificity. We computationally illuminated the coupling preferences of 180 non-olfactory GPCRs (non-OR) with previously unreported transduction mechanisms and experimentally validate the predicted couplings for multiple previously uncharacterized GPCRs, including QRFPR, GPR50, GPR37, GPR37L1 and GPRC5A. Our predictions established that Gi/o is the most prevalent coupling among non-OR GPCRs, which is often co-occurring with Gq/11 and, to a lesser extent, G12/13 signaling. Gs coupling is less common and restricted to specific clusters within the non-OR GPCRome phylogeny, likely due to stricter structural requirements for its binding. We also computed G protein complexes for over 400 ORs, establishing Gs as the most prevalent coupling. ORs are predicted to bind to Gs with a simpler interface compared to non-ORs, ultimately leading to energetically less stable complexes. Additionally, we predict recurrent bindings to Gq/11 and Gi/o proteins for ORs, suggesting potentially novel ORs signaling mechanisms. We exploited the GPCRome coupling atlas to interpret healthy and cancer expression data, revealing the coupling of most GPCR-G protein co-expressed pairs. This analysis highlights a richer coupling repertoire in healthy tissues compared to cancer, likely reflecting the high signaling requirements of specialized normal cell functions, which are lost in most cancer cells due to their de-differentiated state or under cancer selection processes. In summary, this study provides the first computational 3D atlas of the human GPCR-G protein transductome, thereby illuminating the signaling mechanisms of neglected GPCR classes and providing the basis for interpreting omics datasets from a myriad of pathological conditions, thus enabling the development of novel precision therapeutics.
bioinformatics2026-03-10v1STAR Suite: Integrating transcriptomics through AI software engineering in the NIH MorPhiC consortium
Hung, L.-H.; Yeung, K. Y.Abstract
To accommodate rapid methodological turnover, bioinformatics pipelines typically consist of discrete binaries linked via scripts. While flexible, this architecture relies on intermediate files, sacrificing performance, and treating complex codebases as static silos. For example, the STAR aligner {dobin2013star}---the standard engine for transcriptomics---uses an external script for adapter trimming, necessitating the decompression and re-compression of large files. These limitations presented scalability problems for uniform processing of data in the NIH MorPhiC consortium. We present our solution, STAR Suite, a human-engineered and AI-implemented modernization that integrates functionality directly into the C++ source. In just four months, a single developer added over 92,000 lines to the original 28,000-line codebase to produce four unified modules: STAR-core, STAR-Flex, STAR-Perturb, and STAR-SLAM that can be installed as a pre-compiled binary without introducing any new dependencies. This work demonstrates a new paradigm for the rapid evolution of high-performance bioinformatics software.
bioinformatics2026-03-10v1AQuA2-Cloud: a web platform for fluorescence bioimaging activity analysis
Bright, M.; Mi, X.; Duarte, D.; Carey, E.; Lyu, B.; Wang, Y.; Nimmerjahn, A.; Yu, G.Abstract
Advanced biological imaging analysis platforms such as Activity Quantification and Analysis (AQuA2) enable accurate spatiotemporal activity analysis across diverse cell populations within many species. These tools are increasingly important for investigating cellular signaling dynamics and behavior. However, despite advances in the accuracy and species capability of AQuA2, it remains computationally demanding for analysis of long time-series datasets and requires all users to maintain a MATLAB license, which may limit accessibility and large-scale deployment. To address these limitations, we have designed and made available AQuA2-Cloud, a portable software stack and web platform developed as an improvement and further evolution of AQuA2. This container-deployable system permits multi-user cloud-based high accuracy activity quantification with intuitive workflows, export of analysis data and project files, and comparable processing times. The platform offers integrated features such as in-browser analysis control interfaces, asynchronous program state control, multiple users and user management, support for unreliable connections, file uploading and downloading via web browsers and File Transfer Protocol, and centralized organization of analysis output. AQuA2-Cloud constitutes a cloud-native solution for laboratories or research groups seeking to centralize analysis of spatiotemporal biological imaging datasets while reducing software installation and licensing barriers for end users. The platform enables researchers with minimal technical expertise to perform advanced bioimaging analysis through standard web browsers while maintaining the analytical capabilities of AQuA2. AQuA2-Cloud source code, deployment procedures, and documentation are freely available at (https://github.com/yu-lab-vt/AQuA2-Cloud).
bioinformatics2026-03-10v1Automatic Generation of Model Sequences for Complex Regions in Assembly Graphs
Antipov, D.; Chen, Y.; Sollitto, M.; Phillippy, A. M.; Formenti, G.; Koren, S.Abstract
Recent developments in genome sequencing and assembly technologies have enabled the automated assembly of vertebrate chromosomes from telomere to telomere. However, for some long, highly similar repeats, genome assemblers may lack sufficient information to unambiguously resolve the sequence, leaving tangles in the assembly graph and gaps in the final assembly. In recently published genomes, such gaps are often closed by manual graph curation, a process that is labor-intensive, error-prone, and sometimes infeasible. This can leave important genomic repeats, such as recently duplicated genes, misassembled or excluded from the final assembly. Here we present the Trivial Tangle Traverser (TTT) algorithm that finds optimized resolutions of assembly graph tangles. TTT uses depth of coverage and read-to-graph alignment information in a two-stage process to identify evidence-based traversals that are consistent with the underlying data. First, sequence multiplicities are estimated through mixed-integer linear programming, after which an Eulerian path is found in the derived multigraph and optimized through a gradient-descent-like approach. We evaluate TTT traversals on the HG002 human reference genome and demonstrate its use to characterize a previously unassembled amplified gene array in the zebra finch genome. Availability: TTT is available at https://github.com/marbl/TTT
bioinformatics2026-03-10v1Measuring Amorphous Motion: Application of Optical Flow to Three-Dimensional Fluorescence Microscopy Images
Lee, R. M.; Eisenman, L. R.; Hobson, C.; Aaron, J. S.; Chew, T.-L.Abstract
Motion is an essential component of any living system. It is rich with information, but it is often challenging to quantitatively extract biologically informative results from the motion apparent in microscopy images. This challenge is exacerbated by the wide variety in biological movement, which often takes the form of difficult-to-segment amorphous structures undergoing complex motion. An image processing technique known as optical flow can capture motion at each pixel in an image, thus bypassing the need for object segmentation or a priori definition of motion types. This makes it a powerful tool for quantitative assessment of biological systems from the protein to organism scale. However, despite its flexibility and strengths for analyzing fluorescence microscopy images, its adoption in the bioimaging community has been limited by the availability of easy-to-use tools and guidance in results interpretation. Here we describe an optical flow tool, OpticalFlow3D, that can be run in Python or MATLAB and is compatible with three-dimensional microscopy images. Using biological examples across length scales, we illustrate how OpticalFlow3D can enable new biological insight.
bioinformatics2026-03-10v1In silico analysis of the human titin protein (Immunoglobulin-like, fibronectin type III, and Protein kinase domains) as a potential forensic marker for postmortem interval (PMI) estimation
Gill, M. U.; Akhtar, M.Abstract
Abstract: Due to the limited availability of reliable and well-validated molecular markers, the determination of postmortem interval (PMI) is still a major obstacle for forensic investigators to resolve a case. The largest human protein, known as titin, has never undergone at domain level examination of postmortem degradation patterns. This study focused on the In-silico analysis of the Immunoglobulin-like, fibronectin-type III, and Protein kinase domains of human titin to assess their potential utility in PMI estimation. Sequence data for the studied domains were retrieved from UniProt, 2D & 3D models were generated by PSIPRED and SWISS-MODEL, respectively, followed by physicochemical properties, solubility assessment, and structural comparison. This study revealed that the Ig-like domain is the most stable, followed by the Fn-III and Protein kinase domains. These findings indicate that Titin domains may degrade at different rates in the postmortem period. This study introduces the first computational basis for considering Titin as a multi-domain candidate biomarker for PMI estimation, laying the groundwork for upcoming laboratory validation.
bioinformatics2026-03-10v1SpatioCAD: Context-aware graph diffusion model for pinpointing spatially variable genes in heterogeneous tissues
Zhang, S.; Wen, H.; Shen, Q.Abstract
Spatial transcriptomics enables comprehensive characterization of tissue architecture, and the identification of spatially variable genes (SVGs) is a critical step for defining region-specific molecular markers and uncovering spatially regulated mechanisms across diverse biological contexts. However, most existing methods for SVG detection overlook cell density variations, a major confounding factor in complex tissues such as tumors, where heterogeneous cellularity frequently introduces false-positive calls. Here we present SpatioCAD, a computational framework that explicitly decouples genuine spatial expression patterns from confounding effects driven by cellularity. SpatioCAD leverages and extends a graph diffusion model to simulate expression propagation under cell-density-aware con- ditions, thereby ensuring unbiased detection of SVGs across all expression levels. Systematic evaluations on simulated datasets demonstrate its superior statistical power and specificity. Applied to breast cancer, lung cancer, and glioma datasets, SpatioCAD identifies functionally diverse SVGs, including low-abundance transcripts with established roles in tumor progression, while also recapitulates biologically meaningful tissue architecture features.
bioinformatics2026-03-10v1MOZAIC: Compound Growth via In Silico Reactions and Global Optimization using Conformational Space Annealing
Yoo, J.; Shin, W.-H.Abstract
Motivation: Fragment-based drug discovery (FBDD) is an efficient strategy that leverages small molecular fragments to explore broader chemical space by combining them. Advances in computational methods have enabled the calculation of molecular properties and docking scores, thereby accelerating the development of algorithm- and AI-based approaches in FBDD. However, it should be noted that certain methods do not provide synthetic pathways to obtain the proposed compounds. Consequently, these molecules might not be synthesized easily. Results: In light of these developments, we propose MOZAIC, a novel framework that explores chemical space using a reaction-based fragment growing and Conformational Space Annealing, a powerful global optimization algorithm. Our results show that MOZAIC effectively produces chemically diverse molecules with balanced improvements in lead-like properties, including QED, synthetic accessibility, and binding affinity. Furthermore, its flexible objective function allows fine-tuning for specific design goals, such as enhancing solubility with binding affinity. These capabilities position MOZAIC as a valuable platform for advancing fragment-to-lead and lead optimization efforts in drug discovery. Availability and implementation: MOZAIC is available at https://github.com/kucm-lsbi/MOZAIC/. Supplementary Information: Supplementary data are available at Bioinformatics online.
bioinformatics2026-03-10v1InversePep: Diffusion-Driven Structure-Based Inverse Folding for Functional Peptides
Chilakamarri, S. K.; Kasturi, S. R.; Yerrabandla, S. P. R.; Gogte, S.; Kondaparthi, V.Abstract
Designing functional peptides with specific structural and biochemical properties is critical for applications in protein engineering and therapeutic discovery. However, most peptide design approaches rely on evolutionary or local sequence optimization methods, which are limited when adapting to peptides' shorter length, high conformational flexibility, and unique physicochemical constraints. While recent structure-based inverse folding models have shown success for proteins, these models often underperform on peptides because sequence recovery alone is not a reliable indicator of stability or foldability in short, flexible backbones. To address this challenge, we introduce InversePep, a generative diffusion model for structure-based peptide inverse folding. InversePep learns the conditional distribution of sequences that can adopt a given backbone conformation, enabling direct generation of peptides tailored to target structural geometries. The framework integrates a geometric graph neural network to encode 3D backbone features with a Transformer-based sequence refinement module that iteratively denoises candidate sequences during diffusion. Trained on a diverse set of peptide backbones sourced from Propedia and SATPdb, InversePep effectively captures structural and biochemical diversity across peptide families. In systematic evaluations on held-out peptide structures and the PepBDB benchmark, InversePep achieves a mean TM score of 0.38 and a median of 0.28, outperforming ProteinMPNN and ESM-IF1 in generating geometry-consistent peptide sequences. In-silico folding analyses confirm that sampled peptides reliably adopt the target conformations. These results highlight InversePep's capability for designing structurally stable and sequence-diverse peptides, demonstrating its potential in antimicrobial peptide discovery, peptide therapeutics, and molecular probe development.
bioinformatics2026-03-10v1