Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
BioGraphX: Bridging the Sequence-Structure Gap via PhysicochemicalGraph Encoding for Interpretable Subcellular Localization Prediction
Saeed, A.; Abbas, W.AI Summary
- BioGraphX introduces a novel encoding framework that constructs protein interaction graphs from sequences using biochemical rules, bypassing the need for 3D structure determination.
- The framework integrates 158 interpretable biophysical features with ESM-2 embeddings, enhancing prediction accuracy on the DeepLoc benchmarks.
- SHAP analysis reveals that BioGraphX-Net uses sequence profiles for exclusion and specific biophysical features for precise localization, with Frustration features aiding in resolving targeting ambiguities.
Abstract
Computational approaches for protein subcellular localization prediction are important for understanding cellular mechanisms and developing treatments for complex diseases. However, a critical limitation of current methods is their lack of interpretability: while they can predict where a protein localizes, they fail to explain why the protein is assigned to a specific location. Moreover, traditional approaches rely on Anfinsen' s principle, which assumes that protein behavior is determined by its native three-dimensional structure, requiring costly and time-consuming process. Here, we propose BioGraphX, a novel encoding framework that constructs protein interaction graphs directly from protein sequences using biochemical rules. This approach eliminates the need for three dimensional structure determination by encoding 158 interpretable features grounded in biophysical principles. Building upon this representation, BioGraphX Net demonstrates superior performance on the DeepLoc benchmarks by integrating ESM-2 embeddings with the proposed features via a gating mechanism. Gating analysis shows that although ESM-2 embeddings provide strong contributions, BioGraphX features function as high-precision filters. SHAP analysis shows that BioGraphX-Net encodes a sophisticated biophysical logic: sequence profiles act as universal exclusion filters, while organelle-specific combinations of biophysical features enable precise compartment discrimination. Notably, Frustration features help resolve targeting ambiguities in complex compartments, reflecting evolutionary constraints while preventing mislocalization from sequence mimicry. It has the additional advantage of promoting Green AI in bioinformatics, achieving performance comparable to the state-of-the-art while maintaining a minimal parameter count of 13.46 million. In summary, BioGraphX not only provides accurate predictions but also offers new insights into the language of life.
bioinformatics2026-02-18v2Short linear motifs - Underexplored players driving Toxoplasma gondii infection
Alvarado Valverde, J.; Lapouge, K.; Boergel, A.; Remans, K.; Luck, K.; Gibson, T.AI Summary
- The study explores the role of short linear motifs in Toxoplasma gondii's infection process, focusing on how these motifs facilitate interactions with host proteins.
- A computational pipeline was developed to identify motifs in Toxoplasma secreted proteins, revealing 24,291 motif matches in 295 proteins.
- Experimental validation confirmed the presence of TRAF6-binding motifs in Toxoplasma proteins RON10 and GRA15, highlighting the utility of motif predictions in understanding infection mechanisms.
Abstract
Pathogens infect hosts by interacting with host proteins and exploiting their functions to their advantage. Short linear motifs, small functional regions within intrinsically disordered protein regions, are common mediators of host-pathogen protein interactions. While motifs have been more extensively studied in viruses and bacteria, the extent to which eukaryotic unicellular parasites use motifs during infection remains unexplored. Toxoplasma gondii is a widespread intracellular Apicomplexan parasite capable of infecting all warm-blooded animals and invading any of their nucleated cells. Toxoplasma's secreted proteins are key in interacting with host proteins during infection, making them potential sources for motifs. To highlight the role of motifs in Toxoplasma gondii infection, we curated 21 known motif instances in Toxoplasma proteins from the scientific literature. To identify more motifs in Toxoplasma secreted proteins, we developed a computational pipeline that annotates putative motif matches with structural and functional features. Through this approach, we identified a set of 24,291 motif matches in 295 secreted proteins. We highlight strategies for further prioritisation of likely functional motif matches by focusing on integrin motifs, degrons and TRAF6-binding motifs. We subjected four predicted TRAF6-binding motifs to experimental validation, supporting the predicted motifs in the Toxoplasma proteins RON10 and GRA15. Our motif predictions provide a valuable resource for generating hypotheses and designing experiments to study infection mechanisms. The characterisation of motifs in Toxoplasma will be key to understanding the molecular principles underlying its broad host range and more comprehensive Apicomplexan infection strategies.
bioinformatics2026-02-18v2Hi-Cformer enables multi-scale chromatin contact map modeling for single-cell Hi-C data analysis
Wu, X.; Chen, X.; Jiang, R.AI Summary
- Hi-Cformer is a transformer-based method designed to model multi-scale chromatin contact maps from single-cell Hi-C data, addressing challenges like sparsity and uneven contact distribution.
- It uses a specialized attention mechanism to capture dependencies across genomic regions and scales, providing robust low-dimensional cell representations and clearer cell type separation.
- Hi-Cformer accurately imputes chromatin interactions, identifies 3D genome features, and extends to cell type annotation with high accuracy across different datasets.
Abstract
Single-cell Hi-C captures the three-dimensional organization of chromatin in individual cells and provides insights into fundamental genomic processes such as gene regulation and transcription. While analyses of bulk Hi-C data have revealed multi-scale chromatin structures like A/B compartments and topologically associating domains, single-cell Hi-C data remain challenging to analyze due to sparsity and uneven distribution of chromatin contacts across genomic distances. These characteristics lead to strong signals near the diagonal and complex multi-scale local patterns in single-cell contact maps. Here, we propose Hi-Cformer, a transformer-based method that simultaneously models multi-scale blocks of chromatin contact maps and incorporates a specially designed attention mechanism to capture the dependencies between chromatin interactions across genomic regions and scales, enabling the integration of both global and fine-grained chromatin interaction features. Building on this architecture, Hi-Cformer robustly derives low-dimensional representations of cells from single-cell Hi-C data, achieving clearer separation of cell types compared to existing methods. Hi-Cformer can also accurately impute chromatin interaction signals associated with cellular heterogeneity, including 3D genome features such as topologically associating domain-like boundaries and A/B compartments. Furthermore, by leveraging its learned embeddings, Hi-Cformer can be extended to cell type annotation, achieving high accuracy and robustness across both intra- and inter-dataset scenarios.
bioinformatics2026-02-18v2Structural Characterization of the Type IV Secretion System in Brucella melitensis for Virtual Screening-Based Therapeutic Targeting
Kapoor, J.; Panda, A.; Rajagopal, R.; Kumar, S.; Bandyopadhyay, A.AI Summary
- The study focused on characterizing the Type IV Secretion System (T4SS) in Brucella melitensis to explore its potential as a therapeutic target for brucellosis.
- Computational modeling and structural analysis of T4SS components were performed, revealing conserved architecture with E. coli T4SS despite low sequence identity.
- Virtual screening identified three promising drug candidates (Ezetimibe, Chlordiazepoxide, Alloin) targeting the VirB11 ATPase dimeric interface, with favorable binding energies confirmed by molecular dynamics simulations.
Abstract
Brucellosis is a globally important zoonotic disease caused by Brucella melitensis, the most virulent and clinically significant species affecting both humans and livestock. Unlike many Gram-negative pathogens, B. melitensis, a facultative intracellular pathogen, lacks conventional virulence factors and instead relies on specialized systems such as the Type IV Secretion System (T4SS) for secretion of effector proteins. In this study, an integrated computational pipeline was implemented to identify, model, and assemble the T4SS components, encoded by virB operon, from the complete B. melitensis proteome. Template-based modeling strategies were employed to generate structures of T4SS subcomplexes, referencing crystallographic data from E. coli T4SS. Structural superposition with E. coli homologs revealed highly conserved architecture despite only 30 to 50% sequence identity. Stereochemical validation confirmed high model quality and favorable interactions among most VirB protein pairs. Membrane insertion analysis of the membrane-embedded assemblies further corroborated the spatial orientation of the modeled T4SS. Potential of T4SS as a drug target was explored by targeting dimeric interface of VirB11 ATPase to disrupt protein-protein interactions that could disarm the pathogen. Virtual screening of compounds from DrugBank database revealed compounds with docking score less than -7.0 kcal/mol that were screened based on ADMET properties, yielding three promising candidates: Ezetimibe (Drug Id: DB00973), Chlordiazepoxide (Drug Id: DB00475), and Alloin (Drug Id: DB15477). MM-GBSA analysis estimated favorable binding free energies for these compounds and molecular dynamics simulation for 200 ns further confirmed the protein-ligand interaction stability. Collectively, these findings provide new insights into the architecture of B. melitensis T4SS and identify three potential drug molecules targeting T4SS. This supports FDA approved drug repurposing as an effective strategy for anti-virulence therapy against Brucellosis.
bioinformatics2026-02-18v1KG-Orchestra: An Open-Source Multi-Agent Framework for Evidence-Based Biomedical Knowledge Graphs Enrichment.
Mohamed, A. H.; Shalaby, K. S.; Kaladharan, A.; Atas Guvenilir, H.; Tom Kodamullil, A.AI Summary
- KG-Orchestra is a multi-agent framework designed to enrich biomedical knowledge graphs (BKGs) by focusing on specific topics, using Retrieval-Augmented Generation (RAG) for evidence acquisition, validation, and integration.
- Evaluations on specialized contexts like Nelivaptan-Alzheimer's link and gut-brain axis interactions showed that Qwen 3 variants and hybrid retrieval strategies improved reasoning and evidence relevance.
- The framework ensures high triplet integrity and biological validity, is computationally flexible, and supports applications like drug repurposing and pathway completion.
Abstract
Biomedical Knowledge Graphs (BKGs) offer integrative representations of complex biology, yet their utility is compromised by the limitations of current construction methods: manual curation offers high fidelity but is unscalable, whereas purely automated Large Language Model (LLM) approaches often yield broad networks lacking mechanistic granularity. We present KG-Orchestra, an open-source multi-agent framework designed to build specialized, directional, cause-and-effect BKGs by enriching seed graphs. The framework focuses on increasing granularity within specific topics by leveraging Retrieval-Augmented Generation (RAG) to autonomously acquire, validate, and integrate evidence. The system orchestrates specialized agents for retrieval, schema alignment, and triplet validation with explicit, traceable provenance, transforming sparse seeds into dense, high-resolution resources. We evaluated KG-Orchestra on two specialized contexts -- the mechanistic link between Nelivaptan and Alzheimer's Disease (NADKG) and the complex probiotic interactions within the gut-brain axis (ProPreSyn-GBA) -- across varying computational budgets. Our benchmarking results demonstrate that Qwen 3 variants deliver superior reasoning performance and that hybrid retrieval strategies significantly enhance evidence relevance. Furthermore, the multi-agent architecture ensures high triplet integrity and biological validity through iterative cross-checking and self-correction. The framework remains computationally flexible, deploying from single laptop GPUs to high-performance clusters. By bridging knowledge gaps and adding context-aware entities, KG-Orchestra increases reliability while validating seed assertions against up-to-date sources. This versatility supports critical downstream applications, including completing missing mechanistic pathways, integrating novel entities for drug repurposing, constructing targeted subgraphs from entity lists, and retroactively validating graph evidence for transparent auditing.
bioinformatics2026-02-18v1Wayfarer: A multiscale framework for spatial analysis of tumor progression
Moses, L.; Herault, A.; Cabon, L.; Dumitrascu, B.AI Summary
- Wayfarer is a multiscale framework designed to analyze how spatial association metrics in tumor progression change across different spatial scales using spatial -omics data.
- Applied to Xenium data from lung adenocarcinoma, Wayfarer revealed that tumor progression involves shifts in spatial patterns, with increased fine-scale coherence in ERBB2-high regions and coarse-scale clustering of immune markers.
- This framework transforms spatial aggregation from a confounder into a diagnostic tool, available as an R package via Bioconductor.
Abstract
Spatial biology spans multiple length scales, from intracellular organization to tissue-level architecture. Spatial transcriptomics captures this structure, yet most analyses operate at a single spatial resolution, implicitly assuming that biological organization is scale-consistent. In practice, spatial autocorrelation and co-localization are functions of scale, and conclusions can depend on arbitrary aggregation choices. Here we present Wayfarer, a multiscale framework for spatial -omics that tracks how spatial association metrics evolve across nested spatial aggregations, enabling statistical comparison of multiscale structure across biological conditions. Using Xenium data from lung adenocarcinoma (LUAD), we show that spatial patterns often co-exist at fine and coarse scales and that progression is accompanied by reproducible shifts in scale-response profiles. These include increased fine-scale coherence of ERBB2-high tumor regions and coarse-scale clustering of immune-associated markers that are not apparent at a single resolution. Wayfarer converts spatial aggregation from a confounder into a diagnostic signal and is implemented as an R package to be released through Bioconductor.
bioinformatics2026-02-18v1Private Information Leakage from Polygenic Risk Scores
Nikitin, K.; Gursoy, G.AI Summary
- This study investigates the privacy risks of sharing Polygenic Risk Scores (PRSs), demonstrating that PRSs can be used to reconstruct parts of an individual's genome.
- Using dynamic programming and population-based likelihood estimation, the research shows how a single PRS value can reveal genotypes, with increased accuracy when combining multiple PRSs.
- The authors propose an analytical framework to evaluate privacy risks and suggest methods for sharing PRS models while maintaining utility.
Abstract
Polygenic Risk Scores (PRSs) estimate the likelihood of individuals to develop complex diseases based on their genetic variations. While their use in clinical practice and direct-to-consumer genetic testing is growing, the privacy implications of publicly sharing PRS values are often underestimated. In this work, we demonstrate that PRSs can be exploited to recover genotypes and to de-anonymize individuals. We describe how to reconstruct a portion of an individual's genome from a single PRS value by using dynamic programming and population-based likelihood estimation, which we experimentally demonstrate on PRS panels of up 50 variants. We highlight the risks of combining multiple, even larger-panel PRSs to improve genotype-recovery accuracy, which can lead to the re-identification of individuals or their relatives in genomic databases or to the prediction of additional health risks, not originally associated with the disclosed PRSs. We then develop an analytical framework to assess the privacy risk of releasing individual PRS values and provide a potential solution for sharing PRS models without decreasing their utility. Our tool and instructions to reproduce our calculations can be found at https://github.com/G2Lab/prs-privacy.
bioinformatics2026-02-18v1Construction of distinct k-mer color sets via set fingerprinting
Alanko, J. N.; Puglisi, S. J.AI Summary
- This study introduces a Monte Carlo algorithm for constructing distinct k-mer color sets in the colored de Bruijn graph model, focusing on reducing memory usage during index construction.
- The algorithm uses on-the-fly deduplication via incremental fingerprinting, providing a strong error probability bound of 2^(-82).
- Applied to 65,536 S. enterica genomes, it compressed the color sets to 40 GiB in 7 hours and 17 minutes, using only 14 GiB of RAM.
Abstract
The colored de Bruijn graph model is the currently dominant paradigm for indexing large microbial reference genome datasets. In this model, each reference genome is assigned a unique color, typically an integer id, and each k-mer is associated with a color set, which is the set of colors of the reference genomes that contain that k-mer. This data structure supports a variety of pseudoalignment algorithms, which aim to determine the set of genomes most compatible with a query sequence. In most applications, many distinct k-mers are associated with the same color set. In current indexing algorithms, color sets are typically deduplicated and compressed only at the end of index construction. As a result, the peak memory usage can greatly exceed the size of the final data structure, making index construction a bottleneck in analysis pipelines. In this work, we present a Monte Carlo algorithm that constructs the set of distinct color sets for the k-mers directly in any individually compressed form. The method performs on-the-fly deduplication via incremental fingerprinting. We provide a strong bound on the error probability of the algorithm, even if the input is chosen adversarially, assuming that a source of random bits is available at run time. We show that given an SBWT index of 65,536 S. enterica genomes, we can enumerate and compress the distinct color sets of the genomes to 40 GiB on disk in 7 hours and 17 minutes, using only 14 GiB of RAM and no temporary disk space, with an error probability of at most 2^(-82).
bioinformatics2026-02-18v1Modeling the organizational heterogeneity of lipid-enriched microdomains in the neuronal membranes of gray and white matter of Alzheimer brain: A computational lipidomics study
Peesapati, S.; Chakraborty, S.AI Summary
- This study used lipidomics and molecular dynamics simulations to model how Alzheimer's disease (AD) affects lipid composition and membrane organization in gray (GM) and white matter (WM) of the brain.
- Findings indicate that AD leads to significant changes in membrane thickness and microdomain distribution, with more pronounced alterations in GM than WM.
- The study highlights the role of lipid composition in neuronal membrane homeostasis, showing increased cholesterol/ceramide/sphingomyelin domains in GM under AD conditions.
Abstract
Alzheimer disease (AD) is a leading cause of death among the elderly, with no existing treatment. The development of therapy is further challenged by a limited understanding of molecular pathogenesis and the absence of reliable early detection biomarkers. Neuroimaging and lipidomic studies reveal structural and biochemical alterations in both gray and white matter in AD patients, including disruptions in membrane organization and neuronal signaling pathways. In the present work, we employed lipidomics guided modeling of membranes in gray and white matter regions in healthy and diseased (AD) conditions, and used all-atom molecular dynamics (MD) simulations to examine how AD-associated alterations in lipid composition influence the structure, spatial organization, and micro-heterogeneity of neuronal plasma membranes in different brain regions. Data suggest that AD associated lipid alterations in gray matter (GM) and white matter (WM) impact membrane thickness and microdomain distribution, highlighting the critical role of lipid composition in maintaining neuronal membrane homeostasis and function. Higher-order cholesterol/ceramide/sphingomyelin enriched domains are more abundant in the neuronal membranes of the GM region in diseased conditions. Under AD-mimicking conditions, lipidomic analyses demonstrate that neuronal membranes in GM experience more substantial compositional and structural remodeling than those in WM. Our results show significant changes in membrane microdomain distribution across the lipid bilayers, and, interestingly, these changes are more pronounced in the gray matter than in the white matter. This study establishes a framework for modeling the tissue-specific lipidomics data to understand how disease-driven compositional changes affect the structure, organization, and dynamics of biological membranes.
bioinformatics2026-02-18v1The Role of Human-Specific lncRNA in Hyaline Cartilage Development
Osone, T.; Takao, T.; Takarada, T.AI Summary
- This study investigates the role of human-specific long non-coding RNAs (lncRNAs) in hyaline cartilage development using human iPS cells differentiated into limb bud-like mesenchymal cells and then into hyaline cartilage-like tissue.
- Bulk RNA sequencing revealed that human-specific lncRNAs are significantly upregulated in the cartilage-like tissue, potentially regulating genes related to the extracellular matrix.
- These findings suggest that controlling human-specific lncRNAs could enhance regenerative cartilage tissue quality and provide insights into human-specific diseases.
Abstract
One of the distinctive characteristics of humans is their bipedalism. To achieve upright bipedal walking, the angles of the pelvis and femur have been altered. Although evolutionary hypotheses on the transition to bipedalism exist, the molecular mechanisms remain unclear. This study attempts to elucidate these mechanisms using a system for inducing hyaline cartilage-like tissue from human iPS cells via limb bud like mesenchymal cells. Focus was placed on non-coding RNAs, known for their potential in generating biological diversity. Bulk RNA sequencing was conducted to compare the expression and functions of human-specific long non-coding RNAs between limb bud like mesenchymal cells and induced hyaline cartilage-like tissue. The results indicated that human-specific lncRNAs, significantly upregulated in hyaline cartilage-like tissue, may regulate genes related to the extracellular matrix. These findings suggest the potential to develop regenerative cartilage tissue with enhanced ECM quality through controlling human-specific lncRNAs. Additionally, studying human-specific lncRNAs could elucidate mechanisms of diseases that are less common in other species but more prevalent in humans.
bioinformatics2026-02-18v1Pioneer and Altimeter: Fast Analysis of DIA Proteomics Data Optimized for Narrow Isolation Windows
Wamsley, N. T.; Wilkerson, E. M.; Major, M.; Goldfarb, D.AI Summary
- The study introduces Pioneer and Altimeter, tools designed for fast analysis of data-independent acquisition (DIA) proteomics data, specifically optimized for narrow isolation windows.
- Altimeter models fragment intensity as a function of collision energy, allowing spectral library reuse, while Pioneer re-isotopes spectra and uses advanced techniques for rapid, accurate DIA analysis.
- These tools enable high-confidence protein identification and quantification, performing analyses 2-6 times faster while controlling false-discovery rates across various experimental setups.
Abstract
Advances in mass spectrometry have enabled increasingly fast data-independent acquisition (DIA) experiments, producing datasets whose scale and complexity challenge existing analysis tools. Those same advances have also led to the use of narrow isolation windows, which alter MS2 spectra via fragment isotope effects and give rise to systematic deviations from spectral libraries. Here we introduce Pioneer and Altimeter, open-source tools for fast DIA analysis with explicit modeling of isolation-window effects. Altimeter predicts deisotoped fragment intensity as a continuous function of collision energy, allowing a single spectral library to be reused across datasets. Pioneer re-isotopes predicted spectra per scan and combines an intensity-aware fragment index, spectral deconvolution, and dual-window quantification for fast, spectrum-centric DIA analysis. Across instruments, experimental designs, and sample inputs, Pioneer enables high-confidence identification and precise quantification at scale, completing analyses 2-6x faster and maintaining conservative false-discovery rate control.
bioinformatics2026-02-18v1Analysis of Transcriptograms in Epithelial-Mesenchymal Transition (EMT)
Santos, O. J.; Dalmolin, R. J.; de Almeida, R. M. C.AI Summary
- This study uses a novel pipeline integrating Transcriptogram with PCA to analyze EMT in single-cell RNA-seq data, reducing noise by projecting data onto PPI-ordered gene lists.
- Applied to TGF-β1-induced MCF10A cells, the method revealed EMT as a systemic reprogramming with distinct cellular trajectories, identifying key modules like a metabolic switch, cell cycle blockade, and a detoxification program.
- The approach enhances the resolution of cellular plasticity, showing EMT involves multiple stages and pathways, not just morphological changes.
Abstract
Single-cell RNA sequencing (single-cell RNA-seq) has represented a revolution in gene expression analysis. However, high dropout rates and stochastic noise often reduce the amount of information captured in these experiments. The epithelial-mesenchymal transition (EMT), which is fundamental to tumor progression and organismal development, is particularly difficult to fully characterize due to the existence of intermediate states. In this work, we demonstrate that projecting transcriptomic data onto gene lists ordered using protein-protein interaction (PPI) information acts as a biological low-pass filter, attenuating technical noise and increasing the statistical power of the analyses. We propose and validate an innovative pipeline that integrates the Transcriptogram method with Principal Component Analysis (PCA). By applying a moving average over functionally ordered genes, we drastically increase the signal-to-noise ratio, enabling the inference of cellular trajectories. The method was applied to a public dataset of TGF-{beta}1-induced MCF10A cells, with rigorous batch-effect correction based on biological controls. The results reveal that EMT is not merely a morphological change, but a coordinated, systemic reprogramming. This approach enabled the identification of critical modules that would remain hidden in conventional analyses: (i) a massive Metabolic Switch (Cluster 2), indicating a transition toward oxidative phosphorylation to sustain invasion; (ii) a strategic blockade of the cell cycle (Cluster 4); and (iii) a Detoxification Shield and chemoresistance program (Cluster 5), characterized by endogenous activation of metallothioneins. We conclude that the combination of PPI network topology and dimensionality reduction offers superior resolution for dissecting cellular plasticity. The method not only validates classical markers, but also reveals the hidden functional architecture of the transition, showing that EMT is not a single, uniform process, but rather one in which cells can follow distinct trajectories, halting at different stages of differentiation.
bioinformatics2026-02-18v1Differential analysis of genomics count data with edge
Pachter, L.AI Summary
- The study introduces edgePython, a Python port of the edgeR package, to facilitate differential expression analysis in the Python-dominated single-cell genomics field.
- edgePython includes a new negative binomial gamma mixed model for multi-subject single-cell analysis and applies empirical Bayes shrinkage to cell-level dispersion.
- This adaptation aims to enhance integration with Python tools while maintaining the core functionalities of edgeR.
Abstract
The edgeR Bioconductor package is one of the most widely used tools for differential expression analysis of count-based genomics data. Despite its popularity, the R-only implementation limits its integration with the Python centric ecosystem that has become dominant in single-cell genomics. We present edgePython, a Python port of edgeR 4.8.2 that extends the framework with a negative binomial gamma mixed model for multi-subject single-cell analysis and empirical Bayes shrinkage of cell-level dispersion.
bioinformatics2026-02-18v1Fast structural search for classification of gut bacterial mucin O-glycan degrading enzymes
Erden, M.; Schult, T.; Yanagi, K.; Sahoo, J. K.; Kaplan, D. L.; Cowen, L. J.; Lee, K.AI Summary
- The study introduces Deep Enzyme Function Transfer (DEFT), which combines sequence- and structure-based methods to improve the classification of enzymes, particularly at the detailed levels of the EC number hierarchy.
- DEFT first uses a protein language model to assign the first two levels of the EC number, then employs structure-based prediction for the remaining levels, reducing false positives.
- Benchmarking showed DEFT's superior accuracy and efficiency, enabling high-throughput annotations, with experimental validation on glycoside hydrolase profiles of gut bacteria.
Abstract
The Enzyme Commission (EC) numbering scheme provides a hierarchical way to classify enzymes according to their catalytic functions. While recent protein language model (PLM) based approaches like CLEAN and ProteInter have improved sequence-based EC number prediction, they struggle with fine-grained classification at the deepest hierarchical level. Structure-based approaches for grouping similar proteins using alignment tools excel at finding proteins that share overall global structure, but suffer from high false positive rates when classifying proteins that are globally structurally similar but functional differentiation depends on a localized region. This problem is particularly relevant to EC number prediction, as enzymatic function depends on its catalytic domain, which is a relatively small, specific region of the protein. We introduce Deep Enzyme Function Transfer (DEFT) that harmonizes sequence- and structure-based approaches through the key insight that PLM based annotations of the first two EC number hierarchy levels vastly reduces false positives that are likely to show in purely structure-based EC number prediction. Given an enzyme of interest, DEFT first uses a PLM based method to assign the first two levels of the enzymes EC number, and then uses a structure-based method to predict the remaining two levels of the EC number. Using benchmarking datasets, we demonstrate that DEFT achieves superior accuracy compared to current state-of-the-art tools for EC number prediction. Furthermore we show that DEFTs computational efficiency enables high-throughput, genome-wide annotations of organisms enzyme repertoires. We illustrate this capability by experimentally validating DEFT predicted glycoside hydrolase (GH) profiles of intestinal mucus associated bacteria.
bioinformatics2026-02-18v1Adaptive Tracepoints for Pangenome Alignment Compression
Kaushan, H.; Marco-Sola, S.; Garrison, E.; Prins, P.; Guarracino, A.AI Summary
- The study introduces adaptive tracepoints, a method for compressing sequence alignments in pangenomes by segmenting alignments based on complexity metrics like edit or diagonal distance, rather than fixed intervals.
- On simulated long sequence alignments, diagonal-bounded tracepoints achieved 10.5-13.7X better compression than fixed-length encodings, while edit-bounded tracepoints offered a balance between compression and reconstruction efficiency.
- Real pangenome data showed compression improvements of 23-139X with no degradation in alignment scores and linear reconstruction time.
Abstract
Motivation: Storing millions of sequence alignments from large-scale genomic comparisons requires efficient compression methods. While fixed-size alignment encodings offer uniform spacing and bounded reconstruction cost, they cannot adapt to variable alignment complexity across sequences, missing compression opportunities in conserved regions. Results: We present adaptive tracepoints, a complexity-aware alignment encoding that segments alignments using configurable complexity metrics (edit distance or diagonal distance) rather than fixed intervals. Segments are bounded by either the number of differences or the deviation from the main diagonal, adapting to local alignment characteristics. Reconstruction guarantees that alignments maintain identical or improved alignment scores. We validate the correctness of our method on simulated and real pangenomes with varying lengths and divergences. Diagonal-bounded tracepoints achieve 10.5- 13.7X better compression than fixed-length encodings (l=100) on simulated long sequence alignments (100 Kb), while edit-bounded tracepoints provide a tunable trade-off between compression and reconstruction cost, approaching diagonal-bounded compression at higher thresholds with substantially lower memory and runtime. On real pangenomes (390M alignments), these methods compress alignments by 23-139X relative to uncompressed representations, with no score degradation and reconstruction time linear in alignment length. Availability: Code and documentation are publicly available at https://github.com/AndreaGuarracino/tracepoints, https://github.com/AndreaGuarracino/tpa, and https://github.com/AndreaGuarracino/cigzip. Contact: aguarracino@tgen.org Supplementary information: Supplementary data are available at Bioinformatics online.
bioinformatics2026-02-18v1Drug-Target Interaction Prediction with PIGLET
Carpenter, K. A.; Altman, R. B.AI Summary
- The study introduces PIGLET, a novel graph transformer method for drug-target interaction (DTI) prediction, which uses a proteome-wide knowledge graph.
- PIGLET was benchmarked against existing models on the Human dataset using random and drug-based splits.
- PIGLET showed similar performance to other models on random splits but outperformed them on the more rigorous drug-based split.
Abstract
Drug-target interaction (DTI) prediction is a key task for computed-aided drug development that has been widely approached by deep learning models. Despite extremely high reported performance, these models have yet to find widespread success in accelerating real-world drug discovery. In contrast with the most common approach of creating embeddings from one-dimensional or three-dimensional representations of the input drug and input target, we create a novel graph transformer method for DTI prediction that operates on a proteome-wide knowledge graph of binding pocket similarity, protein-protein interactions, drug similarity, and known binding relationships. We benchmark our method, named PIGLET, against existing DTI prediction models on the Human dataset. We assess performance with two different splitting strategies: the frequently-reported random split, and a novel, more rigorous drug-based split. All models perform similarly well on the random split, and PIGLET outperforms all models on the drug-based split. We highlight the utility of PIGLET through a real-world drug discovery case study.
bioinformatics2026-02-18v1Influence of molecular representation and charge on protein-ligand structural predictions by popular co-folding methods
Bugrova, A.; Orekhov, P.; Gushchin, I.AI Summary
- This study investigated how the input format (CCD or SMILES) and charge of ligands (methylamine and acetic acid) affect protein-ligand structural predictions by four algorithms: AlphaFold 3, Boltz-2, Chai-1, and Protenix-v1.
- Results showed that the input format significantly influenced prediction outcomes more than protonation, while changes in charge did not align with experimental expectations.
- The study suggests improving prediction algorithms by ensuring consistency across input formats and incorporating protonation steps in training and prediction.
Abstract
Recently developed deep learning-based tools can effectively generate structural models of complexes of proteins and non-proteinaceous compounds. While some of their predictive capabilities are truly exciting, others remain to be thoroughly tested. Here, we probe whether the ligand input format (Chemical Component Dictionary, CCD, or Simplified Molecular Input Line Entry System, SMILES) and charge (which depends on protonation) will affect the results of the predictions by four popular algorithms: AlphaFold 3, Boltz-2, Chai-1, and Protenix-v1. We chose methylamine and acetic acid as two of the simplest titratable chemicals that are omnipresent in proteins as amino and carboxy moieties, and are consequently ubiquitous in the Protein Data Bank models that are most commonly used for training. Unexpectedly, we found that for both molecules, in many cases the input format affected the prediction results, and did it much stronger compared to protonation, whereas changes in the formally specified charge of the molecules did not lead to changes in binding expected from experiments. We conclude that (i) ensuring identical results irrespective of input formats and (ii) inclusion of protonation-related steps into training and prediction pipelines are the two available paths for improvement of protein-ligand structure prediction algorithms.
bioinformatics2026-02-18v1Learning Mappings from Cryo-EM Images to Atomic Coordinates via Latent Representations
Abid, E.; Jonic, S.AI Summary
- The study investigates if supervised learning can map noisy cryo-EM images to 3D atomic coordinates without pose recovery, using a convolutional auto-encoder to generate latent representations.
- A regression network then predicts atomic coordinates from these latents.
- Results showed mean RMSDs of 2.11 Å for adenylate kinase and 0.80 Å for nucleosome core particles, demonstrating that latent representations can effectively preserve necessary structural information.
Abstract
Single-particle cryo-electron microscopy (cryo-EM) aims to determine three-dimensional (3D) structures of biomolecular complexes from noisy two-dimensional (2D) projection images acquired at unknown orientations. The presence of pose uncertainty and continuous conformational heterogeneity makes high-resolution reconstruction challenging. Here, we investigate, in a controlled synthetic setting, whether supervised learning can map noisy cryo-EM single-particle images to atomic coordinates without pose recovery or 2D projection calculations. We propose a convolutional auto-encoder to compress particle images into their corresponding latent representations, followed by a regression network to predict 3D atomic coordinates from these image latents. We show the performance of this approach using synthetic datasets of pairs of particle images and conformational models of adenylate kinase and nucleosome core particles, generated using a realistic cryo-EM forward model based on Normal Mode Analysis for simulating dynamics. Inference yielded mean RMSDs of 2.11 [A] for all-atom models of adenylate kinase (1,656 atoms) and 0.80 [A] for the coarse-grain models of nucleosome (1,041 C-P atoms). These results indicate that compact image latents preserve pose and conformation related information sufficiently well to support atomic coordinate regression. This provides a quantitative proof-of-principle for coupling image and structure spaces toward fast estimation of conformational variability in cryo-EM.
bioinformatics2026-02-18v1Guided tokenization and domain knowledge enhance genomic language models' performance
Mahangade, V.; Mollerus, M.; Crandall, K. A.; Rahnavard, A.AI Summary
- The study introduces Guided Tokenization (GT), which prioritizes biologically significant subsequences for tokenization in genomic language models (gLMs).
- GT, combined with domain adaptation, enhances the representation quality and classification accuracy of gLMs.
- This approach improves performance in tasks like DNA sequence classification, promoter detection, and antimicrobial resistance classification, particularly in smaller models.
Abstract
Adapting language models to genomic and metagenomic sequences presents unique challenges, particularly in tokenization and task-specific generalization. Standard methods, such as fixed-length k-mers or byte pair encoding, often fail to preserve biologically meaningful patterns essential for downstream tasks. We introduce Guided Tokenization (GT), a strategy that prioritizes biologically and statistically important subsequences based on importance scores, model attention, and class distributions. Combined with domain adaptation, which incorporates prior domain specific biological knowledge, this approach improves both representation quality and classification accuracy in compact genomic language models (gLMs). GT enhances biological awareness in genomic language models, particularly for effective small and mid-sized models across key tasks, including DNA sequence read classification, promoter detection, antimicrobial resistance classification, and targeted amplicon taxonomic profiling. Our results highlight the promise of guided tokenization and domain-aware modeling for building efficient, biologically grounded language models for scalable genomic applications.
bioinformatics2026-02-18v1Supporting Metadata Curation from Public Life Science Databases Using Open-Weight Large Language Models
Shintani, M.; Andrade, D.; Bono, H.AI Summary
- The study developed a workflow using large language models (LLMs) for automated metadata curation in public life science databases to improve data reuse.
- Benchmarking with Arabidopsis RNA sequencing data showed that LLMs significantly outperformed simple keyword searches, with open-weight models achieving near-perfect classification (F1>0.98).
- Utilizing LLM confidence scores allows for automatic processing of high-confidence cases, enhancing scalability and reproducibility in metadata curation.
Abstract
Although the Gene Expression Omnibus and other public repositories are expanding rapidly, curation across these databases has not kept pace. Data reuse is often hindered by unstandardized metadata comprising unstructured text. To address this, we developed a workflow that combines retrieval via an application programming interface with semantic filtering using large language models (LLMs) for automated curation. We benchmarked multiple LLMs using metadata from 150 candidate Arabidopsis RNA sequencing projects to classify samples treated with exogenous abscisic acid and their controls. Simple keyword searches yielded many false positives (F1=0.59); classification using LLMs significantly improved performance. Several open-weight models achieved a nearly perfect performance (F1>0.98), comparable to that of closed models. We also found that utilizing LLM confidence scores enables high-confidence cases to be processed automatically. These results suggest that open-weight LLMs can support scalable and reproducible metadata curation in local environments, providing a foundation for accelerating public dataset reuse.
bioinformatics2026-02-18v1Cancer Driver Gene Discovery: A Patient-Level Statistical Framework
Bahari, F.; Montazeri, H.AI Summary
- The study introduces iDriver, a statistical framework designed to identify cancer driver genes by integrating mutation recurrence and functional impact at the patient level, addressing the challenge of patient-specific mutation burden variability.
- When applied to 29 cancer types, iDriver identified both known and novel cancer drivers in coding and noncoding regions, demonstrating clinical and biological relevance.
- In benchmarks, iDriver outperformed 12 other methods, achieving top rankings for identifying known cancer drivers in both coding and noncoding genomic elements.
Abstract
Tumor genomes harbor a mixture of neutral and positively selected mutations, yet distinguishing true cancer drivers remains a major challenge. Several factors can obscure the detection of selection signals, among which patient-specific variation in mutational burden plays a significant role. Current approaches often fail to account for the heterogeneity in mutation burden across different patients; in particular, no existing method explicitly accounts for it when integrating both mutation recurrence and functional impact. Here we present iDriver, a probabilistic graphical model that integrates both mutation recurrence and functional impact at the individual-patient level, enabling an enhanced estimation of positive selection across functional genomic elements. Applying iDriver to 29 cancer types, we identify both known and previously unrecognized drivers spanning coding and noncoding regions, and provide evidence for their clinical and biological relevance. In comprehensive benchmarks against 12 established driver discovery methods, iDriver consistently outperformed all competitors, achieving the highest rankings for known cancer drivers across both coding and noncoding elements.
bioinformatics2026-02-18v1Privacy-Preserving Pangenome Graphs
Blindenbach, J.; Soni, S.; Gursoy, G.AI Summary
- The study introduces PanMixer, a framework for privacy-preserving pangenome graph releases, addressing privacy concerns by selectively obfuscating individual haplotypes.
- PanMixer formulates the privacy-utility trade-off as a knapsack problem, using information theory for privacy risk and graph properties for utility.
- Testing on a draft human pangenome of 47 individuals showed PanMixer reduces re-identification risk while maintaining the accuracy of downstream genomic applications.
Abstract
The human pangenome reference, often represented as a graph, promises to capture genetic diversity across populations, but open release of individual haplotypes raises significant privacy concerns, including risks of re-identification and inference of sensitive traits. To address these challenges, we introduce PanMixer, a framework for privacy-preserving pangenome graph releases that selectively obfuscates an individual's haplotypes while retaining the utility of the reference graph. PanMixer formulates the privacy-utility trade-off as a knapsack problem, where privacy risk is quantified using information theory and utility is measured using graph properties. Using the recently released draft human pangenome containing 47 individuals, we show that PanMixer robustly reduces re-identification risk under linkage attacks and genome reconstruction attempts. We also show that PanMixer preserves the accuracy of key downstream applications, including allele frequency estimation, linkage disequilibrium analysis, and read mapping. By addressing privacy concerns, PanMixer enables the inclusion of individuals, particularly those from underrepresented populations, who might otherwise be reluctant to contribute but seek representation in future genomic studies. Our results provide both a practical tool and a generalizable framework for balancing privacy and utility in future large-scale pangenome references.
bioinformatics2026-02-18v1BOND-PEP: topology-conditioned bipartite alignment for evidence-grounded peptide binder generation
Ding, W.AI Summary
- BOND-PEP is a novel framework for generating peptide binders that uses empirical binding evidence to condition peptide generation explicitly at the residue level.
- It achieves state-of-the-art performance in terms of low perplexity, high hit rates, and sequence novelty, outperforming existing methods.
- The approach provides a practical method for de novo peptide binder design, effective even under conditions of noisy labels and distribution shift.
Abstract
Peptide binders can modulate proteins that remain challenging for small molecules, but discovering high-affinity, selective peptides is still slow and sample-intensive. Sequence-first generators could scale design when structures are unavailable or conformationally heterogeneous, yet they often trade diversity for control: unconstrained sampling is inefficient while conditioning remains largely implicit. This limitation is exacerbated by the uneven transfer of protein language model priors to short peptides. Here we present BOND-PEP, a retrieval-augmented, bipartite-aligned, topology-conditioned framework that converts empirical binding evidence into an explicit, residue-resolved conditioning state for peptide generation. BOND-PEP shows low perplexity together with satisfactory free-generation hit rates and sequence novelty under a fair evaluation protocol and decoding budget. Compared with existing peptide generation methods, BOND-PEP achieves state-of-the-art results that match or improve upon validated peptide-protein sequence pairs. In total, BOND-PEP provides a practical, sequence-only route to controllable de novo peptide binder generation under noisy labels and distribution shift.
bioinformatics2026-02-18v1Information-Content-Informed Kendall-tau Correlation Methodology: Interpreting Missing Values in Metabolomics as Potentially Useful Information
Flight, R. M.; Bhatt, P. S.; Moseley, H. N. B.AI Summary
- The study introduces the Information-Content-Informed Kendall-tau (ICI-Kt) methodology to handle missing values in metabolomics data by treating them as informative, particularly when they are left-censored due to detection limits.
- Using simulated and over 700 experimental datasets, the approach was shown to enhance the interpretation of correlation by including these missing values, improving outlier detection and feature network construction.
- The methodology is implemented in R and Python, available on GitHub, facilitating fast calculations for large datasets.
Abstract
Background: Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by re-moving them or are imputed and used in the calculation of the correlation coefficient. In either case, the correlation value will be impacted based on a perspective that the missing data represents no useful information. However, missing values occur in real data sets for a variety of reasons. In metabolomics data sets, a major reason for missing values is that a specific measurable phenomenon falls below the detection limits of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent potentially useful information by virtue of their "missingness" at one end of the data distribution. Methods: To include this information due to left-censorsed missingness, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We developed a statistical test and then show that most missing values in metabolomics datasets are the result of left-censorship. Next, we show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that add further layers of information interpretation in the methodology. Results: Using both simulated and over 700 experimental data sets from The Metabolomics Workbench, we demonstrate that the ICI-Kt methodology allows for the inclusion of left-censored missing data values as interpretable information, enabling both improved determination of outlier samples and improved feature-feature network construction. Conclusions: We provide explicitly parallel implementations in both R and Python that allow fast calculations of all the variables used when applying the ICI-Kt methodology on large numbers of samples. The ICI-Kt methods are available as an R package and Python module on GitHub at https://github.com/moseleyBioinformaticsLab/ICIKendallTau and https://github.com/moseleyBioinformaticsLab/icikt, respectively.
bioinformatics2026-02-17v5ConNIS and labeling instability: new statistical methods for improving the detection of essential genes in TraDIS libraries
Hanke, M.; Harten, T.; Foraita, R.AI Summary
- The study introduces ConNIS, a new method for determining gene essentiality in TraDIS data by analyzing the probability of consecutive non-insertion sites within genes.
- ConNIS was shown to outperform existing methods in simulations and real-world scenarios, especially at low to medium insertion densities.
- A subsample-based instability criterion was developed to set methodologically sound parameter values, enhancing the precision of TraDIS analyses.
Abstract
The identification of essential genes in Transposon Directed Insertion Site Sequencing (TraDIS) data relies on the assumption that transposon insertions occur randomly in non-essential regions, leaving essential genes largely insertion-free. While intragenic insertion-free sequences have been considered as a reliable indicator for gene essentiality, so far, no exact probability distribution for these sequences has been proposed. Further, many methods require setting thresholds or parameter values a priori without providing any statistical basis, limiting the comparability of results. Here, we introduce Consecutive Non-Insertions Sites (ConNIS), a novel method for gene essentiality determination. ConNIS provides an analytic solution for the probability of observing insertion-free sequences within genes of given length and considers variation in insertion density across the genome. Based on an extensive simulation study and real world scenarios, ConNIS was found to be superior to prevalent state-of-the-art methods, particularly in scenarios with a low or medium insertion density. In addition, our results show that the precision of existing methods can be improved by incorporating a simple weighting factor for the genome-wide insertion density. To set methodically embedded parameter and threshold values of TraDIS methods a subsample based instability criterion was developed. Application of this criterion in real and synthetic data settings demonstrated its effectiveness in selecting well-suited parameter/threshold values across methods. A ready-to-use R package and an interactive web application are provided to facilitate application and reproducibility.
bioinformatics2026-02-17v3Minimizer Density revisited: Models and Multiminimizers
Ingels, F.; Robidou, L.; Martayan, I.; Marchet, C.; Limasset, A.AI Summary
- This study revisits the concept of density in k-mer-based sequence analysis by linking it to the gap between selected positions, proposing a probabilistic model that assumes equally distributed gaps.
- A novel technique, multiminimizers, is introduced where each k-mer is associated with multiple candidate minimizers, leading to a semi-local scheme that reduces density at the cost of increased computation time.
- The study also introduces deduplicated density, showing that multiminimizers improve this metric, though globally minimizing it is NP-complete, and provides an efficient SIMD-accelerated Rust implementation.
Abstract
High-throughput sequence analysis commonly relies on k-mers (words of fixed length k) to remain tractable at modern scales. These k-mer-based pipelines can employ a sampling step, which in turn allows grouping consecutive k-mers into larger strings to improve data locality. Although other sampling strategies exist, local schemes have become standard: such schemes map each k-mer to the position of one of its characters. A key performance measure of these schemes is their density, defined as the expected fraction of selected positions. The most widely used local scheme is the minimizer scheme: given an integer m [≤] k, a minimizer scheme associates each k-mer to the starting position of one of its m-mers, called its minimizer. Being a local scheme, the minimizer scheme guarantees covering all k-mers of a sequence, with a maximal gap between selected positions of w = k - m + 1. Recent works have established near-tight lower bounds on achievable density under standard assumptions for local schemes, and state-of-the-art schemes now operate close to these limits, suggesting that further improvements under the classical notion of density will face diminishing returns. Hence, in this work, we aim to revisit the notion of density and broaden its scope. As a first contribution, we draw a link between density and the gap between consecutive selected positions. We propose a probabilistic model allowing us to establish that the density of a local scheme is exactly the inverse of the expected gap between the positions it selects, under the minimal and only assumption that said gaps are somehow equally distributed. We emphasize here that our model makes no assumptions about how positions are selected, unlike the classical models in the literature. Our result introduces a novel method for computing the density of a local scheme, extending beyond classical settings. Based on this analysis, we introduce a novel technique, named multiminizers, by associating each k-mer with a bounded set of candidate minimizers rather than a single one. The candidate furthest away (in a precise sense defined in the article) is selected. Since the decision is made by taking advantage of a context beyond a single k-mer, this technique is not a local scheme - as a result, we propose the concept of semi-local schemes, which provide a broader framework within which our method fits. Using the multiminimizer trick on a local scheme reduces its density at the expense of a controlled increase in computation time. We show that this method, when applied to random (hash-based) minimizers and to open-closed mod-minimizers, achieves asymptotically optimal density representing, to our knowledge, the first construction converging to this limit. Our third contribution is the introduction of the deduplicated density, which measures the fraction of distinct minimizers used to cover all k-mers of a set of sequences. While this problem has gained traction in applications such as assembly, filtering, and pattern matching, standard minimizer schemes are often used as a proxy, blurring the distinction between the two objectives (minimizing the number of selected positions or the number of selected minimizers). Although related to the classical notion of density, deduplicated density differs in both definition and suitable constructions, and must be analyzed in its own right, together with its precise connections to standard density. We show that multiminimizers can also improve this metric, but that globally minimizing deduplicated density in this setting is NP-complete, and we instead propose a local heuristic with strong empirical behavior. Finally, we show that multiminimizers can be computed efficiently, and provide a SIMD-accelerated Rust implementation together with proofs of concept demonstrating reduced memory footprints on core sequence-analysis tasks. We conclude with open theoretical and practical questions that remain to be addressed in the area of density.
bioinformatics2026-02-17v2ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa
Malbranke, C.; Zalaffi, G. P.; Bitbol, A.-F.AI Summary
- ProteomeLM, a transformer-based model, was developed to predict protein-protein interactions (PPI) and gene essentiality by learning from entire proteomes across various species.
- The model uses contextualized protein representations to predict PPI with high accuracy and speed, surpassing traditional coevolution-based methods.
- ProteomeLM-PPI and ProteomeLM-Ess, extensions of ProteomeLM, achieve state-of-the-art performance in PPI prediction and gene essentiality prediction across different taxa.
Abstract
Language models trained on biological sequences are advancing inference tasks from the scale of single proteins to that of genomic neighborhoods. Here, we introduce ProteomeLM, a transformer-based language model that uniquely operates on entire proteomes from species spanning the tree of life. ProteomeLM is trained to reconstruct masked protein embeddings using the whole proteomic context, yielding contextualized protein representations that reflect proteome-scale functional constraints. Notably, ProteomeLM's attention coefficients encode protein-protein interactions (PPI), despite being trained without interaction labels. Furthermore, it enables interactome-wide PPI screening that is substantially more accurate, and orders of magnitude faster, than amino-acid coevolution-based methods. We further develop ProteomeLM-PPI, a supervised model that combines ProteomeLM embeddings and attention coefficients to achieve state-of-the-art PPI prediction across benchmarks and species. Finally, we introduce ProteomeLM-Ess, a supervised gene essentiality predictor that generalizes across diverse taxa. Our results demonstrate the potential of proteome-scale language models for addressing function and interactions at the organism level.
bioinformatics2026-02-17v2Compressed inverted indexes for scalable sequence similarity
Ingels, F.; Vandamme, L.; Girard, M.; Agret, C.; Cazaux, B.; Limasset, A.AI Summary
-
The study addresses the scalability limits of MinHash-derived sketching methods for nucleotide sequence similarity by introducing a novel framework using compressed inverted indexes, which match the space complexity of forward indexes.
-
They developed algorithms for efficient all-vs-all comparisons and introduced early-pruning schemes to optimize time and memory usage, maintaining high accuracy in similarity searches.
-
Onika, the resulting tool, significantly accelerates large-scale similarity searches and reduces resource usage, as demonstrated on various datasets, while maintaining sensitivity at practical similarity thresholds.
Abstract
Modern sequencing continues to drive explosive growth of nucleotide sequence archives, pushing MinHash-derived sketching methods to their practical scalability limits. State-of-the-art tools such as Mash, Dashing2, and Bindash2 provide compact sketches and accurate similarity estimates for large collections, yet ultimately rely on forward indexes that materialize sketches as explicit fingerprint vectors. As a result, large-scale similarity search and exhaustive collection-versus-collection comparison still incur quadratic resource usage. In this work, we revisit the architecture of sketch-based indexes and provide a novel framework for scalable similarity search over massive sketch collections. Our first contribution is a formal cost model for sketch comparison, within which we prove that inverted indexes on sketch fingerprints, equipped with suitably compressed posting lists, achieve the same asymptotic space complexity as standard forward indexes, thereby eliminating the perceived memory penalty traditionally associated with inverted indexes. Building on this model, we design algorithms for all-vs-all comparison between two inverted indexes whose running times are proportional to the total number of matching sketch positions, leading to output-sensitive optimality and enabling efficient large-scale similarity comparisons. Our second contribution leverages the prevalence of similarity thresholds in downstream applications such as clustering, redundancy filtering, and database search. We introduce two early-pruning schemes: an exact criterion that safely eliminates pairs guaranteed not to reach a target Jaccard similarity, and a probabilistic strategy that exploits partial match statistics to discard pairs unlikely to exceed this threshold. Together, these schemes address both time and memory bottlenecks while maintaining rigorous guarantees on retained high-similarity pairs and providing explicit control of the false-rejection probability for the probabilistic variant. Finally, we instantiate these ideas in Onika, an open-source Rust implementation based on compressed inverted posting lists available at github.com/Malfoy/Onika. Onika incorporates a similarity-aware document reordering strategy that restructures sketch identifiers to further shrink index size and improve locality, particularly for redundant collections. Experiments on large bacterial genome repositories, synthetic low-redundancy benchmarks, and long-read HiFi sequencing datasets demonstrate that Onika matches or improves upon the sketch sizes of leading tools while accelerating large-scale search and collection-versus-collection comparisons by up to several orders of magnitude in low-redundancy regimes, without compromising sensitivity at practically relevant similarity thresholds.
bioinformatics2026-02-17v2-
A Discrete Language of Protein Words for Functional Discovery and Design
Guo, Z.; Wang, Z.; Chai, Y.; XU, K.; Li, M.; Li, W.; Ou, G.AI Summary
- The study introduces a framework that discretizes protein sequences into a vocabulary of "Protein Words" to capture higher-order structural and functional signals, outperforming traditional residue-level models in tasks like remote homology and mutation effect prediction.
- Analysis across 54 species showed that these words correlate with evolutionary complexity, particularly in eukaryotic disordered regions.
- The framework identified ADMAP1 as a new regulator of sperm motility and enabled the design of functional cofilin variants, demonstrating its utility in both discovery and protein engineering.
Abstract
Proteins function through hierarchical modules, yet conventional models treat sequences as linear strings of residues, overlooking the recurrent multi-residue patterns-or "Protein Words"-that govern biological architecture. We introduce a physics-aware framework that discretizes protein space into a learnable vocabulary derived from the evolutionary record. By encoding proteins as sequences of discrete "words," our model captures higher-order structural and functional signals inaccessible to residue-level models, achieving highly competitive performance against widely established baselines in remote homology and mutation effect prediction. Analysis across 54 species reveals that these words track evolutionary complexity, specifically identifying the expansion of eukaryotic disordered regions. We demonstrate the discovery potential of this semantic axis by identifying ADMAP1 as a previously uncharacterized regulator of sperm motility, validated via CRISPR-Cas9 knockout mice. Finally, this vocabulary enables programmable design, generating functional cofilin variants despite high sequence divergence. This work establishes a linguistically inspired framework for deciphering the dark proteome and engineering biological function.
bioinformatics2026-02-17v1TITAN-BBB: Predicting BBB Permeability using Multi-Modal Deep-Learning Models
de Oliveira, G. B.; Saeed, F.AI Summary
- TITAN-BBB uses a multi-modal deep-learning approach integrating tabular, image, and text features to predict blood-brain barrier (BBB) permeability.
- The model was trained on the largest aggregated BBB permeability dataset, achieving 86.5% balanced accuracy in classification and 0.436 mean absolute error in regression.
- TITAN-BBB outperformed existing models by 3.1% in accuracy and reduced regression error by 20%.
Abstract
Computational prediction of blood-brain barrier (BBB) permeability has emerged as a vital alternative to traditional experimental assays, which are often resource-intensive and low-throughput to meet the demands of early-stage drug discovery. While early machine learn-ing approaches have shown promise, integration of traditional chemical descriptors with deep learning embeddings remains an underexplored frontier. In this paper, we introduce TITAN-BBB, a multi-modal deep-learning architecture that utilizes tabular, image, and text-based features and combines them using attention mechanisms. To evaluate, we aggregated multiple literature sources to create the largest BBB permeability dataset to date, enabling robust training for both classification and regression tasks. Our results demonstrate that TITAN-BBB achieves 86.5% of balanced accuracy on classification tasks and 0.436 of mean absolute error for regression, outperforming the state-of-the-art by 3.1 percentage points in balanced accuracy and reducing the regression error by 20%. Our approach also outperforms state-of-the-art models in both classification and regression performance, demonstrating the benefits of combining deep and domain-specific representations. The source code is publicly available at https://github.com/pcdslab/TITAN-BBB. The inference-ready model is hosted on Hugging Face at https://huggingface.co/SaeedLab/TITAN-BBB, and the aggregated BBB permeability datasets are available at https://huggingface.co/datasets/SaeedLab/BBBP.
bioinformatics2026-02-17v1FiCOPS: Hardware/Software Co-Design of FPGA Computational Framework for Mass Spectrometry-Based Peptide Database Search
Kumar, S.; Zambreno, J.; Khokhar, A.; Akram, S.; Saeed, F.AI Summary
- The study aimed to enhance the speed and efficiency of peptide database search from mass spectrometry data by developing FiCOPS, an FPGA-based computational framework.
- FiCOPS was designed using a hardware/software co-design approach, focusing on parallelism and reducing computational bottlenecks in the database search algorithm.
- Testing on the Intel Stratix 10 FPGA showed FiCOPS achieved a 3.5x speed-up over CPU solutions and reduced power consumption by 3x and 5x compared to CPU and GPU solutions, respectively.
Abstract
Improving the speed and efficiency of database search algorithms that deduce peptides from mass spectrometry (MS) data has been an active area of research for more than three decades. The significance of the need for faster database search methods has rapidly increased due to the growing interest in studying non-model organisms, meta-proteomics, and proteogenomic data, which are notorious for their enormous search space. Poor scalability of serial algorithms with the growing size of the database and increasing parameters of post-translational modifications is a widely recognized problem. While high-performance computing techniques can be used on supercomputing machines, the need for real-time, on-the-instrument solutions necessitates the development of an efficient system-on-chip that optimizes design constraints such as cost, performance, and power of the system. To show case that such a system can work, we present an FPGA-based computational framework called FiCOPS to accelerate database search using a hardware/software co-design methodology. First, we theoretically analyze the database-search algorithm (closed-search) to reveal opportunities for parallelism and uncover computational bottlenecks. We then design an FPGA-based architectural template to exploit parallelism inherent in the search workload. We also formulate an analytical performance model for the architecture template to perform rapid design space exploration and find a near-optimal accelerator configuration. Finally, we implement our design on the Intel Stratix 10 FPGA platform and evaluate it using real-world datasets. Our experiments demonstrate that FiCOPS achieves 3.5 times speed-up over existing CPU solutions and 3 times and 5 times reduction in power consumption compared to existing CPU and GPU solutions.
bioinformatics2026-02-17v1Evaluating Single-Cell Perturbation Response Models Is Far from Straightforward
Heidari, M.; Karimpour, M.; Srivatsa, S.; Montazeri, H.AI Summary
- This study evaluates the performance of single-cell perturbation response models, highlighting the limitations of current evaluation metrics like correlation-based measures and distributional distances.
- Using cross-splitting, controlled noise experiments, and synthetic data, the research shows that metrics such as Wasserstein and Energy distances can misrepresent model performance due to issues like scale and dimensionality.
- The findings indicate that complex deep learning models often do not outperform simple baselines, suggesting a need for improved evaluation standards in developing reliable virtual-cell models.
Abstract
Predicting cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology and a key step toward building in silico virtual cells. The rapid growth of perturbation datasets and advances in deep-learning models have raised expectations for accurate and generalizable prediction. We show that these expectations are overly optimistic, largely due to the failure modes of existing evaluation metrics. In this study, using cross-splitting, controlled noise experiments, and synthetic data, we systematically evaluate both prediction models and evaluation metrics. We demonstrate that widely used metrics, including correlation-based measures and common distributional distances, are strongly influenced by scale, sparsity, and dimensionality, often misrepresenting model performance. In particular, the Wasserstein distance fails in high-dimensional gene expression spaces under variance scaling, while the Energy distance can overlook disruptions in gene-gene dependencies. Our analyses further reveal that complex deep learning models often underperform simple baselines and remain far from empirical performance bounds across multiple chemical perturbation datasets. Together, our framework exposes critical pitfalls, establishes robust evaluation guidelines, and provides a foundation for trustworthy benchmarking toward reliable virtual-cell models.
bioinformatics2026-02-17v1ProtFlow: Flow Matching-based Protein Sequence Design with Comprehensive Protein Semantic Distribution Learning and High-quality Generation
Kong, Z.; Zhu, Y.; Xu, Y.; Yin, M.; Hou, T.; Wu, J.; Xu, H.; Hsieh, C.-Y.AI Summary
- ProtFlow uses a flow matching algorithm to learn the comprehensive semantic distribution of protein sequences, addressing the limitations of existing models that focus on local statistics.
- The model incorporates a semantic integration network to reorganize protein representation space, enhancing global semantic capture.
- Experiments showed ProtFlow excels in generating high-quality peptides, particularly antimicrobial peptides with effective activity against various pathogens, including underrepresented species.
Abstract
Designing protein sequences with desired properties is a fundamental task in protein engineering. Recent advances in deep generative models have greatly accelerated this design process. However, most existing models face the issue of distribution centralization and focus on local compositional statistics of natural sequences instead of the global semantic organization of protein space, which confines their generation to specific regions of the distribution. These problems are amplified for functional proteins, whose sequence patterns strongly correlate with semantic representations and exhibit a long-tailed functional distribution, causing existing models to miss semantic regions associated with rare but essential functions. Here, we propose ProtFlow, a generative model designed for comprehensive semantic distribution learning of protein sequences, enabling high-quality sequence generation. ProtFlow employs a rectified flow matching algorithm to efficiently capture the underlying semantic distribution of the protein design manifold, and introduces a reflow technique enabling one-step sequence generation. We construct a semantic integration network to reorganize the representation space of large protein language models, facilitating stable and compact incorporation of global protein semantics. We pretrain ProtFlow on 2.6M peptide sequences and finetune it on antimicrobial peptides (AMPs), a representative class of therapeutic proteins exhibiting unevenly distributed activities across pathogen targets. Experiments show that ProtFlow outperforms state-of-the-art methods in generating high-quality peptides, and AMPs with desirable activity profiles across a range of pathogens, particularly against underrepresented bacterial species. These results demonstrate the effectiveness of ProtFlow in capturing the full training distribution and its potential as a general framework for computational protein design.
bioinformatics2026-02-17v1Diffusion Probabilistic Models for Missing-Wedge Correction in Cryo-Electron Tomography
Hasan, N.; Bertin, A.; Jonic, S.AI Summary
- The study addresses the missing-wedge (MW) distortion in cryo-electron tomography by proposing MW-RaMViD, a method for generating unacquired 2D tilt images based on the RaMViD approach.
- MW-RaMViD was adapted for cryo-ET by incorporating MRC image format, floating-point pixel intensity, and a controlled inference protocol for MW correction.
- Evaluations on synthetic datasets showed that smaller step sizes and larger conditioning windows in MW-RaMViD reduce error accumulation and improve reconstruction fidelity, as measured by RMSE and Fourier Shell Correlation.
Abstract
Interpretation of 3D cryo-electron tomography (cryo-ET) reconstructions (tomograms) is hampered by the so-called missing-wedge (MW) distortions, which arise because tilt image series used for the reconstructions are acquired in a limited angular range. While many deep-learning approaches address the correction of the MW artifacts on the level of tomograms (3D volumes), the correction at the level of 2D tilt images (generation of unacquired images) remains underexplored. We propose MW-RaMViD, a 2D tilt-image generation method for MW correction, based on Random-Mask Video Diffusion (RaMViD) method for prediction of frames in natural videos. To adapt RaMViD for cryo-ET, we add MRC image-format support, floating-point pixel intensity representation, and a controlled inference protocol enabling both one-run and progressive MW completion (generating a small number of missing tilts per step using a sliding window). We evaluate the method on a synthetic noisy tilt-series dataset and study the effects of MW completion step size and conditioning sequence length. Results show that smaller step sizes and larger conditioning windows reduce error accumulation at higher tilt angles and improve reconstruction fidelity, which was measured by Root Mean Square Error on the image level and by Fourier Shell Correlation on the tomogram level.
bioinformatics2026-02-17v1RNAiSpline: A Deep learning model for siRNA efficacy prediction
Surkanti, S. R.; Kasturi, V. V.; Saligram, S. S.; Basangari, B. C.; Kondaparthi, V.AI Summary
- The study aimed to develop a computational model, RNAiSpline, for predicting siRNA efficacy in silencing mRNA.
- RNAiSpline uses self-supervised pretraining and fine-tuning with KAN, CNN, and Transformer Encoder to address data scarcity and bias.
- On an independent test dataset, RNAiSpline achieved an ROC-AUC of 0.8175, F1 score of 0.7717, and Pearson correlation of 0.6032.
Abstract
RNA interference (RNAi) is a crucial biological post-transcriptional gene silencing mechanism where small interfering RNA (siRNA) guides RNA-induced silencing complex (RISC) to bind with messenger RNA (mRNA) thereby silencing it and stopping protein formation. We exploit this process to prevent the formation of harmful proteins by silencing mRNA before it is translated into protein through an effective siRNA. There exists a need to develop a computational model that predicts the effectiveness of siRNA on a given mRNA. Designing a model is challenging, as the data availability is either scarce or biased, and existing models lack generalization ability, even though the parameters to training samples ratio is very high. To overcome these challenges, we introduce RNAiSpline, which incorporates self-supervised pretraining and fine-tuning with Kalmogorov-Arnold Network (KAN), Convolutional Neural Network (CNN), and Transformer Encoder. Evaluation on the independent test dataset yields an ROC-AUC of 0.8175, an F1 score of 0.7717, and Pearson correlation of 0.6032, making RNAiSpline a robust model for siRNA efficacy prediction.
bioinformatics2026-02-17v1A Robust Framework for Predicting Mutation Effects on Transcription Factor Binding: Insights from Mutational Signatures in 560 Breast CancerGenomes
Kilinc, H. H.; Otlu, B.AI Summary
- This study developed a k-mer-based linear regression model to predict the effects of 3.5 million somatic mutations from 560 breast cancer genomes on transcription factor (TF) binding affinity.
- The framework identified that specific mutational signatures like APOBEC (SBS2, SBS13) and aging (SBS1) correlate with gain- or loss-of-function (GOF/LOF) in TF families, affecting gene regulation.
- Analysis showed subtype-specific effects, with basal-like TNBC showing SBS3-driven GOF in CXXC family linked to MYC targets, and SBS39-driven LOF linked to DNA repair pathways.
Abstract
Background: A vast majority of somatic mutations in cancer reside in noncoding regions, yet systematically predicting their functional impact on gene regulation remains a significant challenge. These variants often enforce their effects by altering the binding affinity of transcription factors (TFs) to cis-regulatory elements. However, a critical gap exists in linking specific mutational processes to the disruption of gene regulatory networks at a systems level. Results: In this study, we present a comprehensive in silico pipeline centered on k-mer-based linear regression models to quantify TF binding affinity. Our framework produced 403 high-confidence TF models trained on high-throughput ChIP-seq and PBM datasets. We applied this pipeline to 3.5 million somatic mutations from 560 breast cancer whole genomes to predict gain- or loss-of-function (GOF/LOF) binding perturbations. These predictions were integrated with mutational signature analysis and curated gene sets, utilizing Activity-by-Contact model-based enhancer-gene maps to link variants to their target genes. Our analysis revealed that distinct mutational processes exert non-random, directional effects on specific TF families. The APOBEC-associated signatures (SBS2 and SBS13) were strongly enriched for GOF events in the Myb/SANT and FOX families, while the aging-associated signature SBS1 was enriched for LOF events in the Ets family members. Furthermore, predicted perturbations at putative enhancers were significantly linked to key oncogenes and tumor suppressor genes, with GOF and LOF events (e.g., FOXA1 and BRCA1/2, respectively). In breast cancer samples, the basal-like TNBC subtype exhibited that SBS3-driven GOF enrichments for the CXXC family converged on MYC target gene programs, while SBS39-driven LOF events for the same family converged on DNA Repair pathways. Conclusions: Our framework provides a robust and scalable approach for prioritizing and interpreting the functional consequences of somatic mutations in terms of TF perturbations. We demonstrate that specific mutational processes systematically rewire the gene regulatory landscape in a subtype-specific manner, offering novel mechanisms for transcriptional deregulation in breast cancer.
bioinformatics2026-02-17v1Ancestry-specific performance of variant effect predictors in clinical variant classification
Hoffing, R.; Zeiberg, D.; Stenton, S. L.; Mort, M.; Cooper, D. N.; Hahn, M. W.; O'Donnell-Luria, A.; Ward, L. D.; Radivojac, P.AI Summary
- The study assessed the ancestry-specific performance of variant effect predictors in classifying clinical variants, focusing on accuracy and evidence strength per ACMG/AMP guidelines.
- Key confounders identified were the count of rare variants and their allele frequency distribution across ancestries.
- Results showed that after stratifying by allele frequency, predictors had comparable performance across major ancestry groups, supporting their broad application in genetic diagnosis.
Abstract
Predicting the effects of genetic variants and assessing prediction performance are key computational tasks in genomic medicine. It has been shown that well-calibrated variant effect predictors can be reliably used as evidence towards establishing pathogenicity (or benignity) of missense variants, thereby rendering these variants suitable for use in (or exclusion from) the genetic diagnosis of rare Mendelian conditions. However, most predictors have been trained or calibrated on data that may not be sufficiently representative to lead to similar performance across all genetic ancestries. This raises questions about the responsible deployment of these tools to improve human health. To better understand the utility of computational predictors, we set out to assess their ancestry-specific performance in terms of accuracy and evidence strength according to the ACMG/AMP guidelines. First, we determined that the expected count of rare variants in an individual's genome and the allele frequency distribution of these variants are the key confounders when evaluating a predictor's performance across different genetic ancestries. Second, we found that a predictor's accuracy itself inversely correlates with the allele frequency of the rare variant. After stratifying according to allele frequency, we show that established methods for predicting the pathogenicity of missense variants have comparable performance levels across major ancestry groups. Our results therefore support the wide deployment of such models in the context of genetic diagnosis and related applications.
bioinformatics2026-02-17v1rbio1-training scientific reasoning LLMs with biological world models as soft verifiers
Istrate, A.-M.; Milletari, F.; Castrotorres, F.; Tomczak, J. M.; Torkar, M.; Li, D.; Karaletsos, T.AI Summary
- The study explores training reasoning models in biology using biological world models as soft verifiers, introducing two paradigms: RLEMF and RLPK.
- rbio1, a model post-trained from a pretrained LLM with reinforcement learning, uses these paradigms to achieve state-of-the-art performance on the PerturbQA benchmark.
- The approach demonstrates that soft verification can enhance model performance and enable zero-shot transfer to tasks like disease-state prediction.
Abstract
Reasoning models are typically trained against verification mechanisms in formally specified systems such as code or symbolic math. In open domains like biology, however, we lack exact rules to enable large-scale formal verification and instead often rely on lab experiments to test predictions. Such experiments are slow, costly, and cannot scale with computation. In this work, we show that world models of biology or other prior knowledge can serve as approximate oracles for soft verification, allowing reasoning systems to be trained without additional experimental data. We present two paradigms of training models with approximate verifiers: RLEMF: reinforcement learning with experimental model feedback and RLPK: reinforcement learning from prior knowledge. Using these paradigms, we introduce rbio1, a reasoning model for biology post-trained from a pretrained LLM with reinforcement learning, using learned biological models for verification during training. We demonstrate that soft verification can distill biological world models into rbio1, enabling it to achieve state-of-the-art performance on perturbation prediction in the PerturbQA benchmark. We further show that composing multiple AI-verifiers improves performance and that models trained with soft biological rewards transfer zero-shot to cross-domain tasks such as disease-state prediction. We present rbio1 as a proof of concept that predictions from biological models can train powerful reasoning systems using simulations rather than experimental data, offering a new paradigm for model training.
bioinformatics2026-02-16v4SMECT: a framework for benchmarking post-GWAS methods for spatial mapping of cells associated with human complex traits
Liu, M.; Xue, C.; Luo, Y.; Peng, W.; Ye, L.; Zhang, L.; Wei, W.; Li, M.AI Summary
- SMECT is a framework designed to benchmark post-GWAS methods for mapping cells associated with complex human traits using spatial transcriptomics.
- It uses a simulation engine, 21 real-world datasets, and an assessment toolkit to evaluate methods like DESE, S-LDSC, and scDRS.
- Findings show DESE excels in both sensitivity and specificity, while S-LDSC has high sensitivity but low specificity, and scDRS is specific but less sensitive.
Abstract
Spatially resolving the cellular basis of complex human traits is essential for elucidating disease mechanisms, yet the comparative performance of computational methods for this task has not been systematically evaluated. Here, we present SMECT (Spatial Mapping Evaluation of Complex Traits), the first comprehensive framework for systematically evaluating methods that integrate genetic data with spatial transcriptomics. SMECT combines a biologically realistic simulation engine, a curated resource of 21 diverse real-world datasets, and a multi-faceted assessment toolkit. Using this framework, we benchmarked three state-of-the-art methods. DESE, S-LDSC, and scDRS across 19 complex traits. Our analysis reveals a fundamental trade-off between detection sensitivity and biological specificity. We demonstrate that while S-LDSC identifies extensive spatial signals, it suffers from inflated non-specific significant associations. Conversely, scDRS is highly specific but conservative, performing well only in tissues with strong biological signals while missing subtle associations in sparser datasets. DESE overcomes these limitations, consistently achieving high power and robust specificity across both simulated and real-world scenarios. SMECT provides critical guidelines for method selection and serves as a foundational resource for developing robust spatial analyses of human complex traits. The framework is publicly available at https://github.com/pmglab/smect.
bioinformatics2026-02-16v1Empty drops in scRNA-seq uncover the surprising prevalence of sequestered neuropeptide mRNA and pervasive sequencing artifacts
Gorin, G.; Goodman, L.AI Summary
- This study utilized empty drops from single-cell RNA sequencing to explore sequencing artifacts and biological phenomena.
- A simple procedure was developed to detect sequencing artifacts, providing recommendations to minimize quantification errors.
- Surprisingly, empty drops showed a high prevalence of mRNA for neuropeptide-related genes, suggesting potential physiological relevance.
Abstract
The empty drops in single-cell sequencing experiments are an underexplored resource. As such, they present a substrate to ask questions orthogonal to standard single-cell sequencing workflows, calibrate statistical models using simple internal controls, and detect technical outliers which would be otherwise challenging to distinguish from real biology. In this case study, we report a relatively simple procedure to detect sequencing artifacts and make recommendations to reduce the risk of erroneous quantifications. In addition, we report the surprising abundance and co-expression of mRNA coding for neuropeptide-related genes in the empty drops, possibly reflecting underlying physiology.
bioinformatics2026-02-15v1GATSBI: Improving context-aware protein embeddingsthrough biologically motivated data splits
Nayar, G.; Altman, R. B.AI Summary
- GATSBI introduces a graph attention framework to create context-aware protein embeddings by integrating various biological data types.
- The study uses task-aligned evaluation protocols, showing that models trained with biologically relevant data partitions generalize better.
- GATSBI outperforms existing embeddings in predicting interactions, functions, and functional sets, especially for understudied proteins under inductive settings.
Abstract
Motivation: Understanding protein function requires integrating diverse biological evidence while accounting for strong contextual dependence. Recent protein embedding methods increasingly leverage heterogeneous biological networks, yet their evaluation protocols often fail to reflect the specific biological tasks for which the embeddings are intended. Prediction of missing interactions, annotation of new proteins, and discovery of functional modules require fundamentally different data partitions, such as edge masked versus node held out splits. Moreover, most approaches report performance primarily on well-studied proteins, where computational predictions are least needed, risking substantial overestimation of real-world utility. Results: We introduce a graph attention based framework (GATSBI) to construct context-aware protein embeddings from integrated protein protein interactions, co-expression, sequence representations, and tissue-specific associations. Using task-aligned evaluation protocols, we show that models trained with biologically appropriate partitions achieve markedly better generalization. Across interaction, function, and functional set prediction, Gatsbi consistently outperforms existing pretrained embeddings for both well-studied and understudied proteins, with the largest gains observed for the understudied regime and under inductive node-held-out evaluation. To enable broad reuse, we provide the learned embeddings for download for application to other protein prediction tasks. Availability and Implementation: Code and models for our experiments are available at https://github.com/Helix-Research-Lab/GATSBI-embedding
bioinformatics2026-02-15v1DiCoLo: Integration-free and cluster-free detection of localized differential gene co-expression in single-cell data
Li, R.; Yang, J.; Su, P.-C.; Jaffe, A.; Lindenbaum, O.; Kluger, Y.AI Summary
- DiCoLo is introduced to detect localized differential gene co-expression in single-cell data without relying on cell clustering or cross-condition alignment.
- It uses Optimal Transport distances to construct gene graphs and identify changes in gene connectivity patterns between conditions.
- DiCoLo effectively identifies differential gene co-localization in complex scenarios, revealing new insights in mouse hair follicle development related to dermal condensate differentiation.
Abstract
Detecting changes in gene coordination patterns between biological conditions and identifying the cell populations in which these changes occur are key challenges in single-cell analysis. Existing approaches often compare gene co-expression between predefined cell clusters or rely on aligning cells across conditions. These strategies can be suboptimal when changes occur within small subpopulations or when batch effects obscure the underlying biological signal. To address these challenges, we introduce DiCoLo, a framework that identifies genes exhibiting differential co-localization, defined as changes in coordinated expression within localized cell neighborhoods - subsets of highly similar cells in the transcriptomic space. Importantly, DiCoLo does not rely on cell clustering or cross-condition alignment. For each condition, DiCoLo constructs a gene graph using Optimal Transport distances that reflect gene co-localization patterns across the cell manifold. Then, it identifies differential gene programs by detecting changes in connectivity patterns between the gene graphs. We show that DiCoLo robustly identifies differential gene co-localization even under weak signals or complex batch effects, outperforming existing methods across multiple benchmark datasets. When applied to mouse hair follicle development data, DiCoLo reveals coordinated gene programs and emerging cell populations driven by perturbations in morphogen signaling that underlie dermal condensate differentiation. Overall, these results establish DiCoLo as a powerful framework for uncovering localized differential transcriptional coordinated patterns in single-cell data.
bioinformatics2026-02-14v2EMReady2: improvement of cryo-EM and cryo-ET maps by local quality-aware deep learning with Mamba
Cao, H.; Zhu, Y.; Li, T.; Chen, J.; He, J.; Wang, X.; Huang, S.-Y.AI Summary
- The study addresses the challenge of improving cryo-EM map quality by introducing EMReady_mamba, a deep learning model using a Mamba-based dual-branch UNet architecture.
- EMReady_mamba employs a local resolution-guided learning strategy to handle map heterogeneity, extending its applicability to various map types including nucleic acids and cryo-ET maps.
- Evaluated on 136 maps, EMReady_mamba demonstrated superior performance in enhancing map quality and interpretation compared to existing methods._
Abstract
Cryo-electron microscopy (cryo-EM) has emerged as a leading technology for determining the structures of biological macromolecules. However, map quality issues such as noise and loss of contrast hinder accurate map interpretation. Traditional and deep learning-based post processing methods offer improvements but face limitations particularly in handling map heterogeneity. Here, we present a generalist Mamba-based deep learning model for improving cryo-EM maps, named EMReady_mamba. EMReady_mamba introduces a fast Mamba-based dual-branch UNet architecture to jointly capture local and global features. In addition, EMReady_mamba also uses a local resolution-guided learning strategy to address map heterogeneity, and significantly extends the training set. These advances render EMReady_mamba applicable to a broader range of cryo-EM maps, including those con taining nucleic acids, medium-resolution maps, and cryo-electron tomography (cryo-ET) maps, while substantially reducing computational cost. EMReady_mamba is extensively evaluated on 136 diverse maps at 2.0-10.0 A resolutions, and compared with existing map post-processing methods. It is shown that EMReady_mamba exhibits state-of-the-art performance in both map quality and map interpretation improvement. EMReady2 is freely available at https://github.com/huang-laboratory/EMReady2/.
bioinformatics2026-02-14v2RNApdbee 3.0: A unified web server for comprehensive RNA secondary structure annotation from 3D coordinates
Pielesiak, J.; Niznik, K.; Snioszek, P.; Wachowski, G.; Zurawski, M.; Antczak, M.; Szachniuk, M.; Zok, T.AI Summary
- RNApdbee 3.0 is a web server that integrates 2D and 3D data to annotate RNA secondary structures, classifying base pairs and identifying various nucleotide interactions.
- It handles incomplete or modified residues, providing results in standard formats like dot-bracket notation, BPSEQ, and CT, along with graphical visualizations.
- The tool standardizes inputs to PDBx/mmCIF, integrates seven annotation tools, and decomposes structures into stems, loops, and single strands, ensuring comprehensive RNA structural analysis.
Abstract
RNApdbee 3.0 (publicly available at https://rnapdbee.cs.put.poznan.pl/) offers an advanced pipeline for comprehensive RNA structural annotation, integrating 2D and 3D data to build detailed nucleotide interaction networks. It classifies base pairs as canonical or noncanonical using the Leontis-Westhof and Saenger schemes and identifies stacking, base-ribose, base-phosphate, and base-triple interactions. The tool handles incomplete or modified residues, marking missing nucleotides and distinguishing noncanonical base pairs for accurate and effective visualization. Results are provided in standard formats - namely, extended dot-bracket notation, BPSEQ, and CT - and in highly valuable graphical visualizations. RNApdbee decomposes secondary structures into stems, loops, and single strands and offers flexible pseudoknot encoding. Its unified framework addresses inconsistencies across structural data formats by standardizing all inputs to PDBx/mmCIF and integrating seven widely used annotation tools. Finally, RNApdbee ensures reliable, format-independent, and comprehensive RNA structural annotation and interpretation.
bioinformatics2026-02-14v2DVPNet: A New XAI-Based Interpretable Genetic Profiling Framework Using Nucleotide Transformer and Probabilistic Circuits
Kusumoto, T.AI Summary
- The study introduces DVPNet, an XAI-based framework for genetic profiling that uses a Nucleotide Transformer and probabilistic circuits to classify cancer vs. normal cells.
- Using the GSE131907 dataset, 900 genes per sample were selected, transformed into embeddings, and used to train the model, which then provided probabilistic contributions for classification.
- Key findings include identification of 1,524 genes with unexpected contribution scores, highlighting genes like ITGA5 and TP73, offering new insights beyond traditional statistical methods.
Abstract
In this study, we present an XAI-based genetic profiling framework that quantifies gene importance for distinguishing cancer cells from normal cells based on an interpretable AI decision process. We propose a new explainable AI (XAI) classification model that combines probabilistic circuits with the Nucleotide Transformer. By leveraging the strong feature-extraction capability of the Nucleotide Transformer, we design a tractable classification framework based on probabilistic circuits while preserving probabilistic interpretability. To demonstrate the capability of this framework, we used the GSE131907 single-cell lung cancer atlas and constructed a dataset consisting of cancer-cell and normal-cell classes. From each sample, 900 gene types were randomly selected and converted into embedding vectors using the Nucleotide Transformer, after which the classification model was trained. We then extracted class-specific probabilistic contributions from the tractable model and defined a contribution score for the cancer-cell class. Genetic profiling was performed based on these scores, providing insights into which genes and biological pathways are most important for the classification task. Notably, 1,524 of the 9,540 observed genes showed contribution scores that contradicted what would be expected from their class-wise occurrence frequencies, suggesting that the profiling goes beyond simple statistics by leveraging biological feature representations encoded by the Nucleotide Transformer. The top-ranked genes among these contradictory cases include several well-studied genes in cancer research (e.g., ITGA5, SIGLEC9, NOTUM, and TP73). Overall, these analyses go beyond traditional statistical or gene-expression-level approaches and provide new academic insights for genetic research.
bioinformatics2026-02-14v2IntelliFold-2: Surpassing AlphaFold 3 via Architectural Refinement and Structural Consistency
Qiao, L.; Yan, H.; Liu, G.; Guo, G.; Sun, S.AI Summary
- IntelliFold-2 enhances biomolecular structure prediction through architectural refinements like latent space scaling and atom-attention, improving over AlphaFold 3.
- Key improvements include better performance in therapeutic contexts, especially for antibody-antigen interactions and protein-ligand co-folding.
- Three variants (Flash, v2, Pro) are released to cater to different needs from efficient fine-tuning to high-precision inference.
Abstract
IntelliFold-2 is an open-source biomolecular structure prediction model that improves accuracy and robustness through architectural refinement and multiscale structural consistency. We introduce latent space scaling in Pairformer blocks, a principled atom-attention formulation with stochastic atomization, policy-guided optimization for diffusion sampling and difficulty-aware loss reweighting. On Foldbench, IntelliFold-2 improves performance in therapeutically relevant settings, with particularly strong gains for antibody-antigen interactions and protein-ligand co-folding relative to AlphaFold 3. We release three variants (Flash, v2, and Pro) to cover efficient fine-tuning through high-precision server-side inference.
bioinformatics2026-02-14v2Analysis of Age-Specific Dysregulation of miRNAs in Lung Cancer Via Machine learning: Biomarker Identification and Therapeutic Implications in Patients Aged 60 and Above.
Hasan, A.; Muzaffar, A.AI Summary
- This study analyzed miRNA dysregulation in lung cancer patients aged 60 and above using RNA sequencing data from TCGA.
- Differential expression analysis identified 25 significant miRNAs, with hsa-mir-1911 upregulated and others like hsa-mir-196a downregulated.
- Machine learning highlighted key miRNAs involved in lung cancer biology, suggesting their potential as biomarkers for early diagnosis and personalized therapy targets.
Abstract
Lung cancer is the leading cause of cancer-related mortality worldwide, predominantly affects older individuals, with non-small cell lung cancer (NSCLC) comprising 85% of cases. Despite advancements in diagnosis and treatment, prognosis for elderly patients remains poor. This study investigates the role of microRNAs (miRNAs) involved in lung cancer, focusing on individuals aged 60 and above. RNA sequencing data from The Cancer Genome Atlas (TCGA) was used to conduct differential expression analysis of miRNA profiles from elderly and senile patient groups. Results showed that out of 1,881 miRNA profiles, 801 were found to be differentially expressed. Filtering for significance identified that 25 miRNAs, with hsa-mir-1911 upregulated and 24, including hsa-mir-196a and hsa-mir-323b found to be downregulated. Studies showed that these miRNAs play roles in apoptosis, senescence, and inflammation. Another Experimental approach in this study, used Machine learning analysis which highlighted key miRNAs, including hsa-mir-181b, hsa-mir-542, hsa-mir-450b, hsa-mir-584, and hsa-mir-21 as crucial in lung cancer biology. Moreover, Functional enrichment analysis revealed their involvement in gene silencing, translational repression, and RNA-induced silencing complex (RISC) regulation. This research identifies the association of miRNAs and aging in lung cancer and finds potential biomarkers that can be helpful in early diagnosis and targets for personalized therapies.
bioinformatics2026-02-14v1Cell phenotypes in the biomedical literature: a systematic analysis and text mining corpus
Rotenberg, N. H.; Leaman, R.; Islamaj, R.; Kuivaniemi, H.; Tromp, G.; Fluharty, B.; Richardson, S.; Eastwood, C.; Diller, M.; Xu, B.; Pankajam, A. V.; Osumi-Sutherland, D.; Lu, Z.; Scheuermann, R. H.AI Summary
- The study introduces CellLink, a corpus of over 22,000 annotated mentions of human and mouse cell populations from recent literature, linked to Cell Ontology terms.
- Analysis showed lineage-specific patterns in cell naming based on various attributes.
- Fine-tuning transformer models on CellLink improved named entity recognition, and embedding approaches enhanced zero-shot entity linking, with applications in refining the chondrocyte branch of Cell Ontology.
Abstract
The variety of cell phenotypes identified by single-cell technologies is rapidly expanding, yet this knowledge is dispersed across the scientific literature and incompletely represented in structured resources. We present the CellLink corpus, a manually annotated collection of over 22,000 mentions of human and mouse cell populations in recent journal articles, distinguishing specific cell phenotypes, heterogeneous cell populations, and vague cell populations, and linking to Cell Ontology (CL) terms as either exact or related matches, covering nearly half of the terms in the current CL. A systematic analysis reveals lineage-specific patterns in how authors utilize anatomical context, molecular signatures, functional roles, developmental stage, and other attributes in cell naming. We show that fine-tuning transformer-based models on CellLink yields strong performance for named entity recognition, while embedding-based approaches support zero-shot entity linking and distinguishing exact from related matches. We further demonstrate the utility of CellLink to expand and refine the chondrocyte branch of CL.
bioinformatics2026-02-14v1Feature-based in-silico model to predict the Mycobacterium tuberculosis bedaquiline phenotype associated with Rv0678 variants
Quispe Rojas, W.; de Diego Fuertes, M.; Rennie, V.; Riviere, E.; Safarpour, M.; Van Rie, A.AI Summary
- The study developed an in-silico model to predict bedaquiline resistance in Mycobacterium tuberculosis based on 13 features of Rv0678 variants.
- Key features included evolutionary conservation and proximity to functional sites, with the model achieving high accuracy (ROC-AUC 0.826, sensitivity 87.1%, specificity 88.2%).
- External validation showed reduced performance, likely due to varied phenotypic measurement methods.
Abstract
Bedaquiline resistance is emerging globally and threatens the effectiveness of the novel short all-oral regimens for rifampicin-resistant tuberculosis. Following a systematic literature review, we quantified 13 sequence, biochemical, and structural features of 62 Rv0678 missense variants reported in 136 Mycobacterium tuberculosis isolates. Using rigorous machine learning methods, we show that the strongest contributing features were the evolutionary conservation score and the shortest atomic distance to key functional sites. The final 5-feature model had good performance (ROC-AUC 0.826) and classified the bedaquiline phenotype with high accuracy [sensitivity 87.1% (95% CI, 78.3-92.6) and specificity 88.2% (95% CI, 76.6-94.5)]. Performance was lower in external validation, likely due to the measurement error introduced when using diverse phenotypic methods. missense variants on the mmpR5 protein structure and function. Integrating the five-feature in-silico in variant interpretation software could improve the prediction of the effect of Rv0678 variants and guide clinical management of rifampicin-resistant tuberculosis.
bioinformatics2026-02-14v1CodonRL: Multi-Objective Codon Sequence Optimization Using Demonstration-Guided Reinforcement Learning
Du, S.; Kaynar, G.; Li, J.; You, Z.; Tang, S.; Kingsford, C.AI Summary
- CodonRL uses reinforcement learning to optimize codon sequences for translation efficiency, RNA stability, and compositional properties, addressing challenges like large action spaces and delayed rewards.
- It employs LinearFold for training and ViennaRNA for evaluation, with expert sequences to guide learning and milestone rewards to manage long-range optimization.
- On a benchmark of 55 human proteins, CodonRL outperformed GEMORNA, showing improvements in CAI by 9.5%, MFE by 25.4 kcal/mol, and reducing uridine content by 3.4%, enhancing translation efficiency, stability, and reducing immunogenicity.
Abstract
Optimizing synonymous codon sequences to improve translation efficiency, RNA stability, and compositional properties is challenging because the search space grows exponentially with protein length and objectives interact through long range RNA structure. Dynamic programming-based methods can provide strong solutions for fixed objective combinations but are difficult to extend to additional constraints. Deep generative models require large-scale, high-quality mRNA sequence datasets for training, limiting applicability when such data are scarce. Reinforcement learning naturally handles sequential decision-making but faces challenges in codon optimization due to delayed rewards, large action spaces, and expensive structural evaluation. We present CodonRL, a reinforcement learning framework that learns a structural prior for mRNA design from efficient folding feedback and demonstration-guided replay, and then enables user-controlled multi-objective trade-offs during inference. CodonRL uses LinearFold for fast intermediate reward computation during training and ViennaRNA for final evaluation, warms up learning with expert sequences to accelerate convergence for global structure objectives, and introduces milestone-based intermediate rewards to address delayed feedback in long range optimization. On a benchmark of 55 human proteins, CodonRL outperforms GEMORNA, a state-of-the-art codon optimization method, across multiple metrics, achieving 9.5% higher codon adaptation index (CAI), 25.4 kcal/mol more favorable minimum free energy (MFE), and 3.4% lower uridine content on average, while improving codon stabilization coefficient (CSC) in over 90% of benchmark proteins under matched constraints. These gains translate into designs that are predicted to be more efficiently translated, more structurally stable, and less immunogenic, while supporting continuous objective reweighting at inference time.
bioinformatics2026-02-14v1