Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
TPCAV: Interpreting deep learning genomics models via concept attribution
Yang, J.; Mahony, S.Abstract
Interpreting genomics deep learning models remains challenging. Existing feature attribution methods are largely restricted to one-hot DNA inputs and therefore cannot assess the influence of more general genomic features such as chromatin states or genomic repeats. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We improve upon the original TCAV method by incorporating a PCA-based decorrelation transformation to address correlated and redundant embedding features commonly observed in genomics deep learning models, resulting in the Testing with PCA-projected Concept Activation Vectors (TPCAV) approach. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides comparable motif feature interpretation to TF-MoDISco on one-hot encoded DNA-based transcription factor binding prediction models. TPCAV also enables robust interpretive analysis of how more general biological concepts such as repetitive elements and chromatin state annotations contribute towards predictions. TPCAV uniquely generalizes to interpret features learned by tokenized foundation models as well as models incorporating chromatin signals as inputs. We further show that TPCAV can identify representative regions associated with specific concepts, motivating downstream investigation of distinct regulatory mechanisms. TPCAV provides a flexible and robust complement to existing model interpretation techniques.
bioinformatics2026-04-14v4TSvelo: Comprehensive RNA velocity by modeling cascade of gene regulation, transcription and splicing
Li, J.; Wang, Z.; Shen, H.-B.; Yuan, Y.Abstract
RNA velocity approaches fit gene dynamics and infer cell fate by modeling the splicing process using single-cell RNA sequencing (scRNA-seq) data. However, due to short time scale of splicing, high noise and large complexity of data, existing RNA velocity methods often fail to precisely capture the complex velocity dynamics for individual gene and single cell, which makes its downstream analysis less reliable and less robust. We propose TSvelo, a comprehensive RNA velocity mathematics framework that can model the cascade of gene regulation, Transcription and Splicing using highly interpretable neural Ordinary Differential Equations (ODEs). TSvelo can precisely capture the transcription-unspliced-spliced 3D dynamics of all genes simultaneously, infer unified latent time shared by genes within single cell, and be applied to multi-lineage datasets. Experiments on six scRNA-seq datasets, including two multi-lineage datasets, demonstrate TSvelo's superiority.
bioinformatics2026-04-14v4Beyond Single Algorithms: A Framework for Validating and Aggregating Active Modules in Genetic Interaction Networks
Liu, J.; Xu, M.; Xing, J.Abstract
High-throughput sequencing methods have generated vast amounts of genetic data for candidate gene studies. However, the complexity of the disease genetic structure often results in a large number of candidate genes and poses a significant challenge for these studies. To explore the multi-gene interactions and elucidate the genetic mechanism, candidate genes are often analyzed through Gene-Gene interaction (GGI) networks. These networks can become very large, necessitating efficient methods to reduce their complexity. Active Module Identification (AMI) is a common method to analyze GGI networks by identifying enriched subnetworks representing relevant biological processes. Multiple AMI algorithms have been developed for biological datasets, and a comparative analysis of their behaviors across a variety of datasets is crucial to their application. In this study, we introduce a framework to compare and aggregate the modules produced by multiple AMI algorithms. We first used a modified Empirical Pipeline to validate the output of four AMI algorithms -- PAPER, DOMINO, FDRnet, and HotNet2 -- and find that no single algorithm performs well across the different datasets. Using the Earth Mover's Distance to measure pairwise module similarity, we find that the outputs of different algorithms are structurally distinct, suggesting that each captures different aspects of the underlying biology. These findings suggest that a comprehensive analysis requires the aggregation of outputs from multiple algorithms. We propose two methods to this end: a spectral clustering approach for module aggregation, and an algorithm that combines modules with similar network structures called Greedy Conductance-based Merging (GCM). The merging algorithm not only allows researchers to obtain a set of cohesive modules from multiple algorithms, it also has the potential of identifying "hidden" genes that are not present in the original input data from the network. Overall, our results advance our understanding of AMI algorithms and how they should be applied. Tools and workflows developed in this study will facilitate researchers working with GGI and AMI algorithms to enhance their analyses. Our code is freely available at https://github.com/LiuJ0/AMI-Benchmark/.
bioinformatics2026-04-14v2GRASP: Gene-relation adaptive soft prompt for scalable and generalizable gene network inference with large language models
Feng, Y.; Deng, K.; Guan, Y.Abstract
Gene networks (GNs) encode diverse molecular relationships and are central to interpreting cellular function and disease. The heterogeneity of interaction types has led to computational methods specialized for particular network contexts. Large language models (LLMs) offer a unified, language-based formulation of GN inference by leveraging biological knowledge from large-scale text corpora, yet their effectiveness remains sensitive to prompt design. Here, we introduce Gene-Relation Adaptive Soft Prompt (GRASP), a parameter-efficient and trainable framework that conditions inference on each gene pair through only three virtual tokens. Using factorized gene-specific and relation-aware components, GRASP learns to map each pair's biological context into compact soft prompts that combine pair-specific signals with shared interaction patterns. Across diverse GN inference tasks, GRASP consistently outperforms alternative prompting strategies. It also shows a stronger ability to recover unannotated interactions from synthetic negative sets, suggesting its capacity to identify biologically meaningful relationships beyond existing databases. Together, these results establish GRASP as a scalable and generalizable prompting framework for LLM-based GN inference.
bioinformatics2026-04-14v2From Movement to METs: A Validation of ActTrust(R) for Energy Expenditure Estimation and Physical Activity Classification in Young Adults
dos Santos Batista, E.; Basilio Gomes, S. R.; Bruno de Morais Ferreira, A.; Franca, L. G. S.; Fontenele Araujo, J.; Mortatti, A. L.; Leocadio-Miguel, M. A.Abstract
Estimating physical activity (PA) levels is a challenging and expensive task. An alternative could be the use of actigraphy devices to estimate PA. This has been previously done to a number of devices, including ActiGraph(R) GT3X+. In this study, we validated ActTrust(R) against the widely used GT3X+ and compared activity counts to metabolic equivalents (METs) derived from indirect calorimetry during treadmill walking and running. Fifty-six young adults (34 men, 22 women) participated in controlled effort exercises including light, moderate, vigorous, and very vigorous activity intensities. We developed a linear model to estimate energy expenditure (EE) from movement count of combinations of devices placed at hip or wrist. We then estimated cut-off points for each intensity range. Our results showed correlations between treadmill speed and both METs (<em>r</em> = 0.95, <em>p</em> < 0.05) and movement counts from both GT3X+ and ActTrust devices placed either on the hip (<em>r</em> = 0.94, <em>p</em> < 0.05; <em>r</em> = 0.93, <em>p</em> < 0.05) or on the wrist (<em>r</em> = 0.88, <em>p</em> < 0.05; <em>r</em> = 0.88, <em>p</em> < 0.05), respectively. Our proposed model performed well with balanced accuracies above 0.77 for all intensity ranges and over 0.9 for light and moderate activity. This is the first study to model estimate and validate PA intensity thresholds on ActTrust(R) devices. Our findings support the use of ActTrust(R) devices as simple, cost-effective tool for 24-hour assessments of EE.
bioinformatics2026-04-14v2Harnessing AI to Build Virtual Cells
Cheng, X.; Li, P.; Guo, H.; Liang, Y.; Gong, J.; de Vazelhes, W.; Gou, C.; Xie, P.; Song, L.; Xing, E. P.Abstract
A virtual cell is a world model of a cell: a computational system that predicts, simulates and programs cellular processes across modalities and scales. An important path toward this goal is to model how genetic and chemical perturbations give rise to transcriptional responses, a core capability for disease understanding and drug discovery. However, current approaches remain expert-intensive, relying on iterative manual model design, training and debugging over months. Here we present VCHarness, an autonomous AI system that constructs perturbation-response models by combining an AI coding agent with multimodal biological foundation models. The system explores large spaces of architectures and training pipelines with minimal human intervention, iteratively generating, evaluating and refining candidate models. Across multiple perturbation-response benchmarks, VCHarness identifies architectures that outperform expert-designed approaches while reducing development time from months to days. It further uncovers non-obvious architectural patterns associated with improved performance, indicating that automated search can extend beyond conventional design strategies. These results suggest a shift from manually engineered models toward autonomous systems for constructing components of virtual cell world models, enabling scalable and data-driven exploration of cellular systems.
bioinformatics2026-04-14v1Reconstructing intra-tumor fitness landscapes from scSeq CNA genotypes via simulation-based Bayesian inference and Deep Learning
KafiKang, M.; Skums, P.Abstract
Inferring the selective effects of copy-number alterations (CNAs) from clonal tumor data is essential for understanding tumor evolution. In practice, intra-tumor evolutionary parameters are typically estimated by fitting population genetic models to observed data using maximum likelihood or Bayesian methods. However, realistic mechanistic models often lead to intractable likelihoods, limiting the applicability of conventional inference approaches. Here, we introduce a likelihood-free, simulation-based framework for inferring intra-tumor selection coefficients directly from clonal CNA profiles. Our approach employs neural posterior estimation to amortize inference across simulated tumors and uses normalizing flows to flexibly parameterize high-dimensional posterior distributions while enabling robust uncertainty quantification. Our primary model, CloneMLP-NPE, learns representations of whole-tumor CNA genotypes using a multilayer perceptron (MLP)-based encoder. We compare this model against two baselines: (i) a Set Transformer encoder applied to the same whole-tumor CNA profiles, and (ii) a consensus-based approach that relies only on the CNA profile of the most abundant clone. On held-out simulations, CloneMLP-NPE achieves the strongest overall performance, yielding well-calibrated posterior distributions and more accurate posterior mean estimates than both baselines.
bioinformatics2026-04-14v1Multi-Agent Orchestration for Knowledge Extraction and Retrieval: AI Expert System for GPCRs
spieser, j. C.; Kogan, P.; Yang, J.; meller, j.; Patra, K.; shamsaei, B.Abstract
We present GPCR-Nexus, an AI-driven platform for integrated exploration of G protein coupled receptor (GPCR) biology that unifies structured databases with unstructured scientific literature. The system combines a GPCR / ligand knowledge graph with vector-based semantic retrieval to enable comprehensive, up-to-date information access. Central to GPCR Nexus is a multi agent architecture in which specialized components coordinate query planning, evidence retrieval, validation, and synthesis. This design ensures that generated responses are grounded in verifiable sources while maintaining coherence across heterogeneous data modalities. By jointly leveraging curated databases and primary literature, GPCR Nexus enables context-aware reasoning over molecular interactions, functional mechanisms, and disease associations. The platform produces citation-backed outputs with traceable evidence, addressing limitations of conventional database queries and standalone language models. We detail the system architecture, data integration strategy, and agent orchestration framework, and demonstrate its utility through representative query scenarios. GPCR Nexus provides a scalable approach to combining structured and unstructured biomedical knowledge using agent based AI, offering improved accuracy, interpretability, and coverage. This work establishes a foundation for trustworthy, AI-assisted knowledge synthesis in GPCR research and drug discovery.
bioinformatics2026-04-14v1A Hierarchy-aware Gene Exploration Platform for Multi-layered Toxicogenomic Analysis: A Case Study on Acetaminophen-induced Hepatotoxicity
Kim, M.; Cui, Y.; Kim, M. G.Abstract
Background: The interpretation of high-dimensional transcriptomic data remains a major challenge in mechanistic toxicology and drug safety assessment. Conventional clustering approaches based solely on expression profiles often fail to capture intrinsic biological relationships among genes, limiting interpretability and downstream analysis. Methods: We developed a hierarchy-aware gene exploration platform that integrates structured biological knowledge from the HUGO Gene Nomenclature Committee (HGNC). The core of the framework is a similarity kernel based on a single-step hyperdiffusion formulation (H K H^top), which embeds gene family hierarchy into the similarity space. The platform is implemented as an interactive web application supporting Uniform Manifold Approximation and Projection (UMAP) visualization, Leiden clustering, functional enrichment analysis, and hierarchy-based gene recommendation. Results: Applied to a transcriptomic dataset of acetaminophen-induced acute liver failure (APAP ALF), the proposed approach achieved a 33.8-fold improvement in functional coherence compared to an expression-only baseline. The hierarchy-aware embedding produced compact and biologically consistent clusters, enabling identification of key toxicological modules, including disruption of RNA processing, extracellular matrix remodeling, and impairment of lipid transport. In addition, the framework detected small but highly significant regulatory modules associated with epigenetic reprogramming. Conclusion: By incorporating biological hierarchy into gene similarity, the proposed platform enhances the interpretability of transcriptomic analysis and enables structured exploration of functional relationships. This approach provides a practical framework for mechanistic insight generation and supports more transparent and reproducible analysis in toxicogenomics.
bioinformatics2026-04-14v1found: Inferring cell-level perturbation from structured label noise in single-cell data
Afanasiev, E.; Goeva, A.Abstract
Recent work by Goeva et al. introduced HiDDEN, a method for refining batch-level labels to infer cell-level perturbation without prior knowledge of affected populations, addressing the mismatch between sample-level labels and heterogeneous perturbation effects across cells. Here, we present found, a Python and R implementation of HiDDEN, supporting pipeline customization, by-factor grouping, hyperparameter selection, and visualization. Through benchmarking across diverse datasets, we show that performance depends strongly on modeling choices, particularly regression, grouping, and embedding dimensionality. found provides a practical, flexible, and accessible framework for robust cell-level perturbation analysis.
bioinformatics2026-04-14v1Identification of the novel inhibitors against M. tuberculosis ESX-1 secretion system EccA1 enzyme using virtual screening, docking and dynamics simulation techniques
Kumar, R.; saxena, a. K.Abstract
The M. tuberculosis ESX-1 secretion system EccA1 enzyme is involved in the secretion of virulence factors and is essential for virulence and bacterial survival within the phagosome. Development of the small molecular inhibitors abolishing EccA1 function can yield new antivirulence drugs. In this study, we modeled the full-length EccA1 (573 residues, Mw [~]62.4 kDa) structure, which contains N-terminal TPR domain and a C-terminal CbxX/CfqX type ATPase domain. We have identified five ZINC compounds having binding energy i. e. Z1 (ZINC000004513760, -43.45 kcal/mol), Z2 (ZINC000000001793, -49.56 kcal/mol), Z3 (ZINC000005390388, -55.83 kcal/mol), Z4 (ZINC000257294577, -52.33 kcal/mol), Z5 (ZINC000004824264, -44.44 kcal/mol) through virtual screening of the ZINC compounds targeting C-terminal ATPase pocket of EccA1. The Z1-Z5 compounds were compared with ADP substrate having binding energy (Adenosine diphosphate, -35.00 kcal/mol), p97 ATPase inhibitors i.e. NMS873 (3-[3-cyclopentylsulfanyl-5-[[3-methyl-4-(4 methylsulfonylphenyl)phenoxy]methyl]-1,2,4-triazol-4-yl]pyridine, -48.68 kcal/mol), and CB5083 (1-[4-(benzylamino)-5H,7H,8H-pyrano[4,3-d]pyrimidin-2-yl]-2-methyl-1H- indole-4-carboxamide, -50.88 kcal/mol) against EccA1. The Z1-Z5 compounds exhibited good Absorption, Distribution, Metabolism, and/or Excretion properties (ADMTE). Pharmacokinetic properties and Lipinsky's rule of five for Z1-Z5 compounds showed drug-like properties. 100 ns dynamics simulation analysis on EccA1 complexed with (i) Z1-Z5 compounds (ii) ADP substrate and (iii) NMS873 and CB5083 inhibitors showed high stability and biologically relevant conformation during dynamics simulation. These data indicate that Z1-Z5 compounds may act as potential inhibitors against EccA1 and provide avenues for new antivirulence drug development after in vitro and in vivo clinical trials.
bioinformatics2026-04-14v1A Machine Learning Approach for Physiological Role Prediction in Protein Contact Networks: a large-scale analysis on the human proteome
Cervellini, M.; Martino, A.Abstract
Proteins are fundamental macromolecules involved in virtually all biological processes. Their physiological roles are tightly linked to their three-dimensional structure, which can be naturally abstracted as Protein Contact Networks (PCNs), i.e., graphs where residues are nodes and edges encode spatial proximity. This representation enables the application of Graph Machine Learning to address the protein functional annotation gap at proteome scale. In this work, protein function prediction is studied on the majority of the human proteome, focusing on enzymatic activity and enzyme class assignment as well-defined and biologically meaningful targets. A large-scale supervised analysis was conducted on PCNs derived from experimentally resolved human protein structures. Multiple graph-based learning paradigms were systematically compared under a unified evaluation protocol, including handcrafted graph embeddings, kernel methods, and end-to-end Graph Neural Networks (GNNs). Feature engineering approaches comprised (i) spectral density embeddings of the normalized graph Laplacian and (ii) higher-order topological representations based on simplicial complexes, with optional INDVAL-based feature selection. These representations were paired with linear, ensemble, and kernel classifiers, while GNNs were trained directly on raw PCNs exploiting a diverse set of message-passing architectures. Two tasks were considered: binary classification of enzymatic versus non-enzymatic proteins and multiclass prediction of first-level Enzyme Commission (EC) classes. Performance was assessed using repeated stratified splits to ensure robust and variance-aware evaluation. In the binary enzymatic classification task, the Jaccard-based graph kernel achieved the best performance with an adjusted balanced accuracy of 0.90, closely followed by GNNs trained end-to-end on PCNs. In the multiclass EC prediction task, GNNs demonstrated superior discriminative power, reaching an adjusted balanced accuracy of 0.92 and outperforming all explicit embedding and kernel-based approaches. Overall, results indicate that EC class prediction is intrinsically more complex than binary enzymatic discrimination and benefits from the higher expressivity of deep message-passing architectures. The findings demonstrate that graph-based representations of protein structure support competitive functional prediction at proteome scale, with classical kernel methods and modern GNNs offering complementary strengths in terms of accuracy, scalability, and flexibility.
bioinformatics2026-04-14v1Predicting Pre-treatment Resistance or Post-treatment Effect? A Systematic Benchmarking of Single-Cell Drug Response Models
Shen, L.; Sun, X.; Zheng, S.; Hashmi, A.; Eriksson, J.; Mustonen, H.; Seppänen, H.; Shen, B.; Li, M.; Vähä-Koskela, M.; Tang, J.Abstract
Intratumoral heterogeneity is a major driver of variable drug responses in cancer. Single-cell RNA sequencing (scRNA-seq) enables the characterization of such heterogeneity, providing an opportunity to predict drug response at single-cell resolution. As a result, a growing number of computational models have been developed to infer drug response from scRNA-seq datasets. However, their performance, robustness, and generalizability across different biological contexts have not been systematically evaluated. To address this gap, we conducted a comprehensive benchmarking of representative single-cell drug response prediction models. Using 26 curated datasets comprising over 760,000 cells across 12 cancer types and 21 therapeutic agents, we constructed balanced and imbalanced scenarios to reflect more realistic distributions of drug response labels. To address the lack of ground-truth drug-response labels in conventional scRNA-seq datasets, we further incorporated lineage-tracing data with experimentally validated drug-response annotations, enabling model evaluation in a clinically relevant pre-treatment prediction setting. Our results show that across the tested methods, the prediction performance is markedly higher in cell lines than in tissue samples. Under imbalanced conditions, most methods exhibited sharp performance declines, whereas scDEAL demonstrated the highest robustness. Independent validation using an in-house pancreatic ductal adenocarcinoma dataset further confirms the robustness of scDEAL and its ability to capture biologically meaningful state transitions. Label-substitution experiment revealed that this robust performance partially driven by the model's specific training label construction. However, the benchmarking with lineage-tracing data reveals a fundamental limitation: most models capture drug-induced transcriptional changes but struggle to predict a cell's intrinsic resistance state prior to treatment. In summary, our study not only defines the performance boundaries of current approaches but also highlights their limitations in addressing intratumoral heterogeneity, extreme class imbalance, and the prediction of intrinsic cellular resistance, emphasizing the need for the development of next-generation single-cell drug response models with stronger clinical relevance.
bioinformatics2026-04-14v1SpaceExpander: Automated Drafting and Evaluation of Markush Claims for Chemical Space Expansion
Wu, R.; Mao, L.; Diao, Y.; Li, H.Abstract
Drafting Markush claims for chemical patents remains difficult because manual claim writing is slow, error prone, and often fails to capture related chemical space in a systematic manner. We developed SpaceExpander, a computational method that converts disclosed compounds into generalized Markush claims by extracting core scaffolds, defining variable positions, decomposing complex substituents, and expanding substituent space through fragment matching. We evaluated the method on 24 publicly available chemical patents and compared its performance with IntelliPatent. SpaceExpander achieved a mean atom level scaffold accuracy of 0.92 and exactly recovered the reference scaffold in 19 of 24 patents. By contrast, IntelliPatent could process only 2 patents from the same set, indicating more limited applicability to structurally diverse cases. We further examined practical claim coverage in a case study based on the Osimertinib patent. Using representative disclosed compounds as input, SpaceExpander drafted a Markush claim that covered 5 of 7 additional approved third-generation EGFR inhibitors beyond Osimertinib. These results show that SpaceExpander is a validated method for automated Markush claim drafting and chemical space expansion.
bioinformatics2026-04-14v1SPEAR: Predicting Gene Expression from Single-Cell Chromatin Accessibility
Walter-Angelo, T.; Uzun, Y.Abstract
Single-cell multiome assays enable direct measurement of chromatin accessibility and gene expression within the same cell. Still, most experimental designs remain constrained to two (and, less commonly, three) modalities per cell. This limitation motivates computational models that can predict unmeasured layers and, simultaneously, help dissect how cis-regulatory accessibility relates to transcription at gene resolution. Existing cross-modal methods often prioritize latent alignment or modality reconstruction, making it difficult to isolate the impact of model inductive bias under a shared cis-regulatory feature definition. We present SPEAR, a configuration-driven framework for gene-centric egression of single-cell gene expression from chromatin accessibility sing a fixed transcription-start site-centered representation shared cross model families. Here we show that, under identical features, splits, and valuation, model performance stratifies reproducibly across two multiome systems (mouse embryonic development and human hemogenic endothelium), with transformer encoders achieving the strongest mean test correlations (0.546 and 0.470, respectively). Per-gene performance distributions reveal substantial heterogeneity in predictability, indicating that accessibility driven signal is concentrated in subset of genes across contexts. Shapley value-based feature attribution further localizes predictive signal to promoter-proximal bins, with feature importance decaying with distance from the transcription start site, supporting a promoter-centered regime of cis-regulatory control within the modeled window. Together, these results provide a controlled comparison of inductive biases for chromatin-to-expression prediction and deliver analysis-ready outputs for gene-level interpretation. SPEAR is open source and publicly available for use at https://github.com/UzunLab/SPEAR.
bioinformatics2026-04-14v1A correlational study of ABCA3 and SCN4B as exercise-related biomarkers of patients with Stanford type A aortic dissection
Qiao, S.; Chen, T.; Xie, B.; Han, Y.; Wang, B.; Li, Y.; Jia, B.; Wu, N.Abstract
Background: Accumulating evidence indicates that moderate exercise may reduce the incidence of Stanford type A aortic dissection (TAAD), but the specific mechanisms remain unclear. This study aims to identify exercise-related biomarkers in TAAD patients and to investigate their underlying mechanisms. Methods: Transcriptome data related to TAAD and exercise-related genes were obtained from publicly available databases. Candidate biomarkers for TAAD were identified through an integrative approach incorporating differential expression analysis, machine learning, and expression level assessment, leading to the construction of a diagnostic model. Subsequently, functional enrichment, immune infiltration, regulatory network analysis, and computational drug prediction were conducted to systematically investigate the pathological mechanisms and translational potential of the identified biomarkers. Results: ABCA3 and SCN4B were identified as exercise-related biomarkers in TAAD progression. A nomogram incorporating these two biomarkers exhibited strong diagnostic performance for identifying the disease. Functional enrichment analysis revealed potential involvement of these biomarkers in disease progression through pathways including circadian rhythm regulation and ribosome biosynthesis. Additionally, immune cells like M1 macrophages and naive B cells, as well as regulatory factors including hsa-miR-1343-3p and XIST, were found to be involved in this process. Finally, zonisamide and MRS1097 were identified through computation prediction as potential therapeutic drugs. Conclusion: ABCA3 and SCN4B were identified as exercise-related biomarkers associated with TAAD and represent potential valuable targets for both diagnosis and treatment strategies.
bioinformatics2026-04-14v1MAJEC: unified gene, isoform, and locus-level transposable element quantification from RNA-seq
Lim, T.-Y.; Firestone, A. J.Abstract
Background: The study of transposable elements (TEs) has become increasingly central to fields such as cancer biology, immunology, and aging. Accurately quantifying disease- or laboratory-mediated perturbations in these elements is critical to support this expanding research, yet current RNA-seq pipelines struggle with the pervasive overlap between TEs and protein-coding genes. Existing tools either aggregate to the subfamily level with no locus resolution (TEtranscripts), or provide locus-level quantification without modeling gene overlap (Telescope), with the latter attributing over 40% of TE signal to the 1.1% of loci that overlap gene exons. Results: We present MAJEC (Momentum Accelerated Junction Enhanced Counting), a unified Expectation-Maximization (EM) framework that jointly quantifies genes, transcript isoforms, and individual TE loci from BAM alignments in a single pass. Splice junction evidence informs transcript-level priors, enabling MAJEC to probabilistically distinguish genic from TE-derived reads. This approach was independently validated against Salmon and RSEM on isoform quantification benchmarks. The joint feature space reduces exon-overlap contamination of locus-level TE estimates from 43% of total signal (Telescope) to 5% (MAJEC), while preserving subfamily-level accuracy (differential expression r = 0.987 vs TEtranscripts). Using paired biological vignettes, we demonstrate that MAJEC correctly resolves both the false TE reactivation artifacts endemic to TE-only models, and the false gene upregulation artifacts that occur when heuristic rules misassign genuine intragenic TE transcription. Conclusion: MAJEC simultaneously produces the isoform and locus-level resolution that TEtranscripts lacks, with greater accuracy than Telescope, and runs faster than either.
bioinformatics2026-04-14v1BrainPET Studio: An Atlas-Based, User-Friendly Desktop Tool for Quantitative PET Neuroimaging Analysis
Nabizadeh, F.Abstract
Quantitative analysis of positron emission tomography (PET) neuroimaging data is essential for studying neurodegenerative diseases, yet existing processing pipelines often rely on computationally intensive software packages such as FreeSurfer, limiting accessibility for many research groups. Here I introduce BrainPET Studio, an open-source desktop application for atlas-based regional PET quantification that operates entirely in Montreal Neurological Institute (MNI) standard space. BrainPET Studio integrates affine registration, optional Muller-Gartner (MG) partial volume correction (PVC), interactive quality control (QC), and standardized uptake value ratio (SUVR) calculation into a single graphical user interface (GUI), eliminating the requirement for FreeSurfer-based cortical reconstruction. I validated BrainPET Studio against two established pipelines: (1) the UC Berkeley Alzheimers Disease Neuroimaging Initiative (ADNI) AV1451 (flortaucipir) pipeline, which employs FreeSurfer v7.1.1 parcellation, SPM-based coregistration, and Geometric Transfer Matrix (GTM) PVC in native subject space, and (2) the volBrain/petBrain online platform. Region-of-interest (ROI) SUVR values were compared across 322 subjects. Overall Pearson correlation coefficients for meta-ROI composites ranged from r=0.83 0.96 versus ADNI and r=0.86 0.94 versus volBrain/petBrain. Detailed per-subject validation on four representative cases across 112 FreeSurfer-defined regions demonstrated strong agreement for large cortical composites and acceptable variability for smaller medial temporal structures. These results establish BrainPET Studio as a reliable, accessible, and extensible tool for multi site PET research, educational applications, and studies where FreeSurfer-based processing is impractical.
bioinformatics2026-04-13v1Introducing the digital PCR data essentials standard to harmonize data structure for clinical and research use
Trypsteen, W.; Vynck, M.; Untergrasser, A.; Whale, A. S.; Rodiger, S.; Dobnik, D.; Bogozalec Kosir, A.; Milavec, M.; Kubista, M.; Pfaffl, M. W.; Nour, A. A.; Young-Kyung, B.; Bustin, S. A.; Calin, G.; Chen, Y.; Cleveland, M. H.; De Falco, A.; Forootan, A.; O'Sullivan, D. M.; Devonshire, A. S.; Foy, C. A.; Fraley, S. I.; Gleerup, D. G.; He, H.-J.; Hellemans, J.; Lievens, A.; Lind, G. E.; Porco, D.; Romsos, E. L.; Thas, O.; Drandi, D.; de Tayrac, M.; Taly, V.; Huggett, J. F.; Vandesompele, J.; De Spiegelaere, W.Abstract
Digital PCR (dPCR) is a powerful technology for absolute quantification of nucleic acids, valued for its accuracy, sensitivity, and repeatability. Yet, the commercialization of different instruments with proprietary software has introduced challenges to data analysis, interoperability, and comparability. Therefore, we present the Digital PCR Data Essentials Standard (DDES), a lightweight, human- and machine-readable, and cross-platform data standard developed in collaboration with the dPCR community. The standard consists of three file types designed to enable both manual inspection and automated analysis: (i) a main file summarizing experiment and reaction-level (meta-)data; (ii) an assay file describing targets and detection chemistry, and (iii) intensity files capturing partition-level raw fluorescence data per reaction. DDES supports a wide range of current dPCR applications, including singleplex and multiplex assays, endpoint and real-time readouts, and will be curated to implement future dPCR developments. By harmonizing the data structure, DDES lays out the foundation for FAIR dPCR data practices and supports improved software compatibility, collaborative and reproducible research, and future dPCR data repositories.
bioinformatics2026-04-13v1VeloTrace Reconciles Divergent Velocity and Trajectory in Single-cell Transcriptomics with Deep Neural ODE
Cheng, H.; Qiao, Y.; Feng, Y.; Wei, Y.; Li, J.; Cai, J.; Zheng, S.; Chen, S.; Li, G.; Simons, B. D.; Lian, Q.; Xin, H.Abstract
Cellular identity and fate transitions are governed by continuous molecular processes that form dynamic trajectories within a high-dimensional transcriptomic landscape. Existing methods attempt to model these dynamics from two complementary perspectives: trajectory inference and velocity modeling. Ideally, velocity and trajectory are dual aspects of transcriptomic dynamics where velocity is tangent to trajectory everywhere. This inherent connection between velocity and trajectory is currently absent in transcriptomic analysis. Splicing velocity are precision-limited to inadequately-sequenced genes, while trajectory inference prioritizes the modeling of global trends while omitting local dynamics. This divergence breaks the geometric continuity between local velocities and global trajectories, hindering the reliable interpretation of developmental dynamics. To reconcile trajectory inference and RNA velocity, we introduce VeloTrace, a framework that unifies them through Neural Ordinary Differential Equations (NeuralODEs). VeloTrace learns a continuous-time velocity field whose integral curves constitute the trajectory itself, while ensuring that velocities are tangent to integral paths everywhere. Leveraging a splicing quality score, VeloTrace incorporates high-quality splicing velocity as partial supervision for velocity orientation and grounding. During optimization, VeloTrace incorporates a Monte Carlo multi-time-frame supervision strategy to ensure coherence between local and global trajectorys and suppress sequencing-induced stochastic diffusion. Through refining the velocity field and cell-specific parameters for pseudo-time, expression, and velocity, VeloTrace reconstructs a smooth, local-and-global-coherent velocity-vector-guided flow in the transcriptomic latent space. This strategy ensures a complementary integration of velocity and trajectory, imputing the transcriptional kinetics for genes of insufficient strength, whose kinetics cannot be accurately portrayed by splicing velocity. In simulation benchmarks, VeloTrace captured the transcriptional dynamics of all expressed genes, even those with inadequate sequencing coverage, producing velocity directions that were most consistent with the true direction and everywhere tangential across the entire process, outperforming state-of-the-art methods, including scVelo, UniTVelo, VeloVI and scTour. VeloTrace uniquely reconciles RNA velocity and trajectory inference, creating a velocity field where each cell can infer past and future transitions from its current state. Moreover, VeloTrace extends reliable velocity estimation to a broader set of genes. When applied to mouse neural stem cell differentiation data, it successfully recovers dynamics of driver genes for two developmental lineages, including those with low expression, shedding light on their regulatory roles during differentiation. This unified framework lays the foundation for more accurate modeling of gene regulation and cell fate decisions in complex biological systems.
bioinformatics2026-04-13v1IMAS enables target-aware integration of tumour multiomics to resolve communication-guided regulatory mechanisms
Deyang, W.; Yamashiro, T.; Inubushi, T.Abstract
Tumour multiomic datasets are often sparse, heterogeneous and limited in size, hindering robust and interpretable discovery of regulatory mechanisms. Here we present IMAS, a target-aware integrative framework for multiomic data augmentation and mechanism prioritization that leverages a pan-cancer single-cell multiomic resource to contextualize new tumour datasets and identify reliable sample-specific mechanistic hypotheses. IMAS combines shared latent-space modelling with target-domain adaptation to improve correspondence between predicted and observed RNA and TF profiles while concentrating explanatory predictive supports within the target dataset. Building on this adapted representation, IMAS reconstructs structured RNA-TF coupling networks, refines intercellular signaling through ligand-informed communication modelling, and organizes regulatory programs along communication-associated ordering. In independent colon cancer data, IMAS improved cluster-resolved correspondence and revealed communication-guided regulatory cascades across malignant epithelial states. A LAMB1-centred analysis further demonstrates how the framework supports progressive reinforcement of local regulatory structure and enables perturbation-based probing of context-specific dependencies. Rather than exhaustively predicting all possible outcomes, IMAS provides a target-aware and interpretable strategy to construct consistent and interpretable mechanism-discovery scaffolds and prioritize regulatory dependencies in data-limited tumour systems.
bioinformatics2026-04-13v1TB-Bench: A Systematic Benchmark of Machine Learning and Deep Learning Methods for Second-Line TB Drug Resistance Prediction
VP, B.; Jaiswal, S.; Meshram, A.; PVS, D.; S C, S.; Narayanan, M.Abstract
Drug-resistant tuberculosis (TB), characterized by prolonged treatment regimens and suboptimal treatment outcomes, remains a major obstacle to global TB elimination. Advances in sequencing technologies have enabled the development of machine-learning (ML) approaches, including deep-learning (DL) methods, to predict drug resistance directly from genomic data. However, a significant gap remains in translating these advances into clinical practice. While current approaches reliably predict resistance to first-line drugs, they show consistently lower and more variable performance for second-line drugs compared with traditional drug-susceptibility testing. To characterize these limitations and assess practical utility, we conducted a comprehensive survey and standardized benchmarking of current approaches for predicting TB drug resistance using whole-genome sequencing (WGS) data. Using systematic selection criteria, we identified 20 traditional ML and DL models from 8 studies and evaluated drug-specific versions across 14 second-line drugs within a unified framework. To account for methodological heterogeneity, the models were evaluated using three distinct feature sets reflecting variability in input representations. We trained and evaluated the models on different subsets of the WHO dataset, comprising 50,801 samples, and assessed generalizability using an external validation dataset comprising 1,199 samples. In the internal evaluation on the held-out WHO test dataset, traditional ML models using binary features achieved higher predictive performance than DL models. For example, XGBoost achieved the highest area under the precision-recall curve (PRAUC) scores (46%-93%) for 10 of the 14 drugs. However, performance varied substantially across drugs. Notably, the superior performance of traditional ML models - even with limited feature sets - highlights their applicability in low-resource settings. When evaluated on the external validation dataset, the performance of traditional ML and DL models was comparable, and neither class of models demonstrated substantial improvement over catalogue-based approaches, underscoring challenges in cross-dataset generalization. Overall, this benchmarking study provides a comprehensive and systematic evaluation of current approaches, establishes a rigorous evaluation framework for future comparisons, and identifies key methodological considerations necessary to advance robust drug resistance prediction in clinical settings. To enhance reproducibility and facilitate the application of TB-Bench to additional datasets and models, we have made the source code publicly available at https://github.com/BIRDSgroup/TB-Bench.
bioinformatics2026-04-13v1HEIMDALL: Disentangling tokenizer design for robust transfer in single-cell foundation models
Haber, E.; Alam, S.; Ho, N.; Liu, R.; Trop, E.; Liang, S.; Yang, M.; Krieger, S.; Ma, J.Abstract
Foundation models for single-cell RNA-sequencing (scRNA-seq) data are emerging as powerful tools for single-cell analysis, yet their performance depends critically on how cells are tokenized into model inputs. Single-cell data lack a canonical tokenization scheme, and many design choices in current single-cell foundation models (scFMs) remain heuristic, entangled, and difficult to evaluate. Here, we introduce HEIMDALL, a unified framework for dissecting and redesigning tokenizers in scFMs. By decomposing existing tokenization strategies into individual design choices, HEIMDALL enables attribution of the components that underlie robust generalization, allowing more principled design of improved tokenizers. Combining HEIMDALL with a minimal transformer backbone, we find that tokenizer design is instrumental for generalization in challenging distribution-shift settings such as cross-tissue, cross-species, and cross-gene-panel cell type classification, as well as reverse perturbation prediction. We show that, while tokenizer choice has little effect in scenarios with matched train and test data, it becomes imperative under distribution shift. Rather than identifying a single globally optimal tokenizer, HEIMDALL reveals that robust transfer depends on a small number of tokenization design axes - especially gene identity, expression encoding, and ordering - that expose different biological priors to the model. In this sense, universal transferability in scFMs still depends on a non-universal tokenizer interface. Together, these findings establish tokenization as a critical design axis in scFMs and provide design principles and reusable infrastructure for more robust scFMs.
bioinformatics2026-04-12v3CRIS: A Centralized Resource for High-Quality RNA Structure and Interaction Data in the AI Era
Lee, W. H.; Dharmawan, C.; Li, K.; Bai, J.; Solanki, P.; Sharma, A.; Zhang, M.; Lu, Z.Abstract
As interest in RNA-based therapeutics expands, there is a growing demand for RNA structure elucidation and RNA-RNA interactions in both academic and clinical settings. Despite rapid advances in methods for RNA structure determination, the field faces persistent challenges in data reproducibility, quality control, and accessibility, largely due to inconsistencies in data processing and analysis workflows. Concurrently, methodological improvements have generated increasingly complex datasets, which necessitate a standardized framework. Here, we present the Crosslinking-based RNA Interactomes and Structuromes (CRIS) database, a comprehensive resource designed to address these limitations. Among existing experimental and computational approaches for RNA structure characterization, crosslinking-based technologies offer superior reliability, high throughput, and high resolution. CRIS provides rigorously curated datasets, standardized workflows, and user-friendly tools, together with built-in quality metrics and detailed visualization guidance to ensure reproducibility and transparency while pairing seamlessly with existing experimental pipelines. By delivering high-complexity RNA datasets alongside accessible computational tools, CRIS serves as a standardized reference for both new and existing data, facilitating investigation through comparative analyses and providing a training resource for deep learning-based computational exploration. This will enable integration into machine learning workflows for large scale, novel RNA structure discovery.
bioinformatics2026-04-12v2Spectral Graph Features for Reference-free RNA 3D Quality Assessment
Zhu, Y.; Zhang, H.; Calhoun, V. D.; Bi, Y.Abstract
MotivationExisting RNA 3D structure quality assessment (QA) methods rely on local geometric descriptors or statistical potentials that evaluate atomic-level contacts but are blind to global topological coherence. This creates a critical failure mode--structures that are "locally correct but globally wrong"--where well-formed local helices mask misplaced domains and incorrect overall packing. ResultsWe introduce SpecRNA-QA, a lightweight method that scores RNA 3D models using multi-scale spectral features derived from the graph Laplacian of inter-nucleotide contact networks. By computing eigenvalue distributions, heat-kernel traces, and spectral entropy across four distance scales with binary and Gaussian kernels, SpecRNA-QA captures global structural coherence inaccessible to conventional descriptors. In leave-one-out cross-validation on CASP16 (42 targets, 7,368 models), spectral features achieve median per-target Spearman{rho} = 0.69 [95% CI: 0.64-0.73], significantly outperforming an internal geometry baseline ({rho} = 0.47, {Delta}{rho} = +0.22, Wilcoxon p = 1.2x10-10). Compared against established unsupervised statistical potentials--which require no labeled data, unlike the supervised spectral model-- rsRNASP outperforms on small-to-medium RNAs ({rho} = 0.67 vs. 0.57,[≤] 200 nt). However, rsRNASP times out on most large RNAs (>200 nt), where SpecRNA-QA provides the strongest available quality signal ({rho} = 0.72 vs. DFIRE 0.52), revealing clear complementarity between global-topological and local-energy scoring. A training-free heuristic using only three spectral statistics enables quality estimation without any labeled data. AvailabilitySpecRNA-QA is available as a Python package at https://github.com/yudabitrends/specrnaq. Contactybi3@gsu.edu Supplementary informationSupplementary data are available online.
bioinformatics2026-04-12v2On the correctness of gene tree tagging under a unified model of gene duplication, loss, and coalescence
Parsons, R.; Liu, Y.; Dua, P.; Markin, A.; Molloy, E.Abstract
ASTRAL-pro is the leading method for reconstructing species trees under complex evolutionary scenarios involving gene duplication, loss, and coalescence, commonly modeled by DLCoal. A unique aspect of A-pro is that it utilizes rooted gene trees, with internal vertices labeled as duplications or speciations, to modify its objective function compared to the traditional ASTRAL method. Although there is a natural event-based definition of correct tagging when genes evolve with only gene duplications and losses, it cannot be applied when there is deep coalescence. Here, we introduce a definition of correct tagging that is broadly applicable, proposing that a gene tree vertex is correctly tagged as a duplication if it is the most recent common ancestor of at least one pair of gene copies related via a duplication event. Using this definition, we study some statistical properties of ASTRAL-pro's objective function under the DLCoal model and evaluate the accuracy of ASTRAL-pro's tagging algorithm in simulations.
bioinformatics2026-04-12v2KNexPHENIX: A PHENIX-Based Workflow for Improving Cryo-EM and Crystallographic Structural Models
Nandi, S.; Conn, G. L.Abstract
New and improved methods for visualizing complex macromolecules in atomic detail continue to expand structural information in the Protein Data Bank but accurately refining atomic models from experimental maps remains a challenge due to efficiency limitations of current refinement approaches. Standard PHENIX refinement can partially address these limitations with its speed and accessibility but often fails to yield the best model compared to more computationally demanding approaches. We therefore developed KNexPHENIX, a customized PHENIX-based workflow, to support optimal macromolecular model building. KNexPHENIX can be used to refine macromolecular structures obtained via cryo-electron microscopy (cryo-EM) or X-ray crystallography, regardless of molecular size or composition. KNexPHENIX was evaluated on deposited structures and de novo models and consistently produced models with lower MolProbity scores, indicating improved model stereochemistry, compared to default PHENIX, REFMAC Servalcat, REFMAC, or CERES refinement. Importantly, this was accomplished while maintaining model-to-map correlation for cryo-EM datasets and maintaining or reducing the Rfree-Rwork difference below accepted thresholds for X-ray crystallographic structures, thus limiting overfitting while preserving refinement accuracy. These results establish the KNexPHENIX workflow as a practical, accessible approach for refining both cryo-EM and crystallographic structures, enabling the generation of high-quality models for deposition and guiding further experimental studies.
bioinformatics2026-04-12v2rnaends: an R package to study exact RNA ends at nucleotide resolution
Caetano, T.; Redder, P.; Fichant, G.; Barriot, R.Abstract
5' and 3' RNA-end sequencing protocols have unlocked new opportunities to study aspects of RNA metabolism such as synthesis, maturation and degradation, by enabling the quantification of exact ends of RNA molecules in vivo. From RNA-Seq data that have been generated with one of the specialized protocols, it is possible to identify transcription start sites (TSS) and/or endoribonucleolytic cleavage sites, and even, in some cases, co-translational 5' to 3' degradation dynamics. Furthermore, post-transcriptional addition of ribonucleotides at the 3' end of RNA can be studied at the nucleotide resolution. While different RNA-end sequencing library protocols exist that have been adapted to a specific organism (prokaryote or eukaryote) or specific biological question, the generated RNA-Seq data are very similar and share common processing steps. Most importantly, the major aspect of RNA-end sequencing is that only the 5' or 3' end mapped location is of interest, contrary to conventional RNA sequencing that considers genomic ranges for gene expression analysis. This translates to a simple representation of the quantitative data as a count matrix of RNA-end location on the reference sequences. This representation seems under-exploited and is, to our knowledge, not available in a generic package focused on the analyses on the exact transcriptome ends. Here, we present the rnaends R package which is dedicated to RNA-end sequencing analysis. It offers functions for raw read pre-processing, RNA-end mapping and quantification, RNA-end count matrix post-processing, and further downstream count matrix analyses such as TSS identification, fast Fourier transform for signal periodic pattern analysis, or differential proportion of RNA-end analysis. The use of rnaends is illustrated here with applications in RNA metabolism studies through selected rnaends workflows on published RNA-end datasets: (i) TSS identification, (ii) ribosome translation speed and co-translational degradation, (iii) post-transcriptional modification analysis and differential proportion analysis.
bioinformatics2026-04-11v4Coherent Cross-modal Generation of Synthetic Biomedical Data to Advance Multimodal Precision Medicine
Marchesi, R.; Lazzaro, N.; Endrizzi, W.; Leonardi, G.; Pozzi, M.; Ragni, F.; Bovo, S.; Moroni, M.; Osmani, V.; Jurman, G.Abstract
Integration of multimodal, multi-omics data is critical for advancing precision medicine, yet its application is frequently limited by incomplete datasets where one or more modalities are missing. To address this challenge, we developed a generative framework capable of synthesizing any missing modality from an arbitrary subset of available modalities. We introduce Coherent Denoising, a novel ensemble-based generative diffusion method that aggregates predictions from multiple specialized, single-condition models and enforces consensus during the sampling process. We compare this approach against a multi-condition, generative model that uses a flexible masking strategy to handle arbitrary subsets of inputs. The results show that our architectures successfully generate high-fidelity data that preserve the complex biological signals required for downstream tasks. We demonstrate that the generated synthetic data can be used to maintain the performance of predictive models on incomplete patient profiles and can leverage counterfactual analysis to guide the prioritization of diagnostic tests. We validated the framework's efficacy on a large-scale multimodal, multi-omics cohort from The Cancer Genome Atlas (TCGA) of over 10,000 samples spanning across 20 tumor types, using data modalities such as copy-number alterations (CNA), transcriptomics (RNA-Seq), proteomics (RPPA), and histopathology (WSI). This work establishes a robust and flexible generative framework to address sparsity in multimodal datasets, providing a key step toward improving precision oncology.
bioinformatics2026-04-11v3COMPASS: A Web-Based COMPosite Activity Scoring System to Navigate Health and Disease Through Deterministic Digital Biomarkers
Sinha, S.; Ghosh, P.Abstract
Quantifying pathway activity in a reproducible and interpretable manner remains a central challenge in systems biology and precision medicine. Here, we introduce COMPASS (COMPosite Activity Scoring System), a deterministic, ontology-free, threshold-based framework that converts gene expression into per-sample pathway activity scores without reliance on permutation or reference cohorts. Implemented as an intuitive web application, COMPASS derives gene-specific activation thresholds directly from data, standardizes deviations from these boundaries, and integrates directionally opposing genes into a single composite score using closed-form logic. Implemented as an accessible web application, COMPASS enables users to upload expression matrices, define gene signatures, and perform activity scoring, statistical comparisons, and survival analyses without coding. Across diverse biological and clinical datasets, COMPASS generates stable and transferable digital biomarkers that quantify cellular states, benchmark humanness and relevance of model systems and enable outcome stratification. In head-to-head comparisons with widely used single-sample enrichment methods (GSVA and ssGSEA), COMPASS shows consistent performance across multi-cohort datasets, with improved discrimination when integrating bidirectional gene programs. Stratified bootstrap analyses further demonstrate reduced variability and increased robustness. By directly linking expression thresholds, deviation, and gene directionality, COMPASS provides a transparent and generalizable framework for ontology-free pathway activity quantification and outcome modeling.
bioinformatics2026-04-11v3scLongTree: an accurate computational tool to infer the longitudinal tree for scDNAseq data
Khan, R.; Bhattarai, P.; Zhang, L.; Zhou, X. M.; Mallory, X.Abstract
Longitudinal single-cell DNA sequencing (scDNA-seq) refers to single-cell data sequenced at different time points providing more knowledge of the order of mutations than scDNA-seq taken at only one time point. The technique can facilitate the inference of subclonal trees that depict the evolution of cancer cells and facilitate understanding of how cancer grows, with implications for prognosis and treatment. There is currently a scarcity of tools that can infer subclonal trees based on longitudinal scDNA-seq, and existing tools are limited in accuracy and scale. We therefore introduce scLongTree, a computational tool that can accurately infer a subclonal tree based on longitudinal scDNA-seq. ScLongTree is scalable to hundreds of mutations, and outperforms state-of-the-art tools such as LACE, SCITE, and SiCloneFit on a comprehensive simulated dataset. Tests on a real dataset, SA501, showed that scLongTree can more accurately interpret the progressive growth of the tumor than LACE, and is more robust to different numbers of mutations being used. Tests on a large AML dataset AML107, which has 4,617 cells, show that scLongTree is scalable to thousands of cells. ScLongTree is freely available on https://github.com/compbio-mallory/sc_longitudinal_infer.
bioinformatics2026-04-11v2A structure-informed deep learning framework for modeling TCR-peptide-HLA interactions
Cao, K.; Li, R.; Strazar, M.; Brown, E. M.; Nguyen, P. N. U.; Pust, M.-M.; Park, J.; Graham, D. B.; Ashenberg, O.; Uhler, C.; Xavier, R.Abstract
The interaction between T cell receptors (TCRs), peptides, and human leukocyte antigens (HLAs) underlies antigen-specific T cell immunity. Despite substantial advances in peptide-HLA presentation prediction, accurate modeling of coupled TCR-peptide-HLA recognition remains underdeveloped, limiting applications such as TCR and neoepitope prioritization in cancer and antigen identification in autoimmunity. Here we present StriMap, a unified framework for predicting TCR-peptide-HLA interactions by integrating physicochemical, sequence-context, and structural features at recognition interfaces. StriMap achieves state-of-the-art performance with improved generalizability and enables applications in both cancer and autoimmunity. As a case study in ankylosing spondylitis (AS), we screened 13 million peptides derived from 43,241 bacterial proteins and identified candidate molecular mimics that were experimentally validated to activate T cells expressing an AS-associated TCR. Notably, a top validated peptide was enriched in patients with inflammatory bowel disease (IBD), suggesting potential shared microbial triggers between AS and IBD. Overall, StriMap provides a generalizable framework for rational immunotherapy design and for dissecting antigenic drivers of autoimmunity.
bioinformatics2026-04-11v2Large-Scale Statistical Dissection of Sequence-Derived Biochemical Features Distinguishing Soluble and Insoluble Proteins
Vu, N. H. H.; Nguyen Bao, L.Abstract
Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a statistically rigorous large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliffs {delta}, and discriminative performance was assessed by ROC-AUC. Although 34 features remained significant after correction, most exhibited small effect sizes and substantial class overlap, consistent with a weak-signal regime. The strongest effects were associated with size-related features (sequence length and molecular weight; {delta} {approx} -0.21), whereas charge-related descriptors, particularly the proportion of negatively charged residues ({delta} = 0.150; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ({rho} up to 0.998). Applying a redundancy threshold (|{rho}| [≥] 0.85), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings demonstrate that sequence-level solubility information is intrinsically low-dimensional and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.
bioinformatics2026-04-11v2PRIZM: Combining Low-N Data and Zero-shot Models to Design Enhanced Protein Variants
Harding-Larsen, D.; Lax, B. M.; Garcia, M. E.; Mendonca, C.; Mejia-Otalvaro, F.; Welner, D. H.; Mazurenko, S.Abstract
Machine learning has repeatedly shown the ability to accelerate protein engineering, but many approaches demand large amounts of robust, high-quality training data as well as substantial computational expertise. While large pre-trained models can function as zero-shot proxies for predicting variant effects, selecting the best model for a given protein property is often non-trivial. Here, we introduce Protein Ranking using Informed Zero-shot Modelling (PRIZM), a two-phase workflow that first uses a high-quality low-N dataset to identify the most suitable pre-trained zero-shot model for a target protein property and then applies that model to rank and prioritize an in silico variant library for experimental testing. Across diverse benchmark datasets spanning multiple protein properties, PRIZM reliably separated low- from high-performing models using datasets of ~20 labelled variants. We further demonstrate PRIZM in enzyme engineering case studies targeting sucrose synthase thermostability and glycosyltransferase activity, where PRIZM-guided selection identified improved variants, including gains of ~3{degrees}C in apparent melting temperature and ~20% higher relative activity. PRIZM provides an accessible, data-efficient route to leverage foundation models for protein design while requiring minimal experimental data.
bioinformatics2026-04-11v2DyGraphTrans: A temporal graph representation learning framework for modeling disease progression from Electronic Health Records
Rahman, M. T.; Al Olaimat, M.; Bozdag, S.; Alzheimer's Disease Neuroimaging Initiative,Abstract
Motivation: Electronic Health Records (EHRs) contain vast amounts of longitudinal patient medical history data, making them highly informative for early disease prediction. Numerous computational methods have been developed to leverage EHR data; however, many process multiple patient records simultaneously, resulting in high memory consumption and computational cost. Moreover, these models also often lack interpretability, limiting insight into the factors driving their predictions. Efficiently handling large-scale EHR data while maintaining predictive accuracy and interpretability therefore remains a critical challenge. To address this gap, we propose DyGraphTrans, a dynamic graph representation learning framework that represents patient EHR data as a sequence of temporal graphs. In this representation, nodes correspond to patients, node features encode temporal clinical attributes, and edges capture patient similarity. DyGraphTrans models both local temporal dependencies and long-range global trends, while a sliding-window mechanism reduces memory consumption without sacrificing essential temporal context. Unlike existing dynamic graph models, DyGraphTrans jointly captures patient similarity and temporal evolution in a memory-efficient and interpretable manner. Results: We evaluated DyGraphTrans on Alzheimer's Disease Neuroimaging Initiative (ADNI) and National Alzheimer's Coordinating Center (NACC) for disease progression prediction, as well as on the Medical Information Mart for Intensive Care (MIMIC-IV) dataset for early mortality prediction. We further assessed the model on multiple benchmark dynamic graph datasets to evaluate its generalizability. DyGraphTrans achieved strong predictive performance across diverse datasets. We also demonstrated interpretability of DyGraphTrans aligned with known clinical risk factors.
bioinformatics2026-04-11v2Identification, evolutionary history and characteristics of orphan genes in root-knot nematodes
Seckin, E.; Colinet, D.; Bailly-Bechet, M.; Seassau, A.; Bottini, S.; Sarti, E.; Danchin, E. G.Abstract
Orphan genes, lacking homologs in other species, are systematically found across genomes. Their presence may result from extensive divergence from pre-existing genes or from de novo gene birth, which occurs when a gene emerges from a previously non-genic region. In this study, we identified orphan genes in the genomes of globally distributed plant-parasitic nematodes of the genus Meloidogyne and investigated their origins, evolution, and characteristics. Using a comparative genomics framework across 85 nematode species, we found that 18% of Meloidogyne genes are genus-specific, transcriptionally supported orphans. By combining ancestral sequence reconstruction and synteny-based approaches, we inferred that 20% of these orphan genes originated through high divergence, while 18% likely emerged de novo. Proteomic and translatomic evidence confirmed the translation of a subset of these genes, and feature analyses revealed distinctive molecular signatures, including shorter length, signal peptide enrichment, and a tendency for extracellular localization. These findings highlight orphan genes as a substantial and previously underexplored component of the Meloidogyne genome, with potential roles in their worldwide parasitism.
bioinformatics2026-04-11v2AEGIS: an annotation extraction and genomic integration resource
Navarro-Paya, D.; Santiago, A.; Velt, A.; Moretto, M.; Rustenholz, C.; Matus, J. T.Abstract
The GTF/GFF3 formats are the standard for storing and exchanging genome annotations. However, their flexibility often results in inconsistent and poorly formatted files across different sources, creating a major bottleneck for downstream bioinformatics analyses. Here, we present Annotation Extraction and Genomic Integration Suite (AEGIS), a comprehensive and user-friendly command-line toolkit designed to parse, validate and standardise genome annotation files. AEGIS robustly corrects common structural and formatting errors, ensuring interoperability with downstream tools. Beyond standardisation, the suite provides advanced modules for analysis, such as flexible sequence extraction (e.g. genes, CDS, proteins) with isoform handling, customisable promoter region definitions and targeted DNA motif searches. A key feature of AEGIS is its integrated workflow for comparative genomics, which combines multiple lines of evidence (i.e., sequence homology, synteny and coordinate-based lift-overs) to enable a robust gene ID correspondence and orthology assessment. We demonstrate the utility of AEGIS by comparing two major Arabidopsis thaliana annotations (TAIR10 vs. Araport11), successfully identifying and quantifying complex structural changes such as gene splits and fusions. AEGIS provides a unified solution for annotation quality control, feature extraction and comparative genomic analysis, simplifying complex workflows and enhancing reliability in bioinformatic research. The software is open-source, implemented in Python and is available on GitHub, PyPI, and as a Docker container to ensure accessibility and reproducibility.
bioinformatics2026-04-11v2scMultiPreDICT: A single-cell predictive framework with transcriptomic and epigenetic signatures
Manful, E.-E.; Uzun, Y.Abstract
Cellular responses to genetic perturbations depend on both transcriptional programs and the epigenetic landscape. While single-cell multiomics technologies enable simultaneous profiling of gene expression and chromatin accessibility, the relative contribution of each regulatory layer to gene expression remains unclear. Existing computational approaches focus on data integration and gene regulatory network inference but do not systematically compare the predictive performance of transcriptional versus epigenetic features on a gene-by-gene basis.We present scMultiPreDICT, a computational framework for comparative predictive modeling of gene expression using single-cell multiomics data. scMultiPreDICT benchmarks RNA-only, ATAC-only and multimodal feature sets across six machine learning models including regression, tree-based learning and deep learning using multiple biological datasets. We show that RNA-derived features generally provide strong predictive power, whereas chromatin accessibility alone yields a modest performance. Surprisingly, multimodal integration does not uniformly improve prediction accuracy; instead, its benefit is gene-specific and context-dependent. Feature importance analysis reveals that transcriptional features dominate for most genes, whereas chromatin accessibility contributes meaningfully for a subset of genes in specific cellular contexts. Overall, the results demonstrate that regulatory layers contribute differently to gene expression. scMultiPreDICT provides a systematic framework for identifying the relative contributions of transcriptional and epigenetic regulation across genes and cellular contexts, guiding the design of targeted perturbation studies and the prioritization of regulatory layers for therapeutic interventions. scMultiPreDICT is implemented in R and available at https://github.com/UzunLab/scMultiPreDICT/.
bioinformatics2026-04-11v1Structural Connectome Analysis using a Graph-based Deep Model for Age and Dementia Prediction
Kazi, A.; Mora, J.; Fischl, B.; Dalca, A.; Aganj, I.Abstract
We address the prediction of non-imaging variables based on structural brain connectivity derived from diffusion magnetic resonance images, using graph-based machine learning. We predict age and the mini-mental state examination (MMSE) score as examples of a demographic and a clinical variable. We propose a machine-learning model inspired by graph convolutional networks (GCNs), which takes a brain connectivity graph as input and processes the data separately through a parallel GCN mechanism with multiple branches. The novelty of our work lies in the model architecture, especially the Connectivity Attention Block, which learns an embedding representation of brain graphs while providing graph-level attention. We show experiments on publicly available datasets of PREVENT-AD and OASIS3. The proposed network is a simple design that employs different heads involving graph convolutions focused on edges and nodes, capturing representations from the input data thoroughly. A linear branch, and skip connections. To test the ability of our model to extract complementary and representative features from brain connectivity data, we chose the task of sex classification. We validate our model by comparing it to existing methods and via ablations. This quantifies the degree to which the connectome varies depending on the task, which is important for improving our understanding of health and disease across the population. The proposed model generally demonstrates higher performance especially for age prediction compared to the existing machine-learning algorithms we tested, including classical methods and (graph and non-graph) deep learning.
bioinformatics2026-04-10v3Structure-Based and Stability-Validated Prioritization of BACE1 Inhibitors Integrating Meta-Ensemble QSAR and Molecular Dynamics
Chowdhury, T. D.; Shafoyat, M. U.; Hemel, N. H.; Nizam, D.; Sajib, J. H.; Toha, T. I.; Nyeem, T. A.; Farzana, M.; Haque, S. R.; Hasan, M.; Siddiquee, K. N. e. A.; Mannoor, K.Abstract
Alzheimers disease remains an unmet therapeutic challenge, and no {beta}-secretase (BACE1) inhibitor has achieved clinical approval. A major limitation of prior discovery efforts is reliance on single-parameter optimization, often yielding computational hits with poor translational potential. Here, we present a stability-validated, biology-informed computational framework that integrates meta-ensemble QSAR (five tree-based classifiers with ECFP4 fingerprints), structure-based docking, Protein Language Model (ESM-1b)-guided hybrid residue interaction weighting, and comprehensive ADMET profiling within a normalized composite ranking scheme. Model robustness was confirmed through external validation and Y-randomization (n = 100; empirical p = 0.009). Heuristic weighting was quantitatively stress-tested using global {+/-}10% perturbation analysis (mean Spearman {rho} = 0.998; mean Kendalls {tau} = 0.970), demonstrating exceptional ranking stability under controlled parameter uncertainty. Screening of 16,196 structurally diverse compounds, including CNS-active molecules, phytochemicals, approved drugs, and investigational agents, identified 153 predicted actives (accuracy 0.852; ROC-AUC 0.920), which were refined to 111 drug-like candidates and seven prioritized leads. Two-hundred-nanosecond molecular dynamics simulations confirmed stable binding within the BACE1 catalytic pocket and sustained interaction networks over time. Mol-2 exhibited the most favorable profile, characterized by low ligand RMSD (1.2-1.6 [A]), persistent catalytic dyad interactions (ASP32 98%, ASP228 99%), predicted BBB permeability, acceptable efflux profile, and balanced ADMET characteristics consistent with CNS drug-like space. Collectively, this integrative, interpretable, and robustness-validated framework provides a systematic strategy for multi-criteria lead prioritization and may serve as a transferable platform for structure-guided discovery of therapeutics targeting complex neurodegenerative pathways
bioinformatics2026-04-10v1PERREO: An integrated pipeline for repetitive elements analysis enables the repeatome expression profiling in cancer
Rodriguez-Martin, F.; Masero-Leon, M.; Gomez-Cabello, D.Abstract
Transcriptome-wide profiling of repetitive elements expression reveals transposable element-derived transcripts that are deregulated in diverse biological contexts including cancer. However, most RNA-seq pipelines are optimized for annotated genes and substantially undercount repeat RNA molecules, limiting their discovery and characterization. Here we present PERREO, a comprehensive, user-friendly pipeline for analyzing repetitive RNA elements from short- and long-read sequencing data. PERREO performs quality control, repeat-aware alignment and quantification, differential expression analysis, co-expression network analysis, and de novo transcript assembly with minimal computational expertise required. We validate PERREO across cell lines, tumor tissues and liquid biopsies, demonstrating superior sensitivity to repetitive RNA signatures compared with standard RNA-seq approaches. PERREO integrates predictive modelling to identify biological associations and generates publication-ready visualizations. By removing the bioinformatic barrier to repetitive RNA discovery, this pipeline enables broader investigation of the repeatome's role in cellular biology and disease, yielding valuable results that, for specific analytical objectives, outperform certain existing tools and pipelines.
bioinformatics2026-04-10v1BrightEyes-FFS: an open-source platform for comprehensive analysis of fluorescence fluctuation spectroscopy experiments with small detector arrays
Slenders, E.; Perego, E.; Zappone, S.; Vicidomini, G.Abstract
Fluorescence fluctuation spectroscopy (FFS) is an ensemble of techniques for quantitative measurement of molecular dynamics and interactions. Recently, the introduction of small-format array detectors has opened up a new range of spatiotemporal information, allowing for more detailed analysis of system kinetics. However, there is currently no open-source software available for analyzing the high-dimensional FFS data sets. We present BrightEyes-FFS, an open-source Python-based environment for FFS analysis with array detectors. The environment includes a Python package for reading raw FFS data, computing auto- and cross-correlations using various algorithms, and fitting the correlations to several models. A graphical user interface (GUI), available as a standalone executable, makes the analysis fast and user-friendly. An automated Jupyter Notebook writing tool enables transition from the GUI to Jupyter Notebook for custom analysis. We believe that BrightEyes-FFS will enable a wider community to study diffusion, flow, and interaction dynamics.
bioinformatics2026-04-10v1Statistical Principles Define an Open-Source Differential Analysis Workflow for Mass Spectrometry Imaging Experiments with Complex Designs
Rogers, E. B. T.; Lakkimsetty, S. S.; Bemis, K. A.; Schurman, C. A.; Angel, P. A.; Schilling, B.; Vitek, O.Abstract
Mass spectrometry imaging (MSI) characterizes the spatial heterogeneity of molecular abundances in biological samples. Experiments with complex designs, involving multiple conditions and multiple samples, provide particularly useful insight into differential abundance of analytes. However, analyses of these experiments require attention to details such as signal processing, selection of regions of interest, and statistical methodology. This manuscript contributes a statistical analysis workflow for detecting differentially abundant analytes in MSI experiments with complex designs. Using a case study of histologic samples of human tibial plateaus from knees of osteoarthritis patients and cadaveric controls, as well as simulated datasets, we illustrate the impact of the analysis decisions. We illustrate the importance of signal processing and feature aggregation for preserving biological relevance and alleviating the stringency of multiple testing. We further demonstrate the importance of selecting regions of interest in ways that are compatible with differential analysis. Finally, we contrast several common statistical models for differential analysis, showcase the appropriate use of replication, and demonstrate model-based calculation of sample size for followup investigations. The discussion is accompanied by detailed recommendations and an open-source R-based implementation that can be followed by other investigations.
bioinformatics2026-04-10v1Deep learning enables direct HLA typing from immunopeptidomics data
Pilz, M.; Scheid, J.; Bauer, A.; Lemke, S.; Sachsenberg, T.; Bauer, J.; Nelde, A.; Stadelmaier, J.; Walter, A.; Rammensee, H.-G.; Nahnsen, S.; Kohlbacher, O.; Walz, J. S.Abstract
The immune system eliminates malignant and infected cells through T-cell-mediated recognition of peptides presented by human leukocyte antigen molecules. Mass spectrometry-based immunopeptidomics enables unbiased identification of naturally presented HLA-restricted peptides and has become central to the development of T-cell-based immunotherapies. However, immunopeptidomics data reflects the combined peptide presentation of multiple HLA alleles, and determining which allotypes are represented in this multi-allelic complexity remains an unmet computational challenge. Here, we introduce immunotype, a deep learning-based ensemble predictor for HLA class I allotyping directly from immunopeptidomics data. Immunotype integrates peptide and HLA sequence information through transformer encoders and a graph neural network, complemented by a curated mono-allelic reference of known peptide-HLA binding preferences. Immunotype achieves an overall accuracy of 87.2% at protein-level resolution across diverse tissues and thereby enables rapid, cost-effective HLA typing of large-scale immunopeptidomics datasets.
bioinformatics2026-04-10v1A computational model for quantifying instability of tandem repeats across the genome
Dolzhenko, E.; English, A.; Mokveld, T.; de Sena Brandine, G.; Kronenberg, Z.; Wright, G.; Drogemoller, B.; Rowell, W. J.; Wenger, A. M.; Bennett, M. F.; Weisburd, B.; Erwin, G. S.; Jin, P.; Nelson, D. L.; Dashnow, H.; Sedlazeck, F.; Eberle, M. A.Abstract
Tandem repeats (TRs) exhibit high levels of somatic mosaicism, which is increasingly recognized as an important modifier of repeat expansion disorders. Long-read sequencing can capture full-length repeat alleles, yet robust frameworks for quantifying instability across TRs genome-wide are still needed. Here, we introduce a general-purpose model for quantifying TR instability in a given long-read sequencing dataset, without explicitly distinguishing biological mosaicism from technical noise, and which is broadly applicable to both simple and structurally complex loci. This model accurately characterizes allelic instability at each TR locus by representing the distribution of read-to-consensus deviations for each allele. Using HiFi sequencing data from 256 HPRC cell line samples, we fitted models for 617,007 TR loci, including known pathogenic repeats. We observe that instability levels are generally low, but vary substantially across individual TRs, and are driven more strongly by repeat composition than overall repeat length. Furthermore, we applied our method to targeted PureTarget long-read data from samples with known repeat expansions and identified significant mosaicism in the majority of expanded alleles. Our model offers a practical way to quantify instability of tandem repeats across the genome and to detect unusually unstable repeat alleles.
bioinformatics2026-04-10v1Structure-aware geometric graph learning for modeling protease-substrate specificity at scale
Guo, X.; Bi, Y.; Ran, Z.; Pan, T.; Sun, H.; Hao, Y.; Jia, R.; Wang, C.; Zhang, Q.; Kurgan, L.; Song, J.; Li, F.Abstract
Protease-substrate specificity is central to cellular regulation and disease pathogenesis, and accurately modeling its structural determinants remains challenging. Substrate recognition is governed by spatial constraints and higher-order relationships that extend beyond local sequence motifs. Most computational approaches rely predominantly on motif-centric or sequence-based representations, limiting their ability to capture the geometric and relational structure underlying enzymatic specificity. Here, we introduce OmniCleave, a structure-aware geometric graph learning framework for modeling protease-substrate specificity at scale. OmniCleave is trained on 57,278 structure-informed protease-substrate pairs derived from 9,651 substrates spanning over 100 proteases across six distinct families. The framework integrates multi-scale structural graphs with higher-order protease relational topology, explicitly encoding spatial context and inter-protease dependencies within a unified geometric representation. This formulation moves beyond local pattern recognition and enables transferable modelling across six protease families. Across large-scale benchmarks, the framework consistently outperforms existing approaches and reveals interpretable geometric determinants underlying substrate recognition. Experimental validation confirms three novel caspase-3 substrates and 21 cleavage sites predicted by OmniCleave, supporting the biological relevance of the learned representations. Together, OmniCleave provides a scalable geometric framework for modeling protease-substrate specificity, with practical utility for systematic analysis of protease biology.
bioinformatics2026-04-10v1SimpleFold-Turbo: Adaptive Inference Caching Yields 14-fold Acceleration of Flow-Matching Protein Structure Prediction
Taghon, G.Abstract
We apply TeaCache, an adaptive caching technique from video diffusion to SimpleFold's flow-matching protein structure prediction and achieve (9 to 14)-fold inference speedups with negligible quality loss. We determine that flow matching's near-linear generative trajectories make consecutive neural-network evaluations highly redundant. At a low redundancy threshold, SimpleFold-Turbo (SFT) skips {approx} 93 % of forward passes while preserving near-baseline template modeling (TM)-scores across 300 structurally diverse CATH domains and all six SimpleFold model sizes (100 million to 3 billion parameters), at compute budgets where log-uniform step-skipping collapses. Speedup scales with model size because caching overhead is constant while per-step cost grows, and a general three-phase skip pattern emerges independent of protein size or fold. SF-T requires no retraining, no weight modification, and no MSA server dependencies. We release SF-T as fully open-source software enabling thousands of structure predictions per hour on commodity hardware.
bioinformatics2026-04-10v1Generating, curating, and evaluating trnL reference sequence databases: Benchmarking OBITools3/ecoPCR, RESCRIPt, and MetaCurator
KUDDAR, O. S.; Meiklejohn, K. A.; Callahan, B. J.Abstract
Plant DNA metabarcoding enables the identification of plant taxa in mixed samples, with the trnL (UAA) intron and its P6 loop mini-barcode region performing as well as or better than other commonly used markers. Reliable metabarcoding requires high-quality reference databases, yet a regularly maintained trnL resource is currently lacking. Consequently, most studies use uncurated sequences downloaded directly from public repositories without essential validation. We address these gaps by providing guidance through a systematic comparison of three database curation tools: OBITools3/ecoPCR, RESCRIPt, and MetaCurator, to generate three trnL reference sequence databases and evaluate their classification performance across commonly sequenced trnL regions (CD, CH, and GH). Reference trnL sequences and taxonomy files were retrieved from public sequence repositories and curated using standardized filtering steps to reduce taxonomic errors, sequence ambiguity, and redundancy. Four simulated query datasets; two base sets and their mutated counterparts, were constructed to assess classification performance of the databases using the Naive Bayesian Classifier implemented in DADA2. The evaluation showed that performance differed by trnL region: MetaCurator and RESCRIPt yielded higher and similar metrics for trnL CD; OBITools3/ecoPCR and RESCRIPt were comparable for trnL CH; and MetaCurator attained the highest performance for trnL GH region. All reference databases, taxonomy, and evaluation files are available at Zenodo (https://doi.org/10.5281/zenodo.17969450). The complete computational workflow and scripts are available on GitHub (https://github.com/oskuddar/trnL_DB). Although evaluation was focused on plant taxa in the United States, the resulting databases are suitable for use as global trnL reference databases.
bioinformatics2026-04-10v1Synolog: A Scalable Synteny-Based Framework for Genome Architecture Characterization
Madrigal, G.; Catchen, J. M.Abstract
Detailing the genomic architecture across multiple organisms has been a task performed for decades. The continuing growth of genomic datasets not only serves as a resource for studying genome evolution but warrants the availability of scalable and user-friendly software for processing these datasets. Here, we present Synolog, a bioinformatic toolkit that can automatically identify orthologs for both protein-coding and non-coding genes, synteny clusters across two or more genomes, as well as retrogenes, and segmental duplications. Applying Synolog, we illustrate cases of local gene expansions in ecologically disparate turtle species, identify synteny clusters across hundreds of millions of years of metazoan evolution, and reconstruct chromosome-level assemblies in teleosts using the inferred synteny clusters; all using its integrated visual features. In parallel, we compare our orthogroup method to that of commonly used software and note the tradeoffs of making inferences solely based on sequence similarity versus a synteny-based approach.
bioinformatics2026-04-10v1Impact of Regularization Methods and Outlier Removal on Unsupervised Sample Classification
Heckman, C. A.Abstract
Background: High-content assays have problems distinguishing biologically significant effects from the incidental effects of non-repeatable technical factors. Non-repeatable results are attributed to variations in the cell culture environment and the numerous, heterogeneous descriptors evaluated. The aim here was to determine whether preprocessing operations impacted the reproducibility of class assignments of experimental data. Methods: Batch effects that could affect reproducibility, i.e., signal/noise ratio, instrumental conditions, and segmentation, were controlled variables. The remaining batch effects, variations in materials, personnel, and culture environment could not be controlled. The values of descriptors were measured directly from images. Exploratory factor analysis was used to solve the identifiable and interpretable feature, factor 4. In each of five trials, one sample was treated with the same chemical mixture (EXP) and another with the solvent vehicle alone (CON). Results: Repeated CON and EXP samples showed significant differences among factor 4 means in data regularized within each trial. The mean of Trial 3 CON differed significantly from all other CON samples. These differences disappeared upon regularization to comprehensive databases. Among repeated EXPs, the Trial 2 mean differed from three other EXPs, but regularization to comprehensive databases had little effect. However, classification patterns were unchanged after regularization to any comprehensive database derived by the same protocol. After regularization to datasets derived by two different protocols, the classification pattern differed but only reflected elevation of differences that had been marginal to statistical significance. Outlier removal was deleterious. Even with the most sparing definition of outliers, over 3% of the contents of a single sample were removed from most trials. Elimination based on the overall within-trial distributions caused type I and type II errors. Conclusions: Non-repeatable factor 4 means in repeated trials had negligible influence on classification outcomes, so repeatability may not be a good indicator of assay quality. Irreducible batch effects, combined with small sample sizes and skewed distributions of the descriptor values, may account for non-repeatability. As the current results are based on real-world data, they suggest that non-repeatability is an uncorrectable feature of these assays. Classification patterns are not affected by several irreducible technical factors, namely materials, personnel, and non-repeatable environmental variables.
bioinformatics2026-04-10v1