Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Permute-match tests detect significant correlations between time series despite nonstationarity and limited replicates
Yuan, A. E.; Shou, W.Abstract
Researchers frequently analyze correlations between pairs of time series by determining whether an observed correlation is stronger than expected under the null hypothesis of independence. However, the time series are often nonstationary, with statistical properties that change over time, thereby making standard tests invalid. If sufficient replicates exist, a trial-swapping permutation test can be performed that handles nonstationarity by comparing within-replicate correlations to between-replicate correlations. Although largely assumption-free, this test is fundamentally limited by the number of replicates (n) because its minimum p-value is 1/n!. With n=3, this minimum is 1/6, rendering thresholds like 0.05 unattainable. This limits its use considerably in animal experiments, where n may be as low as 3. We propose permute-match tests -- modified permutation tests that can report lower p-values of 2/nn or 1/nn under strong evidence of dependence. Permute-match tests guarantee a false positive rate at or below the significance level when replicates are independent and identically distributed. The bound of 1/nn is not gratuitously conservative, since it cannot be further lowered without additional assumptions. We demonstrate our approach using synthetic data and apply it to an existing dataset with 3 independent groups of zebrafish, confirming the observation that zebrafish swim faster when directionally aligned.
bioinformatics2026-07-01v5Age-related erosion of X chromosome inactivation in human tissues
Rocca, C.; Gylemo, B.; Edwards, M.; Cing, Z.; Gibbs, J. R.; Nestor, C. E.; DeCasien, A. R.Abstract
Age-related diseases often show sex differences, yet their molecular bases remain unclear. Animal models suggest that age-related disruption of X-chromosome inactivation (XCI) occurs in female mice. We test whether this phenomenon extends to humans using bulk and single-cell datasets. We find that age-dependent escape from XCI also occurs in human females, particularly among genes at the distal ends of the X-chromosome and those involved in genome stability. These findings provide preliminary evidence that XCI erosion represents a human female-specific aging process.
bioinformatics2026-07-01v2MORPH Predicts the Single-Cell Outcome of Genetic Perturbations Across Conditions and Data Modalities
He, C.; Zhang, J.; Dahleh, M. A.; Uhler, C.Abstract
Modeling cellular responses to genetic perturbations is a significant challenge in computational biology. Measuring all gene perturbations and their combinations across cell types and conditions is experimentally challenging, highlighting the need for predictive models that generalize across data types to support this task. Here we present MORPH, a MOdular framework for predicting Responses to Perturbational cHanges. MORPH combines a discrepancy-based variational autoencoder with an attention mechanism to predict cellular responses to unseen perturbations. It supports both single-cell transcriptomics and imaging outputs and can generalize to unseen perturbations, combinations of perturbations, and perturbations in new cellular contexts. The attention-based framework enables inference of gene interactions and regulatory networks, while the learned gene embeddings can guide the design of informative perturbations, as demonstrated in two applications. Overall, MORPH is a flexible tool for optimizing perturbation experiments, enabling efficient exploration of the perturbation space to advance understanding of cellular programs for fundamental research and therapeutic applications.
bioinformatics2026-07-01v2Generative design of antigen-specific T-cell receptor sequences with a conditional diffusion model
Zhang, Y.; Liang, W.; Xu, S.; Witney, M.; Su, X.; Andrews, M. C.; Rossjohn, J.; Purcell, A. W.; Wang, F.; Song, J.Abstract
T cell receptor (TCR)-based immunotherapy holds immense potential for treating cancers, autoimmunity, and infectious diseases, where antigen-specific TCR recognition is crucial for adaptive immune responses. Engineering or de novo generation of the complementarity-determining region 3 (CDR3) loops of TCRs using artificial intelligence offers a powerful alternative to designing antigen-specific TCRs rather than laborious experimental screening. However, current in silico approaches are constrained by weak conditional guidance, limited flexibility, and a lack of rigorous functional validation. To address these limitations, we introduce TCRDiff, a generative diffusion framework for designing antigen-specific TCRs conditioned on peptide-MHC (pMHC) targets and germline-encoded TCR variable genes. By leveraging pre-trained knowledge from massive T-cell repertoires and TCR-pMHC recognition data, TCRDiff generates CDR3{beta} sequences that closely resemble native-binding TCRs via a denoising diffusion process. Furthermore, incorporating interface geometry features generated TCR-pMHC complexes with superior structural plausibility than models relying solely on sequence-based diffusion or structure-based modeling. As a proof of concept, we deployed TCRDiff in a systematic pipeline to design candidate TCRs against a clinically validated cancer antigen. In vitro activation assays validated that TCRDiff-generated TCRs efficiently recognize the MAGE-A3 epitope with minimal off-target reactivity. Thus, TCRDiff establishes a powerful, validated computational paradigm to accelerate the development of TCR-based immunotherapies.
bioinformatics2026-07-01v2AI-guided discovery for low-resource peptide engineering using evolutionary scale modeling
Andrekson, L.; Rydbergh, R.; Mercado, R.; Wenzel, M.Abstract
Reliable estimation of downstream performance in low-data peptide machine learning is critical for guiding early-stage AI-driven peptide engineering. Yet, it is often unclear how to assess whether a model will be effective in iterative discovery settings. Here, we show that the cross validation R2 score can serve as a simple and robust proxy for predicting active learning workflow performance, enabling early-stage evaluation of model suitability for sequential peptide optimization. To support this, we introduce SCARSE, a machine learning framework combining ESM-2 protein language model embeddings with Gaussian process regression and extremely randomized trees classification, designed for low-resource peptide property prediction (20-500 training samples). We benchmark SCARSE across 23 peptide and small-protein datasets covering substitution and indel variants, antimicrobial peptides, cell-penetrating peptides, and toxic/non-toxic peptides. SCARSE significantly outperforms a hand-engineered descriptor baseline on substitution and indel tasks, while comparable performance was achieved on shorter peptide non-mutant datasets where simpler descriptors capture enough of the signal. In simulated active learning workflows, SCARSE consistently outperforms baseline and random sampling strategies. Notably, we demonstrate that CV R2 computed from as few as 50 labeled peptides can be sufficient to estimate final active learning end-point performance, providing a practical, data-efficient criterion for deciding whether a given dataset combined with SCARSE is suitable for iterative peptide discovery. SCARSE is released as a pip package and is available via HuggingFace Spaces to facilitate integration into peptide engineering workflows.
bioinformatics2026-07-01v1MintCNA: A Unified Framework for Integrative Copy Number Profiling with Single-Cell Multi-Omics Data
Bao, W.; Qin, F.; Xiao, F.Abstract
Chromosomal copy number alterations (CNAs) are key drivers of tumor evolution, disease progression and therapeutic resistance, and the identification of them is an important step to delineate tumor clonal structure. However, accurately resolving CNA landscapes from single-cell data remains challenging. Most existing tools analyze one omics layer at a time and are susceptible to assay-specific noises, limiting their ability to recover shared or modality-specific CNAs. Recent single-cell multi-omics techniques enable joint sequencing of multiple molecular layers in the same cells, yet in silico methods that fully exploit such complementary multi-modal data for CNA analysis are still missing. Here we present a single-cell multi-omics integration framework, MintCNA, a unified framework for CNA detection from paired multi-omics data. MintCNA integrates traditional statistical modeling with embedded deep learning structure to enhance CNA profiling across multi-omics. We use an attention-guided convolutional autoencoder for data denoising and perform multivariate change-point detection utilizing a sliding-window screening and ranking procedure. Missingness-adjusted CUSUM statistics are constructed which jointly aggregate omics features by a data-adaptive projection to detect genome-wide chromosomal breakpoints. Across various simulations and applications to a colorectal cancer multi-omics dataset, MintCNA consistently outperforms existing single-omics CNA callers in detection accuracy. MintCNA provides a single-cell CNA tool that integrates paired scDNA-seq and scRNA-seq, supporting the study of intra-tumor heterogeneity and tumor evolution.
bioinformatics2026-07-01v1Direct probabilistic quantification of mosaic loss of chromosome Y from sequencing data
Lin, J.-R.; Chang, Y.-C.; Maslov, A. Y.; Song, Y.; Gao, T.; Shan, J.; Bennett, D. A.; Milman, S.; Barzilai, N.; Vijg, J.; Montagna, C.; Zhang, Z.Abstract
Loss of chromosome Y (LOY) is the most common aneuploidy in aging men and is increasingly recognized as a marker of aging and genomic instability. Because LOY occurs in mosaic form, its degree reflects the fraction of cells lacking the Y chromosome. Existing SNP-array- and sequencing-based methods rely largely on single genomic features and indirect transformations to estimate this fraction. We developed BaySeq-Y, a Bayesian method that directly estimates LOY mosaicism from sequencing data using VCF files with read depth (DP) and allelic depth (AD). Within a rigorous Bayesian framework, BaySeq-Y integrates complementary LOY-associated genomic features, including decreased read depth and allelic imbalance, and can additionally leverage haplotype phasing to improve precision. In simulations and fluorescence in situ hybridization validation (FISH), BaySeq-Y provided accurate estimates and outperformed existing methods. Applications to ROSMAP and GTEx supported its biological relevance through transcriptomic validation, demonstrating its utility for quantifying LOY across diverse sequencing datasets.
bioinformatics2026-07-01v1MCD Stitcher: An open-source tool for whole-slide stitching and conversion of Imaging Mass Cytometry data
Chaurasia, P.Abstract
Imaging Mass Cytometry (IMC) combines metal-tagged antibody labelling with laser ablation mass spectrometry to generate highly multiplexed spatial images of tissue sections. However, the area that can be acquired within a single region of interest (ROI) is limited by hardware and software constraints, requiring large tissues to be imaged as multiple tiled ROIs. Reconstructing these ROIs into whole-slide images requires additional processing, while the proprietary .mcd file format can hinder integration with standard bioimage analysis workflows. Here, we present MCD Stitcher, an open-source Python package for converting .mcd files into OME-TIFF images with automated whole-slide stitching. The tool supports rectangular and polygonal ROIs, accommodates variable pixel sizes between ROIs, and uses memory-aware chunked reading during data ingestion to process large datasets on standard workstations. The generated OME-TIFF outputs preserve spatial, channel, and acquisition metadata for downstream analysis in tools such as QuPath, napari, and ImageJ/Fiji. MCD Stitcher provides a reproducible workflow for converting raw IMC data into interoperable image formats, enabling whole-slide spatial analysis without reliance on vendor-specific software.
bioinformatics2026-07-01v1Phenotypic inference from sparse tumor genomes informs an explainable deep-learning model for cancer prognosis
Grant, S.; Nath, A.Abstract
Somatic genomic alterations are widely profiled in cancer and remain the primary source for personalized therapy, yet their clinical utility is limited to few actionable targets. AI/ML models offer opportunities to capture genome-wide complexities, but clinical translation is hindered by poor interpretability, often limited to single-gene effects, and overlooks higher-order phenotypic interactions. To address this, we developed PhenoMap, a machine-learning framework that infers tumor phenotypic states from somatic variants. Trained on 9,000 pan-cancer genomes and transcriptomes, PhenoMap accurately reconstructs expression-based pathway enrichment scores and consolidated hallmark cancer phenotypes, enabling multilevel interpretation at phenotype, pathway, and gene scales. PhenoMap captured molecular subtypes and key resistance pathways across breast, lung, and brain cancers. We leveraged these features in PhenoSurv, a deep survival model integrating phenotypic reconstruction loss, Kullback-Leibler divergence, and survival loss to learn biologically-grounded predictors. PhenoSurv outperformed state-of-the-art survival models while providing robust mechanistic explanations. NOTCH1 signaling and SMARCA4 mutations emerged as a major prognostic factor in hormone receptor-positive breast cancer. TGFb signaling and inflammasomes, potentially modulated by FAT1, predicted lung adenocarcinoma outcomes, while inositol metabolism and PI3K signaling were key drivers in brain cancer. Together, PhenoMap and PhenoSurv provide accurate, interpretable, and clinically actionable models for precision oncology.
bioinformatics2026-07-01v1Tabular Foundation Models Are Competitive Cellular Perturbation Predictors Across Biological Scales
Palla, G.; Hillsley, A.; Kim, Y.-J.; Royer, L. A.Abstract
Predicting how cells respond to genetic and chemical perturbations is a central challenge in drug discovery and functional genomics. A growing ecosystem of specialized single-cell foundation models has been developed to address this problem, yet their practical advantage over domain-agnostic approaches remains unclear. Here we evaluate the power of Tabular Foundation Models such as TabICL and TabPFN, general-purpose pre-trained regression models, against domain-specific architectures including PRESAGE, scGPT, scLAMBDA, STACK and Prophet across four complementary evaluation settings: cell-level in-context cross-cell-type prediction, pseudobulk perturbation prediction on five Perturb-seq datasets of cell-lines, a genome-wide CRISPR screen in primary human CD4+ T cells, and embryo-level cell-type composition prediction in a zebrafish developmental perturbation atlas. In the cell-level cross-cell type perturbation prediction, Tabular Foundation Models perform on par or better than specialized models. On pseudobulk perturbation prediction, Tabular Foundation Models consistently outperform specialized baselines across multiple evaluation metrics and datasets. On whole-emrbryo cell-type composition prediction, Tabular Foundation Models are competitive with specialized baselines. These results demonstrate that general-purpose tabular in-context learning provides a strong and scalable alternative to bespoke biological architectures for perturbation response modeling across cell systems and scales.
bioinformatics2026-07-01v1Penumbria: Advanced 3D cell segmentation for biomedical imaging
Stockert, L.; Donovan, J.; Baier, H.Abstract
Quantitative analysis of three-dimensional cellular architecture is fundamental to understanding tissue organization, disease progression, and drug response. Yet 3D cell segmentation remains a critical bottleneck due to diverse cell morphologies, low signal-to-noise ratios, and data scarcity. We introduce Penumbria, a general-purpose 3D cell segmentation framework that achieves state-of-the-art accuracy across morphologically distinct cell populations and imaging conditions in volumetric microscopy. Penumbria formulates segmentation as a regression problem on distances to cell boundaries, supporting instance reconstruction without shape priors and permitting end-to-end GPU inference. A U-Net-based architecture with xLSTM bottleneck blocks and patch embeddings enables multi-scale feature extraction, long-range modeling of spatial context, and convolutional feature-volume tokenization. The model is extended with two modules: a Global Zernike Phase Layer, which learns Zernike-parameterized phase corrections in the frequency domain to undo optical aberrations such as defocus and tilt, and a Scaled Geocaps Layer, which samples features at fixed grid locations across multiple spatial scales, routing evidence between them such that a detection is only confident where concordance holds across scales simultaneously. Across four diverse 3D datasets selected to probe the limits of existing methods, Penumbria outperforms Cellpose-SAM across all evaluation thresholds and surpasses StarDist-3D on most datasets while matching it on Parhyale hawaiensis. Trained entirely from scratch, Penumbria achieves up to a 38% improvement in mean average precision over the second-best method. Strong boundary accuracy further supports downstream analyses such as quantifying membrane dynamics or protein localization.
bioinformatics2026-07-01v1BOSE: A Bayesian Order Statistics-Based Estimator for Recovering the Sample Mean and Standard Deviation
Pan, W.; Lu, Z.; Jiang, W.; Lim, J.; Xu, L.; Wang, X.Abstract
In meta-analyses of continuous outcomes, the sample mean and standard deviation (SD) are essential for synthesizing effect sizes across studies. However, clinical studies frequently report alternative summary statistics, such as the median, quartiles, and range. To enable inclusion of such studies, various methods have been proposed to estimate the sample mean and SD from these reported summaries. We propose the Bayesian Order Statistics-based Estimator (BOSE), which leverages the joint likelihood of observed order statistics together with weakly informative priors to obtain the full posterior distribution for the mean and SD without relying on computationally intensive iterative procedures such as Markov chain Monte Carlo algorithms. Our numerical studies demonstrate that BOSE performs competitively with existing approaches in estimating the mean, while achieving superior performance for estimating the SD across all evaluated scenarios, particularly in small-sample settings. Under non-normal distributions including skewed, heavy-tailed, and bimodal settings with mild or moderate deviations from normality, BOSE remains robust and stable, whereas methods specifically designed for skewed distributions may become unstable or even inapplicable. Beyond point estimation, BOSE naturally provides empirically validated posterior credible intervals, enabling researchers to formally quantify uncertainty for study-level estimates and make reliable, evidence-based decisions in meta-analytic research synthesis. A publicly accessible web application implementing BOSE and competing methods is also provided to facilitate practical use in meta-analytic research.
bioinformatics2026-07-01v1mirCCC: Repression-aware graph learning for miRNA-mediated cell-cell communication inference
Chen, Y.; Cui, J.; Zhang, S.; Liu, E.; Xie, L.; Feng, C.; Chen, M.Abstract
Cell-cell communication analyses usually focus on protein ligands and receptors and therefore miss the extracellular vesicle-mediated transfer of microRNAs, an important route of signalling in cancer. Here, we show that microRNA-mediated communication can be inferred from standard single-cell RNA sequencing by detecting coordinated decreases in the expression of validated miRNA target genes. We developed mirCCC, a computational framework that estimates cell-specific microRNA activity, models cellular sending and receiving capacities for extracellular vesicle transfer, and learns microRNA-resolved communication graphs from transcriptomic data. In synthetic benchmarks with strong confounding signals, mirCCC improved, whereas all comparison methods declined. Applied to a human colorectal cancer atlas, mirCCC recovered known colorectal cancer-associated microRNAs and identified stromal- and myeloid-to-epithelial communication converging on a plasticity program linked to TGF-{beta} and Wnt/{beta}-catenin signalling. These results provide a practical route for studying extracellular vesicle-mediated communication in existing single-cell atlases.
bioinformatics2026-07-01v1Impacts of batch effects on the performance of machine learning classifiers across multiple studies
Raab, P.; Johnson, W. E.; Piccolo, S. R.Abstract
Precision medicine relies on accurate and generalizable predictions for patients across the spectrum of human diversity. Because capturing biological heterogeneity requires large sample sizes, researchers must often aggregate data from several experimental batches or independent studies. This integration allows for greater statistical power and diversity than a single study could provide, while avoiding the costs of generating massive new -omics datasets. Predictive models trained on these aggregated data are theoretically better equipped to detect subtle patterns that generalize to new data. However, this potential is frequently undermined by "batch effects"--systematic technical artifacts that can bias model training to predict experimental batches and shadow meaningful biological conditions. Models trained on data with batch effects can exhibit substantially degraded performance when applied to data from new batches. Statistical adjustment methods can mitigate these artifacts while preserving biological signals. To ensure these adjustments actually facilitate generalization, we emphasize the use of external, independent cohorts for rigorous validation. This chapter examines how batch effects impact predictions and compares various adjustment methods.
bioinformatics2026-06-30v1GeneBench-Pro: Evaluating Multistage Statistical Reasoning\\in Genomics, Quantitative Biology, and Translational Biomedicine
Li, J. H.; Ho, A. J.Abstract
We introduce GeneBench-Pro, an expanded and improved version of GeneBench that comprises harder problems across a wider breadth of domains. GeneBench-Pro is a benchmark for AI agents performing realistic multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine which seeks to capture the complexity of real-world problems that computational life scientists face when tasked with producing a conclusion upon which a downstream scientific or translational decision is contingent. The benchmark comprises 129 evaluations targeting quantities of direct practical relevance across 10 primary domains and 21 terminal subdomains, with a genomics-centered core. Similarly to GeneBench, each problem provides the agent with brief context, a target estimand, and minimal guidance otherwise; the agent must then navigate multiple dependent decision points; i.e., substantive inferential forks where a plausible wrong choice changes the downstream analysis, to identify and execute the correct analysis workflow and arrive at the correct answer. Relative to GeneBench, GeneBench-Pro adds 29 new problems, drops three, and introduces significantly redesigned versions of 54 of the remaining 100 overlapping problems. 82 of the 129 problems were reviewed by external domain experts, whose findings led to prompt/data modifications and redesign of those problems whose targets were not sufficiently identifiable. Ten externally reviewed problems are released publicly, 50 held-out problems were provided to Artificial Analysis for independent third-party model benchmarking, and the remainder are retained as an internal holdout. In evaluations over the full 129-problem suite, GPT-5.6 Sol reaches an eval-level pass rate of 28.7% at the max reasoning level, and GPT-5.6 Sol Pro reaches 31.5% in separately reported GPT Pro runs. GPT-5.5 reaches 12.0%, GPT-5.4 reaches 8.9%, and the strongest non-GPT baseline, Claude Opus 4.8, reaches 16.0%. As with GeneBench, models often complete substantial portions of the workflow but exhibit a consistent gap between noticing and acting by identifying local diagnostic signals but failing to propagate the implications to the corresponding analysis decision. As a result, models often select wrong estimators or persist on initially plausible but incorrect analysis paths. GeneBench-Pro therefore measures an emerging capability of long-horizon biological reasoning that remains unreliable.
bioinformatics2026-06-30v1Structural Bioinformatics of Four Human Aquaporins and Their Water-Soluble QTY Analogs
Zhang, S.; Xiao, E.Abstract
Human aquaporins (AQPs) are essential membrane channels, yet their inherent hydrophobicity complicates structural and functional studies. We present the systematic application of the QTY code to human AQPs, integrating it with AlphaFold 3 structure prediction to design and validate that four-representative human AQPs (AQP1, AQP3, AQP4, AQP7) can be converted into water-soluble analogs while maintaining their conformation. This approach features a novel platform for editing challenging membrane proteins. The QTY code was applied to the transmembrane regions of the selected four AQPs. Subsequently, the water-soluble QTY analogs of the four AQPs were predicted using AlphaFold 3. The predicted structures were superposed with CyroEM- or X-ray-determined native structures in PyMOL. Further analyses included root-mean-square deviation (RMSD) calculations, visualization of hydrophobic surface reduction, and inspection of conserved protein-ligand binding ability. After applying the QTY code, sequence changes between native AQPs and their QTY analogs was significant (42.86-48.80%). Nevertheless, their structures superposed well in analyses, with only slight deviations (RMSD < 0.6 [A]). In addition, the surface hydrophobicity of all QTY-edited AQPs was significantly reduced. Importantly, molecular contacts between the cholesterol ligand and protein were largely preserved for both native AQP1 and its QTY analog. Finally, all AlphaFold3-predicted structures for AQPs have high confidence values (pLDDT > 90; pTM ~0.83), supporting the reliability of the predicted structures. The findings demonstrate that membrane protein hydrophobicity can be edited and reduced without compromising fold integrity or functional architecture. Integration of the QTY code with AlphaFold 3 affords a high-throughput platform for designing water-soluble, structurally faithful analogs of challenging membrane proteins. Such a strategy can provide a potent platform for detergent-free biochemical studies and water-soluble analogs for therapeutic monoclonal antibody discoveries, thus advancing research of this pharmacologically important protein family.
bioinformatics2026-06-30v1Time-resolved inference of gene regulatory networks underlying human cranial neural crest development suggests novel risk genes for orofacial clefting.
Eibl, M.; Theiss, S.; Einarsson, H.; Vaagenso, C. S.; Krautz, R.; Gehringer, M.; Siewert, A.; Zhang, Y.; Rada-Iglesias, A.; Saez-Rodriguez, J.; Herrmann, C.; Ludwig, K. U.; Andersson, R.; Laugsch, M.Abstract
Cranial neural crest cells (CNCCs) play a central role in shaping the human head and face. Aberrant CNCC differentiation contributes to craniofacial birth defects, particularly non-syndromic cleft lip with or without cleft palate (nsCL/P), one of the most common congenital disorders. Although the number of genetic variants associated with this condition is steadily increasing, it remains challenging to determine if and how these variants may contribute to disease development. The majority of these variants lie within non-coding regulatory elements that govern cell-type and stage-specific gene expression, which is orchestrated by dynamic gene regulatory networks (GRNs). Despite extensive work in model organisms, a time-resolved, multi-omics perspective of GRNs controlling CNCC differentiation in a human system is still lacking. To fill this gap, we generated paired transcriptomic and chromatin accessibility data at four timepoints during in vitro differentiation of CNCCs derived from human induced pluripotent stem cells. Integrating these two modalities enabled time-resolved inference of GRNs and identification of dynamic regulatory relationships, including stage-specific roles of core transcription factors. Leveraging these time-resolved GRNs, we mapped 29 nsCL/P associated variants linked to 70 putative target genes, with 40 located outside the associated genomic loci, suggesting novel distal regulatory relationships. Integration of these data with complementary time-course scRNA-seq data revealed an ectomesenchymal-biased subpopulation of CNCCs as particularly sensitive to genetic variants associated with nsCL/P. We provide a time-resolved inference of GRN in human CNCC differentiation, allowing us to determine the dynamics of stage-specific core regulatory programs that are otherwise missed in analyses based on a single time snapshot. To our knowledge, the data represent the first multi-omics map of human CNCC with temporal resolution, which expands the understanding of early human craniofacial development, refines variant-to-gene assignment, prioritizes candidate risk genes and cell states relevant to nsCL/P. Our findings demonstrate the relevance of studying the dynamics upon differentiation rather than just one fixed timepoint and offer a valuable basis for further investigation of non-coding variation in CNCC-related disorders.
bioinformatics2026-06-30v1A pan-cancer benchmark of integrated ferroptosis, cuproptosis and disulfidptosis prognostic signatures
Demir, A. Y.; Yasar, E.Abstract
Integrated prognostic signatures combining ferroptosis, cuproptosis, and disulfidptosis are increasingly reported in oncology as advances in risk stratification, yet their added value over simpler pathway-specific or proliferation-related models remains unclear. Here, we developed an integrated regulated cell-death signature and evaluated it through an adversarial pan-cancer benchmark. Using the TCGA pan-cancer cohort comprising 9,808 tumours across 33 cancer types, we curated 118 genes associated with the three cell-death programmes, characterised inter-pathway crosstalk, and derived a 26-gene LASSO-Cox risk signature. The model showed reproducible prognostic performance across cancers, with a pan-cancer concordance index of 0.573 (95% CI, 0.552-0.594), and was independently validated in METABRIC and CGGA cohorts, remaining significant after adjustment for standard clinical variables. However, benchmarking revealed that the integrated signature, although superior to size-matched random gene sets (empirical p < 0.001), did not outperform a ferroptosis-only model (DeLong p = 0.81), indicating no measurable gain from pathway integration. Moreover, much of the prognostic signal reflected tumour proliferation rather than regulated cell death. After adjustment for the proliferation meta-signature (meta-PCNA), ferroptosis performance declined from 0.573 to 0.504, while the integrated model decreased to 0.554. High-risk tumours were more sensitive to anti-proliferative drugs, and the risk score was most strongly associated with E2F, MYC, and G2M target programmes. The signature stratified prognosis but did not predict immune-checkpoint blockade response in IMvigor210 (AUC {approx} 0.50). Importantly, the underlying biology was not merely a modelling artefact. Signature genes showed concordance with protein abundance in CPTAC cohorts, and the three cell-death programmes co-varied within individual malignant cells, with correlations ranging from {rho} = 0.46 to 0.66. Overall, our findings indicate that integrated multi-death signatures are reproducible and biologically grounded, yet prognostically redundant and substantially confounded by proliferation. This study provides a cautionary benchmark for the rapidly expanding use of composite regulated cell-death signatures in cancer prognosis.
bioinformatics2026-06-30v1Integrating Semantic Retrieval, LLM-based Refinement, and Structured Expert Curation for Scalable AOP Gene Mapping
Schaffert, A.; Fratello, M.; Kangas, K.; Torres Maia, M.; del Giudice, G.; Mobus, L.; Accardi, C.; Al-Abdulraheem, Z.; Campini, L.; Galardo, F.; Federico, A.; Ciancaleoni, G.; Juppi, H.-K.; Paparella, M.; Serra, A.; Greco, D.Abstract
Toxicogenomics can support regulatory toxicology, but its use is limited by the difficulty of translating molecular responses into mechanistic, decision-relevant interpretations. Adverse Outcome Pathways (AOPs) provide a framework for this translation, yet omics applications require scalable mapping of Key Events (KEs) to molecular features. Here, we present an AI-assisted, multi-step workflow for KE-to-gene mapping that uses embedding-based semantic retrieval to identify candidate ontology/pathway terms, large language model-assisted refinement to filter these candidates, and double-independent expert group curation with rule-based consolidation to finalize mappings and derive confidence scores. Compared with earlier NLP-based approaches, the workflow improves KE-to-ontology/pathway mapping performance and generates candidate annotations that better align with expert judgment while substantially reducing the need for manual augmentation. Explicit gene and protein mentions in KE titles were additionally grounded to improve specificity, and each curated mapping was assigned curator reason codes to support transparent, traceable, and confidence-aware reuse. Applied across AOP-Wiki, the workflow produced a comprehensive KE-to-gene set resource covering 1,254 KEs across 523 AOPs and linking 15,833 human genes. Utility is demonstrated through CTD-based AOP fingerprinting of curated reference chemical groups, highlighting expanded coverage and confidence-informed interpretation of chemical-associated gene signatures in an AOP context. The workflow and resulting resource provide a practical bridge between toxicogenomics and AOP-based mechanistic interpretation and support routine updating and future extension to additional omics layers within OECD Omics2AOP.
bioinformatics2026-06-30v1A High-Quality Acetylation Dataset Reveals Modest Data Requirements for Transfer Learning to Identify Little Studied Post-Translational Modifications
Hartmaring, Y.; Wang, S.; Jones, A. R.; Vizcaino, J. A.; Schlaffner, C. N.; Renard, B. Y.Abstract
Dysregulation of post-translational modifications (PTMs) is associated with severe pathologies, including cancers and Alzheimer's disease. Despite their biological importance, identifying modified peptides remains challenging due to the immense combinatorial search space. While searches benefit from prior knowledge of a peptide's modification status, the data scarcity for most PTMs hinders the development of accurate deep learning classifiers like AHLF (ad hoc learning of peptide fragmentation). Here, we overcome this data bottleneck for acetylation and ubiquitination. We harmonised a dataset with about 500,000 high quality acetylated peptide-spectrum matches (PSMs) from nine publicly available acetylation-enriched datasets. We fine-tuned AHLF with the acetylation and a 2-million spectra strong ubiquitination dataset separately and assessed the minimum data requirement for training by iteratively downsampling. Training separate models on SILAC and label-free subsets also assessed the impact of data diversity. The resulting acetylation and ubiquitination models achieve an AUC of 0.87 and 0.90 respectively. Beyond 28,500 acetylated spectra, corresponding to roughly 0.3% of the original model's training data, additional data just provides minor performance gains. Finally, we show that data diversity is beneficial for generalizability, while models trained on homogeneous data sources tend to overfit to their respective data type. All code, and model weights are available at https://gitlab.com/dacs-hpi/ahlf-ptmai.
bioinformatics2026-06-30v1Real-World Progression-Free Survival with Erlotinib versus Osimertinib in EGFR L858R+T790M Compound Mutation Non-Small Cell Lung Cancer: An Exploratory Analysis of the MSK-CHORD Dataset
Dalloul, Z.; Abboud, A.; Dalloul, I.; Abdelsalam, M.Abstract
Background: Osimertinib is the standard first-line treatment for EGFR- mutant non-small cell lung cancer (NSCLC) harboring common activating mutations, including exon 19 deletions and L858R. It is also active against tumors with acquired T790M resistance. However, the EGFR L858R+T790M compound mutation, where both variants co-occur within the same tumor, may confer distinct drug-sensitivity profiles not predicted by either mutation alone. Limited data exist on comparative treatment outcomes in this rare genotype. Methods: Using the MSK-CHORD clinicogenomic dataset (n=24,950), we identified patients with concurrent EGFR L858R and T790M mutations receiving erlotinib (Erlo) or osimertinib (Osi) monotherapy. Real-world progression-free survival (rwPFS) per treatment line was calculated using a strict definition requiring confirmed radiological progression events (rwPFS-strict), excluding lines with null endpoint data. Kaplan-Meier analysis, log-rank testing, Cox proportional hazards regression, and cross-cohort heterogeneity testing (Cochran's Q statistic) were performed. Two control cohorts, L858R-only (n=372) and T790M-only (n=76), were analyzed in parallel to assess mutation-context specificity of treatment response. Results: Thirty-one patients with EGFR L858R+T790M were identified; 21 contributed evaluable monotherapy lines, yielding 23 Erlo and 15 Osi treatment lines (14 unique patients per treatment group, 7 contributing to both). Median rwPFS numerically favored Erlo over Osi (7.10 vs 5.32 months; HR 1.29, 95% CI 0.66-2.52; log-rank p=0.46). This directional trend was reversed in the L858R-only control cohort, where Osi demonstrated significant superiority (9.03 vs 5.75 months; HR 0.70, 95% CI 0.55-0.89; p=0.003). The T790M-only cohort showed no significant difference (HR 1.32, p=0.12). An exploratory post-hoc heterogeneity test confirmed a significant cross-cohort interaction (Q=9.94, df=2, p=0.007). Conclusions: The expected osimertinib advantage was absent in L858R+T790M compound-mutant NSCLC. The opposing hazard ratio directions across mutation contexts (HR 1.29 vs 0.70), with a significant exploratory cross-cohort interaction (p=0.007), suggest that the EGFR L858R+T790M compound mutation may represent a pharmacologically distinct entity with differential TKI sensitivity. These hypothesis-generating findings warrant prospective validation.
bioinformatics2026-06-30v1A robot model of compass cue calibration in the insect brain
Mitchell, R.; Dacke, M.; Webb, B.Abstract
Dung beetles can use a variety of orientation cues to maintain a consistent bearing during ball-rolling. Where several cues are available, they appear to learn the spatial relationship between them, providing redundancy if some cues are removed. Mounting evidence indicates that such a learning process is implemented in the insect head direction circuit; specifically, in the plastic substrate between sensory input neurons and compass neurons in the central complex. This plasticity appears to be driven by rotational movements, providing a clear link with observed beetle 'dance' behaviour. Here, we extend our functional model of this circuit and use it on a robot platform, to test it in the same behavioural assay as was used for the beetles. The robot was able to replicate the beetle's ability to substitute a directional wind cue for a point source light cue in guiding straight-line movement. However, it also revealed significant biasing coupled to dance direction. This biasing appears to be caused by inherent conflict between recurrent and instantaneous inputs to the compass circuit. We predict that the real insect should experience similar issues unless it has evolved a neural mechanism to compensate.
bioinformatics2026-06-30v1Svirlpool: structural variant detection from long read sequencing by local assembly
May, V.; Hartmann, T.; Beule, D.; Holtgrewe, M.Abstract
Motivation: Long-Read Sequencing (LRS), and Oxford Nanopore Technologies (ONT) in particular, has greatly improved the detection of structural genome variants (SVs). Fast alignment-based ONT callers achieve strong benchmark performance, but they necessarily reduce the read sequence to alignment-derived signals when deciding whether variants are shared across samples. This can be limiting for cohort and clinical analyses, especially for insertions and repeat regions where sequence representation matters. We present Svirlpool, a multi-sample SV caller for ONT data that builds local consensus assemblies of candidate SV regions and retains the assembled sequence up to the final joint-calling step, where merging tolerances are scaled by a reference-independent noise estimate derived from the reads. Results: We validated Svirlpool on two ONT family datasets: the recent high-quality HG002 Ashkenazi trio and the older Platinum Pedigree family, using the Genome in a Bottle and T2TQ100 benchmarks on the GRCh38, GRCh37, and CHM13v2 references and the Mendelian consistency of native multi-sample calls. We compare against current native joint callers and post-hoc merging workflows. Svirlpool produces highly Mendelian-consistent insertion calls in trio analyses (95.2% on GRCh38 and 95.1% on CHM13v2 at 30x), and on CHM13v2 it reaches the highest insertion and deletion consistency among all tested approaches. Sawfish and Sniffles achieve the highest SV benchmark F1 scores on recent high-quality ONT data, whereas Svirlpool enters the competition with more conservative SV calls. Svirlpool features native, sequence-aware joint calling with retained local consensus sequences and shows a very high Mendelian consistency with sequencing data from different batches and chemistries, which is a common situation in clinical application. Availability and Implementation: Source code, container images, and documentation available at https://github.com/bihealth/svirlpool
bioinformatics2026-06-29v3NPTX2-Centered Cognitive Resilience Mechanisms in the Context of AD Pathology
Lao, Y.; Xiao, M.-F.; Ji, S.; Piras, I. S.; Kim, K.; Bonfitto, A.; Song, S.; Aldabergenova, A.; Sloan, J.; Trejo, A.; Geula, C.; Na, C.-H.; Rogalski, E. J.; Kawas, C. H.; Corrada, M. M.; Serrano, G. E.; Beach, T. G.; Troncoso, J. C.; Huentelman, M. J.; Barnes, C. A.; Worley, P. F.; Colantuoni, C.Abstract
Background Cognitive resilience to Alzheimer's disease (AD) pathology is associated with preserved expression of NPTX2, an activity-regulated synaptic protein involved in circuit plasticity, excitation-inhibition balance, and complement-linked synapse regulation. However, the broader molecular programs coordinated with NPTX2 in resilient individuals remain unclear. Methods We analyzed postmortem middle temporal gyrus tissue using targeted PRM-MS proteomics in 135 individuals and bulk RNA-seq in an expanded 575-sample cohort. NPTX2-associated molecular coordination was assessed within cognitively normal low-pathology controls (CN-Lo), cognitively normal high-pathology controls (CN-Hi), mild cognitive impairment (MCI), and AD. Correlation-based approaches were applied using NPTX2 protein and NPTX2 mRNA expression as anchors to define resilience mechanisms in CN-Hi subjects. Results NPTX2 protein abundance was preserved across all controls regardless of age and pathology but reduced in MCI and AD. NPTX2 mRNA expression was also invariant across pathology within controls and reduced in MCI and AD but decreased markedly with age. Targeted proteomics identified NPTX2 relationships with synaptic and inhibitory-circuit proteins that were preserved across control groups, alongside CN-Hi-specific recruitment of trafficking, lysosomal, metabolic, and proteostasis-associated proteins. Transcriptome-wide correlations with NPTX2 revealed differences in gene co-expression between groups, identifying a prominent activity-dependent program including BDNF, VGF, SCG2, SST, SERTM1, DUSP4, and EGR4, that was preserved in both CN-Lo and CN-Hi subjects, while genes recruited to the NPTX2 network specifically in CN-Hi implicated immune, neuroprotective, translation, and proteostasis-related pathways. Coupling differential gene expression analysis with co-expression, we further identified five candidate resilience genes whose expression and NPTX correlation was preserved across controls, but lost in MCI and AD: SST, MAL2, TAC1, SERTM1, and RFK. Expression of genes in distinct NPTX2 co-expression classes can be freely explored in our bulk RNA-seq data and other public AD transcriptomic datasets at NeMO Analytics. Conclusion Findings suggest that cognitive resilience in the context of AD neuropathology engages a coordinated molecular state distinct from both preserved cognition without pathology and MCI/AD, which is organized around preserved and selectively remodeled NPTX2-associations. Rather than reflecting broad transcript abundance changes, resilience was characterized by maintained synaptic and inhibitory programs, and adaptive proteostasis and trafficking pathways that distinguish resilient high-pathology individuals from low-pathology controls or symptomatic AD.
bioinformatics2026-06-29v3The structural context of mutations in proteins predicts their effect on antibiotic resistance
Green, A. G.; Tasmin, M.; Vargas, R.; Farhat, M. R.Abstract
In Mycobacterium tuberculosis, a prevalent and deadly pathogen, resistance to antibiotics evolves primarily through non-synonymous mutations in proteins. Sequence-based analyses are currently used to understand the genetic basis of antibiotic resistance, either via genotype-phenotype association, or via signals of convergent evolution. These methods focus on primary sequence and often neglect other biological signals such as protein structural information. We hypothesize that integrating the structural context of mutations improves the prediction of effects on function and phenotype. We curate high confidence structural annotations for the M. tuberculosis proteome from 1,371 crystallography and 2,316 AlphaFold predictions, and combine the structures with mutations from over 31,000 clinical M. tuberculosis isolates. We demonstrate that mutations in proteins known to cause resistance are clustered in 3D space, even in proteins where inactivating mutations at any position are thought to cause resistance. We develop a statistic to search the M. tuberculosis proteome for signal of clustered mutations, finding over 450 proteins that display this signal, many of which have a known relationship with antibiotic resistance. We show that a supervised classifier trained on 3D distance to known resistance sites alone has an F1 score of 94.6% at classifying mutations as resistance-conferring across proteins. This work demonstrates that protein structure provides useful information for categorizing which variants may cause antibiotic resistance, even when the majority of structures are AI-predicted.
bioinformatics2026-06-29v2eRNAformer enables genome-wide de novo mapping of enhancer-derived RNA loci
Yu, H.; Li, W.; Li, W.; Liu, Y.; Chen, Y.; Zhang, X.; He, S.; Chen, Z.; Wang, H.; Ni, J.; Gao, T.; Li, F.; Lu, L.Abstract
Enhancer-derived RNAs (eRNAs) are critical regulators of gene transcription, yet their genome-wide annotation remains challenging. Here, we present eRNAformer, a multi-modal deep learning framework that integrates convolutional neural networks with transformers, specifically designed to capture long-range genetic features associated with bidirectional transcription. This approach enables de novo mapping of eRNA loci using DNA sequence and aggregated conventional RNA-seq data. When evaluated on ENCODE datasets, eRNAformer demonstrated high sensitivity and specificity in discriminating known eRNA loci from non-eRNA loci. Notably, the newly identified eRNA loci were enriched with evolutionarily constrained variants and genetic risk factors for complex diseases, and exhibit potential relevance for cancer therapy. Applied to GEO datasets, eRNAformer identified a range from 14,219 to 56,451 eRNA loci across multiple hematologic malignancies, facilitating the construction of a comprehensive eRNA database for blood cancers. We further identified and experimentally validated FOXO1e, a cluster of eRNAs located approximately 120 kb upstream of FOXO1, a known oncogene that drives t(8;21) acute myeloid leukemia (AML) preleukemic program. Together, these findings establish eRNAformer as a powerful tool for genome-wide eRNA annotation, provide a valuable resource for eRNA studies in hematologic cancers, and underscore the functional importance of eRNAs in AML pathogenesis.
bioinformatics2026-06-29v2SCiMS: Sex Calling in Metagenomic Sequences
Tran, H. N.; Kirven, K. J.; Davenport, E. R.Abstract
Background: Host sex is a critical determinant of microbial community structure across many host species, influenced by hormonal profiles, physiology, and sex-stratified behaviors. Despite its importance, sex metadata is frequently missing in microbiome studies, including for animal-associated samples. Host chromosomal sex can be inferred from the host-derived reads present in metagenomic data, but existing genomic sex prediction tools rely on fixed coverage thresholds calibrated for human XY chromosomes and require relatively high host reads, limiting their use on low host-biomass samples such as stool and on organisms with other sex-determination systems. Results: Here, we present SCiMS (Sex Calling in Metagenomic Sequences), a bioinformatic tool that leverages host-derived DNA within shotgun metagenomic data to predict host chromosomal sex, even at low host coverage. SCiMS uses a multinomial likelihood computed from observed read counts under each sex and reports chromosomal sex calls. Because the expected read distribution is derived directly from chromosome lengths and ploidy under each candidate karyotype, SCiMS applies to any organism with a heterogametic sex-determination system. We benchmarked SCiMS against existing tools on simulated metagenomic data, human metagenomic samples spanning multiple body sites, and metagenomic samples from seven animal species. SCiMS matched or outperformed existing tools, with its noticeable advantage at low host read conditions. Conclusions: SCiMS provides an accurate, scalable, and cross-species generalizable solution for host chromosomal sex classification, even when host DNA is minimal. By enabling recovery of missing sex metadata, it serves as a quality-control tool for analyses in microbiome research. SCiMS is freely available at <a href="http://github.com/davenport-lab/SCiMS">http://github.com/davenport-lab/SCiMS</a>.
bioinformatics2026-06-29v2Anatomy-Guided 3D Graph Networks for Couinaud Segmentation in Tumor Affected Livers
You, L.; Dang, H.; Wang, H.; Matta, E.; zhou, X.Abstract
Abstract: Image-based liver Couinaud segmentation is designed to automatically provide the locations of suspicious objects in liver CT/MR images. Once achieved, the physicians will be guided to the target slice and area where the suspicious node is located. However, conventional algorithms trained primarily on healthy liver images often fail to generalize to Hepatocellular Carcinoma (HCC) cases due to pathological structural distortions. In this work, we propose a robust two-stage framework that integrates a 3D Unet with a 3D Anatomical Structure-Guided Graph Convolutional Network (3D GCN). This two-stage strategy effectively isolates the liver volume to eliminate structural noise from neighboring organs, such as the spleen, allowing the framework to focus exclusively on the complex 3D anatomical relationships among the eight segments. To ensure the topological consistency required for global spatial reasoning, we implement a standardized preprocessing pipeline that normalizes liver-only volumes to exactly 50 frames along the z-axis. By combining a lightweight 3D UNet backbone with the 3D GCN for refined boundary reasoning, our model demonstrates superior generalization performance on unseen clinical datasets, achieving a mean Dice score of 0.828 in blind testing. By releasing our code and pretrained weights, we aim to provide the first publicly available deep learning resource for robust Couinaud segmentation.
bioinformatics2026-06-29v2Confounding effects of inferring gene co-expression networks from pooled data from different biological populations
Runghen, R.; Eliassi-Rad, T.; Bolnick, D. I.Abstract
Weighted Gene Co-expression Network Analysis (WGCNA) is routinely applied to pooled datasets from multiple biological populations, genotypes, or treatment groups, implicitly assuming a shared module structure across groups. While the distortion of pairwise correlations by pooling heterogeneous groups is well established statistically, three aspects of this problem have received little systematic attention in the context of co-expression network analysis: the extent to which pooling disrupts the discrete module-level community structure inferred by WGCNA; whether this disruption is detectable from the global topology metrics researchers routinely report; and how prevalent the pooling practice is in published multi-group WGCNA studies. Using analytical toy examples and a four-scenario simulation framework, we address all three questions. Module preservation Zsummary scores declined progressively with between-population divergence, from full preservation under identical populations (mean median Zsummary = 25.2 {+/-} 3.3, 95% interval 19.0--30.7 across 20 simulation replicates) to substantial disruption when both network structure and mean expression differed (mean median Zsummary = 11.9 {+/-} 1.0, 95% interval 10.2--13.5). This disruption was undetectable from global topology metrics: modularity and clustering coefficient remained stable across all scenarios, while edge density was sensitive but non-specific. These findings were corroborated in an empirical reanalysis of divergent lake and stream stickleback transcriptomes, where merged analysis collapsed 26 lake-specific and 59 stream-specific modules into only 19 merged modules. A survey of 100 publications found that 78.7% (95% CI 69.4--87.9%) of multi-group WGCNA studies with sufficient methodological reporting used a single merged analysis. Results were robust across network sizes of 250--1,000 genes and rewiring rates of 10--50%. We provide concrete recommendations including module preservation testing in both directions, population-specific baseline networks, and consensus WGCNA as a principled alternative.
bioinformatics2026-06-29v1Can a Tissue-derived Progression Signature Accurately Predict Colorectal Cancer Stage Transitions in Blood?
Sarkar, P.; Sarkar, P.Abstract
Abstract. Colorectal cancer (CRC) is challenging to track because its molecular changes are very complex as the disease progresses, creating significant challenges for robust biomarker discovery. In this study, we developed a machine learning framework by integrating monotonic progression and the StepMiner approach. We conducted external validation to identify reproducible, consistent transcriptomic biomarkers associated with CRC progression. Gene expression datasets were analyzed across four disease states from publicly available GEO: normal colon, adenoma, primary colorectal cancer, and metastasis. First, we identified genes with monotonic expression, then used the StepMiner approach to identify genes that act as 'switches' between stages. A balanced 74-gene signature was used for machine-learning classification with a Random Forest. External validation showed strong performance in tissue-based datasets. However, tissue-derived signatures and plasma and blood-based datasets showed poor performance, highlighting biological differences between transcriptomic profiles. Cross-filtering between tissue-derived genes and blood expression datasets was performed, which resulted in the selection of 62 blood-compatible gene signatures. Leakage-free retraining on GSE164191 achieved a mean AUC of 0.868 with balanced precision. Functional enrichment analysis showed that these genes are highly active in cancer growth. Specifically, genes CBX3, S100A11, PDK4, NCOR1, and SOX4 demonstrated stable and reliable performance across the validation fold. Overall, our study presents a progression-aware transcriptomic framework for CRC biomarker discovery and demonstrates the importance of external validation. Additionally, we evaluate whether tissue-derived signatures can predict blood profiles. This proposed approach may help the future development of tissue-based diagnostics and minimally liquid-biopsy strategies for CRC. To ensure reproducibility, our proposed workflow was automated as a Nextflow pipeline. The tissue-derived model was deployed as an application utilizing Angular, ASP.NET Core, and Plumber (R).
bioinformatics2026-06-29v1Placental pathology, circadian biology, and pathogenesis of spontaneous preterm birth: a pilot study of human placental gene expression profiling using a targeted HTG transcriptome panel
Zhou, G.; Hoffmann, H.; Yamamoto, H. S.; Woods, K.; Adkins, M.; Barbieri, R.; Fichorova, R. N.Abstract
BACKGROUND: Spontaneous preterm birth (sPTB) remains the foremost cause of neonatal morbidity and mortality worldwide. Although histologic chorioamnionitis (HCA) and placental vascular abnormalities are frequently observed in sPTB, the molecular cascades linking these lesions to labor initiation remain poorly understood. Emerging evidence implicates circadian dysregulation and trophoblast dysfunction as additional drivers of sPTB. OBJECTIVE: This study aims to map placental pathology to distinct transcriptomic functional signatures that may precipitate sPTB, delineate the contribution of circadian regulation - both core-clock genes and circadian transcription-factor target sets (TFTs) - to sPTB, and identify placental cell-type-enriched and developmental pathway signatures that differ between sPTB and term deliveries. STUDY DESIGN: We performed bulk RNA sequencing on 32 formalin fixed, paraffin embedded placental specimens from 12 selected women (9 sPTB and 3 Term) in the POUCH Study cohort. Samples were selected for white ethnicity, maternal age 23-33years, and parity 1-4 to reduce heterogeneity within groups. An extraction-free HTG transcriptome panel assayed 19,398 protein-coding genes. Log2-fold changes of all genes were computed with limma adjusted for maternal age, gestational age, parity, placental region, placental pathology, and POUCHID (a clustering variable) for sPTB vs. Term and HCA/vascular lesion vs. no pathology (no placental pathology adjustment). Gene-set enrichment used 50 Hallmark sets (MSigDB) plus curated placental circadian, circadian TFT, cell-type, and developmental pathways or gene sets. RESULTS: sPTB placentas displayed a global suppression of metabolic, secretory, and immune pathways (e.g., protein secretion, oxidative phosphorylation, Interferon responses, Complement, ROS, MYC Targets, TGF {beta}, mTORC1, and Coagulation) while KRAS Signaling Down and EMT were up-regulated. HCA-enriched sets (TNF/NF-{kappa}B, ROS, KRAS Up, IL-2/STAT5, Hypoxia, Interferon-{gamma}) were up-regulated, with EMT and Notch remaining down. Vascular abnormalities alone showed up-regulation of 12 Hallmark sets - including TGF-{beta}, TNF/NF-{kappa}B, ROS, pancreatic {beta}-cell stress, Hypoxia, Oxidative Phosphorylation, EMT, and mTORC1 - while Notch was down-regulated. When HCA co-exists with vascular abnormalities, the Hallmark profile becomes more inflammatory highlighting a synergistic exacerbation of innate immunity, oxidative stress, and programmed cell death with the 12 up-regulated sets (Complement, Interferon /{gamma}, TNF, ROS, Apoptosis, and Heme Metabolism). The exclusive downregulation of DNA Repair suggests compromised genomic integrity. Circadian gene-sets analysis revealed an up-regulated Regulation of Circadian Sleep Wake Cycle in sPTB but down-regulation of core clock pathway and suppressed circadian TF targets. Cell-type enrichment reveals increased trophoblast giant cells and IGFBP1-DKK1 positive fetal cells, with marked suppression of extravillous trophoblasts, syncytiotrophoblasts, villous cytotrophoblasts, and fetal myeloid cells. Placental developmental pathways were downregulated, indicating arrested trophoblast maturation. CONCLUSION: Our pilot analysis demonstrates sPTB placentas exhibit a global suppression of metabolic, secretory, and immune-modulatory programs and maladaptive trophoblast remodeling, whereas HCA and vascular abnormalities drove distinct inflammatory or hypoxic signatures. The shared and opposing Hallmark pathways across phenotypes highlight distinct yet overlapping pathogenic mechanisms. Dysregulated circadian pathways, consistent downregulated transcription factor target gene sets, and trophoblast specific signatures implicate circadian misalignment and impaired placental maturation as key contributors to preterm parturition. These findings provide a mechanistic atlas linking placental pathology to sPTB and highlight potential targets for chronotherapeutic and cell type specific interventions.
bioinformatics2026-06-29v1EnzyKAN: Protein Language Model Embeddings and Kolmogorov-Arnold Network Variants for Enzyme Commission Classification with a Proposed Electron-Transfer Physics Feature Framework
R, S.; Reddy, B. R. R.Abstract
Motivation: Computational enzyme classification has previously utilised sequence homology features and protein language model embeddings. The Kolmogorov-Arnold Network (KAN) paradigm, which uses learnable edge functions rather than fixed ones, has shown promising results in biological sequence tasks. Results: A fully reproducible investigation of KAN variants for seven-class EC classification on up to 9,516 labelled sequences from the CLEAN benchmark (9,386 for language model experiments). In the sequence only settings, fixed basis KAN variants outperformed an MLP baseline moderately (macro F1 = 0.17-0.29). Utilisation of ESM-2 650M embeddings greatly improved results via 5-fold cross-validation: MLP macro F1 = 0.750 +/- 0.009, accuracy = 0.823 +/- 0.009; learnable SineKAN macro F1 = 0.716 +/- 0.023, accuracy = 0.788 +/- 0.019. MLP performed comparably but did not exceed conventional baselines. As an aside, we introduce but do not investigate an approach to EC oxidoreductase sub-classification through the use of a Marcus theory-based electron transfer feature framework. Availability: Code and result files are available at https://github.com/sanjuz-cas/ENZYKAN.
bioinformatics2026-06-29v1Metrics for Distinguishing Biological and Interventional Change in AI Models
Ewing, M. A.Abstract
Statistical and machine-learning models of longitudinal biological data evaluate change by comparing each new observation against the trajectory implied by prior observations, assuming the process generating that trajectory is stable. We use data substrate to mean the underlying structure of the longitudinal data that determines what any such model can recover, independent of its architecture or capacity. When the generating process changes, whether through a biological transition or through an external intervention, the prior trajectory ceases to be a valid reference, and extrapolated predictions can be confidently wrong with no internal signal that the reference has failed. A distinct and recognised difficulty is that biological change and interventional change, observed only through serial intertemporal comparison under an assumed trajectory, are readily conflated; existing approaches address this through causal assumptions or hidden-confounder models rather than from the data substrate itself. Here we ask whether the two can be distinguished at the substrate level, and we introduce two subject-level metrics that quantify the geometric signature an interventional change leaves in the data: Curvature Shift, the change in trajectory slope across the event, and Deformation Risk, the departure of post-event observations from the prior-trajectory reference. We evaluate the condition on longitudinal cognitive measurements from 309 human subjects in the Alzheimer Disease Neuroimaging Initiative (ADNI), a large longitudinal dataset containing two distinct, ex-ante-defined regime-change events in the same subjects: a biological transition and an intervention. A model extrapolating the pre-event trajectory assigned the wrong direction of change to roughly two-thirds of post-event observations (post-event sign accuracy 0.341 after the biological event and 0.350 after the intervention, against a chance value of 0.50); only 11% of post-biological-event and 12% of post-intervention readings remained concordant with prior dynamics, and a higher-capacity multilayer perceptron reproduced rather than resolved the error. Curvature Shift was 2.23-fold higher after the biological event (p = 4.4e-8) and 2.26-fold higher after the intervention (p = 7.4e-8), and the two metrics were coupled (rho = 0.500; 95% CI, 0.407 to 0.587). Findings replicated on an independent endpoint and survived propensity matching, permutation, and leave-one-out. The metrics detect, per subject, when the reference of a fitted model has stopped governing the data and whether the departure carries the geometric signature of an interventional change.
bioinformatics2026-06-29v1MxSure: a mixture model for inferring within-host substitution rates and transmission SNP thresholds
Khurram, Z.; Chaguza, C.; Kwambana-Adams, B. A.; Shao, Y.; Lawley, T.; Yong, M.; Davies, M. R.; Zarebski, A. E.; Tonkin-Hill, G.Abstract
Quantifying short-term evolutionary rates of microbial genomes is essential for understanding the processes that shape within-host evolution and for establishing thresholds needed to track transmission. In studies of short-term evolutionary rates, samples are often collected from closely related clusters (e.g. longitudinally from the same host or from transmission pairs), with substantial time intervals separating genomes between clusters. Distinguishing strain replacement from persistence presents an additional challenge in these studies. In addition, many public health and metagenomic bacterial strain tracking pipelines output pairwise SNP distances rather than the multiple sequence alignments required by common substitution rate estimation pipelines. This makes it challenging to estimate within-host evolutionary rates in many commensal bacterial species that are difficult to culture and isolate. To address these challenges, we introduce MxSure, a tool for estimating substitution rates and transmission thresholds while accounting for strain replacement from pairwise SNP distance data, as commonly generated by transmission tracking and metagenomic analysis pipelines. We demonstrate the accuracy of MxSure through extensive simulations and by analysing species with previously estimated substitution rates from longitudinal metagenomic datasets. Using MxSure, we estimated within-host substitution rates and transmission SNP thresholds for multiple commensal bacterial species including Bifidobacterium longum and Bifidobacterium bifidum from a longitudinal study of the infant gut microbiome.
bioinformatics2026-06-29v1Single-cell transcriptomics reveals chondrocyte state transitions and ECM remodeling in osteoarthritic knee cartilage
Bo, Z.; Xu, H.; Liang, Y.Abstract
Osteoarthritis cartilage has heterogeneous chondrocyte states, yet their transitions remain unresolved from public single-cell data. We retrospectively reanalyzed a public knee cartilage single-cell RNA-seq dataset GSE255460 from 8 osteoarthritis and 3 non-osteoarthritis donors totaling 19 samples. After sample-wise quality control and doublet removal we performed batch-corrected clustering, chondrocyte subclustering with marker-based annotation, and trajectory inference using Slingshot. Regulatory chondrocytes were tested for osteoarthritis versus control differential expression, followed by Gene Ontology and KEGG enrichment with Benjamini-Hochberg false discovery rate <0.05, and protein-protein interaction hub screening. We retained 27,036 cells. Chondrocytes exhibited branching continuous states; regulatory cells localized near the main manifold and adjacent to inferred branches, suggesting a transition-adjacent state. In regulatory cells, osteoarthritis-upregulated genes were enriched for collagen-containing extracellular matrix organization, endoplasmic reticulum secretory/proteostasis, cell-matrix adhesion including focal adhesion, and TGFbeta/SMAD signaling. Protein-protein interaction analysis identified five high-connectivity hubs: COL5A1, COL5A2, COL6A1, COL1A2, and COL3A1. Our findings support a transition-adjacent regulatory program in OA with coordinated extracellular matrix remodeling and secretory/adhesion/TGFbeta signatures, nominating collagen hubs for validation.
bioinformatics2026-06-29v1Context-dependent correlations mislead transcriptomic network inference in bulk and single-cell data
Asiaee, A.; Bombina, P.; McGee, R. L.; Reed, J.; Abrams, Z. B.; Abruzzo, L. V.; Coombes, K. R.Abstract
Background. Correlation is the dominant input to co-expression module discovery and miRNA-target inference. Both rely on an implicit assumption: a Pearson coefficient pooled across heterogeneous samples, whether tissues, cancer types, or cell types, estimates one biologically meaningful quantity. Simpson's paradox makes this assumption fragile in principle, since between-group mean shifts can dominate or reverse within-group associations. How often this happens in real transcriptomic data has not been quantified. Results. Across 8,890 TCGA tumors from 31 cancer cohorts and 23,170,038 miRNA-mRNA pairs, 94.8% of pairs showed both positive and negative within-cohort correlations. Restricting to the high-variance domain of one million pairs, 13.3% of pooled correlations with |r_global| >= 0.2 reversed against the within-cohort majority at sign tolerance epsilon = 0.05. Heterogeneity was the rule rather than the exception (median I2 = 0.86, IQR 0.80-0.90), and 99.5% of pairs rejected equal correlation across cohorts at FDR < 0.05. Of 692,770 experimentally validated miRTarBase v10 targets measurable in our data, only 0.9% were uniformly negative across cohorts. The pattern recurred across modalities. In GTEx, 21.0% of pooled signs disagreed with the tissue majority, and 23.5% of pairs flipped sign after tissue-mean removal. In 10x PBMC scRNA-seq, 13.1% of gene-gene correlations flipped after cell-type-mean removal; in CITE-seq, 37.9% of protein-RNA pairs flipped under a joint WNN partition of cells. Refining context reduced reversal, though by how much depended on the partition: within BRCA, 5.5% of pairs reversed under molecular PAM50 subtypes versus 0.35% under clinical IHC receptor status, and refining T cells into transcriptome-defined subtypes cut PBMC reversal from 11.8% to 0.13%. Conclusions. A single pooled correlation coefficient can invert direction relative to its within-context constituents at rates that are not negligible. Correlations should be reported with their context: the within-context distribution, a heterogeneity statistic, and a diagnostic that separates between-context mean shifts from within-context association. We provide a small R interface that computes these summaries.
bioinformatics2026-06-29v1Retention, not flux: endpoint confounding caps computational prediction of peptide skin penetration, with a delivery-aware reframing
Komianos, N.; Prakash, P.Abstract
Bioactive peptides are now central to cosmetic and dermatological actives, yet predicting whether a given sequence will reach its site of action in skin remains unsolved. We contend that the dominant framing, predicting a single binary "skin permeability" label from sequence, is ill-posed, and that this, rather than a shortage of modelling power, explains the field's stalled predictive performance. The scope of the claim is narrow: barrier-crossing propensity is a legitimate, learnable function of molecular structure, whereas the vehicle- and endpoint-agnostic binary label that the literature supplies is not. We support this with a first-principles analysis and a study of public-source data. First, the experimental endpoint most commonly reported, transdermal flux into a diffusion-cell receptor compartment (OECD Test Guideline 428), conflates two opposite outcomes (genuine deep delivery and undesired systemic transport) and is, for a cosmetic active, frequently a failure signal rather than a success signal. That receptor flux is an imperfect measure of cutaneous bioavailability is long established in dermatopharmacokinetics; our contribution is to show that the same confound, inherited through scraped labels, is what caps machine learning from sequence. Second, reported "permeability" is a property of the sequence x delivery-vehicle x measurement-compartment triad, two terms of which are usually unrecorded. Third, on public-source data, a physicochemical intrinsic-permeability estimate (Potts-Guy) carries no positive predictive signal for scraped penetration labels (grouped AUC 0.45, 95% CI 0.40-0.51); sequence-only classifiers plateau in the mid-0.70s with diminishing returns as labels accumulate (AUC 0.70-0.77); and the same descriptor pipeline on a clean single-endpoint membrane dataset scores materially higher (AUC 0.83, non-overlapping CI). Our proposed reframing separates barrier-crossing (data-driven, sequence-level) from depth-and-retention (physics-driven, delivery-aware) and treats intrinsic transdermal flux as a regulatory risk axis; we close by proposing a triad-annotated reporting schema and a seed benchmark.
bioinformatics2026-06-29v1Causally measuring aging and rejuvenation through transcriptomic damage
Zhang, S.; Iqbal, S.; Tyshkovskiy, A.; Gladyshev, V. N.Abstract
Aging is caused, fully in large part, by the progressive accumulation of damage, yet quantifying age-related damage across tissues and conditions remains a challenge. Here, we present a computational framework to quantify damage from standard RNA-sequencing data. It captures four classes of aberrant transcript structures, including premature termination upon intron retention, domain-disrupting splice variants, repeat elements, and gene fusion events, each reflecting distinct forms of RNA integrity loss. Using this method, we revealed a robust age-associated increase in transcriptomic damage across tissues. To integrate these measurements into a unified biomarker, we constructed a transcriptomic damage-based aging (tDamAge) clock using machine learning models trained across mouse tissues or human peripheral blood. It could predict age and detect transcriptomic shifts under both pro-aging and anti-aging conditions. Progeroid models exhibited accelerated tDamAge, whereas interventions such as caloric restriction, rapamycin, and methionine restriction lowered tDamAge. Cross-dataset analysis showed that diverse anti-aging interventions converge on shared transcriptomic signatures, particularly RNA processing and chromatin organization pathways, and these age-associated patterns could be reversed by interventions. We further identified elevated damage age acceleration in Alzheimers disease and observed rejuvenation-like reductions during embryonic development. Together, our findings establish transcriptomic damage as a causal, quantifiable and biologically interpretable feature of aging and demonstrate that tDamAge could detect age progression, acceleration, deceleration, and reversal.
bioinformatics2026-06-29v1G-LATO: Inference of Spatial Latent Ordering via Deep Gaussian Processes
Zago, M.; Mukherjee, S.; Schleicher, J. T.; Bürkner, P.; Tabatabai, G.; Claassen, M.Abstract
Spatial transcriptomics enables the study of cells within their native tissue context, yet identifying gradients of cellular development remains challenging. We introduce a deep Gaussian process model to address this gap. Our method recovers spatially smooth gradients explaining observed gene expression. We illustrate our method on healthy liver and glioblastoma data in reconstructing known spatial organisation and uncovering new pathological gradients, thus providing robust inference for spatial biology.
bioinformatics2026-06-29v1SentryPath: a mechanistic protocol-ranking simulator with leave-one-trial-out cross-validation across 13 phase-III oncology randomised controlled trials and a pre-registered prospective forecast
Kumar, M. D.; Kumar, M.Abstract
Background. Pivotal oncology trials cost a median of ~$19 million each (oncology often $45 million or more) and contribute to a capitalised cost of ~$2.6 billion per approved drug, yet most candidate protocols never reach trial. Existing in-silico screening tools either rely on closed proprietary PK/PD modelling or require patient-level data; a transparent, cohort-level, cross-validated mechanistic alternative is missing. Methods. SentryPath is a physics-based stochastic differential equation simulator built on a Gompertzian tumour-growth term with Emax pharmacodynamic kill and Bliss-independence combination modelling, scored at the cohort level. Validation against 13 published phase-3 randomised controlled trials covering six cancer types uses the 2-year overall-survival (OS) rate ratio as the primary endpoint, cross-checked against ClinicalTrials.gov posted results. For cancer types with >=2 trials we apply leave-one-trial-out cross-validation: two shared efficacy scalars per cancer type are fit on training trials and used to predict the held-out trial cold. Results. With the per-drug efficacy proxies held fixed from the literature, two shared cancer-type scalars fit on the training trials transfer to the held-out trial with a mean held-out error of 3.7 % (range 0.7-7.3 %) on 2-year OS rate ratios across three NSCLC trials; extending the same method to RCC, HCC, and ESCC yields a 5.4 % aggregate across nine folds (per-fold range 0.2-11.2 %), reported with per-cancer stratification. We are explicit that only the two scalars are held out - the per-protocol efficacy proxies underneath are literature-anchored to drug classes that include the benchmark trials, so this is a test of scalar transfer, not of the whole engine cold. Cross-validation improves on the same engine without it (16.4 % with production cancer priors; 21.9 % with no efficacy modifiers); a matched in-sample fit of the same two-scalar model gives 4.4 %, slightly below the 5.4 % held-out, the expected direction. Two prospective forecasts are pre-registered on the Open Science Framework with falsification envelopes and pre-readout bias disclosure. The first forecast (NCT04770896) reaches its primary data cutoff on 2026-06-30; the observed outcome and its mapping to the pre-committed interpretation will be reported in a versioned update to this preprint. Conclusion. A transparent mechanistic simulator, with a literature-anchored efficacy library and only two cross-validated scalars per cancer type, transfers those scalars across held-out NSCLC trials at 3.7 % mean error (range 0.7-7.3 %) and extends to other cancers with documented per-cancer stratification. The validation is pilot-scale (3-9 folds) and the scalars sit on a fixed, trial-informed substrate; its distinguishing contribution is less the error magnitude than the public predict-verify-disclose cycle that goes beyond retrospective fit. Keywords: mechanistic simulation; oncology; clinical trial design; cross-validation; pre-registration; protocol prioritisation.
bioinformatics2026-06-29v1A hyperbolic topological atlas reveals polyamine steering of a shared developmental manifold in Arabidopsis
Zdrazil, J.; Kong, L.; Flores-Hernandez, E.; Rodriguez Kessler, M.; Klimes, P.; Spichal, L.; De Diego, N.; Snasel, V.Abstract
High-throughput plant phenotyping captures development at scale, yet image-rich screens are still often reduced to static trait summaries. We tested whether nutrient availability, polyamine priming, concentration, and their transport reshape Arabidopsis rosette development by generating distinct morphologies or by changing residence along a common trajectory. We analyzed 138,223 time-resolved rosette images from Col-0 and five mutants involved in polyamine transport (put1-5) primed to putrescine, spermidine, spermine, dose, and nutrient regimes using a self-supervised vision backbone, Poincare embedding, hyperbolic Mapper, and manifold straightening. The data form a single connected developmental manifold with 410 nodes and 746 edges, organized from an early, low-nutrient-biased hub through high-betweenness transition corridors to two late, nutrient-enriched terminal regions. Polyamine identity stratifies this manifold by developmental phase: putrescine enriches early states, spermidine occupies transition corridors, and spermine marks late compact rosettes. Nutrient richness and dose change distal occupancy, whereas put genotypes alter dwell time within shared regions rather than producing separate topologies. Manifold straightening resolves these effects into a short early lateral deflection followed by convergence, yielding two scalar readouts, early transverse offset and distal occupancy, that summarize treatment action on a common morphodynamic scale. The framework converts large image screens into interpretable developmental geometry for image-based phenomics.
bioinformatics2026-06-29v1Practical Use of Advanced AI Frameworks on Real-Life Scientific Problems: Three Case Studies
Gulluoglu, H. S. A.; Baby, J.; Bagul, K. M.; Basangari, B. R.; Bathini, S. A.; Chalamalla, N. K. R.; Dcunha, J.; Gupta, O.; Huang, L.; Jiang, X.; Naidu, Y. R.; Sathishkumar, G.; Sehrawat, M.; Thota, S. L.; Thuvara, D.; Vanguri, M. B.; Yin, J.; Jugder, B.-E.; Lusky, I. E.; Li, J.; Sinitskiy, A.Abstract
Agentic artificial intelligence (AI) systems increasingly claim to automate scientific research, yet independent evaluations report persistent gaps between those claims and demonstrated capability. We tested frontier agentic AI systems on three practical problems: prediction of treatment non-response in immune-mediated inflammatory diseases, optical chemical structure recognition for literature mining, and prediction of drug-design-related properties from small datasets. Each problem was first assigned to autonomous frameworks and then reattempted as human-led, AI-assisted work. Autonomous runs failed in most cases, while human-led work produced reusable resources and modest but defensible performance, including new evidence for possible mechanisms of treatment resistance and a more practical benchmark for mining chemical structures from scientific papers. Property prediction was the single task on which one autonomous AI framework matched the human expert. We conclude that current frameworks can carry out engineering and analysis once a human expert leads the project, but cannot yet engineer a novel solution without oversight. The use of AI on real-life scientific problems remains an art rather than a routine technology.
bioinformatics2026-06-29v1Learning Fragmentation Physics or Exploiting Sequence Priors? Benchmarking Bias in Deep Learning Models for De Novo Peptide Sequencing
Li, J.; Rost, H.Abstract
Deep learning models have advanced de novo peptide sequencing, but their predictions may reflect both physics-based spectral evidence and learned peptide-sequence priors. Systematically measuring such prior-associated behavior is important for benchmarking model robustness beyond conventional proteomics data. Here, we introduce the Prior Bias Index (PBI), a general framework for measuring the extent to which model behavior shifts toward prior-associated reference patterns under controlled conditions, and implement it as DeNovo-PBI, a benchmark for quantifying prior bias in de novo peptide sequencing models. DeNovo-PBI combines benchmark dataset construction, in silico sequence and spectral perturbation workflows, PBI-based metrics, and analysis algorithms to evaluate three forms of prior-associated behavior: sequence-distribution dependence, database amino-acid-pair order preference, and mutation-group prediction consistency under shared sequence context. In addition to experimentally acquired peptide spectra, we generated in silico spectra from random, natural, and mutated peptide sequences and selectively removed fragment ions that distinguish N-terminal residue orders. Across these assays, deep learning models showed peptide-sequence-distribution-dependent performance and strong directional amino-acid-pair order preferences even when order-diagnostic spectral evidence was removed. DeNovo-PBI provides a quantitative benchmark for measuring, comparing, and interpreting learned bias in de novo peptide sequencing models.
bioinformatics2026-06-29v1Short-Read Sequencing Benchmarking with Donor-Specific Assemblies
McGee, S. R.; Smith, J. D.; Frazar, C. D.; Ryke, E.; Vollger, M. R.; Kwon, Y.; Bennett, J. T.; Eichler, E. E.; Stergachis, A.; Wei, C.-L.Abstract
Background High-throughput short-read sequencing has become a core technology for genomics, but the rapid expansion of available platforms has made it increasingly important to benchmark them under standardized conditions. A major challenge is that conventional reference-based comparisons confound true sequencing errors with inherited variation and reference bias, making it difficult to isolate platform-intrinsic performance. Results We benchmarked nine short-read chemistries across seven DNA sequencers using two highly characterized benchmark samples, HG002 and COLO829BL, together with donor-specific assemblies to measure sequencing errors against sample-matched genomic references. This strategy separated authentic platform errors from biological divergence and revealed substantial differences in substitution, indel, read-position, and sequence-context error profiles. Element AVITI UltraQ and Roche SBX-D showed the lowest substitution error rates, whereas Ultima and Roche chemistries exhibited the strongest indel-associated biases. We also found pronounced platform-specific effects in low-complexity regions and trinucleotide contexts, including homopolymer-associated errors and context-dependent substitution skews that are directly relevant to rare-variant detection. In addition, we show that donor-specific references are essential for unbiased base-quality recalibration because they minimize reference bias and more faithfully support cross-platform comparison and low-frequency variant-calling thresholds. Conclusions Donor-specific assembly-based benchmarking provides a robust framework for measuring true short-read sequencing errors and comparing platforms on a common, sample-matched basis. Our results establish a comprehensive reference for the community and show that authentic error profiles can guide platform selection, quality filtering, and improved detection of rare somatic variation.
bioinformatics2026-06-28v1Lineage-aware stochastic modeling reveals gene-expression dynamics in development and disease
Xing, J.; Staklinski, S. J.; Liu, Z.; Nowak, D.; Siepel, A.Abstract
Gene expression evolves dynamically along cell lineages, yet most analysis methods treat single-cell RNA-seq (scRNA-seq) data as static snapshots and fail to exploit phylogenetic relationships among cells. Recent advances in cell-lineage tracing now enable the reconstruction of high-resolution lineage phylogenies, providing a natural framework for identifying when and where transcriptional changes arise during development, differentiation, and disease progression. Some models of gene expression have begun to consider phylogenetic structure, but they generally rely on imprecise Gaussian assumptions, focus on endpoint-level comparisons, or fail to consider sparse and overdispersed scRNA-seq read counts. Here, we present LaVOUS (Lineage-aware Variational Ornstein-Uhlenbeck Single-cell RNA-seq analysis), a probabilistic framework that couples lineage-based models of latent dynamics derived from the Brownian motion and Ornstein-Uhlenbeck stochastic processes with negative-binomial observation models and scalable variational inference. LaVOUS enables likelihood-based tests for cellular heritability and branch-specific shifts in gene expression, as well as phylogenetic reconstruction of latent expression histories. In simulations, LaVOUS outperformed Gaussian method in detecting lineage-associated expression changes and produced accurate reconstructions of expression histories across expression levels. We additionally applied LaVOUS to paired single-cell lineage and transcriptomic data from metastatic lung cancer, class-switching B cells, and the developing brain. Across these settings, LaVOUS identified lineage-associated expression changes related to metastatic progression, B-cell isotype switching, and dopaminergic and glutamatergic neuron differentiation. By providing an expressive framework for modeling sparse count data on lineage trees, LaVOUS establishes a foundation for studying single-cell expression dynamics across developmental and disease contexts, with natural extensions to multi-gene regulation, lineage uncertainty, and multi-modal integration.
bioinformatics2026-06-28v1Client-server interfaces enable efficient agent-driven variant calling
Yu, X.; Zheng, Z.; CHEN, L.; QIn, Z.; Guo, X.; He, M.; Luo, R.Abstract
Background: Large language model (LLM) agents increasingly automate bioinformatics analyses, but most existing bioinformatics tools were built for standalone use by human experts. An agent driving such a tool must reason about its installation, configuration, and execution from documentation for human, spending many turns, tokens, and tool calls per result. How a method is exposed to an agent can therefore matter as much as the method itself. By designing agentic interfaces for these tools, agent can reduce such overhead and improve the reliability of agent-driven analyses. Findings: To test this design, we re-architected Clair3, a widely used deep-learning-based long-read variant caller, into a client-server system, Clair3-Connect. The client performs all genomics related processing and holds the identifiable data. The server runs only neural-network inference, and the client sends only feature tensors to the server, while sample identifiers and genomic context remain on the client. The client exposes schema-defined agent-facing tools that an agent invokes through single structured calls. On an APOE diplotyping task, all 60 agent runs were correct. The agentic tools used 12K tokens in 3 turns, 6.8 to 14 times fewer tokens than the shell-driven baselines (81K-163K tokens), at about a quarter the wall-clock time and far more stably (4% versus 35% token usage variation). Dropping the pileup and phasing stages to keep the client light left SNP F1 within 0.1-0.3 points of standard Clair3 by 50x coverage, while mutual TLS and AES-256-GCM encryption added 7.2% to end-to-end runtime. Conclusions: Recasting an established algorithm as developer-built, agentic tools behind a secure client-server boundary makes it more efficient, reliable, and easier to deploy for an LLM agent than a third-party wrapper, which cannot recover the defaults and conventions only its developers know. Agentic interfaces should be a first-class deliverable of bioinformatics tool development.
bioinformatics2026-06-28v1PARROT: Phase-Altering Regulatory Rewiring Over Time
Chen, C.; Padi, M.; Quackenbush, J.Abstract
Motivation: Gene regulatory networks undergo dynamic restructuring during development and disease. Identifying when and how these networks change is crucial for understanding developmental and disease transitions, yet existing change-point detection methods often ignore network structure or lack interpretable community assignments. Results: We present PARROT (Phase-Altering Regulatory Rewiring Over Time), a framework for detecting change- points in dynamic networks using Stochastic Block Models. PARROT jointly estimates change-point locations and community structure across four network classes: unipartite and bipartite with either Gaussian or Bernoulli edge models. Simulations demonstrate improved performance and community recovery compared to other methods. Applications to human cardiac differentiation and mouse lung development data successfully recovered known phase boundaries. PARROT identifies both which genes are reassigned across modules and how the connections change between states.
bioinformatics2026-06-28v1Spatial co-expression and cell-cell communication inference from spatially resolved transcriptomics with CONCISE
Zhao, J.; Shan, X.; Wang, G.; Chu, T.; Lin, C.; Chang, R.; Zhao, H.Abstract
Cell-cell communication is fundamental to tissue organization, homeostasis, and disease progression. Recent advances in spatial transcriptomics provide unprecedented opportunities to systematically characterize ligand-receptor interactions directly within intact tissues. However, robust inference of spatial ligand-receptor interactions remains challenging because intrinsic features of spatial transcriptomics data, including spatial autocorrelation, variation in total molecular counts, and measurement errors, can induce spurious spatial co-expression and lead to inflated false-positive results. Most existing methods do not adequately account for these confounding factors, limiting the reliability of inferred cellular communication. Here, we present CONCISE, a statistical method for spatially constrained co-expression and ligand-receptor interaction inference that jointly models spatial autocorrelation, variation in total molecular counts, measurement errors, and spatial proximity constraints. CONCISE combines efficient moment-based parameter estimation with analytical hypothesis testing, enabling fast and statistically rigorous inference without restrictive distributional assumptions. Through extensive simulations, real-data permutation experiments, and biologically motivated negative-control analyses across different spatial transcriptomics platforms, we show that most existing methods presented inflated false-positive rates, whereas CONCISE achieved well-calibrated inference, robust false-positive control, and improved detection power. Application of CONCISE to high-resolution MERFISH and CosMx datasets from intestinal inflammation and non-small cell lung cancer further highlights its biological utility in disease contexts. CONCISE uncovered inflammation-associated fibroblast-specific interactions during intestinal inflammation and delineated complex tumor-immune and tumor-stromal signaling networks within the tumor microenvironment.
bioinformatics2026-06-28v1LOESS and DE-SWAN can induce artifactual "waves" of molecular aging
Carbonneau, M.; Shutta, K. H.; Miller, J.; Shen, X.; Snyder, M.; Quackenbush, J.Abstract
A growing literature has investigated the relationship between age and biomolecular changes, leading to conclusions that aging occurs in discrete molecular "waves." Data summary tools such as LOESS and sliding window analyses like DE-SWAN are common approaches that have gained acceptance in recent years. We demonstrate via simple simulations that these tools can identify non-linear patterns of aging where they do not exist. Specifically, we show that (i) clustering of molecular trajectories using LOESS can lead to artifactual characteristic patterns of molecular aging, (ii) "waves" of aging identified using the combination of LOESS and DE-SWAN in real data are not robust to changes in the underlying age distribution and are not supported by valid permutation testing, and (iii) DE-SWAN alone can generate pronounced "waves" of nonlinear molecular aging in linear data due to differences in statistical power along the age continuum. Our results specifically challenge the statistical support for discrete aging crests inferred in the literature, but do not rule out nonlinear molecular aging or age-associated transitions that may be detectable using other cohorts and statistical models.
bioinformatics2026-06-28v1CoLa-VAE: A Cell-Cell Communication-Aware Variational Autoencoder for Representation Learning and Expression Denoising
Chen, Y.; Qi, C.; Fang, H.; Luan, F.; Zhang, Z.; Arya, S.; Wei, Z.Abstract
Single-cell RNA sequencing provides a powerful view of cellular heterogeneity, but its sparsity and dropout noise remain major obstacles for recovering biologically meaningful gene expression programs and for downstream analyses that depend on reliable expression measurements. Ligand-receptor-based cell-cell communication inference is such analysis, missing ligand or receptor expression can cause substantial false negatives in sparse single-cell data. Here, we present CoLa-VAE, a cell-cell communication-aware variational autoencoder that jointly learns latent representations and denoised expression profiles by incorporating ligand-receptor-derived communication topology through dynamic graph Laplacian regularization. Rather than treating denoising as a secondary output of representation learning, CoLa-VAE uses denoised expression to iteratively refine communication estimates and uses the resulting communication structure to guide both latent organization and expression reconstruction. In addition to improving latent space organization and producing robust denoised expression matrices, CoLa-VAE-denoised matrices also improved downstream biological analyses, including the detection of robust differential cell-cell communication programs, mitigation of batch-associated variation and enhanced spatial transcriptomic deconvolution when spatially constrained communication structure was incorporated. Together, these results establish CoLa-VAE as a communication-guided denoising and representation learning framework that recovers biologically meaningful expression signals from sparse single-cell and spatial transcriptomic data, enabling more sensitive and reliable downstream analysis.
bioinformatics2026-06-26v2