Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Time-resolved inference of gene regulatory networks underlying human cranial neural crest development suggests novel risk genes for orofacial clefting.
Eibl, M.; Theiss, S.; Einarsson, H.; Vaagenso, C. S.; Krautz, R.; Gehringer, M.; Siewert, A.; Zhang, Y.; Rada-Iglesias, A.; Saez-Rodriguez, J.; Herrmann, C.; Ludwig, K. U.; Andersson, R.; Laugsch, M.Abstract
Cranial neural crest cells (CNCCs) play a central role in shaping the human head and face. Aberrant CNCC differentiation contributes to craniofacial birth defects, particularly non-syndromic cleft lip with or without cleft palate (nsCL/P), one of the most common congenital disorders. Although the number of genetic variants associated with this condition is steadily increasing, it remains challenging to determine if and how these variants may contribute to disease development. The majority of these variants lie within non-coding regulatory elements that govern cell-type and stage-specific gene expression, which is orchestrated by dynamic gene regulatory networks (GRNs). Despite extensive work in model organisms, a time-resolved, multi-omics perspective of GRNs controlling CNCC differentiation in a human system is still lacking. To fill this gap, we generated paired transcriptomic and chromatin accessibility data at four timepoints during in vitro differentiation of CNCCs derived from human induced pluripotent stem cells. Integrating these two modalities enabled time-resolved inference of GRNs and identification of dynamic regulatory relationships, including stage-specific roles of core transcription factors. Leveraging these time-resolved GRNs, we mapped 29 nsCL/P associated variants linked to 70 putative target genes, with 40 located outside the associated genomic loci, suggesting novel distal regulatory relationships. Integration of these data with complementary time-course scRNA-seq data revealed an ectomesenchymal-biased subpopulation of CNCCs as particularly sensitive to genetic variants associated with nsCL/P. We provide a time-resolved inference of GRN in human CNCC differentiation, allowing us to determine the dynamics of stage-specific core regulatory programs that are otherwise missed in analyses based on a single time snapshot. To our knowledge, the data represent the first multi-omics map of human CNCC with temporal resolution, which expands the understanding of early human craniofacial development, refines variant-to-gene assignment, prioritizes candidate risk genes and cell states relevant to nsCL/P. Our findings demonstrate the relevance of studying the dynamics upon differentiation rather than just one fixed timepoint and offer a valuable basis for further investigation of non-coding variation in CNCC-related disorders.
bioinformatics2026-06-30v1Impacts of batch effects on the performance of machine learning classifiers across multiple studies
Raab, P.; Johnson, W. E.; Piccolo, S. R.Abstract
Precision medicine relies on accurate and generalizable predictions for patients across the spectrum of human diversity. Because capturing biological heterogeneity requires large sample sizes, researchers must often aggregate data from several experimental batches or independent studies. This integration allows for greater statistical power and diversity than a single study could provide, while avoiding the costs of generating massive new -omics datasets. Predictive models trained on these aggregated data are theoretically better equipped to detect subtle patterns that generalize to new data. However, this potential is frequently undermined by "batch effects"--systematic technical artifacts that can bias model training to predict experimental batches and shadow meaningful biological conditions. Models trained on data with batch effects can exhibit substantially degraded performance when applied to data from new batches. Statistical adjustment methods can mitigate these artifacts while preserving biological signals. To ensure these adjustments actually facilitate generalization, we emphasize the use of external, independent cohorts for rigorous validation. This chapter examines how batch effects impact predictions and compares various adjustment methods.
bioinformatics2026-06-30v1Real-World Progression-Free Survival with Erlotinib versus Osimertinib in EGFR L858R+T790M Compound Mutation Non-Small Cell Lung Cancer: An Exploratory Analysis of the MSK-CHORD Dataset
Dalloul, Z.; Abboud, A.; Dalloul, I.; Abdelsalam, M.Abstract
Background: Osimertinib is the standard first-line treatment for EGFR- mutant non-small cell lung cancer (NSCLC) harboring common activating mutations, including exon 19 deletions and L858R. It is also active against tumors with acquired T790M resistance. However, the EGFR L858R+T790M compound mutation, where both variants co-occur within the same tumor, may confer distinct drug-sensitivity profiles not predicted by either mutation alone. Limited data exist on comparative treatment outcomes in this rare genotype. Methods: Using the MSK-CHORD clinicogenomic dataset (n=24,950), we identified patients with concurrent EGFR L858R and T790M mutations receiving erlotinib (Erlo) or osimertinib (Osi) monotherapy. Real-world progression-free survival (rwPFS) per treatment line was calculated using a strict definition requiring confirmed radiological progression events (rwPFS-strict), excluding lines with null endpoint data. Kaplan-Meier analysis, log-rank testing, Cox proportional hazards regression, and cross-cohort heterogeneity testing (Cochran's Q statistic) were performed. Two control cohorts, L858R-only (n=372) and T790M-only (n=76), were analyzed in parallel to assess mutation-context specificity of treatment response. Results: Thirty-one patients with EGFR L858R+T790M were identified; 21 contributed evaluable monotherapy lines, yielding 23 Erlo and 15 Osi treatment lines (14 unique patients per treatment group, 7 contributing to both). Median rwPFS numerically favored Erlo over Osi (7.10 vs 5.32 months; HR 1.29, 95% CI 0.66-2.52; log-rank p=0.46). This directional trend was reversed in the L858R-only control cohort, where Osi demonstrated significant superiority (9.03 vs 5.75 months; HR 0.70, 95% CI 0.55-0.89; p=0.003). The T790M-only cohort showed no significant difference (HR 1.32, p=0.12). An exploratory post-hoc heterogeneity test confirmed a significant cross-cohort interaction (Q=9.94, df=2, p=0.007). Conclusions: The expected osimertinib advantage was absent in L858R+T790M compound-mutant NSCLC. The opposing hazard ratio directions across mutation contexts (HR 1.29 vs 0.70), with a significant exploratory cross-cohort interaction (p=0.007), suggest that the EGFR L858R+T790M compound mutation may represent a pharmacologically distinct entity with differential TKI sensitivity. These hypothesis-generating findings warrant prospective validation.
bioinformatics2026-06-30v1GeneBench-Pro: Evaluating Multistage Statistical Reasoning\\in Genomics, Quantitative Biology, and Translational Biomedicine
Li, J. H.; Ho, A. J.Abstract
We introduce GeneBench-Pro, an expanded and improved version of GeneBench that comprises harder problems across a wider breadth of domains. GeneBench-Pro is a benchmark for AI agents performing realistic multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine which seeks to capture the complexity of real-world problems that computational life scientists face when tasked with producing a conclusion upon which a downstream scientific or translational decision is contingent. The benchmark comprises 129 evaluations targeting quantities of direct practical relevance across 10 primary domains and 21 terminal subdomains, with a genomics-centered core. Similarly to GeneBench, each problem provides the agent with brief context, a target estimand, and minimal guidance otherwise; the agent must then navigate multiple dependent decision points; i.e., substantive inferential forks where a plausible wrong choice changes the downstream analysis, to identify and execute the correct analysis workflow and arrive at the correct answer. Relative to GeneBench, GeneBench-Pro adds 29 new problems, drops three, and introduces significantly redesigned versions of 54 of the remaining 100 overlapping problems. 82 of the 129 problems were reviewed by external domain experts, whose findings led to prompt/data modifications and redesign of those problems whose targets were not sufficiently identifiable. Ten externally reviewed problems are released publicly, 50 held-out problems were provided to Artificial Analysis for independent third-party model benchmarking, and the remainder are retained as an internal holdout. In evaluations over the full 129-problem suite, GPT-5.6 Sol reaches an eval-level pass rate of 28.7% at the max reasoning level, and GPT-5.6 Sol Pro reaches 31.5% in separately reported GPT Pro runs. GPT-5.5 reaches 12.0%, GPT-5.4 reaches 8.9%, and the strongest non-GPT baseline, Claude Opus 4.8, reaches 16.0%. As with GeneBench, models often complete substantial portions of the workflow but exhibit a consistent gap between noticing and acting by identifying local diagnostic signals but failing to propagate the implications to the corresponding analysis decision. As a result, models often select wrong estimators or persist on initially plausible but incorrect analysis paths. GeneBench-Pro therefore measures an emerging capability of long-horizon biological reasoning that remains unreliable.
bioinformatics2026-06-30v1A robot model of compass cue calibration in the insect brain
Mitchell, R.; Dacke, M.; Webb, B.Abstract
Dung beetles can use a variety of orientation cues to maintain a consistent bearing during ball-rolling. Where several cues are available, they appear to learn the spatial relationship between them, providing redundancy if some cues are removed. Mounting evidence indicates that such a learning process is implemented in the insect head direction circuit; specifically, in the plastic substrate between sensory input neurons and compass neurons in the central complex. This plasticity appears to be driven by rotational movements, providing a clear link with observed beetle 'dance' behaviour. Here, we extend our functional model of this circuit and use it on a robot platform, to test it in the same behavioural assay as was used for the beetles. The robot was able to replicate the beetle's ability to substitute a directional wind cue for a point source light cue in guiding straight-line movement. However, it also revealed significant biasing coupled to dance direction. This biasing appears to be caused by inherent conflict between recurrent and instantaneous inputs to the compass circuit. We predict that the real insect should experience similar issues unless it has evolved a neural mechanism to compensate.
bioinformatics2026-06-30v1A High-Quality Acetylation Dataset Reveals Modest Data Requirements for Transfer Learning to Identify Little Studied Post-Translational Modifications
Hartmaring, Y.; Wang, S.; Jones, A. R.; Vizcaino, J. A.; Schlaffner, C. N.; Renard, B. Y.Abstract
Dysregulation of post-translational modifications (PTMs) is associated with severe pathologies, including cancers and Alzheimer's disease. Despite their biological importance, identifying modified peptides remains challenging due to the immense combinatorial search space. While searches benefit from prior knowledge of a peptide's modification status, the data scarcity for most PTMs hinders the development of accurate deep learning classifiers like AHLF (ad hoc learning of peptide fragmentation). Here, we overcome this data bottleneck for acetylation and ubiquitination. We harmonised a dataset with about 500,000 high quality acetylated peptide-spectrum matches (PSMs) from nine publicly available acetylation-enriched datasets. We fine-tuned AHLF with the acetylation and a 2-million spectra strong ubiquitination dataset separately and assessed the minimum data requirement for training by iteratively downsampling. Training separate models on SILAC and label-free subsets also assessed the impact of data diversity. The resulting acetylation and ubiquitination models achieve an AUC of 0.87 and 0.90 respectively. Beyond 28,500 acetylated spectra, corresponding to roughly 0.3% of the original model's training data, additional data just provides minor performance gains. Finally, we show that data diversity is beneficial for generalizability, while models trained on homogeneous data sources tend to overfit to their respective data type. All code, and model weights are available at https://gitlab.com/dacs-hpi/ahlf-ptmai.
bioinformatics2026-06-30v1Integrating Semantic Retrieval, LLM-based Refinement, and Structured Expert Curation for Scalable AOP Gene Mapping
Schaffert, A.; Fratello, M.; Kangas, K.; Torres Maia, M.; del Giudice, G.; Mobus, L.; Accardi, C.; Al-Abdulraheem, Z.; Campini, L.; Galardo, F.; Federico, A.; Ciancaleoni, G.; Juppi, H.-K.; Paparella, M.; Serra, A.; Greco, D.Abstract
Toxicogenomics can support regulatory toxicology, but its use is limited by the difficulty of translating molecular responses into mechanistic, decision-relevant interpretations. Adverse Outcome Pathways (AOPs) provide a framework for this translation, yet omics applications require scalable mapping of Key Events (KEs) to molecular features. Here, we present an AI-assisted, multi-step workflow for KE-to-gene mapping that uses embedding-based semantic retrieval to identify candidate ontology/pathway terms, large language model-assisted refinement to filter these candidates, and double-independent expert group curation with rule-based consolidation to finalize mappings and derive confidence scores. Compared with earlier NLP-based approaches, the workflow improves KE-to-ontology/pathway mapping performance and generates candidate annotations that better align with expert judgment while substantially reducing the need for manual augmentation. Explicit gene and protein mentions in KE titles were additionally grounded to improve specificity, and each curated mapping was assigned curator reason codes to support transparent, traceable, and confidence-aware reuse. Applied across AOP-Wiki, the workflow produced a comprehensive KE-to-gene set resource covering 1,254 KEs across 523 AOPs and linking 15,833 human genes. Utility is demonstrated through CTD-based AOP fingerprinting of curated reference chemical groups, highlighting expanded coverage and confidence-informed interpretation of chemical-associated gene signatures in an AOP context. The workflow and resulting resource provide a practical bridge between toxicogenomics and AOP-based mechanistic interpretation and support routine updating and future extension to additional omics layers within OECD Omics2AOP.
bioinformatics2026-06-30v1A pan-cancer benchmark of integrated ferroptosis, cuproptosis and disulfidptosis prognostic signatures
Demir, A. Y.; Yasar, E.Abstract
Integrated prognostic signatures combining ferroptosis, cuproptosis, and disulfidptosis are increasingly reported in oncology as advances in risk stratification, yet their added value over simpler pathway-specific or proliferation-related models remains unclear. Here, we developed an integrated regulated cell-death signature and evaluated it through an adversarial pan-cancer benchmark. Using the TCGA pan-cancer cohort comprising 9,808 tumours across 33 cancer types, we curated 118 genes associated with the three cell-death programmes, characterised inter-pathway crosstalk, and derived a 26-gene LASSO-Cox risk signature. The model showed reproducible prognostic performance across cancers, with a pan-cancer concordance index of 0.573 (95% CI, 0.552-0.594), and was independently validated in METABRIC and CGGA cohorts, remaining significant after adjustment for standard clinical variables. However, benchmarking revealed that the integrated signature, although superior to size-matched random gene sets (empirical p < 0.001), did not outperform a ferroptosis-only model (DeLong p = 0.81), indicating no measurable gain from pathway integration. Moreover, much of the prognostic signal reflected tumour proliferation rather than regulated cell death. After adjustment for the proliferation meta-signature (meta-PCNA), ferroptosis performance declined from 0.573 to 0.504, while the integrated model decreased to 0.554. High-risk tumours were more sensitive to anti-proliferative drugs, and the risk score was most strongly associated with E2F, MYC, and G2M target programmes. The signature stratified prognosis but did not predict immune-checkpoint blockade response in IMvigor210 (AUC {approx} 0.50). Importantly, the underlying biology was not merely a modelling artefact. Signature genes showed concordance with protein abundance in CPTAC cohorts, and the three cell-death programmes co-varied within individual malignant cells, with correlations ranging from {rho} = 0.46 to 0.66. Overall, our findings indicate that integrated multi-death signatures are reproducible and biologically grounded, yet prognostically redundant and substantially confounded by proliferation. This study provides a cautionary benchmark for the rapidly expanding use of composite regulated cell-death signatures in cancer prognosis.
bioinformatics2026-06-30v1Structural Bioinformatics of Four Human Aquaporins and Their Water-Soluble QTY Analogs
Zhang, S.; Xiao, E.Abstract
Human aquaporins (AQPs) are essential membrane channels, yet their inherent hydrophobicity complicates structural and functional studies. We present the systematic application of the QTY code to human AQPs, integrating it with AlphaFold 3 structure prediction to design and validate that four-representative human AQPs (AQP1, AQP3, AQP4, AQP7) can be converted into water-soluble analogs while maintaining their conformation. This approach features a novel platform for editing challenging membrane proteins. The QTY code was applied to the transmembrane regions of the selected four AQPs. Subsequently, the water-soluble QTY analogs of the four AQPs were predicted using AlphaFold 3. The predicted structures were superposed with CyroEM- or X-ray-determined native structures in PyMOL. Further analyses included root-mean-square deviation (RMSD) calculations, visualization of hydrophobic surface reduction, and inspection of conserved protein-ligand binding ability. After applying the QTY code, sequence changes between native AQPs and their QTY analogs was significant (42.86-48.80%). Nevertheless, their structures superposed well in analyses, with only slight deviations (RMSD < 0.6 [A]). In addition, the surface hydrophobicity of all QTY-edited AQPs was significantly reduced. Importantly, molecular contacts between the cholesterol ligand and protein were largely preserved for both native AQP1 and its QTY analog. Finally, all AlphaFold3-predicted structures for AQPs have high confidence values (pLDDT > 90; pTM ~0.83), supporting the reliability of the predicted structures. The findings demonstrate that membrane protein hydrophobicity can be edited and reduced without compromising fold integrity or functional architecture. Integration of the QTY code with AlphaFold 3 affords a high-throughput platform for designing water-soluble, structurally faithful analogs of challenging membrane proteins. Such a strategy can provide a potent platform for detergent-free biochemical studies and water-soluble analogs for therapeutic monoclonal antibody discoveries, thus advancing research of this pharmacologically important protein family.
bioinformatics2026-06-30v1Svirlpool: structural variant detection from long read sequencing by local assembly
May, V.; Hartmann, T.; Beule, D.; Holtgrewe, M.Abstract
Motivation: Long-Read Sequencing (LRS), and Oxford Nanopore Technologies (ONT) in particular, has greatly improved the detection of structural genome variants (SVs). Fast alignment-based ONT callers achieve strong benchmark performance, but they necessarily reduce the read sequence to alignment-derived signals when deciding whether variants are shared across samples. This can be limiting for cohort and clinical analyses, especially for insertions and repeat regions where sequence representation matters. We present Svirlpool, a multi-sample SV caller for ONT data that builds local consensus assemblies of candidate SV regions and retains the assembled sequence up to the final joint-calling step, where merging tolerances are scaled by a reference-independent noise estimate derived from the reads. Results: We validated Svirlpool on two ONT family datasets: the recent high-quality HG002 Ashkenazi trio and the older Platinum Pedigree family, using the Genome in a Bottle and T2TQ100 benchmarks on the GRCh38, GRCh37, and CHM13v2 references and the Mendelian consistency of native multi-sample calls. We compare against current native joint callers and post-hoc merging workflows. Svirlpool produces highly Mendelian-consistent insertion calls in trio analyses (95.2% on GRCh38 and 95.1% on CHM13v2 at 30x), and on CHM13v2 it reaches the highest insertion and deletion consistency among all tested approaches. Sawfish and Sniffles achieve the highest SV benchmark F1 scores on recent high-quality ONT data, whereas Svirlpool enters the competition with more conservative SV calls. Svirlpool features native, sequence-aware joint calling with retained local consensus sequences and shows a very high Mendelian consistency with sequencing data from different batches and chemistries, which is a common situation in clinical application. Availability and Implementation: Source code, container images, and documentation available at https://github.com/bihealth/svirlpool
bioinformatics2026-06-29v3NPTX2-Centered Cognitive Resilience Mechanisms in the Context of AD Pathology
Lao, Y.; Xiao, M.-F.; Ji, S.; Piras, I. S.; Kim, K.; Bonfitto, A.; Song, S.; Aldabergenova, A.; Sloan, J.; Trejo, A.; Geula, C.; Na, C.-H.; Rogalski, E. J.; Kawas, C. H.; Corrada, M. M.; Serrano, G. E.; Beach, T. G.; Troncoso, J. C.; Huentelman, M. J.; Barnes, C. A.; Worley, P. F.; Colantuoni, C.Abstract
Background Cognitive resilience to Alzheimer's disease (AD) pathology is associated with preserved expression of NPTX2, an activity-regulated synaptic protein involved in circuit plasticity, excitation-inhibition balance, and complement-linked synapse regulation. However, the broader molecular programs coordinated with NPTX2 in resilient individuals remain unclear. Methods We analyzed postmortem middle temporal gyrus tissue using targeted PRM-MS proteomics in 135 individuals and bulk RNA-seq in an expanded 575-sample cohort. NPTX2-associated molecular coordination was assessed within cognitively normal low-pathology controls (CN-Lo), cognitively normal high-pathology controls (CN-Hi), mild cognitive impairment (MCI), and AD. Correlation-based approaches were applied using NPTX2 protein and NPTX2 mRNA expression as anchors to define resilience mechanisms in CN-Hi subjects. Results NPTX2 protein abundance was preserved across all controls regardless of age and pathology but reduced in MCI and AD. NPTX2 mRNA expression was also invariant across pathology within controls and reduced in MCI and AD but decreased markedly with age. Targeted proteomics identified NPTX2 relationships with synaptic and inhibitory-circuit proteins that were preserved across control groups, alongside CN-Hi-specific recruitment of trafficking, lysosomal, metabolic, and proteostasis-associated proteins. Transcriptome-wide correlations with NPTX2 revealed differences in gene co-expression between groups, identifying a prominent activity-dependent program including BDNF, VGF, SCG2, SST, SERTM1, DUSP4, and EGR4, that was preserved in both CN-Lo and CN-Hi subjects, while genes recruited to the NPTX2 network specifically in CN-Hi implicated immune, neuroprotective, translation, and proteostasis-related pathways. Coupling differential gene expression analysis with co-expression, we further identified five candidate resilience genes whose expression and NPTX correlation was preserved across controls, but lost in MCI and AD: SST, MAL2, TAC1, SERTM1, and RFK. Expression of genes in distinct NPTX2 co-expression classes can be freely explored in our bulk RNA-seq data and other public AD transcriptomic datasets at NeMO Analytics. Conclusion Findings suggest that cognitive resilience in the context of AD neuropathology engages a coordinated molecular state distinct from both preserved cognition without pathology and MCI/AD, which is organized around preserved and selectively remodeled NPTX2-associations. Rather than reflecting broad transcript abundance changes, resilience was characterized by maintained synaptic and inhibitory programs, and adaptive proteostasis and trafficking pathways that distinguish resilient high-pathology individuals from low-pathology controls or symptomatic AD.
bioinformatics2026-06-29v3SCiMS: Sex Calling in Metagenomic Sequences
Tran, H. N.; Kirven, K. J.; Davenport, E. R.Abstract
Background: Host sex is a critical determinant of microbial community structure across many host species, influenced by hormonal profiles, physiology, and sex-stratified behaviors. Despite its importance, sex metadata is frequently missing in microbiome studies, including for animal-associated samples. Host chromosomal sex can be inferred from the host-derived reads present in metagenomic data, but existing genomic sex prediction tools rely on fixed coverage thresholds calibrated for human XY chromosomes and require relatively high host reads, limiting their use on low host-biomass samples such as stool and on organisms with other sex-determination systems. Results: Here, we present SCiMS (Sex Calling in Metagenomic Sequences), a bioinformatic tool that leverages host-derived DNA within shotgun metagenomic data to predict host chromosomal sex, even at low host coverage. SCiMS uses a multinomial likelihood computed from observed read counts under each sex and reports chromosomal sex calls. Because the expected read distribution is derived directly from chromosome lengths and ploidy under each candidate karyotype, SCiMS applies to any organism with a heterogametic sex-determination system. We benchmarked SCiMS against existing tools on simulated metagenomic data, human metagenomic samples spanning multiple body sites, and metagenomic samples from seven animal species. SCiMS matched or outperformed existing tools, with its noticeable advantage at low host read conditions. Conclusions: SCiMS provides an accurate, scalable, and cross-species generalizable solution for host chromosomal sex classification, even when host DNA is minimal. By enabling recovery of missing sex metadata, it serves as a quality-control tool for analyses in microbiome research. SCiMS is freely available at <a href="http://github.com/davenport-lab/SCiMS">http://github.com/davenport-lab/SCiMS</a>.
bioinformatics2026-06-29v2Anatomy-Guided 3D Graph Networks for Couinaud Segmentation in Tumor Affected Livers
You, L.; Dang, H.; Wang, H.; Matta, E.; zhou, X.Abstract
Abstract: Image-based liver Couinaud segmentation is designed to automatically provide the locations of suspicious objects in liver CT/MR images. Once achieved, the physicians will be guided to the target slice and area where the suspicious node is located. However, conventional algorithms trained primarily on healthy liver images often fail to generalize to Hepatocellular Carcinoma (HCC) cases due to pathological structural distortions. In this work, we propose a robust two-stage framework that integrates a 3D Unet with a 3D Anatomical Structure-Guided Graph Convolutional Network (3D GCN). This two-stage strategy effectively isolates the liver volume to eliminate structural noise from neighboring organs, such as the spleen, allowing the framework to focus exclusively on the complex 3D anatomical relationships among the eight segments. To ensure the topological consistency required for global spatial reasoning, we implement a standardized preprocessing pipeline that normalizes liver-only volumes to exactly 50 frames along the z-axis. By combining a lightweight 3D UNet backbone with the 3D GCN for refined boundary reasoning, our model demonstrates superior generalization performance on unseen clinical datasets, achieving a mean Dice score of 0.828 in blind testing. By releasing our code and pretrained weights, we aim to provide the first publicly available deep learning resource for robust Couinaud segmentation.
bioinformatics2026-06-29v2eRNAformer enables genome-wide de novo mapping of enhancer-derived RNA loci
Yu, H.; Li, W.; Li, W.; Liu, Y.; Chen, Y.; Zhang, X.; He, S.; Chen, Z.; Wang, H.; Ni, J.; Gao, T.; Li, F.; Lu, L.Abstract
Enhancer-derived RNAs (eRNAs) are critical regulators of gene transcription, yet their genome-wide annotation remains challenging. Here, we present eRNAformer, a multi-modal deep learning framework that integrates convolutional neural networks with transformers, specifically designed to capture long-range genetic features associated with bidirectional transcription. This approach enables de novo mapping of eRNA loci using DNA sequence and aggregated conventional RNA-seq data. When evaluated on ENCODE datasets, eRNAformer demonstrated high sensitivity and specificity in discriminating known eRNA loci from non-eRNA loci. Notably, the newly identified eRNA loci were enriched with evolutionarily constrained variants and genetic risk factors for complex diseases, and exhibit potential relevance for cancer therapy. Applied to GEO datasets, eRNAformer identified a range from 14,219 to 56,451 eRNA loci across multiple hematologic malignancies, facilitating the construction of a comprehensive eRNA database for blood cancers. We further identified and experimentally validated FOXO1e, a cluster of eRNAs located approximately 120 kb upstream of FOXO1, a known oncogene that drives t(8;21) acute myeloid leukemia (AML) preleukemic program. Together, these findings establish eRNAformer as a powerful tool for genome-wide eRNA annotation, provide a valuable resource for eRNA studies in hematologic cancers, and underscore the functional importance of eRNAs in AML pathogenesis.
bioinformatics2026-06-29v2The structural context of mutations in proteins predicts their effect on antibiotic resistance
Green, A. G.; Tasmin, M.; Vargas, R.; Farhat, M. R.Abstract
In Mycobacterium tuberculosis, a prevalent and deadly pathogen, resistance to antibiotics evolves primarily through non-synonymous mutations in proteins. Sequence-based analyses are currently used to understand the genetic basis of antibiotic resistance, either via genotype-phenotype association, or via signals of convergent evolution. These methods focus on primary sequence and often neglect other biological signals such as protein structural information. We hypothesize that integrating the structural context of mutations improves the prediction of effects on function and phenotype. We curate high confidence structural annotations for the M. tuberculosis proteome from 1,371 crystallography and 2,316 AlphaFold predictions, and combine the structures with mutations from over 31,000 clinical M. tuberculosis isolates. We demonstrate that mutations in proteins known to cause resistance are clustered in 3D space, even in proteins where inactivating mutations at any position are thought to cause resistance. We develop a statistic to search the M. tuberculosis proteome for signal of clustered mutations, finding over 450 proteins that display this signal, many of which have a known relationship with antibiotic resistance. We show that a supervised classifier trained on 3D distance to known resistance sites alone has an F1 score of 94.6% at classifying mutations as resistance-conferring across proteins. This work demonstrates that protein structure provides useful information for categorizing which variants may cause antibiotic resistance, even when the majority of structures are AI-predicted.
bioinformatics2026-06-29v2EnzyKAN: Protein Language Model Embeddings and Kolmogorov-Arnold Network Variants for Enzyme Commission Classification with a Proposed Electron-Transfer Physics Feature Framework
R, S.; Reddy, B. R. R.Abstract
Motivation: Computational enzyme classification has previously utilised sequence homology features and protein language model embeddings. The Kolmogorov-Arnold Network (KAN) paradigm, which uses learnable edge functions rather than fixed ones, has shown promising results in biological sequence tasks. Results: A fully reproducible investigation of KAN variants for seven-class EC classification on up to 9,516 labelled sequences from the CLEAN benchmark (9,386 for language model experiments). In the sequence only settings, fixed basis KAN variants outperformed an MLP baseline moderately (macro F1 = 0.17-0.29). Utilisation of ESM-2 650M embeddings greatly improved results via 5-fold cross-validation: MLP macro F1 = 0.750 +/- 0.009, accuracy = 0.823 +/- 0.009; learnable SineKAN macro F1 = 0.716 +/- 0.023, accuracy = 0.788 +/- 0.019. MLP performed comparably but did not exceed conventional baselines. As an aside, we introduce but do not investigate an approach to EC oxidoreductase sub-classification through the use of a Marcus theory-based electron transfer feature framework. Availability: Code and result files are available at https://github.com/sanjuz-cas/ENZYKAN.
bioinformatics2026-06-29v1A hyperbolic topological atlas reveals polyamine steering of a shared developmental manifold in Arabidopsis
Zdrazil, J.; Kong, L.; Flores-Hernandez, E.; Rodriguez Kessler, M.; Klimes, P.; Spichal, L.; De Diego, N.; Snasel, V.Abstract
High-throughput plant phenotyping captures development at scale, yet image-rich screens are still often reduced to static trait summaries. We tested whether nutrient availability, polyamine priming, concentration, and their transport reshape Arabidopsis rosette development by generating distinct morphologies or by changing residence along a common trajectory. We analyzed 138,223 time-resolved rosette images from Col-0 and five mutants involved in polyamine transport (put1-5) primed to putrescine, spermidine, spermine, dose, and nutrient regimes using a self-supervised vision backbone, Poincare embedding, hyperbolic Mapper, and manifold straightening. The data form a single connected developmental manifold with 410 nodes and 746 edges, organized from an early, low-nutrient-biased hub through high-betweenness transition corridors to two late, nutrient-enriched terminal regions. Polyamine identity stratifies this manifold by developmental phase: putrescine enriches early states, spermidine occupies transition corridors, and spermine marks late compact rosettes. Nutrient richness and dose change distal occupancy, whereas put genotypes alter dwell time within shared regions rather than producing separate topologies. Manifold straightening resolves these effects into a short early lateral deflection followed by convergence, yielding two scalar readouts, early transverse offset and distal occupancy, that summarize treatment action on a common morphodynamic scale. The framework converts large image screens into interpretable developmental geometry for image-based phenomics.
bioinformatics2026-06-29v1MxSure: a mixture model for inferring within-host substitution rates and transmission SNP thresholds
Khurram, Z.; Chaguza, C.; Kwambana-Adams, B. A.; Shao, Y.; Lawley, T.; Yong, M.; Davies, M. R.; Zarebski, A. E.; Tonkin-Hill, G.Abstract
Quantifying short-term evolutionary rates of microbial genomes is essential for understanding the processes that shape within-host evolution and for establishing thresholds needed to track transmission. In studies of short-term evolutionary rates, samples are often collected from closely related clusters (e.g. longitudinally from the same host or from transmission pairs), with substantial time intervals separating genomes between clusters. Distinguishing strain replacement from persistence presents an additional challenge in these studies. In addition, many public health and metagenomic bacterial strain tracking pipelines output pairwise SNP distances rather than the multiple sequence alignments required by common substitution rate estimation pipelines. This makes it challenging to estimate within-host evolutionary rates in many commensal bacterial species that are difficult to culture and isolate. To address these challenges, we introduce MxSure, a tool for estimating substitution rates and transmission thresholds while accounting for strain replacement from pairwise SNP distance data, as commonly generated by transmission tracking and metagenomic analysis pipelines. We demonstrate the accuracy of MxSure through extensive simulations and by analysing species with previously estimated substitution rates from longitudinal metagenomic datasets. Using MxSure, we estimated within-host substitution rates and transmission SNP thresholds for multiple commensal bacterial species including Bifidobacterium longum and Bifidobacterium bifidum from a longitudinal study of the infant gut microbiome.
bioinformatics2026-06-29v1Practical Use of Advanced AI Frameworks on Real-Life Scientific Problems: Three Case Studies
Gulluoglu, H. S. A.; Baby, J.; Bagul, K. M.; Basangari, B. R.; Bathini, S. A.; Chalamalla, N. K. R.; Dcunha, J.; Gupta, O.; Huang, L.; Jiang, X.; Naidu, Y. R.; Sathishkumar, G.; Sehrawat, M.; Thota, S. L.; Thuvara, D.; Vanguri, M. B.; Yin, J.; Jugder, B.-E.; Lusky, I. E.; Li, J.; Sinitskiy, A.Abstract
Agentic artificial intelligence (AI) systems increasingly claim to automate scientific research, yet independent evaluations report persistent gaps between those claims and demonstrated capability. We tested frontier agentic AI systems on three practical problems: prediction of treatment non-response in immune-mediated inflammatory diseases, optical chemical structure recognition for literature mining, and prediction of drug-design-related properties from small datasets. Each problem was first assigned to autonomous frameworks and then reattempted as human-led, AI-assisted work. Autonomous runs failed in most cases, while human-led work produced reusable resources and modest but defensible performance, including new evidence for possible mechanisms of treatment resistance and a more practical benchmark for mining chemical structures from scientific papers. Property prediction was the single task on which one autonomous AI framework matched the human expert. We conclude that current frameworks can carry out engineering and analysis once a human expert leads the project, but cannot yet engineer a novel solution without oversight. The use of AI on real-life scientific problems remains an art rather than a routine technology.
bioinformatics2026-06-29v1Learning Fragmentation Physics or Exploiting Sequence Priors? Benchmarking Bias in Deep Learning Models for De Novo Peptide Sequencing
Li, J.; Rost, H.Abstract
Deep learning models have advanced de novo peptide sequencing, but their predictions may reflect both physics-based spectral evidence and learned peptide-sequence priors. Systematically measuring such prior-associated behavior is important for benchmarking model robustness beyond conventional proteomics data. Here, we introduce the Prior Bias Index (PBI), a general framework for measuring the extent to which model behavior shifts toward prior-associated reference patterns under controlled conditions, and implement it as DeNovo-PBI, a benchmark for quantifying prior bias in de novo peptide sequencing models. DeNovo-PBI combines benchmark dataset construction, in silico sequence and spectral perturbation workflows, PBI-based metrics, and analysis algorithms to evaluate three forms of prior-associated behavior: sequence-distribution dependence, database amino-acid-pair order preference, and mutation-group prediction consistency under shared sequence context. In addition to experimentally acquired peptide spectra, we generated in silico spectra from random, natural, and mutated peptide sequences and selectively removed fragment ions that distinguish N-terminal residue orders. Across these assays, deep learning models showed peptide-sequence-distribution-dependent performance and strong directional amino-acid-pair order preferences even when order-diagnostic spectral evidence was removed. DeNovo-PBI provides a quantitative benchmark for measuring, comparing, and interpreting learned bias in de novo peptide sequencing models.
bioinformatics2026-06-29v1Confounding effects of inferring gene co-expression networks from pooled data from different biological populations
Runghen, R.; Eliassi-Rad, T.; Bolnick, D. I.Abstract
Weighted Gene Co-expression Network Analysis (WGCNA) is routinely applied to pooled datasets from multiple biological populations, genotypes, or treatment groups, implicitly assuming a shared module structure across groups. While the distortion of pairwise correlations by pooling heterogeneous groups is well established statistically, three aspects of this problem have received little systematic attention in the context of co-expression network analysis: the extent to which pooling disrupts the discrete module-level community structure inferred by WGCNA; whether this disruption is detectable from the global topology metrics researchers routinely report; and how prevalent the pooling practice is in published multi-group WGCNA studies. Using analytical toy examples and a four-scenario simulation framework, we address all three questions. Module preservation Zsummary scores declined progressively with between-population divergence, from full preservation under identical populations (mean median Zsummary = 25.2 {+/-} 3.3, 95% interval 19.0--30.7 across 20 simulation replicates) to substantial disruption when both network structure and mean expression differed (mean median Zsummary = 11.9 {+/-} 1.0, 95% interval 10.2--13.5). This disruption was undetectable from global topology metrics: modularity and clustering coefficient remained stable across all scenarios, while edge density was sensitive but non-specific. These findings were corroborated in an empirical reanalysis of divergent lake and stream stickleback transcriptomes, where merged analysis collapsed 26 lake-specific and 59 stream-specific modules into only 19 merged modules. A survey of 100 publications found that 78.7% (95% CI 69.4--87.9%) of multi-group WGCNA studies with sufficient methodological reporting used a single merged analysis. Results were robust across network sizes of 250--1,000 genes and rewiring rates of 10--50%. We provide concrete recommendations including module preservation testing in both directions, population-specific baseline networks, and consensus WGCNA as a principled alternative.
bioinformatics2026-06-29v1Can a Tissue-derived Progression Signature Accurately Predict Colorectal Cancer Stage Transitions in Blood?
Sarkar, P.; Sarkar, P.Abstract
Abstract. Colorectal cancer (CRC) is challenging to track because its molecular changes are very complex as the disease progresses, creating significant challenges for robust biomarker discovery. In this study, we developed a machine learning framework by integrating monotonic progression and the StepMiner approach. We conducted external validation to identify reproducible, consistent transcriptomic biomarkers associated with CRC progression. Gene expression datasets were analyzed across four disease states from publicly available GEO: normal colon, adenoma, primary colorectal cancer, and metastasis. First, we identified genes with monotonic expression, then used the StepMiner approach to identify genes that act as 'switches' between stages. A balanced 74-gene signature was used for machine-learning classification with a Random Forest. External validation showed strong performance in tissue-based datasets. However, tissue-derived signatures and plasma and blood-based datasets showed poor performance, highlighting biological differences between transcriptomic profiles. Cross-filtering between tissue-derived genes and blood expression datasets was performed, which resulted in the selection of 62 blood-compatible gene signatures. Leakage-free retraining on GSE164191 achieved a mean AUC of 0.868 with balanced precision. Functional enrichment analysis showed that these genes are highly active in cancer growth. Specifically, genes CBX3, S100A11, PDK4, NCOR1, and SOX4 demonstrated stable and reliable performance across the validation fold. Overall, our study presents a progression-aware transcriptomic framework for CRC biomarker discovery and demonstrates the importance of external validation. Additionally, we evaluate whether tissue-derived signatures can predict blood profiles. This proposed approach may help the future development of tissue-based diagnostics and minimally liquid-biopsy strategies for CRC. To ensure reproducibility, our proposed workflow was automated as a Nextflow pipeline. The tissue-derived model was deployed as an application utilizing Angular, ASP.NET Core, and Plumber (R).
bioinformatics2026-06-29v1Placental pathology, circadian biology, and pathogenesis of spontaneous preterm birth: a pilot study of human placental gene expression profiling using a targeted HTG transcriptome panel
Zhou, G.; Hoffmann, H.; Yamamoto, H. S.; Woods, K.; Adkins, M.; Barbieri, R.; Fichorova, R. N.Abstract
BACKGROUND: Spontaneous preterm birth (sPTB) remains the foremost cause of neonatal morbidity and mortality worldwide. Although histologic chorioamnionitis (HCA) and placental vascular abnormalities are frequently observed in sPTB, the molecular cascades linking these lesions to labor initiation remain poorly understood. Emerging evidence implicates circadian dysregulation and trophoblast dysfunction as additional drivers of sPTB. OBJECTIVE: This study aims to map placental pathology to distinct transcriptomic functional signatures that may precipitate sPTB, delineate the contribution of circadian regulation - both core-clock genes and circadian transcription-factor target sets (TFTs) - to sPTB, and identify placental cell-type-enriched and developmental pathway signatures that differ between sPTB and term deliveries. STUDY DESIGN: We performed bulk RNA sequencing on 32 formalin fixed, paraffin embedded placental specimens from 12 selected women (9 sPTB and 3 Term) in the POUCH Study cohort. Samples were selected for white ethnicity, maternal age 23-33years, and parity 1-4 to reduce heterogeneity within groups. An extraction-free HTG transcriptome panel assayed 19,398 protein-coding genes. Log2-fold changes of all genes were computed with limma adjusted for maternal age, gestational age, parity, placental region, placental pathology, and POUCHID (a clustering variable) for sPTB vs. Term and HCA/vascular lesion vs. no pathology (no placental pathology adjustment). Gene-set enrichment used 50 Hallmark sets (MSigDB) plus curated placental circadian, circadian TFT, cell-type, and developmental pathways or gene sets. RESULTS: sPTB placentas displayed a global suppression of metabolic, secretory, and immune pathways (e.g., protein secretion, oxidative phosphorylation, Interferon responses, Complement, ROS, MYC Targets, TGF {beta}, mTORC1, and Coagulation) while KRAS Signaling Down and EMT were up-regulated. HCA-enriched sets (TNF/NF-{kappa}B, ROS, KRAS Up, IL-2/STAT5, Hypoxia, Interferon-{gamma}) were up-regulated, with EMT and Notch remaining down. Vascular abnormalities alone showed up-regulation of 12 Hallmark sets - including TGF-{beta}, TNF/NF-{kappa}B, ROS, pancreatic {beta}-cell stress, Hypoxia, Oxidative Phosphorylation, EMT, and mTORC1 - while Notch was down-regulated. When HCA co-exists with vascular abnormalities, the Hallmark profile becomes more inflammatory highlighting a synergistic exacerbation of innate immunity, oxidative stress, and programmed cell death with the 12 up-regulated sets (Complement, Interferon /{gamma}, TNF, ROS, Apoptosis, and Heme Metabolism). The exclusive downregulation of DNA Repair suggests compromised genomic integrity. Circadian gene-sets analysis revealed an up-regulated Regulation of Circadian Sleep Wake Cycle in sPTB but down-regulation of core clock pathway and suppressed circadian TF targets. Cell-type enrichment reveals increased trophoblast giant cells and IGFBP1-DKK1 positive fetal cells, with marked suppression of extravillous trophoblasts, syncytiotrophoblasts, villous cytotrophoblasts, and fetal myeloid cells. Placental developmental pathways were downregulated, indicating arrested trophoblast maturation. CONCLUSION: Our pilot analysis demonstrates sPTB placentas exhibit a global suppression of metabolic, secretory, and immune-modulatory programs and maladaptive trophoblast remodeling, whereas HCA and vascular abnormalities drove distinct inflammatory or hypoxic signatures. The shared and opposing Hallmark pathways across phenotypes highlight distinct yet overlapping pathogenic mechanisms. Dysregulated circadian pathways, consistent downregulated transcription factor target gene sets, and trophoblast specific signatures implicate circadian misalignment and impaired placental maturation as key contributors to preterm parturition. These findings provide a mechanistic atlas linking placental pathology to sPTB and highlight potential targets for chronotherapeutic and cell type specific interventions.
bioinformatics2026-06-29v1Metrics for Distinguishing Biological and Interventional Change in AI Models
Ewing, M. A.Abstract
Statistical and machine-learning models of longitudinal biological data evaluate change by comparing each new observation against the trajectory implied by prior observations, assuming the process generating that trajectory is stable. We use data substrate to mean the underlying structure of the longitudinal data that determines what any such model can recover, independent of its architecture or capacity. When the generating process changes, whether through a biological transition or through an external intervention, the prior trajectory ceases to be a valid reference, and extrapolated predictions can be confidently wrong with no internal signal that the reference has failed. A distinct and recognised difficulty is that biological change and interventional change, observed only through serial intertemporal comparison under an assumed trajectory, are readily conflated; existing approaches address this through causal assumptions or hidden-confounder models rather than from the data substrate itself. Here we ask whether the two can be distinguished at the substrate level, and we introduce two subject-level metrics that quantify the geometric signature an interventional change leaves in the data: Curvature Shift, the change in trajectory slope across the event, and Deformation Risk, the departure of post-event observations from the prior-trajectory reference. We evaluate the condition on longitudinal cognitive measurements from 309 human subjects in the Alzheimer Disease Neuroimaging Initiative (ADNI), a large longitudinal dataset containing two distinct, ex-ante-defined regime-change events in the same subjects: a biological transition and an intervention. A model extrapolating the pre-event trajectory assigned the wrong direction of change to roughly two-thirds of post-event observations (post-event sign accuracy 0.341 after the biological event and 0.350 after the intervention, against a chance value of 0.50); only 11% of post-biological-event and 12% of post-intervention readings remained concordant with prior dynamics, and a higher-capacity multilayer perceptron reproduced rather than resolved the error. Curvature Shift was 2.23-fold higher after the biological event (p = 4.4e-8) and 2.26-fold higher after the intervention (p = 7.4e-8), and the two metrics were coupled (rho = 0.500; 95% CI, 0.407 to 0.587). Findings replicated on an independent endpoint and survived propensity matching, permutation, and leave-one-out. The metrics detect, per subject, when the reference of a fitted model has stopped governing the data and whether the departure carries the geometric signature of an interventional change.
bioinformatics2026-06-29v1Single-cell transcriptomics reveals chondrocyte state transitions and ECM remodeling in osteoarthritic knee cartilage
Bo, Z.; Xu, H.; Liang, Y.Abstract
Osteoarthritis cartilage has heterogeneous chondrocyte states, yet their transitions remain unresolved from public single-cell data. We retrospectively reanalyzed a public knee cartilage single-cell RNA-seq dataset GSE255460 from 8 osteoarthritis and 3 non-osteoarthritis donors totaling 19 samples. After sample-wise quality control and doublet removal we performed batch-corrected clustering, chondrocyte subclustering with marker-based annotation, and trajectory inference using Slingshot. Regulatory chondrocytes were tested for osteoarthritis versus control differential expression, followed by Gene Ontology and KEGG enrichment with Benjamini-Hochberg false discovery rate <0.05, and protein-protein interaction hub screening. We retained 27,036 cells. Chondrocytes exhibited branching continuous states; regulatory cells localized near the main manifold and adjacent to inferred branches, suggesting a transition-adjacent state. In regulatory cells, osteoarthritis-upregulated genes were enriched for collagen-containing extracellular matrix organization, endoplasmic reticulum secretory/proteostasis, cell-matrix adhesion including focal adhesion, and TGFbeta/SMAD signaling. Protein-protein interaction analysis identified five high-connectivity hubs: COL5A1, COL5A2, COL6A1, COL1A2, and COL3A1. Our findings support a transition-adjacent regulatory program in OA with coordinated extracellular matrix remodeling and secretory/adhesion/TGFbeta signatures, nominating collagen hubs for validation.
bioinformatics2026-06-29v1Context-dependent correlations mislead transcriptomic network inference in bulk and single-cell data
Asiaee, A.; Bombina, P.; McGee, R. L.; Reed, J.; Abrams, Z. B.; Abruzzo, L. V.; Coombes, K. R.Abstract
Background. Correlation is the dominant input to co-expression module discovery and miRNA-target inference. Both rely on an implicit assumption: a Pearson coefficient pooled across heterogeneous samples, whether tissues, cancer types, or cell types, estimates one biologically meaningful quantity. Simpson's paradox makes this assumption fragile in principle, since between-group mean shifts can dominate or reverse within-group associations. How often this happens in real transcriptomic data has not been quantified. Results. Across 8,890 TCGA tumors from 31 cancer cohorts and 23,170,038 miRNA-mRNA pairs, 94.8% of pairs showed both positive and negative within-cohort correlations. Restricting to the high-variance domain of one million pairs, 13.3% of pooled correlations with |r_global| >= 0.2 reversed against the within-cohort majority at sign tolerance epsilon = 0.05. Heterogeneity was the rule rather than the exception (median I2 = 0.86, IQR 0.80-0.90), and 99.5% of pairs rejected equal correlation across cohorts at FDR < 0.05. Of 692,770 experimentally validated miRTarBase v10 targets measurable in our data, only 0.9% were uniformly negative across cohorts. The pattern recurred across modalities. In GTEx, 21.0% of pooled signs disagreed with the tissue majority, and 23.5% of pairs flipped sign after tissue-mean removal. In 10x PBMC scRNA-seq, 13.1% of gene-gene correlations flipped after cell-type-mean removal; in CITE-seq, 37.9% of protein-RNA pairs flipped under a joint WNN partition of cells. Refining context reduced reversal, though by how much depended on the partition: within BRCA, 5.5% of pairs reversed under molecular PAM50 subtypes versus 0.35% under clinical IHC receptor status, and refining T cells into transcriptome-defined subtypes cut PBMC reversal from 11.8% to 0.13%. Conclusions. A single pooled correlation coefficient can invert direction relative to its within-context constituents at rates that are not negligible. Correlations should be reported with their context: the within-context distribution, a heterogeneity statistic, and a diagnostic that separates between-context mean shifts from within-context association. We provide a small R interface that computes these summaries.
bioinformatics2026-06-29v1Retention, not flux: endpoint confounding caps computational prediction of peptide skin penetration, with a delivery-aware reframing
Komianos, N.; Prakash, P.Abstract
Bioactive peptides are now central to cosmetic and dermatological actives, yet predicting whether a given sequence will reach its site of action in skin remains unsolved. We contend that the dominant framing, predicting a single binary "skin permeability" label from sequence, is ill-posed, and that this, rather than a shortage of modelling power, explains the field's stalled predictive performance. The scope of the claim is narrow: barrier-crossing propensity is a legitimate, learnable function of molecular structure, whereas the vehicle- and endpoint-agnostic binary label that the literature supplies is not. We support this with a first-principles analysis and a study of public-source data. First, the experimental endpoint most commonly reported, transdermal flux into a diffusion-cell receptor compartment (OECD Test Guideline 428), conflates two opposite outcomes (genuine deep delivery and undesired systemic transport) and is, for a cosmetic active, frequently a failure signal rather than a success signal. That receptor flux is an imperfect measure of cutaneous bioavailability is long established in dermatopharmacokinetics; our contribution is to show that the same confound, inherited through scraped labels, is what caps machine learning from sequence. Second, reported "permeability" is a property of the sequence x delivery-vehicle x measurement-compartment triad, two terms of which are usually unrecorded. Third, on public-source data, a physicochemical intrinsic-permeability estimate (Potts-Guy) carries no positive predictive signal for scraped penetration labels (grouped AUC 0.45, 95% CI 0.40-0.51); sequence-only classifiers plateau in the mid-0.70s with diminishing returns as labels accumulate (AUC 0.70-0.77); and the same descriptor pipeline on a clean single-endpoint membrane dataset scores materially higher (AUC 0.83, non-overlapping CI). Our proposed reframing separates barrier-crossing (data-driven, sequence-level) from depth-and-retention (physics-driven, delivery-aware) and treats intrinsic transdermal flux as a regulatory risk axis; we close by proposing a triad-annotated reporting schema and a seed benchmark.
bioinformatics2026-06-29v1Causally measuring aging and rejuvenation through transcriptomic damage
Zhang, S.; Iqbal, S.; Tyshkovskiy, A.; Gladyshev, V. N.Abstract
Aging is caused, fully in large part, by the progressive accumulation of damage, yet quantifying age-related damage across tissues and conditions remains a challenge. Here, we present a computational framework to quantify damage from standard RNA-sequencing data. It captures four classes of aberrant transcript structures, including premature termination upon intron retention, domain-disrupting splice variants, repeat elements, and gene fusion events, each reflecting distinct forms of RNA integrity loss. Using this method, we revealed a robust age-associated increase in transcriptomic damage across tissues. To integrate these measurements into a unified biomarker, we constructed a transcriptomic damage-based aging (tDamAge) clock using machine learning models trained across mouse tissues or human peripheral blood. It could predict age and detect transcriptomic shifts under both pro-aging and anti-aging conditions. Progeroid models exhibited accelerated tDamAge, whereas interventions such as caloric restriction, rapamycin, and methionine restriction lowered tDamAge. Cross-dataset analysis showed that diverse anti-aging interventions converge on shared transcriptomic signatures, particularly RNA processing and chromatin organization pathways, and these age-associated patterns could be reversed by interventions. We further identified elevated damage age acceleration in Alzheimers disease and observed rejuvenation-like reductions during embryonic development. Together, our findings establish transcriptomic damage as a causal, quantifiable and biologically interpretable feature of aging and demonstrate that tDamAge could detect age progression, acceleration, deceleration, and reversal.
bioinformatics2026-06-29v1SentryPath: a mechanistic protocol-ranking simulator with leave-one-trial-out cross-validation across 13 phase-III oncology randomised controlled trials and a pre-registered prospective forecast
Kumar, M. D.; Kumar, M.Abstract
Background. Pivotal oncology trials cost a median of ~$19 million each (oncology often $45 million or more) and contribute to a capitalised cost of ~$2.6 billion per approved drug, yet most candidate protocols never reach trial. Existing in-silico screening tools either rely on closed proprietary PK/PD modelling or require patient-level data; a transparent, cohort-level, cross-validated mechanistic alternative is missing. Methods. SentryPath is a physics-based stochastic differential equation simulator built on a Gompertzian tumour-growth term with Emax pharmacodynamic kill and Bliss-independence combination modelling, scored at the cohort level. Validation against 13 published phase-3 randomised controlled trials covering six cancer types uses the 2-year overall-survival (OS) rate ratio as the primary endpoint, cross-checked against ClinicalTrials.gov posted results. For cancer types with >=2 trials we apply leave-one-trial-out cross-validation: two shared efficacy scalars per cancer type are fit on training trials and used to predict the held-out trial cold. Results. With the per-drug efficacy proxies held fixed from the literature, two shared cancer-type scalars fit on the training trials transfer to the held-out trial with a mean held-out error of 3.7 % (range 0.7-7.3 %) on 2-year OS rate ratios across three NSCLC trials; extending the same method to RCC, HCC, and ESCC yields a 5.4 % aggregate across nine folds (per-fold range 0.2-11.2 %), reported with per-cancer stratification. We are explicit that only the two scalars are held out - the per-protocol efficacy proxies underneath are literature-anchored to drug classes that include the benchmark trials, so this is a test of scalar transfer, not of the whole engine cold. Cross-validation improves on the same engine without it (16.4 % with production cancer priors; 21.9 % with no efficacy modifiers); a matched in-sample fit of the same two-scalar model gives 4.4 %, slightly below the 5.4 % held-out, the expected direction. Two prospective forecasts are pre-registered on the Open Science Framework with falsification envelopes and pre-readout bias disclosure. The first forecast (NCT04770896) reaches its primary data cutoff on 2026-06-30; the observed outcome and its mapping to the pre-committed interpretation will be reported in a versioned update to this preprint. Conclusion. A transparent mechanistic simulator, with a literature-anchored efficacy library and only two cross-validated scalars per cancer type, transfers those scalars across held-out NSCLC trials at 3.7 % mean error (range 0.7-7.3 %) and extends to other cancers with documented per-cancer stratification. The validation is pilot-scale (3-9 folds) and the scalars sit on a fixed, trial-informed substrate; its distinguishing contribution is less the error magnitude than the public predict-verify-disclose cycle that goes beyond retrospective fit. Keywords: mechanistic simulation; oncology; clinical trial design; cross-validation; pre-registration; protocol prioritisation.
bioinformatics2026-06-29v1G-LATO: Inference of Spatial Latent Ordering via Deep Gaussian Processes
Zago, M.; Mukherjee, S.; Schleicher, J. T.; Bürkner, P.; Tabatabai, G.; Claassen, M.Abstract
Spatial transcriptomics enables the study of cells within their native tissue context, yet identifying gradients of cellular development remains challenging. We introduce a deep Gaussian process model to address this gap. Our method recovers spatially smooth gradients explaining observed gene expression. We illustrate our method on healthy liver and glioblastoma data in reconstructing known spatial organisation and uncovering new pathological gradients, thus providing robust inference for spatial biology.
bioinformatics2026-06-29v1Lineage-aware stochastic modeling reveals gene-expression dynamics in development and disease
Xing, J.; Staklinski, S. J.; Liu, Z.; Nowak, D.; Siepel, A.Abstract
Gene expression evolves dynamically along cell lineages, yet most analysis methods treat single-cell RNA-seq (scRNA-seq) data as static snapshots and fail to exploit phylogenetic relationships among cells. Recent advances in cell-lineage tracing now enable the reconstruction of high-resolution lineage phylogenies, providing a natural framework for identifying when and where transcriptional changes arise during development, differentiation, and disease progression. Some models of gene expression have begun to consider phylogenetic structure, but they generally rely on imprecise Gaussian assumptions, focus on endpoint-level comparisons, or fail to consider sparse and overdispersed scRNA-seq read counts. Here, we present LaVOUS (Lineage-aware Variational Ornstein-Uhlenbeck Single-cell RNA-seq analysis), a probabilistic framework that couples lineage-based models of latent dynamics derived from the Brownian motion and Ornstein-Uhlenbeck stochastic processes with negative-binomial observation models and scalable variational inference. LaVOUS enables likelihood-based tests for cellular heritability and branch-specific shifts in gene expression, as well as phylogenetic reconstruction of latent expression histories. In simulations, LaVOUS outperformed Gaussian method in detecting lineage-associated expression changes and produced accurate reconstructions of expression histories across expression levels. We additionally applied LaVOUS to paired single-cell lineage and transcriptomic data from metastatic lung cancer, class-switching B cells, and the developing brain. Across these settings, LaVOUS identified lineage-associated expression changes related to metastatic progression, B-cell isotype switching, and dopaminergic and glutamatergic neuron differentiation. By providing an expressive framework for modeling sparse count data on lineage trees, LaVOUS establishes a foundation for studying single-cell expression dynamics across developmental and disease contexts, with natural extensions to multi-gene regulation, lineage uncertainty, and multi-modal integration.
bioinformatics2026-06-28v1Short-Read Sequencing Benchmarking with Donor-Specific Assemblies
McGee, S. R.; Smith, J. D.; Frazar, C. D.; Ryke, E.; Vollger, M. R.; Kwon, Y.; Bennett, J. T.; Eichler, E. E.; Stergachis, A.; Wei, C.-L.Abstract
Background High-throughput short-read sequencing has become a core technology for genomics, but the rapid expansion of available platforms has made it increasingly important to benchmark them under standardized conditions. A major challenge is that conventional reference-based comparisons confound true sequencing errors with inherited variation and reference bias, making it difficult to isolate platform-intrinsic performance. Results We benchmarked nine short-read chemistries across seven DNA sequencers using two highly characterized benchmark samples, HG002 and COLO829BL, together with donor-specific assemblies to measure sequencing errors against sample-matched genomic references. This strategy separated authentic platform errors from biological divergence and revealed substantial differences in substitution, indel, read-position, and sequence-context error profiles. Element AVITI UltraQ and Roche SBX-D showed the lowest substitution error rates, whereas Ultima and Roche chemistries exhibited the strongest indel-associated biases. We also found pronounced platform-specific effects in low-complexity regions and trinucleotide contexts, including homopolymer-associated errors and context-dependent substitution skews that are directly relevant to rare-variant detection. In addition, we show that donor-specific references are essential for unbiased base-quality recalibration because they minimize reference bias and more faithfully support cross-platform comparison and low-frequency variant-calling thresholds. Conclusions Donor-specific assembly-based benchmarking provides a robust framework for measuring true short-read sequencing errors and comparing platforms on a common, sample-matched basis. Our results establish a comprehensive reference for the community and show that authentic error profiles can guide platform selection, quality filtering, and improved detection of rare somatic variation.
bioinformatics2026-06-28v1LOESS and DE-SWAN can induce artifactual "waves" of molecular aging
Carbonneau, M.; Shutta, K. H.; Miller, J.; Shen, X.; Snyder, M.; Quackenbush, J.Abstract
A growing literature has investigated the relationship between age and biomolecular changes, leading to conclusions that aging occurs in discrete molecular "waves." Data summary tools such as LOESS and sliding window analyses like DE-SWAN are common approaches that have gained acceptance in recent years. We demonstrate via simple simulations that these tools can identify non-linear patterns of aging where they do not exist. Specifically, we show that (i) clustering of molecular trajectories using LOESS can lead to artifactual characteristic patterns of molecular aging, (ii) "waves" of aging identified using the combination of LOESS and DE-SWAN in real data are not robust to changes in the underlying age distribution and are not supported by valid permutation testing, and (iii) DE-SWAN alone can generate pronounced "waves" of nonlinear molecular aging in linear data due to differences in statistical power along the age continuum. Our results specifically challenge the statistical support for discrete aging crests inferred in the literature, but do not rule out nonlinear molecular aging or age-associated transitions that may be detectable using other cohorts and statistical models.
bioinformatics2026-06-28v1Client-server interfaces enable efficient agent-driven variant calling
Yu, X.; Zheng, Z.; CHEN, L.; QIn, Z.; Guo, X.; He, M.; Luo, R.Abstract
Background: Large language model (LLM) agents increasingly automate bioinformatics analyses, but most existing bioinformatics tools were built for standalone use by human experts. An agent driving such a tool must reason about its installation, configuration, and execution from documentation for human, spending many turns, tokens, and tool calls per result. How a method is exposed to an agent can therefore matter as much as the method itself. By designing agentic interfaces for these tools, agent can reduce such overhead and improve the reliability of agent-driven analyses. Findings: To test this design, we re-architected Clair3, a widely used deep-learning-based long-read variant caller, into a client-server system, Clair3-Connect. The client performs all genomics related processing and holds the identifiable data. The server runs only neural-network inference, and the client sends only feature tensors to the server, while sample identifiers and genomic context remain on the client. The client exposes schema-defined agent-facing tools that an agent invokes through single structured calls. On an APOE diplotyping task, all 60 agent runs were correct. The agentic tools used 12K tokens in 3 turns, 6.8 to 14 times fewer tokens than the shell-driven baselines (81K-163K tokens), at about a quarter the wall-clock time and far more stably (4% versus 35% token usage variation). Dropping the pileup and phasing stages to keep the client light left SNP F1 within 0.1-0.3 points of standard Clair3 by 50x coverage, while mutual TLS and AES-256-GCM encryption added 7.2% to end-to-end runtime. Conclusions: Recasting an established algorithm as developer-built, agentic tools behind a secure client-server boundary makes it more efficient, reliable, and easier to deploy for an LLM agent than a third-party wrapper, which cannot recover the defaults and conventions only its developers know. Agentic interfaces should be a first-class deliverable of bioinformatics tool development.
bioinformatics2026-06-28v1PARROT: Phase-Altering Regulatory Rewiring Over Time
Chen, C.; Padi, M.; Quackenbush, J.Abstract
Motivation: Gene regulatory networks undergo dynamic restructuring during development and disease. Identifying when and how these networks change is crucial for understanding developmental and disease transitions, yet existing change-point detection methods often ignore network structure or lack interpretable community assignments. Results: We present PARROT (Phase-Altering Regulatory Rewiring Over Time), a framework for detecting change- points in dynamic networks using Stochastic Block Models. PARROT jointly estimates change-point locations and community structure across four network classes: unipartite and bipartite with either Gaussian or Bernoulli edge models. Simulations demonstrate improved performance and community recovery compared to other methods. Applications to human cardiac differentiation and mouse lung development data successfully recovered known phase boundaries. PARROT identifies both which genes are reassigned across modules and how the connections change between states.
bioinformatics2026-06-28v1Spatial co-expression and cell-cell communication inference from spatially resolved transcriptomics with CONCISE
Zhao, J.; Shan, X.; Wang, G.; Chu, T.; Lin, C.; Chang, R.; Zhao, H.Abstract
Cell-cell communication is fundamental to tissue organization, homeostasis, and disease progression. Recent advances in spatial transcriptomics provide unprecedented opportunities to systematically characterize ligand-receptor interactions directly within intact tissues. However, robust inference of spatial ligand-receptor interactions remains challenging because intrinsic features of spatial transcriptomics data, including spatial autocorrelation, variation in total molecular counts, and measurement errors, can induce spurious spatial co-expression and lead to inflated false-positive results. Most existing methods do not adequately account for these confounding factors, limiting the reliability of inferred cellular communication. Here, we present CONCISE, a statistical method for spatially constrained co-expression and ligand-receptor interaction inference that jointly models spatial autocorrelation, variation in total molecular counts, measurement errors, and spatial proximity constraints. CONCISE combines efficient moment-based parameter estimation with analytical hypothesis testing, enabling fast and statistically rigorous inference without restrictive distributional assumptions. Through extensive simulations, real-data permutation experiments, and biologically motivated negative-control analyses across different spatial transcriptomics platforms, we show that most existing methods presented inflated false-positive rates, whereas CONCISE achieved well-calibrated inference, robust false-positive control, and improved detection power. Application of CONCISE to high-resolution MERFISH and CosMx datasets from intestinal inflammation and non-small cell lung cancer further highlights its biological utility in disease contexts. CONCISE uncovered inflammation-associated fibroblast-specific interactions during intestinal inflammation and delineated complex tumor-immune and tumor-stromal signaling networks within the tumor microenvironment.
bioinformatics2026-06-28v1Development of Deep-Learning Models that Predict Quantitative Protein-Ligand Interactions in Glycobiology as a part of a Capstone Course
Yin, H.; Liu, W.; Zhou, W.; Chang, Z.; Carpenter, E. J.; Satyajith, A.; Haregu, S.; Greiner, R.; Derda, R.Abstract
Glycans coat the surface of all cells, and every glycan is recognised by specific glycan-binding pro-teins (GBPs). There are no general tools that can accurately estimate the binding strength between glycan and GBP from the amino acid sequence of the GBP and the molecular structure of the glycan, represented as SMILES string. We describe models for predicting such binding strengths developed as a part of a Capstone Course at the University of Alberta. The models are trained on a dataset that combines BindingDB, a published database of small-molecule protein interactions, and data from glycan arrays measured by Consortium of Functional Glycomics (CFG). In this hybrid dataset of protein-ligand interactions the ligands are both glycans from CFG and small molecules from BindingDB; similarly, proteins include GBP and proteins from BindingDB. Three models are presented (i) ProMax which fuses ESM-2, MolFormer, and MolCLR features; (ii) APEX which constrains learning to a predetermined form, a physical model of binding; (iii) UltraMax adds inter-atomic distances for the ligands. To address the dataset's severe long-tail distribution, the models employ tail-aware losses for rare high-binding instances. Trained and evaluated on approximately one million protein--ligand pairs using hold-out splits for unseen molecules, the three models provide a unified framework for quantitative glycan-protein binding prediction. We observed that learning glycan-protein binding is harder than the similar task of learning small-molecule-protein interactions. Simple mirror-inversion tests led us to postulate that insufficient use of chiral features is an important source of difficulty in learning these interactions.
bioinformatics2026-06-26v2CoLa-VAE: A Cell-Cell Communication-Aware Variational Autoencoder for Representation Learning and Expression Denoising
Chen, Y.; Qi, C.; Fang, H.; Luan, F.; Zhang, Z.; Arya, S.; Wei, Z.Abstract
Single-cell RNA sequencing provides a powerful view of cellular heterogeneity, but its sparsity and dropout noise remain major obstacles for recovering biologically meaningful gene expression programs and for downstream analyses that depend on reliable expression measurements. Ligand-receptor-based cell-cell communication inference is such analysis, missing ligand or receptor expression can cause substantial false negatives in sparse single-cell data. Here, we present CoLa-VAE, a cell-cell communication-aware variational autoencoder that jointly learns latent representations and denoised expression profiles by incorporating ligand-receptor-derived communication topology through dynamic graph Laplacian regularization. Rather than treating denoising as a secondary output of representation learning, CoLa-VAE uses denoised expression to iteratively refine communication estimates and uses the resulting communication structure to guide both latent organization and expression reconstruction. In addition to improving latent space organization and producing robust denoised expression matrices, CoLa-VAE-denoised matrices also improved downstream biological analyses, including the detection of robust differential cell-cell communication programs, mitigation of batch-associated variation and enhanced spatial transcriptomic deconvolution when spatially constrained communication structure was incorporated. Together, these results establish CoLa-VAE as a communication-guided denoising and representation learning framework that recovers biologically meaningful expression signals from sparse single-cell and spatial transcriptomic data, enabling more sensitive and reliable downstream analysis.
bioinformatics2026-06-26v2Cell-free DNA Fragmentation Profiling at Transcription Start Sites Improves upon Cancer-Type-Specific Region Selection for Cancer Detection
Pronk, B.; Makrodimitris, S.; Wilting, S.; Reinders, M.Abstract
Accurate discrimination between healthy individuals and patients with cancer using minimally invasive liquid biopsies could improve cancer diagnosis and monitoring. Circulating cell-free DNA (cfDNA) is a promising biomarker, since fragmentation patterns reflect chromatin organization and have been used to interrogate regulatory regions such as transcription start sites (TSSs). Classification approaches typically rely on hypothesis-driven selection of genomic regions based on literature or external tissue data. Therefore, they assume that tumor-derived cfDNA constitutes the dominant diagnostic signal, potentially overlooking a systemic, genome-wide shift in the cfDNA pool. We present a data-driven framework that identifies discriminative genomic loci directly from cfDNA whole-genome sequencing data. Using fragmentomic features captured at TSSs within a nested cross-validation framework, the model outperforms ichorCNA and hypothesis-driven baselines in distinguishing healthy from colorectal and breast cancer samples (AUROC 0.95+-0.039). Performance was maintained in a pan-cancer setting across seven malignancies (AUROC 0.946+-0.032) and generalized to previously unseen cancer types within the same cohorts (AUROC 0.934+-0.006). While validation in an independent external cohort showed a performance gap (AUROC 0.694), the data-driven model was consistently competitive with baseline methods. These results indicate that robust cancer detection is enabled by integrating distributed genome-wide fragmentation patterns rather than restricting analysis to predefined regions.
bioinformatics2026-06-26v1Comp2GPR: A Sequence-Driven Framework for Gene.Protein-Reaction Rule Reconstruction
Castillo, S.Abstract
Accurate gene-protein-reaction (GPR) associations are essential for the predictive performance of genome-scale metabolic models (GEMs),as they define the mapping between genes, enzymes, and metabolic reactions. However, GPR rules are often incomplete or inconsistent due to limitations in annotation transfer and the ambiguous representation of multi-subunit protein complexes, leading to errors in downstream analyses such as gene essentiality prediction. Here, I introduce Comp2GPR, an automated pipeline for reconstructing GPR rules that integrates curated protein complex information with sequence-level evidence. Protein complexes were sourced from the Complex Portal and subjected to an AI-assisted curation workflow to retain only metabolically relevant assemblies. Comp2GPR combines deterministic sequence similarity mapping with explicit rule construction to generate Boolean GPR expressions that accurately represent obligate subunit relationships and isoenzyme redundancy. I evaluated the impact of the reconstructed GPR rules by integrating them into the Yeast9 metabolic model and comparing gene essentiality predictions with the original model. While global performance metrics remained largely unchanged, the updated model achieved a net improvement in prediction accuracy through gene-level corrections. Overall, Comp2GPR demonstrates that combining curated protein complex data with sequence-based validation improves the accuracy, interpretability, and reproducibility of GPR rules. The method provides a robust framework for enhancing metabolic model annotations and supports more reliable simulation-based analyses.
bioinformatics2026-06-26v1MYC and RNA Polymerase II Binding Near Transcriptional End Sites Regulate the Expression of Functionally-Related Genes
Prochownik, E. V.; Henchy, C. M.; Wang, H.Abstract
MYC oncoprotein binding at promoters and enhancers influences RNA polymerase II (RNAPII)-driven gene expression. Numerous genes also bind MYC near their transcriptional end sites (TESs). This often allows direct promoter-TES contact via looping and further regulates total and 'read-through' transcription that extends beyond standard termination sites. We aimed here to better clarify the rules governing TES associated MYC and/or RNAPII binding cross-talk in human and murine cells. Using ChIPseq and RNAseq datasets from the ENCODE portal and elsewhere, MYC and RNAPII binding profiles were found to differ around TESs and transcriptional start sites (TSSs). Variations in E box flanking sequences likely accounted for the somewhat lower affinities of MYC for TES-associated sites. Motifs for numerous other transcription factors were also observed to cluster non-randomly and in close proximity to MYC and RNAPII binding site peak summits. On average, genes with TES-proximal MYC or RNAPII sites were more highly expressed than those without, although co-binding tended to be suppressive. Both normal and neoplastic proliferative stimuli altered the MYC and RNAPII binding patterns of many genes, indicating that 'category switching' was common, subject to disparate external signals and often reversible. Functionally related gene sets with high levels of read-through transcription were uniformly marked by significant amounts of TES-associated MYC and/or RNAPII binding. These findings indicate that, both independently and together, MYC and RNAPII binding near TESs dynamically impact total and read-through transcription while also coordinating the expression of many common purpose gene sets.
bioinformatics2026-06-26v1PlantGeneAnn: a strand-specific genome foundation model for ab initio gene structure annotation of plant genomes
Qizhe, Z.; Zhengyang, Z.; Kepeng, L.; Wang, J.; Kaixuan, D.; Xianglei, X.; Wei, X.; Xuehai, H.Abstract
High-quality plant genome assemblies are rapidly increasing, but accurate structural annotation remains reliant on transcript and homology evidence, limiting applications in newly sequenced and non-model species. Here, we present PlantGeneAnn, a plant-optimized, strand-specific genome foundation model for ab initio gene structure annotation. Fine-tuned on only nine high-quality model plant annotations, PlantGeneAnn outperformed a multi-species model trained on 42 species, showing that annotation quality is more important than token volume. On a stringent 13-species benchmark covering rosids, asterids, and monocots, PlantGeneAnn surpassed four state-of-the-art baselines across five evaluation levels, from base-level classification to complete transcript recovery. It achieved higher intron precision and better captured complex gene structures. In zero-shot variant effect prediction, PlantGeneAnn identified cryptic splice donors and premature stop codons in maize and rice, with saturation mutagenesis confirming single-nucleotide, context-dependent sensitivity. It also retained generalizability for epigenomic track prediction, highlighting its value for pan-genomics, crop improvement, and non-model plant research.
bioinformatics2026-06-26v1Consistent consensus-based annotation of spatial adaptive immune receptor repertoires from long-read sequencing using LongAIRR
Schuck, J.; Ortega Iannazzo, S.; Mahmoud, Z.; Gwellem Anchang, C.; Hasse, L. M.; Weber, K.; Imkeller, K.Abstract
The combination of spatial transcriptomics with long-read sequencing enables spatial characterization of full-length transcripts within solid tissue sections. However, standardized computational analysis frameworks are lacking, and it remains unclear whether available long-read sequencing platforms from Oxford Nanopore Technologies and Pacific Biosciences yield comparable results. Here, we present a computational strategy for spatial full-length transcript analysis, focusing on the spatial profiling of adaptive immune receptor repertoires (AIRR). Our approach introduces an adaptive filtering strategy that dynamically refines read selection and significantly improves consensus accuracy, enabling high-confidence sequence reconstruction independent of platform-specific sequencing error profiles. We further derive evidence-based guidelines tailored to the consistent and robust analysis of spatial AIRR data. The resulting software LongAIRR is modular and interoperable with existing spatial transcriptomics and AIRR analysis frameworks. This work establishes a methodological foundation for spatial immunology, enabling precise mapping of immune repertoires within their native tissue microenvironments.
bioinformatics2026-06-26v1Learning Perturbation Effects Through Contrastive Alignment of Multimodal Biological Embeddings
Long, W.; Liu, T.; Szalata, A.; Theis, F. J.; Xue, L.; Zhao, H.Abstract
Multimodal single cell perturbation screens offer a scalable approach for characterizing the effects of genetic and chemical interventions on cellular state. However, most existing representation learning methods are tailored to a single perturbation modality and fail to explicitly incorporate external semantic knowledge, which limits their ability to generalize across datasets and perturbation types. Here, we introduce PertOmni, a CLIP style multimodal representation learning framework that aligns transcriptomic perturbation signatures with text derived embeddings of curated genes and compound descriptions, as well as image derived embeddings from cell paintings. PertOmni jointly trains a shared transcriptomic encoder and dataset specific text encoders using a masked contrastive objective that emphasizes within cell type discrimination while mitigating confounding effects arising from cell type heterogeneity. We evaluate the produced joint embedding space on bidirectional retrieval, drug gene interaction inference, and perturbation prediction across both small molecule and CRISPRi perturbation datasets, and demonstrate consistent improvements over strong baseline methods.
bioinformatics2026-06-26v1Computational reconstruction of hierarchical cis-regulatory networks reveals synergistic transcription control and disease-associated rewiring
Zhu, X.; Zhou, X.; Zhang, Y.; Cai, G.; Zhao, W.; Zhou, B.; Zhou, J.; Tang, Z.; Liu, J.; Zhu, Q.; Cao, J.; Yang, B.; Gu, X.; Zhou, Z.Abstract
Gene regulation emerges from coordinated interactions among dispersed cis-regulatory elements, yet how these elements integrate into functional regulatory networks and collectively regulate gene transcription remains poorly understood. Here, we present ORIGAMI, a multi-omics, gene-centric deep learning framework that reconstructs functional cis-regulatory networks constrained by transcriptional output. ORIGAMI formulates cis-regulatory modeling as a latent graph inference task, which integrates DNA sequence, epigenomic signals, and three-dimensional chromatin priors to infer denoised regulatory graphs that capture functional interactions rather than structural proximity alone. The inferred regulatory graphs exhibit distinct topological regimes, where hierarchical and modular organization encodes cell-state-specific functional demands and enables synergistic transcriptional control. Furthermore, we show that these regulatory architectures undergo measurable state-dependent rewiring across disease contexts. Finally, ORIGAMI accurately predicts the transcriptional consequences of both cis- and trans-regulatory perturbations and links the rearrangement of regulatory architecture to perturbation response. Together, ORIGAMI advances a network-based view of gene regulation and establishes a foundation for virtual cell modeling of regulatory dynamics.
bioinformatics2026-06-26v1A Generalised Epigenetic Clock Reveals Therapeutic Vulnerabilities Linked to Ageing in Cancer Cells
Fernandez-Rebollo, I.; Digilio, A.; Oikonomou, A.; Trastulla, L.; Esteller, M.; Iorio, F.Abstract
Epigenetic clocks estimate biological age from DNA methylation patterns but perform poorly in cancer due to extensive epigenetic reprogramming, limiting the study of ageing in tumour biology.Here, we develop GepiClock, an epigenetic clock trained on DNA methylation data from 32 cancer types in The Cancer Genome Atlas. Based on 4,862 CpG sites, GepiClock accurately predicts age across both tumour and normal samples, indicating that core ageing-associatedmethylation programmes remain detectable despite malignant transformation.Applying GepiClock to molecularly profiled cancer cell lines with matched drug response and CRISPR screening data revealed age-associated vulnerabilities. Younger-predicted cell lines were more sensitive to mTOR, MEK1/2 and HSP90 inhibitors, whereas older lines showed increased sensitivity to AKT and PI3K inhibitors. Additional cancer-type-specific patterns and age-associated genetic dependencies were identified.These findings establish a framework to quantify biological age in cancer and link ageing-associated states to therapeutic vulnerabilities.
bioinformatics2026-06-26v1Efficient evidence-based genome annotation with EviAnn
Zimin, A. V.; Puiu, D.; Pertea, M.; Yorke, J.; Salzberg, S.Abstract
For many years, machine learning-based ab initio gene finding approaches have been central components of eukaryotic genome annotation pipelines, and they remain so today. The reliance on these approaches was originally sustained by the high cost and low availability of gene expression data, a primary source of evidence for gene annotation along with protein homology. However, innovations in modern sequencing technologies have revolutionized the acquisition of gene expression data, allowing scientists to rely more heavily on this class of evidence. In addition, proteins found in a multitude of well-annotated genomes represent another invaluable resource for gene annotation. Existing annotation packages often underutilize these data sources, which prompted us to develop EviAnn (Evidence-based Annotator), a novel evidence-based eukaryotic gene annotation system. EviAnn takes a strongly data-driven approach, building the exon-intron structure of genes from transcript alignments or protein-sequence homology rather than from purely ab initio gene finding techniques. We show that when provided with the same input data, EviAnn consistently outperforms current state-of-the-art packages including BRAKER3, MAKER2, and FINDER, while utilizing considerably less computer time. Annotation of a mammalian genome can be completed in less than an hour on a single multi-core server. EviAnn is freely available under an open-source license from https://github.com/alekseyzimin/EviAnn_release and from Bioconda as "eviann".
bioinformatics2026-06-25v3CellOS: Learning a World Model of Cellular State through Joint Embedding Prediction
Zhou, Q.; Le, Y.; Qi, X.; Chang, S.; Lu, H.; Wu, Y.; Wang, H.; Ran, R.; li, x.Abstract
Foundation models learned from single-cell transcriptomes are central to the prospect of AI virtual cell that can represent, query and predict cellular state. However, most current single-cell foundation models learn from a single view of gene expression and are optimized primarily through reconstruction or next-token prediction. As a result, they capture expression abundance but can-not explicitly reconcile complementary views of cellular state. Here we present CellOS, a multi-view foundation model that learns cellular representations from paired expression and perception views. CellOS integrates complementary views through a scalable three-stage training strategy that combines causal cell-sentence language modelling, function-preserving dense-to-mixture-of-experts expansion and latent-space alignment via an LLM-JEPA objective. Using this framework, we trained a 12-billion-parameter model on 390.5 million single-cell transcriptomes. Across diverse benchmarks spanning cell-state annotation, batch integration and perturbation-response prediction, CellOS consistently outperformed state-of-the-art single-cell foundation models. Together, these results suggest that predictive alignment between complementary cellular views provides a scalable path toward representation-centric cellular world models and transferable AI virtual cells.
bioinformatics2026-06-25v2Dynamic genomic constraints reveal fitness trade-offs underlying bacterial resistance evolution.
Dillon, L.; McInerney, J. O.; Creevey, C. J.Abstract
Antimicrobial resistance (AMR) is often modelled as the accumulation of resistance genes leading to multidrug resistance (MDR). We show that gene co-occurrence patterns in two opportunistic pathogens are consistent with fitness trade-offs that constrain which combinations of resistance mechanisms coexist. We applied a combined pangenomic and machine-learning analysis to 9,584 Escherichia coli genomes (99.2% phylogroup B2) and 7,057 Pseudomonas aeruginosa genomes. In E. coli, we identified eight cases of mutually exclusive gene pairs that independently predicted the same MDR phenotype, suggesting alternative routes to resistance whose components are typically not co-inherited. In a separate dataset of 352 strains with paired minimum inhibitory concentration (MIC) data, these dissociated combinations co-occurred more often in resistant than susceptible strains, consistent with the constraints being conditional on antibiotic selection. 33 gene pairs showed opposing association patterns between the two species, with combinations significantly associated in one species and significantly dissociated in the other (e.g. associated in E. coli and dissociated in P. aeruginosa, or vice versa). This indicates that genomic context modifies the contribution of individual genes to resistance phenotypes, and offers one explanation for the observation that 106 ARGs are present in >95% of strains yet do not predict resistance phenotype on their own. The findings are consistent with resistance evolution being shaped by fitness trade-offs and suggest that the dissociation patterns we identify could be targets for follow-up experimental work on resistance-associated fitness costs.
bioinformatics2026-06-25v2FoldARE, an RNA secondary structure analysis and prediction tool via generative pseudo-SHAPE modeling
Marino, S. M.; Husak, V.; Tebaldi, T.Abstract
RNA secondary structure prediction is limited by conformational heterogeneity and the scarcity of experimental data, as many RNAs populate ensembles of near-isoenergetic folds and SHAPE data are often unavailable. We present FoldARE (Folding and Analysis of RNA Ensembles), a two-step framework that derives pseudo-SHAPE constraints from in silico structural ensembles and uses them to guide SHAPE-aware secondary structure prediction. In the first step, an ensemble is generated and parsed nucleotide by nucleotide to estimate single-strandedness frequencies, which are converted into a pseudo-SHAPE reactivity profile using a weight-and-threshold scheme. In the second step, this profile is provided as a constraint to a SHAPE-compatible folding algorithm to improve the final prediction. We systematically evaluated all combinations of four ensemble-capable predictors, ViennaRNA, RNAstructure, LinearFold and EternaFold. After parameter optimization on a structurally diverse 25-RNA training set and validation using multiple scoring schemes, the best configuration combined EternaFold as ensembler and RNAstructure as predictor. Across external benchmark datasets (RNAstrand, ArchiveII and bpRNA) and the experimentally derived eFold dataset, FoldARE achieved the highest accuracy. Beyond prediction, FoldARE provides modules for ensemble-focused comparative analysis, including pairwise and multi-tool consensus assessment, per-nucleotide variability metrics, and interactive visualizations. Notably, it also supports the evaluation of m6A modification effects on structural ensembles. FoldARE is freely available on GitHub (https://github.com/TebaldiLab/FoldARE) and as a web accessible version (https://rdds.it/foldare/)
bioinformatics2026-06-25v2