Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
ECLIPSE: Exploring the dark proteome of ESKAPE pathogens through the sequence similarity network of the Protein Universe Atlas
Lata, S.; Heinz, D. W.Abstract
The accelerating crisis of antimicrobial resistance among the critical, so-called ESKAPE bacterial pathogens demands the urgent identification of novel molecular targets. However, a substantial fraction of ESKAPE proteomes remains functionally uncharacterized, with many genes annotated as encoding hypothetical proteins. These protein sequences often lack significant similarity to known protein families when using conventional homology-based annotation methods and thus remain "dark". This limits our ability to explore their role in pathogenicity, and it is thus crucial to bridge this substantial gap in pathogen biology by developing novel strategies to illuminate these "dark" regions of the ESKAPE pan-proteomes.We introduce ECLIPSE (ESKAPE Connectome Linkage and Inference for Proteome Sequence Exploration), a network-based computational framework that systematically identifies and prioritises functionally dark protein families in ESKAPE pan-proteomes. ECLIPSE embeds target ESKAPE pathogen proteomes within the global sequence similarity network of the Protein Universe Atlas (Durairaj et al. 2023). It detects connected components composed entirely of unannotated proteins, called the "dark proteome". As a case study, we applied ECLIPSE to a pan-proteome of 3,460,657 protein sequences from 635 strains of Pseudomonas aeruginosa (PA). ECLIPSE identified 120,985 proteins (4%) residing in completely dark connected components. Furthermore, we performed a taxonomic diversity analysis using normalized Shannon indices to characterize each dark component by its enrichment in ESKAPE pathogens. The analysis utilized the evenness (E) value (see Methods 2.1), which distinguishes Pseudomonas-specific (target-specific) from ESKAPE-enriched dark components. We then developed the Dark Proteome Prioritization Score (DPPS), a composite multi-dimensional scoring framework (see Methods 2.5). It ranks these dark components by biological relevance across four orthogonal axes: (i) functional darkness, (ii) P. aeruginosa proportion in the Atlas, (iii) AMR-clade taxonomic restriction, and (iv) conservation across the 635 P. aeruginosa strains. This framework outputs a robust four-tier scoring system; the prioritized Tier I components were validated by weight sensitivity analysis and remained stable across 500 Monte Carlo weight perturbations. Structural characterization of one of the top-ranked ESKAPE-enriched dark component revealed that it belongs to the beta-barrel fold DUF1302 (PF06980) family for which no experimentally solved three-dimensional structure exists in the PDB. The genomic context analysis indicates that it is co-localized with a LuxR-type transcriptional regulator. Collectively, ECLIPSE identifies evolutionarily conserved, structurally defined, and functionally dark proteins enriched across ESKAPE pathogens; these candidates can further facilitate the experimental characterization of dark proteins as an alternative antimicrobial target.
bioinformatics2026-07-03v2Artificial intelligence virtual cell immune recovery model for screening traditional Chinese medicine ingredients
Hu, C.; Xiao, B.; Chen, C. Y.-C.Abstract
Screening therapeutic candidates from single-cell transcriptomes requires a target that is closer to treatment response than disease-signature reversal. In immune diseases, post-treatment recovery may follow patient- and lineage-specific trajectories rather than a simple return along the pretreatment disease axis. We developed ImmuneNavi, an artificial intelligence virtual cell (AIVC) immune recovery model for ranking traditional Chinese medicine ingredients from paired PBMC data. The model maps heterogeneous PBMC cohorts to a common healthy immune coordinate system, constructs patient-lineage disease and recovery states, and processes ITCM treated-control profiles into a fixed ingredient perturbation bank. Patient and ingredient states are represented in matched gene, pathway and transcription-factor views, allowing the model to combine local transcriptional direction with more stable program-level features. A matcher trained on one paired treatment cohort preserved recovery-aligned ingredient rankings in independent PBMC cohorts without redefining the feature space, candidate set or preprocessing procedure. ImmuneNavi provides an AIVC model that uses paired immune-state measurements to screen natural-product candidates for experimental follow-up.
bioinformatics2026-07-03v2Multiscale Analysis of Cellular Senescence through Ripley's Functions and Functional Statistics.
Verrier, C.; Dabo-Niang, s.; Dehennaut, V.Abstract
Cellular senescence is a heterogeneous and evolving process involved in development, tissue repair, aging, and age-related diseases. Although senescence burden in tissues has been widely studied, its spatial organization remains poorly understood, particularly in vivo. Senescence encompasses a spectrum of distinct states, with cells differing in molecular signatures, secretory activity, persistence, and interactions with their microenvironment depending on the inducing stimulus and tissue context. This heterogeneity suggests that spatial organization may reflect underlying processes such as tissue repair, regeneration, or maladaptive remodeling, providing insight into senescence function and its pathological roles. Here, we propose a quantitative, multi-scale framework to characterize the spatial organization of senescent cell populations in post-infarction mouse hearts. By combining a senescence-signature scoring strategy with spatial statistical methods and functional data analysis, we assess whether senescent cells exhibit clustered or dispersed patterns, and how these spatial distributions evolve over time following infarction. This approach aims to provide new insights into the spatiotemporal dynamics of senescence in vivo and to identify spatial features that may inform therapeutic strategies targeting age-related and tissue repair-associated pathologies.
bioinformatics2026-07-03v1GLproxScape reconstructs spatial chromatin occupancy landscapes from tiled genomic locus proteomics
Ozcan, S. C.; Sergi, B.; Yildirim, B.; Cagiral, U.; Gonen, M.; ACILAN AYHAN, C.Abstract
Genomic locus proteomics combines proximity labeling with mass spectrometry to identify the proteins associated with user-defined genomic loci. However, per-region enrichment values from tiledguide designs are typically pooled before hit calling, collapsing the latent spatial structure encodedby overlapping measurements. Here, we describe GLproxScape, an R package that treats per-region enrichments as indirect spatial measurements and reconstructs latent chromatin occupancylandscapes through a Gaussian labeling-kernel forward model. Sequence-specific transcriptionfactors are resolved by motif-anchored non-negative least-squares deconvolution against JASPARor HOCOMOCO position weight matrices, while chromatin regulators which lack defined DNA-binding motifs are inferred as broad occupancy zones, enabling recovery of overlapping membersof multi-subunit complexes. Applied to published genomic locus proteomics datasets at the humanTERT, MYC, FOXP2, and FOXQ1 loci and the mouse Ripk3 locus, GLproxScape recovered knownregulators with predicted positions independently supported by ChIP-Atlas peaks, reconstructedcandidate co-binding relationships, and identified chromatin complexes inaccessible to pooledanalyses. Systematic sgRNA-ablation experiments further showed that densely tiled designsimprove event recovery and positional stability, providing concrete experimental guidance for futuregenomic locus proteomics studies.
bioinformatics2026-07-03v1Multimodal computational framework identifies B cell convergence in autoimmunity and ageing
Lou, H.; Zhang, M.; Zhang, B.; Lu, Q.; Zheng, J.; Cao, X.Abstract
Identification of the origin of pathogenic immune cells is crucial for therapeutic interventions and diagnosis but pseudotime methods struggle to trace immune cells accurately. Current trajectory inference methods for B cell development and response in health and disease either ignore or underutilize antigen receptor sequence information, limiting their ability to resolve developmental pathways, particularly for pathogenic populations. Widely used methods such as Monocle 3, reconstruct developmental paths from transcriptomic similarity alone, discarding the features from immune receptors. Dandelion has combined the immune receptor features with transcriptomics but it struggles to simulate the trajectory path of B cells. Here we present ClonoTrace, a computational framework that integrates BCR sequence features with transcriptomic trajectory inference through gated fusion of multimodal embeddings. In fetal B cell development and germinal centre development, ClonoTrace achieves higher trajectory inference accuracy than Monocle 3 and Dandelion. Applied to systemic lupus erythematosus, ClonoTrace identifies memory B cell extrafollicular maturation pathway in addition to naive B cell, accompanied by induction of ZEB2 with a concomitant decline of BACH2 along the trajectory, as the alternative origin of pathogenic double negative 2 B cells (DN2) in systemic lupus erythematosus (SLE) patients. In healthy ageing, ClonoTrace identified three pathways from naive, IgM+ memory B cells and switched-memory B cells mature through a DN2-associated transcriptional state that precedes age-associated B cells. ClonoTrace's fate probability algorithm indicated that IgM+memory B cell to ABC transition emerged as the leading candidate age-associated transition, that is a process distinct from SLE DN2 maturation. ClonoTrace provides a generalizable framework for receptor-informed trajectory inference, revealing the developmental pathways of pathogenic B cell populations that are untraceable to single modality approaches in autoimmunity and aging.
bioinformatics2026-07-03v1AART enables fast and accurate cross-platform proteomic translation
chen, y.; Zhang, S.Abstract
Plasma proteomic profiling has been widely used for biomarker discovery, disease prediction and diagnosis, and patient stratification. However, technical differences across assay platforms often result in low-to-moderate agreement, limiting study reproducibility, data integration, and model transferability. Here we present AART, a cross-platform proteomic translation framework that integrates matched-protein ridge regression with proteome-wide residual learning. We benchmarked AART spanning three independent cohorts profiled using three major platforms, including Olink, SomaScan, and mass spectrometry. Across all six translation directions, AART achieved the best performance compared with baseline methods for both overlapping and non-overlapping protein translations, with a relative improvement of 92.0% on average over direct mapping and by up to 31.6% over cpiVAE, the strongest baseline. Proteins that were accurately translated and improved by AART were enriched for extracellular, vesicle-associated, and tissue-restricted plasma biology. In downstream applications, AART improved the reproducibility of proteomic association analyses relative to direct cross-platform comparison by 75.5% for type 2 diabetes and 370.6% for Alzheimer's disease. AART-enabled cohort integration enhanced diagnostic accuracy for amyotrophic lateral sclerosis by 92.6% compared with non-integration analysis. AART was overall one to three orders of magnitude faster than cpiVAE, facilitating biobank-scale applications. Together, these results establish AART as a fast, accurate, and scalable framework for cross-platform proteomic translation, enabling more reproducible, transferable, and integrated proteomic research.
bioinformatics2026-07-03v1Scalable and rare-variant aware genome inference across the 1kGP cohort
Ebler, J.; Prodanov, T.; Blair, A.; Lee, S. K.; Ebert, P.; Human Pangenome Reference Consortium, ; Paten, B.; Marschall, T.Abstract
Pangenome graphs built from haplotype-resolved de novo assemblies enable accurate analysis of genetic variation. The short-read-based tool PanGenie efficiently genotypes variants discovered in a pangenome across large cohorts and outperforms linear reference-based methods for structural variants (SVs). However, it cannot detect novel variants absent from the graph, missing many rare SVs (allele frequency <1%) and was limited to graphs with 254 haplotypes. First, we introduce a haplotype sampling step that reduces the number of haplotypes using sample-specific k-mers before genotyping, decreasing runtime twelvefold and memory usage 1.4-fold at 30x coverage. Second, we present a polishing workflow that corrects residual errors in haplotypes inferred from PanGenie genotypes and incorporates rare and private mutations. We genotype 3,202 samples from the 1000 Genomes Project and use low-coverage ONT data (967 samples) for polishing. We achieve a median QV of 46 and provide the 1,934 polished haplotype sequences as a community resource.
bioinformatics2026-07-03v1Disease Stage- and Risk-Associated RNA Editing Signatures in Acute Myeloid Leukemia and Their Utility for Peripheral Blood-Based Assessment
Gu, T.; Bui, D.; Lee, J.-H.Abstract
RNA editing is a widespread post-transcriptional regulatory mechanism, but its role in acute myeloid leukemia (AML) remains incompletely understood. We analyzed RNA editing in 59 paired diagnosis-relapse AML samples and eight age-matched healthy controls using a stringent discovery pipeline and beta-binomial regression framework accounting for overdispersion and repeated measurements. A total of 166,323 high-confidence RNA editing sites mapping to 5,917 genes were identified. Of tested sites, 1.2%-3.6% varied significantly by disease stage or ELN-2022 risk group. Disease stage-specific editing signatures distinguished healthy controls, diagnosis, and relapse samples, with relapse-associated signals validated in an independent AML cohort. ELN-2022 risk-specific editing signatures showed substantial overlap between intermediate- and adverse-risk groups. Cross-cohort analyses identified four bone marrow (BM) editing sites in TMEM165, COQ4, TIMM17A, and PLXDC2 reproducibly associated with relapse and one peripheral blood (PB) editing site in ABHD18 elevated in higher-risk ELN-2022 groups. Most editing sites were shared between BM and PB; only 2.1%-2.3% exhibited tissue-specific differences. Higher global editing levels were correlated with leukemic state, white blood cell count, and selected clinical features. These findings identify reproducible RNA editing signatures linked to AML disease stage and risk and support the use of RNA editing biomarkers for PB disease assessment.
bioinformatics2026-07-03v1Location dependence of protein intrinsic disorder in Drosophila melanogaster
Abdulla Daanaa, H. S.; Kuraku, S.; Akashi, H.; Saito, K.Abstract
The relevance of protein structural flexibility in function remains contested, but experimental and computational evidence continues to accumulate. Many efforts to address this investigate intrinsic disorder, which commonly refers to peptide segments or entire protein sequences that presumably lack structure and exhibit high flexibility/conformational heterogeneity under physiological conditions. These efforts face challenges such as conflicting computational predictions and ambiguous relationships among intrinsic disorder locations and other protein properties. We address these challenges at a genome-wide scale in Drosophila melanogaster using residue-level predictions for various protein properties. We employ single and consensus approaches to quantify the prevalence of intrinsic disorder and attempt to infer function by testing for differences along protein sequences. Intrinsic disorder is likely more common at terminals than internal regions, and amino acid frequencies can vary substantially between regions in a manner that plausibly reflects functions of intrinsic disorder, rather than only proteome-wide effects. Tertiary structure potentially underlies the prevalence of intrinsic disorder along protein sequences; this prevalence varies more in a putatively solvent-exposed context than a solvent-buried one. Protein-binding appears to be a main function of intrinsic disorder, and we find support consistent with the notion that structural flexibility fosters binding plasticity, and show that location and protein length are factors in this relationship. Nucleic acid-binding and linker are ostensibly less common disorder functions than protein-binding, but nucleic acid-binding seems more localized at terminals. Residue-level estimates of selection pressure indicate that disordered regions generally evolve under weaker sequence constraints than structured regions, except at the N-terminal region. Biases in disorder prediction are a considerable factor in many of the observations, but unlikely a full explanation. The findings strengthen support for functional relevance of flexibility, offer insight into protein architecture and function, and lend impetus for experimental inquiry.
bioinformatics2026-07-03v1Raw-count embeddings improve single-cell foundation models
Schlede, S.; Muruganandan, T. P.; Gojjam Kantharaju, S.; Kisis, I.; Boecker, M.; Kim Alves Carpinteiro, M.; Schmitz, A.; Buchwald, L. M.; Sakthivelu, V.; Gülcüler Balta, G. S.; Anstötz, M.; Rueger, M. A.; Thomas, R. K.; Beleggia, F.Abstract
Single-cell transformer foundation models have grown to hundreds of millions of parameters, yet the preprocessing choices that underlie them, including gene ranking and library-size normalisation, have not been systematically benchmarked. Testing seven strategies, we find these elaborations are largely unnecessary: non-normalised, log-transformed counts give the best performance, and gene order barely matters, with even random ordering outperforming sophisticated rank-based schemes. The resulting model, Gene Intelligence, projects log1p-transformed raw counts directly onto each token embedding and jointly predicts masked tokens and counts, using no normalisation, positional encoding, or read-depth tokens. Despite this simplicity, it achieves state-of-the-art performance in the tested gene-level tasks and in doublet detection, and matches large current foundation models on cell-classification tasks while using 10- to 200-fold fewer parameters.
bioinformatics2026-07-03v1GenPerturb: sequence-grounded interpretation of perturbation transcriptomes using pretrained genomic models
Nikaido, I.; Shiihashi, T.Abstract
Background: Perturb-seq captures transcriptional responses to thousands of genetic and chemical perturbations, but does not directly resolve the cis-regulatory elements or transcription factor motifs underlying those responses. Existing approaches rely on indirect post hoc analyses or external epigenomic annotations, making it difficult to connect gene-level responses to specific regulatory element Results: We present GenPerturb, a framework that leverages pretrained sequence-to-expression models to link perturbation-induced expression changes to candidate cis-regulatory elements. By contrasting perturbation and control states, GenPerturb prioritizes regulatory regions and transcription factor motifs associated with each perturbation. The model recapitulates perturbation-dependent gene expression patterns and enables sequence-level interpretation without requiring matched chromatin data. Across multiple perturbation types, GenPerturb identifies biologically meaningful regulatory programs, including lineage-specific and signaling-associated motif activities, even when corresponding transcription factor expression changes are limited. Conclusions: GenPerturb converts gene-level expression responses from Perturb-seq into perturbation-specific, sequence-grounded cis-regulatory hypotheses. By prioritizing candidate regulatory elements and transcription factor motifs responsive to each perturbation without requiring matched chromatin data, GenPerturb enables mechanistic interpretation of transcriptional regulation and guides downstream experimental validation.
bioinformatics2026-07-03v1Replication fork directionality reveals how structural variants arise under replication stress
Glodzik, D.; Rigby, M.; Andreopoulos, M.; Crawford, J.; Ehmsen, S.; Tapinos, A.; Cornish, A.; Houlston, R.; Wedge, D. C.; Scully, R.; Park, P. J.Abstract
Structural variants (SVs) in cancer are associated with defects in DNA repair and replication stress, but the mechanisms generating common SV types remain unresolved. We propose that large (>100 kb) tandem duplications originate through a novel sister-fork breakage-fusion mechanism. To capture replication-related context beyond breakpoints, we developed an algorithm to characterize replication timing, origin density, and fork direction across SV-spanned regions, features that refine and differentiate previously defined SV signatures. Large tandem duplications frequently overlap replication origins from which forks proceed bidirectionally; combined with independent evidence from APOBEC strand asymmetry, this pattern is compatible uniquely with the proposed mechanism. Although tandem duplications in CCNE1-amplified and CDK12-mutant cancers also concentrate around origins and highly transcribed genes, they display distinct contexts: CDK12-mutant SVs arise near later-firing origins, whereas those in CCNE1--amplified tumors often coincide with genes in specific strand configurations, suggesting different causes of fork stalling. Incorporating replication features into signature analysis enabled the discovery of new SV signatures, which we used to build SVIG, a multi-class classifier of SV phenotypes. SV signatures attributed to replication stress may help guide therapies targeting this vulnerability.
bioinformatics2026-07-03v1RD-OMICS: An Integrative Multi-Omics Data Inventory in Rare Diseases
Sun, S.; Wang, H.; Mathe, E. A.; Zhu, Q.Abstract
Rare diseases (RD) impact over 30 million individuals in the United States, yet fewer than 5% of the identified conditions have FDA-approved treatments. Progress in RD research is hindered by small patient cohorts, biological heterogeneity, and the fragmented, inconsistently annotated publicly available omics data, which limits integrative analysis and translational discovery. Here, we present RD-OMICS, a data inventory with integrated and structured RD omics data from Gene Expression Omnibus (GEO), in the form of a knowledge graph. We developed a metadata harmonization pipeline that combines rule-based mapping and large language model (LLM)-assisted semantic categorization. The graph-based data model was defined to integrate different types of data including disease conditions, experiments, samples, platforms, projects, and publications into a centralized inventory graph. In this preliminary study, 11,049 GEO series for 126 rare diseases were processed and integrated into RD-OMICS, which includes 375,930 individual biospecimen samples, 1,578 sequencing and array platforms, 10,938 biological projects. Case studies demonstrate the use of RD-OMICS in supporting rare disease research, omics cohort construction, and transcriptome-based drug repurposing for amyotrophic lateral sclerosis (ALS). RD-OMICS provides a scalable foundation for transforming fragmented omics data into a structured, harmonized and interoperable resource, facilitating therapeutic development and other translational discoveries in rare diseases.
bioinformatics2026-07-03v1Structural Organization of the Nvj3-Mdm1 Complex Reveals a Conserved Lipid-Compatible Contact Site Module
Aboumourad, M.; Hariri, H.Abstract
Membrane contact sites are organized by protein assemblies that physically couple organelles and coordinate lipid metabolism, yet the structural principles that enable lipid exchange across these junctions remain poorly defined. At the nuclear-vacuolar junction (NVJ) in budding yeast, the tethering protein Mdm1 and its binding partner Nvj3 form a complex that regulates lipid metabolic pathways, but the structural features underlying their interaction have not been resolved. Here, we use AlphaFold-based complex prediction and comparative structural analysis to define the organization of Nvj3-Mdm1 complex assembly. We identify a high-confidence heterodimer in which conserved PXA and PXC domains generate an extended tunnel spanning both proteins. Tunnel analysis predicts a core hydrophobic conduit traversing the Nvj3-Mdm1 interface, consistent with a lipid-compatible architecture. Evolutionary conservation is enriched at the Nvj3-Mdm1 interface. The predicted conduit shares geometric and physicochemical properties with bridge-like lipid transfer proteins, including Atg2, Fmp27, and Hob2, suggesting that heteromeric tether assemblies may contribute directly to inter-organelle lipid transfer. Cophylogenetic analysis reveals coordinated coevolution of Nvj3 and Mdm1 across Saccharomycetes. Together, these findings define Nvj3 as a structural partner of Mdm1 and support a conduit-based model of lipid transfer at the NVJ.
bioinformatics2026-07-03v1RegulomeXplorer: Interactive exploration of drug effects on subcellularly resolved proteomes
Uiberacker, M.; Iellici, T.; Afanaseva, E.; Meier-Menches, S.; Zanghellini, J.Abstract
Mass spectrometry-based proteomics allows the quantification of drug-induced changes in protein abundance. However, the integration of perturbation data across subcellular compartments remains a challenging bottleneck. Here, we present RegulomeXplorer, a web-based tool for automated processing and interactive exploration of subcellular compartment-resolved proteomics data. RegulomeXplorer employs MaxQuant output files to determine differential protein regulations upon drug perturbation, performs functional enrichment analysis, and visualizes enriched terms on a two-dimensional cytoplasmic-nuclear plane, called regulome. The data visualization by means of regulomes allows to simultaneously assess the magnitude of drug perturbation effects within separate subcellular compartments as well as the contribution of regulated proteins to the position of each enriched term in the regulome plane. We validated RegulomeXplorer against previously published, manually curated regulome analyses. It was then applied on subcellular compartment resolved breast cancer cell line proteomes, revealing drug- and cell-line-specific responses to Doxorubicin and Taxol, both in line with their described mode of action. RegulomeXplorer provides an accessible workflow for interpreting compartment-resolved perturbation proteomics and generating mode of action hypotheses in drug-response studies. RegulomeXplorer is freely available without registration at https://chemnettools.anc.univie.ac.at/RegulomeExplorer/.
bioinformatics2026-07-03v1Segmentation and classification of retinal pigment granules in fluorescence lifetime imaging microscopy (FLIM) data
Ali, M.; Ahmad, H. A.; Alderzy, H.; Hammer, M.; Heintzmann, R.; Stranik, O.Abstract
Alterations of fluorescence properties in retinal pigment epithelium (RPE) cells caused by diseases such as age-related macular degeneration (AMD) highlight the need for detailed analysis of the fluorescent RPE granules at the individual level. Precise segmentation and classification of these granules remain challenging due to their limited visual separability. In this study, we present Classi4RPE, a computational algorithm designed to accurately segment RPE granules and classify them into three categories -- lipofuscin (L), melanolipofuscin (ML), and melanin (M) -- based on fluorescence lifetime imaging data, which provide distinctive contrast. The method is implemented in a custom Python framework and employs seeded watershed segmentation to isolate individual granules. Lipofuscin granules are identified as hyperfluorescent structures with longer lifetimes, while granules with shorter lifetimes are further analyzed based on their spatial lifetime distribution from the center to edge, enabling discrimination of ML from other melanin-rich granules. Our approach achieves high performance, with mean sensitivities of 0.99 for L granules and 0.90 for ML granules, and corresponding specificities of 0.93 and 0.98, respectively, compared to manually annotated ground truth. These results demonstrate the potential of Classi4RPE to surpass human visual limitations and provide a robust tool for quantitative RPE analysis.
bioinformatics2026-07-03v1Quantifying Asymmetric Coevolutionary Dynamics using Normalized Phylogenetic Costs
Wagle, S.; Markin, A.; Sherman, T. J.; Mayo, C.; Dunham, T. J.; Brelsfoard, C.; Cohnstaedt, L. W.; Wilson, W. C.; Anderson, T. K.; Eulenstein, O.Abstract
Coevolutionary studies aim to characterize associations, such as virus-host relationships, by using phylogenetic distances to quantify the topological concordance between the phylogenies of interacting taxa. However, phylogenetic distances cannot capture asymmetrical relationships that arise from differences in sampling, evolutionary rates, or characterizations between datasets. Furthermore, a lack of accurate normalization complicates the interpretation and validation of coevolutionary analyses. To address these limitations, we employed the Asymmetric Cluster Affinity and Cluster Support costs as a general framework to quantify coevolutionary patterns across multiple biological scales. We benchmarked the precision of these costs by reanalyzing a curated dataset documenting interspecies transmission frequencies across nineteen virus-host phylogenies. Our results corroborate prior findings showing that all virus families under study can cross species boundaries; however, the asymmetric costs provide a more granular representation, demonstrating that the frequency of such events varies significantly across families. We then applied the Asymmetric Cluster Support cost to quantify preferential gene segment pairings within the Bluetongue virus genome. This analysis revealed a close phylogenetic association between the outer capsid proteins VP2 and VP5, likely reflecting shared selective pressures due to their critical roles in cell entry and exit. In contrast, gene segments encoding nonstructural proteins exhibited discordant evolutionary histories relative to other segments. Finally, we demonstrated that the Asymmetric Cluster Support cost can detect coevolutionary dynamics in swine influenza A virus, identifying novel gene pairings indicative of major viral reassortment events. Overall, our approach demonstrates that normalized asymmetric phylogenetic costs accurately capture complex biological relationships and provide a robust framework for quantifying fine-scale coevolutionary dynamics in rapidly evolving pathogens.
bioinformatics2026-07-03v1Simulating population pangenomes under coalescent demographic models with MSpangenome
Piat, L.; Denni, S.; Dubois, S.; Linard, B.; Duvaux, L.Abstract
Motivation: Pangenome variation graphs (PVGs) are increasingly used to represent genomic diversity, yet there is currently no general framework for generating population pangenomes directly from explicit evolutionary histories. Existing simulators typically focus on individual classes of variation and do not integrate these variations within a genealogy-aware framework driven by explicit demographic histories. As a result, evaluating pangenome methods in realistic population-genetic settings remains challenging, and benchmark datasets with known evolutionary ground truth are scarce. Results: We present MSpangenome, a genealogy-aware frame- work that bridges coalescent population genetic simulations and pangenome graph analyses. The pipeline combines ancestry simulation with msprime and a de novo graph construction algorithm to generate PVGs directly from simulated genealogies. By explicitly modeling recombination, demographic history and incomplete lineage sorting, MSpangenome produces structurally complex pangenomes in which nested and overlapping structural variants emerge naturally from the underlying genealogies, while their evolutionary history and graph topology remain known by construction. This provides a general framework for generating realistic population pangenomes and establishing ground-truth datasets for methodological evaluation. We demonstrate its utility by generating population-scale pangenomes and using them as controlled references to benchmark the widely used graph construction tools, PGGB and Minigraph-Cactus. Our analyses reveal contrasting performance regimes across levels of sequence diversity, sample sizes and classes of structural variation, highlighting the value of simulation-based benchmarking for identifying reconstruction errors that are hard to detect using empirical datasets alone. Availability and implementation: MSpangenome is imple- mented in Python, fully containerized, freely available at https://forge.inrae.fr/pangepop/MSpangepop and mirrored at https://github.com/inrae/MSpangepop.
bioinformatics2026-07-03v1ViralEpiBase: a manually curated repository of epitranscriptomic modification sites across viral RNA genomes and virus-encoded transcripts
Srinivasan, S.; Chande, A.Abstract
Post-transcriptional chemical modifications of RNA, collectively termed the epitranscriptome, have emerged as critical regulatory layers governing viral replication, pathogenicity, and host-virus interactions. Despite the rapid accumulation of experimental data on viral RNA modifications, no dedicated, freely accessible resource existed for systematically cataloguing these sites across diverse viral species. Here we present ViralEpiBase, a manually curated database of epitranscriptomic modification sites identified in viral RNA genomes and virus-encoded transcripts at single-nucleotide resolution. ViralEpiBase currently integrates seven chemically distinct RNA modification types: N6-methyladenosine (m6A), N1-methyladenosine (m1A), pseudouridine ({Psi}), 5-methylcytosine (m5C), 2'-O-methylation (2'OMe), inosine and N4-acetylcytidine (ac4C); across 12 viral species encompassing both DNA and RNA viruses of clinical and biological significance. Each entry is linked to its primary literature source or deposited dataset and is retrievable by modification type, genomic coordinates, or viral taxonomy. The database is freely accessible through an intuitive web interface and is updated continuously as new experimental evidence becomes available. ViralEpiBase thus provides the first unified platform dedicated exclusively to viral epitranscriptomics and is designed to facilitate mechanistic investigation of RNA modification functions in viral biology.
bioinformatics2026-07-03v1Multi-modality Graph Representation Learning for Malignant Cell Identification from scRNA-seq using DeepMalignant
Bhattarrai, P.; Yuan, W.; Chi, H.; Zhou, X. M.; Mallory, X.Abstract
Distinguishing malignant from normal cells in single-cell RNA sequencing data remains a critical yet challenging task in cancer genomics. Existing methods often suffer from poor precision, limited generalizability across cancer types, and reduced robustness across different sequencing platforms. We developed DeepMalignant, an unsupervised multimodal graph attention autoencoder for malignant cell identification that jointly integrates gene expression and copy number alteration (CNA) information. We applied DeepMalignant to five datasets covering 26 samples and four cancer types (breast, colorectal, pancreatic, and ovarian cancers), generated by three platforms (10x Genomics, inDrop, and Drop-seq) for benchmarking and compared it with existing state-of-the-art methods including scMalignantFinder, PreCanCell, CopyKAT, ikarus, and Cancer-Finder. DeepMalignant achieved the best overall balance of precision and recall and consistently outperformed the existing methods that used either gene expression or CNA in F1 scores. Ablation studies showed that both CNA-based edge weighting and graph attention aggregation contribute independently to performance, and attribution analysis further indicated that the learned embeddings capture biologically meaningful malignant programs. We further applied DeepMalignant to two ductal carcinoma in situ (DCIS) samples, DCIS2 and DCIS1, that have matched spatial transcriptomics and scRNA-seq data. DeepMalignant identified tumor-enriched regions that were highly consistent with the matched histological image. The downstream cell-cell communications analysis revealed that fibroblast-derived C3 and MIF both directed signaling more toward normal epithelial cells than tumor epithelial cells, demonstrating that accurate tumor-normal cell classification by DeepMalignant enables biologically meaningful interrogation of the tumor microenvironment and revealing how stromal cells differentially communicate with malignant versus normal epithelial populations.
bioinformatics2026-07-03v1SpatialFuser: a unified framework for integrative analysis of unpaired spatial multi-omics data
Cai, W.; Li, W.Abstract
Recent advances in spatial multi-omics technologies provide unprecedented opportunities to interpret molecular features in tissue microenvironments, but integrative analysis across heterogeneous datasets remains challenging. Here we present SpatialFuser, a deep learning framework for integrative analysis of unpaired spatial multi-omics data across epigenomics, transcriptomics, proteomics, and metabolomics. SpatialFuser consists of three coordinated modules: MCGATE, a Multi-head Collaborative Graph Attention auToEncoder that learns multi-scale spatial representations to decipher fine-grained spatial heterogeneity beyond predefined spatial neighbourhoods; an optional geometric pre-matching module that provides coarse initialization under tissue geometry mismatch; and an iterative matching-fusion module that couples geometry-constrained optimal transport matching with contrastive-learning-guided modality fusion for cross-slice alignment and integration. Systematic benchmarks demonstrate superior performance and reliability compared with existing state-of-the-art methods in spatial domain identification, cross-slice alignment, and multi-omics integration. Applications to real datasets illustrate that SpatialFuser resolves precise spatial molecular patterns, reveals developmental dynamics, and recovers complementary signals across modalities. Cross-resolution integration of weakly correlated modalities by our method further uncovers previously obscured biological variation. The generalizability and versatility of our framework enable customized analytical scenarios and potential extension for emerging omics.
bioinformatics2026-07-02v3Computational Binding Affinities of Disheveled PDZ Protein-Ligand Complexes
Singh, A.; Jubintoro, A.; Kancharla, H.; Blankenberg, P.; Zheng, J.Abstract
Wnt/B-catenin signaling is critical for cell growth and development, with its hyperactive dysregulation implicated in the development of cancer. Current therapeutic research on inhibition of Wnt/B-catenin signaling is impeded by the high cost of experimentally determining binding affinities. Consequently, interest has risen in screening potential inhibitors binding affinities with computational tools to reduce costs. Here, we test the validity of a computational molecular dynamics simulator, Binding Free Energy Estimator 2 (BFEE2), for determining peptide ligand affinity for Wnt/B-catenin signaling. We focus on the Dishevelled (DVL) PDZ domain, a key mediator in WNT signaling through its ability to bind to various peptide ligands. We analyze the binding affinities of several DVL PDZ domain-peptide and domain-ligand complexes against previously established results to determine the validity of computational analysis. We conclude that computational molecular dynamics simulations were successful for peptide-ligand complexes with mixed results for small-molecule scenarios.
bioinformatics2026-07-02v3Evidence for post-allopolyploidy genetic exchanges between duplicated regions in three ancient polyploidies
Dhillon, A. K.; Pasagadugula, H.; Pitts, I.; Rohilla, M.; Conant, G. C.Abstract
Many successful lineages, including flowering plants and vertebrates, owe some of their evolutionary prosperity to whole genome duplications (WGD). However, in the immediate aftermath of a WGD, the new polyploid species that is formed often experiences multivalent pairings during meiosis, which can produce inviable gametes. To mitigate the potential harm caused by such pairings, most lineages eventually undergo "diploidization" to restore typical bivalent pairing. A key component of this process is the loss of duplicated genes. While diploidization was once thought to be rapid, recent analyses of polyploidies suggest the process may be more drawn out, with multivalent pairing persisting long after the initial WGD event. Here, we assess evidence for "late" diploidization after three different polyploidies: the teleost genome duplication (TGD), nested polyploidies in Paramecium lineages, and the ancient WGD in bakers yeast. Using our tool POInT (the Polyploidy Orthology Inference Tool), we model the resolution of these events. By analyzing discordance between expected species trees and observed gene trees, we argue that late diploidization was a likely feature in the resolution of all three polyploidies.
bioinformatics2026-07-02v2Mechanisms Matter: Transportability of Cellular Perturbation Effects
Qi, S.-a.; Chapfuwa, P.Abstract
Predicting cellular responses to genetic or chemical perturbations across biological contexts is central to drug development and disease understanding. Despite increases in data and model scale, deep learning models have not consistently outperformed simple baselines. Leveraging causal transportability theory, we show that cross-context generalization is governed by shared causal mechanisms, not merely distributional similarity. To enable controlled evaluation, we develop a causal simulator that generates realistic semi-synthetic Perturb-seq datasets with tunable mechanistic divergence, providing benchmarks with known ground-truth causal structure. Further, we adapt the Vendi diversity score to the perturbation setting as a diagnostic for mode collapse, a failure mode invisible to standard per-perturbation metrics. Extensive experiments across four deep learning models and six simple baselines on semi-synthetic and real Perturb-seq datasets reveal a cross-context generalization gap: performance under cross-context splits drops substantially, often to simple baseline levels. Notably, even on synthetic data with fully specified causal structure, no model generalized across contexts with different causal mechanisms. These results underscore the need for cross-context evaluation, diversity-aware metrics, and mechanistically grounded inductive biases.
bioinformatics2026-07-02v2Tabular Foundation Models Are Competitive Cellular Perturbation Predictors Across Biological Scales
Palla, G.; Hillsley, A.; Kim, Y.-J.; Royer, L. A.Abstract
Predicting how cells respond to genetic and chemical perturbations is a central challenge in drug discovery and functional genomics. A growing ecosystem of specialized single-cell foundation models has been developed to address this problem, yet their practical advantage over domain-agnostic approaches remains unclear. Here we evaluate the power of Tabular Foundation Models such as TabICL and TabPFN, general-purpose pre-trained regression models, against domain-specific architectures including PRESAGE, scGPT, scLAMBDA, STACK and Prophet across four complementary evaluation settings: cell-level in-context cross-cell-type prediction, pseudobulk perturbation prediction on five Perturb-seq datasets of cell-lines, a genome-wide CRISPR screen in primary human CD4+ T cells, and embryo-level cell-type composition prediction in a zebrafish developmental perturbation atlas. In the cell-level cross-cell type perturbation prediction, Tabular Foundation Models perform on par or better than specialized models. On pseudobulk perturbation prediction, Tabular Foundation Models consistently outperform specialized baselines across multiple evaluation metrics and datasets. On whole-emrbryo cell-type composition prediction, Tabular Foundation Models are competitive with specialized baselines. These results demonstrate that general-purpose tabular in-context learning provides a strong and scalable alternative to bespoke biological architectures for perturbation response modeling across cell systems and scales.
bioinformatics2026-07-02v2ProLoc: Text-guided Localization of Protein Functional Regions
Liu, P.; Fan, J.; Pan, M.; Zhang, J.Abstract
MotivationProtein function is often mediated by specific sequence regions, such as domains, motifs and functional sites. Identifying these regions is important for understanding protein mechanisms, annotating newly sequenced proteins and prioritizing residues for experimental validation. However, existing protein function prediction and protein-text models mainly capture global protein-level associations, making it difficult to determine which residues support a given textual functional description. This limits their use for mechanistic interpretation and residue-level experimental prioritization. ResultsWe introduce text-guided protein functional region localization, a span-level grounding task that identifies residue regions corresponding to natural-language functional descriptions. We construct an InterPro-derived localization benchmark of explicit protein-text-region examples, covering both domain-level and functional-site annotations with sequence-similarity-aware splits and a unified span-level evaluation protocol. We further propose ProLoc, a text-conditioned localization model built on raw ESM2-650M and PubMedBERT with direct residue-level localization and anchor-free span proposal generation. On the held-out test set, ProLoc substantially outperforms window-based adaptations of representative protein and protein-text models. Its direct output achieves the strongest single-region localization performance, reaching 0.7730 IoU@1, while its anchor-free proposal output improves visible multi-site recovery, reaching 0.9671 VM R@10 IoU50 and 0.9489 VM All-Hit@50. Availability and ImplementationSource code and evaluation scripts are available at https://github.com/ShiDeng7rz/Proloc. The processed benchmark and data splits are archived at Zenodo: https://doi.org/10.5281/zenodo.20729714. Contactliupeishuo@nju.edu.cn
bioinformatics2026-07-02v2BacNeMu: neutral mutation spectra reconstruction pipeline for bacteria
Skudnov, A.; Badamshin, E.; Efimenko, B.; Popadin, K.; Gunbin, K.; Denisov, S.Abstract
The mutational spectrum is an increasingly important molecular phenotype that quantitatively describes mutagenesis in a given gene and species, enabling future comparative analyses to reveal differences in underlying mutagenic processes, whether internal, such as DNA repair processes, or external, such as ecological niches and conditions. Mutation accumulation experiments, although time-consuming and costly, remain the standard approach for reconstructing bacterial neutral mutation spectra. Here, we present BacNeMu, a phylogenetically informed pipeline that reconstructs neutral mutational spectra of bacterial genomes using open databases GTDB, AnnoTree and KEGG Orthology, building on previously developed NeMu pipeline. BacNeMu reconstructs mutation spectra that closely match mutation accumulation experiments results while requiring substantially less time, enabling comparative analyses across diverse bacterial taxa. Applied to obligate aerobes and anaerobes, BacNeMu recovered the expected excess of T:A>C:G transitions, consistent with oxidative-damage-associated mutational patterns previously described in mitochondrial genomes and yeast single-strand. We further asked if any other ecologic factors influence a mutational spectrum. As a pilot we compared three species living under different temperatures: one strong thermophile - Thermotoga maritima, one psychrophile - Clostridium algidicarnis, and one with intermediate temperature tolerance - Psychrobacter sanguinis. In the thermophile, the relative frequency of T:A>C:G substitutions was higher than in the psychrophile, consistent with the hypothesis that GC-biased mutagenesis contributes to thermal adaptation, although C:G>T:A transitions predominate across all three species. BacNeMu provides a rapid, phylogenetically informed framework for generating biologically meaningful mutation spectra from open databases.
bioinformatics2026-07-02v1Knowledge-guided Bayesian optimization using pre-trained LLMs speeds up the identification of superior genotypes from germplasm collection
Hamazaki, K.; Tsuda, K.Abstract
Background: Germplasm collections contain wide genetic diversity that is valuable for plant breeding, but conducting phenotypic evaluation for all genotypes in field trials is rarely feasible. Bayesian optimization offers a way to decide, season by season, which genotypes to cultivate in order to identify superior genotypes with fewer evaluations. However, standard Bayesian optimization commonly starts from randomly selected genotypes and mainly relies on surrogate models built from marker genotype information, while the text-based passport information that accompanies germplasm is not fully used. We examined whether pre-trained large language models can provide prior knowledge that improves these decisions in germplasm evaluation. Results: We constructed a large-language-model-guided Bayesian optimization framework that introduces large language models into two parts of the Bayesian optimization workflow. In zero-shot warmstarting, a large language model proposes initial genotypes using passport information such as cultivar name, country of origin, and subpopulation, optionally together with principal component scores derived from genome-wide single-nucleotide-polymorphism markers. In addition, we evaluated a large-language-model-based surrogate model that predicts phenotypic values for untested genotypes using in-context learning from previously evaluated genotypes. Using a rice germplasm panel and two target traits (seed number per panicle for maximization and protein content for minimization), we compared strategies. For seed number per panicle, zero-shot warmstarting with a general-purpose instruction-following model reduced the number of evaluated genotypes needed to reach the best genotype, whereas improvements were small for protein content. When genomic information was available, Gaussian-process-based Bayesian optimization was the strongest overall approach, while the large-language-model-based surrogate model outperformed random baselines and was competitive in some settings. When genomic information was not available, predictions based on passport information improved efficiency compared with fully random strategies. Conclusions: Pre-trained large language models can inject useful agronomic knowledge into Bayesian optimization for germplasm evaluation, particularly by improving early-stage genotype selection, and can also support optimization when genomic information is unavailable. As models better handle long genomic sequences together with passport information, large-language-model-guided Bayesian optimization may become a practical and explainable decision-support approach for agricultural optimization.
bioinformatics2026-07-02v1Bridging Gene Expression and Morphology: A Cell Size Score and Its Applications Across Multiple Diseases and Physiological Contexts
Ji, X.; Cui, Q.Abstract
Cell size is a critical morphological parameter determining cellular functional homeostasis, yet existing large-scale transcriptomic databases lack direct cell size measurement data. By integrating high-resolution immunofluorescence images with transcriptomics, we identified 457 genes significantly correlated with cell area. Based on these findings, we developed an algorithm, Cell Size Score (CSS), to predict cell size from gene expression profiles. Validation across multiple independent datasets, including human cell lines, mouse models, and single-cell spatial transcriptomics, confirmed that CSS accurately predicts cell size. Furthermore, we observed a significant positive correlation between CSS and broad-spectrum chemotherapy drug resistance, suggesting that increased cell volume confers survival advantages to cancer cells. Moreover, CSS analysis of aging revealed sex-dependent, tissue-specific patterns of change, wherein male adipose and cardiac tissues exhibited progressive hypertrophy with age, while female reproductive organs showed significant atrophy. Additionally, CSS significantly increased in skeletal muscle after exercise, indicating that this metric can capture dynamic physiological adaptation processes. This study establishes a bridge between transcriptomics and cell morphology, providing novel insights into retrospectively analyzing the role of cell size in pathological and physiological processes such as cancer and aging using existing omics data, as well as understanding the molecular mechanisms underlying cell size regulation.
bioinformatics2026-07-02v1Permute-match tests detect significant correlations between time series despite nonstationarity and limited replicates
Yuan, A. E.; Shou, W.Abstract
Researchers frequently analyze correlations between pairs of time series by determining whether an observed correlation is stronger than expected under the null hypothesis of independence. However, the time series are often nonstationary, with statistical properties that change over time, thereby making standard tests invalid. If sufficient replicates exist, a trial-swapping permutation test can be performed that handles nonstationarity by comparing within-replicate correlations to between-replicate correlations. Although largely assumption-free, this test is fundamentally limited by the number of replicates (n) because its minimum p-value is 1/n!. With n=3, this minimum is 1/6, rendering thresholds like 0.05 unattainable. This limits its use considerably in animal experiments, where n may be as low as 3. We propose permute-match tests -- modified permutation tests that can report lower p-values of 2/nn or 1/nn under strong evidence of dependence. Permute-match tests guarantee a false positive rate at or below the significance level when replicates are independent and identically distributed. The bound of 1/nn is not gratuitously conservative, since it cannot be further lowered without additional assumptions. We demonstrate our approach using synthetic data and apply it to an existing dataset with 3 independent groups of zebrafish, confirming the observation that zebrafish swim faster when directionally aligned.
bioinformatics2026-07-01v5Age-related erosion of X chromosome inactivation in human tissues
Rocca, C.; Gylemo, B.; Edwards, M.; Cing, Z.; Gibbs, J. R.; Nestor, C. E.; DeCasien, A. R.Abstract
Age-related diseases often show sex differences, yet their molecular bases remain unclear. Animal models suggest that age-related disruption of X-chromosome inactivation (XCI) occurs in female mice. We test whether this phenomenon extends to humans using bulk and single-cell datasets. We find that age-dependent escape from XCI also occurs in human females, particularly among genes at the distal ends of the X-chromosome and those involved in genome stability. These findings provide preliminary evidence that XCI erosion represents a human female-specific aging process.
bioinformatics2026-07-01v2Generative design of antigen-specific T-cell receptor sequences with a conditional diffusion model
Zhang, Y.; Liang, W.; Xu, S.; Witney, M.; Su, X.; Andrews, M. C.; Rossjohn, J.; Purcell, A. W.; Wang, F.; Song, J.Abstract
T cell receptor (TCR)-based immunotherapy holds immense potential for treating cancers, autoimmunity, and infectious diseases, where antigen-specific TCR recognition is crucial for adaptive immune responses. Engineering or de novo generation of the complementarity-determining region 3 (CDR3) loops of TCRs using artificial intelligence offers a powerful alternative to designing antigen-specific TCRs rather than laborious experimental screening. However, current in silico approaches are constrained by weak conditional guidance, limited flexibility, and a lack of rigorous functional validation. To address these limitations, we introduce TCRDiff, a generative diffusion framework for designing antigen-specific TCRs conditioned on peptide-MHC (pMHC) targets and germline-encoded TCR variable genes. By leveraging pre-trained knowledge from massive T-cell repertoires and TCR-pMHC recognition data, TCRDiff generates CDR3{beta} sequences that closely resemble native-binding TCRs via a denoising diffusion process. Furthermore, incorporating interface geometry features generated TCR-pMHC complexes with superior structural plausibility than models relying solely on sequence-based diffusion or structure-based modeling. As a proof of concept, we deployed TCRDiff in a systematic pipeline to design candidate TCRs against a clinically validated cancer antigen. In vitro activation assays validated that TCRDiff-generated TCRs efficiently recognize the MAGE-A3 epitope with minimal off-target reactivity. Thus, TCRDiff establishes a powerful, validated computational paradigm to accelerate the development of TCR-based immunotherapies.
bioinformatics2026-07-01v2MORPH Predicts the Single-Cell Outcome of Genetic Perturbations Across Conditions and Data Modalities
He, C.; Zhang, J.; Dahleh, M. A.; Uhler, C.Abstract
Modeling cellular responses to genetic perturbations is a significant challenge in computational biology. Measuring all gene perturbations and their combinations across cell types and conditions is experimentally challenging, highlighting the need for predictive models that generalize across data types to support this task. Here we present MORPH, a MOdular framework for predicting Responses to Perturbational cHanges. MORPH combines a discrepancy-based variational autoencoder with an attention mechanism to predict cellular responses to unseen perturbations. It supports both single-cell transcriptomics and imaging outputs and can generalize to unseen perturbations, combinations of perturbations, and perturbations in new cellular contexts. The attention-based framework enables inference of gene interactions and regulatory networks, while the learned gene embeddings can guide the design of informative perturbations, as demonstrated in two applications. Overall, MORPH is a flexible tool for optimizing perturbation experiments, enabling efficient exploration of the perturbation space to advance understanding of cellular programs for fundamental research and therapeutic applications.
bioinformatics2026-07-01v2AI-guided discovery for low-resource peptide engineering using evolutionary scale modeling
Andrekson, L.; Rydbergh, R.; Mercado, R.; Wenzel, M.Abstract
Reliable estimation of downstream performance in low-data peptide machine learning is critical for guiding early-stage AI-driven peptide engineering. Yet, it is often unclear how to assess whether a model will be effective in iterative discovery settings. Here, we show that the cross validation R2 score can serve as a simple and robust proxy for predicting active learning workflow performance, enabling early-stage evaluation of model suitability for sequential peptide optimization. To support this, we introduce SCARSE, a machine learning framework combining ESM-2 protein language model embeddings with Gaussian process regression and extremely randomized trees classification, designed for low-resource peptide property prediction (20-500 training samples). We benchmark SCARSE across 23 peptide and small-protein datasets covering substitution and indel variants, antimicrobial peptides, cell-penetrating peptides, and toxic/non-toxic peptides. SCARSE significantly outperforms a hand-engineered descriptor baseline on substitution and indel tasks, while comparable performance was achieved on shorter peptide non-mutant datasets where simpler descriptors capture enough of the signal. In simulated active learning workflows, SCARSE consistently outperforms baseline and random sampling strategies. Notably, we demonstrate that CV R2 computed from as few as 50 labeled peptides can be sufficient to estimate final active learning end-point performance, providing a practical, data-efficient criterion for deciding whether a given dataset combined with SCARSE is suitable for iterative peptide discovery. SCARSE is released as a pip package and is available via HuggingFace Spaces to facilitate integration into peptide engineering workflows.
bioinformatics2026-07-01v1MintCNA: A Unified Framework for Integrative Copy Number Profiling with Single-Cell Multi-Omics Data
Bao, W.; Qin, F.; Xiao, F.Abstract
Chromosomal copy number alterations (CNAs) are key drivers of tumor evolution, disease progression and therapeutic resistance, and the identification of them is an important step to delineate tumor clonal structure. However, accurately resolving CNA landscapes from single-cell data remains challenging. Most existing tools analyze one omics layer at a time and are susceptible to assay-specific noises, limiting their ability to recover shared or modality-specific CNAs. Recent single-cell multi-omics techniques enable joint sequencing of multiple molecular layers in the same cells, yet in silico methods that fully exploit such complementary multi-modal data for CNA analysis are still missing. Here we present a single-cell multi-omics integration framework, MintCNA, a unified framework for CNA detection from paired multi-omics data. MintCNA integrates traditional statistical modeling with embedded deep learning structure to enhance CNA profiling across multi-omics. We use an attention-guided convolutional autoencoder for data denoising and perform multivariate change-point detection utilizing a sliding-window screening and ranking procedure. Missingness-adjusted CUSUM statistics are constructed which jointly aggregate omics features by a data-adaptive projection to detect genome-wide chromosomal breakpoints. Across various simulations and applications to a colorectal cancer multi-omics dataset, MintCNA consistently outperforms existing single-omics CNA callers in detection accuracy. MintCNA provides a single-cell CNA tool that integrates paired scDNA-seq and scRNA-seq, supporting the study of intra-tumor heterogeneity and tumor evolution.
bioinformatics2026-07-01v1Direct probabilistic quantification of mosaic loss of chromosome Y from sequencing data
Lin, J.-R.; Chang, Y.-C.; Maslov, A. Y.; Song, Y.; Gao, T.; Shan, J.; Bennett, D. A.; Milman, S.; Barzilai, N.; Vijg, J.; Montagna, C.; Zhang, Z.Abstract
Loss of chromosome Y (LOY) is the most common aneuploidy in aging men and is increasingly recognized as a marker of aging and genomic instability. Because LOY occurs in mosaic form, its degree reflects the fraction of cells lacking the Y chromosome. Existing SNP-array- and sequencing-based methods rely largely on single genomic features and indirect transformations to estimate this fraction. We developed BaySeq-Y, a Bayesian method that directly estimates LOY mosaicism from sequencing data using VCF files with read depth (DP) and allelic depth (AD). Within a rigorous Bayesian framework, BaySeq-Y integrates complementary LOY-associated genomic features, including decreased read depth and allelic imbalance, and can additionally leverage haplotype phasing to improve precision. In simulations and fluorescence in situ hybridization validation (FISH), BaySeq-Y provided accurate estimates and outperformed existing methods. Applications to ROSMAP and GTEx supported its biological relevance through transcriptomic validation, demonstrating its utility for quantifying LOY across diverse sequencing datasets.
bioinformatics2026-07-01v1MCD Stitcher: An open-source tool for whole-slide stitching and conversion of Imaging Mass Cytometry data
Chaurasia, P.Abstract
Imaging Mass Cytometry (IMC) combines metal-tagged antibody labelling with laser ablation mass spectrometry to generate highly multiplexed spatial images of tissue sections. However, the area that can be acquired within a single region of interest (ROI) is limited by hardware and software constraints, requiring large tissues to be imaged as multiple tiled ROIs. Reconstructing these ROIs into whole-slide images requires additional processing, while the proprietary .mcd file format can hinder integration with standard bioimage analysis workflows. Here, we present MCD Stitcher, an open-source Python package for converting .mcd files into OME-TIFF images with automated whole-slide stitching. The tool supports rectangular and polygonal ROIs, accommodates variable pixel sizes between ROIs, and uses memory-aware chunked reading during data ingestion to process large datasets on standard workstations. The generated OME-TIFF outputs preserve spatial, channel, and acquisition metadata for downstream analysis in tools such as QuPath, napari, and ImageJ/Fiji. MCD Stitcher provides a reproducible workflow for converting raw IMC data into interoperable image formats, enabling whole-slide spatial analysis without reliance on vendor-specific software.
bioinformatics2026-07-01v1Phenotypic inference from sparse tumor genomes informs an explainable deep-learning model for cancer prognosis
Grant, S.; Nath, A.Abstract
Somatic genomic alterations are widely profiled in cancer and remain the primary source for personalized therapy, yet their clinical utility is limited to few actionable targets. AI/ML models offer opportunities to capture genome-wide complexities, but clinical translation is hindered by poor interpretability, often limited to single-gene effects, and overlooks higher-order phenotypic interactions. To address this, we developed PhenoMap, a machine-learning framework that infers tumor phenotypic states from somatic variants. Trained on 9,000 pan-cancer genomes and transcriptomes, PhenoMap accurately reconstructs expression-based pathway enrichment scores and consolidated hallmark cancer phenotypes, enabling multilevel interpretation at phenotype, pathway, and gene scales. PhenoMap captured molecular subtypes and key resistance pathways across breast, lung, and brain cancers. We leveraged these features in PhenoSurv, a deep survival model integrating phenotypic reconstruction loss, Kullback-Leibler divergence, and survival loss to learn biologically-grounded predictors. PhenoSurv outperformed state-of-the-art survival models while providing robust mechanistic explanations. NOTCH1 signaling and SMARCA4 mutations emerged as a major prognostic factor in hormone receptor-positive breast cancer. TGFb signaling and inflammasomes, potentially modulated by FAT1, predicted lung adenocarcinoma outcomes, while inositol metabolism and PI3K signaling were key drivers in brain cancer. Together, PhenoMap and PhenoSurv provide accurate, interpretable, and clinically actionable models for precision oncology.
bioinformatics2026-07-01v1Tabular Foundation Models Are Competitive Cellular Perturbation Predictors Across Biological Scales
Palla, G.; Hillsley, A.; Kim, Y.-J.; Royer, L. A.Abstract
Predicting how cells respond to genetic and chemical perturbations is a central challenge in drug discovery and functional genomics. A growing ecosystem of specialized single-cell foundation models has been developed to address this problem, yet their practical advantage over domain-agnostic approaches remains unclear. Here we evaluate the power of Tabular Foundation Models such as TabICL and TabPFN, general-purpose pre-trained regression models, against domain-specific architectures including PRESAGE, scGPT, scLAMBDA, STACK and Prophet across four complementary evaluation settings: cell-level in-context cross-cell-type prediction, pseudobulk perturbation prediction on five Perturb-seq datasets of cell-lines, a genome-wide CRISPR screen in primary human CD4+ T cells, and embryo-level cell-type composition prediction in a zebrafish developmental perturbation atlas. In the cell-level cross-cell type perturbation prediction, Tabular Foundation Models perform on par or better than specialized models. On pseudobulk perturbation prediction, Tabular Foundation Models consistently outperform specialized baselines across multiple evaluation metrics and datasets. On whole-emrbryo cell-type composition prediction, Tabular Foundation Models are competitive with specialized baselines. These results demonstrate that general-purpose tabular in-context learning provides a strong and scalable alternative to bespoke biological architectures for perturbation response modeling across cell systems and scales.
bioinformatics2026-07-01v1Penumbria: Advanced 3D cell segmentation for biomedical imaging
Stockert, L.; Donovan, J.; Baier, H.Abstract
Quantitative analysis of three-dimensional cellular architecture is fundamental to understanding tissue organization, disease progression, and drug response. Yet 3D cell segmentation remains a critical bottleneck due to diverse cell morphologies, low signal-to-noise ratios, and data scarcity. We introduce Penumbria, a general-purpose 3D cell segmentation framework that achieves state-of-the-art accuracy across morphologically distinct cell populations and imaging conditions in volumetric microscopy. Penumbria formulates segmentation as a regression problem on distances to cell boundaries, supporting instance reconstruction without shape priors and permitting end-to-end GPU inference. A U-Net-based architecture with xLSTM bottleneck blocks and patch embeddings enables multi-scale feature extraction, long-range modeling of spatial context, and convolutional feature-volume tokenization. The model is extended with two modules: a Global Zernike Phase Layer, which learns Zernike-parameterized phase corrections in the frequency domain to undo optical aberrations such as defocus and tilt, and a Scaled Geocaps Layer, which samples features at fixed grid locations across multiple spatial scales, routing evidence between them such that a detection is only confident where concordance holds across scales simultaneously. Across four diverse 3D datasets selected to probe the limits of existing methods, Penumbria outperforms Cellpose-SAM across all evaluation thresholds and surpasses StarDist-3D on most datasets while matching it on Parhyale hawaiensis. Trained entirely from scratch, Penumbria achieves up to a 38% improvement in mean average precision over the second-best method. Strong boundary accuracy further supports downstream analyses such as quantifying membrane dynamics or protein localization.
bioinformatics2026-07-01v1BOSE: A Bayesian Order Statistics-Based Estimator for Recovering the Sample Mean and Standard Deviation
Pan, W.; Lu, Z.; Jiang, W.; Lim, J.; Xu, L.; Wang, X.Abstract
In meta-analyses of continuous outcomes, the sample mean and standard deviation (SD) are essential for synthesizing effect sizes across studies. However, clinical studies frequently report alternative summary statistics, such as the median, quartiles, and range. To enable inclusion of such studies, various methods have been proposed to estimate the sample mean and SD from these reported summaries. We propose the Bayesian Order Statistics-based Estimator (BOSE), which leverages the joint likelihood of observed order statistics together with weakly informative priors to obtain the full posterior distribution for the mean and SD without relying on computationally intensive iterative procedures such as Markov chain Monte Carlo algorithms. Our numerical studies demonstrate that BOSE performs competitively with existing approaches in estimating the mean, while achieving superior performance for estimating the SD across all evaluated scenarios, particularly in small-sample settings. Under non-normal distributions including skewed, heavy-tailed, and bimodal settings with mild or moderate deviations from normality, BOSE remains robust and stable, whereas methods specifically designed for skewed distributions may become unstable or even inapplicable. Beyond point estimation, BOSE naturally provides empirically validated posterior credible intervals, enabling researchers to formally quantify uncertainty for study-level estimates and make reliable, evidence-based decisions in meta-analytic research synthesis. A publicly accessible web application implementing BOSE and competing methods is also provided to facilitate practical use in meta-analytic research.
bioinformatics2026-07-01v1mirCCC: Repression-aware graph learning for miRNA-mediated cell-cell communication inference
Chen, Y.; Cui, J.; Zhang, S.; Liu, E.; Xie, L.; Feng, C.; Chen, M.Abstract
Cell-cell communication analyses usually focus on protein ligands and receptors and therefore miss the extracellular vesicle-mediated transfer of microRNAs, an important route of signalling in cancer. Here, we show that microRNA-mediated communication can be inferred from standard single-cell RNA sequencing by detecting coordinated decreases in the expression of validated miRNA target genes. We developed mirCCC, a computational framework that estimates cell-specific microRNA activity, models cellular sending and receiving capacities for extracellular vesicle transfer, and learns microRNA-resolved communication graphs from transcriptomic data. In synthetic benchmarks with strong confounding signals, mirCCC improved, whereas all comparison methods declined. Applied to a human colorectal cancer atlas, mirCCC recovered known colorectal cancer-associated microRNAs and identified stromal- and myeloid-to-epithelial communication converging on a plasticity program linked to TGF-{beta} and Wnt/{beta}-catenin signalling. These results provide a practical route for studying extracellular vesicle-mediated communication in existing single-cell atlases.
bioinformatics2026-07-01v1GeneBench-Pro: Evaluating Multistage Statistical Reasoning\\in Genomics, Quantitative Biology, and Translational Biomedicine
Li, J. H.; Ho, A. J.Abstract
We introduce GeneBench-Pro, an expanded and improved version of GeneBench that comprises harder problems across a wider breadth of domains. GeneBench-Pro is a benchmark for AI agents performing realistic multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine which seeks to capture the complexity of real-world problems that computational life scientists face when tasked with producing a conclusion upon which a downstream scientific or translational decision is contingent. The benchmark comprises 129 evaluations targeting quantities of direct practical relevance across 10 primary domains and 21 terminal subdomains, with a genomics-centered core. Similarly to GeneBench, each problem provides the agent with brief context, a target estimand, and minimal guidance otherwise; the agent must then navigate multiple dependent decision points; i.e., substantive inferential forks where a plausible wrong choice changes the downstream analysis, to identify and execute the correct analysis workflow and arrive at the correct answer. Relative to GeneBench, GeneBench-Pro adds 29 new problems, drops three, and introduces significantly redesigned versions of 54 of the remaining 100 overlapping problems. 82 of the 129 problems were reviewed by external domain experts, whose findings led to prompt/data modifications and redesign of those problems whose targets were not sufficiently identifiable. Ten externally reviewed problems are released publicly, 50 held-out problems were provided to Artificial Analysis for independent third-party model benchmarking, and the remainder are retained as an internal holdout. In evaluations over the full 129-problem suite, GPT-5.6 Sol reaches an eval-level pass rate of 28.7% at the max reasoning level, and GPT-5.6 Sol Pro reaches 31.5% in separately reported GPT Pro runs. GPT-5.5 reaches 12.0%, GPT-5.4 reaches 8.9%, and the strongest non-GPT baseline, Claude Opus 4.8, reaches 16.0%. As with GeneBench, models often complete substantial portions of the workflow but exhibit a consistent gap between noticing and acting by identifying local diagnostic signals but failing to propagate the implications to the corresponding analysis decision. As a result, models often select wrong estimators or persist on initially plausible but incorrect analysis paths. GeneBench-Pro therefore measures an emerging capability of long-horizon biological reasoning that remains unreliable.
bioinformatics2026-06-30v1Structural Bioinformatics of Four Human Aquaporins and Their Water-Soluble QTY Analogs
Zhang, S.; Xiao, E.Abstract
Human aquaporins (AQPs) are essential membrane channels, yet their inherent hydrophobicity complicates structural and functional studies. We present the systematic application of the QTY code to human AQPs, integrating it with AlphaFold 3 structure prediction to design and validate that four-representative human AQPs (AQP1, AQP3, AQP4, AQP7) can be converted into water-soluble analogs while maintaining their conformation. This approach features a novel platform for editing challenging membrane proteins. The QTY code was applied to the transmembrane regions of the selected four AQPs. Subsequently, the water-soluble QTY analogs of the four AQPs were predicted using AlphaFold 3. The predicted structures were superposed with CyroEM- or X-ray-determined native structures in PyMOL. Further analyses included root-mean-square deviation (RMSD) calculations, visualization of hydrophobic surface reduction, and inspection of conserved protein-ligand binding ability. After applying the QTY code, sequence changes between native AQPs and their QTY analogs was significant (42.86-48.80%). Nevertheless, their structures superposed well in analyses, with only slight deviations (RMSD < 0.6 [A]). In addition, the surface hydrophobicity of all QTY-edited AQPs was significantly reduced. Importantly, molecular contacts between the cholesterol ligand and protein were largely preserved for both native AQP1 and its QTY analog. Finally, all AlphaFold3-predicted structures for AQPs have high confidence values (pLDDT > 90; pTM ~0.83), supporting the reliability of the predicted structures. The findings demonstrate that membrane protein hydrophobicity can be edited and reduced without compromising fold integrity or functional architecture. Integration of the QTY code with AlphaFold 3 affords a high-throughput platform for designing water-soluble, structurally faithful analogs of challenging membrane proteins. Such a strategy can provide a potent platform for detergent-free biochemical studies and water-soluble analogs for therapeutic monoclonal antibody discoveries, thus advancing research of this pharmacologically important protein family.
bioinformatics2026-06-30v1Time-resolved inference of gene regulatory networks underlying human cranial neural crest development suggests novel risk genes for orofacial clefting.
Eibl, M.; Theiss, S.; Einarsson, H.; Vaagenso, C. S.; Krautz, R.; Gehringer, M.; Siewert, A.; Zhang, Y.; Rada-Iglesias, A.; Saez-Rodriguez, J.; Herrmann, C.; Ludwig, K. U.; Andersson, R.; Laugsch, M.Abstract
Cranial neural crest cells (CNCCs) play a central role in shaping the human head and face. Aberrant CNCC differentiation contributes to craniofacial birth defects, particularly non-syndromic cleft lip with or without cleft palate (nsCL/P), one of the most common congenital disorders. Although the number of genetic variants associated with this condition is steadily increasing, it remains challenging to determine if and how these variants may contribute to disease development. The majority of these variants lie within non-coding regulatory elements that govern cell-type and stage-specific gene expression, which is orchestrated by dynamic gene regulatory networks (GRNs). Despite extensive work in model organisms, a time-resolved, multi-omics perspective of GRNs controlling CNCC differentiation in a human system is still lacking. To fill this gap, we generated paired transcriptomic and chromatin accessibility data at four timepoints during in vitro differentiation of CNCCs derived from human induced pluripotent stem cells. Integrating these two modalities enabled time-resolved inference of GRNs and identification of dynamic regulatory relationships, including stage-specific roles of core transcription factors. Leveraging these time-resolved GRNs, we mapped 29 nsCL/P associated variants linked to 70 putative target genes, with 40 located outside the associated genomic loci, suggesting novel distal regulatory relationships. Integration of these data with complementary time-course scRNA-seq data revealed an ectomesenchymal-biased subpopulation of CNCCs as particularly sensitive to genetic variants associated with nsCL/P. We provide a time-resolved inference of GRN in human CNCC differentiation, allowing us to determine the dynamics of stage-specific core regulatory programs that are otherwise missed in analyses based on a single time snapshot. To our knowledge, the data represent the first multi-omics map of human CNCC with temporal resolution, which expands the understanding of early human craniofacial development, refines variant-to-gene assignment, prioritizes candidate risk genes and cell states relevant to nsCL/P. Our findings demonstrate the relevance of studying the dynamics upon differentiation rather than just one fixed timepoint and offer a valuable basis for further investigation of non-coding variation in CNCC-related disorders.
bioinformatics2026-06-30v1Impacts of batch effects on the performance of machine learning classifiers across multiple studies
Raab, P.; Johnson, W. E.; Piccolo, S. R.Abstract
Precision medicine relies on accurate and generalizable predictions for patients across the spectrum of human diversity. Because capturing biological heterogeneity requires large sample sizes, researchers must often aggregate data from several experimental batches or independent studies. This integration allows for greater statistical power and diversity than a single study could provide, while avoiding the costs of generating massive new -omics datasets. Predictive models trained on these aggregated data are theoretically better equipped to detect subtle patterns that generalize to new data. However, this potential is frequently undermined by "batch effects"--systematic technical artifacts that can bias model training to predict experimental batches and shadow meaningful biological conditions. Models trained on data with batch effects can exhibit substantially degraded performance when applied to data from new batches. Statistical adjustment methods can mitigate these artifacts while preserving biological signals. To ensure these adjustments actually facilitate generalization, we emphasize the use of external, independent cohorts for rigorous validation. This chapter examines how batch effects impact predictions and compares various adjustment methods.
bioinformatics2026-06-30v1A pan-cancer benchmark of integrated ferroptosis, cuproptosis and disulfidptosis prognostic signatures
Demir, A. Y.; Yasar, E.Abstract
Integrated prognostic signatures combining ferroptosis, cuproptosis, and disulfidptosis are increasingly reported in oncology as advances in risk stratification, yet their added value over simpler pathway-specific or proliferation-related models remains unclear. Here, we developed an integrated regulated cell-death signature and evaluated it through an adversarial pan-cancer benchmark. Using the TCGA pan-cancer cohort comprising 9,808 tumours across 33 cancer types, we curated 118 genes associated with the three cell-death programmes, characterised inter-pathway crosstalk, and derived a 26-gene LASSO-Cox risk signature. The model showed reproducible prognostic performance across cancers, with a pan-cancer concordance index of 0.573 (95% CI, 0.552-0.594), and was independently validated in METABRIC and CGGA cohorts, remaining significant after adjustment for standard clinical variables. However, benchmarking revealed that the integrated signature, although superior to size-matched random gene sets (empirical p < 0.001), did not outperform a ferroptosis-only model (DeLong p = 0.81), indicating no measurable gain from pathway integration. Moreover, much of the prognostic signal reflected tumour proliferation rather than regulated cell death. After adjustment for the proliferation meta-signature (meta-PCNA), ferroptosis performance declined from 0.573 to 0.504, while the integrated model decreased to 0.554. High-risk tumours were more sensitive to anti-proliferative drugs, and the risk score was most strongly associated with E2F, MYC, and G2M target programmes. The signature stratified prognosis but did not predict immune-checkpoint blockade response in IMvigor210 (AUC {approx} 0.50). Importantly, the underlying biology was not merely a modelling artefact. Signature genes showed concordance with protein abundance in CPTAC cohorts, and the three cell-death programmes co-varied within individual malignant cells, with correlations ranging from {rho} = 0.46 to 0.66. Overall, our findings indicate that integrated multi-death signatures are reproducible and biologically grounded, yet prognostically redundant and substantially confounded by proliferation. This study provides a cautionary benchmark for the rapidly expanding use of composite regulated cell-death signatures in cancer prognosis.
bioinformatics2026-06-30v1Integrating Semantic Retrieval, LLM-based Refinement, and Structured Expert Curation for Scalable AOP Gene Mapping
Schaffert, A.; Fratello, M.; Kangas, K.; Torres Maia, M.; del Giudice, G.; Mobus, L.; Accardi, C.; Al-Abdulraheem, Z.; Campini, L.; Galardo, F.; Federico, A.; Ciancaleoni, G.; Juppi, H.-K.; Paparella, M.; Serra, A.; Greco, D.Abstract
Toxicogenomics can support regulatory toxicology, but its use is limited by the difficulty of translating molecular responses into mechanistic, decision-relevant interpretations. Adverse Outcome Pathways (AOPs) provide a framework for this translation, yet omics applications require scalable mapping of Key Events (KEs) to molecular features. Here, we present an AI-assisted, multi-step workflow for KE-to-gene mapping that uses embedding-based semantic retrieval to identify candidate ontology/pathway terms, large language model-assisted refinement to filter these candidates, and double-independent expert group curation with rule-based consolidation to finalize mappings and derive confidence scores. Compared with earlier NLP-based approaches, the workflow improves KE-to-ontology/pathway mapping performance and generates candidate annotations that better align with expert judgment while substantially reducing the need for manual augmentation. Explicit gene and protein mentions in KE titles were additionally grounded to improve specificity, and each curated mapping was assigned curator reason codes to support transparent, traceable, and confidence-aware reuse. Applied across AOP-Wiki, the workflow produced a comprehensive KE-to-gene set resource covering 1,254 KEs across 523 AOPs and linking 15,833 human genes. Utility is demonstrated through CTD-based AOP fingerprinting of curated reference chemical groups, highlighting expanded coverage and confidence-informed interpretation of chemical-associated gene signatures in an AOP context. The workflow and resulting resource provide a practical bridge between toxicogenomics and AOP-based mechanistic interpretation and support routine updating and future extension to additional omics layers within OECD Omics2AOP.
bioinformatics2026-06-30v1A High-Quality Acetylation Dataset Reveals Modest Data Requirements for Transfer Learning to Identify Little Studied Post-Translational Modifications
Hartmaring, Y.; Wang, S.; Jones, A. R.; Vizcaino, J. A.; Schlaffner, C. N.; Renard, B. Y.Abstract
Dysregulation of post-translational modifications (PTMs) is associated with severe pathologies, including cancers and Alzheimer's disease. Despite their biological importance, identifying modified peptides remains challenging due to the immense combinatorial search space. While searches benefit from prior knowledge of a peptide's modification status, the data scarcity for most PTMs hinders the development of accurate deep learning classifiers like AHLF (ad hoc learning of peptide fragmentation). Here, we overcome this data bottleneck for acetylation and ubiquitination. We harmonised a dataset with about 500,000 high quality acetylated peptide-spectrum matches (PSMs) from nine publicly available acetylation-enriched datasets. We fine-tuned AHLF with the acetylation and a 2-million spectra strong ubiquitination dataset separately and assessed the minimum data requirement for training by iteratively downsampling. Training separate models on SILAC and label-free subsets also assessed the impact of data diversity. The resulting acetylation and ubiquitination models achieve an AUC of 0.87 and 0.90 respectively. Beyond 28,500 acetylated spectra, corresponding to roughly 0.3% of the original model's training data, additional data just provides minor performance gains. Finally, we show that data diversity is beneficial for generalizability, while models trained on homogeneous data sources tend to overfit to their respective data type. All code, and model weights are available at https://gitlab.com/dacs-hpi/ahlf-ptmai.
bioinformatics2026-06-30v1Real-World Progression-Free Survival with Erlotinib versus Osimertinib in EGFR L858R+T790M Compound Mutation Non-Small Cell Lung Cancer: An Exploratory Analysis of the MSK-CHORD Dataset
Dalloul, Z.; Abboud, A.; Dalloul, I.; Abdelsalam, M.Abstract
Background: Osimertinib is the standard first-line treatment for EGFR- mutant non-small cell lung cancer (NSCLC) harboring common activating mutations, including exon 19 deletions and L858R. It is also active against tumors with acquired T790M resistance. However, the EGFR L858R+T790M compound mutation, where both variants co-occur within the same tumor, may confer distinct drug-sensitivity profiles not predicted by either mutation alone. Limited data exist on comparative treatment outcomes in this rare genotype. Methods: Using the MSK-CHORD clinicogenomic dataset (n=24,950), we identified patients with concurrent EGFR L858R and T790M mutations receiving erlotinib (Erlo) or osimertinib (Osi) monotherapy. Real-world progression-free survival (rwPFS) per treatment line was calculated using a strict definition requiring confirmed radiological progression events (rwPFS-strict), excluding lines with null endpoint data. Kaplan-Meier analysis, log-rank testing, Cox proportional hazards regression, and cross-cohort heterogeneity testing (Cochran's Q statistic) were performed. Two control cohorts, L858R-only (n=372) and T790M-only (n=76), were analyzed in parallel to assess mutation-context specificity of treatment response. Results: Thirty-one patients with EGFR L858R+T790M were identified; 21 contributed evaluable monotherapy lines, yielding 23 Erlo and 15 Osi treatment lines (14 unique patients per treatment group, 7 contributing to both). Median rwPFS numerically favored Erlo over Osi (7.10 vs 5.32 months; HR 1.29, 95% CI 0.66-2.52; log-rank p=0.46). This directional trend was reversed in the L858R-only control cohort, where Osi demonstrated significant superiority (9.03 vs 5.75 months; HR 0.70, 95% CI 0.55-0.89; p=0.003). The T790M-only cohort showed no significant difference (HR 1.32, p=0.12). An exploratory post-hoc heterogeneity test confirmed a significant cross-cohort interaction (Q=9.94, df=2, p=0.007). Conclusions: The expected osimertinib advantage was absent in L858R+T790M compound-mutant NSCLC. The opposing hazard ratio directions across mutation contexts (HR 1.29 vs 0.70), with a significant exploratory cross-cohort interaction (p=0.007), suggest that the EGFR L858R+T790M compound mutation may represent a pharmacologically distinct entity with differential TKI sensitivity. These hypothesis-generating findings warrant prospective validation.
bioinformatics2026-06-30v1