News
CMN Special

Breaking: Deep learning multi-dataset training improves CRISPR base-editing predictions

Researchers in Denmark report today in Nature Communications about novel deep learning models that significantly improve the accuracy of predicting CRISPR base-editing outcomes by training simultaneously on multiple experimental datasets whilst tracking their origins. The approach allows users to tailor predictions to specific base editors and experimental conditions, addressing a longstanding challenge in the precise design of genome editing experiments.

By: Gorm Palmgren - Nov. 7, 2025
News

»In previous Cas9-efficiency work, we could calibrate datasets using overlapping gRNAs, but here we had too few overlaps,« explains Jan Gorodkin at the University of Copenhagen. He is co-senior author of the paper that was published today in Nature Communications. He adds: »By labelling each gRNA by its dataset of origin, we could train effectively on all datasets at once without forcing them into a single unified scale.«

CRISPR base editors represent a refinement of classical CRISPR-Cas9 technology, enabling researchers to change single DNA letters without cutting both strands of the double helix. These molecular tools combine a modified Cas9 enzyme with a deaminase that chemically converts one DNA base into another. Adenine base editors convert A·T base pairs to G·C, whilst cytosine base editors convert C·G to T·A. However, the technology presents a complication: not only the intended base but also nearby "bystander" bases within an approximately eight-nucleotide window may be edited, creating multiple possible outcomes for any given gRNA.

»We were surprised to find that the base editors evaluated in this study still have a large editing window, which can introduce unintended bystander edits,« notes Yonglun Luo, co-senior author of the paper, at Aarhus University, Denmark. »That underlines the importance of prediction tools that capture both efficiency and the full spectrum of editing outcomes.«

By labelling each gRNA by its dataset of origin, we could train effectively on all datasets at once without forcing them into a single unified scaleJan Gorodkin

Predicting which outcomes will occur and how efficiently they will occur has proven difficult. Existing prediction tools have been trained on individual datasets, each with inherent limitations, including diverse experimental platforms, different deaminase variants with distinct sequence preferences, and an abundance of low-efficiency gRNAs that skew model training. The fragmented nature of available data has meant that models perform well on data similar to their training set but struggle to generalise across different experimental conditions.

»Using datasets generated by different studies for training a better prediction model is challenging due to data incompatibility issues, caused by factors such as expression level of the base editor, different versions of base editors, and cell-type differences,« says Yonglun Luo. »Our dataset-aware training approach overcomes this challenge and enables more accurate prediction of base-editing gRNAs.«

Figure 1. High throughput generation of ABE and CBE editing data in cells. A) Experimental workflow...
Figure 1. High throughput generation of ABE and CBE editing data in cells. A) Experimental workflow for SURRO-seq library screening in HEK293T cells. B) Histogram distribution of base editing efficiency for ABE7.10 (top) and BE4 (bottom). Used under the Creative Commons license from Figure 1 in Sun et al. (2025) Nature Communications, https://doi.org/

The research team addressed this by generating substantial new experimental data and developing a novel training strategy. Using their previously established SURRO-seq technology (see Figure 1), which creates libraries pairing gRNAs with their target sequences integrated into the genome, they measured base-editing efficiency for approximately 11,500 gRNAs each for ABE7.10 and BE4-Gam base editors in HEK293T cells. After quality filtering, they obtained robust measurements for over 11,000 gRNAs per editor.

Analysis revealed that ABE7.10 exhibited highly specific adenine-to-guanine transitions at 97%, whilst BE4 showed 92% cytosine-to-thymine specificity. Both editors displayed peak activity at positions four through eight in the protospacer sequence. Notably, the team found positive correlations between base-editing efficiency and standard Cas9-induced indel frequency at the same sites, suggesting that Cas9 activity predictions could inform base-editing models.

The key innovation lies in how they integrated their data with existing published datasets. Rather than simply pooling data, they developed an architecture that trains simultaneously on multiple datasets whilst explicitly labelling each data point's origin. For adenine base editors, they incorporated five datasets, including their SURRO-seq data and published data from Song, Arbab, and two Kissling datasets, examining both ABE7.10 and the engineered ABE8e variant. For cytosine base editors, they used three datasets from SURRO-seq, Song, and Arbab studies. The model architecture employs deep convolutional neural networks with multiple filter sizes to process the 30-nucleotide target sequence, alongside molecular features including gRNA-DNA binding energy and predicted Cas9 efficiency.

We hope CRISPRon-ABE and CRISPRon-CBE will help other researchers select gRNAs giving the highest editing efficiency and the intended outcomeYonglun Luo

The dataset labelling proves crucial. The team observed substantial heterogeneity across datasets, with the Kissling ABE8e data showing predominantly high editing efficiencies of 60% to 80%, whilst the Kissling ABE7.10 data for identical gRNA sequences showed lower efficiencies of 30% to 50%, reflecting the engineered enzyme's three- to eleven-fold increase in activity. By encoding the dataset origin as a feature vector, the model learns these systematic differences during training. Critically, users can make predictions which allow them to assign weights to different datasets, essentially tuning the model to match their experimental conditions.

The models, termed CRISPRon-ABE and CRISPRon-CBE, predict both overall gRNA editing efficiency and the frequency of specific editing outcomes simultaneously. To jointly evaluate performance on both metrics, the researchers introduce two-dimensional correlation coefficients that extend the standard Pearson and Spearman measures. Testing on independent datasets demonstrated consistent superiority over existing methods, including DeepABE/CBE, BE-HIVE, BE-DICT, BE_Endo, and BEDICT2.0. Feature analysis confirmed that predicted Cas9 efficiency plays an important role in base-editor predictions, and omitting dataset labels during training decreased performance by approximately 10%.

Figure 2. Schematic representation of the CRISPRon-ABE and CRISPRon-CBE prediction algorithm. The...
Figure 2. Schematic representation of the CRISPRon-ABE and CRISPRon-CBE prediction algorithm. The inputs of the models are the one-hot encoded 30nt target DNA sequence, labeling of the edited nucleotide positions within the editing window from the outcome sequence, the binding energy (ΔGB), the predicted Cas9 efficiency, and the labeling of the dataset source. Used under the Creative Commons license from Figure 3 in Sun et al. (2025) Nature Communications, https://doi.org/

The researchers acknowledge limitations in their current models. They have incorporated only three base editor variants – ABE7.10, ABE8e, and BE4 – and all training data derive from HEK293T cells, except for datasets that include mouse embryonic stem cells and U2OS cells. Different deaminases exhibit distinct sequence motif preferences, meaning predictions for base editors not represented in the training data remain challenging. The dataset integration strategy could apply to other CRISPR technologies where data heterogeneity poses challenges. The models are available both as a web server and as standalone software under an academic use licence.

»We hope CRISPRon-ABE and CRISPRon-CBE will help other researchers select gRNAs giving the highest editing efficiency and the intended outcome,« says Yonglun Luo.

The study was led by Jan Gorodkin and Yonglun Luo at the University of Copenhagen and Aarhus University, respectively, in Denmark. The study was published today, 7 November 2025, in Nature Communications under the title "Deep learning models simultaneously trained on multiple datasets improve base-editing activity prediction".

The CRISPRon-ABE and CRISPRon-CBE tools can be accessed via https://rth.dk/resources/crispr/. Sequencing data is deposited in the China National GeneBank Database under accession CNP0001031 and NCBI Sequence Read Archive under accession PRJNA1321836.

To get more CRISPR Medicine News delivered to your inbox, sign up to the free weekly CMN Newsletter here.

Tags

HashtagArticleHashtagCMN HighlightsHashtagInterviewHashtagNewsHashtagOff-targetHashtagBase editorsHashtagSafety

CLINICAL TRIALS
Non-small Cell Lung Cancer, NSCLC, (NCT06097962)
Sponsors:
Base Therapeutics (Shanghai) Co., Ltd.
Indicator
IND Enabling
Phase I
Phase II
Phase III
Gastric Cancer and Colorectal Cancer, CRC, (NCT07166263)
Sponsors:
Base Therapeutics (Shanghai) Co., Ltd.  
Indicator
IND Enabling
Phase I
Phase II
Phase III
Relapsed or Refractory Acute Myeloid Leukemia, AML, (NCT06541444)
Sponsors:
Base Therapeutics (Shanghai) Co., Ltd.
Indicator
IND Enabling
Phase I
Phase II
Phase III
View all clinical trials
close
Search CRISPR Medicine