Introduction

A strategy of single-cell low-coverage whole genome sequencing (SLWGS) is suited for the detection of chromosomal aberrations1. Typically, next-generation sequencing (NGS) requires nanogram amounts of DNA to construct a library for sequencing2, whereas a single cell only contains 6–7 pg of genomic DNA (gDNA). Therefore, a critical step for single-cell sequencing is whole-genome amplification (WGA) to generate sufficient DNA for library construction.

Three WGA methods are widely used for SLWGS, namely, degenerate-oligonucleotide-primed polymerase chain reaction (DOP-PCR) (marketed as WGA4 kit; Sigma-Aldrich, St. Louis, MO, US)2, multiple displacement amplification (MDA) (marketed as REPLI-g Single Cell Kit; QIAGEN, Germantown, MD, US)3, and a combination of displacement pre-amplification and PCR amplification (marketed as PicoPLEX WGA Kit; Rubicon Genomics, Ann Arbor, MI, US)4. Many comparisons have evaluated the efficiency among these WGA kits5,http://github.com/iontorrent/tmap) was employed to perform the alignment and resulted in bam format. The map** parameter (tmap mapall -v -Y -u -o 2 -a 0 -n 6 stage1 map4) and the alignment output model parameter in the map** methods were set as “map all” and “0”, respectively. “map all” indicates multi-map** procedure, whereas “0” indicates output the unique best hit reads. After removing the duplication on POS, the unique non-duplication reads were used for further analysis.

GC-bias calculation

GC content bias is the proportion of G and C bases in a specific region compared with that reported previously12, which describes the bias resulting from the GC content. The bias leads to abnormal sequencing depth in a specific genomic region, which potentially influences the uniformity of read distribution. Moreover, two primary categories are based on NGS for CNVs-detection methods: the pair-end map** (PEM) and the depth of coverage (DOC)13,14. Most CNVs detection tools are universally designed based on the DOC methods14. Coverage of depth depending on the GC content can complicate the accuracy of CNVs detection. To describe the GC-bias in WGA, we referred to the method in the article of Nora Rieber15.

Let R1, R2… R w represent the unique non-duplication mapped reads that align to the W windows.

$${\rm{Total}}\,{\rm{variance}}:\,\mathrm{TV}=\frac{\,1\,}{\,W\,}{\sum }_{w}{({R}_{W}-M)}^{2}$$
(1)
$${\rm{Variance}}\,{\rm{after}}\,{\rm{G}}+{\rm{C}}\,{\rm{loess}}\,{\rm{fit}}:\,LV=\frac{\,1\,}{\,W\,}{\sum }_{w}{({R}_{W}-{L}_{W})}^{2}$$
(2)
$${\rm{Contribution}}\,{\rm{of}}\,{\rm{G}}+{\rm{C}}\,{\rm{bias}}\,{\rm{to}}\,{\rm{total}}\,{\rm{variance}}:{\rm{\Delta }}{R}_{GC}=1-\frac{LV}{TV}$$
(3)

where M represents the average number of unique non-duplication mapped reads on each autosome window, L w is obtained via a loess local regression fit of the unique non-duplication mapped reads against the G + C content, and ΔR GC is the quantitative value of GC-bias. Small values of ΔR GC indicate the GC-bias is less serious. However, ΔR GC is a relative measure and can be influenced by WGA uniformity.

Data analyses

The windows selection was performed referring to previous reports, GC-bias correction and copy number analysis12. In brief, the reference genome (GRCh37, UCSC release hg19) was divided into sliding SE50 simulated reads and mapped back to the origin reference genome with a maximum of two mismatches. Among the 100 K simulated unique mapped reads in continuous windows, we allowed 20 K overlap** reads to exist. The GC content of each window was calculated and used for the GC-bias correction. The normalized depth ratio (NDR) is the unique mapped non-duplication reads of each window divided by the total average unique mapped non-duplication reads, which was used to calculate the coverage and evaluate the reproducibility and uniformity. Additionally, we referred to the algorithm from Zhang et al.12 to detect CNVs. To remain as close to the characteristics of the human reference genome as possible, we used the optimized dynamic window size to call CNVs. After the GC-bias correction and binary segmentation, we discerned the CNVs breakpoints. Sensitivity and specificity were calculated as follow:

$${Sensitivity}=\frac{{TPR}}{({TPR}+{FNR})}$$
(4)
$${Specificity}=\frac{{TNR}}{({TNR}+{FPR})}$$
(5)

where FNR is short for false negative rate which equal to the false negative signal number divided by the total true positive signal number. FPR is short for false positive rate which equal to the signal number divided by the total true positive signal number. TNR is short for negative true negative rate which equal to the true negative signal number divided by the total true negative signal number. TPR is short for true positive rate which equal to the true positive signal number divided by the total true positive signal number. The difference in different groups was analysed by one-way ANOVA16. We also performed the Mann–Whitney-Wilcoxon test to assess the variation between two groups. Differences yielding P-values below or equal to 0.05 were considered significant. Numbers given before the ‘±’ symbol in results indicate the average value, and numbers given after the ‘±’ symbol indicate standard deviation.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Results

Comparison of amplification time and yield

The amplification yield was compared using the two WGA kits in the final volume of 75 μL of amplification product. The WGA4 kit had the WGA product at the concentration of 72.98 ± 17.81 ng/μL, whereas the PicoPLEX WGA Kit had the WGA product at the concentration of 37.56 ± 4.96 ng/μL. The yield of different cell numbers using the same WGA kit was not different, but a significant difference was detected between the two WGA kits. Additionally, approximately 4.5 h with the WGA4 kit and 2.5 h with the PicoPLEX WGA Kit were required to finish the WGA procedure. Comparatively, less time was consumed with the PicoPLEX WGA Kit to obtain sufficient yield for library construction.

Data production

To reduce the effect of sequencing depth on the comparison of each combination, we randomly extracted 2 million clean reads from the total data of each sample (Supplementary Table II, HiSeq2000, Supplementary Table III, Proton). The extraction strategy and reason are described previously7. Table 1 shows the mean basic statistics of both platforms. We found that the mean unique map** rate (58.72%) of PicoPLEX WGA Kit was lower than that of WGA4 kit (62.43%) on the HiSeq2000 platform (Supplementary Fig. S1). On the Proton platform, the average unique map** rate of WGA4 kit was 91.23% and that of the PicoPLEX WGA Kit was 91.36% (Supplementary Fig. S1), the map** rate of WGA4 kit was much higher than PicoPLEX WGA Kit on the Hiseq2000 platform.

Table 1 Global Average Statistics of Sequencing and Map** of different Platforms and Kits.

To gain further insights into the data quality, we investigated the discordantly mapped reads derived from different libraries and sequencing processes. The mismatch rate, deletion rate and insertion rate are a series of important parameters to consider for calling single-nucleotide variants (SNVs). Based on the alignment results and the Compact Idiosyncratic Gapped Alignment Report (CIGAR), we encoded matches and mismatches with an ‘M’, insertions with an ‘I’ and deletions with a ‘D’. Subsequently, we defined ErrorRate as the sum of mismatch rate, deletion rate and insertion rate (Table 1). The results of variance analysis (Supplementary Fig. S2) suggested that the PicoPLEX WGA Kit had a lower ErrorRate (P < 0.01) than that of the WGA4 kit on the HiSeq2000 platform independent of cell number. The results were reversed on the Ion Proton platform. Furthermore, the ErrorRate of Hiseq2000 was lower than that of Ion Proton with the same WGA kit.

However, whether the map rate of Ion Proton was higher than that of Hiseq2000 or the difference between the mismatch rate, insertion rate and deletion rate was significant could not be determined because the two sequencing platforms were not comparable because of the different alignment methods used and different sequencing principles17.

GC-bias of four combinations

Generally, GC-bias is considered an important factor that complicates data analysis. The plot of the NDR at various genomic regions versus the GC content showed that the average GC content was 39.70% on HiSeq2000 and 41.86% on Ion Proton using the WGA4 kit, which were values very close to those of the reference genome (41.9%). By contrast, the average GC content was 44.10% on HiSeq2000 and 45.22% on Ion Proton with the PicoPLEX WGA Kit (Fig. 2). These results demonstrated the amplification preference of the PicoPLEX WGA Kit on GC-rich regions.

Figure 2
figure 2

GC plots for HiSeq2000 (a) and Proton (b) platforms. A heat map describes rates for each (GC, Original copy ratio) pair. Smoothed loess curves (black line) are fitted to represent the local original copy ratio trend. RS, RM, SS, SM are four combinations. RS is short for Rubicon PicoPLEX WGA Kit and single cell, RM is short for Rubicon PicoPLEX WGA Kit and multiple cells, SS is short for Sigma-Aldrich WGA4 kit and single cell, SM is short for Sigma-Aldrich WGA4 kit and multiple cells.

Commonly, ΔRGC is used to quantify GC-bias, and a small ΔRGC value indicates reduced GC-bias. We analysed the ΔRGC value for the four combinations on the two platforms (Fig. 3). On the HiSeq2000 platform, the values of ΔRGC from PicoPLEX WGA Kit amplified data were 0.25 ± 0.08 and 0.29 ± 0.05 for single cell and multiple cells, respectively, whereas the values were 0.08 ± 0.04 and 0.14 ± 0.03 for single cell and multiple cells, respectively, of WGA4 kit amplified data. Conclusively, SS had significantly less GC-bias than that of RS (P < 0.05), and SM had less GC-bias than that of RM (P < 0.05). Thus, data generated with the WGA4 kit had less GC-bias than the data generated with the PicoPLEX WGA Kit on the Hiseq2000 platform. On the Ion Proton platform, the values of ΔRGC from PicoPLEX® amplified data were 0.13 ± 0.04 for RM and 0.15 ± 0.08 for RS. The values of ΔRGC from WGA4 kit amplified data were 0.04 ± 0.01 for SM and 0.03 ± 0.01 for SS. To summarize, data generated with the WGA4 kit had less GC-bias than data generated with the PicoPLEX WGA Kit for single cell (P < 0.05) and multiple cells (P < 0.05).

Figure 3
figure 3

Values of ΔRGC for the four combinations between Hiseq2000 and Proton platforms. The box-plot represents the correlation of 11 cell lines used in this study for HiSeq2000 and Proton platforms. RS, RM, SS, SM are four combinations. RS is short for Rubicon PicoPLEX WGA Kit and single cell, RM is short for Rubicon PicoPLEX WGA Kit and multiple cells, SS is short for Sigma-Aldrich WGA4 kit and single cell, SM is short for Sigma-Aldrich WGA4 kit and multiple cells.

Based on this discovery, a weighted correction strategy could be used to remove the GC-bias (Fig. 4), which was reported to correct more than 99.9% of the GC-bias12.

Figure 4
figure 4

Distribution of NDR values for the four combinations across the whole genome on HiSeq2000 (a) and Proton (b) platforms. Box plot represents NDR values in 124,011 windows for the same sample. x-axis is Chromosome number; y-axis is NDR values. The left and right represent the comparison without GC-correction and after GC-correction, respectively, for the same combination. The CV is the coefficient of variation of NDR across the whole genome. RS, RM, SS, SM are four combinations. RS is short for Rubicon PicoPLEX WGA Kit and single cell, RM is short for Rubicon PicoPLEX WGA Kit and multiple cells, SS is short for Sigma-Aldrich WGA4 kit and single cell, SM is short for Sigma-Aldrich WGA4 kit and multiple cells.

Reproducibility Evaluation

Reproducibility is the ability to reproduce experimental results, either by the sample type or experimental combination, and is particularly important when the amount of DNA is typically at a picogram level. In this study, we used Pearson’s correlation coefficient of the NDR on a selected window along the autosome to quantify the reproducibility between two representative combinations. The correlation value matrix was calculated between any two cell lines among the 11 cell lines.

On the HiSeq2000 platform, the correlation values of PicoPLEX WGA Kit amplification data were 0.62 ± 0.18 and 0.79 ± 0.03 for single cell and multiple cells, respectively; whereas the values were 0.28 ± 0.08 and 0.57 ± 0.06 for single cell and multiple cells, respectively, when using the WGA4 kit. RS had significantly better reproducibility than that of SS (P < 0.05), and RM also had better reproducibility than that of SM (P < 0.05).

On the Proton platform, the correlation values of PicoPLEX WGA Kit amplification data were 0.76 ± 0.15 and 0.91 ± 0.02 for single cell and multiple cells, respectively; whereas the values were 0.69 ± 0.08 and 0.86 ± 0.03 for single cell and multiple cells, respectively, when using the WGA4 kit (Fig. 5). RS had significantly better reproducibility than that of SS (P < 0.05), and RM had significantly better reproducibility than that of SM (P < 0.05). These results demonstrated that the PicoPLEX WGA Kit outperformed WGA4 kit on reproducibility for the corresponding cell number on both Hiseq2000 and Ion Proton platforms.

Figure 5
figure 5

Reproducibility of the four combinations between HiSeq2000 and Proton platforms. The box-plot represents the correlation of 11 cell lines used in this study for HiSeq2000 and Proton platforms. RS, RM, SS, SM are four combinations. RS is short for Rubicon PicoPLEX WGA Kit and single cell, RM is short for Rubicon PicoPLEX WGA Kit and multiple cells, SS is short for Sigma-Aldrich WGA4 kit and single cell, SM is short for Sigma-Aldrich WGA4 kit and multiple cells.

Genome coverage uniformity

Coverage depth has been widely employed in different CNVs calling algorithms, and uniformity of WGA product is important to coverage depth and CNVs detection. Therefore, we characterized the uniformity by comparing the uniformity of reads distribution using the extracted data mentioned above. We simulated the theoretical sequencing depth distribution, which followed the Poisson distribution (124,011 dots, λ = 30), and normalized it by dividing by 30. Previously, we found that the distribution of data from the WGA4 kit was close to the theoretical one on the two sequencing platforms; whereas bias was observed in the data from the PicoPLEX WGAKit (Fig. 6). The CV value effectively described the relative variance of chromosomal depth, uniformity, and overall GC-bias in previous studies21.

In this study, we did not consider sample processing time, reagents consumption, labour costs or sample size. Those parameters might have an important role in technology selection, particularly in the scenario of clinical use. However, rapid advances in sequencing technology are likely to change those parameters in the future. Researchers within the expanding field of single cell research can obtain various experimental parameters from the cell lines before managing a multitude of clinical samples from large trials. In pre-implantation genetic screening (PGS) research, those advantages become more obvious because PGS involves a screening process before implantation for one or more nuclei from oocytes [a polar body or bodies (PBs)] or embryos (blastomere or trophectoderm cells) to detect the chromosomal CNVs23, and therefore, SLWGS for identifying CNVs has become common practice in PGS24.