Introduction

Tea [Camellia sinensis (L.) O. Kuntze, 2n = 30] is one of the most important and traditional economic crops in many develo** countries in Asia, Africa, and Latin America, and is consumed as a beverage by more than two-thirds of the world’s population3). After performing five filtering steps (described in the Methods section), we identified a total of 218.87 million SNPs among the tea populations, with a density of approximately 67 SNPs per kb (Fig. 1a; Supplementary Tables 14 and 15). We anticipate that this extensive whole-profile SNP dataset will be valuable for further tea genomics research and marker-assisted breeding.

Fig. 2: Distribution and evolution of tea.
figure 2

a The distribution of tea accessions assessed in the present study. The teas within the black oval had the highest nucleotide polymorphism. b Principal component analysis of the tea populations. PC1 and PC2 split the tea populations into three clusters. The Camellia sinensis var. sinensis (CSS) samples were found to cluster more tightly than the C. sinensis var. assamica (CSA) samples. c A phylogenetic tree of tea. Camellia sasanqua Thunb. was used as the outgroup, and the tea samples closest to the outgroup were C. sinensis-related species (CSR). d Structure of the tea populations. Green, blue, and yellow represent the CSS, CSA, and CSR populations, respectively. Yellow and orange are marked with arrows. Source data are provided as a Source Data file.

To further investigate the phylogenetic relationships among these accessions, we constructed a maximum likelihood-based phylogenetic tree with SNPs filtered from the total SNP dataset (see the Methods section for details), using Camellia sasanqua as an outgroup (Fig. 2c). We found that all samples were clustered into one of three independent clades (Fig. 2c; Supplementary Data 4) corresponding to the CSR, CSS, and CSA populations, which is consistent with the morphology-based classical taxonomy of CSA and CSS.

Principal component analysis (PCA) was used to investigate the relationships and differentiation among populations and consistently revealed the presence of three clusters corresponding to CSA, CSS, and CSR (Fig. 2b). The first two principal components accounted for 13.08% of the total variance, with PC1 reflecting the variability of the CSA and CSS groups and PC2 differentiating CSR plants from CSA and CSS plants. We found that CSS showed better aggregation than CSA and CSR, whereas the juncture accessions of CSA and CSS were also close to CSR in the phylogenetic tree. At a K value of 3, CSA, CSS, and CSR could be readily distinguished (Fig. 2d; Supplementary Fig. 12; Supplementary Note 4), which is consistent with the PCA results (Fig. 2b). At a K value of 3 or 4, most new accessions collected from outside China appeared to have originated from CSA and CSS (yellow color, marked with an arrow in Fig. 2d), indicating their high diversity.

On the basis of the phylogenetic and population structure results (Fig. 2c; Supplementary Data 46), we further investigated individual- and population-level heterozygosity among the populations (Supplementary Data 3). We accordingly found the heterozygosity of CSR (6.37E-3) to be significantly higher than that of CSA (6.29E-3) and CSS (5.69E-3) (both P values < 0.05; Supplementary Fig. 13). We also calculated linkage disequilibrium (LD) decay values based on the squared correlation coefficient (r2) of pairwise SNPs in two groups, which revealed that for the CSA and CSS groups, the average r2 among SNPs decayed to ~50% of its maximum value at ~41 and 59 kb, respectively. These values thus indicate that the tea genomes have relatively long LD distances and slow LD decay (Supplementary Fig. 14).

Selective sweeps in the two major tea populations

It is generally thought that the differences between CSS and CSA teas lie primarily in their flavor, leaf and tree types, cold tolerance, and processing suitability. Among the accessions assessed in the present study, the CSA population comprised three green tea accessions and 34 black tea accessions, whereas the CSS population contained 45 green tea accessions, 19 oolong tea accessions, and 11 black tea accessions (Fig. 3a). To determine the potential genetic bases of these differences, we used SweepFinder2 (version 1.0) to scan for selective sweep regions and selected regions with the top 1% of composite likelihood ratio (CLR) scores and the genes overlap** with the final sweep regions (≥300 bp). On the basis of this analysis, we identified a total of 1336 and 1028 genes bearing selection signatures in the CSA and CSS populations, respectively (Supplementary Data 7 and 8; Supplementary Fig. 15).

Fig. 3: Sweep genesets in CSA and CSS show different directions of domestication.
figure 3

a The tea types were used for SweepFinder2 analysis of CSS (C. sinensis var. sinensis) and CSA (Camellia sinensis var. assamica). b The pathway of terpene metabolism. The selective sweep genes are indicated by stars. The arrows bisected by equal symbols indicate hidden processes. c The expression of terpene-related genes in different tea tissues. d The expression of NBS-ARC genes in different tea tissues. Source data underlying Fig. 3c, d are provided as a Source Data file.

Using the data generated from GO analysis, we selected enriched genes (P value < 0.05, FDR < 0.05) from the candidate selective sweep genes of the CSA and CSS populations (Supplementary Tables 16 and 17; Supplementary Fig. 16) and accordingly found that volatile terpene metabolism genes, such as cytochrome P450s (e.g., geraniol 8-hydroxylase) and terpene synthases, including alpha-terpineol synthase (ATESY), (−)-germacrene D synthase (TPSGD), and strictosidine synthase (STSY), were significantly selected in the CSS population but not the CSA population (Fig. 3b; Supplementary Tables 16 and 17). The functionalization of core terpene molecules requires cytochrome P450s32, among which geraniol 8-hydroxylase catalyzes the conversion of geraniol (6E)-8-hydroxygeraniol (Fig. 3b), which may affect the accumulation level of geraniol. Alpha-terpineol, a monoterpene found in tea, is generated by the ATESY-mediated catalysis of geranyl-PP, whereas TPSGD catalyzes the conversion of farneyl-PP to the sesquiterpene germacrene D. Strictosidine is the precursor of terpenoid indole alkaloids, and STSY is a key enzyme in the synthesis of these alkaloids (Fig. 3b). Moreover, we found that 80% of the selected terpene-related genes showed relatively high expression in buds or leaves, whereas 33% of these genes showed significantly high expression in buds or leaves (Fig. 3c; Supplementary Table 18).

Compared with the CSA accessions, the CSS accessions were characterized by the selection of a larger number of NBS-ARC (nucleotide-binding site domain in apoptotic protease-activating factor-1, R proteins and Caenorhabditis elegans death-4 protein) genes, the Arabidopsis homologs of which, including RPS3 (also known as RPM1)33, RPS534, and SUMM235, have been shown to be involved in resistance to Pseudomonas syringae (RPS) (Supplementary Tables 16 and 17). The expression profiles of these genes revealed that 69% of the NBS-ARC genes subject to selection are highly expressed in spring, autumn, or winter, whereas 24% of these genes are significantly highly expressed in spring, autumn, or winter (Fig. 3d; Supplementary Table 19). However, among the 214 genes under selection in both the CSS and CSA populations, we were unable to detect enrichment of any genes related to flavor synthesis or abiotic and biotic stress resistance in the CSA population (Supplementary Data 7 and 8).

Discussion

This study presents a chromosome-scale genome sequence of tea and resequencing data for 139 tea accessions collected from around the world. According to our analyses, these genomic resources will be valuable for future genomics research and molecular breeding of tea. The data reveal the genome-wide phylogeny of tea and the directions of divergent selection between the two main tea varieties, namely, CSS and CSA. Compared with CSA, in CSS, genes involved in flavor metabolism and cold tolerance have been subjected to stronger selection, which is consistent with the fact that tea accessions from eastern and northern China, such as green and oolong tea, have a distinct aroma and are cold tolerant. Our data also indicate that the CSR population is an ancestor of CSS and CSA. However, although these findings represent an important step in unravelling details of the origin and domestication of CSS and CSA, it remains necessary to identify the closest ancestor of tea and to examine a larger number of CSR accessions in the future. Due to the limitations of sampling in India, we cannot rule out the possibility of other evolutionary scenarios, an evaluation of which will require a more comprehensive collection of samples. Although several studies related to tea genomics have recently been published7), and the transcript reads were assembled using Cufflinks (version 2.2.1). All of the predicted gene structures were integrated using EVidenceModeler (version 1.1.1). Protein-coding genes with a coding sequence length shorter than 300 nt and with stop codons were filtered (with the exception of stop codons at the end of a sequence). We then mapped RNA-seq reads to the predicted coding regions using SOAP2 and selected the predicted gene regions based on RNA-seq data (regions with >50% coverage). The methods used for gene and functional annotation are described in detail in Supplementary Note 2. The sequences of LJ43 and Actinidia chinensis21 proteins were analyzed using blastp with the parameters -evalue 1e-5 -num_alignments 5. Thereafter, syntenic blocks were identified using MCScanX with the parameters –e 1e-20. SCZ and YK10 were analyzed using the same pipeline and parameters. We also analyzed the genome synteny between Theobroma cacao54 and LJ43, SCZ, and YK10 (Supplementary Note 3).

Analysis of positive Darwinian selection

A species tree was constructed as described in Supplementary Note 3, without SCZ and YK10. We identified 1031 single-copy gene families. The protein sequences of single-copy genes were aligned using ClustalW255, and then the ClustalW2 data were transformed to nuclear format according to the alignment protein sequences using an in-house Perl script. Gblocks56 was used to cleave the nuclear alignment sequences based on the t = c parameter. “Branch-site” models A and Test2 were selected to assess positive selection using codeml of the PAML package. The significant sites were dropped if the 5-bp sequences around the site sequences were cut by Gblocks. A false discovery rate (FDR) value of ≤0.05 was used to filter the results.

SNP calling and filtering

Quality-controlled reads were mapped to the unmasked tea genome using bwa (version 0.7.15)57 with the default parameters. SAMtools (version 1.4)58 was used for sorting, and Picard (v.2.17.0) was used to remove duplicates. The HaplotypeCaller of GATK (version 3.8.0)58 was used to construct general variant calling files for the tea group (139 accessions) and outgroup (C. sasanqua, CM-1) by invoking -ERC:GVCF. gVCF files in the tea group were combined using GenotypeGVCFs in GATK to form a single-variant calling file, whereas the gVCF file for the outgroup was called using the option ‘–allSites’ to include all sites. The final single-variant calling file was merged using BCFtools (version 1.6), with only the consistent positions retained in both groups. To obtain high-quality SNPs, we initially used the GATK hard filter to filter the merged VCF data with the options (QD ≥ 2.0 && FS ≤ 60.0 && MQ ≥ 40.0 && MQRankSum ≥ −12.5 && ReadPosRankSum ≥ −8.0). Thereafter, we performed strict filtering of the SNP calls based on the following criteria: (1) sites were located at a distance of least 5 bp from a predicted insertion/deletion; (2) the consensus quality was ≥40; (3) the sites were not triallelic and did not contain InDels; (4) the depth ranged from 2.5 to 97.5% in the depth quartile; and (5) SNPs had minor allele frequencies (MAFs) ≥ 0.01.

Population genetic analyses

We selected high-quality SNPs with a maximum of 20% missing data, and to eliminate the potential effects of physical linkage among variants, the sites were thinned such that no two sites were within the same 2000-bp region. Phylogenetic analysis was conducted with the final SNP set using IQ-TREE (version 1.6.9)59,60,61. A maximum likelihood (ML)-based phylogenetic tree was constructed using the GTR + F + R5 model, with 1000 rapid bootstrap replicates conducted to determine branch confidence values. The best-fitting model was estimated using ModelFinder implemented in IQ-TREE after evaluating 286 DNA models. GTR + F + R5 was selected based on the Bayesian information criterion. The ML phylogenetic tree was constructed based on intergene region SNPs using the final SNP set and 4DTV SNPs. Principal component analysis (PCA) of the final SNP set was performed using PLINK (version 1.90), with the principal components plotted against one another using R 3.4 to visualize patterns of genetic variation. We also used the final SNP set for population structure analysis using ADMIXTURE (version 1.3)62, which was run with K values (the number of assumed ancestral components) ranging from 1 to 10.

Population heterozygosity at a given locus was computed as the fraction of heterozygous individuals among all individuals in a given population. The average heterozygosity was then calculated for each 40-kb sliding window, with a step size of 20 kb. Individual heterozygosity was computed as the fraction of loci that were heterozygous in an individual. Average heterozygosity was also calculated using the same method. Windows with an average depth <1 were filtered out.

To eliminate the influence of differences in sample number, eight samples of the CSR/CSA/CSS populations were randomly selected to calculate nucleotide diversity. To reduce the sampling error, we performed 20 repeat calculations for each population using VCFtools (version 0.1.16) with a window size of 50 kb and a step size of 10 kb. The data for each population are presented as boxplots created using R.

Selective sweep analysis

TreeTime 0.5.363 was used to infer the ancestral state based on ML using the generated evolutionary tree. Sites lacking a reconstructed ancestral state in a population were folded in the SweepFinder2 analysis. We excluded sites that were neither polymorphic nor substitutions, as recommended by the SweepFinder2 manual64. To reduce the likelihood of false positives, the chromosome-wide frequency spectrum was calculated as the background for each chromosome and population. SweepFinder2 was run with a grid size of 100. The CLR scores from the SweepFinder2 results were extracted and merged into sweep regions when the neighboring score(s) exceeded a certain threshold, which was set as the top 1% of CLR scores. To obtain regions with greater continuity, we merged regions into a single region with a certain size threshold between regions, with the threshold being set to 50% of the size in the adjacent sweep regions. The final score for each sweep region was the sum of the CLR scores of the sites in the sweep region. The final sweep regions were filtered based on a minimum size of 300 bp. Genes overlap** within the sweep regions were extracted as candidate selective sweep genes. The GO-enriched (P value < 0.05, FDR < 0.05) candidate selective sweep genes were chosen, and Fst, θπ and Tajima’s D values were calculated using VCFtools with a window size of 50,000 bp and a step size of 10,000 bp.

Gene expression

Transcript-level expression was calculated using HISAT2, StringTie, and Ballgown with the default parameters65. The genes identified among the selection results were selected for expression analysis, and an expression heatmap was plotted using the heatmap package in R 3.4. The average expression of selected genes shown in Fig. 3d was calculated according to season, whereas the average expression of selected genes shown in Fig. 3c was calculated according to tissue. Student’s t-test was used to identify the significantly differentially expressed genes (P value < 0.05).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.