Background

Tea, produced from the leaves of the tea plant, Camellia sinensis (L.), belonging to family Theaceae, is one of the most popular natural non-alcoholic beverages consumed worldwide. To date, nearly 4000 bioactive compounds have been identified in tea [1] including catechins, caffeine, theanine and volatile oils [2]. Catechins generally contain six different monomers, namely catechin (C), gallocatechin (GC), epicatechin (EC), epigallocatechin (EGC), epicatechin gallate (ECG) and epigallocatechin gallate (EGCG) [3]. Catechins, caffeine and theanine are the main three characteristic biologically active compounds in tea [4]. They are not only important contributors to flavour, but also have beneficial effects on human health due to their autoxidation and anticancer activity [5] and their ability to lower blood pressure [6], prevent cardiovascular diseases [7], and assist weight loss [8].

Gene co-expression network analysis (GCNA) is a genetic approach for analysing correlations between genes using large-scale gene expression profiling data that is especially useful for investigating relationships between functional modules and phenotypic traits [9, 10]. Weighted GCNA (WGCNA) is one of the most popular GCNA-based approaches, and this correlation-based technique describes and visualises co-expression networks between genes using transcriptomic data [11]. This technique has been successfully utilized to identify the gene modules in Arabidopsis and rice that are related to drought and bacterial stress [12]. Module assignment in WGCNA is a flexible process which reduces the complexity of a dataset from hundreds of genes to a smaller number of modules.

Researchers have focused on the molecular mechanisms involved in plant growth, development [13, 14] and the production of secondary metabolites [15] in tea plants. Regulatory mechanisms underlying secondary metabolite biosynthesis, particularly those related to catechins, theanine and caffeine, have been explored at the molecular level. Recent advances in next-generation sequencing of RNA [16] have been accompanied by an increase in the amount of available transcriptomic data from different tissues of tea plants [17], from different species of the genus Camellia [18], and from plants grown under different stress conditions [55], which helped to elucidate the complex biological functions of genes. Based on the BLAST results from the NR database, GO annotation was carried out using the Blast 2 GO program (version 2.3.4) [56].

Identification of gene expression and DEGs

Expression levels of unigenes were calculated using the FPKM method. Firstly, reads were mapped to unigene datasets by Bowtie2 (version 2.1.0, http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) at a sensitive setting. Based on the Bowtie results, FPKM values for each unigene were subsequently calculated by RESM (version 1.2.29) [57] with default parameters. DEGs were identified based on the method described by Audic et al. [58]. Genes with|log2ratio| ≥ 1 and false discovery rate (FDR) < 0.05 were identified as DEGs.

Construction of gene co-expression networks

Gene co-expression networks were constructed using the WGCNA approach with R packages (version 3.2.2). DEGs expressed in at least one pairwise comparison in ten tissues were retained for co-expression network construction by WGCNA analysis [11]. All tissues were initially clustered to analyse the sample height. Following application of the scale-free topology criterion described previously, a soft threshold of 30 was chosen. Based on the topological overlap-based dissimilarity measure [59], unigenes were first hierarchically clustered, and the gene dendrogram was used for module detection by the dynamic tree cut method (mergeCutHeight = 0.25, minModuleSize = 30). In the weighted gene co-expression network, gene connectivity was based on the edge weight (ranging from 0 to 1) determined by the topology overlap measure, which reflects the strength of the communication between two genes. The weights across all edges of a node were summed and used to define the level of connectivity, and nodes with high connectivity were considered hub genes.

Identification of content-related modules

To identify modules associated with catechins, theanine and caffeine, we first calculated the module eigengenes of each module, then correlated these with the catechin, theanine and caffeine content using Pearson’s correlation coefficients and an asymptotic confidence interval based on Fisher’s Z transformation. Modules with p-values < 0.05 were identified as content-related modules. To further characterise these modules, enrichment of annotated unigenes in each content-related module was investigated using the phyper function within the R platform based on KEGG pathway annotation, and q-value or FDR corrections were applied by multiple testing [60]. We defined KEGG pathways with a q-value or FDR < 0.05 as significantly enriched [61].

Module hub gene selection and visualisation

The most central and connected genes, involved in numerous interactions, were considered hub genes [62], which are likely to play a more important role in a given module than other genes in the overall co-expression network. In this study, we categorised the top 2% of the most highly connected genes in a module as hub genes based on the size of the module. Co-expression interactions and patterns of hub genes were visualised using Cytoscape [63].

qPCR validation of selected unigenes

In order to evaluate the assembly quality of RNA-seq data, the expression patterns of eight selected transcripts were monitored by qPCR. RNA samples were isolated from samples using the CTAB method [45], and total RNA was reverse-transcribed into single-stranded cDNAs using a reverse transcription kit for real-time PCR (TaKaRa). Detailed information (unigene IDs and primer sequences) related to the selected transcripts used for qPCR is listed in Additional file 5. PCR amplification was performed according to the manufacturer’s instructions using a CFX96TM real-time PCR system (Bio-Rad) with an annealing temperature of 60 °C. The housekee** gene glyceraldehyde-3-phosphate dehydrogenase (GAPDH) was used as an internal reference gene, and relative expression levels of target genes were calculated using the 2ΔCt method [64]. All qPCRs were analysed using three technical and three biological replicates.