Background

Gene co-expression analyses can reveal functional relationships between gene products. These types of relationships are typically explored using some type of similarity measure, e.g. Pearson’s correlation coefficient (PCC), to quantify the association between two genes in the genome. The pairwise relationships can be represented as a network structure, in which edges (co-expression relationships) connect nodes (genes) that generally include the majority of genes in a given organism’s genome [1]. Based on these associations, it is possible to predict functional gene clusters, or groups of genes, that participate in common biological pathways [2, 3]. Moreover, this approach may also be used to find the conserved orthologous gene clusters across several species [4, 5], with the implication that the clusters are involved in similar biological processes.

Many co-expression networks have been constructed in plants, such as Arabidopsis [1, 3, 611], barley [12], rice [13, 14], poplar [15], tobacco [16], and maize [17]. Several of these efforts have been implemented as web-based tools, e.g. the Arabidopsis Co-expression Toolkit (ACT) [18], ATTED-II [19], AtCOECis [20], RiceArrayNet (PlantArraynet) [14], Co-expressed biological processes (CoP) database [15], The Gene Co-expression Network Browser [13], and two AraNets [1, 9], and PlaNet [21].

While the co-expression-based approaches have proved successful for several biological processes, far from all cellular aspects can rely on this type of metrics. Instead, integrative approaches are increasingly applied to extend knowledge gleaned from one type of dataset. These studies are typically relying on functional and structural genomics data, such as high-throughput microarray assays, next-generation sequencing, and metabolomic and proteomic technologies [22].

Plant cell walls are mainly composed of cellulose, non-cellulosic polysaccharides (hemicelluloses and pectin) and lignin, and represent the most abundant renewable biomass on earth [23]. The primary and secondary cell walls are typically distinct structures in plant growth and development [23], where the primary wall is a flexible matrix that allows directed cell growth and the secondary wall is a robust structure surrounding cells that need extra support for their functions. In general, cellulose makes up almost 25-30% of dry matter in grasses [24] and 40-45% in woody plants [25]. Hemicelluloses are polysaccharides that contain xyloglucans, xylans, mannans and glucomannans, and β-(1 → 3,1 → 4)-glucans, whereas pectins are diversified compounds that mainly are present in primary walls [26]. Lastly, the polyphenolic molecule lignin is an amorphous polymer of phenylpropane units with three monomers: p-hydroxyphenyl (H), guaiacyl (G), and syringyl (S) [27, 28], laid down during secondary wall formation.

More than one thousand gene products have been proposed to be dedicated to plant cell wall biogenesis and modification [291), we summarized the information using R function collapseRows[48, 49]. The resulting expression matrix contained 33,204 genes or probe sets. To be able to statistically compare the expression matrix to the cell wall data, we decided to construct a weighted correlation network [50] based on the 33,204 probes for the 29 tissues that we also used for cell wall analyses. Here, the weights of edges in the corresponding co-expression network correspond to the degree of similarity of the expression profiles of two adjacent nodes/genes.

Subsequently, a clustering approach of the weighted correlation network was undertaken, which resulted in 56 groups of highly co-expressed genes, also referred to as gene modules (Additional file 2). Hence, modules were defined as groups of genes which exhibit a high intra-module topological overlap [51]. The modules were denoted by numbers from zero to 55 and prefixed with “ME” referring to “module eigengene” [50]. Obviously, the numbers of genes (probe sets) per module differed, and more than half of the modules contained less than 500 genes (probe sets) (Additional file 3A). To explore the co-expression relationships between modules, a module’s representative expression pattern was summarized using the first principal component of all the module’s gene members. Further, all module eigengenes were clustered by using complete linkage method (Additional file 3B), which characterizes the similarity structure between the modules.

Biological relevance and connectivity scores of network modules

To assess the functional relevance of the gene modules, and to make sure that the co-expressed modules reflect biologically relevant information, we next examined whether certain ontology terms were over-represented in the modules. Gene ontology (GO) enrichment analysis was therefore performed using a weighted method and Fisher’s exact test [52] (Additional file 4). The analysis showed that a total of 4,014 enriched terms and 1,175 unique terms were identified among the modules at p < 0.05. Notably, a significant over-representation of the terms cellulose and non-cellulosic polysaccharide biosynthesis was observed for Module 24 (with 406 genes or probe sets) and Module 44 (with 136 genes or probe sets) (Additional file 4). Based on the representation of KEGG reference pathway maps and BRITE functional hierarchies [53], we furthermore performed a functional enrichment analysis of KEGG Orthology for each module using hypergeometric tests. Module 24 and Module 44 were enriched in glycan biosynthesis and metabolism, consistent with the findings that genes in Module 24 and Module 44 may participate in cellulose and non-cellulosic biosynthesis as observed in the GO enrichment analysis. Detailed significant associations for each module are supplied in Additional file 5.

Highly connected genes, or hubs, are thought to play a central role in biological networks. Connectivity has been found as an important complementary gene screening variable for finding biologically significant genes in particular biological processes [54]. Intramodular connectivity (kWithin) is defined as the gene connectivity inside a given module. In weighted networks, intramodular connectivity equals the sum of connection weights of a node with all other nodes inside module. In this study, we defined whole network connectivity kTotal, and external module connectivity (kOut = kTotal-kWithin) for any given node. To find genes of high connectivity (i.e. 'hubs’) in consensus modules, we evaluated the module eigengene-based connectivity (kME) as the correlation between the gene expression and the module eigengene [55]. We also calculated all connectivity types in all models, and the genes sorted out by their kME were listed in Additional file 6.

Cell wall composition analysis

In an attempt to assign certain cell wall related functions to the modules, we harvested material from the 29 tissues that corresponded to the microarray data sets above. We sequentially extracted wall polysaccharides including pectin with ammonium oxalate, hemicelluloses with KOH, and cellulose in the remaining pellet [56, 57]. The pectin was present at very low levels, or absent, in most rice tissues, and we therefore did not use the pectin data for any further investigation in this work. In summary, the cell wall composition varied greatly across the 29 tissues (Figure 1; Additional file 7). Cellulose content ranged from 0.29% (endosperm1) to 31.33% of dry matter (palea/lemma) (Figure 1A). Three major monosaccharides of hemicelluloses also varied significantly [60, 81] was used to align the nucleotide sequences of the remaining probe sets to the Michigan State University (MSU) Rice Genome Annotation version 6.0 [82] which currently contains 56,797 protein-coding gene models (BLAT parameters used: minIdentity = 100, minMatch = 1, stepSize = 5). Subsequently, 31,574 probe sets could be mapped to a unique genomic location with at least six perfect match probes (more than 50% of the 11 spotted probe-pairs per sequence). The probe sets in the expression matrix were annotated with the corresponding genes names; probe sets which could not be mapped to genes remain annotated with their original probe names. Further, to obtain a single expression level estimate per gene based on multiple probes the collapseRow function implemented in the WGCNA R package [48, 50, 83] was used to summarize the probe intensities. The resulting microarray expression matrix contained 33,204 genes or probe sets (i.e. where no map** to genomic location was found).

To finally construct a genome wide rice co-expression network, the following approach was conducted: Initially, the pairwise similarities of all 33,204 genes or probe sets based on the expression profiles across the 29 tissues were quantified using PCC. Further, the approach developed by Langfelder and Horvath [50] is used to derive a weighted co-expression network. More specifically, a similarity matrix S was constructed in which the entry Sij corresponds to the absolute value of the pairwise PCC:

S ij = cor x i , x j
(1)

where xi and xj represent of the expression profiles for genes or probe sets i and j,respectively.

Furthermore, the similarity matrix S was transformed into a weighted adjacency matrix, denoted by A. Here, the entry Aij is obtained by raising the previously derived co-expression similarity Sij to a power, β, β > =1:

A ij = S β ij
(2)

The power, β, used to transform the similarity matrix is selected such that to the resulting network (described by its adjacency matrix) best approximates a scale-free topology – a defining network property of biological networks [84, 85]. In the case of the rice genome wide co-expression network, the parameter β = 7 was chosen (Additional file 10).

Gene modules were defined as sets of nodes in the co-expression network, i.e. genes and probe sets, with a high topological overlap [50, 51]. The topological overlap measure (TOM) between the ith and jth node is defined as

TOM = u i , j A iu A uj + A ij min K i , K j + 1 - A ij
(3)

where u i , j A iu A uj denotes the number of nodes to which both nodes i and j are connected by an edge, K i = i 1 j a ij denotes the sum of edge weights, i.e. the connection strengths, between ith gene and the other genes. Further, 1-TOM denotes the TOM based dissimilarity measure (1-TOM) which was used for hierarchical clustering. Finally, gene modules are obtained by using dynamic tree cutting algorithm on the resulting dendrogram. This outlined procedure were carried out using the blockwiseModule method implemented in the WGCNA R package (parameters: maxBlockSize = 20000, power = 7, minModuleSize = 50, reassignThreshold = 0, mergeCutHeight = 0.20) [83].

Connectivity scores of rice genes

Highly connected nodes in a network, commonly termed hubs, are thought to play a central role in the case of biological networks. The connectivity of a node has been used as a defining property for finding biologically relevant genes in co-expression networks [54]. Here, the intra-modular connectivity (kWithin) is used as a measure of centralization of genes. It is defined as the degree of the node corresponding to a gene inside a given module of the genome wide rice co-expression network [54]. The parameter kTotal was defined as the whole network connectivity for genes, reported as the sum of its connection strengths with all other genes in the network. A module’s eigengene-based connectivity (kME) was defined as the correlation between a gene expression value and the module eigengene (the average module expression value for an individual), which can be derived using R function consensusKME in the WGCNA package [50, 55].

Plant material collection and cell wall composition determination

The 29 tissues, or organs, of Zhenshan 97, indica variety were harvested at 16:00–18:00 of the day according to Wang et al. [44]. All samples were dried at 50°C after inactivation at 105°C for 5 min. The dried tissues were ground through a 40 mesh screen and stored in a dry container until use.

Plant cell wall fractionation procedure and cell wall composition analysis were described by Peng et al. [56] with modification by Li et al. [58]. The crude cell wall material was suspended in 0.5% (w/v) ammonium oxalate and heated for 1 h in a boiling water bath (supernatant referred to as pectins). The remaining pellet was first re-suspended in 4 M KOH containing 1.0 mg/mL sodium borohydride for 1 h at 25°C., and then the combined supernatant was neutralized, dialyzed and lyophilized (referred to as hemicelluloses). The non-KOH-extractable residue defined as crude cellulose, was further extracted with acetic:nitric acids:water (8:1:2) for 1 h at 100°C, and the remaining pellet was defined as cellulose. Cellulose was analyzed by anthrone/H2SO4 method. Monosaccharides (xylose, arabinose, galactose) of hemicelluloses were determined by GC-MS [58].

Three monolignols were detected by HPLC [57]. All the samples were extracted with benzene:ethanol (2:1, v/v) in a Soxhlet for 4 h, the remaining pellet was then collected as cell wall residue (CWR). The procedure of nitrobenzene oxidation of lignin was carried out as follows: First, 0.05 g CWR was added with 5 mL 2 M NaOH and 0.5 mL nitrobenzene, and a stir bar was put into a 25 mL Teflon gasket in a stainless steel bomb, and the bomb was sealed tightly and heated at 170°C (oil bath) for 3.5 h and stirred at 20 rpm. Then, the bomb was cooled with cold water, the chromatographic internal standard (ethyl vanillin) was added to the oxidation mixture. To remove nitrobenzene and its reduction byproducts, the alkaline oxidation mixture was washed 3 times with 30 mL CH2CI2/ethyl acetate mixture (1/1, v/v).The alkaline solution was acidified to pH 3.0-4.0 with 6 M HCl, and then extracted with CH2CI2/ethyl acetate (3 × 30 mL) to obtain the lignin oxidation products which were in the organic phase. The organic extracts were evaporated to dryness under reduced pressure 40°C. Finally, the oxidation products were dissolved in 10 mL chromatographic pure methanol. All experiments were carried out in triplicate. Standard chemicals: p-Hydroxybenzaldehyde(H), vanillin(G) and syringaldehyde (S) were purchased from Sinopharm Chemical Reagent Co., Ltd.

Identification of the cell wall-related modules through functional enrichment

GO terms of probes and genes were derived from agriGO [86]. To elucidate key biological processes, rather than conserved particular molecular functions, the GO sub-ontology 'biological process’ (GO-BP) was used for the gene-set enrichment analysis [87]. The enrichment analysis of particular GO-BP terms was performed using a weighted method in combination with Fisher’s exact test which is provided by topGO package [52]. KEGG ontology (KO) from the KEGG database (http://www.genome.jp/kegg/) [53] was additionally obtained and RAP IDs were converted to TIGR IDs using the RAP-DB ID converter tool (http://rapdb.dna.affrc.go.jp/tools/converter) [88]. KO enrichment was calculated by using hyper geometric test [89].

Analysis of the cell wall-related modules through physiologic traits

For each gene module, the module eigengenes, i.e. the first principle component of the expression profiles of all the modules members, was derived as a representative expression profile for each module. Module eigengenes were calculated through the WGCNA R package [50, 83]. Subsequently, the association of module eigengenes and the measured physiological traits was determined as follows: for each module, the eigengene (ME) was tested for significant associations with the external traits. In case such an association is present, subsequently, a correlation analysis was performed between all of the modules genes and the cell wall components individually to study finer substructure of particular gene/external trait relationships. In addition to the degree of correlation to the trait, the genes intra-modular connectivity is used to rank putative gene candidates.

Canonical correlations of cell wall traits and modules eigengenes

The set of cell wall features is represented by the matrix X in which rows correspond to the 29 tissues and columns correspond to the 7 cell wall measurements. Likewise, matrix Y of denotes the set of eigengenes whereas rows also correspond to the 29 tissues and columns correspond to the obtained 56 eigengenes. In CCA, a 1 = a 1 1 , , a p 1 T and b 1 = b 1 1 , , b q 1 T denote the two basis vectors, such that the correlation between the projections of the variables – columns in X and Y – onto these basis vectors given by U 1 =X a 1 = a 1 1 X 1 + a 2 1 X 2 ++ a p 1 X p and V 1 =Y b 1 = b 1 1 Y 1 + b 2 1 Y 2 ++ b q 1 Y q are mutually maximized: ρ 1 =cor U 1 , V 1 = max a 1 , b 1 cor X a 1 , Y b 1 .

These derived linear projections U1 and V1 are called the first canonical variates. To investigate association between individual variables, i.e. eigengenes and cell wall features the similarity between variables in X and Y is quantified based on the Pearson correlations of their initial representation and the determined canonical variates. This form of correlations is known as canonical structure correlations [90] and can be further visualized by means of a relevance network [91]. Both, the CCA analysis as well the network are derived using the mixOmics package (http://www.math.univ-toulouse.fr/~biostat/mixOmics/) [92]. As a threshold for deriving edges between eigengenes and cell wall features in the relevance network, τ CCA  ≥ 0.5 for the absolute values of association between variables was chosen further ensuring that all 7 cell wall parameters are not isolated in this network.

Availability of supporting data

All data sets supporting the results of this article are included within the article and also provided in the repository hosted by LabArchives, LLC (http://www.labarchives.com/) with DOI: http://dx.doi.org/10.6070/H4NV9G6V.