Background

DNA methylation refers to the methylation of the carbon atom at position 5 of a cytosine (m5C), which mostly happens within CpG, CpHpG and CpHpH nucleotide patterns in eukaryotes [1]-[4]. In differentiated cells of mammals, methylation appears predominantly at CpG dinucleotides, with about 60% to 90% of all CpG sites methylated [4]-[6]. DNA methylation is a stable epigenetic modification involved in many cellular processes, including cellular differentiation, suppression of transposable elements, embryogenesis, X-inactivation and genomic imprinting [4]. DNA methylation around the 5’ terminus of a gene is well-recognized to be associated with low gene expression, by actively repressing transcription or marking already silenced genes [7],[8]. Different models have been proposed for the molecular mechanisms of DNA methylation in transcriptional repression, including the blockage of transcription factor binding, and the recruitment of transcriptional repressors involved in methylation-dependent chromatin remodeling and gene repression [1],[9]. The important roles of DNA methylation are also evidenced by the association of aberrant DNA methylation with various human diseases [10],[11].

Previous findings obtained by high-throughput methods

To systematically study DNA methylation at the genomic scale, it is necessary to identify many, ideally all, methylated sites in a genome. Various high-throughput methods have been invented for large-scale detection of methylation events [8],[12]-[14]. These methods differ in the way genomic regions enriched for methylated or unmethylated DNA are identified, and how genomic locations of these regions or their sequences are determined. The former includes the use of methylation-sensitive restriction enzyme digestion [15],[16], immunoprecipitation [17]-[19], affinity capture [20],[21], and bisulfite conversion of unmethylated cytosines to uracils [2]-[4],[22]. The identities of the collected regions are determined by microarray [15]-[19] or sequencing [2]-[4],[20]-[22]. These methods have been extensively compared in terms of their genomic coverage, resolution, cost, consistency and context-specific bias [23],[

Figure 3
figure 3

Sub-regions defined for each gene. The transcribed region (body) of a gene is divided into 6 variable-length sub-regions according to its exons and introns, namely first exon (FirstEx), first intron (FirstIn), last exon (LastEx), last intron (LastIn), internal exons (IntnEx) and internal introns (IntnIn). The 2 kb upstream region is divided into 5 fixed-length sub-regions Up1-Up5, each of 400 bp. Downstream sub-regions Dw1-Dw5 are defined analogously. In some analyses these sub-regions are further grouped into meta sub-regions, such as Upstream (Up1-Up5), Body (all the exonic and intronic sub-regions) and Downstream (Dw1-Dw5).

DNA methylation is partially indicative of expression class

We first constructed models with all DNA methylation features from the 16 sub-regions of each gene, using the mCG methylation measure. We tried 11 different model construction methods, and found that the Random Forest method [43] produced models with the highest cross-validation accuracy, regardless of the exact way model accuracy was computed (Additional file 1: Figure S17). We thus used the modeling accuracy of this method as a proxy of how indicative of gene expression the methylation features are. Based on the AUC measure (area under the receiver operator characteristic curve), the accuracy of the one-class-against-all models for the four expression classes ranged from 0.63 to 0.82 (Additional file 1: Figure S18), where a random assignment of genes to expression classes would result in an AUC value of 0.5, indicating that the methylation features were able to partially separate genes from different expression classes. Among the four expression classes, the Lowest expression class had the highest accuracy, followed by the Highest, Medium-high and Medium-low classes. These results are consistent with what we observed from the scatterplots, that many genes with the lowest expression levels have very high methylation patterns, which can separate them from genes with higher expression levels. The genes with the highest expression levels are slightly more difficult to identify since their signature of low methylation is also shared by many genes from other expression classes. Lacking clear signatures from DNA methylation levels alone, genes in the two medium expression classes are most difficult to identify. The same trends were observed when we repeated the analysis with all four DNA methylation quantification measures and a wide range of expression class numbers (from 2 to 64, Additional file 1: Figures S19–S22).

Gene body methylation is a stronger indicator of expression class than promoter methylation

We then compared the models constructed using features from either the upstream regions, gene bodies or downstream regions alone (Figure 4). Methylation levels at gene bodies were more capable of telling the expression class of a gene than upstream and downstream regions, for all four expression classes. Combining features from all sub-regions gave the best modeling accuracy, which shows that the features from the different sub-regions are not totally redundant, and may play different roles in gene regulation. These observations stay true for all four methylation quantification measures (Figure 5 and Additional file 1: Figure S23). Comparing the modeling accuracy of the four methylation measures, none of them is clearly better than the others, although on average mCG/CG/len had a slightly higher accuracy.

Figure 4
figure 4

Accuracy of Random Forest expression models based on DNA methylation features quantified by mCG from three individual sub-regions or their combination. The accuracy values of genes from the four expression classes are shown in the first four bar groups, while the last bar group shows the average accuracy of the four expression classes.

Figure 5
figure 5

Comparison of the modeling accuracy based on different DNA methylation measures. The Random Forest expression models based on the four quantification measures of DNA methylation are shown by different colors. The modeling accuracy involving different subsets of genes from different expression classes are shown in the first four rows, while the last row shows the average accuracy of the four expression classes. Within each row, the four bar groups show the accuracy values of the models constructed from only downstream features, only upstream features, only gene body features, and all of them, respectively. The four quantification measures are ordered according to the average accuracy of their corresponding models when features from all three sub-regions are considered.

A potential confounding factor of the above analyses is that the upstream and downstream regions of a transcript could overlap with the body of another transcript [1: Figure S25), but again the modeling accuracy was higher when both types of features were considered than when either one was used alone.

To test if the above observations are sensitive to the way we define expression classes, we also used a second way to divide genes into four expression classes covering equal range of log-expression values. The results (Additional file 1: Figure S26) show that all the main observations discussed above remain unaffected.

Quantitative relationships between promoter and gene body methylation

Since both promoter and gene body methylation are indicative of gene expression to a certain extent, we next explored whether they carry redundant information. When plotting the DNA methylation levels at these two regions for all genes, the distributions based on the four quantification measures were found to be very different (Additional file 1: Figure S27). An L-shaped pattern was observed for mCG (Additional file 1: Figure S27a) and less obviously for mCG/len (Additional file 1: Figure S27c), but not for the other two measures (Additional file 1: Figure S27b and d). Notably, when mCG/CG was used for quantification, the genes were divided into two large clusters (Additional file 1: Figure S27b). Both clusters display very high level of gene body methylation, but one with very high and the other with very low promoter methylation. We also created scatterplots for studying the relationships between the length, the number of CpGs, and the number of methylated CpGs in each sub-region, for each of the 16 types of sub-regions (Additional file 1: Figures S28–S30). The scatterplots between number of CpGs and number of methylated CpGs reveal some interesting patterns about the two clusters in the mCG/CG plot (Additional file 1: Figure S29). For most gene body sub-regions except FirstEx and to some degree LastEx, the genes form a straight line along the diagonal line CG = mCG, showing that the different genes actually have different absolute number of CpGs at their gene bodies, but most of their internal exons and internal introns are fully methylated. In contrast, for the upstream and downstream sub-regions, as well as the first exon, the genes form a tilted V-shaped pattern, with a group of genes lying close to the diagonal CG = mCG and another group lying close to the vertical axis mCG = 0, which correspond to the extreme cases with fully methylated and fully unmethylated CpGs.

To gain more insights into the relationships between promoter and gene body methylation, we included in our analysis the expression levels of the genes (Additional file 1: Figure S31). The three-dimensional scatterplot based on the mCG measure displays the sharpest pattern among the four plots (Additional file 1: Figure S31a), which shows a “triple-inverse” relationship between promoter methylation, gene body methylation and gene expression. This triple-inverse relationship indicates that a gene can either have a high promoter mCG level, a high gene body mCG level, or a high expression level, but not two or three of them simultaneously. This relationship between the three quantities is consistent with the L-shaped patterns we previously observed in the 2D plots (Additional file 1: Figures S8a, S9a and S27a). These results suggest that in terms of the absolute number of methylated CpG sites, either strong promoter methylation or strong gene body methylation alone is sufficient to indicate low expression, and it is not required for a gene to redundantly have both indicators.

Potential role of gene body methylation for genes with CpG-poor promoters

It has been proposed that for CpG island promoters, DNA methylation is a sufficient but not necessary condition for gene inactivation, while for CpG-poor promoters, DNA methylation does not preclude expression [19]. To check whether the same observations could be made in our data, we plotted the expression level of different groups of genes according to their promoter CpG levels (Figure 6A and B). Indeed, the expression levels of genes with a large number of CpG dinucleotides in their promoter regions were more strongly affected by the DNA methylation in these regions. Specifically, for both mCG and mCG/CG measures, promoter methylation was more anti-correlated with gene expression for genes with highest or medium promoter CpG levels (first two bar sets of the figures) than those with lowest promoter CpG levels (last bar sets of the figures). Genes with lowest promoter CpG levels were largely insensitive to promoter methylation, and had low expression levels in general.

Figure 6
figure 6

Relationship between DNA methylation and gene expression for genes with different promoter CpG levels. The four panels show the expression levels of different subsets of genes and their corresponding methylation levels at upstream (A and B) or transcribed regions (C and D). Panels A and C involve the use of the mCG methylation measure, while panels B and D involve the use of the mCG/CG measure. Within each panel, the genes are first divided into three subsets according to their promoter CpG levels, which correspond to three bar groups. For each subset, the genes are further divided into another level of three subsets based on their methylation level. Finally, for each of the resulting subset of genes, their distribution of expression levels is shown by a Box and Whisker plot.

For this group of genes with CpG-poor promoters, can gene body methylation indicate their expression levels? To answer this question, we again divided genes into three groups according to their promoter CpG counts, but this time we studied the correlation between gene body methylation and expression levels of each group instead (Figure 6C and D). For both mCG and mCG/CG, the genes with CpG-poor promoters do exhibit some weak differential expression patterns as gene body methylation level varies, but the correlation between gene body methylation and expression was positive for mCG and negative for mCG/CG. These results suggest a potential role of gene body methylation in regulating genes with CpG-poor promoters, although the exact mode of regulation is yet to be understood.

Generality of the quantitative models

All the results above were based on quantitative models both constructed and tested on the same individuals (albeit on different subsets of genes), using data from one single cell type (PBMC). To test if these models are generally useful for signifying expression classes, we collected single-base resolution bisulfite sequencing and RNA-seq data for two cell lines, H1 human embryonic stem cells (hESC) and the human lung fibroblast line IMR90, from the Roadmap Epigenomics Project [45] (Additional file 1: Table S3). We constructed models using DNA methylation and expression data from one individual/cell line, and applied the models to predict the expression class of genes in another individual/cell line based on its DNA methylation profile alone. To ensure the generality of the models, the genes used for training in the first individual/cell line and the genes used for testing in the second individual/cell line were mutually exclusive.

The results (Figure 7) show that, for all combinations of training and testing individuals/cell lines, the prediction accuracy was much higher than random predictions (which would have an AUC value of 0.5). Models constructed from any one of the three individuals were able to predict the expression classes of genes in another individual with an average AUC of about 0.9, which is expected as these samples all contained PBMC from individuals in the same family. More interestingly, the other data set combinations also have prediction accuracy of about 0.75 on average, which demonstrate the generality of the constructed models. These cross-sample results reconfirm our earlier findings that the more extreme expression classes are better indicated by methylation patterns. Moreover, among the four methylation quantification measures used, mCG, mCG/len and mCG/CG/len consistently provided better modeling accuracy than mCG/CG (Figure 7), which indicates that the commonly-used quantification measure of DNA methylation, mCG/CG, is not necessarily the best in signifying gene expression classes.

Figure 7
figure 7

Generality of the quantitative models. Random Forest expression models were constructed using methylation and expression data from one of the individuals or cell lines, indicated by the different columns. The methylation level of a gene is defined as the average level over its upstream, transcribed and downstream regions. These models were used to predict the expression levels of genes in another individual/cell line, based on their measured DNA methylation levels of them in it. For each of these model training/testing combinations, the prediction accuracy values of the genes in different expression classes, and their overall average, are shown in different bar groups. Within each bar group, the accuracy values based on the four DNA methylation measures are shown.

Quantitative relationship with histone modifications

Our quantitative models based on DNA methylation were able to achieve reasonable accuracy in identifying the expression class of a gene, but they also show that DNA methylation alone is not informative enough to signify precise expression levels. We have previously shown that histone modifications are strong indicators of expression levels [46],[47]. Therefore, we next explored the relationship between DNA methylation and histone modifications in terms of indicating gene expression, and tested whether information on gene expression conveyed by DNA methylation is totally subsumed by that of histone modifications. It was previously shown that promoter methylation was negatively correlated with H3K4me3 (histone 3 lysine 4 trimethylation) in the human brain [32], and gene body methylation was positively correlated with H3K36me3 and negatively correlated with H3K27me3 in a B-lymphocyte cell line [28]. To study the quantitative relationships between DNA methylation and histone modifications in the context of indicating expression levels, we compared statistical models that involve either only DNA methylation features, only histone modification features, or both.

We collected ChIP-seq data for 26 types of histone modification from the H1 embryonic cell line from the Roadmap Epigenomics Project (Additional file 1: Table S3). As with DNA methylation, we computed the average signal of each type of histone modification in the same 16 sub-regions for each gene. Although some histone marks are known to be enriched in particular sub-regions, this knowledge is limited to some well-studied types of histone modifications. We therefore considered all sub-regions and let the Random Forest method identify the features most useful for indicating expression levels.

As expected, some of the models constructed from histone modification features alone had high cross-validation accuracy (Figure 8). Consistent with previous findings, the two strongest feature sets were H3K36me3 and H3K4me3, which mark actively transcribed regions and active promoters, respectively [48]. Models based on DNA methylation features alone were not as accurate as those constructed from these histone modification features well-known for their roles in marking gene activities, but were more accurate than many other types of histone modification such as H3K9me3 and H3K4me1 (Figure 8).

Figure 8
figure 8

Joint effects of DNA methylation and histone modifications on gene expression. The four panels compare Random Forest expression models with only DNA methylation features (straight line with triangle markers), only histone modification features (orange bars), or both (blue bars). The four panels involve DNA methylation levels computed by different quantification measures. For DNA methylation and any type of histone modifications, its signal level is computed as the average over the upstream, transcribed and downstream regions of a gene. In each panel, the first 26 bar groups correspond to models involving one of the 26 types of histone modification, while the last bar group corresponds to the model involving all 26 types of histone modification.

DNA methylation and histone modifications contain non-redundant information about gene expression

Interestingly, regardless of the type of histone modification and the DNA methylation measure used, combining both types of features consistently increased the accuracy of the corresponding models involving only histone modification features or only DNA methylation features. Even for the strongest histone modification feature set derived from H3K36me3, incorporating DNA methylation features still led to an improvement of modeling accuracy by about 6%, from AUC value of 0.83 to 0.88 for mCG/CG/len, which indicates that the two types of signals were not completely redundant in terms of signifying gene expression.

To better understand how DNA methylation complements histone modification in indicating expression classes, we examined the DNA methylation and H3K36me3 signal levels of two types of genes, namely (1) those with expression classes correctly identified by the model involving only mCG/CG/len features but not by the model involving only H3K36me3 features, and (2) the vice versa, i.e., those with expression classes correctly identified by the H3K36me3 model but not the mCG/CG/len model. The genes with expression classes correctly identified by the mCG/CG/len model only displayed higher mCG/CG/len levels (Figure 9A, blue lines and areas) and lower H3K36me3 levels (Figure 9B), indicating that in general they were the less transcribed genes. Among the different sub-regions, as expected the ones best separating the two groups of genes in terms of H3K36me3 signals were those within the gene bodies, and to a lesser extent those at downstream regions (Figure 9B). Interestingly, in terms of mCG/CG/len levels, the sub-regions that best separate the two groups of genes were the exonic regions, especially the first exon (Figure 9A), indicating that methylation levels at exonic regions not only play crucial roles in models involving DNA methylation features alone, but could also be important in complementing histone modifications in indicating the expression class of a gene.

Figure 9
figure 9

DNA methylation and H3K36me3 levels of genes the expression classes of which were correctly identified by either the mCG/CG/len model but not the H3K36me3 model, or vice versa. In the figures, the solid lines represent the median signal value of all genes in the group, and the shaded area of the same color tone marks the 25-th precentile to 75-th percentile range.

As in the case of DNA methylation, histone modification features were most successful in identifying genes with lowest expression levels (Additional file 1: Figure S32). However, even the strongest histone modification features were not significantly better than DNA methylation in identifying these genes. In contrast, some of them were much better in identifying genes with medium expression levels, suggesting that DNA methylation mainly indicates the coarse on/off status of a gene, while some histone marks provide more fine-grained details about the precise expression levels.

We examined the relationships between DNA methylation and histone modifications in more detail by plotting their values in different sub-regions of genes (Additional file 1: Figures S33–S34). In particular, we reconfirmed previous findings that DNA methylation and H3K4me3 negatively correlate at the upstream region (Figure 10). However, whether gene body methylation positively or negatively correlates with H3K36me3 depends on the DNA quantification measure (Figure 11), with the correlation being most positive for mCG/len, and most negative for mCG/CG.

Figure 10
figure 10

Relationships between the DNA methylation (y-axis) and H3K4me3 (x-axis) at the upstream regions of genes, based on the four DNA methylation measures.

Figure 11
figure 11

Relationships between the DNA methylation (y-axis) and H3K36me3 (x-axis) at the transcribed regions of genes, based on the four DNA methylation measures.

A small number of DNA methylation and histone modification features are sufficient to maximally indicate gene expression

When we combined features derived from DNA methylation and all 26 types of histone modifications, the resulting model had a higher accuracy than all the models involving single histone modification and/or DNA methylation features (Figure 8). To test if it is possible to achieve the same accuracy with a smaller number of feature sets, we applied a forward feature selection procedure. Specifically, we started with either an empty set of features, or all DNA methylation features based on one quantification measure. We then iteratively added the set of features for the type of histone modification that could maximize the accuracy gain, until no more sets could lead to any further improvements. Depending on the DNA methylation features included in the first step, maximal accuracy was achieved by 6-8 feature sets in total (Additional file 1: Figure S45).

Consistent with the single-feature-set results, H3K36me3 and H3K4me3 were always the features first incorporated into the models. The features next incorporated include those that involve H3K79, and the repressive mark H3K27me3. For the DNA methylation measures mCG, mCG/CG and mCG/CG/len, including DNA methylation features resulted in final models with higher accuracy than the one involving histone modification features alone, indicating that DNA methylation has non-negligible roles in these models with maximal modeling accuracy.

Since the AUC values were increased most by H3K36me3 and H3K4me3, and these two marks are well-known to be most indicative of expression levels, we believe similar results would be obtained if we had applied other feature selection methods.