Background

5-methylcytosine (5mC) is a key epigenetic modification known to be involved in biological processes such as regulation of gene expression, DNA structure and control of transposable elements. 5mC exists in most eukaryotic groups including plants, fungi, invertebrate and vertebrate animals [1]. It is however absent in certain model organisms such as the budding yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans and the fly Drosophila melanogaster. Furthermore, the levels and genomic patterns of 5mC are evolutionarily labile. While invertebrate genomes display sparse methylation with most methylation accumulating in transcribed genes, vertebrate genomes are extensively methylated [2].

In vertebrate genomes, 5mC occurs predominantly in a CpG sequence context [3]. 5mC can be converted to thymine by spontaneous or enzymatic deamination, which is thought to lead to an evolutionary depletion of CpGs in methylated vertebrate genomes [4], except at CpG-rich regions known as CpG islands (CGIs) that remain mostly unmethylated in somatic cells and the germline [5].

In the well-studied mouse and human genomes, DNA methylation silences transposable elements and prevents them from disturbing expression of neighboring genes [6, 7]. In the mouse, CGI methylation is infrequent and occurs mostly in gene bodies [8]. At transcription start sites (TSS), most CGIs remain constitutively unmethylated, except a minor fraction undergoing long term silencing by DNA methylation associated with X-chromosome inactivation (XCI), parental genomic imprinting and developmental genes. In particular, promoter CGI DNA methylation is targeted to a small number of germline genes during mouse development and required to maintain these genes repressed in somatic lineages [6, 8, 9]. More recently, a new class of large unmethylated regions covering developmental genes, termed DNA methylation valleys or canyons, has been described in mouse, human and zebrafish [10,Full size image

To ensure that this observation is not biased by CGI annotations and further explore the relationship between CpG density and methylation, we correlated DNA methylation and CpG ratio observed/expected in 0.5 kb genomic windows for each species. In the mouse, most of the windows with a CpG ratio above 0.7 are hypomethylated. In contrast, we found that the probability of methylation decreases at a higher CpG ratio in all the other species (Fig. 2d). This is consistent with a previous study showing that experimentally defined hypomethylated islands have a much lower CpG ratio in the mouse compared to the human genome [19]. Altogether this shows that the limit of CpG ratio that protects against methylation varies between species and is lower in the mouse compared to other vertebrates.

Having identified a higher fraction of methylated CGIs in somatic cells in all the studied species compared to the mouse, we wondered whether CGIs also show an increased frequency of methylation in the germline. We used public WGBS and Reduced Representation Bisulfite Sequencing (RRBS) data of sperm from human, mouse, dog, cow and chicken and complemented this set by performing RRBS on sperm from human, pig and chicken (Table S1 and S2). As in somatic tissues, this revealed that the fraction of methylated CGIs in sperm is higher in other species compared to the mouse (Fig. 2e, Additional file 1: Fig. S3a-b). Additionally, we analyzed public WGBS profiles in oocytes of human, mouse and cow and found again an increased frequency of methylation of annotated CGIs in human and cow compared to the mouse (Additional file 1: Fig. S3c). In each species, the CGI methylation status in fibroblasts positively correlates with CGI methylation in gametes (Additional file 1: Fig. S3d), suggesting a consistent pattern of CGI methylation between gametes and somatic cells. In summary, this shows that CG-rich sequences are more frequently methylated in germ and somatic cells of other vertebrates compared to the mouse.

Large unmethylated valleys are conserved among vertebrates

Large unmethylated regions covering several kilobases have been previously described in human and mouse and named DNA Methylation Valleys (DMVs) or canyons [10,3d). These results show that large unmethylated valleys covering transcription factor genes and developmental genes are highly conserved among vertebrates.

Fig. 3
figure 3

Conservation of DNA Methylation Valleys (DMVs) across vertebrates. a Bar graphs showing the number of identified PMDs, UMRs and DMVs (UMR >= 5kb) and the percentage of the genome covered by each feature in dermal fibroblasts of the seven studied species. b Gene ontology analysis of genes located in DMVs. The graph shows the enrichments of ontology terms 'regulation of transcription' and 'developmental process' in DMV-associated genes compared to all genes (p value: hypergeometric test). c Genome browser snapshots of WGBS methylation scores showing a conserved DMV overlap** the MEIS1 gene in dermal fibroblasts of the seven species. Each WGBS track shows the percent methylation of individual CpGs between 0 and 100%. CpG islands (green rectangles) and Ensembl gene annotations are shown below the tracks. d Analysis of the overlap of genes located in DMVs in fibroblasts across the seven vertebrate species. Each square in the heatmap represents the percentage of common genes associated with DMVs between two species. Details about the calculation are provided in the Methods section

Prediction of allele-specific methylation reveals a conserved set of imprinted genes in mammals

Imprinted genes are under control of germline differentially methylated regions (gDMRs), which acquire differential methylation in the parental gametes and can also direct the establishment of somatic DMRs in the embryo. Imprinted DMRs are CpG-rich and present ~50% methylation because either the maternal or paternal allele is highly methylated and the other one is unmethylated. Furthermore, they are generally maintained in all somatic tissues and thus can be used to comprehensively identify imprinted genes irrespective of whether they are expressed. The catalog of imprinted DMRs is well described in mouse and human, but to what extent imprinted genes are conserved in all mammals remains elusive. We therefore wished to use the WGBS data to predict imprinted DMRs and investigate their conservation across mammals. We developed a pipeline to predict regions of allelic methylation that have a mean methylation between 30 and 60%, more than 90% of either fully methylated and unmethylated reads and a maximum of 40% difference between fully methylated and unmethylated reads (Fig. 4a, see Methods). We excluded the regions on the X chromosome as they can be subjected to X chromosome inactivation in females, and those overlap** developmental genes (such as HOX and TBX) previously known to have variable allele specific methylation [21, 22]. Finally, we added stringent criteria by selecting only regions with more than 20 CpG and bigger than 350 bp (stringent mode) while kee** a lenient prediction mode with only a selection for regions with more than 10 CpG (Fig. 4a).

Fig. 4
figure 4

Prediction of imprinted DMRs in mammals using WGBS data. a Description of the pipeline used to detect potential imprinted DMRs using WGBS data. To differentiate allelic from partial random methylation, we use single read methylation scores to identify regions that contain a mixture of fully methylated and fully unmethylated reads. We applied a stringent mode to identify regions larger than 350 bp with a minimum of 20 CpGs, and a lenient mode to identify regions with a minimum of 10 CpGs. b Top ranked genes associated with predicted allelic DMRs in fibroblast in stringent and lenient mode in at least 2 species. Asterisks indicate genes previously known to be imprinted in human or mouse. c Genome browser snapshots of WGBS profiles over the PLAGL1 (left panel) and KBTBD6 (right panel) genes in dermal fibroblasts of 6 mammalian species. KBTBD6 is not shown in the rabbit because of lack of gene annotation. CpG islands (green rectangles), predicted DMRs (purple rectangles) and Ensembl or Refseq gene annotations are shown below the tracks

When we applied this pipeline prediction to mouse fibroblasts, 18 out of the 20 known mouse gDMRs were identified and 30 out of the 33 identified regions under stringent criteria were close to a known imprinted gene (Table S4), which demonstrates the reliability of the pipeline. Applying this pipeline to the six mammals led to the identification of 29 genes close to regions with allelic methylation in at least 2 species (Fig. 4b). The top 16 ranked genes predicted in at least 4 species were known imprinted genes in the mouse, such as MEST, GNAS, PEG10, KCNQ1 and PLAGL1 that were predicted to be imprinted in all tested mammals (Fig. 4b). None of these DMRs were identified in the chicken known to lack genomic imprinting (Fig. 4b). Thus, this analysis reveals a conserved core set of genes predicted to carry imprinted methylation in mammals.

Conversely, we also make predictions of novel DMRs occurring in mammals other than mouse and human. One example is KBTBD6, a gene not previously described as imprinted in mouse or human (Fig. 4b, c). In our pipeline, this gene is predicted for allelic methylation in dog, cow and pig. Interestingly, it has been recently identified as an imprinted gene in pig with an allelic expression screening strategy [23].

In the mouse, ZFP57 interacts with a CpG-methylated hexanucleotide (TGCCGC) in gDMRs and is required for the maintenance of allele-specific methylation during development [24, 25]. To investigate whether conserved mechanisms take place in mammals, we performed an enrichment analysis of transcription factor (TF) motifs from the JASPAR database in the predicted stringent allelic DMRs of each species. We selected motifs present in more than 50% of DMRs with a p-value < 0.01 compared to random regions with similar GC content. The ZFP57 motif showed a significant enrichment in DMRs of all mammalian species except the dog, suggesting a conserved role in maintaining imprinted DMRs across mammals (Table S5). Interestingly, another zinc finger protein ZBTB14 showed a motif enrichment in five mammalian species, suggesting a potential role in regulating imprinted DMRs (Table S5). These data suggest potential conservation of the molecular mechanisms regulating imprinted allelic methylation across mammals.

Reconfiguration of DNA methylation is a hallmark of X-chromosome inactivation in all mammals

DNA methylation is reconfigured on the inactive X chromosome in human and mouse. Promoter CGIs are usually unmethylated on the active X chromosome (Xa) and highly methylated on the inactive X chromosome, leading to an average methylation level of 30-40% [26, 27]. In humans, an early study also showed that the active X is more methylated than the inactive X chromosome in gene bodies [28]. We took advantage of using female dermal fibroblasts in all species to study the conservation of DNA methylation changes associated with X-chromosome inactivation across placental mammals. For each species, we compared the mean CG methylation of promoter-CGIs and non-CGI regions (1 kb tiles) on the X chromosome and autosomes. As expected, in human and mouse, promoter-CGIs on autosomes were unmethylated while the major fraction of promoter-CGIs on the X chromosome had a mean methylation around 30% (Fig. 5a). This pattern was recapitulated in all the analyzed mammalian species (Fig. 5a). Conversely, the mean methylation of non-CGI regions was significantly lower on the X chromosome compared to autosomes in all mammals (Fig. 5b). Interestingly, the global hypomethylation of non-CGI regions is more drastic in all other mammals compared to the mouse (Fig. 5b). This indicates that the DNA methylation signature of X chromosome inactivation is conserved in mammals.

Fig. 5
figure 5

Conserved DNA methylation signature of X chromosome inactivation in mammals. a Violin plots of CG methylation scores of promoter-CGIs measured by WGBS in dermal fibroblasts across autosomes (A) or the X chromosome (X) in mammalian species. Median values are indicated by white circles. ***: p-value < 0.001 (Wilcoxon test). b Violin plots of CG methylation scores of 1 kb genomic tiles (excluding CGIs) measured by WGBS in dermal fibroblasts across autosomes (A) or the X chromosome (X) in mammalian species. Median values are indicated by white circles. ***: p-value < 0.001 (Wilcoxon test). c Number of X-linked genes with an unmethylated promoter CGI (methylation < 10%), predicted to escape XCI, identified in each species. d Table of genes predicted to escape XCI in at least 3 species. Asterisks indicate genes previously shown as escapees in human and mouse

Promoter CGI methylation is strongly predictive of the XCI status and unmethylated pCGIs can be used to predict genes that escape XCI [26, 29]. For each species, we determined X-linked genes with unmethylated pCGI (<10%) that presumably escape XCI in order to investigate the conservation of XCI escape calls across species. We refined this analysis by manually checking on the genome browser potential promoter CGIs that could not be identified due to incorrect gene annotation. Mouse showed the lowest number of XCI escapees (Fig. 5c), which is in agreement with a recent study [30]. It is important to note that the number of genes esca** XCI in rabbit is underestimated due to poor gene annotation in this species. Indeed, we identified in rabbit several unmethylated CGIs that colocalized with a transcription start but without gene annotation. Overall, we identified 22 genes esca** XCI in at least three mammalian species (Fig. 5d). DDX3X, KDM6A, EIF2S3 were predicted XCI escapees in all the studied mammals, while most other genes were predicted as XCI escapee in mammals other than the mouse.

Altogether, these results reveal conservations of DNA methylation patterns associated with XCI in mammals with the mouse being an outlier in terms of hypomethylation of non-CGI regions and the number of XCI escapees.

Correlation between DNA methylation and gene expression

Next, we focused on the relationship between DNA methylation and gene transcription. Gene bodies represent the most conserved targets of DNA methylation in eukaryotes [1, 13] and in mouse and human, high gene body methylation has been associated with expressed genes [31,32,33]. To test whether this applies to other species, we quantified gene expression in the primary fibroblasts by RNA-seq (Table S6). In all mammals, genes with high expression (log2 rpkm > 0) were more likely to have high gene body methylation (Additional file 1: Fig. S4a-b). Compared to the other mammals, the mouse was again an exception with a lower difference in gene body methylation between highly expressed and lowly expressed genes. Surprisingly, we did not observe the same tendency in the chicken (Additional file 1: Fig. S4a-b).

To investigate the relationship between gene expression and promoter DNA methylation, we classified gene promoters into three groups based on their CG ratio: low (LCP), intermediate (ICP) and high (HCP) CG ratio promoters with an adjustment of CG ratio for each species (Additional file 1: Fig. S5a). In all species, HCP promoters were mostly hypomethylated, whereas LCP promoters were in majority highly methylated and ICP promoters showed intermediate levels of methylation (Additional file 1: Fig. S5b). In line with our above CGI methylation data, we noted that the mouse had the lowest proportion of highly methylated HCPs and ICPs (Additional file 1: Fig. S5b). Comparing RNA-seq expression and promoter DNA methylation revealed a significant anti-correlation between gene expression and promoter methylation for ICPs and HCPs in all species (Fig. 6a). This anticorrelation was less marked in some species such as rabbit and dog, probably due to more frequent inaccurate gene annotations in these species. In contrast, LCPs showed an anticorrelation in human, mouse and pig but not in the other species. Altogether these results demonstrate that methylation of CpG-rich promoters correlates with low gene expression across vertebrates.

Fig. 6
figure 6

Impact of promoter DNA methylation on gene expression in vertebrates. a Boxplots showing gene expression scores (rpkm) depending on the level of promoter DNA methylation for genes with LCP, ICP or HCP promoters in each species. b Boxplots of promoter DNA methylation scores in fibroblasts for the previously identified list of germline genes upregulated in Dnmt3a/3b double knockout embryos (termed 'gg dko' genes). For the species other than mouse, orthologs of mouse 'gg dko' genes are shown. c Enrichment of 'gg dko' orthologs among genes with methylated CG-rich promoters in fibroblasts for each species. The graph shows the associated adjusted p-values (-log10) calculated by hypergeometric tests. d Boxplots of the fold change (FC) of gene expression of 'gg dko' orthologs compared to all genes after 5-azadC treatment in fibroblasts. e Enrichment of 'gg dko' orthologs among genes upregulated by 5-azadC in each species. The graph shows adjusted p-values (-log10) calculated by hypergeometric tests. f Table showing germline genes upregulated by 5-azadC in at least 3 vertebrate species. The stringent mode corresponds to genes with a methylated promoter in control condition (> 50%), a fold change upon 5-azadC treatment > 3 and an adjusted p-value < 0.01. The lenient mode corresponds to less stringent cut-offs on promoter DNA methylation (> 25%) or fold change upon 5-azadC treatment (> 2). Genes in white did not pass the previous criteria. g RT-qPCR quantification of the expression of the DAZL gene in dermal fibroblasts treated with 5-azadC for 72h compared to untreated fibroblasts (NT). The expression was normalized to two housekee** genes (Gusb and Mrpl32) (mean ± SEM, n=3 independent experiments). In the boxplots, the line indicates the median, the box limits indicate the upper and lower quartiles, and the whiskers extend to 1.5 IQR from the quartiles in a, d or to the data extremes in b

Repression of germline genes and ERVs are conserved functions of DNA methylation across vertebrates

Having shown that methylation of CpG-rich promoters correlates with gene silencing, we investigated which genes are principal targets of DNA methylation-mediated repression. In the mouse, repression by DNA methylation of CpG-rich promoters occurs predominantly at germline genes, but it is unknown if this function is conserved in other vertebrates. Interestingly, GO enrichment analysis showed that most of the top ranked biological process terms associated with highly methylated (methylation > 50%) CpG-rich promoters (ICPs and HCPs) relate to germline functions (reproduction, meiosis, piRNA process, gamete generation...) in all mammals (Additional file 1: Fig. S5c, Table S7). In chicken, although the top ranked terms were not related to germline functions, many germline gene orthologs (such as DAZL, MEIOC, MAEL, DMRTB1, PNLDC1, RBM46 … ) were highly methylated but listed in different enriched terms (Table S7). Compared to the mouse, germline GO terms were less enriched in the other species, which is consistent with our above results showing more frequent methylation of CpG-rich promoters in other species. To avoid biases due to incomplete gene ontology annotation, we focused on a subset of germline genes that we previously identified as the first targets of DNA methylation in mouse double knockout (dko) embryos lacking DNMT3A and DNMT3B (hereafter termed 'gg dko' for 'germline genes dko') [6]. In all the studied species, we found that the orthologs of these genes (Table S8) tend to have methylated promoters (Fig. 6b) and are significantly enriched among genes with highly methylated CpG-rich promoter (Fig. 6c). These data demonstrate that these germline gene promoters are conserved targets of DNA methylation in vertebrates.

To test for a causal link between promoter DNA methylation and repression of germline genes, we treated proliferating primary dermal fibroblasts of each species with the DNA methylation inhibitor 5-azadeoxycytidine (5-azadC) for 72 hours. To validate the effect of 5-azadC, we performed RRBS (Table S2) and confirmed reduced global DNA methylation in human, mouse, rabbit, cow, pig and chicken 5-azadC-treated cells (Additional file 1: Fig. S6a). In contrast, we could not achieve DNA methylation inhibition by 5-azadC treatment in the dog fibroblasts because they immediately stopped dividing upon treatment. Transcriptomic analysis by RNA-seq revealed that 5-azadC treatment led to more upregulated than downregulated genes in the six species (absolute fold change > 3 and adjusted p-value < 0.01, Table S9), which is consistent with a DNA methylation inhibition effect (Additional file 1: Fig. S6b). Interestingly, we found that ‘gg dko’ orthologs have high fold changes of expression compared to the whole gene population in all species (Fig. 6d), and were significantly enriched among genes upregulated by 5-azadC treatment in the 6 studied species (Fig. 6e). Furthermore, 'gg dko' orthologs were also enriched among upregulated genes when we used only genes with a methylated CpG-rich promoter as background in the 6 studied species (Fig. S6c), indicating that germline genes are preferentially affected by 5-azadC among all the methylated genes. We checked whether a common set of germline genes is repressed by DNA methylation and found 16 germline genes upregulated by 5-azadC treatment in at least 3 species (Fig. 6f). Among these genes, DAZL was found upregulated in 5 tested species (Additional file 1: Fig. S6d), which was validated by RT-qPCR (Fig. 6g). Although DAZL is not annotated in rabbit, we designed primers for a region supposed to be the ortholog of DAZL 3’UTR and observed an induction of this potential transcript upon 5-azadC treatment (Fig. 6g). These results show that germline genes are conserved targets of DNA methylation-mediated repression in vertebrates.

Finally, we investigated whether DNA methylation has a conserved role in repressing transposable elements. To this end, we counted unique and multiple-map** reads in RepeatMasker annotations to evaluate the expression of TE families upon 5-azadC treatment in each species. As expected, we found a high number of upregulated TEs in mouse fibroblasts, including numerous Intracisternal A particles (IAP) families and other LTR-containing endogenous retroviruses (ERVs) (Additional file 1: Fig. S7). Several TE families were also found significantly upregulated upon 5-azadC treatment in all other species, which belong mostly to LTR-containing ERV classes (Additional file 1: Fig. S7). The number of upregulated TE families was higher in mouse compared to the other species. This might be attributed to the presence of more evolutionary young ERVs (such as IAPs) in the mouse genome, or to differences in the response to 5-azadC treatment or quality of genome annotations. Interestingly, this analysis revealed a high number of upregulated ERV families in chicken, indicating that although DNA methylation is globally reduced in the chicken genome, it is nevertheless involved in maintenance of ERV repression.

Altogether, these results show that repression of germline genes and ERVs are evolutionary conserved functions of DNA methylation among vertebrates.