Introduction

Transposable elements (TEs) are repetitive sequences that comprise approximately half of mammalian genomes due to historic proliferation within the host. Although the vast majority of human TEs have lost the ability to transpose, many encode functional regulatory elements that can disrupt gene regulatory networks if left unchecked. In consequence, most TEs are epigenetically silenced by DNA methylation and/or repressive histone modifications in normal somatic tissuesFull size image

The aggregate epigenetic profile of TEs also differs by class, which includes DNA transposons, LINE (long interspersed nuclear element), SINE (short interspersed nuclear element), and LTR (long terminal repeat) retrotransposons, the primate-specific SVA class (SINE-VNTR-Alu), and Other class TEs, older elements whose sequence is too degraded to accurately classify (Fig. 1b; Supplementary Fig. 3a–c). For instance, the SINE class has a higher proportion of bases in the 4_Tx transcribed state and most active regulatory states than other classes, while the SVA class has more CpGs missing methylation data. The latter is likely due to SVA length (median 1,151.5 bp vs. 232 bp for all TEs; Supplementary Fig. 3d) and recent propagation in the genome, which decreases mappability (i.e., the likelihood that a read is correctly mapped). Indeed, the number of epigenomes in which individual SVA elements are missing methylation data is negatively correlated with mappability (Spearman correlation, ρ = −0.38, P-value < 0.001). Interestingly, a higher proportion of the LTR class is annotated with the 9_Het heterochromatin state, but the SINE class is more hypermethylated than other classes, suggesting that the two classes are subject to different mechanisms of epigenetic repression. Regression analysis on individual TEs confirms that this difference is driven primarily by TE class, not CpG density (Supplementary Discussion; Supplementary Table 2).

Despite their depletion in active epigenetic states, TEs have enormous representation in the genome, encompassing ~45% of its length and ~50% of CpGs. As a result, TEs comprise 26% of active regulatory and 44% of transcribed chromHMM states across all Roadmap epigenomes, as well as 22% of DHS peaks and 24% of H3K27ac peaks (Fig. 1c). Some TE classes contribute a disproportionate amount of each epigenetic state (Fig. 1d). For instance, the SINE class contributes 37–53% of the active and poised (10_TssBiv, 11_BivFlnk, 12_EnhBiv) chromHMM states and 41% of H3K27ac peaks within TEs, although it encompasses only 28% of TE bases. Therefore, although TEs are depleted in active epigenetic states, their contribution to them is substantial, and the extent of the contribution varies by TE class.

Potential for individual TEs to be epigenetically active

We next investigated the likelihood of each of the 4,430,788 human TE fragments to be annotated with an active epigenetic state in a Roadmap tissue. Within each epigenome, only a small fraction of individual TEs are in each active epigenetic state (see Methods; median ≤3% for all active regulatory states and DHS/H3K27ac peaks; Fig. 2a). However, the majority of TEs are annotated with an active epigenetic state in a Roadmap epigenome: 47% of individual TEs are in an active regulatory chromHMM state and 75% are in a transcribed state at least once. Specifically, 4% of TEs are in the 1_TssA state (characterized by H3K4 tri-methylation, H3K4me3), 44% in the 7_Enh state, and 33% and 24% overlap a DHS or H3K27ac peak, respectively. In contrast, 82% of TEs are in a repressed state (heterochromatin, 9_Het, or Polycomb repressed, 13_ReprPC and 14_ReprPCWk) in at least one Roadmap epigenome.

Fig. 2: Individual TE potential to be annotated with an epigenetic state.
figure 2

a Boxplots indicate the proportion of all TEs (4,430,788 TEs) annotated with the state per epigenome (chromHMM states n = 127 epigenomes, methylation states n = 37, DHS n = 53, H3K27ac n = 98, expression RPKM >1 n = 56; see Methods). Red dots are the fraction of TEs annotated with the state in at least one epigenome. For WGBS states, only TEs with CpGs are included (3,200,428 TEs, 72% of all TEs). b For TEs annotated with the state in at least one epigenome (a, red dots), the proportion in each TE class. The color legend is located below f. c For TEs annotated with the state in at least one epigenome (a, red dots), the proportion of all Roadmap epigenomes the TE is annotated with the state. The b and c y-axes are shared with a. d Number of TEs annotated with the state in 100% of epigenomes. e Number of TEs annotated with the state in ≥90% of epigenomes. f For TEs annotated with the state in ≥90% of epigenomes, the proportion in each TE class. The e and f y-axes are shared with d. a, ce Boxplots, bars, and distributions are colored by state (y-axis). a, c Boxplot elements: center line, median; box limits, first and third quartiles; whiskers, maximum value ≤1.5*IQR from box limits; points, outliers.

Despite the extensive quality control performed on the Roadmap datasets, our results are dependent on variability between similar epigenomes, and some fraction of the 47% of TEs may be noise. The method of annotating TEs with chromHMM states also has a moderate impact on this result. When TEs are required to overlap the center of a 200 bp chromHMM annotation window, the statistic decreases to 37%, drop** further to 26% when TEs are required to overlap the center of a chromHMM annotation block (Supplementary Fig. 4a, b; Supplementary Discussion). Furthermore, although a TE must overlap the summit of a DHS or H3K27ac peak (the predicted binding location24) to be annotated with that state, we cannot say with complete certainty that the TE is responsible for the peak.

However, the results are robust to the chromHMM model used. With the 18-state model generated by the Roadmap Project, which includes H3K27ac as well as the modifications included in the 15-state model, the number of TEs in an active regulatory state (states 1–4 and 7–11) in at least one epigenome increases slightly to 48%, despite a 23% reduction in the number of epigenomes under consideration (n = 98; Supplementary Fig. 5a, b). The Roadmap Project also trained separate 50-state chromHMM models on seven deeply profiled reference epigenomes (embryonic stem cells (ESCs), ES-derived cells, and IMR90) using all available epigenetic marks. Using those state calls, a median of 8% of TEs are in an active regulatory state in each epigenome (based on the corresponding 18-state model state; Supplementary Fig. 5c), as opposed to 3% with the 15-state model. Therefore, more nuanced annotation models increase the observed regulatory role of TEs.

The fraction of individual TEs ever in an active regulatory state is slightly lower than expectation, as estimated by shuffling the locations of all TEs ten times. On average, 49% (standard deviation, 0%) of shuffled TEs are in an active regulatory state in at least one epigenome. However, the particular TEs in an active regulatory state in each epigenome are not random, and we observe subfamily-specific patterns that are lost with shuffling (see below).

Although LTR elements comprise 16% of TEs, they account for 27% of TEs ever in the 8_ZNF/Rpts state and 23% of those ever in the 9_Het heterochromatin state, as well as 21–22% of TEs overlap** DHS or H3K27ac peaks (Fig. 2b; Supplementary Fig. 6a). SINE elements (40% of all TEs) represent 50–51% of TEs ever in the transcribed 4_Tx state, the transcribed flanking 3_TxFlnk state, the genic enhancer 6_EnhG state, or expressed RPKM >1 (reads per kilobase per million reads), likely due to greater overlap with protein-coding introns (Supplementary Fig. 6b).

TE regulatory signatures in the Roadmap epigenomes are restricted rather than universal. A TE in an active regulatory state or overlap** a DHS or H3K27ac peak remains in that state in only 1–6% of epigenomes (median; Fig. 2c), while a TE remains in a transcribed chromHMM state in 6–9% of epigenomes. The degree of restriction also exhibits class-specific variation: for example, SINE and SVA elements remain in the 5_TxWk state longer than other TEs (Supplementary Fig. 6c). In contrast, RefSeq promoters and exons exhibit a much more universal activity profile (Supplementary Fig. 7a–d). 78% of promoters are in the 1_TssA state and >92% overlap a DHS and H3K27ac peak summit in at least one epigenome, while 93% of RefSeq exons are expressed RPKM >1 at least once (vs. 32% of TEs). A substantial proportion of promoters and exons (26–48%) also remain in those states in ≥90% of epigenomes. These results demonstrate that the ability of individual TEs to be regulatory elements, while greater than previously expected, is dramatically less than that of dedicated regulatory and genic elements.

To ensure that biological outliers did not skew our results, we repeated the analyses excluding five cancer cell lines and IMR90, which was identified as an epigenetic outlier in ref. 19 (Supplementary Fig. 7e, f). The results are very similar, except that the number of TEs hypo- or intermediately methylated in any epigenome decreases by 6–8% with the exclusion of IMR90, which has a much higher proportion of lowly methylated CpGs (Supplementary Fig. 1).

A small fraction of TEs is consistently annotated with the same epigenetic state in all epigenomes, and most are repressed (Fig. 2d). Of the 2% of TEs consistently in the same chromHMM state (n = 106,228), 99.9% are in the 15_Quies quiescent state, which lacks ChIP-seq signal for the five constituent histone modifications. However, there are a few examples of TEs in an active epigenetic state in all epigenomes (Supplementary Data 1; Supplementary Discussion), and even more in 90% of epigenomes (Fig. 2e). The latter are frequently enriched for SINE elements, although LTR elements are enriched among TEs overlap** DHS peaks (Fig. 2f). Many of these TEs overlap RefSeq promoters and genes, including 92% of TEs expressed RPKM >1 in all epigenomes (79% of which overlap protein-coding exons), but others are intergenic. This includes 13 TEs overlap** a DHS peak in all epigenomes that are >50 kb from the nearest RefSeq gene and do not overlap a GENCODE annotation. They are frequently in the 1_TssA promoter state, the 7_Enh enhancer state, or the 8_ZNF/Rpts state and may represent uncharacterized transcripts or enhancers.

Epigenetic state dynamics of TEs

Because most TEs exhibit restricted regulatory activity, we next examined the dynamics of their epigenetic profiles across Roadmap epigenomes. Each TE is annotated with a median of four chromHMM and two methylation states across all epigenomes, although nine TEs are annotated with all 15 chromHMM states (Fig. 3a, b). However, the set of states with which each TE is annotated varies (Fig. 3c, d). For example, TEs in the poised promoter states (10_TssBiv and 11_BivFlnk) in any epigenome (<1% of TEs) spend an average of 10–16% of epigenomes in the corresponding active promoter and promoter flanking states (1_TssA and 2_TssAFlnk) and 13–15% of epigenomes in the Polycomb repressed state (13_ReprPC), a pattern typical of a poised regulatory element becoming active in a differentiating cell lineage. In contrast, the average TE spends ≤1% of epigenomes in those states.

Fig. 3: Epigenetic state dynamics of TEs.
figure 3

a For all individual TEs in each class, the total number of chromHMM states the TE is annotated with across all epigenomes (DNA n = 456,948 TEs, LINE n = 1,480,369, LTR n = 708,210, SINE n = 1,769,839, SVA n = 3,608, Other n = 11,814). b For all individual TEs in each class, the total number of methylation states the TE is annotated with across all epigenomes (TEs overlap** CpGs only: DNA n = 275,140 TEs, LINE n = 952,459, LTR n = 532,571, SINE n = 1,430,171, SVA n = 3,519, Other n = 6,568). a, b Histograms are colored by class (facet). c For TEs in chromHMM State 1 in at least one Roadmap epigenome (see Fig. 2a, red dots), the average proportion of epigenomes in which they are annotated with chromHMM State 2 (represented by color scale below c, d, 0 to 100% of epigenomes). d For TEs in methylation State 1 in at least one Roadmap epigenome (see Fig. 2a, red dots), the average proportion of epigenomes in which they are annotated with methylation State 2 (represented by color scale below c, d, 0 to 100% of epigenomes).

Interestingly, TEs ever in repressed states are less dynamic than those ever in active states (Supplementary Fig. 8a, b). However, this may be partially due to long TEs that overlap multiple chromHMM states within a single epigenome (Supplementary Discussion; Supplementary Fig. 8c, d).

Although all TEs exhibit similar chromHMM dynamics regardless of class (Fig. 3a), SINE and SVA elements have less dynamic methylation profiles (Fig. 3b; Supplementary Fig. 8e). They are less likely to be hypo- or intermediately methylated in any epigenome (Supplementary Fig. 6a), and they remain intermediately methylated in fewer epigenomes (median 3–5% of epigenomes vs. 8–14% for other classes). This may be due to the tendency of SINE and SVA elements to overlap CpG islands (1% of SINE and 35% of SVA elements vs. <0.5% for other classes), which have greater coordination of CpG methylation levels and may result in a more bimodal distribution of average methylation. Taken together, these results suggest that some TEs have biochemical activity specific to particular tissues or developmental states, and that TE phylogeny contributes to this profile.

Compared to background (ten iterations of shuffled TEs), true TEs are slightly less dynamic than expected, although the results are almost identical (Supplementary Fig. 9a, b; Supplementary Discussion).

Finally, we confirmed the concordance of the five experimental techniques used to query TE epigenetic status (Supplementary Fig. 10a–e). TEs are significantly more likely to overlap a DHS or H3K27ac peak, be in an active regulatory chromHMM state, or be hypo- or intermediately methylated when also in one of the other states (pairwise by-technique Chi-squared tests, P-value = 0). In fact, although only 2% of TEs overlap a DHS peak and 0.7% overlap an H3K27ac peak (of the TE-by-epigenome instances with data for both metrics), 0.2% are annotated with both peaks, ten times the expected number if the peaks were independent. Furthermore, the proportion of TEs overlap** both peaks increases to 3–4% when the TE is also in the 1_TssA or 7_Enh state. However, because the epigenetic marks are not completely redundant, we have included all in our analyses.

Differences in TE activity by tissue classification

In our analyses across all Roadmap epigenomes, we established that TEs contribute different proportions of each epigenetic state to the genome (Fig. 1c), and that this pattern differs by TE class (Fig. 1d). Figure 4a demonstrates that there is also variation between epigenomes in the contribution of TEs to epigenetic states. Besides CpGs missing methylation data, the 8_ZNF/Rpts state exhibits the largest variation, ranging from 21% of the state within TEs (E002, ES-WA7 Cells) to 71% (E051, Primary hematopoietic stem cells G-CSF-mobilized Male). Indeed, TEs can be either enriched or depleted in this state compared to their genomic representation (dashed line) depending on the epigenome.

Fig. 4: Variation in TE contribution to epigenetic states by epigenome classification.
figure 4

a Proportion of the epigenetic state within TEs by epigenome. Each epigenome is represented by a circle, colored by Roadmap group. Solid black lines represent the proportion of the epigenetic state within TEs across all epigenomes (contribution; see Fig. 1c). Dashed and dot-dashed lines represent the proportion of genomic bases and CpGs within TEs, respectively. Red stars represent Bonferroni-corrected Kruskal–Wallis test P-value < 0.05 across epigenome groups (chromHMM states n = 127 epigenomes, methylation states n = 37, DHS n = 53, H3K27ac n = 98; see Supplementary Data 3 for group assignments). b Proportion of the epigenetic state within each TE class by epigenome, colored by Roadmap groups. Solid black lines represent the proportion of the epigenetic state within the class across all epigenomes. Dashed and dot-dashed lines represent the proportion of genomic bases and CpGs within the TE class, respectively. c The ratio of the average RPKM over TEs to the average RPKM over the entire genome, colored by Roadmap group. d The ratio of the average RPKM over each TE class to the average RPKM over the entire genome, colored by Roadmap group. c, d The solid black bar is the median for all epigenomes (n = 56 epigenomes with RNA-seq data).

This variation across epigenomes is driven in part by tissue. The proportion of most epigenetic states within TEs is significantly different across the group classifications assigned by the Roadmap Project (P-value < 0.05, Kruskal–Wallis test, Bonferroni correction; Supplementary Discussion). As expected, ESCs are enriched among the samples with a higher contribution of TEs to the 1_TssA state than the total across epigenomes (solid black lines; ESC mean 16%; P-value < 0.1, permutation test with 1,000 permutations, FDR (false discovery rate) correction; Supplementary Data 2)25,26,27. Interestingly, blood groups (Blood & T-cell and HSC (hematopoietic stem cell) & B-cell) are also enriched among epigenomes with a higher contribution from TEs to the weakly transcribed (5_TxWk) and enhancer (6_EnhG and 7_Enh) states, while Brain epigenomes are depleted (mean 5_TxWk: 48–49% vs. 45%; 6_EnhG: 30–32% vs. 21%; 7_Enh: 37–38% vs. 25%). Other trends can be observed using additional Roadmap-assigned classifications (Anatomy and Type) or other epigenome metadata (e.g., donor age, germ layer of origin, and cancer cell lines; Supplementary Data 3), including that cancer cell lines are enriched among epigenomes with a higher proportion of many active and poised regulatory states in TEs (P-value < 0.05, Kruskal–Wallis test, Bonferroni correction; Supplementary Fig. 11).

The proportion of each epigenetic state within TE classes also shows tissue-specific patterns (Fig. 4b). For instance, only the LTR class shows enrichment of ESCs among epigenomes with a higher contribution of TEs to the 1_TssA, 2_TssAFlnk, 3_TxFlnk, 7_Enh, and H3K27ac states (P-value < 0.1, permutation test with 1,000 permutations, FDR correction). Similarly, only the SINE class shows enrichment of both HSC & B-cell and Blood & T-cell epigenomes among those with a higher contribution of TEs to the enhancer states (6_EnhG and 7_Enh). Thus, the patterns observed in Fig. 4a are not evenly distributed across TE classes.

Finally, we compared the average genome-wide expression level to the TE expression level, which recapitulates the results observed with epigenetic marks (Fig. 4c). The median ratio of TE-to-genome expression is 14% across all epigenomes, but the range is 2–45%, with blood, brain, and ESC epigenomes exhibiting a much higher TE expression level relative to the rest of the genome. All classes exhibit higher relative expression in blood and brain, while the ESC pattern is observed only for LTRs (Fig. 4d). Together, these results point to a potential biological role for specific TE subgroups in specific organs, which may be facilitated by tissue-specific expression of transcription factors whose binding sites are encoded by various phylogenetic lineages of TEs.

Tissue-specific subfamily enrichment in epigenetic states

We next asked whether the tissue specificity we observed at the class level was present at the finer phylogenetic resolution of the TE subfamily. The vast majority of TE subfamilies overlap all 15 chromHMM states (759 subfamilies, 78%), all four methylation states (100% of subfamilies with CpGs), and DHS and H3K27ac peak summits (965 and 963 subfamilies, respectively) in at least one Roadmap epigenome (Fig. 5a; Supplementary Fig. 12a), and most subfamilies are in most states in ≥75% of epigenomes (Supplementary Fig. 12b). These numbers generally match expectation (ten iterations of shuffled TEs; Supplementary Fig. 12a), except that 92% of shuffled subfamilies overlap all 15 chromHMM states (n = 894, standard deviation = 6). Supplementary Fig. 12b suggests that this is due to more overlap with the small active and poised regulatory states, possibly due to exclusion of true TEs from developmentally important gene promoters.

Fig. 5: Tissue-specific enrichment of TE subfamilies in epigenetic states.
figure 5

a The total number of epigenetic states with which each TE subfamily is annotated across all epigenomes (see Methods; n = 968 TE subfamilies, n = 965 for methylation states). b PCA on Roadmap epigenomes (n = 127 epigenomes), using the LOR enrichment of each TE subfamily in each chromHMM state (13,716 subfamily-by-state combinations) as variables. Color is based on group (see legend above Fig. 5b), and shape is based on epigenome age (non-fetal and unknown age are not distinguished). c PCA on TE subfamilies (n = 937 subfamilies), using the LOR enrichment of the subfamily in each chromHMM state in each Roadmap epigenome (1,904 state-by-epigenome combinations) as variables. Color is based on class (see legend below Fig. 5d), and shape is based on TE family (only Alu and L1 are highlighted). b, c The amount of variation explained by each PC is listed in parentheses. d The proportion of epigenomes each TE subfamily is enriched LOR > 1.5 in the epigenetic state (chromHMM states 127 epigenomes, methylation states 37, DHS 53, H3K27ac 98). Each subfamily is represented by a circle and is colored by TE class (see Methods). The n below each state on the y-axis indicates the number of subfamilies enriched in the state in at least one epigenome. e The percentage of LTR22A elements that contain a binding motif for the transcription factor. Active TEs: elements annotated with the 7_Enh enhancer state in epigenomes where the subfamily is enriched in the state (n = 103 TEs); Inactive TEs: elements never in the state (n = 68 TEs). The top 5 most significant transcription factors as predicted by HOMER are shown (binomial FDR-corrected P-value < 0.0001 for each).

To identify subfamilies exhibiting coordinated epigenetic profiles, we calculated the log odds ratio (LOR) enrichment of each TE subfamily in each epigenetic state in each epigenome compared to genomic background (Equation (1)). In total, there are 32,947 enrichments with LOR > 1.5 (approximately a three-fold enrichment over genomic background), 70% of which are in the 8_ZNF/Rpts and 9_Het states. Shuffling TE locations abrogates the vast majority of enrichments and enriched subfamilies (Supplementary Fig. 12c, d), confirming that many subfamily elements have coordinated epigenetic profiles.

Subfamily enrichment in epigenetic states can differentiate epigenome categories. Principal component analysis (PCA) performed on chromHMM state enrichments clearly separates Roadmap groups, including ESCs/iPSCs (induced pluripotent stem cells) and blood epigenomes along the second principal component (PC2) (Fig. 5b). Brain and digestive epigenomes and other organs form a large cluster in the center of the plot. This analysis closely recapitulates the results observed using genome-wide epigenetic profiles of the Enh and ReprPC states, which also separated these major epigenome groups19 . Interestingly, in our analysis, fetal epigenomes form a subcluster within the tissue-based clusters, suggesting that developmental stage influences the epigenetic profile of TE subfamilies. Similar results are observed for different epigenetic assays (Supplementary Fig. 13a–d) and epigenome classifications (Supplementary Fig. 13e–h), including that cancer cell lines cluster to one side of PC1. Subfamilies can also be distinguished by their epigenetic profiles (Fig. 5c; Supplementary Fig. 13i–k).

Fig. 6: SINE and Alu methylation as a function of age.
figure 6

a Scaled density plot of Jukes-Cantor evolutionary distance (age) for SINE elements (n = 1,430,171 SINE elements with CpGs). b Generalized additive models of the proportion of epigenomes (total 37 epigenomes) in which individual SINE elements are annotated with each methylation state vs. age, smoothed using cubic regression splines. c Generalized additive models of SINE element characteristics vs. age, smoothed using cubic regression splines. The pink line represents a logistic regression model of the likelihood a SINE element overlaps a CpG island vs. age. d Scaled density plot of Jukes-Cantor evolutionary distance (age) for Alu elements (n = 1,105,597 Alu elements with CpGs). e Generalized additive models of the proportion of epigenomes (total 37 epigenomes) in which individual Alu elements are annotated with in each methylation state vs. age, smoothed using cubic regression splines. f Generalized additive models of Alu element characteristics vs. age, smoothed using cubic regression splines. The pink line represents a logistic regression model of the likelihood an Alu element overlaps a CpG island vs. age.

To identify TE subfamilies that may perform tissue-specific coordinated regulatory functions, we next identified subfamilies that are enriched LOR > 1.5 more often than expected in an epigenome category in an active state (active regulatory or transcribed chromHMM state, hypo- or intermediately methylated states, or DHS or H3K27ac peak overlap; FDR-corrected P-value < 0.05, permutation test; Fig. 5d; Supplementary Data 4; Supplementary Discussion). As in ref. 19a–d). This pattern is most striking for SINE elements (Fig. 6a, b). Interestingly, SINE TEs decrease not just in length but in CpG density with age, which would not be expected if CpGs were evenly distributed across the TE (Fig. 6c). Indeed, older SINE TEs are significantly less likely to overlap CpG islands (logistic regression, Jukes-Cantor evolutionary distance predictor coefficient P-value < 0.001), and the median age of SINE elements that overlap CpG islands is much lower than those that do not (0.04 vs. 0.17). Taken together, these results indicate that SINE elements with CpG islands are subject to greater levels of repressive DNA methylation, but that the CpG islands are eliminated from the TEs as they age.

To confirm that the trends observed for all SINE elements are not the result of bimodal age distribution between the older SINE families (MIR, Deu, tRNA, and SINE, median Jukes-Cantor evolutionary distance 0.37 ± interquartile range (IQR) 0.10; Fig. 6a) and the younger Alu subfamilies (median distance 0.13 ± IQR 0.07; Fig. 6d), we repeated the analysis with only Alu elements (66% of SINE TEs), which recapitulated the results (Fig. 6e, f; Supplementary Table 3).

Evolutionary conservation of TE regulatory signatures

Finally, we investigated whether the regulatory signatures we observed in human TEs are evolutionarily conserved by comparing them to orthologous TEs in the mouse genome. Although human and mouse diverged approximately 90 million years ago34, 6% of hg19 TEs have an identifiable corresponding region in the mm10 genome (n = 269,096) and overlap an mm10 TE from the same subfamily (n = 269,801), which we consider orthologous pairs (Supplementary Fig. 20a).

We profiled the methylation level and chromHMM state of the orthologous TEs in twelve samples interrogated by the mouseENCODE project21,22,23 that anatomically matched human Roadmap epigenomes (Supplementary Table 4). Although only a small fraction of orthologous TEs is hypomethylated in either human or mouse, those hypomethylated in one species are significantly more likely to be hypomethylated in the other in the corresponding tissue than expected by random chance (P-value < 0.001, Chi-squared test for seven human-mouse epigenome pairs with WGBS data; Cramer’s V 0.10 to 0.14; Fig. 7a, b). This result holds true when all methylation states are considered (Cramer’s V 0.16 to 0.23). Thus, the DNA methylation level of orthologous TEs is more conserved than expected between human and mouse.

Fig. 7: Conserved epigenetic profiles of orthologous TEs in human and mouse.
figure 7

a 2D density plot of the methylation level of orthologous TEs (n = 140,185 TEs) in seven anatomy-matched human-mouse epigenome pairs. TE pairs missing methylation in either epigenome or lacking CpGs are excluded. b Methylation state of orthologous TEs in human and mouse across the seven epigenome pairs. Percentages to the right of each x-axis state indicate the proportion of TE ortholog instances across the epigenome pairs that are annotated with that state. c chromHMM state of orthologous TEs (n = 296,457 TEs) in human and mouse across twelve anatomy-matched epigenome pairs. Percentages to the right of each x-axis state indicate the proportion of TE ortholog instances across the epigenome pairs that are annotated with that state. d Tissue-specificity profiles in mouse for orthologous TEs that exhibit tissue specificity in that tissue in human. TEs are considered tissue-specific if they are annotated with a state in two epigenomes from the same tissue but fewer than 5 epigenomes overall. Active: any active regulatory state (Brain n = 1,812 TEs, Intestine n = 1,691, Stomach n = 205, Heart n = 3,081, Lung n = 653); Promoter: 1_TssA/TssA (Brain n = 54 TEs, Intestine n = 27, Stomach n = 16, Heart n = 49, Lung n = 16). TEs in the 2–8 epigenomes category are not specific to the tissue.

The histone modification profile of orthologous TEs is also conserved across species. Although the human and mouse epigenomes were annotated with different chromHMM models, there is a non-random distribution of orthologous TEs in each human/mouse chromHMM state combination (P-value < 0.001, Chi-squared test for twelve epigenome pairs; Cramer’s V 0.22 to 0.26). Figure 7c demonstrates a clear enrichment of the promoter state (TssA) in mouse for TE orthologs annotated with the promoter state (1_TssA) in human, as well as an enrichment in multiple active regulatory mouse chromHMM states (mouse: TssA, TssAFlnk1, TssAFlnk2, Enh, EnhLo1, and EnhLo2) for orthologs in human active regulatory states (human: 1_TssA, 2_TssAFlnk, 3_TxFlnk, 6_EnhG, 7_Enh). The epigenetic profile of shared TE subfamilies (578 of 968 human subfamilies; Supplementary Fig. 20b) is also conserved across species (Supplementary Fig. 20c–e).

We next identified TEs that exhibit conserved tissue-specific epigenetic profiles (Fig. 7d). Of the TEs with tissue-specific promoter state (1_TssA) annotation in humans (n = 162), 0–29% also exhibit tissue-specific promoter state (TssA) annotation in mouse in the same tissue. When considering all active regulatory states (n = 7,442), 3–14% of tissue-specific TEs exhibit the same profile in mouse. All TEs with conserved tissue-specific promoter annotations (n = 19) are located within or near human RefSeq genes, and in many cases, their conserved epigenetic profile likely stems from overlap with the promoter region of a gene whose tissue-specific expression profile is conserved between the species. For example, two conserved MIR elements (chr12:57443327–57443493 and chr19:39302694-39302815) overlap MYO1A and LGALS4 or their promoters, which have high intestinal expression. However, 54 of the TEs with conserved active regulatory annotations are >50 kb from the nearest RefSeq gene, including 28 that are >100 kb. Four such elements are displayed in Supplementary Fig. 21, including three brain-specific enhancers and an intestine-specific element that is an enhancer in human and a promoter in mouse. These include a UCON29 element, a subfamily that has been shown to have brain-specific enhancer activity in this paper and others39.

As mentioned above, Cao et al.20 used a machine learning model trained on ENCODE histone modification data to demonstrate that 35% of TEs have signatures of cis-regulatory elements. Although this is moderately lower than our estimate, they reveal via saturation analysis that more TEs may perform regulatory functions in additional cell types. They also confirmed that TE enhancer activity is highly tissue-specific and can be used to distinguish cell types, and they profiled the tissue-specific enrichment of TE families as enhancer- or promoter-like elements in Roadmap tissues. Interestingly, they found that TEs with ESC/iPSC enhancer activity had highly conserved epigenetic profiles in primates, but less so in less related species, and they uncovered widespread 3D interactions between MIR and L2 elements that modulate their enhancer activity.

An additional study that identified TE-derived promoters and enhancers using FANTOM5 CAGE-seq (cap analysis gene expression) data18 determined that 45% of enhancers and 5% of promoters overlap TEs, lower than random expectation, with more tissue-specific enhancers overlap** TEs, particularly those specific to blood and testis. Older TE families had more members overlap** enhancers, as did ERVs. Additionally, they found that although enhancers are biased toward overlap with mammal-specific and ancient TEs, enhancers that overlap younger TEs are more likely to be active in only a single tissue. Villar et al.40 also identified an enrichment of younger repeat families in recently evolved enhancers, although ancient ERV families are more likely to be exapted as recently evolved promoters.

An important caveat to our results is the exclusive use of uniquely mapped, short (36 bp) reads to generate chromHMM state assignments and peak calls. TEs present a mappability challenge due to their repetitive nature, particularly for short reads and young TE subfamilies that have not accumulated sequence changes over time, such as the SVA subfamilies, young Alu and L1 subfamilies, and some ERVs. Although we capture enrichment of very young subfamilies with low mappability in active regulatory states, it is possible that using multimapped reads assigned at the subfamily level would reveal additional TE activity and that our results represent a lower bound. In particular, many SVA and SINE elements have low mappability due to a combination of age, length, and prolific expansion (Supplementary Fig. 3d), and these classes may be disproportionately affected by mappability. However, mappability is not correlated with the number of epigenomes a TE is annotated with the 15_Quies state (Spearman correlation, rho = 0.01, P-value < 0.001), and the TEs in each chromHMM state exhibit histone modification, chromatin accessibility, and DNA methylation profiles characteristic of that epigenetic state (Supplementary Fig. 22a).

In cancer, the genome undergoes global DNA hypomethylation and exhibits dysregulated chromatin41. The enormous number of TEs with narrowly restricted regulatory profiles suggests that they are subject to tight epigenetic control and could be rapidly de-repressed during malignant transformation. In line with this hypothesis, we observe greater contribution of TEs to active regulatory states in cancer cell lines. Aberrantly activated TEs can alter the expression of nearby genes, serve as alternative promoters that form chimeric or immunorestricted transcripts, or even drive oncogene expression36. However, epigenetic therapies may also potentiate cancer immunotherapy through the activation of TEs, which increase tumor immunogenicity by forming double-stranded RNA and immunogenic proteins42,43,44,45,46. Resources such as this study could inform predictions of which TEs are activated in cancer and in response to epigenetic therapies, including in normal tissues to prevent potential off-target effects.

In conclusion, this study represents an important synthesis of epigenetic data in the context of TEs that could serve as a resource for investigations of this underexplored aspect of the human genome in healthy and diseased states.