Introduction

In eutherian mammals, one of the two X chromosomes (X) is epigenetically inactivated in XX females in order to achieve dosage compensation with XY males through a process known as X-chromosome inactivation (XCI) (see Balaton, 2018 for a review [1]). This inactivation is incomplete, as approximately 12% of genes consistently escape from XCI in humans [2], here defined as having at least 10% expression from the inactive X (** from XCI [2]. The short arm of the X near PAR1 is enriched in genes esca** from XCI, while the long arm that contains XIST—the gene responsible for initiating XCI—is enriched in genes subject to XCI [3]. Genes esca** from XCI are often found clustered together, with some convergence with topologically associated domains (TADs) [9]. In addition to genes that consistently escape from XCI (sometimes called constitutive escape), a further 8% of genes have been found to vary their XCI status between different tissues or individuals (termed variable or facultative escape [2] (reviewed in [5]), and another 7% of genes were found to be discordant between the studies identifying them [2]. Variably esca** and discordant genes were found to be enriched at boundaries between clusters of genes with opposite XCI statuses [2]. The factors determining XCI status remain unresolved, with the above evidence suggesting regional control, but there are also lone genes that escape XCI while flanked with genes subject to XCI [2] and even genes with two transcription start sites (TSSs) with opposite XCI status [10, 11]. Furthermore, these solo escape genes are able to recapitulate escape when integrated elsewhere on the X [12, 13].

Many methods have been used to identify which genes escape from XCI (reviewed in [14]). The gold-standard approach is to compare expression levels between the ** from XCI, while inactive marks such as H3K9me3, H4K20me3, H3K27me3 and macroH2A are enriched at genes subject to XCI [14, 22, 23], reviewed in [14]. A predictive model using many epigenetic as well as genetic features in mice was able to predict a gene’s XCI status accurately 78% of the time [24] and in humans a model obtained over 80% accuracy using only genomic repeats [25]. These, and additional studies have found L1 repeats enriched near genes that are subject to XCI, while ALU elements are more frequent at genes esca** XCI [25,26,27,29]. Another study found many genes where ** Technologies (CEMT) as these samples were derived from cancer and thus were anticipated to have a high frequency of skewed XCI, allowing us to use allelic expression to determine XCI status in each sample [11]. As cancer is known to have epigenetic changes, we additionally examined data from Core Research for Evolutional Science and Technology (CREST), another group within IHEC, thus allowing us to determine whether any trends that we observed in the CEMT data were due to the samples being cancer-derived. However, the CREST samples had less sequencing depth, fewer females (only nine), and could only be examined for DNAme and histone marks. Samples are listed in Additional file 2: Table S1. In our analyses, genes in the PAR were not included with genes esca** from XCI as they may be epigenetically distinct, especially when comparisons with males are included.

Histone marks differ with sex and XCI status

We compared the levels of histone modifications with sex and published XCI status calls derived from a synthesis of various approaches (hereafter referred to as meta-status) [2]. We used levels within 500 bp upstream of a gene’s TSS (except for the mark H3K36me3 that is associated with gene bodies and so was examined at exons [32]), and H3K4me1 that is associated with enhancers and so was examined at annotated enhancer sites [33]. We found that most marks had a significant difference (p value < 0.01) for the median level per transcript between males and females, at genes esca** and subject to XCI in both datasets (Fig. 1a, Additional file 3: Table S2). Fewer marks showed significant differences between genes esca** XCI and those subject to XCI within each sex. The euchromatic marks (H3K4me3, H3K27ac, and H3K36me3) were significantly different between transcripts subject to XCI and those esca** from XCI in both CEMT and CREST females, while the heterochromatic marks (H3K9me3, and H3K27me3) were only significantly different within the CREST dataset. Comparing XCI statuses within males gave the fewest significantly different marks, as was expected. Overall, the X chromosome of males and females differs in both heterochromatic and euchromatic marks, and the observable differences between XCI status implicate inactivation-related differences in addition to copy number (XX or XY) differences.

Fig. 1
figure 1

The ** XCI, which partially explains the stronger p-values at transcripts subject to XCI. H3K27me3 has a higher ** XCI, and lower for transcripts subject to XCI. H3K36me3 is reduced on the ** from XCI the differences were more variable between the datasets.

H3K27me3 showed the largest change between the **, subject to XCI and variably esca** categories being significantly different between the sexes. We analyzed chromosome 7 as an example autosome and saw a much lower percentage of transcripts with significant male–female differences for H3K9me3 and H3K27me3 than for transcripts esca** from XCI, validating that transcripts that escape from XCI have a significant increase of heterochromatic marks in females relative to males. Metagene plots extending 50 kb up and downstream of genes esca** or subject to XCI, in females and males (Additional file 1:Figure S2) confirm the predominance of marks at the TSSs, with higher H3K4me3 and H3K27ac TSS peaks observed for genes esca** XCI in females. For the heterochromatic H3K9me3 and particularly H3K27me3, we observe both a reduced TSS peak and lower gene body levels for escape genes in females. For all marks, the standard deviation across genes with each XCI status was large, calling into question whether the differences could be predictive for individual genes, as has been found for DNAme (see Additional file 3: Table S2).

In addition to our promoter and gene-based analysis, we also compared histone marks at enhancers annotated to genes on the X [33] and found that all marks showed significant, although small, differences between males and females, for both XCI statuses (Fig. 1d, Additional file 3: Table S2 for values). We further considered whether the enhancer was found within the gene to ensure that differences were not arising simply due to expression of the gene altering chromatin; however, most marks remained significant regardless of location. Looking at the ** and genes subject to XCI. In CREST, the ** from XCI. Overall, it appears that enhancers gain heterochromatic marks on the ** from XCI, while five were previously designated esca** XCI and one subject to XCI.

Fig. 2
figure 2

Epigenetic marks do not change consistently with XCI status for variably esca** genes. a The number of genes with each XCI status call across all samples as assigned by ** genes that had significant differences in the histone mark between samples that were subject to or esca** from XCI. For each gene, on the left is a comparison of each epigenetic mark vs the ** genes.

Genes that variably escape from XCI provide a unique opportunity to study differences between genes esca** vs subject to XCI in the same genomic context. All of the marks available except for H3K4me1 were significantly different (p-value < 0.05) between samples esca** XCI vs those subject to XCI in at least one of the eight variably esca** genes, but never for the majority of genes (Fig. 2b, Additional file 1: Table S5). Consistent with the associations seen for genes subject to or esca** from XCI, when active marks were significantly different, they tended to be higher in samples esca** XCI, while inactive marks were lower in samples esca** XCI (Additional file 1: Table S6). The exception to this is H3K36me3 in gene bodies.

DNAme was the most consistent mark differentiating samples esca** from those subject to XCI, being seen significantly different in four out of the eight variably esca** genes. The samples subject to XCI in PRKX had significantly higher DNAme, but were not above the DNAme thresholds for XCI status calls that we established previously [11]. The other three genes with significant DNAme differences showed a clear switch from a DNAme pattern matching genes esca** XCI to a pattern matching genes subject to XCI. TIMP1, one of the four genes that was not significant, has low CpG density and high male DNAme so was not expected to differ with XCI status. For the other three genes, the limited informative samples reduced the power to detect differences, although they may have had incorrect XCI status calls or there may be more complicated epigenetic processes involved. Interestingly, the two genes found to be variably esca** by both ** genes did not show significant differences at any of the examined marks; increasing the sample size might give us the power to see more consistent differences across variably esca** genes as some of these genes only had 2 informative samples per XCI status. Two genes showed significant expression differences between samples that escaped XCI versus those subject to XCI (Additional file 1: Figure S4). In BCOR, samples esca** XCI had higher expression across all exons, while in EIF2S3 some exons were higher in samples subject to XCI while other exons were higher in samples esca** XCI. XCI status and expression per exon may be linked by different TSSs having different XCI status or possibly different tissues having different XCI status and dominant splicing variants. To test whether variable escape may be tissue-specific, XCI status per sample was compared with tissue of origin; only one of the eight genes showed tissue-specificity, EIF2S3. However, with only eight samples in three tissue types and being limited by heterozygous polymorphisms, there are likely other variable escape genes that were not identified here as many genes did not have the required number of informative samples.

Expanding sample-specific XCI status by using DNA methylation

To increase our sample size, we used promoter DNAme levels to determine XCI status across all genes within the larger 45 sample CEMT dataset, regardless of skewed XCI. Only TSSs with high CpG density and low male methylation were considered informative, and within this group we found 47 genes esca** XCI, 393 subject to XCI and 17 variably esca** across samples (Fig. 3a, Additional file 4: Table S4 for XCI status calls). Our DNAme-based calls had strong concordance with meta-status; there were no genes called as esca** XCI here that were previously called as subject to XCI, while only one of the genes called as subject to XCI here was previously called as esca** XCI. We included genes in the variably esca** from XCI category if at least one of their TSSs had 33% or more of its samples esca** XCI and another 33% or more samples subject to XCI. Additionally, one gene had opposing XCI statuses at separate TSSs and 36 had opposite XCI statuses across tissues (examples of genes with these variable escape scenarios are shown in Fig. 3b). An additional 67 genes were found variably esca** in at least one tissue, but were not identified as variably esca** from XCI in the larger dataset. Only BCOR was found variably esca** from XCI in the ** here. In addition 96% of genes esca** and 87% of genes subject to XCI identified by ** in only one of the datasets.

Fig. 3
figure 3

DNAme varies at genes variably esca** from XCI. a The number of genes with each XCI status call by DNAme, with their call by meta-status underneath. b From left to right: An example of a gene that variably escapes XCI across individuals (and within multiple tissues), a gene that variably escapes from XCI between tissues, and a gene that variably escapes from XCI between TSSs. c The percent DNAme per read for genes, binned together by their mean DNAme across the CpG island. Only reads overlap** the CpG island were included here. d The distribution of genes with each XCI status across the bins of mean DNAme per island. e Allelic DNAme, shown as the percent DNAme per read by allele. The mean DNAme across all reads per allele in each bin is shown underneath

Comparing epigenetic marks to DNAme-based XCI status calls, all marks (H3K4me3, H3K9me3, H3K27me3 and H3K27ac) except H3K4me1 and H3K36me3 were significantly different between genes with opposite XCI status calls, with increased prominence of H3K9me3 (Additional file 1: Table S7). We again compared epigenetic marks at variably esca** genes to see if they differed between samples in which the gene escaped XCI vs those in which it was subject to XCI. We categorized variable escape genes as those variably esca** across the dataset, across TSSs, across tissues or within specific tissues. For variable escape from XCI between individuals across the dataset, every mark examined was found to be significant (adjusted p value < 0.01) in at least one gene; however across all categories of variable escape from XCI, only expression and H3K4me3 were significant in more than 25% of genes in any type of variable escape category (Table 1). The direction of histone mark changes was less consistent than for ** XCI and higher inactive marks in genes subject to XCI, but with many genes showing the opposite results (Additional file 1: Figure S5).

Table 1 The percentage of variably esca** (VE) genes found by DNAme that have significant differences in epigenetic marks (BH corrected p value < 0.01)

We have previously seen that the average DNAme at genes subject to XCI was 38%, less than expected if the ** genes were found distributed in the range where genes esca** and subject to XCI were found; however, genes with intermediate 20–30% DNAme had more variably esca** genes than genes with a consistent XCI status.

While the bimodal appearance of the DNAme reads reflects that the ** heterozygous SNPs within 2 kb of TSSs. In addition to the usual limitations of map** allelic reads, we had to exclude C <  > T and G <  > A polymorphisms as the bisulfite conversion step in WGBS converts unmethylated C to T and on the opposite strand this appears as a G to A conversion. Separating genes into the same 10% bins of mean DNAme as earlier (Fig. 3e), we see that the intermediately methylated reads tend to be on the hypermethylated allele (the presumed ** from XCI and those having one allele below 25% and one above 75% being called as subject to XCI. These calls for SNPs within CpG islands had good agreement with previous calls with all 28 of the loci called esca** and 50/51 of the loci called subject to XCI being concordant. To explain the prevalence of intermediately methylated reads, we examined the DNAme per CpG across some of these islands where we observed that the DNAme level was not consistent (Additional file 1: Figure S6 for browser tracks across islands, Additional file 1: Figure S7 for DNAme differences between adjacent CpGs). We observe an average difference between adjacent CpG sites of 24% in cancer and 13% in healthy samples, which is likely a major contributor to the intermediately methylated WGBS reads and CpG island DNAme averages we observe for the ** XCI, using the remainder to test accuracy, and used twice as many genes subject to XCI for training. Using this predictor, we could predict escape from XCI with accuracies ranging from 42% with H3K9me3 to 69% with H3K4me3 and for genes subject to XCI with accuracies ranging from 85% with genebody H3K36me3 to 99% with H3K27ac. In contrast, a similar model using CpG island DNAme data obtained a much better accuracy of 87% for predicting genes as esca** XCI and 99% for predicting genes as subject to XCI, showing the higher predictive ability of DNAme.

Fig. 4
figure 4

XCI status predictions with an epigenetic model expands the number of genes examinable. a ROC curves for each random forest predictor trained using single marks, along with the combined predictor using all of the epigenetic marks. An example sample, CEMT28 is shown. See Additional file 1: Figure S8 for all samples. b Accuracy of our epigenetic predictor using DNAme and all six histone marks. Each point is one of the 20 models per sample. This accuracy is tested on genes outside of the training set. c The number of genes with each XCI status as predicted by our model, with their call by meta-status underneath. d, e As (c), but further split by the presence of a CpG island (d) or by an expression threshold of 0.1 RPKM (e). f The predictive ability of each mark. Each mark was ranked per model on how important it was to the model, with the most important mark being ranked 14 and the least important being ranked first. We used the marks within each female sample paired with the mean mark in similar male samples for the predictor, so both the female and male marks are featured here

To get XCI status calls from histone mark data with an improved accuracy, we combined data from all of the histone marks and DNAme data from CEMT and trained a new random forest model [35]. This combined epigenetic XCI predictor was trained using XCI meta-status and was able to accurately predict genes esca** vs subject to XCI, with a median accuracy for genes outside the training set of 75% for genes esca** from XCI and 90% for genes subject to XCI (Fig. 4b). We trained the model 20 separate times per sample and were confident in a prediction if 75% + of the models agreed. A separate epigenetic XCI predictor was trained and used within each sample, however the models are capable of being used across samples within the same tissue with reduced accuracy and even across tissues (Additional file 1: Figure S9 for a summary of accuracies). Models in some tissues tended to overcall genes as subject to XCI while others overcalled genes as esca** from XCI, however the number of escape genes called per sample had no correlation with XIST expression (Additional file 1: Figure S10). Across all samples, the model called 46 genes as esca** XCI, 780 genes as subject to XCI and seven genes as variably esca** from XCI (Fig. 4c, Additional file 4: Table S4 for XCI status calls). While none of the genes predicted to escape XCI here have a meta-status of subject to XCI, 11 of the genes predicted to be subject to XCI have a meta-status of esca** XCI and an additional six genes are located in the PAR1 and are expected to escape XCI [2]. Comparing these predictions to our ** XCI by ** by ** XCI across samples, we predicted 48 genes having tissue-specific escape from XCI, and one gene with separate TSSs with opposite XCI status. To investigate which marks are driving this variability in XCI status predictions we compared our epigenetic marks across samples, tissues and TSSs with opposite XCI status predictions (Additional file 1: Table S12). At genes predicted to variably escape across samples we found that very few marks had significant (t-test, adjusted p value < 0.01) differences between samples found esca** and those subject to XCI. DNAme was the exception to this with four of seven genes having significant DNAme differences. For the genes found variably esca** across tissues, all of the marks had multiple genes significantly different between tissues subject to XCI vs tissues esca** from XCI, but many of the genes that didn’t variably escape also had significant differences across tissues. Tissue-specific variable escape genes had significant enrichment (Chi-square test, adjusted p value < 0.01) for genes with tissue-specific H3K27me3, H3K4me3, DNAme and expression over genes that did not variably escape from XCI. There was only one gene found to variably escape between TSSs so no statistical tests were possible, however there were differences between TSSs for H3K27ac, H3K4me1 and DNAme for the different exons used.

Our initial thresholds to call variable escape across samples were arbitrary, so we varied the percentage of samples with each XCI status required to classify a gene as variably esca** from XCI in order to determine the effects of different variable escape thresholds. At our threshold requiring 33% of samples to have each XCI status in order to be called as variably esca** from XCI, we found 7 of 1155 genes to be variably esca**. Lowering this threshold to 25% found 35 variably esca** genes, at 10% we found 304 genes and at 5% we found 476 genes. This shows that there is no natural threshold at which genes become variable in their expression from the ** decreased, the percentage of these genes with significant DNAme differences between samples with opposite XCI statuses decreased down to 20% and the percentage of genes with H3K27me3 differences rose to 27% (Additional file 1: Table S13); however, we must also consider that the cancer origin of these samples may contribute to rare epigenetic misregulation.

To validate our conclusions from this model on healthy samples, we trained our overall epigenetic predictor on the CREST dataset. The CREST dataset contains nine samples for which we were able to obtain all of the required epigenetic data for our predictor. We predicted 88 genes esca** from XCI, 802 subject to XCI, 40 variably esca** across samples, ten across tissues and six across TSSs. These calls are similar to those in the CEMT data, with 95% of genes with calls from both datasets agreeing (Additional file 1: Table S14). The genes variably esca** from XCI in the CEMT dataset tended to be esca** XCI in CREST while genes variably esca** in CREST tended to be subject to XCI in the CEMT dataset. The number of genes variably esca** from XCI is increased in CREST, possibly due to how few samples were required for variable escape (three with each XCI status) decreasing stringency. Another possibility is that having random ** across individuals in CREST had significant differences between samples subject to XCI and those esca** from XCI (Additional file 1: Table S15). CREST tissue-specific genes had significant differences in H3K27me3, DNAme and expression between tissues, all three of which were also significant in CEMT samples. CREST had enough genes variably escape across TSSs to see that H3K4me3, H3K27me3 and DNAme were significantly different between TSSs esca** and TSSs subject to XCI in females. Males had significant differences in H3K4me3, H3K27ac, H3K27me3, H3K36me3 and DNAme between TSSs esca** vs subject to XCI in females, which suggests that these TSS also differ significantly on the Xa. These TSSs may be predisposed to have different XCI statuses based on their epigenetic landscape prior to XCI or the Xa differences may be misleading the predictor causing it to predict different XCI statuses. The results between our cancer and healthy samples are similar overall, with results from both datasets finding few genes with significant epigenetic differences between genes variably esca** across individuals, and finding H3K27me3, DNAme and expression differences more commonly different between tissues at genes with a tissue-specific XCI status than at other genes.

Independent regulation of variable escape across a region

As an application of our epigenetic XCI predictor and to understand the scale at which variably esca** genes are regulated, we examined XCI status calls per sample across a region that is enriched in genes variably esca** from XCI according to their meta-status (Fig. 5a) We found that many of the genes in this region that are annotated as variably esca** from XCI had low levels of variable escape with few samples differing from the most common XCI status. The genes that vary in XCI status across samples change their XCI status independent of the XCI status of neighboring genes, suggesting that regulation of variably esca** genes happens at the single gene level and not at the domain level. Additionally, we saw genes that had multiple TSSs with different XCI statuses and genes that are bidirectional from the same promoter with opposite XCI status showing that the scale of regulation could be narrowed even further. All of the genes in this region that showed variable escape here, except for IRAK1, had significant differences for some combination of marks including H3K9me3, H3K27me3 and DNAme between samples esca** vs subject to XCI (p value < 0.05, Fig. 5b, Additional file 1: Figure S11 for which marks were significant per TSS). Euchromatic marks were less frequently seen to be significantly different.

Fig. 5
figure 5

XCI status calls are independent between neighboring variably esca** genes. a A map of a variably esca** region, with genes colored by their XCI status as predicted per sample, by our random forest model using all epigenetic marks available. The samples were clustered based on their XCI status calls within the region. Arrows indicate where each TSS is located, and they point in the direction of transcription. Genes which are colored as variably esca** here are variably esca** between transcripts and TSSs within a sample. b Metagene plots for the epigenetic marks that were most commonly significantly different between samples subject to XCI vs those esca** from XCI at the above variably esca** genes. Genes were chosen to show every combination of which mark is significant per gene, that we saw in this region. Marks that were significant at a gene are marked with a star

Genetic contribution to variable escape from XCI

To identify any genetic differences at variably esca** genes between samples that escape and those subject to XCI, we obtained existing exome-seq, RNA-seq, Illumina Infinium Human Methylation450 BeadChip array (450 k array) and Affymetrix Genome-Wide Human SNP Array 6.0 (SNP6) data for 5817 samples from cancers where clonality should lead to skewed ** XCI and 377 genes subject to XCI by ** XCI and 397 genes subject to XCI (Fig. 6a). Of the 25 genes called as esca** from XCI by DNAme that were informative by ** by ** from XCI and 20 variably esca** from XCI by ** genes. a The number of genes with each XCI status call in the TCGA dataset made using ** XCI by DNAme became variably esca** genes when the threshold for variable escape was lowered to 100 or more samples with each XCI status. b–i The percent of samples with each allele that were found with each XCI status at the most significant loci for our association analyses. The chromosomal location below the gene name is for the locus associated with the XCI status of the gene and is the location in hg38. The top row of graphs are the most significant loci associated with ** from XCI, we found 45 genes variably esca** by ** genes by DNAme. For our genetic tests, we decided to use a less stringent measure of variable escape for our DNAme calls, as there are so many informative samples here we called any gene with over 100 samples with each XCI status instead of the usual 33% of all samples. This gave us 126 variably esca** genes to test. Of these new variably esca** genes, 26 were previously called as esca** XCI, 59 were called as subject to XCI and 36 did not meet the thresholds for either call previously as too many of the samples were outside of the thresholds to be called as esca** or subject to XCI.

We tested association between XCI status of these variably esca** genes and all SNPs on the SNP6 array and did not find any loci significantly associated with our ** genes but with our DNAme-based XCI status calls we found 610 significant combinations of gene and genetic locus across all chromosomes (Additional file 5: Table S16). Only seven of these were X-linked with the closest being 9 Mb away from the affected gene. There were significant loci for 75 of the genes found to variably escape by DNAme, and most of these genes had multiple significant loci, with a maximum of 26 significant loci for SLC16A2 (Additional file 6: Table S17). Many of the loci were also significantly associated with XCI status for multiple genes, with only 372 unique loci appearing in the 610 significant gene:locus associations. The most genes affected per locus was 18 for chr4:130533697 (Additional file 6: Table S17). However, none of these significant polymorphisms showed 100% correlation with XCI status calls and so they are not the causative or sole-causative polymorphism responsible for the change in XCI status, but may be part of a complex mechanism or be in incomplete linkage disequilibrium with a causative polymorphism (Fig. 6b-i). We examined attributable risk per significant locus and found that the allele with the highest contribution to XCI status had an attributable risk of 28%, but 90% of the loci had attributable risk under 10% (Additional file 5: Table S16). This suggests to us that there are alleles which allow for a change in XCI status and give an increased chance of changing the XCI status, but are not sufficient for the change by themselves.

To test the strength of the effect of SNPs that were significantly associated with XCI status as determined by DNAme, we compared the genotype of samples to their DNAme to find significant DNAme-quantitative trait loci (DNAme-QTL). Testing our 610 significantly associated gene:locus combinations with DNAme-based XCI status calls, we found 38 loci were also significant DNAmeQTLs (Fig. 6j-k, Additional file 7: Table S18). We also tested these DNAmeQTLs in males and all 38 loci were found to only be significant in females. Three of these significant DNAmeQTLs (for the genes EIF2S3, PNPLA4 and NLGN4X) had their median DNAme with one allele in the range to be called as esca** from XCI, with the median DNAme of their other allele in the range to be called as subject to XCI, while the others did not (Fig. 6l). Overall, it appears that there are multiple X-linked and autosomal loci contributing to the variability observed in escape from XCI; however, these are not major contributors and the effect of a single DNAmeQTL is not sufficient for a change in XCI status.

Discussion

XCI is a classic paradigm for studying epigenetic regulation, yet how some genes are resistant to silencing (or the maintenance of silencing) and escape XCI remains unresolved. Here, we have examined the genetic and epigenetic differences between genes esca** and those subject to XCI. Overall, epigenetic marks were more different between males and females than between genes esca** vs subject to XCI, suggesting an influence of the ** XCI have similar epigenetic marks between the ** XCI may be why escape genes can have as low as 10% expression from the ** XCI could also contribute to lower expression from the ** from those subject to XCI, while the heterochromatic mark H3K27me3 had the largest **, variably esca** or subject to XCI across our DNAme analyses as our previous ** and variably esca** from XCI. A large proportion of the additional genes found subject to XCI by our epigenetic predictor may in fact be silenced on both the Xa and ** from XCI and the number of epigenetic marks that were significant in at least one gene, but decreased the percentage of genes significant for DNAme that was the only mark ever significant for over 50% of genes in a dataset.

We observed that variable escape from XCI was regulated at the level of single genes, with adjacent genes varying their XCI status independently. In contrast, a study in mice found clusters of genes that variably escape across their three cell lines, with adjacent genes often having the same XCI status across lines [9]. They also found that these clusters colocalize with TADs, with one line having the majority of a TAD esca** XCI and another line having only part of it esca**. An interesting candidate regulator of regional control is SMCHD1. In mice with SMCHD1 knocked-out, regions enriched with variably esca** genes were upregulated, while genes that constitutively escaped from XCI were not affected; however, no impact was seen on variable escape genes in human patients with heterozygous SMCHD1 mutations [36]. Nonetheless, another study found variants with low expression of SMCHD1, ZSCAN9 and HBG2/TRIM6 associated with hypomethylation of X-linked CpG islands, with affected islands enriched near genes that variably escape from XCI [29]. Additionally there are individual genes which are susceptible to reactivation under certain conditions, such as how some genes are reliant on XIST expression and H3K27me3 deacetylation to remain silent, while others continue to be silenced when XIST expression is disrupted [47]. This also supports how our variably esca** genes did not have consistent epigenetic differences between samples which escaped XCI and those which were subject to XCI. Overall, there is evidence for both domain-level and gene-specific regulation of escape. We suggest that for some domains the former predominates, while for other genes the latter predominates. Additionally, the domain featured in Fig. 4 (and other variably esca** domains) is at a threshold where individual genes within the domain can have either XCI status based on local factors.

We thus asked whether variable escape from XCI could be controlled by local sequence variants. Here, we found an association between numerous genetic variants and sample-specific XCI status at variably esca** genes. However, we did not find any local genetic effect, as none of the loci were within 5 Mb of the affected genes and only 10 of 610 significant loci were located on the X. None of the SNPs we identified were completely correlated with a gene’s XCI status, so other factors must be involved. Additionally, all of the significant loci we found were based on DNAme for XCI status calls. These loci could have been affecting just DNAme instead of XCI status, however 38 out of 610 significant loci were female-specific DNAmeQTL while only one loci was a significant DNAmeQTL in males. With more samples with skewed XCI, we may have found loci associated with our ** from XCI in the CEMT cancer dataset as in the healthy CREST dataset. Nonetheless, we used the CEMT dataset because it had a standardized set of epigenetic marks across many samples and the clonality of cancer allowed us to examine expression and DNAme allelically. We found that other datasets, did not always have all the marks from the same samples, were lacking females or sex labels or had mislabeled sex.

The use of different methods and sample sizes to call XCI status can result in discordant calls generally due to one approach calling a gene as variably esca** while other studies do not. A previous meta-analysis saw 7% of genes having discordant calls between studies [2]. Many of these discordancies between studies may be due to different samples and tissues used, but here we see differences in XCI status called using different approaches with the same samples. Genes could be falsely called as subject to XCI in the ** XCI tended to have equal levels of marks on the Xa and ** vs subject to XCI at variably esca** genes, but which marks were significant was not consistent between genes and no mark was significant across all of the variably esca** genes, likely reflecting that variably esca** genes having multiple ways in which they are regulated. DNAme intermediate to what is expected for genes esca** vs subject to XCI is enriched at variably esca** genes and is mostly due to inconsistent DNAme on the ** genes were seen to regulate their XCI status independently from each other, suggesting local regulatory elements. Additionally, we searched for polymorphisms which may control variable escape from XCI and found non-syntenic loci, some with a strong correlation, but none were completely correlated further suggesting complex regulation. Overall, we see that escape from XCI is influenced by both local regulatory elements as well as trans-acting factors and chromatin modifications that can be independent of each other. Understanding how genes escape from XCI will further our understanding of epigenetics in general and may allow us to control which genes are esca** from XCI and rescue X-linked mutations in females.

Methods

Previous XCI status calls

We used XCI meta-status calls from [2] for all comparisons with past XCI statuses and to train our models. Genes that escape and mostly escaped were combined together due to the small size of these categories, with genes in the PAR1 being left out or having their own separate category depending on the analysis. Genes that were mostly subject to XCI were combined with genes subject to XCI for comparisons between studies, but were left out when training models. Genes that were annotated as variably esca**, mostly variably esca** and discordant across studies were combined together as variably esca** genes for comparisons here.

Histone ChIP-seq analysis

Histone ChIP-seq bigwig files were downloaded from the IHEC data portal [38] and their mean signal quantified with bigWigAverageOverBed [39] for a region 500 bp upstream of TSSs as annotated by Gencode [40]. We normalized the data across samples by multiplying samples to have the same total depth (including all chromosomes). The ** genes, requiring at least two samples with each XCI status. This narrows the number of variably esca** genes and increases the chance that those found would have enough samples to reach significance. The overall expression level of genes was calculated using bigwig files downloaded from the CEEHRC data portal [42] and quantified as RPKM using VisRseq [43].

DNAme analysis

WGBS bigwig files were downloaded from the IHEC data portal [38] and quantified with bigWigAverageOverBed [39] for a region 500 bp upstream of TSSs as annotated by Gencode [40]. DNAme thresholds established in [11] were used to determine which genes were esca** XCI and which were subject to XCI. These thresholds are: DNAme < 10% escapes XCI, 15% < DNAme < 60% subject to XCI, and DNAme > 60% hypermethylated. A threshold of DNAme < 15% in males was used to filter out TSSs that were methylated on the Xa and therefore not informative for this analysis. To see the differences between adjacent CpGs, we converted bigWig files to bedGraphs and for each island we used R to find the mean absolute value difference between each adjacent CpG.

DNAme per read was calculated by downloading WGBS bam files and using a script to count the number of unmethylated and methylated CG dinucleotides per read within CpG islands within 2 kb of TSSs. For allelic DNAme, we did similar but only examined reads that overlapped heterozygous SNPs identified in our ** from XCI and both alleles higher than 0.75 being called as hypermethylated. Polymorphisms with one allele above 0.75 and the other allele below 0.25 were called as subject to XCI. The DNAme per read per polymorphism was binned as above, but instead of using the mean DNAme across all reads, we determined the mean DNAme per allele and used the mean of that; this was done so that we get the mean between the ** XCI were often within two standard deviations of each, the average of these two means was often used as a threshold instead.

For our random forest models, we wanted to include both male and female data, and breast did not have any male data so we used the kmeans function in R to cluster all of our samples based on autosomal levels of all seven epigenetic marks used herein. With three clusters we had multiple male and female samples in each cluster. As input for our models, we used individual female data per sample and matched it with the mean values per gene across males in the same cluster.

Random forest models were trained using the R package caret [35] with the trainControl method cv and the train method rf. We trained the model on genes known to escape or be subject to XCI [2]. The training metric was ROC, tunelength was 5 and ntree was 1500. Three genes esca** and subject to XCI were left out of the training set and used to check accuracy of overall calls. We trained twenty models per sample, with each model being trained on a random sample of 75% of the genes esca** XCI and twice as many genes subject to XCI, with each iteration of the model using 75% of the number of input esca** genes. Accuracy per model was tested on the remaining genes with known XCI status. Genes were considered as esca** or subject to XCI if 15 + of 20 models predicted them as esca** or subject to XCI, respectively. Separate categories were made for genes where only 12–14 of the models agreed on the gene’s XCI status, being annotated as leaning subject or leaning escape. Overall calls were made across samples with genes with 66% or more of samples agreeing on a gene’s XCI status being called as subject to or esca** from XCI, genes with at least 33% or more of all samples having each XCI status being called as variably esca** from XCI, and genes that required the leaning categories to reach 66% of samples having a status being annotated with a similar leaning status.

Statistical comparisons

All statistical comparisons were done in R [44]. The majority were t-tests with a Benjamini–Hochberg (BH) multiple testing correction [45] with results deemed significant if they had an adjusted p value < 0.01. The one test with a different threshold was for comparing genes variably esca** XCI as determined by ** from XCI, assuming the reason for this is that they did not have skewed XCI.

For the TCGA DNAme-based XCI status calls we downloaded methylation beta-values from the genomic data commons data portal for females and males from the TCGA dataset. Probes were removed if the average male DNAme was over 15% and female samples were removed if their average DNAme was two standard deviations below the female average, as we presume that they were mislabeled males or had lost their ** gene’s XCI status (Chi-square test, significant if BH adjusted p value < 0.01). We tested vs all SNPs on the array, and again with just the SNPs on the X. For samples which had multiple SNP array datasets, we used a consensus allele across all of the arrays. We did not include heterozygous samples as we were testing for a cis-effect and had no way of knowing which allele was on the **. DNAmeQTLs were examined by using the lm function in R to make a linear model for every combination of SNP and CpG island.