
Human and mouse differ in both the initiation and completeness of X-chromosome inactivation (XCI) [1, 2]. In contrast to human, mouse has imprinted XCI early in development, which is maintained in extraembryonic (placental) tissues [3,4,5]. In placenta, rat [6] and vole [15], while in mouse the proportion of genes esca** from XCI is only 3–7% [16]. In human, an additional 15% of genes variably escape from XCI, differing in their XCI status between different tissues, populations, individuals or studies [15, 17]. Large-scale studies have not been reported in species outside of human and mouse, and the studies in mouse generally report only on the genes esca** from XCI. The variation between species highlights the importance of studying XCI across a range of species; particularly as the most common model organism, mouse, appears quite different from human.

There are various methods to examine the XCI status of genes, with the above numbers being determined using a combination of allelic expression and DNA methylation (DNAme). Additional methods to assess XCI status are reviewed in [18]. For allelic expression to be used to examine XCI escape status, the samples analyzed must be skewed so that the majority of cells in the sample have the same ** genes [24]. Males have low DNAme of promoter CpG islands on the X chromosome, while females, with one Xa and one ** from and subject to XCI, but these differences are subtler and may be tissue-specific [23, 25,26,27]).

Knowing the XCI status of genes is important, as genes that escape from XCI often have sex-biased expression, being higher in males if a gametolog is also present on the Y, and higher in females if not [17]. Furthermore, having two active copies of a gene has been argued to protect females from cancers as both copies will need to be mutated in order to have loss of function [28]. In individual species, knowing which genes escape from XCI will be useful for map** the effect of X-linked genes to various traits, and understanding XCI within a species is important for genomic selection strategies in breeding for agriculture [29]. Additionally, the knowledge of which genes escape from XCI across species can further our understanding of the underlying mechanism allowing some genes to escape XCI and give insight into the evolutionary development of XCI.

Here, we compared the XCI status of human and mouse, first examining allelic expression and DNAme in human and mouse to establish robust thresholds of DNAme as an indicator of XCI. We then used DNAme data across two separate groups, one of nine different mammalian species, and one of five different primate species, to examine conservation of XCI escape status across species. Finally, we performed an analysis testing elements previously seen enriched at genes with various XCI statuses (repetitive elements, CTCF and ATAC-seq) for enrichment with our XCI status calls across species.


XCI status calls from allelic expression

To obtain DNAme thresholds separating genes esca** XCI from genes subject to XCI, we first needed to establish which genes were esca** versus subject to XCI using allelic expression data. Allelic expression data requires skewed ** XCI and those with ** XCI, 262 genes subject to XCI and 21 genes variably esca** from XCI in them (Additional file 2: Table S1). We called genes as variably esca** if they had at least 33% of informative samples with each XCI status. The majority of these XCI status calls agreed with previous studies, with discordance for only 53 genes, (17% of genes with an XCI status call in both), 39 of which were reported to variably escape from XCI here or previously [15]. We attribute the low number of genes variably esca** in our current study to the limited number of samples available and the frequency of informative, heterozygous SNPs per sample, resulting in a mean of 3.5 informative samples per gene. With more samples, we would expect to observe more variably esca** genes.

Fig. 1
figure 1

Using ** XCI, 662 genes subject to XCI and 10 genes variably esca** from XCI (Additional file 3: Table S2). We used three different mouse expression datasets (Keown et al., Berletch et al. and Wu et al.) and results were 97%, 90% and 87% concordant when datasets were compared with each other [16, 21, 26]. Most of the discordance in our results arises from identifying more genes variably esca** in the Wu dataset than the other two datasets. Additionally, our use of a threshold of 0.1 rather than 0 to call escape from XCI and the inclusion of a variable escape category resulted in more discordant calls relative to those assigned by Berletch [16]. Figure 1 shows a clear DNAme difference between genes with an ** from XCI, which occasionally crossed the threshold to being called subject to XCI, particularly in chimp, likely due to the low sample size in WGBS (only one sample). Many genes were not assigned a call in one of the datasets as they were hypermethylated. XCI status calls made using our DNAme thresholds were generally consistent so we did not discard the 450k array datasets.

Fig. 2
figure 2

The number and type of XCI status calls per species. The number of XCI status calls per dataset (a) and the percentage of calls with each XCI status per dataset (b) are shown. Datasets (columns) were sorted by technique used to generate the data. Species names are colored by the type of data used to generate XCI status calls

Horse had elevated numbers of variably esca** genes (10%), which were close to that seen previously in human, while other species (including human) only had 0–5% of genes found variably esca** from XCI. The variation in proportion of variable escape genes seen here could be due to low sample size (in everything except human WGBS), or from our methods of calling variable escape genes being more stringent than previous studies. We required at least 33% of informative samples to have each XCI status before calling a gene as variably esca** from XCI, similar to the initial survey of human XCI status by Carrel and Willard [14]. Reducing this requirement to only 10% of samples increased the number of variably esca** genes found in human to 63—almost a quarter of informative genes. These include 37 new genes called which did not have enough informative samples to be called as esca** or subject to XCI with our initial thresholds, as well as 15 genes which changed from an initial call of esca** XCI (12 genes) or subject to XCI (three genes). Although this lower threshold called more genes, we used our 33% threshold of variable escape calls for subsequent studies as we wished to focus on genes that we were confident changed their XCI status between species, rather than differing levels of variable escape from XCI.

Overall, we saw that calls of XCI status using DNAme agreed well with those made using allelic expression, and provided an opportunity to examine XCI across multiple species. While WGBS resulted in the most XCI status calls, 450k array DNAme-based calls were generally concordant. These studies showed an average of 11% of genes esca** from XCI across 12 different species, with mouse being an outlier with only 5% of genes esca** from XCI.

Conservation of XCI status calls across species

XCI status calls per gene were compared across species, focusing on genes that were informative in 4 + species. We observed 267 genes being completely conserved across all informative species, with only eight of these genes esca** from XCI and the rest being subject to XCI. Of the eight conserved XCI escapees, two (DDX3X and KDM6A) have Y homologues across eutherian mammals [31], five have Y pseudogenes in human (ARSD, STS, PNPLA4, EIF2S3 and MED14) [32], and one has no known Y homology (CTPS2) (Fig. 3a). To avoid biasing the analysis with the more conserved primates, the species were grouped into two groups: primates with 450k array data, and other datasets (including the human and chimp WGBS data). A clear difference in conservation of status was seen between these two groups, with 97% of genes having completely conserved XCI status across primates, while only 75% of genes had conserved XCI status across all mammals (Additional file 2: Table S1). Of the genes which were usually subject to XCI (> 75% of informative species subject to XCI), 79% of these had all informative species subject to XCI. Genes that usually escaped from XCI were less concordant, with only 61% of these genes having entirely conserved XCI status across all informative species. A similar trend was seen in the all primates group.

Fig. 3
figure 3

Concordant and discordant escape genes across species. Eight genes escape XCI in all informative species (a), while 259 genes were subject to XCI in all informative species (not shown). Discordant genes in two different groups of species were examined, only primates (b) and all mammals (c, limited to only 2 primate species). The intersection of a gene and species is colored based on that gene’s XCI status call in that species. Genes that did not have an XCI status call in a species are colored grey. Only escape genes informative in at least 4 + species were selected for a. Genes were selected for b if they had at least one discordant primate species while genes in c required two XCI statuses with two or more species. To match best across species within groups, 450k array data were prioritized in b and WGBS data were prioritized in c. Genes are organized based on their position on the human X chromosome with a horizontal black line denoting the centromere. Green boxes highlight domains of adjacent genes with similar changes to XCI statuses across species

There were 16 genes that varied frequently (2 + species esca** XCI and 2 + species subject to XCI) in the all mammals group and none that varied greatly across primates, again showing the higher similarity in XCI status across closely related species (Fig. 3). Of these 16 genes, four showed primate-specific escape from XCI (RPS4X, CDK16, EIF1AX and GEMIN8) and one showed artiodactyla-specific (cow, sheep, goat, pig) XCI (KDM5C). The pattern of conservation of the other genes variably esca** across species did not match any phylogenetic patterns. The primate-specific escape genes RPS4X and EIF1AX have been shown to have primate-specific retention of their Y homolog while KDM5C, the gene that is subject to XCI only in artiodactyla has lost its Y homolog in bulls, while retaining it in mouse and primates [31]. We show the WGBS data surrounding the CpG island at the transcription start site (TSS) of the ubiquitous escape gene KDM6A, the artiodactyla-specific subject gene KDM5C and the primate-specific escape gene RPS4X (Fig. 4).

Fig. 4
figure 4

Featured genes compared across species. Male and female DNAme values are graphed by gene and dataset. KDM6A is featured as it is concordantly esca** across species (a). KDM5C is featured because it is known to escape XCI across species but is here shown to be subject to XCI in artiodactyla (cow, sheep, pig and goat) (b). RPS4X is featured because it is a well-known primate-specific escape gene (c). Male methylation is shown in blue and female in red. Annotated CpG islands are shown under the methylation data in purple. Genes are shown colored by their XCI status with arrows at the TSS pointing in the direction of transcription. All of the methylation data shown are from WGBS. Pig did not have KDM6A annotated, but predictions from other species show it located at this CpG island. Goat did not have a CpG island or hypomethylated region at the annotated KDM6A

CDKL5 was the only gene seen to have more than one discordant species in primates (Fig. 3b), being subject to XCI in the human WGBS data, variable in orangutan and the human 450k array data and esca** in chimp and bonobo. In gorilla, CDKL5 appeared subject to XCI, but half of the data were in the uncallable region between 10 and 15% DNAme so it was not called as subject to XCI. Other genes had only one species of primates discordant from the rest, usually gorilla or bonobo.

Role for alternative promoter usage in escape from XCI

UBA1 was particularly interesting as it has been shown previously in human to have two different TSSs with differing XCI statuses [33]. This pattern of multiple TSSs with differing XCI status was seen also in chimp and horse (although data are sparse in horse) (Fig. 5). In cow, the upstream TSS and CpG island are not annotated, but the region homologous to the human upstream TSS showed a DNAme pattern consistent with a promoter subject to XCI, and in pig the CpG islands are annotated but the gene is not. Similarly, in mouse both TSSs (which are annotated but lack CpG island definition) had female-specific DNAme. Mouse has been shown to have fewer CpG islands than human, with CpG island loss from the ancestral genome being four times as high in mouse as human [34]. The island is still large enough to see hypomethylation on the Xa so the cutoff for minimum island size may be too high in some species. Overall, the alternative TSSs are conserved across species; however, the XCI status of the downstream TSS changes from esca** from XCI in human, chimp and horse to being subject to XCI in mouse and cow. In humans, both TSSs were always found within the same topologically associated domain (TAD) and sub-TAD. Examining TSS usage in the other genes featured in Fig. 3C, we were able to map the TSS and CpG islands using either the University of California Santa Cruz Genome Browser (UCSC) [35] for that species or using the UCSC liftover tool across species, suggesting that the change in XCI status across species was not due to differences in TSS usage between species.

Fig. 5
figure 5

DNAme across the variably esca** gene UBA1. UBA1 is featured as it has multiple different TSSs with CpG islands that have different XCI statuses. Male methylation is shown in blue and female in red. Annotated CpG islands are shown under the methylation data in orange. Genes are shown colored by their XCI status with arrows at the TSS pointing in the direction of transcription. All of the methylation data shown, except for horse are from WGBS. Horse used RRBS data, which is why the data are so sparse

Domains of escape from XCI across species

Looking at the position of genes esca** XCI along the human X chromosome, we saw that most genes esca** XCI clustered into domains on the short arm of the X chromosome, similar to what has been described previously [14]. Ten of the 23 transitions between clusters of genes esca** or variably esca** from XCI and genes subject to XCI fell near TAD boundaries in human [36], again similar to what has been seen previously [37]. These clusters of genes esca** from XCI often matched across species. Genes discordant in more than one species were also often clustered, while the genes discordant in only one species were generally scattered by themselves. Some of the genes within discordant clusters were not featured in Fig. 3 as they were missing data in some species. Only two of the strongly discordant genes featured in Fig. 3 are located on the long arm of the X chromosome and they did not form a cluster.

We investigated these domains of changing XCI status further by examining whether the discordant species had altered the chromosomal arrangement of these genes. For the primate-specific region of genes esca** XCI spanning the genes TCEANC to GEMIN8, most species had the same gene order, orientation and flanking genes as observed for human (Additional file 1: Figure S6), although some small changes were observed in gorilla, mouse, cow and sheep. In human and mouse, the two species with Hi-C data, there is a TAD spanning from EGFL6 (which neighbors TCEANC) to GEMIN8, which may coordinate the regulation of this region, although if regulated as a domain, EGFL6 would be expected to also escape XCI in primates. There was no data here giving an XCI status for EGFL6, but a previous study had seen it as subject to XCI in human [38]. Gorilla was the only primate that did not demonstrate escape from XCI across this domain, with only the gene GEMIN8 esca** XCI. A small insertion was present in gorilla, but it was outside of the TAD which cast doubt about whether it could be the cause of this discordance from the other primates. None of the structural differences in this region were conserved across species with concordant XCI status; thus, we found no detectable genomic correlate underpinning the change in XCI status. Similar results were found for the other discordant regions.

Correlation of features with XCI status across species

These genes that transition their inactivation status across species provided a dataset to interrogate for factors underlying establishment of silencing or escape from silencing. We considered various factors pertaining to CpG islands in addition to enrichment of various classes of DNA repeats. No differences were seen in CpG island size, nor CpG and GC content between species with discordant XCI status at specific genes. Differences in islands between all genes esca** from versus subject to XCI per species were seen in some species, but no characteristic was seen to be significant after multiple testing correction or in more than one species.

Different classes of repeats were tested for correlation with genes esca** from versus subject to XCI in human, chimp, mouse, cow, sheep, pig and horse. There were significantly more LINE repeats within 15 kb upstream of genes subject to XCI than for genes esca** from XCI in chimp, mouse, sheep and horse (Fig. 6a, Additional file 5: Table S4, t-test, corrected p-values < 0.01). Other repeat classes found enriched across multiple species include LTR, DNA and snRNA repeats, which were enriched at genes esca** XCI in 3 species (Additional file 1: Figure S7). SINE repeats, which have previously been seen enriched at genes esca** from XCI [39], were only found significant in horse, which unexpectedly had more SINE repeats near genes subject to XCI than at genes esca** from XCI. Human still had more SINE repeats near genes esca** XCI than subject to XCI on average, but this difference failed to reach significance in this study.

Fig. 6
figure 6

Enrichment of elements which may be related to XCI status. a The number of repetitive elements of each class within 15 kb of each CpG island, sorted by XCI status. See Figure S7 for the repeat classes not shown here. b CTCF binding in overlap** 200-bp bins was predicted using a DanQ model [5: Table S4, along with the number of CpG islands or TSSs per XCI status in each species used for each analysis

We compared CTCF-binding signal between genes found esca** vs subject to XCI across species. For this, we predicted the probability of CTCF binding across species by using a DanQ model [6b, Additional file 5: Table S4). All of the species with significant differences had more CTCF-binding signal near genes esca** XCI. We also examined whether there were significant regions in the TCEANC to GEMIN8 cluster of discordant genes which correlated with a change in XCI status across species, but did not find any differences consistent across species (Additional file 6: Table S5).

ATAC-seq is an assay for accessible chromatin [42]. Comparing ATAC-seq signal 250 bp up and downstream of TSSs across species revealed significant differences in the mean female/male ratio across genes that were esca** vs subject to XCI in human, mouse and pig but not in cow or goat (Fig. 6c, Additional file 5: Table S4). ATAC-seq signal had a higher female/male ratio in genes esca** XCI than genes subject to XCI, as seen previously in human [43], and the same trend existed in species where the differences failed to reach significance. In the species with significant differences in ATAC-seq signal with XCI status, we did not see all tissues showing significant differences (Additional file 1: Figure S9). The differences were significant in the only tissue examined in human, two of the three examined in pig, and one out of ten examined in mouse.

Across all species examined, mouse genes appeared uniquely well-silenced. We clustered all species based on their XCI status calls (Additional file 1: Figure S10). The bovids (cow, sheep and goat) as a group clustered together, although mouse clusters with them for an unknown reason. Dog has very sparse data which may explain it clustering as an outlier, but we are unsure of the reason why pig clustered with dog instead of with the more closely related bovids. We observed clear separation of the primates from most other species due to the large number of primate-specific escape genes.


Escape from XCI is an important contributor to sex differences in expression and has even been argued to underlie a male predisposition to cancer [17, 28]. In addition, genes subject to XCI can also have unique effects on phenotype, with some mutations having phenotypic effects only when separate cell populations are expressing two different alleles [44, 45]. Mutations that are deleterious at the cellular level or affect the region controlling choice of ** of X-linked loci involved in skewing of X chromosome inactivation in the human. Eur J Hum Genet. 1998;6:552–62." href="/article/10.1186/s13072-021-00386-8#ref-CR47" id="ref-link-section-d69690147e1605">47]. Knowing the XCI status of genes is also important for estimating the effect of an X-linked allele in genome- or epigenome-wide association studies [48, 49] and is important for genetic selection of X-linked genes in agriculture [29].

To validate our use of DNAme to call XCI status, we compared expression-based calls with DNAme in human and mouse. The human ** from XCI [15]. As cancer samples were used to allow ** in one consortia and subject to XCI in another (Additional file 7: Table S6). Our study was further limited by the need for heterozygous polymorphisms, thus with only 8 samples, any mis-regulation may not have been noticeable, or led to false or missed calls of variable escape from XCI. Our human DNAme calls were 94% (WGBS) and 91% (450k array) concordant with previous XCI calls, and the two datasets analyzed here gave calls that were 97% concordant with each other. Of the few XCI status calls that were inconsistent with previous studies, 80% were in genes called as variably esca** from XCI, and are likely due to differences in the population or tissues sampled. While our mouse ** XCI, but there were differences in which genes were informative [26].

In this study, we have made an average of 342 XCI status calls per species, for 12 different species. The proportion of genes subject to XCI differs, with most species having 80–90% of genes subject to XCI. The only species with more genes subject to XCI is mouse at 95%, and the only species with fewer was horse at 76%. Additionally, horse had elevated numbers of genes variably esca** from XCI (10), while other species only had 0–5% of genes variably esca** from XCI. A meta-analysis in human found 8% of genes variably esca** from XCI and a further 7% as varying between studies [15], while our current study identified 6% variable escape in human by expression and only 2% by DNAme. Our study is consistent with a previous study using DNAme to make XCI status calls that did not see many genes consistently variably esca** from XCI [23]. Of the genes previously predicted to variably escape from XCI [15], 69% had no data in this study due to lack of a CpG island and another 10% were hypermethylated in males or females and therefore XCI status could not be determined.

Our DNAme analysis found that human genes subject to XCI have promoter CpG DNAme between 38% (in WGBS) and 41% (in 450k array analysis) which agrees with a previous analysis using the 450k DNAme array which showed genes subject to XCI having an average DNAme around 40% [23] (Table 1). Mouse had a lower 27% DNAme average for genes subject to XCI; other mouse studies have not examined genes which are subject to XCI. Other species had DNAme averages in a range between human and mouse, but most were closer to human than mouse. Our DNAme thresholds to call genes as esca** from or subject to XCI were consistent across human and mouse WGBS, but as our data were from different studies using different techniques on different tissues in different species there may be variation unaccounted for with our thresholds. However, WGBS and 450k array-based XCI status calls were consistent in both human and chimp and, with a few notable exceptions, genes had concordant XCI status calls across species. Past studies of XCI status calls using DNAme in human did not see many differences in DNAme-based XCI status across tissues [23], so different tissues analyzed may not cause many discordancies. Having male DNAme as a control and an upper threshold for calling genes as subject to XCI should reduce the chance of calling a gene as subject to XCI if it is instead silenced on both copies of the X in a tissue-specific manner. For the primate and dog samples which used the human 450k DNAme array, only probes which mapped consistently between the species were kept by the source publications [50, 51], and so these species may be enriched for genes with a conserved XCI status. Utilizing datasets from different studies confounds the species differences with other experimental differences including sample size as well as inclusion of male samples. The lack of male samples in some species prohibited us from filtering out genes that are methylated on the Xa and therefore would never be seen to escape XCI by DNAme.

Many of the genes esca** from XCI have previously been seen grouped in domains [37], and here we see these domains conserved across species. Furthermore, we see that many of the genes that change XCI status across species are clustered into domains and many of these domains coincide with TADs in human. These domains suggest escape from XCI may be regulated at a domain level; however, we also see some genes being regulated individually and even separate TSSs for the same gene can have opposite XCI statuses. Individual escape genes are often discordant in a few species. Coincidence of changes in XCI status with loss of Y homology emphasizes the importance of dosage for determining genes whose escape from XCI is vital to survival. Generally, the TSS is seen to be conserved, even when a gene changes XCI status. Previous studies have suggested that CTCF and YY1 may be enriched near genes esca** from XCI [16, 53, 54]. CTCF has also been seen enriched at boundaries between domains of genes with opposite XCI statuses [56]. Repeat elements (SINE for genes esca** XCI and LINEs for genes subject to XCI) have also been seen enriched in 100-kb windows around TSSs as well as windows 15 kb upstream [39, 52].

Our XCI status calls across species also allow us to check conservation of elements that may control XCI. A region esca** XCI in human was still able to escape from XCI when inserted at a mouse region which is normally subject to XCI, showing that the mechanisms controlling escape from XCI are conserved and functional across species [55]. We suspect that any elements found to be important in human or mouse research will be conserved across species with the same XCI status; having a variety of mammalian species with XCI status calls gives us a platform to test this hypothesis.

We compared DNA repeats and CpG island characteristics with XCI status within and across species and found none varied significantly across species per discordant gene, few varied between XCI statuses within a species and none varied between XCI statuses in all species. Previous studies have examined enrichment of repetitive elements across differently sized regions ranging from 15 to 100 kb. The enrichment closer to the promoter may reflect gene-specific control, whereas enrichment across a broader range suggests regulation at the level of domains. These studies have seen enrichment of LINE and LTR MLT1K repeats at genes subject to XCI and SINE and MER33 repeats at genes esca** from XCI [39, 52]. Here, with a window of 15 kb, we replicated the enrichment for LINE repeats, with SINE repeats failing to reach significance and LTR and DNA repeats (which MLT1K and MER33 belong to) showing the opposite trend of previous studies. However, no element was consistently found across all species. We also predicted CTCF binding and observed that some species have more CTCF-binding signal around genes esca** XCI than genes subject to XCI as has been seen previously [16, 53, 54]. ATAC-seq signal, which has previously been seen enriched at genes esca** XCI, was also seen enriched here, but again, only in some species [43]. A deeper bioinformatic analysis comparing our XCI status calls to features which differ across species with differing XCI status but are conserved in species with conserved XCI status might identify important regulatory features which control the XCI status of nearby genes or control XCI in general.

These XCI status calls may be improved in the future through new techniques such as single-cell RNA-seq (scRNA-seq) which can make expression-based XCI status calls without the need for samples with skewed ** from XCI [25]. These numbers are higher than other reports of escape, likely due to many of these genes variably esca** from XCI and only esca** from XCI in brain.

Improved gene and genome annotations in some of the less well-studied species would enhance our XCI status calls across species. Many of the species examined here had their gene annotations generated bioinformatically using CESAR [59] map** of human genes instead of being annotated with mRNA from that species. This may not have captured the correct TSS, and if transcription was no longer close to the same CpG island these XCI status calls would be invalid. With better annotations in the future, these datasets could be reprocessed to provide more up-to-date XCI status calls with improved confidence.

As mouse has considerably fewer genes esca** from XCI than other species, there may be a better species to use as a model for research related to which genes escape from XCI. Unfortunately, none of the species other than mouse examined here are small or make affordable model systems. Rabbit, for which there was no DNAme data available, has been shown to be more similar to human than mouse in aspects of XCI and may be a good species for further examination [1].


Our study has created reference XCI status calls for 12 species, so that labs working with diverse mammalian species will have improved understanding of how their genes of interest are expressed in their species of interest. We have again confirmed that mouse has substantially fewer genes esca** from XCI than human, and shown that other mammals are more similar to human in this regard. Additionally, we have shown conservation of XCI status across the majority of X-linked genes and highlighted some genes of interest which are discordant across species. Interestingly, many of these discordant genes occur in domains of similarly regulated genes. In the future, we hope to use these XCI status calls to identify elements which are controlling escape from XCI and which are conserved across species, and these discordant genes are ideal candidate regions to investigate.


** Technologies. This data is from cancer samples, and because cancer has a clonal origin, we anticipated they would show skewing of XCI. Eight of the samples had skewed ** XCI, but most mouse studies do not call genes which are subject to XCI, so they were reanalyzed here.

The different species were processed differently due to different starting file types. The human data were pre-aligned, starting as DNA VCF files and RNA bam files. The DNA VCF files were indexed and then filtered to only heterozygous SNPs in exons using the bcftools view tool [60]. A BCF file was made for the expression data using samtools mpileup with the -t DP,AD options, followed by bcftools filter to filter for depth 30 or higher [61]. The RNA BCF file was then indexed and then bcftools call used to find indels and bcftools view used to filter for quality 30 + calls. In mouse, the data were available as fastq files and were aligned using the MEA pipeline [62]. The resulting unnormalized big wig files were then quantified at known polymorphisms to determine the number of reads on the ** and subject to XCI and not giving an XCI status for genes who cross this threshold with their error rates.

SNPs were mapped to splice variants which include the SNP and the closest TSS of these was used to connect DNAme and ** XCI and islands with between 15 and 60% DNAme being called as subject to XCI. Islands for which over half of males had 15% DNAme or higher were discarded as having male hypermethylation and being uninformative. The mean DNAme across each sex was also calculated and compared per CpG island. The lack of TSSs mapped within each species precluded robust examination of non-CpG island promoter regions, as we were unsure of the exact location of the TSS.

For datasets generated on the human 450k DNAme array, data were downloaded and filtered for promoter-associated probes. The mean DNAme of probes sharing an annotated CpG island were matched to their annotated genes and this was used for making XCI status calls as above.


XCI calls per species were transformed into numeric values, with escape as 0, variable escape as 0.5 and subject to XCI as 1. The daisy function from the cluster package in R was used to compute distance and then hclust with the gower metric and complete method were used to perform the clustering. The phylogenetic tree was generated using the online interactive Tree of Life tool [67].

Conservation analysis

R was used to collect and match all the XCI status calls across species. Genes were matched based on their name, controlling only for capitalization changes across species. Genes with XCI status calls in four or more species were included in further analysis. Datasets analyzed were split into two different groups: all mammals (human, chimp, mouse, cow, pig, sheep, and goat WGBS data, with horse RRBS and dog 450k array data) and primates (human, chimp, bonobo, gorilla and orangutan 450k array data). The two separate groups allowed us to examine conservation of genes without our analyses being biased toward primate-specific calls.

Statistical tests

Statistical tests comparing enrichment of CpG island statistics and various repeat classes between genes subject to or esca** from XCI were done using R. We used a t-test with the Benjamini–Hochberg method for multiple testing correction [68].

Domain analysis

Domains were identified based on conservation calls above and examined using the UCSC browser to compare the arrangement of genes. TAD boundaries were taken from Dixon, 2012 [36] and were annotated to genes if they were between it and the next gene or were within the gene body. Additionally, to confirm that UBA1 TSSs were within the same TAD, we used a larger set of TADs in the 3D genome browser [69].

ATAC-seq analysis

ATAC-seq data were downloaded, see Additional file 4: Table S3 for data sources. If bigwig files were available they were used, but if not we downloaded raw data and aligned it using HISAT2 [70]. The bamcoverage tool from the deepTools package [71] was used to generate bigwig files (normalized using RPKM) and bigWigAverageOverBed from UCSC utilities was used to determine the mean coverage in 250 bp up and downstream of each TSS. Each TSS was matched to the closest CpG island within 2 kb and any XCI status call from that island used for the TSS.

CTCF predictions

CTCF binding was predicted using a strand-specific DanQ model [ For the purpose of quantifying CTCF-binding signal per TSS, we counted the number of bins with an over 50% predicted probability of being a CTCF-bound region within 4 kb of each TSS. For our analysis of the TCEANC to GEMIN8 region, we counted the number of bins with over 50% probability of CTCF within each region.