Introduction

Cell identity is determined by the precise execution of lineage-specific gene expression programs1. These programs are controlled by coordinated signals from regulatory DNA sequences, transcription factors, histone modifications and variants, and 3D genome organization. The role of RNAs in modulating these programs is increasingly appreciated2,3. Many classes of RNAs bind chromatin, collectively termed here, chromatin-associated RNAs (caRNAs). These include long non-coding RNA(lncRNAs)4,5, heterogeneous nuclear RNAs (hnRNAs)6,7, enhancer-RNAs (eRNAs)8,9,10, transposable element (TE)-derived RNAs11,12,13,14, and other chromatin enriched RNAs (cheRNAs)15,16. Yet, the function of these RNAs on chromatin remains largely unknown.

LncRNAs can orchestrate complex regulatory circuits, exemplified by XIST, which acts as a core regulator of X-chromosome inactivation19. In addition to lncRNAs, other classes of caRNAs have genome regulatory functions. For example, eRNAs can affect the expression of neighboring genes through modulation of RNA polII elongation20,21 or recruitment of transcriptional coregulators22,23. Nascent pre-mRNAs can interact with chromatin binding proteins and locally regulate chromatin compaction6,24, and TE-derived RNAs can silence immune response genes and hamper T-cell effector functions25. Furthermore, many proteins involved in controlling chromatin state27,28,29,30 and topology23,31 have RNA-binding activity, suggesting additional roles for caRNAs in chromatin regulation. Despite these examples, which caRNAs have gene regulatory roles and their mechanisms of action remain to be determined32.

With the exception of a small number of caRNAs, we do not know the genomic loci where these RNAs act. As a result, we do not understand the network of interactions between caRNAs and genes or its complexity. Transcription of both lncRNAs33,34 and regulatory elements9,35,36,37 exhibits strong tissue specificity such that the ncRNA-gene interaction network is also likely cell-state dependent, although this remains to be experimentally tested. Characterization of the network of human caRNA-gene interactions at the full transcriptome scale represents an important goal25,38,39,40,41.

Here, we used chromatin-associated RNA sequencing (ChAR-seq) to map the RNA-chromatin interactome in H9 embryonic stem cells and definitive endoderm42,43,44. From these data, we characterize the global architecture of this interactome, present a predictive model for most RNA-DNA chromatin interactions, and identify RNAs deviating from this model. We generate a detailed caRNA-gene interaction network that defines the set of caRNAs that interact with each gene based on physical proximity. These interactions encompass lncRNAs and many unannotated intergenic RNAs that may help prioritize specific caRNAs for future functional validation. Through analysis of the dynamics of the interactome during differentiation, we find that regulation of gene expression by individual caRNAs is very rare.

Results

To detect and map caRNA interactions with the genome, we performed ChAR-seq42,43,44, a proximity-ligation method that captures and sequences RNA-DNA contacts genome-wide (Fig. 1a). We performed ChAR-seq in human H9 embryonic stem cells (ES) before and after differentiation into definitive endoderm (DE) to understand how changes in the caRNA-chromatin interaction network might relate to activation or repression of cell state-specific genes. We validated our cell differentiation system by qPCR against cell-state marker genes and immunostaining, which revealed pure (>99%) ES and DE cell populations (Supplementary Fig. 1a, b, Supplementary Data 10)45.

Fig. 1: Global map** of RNA-chromatin interactions during stem-cell differentiation.
figure 1

a Schematic of the strategy used to map RNA-DNA contacts across the transcriptome and genome using ChAR-seq, highlighting the key steps of the workflow. b, c Composition of the caRNAs identified by ChAR-seq compared to the total RNA population determined by total RNA sequencing. d Scatter plots showing the chromatin association scores for individual RNAs originating from annotated exons as a function of the RNA level in the caRNA population. Chromatin-enriched and depleted RNAs were determined using DESeq2 (FDR 0.05, fold change threshold 3x). Pie charts summarize the fraction of chromatin-enriched and chromatin-depleted RNA in each functional RNA type. The numbers within each pie chart indicate the total number of RNAs in that category. e RNA-DNA contact maps in ES and DE cells for the top 200 most abundant caRNAs (according to their mean expression in ES and DE cells) on Chr7 and Chr8. Maps are displayed at a resolution of 1 RNA per row and 1 Mbp of genome space per column. Color represents contact density, defined as the number of contacts between an RNA and a genomic bin, normalized for sequencing-depth and size of the genomic bin (CPKM: Contacts Per Kb in target genomic region per Million reads). Contacts made by exonic and intronic RNAs are shown in the left and right maps, respectively. f Interaction profiles along the genome for SOX17, PVT1, MALAT1 and XIST exons, and for SOX17, PVT1 and SLC26A3 introns, illustrating 3 major classes of interaction profiles: RNAs localized predominantly near their transcription locus (SOX17, PVT1 exons and introns), spreading across a single chromosome (XIST), and across the genome (MALAT1, SLC26A3 introns).

We sequenced ChAR-seq libraries to obtain over 900 million reads per cell state. We computationally split each read into a uniquely map** RNA- and a DNA-derived sequence (Supplementary Note 1, Supplementary Figs. 2 and 3) and thereby obtained nearly 200 million unique RNA-DNA contacts (Supplementary Fig. 1c).

We first analyzed the global composition of the caRNA population and found that caRNAs were enriched for non-coding RNAs, including introns, long non-coding RNAs (lncRNAs) and other functionally heterogeneous non-coding RNAs (referred to here as ncRNAs) such as small nuclear RNAs (snRNAs) and small nucleolar RNAs (snoRNAs; Fig. 1c, Supplementary Fig. 1d), consistent with previous studies4,46,47,48. We normalized the caRNA population to expression levels by assigning each RNA a chromatin association score, defined as its relative abundance in the ChAR-seq versus total RNA-seq data (“Methods”). We found that nearly all introns and half of all non-coding RNAs had over 3-fold enrichment on chromatin, in agreement with prior characterizations of caRNA16,49, indicating that ncRNAs tend to have nuclear or chromatin localization (Fig. 1d, Supplementary Fig. 1e, Supplementary Data 2). LncRNAs are considered potential chromatin regulatory RNAs3,50, yet our data indicate that non-intronic regions of lncRNAs constitute approximately 3% of the caRNA population and less than 1% when excluding the top 10 most abundant lncRNAs. This result prompted us to perform a broad analysis of RNA-DNA interactions, including all caRNAs, rather than focus exclusively on lncRNAs.

To compare the chromatin association patterns of exon- and intron-derived RNAs, we generated RNA-DNA contact maps for exons and introns (Fig. 1e). Our RNA-DNA contact maps were highly reproducible (Supplementary Fig. 1f) and showed high correlation between replicates and lower correlation between cell states, indicating that the interactome is dynamic during differentiation (Supplementary Fig. 1g). Across exons and introns, we uncovered several features of the RNA-DNA interactome mirroring those described in our prior work on Drosophila melanogaster and by others43,49,51,52,53. First, we noted a higher density of intrachromosomal compared to interchromosomal RNA-DNA contacts, reminiscent of the properties observed at the DNA level by Hi-C54, reflecting the chromatin organization into chromosome territories55. Most RNA-DNA contacts occur close to the RNA transcription locus with, on average, ~100-fold lower contact density 50–100 kb away from the transcription locus compared to the transcription locus (Supplementary Fig. 1h). Finally, we observed three classes of RNA-chromatin association patterns (Fig. 1f). (1) RNAs localizing predominantly at or near their transcription locus. (2) RNAs localizing across the genome, as previously observed52,56. (3) RNAs such as XIST57 localizing across a single chromosome. We confirmed by RNA fluorescence in situ hybridization microscopy that the nuclear localization of select RNAs from these classes was consistent with their classification by ChAR-seq (Supplementary Fig. 4, Supplementary Data 10) and previous studies classifying non-coding RNAs by in situ hybridization58,59,60,61,62. Altogether, these RNA-chromatin interactomes identify numerous RNAs in different functional classes that dynamically reorganize dependent upon cell state and demonstrate that most caRNAs remain associated with chromatin near their sites of synthesis.

ChAR-seq identifies previously unannotated RNAs that bind chromatin dependent on cell state

We identified previously unannotated RNAs that did not overlap with any known genes (as of Gencode v29) in 14% of all RNA-DNA contacts, a proportion similar to that of exons for annotated RNAs (Fig. 1b). To characterize the nature of these unannotated transcripts, we used the StringTie de novo transcriptome assembler to identify individual transcription units (Fig. 2a)63. We uncovered 30,442 loci with significant expression in ES or DE cells (FPM > 0.1), which we hereafter refer to as unannotated transcribed loci (UTLs) (Supplementary Fig. 5b, Supplementary Data 1, Supplementary Data 3). Thus, the number of identified UTLs exceeds the number of known transcripts expressed at similar levels (22,475). We found that UTLs originated from functionally diverse chromatin loci (Fig. 2b). (1) Some UTLs were immediately continuous with the 3’ end of active genes (e.g., UTL69162) and were possibly the result of transcriptional readthrough, as reported in prior studies64,65. (2) Some UTLs overlapped with regulatory signals, such as high ATAC-seq or H3K27ac levels (e.g., UTL69163). (3) Some UTLs overlapped with TEs (e.g., UTL69657), in agreement with prior studies showing that TEs are a source of RNAs that are associated with chromatin11,12,25. (4) Finally, some UTLs did not have any of the above features but had sequence similarity with known transfer RNAs (tRNAs), snRNAs and other small RNAs66. Guided by these observations, we classified the UTLs based on their proximity to the 3’ or 5’ ends of genes, their overlap with transposable elements, snRNAs, or tRNAs, and their overlap with cis-regulatory elements annotated in the Encode Registry of Regulatory Elements67, yielding seven categories of unannotated RNAs (“Methods”, Supplementary Data 3). Approximately 32% of the reads coming from UTLs were classified as readthrough RNAs and ~27% as cis-regulatory element-derived (Fig. 2c). Over 60% of the CRE-derived RNAs were from enhancer elements (Supplementary Fig. 5a). Four percent of the UTL reads were repeat-derived transcripts, roughly evenly distributed between LTR, SINE, and LINE elements (Fig. 2c, Supplementary Fig. 5a). Overall, the expression levels of UTLs were low, but similar to those of lncRNAs (Supplementary Fig. 5c).

Fig. 2: Cell-state-specific unannotated RNAs make up a large fraction of the caRNAs.
figure 2

a Schematic of the method used to catalog unannotated RNAs by identifying transcription units using StringTie2. b Genome tracks showing the chromatin context of 3 representative unannotated transcription loci (UTL). Left panel: UTL69162 and UTL69163, respectively, downstream and antisense to RB1CC1, are classified as readthrough RNA and CRE-derived RNAs. Right panel: UTL69657 is classified as a repeat-derived RNA due to its overlap with a LINE element. In both left and right panels, the top 2 tracks display the strand-specific genome coverage of the RNA-derived side of the ChAR-seq reads in ES and DE replicate 1 (+ strand ES in dark blue, − strand ES in light blue, + strand DE in dark yellow, − strand ES in light yellow). The next two tracks display the strand-specific genome coverage of the total RNA-seq data. c Relative composition of the chromatin-associated UTLs in the 7 annotation classes. d Scatter plots showing the chromatin association scores for individual UTLs and their abundance in the caRNA population. Chromatin-enriched and depleted UTLs were determined using DESeq2 (FDR 0.05, fold change threshold 3x). Pie charts summarize the fraction of chromatin-enriched and chromatin-depleted UTLs in each category. Numbers within each pie chart indicate the total number of RNAs in that category. e Percentage of genes upregulated and downregulated in DE vs ES cells in the caRNA transcriptome and for each RNA category. Up- and downregulated RNAs were identified using DESeq2 (FDR 0.05, fold change threshold 3x). f RNA-DNA contact maps in ES and DE cells for the top 200 most abundant UTLs on Chr7 and Chr8, displayed at a resolution of 1 RNA per row and 1 Mbp of genome space per column. g Genome-scale chromatin interaction profiles of 4 UTLs showing similar localization patterns as annotated RNAs.

Although these RNAs were present in the total RNA population, we found that all categories of UTLs were enriched on chromatin (Fig. 2d, Supplementary Data 2) and were highly cell-state-specific with 15-49% of UTLs up- or downregulated in the caRNA and total RNA populations compared to only ~12% for mRNAs and lncRNAs (Fig. 2e). We examined the cell-state specificity and chromatin localization of two UTLs by fluorescence in situ hybridization and found that their localization was consistent with their ChAR-seq signal (Supplementary Fig. 5d, Supplementary Data 10). We generated RNA-DNA contact maps specifically for UTLs, which showed patterns similar to those observed for exonic and intronic RNAs (Fig. 2f). We found both UTLs, which were locally restricted near their locus and UTLs that spread across the whole genome (Fig. 2g). This result prompted us to perform a broad analysis of all RNA-DNA interactions, including all caRNAs.

RNA-DNA interactome dynamics is driven by caRNAs transcription dynamics rather than relocalization of caRNAs

We next quantified the dynamics of the RNA-chromatin interactome during ES-DE cell differentiation. To identify cell-state dependent interactions, we binned the DNA contacts of each RNA into 100 kb or 1 Mb intervals and performed a quantitative analysis analogous to differential expression analysis to obtain the fold change of each contact in ES versus DE cells and its associated statistical significance (“Methods”). We filtered the data to only include contacts with at least 10 counts in at least two samples and tested ~100,000 exon-chromatin contacts, ~300,000 UTL-chromatin contacts, and 1.6 million intron-chromatin contacts (all at 100 kb resolution) for differential representation in ES vs DE cells (Supplementary Fig. 6a). The corresponding maps are shown in Fig. 3a. While we observed few dynamic RNA-chromatin interactions far from the RNA transcription locus (TL) in the exon and UTL maps, zooming in on a 10 Mb window around each RNA TL at 100 kb resolution revealed widespread changes in the interactome for all categories of RNAs. At 100 kb resolution, ~2% of interactions involving exons and ~7% of interactions involving introns were up- or downregulated in DE versus ES cells (Fig. 3b). More substantial changes were observed at a lower resolution of 1 Mb per genomic bin (Supplementary Fig. 6b). Consistent with the high cell state specificity of UTL expression discussed previously, UTLs also had the most dynamic RNA-DNA contact maps, with very low correlation between the ES and DE contact maps (Fig. 3b, Supplementary Fig. 6c).

Fig. 3: The RNA-DNA interactome dynamics are controlled at the transcription level.
figure 3

a Differential contact maps showing the changes in the RNA-DNA interactome on Chr8 and Chr11 during cellular differentiation for the same top 200 most abundant exonic RNAs, intronic RNAs, and UTLs as those shown Figs. 1e and 2f. For each RNA category, the left map shows the log2 fold change (LFC) in the frequency of each RNA-DNA contact, as computed by DESeq2 (shrunken LFC estimates, see “Methods”). x-axis resolution is 1 Mb as in Figs. 1e and 2f. The right map shows a zoom-in of the left differential map in a 10 Mb window centered at the Transcription Locus (TL) of each caRNA and displayed with an x-axis resolution of 100 kb. b Quantification by RNA class of the percentage of interactions upregulated in DE or ES cells among all interactions tested in that class (interactions with >10 counts in at least one replicate in ES or DE) at 100 kb resolution (bottom panel). c Schematic of 3 models that can explain changes in the DNA contact profile of an RNA during differentiation. d Scatter plot showing the chromatin association score for individual lncRNAs exons (left panel) and UTLs (right panel) in ES versus DE cells. All of the caRNAs with an expression level above 0.1 FPM in both ES and DE cells are shown. Pie charts summarize the fraction of RNAs with significantly higher chromatin association in ES or DE cells (fold change >3, FDR 0.05) and for each RNA class. Numbers within the pie charts indicate the total number of RNAs in that class (FPM > 0.1) and the number of RNAs with differential chromatin association. e Differential contact maps observed versus those explained by transcription dynamics only for the 50 most abundant lncRNAs (left) and UTL (right) on ChrX. Labeled genes are the top 12 most abundant genes. x-axis resolution is 100 kb and a 10 Mb window centered around each RNA TL is shown.

The interactome dynamics during differentiation may be driven by three non-mutually exclusive effects (Fig. 3c). First, an RNA may increase or decrease in overall abundance, resulting in proportionally increased or decreased binding levels on chromatin. Second, an RNA may modulate its affinity for chromatin, for instance, through RNA modifications or through changes in affinity with RNA-binding proteins mediating its interaction with chromatin. Third, an RNA may relocalize from one genomic site to another. The first two modes of dynamics would result in similar binding profiles in ES vs DE cells, albeit with an overall scale shift in binding levels. In contrast, the third mode implies changes in the RNA-binding pattern to chromatin.

To test these models, we first compared the chromatin association score of each RNA in ES versus DE cells. Remarkably, the chromatin association scores remained mostly unchanged during differentiation, particularly for lncRNAs, with only 35 lncRNAs showing evidence of changes in their chromatin affinity (Fig. 3d, left panel, Supplementary Data 2). Surprisingly, a larger fraction of UTLs, when compared to annotated non-coding RNAs (~8% of CRE-derived UTLs and ~5% of intergenic and antisense UTLs), showed significant changes in their chromatin association score between ES and DE cells (Fig. 3d, right panel). Thus, while individual RNAs show different propensities for chromatin interaction, this propensity does not change during differentiation and seems to be a property of the RNA itself. This result rules out model 2 for the majority of caRNAs.

Next, we examined whether the dynamics of specific interactions between an RNA and a chromatin locus can be explained by the transcriptional dynamics of the RNA itself. We compared the true differential contact maps to differential contact maps that would be observed if the frequency of each RNA-DNA contact was proportional to the total abundance of the corresponding RNAs in the caRNA population (“Methods”). These two differential interaction maps were highly similar (Fig. 3e). We further quantified the differences between these maps by identifying specific RNA-DNA contacts whose frequency changes between ES vs DE cells at a greater level than explained by the changes in RNA expression (“Methods”). We found no such contacts in the exon-DNA interactome and a negligible number of them in the UTL-DNA interactome (Supplementary Fig. 6d). Thus, the bulk of the changes in the RNA-DNA interactome appear to rely on transcription level regulation and expression differences in ES vs DE, rather than on modulation of an RNA’s affinity for chromatin or changes in an RNA’s contacts to different DNA binding sites.

A select number of RNAs interact broadly with the genome

We hypothesized that the dynamic RNA-DNA interactome contains a mixture of (1) functional interactions linked to the regulatory activity of the RNA on chromatin and (2) coincidental interactions due to transient proximity of the RNA to chromatin, for instance, during nascent transcription or diffusion within the nucleus. We thus analyzed the contact patterns of individual RNAs to detect features consistent with functional interaction, beginning with features at the chromosome scale. The nuclear speckle-associated lncRNA, MALAT1, and the XIST RNA are two well-studied lncRNAs that act to regulate gene expression broadly across the genome or throughout the X chromosome56,62,68. Yet, it is not known which other RNAs have similar widespread interaction patterns on chromatin.

To systematically identify all RNAs with genome- or chromosome-wide associations, which we termed type I and type II RNAs (Fig. 4a), respectively, we developed two metrics, a trans-delocalization and a cis-delocalization score (Fig. 4b and “Methods”). The trans-delocalization score quantifies the tendency for an RNA to be found on chromosomes other than its source chromosome. Similarly, the cis-delocalization score assesses the tendency for an RNA to spread far (over 10 Mb away) from its locus on its source chromosome. To account for expression, chromosome of origin and sample biases, these scores were calibrated using mRNAs as a reference (“Methods”, Supplementary Note 2, Supplementary Fig. 7). We reasoned that type I RNAs must have high trans- and cis-delocalization scores, while type II RNA must have a high cis-delocalization score but a low trans-delocalization score. Thus, although other patterns may yield high delocalization scores (e.g., an RNA that targets a single locus on a trans-chromosome may have a large trans-delocalization score), we can use these metrics to screen for candidate RNAs with type I and type II patterns. We found that lncRNAs with large trans-delocalization scores (Fig. 4e, left panel) included MALAT1, the pTEFb-associated RNA, 7SK, and the telomerase RNA component, TERC, which all have established genome-wide chromatin regulatory functions, thus validating our approach69,70,71.

Fig. 4: A select population of caRNAs interacts with the genome broadly.
figure 4

a Schematic of the two types of binding patterns identified in this analysis: type I RNAs localized across the genome (trans-delocalized RNAs), type II RNAs localized throughout their source chromosome but absent on other chromosomes (cis-delocalized RNAs). b Schematic definition of the trans- and cis-delocalization scores. The trans-delocalization score quantifies the number of DNA contacts an RNA makes on chromosomes other than its source chromosome (trans-contacts) relative to the number of contacts on its source chromosome (cis-contacts). The cis-delocalization score quantifies the number of DNA contacts an RNA makes over 10 Mb away from its transcription locus (TL) relative to the number of contacts within 10 Mb of its TL. c Distribution of trans- (left) and cis- (right) delocalization scores (geometric mean over 2 independent replicates per cell state) and by class of RNA for exons (n = 23,436 RNAs) and UTLs (n = 19,069 RNAs). Error bars represent the median and 25–75% quartiles. d Fraction of RNAs within each class identified as either delocalized or ultralocalized in regard to its trans- (left) or cis-chromosomal contacts (right). e List of all lncRNAs identified as cis or trans-delocalized in either ES or DE cells and candidate RNAs for type I or type II patterns. Heat maps show the RNA cis and trans-delocalization scores in ES and DE cells and their abundance in the caRNA population. f Chromatin interaction profiles for two examples of cis-delocalized RNAs (RMRP, VTRNA1-1), one example of cis-delocalized RNAs (AP000915.2), and one non-delocalized RNA (CASC15). The yellow track shows the observed ChAR-seq signal. The gray track shows the predicted interaction profile based on the generative model with trans-contact rate prediction, as described in Fig. 5 and Supplementary Note 4. g Scatter plot showing the cis- versus trans-delocalization score for individual lncRNAs in ES cells (left) and UTLs in DE cells (right, excludes tRNA-derived and snRNA-derived UTLs). Colored data points indicate RNAs classified as delocalized (in either cis or trans), ultralocalized (in both cis and trans), and RNAs with XIST-like behavior. The black line shows the linear regression output.

We found that functionally distinct classes of RNAs had different distributions of delocalization scores (Fig. 4c, Supplementary Data 4, Supplementary Data 8, Supplementary Data 9). LncRNAs had a wide range of delocalization scores, with a distribution of scores that mirrored those of mRNAs. In contrast, snRNAs, snoRNAs, tRNA-derived and snRNA-derived UTLs had globally high cis- and trans-delocalization scores, indicating that RNAs in these classes interact with loci throughout their source chromosome and across the whole genome. We observed the opposite behavior for CRE-derived RNAs and, to an even greater extent, for readthrough RNAs, which had mostly negative cis- and trans-delocalization scores, demonstrating that these RNAs tend to remain near their locus of origin. We also noted a negative-shifted distribution of delocalization scores for introns of both mRNAs and lncRNAs (Supplementary Fig. 8a). In ES cells, for ~77% of individual lncRNAs and 96% of individual mRNAs, the trans-delocalization scores of their introns were lower than those of their exons (Supplementary Fig. 8b). Thus, introns tend to remain in closer proximity to their source locus.

Interestingly, repeat-derived RNAs had globally high cis- and trans-delocalization scores in ES cells and low cis- and trans-delocalization scores in DE cells (Fig. 4c). Thus, in ES cells specifically, many repeat-derived RNAs tend to localize away from their transcription locus. To identify RNAs with extreme association scores, we applied an empirical Bayes method using mRNAs as a training set, which essentially identified RNAs in the 5% right-tail or the 5% left-tail of the mRNA score distribution (Method, Supplementary Note 3). We thus created a complete catalog of RNAs with candidate chromosome- or genome-wide association patterns and another catalog of RNAs that remain localized within a 10 Mb window around their transcription locus or on their own chromosome, which we termed ultralocalized RNAs (from a cis- or trans-chromosomal perspective, Supplementary Data 5). As expected, >50% of snRNAs, snoRNAs, tRNAs, and snRNAs were classified as trans-delocalized and >70% of readthrough RNAs were classified as ultralocalized (Fig. 4d). Surprisingly, out of 1289 ncRNAs above 1 FPM with sufficient signal to compute delocalization scores (“Methods”), we detected only 22 lncRNAs (1.7%) with cis- or trans-delocalized patterns in either ES or DE cells (Fig. 4d, Supplementary Fig. 8c). In contrast, we found (excluding tRNA-derived and snRNA-derived UTLs) 60 UTLs in DE cells and 836 UTLs in ES cells and with cis- or trans-delocalization patterns, including 349 repeat-derived RNAs, and several hundreds of intergenic or CRE-derived UTLs (Supplementary Fig. 8c). The lncRNAs we characterized contained the known broadly acting RNAs discussed above.

Importantly, we discovered candidate lncRNAs with potential genome-wide regulatory functions, including the mitochondrial RNA processing endoribonuclease RNA, RMRP, which is implicated in rRNA maturation41,72,73, the Ribonuclease P RNA Component H1, RPPH1, which is involved in tRNA processing74,75, two isoforms of the Vault RNA, VTRNA1-1 and VTRNA1-3, and a large number of UTLs. We validated the delocalization score analysis by directly examining the ChAR-seq signal of these RNAs, which revealed their association across the genome (Fig. 4f). The delocalization of these RNAs was not explained by their abundance. Although MALAT1, 7SK, and RMRP were highly abundant, other delocalized RNAs were all below 10 FPM. Furthermore, many abundant ncRNAs had low delocalization scores (Supplementary Fig. 8d). To confirm that the broad patterns detected by our delocalization score approach were not random or due to non-specific interactions, we performed metagene analysis centered on select genomic features. We detected enrichment of snRNAs at RNAPII occupancy loci (Supplementary Fig. 8e), where MALAT1 and 7SK were also enriched, consistent with the role of these RNAs in cotranscriptional splicing and transcriptional elongation62,69. In contrast, VTRNA1-1 was found at background levels at RNAPII-occupied loci, and RMRP was depleted at these loci. Together, our data show that broadly localized RNAs are rare among annotated lncRNAs, but we discovered a large repertoire of UTLs with potential global chromatin regulatory roles, specifically in ES cells.

While our characterized RNAs were identified as significantly delocalized in cis but not in trans, we noted that among these RNAs, all but XIST also had a high trans-delocalization score, albeit below the FDR threshold for classification as trans-delocalized. Generally, across all RNAs, the cis- and trans-delocalization scores were strongly correlated, indicating that RNAs that localize broadly on their own chromosomes also interact broadly with the rest of the genome (Fig. 4g). Remarkably, XIST was the only exception to this rule and was the only RNA which was simultaneously delocalized in cis and ultralocalized in trans, consistent with its known localization throughout its source chromosome X (Fig. 4g). We concluded that XIST is unique in these cell types in its ability to interact with an entire chromosome while being excluded from other chromosomes.

We next examined changes in RNA delocalization in different cell states. We found that the delocalization scores were highly correlated between ES and DE cells, even for RNAs that were differentially abundant across cell states (Supplementary Fig. 8f). We thus concluded that the extent to which an RNA interacts with chromatin far from its transcription locus or on trans chromosomes is encoded in the RNA itself or the position of its transcription locus relative to other genomic features, rather than post-transcriptionally regulated.

RNA-DNA contacts occur in the vicinity of the transcription locus

Engrietz et al. proposed a dichotomization of RNA-chromatin interactions into proximity-driven and affinity-driven interactions2. The former describes interactions occurring in a 2D or 3D distance-bounded region around the transcription locus without specificity for particular loci within that region. The latter describes RNA targeting well-defined loci, irrespective of their distance to the RNA locus. Some ncRNAs have been proposed to have affinity-driven interactions and regulate transcription or 3D organization of chromatin at their target loci3,76,77,78. These data motivated us to search the interactome for contact patterns in which an RNA shows discrete peaks in its localization profile that are not explained by proximity to its locus (Fig. 5a, top panel, hereafter referred to as Type III patterns). Because standard genomic peak finding tools like MACS279 are not appropriate for ChAR-seq data, we instead developed a generative model, which predicts the RNA-DNA interactome based on 3 features: (1) the total abundance of each RNA on chromatin, (2) a DNA-locus bias which models the propensity for an RNA to be captured at this locus, independently of the identity of that RNA, and (3) the distance between each RNA transcription site and its DNA target loci (Fig. 5b, “Methods” and Supplementary Note 4). As anticipated, the DNA-locus bias correlated with ATAC-seq, likely due to a combination of biological factors such as fewer RNA-DNA interactions existing in compact chromatin and technical biases related to the accessibility of the ChAR-seq bridge molecule. The DNA-locus bias also correlated with nuclear speckle proximity as measured by TSA-seq80, revealing a possible increased affinity for diffusing RNAs towards nuclear speckles. We trained our generative model on mRNAs, as we reasoned that most mRNAs should not have defined chromatin targets. We then used our final model to generate a predicted contact pattern for each RNA, which effectively provides a null hypothesis representing neutral patterns, where an RNA interacts exclusively and non-specifically with neighboring loci due to diffusion (Fig. 5a, model track). Thus, positive deviations from the prediction (more contacts in the observed data compared with the model prediction) provide evidence for peak-like interactions in type III patterns.

Fig. 5: RNA expression and genomic distance determine the RNA-DNA interactome.
figure 5

a Schematic of the type of binding patterns identified in this analysis. An RNA may localize at one or more discrete loci distinct from its transcription site (Pattern type III, top track) or remain in a diffusion-constrained region around its locus (neutral RNA, bottom track). b Components of the generative model used to predict the ChAR-seq maps. The number of contacts observed for an RNA at a DNA locus is proportional to (1) an RNA-DNA distance-dependent contact frequency, (2) the abundance of the RNA on chromatin, (3) a target locus-dependent bias (DNA-bias, yellow track), which correlates with both ATAC-seq signal (purple track) and nuclear speckle proximity signal (TSA-seq, red track). c Example of a type III pattern with a candidate affinity-driven interaction for the lncRNA JPX in DE cells. The observed and predicted localization of JPX (top two tracks) at 10 kb resolution and are compared using DESeq2, yielding a Log2 fold change (observed over model) and an adjusted p-value track (bottom two tracks). Interactions with an LFC greater than 1.3 and an adjusted p-value smaller than 0.05 are labeled as candidate affinity-driven interaction. d Observed contact maps, predicted contact maps, and observed over model LFC maps computed using DESeq2 for the top 200 most abundant RNAs originating from exons (top), introns (middle) and UTLs (bottom). x-axis resolution is 100 kb per bin; y-axis resolution is 1 RNA per bin. Only interactions with at least 10 counts in at least two samples were tested for differences with the model and are shown in the LFC maps. e Number of interactions tested for enrichment over model and proportion of identified candidate affinity-driven interactions by RNA class, in relation to the total number of tested interactions in that RNA class. f Distribution of the RNA-DNA travel distance for interactions significantly above model (n = 33,653 interactions). Error bars represent the median and 25–75% quartiles. The RNA-DNA travel distance is calculated using the map** coordinates of the RNA and DNA side of the ChAR-seq read (“Methods”).

In both ES and DE cells and for exons, introns, and UTLs, our simple generative model produced RNA-DNA contact maps highly similar to experimentally generated ChAR-seq RNA-DNA contacts maps (Fig. 5d, Supplementary Fig. 9a). At 100 kb DNA locus resolution and excluding RNAs previously identified as cis- or trans-delocalized, we identified only ~0.2% of exon and ~0.7% of intron contacts that were not explained by the model, irrespective of whether the RNAs were mRNAs, lncRNAs, or ncRNAs (Fig. 5e and Supplementary Fig. 9b, c). We detected only 11 and 9 lncRNAs in ES and DE cells, respectively, with exons making contacts in the genome at loci not predicted by our model (Supplementary Data 6). Our model also accurately predicted changes in contact rates during differentiation (Supplementary Fig. 9d). Thus, in contrast with prior studies76,77,78, we found no evidence for type III patterns, where individual RNAs target distinct loci away from their transcription site among the entire lncRNA population.

Interestingly, in contrast with that of lncRNAs, the interactome of the UTLs differed more substantially from its prediction. Over 1% of contacts involving 2283 distinct RNAs in ES cells and 2597 in DE cells showed statistical evidence for affinity-driven interactions (Fig. 5e). Readthrough RNAs had the largest number of such contacts followed by CRE-derived RNAs (Supplementary Fig. 9c). This result suggests that many unannotated RNAs, in particular regulatory elements derived RNAs, engage in genomic contacts that cannot be explained by a diffusion process around the transcription locus.

To better understand the nature of these contacts, we examined how far from the RNA transcription locus these contacts occurred (Fig. 5f). We found that most of the significant contacts made by UTL occurred within 100 kb of their locus (51% of all contacts), particularly for readthrough RNAs, which made over 69% of their contacts within 100 kb of their locus (Supplementary Fig. 9e). In contrast, introns of annotated RNAs showed deviations from the predicted patterns at larger distances. Indeed, only 17% of contacts from introns that were not predicted by the model occurred within 100 kb of their locus, whereas 88% occurred between 100 kb and 10 Mb. The difference in distances between RNA loci and their significant DNA contacts between annotated intron RNAs and unannotated RNAs suggests different types of interactions might be regulating RNA spread across chromosomes. Because these length scales are reminiscent of those involved in genome organization at the levels of TADs and A/B compartments81,82,83, we examined the relationship between the RNA localization patterns and the 3D organization of the genome.

The 3D genome organization enables contacts between RNAs and distal chromatin loci

To examine how the 3D organization of the genome affects the localization patterns of individual RNAs on chromatin, we focused on a small ~50 kb TAD on chr4q25, which is nested inside a larger 100 kb TAD (Fig. 6a). Two genes are located at the inner boundary of the small and large TADs: AC106864, an uncharacterized lncRNA, and the LARP7 gene, which is antisense to AC106864 and is highly transcribed in ES cells. We examined the binding profile of AC106864 on chr4 and found that most of the contacts of this RNA were within a few kb of its locus. We also observed two side peaks, labeled L1 and L2, that coincided with the other edge of the small and large TAD. In contrast, our generative model predicted a small peak at L1 (likely due to the high accessibility of this locus as revealed by ATAC-seq) and no signal at L2. The fold difference signal of the observed data over the model confirmed that the two peaks at L1 and L2 were not explained by simple diffusion of the AC106864 or accessibility biases. Interestingly, Hi-C data showed two corner peaks characteristic of a chromatin loop linking the LARP7 locus with both L1 and L2. This result suggests that AC106864 localization at L1 and L2 might be mediated by the chromatin loop. It is also possible that AC106864 targets these loci through other mechanisms, such as base pairing or association with RBP, that are independent of genome folding. Yet this biochemically targeted interaction is unlikely given that the introns of the overlap** mRNA LARP7 also have contact peaks at L1 and L2. Together, these data suggest that TAD organization influences the contact patterns of RNAs and that chromatin loo** enables distal RNA-DNA interactions.

Fig. 6: The 3D genome organization enables long-distance RNA-DNA contacts.
figure 6

a Example of long-range RNA-DNA contacts across a chromatin loop at the LARP7 & AC106864 locus in ES cells. ICE normalized Hi-C map (2 kb resolution) is shown at the top. Transcription of LARP7 (expressed from the positive strand) and AC106864 (expressed from the negative strand, shown as negative values) are detected by ChAR-seq (top 2 tracks). The observed (dark orange) and predicted localization pattern (dark gray) of AC106864 on chromatin are shown with the log fold difference between observed and predicted (purple). The observed and predicted localization patterns for LARP7 are shown in light orange and light gray. ATAC-seq, H3K27ac and H3K4me3 tracks are also shown and indicate that L2 has enhancer-like chromatin properties. b Comparison between ChAR-seq and Hi-C at the chromosome scale. Dashed boxes highlight two example regions where the A/B compartments plaid pattern is clearly visible in both Hi-C and ChAR-seq maps.

This observation prompted us to ask whether the larger-scale topological organization of the chromosome also influences RNA-DNA contacts (Fig. 6b). ChAR-seq contact maps are naturally asymmetric in that the y-axis maps each row to an individual RNA and the x-axis maps each column to a genomic bin. To compare ChAR-seq to Hi-C data at the chromosome scale, we collapsed one dimension of the Hi-C maps into genes while kee** the other dimension as genomic bins. In these transformed Hi-C maps, each pixel represents the contact frequency between the gene and a cognate DNA bin. We detected in the ChAR-seq maps the same plaid pattern found in Hi-C data resulting from the 3D partitioning of the genome into two major compartments, the A and B compartments, also associated with active and inactive chromatin, respectively83. This pattern indicates that any individual caRNA tends to have a specific compartment (either A or B) with which it interacts preferentially. Equivalently, when one caRNA contacts a locus in, say, the A compartment, it has a higher likelihood of contacting other loci in the A compartment rather than in the B compartment. It was not surprising that this pattern was not produced by our generative model since only linear distance is encoded in the model. We concluded that A/B compartments also modulate the long-range interactions of individual RNAs with chromatin.

The caRNA-gene interactome preferentially links upregulated caRNAs to upregulated proximal genes

Our results point to a model where RNA-chromatin association patterns and their dynamics are restricted by (1) the caRNA expression level, (2) the genomic distance from the RNA locus to the DNA target and (3) the 3D chromatin topology. We wanted to determine whether this result is compatible with the hypothesis that ncRNAs participate in the regulation of cell-state-specific protein-coding genes. We reasoned that RNAs with transcriptional regulatory roles are likely to be found near their cognate gene, where they could modulate local chromatin state, TF binding, RNA polymerase, or the activity of gene-proximal regulatory elements. This colocalization hypothesis is consistent with the better-studied ncRNAs with gene regulatory activity, including XISTFull size image

Consistent with the dynamics of the genome-wide RNA-DNA interactome (Fig. 3a–d), the caRNA-gene interactome of >1 million contacts was dynamic across differentiation. We detected most of the differential contacts at genes near the RNA locus (Fig. 7b). For lncRNAs only, we detected 340 differential contacts (~1% of all lncRNA-gene contacts), but these involved only 57 distinct lncRNAs, indicating that a typical single lncRNA differentially contacts multiple genes (Fig. 7c, d). The caRNA-gene interactome involving UTLs was more dynamic than that involving annotated RNAs, consistent with the global interactome dynamics, with up to 20% differential UTL-gene contacts between ES and DE (Fig. 7d).

To identify potential regulatory caRNAs and their putative gene targets, we classified each caRNA and each protein-coding gene as an ES, DE, or stable caRNA or gene based on those cells (FDR cutoff 0.05, Fold Change cutoff 3). We then examined the statistical associations between the class (ES/DE/stable) of a caRNA, its cognate gene, and their interaction. Figure 6e shows the top 20 most upregulated contacts involving a lncRNA along with the cognate lncRNA-gene pair. We noted that all the top 20 upregulated contacts in a given cell state involved ncRNAs upregulated in the same state. This result is consistent with our findings that the RNA-DNA interactome dynamics is globally driven by transcriptional dynamics. Yet most of the nearby genes for these differential contacts were not differentially expressed in ES vs DE, suggesting that changes in the caRNA levels at these genes do not affect their expression. Furthermore, the fold change in contact rate during the ES to DE transition correlated with the fold change of the expression of the source caRNA (Fig. 7f, left panel) but not with that of the contacting protein-coding gene (Fig. 7f, right panel).

To further understand the relationship between gene expression and the presence of a caRNA in the PRR of a gene, we examined how many cell-state-specific contacts are made at cell-state-specific genes. This analysis revealed that >97% of cell state-specific genes are not contacted by lncRNAs in a cell state-specific manner (Fig. 7g, left panel). Interestingly, however, over 50% of these genes are contacted by at least one, and sometimes several, UTL specifically in one cell state (and 15% with a CRE). In contrast, only ~25% of genes that are not cell-state-specific were contacted by cell-state-specific UTLs. Thus, most genes do not require cell-state-specific localization of a particular lncRNA in their PRR to alter their expression, but genes whose expression is altered are likely to be contacted by a UTL in a cell-state-specific manner. Together, our findings indicate that the presence of an individual ncRNA near the gene TSS does not correlate with the gene’s transcription. This result does not rule out a regulatory activity of ncRNAs at protein-coding genes. It remains possible that multiple inputs gate the target gene’s expression, including chromatin state, transcription factors, and possibly several RNAs, which could wash out average correlations between caRNA-gene interactions and gene transcription.

To identify patterns in the interactome that could reveal a regulatory structure, we compared the observed interactome dynamics to that which would be expected should it be independent of the gene expression dynamics (null model). We binned differential contacts in three categories: (1) positive edges, where the contact dynamics were positively correlated with the proximal gene dynamics (contacts that increased in ES to genes that increased in ES, or contacts that increased in DE to genes that increased in DE), (2) negative edges (contacts that increased in ES to genes that increased in DE, or contacts that increased in DE to genes that increased in ES), (3) neutral edges (contacts that increased in ES or DE to genes that were neither ES or DE genes).

We found that across all categories of caRNAs, the interactome contained up to 1.8 times more positive edges (p-value < 0.05 by bootstrap) and up to 1.3 times fewer negative edges (p-value < 0.05 by bootstrap) than would be expected for a random interactome under the null model (Fig. 7h, i). Thus, we conclude that although specific RNAs are not the sole drivers of transcription activation or silencing at any gene, the architecture of the interactome is consistent with an overall positive regulation, where the presence of caRNAs is generally associated with higher expression of the contacted genes.