Background

RNA can act as a carrier of information from the nucleus to the cytoplasm in the processing of protein-coding genes, as a regulatory molecule that can control gene expression, and even as an extracellular signal to coordinate trans-generational inheritance [1,2,3]. RNA binding proteins (RBPs) interact with RNA through a wide variety of primary sequence motifs and RNA structural elements to control all processing steps [3]. Furthermore, with the increase in the number of RBPs that are becoming associated with human diseases, identifying their RNA targets and how they are regulated has become an unmet, urgent need.

To identify direct RNA targets of RBPs, RNA immunoprecipitation (RIP) and crosslinking and immunoprecipitation (CLIP) methods are frequently used. CLIP-based methods utilize UV crosslinking to covalently link an RBP with its bound RNA in live cells, enabling both stringent immunoprecipitation washes and denaturing SDS-PAGE protein gel electrophoresis and nitrocellulose membrane transfer which serves to remove background unbound RNA [4]. Analyses of single RBP binding profiles by CLIP have provided unique insights into basic mechanisms of RNA processing, as well as identified downstream effectors that drive human diseases [5,6,7]. Further efforts to profile multiple human RBPs in the same family or regulatory function by CLIP illustrated coordinated and complex auto- and cross-regulatory interactions among RBPs and their targets [8,9,10]. Rising interest in organizing public deeply sequenced CLIP datasets to enable the community to extract novel RNA biology is apparent from newly available computational databases and integrative methods [11, 12]. However, methodological differences between CLIP approaches, combined with simple experimental variability between labs and variation in acceptable quality control metrics, add significant challenges to interpretation of differences observed.

The field of transcription regulation observed similar challenges and opportunities in integrating transcription factor target profiles [13]. To address this challenge, the ENCODE consortium piloted large-scale profiling of transcription factor targets using a single standardized chromatin immunoprecipitation (ChIP-seq) protocol [14]. The initial effort to profile 119 factors generated a unified dataset for creating and assaying robust quality assessment standards [15], and led to insights into modeling transcription factor complexes, binding modalities, and regulatory networks [16]. More critically, however, this has served as an invaluable resource for researchers to annotate potential functional variants [17] and generate hypotheses across a variety of fields of interest. This success suggested that a similar effort to profile RBP targets using a standardized methodology could similarly drive significant insights in RNA biology.

To this end, we introduced the enhanced CLIP (eCLIP) methodology featuring a size-matched input control [18] and characterized hundreds of immunoprecipitation-grade antibodies with a standardized workflow [19] to generate 223 eCLIP datasets profiling targets for 150 RBPs in K562 and HepG2 cell lines [https://www.encodeproject.org) [20].

Fig. 1
figure 1

Two hundred twenty-three eCLIP datasets profile targets for 150 RNA binding proteins. a Colors indicate RBPs profiled by eCLIP, with manually annotated RBP functions, subcellular localization patterns from immunofluorescence imaging, and predicted RNA binding domains indicated (Additional file 1). b Schematic overview of eCLIP as performed in the datasets described here. Two biological replicates (defined as biosamples from separate cell thaws and crosslinked more than a week apart) were performed for each RBP, along with one size-matched input taken from one of the two biosamples prior to immunoprecipitation

Many CLIP methods included radioactive labeling of the 5′ end of RNA fragments with 32P to visualize protein-RNA complexes after SDS-PAGE electrophoresis and membrane transfer in order to query whether RNA bound to co-purified RBPs of different size is present [4]. However, the eCLIP protocol we utilized above did not include this direct visualization of protein-associated RNA due to the complexity of incorporating radioactive labeling at this scale, preferring validation of eCLIP signal with orthogonal approaches (such as comparison with in vitro-derived motifs or overlap with knockdown/RNA-seq changes). To address this question for future large-scale eCLIP profiling, we pursued alternative labeling approaches. We found that ligation of biotinylated cytidine (instead of the normal RNA adapter) enabled visualization similar to that observed with 32P while using commercially available chemiluminescent detection reagents for biotin-labeled nucleic acids (Additional file 3: Fig. S1a-c) [21]. We note that unlike 32P labeling (which is done as a 5′ phosphorylation reaction with T4 Polynucleotide Kinase), this labeling uses the standard eCLIP RNA adapter ligation reaction and thus may more accurately reflect true protein-coupled RNA positioning.

Surprisingly, when expanding this approach across RBPs, we observed detectable transfer of RNA from non-crosslinked cells to nitrocellulose membranes in a supplier-dependent manner (Additional file 3: Fig. S1d-f). We had previously noted that certain sourced nitrocellulose membranes contained greater amounts of RNA, which would then be recovered during library preparation (particularly in input libraries, which lack adapter addition prior to membrane transfer) [22]. However, we now observed that the recommended (lower contaminant, membrane I) membrane from that effort showed increased transfer of RNA than our previous supplier (membrane G) (Additional file 3: Fig. S1d-f). Although the signal observed in crosslinked samples was typically significantly higher (median 12.5-fold across 17 RBPs tested), with 88% (15 out of 17) RBPs greater than 5-fold (Additional file 3: Fig. S1d), for 2 out of 17, we observed within 5-fold RNA transfer in non-crosslinked samples (Additional file 3: Fig. S1d,f).

To directly query whether this led to artifactual eCLIP peak identification, we chose seven eCLIP experiments performed with membrane I and performed replicate experiments with membrane G. Using MATR3 as an example, we observed that peak fold-enrichment compared across membranes was similar to that observed for within-membrane replicates (Additional file 3: Fig. S1g). Extending this to all seven RBPs, only one (FXR2) out of seven showed notably lower replication of peak significance using membrane G (Additional file 3: Fig. S1h), and even in that case, we observed high overall correlation in peak fold-enrichment (Additional file 3: Fig. S1i). Conservation of signal was not limited to peak calls, as we observed similar enrichments for retrotransposable and other RNA elements as well (Additional file 3: Fig. S1j). Thus, although our data indicates that whether RNA that is not crosslinked to protein will transfer to nitrocellulose membranes is supplier- and product-dependent, but that it does not generally appear to add significant background to the eCLIP profiles studied here.

Recovering RNA binding protein association to retrotransposons and other multicopy RNAs

Standard peak analysis revealed a wide variety of binding modes to mRNAs, with RBPs enriched for coding sequences, 3′ and 5′ untranslated regions, proximal and distal intronic regions, and non-coding RNAs (Additional file 3: Fig. S2a) [23]. We found that simply including non-uniquely mapped reads in standard analysis created thousands of peaks in introns, in intergenic regions, and at pseudogenes that typically lacked standard peak shapes (likely reflecting sequencing errors relative to the main expressed transcript), indicating the need for improved methods to properly quantify RBP binding to such loci.

In order to include these RNA types in eCLIP analysis, we developed a “family-aware map**” approach in which adapter-trimmed reads are first mapped against a database of sequences for primary transcripts and pseudogenes for 82 families (Fig. 2a) (Additional file 4). Reads map** to reference transcripts contained within a family (e.g., LINE, YRNA, or 18S rRNA) are used for quantitation, but reads that map to multiple families are masked (discarding an average of 1.1% of reads). These results are then integrated with standard unique genomic map** in order to incorporate reads that uniquely map to regions annotated as repetitive elements by RepeatMasker [24] into the final family quantitation (Fig. 2a). Confirming the success of this approach, we observed that in eCLIP replicates of YRNA-associating factor TROVE2/RO60 in K562, only 3.7 and 6.8% (replicate 1 and 2, respectively) of usable reads uniquely mapped to YRNA transcripts with standard processing (2.9 and 5.1% to RNY1/2/4/5, with another 0.7% and 1.8% to YRNA pseudogenes) (Fig. 2b). In contrast, for these same datasets, 14.2% and 21.7% of reads mapped uniquely to the YRNA family using the family-aware map** approach, making use of hundreds of thousands of additional reads that did not uniquely map to individual transcripts (Fig. 2b). Performing this analysis for all RBPs, we observed a wide range of read recovery and enrichment for particular elements (Fig. 2c, Additional file 5). For some RBPs such as RPS11 (K562), an average of 95.2% of reads were only recovered using family map** (68.1% map** to RNA18S with an additional 24.1% to RNA28S). In contrast, only 10.4% of reads in KHSRP (K562) eCLIP mapped to multicopy family elements, with 58.9% uniquely map** to the genome (including 41.1% uniquely map** to introns outside of RepeatMasker elements) (Fig. 2c).

Fig. 2
figure 2

Quantification of repetitive elements and other non-uniquely mapped reads. a Graphical representation of repetitive element map**. Reads are mapped to human genome (requiring unique map**) and a database of repetitive element families. Reads are then associated with RNA element families based on mismatch score, with (red) reads discarded if map** equally well to more than one family. b Stacked bars indicate the number of reads from TROVE2 eCLIP in K562 that map either uniquely to one of four primary Y RNA transcripts, map uniquely to Y RNA pseudogenes (identified by RepeatMasker), or (for family-aware map**) map to multiple Y RNA transcripts but not uniquely to the genome or to other repetitive element families. c Stacked bars indicate the fraction of reads (averaged between replicates) of all 223 eCLIP experiments, separated by whether they map (red) uniquely to the genome, (purple) uniquely to the genome but within a repetitive element identified by RepeatMasker, or (gray) to repetitive element families. Datasets are sorted by the fraction of unique genomic reads. d Heatmap indicates the relative information for 26 elements and 168 eCLIP datasets, requiring elements and datasets to have at least one entry meeting a 0.2 relative information cutoff (based on Additional file 3: Fig. S2d). See Table 1 for RBP:element enrichments meeting this criteria and Additional file 5 for all enrichments

At the element level, our family-aware map** strategy recovers many known processing or interacting factors, including RBPs enriched for the mature 18S (RPS3, RPS11) and 28S rRNA (DDX21, NOL12) as well as the 45S rRNA precursor (UTP18, WDR43), tRNAs (NSUN2), RN7SK (LARP7), YRNA (TROVE2), and others (Fig. 2d). To validate this approach, we considered 17 RNA elements with well-studied direct links to either RBP function (such as snoRNA binding with rRNA processing and snRNA binding with snRNA processing and the spliceosome) or specific RBP regulators (e.g., snRNA RN7SK with LARP7 [25] and YRNAs with TROVE2/Ro60 [26]) (Additional file 3: Fig. S2d). We observed that 140 eCLIP datasets had one of these 17 elements as the most highly enriched (by relative information, which we observed to better enable comparison across elements versus fold-enrichment), and in 84 (60%) of these cases, the RBP was previously characterized as having the element-paired RBP function, indicating that this approach is highly successful at recovering targets that reflect annotated functions of profiled RBPs. To set a cutoff for analysis, we found that an information cutoff of 0.2 maximized predictive accuracy, at which 70% (74 out of 105 RBPs with the most enriched RNA element meeting this cutoff) had annotated functions matching the known role for this element (Additional file 3: Fig. S2e). Using this cutoff, 235 RBP-element pairings were identified with large numbers of RBPs associated with mRNA regions (42 with CDS, 24 with 3′UTR, 40 with distal intronic, and 23 with proximal intronic regions) and rRNA (24 with RNA28S and 15 with RNA18s, as well as 12 with precursor 45S rRNA), and smaller numbers associated with other specific RNA classes (Fig. 2d, Table 1).

Table 1 Predominant RNA element for each eCLIP dataset

Characterization of ribosomal RNA interactors and processing factors

Ribosomal RNA (rRNA) is the most abundant RNA found in eukaryotic cells and plays essential roles in defining the structure and activity of the ribosome. In humans, the 5S rRNA is separately transcribed, whereas the 18S, 28S, and 5.8S rRNAs are transcribed as one 45S precursor transcript that then undergoes a complex series of cleavage and RNA modification steps to process the mature rRNAs, which then form complex structures that scaffold the assembly of ~ 80 proteins to create the functional ribosome [27]. Unbiased approaches have characterized over 250 additional factors as playing critical roles in processing pre-rRNA, indicating that rRNA processing and function represent a major function of RBPs in humans [28].

Considering the 150 RBPs profiled, we observed that different subsets of RBPs showed enrichment to specific rRNAs (Fig. 3a), suggesting that the incorporation of normalization against paired input was successful in removing general background at abundant transcripts. Although we are unable to distinguish between map** to mature 18S, 28S, and 5.8S transcripts versus those regions in the precursor, the ~ 10-fold lower read density we observe for 45S (median 281 reads per million (RPM)) versus 18S (2715 RPM) or 28S (1983 RPM) in eCLIP input samples (Additional file 3: Fig. S3a-c) suggests that the majority of 18S and 28S reads reflect mature rRNA transcripts. Considering 30 RBPs previously shown to effect pre-rRNA processing [28], we found that 16 had enrichment for one of the three (18S, 28S, or 45S) rRNAs (42.1% of RBPs meeting a 0.101 position-wise information cutoff) relative to 12.5% of others (3.4-fold enriched, p = 0.00025 by Fisher’s exact test) (Additional file 3: Fig. S3d). Despite high and relatively even read density overall on the abundant rRNA transcripts (Additional file 3: Fig. S3a-c), we observed that these rRNA-enriched RBPs showed a number of specific enrichment patterns: two on the 45S precursor (one situated around the 01 and A0 early processing sites, and a second located ~ 2000 nt further downstream that is discussed below), a cluster at position ~ 4200 of the 28S, and a cluster at ~ 1150 of the 18S, along with other profiles unique to individual RBPs (Fig. 3a). Distinct ribosomal components RPS3 and RPS11 had different positional enrichments, as expected given their different positioning within the 18S ribosome (Additional file 3: Fig. S3e).

Fig. 3
figure 3

eCLIP enrichment for rRNA links RBPs with ribosomal RNA processing. a Heatmap indicates relative information at each position along (top) the ribosomal RNA precursor 45S polycistronic transcript and (bottom) within the mature 18S and 28S transcripts. Reads map** equally to the 45S and mature 18S or 28S are assigned to the mature for quantitation. Purple asterisk indicates RBPs for which knockdown showed rRNA processing defects in Tafforeau et al. [28]. b Lines indicate fold-enrichment in DDX51 eCLIP in K562 cells at the 3′ end of the 28S and 45S transcript. For this and further plots, black line indicates mean and gray region indicates 10th to 90th percentile across all 223 eCLIP datasets. c, d Lines indicate relative information for c UTP18 in K562 and d WDR3 in K562 across the 45S precursor. e Lines indicate fold-enrichment for indicated RBPs within a region flanking putative ribosomal-encoded microRNA rmiR-663. f Red indicates mismatch positions relative to ribosomal rmiR-663 (and 100 nt flanking regions) for genomic-encoded miR-663a, miR-663b, and two additional homologous regions containing putative microRNAs. g Pie chart indicates the fraction of reads in ILF3 HepG2 eCLIP map** (green) with fewer mismatches to rmiR-663, or (gray) map** equally well to rmiR-663 and other miR-663 family members as indicated. See Additional file 3: Fig. S3j-k for LIN28B (HepG2) and SSB (HepG2). h, i Points indicate fold-enrichment in each eCLIP dataset for h C/D-box snoRNAs versus 45S precursor RNA, and i H/ACA-box snoRNAs versus C/D-box snoRNAs. Pearson’s correlation and significance were calculated in MATLAB

Our data on rRNA precursor position-specific enrichment confirms and provides further resolution to proteins previously characterized to play roles in ribosomal RNA processing. Some factors had specific positioning, including DDX51 which had specific enrichment at the 3′ end of 28S as well as the 3′-ETS precursor region, consistent with previous characterization of the role of DDX51 in 3′ end maturation of 28S [29], and UTP18 which had specific enrichment at the 5′ end, matching its roles in early cleavages at the 01, A0, and 1 sites suggested from large-scale screening data [28] (Fig. 3b, c, Additional file 3: Fig. S3f-g). Others, such as WDR3, had broader enrichment patterns that suggest participation in multiple maturation steps (Fig. 3d, Additional file 3: Fig. S3h).

Surprisingly, we observe a cluster of RBP association in the 45S precursor around position 2100, a region located between the A0 and 1 processing sites which lacks a well-defined processing role (Fig. 3a) [27]. Two of these factors have previous links to nucleolar activity, as ILF3 (also known as NF90) was previously shown to associate with pre-60S ribosomal particles in the nucleolus and knockdown of ILF3 gives defects in rRNA biogenesis [28, 30], and LIN28B has been shown to repress let-7 processing by sequestering pri-let-7 in the nucleolus [31]. In this region, multiple sites of ILF3 and SSB enrichment flank a more specific region enriched in LIN28B eCLIP (Fig. 3e, Additional file 3: Fig. S3i) which has previously been described to contain a potential rRNA-encoded microRNA, rmiR-663a [32]. As rmiR-663a shares similar sequence to genomic-encoded miR-663a on chromosome 20 (and would have the same mature miRNA sequence), it has been challenging to isolate expression of the ribosomal-encoded transcript in isolation [33], and indeed, the majority of LIN28B eCLIP reads map** to pri-miRNA map equally to both variants (Sup Fig. 3j). However, when we used sequence variants in the pri-miR sequence as well as the more variable flanking sequence to estimate their separate expression (Fig. 3f), we observed that reads unique to the rmiR outnumbered those unique to genomic homologs by more than 400-fold (Fig. 3g and Additional file 3: Fig. S3j-k), indicating that the observed signal is likely derived from 45S rather than other genomic homologs.

Finally, we considered binding to snoRNAs, a class of highly structured small RNAs that play essential roles in guiding modification of ribosomal RNAs. We found that enrichment for C/D-box snoRNAs, which canonically guide methylation of RNA, was highly correlated to enrichment for the 45S precursor (R2 = 0.67, p = 1.6 × 10−54) (Fig. 3h), providing further confirmation that these 45S-enriched RBPs are likely playing key roles in rRNA processing. Surprisingly, however, we observed that enrichment for H/ACA-box snoRNAs showed far lower correlation with enrichment for either C/D-box snoRNAs (R2 = 0.42) or the 45S precursor (R2 = 0.17) (Fig. 3i, Additional file 3: Fig. S3l). Thus, this data confirms the ability of eCLIP with input normalization to specifically isolate enrichment between abundant snoRNA classes, and suggests that (at least for the RBPs profiled to date here) we see stronger overlap between rRNA precursor and C/D-box versus H/ACA-box snoRNAs.

Repetitive elements define a significant fraction of the RBP target landscape

Repetitive elements constitute a large fraction of the non-coding genome [34], and elements annotated by RepBase constitute an average of 12.2% of reads observed in eCLIP input experiments (Additional file 3: Fig. S4a). In particular, as retrotransposable L1/LINE and Alu elements constitute 10.8% and 0.4% of intronic sequences, respectively (Additional file 3: Fig. S4b), they represent a significant fraction of the pool of nuclear transcribed pre-mRNAs available for RBP interactions. Although some RBPs have been shown to play roles in regulation of active retrotransposition [35], the majority of intronic elements have accumulated mutations or deletions and are no longer capable of active retrotransposition, leaving the question of their function relatively poorly understood. However, recent analyses of RBP targets identified by CLIP (including early releases of the eCLIP data considered here) have shown that both antisense Alu and antisense LINE elements contain cryptic splice sites that can lead to improper splicing and polyadenylation, suggesting that a major yet unappreciated role for many RBPs may be to suppress the emergence of inappropriate cryptic RNA processing sites introduced upon retrotransposition [36, 37].

Querying for RBPs with enriched eCLIP signal at retrotransposable and other repetitive elements, we surprisingly observed that only a small subset of elements (notably including L1 and Alu elements both in sense and antisense orientation) showed high RBP specificity, whereas most elements showed extremely highly correlated enrichments across RBPs (Fig. 4a, Additional file 3: Fig. S4c). This group of elements showed enrichment in a small subset of eCLIP experiments, notably including multiple members of the highly abundant HNRNP family (HNRNPA1, HNRNPU, HNRNPC, and HNNRPL), indicating that they may be coordinately regulated to prevent inappropriate RNA processing.

Fig. 4
figure 4

RBP association at retrotransposable and other repetitive elements. a (left) Heatmap indicates fold-enrichment in eCLIP versus paired input, averaged across two biological replicates. Shown are 30 RepBase elements which had average RPM > 100 in input experiments and at least one RBP with greater than 5-fold enrichment and 65 eCLIP experiments with greater than 5-fold enrichment for at least one element. (right) Color indicates correlation in fold-enrichment between elements across the 65 experiments. b, c Points indicate fold-enrichment for b Alu elements and c L1 LINE elements in individual biological replicates. Shown are all RBPs with average enrichment of at least 2 (for Alu elements) or 5 (for L1 elements). d Bars indicate L1 retrotransposition casTLE effect score (positive score indicates increased retrotransposition upon RBP knockout), with error bars indicating 95% minimum and maximum credible interval estimates (data from Liu et al. [38]). e (left) Each point indicates significance (from two-sided Kolmogorov-Smirnov test) between fold changes observed in RNA-seq of RBP knockdown for the set of genes with one or more RBP-bound L1 (or antisense L1) elements versus the set of genes containing one or more L1 (or antisense L1) elements but lacking RBP binding (defined as overlap with an IDR peak). RBPs were separated based on requiring 5-fold enrichment for L1 elements as in c. (right) Cumulative distribution plots for (top) MATR3 in HepG2 and (bottom) SUGP2 in HepG2. Significance shown is versus the set of genes containing one or more L1 (or antisense L1) elements but lacking RBP binding (red line). f Points indicate the fraction of antisense L1-assigned reads that map to canonical (RepBase) elements for six expression-altering antisense L1-enriched eCLIP datasets (from e), five other antisense-L1 enriched eCLIP datasets, and 11 paired input samples. Significance is from the two-sided non-parametric Kolmogorov-Smirnov test. See Additional file 3: Fig. S4g for the full distribution of read assignments

Analysis of Alu elements recapitulated a previously described interaction of HNRNPC with antisense Alu elements [36], but additionally revealed two RBPs with more than 5-fold enrichment: ILF3 (enriched for both sense and antisense Alu elements) and RNA Polymerase II component POLR2G (antisense) (Fig. 4b, Additional file 3: Fig. S4d). Both of these factors have previous links to RNA processing through Alu elements, as ILF3 association was suggested to repress RNA editing in Alu elements [39] and Alu elements have been shown to effect RNA Polymerase II elongation rates [40]. In total, 19 datasets showed more than 2-fold enrichment for either Alu or antisense Alu elements (Fig. 4b).

Considering L1/LINE elements, we observed enrichment with far more RBPs, with 26 datasets showing 5-fold enrichment (Fig. 4c). Interestingly, we observed generally distinct sets for sense versus antisense L1 enrichment, with only HNRNPC (in K562, but not HepG2) and ZC3H8 showing enrichment for both (Fig. 4c, Additional file 3: Fig. S4e). The RBPs identified here align well with those identified in an independent analysis of L1-associated RBPs which used a subset of these datasets along with independent iCLIP and other datasets, confirming robustness of this analysis across different approaches to quantify enrichment to L1 elements [37]. To query the role of L1 association, we first considered whether binding could specifically act to repress L1 retrotransposition itself. Of the 15 RBPs with more than 5-fold enrichment at sense L1 elements, SAFB (p = 0.002), PPIL4 (0.06), and TRA2A (p = 0.05) were all identified as candidate suppressors of L1 retrotransposition in a recent genome-wide CRISPR screening assay [38], suggesting that this eCLIP enrichment approach identifies functional regulators of retrotransposition (Fig. 4d).

However, we observed that while enriched signal was centered at L1 sense and antisense elements, the signal often extended for multiple kilobases on either side (Additional file 3: Fig. S4f), indicating that despite the overlap with functional regulators of active lines, the majority of eCLIP signal is likely coming from inactive L1 elements contained within pre-mRNAs rather than independently transcribed active L1 elements in the cell lines studied here. Thus, we next assayed whether these RBPs showed evidence for silencing cryptic RNA processing sites created upon retrotransposition, as previously described [36, 37]. To do this, we hypothesized that knockdown of such RBPs would lead to inclusion of premature stop codons that signal nonsense-mediated decay, ultimately decreasing abundance of target mRNA transcripts. For MATR3, we indeed observed that genes containing one or more antisense L1 elements overlapped by peaks showed significantly decreased expression upon RBP knockdown (Fig. 4e), consistent with recent findings that MATR3 binding blocks both cryptic poly(A)-sites and splice sites within LINEs [37]. Interestingly, we observed a similar pattern for 3 other RBPs with antisense L1 enrichment, HNRNPM (which has been identified in complexes with MATR3 [4f, Additional file 3: Fig. S4g).

Meta-gene binding profiles reveal RBP functions

Next, we turned to the question of whether eCLIP peak distributions could reveal RBP roles in mRNA processing. To better separate RBP association patterns, we considered the distribution peaks across a meta-gene generated by size-normalizing binding across all protein-coding transcripts relative to transcription start and stop sites and start and stop codons, and then averaging across all expressed genes (Fig. 5a). Considering binding relative to the coding region (CDS) and 5′ and 3′ untranslated regions of spliced mRNA, we observed an overall average of approximately one peak per gene across the entire mRNA (Additional file 3: Fig. S5a), with a variety of patterns of individual RBP association (Fig. 5b).

Fig. 5
figure 5

mRNA meta-gene profiles from eCLIP correspond to RBP regulatory roles. a (left) Each line indicates the presence (orange) of a reproducible DDX3X K562 eCLIP peak for 9162 mRNAs that are expressed (TPM > 1) in K562. Each gene was normalized to 13 5′UTR, 100 CDS, and 49 3′UTR bins (based on average lengths among expressed transcripts in K562 cells). (right) A meta-mRNA plot is generated by averaging across all expressed genes, with shaded region indicating 5th to 95th percentile observed in 100 bootstrap samplings. b Heatmap indicates peak coverage for 104 datasets (requiring at least 100 reproducible peaks and at least one meta-mRNA position with 5th percentile greater than 0.002). Color indicates the average occupancy, normalized by setting (blue) minimum value to zero and (yellow) maximum to one. Meta-mRNA profiles were hierarchically clustered and manually labeled. c Heatmap indicates pairwise correlation (Pearson’s R) between each pair of positions along the meta-mRNA in b. d Lines indicate average normalized peaks per bin for all RBPs in the indicated class. Shaded region indicates one standard deviation. e Heatmap indicates odds ratio of overlap between eCLIP datasets in (x-axis) indicated meta-mRNA cluster versus (y-axis) annotated RBP functions. See Additional file 3: Fig. S5d for significance

At a global level, the most striking observation was clear delineation points at the start and stop codon positions (Fig. 5b, c), likely reflecting the fact that translation initiation is unique to the 5′UTR whereas the 3′UTR is the only region where bound RBPs will not be removed by translating ribosomes. However, more subtle clustering revealed distinct subgroups within the broader 5′UTR-, CDS-, and 3′UTR-enriched classes (Fig. 5b, d). For example, we observed two distinct classes of 5′UTR binding that appear to correlate with distinct RBP functions. The first (5UTR.TSS) showed greater enrichment closer to the transcription start site and included nuclear 5′ end processing factors such as cap-binding protein NCBP2 (Fig. 5b, d). In addition to 5′ end enrichment, this class also contained RBPs with substantial 3′UTR signal, such as 3′ end processing factor CSTF2T (which also showed significant signal extending past annotated transcription termination sites (Additional file 3: Fig. S5b), consistent with previous CLIP studies [42]). A second set (5UTR.SC) showed biased peak presence closer to the start codon and included both canonical translational initiation factors (such as EIF3G, EIF3D, and EIF3H) as well as RBPs previously shown to play translational regulatory roles (including DDX3X, SRSF1, and FMR1) (Fig. 5b).

Similarly, we also observed distinctions within CDS binding, with either uniform (CDS.UN) density or biased towards the 5′ (CDS.5P) or 3′ (CDS.3P) end. We observed that 13 out of 15 spliceosomal RBPs showed CDS enrichment (10 of which fell into the CDS.UN category), likely reflecting the general lack of introns in 5′UTRs (due to their small size) and 3′UTRs (as they would create targets for nonsense-mediated decay) (Fig. 5b, d).

Finally, we observed multiple modalities of 3′UTR peak distribution. The 3UTR.Un class showed relatively uniform density and contained many well-characterized 3′UTR binding proteins, including NMD factor UPF1 and stress granule factor TIA1. In contrast, RBPs in the 3UTR.5P class had peak density enriched closer to (and continuing 5′ of) the stop codon, including the well-studied IGF2BP family of RBPs (Additional file 3: Fig. S5c). Finally, we observed a number of RBPs with increased enrichment towards the transcription termination site (3UTR.TTS).

Next, we considered whether these patterns corresponded to different RNA processing functions. Although the number of RBPs is limited for some functions, we observed that many clusters had significant overlaps with distinct RBP functional annotations (Fig. 5e, Additional file 3: Fig. S5d). In particular, RBPs associated with nuclear RNA processing steps showed little change (median 1.2-fold decrease in peak density around the stop codon), whereas RBPs with cytoplasmic roles showed a significant 1.6-fold increase (Additional file 3: Fig. S5e), consistent with a stronger role for the stop codon as a delineation point for cytoplasmic RBP association. In all, our results suggest that the pattern of relative enrichment in different gene regions is predictive of the regulatory role that the RBPs play.

Splicing regulatory roles revealed by intronic meta-gene profiles

Next, we performed regional analysis to query binding to exons (specifically 50 nt bordering the splice sites) and 500 nt of proximal introns flanking both the 3′ and 5′ splice sites. As an example, we observed that out of 89,265 introns present in highly expressed transcripts (TPM > 1), 2699 had a significant IDR peak from eCLIP of U2AF2 in K562 cells (Additional file 3: Fig. S6a). These peaks had a stereotypical positioning at the 3′ splice site (extending into the downstream exon due to the use of full reads rather than just read 5′ ends for analysis), matching the well-characterized role of U2AF2 in 3′ splice site recognition (Fig. 6a). These matrices were then summed across all introns to calculate a meta-intron plot representing the average peak coverage at each position, with confidence intervals estimated by bootstrap** (Fig. 6b).

Fig. 6
figure 6

Meta-exon plots reveal intronic regulatory roles. a Each line indicates the presence (in blue) of a reproducible U2AF2 K562 eCLIP peak for 2699 introns that contain at least one peak within the displayed region (500 nt of proximal intron and 50 nt of exon flanking the 5′ and 3′ splice sites). See Additional file 3: Fig. S6a for all 89,265 introns. b Meta-exon plot for data shown in a, with line indicating average and shaded region indicating 5th to 95th percent confidence interval (derived by 100 bootstrap samplings). c (left) Heatmap indicates average peak coverage across all introns for 130 RBPs with at least 100 peaks and 5th percentile confidence interval at least 0.0005 (for heatmap visualization, the maximum value for each dataset was set to one to calculate normalized coverage). (right) Lines show individual RBP examples for five clusters identified based on similar meta-exon profiles. Y-axis indicates fraction of introns with peak

Performing this analysis for 130 RBPs with sufficient peaks (see the “Methods” section), we observed that the profiles recapitulated many known binding patterns, including U2AF1 and U2AF2 at the 3′ splice site, SF3B4 and SF3A3 at the branch point, PRPF8 at the 5′ splice site, and RBFOX2 and PTBP1 at proximal introns (Fig. 6c). Clustering analysis indicated a number of distinct RBP association patterns. In addition to a large group of exclusively exonic datasets, we observed clusters for the canonical splicing features (5′ splice site, 3′ splice site, and branch point), and two additional clusters: one where RBPs showed enrichment for peaks at proximal introns flanking both the 5′ and 3′ splice sites, and one with dominant enrichment in the 5′ splice site proximal intron only (Fig. 6c, right). We also observed a wide range of peak frequency; canonical splicing machinery components such as U2AF2, SF3B4, and PRPF8 had significantly enriched peaks at many introns (with a position maximum of 3.6%, 7.8%, and 5.3% of queried abundant introns respectively in K562), whereas factors such as PTBP1 and RBFOX2 were less commonly enriched at specific positions (0.1% and 0.5%, respectively) (Fig. 6c).

Insights into spliceosomal association and core splicing regulation

The breadth of RBPs profiled provided a unique opportunity to explore their interactions with the spliceosome and their impacts on splicing regulation. In addition to contacting the intron, many spliceosomal and splicing regulatory proteins also interact with the spliceosomal small nuclear RNAs (snRNAs). The overall snRNA family includes five specific RNA families (U1, U2, U4, U5, and U6, which also have variant isoforms that differ slightly in sequence) that play essential roles in canonical GT-AG RNA splicing, as well as four (U11, U12, U4atac, U5atac) specific to the minor AT-AC spliceosome, each of which plays specific mechanistic roles during splicing [43]. Thus, RBP association with a particular snRNA can help to map its function to a particular step in splicing. Quantitating snRNA enrichment using the family-aware map** described above, we recapitulated many known associations between RBPs and the spliceosome, including interactions of SF3B4 with U2 snRNA (47- and 32-fold enriched in HepG2 and K562, respectively) [44] and GEMIN5 with U1 (11.2-fold enriched in K562) [45] (Fig. 7a). In some cases, these dominated overall RNA recovery; for example, an average of 41% of reads from SF3A3 eCLIP and 17% and 20% of SF3B4 eCLIP reads in HepG2 and K562 respectively mapped to the U2 snRNA, whereas U2 reads averaged only 0.7% in input samples.

Fig. 7
figure 7

Insights from eCLIP of spliceosome-associated RBPs. a Heatmap indicates fold-enrichment for individual snRNAs within eCLIP datasets. Shown are all RBPs with greater than 5-fold enrichment for at least one snRNA. b Browser shows read density for eCLIP of AQR (K562), SF3B4 (K562), and SF3A3 (HepG2) for the NARF exon 11 3′ splice site region. Dotted line indicates position of enriched reverse transcription termination at crosslink sites. c (left) Pie chart shows all (n = 2475) introns with > 20 reads in the − 50 to − 15 (branch point) region in AQR K562 eCLIP. Blue indicates putative branch points (the subset with more than 50% of read 5′ ends at one position). (right) Motif information content for 11-mers centered on the putative branch points. Image generated with seqLogo package in R. d Lines indicate mean normalized eCLIP enrichment in IP versus input for SF3B4 and SF3A3 at (red/purple/green) alternative 3′ splice site extensions in RBP knockdown or (black) alternative 3′ splice site events in control HepG2 or K562 cells. The region shown extends 50 nt into exons and 300 nt into introns

Interestingly, while many factors showed similar association between analogous snRNAs in the major and minor spliceosomes (such as PRPF8 and SMNDC1 with U6 and U6atac, and SF3B1 and SF3B4 with U2 and U12), some RBPs were specifically associated with either the major (SF3A3, which was 29.5-fold enriched for U2 but 1.2-fold depleted for U12 in HepG2, and QKI, 118.6-fold enriched for U6 but 2.4-fold depleted for U6ATAC) or minor spliceosome (HNRNPM, which was 8.1-fold enriched in K562 and 7.6-fold in HepG2 for U11 but 5.3- and 4.2-fold depleted for U1) (Fig. 7a, Supplemental Fig. 7a-d). Although preliminary analysis did not show altered splicing upon HNRNPM knockdown specifically at U11/U12 introns, previous studies have suggested that HNRNPM may contribute to minor intron splicing through interactions with FUS [46].

In the first catalytic step of intron splicing, a transesterification step joins the 5′ splice site with the branch point to create an intron lariat structure (Additional file 3: Fig. S7e). This is an essential step in splicing and helps to define 3′ splice site choice, but identification of branch points has remained challenging due to variable positioning (ranging from 20 to 40 nucleotides upstream of the 3′ splice site) and a degenerate sequence motif [47]. Recent efforts to use either specialized library preparation protocols or focused analysis of deep sequencing to identify branch points via lariat junction-spanning reads have enabled the identification of tens of thousands of branch points, but the regulation of branch point recognition and its role in splicing regulation remains poorly understood. Considering the RBPs profiled here, we observe multiple RBPs showing specific enrichment at branch points, including both known regulators (such as SF3 complex components SF3B4 and SF3A3), as well as novel factors (including RBM5). Indeed, analysis of these datasets coupled with focused iCLIP profiling of purified spliceosomes recently indicated distinct patterns of RBP association at branch points and 5′ and 3′ splice sites, which yielded unique insights into how branch point strength defines RBP association and splicesomal assembly dynamics [48].

However, we were particularly intrigued by the observation of a striking pattern of both 5′ splice site and branch point enrichment for the RBP AQR (Fig. 7b). Knockdown of AQR yielded over 30,000 altered alternative splicing events, by far the most of any knockdown performed by the ENCODE consortium to date (including canonical splicing components including U2AF1/2 and SF3B4) [7c). Motif analysis of these positions yielded the canonical branch point motif signal (with 92% containing an A at the base prior to read starts) (Fig. 7c). Thus, these results suggest that AQR eCLIP signal is derived from introns after lariat formation, where reverse transcription is incapable of reading through the branch point adenosine (Additional file 3: Fig. S7e), and that deeper sequencing of AQR eCLIP (potentially with improved methodology to enrich reads at the 3′ rather than 5′ splice site) will provide direct identification of branch points in human.

Next, we considered eCLIP signal at alternatively spliced cassette exons. Considering “native” cassette exons in wild-type K562 and HepG2 cells, we observed that branch point factors SF3B4 and SF3A3 showed decreased signal at alternative exons relative to constitutive exons, consistent with U2AF2 and other spliceosomal components and potentially reflecting overall lower spliceosomal occupancy (Additional file 3: Fig. S7f). However, at alternative 3′ splice sites with the proximal site increased upon knockdown of branch point components SF3B4 and SF3A3, we observed that average eCLIP enrichment for SF3B4 and SF3A3 was decreased at the typical branch point location but increased towards the 3′ splice site (compared to eCLIP signal at native A3SS events which utilize both distal (upstream) and proximal 3′ splice sites in control shRNA datasets) (Fig. 7d, Additional file 3: Fig. S7g). Consistent with previous mini-gene studies showing that 3′ splice site scanning and recognition originates from the branch point and can be blocked if the branch point is moved too close to the 3′ splice site AG [50], these results provide further evidence that use of branch point complex association to restrict recognition by the 3′ splice site machinery may be a common regulatory mechanism [51] (Additional file 3: Fig. S7h).

Clustering of RBP binding identifies known and novel co-associating factors

Large-scale RBP target profiling using a consistent methodology enables cross-comparison between datasets. Considering simple overlap between peak sets for all profiled RBPs, we observed significant overlap for many pairs of RBPs, which often formed co-associating groups (Fig. 8a, left). These groups of RBPs with highly overlap** peaks generally segregated into four major categories. First, we observe high similarity between the same RBP profiled in HepG2 and K562 (including QKI, PTBP1, and LIN28B) (Fig. 8a, green). Indeed, we observe an average peak overlap of 30.0% between the same RBP in K562 and HepG2 versus 4.9% for random RBP pairings (6.1-fold increased), confirming the broad reproducibility of binding across cell types (Fig. 8b). Second, we observe many cases of high overlap between eCLIP for homologous RBPs within the same family, including TIA1 and TIAL1, IGF2BP1/2/3, and fragile X-related FMRP, FXR1, and FXR2 (Fig. 8a, yellow). Third, we observe clusters containing known co-regulating RBPs, including recognition and processing machinery for the 3′ splice site (U2AF1 and U2AF2), branch point (SF3B4 and SF3A3), and 5′ splice site (EFTUD2, RBM22, PRPF8, and others), as well as a group of RBPs that play general roles in binding the 5′UTR of nearly all genes to regulate translation (DDX3X, EIF3G, and NCBP2) (Fig. 8a, red).

Fig. 8
figure 8

RBP co-association predicts known and novel RNP complexes. a Heatmap indicates the pairwise fraction of eCLIP peaks overlap** between datasets. Callout examples are shown for known complexes, RBP families, same RBP profiled across cell types, and putative novel complexes. b GSEA analysis comparing the fraction overlap observed profiling the same RBP in both K562 and HepG2, compared against random pairings of RBPs (with one profiled in K562 and the other in HepG2). c As in b, but using the set of RBPs with interactions reported in the BioPlex IP-mass spectrometry database [52]

Interestingly, we observe unexpected clusters that suggested potential novel complexes or co-interacting partners (Fig. 8a, blue). Some clusters likely reflect overlap** targeting to specific types of RNAs: for example, one cluster contains three RBPs we described above to show specific enrichment at antisense L1/LINE elements (HNRNPM, BCCIP, and EXOSC5). The patterns of other clusters are often less clear, with some containing both well-studied RBPs as well as those with no known RNA processing roles (for example, high overlap between HNRNPL and AGGF1 across both cell types). To consider whether these likely reflected true instances of RBP co-interaction, we asked whether RBPs that had higher peak overlap were more likely to have interactions from large-scale IP-mass spectrometry experiments. Using the BioPlex 2.0 database of ~ 56,000 interactions [52], we observed that RBPs with IP-MS interactions showed an average 2.3-fold increase in eCLIP peak overlap (11.4% versus 4.9% for RBPs without interactions), suggesting that there is a general correlation between peak overlap and RBP interactions (Fig. 8c).

Finally, we performed co-immunoprecipitation (co-IP) studies focusing on one predicted novel interaction group involving HNRNPL and AGGF1. We observed that AGGF1 co-immunoprecipitated HNRNPL, unlike unrelated factors RBFOX2 or FMR1 (Additional file 3: Fig. S8a). We note that this co-IP was observed using less stringent co-IP wash buffers, but was not observed using the high-salt wash buffers present in eCLIP (Additional file 3: Fig. S8b), indicating that the overlap in eCLIP binding likely reflects independent crosslinking events to the distinct RBPs. Thus, these results indicate that the eCLIP data resource reveals many novel RBP interactions that are likely to reflect previously unidentified regulatory complexes.

Discussion

The ENCODE RNA binding protein resource contains 1223 replicated datasets for 356 RBPs, including in vivo targets by eCLIP, in vitro binding motifs by RNA Bind-N-Seq, subcellular localization by immunofluorescence, factor-responsive expression and splicing changes by knockdown/RNA-seq, and DNA associations by ChIP-seq [71], suggesting that RPS3 eCLIP may capture ribosome association on translating mRNAs and could be used as a general approach to assay translation. Similarly, our meta-exon analysis of AQR (followed by further analysis of crosslink-induced termination sites) showed that AQR eCLIP could identify branch points for a set of highly abundant introns, suggesting that further development of profiling of AQR binding targeted to 3′ splice site regions could yield a highly specific approach to identification of branch points transcriptome-wide. Recent work using iCLIP to specifically purify spliceosome-associated RNAs further showed that other eCLIP datasets analyzed here also showed highly stereotypical crosslinking patterns around branch points, which could also broadly map branch point locations and reveal unique insights into the combinatorial effect of branch point and splice site strength on spliceosomal assembly and dynamics [48].

The diversity of distinct RBP association patterns can also be flipped to predict features of a queried RNA. For example, recent work used the ENCODE eCLIP resource to identify UPF1 as one of many RBPs with specific enrichment at 3′UTRs [56]. This finding enabled improved prediction of whether a queried transcript was a protein coding versus long non-coding RNA by incorporating presence (or absence) of UPF1 eCLIP signal as a biomarker for translation [56]. Similarly, our unbiased analysis of foci of enrichment on the 45S rRNA precursor suggested two regions as notably highly enriched across multiple RBPs, one of which matches a well-characterized region (between the canonical 01 and A0 processing sites) with another suggesting interesting regulatory mechanisms linking ribosomal RNA and microRNA processing. Similar analysis identifying eCLIP datasets with enrichment on regulatory non-coding RNAs ** of protein-RNA interactions with individual nucleotide resolution. J Vis Exp. 2011;(50). https://doi.org/10.3791/2638 ." href="/article/10.1186/s13059-020-01982-9#ref-CR73" id="ref-link-section-d130486411e2523">73].

Family-aware map** to multicopy elements

The software pipeline used to quantify enrichment for retrotransposable and other multicopy elements is available at https://github.com/YeoLab/repetitive-element-map**, and was initially described in [75]. Within each family, transcripts were given a priority value, with primary transcripts prioritized over pseudogenes. Map** to the reverse strand of a transcript was counted separately from forward strand map**, creating a second “antisense” family for each RNA family above (which utilized the same element priority order), with the exception of simple repeats (which were all combined into one family).

To quantify eCLIP signal, paired-end sequencing reads were first adapter trimmed as previously described [18]. Next, reads were mapped against the repetitive element database using bowtie2 (v. 2.2.6) with options “-q --sensitive -a -p 3 --no-mixed –reorder” to output all map**s. Read map**s were then processed as follows. First, for each paired-end read pair, only map**s with the lowest alignment scores summing both mismatch penalties (defined as MN + floor((MX − MN)(MIN(Q, 40.0)/40.0)) where Q is the Phred quality value, and default values MX = 6, MN = 2, as described in bowtie2 reference material) and gap penalties (defined as GO + N × GE, where GO = gap open = 5, GE = gap extend = 3, N = gap length) were kept. Next, the map** to the transcript with the highest priority within a RNA family (as listed above) was identified as the “primary” match map**. At this stage, read pairs which had equal best alignments to multiple repeat families were discarded, with only reads map** to a single repeat family considered for further quantification.

Next, these RNA family map**s were integrated with unique genomic map** from the standard eCLIP processing pipeline (using read map** prior to PCR duplicate removal). For read pairs that mapped both to an RNA family above as well as uniquely to the genome, the map** scores (as defined above) were compared. If the unique genome map** was more than 2 mismatches per read (24 alignment score for the read pair) better than to the repeat element, the unique genomic map** was used; otherwise, it was discarded and only the repeat map** was kept. Next, PCR duplicates were removed by comparing all read pairs based on their map** start and stop position (either within the genome or within the mapped primary repeat) and unique molecular identifier sequence, and all but one read pair for read pairs sharing these three values were defined as PCR duplicates and removed. At this stage, RepeatMasker-predicted repetitive elements in the hg19 genome were additionally obtained from the UCSC Genome Browser [24]. Element counts for RepBase elements were therefore determined as the sum of repeat family-mapped read pairs (described above) plus the number of reads that mapped uniquely to the genome at positions which overlapped (by at least one base) RepeatMasked RepBase elements. Reads uniquely map** to non-RepBase genomic regions were then annotated into one of 11 additional classes in the following priority order (based on GENCODE v19 annotations): CDS, 5′UTR and 3′UTR, 3′UTR, 5′UTR, proximal intronic (within 500 nt of splice sites), distal intronic (remaining intronic regions), non-coding exonic, non-coding proximal intronic, non-coding distal intronic, antisense to GENCODE transcripts, and intergenic.

Finally, the number of post-PCR duplicate removal read pairs map** to each class was counted in both IP and paired input sample and normalized for sequencing depth (using the total number of post-PCR duplicate read pairs from both unique genomic map** as well as repeat map** as the denominator to calculate fraction of reads). Significance was determined by Fisher’s exact test or Pearson’s chi-square test if all expected and observed values were five or more. Relative information content of each element in each replicate was calculated as \( {p}_i\times {\log}_2\left(\frac{p_i}{q_i}\right) \), where pi and qi are the fraction of total reads in IP and input respectively that map to element i. To combine two biological replicates, the average reads per million (RPM) was calculated across two IP samples and compared against the paired input experiment to calculate one overall fold-enrichment and relative information value per dataset.

Validation of RNA element links with RBP functional annotations

To quantify whether RNA element enrichment matched with RBP functions, a set of positive control pairings were generated between RNA elements with known links to either RBP function or known RBPs contained within a well-characterized ribonucleoprotein complex (Additional file 3: Fig. S2a). One hundred forty datasets for which the RBP had at least one of these annotated functions were selected, and datasets were sorted by relative information of the most-enriched class. Accuracy (defined as (TP + TN)/(TP + TN + FP + FN)) was then calculated, where true positives (TP) were RBPs for which the most-enriched RNA element was greater than the cutoff value and the RBP has published evidence for the function associated with the most-enriched RNA element, false positives (FP) were RBPs that had an RNA element meeting the relative information cutoff but the RBP lacked publication evidence for the linked function, false negatives (FN) were RBPs lacking an RNA element meeting the relative information cutoff but the RBP had published evidence for functions associated with at least one RNA element class, and true negatives (TN) were RBPs lacking annotated functions or RNA elements meeting the relative information cutoff. Accuracy was calculated for each possible relative information cutoff, and the maximum point (0.2) was chosen.

Ribosomal RNA analysis

RBPs with roles in ribosomal RNA processing were obtained from [28]. Position-wise relative information was calculated as above, using the number of reads overlap** the position in IP versus input for each dataset (using paired-end read 2 only, as was done for genomic map**). To obtain a cutoff for further analysis, RBPs were sorted by the maximum position-wise relative information on the 45S rRNA precursor, and at each value, the F1 score was calculated (defined as (2 × TP)/(2 × TP + FP + FN)) using the definitions described above. The maximum point at 0.101 was used for further analysis.

To quantify enrichment at the rmiR-663 ribosomal versus genomic paralog loci, sequences of rmiR-663 and four genomic-encoded paralogs (miR-663a, miR-663b, AC010970.1, and AC136932.1) were obtained from the UCSC Genome Browser, along with 100 nt of flanking sequence. Only reads that perfectly aligned (with zero mismatches or gaps) to these sequences were counted for further analysis.

Retrotransposable element analysis

L1 retrotransposition genome-wide CRISPR screening data was obtained from Liu et al. [38], using Combo casTLE Effect scores from K562 cells. Bonferroni correction was performed on uncorrected casTLE p values using n = 15 (the number of L1 (sense)-enriched RBPs queried).

To calculate change in expression of L1-containing bound genes, DESeq-calculated gene expression fold changes for RBP knockdown/RNA-seq data were obtained from the ENCODE DCC (http://www.encodeproject.org) for all RBPs with both eCLIP and RNA-seq performed in the same cell type. L1 sense and anti-sense elements were taken from RepeatMasker-predicted repetitive elements in the hg19 genome obtained from the UCSC Genome Browser [24]. For each gene in GENCODE v19, the transcript with the highest abundance in rRNA-depleted total RNA-seq in HepG2 (ENCODE accession ENCFF533XPJ, ENCFF321JIT) and K562 (ENCFF286GLL, ENCFF986DBN) was chosen as the representative transcript, and the set of expressed genes (10,247 in HepG2 and 9162 in K562 with TPM ≥ 1) were considered. Next, genes were separated into three classes: “≥ 1 bound L1(as)” genes with at least one antisense L1 element that overlapped a significant peak identified in eCLIP, “bgd with ≥ 1 L1(as)” genes with at least 1 antisense L1 element but did not have an element that overlapped with an eCLIP peak, or “Bgd” which contained all expressed genes. Significance was determined by the Kolmogorov-Smirnov test with no multiple hypothesis testing correction.

To compare reference versus divergent L1 elements, we defined “canonical” reads as those which mapped best (and were assigned) to sequences present in RepBase, whereas “divergent” reads mapped better to unique genomic loci than to the reference sequence.

Calculation of overall element coverage (Additional file 3: Fig. S4b) was based on the above set of 9162 reference transcripts in K562 expressed with TPM ≥ 1.

Meta-gene and meta-exon peak density maps

To generate meta-gene and meta-exon maps, for each gene in GENCODE v19, the transcript with the highest abundance in rRNA-depleted total RNA-seq in HepG2 (ENCODE accession ENCFF533XPJ, ENCFF321JIT) and K562 (ENCFF286GLL, ENCFF986DBN) was chosen as the representative transcript, and the set of expressed genes (10,247 in HepG2 and 9162 in K562 with TPM ≥ 1) were considered. Datasets with fewer than 100 mRNA-overlap** peaks were discarded, leaving 205 datasets. Next, each gene was split into 162 bins (13 for 5′UTR, 100 for CDS, 49 for 3′UTR), based on the median 5′UTR, CDS, and 3′UTR lengths of highly expressed (TPM ≥ 10) GENCODE v19 transcripts in K562 cells. For each eCLIP dataset, the average peak coverage for each bin was calculated for each gene and then averaged over all genes to generate final meta-gene plot. To generate confidence intervals, bootstrap** was performed by randomly selecting (with replacement) the same number of transcripts and calculating the average position-level peak coverage as above, with the 5th and 95th percentiles (out of 100 permutations) shown. For further visualization and analysis, only 104 RBPs where the 5th percentile was at least 0.002 peaks per gene (~ 20 peaks in at least one bin) were considered. Normalized coverage was then calculated by setting the maximum position to one and minimum position to zero for each eCLIP dataset. Cross-position correlations were calculated using normalized coverage for across all 104 RBPs at each position. Odds ratios and significance (determined by Fisher’s exact test or Yates’ chi-square test if observed and expected values were greater than five) utilized RBP annotations (Additional file 3) from [6), an additional normalization was performed by dividing each position by the maximum meta-exon value for that dataset, in order to scale the meta-exon profiles between 0 and 1.

Analysis of AQR enrichment at branch points

To identify points of enriched read termination in AQR eCLIP, regions from − 50 nt to − 15 nt from annotated 3′ splice sites were obtained from GENCODE v19, and the subset of regions with at least 20 overlap** reads in AQR eCLIP in K562 cells were taken for further analysis. Points of enrichment were identified as those where more than half of reads overlap** the overall region terminated at the same position. Motif analysis was performed by counting the frequency of 11-mers centered on the read start position with 5 nt flanking on either side. Motif logos were generated with seqLogo (R).

Enrichment of branch point factors at alternative 3′ splice site events

Splicing maps profiling normalized enrichment for SF3B4 and SF3A3 at RBP knockdown-responsive alternative 3′ splice site events were generated as previously described [20, 76]. In brief, the set of differential 3′ splice site events for RBP-knockdown/RNA-seq was identified from rMATS analysis between RBP knockdown and paired non-target control. Normalized read density in eCLIP was then calculated for each differential event by subtracting input read density from IP read density (each normalized per million mapped reads). To weigh each event equally, position-wise subtracted read density was then normalized to sum to one across the entire event region (composed of 50 nt of exonic and 300 nt of flanking intron), including a pseudocount of one read (normalized by total mapped read density) at each position. The highest 2.5% and lowest 2.5% values at each position across all events were then removed, and the mean was then calculated across all other events to define the final splicing map. As a control, a set of “native” alternative 3′ splice site events was defined as those which showed alternative usage (0.05 < inclusion < 0.95) in control K562 or HepG2 cells, respectively. Confidence intervals were generated by randomly sampling the number of events in the RBP-responsive class from the native alternative 3′ splice site set 1000 times, processing this sampled set as described above, and plotting the 0.5th to 99.5th percentiles.

Co-occurrence of RBP eCLIP peaks and validation of subcomplexes of RBPs

Overlap between eCLIP datasets A and B was determined by calculating the fraction of significant and reproducible peaks in dataset A that overlapped (by at least one base) a peak in dataset B, and vice versa the fraction of peaks in B that overlapped a peak in A, and taking the maximum of those fractions as the overall pairwise fraction overlap. Only datasets with at least 100 reproducible and significant peaks were used for this analysis. Gene Set Enrichment Analysis was performed using the GSEA software package [77]. RBP interaction data was obtained from the BioPlex 2.0 dataset [52].

IP-western validation was performed using HNNRPL (ab6106, Abcam), RBFOX2 (A300-864A, Bethyl), FMR1 (RN016P, Bethyl), AGGF1 (A303-634A, Bethyl), and TNRC6A (RN033P, MBLI) antibodies in UV crosslinked K562 cells. Immunoprecipitation in high-salt wash conditions was performed using standard eCLIP wash buffers, beads, and other reagents [18]. Low-salt co-immunoprecipitation conditions used identical conditions, except for lysis buffer (50 mM Tris-HCl pH 7.5, 150 mM NaCl, 1% Triton X-100, 0.1% Sodium deoxycholate, and Protease Inhibitor cocktail (Promega)) and wash buffer (5 washes total in TBS + 0.05% NP-40). Westerns were probed with HNNRPL (ab6106, Abcam) primary antibody and TrueBlot secondary (Rockland).