Background

A major challenge in biology is to understand how complex regulatory networks emerge during evolution. An important mechanism for expanding complexity is alternative pre-mRNA splicing (AS), the process by which exonic regions are differentially excised to create multiple transcripts from a single gene locus. Recent surveys of organ transcriptomes across several vertebrate species have revealed AS profiles have diverged rapidly during vertebrate evolution, whereas organ mRNA expression profiles have remained relatively conserved [1, 2]. Moreover, several studies have described examples of how the emergence of lineage-specific isoforms can create novel phenotypes [3,4,5].

However, the emergence of AS-dependent complexity comes at a cost. On the one hand, AS confers flexibility to gene function by altering reading frame and tuning transcript stability [1, 4, 6,7,8]. On the other hand, the inappropriate recognition of intronic sequences resembling splice sites can give rise to the non-canonical execution of regulatory events disrupting gene expression [9]. These deleterious events often manifest themselves within human diseases [10]. Despite the importance of AS in evolution, the mechanisms and genomic features that control this balance between the promotion of novel functionality and its prevalence to cause disease are poorly understood.

In this study, we surveyed the human transcriptome to identify thousands of novel exonization events, the process by which non-canonical intronic sequences are incorporated into mRNA transcripts. We reveal these events do not occur randomly within the genome but are enriched within cell cycle and cell signaling genes. Exonization events occur within m6a (N6-methyladenosine)-modified long introns close to the transcription start site and often overlap Alu and L1 transposon events. The inclusion of these novel exons is promoted by regulatory events that promote the “window of opportunity” for spliceosome recognition, such as the rate of RNA polymerase II elongation and splicing efficiency dynamics. This multilayered system can be actively regulated by exogenous agents permitting the emergence of novel regulation as exemplified by UV irradiation, which promotes exonization within cell cycle genes to suppress their ribosomal engagement. We further provide evidence exonization is suppressed in hematological cancers. Thus, we identify a highly evolvable mechanism that can expand the regulatory complexity of cells.

Results

Exonization events occur in introns enriched with new transposons

To investigate the potential for novel exonization events to occur within the human transcriptome, we analyzed over 400 shRNA knockdown RNA-seq datasets from HepG2 cell lines [11] to identify reads map** between known exons and novel intronic sequences (Fig. 1a and the “Methods” section). We only considered reads map** to exon-exon junctions (EEJs) supported by at least 5 reads and a percent spliced in (PSI) value of at least 5%. Novel exons were defined as those absent from annotation databases [13, 14] and all non-perturbed control datasets (Fig. 1a). Confirming the validity of this approach, the knockdown of the RNA binding protein (RBP) heterogeneous ribonucleoprotein C (hnRNPC) created the most Alu-derived exonization events (Additional file 1: Figure S1), in line with previous observations [15]. In total, we detected 13,103 novel exonic events within 4774 genes or 30.6% of evaluated human protein-coding genes under the perturbations we surveyed.

Fig. 1
figure 1

Genomics features of introns with exonization events. a Workflow to identify novel exonization events (see the “Methods” section). Briefly, RNA-seq from shRNA knockdown of RNA binding proteins in HepG2 is analyzed by 2-pass enabled STAR, and then novel junctions are incorporated into index files analyzed by Whippet [12]. Identified exons are filtered to remove exon-exon junctions and events occurring in any of the matched control samples, as well as annotated in genome databases. Only events supported by > 5 reads map** over exon-exon junctions and a percent spliced in (PSI) greater than 5% are included. b Plot showing the results from a logistic linear regression analysis aimed at identifying features important in discriminating introns prone to exonization events to all other expressed introns. Features in bold significantly contribute to the model (p < 0.01, Student t test). TSS, transcription start site; ppt_len, polypyrimidine tract length; 5′ss, 5′-splice site; 3’ss, 3′-splice site; bp_scr, branchpoint score; SS_dist, splice site distance; BP_num, branchpoint number; AGEZ, AG dinucleotide Exclusion Zone length; TAD, topologically associating domain; ppt_scr, polypyrimidine tract score (n = 13,103). c Plot showing the results from a logistic linear regression analysis aimed at identifying the type of transposable elements that most effectively discriminate introns prone to exonization events compared to all other expressed introns. Features in bold significantly contribute to the model (p < 0.01, Student t test). Nodes are colored by average estimated age of when transposable elements arose (n = 13,103). d Enrichment map for GO, REACTOME, and KEGG functional categories of genes that contain Alu-exonization events, with representative GO terms shown for each sub-network (see Additional file 1: Figure S1 for annotated version). Node size is proportional to the number of genes associated with the GO category, and edge width is proportional to the number of genes shared between GO categories

To investigate the mechanisms underlying these exonization events, we collated a list of features classically associated with alternative splicing, including splice site strength, GC content, and polypyrimidine tract length (Additional file 2: Table S1). Logistic linear regression was then used to compare these events with a “background” group of expressed introns lacking any evidence of exonization (Fig. 1b). Validating our choice of genomic features, our model achieves a high average true positive rate [AUC, area under the receiver operating characteristic (ROC) curve] of 75.2% (Additional file 1: Figure S1). Moreover, we were able to confirm previous results that exonization events tend to occur in long introns with a high GC content [16] (Additional file 1: Figure S1, intron length: p < 3.53 × 10−59, GC 1.01 × 10−73, Student t test). Notably, we find exonization events often overlap nucleosome-binding sites (Additional file 1: Figure S1, p < 2.37 × 10−23, Wilcoxon test), rarely occur at the 3´-end of the gene body (p < 5.90 × 10−21, Student t test) and show a significant tendency to occur within 5´-UTRs (Additional file 1: Figure S1, p < 7.13 × 10−166, Fisher exact test). Importantly, we also observe that the strongest predictor for exonization was the occurrence of transposable elements overlap** the novel exon (Fig. 1b and Additional file 1: Figure S1, p < 6.46 × 10−127, Student t test). In comparison to these strong predictors, no cis-regulatory splicing elements contribute significantly to the model or show significant differences between the datasets (Additional file 1: Figure S1, p > 0.05, Student t test).

To evaluate the conservation of novel exon usage across species, we analyzed the extent of exonization across multiple matched tissue types within four primate species spanning 30 million years of primate evolution. To explore exonization usage between samples, genes with events occurring in all four species were identified and sorted using affinity propagation clustering. In line with canonical alternative splicing [1, 2], samples from the same species invariably clustered together (Additional file 1: Figure S2). The notable exception to this trend was observed in samples from the testes, which showed tissue-specific clustering. This suggests within the testes there is a strong exonization signature conserved across primate species in multiple genes (Additional file 1: Figure S2).

To further investigate the influence of retrotransposons on exonization, we sub-divided the transposable elements into their major sub-families. In line with the lineage-specificity of exonization events, the most significant contributors to the model are transposons younger than 70 million years. In particular, Alu element sub-family members AluJ and AluS, as well as the highly mobile L1 elements are strong predictors (Fig. 1c). Interestingly, an exception to this rule is the AluY sub-family [17], which shows a pattern reminiscent of much older transposon events (Fig. 1c). This difference is potentially due to its relative depletion within gene bodies (3%, 19%, AluY, AluS, occurrence in expressed introns). Finally, we wished to assess which type of genes contains Alu-exonization events. Interestingly, we find a strong enrichment for functions related to cell signaling and cell cycle regulation (Fig. 1d, Additional file 1: Figure S1 and Additional file 3: Table S2). Alongside previous examples [8, 9, 15, 18], our observation of an extensive number of exonization events overlap** new transposons suggests a novel source of transcriptomic complexity.

m6a RNA binding proteins suppress exonization

An evaluation of the trans-factors promoting exonization (Additional file 1: Figure S2 and Additional file 4: Table S3) revealed an enrichment of m6a (N6-methyladenosine) binding RBPs, especially among Alu-containing novel exons (Fig. 2a, p < 0.05, hypermetric test). This included hnRNPC, which has been previously shown to induce a large number of Alu-specific exonization events [15], as well as DiGeorge syndrome critical region 8 (DGCR8) and YTH domain-containing protein 2 (YTHDC2). To examine the potential role of m6a marks in exonization, we analyzed knockdown data of the m6a modification enzyme N6-adenosine-methyltransferase subunit (METTL3) [18]. This analysis revealed a significant increase in the number of detectable exonization events upon METTL3 knockdown (Fig. 2b, p < 3.57 × 10−03, Wilcox-rank sum test), in concordance m6a regulating the inclusion of novel exons. Further analysis of these METTL3-dependent exonization events revealed a functional enrichment of genes associated with DNA damage (p < 2.68 × 10−02, FDR-corrected p value). Next, we analyzed data from HeLa cells constituting of two knockdowns of known m6a regulators (Serine/arginine-rich splicing factor 3 (SRSF3) and YTH domain-containing protein 1 (YTHDC1)) and two knockdowns of RBPs not known to directly recognize m6a (Serine/arginine-rich splicing factor 9 and 10 (SRSF9 and SRSF10)) [19, 2c, p < 3.23 × 10−123, Wilcoxon rank-sum test; Fig. 2d, p < 2.13 × 10−63, Wilcoxon rank sum test; Additional file 1: Figure S2). Altogether, this suggests m6a modification and binding proteins are key regulators of (Alu-containing) exonization events.

Mechanisms increasing the “window of opportunity” for spliceosome recruitment promote exonization

Our initial analysis revealed the prediction of exonization is strongly enhanced by repeat elements, intron length and GC content. These features are known to negatively correlate with RNA polymerase II (RNAPII) elongation rate [23]. We therefore hypothesized that changes in RNAPII elongation rate may promote exonization. To test this, we analyzed RNA-seq data from human cells expressing mutations that increase (E1126G) or decrease (R749H) the elongation rate of RNA polymerase II (RNAPII) [24]. To determine the elongation rate for each mutant, we analyzed data from genome-wide nuclear run-on sequencing (GRO-seq) assay combined with transcription elongation inhibitor DRB (see the “Methods” section and [24]). We observe that mutations slowing the rate of RNAPII elongation (R749H) strongly induced exonization events (Fig. 3a, ALL: p < 3.18 × 10−45, Fisher’s exact test), with this change especially strong in Alu-containing novel exons (Fig. 3a, ALU: p < 5.62 × 10−53, Fisher’s exact test). In contrast, mutations that speed up elongation had negligible effects on the number of exonization events detected (Fig. 3a; p > 0.05, Fisher’s exact test).

Fig. 3
figure 3

Reduced rate of RNA polymerase II elongation and poor splicing efficiency promotes exonization. a Dot plot showing the impact of RNA polymerase mutations on exonization of Alu-containing exons. WT, wildtype; Fast, E1126G mutation; Slow, R749H mutation. Each point represents individual dataset. KB, kilobase; min, minutes. Elongation, rate of transcriptional elongation (see the “Methods” section and [24]). b Boxplot of splicing efficiency for introns with exonization events vs all expressed introns with no evidence of exonization. Splicing efficiency is a metric describing speed of intron excision as measured by assessing nascent RNA-seq using BrU-chase at 0, 15, 30, and 60 min. See Fig. 2d for the description of boxplots. ***p < 1 × 10−10 p value calculated using Wilcoxon-rank sum test. (n = 4,011). c Stacked bar plot showing distributions of introns for splicing efficiency identified by BrU-chase. Groups assigned by K-means clustering (k = 5) (see the “Methods” section)—see Additional file 1: Figure S3a for distributions of splicing efficiencies. (n = 83,972)

This result supports the competition model of alternative splicing [24, 25] wherein the regulation of exon inclusion is associated a “window of opportunity” for spliceosome recognition. If this connection between exonization and opportunity is valid, an independent mechanism for the emergence of novel exons should occur within introns that are slowly processed by the splicing machinery. To evaluate this hypothesis, we analyzed nascent RNA data from BrU-Chase-seq [21], in which cells are labeled with a 15-min BrU pulse and chased for 0, 15, 30, and 60 min. To determine splicing kinetics for each intron, we calculated the splicing efficiency dynamics (SEDs) or the rate of intron excision [21, 26] (see the “Methods” section). K-means clustering was then used to identify five groups of introns with SEDs ranging from very fast to very slow (Additional file 1: Figure S3). Introns containing exonization events were then compared to a background set of all expressed introns. Strikingly, this analysis reveals that introns containing exonization events are strongly enriched within the slowest SED cluster (Fig. 3b, p < 3.23 × 10−239, Wilcox rank-sum test, and Additional file 1: Figure S3). Moreover, these introns display a highly significant reduction in SED compared to background groups (Fig. 3c and Additional file 1: Figure S3, p < 2.2 × 10−160, Wilcoxon rank-sum test). Together, these observations suggest that mechanisms that expand the “window of opportunity” will increase the likelihood of recognition by the splicing machinery and thereby promote the rate of exonization.

DNA damage induces exonization within cell cycle genes

Exogenous process can also promote alterations in transcription elongation and therefore may alter rates of exonization. To investigate this we focused on UV irradiation as previous studies have demonstrated that it promotes both the hyperphosphorylation of RNAPII leading to the subsequent inhibition of transcription elongation [27, 28] and the recruitment of the m6a machinery to sites of DNA damage [25, 38] for the spliceosome to recognize novel exons. We highlight this can occur by decreasing the rate of RNAPII elongation and is associated with slow splicing efficiency dynamics. These novel exons are also marked by m6a RNA modifications. This multi-layered system permits exogenous forces to regulate exonization (Fig. 6). We demonstrate UV irradiation increases the rate of exonization within cell cycle genes, potentially by slowing RNAPII elongation [27, 28], and observe that exonization within these genes coincides with reduced polysome engagement. Furthermore, we describe in cancer how this “window of opportunity” mechanism is repressed and link this suppression to particular cancer mutations [34] within RNA binding splicing factors. Collectively, these results provide new insights into the control and dynamics of exonization in different biological and disease contexts, as well as highlighting an evolutionary process with the potential to expand regulatory complexity within cells.

Fig. 6
figure 6

A model summarizing results from this paper. A model summarizing results from this paper contrasting regulatory mechanisms associated with opening (or facilitating) and closing (or inhibiting) the window of opportunity for exonization. RBP, RNA binding protein; RNAP II, RNA polymerase II; m6a, N6-methyladenosine

Previous work has shown that RBPs and nucleosome occupancy underlie the regulatory control of exonization. For example, competition between the hnRNPC and the 3′-splicing factor U2AF2 [18], in tandem with nonsense-mediated decay [15], has been shown to restrict the inclusion of Alu-containing exons. On the other hand, high nucleosome occupancy is associated with the emergence of new exons [7, 16] and proposed to promote RNA polymerase II pausing [7, 16, 39]. Our observations expand these findings identifying a “window of opportunity” model [25, 38] for exonization controlled by a multi-layered regulatory program, including m6A-associated RBPs that suppress the emergence of new exons. The regulatory networks controlling exonization events are highly interconnected, as RNAPII facilitates the deposition of m6a onto actively transcribing nascent transcripts [40], which is known to tune splicing efficiency [21]. Given our observations that exonization is subject to multi-layered regulatory control (i.e., RNAP-II, RBPs and m6A—see Fig. 6), it is also interesting to consider how this mechanism may influence the life cycle of a transcript. Our results show exonization is associated with decreases in polysome association of genes containing exonization events. An explanation for this observation is that short sequences derived from Alu elements [30] and transcripts with other repeats [41, 42] have increased nuclear accumulation, which would restrict the ribosomal accessibility of transcripts with exonization events (Additional file 1: Figure S4).

It also noteworthy that a system likely evolved to suppress the aberrant impact of transposon inclusion on functional transcripts [43] may have been co-opted to create a novel regulatory mechanism. In support of this proposal, we identify UV irradiation is accompanied by an increase in exonization within cell cycle genes potentially restricting the expression of key checkpoint regulators, until the DNA damage process is complete. These changes may be the result of perturbations to the multi-layered regulatory network controlling exonization, for example, UV irradiation slows the elongation rate of RNA polymerase II [28], and DNA damage up-regulates the m6a-regulatory machinery [48]. Further quality checks were done using Trimmomatics [49] to remove adaptors, low-quality reads, and all reads less than 50 nucleotides in length.

Datasets

All datasets used described in Additional file 5: Table S4 [2, 11, 21, 24, 28, 31, 50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65].

Alternative splicing RNA-seq analysis

Whippet [12] was used to analyze RNA-Seq data employed for the identification of exonization events. Whippet quantifies all combinations of EEJs, including cassette, mutually exclusive, and microexon events. Whippet (v1.0) was run using default settings with “—biascorrect” option enacted to correct for 5´-sequence and GC content batch/bias errors (https://github.com/timbitz/Whippet.jl).

To create the splice graphs required for Whippet splicing quantification, genome annotation files were extracted from Ensembl (Hg38 – Release 93) [69]. For each dataset, this was supplemented with novel exon-exon junctions derived from whole-genome alignment by STAR [66] with 2-pass setting enabled and outFilterMultimapNmax == 10. Whippet index was run for each dataset with “—bam” setting enabled and “--suppress-low-tsl.” Whippet quantification using bias correction function enabled to correct 5′ sequence and GC content bias. Otherwise, the default settings were used, so that only reads map** to exon-exon junctions were used to quantify splicing. Bedtools (intersect –v) [67] was used to remove all exons overlap** with annotation from UCSC genome browser [13]. UCSC Liftover [13] was used to convert all non-human exons in conservation analysis.

Gene expression RNA-seq analysis

Kallisto [68] was used with default settings with index constructed using data extracted from Ensembl Hg38 Release 93 [69].

Identification of exonization events

All events identified by Whippet were only considered as novel exonization events if they passed the following criteria: (1) Exon inclusion was supported by at least 5 corrected reads, as assigned by expectation maximization by Whippet. (2) Event must be CE (core exon) event. (3) Exon was not identified in any of the control/matched datasets. (4) Exon was not previously annotated in Ensembl Hg38 Gene transfer format (GTF) file or UCSC GTF file. (5) Exon had a percent-spliced in (PSI) value of at least 0.05 (i.e., 5%). (6) Exon-exon junction reads must occur between novel exon and known exon.

An exon was considered previously annotated if either exon-exon junction was annotated in Hg38 GTF file (from Ensembl or UCSC). The “number of exonization event” are all those events identified in this manner.

Overlap with known repeat elements

Repeat elements identified by RepeatMasker were downloaded from UCSC table browser [13] in bed format. Bedtools intersect (−wao –f 0.2) was used to identify overlap of transposons with novel exons.

Frequency of Alu-transposable events is calculated as the proportion of exonization events overlap** transposons that are identified as Alu events. All Alu events identified by repeatmasker (containing annotation “Alu”) were grouped together.

Visualization of events

Visualization of splicing events were done using –-bam setting for whippet quant and visualized with Sashimi plots in IGV browser (−DenableSashimi = “true”) [70].

Functional analysis

Functional enrichment analysis was performed using the g:Profiler (https://biit.cs.ut.ee/gprofiler/gost) tool [71]. Genes identified as containing novel splicing events were compared to a background of genes expressed in sample (cRPKM or TPM > 1). Structured controlled vocabularies from Gene Ontology organization, as well as information from the curated KEGG and Reactome databases were included in the analysis. Only functional categorizes with more than five members and fewer than 2000 members were included in the analysis. Significance was assessed using the hypergeometric test. p values were corrected for multiple testing using the method of Benjamini-Hochberg. The Cytoscape application EnrichmentMap (baderlab.org/Software/EnrichmentMap) was used to visualize functional enrichment [72].

General logistic regression

All continuous data was normalized to ensure fair comparison between features. The R module GLM with default setting except family = binomial(). Data was split into training and test data with 90:10% split. ROC curve calculated using test data using ROCR library.

Exonic features

MaxEntScan [73] was used to estimate the strength of 3′ and 5′ splice sites. 5′ splice site strength was assessed using a sequence including 3 nt of the exon and 6 nt of the adjacent intron. 3′ splice site strength was assessed using a sequence including − 20 nt of the flanking intron and 3 nt of the exon. SVM-BPfinder [74] was used to estimate branchpoint and polyprimidine tract strength and other statistics. Score was estimated using the sequence of introns to the 3’end of exon between 20 and 500 nt.

Transcription start sites (TSS) were downloaded from Biomart. TAD boundaries for HepG2 were extracted from ENCODE [11] pre-processed data and converted to Hg38 by liftover. GC content was calculated using python script. Transposon information download from RepeatMasker as described above.

Nucleosome occupancy for HepG2 cells was calculated using data from Enroth et al. [75]. Colorspace read data was aligned using Bowtie [76] (-S -C -p 4 -m 3 --best –strata) using index file constructed from Ensembl Hg38. Nuctools (with default settings) was used to calculate occupancy profiles and calculate occupancy at individual regions [77].

In feature analysis, only exonization events within introns detected in this analysis were used.

Splicing efficiency dynamics

Splicing efficiency dynamics was calculated using approach described previously [26]. Briefly, reads were mapped to Ensembl Hg38 assembly using STAR (2-pass enabled) and only uniquely mapped reads kept for downstream analysis. Splicing index values were first calculated which represent ratio of the split reads map** to the 5′ and 3′ SJ of an intron divided to the sum of split plus non-split reads. The θ value (representing Splicing Efficiency, SE) was extracted from all pulse-chase time points, for introns with at least five reads coverage at both 5′ and 3′ SJ. K-means clustering used to identify give groups of distinct splicing efficiency (very fast, fast, medium, slow and very slow). The Splicing Efficiency Dynamics metric was calculated as SED = 1/ ((1.001 − θ 0 min) × (1.001 − θ 60 min)).

Identification of m6a reads

After genome-wide map** to Hg38 assembly using STAR (2-pass enabled) [66], CLAM (CLiP-seq Analysis of Multi-mapped reads) [Nascent RNA-seq analysis (including GRO-seq)

Wavefront and elongation speeds were extracted from supplementary data of relevant papers [23, 28].

Investigation of influence of read depth on detection of novel exons

Reads were randomly sampled from a 100 M single-end HeLa RNA-seq dataset using the program “fastq-sample” from the “fastq-tools” (v0.8) pipeline using randomized seeds and no replacement. Identical pipeline (see “Identification of exonization events” section of methods) was run on every dataset and the percentage of Alu elements detected.