Background

It is well established that alternative splicing (AS) is a widespread mechanism responsible for increased protein diversity and complexity among eukaryotes. The importance of this mechanism in the regulation of gene function has raised the question of its role in the context of evolution. Recent studies separating exons by evolutionary ages have shown that species-specific exons are mostly alternatively spliced [1, 2] and previous analyses have shown that the converse seems to be the case, that is, many alternative exons are species-specific [3, 4]. Moreover, evolutionary rate measurements show differences between alternatively and constitutively spliced regions [5, 12]. These exons may occur as fortuitous additions to existing transcripts, but they confer an opportunity to explore new functions with negligible disruption of the usual protein function [3]. The study of the mechanisms by which these species-specific exons can appear and become established is therefore key for the understanding of splicing regulation.

Three main mechanisms have been identified as being responsible for the appearance of new exons: gene duplication events, tandem exon duplication events [13], and exaptation, whereby a genomic sequence that did not function as an exon becomes exonized. This last mechanism is mostly driven by transposable elements (TEs) in mammals [1418]. In particular, Alu elements play a prominent role in exon creation in primates [1921]. These elements have motifs that resemble splice sites as part of their consensus sequence, especially in the opposite orientation, which can become functional through specific mutations [2224], allowing exonization of part of the element. RNA editing has also been identified as a mechanism triggering exon creation from Alu elements [25]. In this case, however, the splice site is not in the genomic sequence, but it is instead created during the RNA editing process.

The fact that species-specific exons are, in general, poorly included suggests that they mainly appear with weakly recognized splicing signals. In particular, this is the case for some examples of exonized Alu elements [20], for which the strength of the base pairing between the U1 snRNA and the functional 5' splice site of the Alu determines the level of inclusion [23]. Although alternative exons are generally associated with weaker splice sites compared to constitutive exons [26], the distributions of splice site scores for both types of exon greatly overlap, suggesting that the strength of the splice site alone cannot explain the observed differences in inclusion levels between species-specific and evolutionarily conserved exons. Indeed, splice sites are not the only signals governing the recognition of an exon. There are also splicing enhancers and silencers, which function as activators and repressors of the splicing mechanism, respectively. These can occur in exons as exonic splicing enhancers (ESEs) or silencers (ESSs), and in introns as intronic splicing enhancers or silencers. Many of these regulators have been identified using experimental [27] and computational [28, 29] methods, and recent analyses have recognized their changing role depending on their position along the exon or the intron [30, Full size image

Exon creation from repetitive sequences

Along with tandem duplication events [13], exonization of TEs is one of the most important mechanisms of exon creation [17, 35, 39, 40]. Therefore, we assessed the overlap between exons from the three age sets and TEs, considering as overlap the cases in which the TE covers at least one of the splice sites. We found that PS exons have a high density of TEs in their flanking intronic regions (Figure 2a) and about 43% of the cases overlap TEs (Table 1). On the other hand, MS and VO exons have a very low density of TEs in the proximal adjacent intronic regions (Figure 2b, c) and show negligible overlap of TEs with their splice sites. Additionally, excluding the eight cases in which the exon overlaps more than one TE, we found that for 116 (79.5%) of the PS exons overlap** TEs, the TE is in the opposite strand of the exon. Although Alu elements, unlike other TEs such as L1 and Long Terminal Repeat (LTR) retrotransposons [41], were not found to have a bias in the strand of insertion in human introns [40], we find that most of the Alu elements (88.3%) overlap** a PS exon occur in the strand opposite to the gene (anti-sense). In only 9 out of the 77 cases (11.7%) we found sense Alu elements, and in only 4 of these is the overlap complete. Moreover, the percentage of anti-sense cases for non-Alu TEs is 69.6%. This suggests that for TEs and, especially, for Alu elements, although insertion can potentially occur in either strand, exonization occurs mainly in the opposite strand. Interestingly, although we found no overlap in the MS set, we found 19 cases (less than 0.15%) in the VO set; many of these were simple-repeats (Table 1). More details about the type of TE overlap are given in Table A1 in Additional data file 1. Remarkably, more than 50% of PS exons do not overlap a TE and cannot be explained by tandem duplication, as those cases were discarded during the exon classification.

Table 1 Overlap with repetitive elements
Figure 2
figure 2

Intronic densities for the main classes of repetitive elements. (a) Primate specific, (b) mammalian specific and (c) vertebrate and older. At each intronic position, the density was calculated as the proportion of cases in which the base was covered by a given type of repetitive element. We give on the x-axis the relative position from the splice junctions as negative if upstream of the acceptor site or positive if downstream of the donor site.

Analysis of the splicing regulatory content of exons

In order to understand the properties of the splicing regulatory content that determine the observed differences in inclusion between exon sets, we conducted an analysis of splicing cis-regulatory elements in exons and their flanking introns. For this analysis we used three sets of splicing regulatory elements (SREs): 666 ESE hexamers [4 shows that pseudo-exons have ERA values distributed around zero for all SREs tested (ESEcomb, -0.029; SRall, -0.006; and ESScomb, -0.055). On the other hand, all real exons show positive values for ESEs and negative for ESSs. In particular, PS exons show the closest ERA values to pseudo-INT, but they are still significantly different (Mann-Whitney, ESEcomb p = 1.45 × 10-20, SRall p = 2.88 × 10-8, and ESScomb p ≈ 0). Interestingly, we also observe differences between PS exons and MS/VO for two out the three SRE sets used. For ESEcomb and ESScomb, PS exons show lower absolute ERA values (0.164 and -0.302, respectively) than MS (0.284 and -0.499) and VO (0.258 and -0.387) (see Table A3 in Additional data file 1). Relative to SRall, no significant differences between age groups were observed. ERA was also calculated for the individual SRE sets (see Materials and methods for details). These results can be found in Figure A4 in Additional data file 1.

Figure 4
figure 4

SRE ERA changes with age. Mean exonic relative abundance values for the three age groups (PS, MS and VO) and a set of pseudo-exons not overlap** any repeats (pseudo-INT) calculated for the three motif sets (ESEcomb, ESScomb and SRall). Exons overlap** Alu elements were excluded from the PS set. The standard error is also shown.

Focusing on MS and VO exons, we observe a surprising difference in the content of ESScomb motifs. VO exons present lower absolute ERA values than MS (Mann-Whitney, p = 3.06 × 10-10). This result derives from the fact that VO exons show relatively higher exonic densities of ESSs (0.272) compared to MS (0.213), while for intronic content no significant differences were found (Table A3 in Additional data file 1). Also, VO exons show slightly lower exonic densities for ESEcomb with respect to MS (MS 0.665, and VO 0.633; Mann-Whitney, p = 4.56 × 10-6). These results can be partially explained by the fact that VO exons have stronger splice sites. On the other hand, it also suggests that AS of VO exons may be more dependent on ESS content.

In order to understand if these regulatory elements were under different, possibly functional, constraints depending on the exon age, we investigated their conservation in the mouse orthologous exons (Figure 5). For this purpose, we have calculated the functional conservation score (FCS; see Materials and methods for detailed description) for all three SRE sets on both MS and VO exon sets. This measure reflects the fraction of nucleotides that are covered by motifs from the same SRE set in both human and mouse. This measure correlates with the percentage of sequence conservation but also takes into account cases where a substitution does not change the regulatory character of a region. In general, VO exons have higher FCS values compared to MS exons for ESEcomb (Mann-Whitney p = 8.42 × 10-13), SRall (p = 4.64 × 10-14) and ESScomb (p = 2.99 × 10-16). Additionally, FCS is higher for ESEcomb than for ESScomb for both MS and VO exons (Mann-Whitney, p ≈ 0), which might reflect the importance of the conservation of the amount and position of ESEs in exons. In summary, although VO exons have lower density of ESEs, these are more conserved than in MS exons, indicating that ESE turnover is more frequent in MS compared to VO exons, in agreement with recent analyses [45]. Moreover, VO exons present a larger fraction of ESSs that are highly conserved, suggesting possible constraints due to AS regulation.

Figure 5
figure 5

SRE functional conservation between human and mouse. SRE FCS between human and mouse of exonic regions covered by ESEcomb, SRall and ESScomb motifs for mammalian specific and vertebrate or older exons. See Materials and methods section for formula.

Interestingly, considering all exons from the three age groups, ERA values tend to increase for ESEs (ESEcomb and SRall) and decrease for ESSs (ESScomb). Figure 6 shows the mean ERA values plotted for bins of increasing EST inclusion levels. For ESEcomb (Figure 6a) and SRall (Figure 6b) we observe a consistent increase except at high EST inclusion levels, where SRall values slightly decrease. On the other hand, there is a consistent decrease for ESScomb at all EST inclusion levels (Figure 6c). Exonic and intronic densities do not show such gradients with EST inclusion levels (data not shown). Thus, inclusion levels seem to be determined by the local differences in the densities of motifs.

Figure 6
figure 6

SRE exonic relative abundance and EST inclusion levels. Cumulative plot of ERA variation (y-axis) for bins of increasing maximum EST inclusion levels (x-axis) for (a) ESEcomb, (b) SRall and (c) ESScomb. The standard errors are also shown.

Study case: why Aluelements are a good substrate for exonization

It has been recently reported that all TEs have approximately the same exonization levels with the exception of Alu elements, which are almost three times higher than other TE families [40]. Additionally, the high number of Alu copies in the human genome and their propensity to accumulate in intronic regions[40] make this element the main source of new exons originating from TEs. It has been shown that in some cases, cryptic splice sites are enough to incorporate part of an Alu element in the mature transcript [22, 23] and that in other cases, specific splicing enhancers are needed for their inclusion [34]. We thus applied the ERA measure in order to understand which regulatory features, besides the presence of splice sites, may be responsible for the increased Alu exonization rate.

We compared the SRE densities between the subset of PS overlapped by Alu elements (PS-Alu) and a set of Alu pseudo-exons bigger than 80 bp (pseudo-Alu) (see Materials and methods for details). Figure 7a, b show the mean exonic and intronic densities of the two ESE sets considered (ESEcomb and SRall) for PS-Alu and pseudo-Alu. The mean exonic densities of ESEcomb and SRall for PS-Alu (0.597 and 0.649, respectively) were significantly higher (Mann-Whitney, p = 4.89 × 10-12 and p = 9.78 × 10-6) than the mean exonic densities for pseudo-Alu (0.514 and 0.593). Relative to ESScomb (Figure 7c), PS-Alu shows a mean value of exonic density of 0.150 while pseudo-Alu shows a mean value of 0.190 (Mann-Whitney, p = 1.09 × 10-4).

Surprisingly, we observe the opposite behavior when considering adjacent intronic regions. The mean values of the intronic density of ESEs are significantly lower for PS-Alu when compared to pseudo-Alu (Mann-Whitney, ESEcomb p = 3.64 × 10-4 and SRall p = 2.02 × 10-5), while for ESScomb the mean density values are higher (Mann-Whitney, p = 1.12 × 10-11). All these results suggest that ESEs and ESSs play a role in Alu exonization. In Figure 7d we can observe that for PS-Alu, the mean ERA values for ESEcomb and SRall distribute around positive values (0.276 and 0.177) while the ESScomb values tend to distribute around a negative mean (-0.625). The absolute values are significantly greater than those obtained for pseudo-Alu (Mann-Whitney, p = 8.26 × 10-10, p = 1.31 × 10-7 and p = 3.75 × 10-10). Furthermore, the fact that ESScomb produces the greatest difference of means suggests that this sequence feature might be the main determinant in the exonization of Alu elements. Comparing PS exons overlapped and non-overlapped by Alus, we observe that the latter have higher exonic (0.247) and lower intronic (0.383) densities for ESScomb (Mann-Whitney, p = 6.29 × 10-8 and p = 1.83 × 10-4, respectively). Consequently, their absolute ERA mean values (-0.302) are lower than those observed for Alu overlapped exons and, surprisingly, lower than those observed for pseudo-Alu (-0.407) (Mann-Whitney, p = 3.94 × 10-10 and p = 6.03 × 10-5).

Figure 7
figure 7

Alu's unique cis-regulatory context. Exonic and intronic densities of (a) ESEcomb, (b) SRall and (c) ESScomb motifs on primate specific exons overlap** Alu elements (PS-Alu) and on Alu pseudo-exons (pseudo-Alu). (d) Exonic relative abundance of ESEcomb, SRall and ESScomb motifs for primate specific exons overlap** Alu elements (PS-Alu) and for Alu pseudo-exons (pseudo-Alu). (e) Exonic relative abundance for the same sets of motifs in pseudo-exons overlap** other classes of repeats, namely DNA, LTR, LINE and SINE non-Alu (MIR) repeats. The error bars represent the standard error.

Finally, in order to test whether the found properties are Alu specific, we analyzed sets of pseudo-exons overlap** the other major families of mobile elements in the human genome: Long Interspersed Nuclear Elements (LINEs), LTRs, DNA transposons and non-Alu Short Interspersed Nuclear Elements (SINEs) (see Materials and methods for details). For each of these sets, we calculated the ERA distributions for the same SRE sets as before. As can be seen in Figure 7e, all the pseudo-exon sets show absolute ERA values close to zero. Moreover, they do not present the ERA pattern expected to favor exonization. Indeed, pseudo-exons overlap** DNA transposons and LINEs have negative ERA mean values for ESEcomb. The exception seems to be for LTR pseudo-exons, which have positive ERA values for ESEcomb and negative for ESScomb, but with very low absolute values. This suggests that the high rate of Alu exonization may simply be due to their lack of silencers.

Although Alu elements do not seem to have a strand bias inserting within introns in human genes, protein-coding exons are mostly created from anti-sense Alu elements [40]. In fact, we could only find 64 cases of sense Alu pseudo-exons. In comparison, we could find more than 30,000 Alu pseudo-exons with the Alu in anti-sense. This difference can be explained by the efficiency of the splice sites [22, 23], as sense Alu exons do not contain the strong poly-pyrimidine tract typical of anti-sense ones. Furthermore, most PS exons overlap** anti-sense Alu elements are normally 80 bp long or greater. These lengths correspond, in most cases, with the most commonly used splice sites created by the anti-sense Alu [46] (data not shown). In order to understand the differences in exonization levels, we compared the properties of these two under-represented cases, sense Alu exons and anti-sense Alu exons shorter than 80 bp, making use of pseudo-exons overlap** these elements: pseudo-exons overlap** and Alu in the same orientation (pseudoSS-Alu) and pseudo-exons smaller than 80 bp that overlap an Alu in the opposite strand (pseudoSH-Alu) (see Materials and methods for details). Interestingly, both sets have a different content of splicing regulatory motifs with respect to anti-sense Alu pseudo-exons (pseudo-Alu) bigger than 80 bp (Figure A5 in Additional data file 1). Even though pseudoSS-Alu shows for both sets of ESEs higher exonic densities with respect to the adjacent intronic regions (Figure A5A and A5B in Additional data file 1), no differences are observed for ESSs (Figure A5C in Additional data file 1). This leads to positive ERA values for ESEs (0.091 and 0.086) but close to zero values for ESSs (-0.023). On the other hand, pseudoSH-Alu shows negative ERA values for ESEs (-0.167 and -0.168) and close to zero mean ERA values (-0.040) for ESSs (Figure A5D in Additional data file 1). Thus, both pseudoSS-Alu and pseudoSH-Alu exons have ERA values for ESSs close to zero, as opposed to anti-sense Alu pseudo-exons and PS exons overlap** Alus, which have very large negative ERA values for ESSs. This suggests that the higher content ESSs make sense Alus and regions smaller than 80 bp within anti-sense Alus less prone to exonization.