Background

Alternative splicing (AS) is a post-transcriptional, mRNA modification process that allows a single gene to give rise to multiple protein isoforms [1, 2]. These spliced isoforms can have distinct molecular functions and biological roles and may be differentially expressed among tissues, life cycle stages or environmental conditions [3], resulting in involvement in more genetic interactions and biochemical pathways compared to non-AS genes [4]. Therefore, AS provides a significant boost to genomic complexity without necessitating a proportional increase in genome size. AS takes place to some extent in most eukaryotic organisms [57], and has been studied extensively in humans and model species, including Caenorhabditis elegans[812], but has not received much attention in parasitic nematodes. In fact, information on AS in parasitic nematodes is extremely sparse, and existing reports have focused on a few or single representative gene(s) [1316].

High throughput cDNA sequencing is the preferred method for detecting and quantifying AS. Today’s most prevalent sequencing protocols (e.g., 454, Illumina, Ion Torrent, etc.) involve fragmentation of nucleic acid molecules, construction of sequencing libraries, and generation of many thousands to millions or even billions of short reads. In the absence of a well-curated genome for comparison, these reads must be re-assembled de novo into contiguous sequences (contigs) that faithfully represent the full-length transcripts from which they were derived. Graph-based assembly algorithms have been developed to maintain associations between transcripts with shared contigs, making it possible to identify different isoforms of the same gene [17]. However, this procedure is computationally challenging, and various studies have shown that de novo cDNA assemblers typically overestimate the number of isoforms associated with a given locus and that many of the predicted isoforms are illegitimate [42] (SRA numbers: SRR868958, SRR868932, SRR868957, SRR868939, SRR868942), and the resulting raw reads were mapped to assembled isotigs using Bowtie2 (version 2.1.0, default parameters [39]). The coverage of each isotig, as assessed using RefCov version 0.3 (http://gmt.genome.wustl.edu/gmt-refcov/), is reported in Additional file 3: Table S3.

Table 1 Caenorhabditis elegans test assemblies 1

Parasitic nematode transcriptomes were assembled with parameters that showed the optimal performance on C. elegans data (minimum overlap of 100 bp and 95% identity, minimum contig length of 30, and heterozygosity specified). Large or complex datasets were reduced to a manageable size by digital read normalization prior to assembly using khmer with a word size of 31 bp (http://ged.msu.edu/papers/2012-diginorm/). N. americanus isotigs were compared with transcript isoforms reported along with the genome of N. americanus[43] by BLASTN (cutoff of ≥90% sequence identity over ≥75% length of the isotig in a single high-scoring segment pair).

Annotation of parasitic nematode transcriptomes

Parasite isotigs were searched against the GenBank non-redundant protein database (downloaded July 9, 2013), and non-overlap** top hits with e-value ≤1e-5 were recorded (Additional files 4, 5, 6, 7, 8, 9, 10, 11 and 12: Tables S4-S12). Prot4EST [44] was used to generate translations from the parasite isotigs, and InterPro protein domains and gene ontology terms were predicted from translated proteins using InterProScan [45, 46]. Transcript sequences, peptide translations, and annotations are reported in Additional files 4, 5, 6, 7, 8, 9, 10, 11 and 12: Tables S4-S12 and are available from Nematode.net [22].

Enrichment of AS isogroups associated with functional domains

The number of alternatively spliced and non-alternatively spliced isogroups associated with each InterPro domain was counted (Additional file 13: Table S13), and a non-parametric binomial distribution test was applied to each InterPro domain to test for enrichment of AS isogroups using the following input parameters: (i) the background frequency of AS isogroups across all species (40.5%); (ii) the number of AS isogroups associated with the InterPro domain across all species (i.e., number of “successes”); (iii) the total number of isogroups associated with the InterPro domain across all species (i.e., number of “trials”). In order to reduce false positives resulting from poorly represented domains, domains represented by fewer than ten isogroups or fewer than four species were ignored, reducing total number of domains considered from 5,190 to 3,141 (a 39.5% reduction; Additional file 14: Figure S1). P values calculated for each domain were population corrected using False Discovery Rate (FDR) correction [47], and a significance threshold of 0.01 on the corrected P values was used to determine which InterPro domains were significantly more often associated with AS isogroups than non-AS isogroups.

Results and discussion

Optimization of assembly parameters

cDNA libraries were generated from mixed stage C. elegans worms and sequenced using Roche/454 technology. Our workflow, from the processing of raw reads to the annotation of transcript isoforms and isogroups, is outlined in Figure 1. After trimming, filtering, and contaminant removal, 1,746,642 high-quality reads were mapped to C. elegans CDSs to assess the scope of the dataset prior to assembly, and 8,391 CDS isoforms from 7,487 of the 20,515C. elegans genes (36.5% of all genes) showed ≥50% breadth of coverage. This level of coverage is comparable to the level seen in previous transcriptomic surveys of non-model organisms [26, 28, 29] and is sufficient for testing assembly protocols. It was not our intention to perform a thorough study of AS in C. elegans; studies of this nature have been reported elsewhere [11].

Figure 1
figure 1

Roche/454 read processing, decontamination, assembly and annotation. Raw Roche/454 reads were converted from sff to fastq format for editing and assembly. Relevant adapter sequences were trimmed, and reads failing to meet quality and complexity thresholds were removed. Reads that successfully map to rRNA, bacterial, human or host sequences were also eliminated. The remaining, high-quality, species-specific reads were assembled with Newbler’s cDNA specific protocol using our optimized parameter combination, translated using Prot4EST [44], and annotated using InterProScan [45, 46]. Statistical analyses can be carried out at the level of isotigs (unique transcripts) or isogroups (unique genetic loci) depending on the nature of the investigation.

The Newbler assembler, distributed by 454 Life Sciences, is considered the gold standard for Roche/454 read assembly. Using the cDNA option, Newbler identifies regions of shared sequence, termed contigs, and compiles them into full-length transcripts, termed isotigs. Isotigs with shared contigs, theoretically derived from AS of the same gene, are clustered into isogroups representing distinct genetic loci. We tested various combinations of program parameters in order to reduce assembly errors and increase the percentage of isotigs and isogroups that accurately represent known C. elegans sequences (Table 1). The best results were obtained with a contig length of 30 bp, minimum read overlap of 100 bp, minimum sequence identity of 95% (Table 1, last column). The heterozygous mode had little effect on our C. elegans assemblies, but we chose to invoke this option to accommodate the genetic heterogeneity of our parasitic worm datasets. Using these parameters, 96.7% of the clean reads were assembled into 15,940 isogroups containing 16,772 isotigs. Some 691 (4.3%) of these isogroups are associated with more than one isotig, with an average of 2.2 isotigs per AS isogroup (Table 1). Approximately 17% of the C. elegans genes reported in WormBase build WS236 are associated with more than one CDS isoform with an average of 2.6 isoforms per AS gene [41], and a 2011 study reported that at least 25% of all C. elegans genes undergo AS [11]. The relatively low rate of AS detected in our test assembly is probably a reflection of the clonal worm population that, despite being mixed-stage, was dominated at the tissue level by the relatively large adult hermaphrodites. Sampling each sex and life cycle stage independently could have provided greater resolution of AS events; however, the aim of this exercise was to optimize assembly protocols, not to explore AS in the model worm.

By adjusting assembly parameters, we were able to increase the number and percentage of isotigs and isogroups that accurately reflected known CDSs, increase the coverage of the gene set, and reduce the rates of misassembled transcripts (i.e., cis- and trans-chimeras). In the best version of our transcriptome assembly, 38.07% of the isotigs were matched to 7,027 distinct CDS isoforms from 5,727C. elegans genes (Table 1, last column). Match rates increased when isotigs were compared to coding transcripts (CDSs plus untranslated regions) rather than CDSs, indicating that a portion of our sequence data corresponds to untranslated regions at the extreme ends of the transcript.

Despite careful control of assembly parameters, approximately 30% of the assembled isotigs failed to find a BLAST match to a coding transcript isoform (≥90% sequence identity over ≥75% of the length of the isotig in a single high-scoring segment pair). The identification of novel AS isoforms and genes is to be expected given that the reported rates of AS in C. elegans have steadily increased over time (Table 2). Therefore, we sought to verify the remaining isotigs using data from another sequencing platform. Of the 4,994 un-matched isotigs, 478 showed 100% breath of coverage with Illumina RNAseq reads, a strong indication that they reflect real, expressed transcript isoforms, not sequence misassembles (see Additional file 3: Table S3). An example of this is illustrated in Figure 2. Isogroup00600 contains two distinct isotigs derived from C. elegans gene C18E3.6. One isotig corresponds perfectly to the gene model while the other is missing a 50 base pair segment of the gene’s fourth exon. Sequence reads generated on the Roche/454 and Illumina platforms both support the sequence gap despite the fact that this isoform is not represented in WormBase release WS236.

Table 2 Caenorhabditis elegans assembly statistics 1
Figure 2
figure 2

Alternative splicing of C. elegans gene C05B5.5. (A) Isogroup00600 from our de novo cDNA assembly contains two isotigs derived from C. elegans gene C18E3.6 (exons depicted as blue bars in top track). Alignment of Roche/454 reads (green bars with arrowheads indicating directionality) gave rise to three distinct contigs (dark, medium and light orange bars). These contigs were pieced together to form isotigs 01225 and 01226 based on read support displayed in the contig graph. Isotig01225, which contains all three contigs, corresponds perfectly to the gene model (blue bars). However, isotig01226 includes only the first (light orange) and third (dark orange), which results in a 50 bp gap with respect to isotig01225 and the gene model. (B) Illumina RNAseq reads (dark purple, horizontal bars) mapped to isotig01226 further verifies the junction between the first (light orange) and third (dark orange) contigs, with proportional coverage indicated (light purple, vertical bars). This figure was adapted from alignments visualized using the Integrated Genomics Viewer [48, 49].

Altogether, 12,265 of the 16,772 isotigs included in our best transcript assembly (73.1%) were verified either by a match to previously reported transcript isoforms included in WormBase or by our orthologous sequencing chemistries. Given the limitations presented by today’s sequencing technologies and assembly software, no combination of parameters will provide a perfect assembly. Some rate of error is to be expected given the challenges presented by complex, dynamic eukaryotic transcriptomes (e.g., varying expression rates, RNA half-life, secondary structure, AS, etc.). However, the error rates we detect are lower than those reported in other studies (particularly those involving shorter reads and deBruijn graph assemblers [50, 51]). Clearly, we were able to show improvement over default program parameters using our test dataset, and we expect that the impact of parameter optimization could prove even more vital as the size and complexity of the dataset increases.

The Newbler assembler relies on overlap-layout-consensus (OLC) algorithms for read assembly. These OLC algorithms may be less likely to overestimate the number of isoforms associated with a given gene compared to de Bruijn graph assemblers [1820, 52], but they are computationally intensive and sensitive to the size and complexity of a dataset. Large datasets with many millions of reads from multiple life cycle stages must be reduced prior to assembly. Our optimized protocol performed well with both randomly down-sampled and digitally normalized read sets (Table 3). Interestingly, digital read normalization eliminated nearly half of the reads without much impact on the quality of the assembly, so we have adopted this as our preferred method for dataset reduction prior to assembly.

Table 3 Assembly of down-sampled Caenorhabditis elegans read sets

Parasitic nematode transcript assemblies

We re-visited previously published Roche/454 data from C. oncophora[27], O. flexuosa[28], O. ostertagi[27], T. circumcincta[29], and T. colubriformis[25], re-screening and re-assembling with up-to-date, cDNA specific assembly software and our optimized parameters. Additional life cycle stages were sequenced and added to available datasets from A. caninum[30], D. viviparus[23], N. americanus[26], and O. dentatum[24] prior to assembly (see Additional file 1: Table S1). Together, these nine species represent a diverse array of parasitic nematodes, in terms of biology as well as phylogeny. Necator americanus, one of the two human hookworms, is thought to infect hundreds of millions of people across the Americas, Sub-Saharan Africa and Asia, and is a leading cause of morbidity in children. A. caninum, the canine hookworm, is an important pathogen in domestic dogs and a model for the study of human hookworm infections. O. flexuosa is a filarial nematode and a close relative of Onchocerca volvulus, the causative agent of African river blindness. O. flexuosa is unique among the Onchocercids in that it is devoid of the bacterial endosymbiont required for development and reproduction in its sister taxa. D. viviparus, the bovine lungworm, is the only Trichostrongylid nematode that resides in the lung during its adulthood. C. oncophora, O. dentatum, O. ostertagi, T. circumcincta, and T. colubriformus are all intestinal worms of livestock animals and are responsible for significant financial losses in the beef, dairy, sheep, goat and pork industries. Projects have been initiated to sequence the genomes of these species (see Table 4 for BioProject ID numbers), but N. americanus is the only species for which a draft genome is presently available [43].

Table 4 Parasitic nematode transcript assemblies

Parasitic nematode transcriptome assembly statistics are reported in Table 4. The datasets range in size and complexity from approximately one million reads derived from adult O. flexuosa to upwards of 7.5 million reads derived from eggs, larvae and adults of two geographically distinct strains of O. ostertagia. As previously discussed, there is a limit to the amount of data that can be processed by OLC assembly algorithms like those implemented by Newbler, so several datasets had to be reduced by digital read normalization prior to assembly (Table 4). Our tests seem to indicate that the complexity of the transcriptome has a greater impact on assembly efficiency than the absolute number of reads. For instance, we were able to assemble some 2.5 million reads from mixed sex adult T. colubriformis, whereas a full assembly of the 1.5 million reads derived from L3 and mixed sex adult N. americanus was not possible.

The number of isogroups obtained from each assembly ranged from 15,828 from O. flexuosa to 42,785 from the more thoroughly covered transcriptome of C. oncophora. Detected rates of AS, as measured by the number of isogroups associated with multiple isotigs, mostly fell within the 20-30% range, with a maximum AS rate of 34.65% in D. viviparus (Table 4). The AS rates seen in the parasitic nematodes were expected to be similar, as previous studies have shown that splice events are highly conserved among Caenorhabditis species despite hundreds of millions of years of evolutionary separation [53, 54]. It is also reasonable to expect that the parasitic nematodes, especially those with extremely complex life cycles like N. americanus and D. viviparus, would have higher rates of AS than free-living worms like C. elegans due to the increased genomic complexity that may be required to interact with multiple hosts/vectors, host/vector tissues, and environmental conditions. We did not make an effort to classify or compare the nature of these AS events (e.g., alternative starts and/or stops, intron retention, exon skip**, etc.), but we expect that this will be possible in future studies aimed at exploring AS profiles of particular species in greater detail.

It stands to reason that sampling and sequencing more life cycle stages would lead to increased resolution of AS events. Indeed, including more stages tended to increase the number of isogroups (i.e., genetic loci) identified, but overall AS rates and the average number of isotigs associated with each isogroup remained relatively consistent with the notable exception of T. colubriformis. The AS rate reported for T. colubriformis (11.68%) was much lower than AS rates reported for other species represented by a single cDNA library derived from mixed-sex adults (24.44% AS in O. flexuosa and 21.14% AS in T. circumcincta). This disparity may be due to decreased transcriptomic complexity in T. colubriformis, but there may be other explanations. In the case of T. colubriformis, material was obtained from an inbred laboratory strain [25], while O. flexuosa and T. circumcinta material were collected in the field [28, 29]. O. flexuosa nodules tend to be dominated by large, adult females [55]. Likewise, T. circumcinta is a polymorphic species with sex ratios biased towards females [56]. This is significant due to the fact that a patent female represents a broad survey of adult female tissues, embryos in various stages of development, and even stored sperm from males, all of which contribute to diversity in the transcript population. Sequencing additional life cycle stages of T. colubriformis or specifically studying adults of other species would provide additional data needed to better understand the obtained results.

The assembled sequences from these nine species, as well as their functional annotations and predicted translations are available from Nematode.net [22] for use by the wider community. Although genome sequencing projects are currently underway, the transcriptomes presented here represent a significant proportion of the sequence data available from these species at this time. These datasets are, therefore, a vital source of information on the genetic content and complexity of these parasites and will remain so even after draft genomes are published as genome sequencing does not, in and of itself, provide any information on AS. Historically, initial reports of draft genomes rarely comment on AS [5764]. For instance, the draft genome of the well-studied filarial nematode B. malayi was published in 2007, but the report made no mention of AS [59]. The most recent dataset available from WormBase (B. malayi WS236) includes multiple isoforms for 16% of the reported genes, but no comprehensive studies on the subject of AS in B. malayi have been reported despite an abundance of representative RNAseq data [65]. The recently published N. americanus genome paper was unique in that it included an estimate of AS based on Illumina RNAseq data generated from L3 and adult worms. Multiple isoforms were identified from approximately 25% of the 19,151 predicted protein coding genes. Some 1,209 of the 3,354 AS isogroups from N. americanus match 1,114 genes reported as AS in the genome study (≥90% nucleotide sequence identity over ≥75% of the length of the isotigs in a single high-scoring segment pair) [43], while another 65 AS isogroups matched genes that previously lacked evidence for AS. Clearly our assemblies, performed with a special emphasis on AS, will be a useful complement to genome sequencing studies and transcriptome studies performed using orthologous sequencing and assembly approaches.

Protein domains associated with alternative splicing

Given the fact that in AS, both the patterns of splicing as well as the spliced exons themselves, tend to be evolutionarily conserved [53, 54], we wanted to explore potential links between AS and genetic function. Coding sequences were predicted for each of the transcript assemblies and these were searched for similarity to InterPro protein domains. A total of 5,692 unique InterPro domains were identified from all species included in this study, with counts ranging from 4,904 domains in O. ostertagi to 2,212 in O. flexuosa (Table 4). Some 40.5% of all isogroups associated with an InterPro domain are AS, and 21 InterPro protein domains were significantly correlated with AS isogroups (Table 5). Functions related to nucleotide binding are prevalent in this list. Nucleic acid binding proteins have a wide variety of functions, localization patterns, and binding preferences that can certainly be affected by AS [6668]. For example, AS of the UNC-62 transcription factor produces two distinct isoforms in C. elegans. These isoforms localize to different tissues, exhibit different temporal expression patterns, and seem to bind different DNA consensus sequences due to alterations in the DNA binding domain [69, 70]. Chromo (CHRomatin Organization Modifier) and chromo-like domains interact with histones and nucleic acids, and studies have shown that AS of these proteins can have major implications on function, which, in turn have implications on gene expression and organismal development [71]. Future studies will be required to further explore the link between AS and protein function in parasitic nematodes, as well as to elucidate its specific biological consequences.

Table 5 Enrichment of InterPro protein domains among alternatively spliced isogroups

Conclusions

De novo transcriptome assembly is a complicated procedure that is confounded by varied gene expression patterns, such AS of mRNA. Transcriptome assemblies benefit from the use of optimized parameters designed to increase accurate coverage of the gene set and minimize assembly error. The set of parameters we described was thoroughly tested with C. elegans data and verified using well-curated sequences available from WormBase as well as data from an unrelated sequencing chemistry. Our optimized parameters are offered as a guide to assist in the assembly of other nematode transcriptomes, and updated, annotated transcript assemblies from nine species of parasitic worms are offered as a resource to the research community. Rates of AS seem to be similar among the species studied, and 21 InterPro protein domains appear to be enriched among AS transcripts. This represents a first step in exploring AS among parasitic nematodes, an important and relevant topic that should be further investigated in future sequencing studies.