Background

RNA sequencing (RNA-Seq) technology is revolutionizing the study of species that have not yet had their genomes sequenced by enabling the large-scale analysis of their transcriptomes. To study such transcriptomes, one must first determine a set of transcript sequences via de novo transcriptome assembly, which is the reconstruction of transcript sequences from RNA-Seq reads without the aid of genome sequence information. A number of de novo transcriptome assemblers are currently available, many designed for Illumina platform data [1]-[Full size image

RSEM-EVAL is a novel reference-free transcriptome assembly evaluation measure

RSEM-EVAL, our primary contribution, is a reference-free evaluation measure based on a novel probabilistic model that depends only on an assembly and the RNA-Seq reads used to construct it. In short, the RSEM-EVAL score of an assembly is defined as the log joint probability of the assembly A and the reads D used to construct it, under a model that we have devised. In symbols:

score RSEM-EVAL ( A ) = log P ( A , D )

RSEM-EVAL’s intended use is to compare several assemblies of the same set of reads, and under this scenario, the joint probability is proportional to the posterior probability of the assembly given the reads. Although the posterior probability of the assembly given the reads is a more natural measure for this application, we use the joint probability because it is more efficient to compute.

The details of the RSEM-EVAL model are provided in the Materials and methods. In summary, the RSEM-EVAL score consists of three components: a likelihood, an assembly prior and a Bayesian information criterion (BIC) penalty. That is:

log P ( A , D ) = log Λ P ( D | A , Λ ) P ( A | Λ ) P ( Λ ) log P ( D | A , Λ MLE ) likelihood + log P ( A | Λ MLE ) assembly prior 1 2 ( M + 1 ) log N BIC penalty ,
(1)

where N is the total number of reads (or read pairs, for paired-end data), M is the number of contigs (scaffolds) in the assembly, and Λ MLE is the maximum likelihood (ML) estimate of the expected read coverage under P(A,D|Λ). For typical sizes of RNA-Seq data sets used for transcriptome assembly, the likelihood is generally the dominant component of the RSEM-EVAL score in the above equation. It serves to assess how well the assembly explains the RNA-Seq reads. However, as we will show later, only having this component is not enough. Thus we use the assembly prior and BIC components to assess an assembly’s complexity. These two components penalize assemblies with too many bases or contigs (scaffolds), or with an unusual distribution of contig (scaffold) lengths relative to the expected read coverage. Thus, these two components impose a parsimony preference on the RSEM-EVAL score. The three components together enable the RSEM-EVAL score to favor simple assemblies that can explain the RNA-Seq reads well.

The ground truth is an approximate local maximum of the RSEM-EVAL score

As we have discussed, the true assembly at minimum overlap length 0 is the best possible assembly that can be constructed solely from RNA-Seq data. Therefore, we consider it to be the ground truth assembly. A good evaluation measure should score the ground truth among its best assemblies. Ideally, we would have explored the entire space of assemblies and shown that the RSEM-EVAL score for the ground truth assembly is among the highest scores. However, such a global search of assembly space is computationally infeasible. Thus, we instead performed experiments that assess whether in the local space around the ground truth, the ground truth is among the best scoring assemblies. In other words, we tested whether the ground truth assembly is approximately a local maximum of the RSEM-EVAL function.

We explored the local space of assemblies around that of the ground truth by generating assemblies that were slightly perturbed from it. We performed experiments with two kinds of perturbations: random perturbations and guided perturbations. In our random perturbation experiments, assemblies were generated by randomly mutating the ground truth assembly. Since the minimum overlap length is a critical parameter for constructing assemblies, we also assessed the RSEM-EVAL scores for true assemblies with different minimum overlap lengths in guided perturbation experiments. A good evaluation score should generally prefer true assemblies with small minimum overlap lengths, which are closest to the ground truth.

For these experiments, it was critical that the ground truth assembly be known, and therefore we primarily used a simulated set of RNA-Seq data, in which the true origin of each read is known. In addition, for our guided perturbation experiments, we used the real mouse data set on which the simulated data were based, and we estimated the true origin of each read. For details about these real and simulated data, see the Materials and methods.

Random perturbation

With our random perturbation experiment we wished to determine how well, in terms of the RSEM-EVAL score, the ground truth compares to assemblies in the local space surrounding it. To explore the space of assemblies centered at the ground truth thoroughly, we used four types of mutations (substitution, fusion, fission and indels), each of which was applied at five different strength levels (see Materials and methods). Therefore, in total, we generated 20 classes of randomly perturbed assemblies. For each class, we generated 1,000 independent randomly perturbed assemblies to estimate the RSEM-EVAL score population mean and its 95% confidence interval for assemblies of that class.

On average, the perturbed assemblies had RSEM-EVAL scores that were worse than that of the ground truth (Figure 3A). In addition, the higher the mutation strength, the worse the mean score of the perturbed assemblies. This suggests that the ground truth assembly behaves similarly to a local maximum of the RSEM-EVAL function. Even though the population mean scores of the perturbed assemblies were estimated to be always worse than the score of the ground truth, individual perturbed assemblies could have had higher scores. Therefore, for each class of perturbed assemblies, we calculated the fraction of assemblies with RSEM-EVAL scores larger than that of the ground truth, which we refer to as the error rate. Error rates decreased dramatically with increasing mutation strength and, for all mutation types except fusion, error rates were only non-zero for the weakest mutation strength level (Figure 3B). RSEM-EVAL had the most trouble with the fusion-perturbed assemblies, with more than half of such assemblies at the weakest mutation strength having a score above that of the ground truth. From individual examinations of these assemblies, we observed that in many of these cases, the assemblies contained fusions of contigs with low abundances, which are difficult to distinguish from true contigs, especially with the ground truth defined as the true assembly with minimum overlap length w=0.

Figure 3
figure 3

Random perturbation results. Comparison of the RSEM-EVAL score of the ground truth assembly to those of randomly perturbed versions of that assembly. (A) Changes in the relative scores of the perturbed assemblies with increasing mutation strength. For each class of perturbed assemblies, we computed the mean percentage change in the normalized RSEM-EVAL score for the 1,000 randomly perturbed assemblies in that class. The normalized RSEM-EVAL score is the RSEM-EVAL score of the assembly minus the RSEM-EVAL score one would obtain for the null assembly with no contigs and is useful when positive scores are necessary. For each mutation type, the normalized RSEM-EVAL score is plotted as a function of the mutation strength, with error bars corresponding to 95% confidence intervals. (B) RSEM-EVAL error rates for each perturbed assembly class. Error bars correspond to the 95% confidence intervals for the mean error rates.

Guided perturbation

With the guided perturbation experiments, we measured the RSEM-EVAL scores of assemblies constructed with different values of the minimum overlap length, which is a common parameter in assemblers. Since the true assembly at minimum overlap length 0 is the best achievable assembly, a good evaluation score should be maximized at small minimum overlap lengths. As described before, we used one simulated and one real mouse RNA-Seq data set for these experiments. For each data set, we constructed the true assemblies with minimum overlap lengths ranging from 0 to 75. The true assembly at minimum overlap length 76 (the read length) was not constructed because of prohibitive running times. For the real RNA-Seq data, true assemblies were estimated using REF-EVAL’s procedure, described below. We then calculated the RSEM-EVAL scores for all of these assemblies.

As we had hoped, we found that the RSEM-EVAL score was maximized at small minimum overlap lengths, for both the simulated and real data sets (Figure 4). In contrast, the ML score increased with minimum overlap length and peaked at minimum overlap length w=75. These results support the necessity of the prior of the RSEM-EVAL model, which takes into account the complexity of an assembly.

Figure 4
figure 4

Guided perturbation results. RSEM-EVAL (top row) and maximum likelihood (bottom row) scores of true assemblies for different values of the minimum overlap length w on both simulated (left column) and real (right column) data sets. The maximizing values (red circles) are achieved at w=0, w=2, w=75 and w=75 in a top-down, left-right order. For better visualization of the maximizing values of w, RSEM-EVAL scores for the local regions around the maximal values are shown in Additional file 1: Figure S2.

To explore the effects of the minimum overlap length parameter, w, in the RSEM-EVAL model, we also performed the guided perturbation experiments with w=50 for the RSEM-EVAL model. We did not observe any major differences between these results and those for w=0 (Additional file 1: Figures S3 and S4). To explain this result, note that although the RSEM-EVAL model uses w in both the prior and likelihood correction components, our estimation procedure for the uncorrected likelihood component (see Materials and methods) does not explicitly check for violations of the minimum overlap length by the assembly (i.e., regions that are not covered by reads that overlap each other by at least w bases). Thus, the minimum overlap length does not play a role in the uncorrected likelihood, which is the dominant term of the RSEM-EVAL score.

REF-EVAL is a refined toolset for computing reference-based evaluation measures

Our first experiment, above, shows that RSEM-EVAL has an approximate local maximum at the true assembly. However, this does not necessarily imply that RSEM-EVAL induces a useful ranking of assemblies away from this local maximum. Thus, to assess the usefulness of RSEM-EVAL’s reference-free score, it is of interest to compare the ranking RSEM-EVAL assigns to a collection of assemblies to the ranking assigned by comparing each assembly to a reference. This raises two questions: (1) What reference do we compare against? (2) How do we perform the comparison? REF-EVAL constitutes an answer to both questions. The tools REF-EVAL provides are also of independent interest for reference-based evaluation of transcriptome assemblies.

In answer to question (1), REF-EVAL provides a method to estimate the true assembly of a set of reads, relative to a collection of full-length reference transcript sequences. The estimate is based on alignments of reads to reference transcripts, as described in the Materials and methods. As we have previously discussed, we wish to compare assemblies against the set of true contigs or scaffolds instead of full-length reference sequences because the latter cannot, in general, be fully reconstructed from the data and we want to reward assemblies for recovering read-supported subsequences of the references.

In answer to question (2), REF-EVAL provides two kinds of reference-based measures. First, REF-EVAL provides assembly recall, precision, and F 1 scores at two different granularities: contig (scaffold) and nucleotide. Recall is the fraction of reference elements (contigs, scaffolds or nucleotides) that are correctly recovered by an assembly. Precision is the fraction of assembly elements that correctly recover a reference element. The F 1 score is the harmonic mean of recall and precision. For precise definitions and computational details, see Materials and methods.

Although the contig- (scaffold-) and nucleotide-level measures are straightforward and intuitive, both have drawbacks and the two measures can be quite dissimilar (Figure 5). For example, if two contigs align perfectly to a single reference sequence, but neither covers at least 99% of that sequence, the nucleotide-level measure will count them as correct, whereas the contig-level measure will not (Figure 5B). In general, the contig- and scaffold-level measurements can fail to give a fair assessment of an assembly’s overall quality, since they use very stringent criteria and normally only a small fraction of the reference sequences are correctly recovered. And whereas the nucleotide-level measure arguably gives a more detailed picture of an assembly’s quality, it fails to take into account connectivity between nucleotides. For example, in the example depicted in Figure 5B, the nucleotide-level measure does not take into account the fact that the correctly predicted nucleotides of the reference sequence are predicted by two different contigs.

Figure 5
figure 5

Different granularities of reference-based measures computed by REF-EVAL. (A) The contig-level measure requires at least 99% alignment between a matched contig and reference sequence in a one-to-one map** between an assembly and the reference. (B) The nucleotide-level measure counts the number of correctly recovered nucleotides without requiring a one-to-one map**. Unlike the contig-level measure, it gives full credit to the two short contigs. The table on the right gives both the contig-level and nucleotide-level recall values for (A) and (B).

Motivated, in part, by the shortcomings of the contig-, scaffold-, and nucleotide-level measures, REF-EVAL also provides a novel transcriptome assembly reference-based accuracy measure, the k-mer compression score (KC score). In devising the KC score, our goals were to define a measure that would (1) address some of the limitations of the other measures, (2) provide further intuition for what the RSEM-EVAL score optimizes and (3) be relatively simple. The KC score is a combination of two measures, weighted k-mer recall (WKR) and inverse compression rate (ICR), and is simply defined as:

score KC =WKRICR
(2)

WKR measures the fidelity with which a particular assembly represents the k-mer content of the reference sequences. Balancing WKR, ICR measures the degree to which the assembly compresses the RNA-Seq data. WKR and ICR are defined and further motivated in the Materials and methods.

The RSEM-EVAL score correlates highly with reference-based measures

Having specified a framework for reference-based transcriptome assembly evaluation via REF-EVAL, we then sought to test whether the RSEM-EVAL score ranks assemblies similarly to REF-EVAL’s reference-based measures. To test this, we constructed a large number of assemblies on several RNA-Seq data sets from organisms for which reference transcript sequences were available, and we computed both the RSEM-EVAL score and reference-based measures for each assembly. The RNA-Seq data sets used were the simulated and real mouse strand non-specific data from the perturbation experiments, a real strand-specific mouse data set and a real strand-specific yeast data set. Four publicly available assemblers, Trinity [4], Oases [6], SOAPdenovo-Trans [8] and Trans-ABySS [2], were applied to assemble these data sets using a wide variety of parameter settings.

Overall correlation

For each data set, we computed Spearman’s rank correlation between the reference-based measure values and the RSEM-EVAL scores to measure the similarity of the rankings implied by them. For single-end data, RSEM-EVAL scores had decent correlation with the contig and nucleotide-level F 1 measures on the strand non-specific (Figure 6) and strand-specific (Additional file 1: Figure S5) data sets. Specifically, the correlation of the contig and nucleotide-level F 1 measures to the RSEM-EVAL score is comparable to the correlation of the contig and nucleotide-level F 1 measures to each other. RSEM-EVAL performed similarly well on the paired-end strand non-specific data set (Additional file 1: Figure S6).

Figure 6
figure 6

Correlation of RSEM-EVAL score with reference-based measures for strand non-specific single-end data sets. Scatter plots are shown for the simulated (top row) and real mouse (bottom row) data sets and for both the nucleotide-level F 1 (left column) and contig-level F 1 (center column) measures. For comparison, scatter plots of the nucleotide-level F 1 against the contig-level F 1 are shown (right column). Spearman’s rank correlation coefficient (bottom-right corner of each plot) was computed for each combination of data set and reference-based measure.

The RSEM-EVAL scores had markedly higher correlations with the KC score (k=L, the read length) for both the strand non-specific (Figure 7) and strand-specific (Additional file 1: Figure S7) single-end data sets, as well as for the paired-end data set (Additional file 1: Figure S6), which confirmed our expectations given the mathematical connections between these scores. To assess the impact of the k-mer size on the KC score, we also computed correlations between the RSEM-EVAL score and the KC score at half (k=36) and double (k=152) the read length for the strand non-specific single-end data. We found that these correlation values were not sensitive to the value of k (Additional file 1: Figures S8 and S9). These results provide some intuition for what the RSEM-EVAL score assesses and indicate that the RSEM-EVAL score could be used as a proxy for the KC score when reference sequences are not known.

Figure 7
figure 7

Correlation of the RSEM-EVAL and KC scores on the strand non-specific single-end data sets. Spearman’s rank correlation coefficient (bottom-right corner of each plot) was computed for each data set. KC score, k-mer compression score.

Although this experiment was not designed as a comprehensive evaluation, some features of these results are suggestive of the relative accuracies of the assemblers. First, given the selected assembler versions and parameter settings, Trinity produced the most accurate assemblies for all data sets with respect to the contig-, scaffold- and nucleotide-level F 1 scores and the KC score. The RSEM-EVAL score supported this result, with the Trinity assemblies also obtaining the highest RSEM-EVAL scores. Second, varying the Trinity parameters had a relatively small effect on the accuracy of the resulting assemblies, compared to Oases and SOAPdenovo-Trans, which produced assemblies that spanned a large range of accuracies. From the assemblies of the mouse strand non-specific single-end data produced by the assemblers with their default parameters, we identified a case that exemplifies Trinity’s accuracy and demonstrates how RSEM-EVAL selects the best assembly (Figure 8).

Figure 8
figure 8

RSEM-EVAL correctly selects the Trinity assembly of reads originating from a transcript of mouse gene Rpl24 as the best among the default assemblies from Trinity, Oases and SOAPdenovo-Trans. Reads from the mouse strand non-specific single-end data set aligning to transcript 1 of Rpl24 were extracted and assembled by Trinity, Oases and SOAPdenovo-Trans with default parameters. Contigs (filled rectangles) from each assembly were aligned against the true transcript with BLAT to establish their positional identities (blue-yellow fill, with only the segment of a contig from its highest-scoring local alignment shown). RSEM-EVAL was run on each assembly and the likelihood, prior, BIC and total RSEM-EVAL scores were recorded. Although the SOAPdenovo-Trans assembly was smaller (as reflected by the higher prior score), the Trinity assembly had a much higher likelihood score, which is generally the dominant term in the RSEM-EVAL score, and was thus correctly selected as the most accurate assembly. BIC, Bayesian information criterion.

Comparison to other measures

As we mentioned in the introduction, there are a wide variety of other measures that have been proposed and used for the evaluation of assemblies. We selected a representative set of such measures for comparison with RSEM-EVAL. From the simple reference-free measures, we selected N50 because of its popularity and the number of bases in (non-singleton) contigs because this measure was determined to be ‘strong’ and ‘fully consistent’ for evaluating de novo transcriptome assemblies [20]. Genovo [24] and ALE [25] both provide model-based reference-free scores for evaluating metagenome assemblies, which are highly similar to transcriptome assemblies, and thus we also included these scores for comparison. Lastly, we compared RSEM-EVAL to two comparative-reference-based measures that may be used if a protein set from a closely related species is available: the ortholog hit ratio [26] and the number of unique proteins matched by assembly elements, both of which were determined to be effective for transcriptome assembly evaluation [20].

Because some of these measures were computationally costly to compute, we evaluated them with a smaller data set than that used in the previous section. Specifically, we used the set of reads from the real mouse strand non-specific single-end data set that mapped to genes on chromosome 1. As in the previous section, we assembled these reads with a variety of assemblers and parameter settings and computed the selected set of measures along with the RSEM-EVAL and REF-EVAL measures on the resulting assemblies.

In terms of Spearman’s rank correlation, RSEM-EVAL outperformed all other measures with respect to the contig-level F 1 and KC scores, but had lower correlation with the nucleotide-level F 1 score than the Genovo, ALE and number of unique proteins matched measures (Table 1, Additional file 1: Figures S10 and S11). RSEM-EVAL, Genovo and ALE were the only reference-free measures to have positive correlations with respect to all three REF-EVAL measures. N50 and the number of bases in contigs measures had negative correlation with the nucleotide-level F 1 score and positive but poor correlation with the other REF-EVAL measures. Unsurprisingly, because of the similarity of the models used by RSEM-EVAL and Genovo, the scores produced by the two methods were also similar (Additional file 1: Figure S10). Although ALE is also a model-based reference-free measure, it had noticeably different behavior from RSEM-EVAL and Genovo, particularly for the Oases assemblies, which were generally larger than the other assemblies. Of the two comparative-reference-based measures, the number of unique proteins matched measure performed best, achieving good correlation with all REF-EVAL measures and the highest correlation (0.73) with the nucleotide-level F 1 score. The ortholog hit ratio measure did not fare as well and, in fact, had negative correlation with the nucleotide-level F 1 score.

Table 1 Spearman’s rank correlation coefficient of the scores assigned by several alternative transcriptome assembly evaluation measures to the reference-based scores from REF-EVAL

Given the similarity of RSEM-EVAL to Genovo and ALE, both in terms of their underlying methodology and their performance on the selected data set, we sought to differentiate RSEM-EVAL further from these methods. First, we note that unlike RSEM-EVAL, the Genovo and ALE scores do not explicitly take into account transcript abundances and only use one alignment per read, even if a read has multiple equally good alignments. To demonstrate the necessity of modeling transcript abundance and read map** uncertainty, we constructed a simple realistic example in which only RSEM-EVAL correctly scores the true assembly as the best (Figure 9). Second, we measured the runtime and memory usage of each of these software packages on the full mouse strand non-specific single-end data set and found that RSEM-EVAL is substantially faster than both Genovo and ALE, which have arguably prohibitive runtimes for this realistic data set (Table 2). Lastly, RSEM-EVAL and ALE have richer models than Genovo, both supporting paired-end data, quality scores and strand specificity.

Figure 9
figure 9

Example scenario in which RSEM-EVAL correctly selects the true assembly whereas Genovo and ALE select suboptimal assemblies. Because Genovo and ALE do not explicitly take into account transcript abundance and read map** uncertainty, scenarios in which multiple isoforms of the same gene are present in an RNA-Seq sample can confuse these methods. In this example, a gene has two isoforms, the first isoform (with a length of 1,000 bases) corresponding to the first half of the second isoform (with a length of 2,000 bases). We simulated 5,000 single-end RNA-Seq reads of length 100 bases with 0.01% sequencing error from these transcripts and with a 90:10 abundance ratio between the first and second isoforms, respectively. Because RSEM-EVAL models transcript abundances and takes into account read map** uncertainty, it correctly scores the true assembly the highest. In contrast, Genovo selects the assembly containing only the long isoform and ALE selects the assembly containing only the short isoform.

Table 2 Wall-clock runtimes (in hours, minutes and seconds) and memory usage (as measured by the maximum resident set size, in gigabytes) for several assembly evaluation tools

Within-assembler correlation

One important potential application of RSEM-EVAL is the optimization of the parameters of an assembler. Thus, it is of interest whether the RSEM-EVAL score correlates well with reference-based measures for assemblies generated by a single assembler. In a previous subsection, we showed that the RSEM-EVAL score has high correlation with the KC score for real and simulated data when several different assemblers are used. Looking at each assembler separately, we also find that the RSEM-EVAL score has high correlation with the KC score when only the assembler’s parameters are changed, for both strand non-specific (Figure 10) and strand-specific (Additional file 1: Figure S12) single-end data sets, as well as for the paired-end data set (Additional file 1: Figure S13). This suggests that RSEM-EVAL can be used to optimize the parameters of an assembler for a given data set when the KC score is of interest for measuring the accuracy of an assembly.

Figure 10
figure 10

Within-assembler correlation of the RSEM-EVAL and KC scores on the strand non-specific single-end data sets. Scatter plots are shown for the simulated (top row) and real mouse (bottom row) data sets and for the Trinity (left column), Oases (center column) and SOAPdenovo-Trans (right column) assemblers. Trans-ABySS was omitted because it had only one assembly. Spearman’s rank correlation coefficient (bottom-right corner of each plot) was computed for each combination of data set and assembler. KC score, k-mer compression score.

Assessing the relative impact of individual contigs or scaffolds within an assembly

The RSEM-EVAL score is an assembly-level measure that allows one to compare different assemblies constructed from the same data set. It is also of interest to compute scores for individual contigs or scaffolds within an assembly that reflect their relative impacts. In this section we describe and assess a contig-level score based on RSEM-EVAL for single-end data. RSEM-EVAL can analogously produce scaffold-level scores when paired-end data are available.

One natural statistical approach for assessing the explanatory power of a contig is to compare the hypothesis that a particular contig is a true contig with the null hypothesis that the reads composing the contig are actually from the background noise. For each contig, we use the log of the ratio between the probabilities for these two hypotheses as its contig impact score. Through a decomposition of the RSEM-EVAL score logP(A,D) into contig-level components, we are able to calculate these contig impact scores efficiently (Additional file 1: Section 5).

RSEM-EVAL’s contig impact score measures the relative contribution of each contig to explaining the assembled RNA-Seq data. This suggests a strategy to improve the accuracy of an assembly: trim it by removing contigs that contribute little to explaining the data. To evaluate this strategy (and by extension the contig impact score itself), we trimmed assemblies of the simulated data using the RSEM-EVAL contig impact scores and computed the resulting changes in the evaluation measures. Assemblies were trimmed by removing all contigs with negative scores.

In general, the trimmed assemblies had better evaluation scores than their untrimmed counterparts (Additional file 1: Table S2 and Figure S14). The largest improvements were seen for assemblies produced by Oases and Trans-ABySS, which tend to produce large numbers of contigs. In fact, for both the nucleotide- and contig-level F 1 scores, the trimmed Oases assemblies were the most accurate of all assemblies (both trimmed and untrimmed), supporting the usefulness of the RSEM-EVAL contig impact score. This suggests that the RSEM-EVAL contig impact scores are correctly identifying contigs that are either erroneous or redundant within these assemblies.

RSEM-EVAL guides creation of an improved axolotl assembly

The axolotl (Ambystoma mexicanum) is a neotenic salamander with regenerative abilities that have piqued the interests of scientists. In particular, there is significant interest in studying the molecular basis of axolotl limb regeneration [27]. Although the axolotl is an important model organism, its genome is large and repetitive, and, as a result, it has not yet been sequenced. In addition, a publicly available, complete and high-quality set of axolotl transcript sequences does not exist, which makes it challenging to study the axolotl’s transcriptome.

To demonstrate the use of RSEM-EVAL, we employed it to select an assembler and parameter values for a set of previously published RNA-Seq data from a time-course study of the regenerating axolotl limb blastema [27]. This data set consisted of samples taken at 0 hours, 3 hours, 6 hours, 12 hours, 1 day, 3 days, 5 days, 7 days, 10 days, 14 days, 21 days and 28 days after the start of regeneration and had a total of 345,702,776 strand non-specific single-end reads. Because of the large size of this data set and our goal of testing many different assemblers and parameter settings, we first restricted our analysis to data from three of the time points (6 hours, 14 days and 28 days), which made up a total of 55,559,405 reads. We ran Trinity, Oases and SOAPdenovo-Trans on these data to produce over 100 different assemblies, each of which we scored using RSEM-EVAL. Trans-ABySS was not included due to some difficulties in running it.

Since we did not have a known axolotl transcript set, we were unable to use the reference-based measures we have discussed thus far to assess the RSEM-EVAL score’s effectiveness for these data. Therefore, to obtain an orthogonal measure of accuracy with which to validate the RSEM-EVAL score for this data set, we instead used alignments of the assembly contigs to the known protein sequences of the frog species Xenopus tropicalis. Specifically, we aligned the assemblies against the frog protein sequences with BLASTX [28] and calculated the number of frog proteins that were recovered to various percentages of length by an axolotl contig (Additional file 1: Section 9). We found that, in general, the assemblies with higher RSEM-EVAL scores were those that were also considered better by comparison with the Xenopus protein set (Figure 11). Thus, the RSEM-EVAL score appears to be selecting the highest-quality axolotl assemblies.

Figure 11
figure 11

RSEM-EVAL scores and Xenopus protein recovery for the axolotl blastema transcriptome assemblies. The y-axis represents the percentage of proteins with at least x percent of their length (x-axis) recovered by an axolotl contig. The curve for each assembly is colored according to its RSEM-EVAL score, with red representing the highest RSEM-EVAL score. The assembly with the curve closest to the upper-right corner is the best in terms of its comparison with the Xenopus protein set.

We then pooled all time points of the time course and built an assembly using the assembler (Trinity) and parameter set (--glue_factor 0.01 --glue_factor 0.01 --min_iso_ratio 0.1--min_iso_ratio 0.1) that maximized the RSEM-EVAL score on the subset described above. This assembly is publicly available from the DETONATE website. We compared this assembly to a published axolotl assembly [27]. We find that the new assembly is longer overall and has a larger N50 score than the published assembly (Additional file 1: Table S3).

As length-based measures may not be indicative of a higher quality assembly, we also evaluated the assemblies based on the number of expressed genes and the number of differentially expressed up-regulated (DE UP) genes at each time point in the axolotl RNA-Seq time course. To enable direct comparisons with the published assembly, we used data and methods identical to those in [27], which used a comparative technique that analyzes the axolotl transcripts in terms of their orthologous human genes. With the new RSEM-EVAL guided assembly, we identify both more expressed genes at each time point and more DE UP genes at each time point (Additional file 1: Figure S15). The majority of DE UP genes found in the published assembly are captured in the new assembly (608 of 888 = 68%), while only 39% (608 of 1,576) of the DE UP genes found in the new assembly are captured in the published assembly. The new assembly identifies many new DE UP genes (968) not captured in the old published assembly.

Because transcription factors are important for setting and establishing cell state [29], we further evaluated the list of transcription factors found in the new assembly that are not found in the published assembly across the axolotl RNA-Seq time course. Prior results indicate that oncogenes are up-regulated early in the time course [27]. Using the new assembly we identify two additional DE UP oncogenes (FOSL1 and JUNB) that are not identified using the published assembly [27]. The prior assembly identified many genes associated with limb development and limb regeneration as being DE UP during the middle phase (3 to 14 days) of the time course [27]. The new assembly identifies additional limb development and limb regeneration genes during this middle phase such as HOXA11, HOXA13, MSX1, MSX2 and SHOX. HOXA11 and HOXA13 are important specifiers or markers of positional information along the proximal/distal and anterior/posterior axes of the limb [30]. MSX1 and MSX2 have been shown to be expressed in the axolotl blastema [31]. SHOX mutants in humans exhibit short limbs and overall stature [32]. The identification of many more expressed and DE UP genes, a number of which have previous support for involvement in limb regeneration, suggests that the new assembly gives a more comprehensive view of the genes expressed in the axolotl.