Background

Polyploidy (often referred to as whole genome duplication, WGD) describes the presence of more than two sets of homologous chromosomes in a cell or an organism, is very common in higher plants and plays an important role in plant evolution, speciation and adaptation. It has been discovered that all flowering plants experienced at least two ancient polyploidization events [1] that led to new genes with novel functions [2]. In addition, recent polyploidization events in ferns, lycophytes, and many flowering plants resulted in the formation of neopolyploids, that partly established themselves as novel species [3]. Polyploidization is not only a key process happening in natural populations and species but plays a major role in crop breeding too. Important crops like potato, wheat, cotton, peanut or strawberry are polyploid organisms [4,5,6,7]. Two main categories of polyploidy are recognized: auto- and allopolyploidy. Whilst the first is the outcome of WGD within a species where a genome with multiple sets of homologous chromosomes is generated (e.g. AAAA in the case of an autotetraploid), allopolyploids originate through WGD that is based on the hybridization between species resulting in a genome with multiple sets of homoeologous chromosomes (each from a separate parental subgenome, e.g. AABB in allotetraploids) [8].

Besides the genomic repatterning that comes with WGD, it is known, that past WGD events and a subsequent high rate of maintaining pairs of duplicated genes throughout evolution led to a stable higher rate of duplicated genes in plant genomes, thereby changing the concentration of gene products resulting in gene dosage imbalances [9, 10]. Recent polyploidization events can have immediate phenotypic effects, such as increased cell size leading to an increase in biomass. Especially in allopolyploids, recent gene duplications can induce additional positive effects that are beneficial for plant breeding, such as heterosis and gene redundancy [11, 12]. The first effect causes more vigorous individuals while the latter protects polyploids from the deleterious effect of mutations [12]. But, many more mechanisms are known to be affected by WGD, as such it is well described, that the per-cell gene expression levels are increased in polyploids [13] and that stress related genes can change their expression pattern in polyploid species in comparison to their diploid counterparts [14, 15]. Additional hypotheses about transcriptional changes with regard to polyploidization are well reviewed in Doyle and Coate (2019) [16]. A further mechanism that is also influenced by polyploidization is alternative splicing (AS) [17]. In plants, more than 60% of intron-containing genes undergo AS [18, 19], whereby it is known that environmental stresses can cause even more splicing events [20]. As a modulator of gene expression, AS plays a crucial role in multiple biological processes during plant growth and development.

The analysis of gene expression through RNA sequencing (RNA-seq) is a well-established, commonly used method in both, basic and applied research to interpret functional elements of the genome and understand the formation of phenotypes, traits and the reaction to diseases and a changing climate [21]. The above described effects of polyploidization (high genomic complexity, gene duplications, dosage imbalances, affected AS) bring major challenges especially for the de novo transcriptome assembly that is applied commonly in non-model organisms when no reference genome is at hand. Besides the fact that the de novo assembly is already a complex task in diploids, due to the sequence similarity of transcripts that are isoforms, or are a product of allelic variants, close paralogs or homologs [22], this gets even more challenging in polyploids. While in allopolyploids an additional complexity level is given through the presence of homoeologous genes [23], autopolyploids usually have a high heterozygosity due to the nature of polysomic inheritance where e.g. four different alleles at a given locus with random pairing between each of the four chromosomes can result in nineteen genotypes. In contrast, allopolyploids usually show disomic inheritance that lead to bivalent chromosome formation resulting in a maximum of nine combinations for the given locus in the offspring [24, 25]. All these configurations (e.g. duplications, multiple alleles) cause extra branches and bubbles in the de Bruijn graph that is nowadays predominantly used to build the de novo transcriptome assemblies. Therefore, the graph structure can be ambiguous, and the represented isoforms can be challenging to resolve. As a result, a collapse of transcripts from genes belonging to one gene family (homologs), chimerism (the concatenation of two or more transcripts that may or may not be related) or redundancy (e.g. allelic sequences as separate loci) might occur more frequently [26, 27].

State of the art transcriptome assemblers were developed and tested in model organisms that lack high gene duplication rates or polyploidy levels [28,29,30] and thus, their evaluation in polyploids is scarce. Only a few studies focused on the comparison of transcriptome assembly strategies in polyploid species, among them only one including autotetraploids [31,32,33]. Despite those studies, there is a lack of cross-species analyses comparing the performances of these tools on multiple di- and polyploid species. To fill this gap and to provide a basic guideline for choosing the optimal de novo assembly strategy, we performed a comparison of two common (SOAPdenovo-Trans, Trinity) and one recently released transcriptome assembler (TransLiG) on diploid and autotetraploid non-model plant species, as the scientific interest in this type of polyploidy is increasing [34], but still, “studies about the regulation of genes on the four homologous chromosomes of autopolyploids have received little attention” [16].

As study organisms of choice, we focused on di- and autotetraploids from the plant genera Acer and Vaccinium. The genus Acer is an extremely diverse group containing over 120 species of various size, habit and ploidy level. Our Acer species of choice were sycamore maple (Acer pseudoplatanus L., 4×) and Norway maple (Acer platanoides L., 2×). Both species show a similar distribution pattern across Europe and are valuable hardwood species [35, 36]. Further, Vaccinium is a young and widespread genus with elevated rates of speciation in recent decades that led to the formation of about 450 species [37]. The genus includes blueberries, cranberries or lingonberries and consists of very complex polyploid species like Vaccinium corymbosum L., a highly economically relevant species in the food sector [37, 38]. In addition, to have a proven reference, a di- and an autotetraploid Arabidopsis thaliana L. genotype was included in our study. Among the tested tools, SOAPdenovo-Trans is a transcriptome assembler built on a genome assembler [29, 39], while Trinity [28, 39] was specifically developed for transcriptome assembly. The first was implemented and tested on transcriptome data of rice and mouse, the latter was established using transcriptome data of fission yeast. TransLiG is the most recently developed assembler, released in 2019, reviewed on human transcriptome data, with a special consideration to integrate the sequence depth and paired-end information to retrieve all the transcript-representing paths in splicing graphs [30]. To our knowledge, TransLiG has not been tested on any plant data so far.

Methods

A schematic workflow of the data and tools used in this study is shown in Fig. 1.

Fig. 1
figure 1

Pipeline for de novo transcriptome assembly evaluation

Acer sampling, RNA extraction, library preparation, and sequencing

For each of the two Acer species under investigation, A. platanoides L. (Norway maple, diploid = 2×) and A. pseudoplatanus L. (sycamore maple, tetraploid = 4×), three mature individuals were chosen for selection. The individuals are part of the living collection of woody plants of the Botanical Garden of the University of Vienna, Austria (Hortus Botanicus Vindobonensis, HBV), and can be identified through the following individual accession numbers: Norway maple tree IDs 37006, 30044, and IGF024; sycamore maple tree IDs PP001, 34011, and 32074 (cf. Additional file 1). Leaf material (comprising around five randomly selected small leaves per individual) was collected and immediately frozen in liquid nitrogen. Frozen leaf tissue was ground to a fine powder and from about 50–60 mg total RNA was extracted using TRIzol Reagent as described in Meng and Feldman (2010) [40]. Total RNA was sent on dry ice to the Next Generation Sequencing Facility at Vienna BioCenter Core Facilities (VBCF), Austria. There, the RNA was quality and quantity checked using Agilent’s Bioanalyzer. Library preparation was done using the NEB polyA enrichment kit, including stranded information and a cutout size between 300-800 bp, resulting in an individual median size of each library between 388 and 423 bp. All six mRNA libraries were sequenced on one lane on the HiSeq2500 PE150 in rapid mode. Sample information and sequence data are available at NCBI under the BioProject PRJNA662197.

Additional data

Raw RNA sequence reads of three V. arboreum and three V. corymbosum individuals of the control group (pH 4.5) from the study by Payá-Milans et al. (2018) [32] were downloaded from https://www.ebi.ac.uk/ena, PRJNA353989. In that study, libraries were prepared using the Ribo-Zero™ rRNA Removal Kit on total RNA and ScriptSeq v2 RNA-Seq library preparation kit, and further sequencing was done in paired-end mode with a length of 101 bp and fr-strandness. The A. thaliana RNA-seq data generated by Zhang et al. (2019) [14] was downloaded from https://www.ebi.ac.uk/ena, PRJNA473317. In that case, total RNA was used for sequencing with standard Illumina protocols. A description of all the species used in this study is shown in Table 1, detailed meta data of each individual is provided in Additional file 1.

Table 1 Sample description

De novo transcriptome assembly

Raw sequence reads were pre-processed for base quality (Q20 from left and right) and adapter content using BBDuk package from the software BBMap version 37.68 [41] as well as rRNA filtered using SortMeRNA version 3.0.3 (Kopylova 2012).

De novo transcriptome assemblies of all five species were performed with Trinity version 2.6.5 [28, 42], SOAPdenovo-Trans version 1.04 [Full size image

The N50 sizes of the Arabidopsis assemblies were around 2000 bp while the N50 sizes of the other two genera (Acer and Vaccinium) were less, ranging from 513 to 995 bp, except for the TransLiG assemblies, were the N50 sizes were between 1539 bp (V. arboreum) and 2025 bp (A. pseudoplatanus) (Additional file 2). In general, smaller N50 sizes came with a high proportion of very short transcripts with a length of less than 300 bp. While the number of small transcripts in A. thaliana was between 3% (TransLiG, autotetraploid, 4×) and 10% (SOAPdenovo-Trans, di- and autotetraploid, 2× and 4×), the number of transcripts less than 300 bp was up to 46% in the Vaccinium assemblies and up to 57% in A. platanoides (Trinity and SOAPdenovo-Trans) (Fig. 3, Additional file 7). Additionally, the proportion of contigs that have >  = 50% estimated chance of being segmented (p_segmented) is higher in the Acer and Vaccinium assemblies (15–21%) compared to Arabidopsis (13–15%), with lower proportions in the autotetraploids (Fig. 3, Additional file 2).

According the number of reads (fragments) that mapped back to the assemblies and the number of good map**s (i.e. both of the reads mapped on the same contig, with same orientation and without overlap** the ends of the contig), the highest proportion was seen for Trinity and TransLiG assemblies in all species (Fig. 3 and Additional file 2). Especially for the autotetraploid species, TransLiG (AC: 0.96, VA: 0.90, AT: 0.99) outperformed Trinity (AC: 0.91, VA: 0.82, AT: 0.96) in the proportion of fragments that mapped and in the proportion of good map**s (TL: AC 0.91, VA 0.8, AT 0.96; TR: AC 0.79, VA 0.63, AT 0.89). The proportion of contigs uncovered was rather small for most of the assembly results. More than 5% of uncovered contigs (mean per-base read coverage of < 1) was only seen in the TransLiG assemblies of the autotetraploid species (Fig. 3).

Assembly completeness

With regard to the completeness, we saw most complete assemblies (complete single plus complete duplicated BUSCOs) with TransLiG in both, diploid (AC: 1,613, VA: 1,368, AT: 2,115) and tetraploid (AC: 2,044, VA: 1,558, AT: 2,147) species (Additional file 3). Fewest complete BUSCOs were assembled for the SOAPdenovo-Trans assemblies in autotetraploid Acer (833) and autotetraploid Vaccinium (669). The completeness of A. thaliana assemblies was rather similar for all the assemblers, ranging from 1,932 to 2,147 complete BUSCOs. Focusing on the proportion of complete duplicated BUSCOs compared to the number of all complete BUSCOs, we saw the highest proportion in the TransLiG assemblies of tetraploid species (from 0.73 to 0.83, depending on species) and the least proportion in SOAPdenovo-Trans assemblies (from 0.08 to 0.29). On the other hand, most fragmented and missing BUSCOs were seen for SOAPdenovo-Trans assemblies in Acer and Vaccinium species.

Comparison to the cDNA or protein reference

The comparison of A. thaliana assemblies to the reference cDNA showed that the proportion of transcripts that have a CRB-Blast hit with the reference ranged from 0.75 for SOAPdenovo-Trans up to 0.93 for TransLiG (Fig. 4). The proportion of the reference with a transcript hit was between 0.47 and 0.57 for all assemblies, with the highest values for the Trinity assemblies, 0.56 for diploid and 0.57 for tetraploid A. thaliana. The proportion of transcripts with a CRB-Blast hit and the proportion of reference cDNA with a transcript hit did not differ between di- and tetraploid A. thaliana. This was different for the per base reference coverage. In the diploid A. thaliana assemblies, the highest coverage of 0.25 was seen in the Trinity assembly, compared to 0.18 and 0.21 in SOAPdenovo-Trans and TransLiG, respectively. In the tetraploid A. thaliana, the far highest coverage was seen in the TransLiG assembly with 0.38 compared to 0.18 and 0.25 (Fig. 4 and Additional file 2).

Fig. 4
figure 4

Comparison of Arabidopsis thaliana assemblies to the reference cDNA set. P contigs with CRBB—the proportion of contigs with a CRB-BLAST hit; P reference with CRBB—the proportion of references with a CRB-BLAST hit; Reference coverage—the proportion of reference bases covered by a CRB-BLAST hit

The comparison of the Arabidopsis assemblies with the reference protein set showed that the proportion of contigs with a CRB-Blast hit for SOAPdenovo-Trans is significantly decreased to less than 0.4 for both di- and tetraploid A. thaliana (Additional files 4 and 2). The difference between the proportion of reference with a transcript hit and the per base reference coverage was small for all assemblers, in contrast to the comparison of Acer assemblies to the reference A. yangbiense protein set. For Acer, the proportion of contigs with a CRBB hit was between 0.09 and 0.28 with the highest values in the TransLiG assemblies (Additional files 4 and 2). A comparison for Vaccinium to a reference protein set was not conducive due to the lack of a reasonable protein set for any Vaccinium species.

Genetic variants

AS events, SNPs and short indels were called with a local transcriptome assembler. The number of SNPs was similar in both di- and tetraploid A. thaliana samples with around 23,000 SNPs, the number for diploid Acer and Vaccinium was more than 100,000 SNPs, and the number for autotetraploid Acer and Vaccinium 571,648 and 351,211 SNPs, respectively (Additional file 5). The number of detected genetic variants in diploid A. thaliana is comparable with the number in autotetraploid A. thaliana. Regarding AS events and short indels (< 3nt), the least were found in diploid V. arboreum (8,706 and 6,700, respectively), and the most in autotetraploid A. pseudoplatanus (60,467 and 88,689, respectively) (Additional file 5).

Transcript clustering

To further investigate AS events and redundant transcripts, a clustering of the assembled transcripts was performed with cd-hit-est with a sequence identity threshold of 95%. The proportion of resulting representative transcripts was high in the SOAPdenovo-Trans assemblies (0.95–0.99) and very low in the TransLiG assemblies of autotetraploids (0.60–0.67) (Additional file 6). The further analysis of the completeness with BUSCO showed that in the Vaccinium assemblies the completeness even got little higher in most cases (up to + 6 complete BUSCOs) while the reduction in Arabidopsis assemblies was the highest (-4 to -36 complete BUSCOs) (Fig. 5 and Additional file 3). In general, the number of duplicated BUSCOs decreased in the clustered assemblies compared to the non-clustered ones. The proportion of duplicated BUSCOs in the A. pseudoplatanus TransLiG assembly decreased the most from 0.83 to 0.52 and in the autotetraploid A. thaliana from 0.75 to 0.31 (Fig. 5 and Additional file 3).

Fig. 5
figure 5

Assembled BUSCOs for clustered assemblies. The number of complete BUSCOs for each assembler clustered with cd-hit-est, 95% sequence identity threshold (SO = SOAPdenovo-Trans, TL = TransLiG, TR = Trinity), is shown for the assembled transcriptomes with genera Acer, Vaccinium and Arabidopsis in diploids (2×) and autotetraploids (4×)

Alternative splicing estimation

To estimate the amount of AS forms in the assemblies, both, Trinity’s and TransLiG’s integrated information of the gene ID within the transcript IDs showed that in general more isoforms per genes were present in the autotetraploids (1.5–1.8) compared to diploids (1.2–1.7) with the highest values in A. thaliana. In general, Trinity resulted in a stricter clustering than TransLiG (Additional file 6).

To investigate the number of transcripts that represent different alleles rather than true AS forms, cd-hit-est was run with stricter parameters, integrating transcript and alignment length information. Here, SOAPdenovo-Trans assemblies had the highest proportion of representative transcripts with 99% to 100% in di- and autotetraploid species, respectively, while the proportion in the TransLiG assemblies of the autotetraploids was between 83 and 85% (Additional file 7).

Key findings

The key findings of this study were summarized for each assembler, averaged for all investigated species and provided in Table 2. TransLiG produced for both, di- and autotetraploid species, assemblies with the highest amount of reads that mapped back to the assembly in a sufficient way (0.88 and 0.90, respectively, SO: 0.68; 0.57, TR: 0.82; 0.77). Further, TransLiG had the lowest proportion of short transcripts (0.24 and 0.14), the highest amount of complete BUSCOs (1,699 and 1,916, SO: 1,266; 1,133 TR: 1,615; 1,705), and the lowest number of fragmented BUSCOs (188 and 132, SO: 328; 419, TR: 240; 276). Trinity assemblies on the other hand showed the highest protein reference coverage (0.45 and 0.51, respectively), but only slightly better than TransLiG (0.42 and 0.50). Comparing the A. thaliana assemblies to the complete cDNA reference set, highest reference coverage was seen for diploid A. thaliana assembled with Trinity (0.25 vs 0.18 and 0.21) while the reference coverage of tetraploid A. thaliana was the highest for the TransLiG assembly (0.38 vs 0.18 and 0.25). SOAPdenovo-Trans produced assemblies with the least proportion of uncovered bases (0.02) and the highest proportion of representative transcripts (0.97 and 1.00) using two different parameters for clustering. The lowest number of representative transcripts was seen for TransLiG in the autotetraploids (0.63 or 0.84 with stricter parameters). The number of assembled transcripts was similar for SOAPdenovo-Trans (0.13 and 0.20 million transcripts) and TransLiG (0.13 and 0.25) but significantly higher for Trinity (0.21 and 0.46). In general, the differences between the assembler diverged more within autotetraploid species compared to the diploid species.

Table 2 Summary of the key findings shown for each assembler