Background

As the largest genus in the tribe Triticeae of family Poaceae, Elymus L. contains approximately 150 species occurring in most temperate regions of the world [1]. Given its excellent resistance to drought, cold and disease, the Elymus species can provide important gene pool for improvement and breeding of related cereal crops [2]. Siberian wildrye (Elymus sibiricus L.), a model species of Elymus genus, is a perennial, cool-season and self-pollinated forage grass [1]. It widely spreads in northern Eurasia, and is especially used in the Qinghai-Tibet Plateau (QTP) for grassland establishment and restoration because of its strong adaptability, cold-resistance, high nutritional value, good palatability and ease for cultivation [3, 4]. Up to now, increasing attention has been paid to its germplasm characterization, intraspecific genetic diversity, phylogenic evolution and linkage map construction [1, 4, 5]. However, the lack of genomic sequences hinders the genetics study and breeding progress of this species compared to other Triticeae cereals such as wheat and barley.

E. sibiricus has an allotetraploid genome (StStHH, 2n = 4x = 28). The mean nuclear DNA content (C-value) of E. sibiricus was determined by flow cytometry (FCM) as being 2C = 16.61 pg, approximately twice as large as the possible diploid progenitors from genus Pseudoroegneria (StSt) and Hordeum (HH) [6]. This complexity and huge genome size pose a great challenge to whole genome sequencing of E. sibiricus. Genome survey using next-generation sequencing (NGS) with a given uniform sequencing depth is an alternative and cost-effective strategy in obtaining genome size, heterozygosity, GC and repeat contents. Furthermore, development of molecular markers based on genome survey sequencing and in silico analysis has become a practical tool for genetic study [7].

Among the various types of molecular markers, SSRs (simple sequence repeats) have many advantages including high polymorphism, codominant heredity, good reproducibility and extensive distribution in the genome [8]. These properties have proven to be of great interest for diverse genetic studies including genetic map and fingerprint construction, genetic diversity characterization, and molecular marker assisted breeding, etc. SSR markers can be developed using homology searches from either genomic libraries or transcriptome sequences and expressed sequence tags (ESTs) databases. Often, G-SSR (genomic DNA-derived SSRs) markers are considered to possess higher polymorphism than EST-SSR markers (EST-derived SSRs) due to conservation of the transcribed portions of genome. The ongoing development of next-generation sequencing (NGS) techniques e.g. genomic survey sequencing has economically allowed to access large amounts of genomic data, and can further identify genomic SSR loci by in silico searching SSR motifs in massive scaffold datasets [9,10,15]. On the contrary, compared to multi-loci SSRs, single-locus SSRs primers target a unique location in the genome and could provide more reliable scoring of genotypes in genetic study and breeding programs.

In this study genomic survey sequencing was applied in a E. sibiricus cultivar ‘Chuancao No.2’ using Illumina Hiseq X-ten platform. The first draft genome of E. sibiricus was constructed and some single-locus SSRs and multi-loci SSRs were developed based on genomic survey data. Furthermore, genetic diversity and structure of 27 wild E. sibiricus accessions were characterized using these new SSR markers plus previously published ones. Effectiveness of those markers was also compared and evaluated.

Results

Genome sequencing and characterization

Fourteen 270-bp libraries with pair-end reads of E. sibiricus were constructed and sequenced. The randomly selected 10,000 pairs of reads were then analyzed by BLAST using the NCBI databases. The BLAST result showed that Triticum aestivum and Hordeum vulgare, as the closely-related species to Elymus, were the best matching species in all libraries. In addition, reads of each library aligned with chloroplast genome of E. sibiricus showed lower than 5% matching rate, which indicated that the libraries were established without contamination (data not shown).

From fourteen libraries with 270 bp insertion size, totally 469.17 Gb of clean data were produced using an Illumina HiSeq X-ten platform. The estimated genome size of E. sibiricus was approximately 6.86 Gb with a total sequencing depth of 58.64-fold (Table 1). High quality scores of the filtered sequences were calculated, and the percentages of Q20 and Q30 (the sequencing error rate 1 and 0.1% respectively) was greater than 95.81 and 90.50% respectively, indicating the high accuracy of the sequencing process (Table 1). Then all of the clean data were subjected to 25-mer (k = 25) frequency distribution analysis. It showed that the peak value of the k-mer depth distribution emerged at 54 (Fig. 1). The heterozygosity rate and the proportion of repeat sequence was calculated to be 0.01 and 73.23% respectively according to the k-mer curve distribution.

Table 1 Summary of the fourteen libraries with 270 bp short-inserts
Fig. 1
figure 1

K-mer (k = 25) frequency distribution. The x-axis is depth and the y-axis represent the frequency at that depth. The genome size was calculated following ‘genome size = K-mer num/peak depth’. The two subpeaks were caused by certain repeats rate in the genome of E. sibiricus

After de novo assembly using the SOAP de novo program, all of the clean reads produced a total of 4,841,088 contigs with an N50 length of 2510 bp, which were subsequently assembled into scaffolds (Table 2). Among all those contigs, 683,040 contigs were longer than 500 bp, 352,851 contigs were longer than 1 kb and 7536 contigs were longer than 10 kb. Scaffolds larger than 100 bp were selected for further analysis. Totally 4,763,904 scaffolds were generated with an N50 of 2648, and the number of scaffolds longer than 500 bp, 1 kb and 10 kb were 684,597, 352,659 and 8040, respectively (Table 2). The assembled draft genome was approximately 4.34 Gb, accounting for 63.27% of the estimated 6.86 Gb genome. The calculated GC content of the assembled genome was 45.68% (Table 2), that was consistent with the scatter plot graph built with scaffolds larger than 500 bp (Fig. 2). A total of 25,993 genes were annotated in the E. sibiricus draft genome with an average transcript length per gene of 2632.11 bp and an average coding sequence length of 737.36 bp. The predicted average exons number per gene was 4.72 and the average exon length per transcript was 311.32 bp (Table S1).

Table 2 Statistics of de novo assembly
Fig. 2
figure 2

Guanine plus cytosine (GC) content and depth correlation analysis. The x-axis represents the GC content and the y-axis is the sequence depth

Using RepeatMasker software [16], we annotated the repeat regions of the draft genome, in which the most abundant repetitive elements was retroelements (16.45%), followed by DNA transposons (3.26%). The most common element in retroelements and transposons were long terminal repeat (LTR, 15.90%) and Tc1-IS630-Pogo (0.25%), respectively (Table S2).

Development of genome-wide SSR markers

About 4.34 Gb genome sequences were searched for SSR loci and resulted in 315,446 SSRs from 507,162 (20.23%) scaffolds. 22,611 (0.90%) sequences contained a single-locus SSR which was not found on other scaffolds (Table S3). In these SSRs, the mono-, di- and tri-nucleotide motifs were the most enriched, which accounted for 61.81, 21.59 and 14.55% of the total identified SSRs, respectively (Table 3). Nucleotide composition characteristics indicated that A/T (64.21%), AG/CT (27.35%), CTC/GAG (7.6%), CATG/CATG (3.38%), AAAAT/TTTTA (2.96%) and CTTTTT/GAAAAA (2.91%) were the most abundant motifs corresponding to mono- to hexa-nucleotide repeats, respectively (Fig. 3). Generally, motif abundance decreased as the motif repeat number increased for each motif type (Fig. 4). The top three abundant motif repeat number were 10, 6 and 11, whose total number of SSR motifs were 77,930, 31,980 and 31,575 respectively (Table S4). Using Primer3 software [17], a total of 67,833 SSR primer pairs were designed for above-mentioned 22,611 single-locus SSR-containing sequences. In silico analysis was then used to verify the reliability and polymorphism of these primers by aligning the flanking sequences of them to the genome sequences. Four hundred SSR markers were randomly selected to amplify the genomic DNA of eight E. sibiricus accessions (Table S5) and 30 of them displayed polymorphisms and ten showed single-locus amplicons as expected (ESGA-SL) and 20 displayed multi-loci amplicons (ESGA-ML) (Table S6 and Fig. S1).

Table 3 Statistics of SSR motif
Fig. 3
figure 3

The most abundant motifs (red portion) corresponding to mono- to hexa-nucleotide repeats

Fig. 4
figure 4

Motif frequency distributions of mono- to hexanucleotide motif types (y-axis) with different repeat numbers (from 5 to > 20, x-axis) in the de novo assembled genomic sequences of E. sibiricus. The z-axis represents the number of each type of motif

Polymorphism of novel SSR markers in E. sibiricus

Marker polymorphism evaluation

A total of 60 SSR markers (30 ESGA, 15 ES and 15 ESGS, Table S6) were used to genotype 27 E. sibiricus accessions, which were collected from Mongolia (MG), eastern Qinghai-Tibet Plateau (QTP) and Siberia (SI) respectively. A total number of 29 alleles (Na) were produced by ten ESGA-SL markers, with an average of 2.9 (Table 4). The average polymorphism information content (PIC) value of all ESGA-SL markers was 0.391, among which the ESGA-SL-7 had the highest PIC value of 0.595. The mean value of observed heterozygosity (Ho) calculated based on ESGA-SL markers was 0.480, among which ESGA-SL-10 had the highest Ho of 0.963.

Table 4 Genetic indexes of ESGA-SL markers

For multi-loci markers, 105, 97 and 67 polymorphic bands were amplified by ESGA-ML, ESGS and ES primers, and their corresponding average polymorphism percentage was 90.52, 82.49 and 73.84%, respectively (Table S7). Newly developed ESGA-ML markers had the highest average value of PIC (0.4059) and band informativeness (BI, 0.6494) compared to other multi-loci markers (ESGS and ES). Mann Whitney test [18] indicated that there was no significant difference between ESGA-SL and multi-loci markers in view of PIC values (Table S8). For MI, BI and Rp values, the significant difference (P < 0.01) was observed between transcriptome-developed ES marker and genomic-developed ESGA-ML and ESGS markers.

Cluster and STRUCTURE analysis

Genetic memberships of the 27 tested E. sibiricus accessions based on ESGA-SL and ESGA-ML datasets were acquired via STRUCTURE 2.3.4 software (Fig. 5 and Fig. S2). The result revealed an optimal K value of 2 (K = 2), implying the tested accessions belonged to two potential genetic memberships, which was consistent with the UPGMA dendrogram and principal coordinates analysis (PCoA) using ESGA-SL markers. The 27 accessions could be divided into two clusters (Cluster I and Cluster II, Figs. 5 and 6). Cluster I included all the Mongolia (MG) and Siberia (SI) accessions, and Cluster II included all eastern QTP originated ones. Thus, the wild germplasms from different regions could be notably identified through the UPGMA and PCoA analysis, which revealed the powerful discriminability for wild accessions based on ESGA-SL markers. However, in spite of distinct characterization of tested accessions from different regions characterized via PCoA analysis based on ESGA-ML markers (Fig. S3), the topological structure of UMPGA dendrogram was ambiguous and couldn’t distinguish the 27 wild accessions clearly (Fig. S2). Furthermore, the similar result of UPGMA cluster and PCoA analysis based on ESGA-ML markers was also found in ESGS markers (Fig. S4 and Fig. S5). Besides, compared to genomic-developed markers, the transcriptome-developed ES markers could not explain their structure membership well (Fig. S6 and Fig. S7). This indicates that G-SSR markers has the superior discriminability than the EST-SSR markers for intraspecific diversity analysis.

Fig. 5
figure 5

The UPGMA dendrogram and genetic structure of 27 studied E. sibiricus based on ESGA-SL markers. MGL, Mongolia; QTP, eastern Qinghai-Tibet Plateau; SI, Siberia

Fig. 6
figure 6

The PcoA analysis of 27 studied E. sibiricus based on ESGA-SL markers

The genetic diversity pattern of E. sibiricus based on different types of markers

Based on geographical origin, all tested wild accessions were divided into three geo-groups: MGL (Mongolia), SI (Siberia) and QTP (eastern Qinghai-Tibet Plateau). The observed heterozygosity (Ho) of each geo-group based on the ESGA-SL dataset was changing from 0.467 to 0.492 (Table 5). In general, the values of Na, Ne (effective alleles number), I (Shannon information index), He (expected heterozygosity) and PP (Percentage of polymorphic loci) of each geo-group calculated by ESGA-SL marker were higher than that of ESGA-ML (Table 5). AMOVA analysis was carried out based on both the ESGA-SL and ESGA-ML markers and coefficient of genetic differentiation (Fst) were calculated. The results showed that genetic variation of tested germplasms was mainly distributed among geo-groups with moderate genetic differentiation (Fst = 0.553 for ESGA-SL and Fst = 0.573 for ESGA-ML) (Table 6).

Table 5 Genetic diversity of studied E. sibiricus geo-groups based on ESGA-SL and ESGA-ML markers
Table 6 Genetic variation of E. sibiricus geo-groups

Discussion

Characteristics of E. sibiricus draft genome

The development of next-generation sequencing (NGS) provided researchers with an attainable and cheaper method to access the plant genomes, especially for the non-model grass species like Elymus sibiricus. Based on Illumina sequencing with fourteen 270 bp insertion size libraries, surveyed draft genome of E. sibiricus was de novo sequenced and assembled in this study. The moderate GC content (45.68%) of the draft genome indicated that the Illumina platform was excellently avoided sequencing bias. The final assembly had a N50 of 2510 bp for contigs, far less than that of Lolium perenne L. (contig N50 = 16,370 bp) [19], while slightly larger than that of allotetraploid Arachis hypogaea L. (contig N50 = 696.6 bp) [20]. This may be caused by the relatively big estimated genome size (6.86 Gb), high repetitiveness (73.23%) of E. sibiricus and the short insertion size of library. The estimated genome size of E. sibiricus (6.86 Gb) was smaller than that of related allohexaploid Triticum aestivum (17 Gb) [21], while larger than that of many other important species in Gramineae, such as Hordeum vulgare (5.1 Gb) [22], Aegilops tauschii (4.5 Gb) [23], Triticum urartu (5.0 Gb) [24], Brachypodium distachyon (260 Mb) [25], Oryza sativa (466 Mb) [26], Sorghum bicolor (730 Mb) [27], Lolium perenne (2 Gb) [28] and Zea mays (2.3 Gb) [29]. The low level of heterozygosity for E. sibiricus (0.01%) obtained via the k-mer analysis was probably caused by the self-pollinating mating system of E. sibiricus, and indicated its feasibility for genome sequencing. This is the first draft genome of E. sibiricus and it is useful in the molecular marker development and functional gene mining. This work also provided the basis for further whole-genome sequencing using larger insert libraries and new sequencing technique like the single-molecule real-time sequencing.

SSR marker development

SSR markers have been widely applied in genetic study and molecular breeding. Among all of the identified 293,362 SSRs, the vast of SSRs (97.95%) belonged to mono-, di- and tri-nucleotide motifs, which was similar to the result of restriction site associated DNA-Seq (RAD) in E. nutans [30]. However, in the transcriptome sequencing study of E. sibiricus, the tri-nucleotide motifs had the largest number [37], which is difficult for genotype scoring. In this case, single-locus SSR markers are considered as the best choice, and development of single-locus SSRs has been reported in barley, peanut and Luffa by genome survey [9, 12, 38]. In this study, 10 single-locus SSR markers were developed via the genome survey of E. sibiricus with great potential use in genetic variation study and linkage map construction.

Effectiveness comparison between single- and multi-loci markers

Genetic diversity of 27 wild E. sibiricus accessions was evaluated by 30 markers developed in this study and other 30 ones reported before. We found that the expected single-locus SSRs screened by in silico analysis still exhibited multi-loci amplicons when separated by polyacrylamide gel. This may be caused by their non-conservatism of flanking sequences [9]. Finally, only 10 single-locus (ESGA-SL) markers and 20 multi-locus (ESGA-ML) markers were obtained in this study for genetic diversity analysis of 27 wild E. sibiricus accessions.

The average amplified alleles of the 10 ESGA-SL markers was 2.9, which was close to the allotetraploid species Arachis hypogaea (3.85) and Brassica napus (3.23) [9, 37]. The PIC value of the 10 ESGA-SL markers varied from 0.069 to 0.595 with an average of 0.391, that indicated its abundant polymorphism and high application value [39]. There was no significant difference of PIC detected between ESGA-SL and other three marker systems (ESGA-ML, ES and ESGS), which may be caused by the different calculation criteria between single-locus and multi-loci marker or the limited amplification loci of single-locus markers. According to the Mann Whitney test, G-SSR (ESGA and ESGS markers) was more efficient and polymorphic than EST-SSR (ES markers) in view of PIC, MI and Rp, that may be driven by the more conservative flanking sequences of EST-SSR [40, 41]. In addition, significantly (P < 0.05) higher PIC values of ESGA-ML markers vs. ESGS markers were calculated, which demonstrate the superiority of SSR markers development method by sequencing over traditional method.

The UPGMA and PCoA derived cluster analysis based on ESGA-SL markers divided the 27 wild E. sibiricus accessions into two groups, and the structure analysis based on Bayesian algorithm also revealed the same pattern. However, the other three types of multi-loci markers exhibited inferior ability than ESGA-SL marker in revealing actual genetic relationships. One should note that all the genetic diversity parameters (Na, Ne, I, He and PP) of each geo-group calculated based on ESGA-SL markers were higher than that of ESGA-ML, which suggested that the single-locus marker reveals more accurate genetic information, so it is more suitable for further genetic analysis [37]. However, slightly higher pairwise Fst values were observed among each geo-group based on ESGA-ML markers. Given that multi-loci SSRs possesses characteristic like multiple amplification sites in the genome location, a part of genetic information was unavoidably covered. The advantage of single-locus markers over multi-loci markers was manifested in this study, however, vast number of single-locus markers that covering the entire genome of E. sibiricus are required further be identified or developed. In this case, higher quality genome-wide sequencing and assembling for E. sibiricus are necessary.

Conclusions

In this study, the de novo whole genomic survey of E. sibiricus was performed and a 4.34 Gb reference genome sequence was obtained with 73.23% repetitive elements, 0.01% heterozygosity and 45.68% GC content. Totally 293,362 SSR markers were identified from the draft genome and 67,833 potential markers were screened by in silico analysis. Subsequently, ten single-locus (ESGA-SL) markers and 20 multi-locus (ESGA-ML) markers were verified by gel electrophoresis and exhibited polymorphism in 27 E. sibiricus accessions. The single-locus marker was proved more efficient and informativeness in genetic study than multi-loci marker. This survey of the genome and the developed SSR markers will facilitate further whole genome sequencing, molecular breeding and phylogenetic study of E. sibiricus and related Triticeae species.

Methods

Plant materials

E. sibiricus cultivar ‘Chuancao No.2’ provided by Sichuan Academy of Grassland Sciences (Chengdu, China) was adopted after identification as tetraploid by flow cytometry (Fig. S8) and planted in the growth chamber (25 °C, 300 μmols·m2·s− 1, 16-h photoperiod). The total genomic DNA of ‘Chuancao No.2’ was isolated from fresh young and clean leaves using a DNA extraction kit (Tiangen, Bei**g, China). DNA concentration and purity were checked on a BioPhotometer (Eppendorf, Germany) and the quality was detected by 1% agarose gel electrophoresis.

Library construction and Illumina sequencing

Fourteen genomic paired-end (PE) libraries with 270-bp insertions were prepared following the manufacturer’s instructions and then sequenced on an Illumina HiSeq X-ten platform. Clean reads were obtained abide by the following filtration and correction criterion [9]: less than 10% unidentified nucleotides (N); no more than 10 nt aligned to the adaptor, allowing for at most 10% mismatches; with at most 50% bases having a phred quality of < 5. Putative PCR duplicates generated during PCR amplification in the library construction process was excluded. In addition, to investigate the potential contaminating effect, 10,000 pairs of clean reads were randomly selected and searched against the NCBI database using BLAST [42]. Finally, to evaluate the content of extra-nuclear DNA in the aforementioned fourteen libraries, BLAST was performed using SOAP [43] with the chloroplast genome of E. sibiricus (MK775250, 135,075 bp).

Genome assembly, annotation and guanine plus cytosine (GC) content analysis

The filtered high-quality sequences were assembled by SOAPdenovo2 [49].

SSRs evaluation based on the genetic diversity of E. sibiricus germplasm

Four hundred pairs of SSR markers were randomly selected for synthesis, then PCR and electrophoresis were performed for screening and validation. Primers with only 0–2 amplified bands were recognized as single-locus SSR and those possessing polymorphism was named ‘E. sibiricus genome assembly single locus’ (ESGA-SL) marker. Analogously, primers with more than 2 amplified bands simultaneously polymorphic was called ‘E. sibiricus genome assembly multi loci’ (ESGA-ML) marker. In addition, 30 pairs of multi loci markers including fifteen G-SSR markers (ESGS [40]) developed based on magnetic bead enrichment, and fifteen pairs of EST-SSR markers (ES [31]) based on E. sibiricus transcriptome were also selected to amplified the same 27 E. sibiricus accessions DNA (Table S5, Supplementary file).

Amplified bands were recorded following genotypes (single locus marker) or 0/1 binary matrix (multi loci marker). Nei’s genetic distance (GD) matrix among 27 accessions was calculated by Freetree [50] with 10,000 bootstrap value, and the UPGMA dendrogram was constructed then visualized in Figtree [50]. The principal coordinate analysis (PCoA) was performed via NTSYS v2.2 [51]. STRUCTURE v2.3.4 [52] was performed based on a Bayesian model to the illustration of genetic membership. The parameters were set to 50,000 burn-in and 100,000 Monte Carlo Markov chain (MCMC) with an admixture model. The STRUCTURE HARVESTER [53] was then applied to estimate the “optimum K”. The hierarchical analysis of molecular variance (AMOVA) was carried out using GenAlEx [54] to calculate the number of alleles (Na), effective alleles (Ne), Shannon diversity index (I), expected heterozygosity (He), polymorphic site proportion (PP) and other genetic diversity parameters. Finally, the observed heterozygosity (Ho) and gene flow (Nm) of single-locus SSRs were calculated.