Background

Until recently, strategies for improvement of many global food legume species have been hindered by a lack of genetic and genomic resources. During the early 1990s, barrel medic (Medicago truncatula Gaertn.) and Lotus japonicus L. were selected as candidate model legume species, due to relatively small genome sizes, inbreeding reproductive habits and short life-cycles [1, 2]. Whole genome sequencing projects have been undertaken for these species, providing the opportunity to identify putative orthologous gene sequence resources in other crop legume species, especially those located within the Galegoid clade of the Fabaceae sub-family Papilionoideae [3]. In addition, a draft genome sequence has been completed for the warm-season food legume, soybean (Glycine max), which is located in the other major (Phaseoloid) clade http://www.phytozome.net/soybean, providing additional further insights into comparative genomics within the Fabaceae family.

The current dearth of genomic resources for many crop legume species prohibits effective exploitation, through comparative or translational studies, of molecular genetic tools generated from the three genome draft sequences. There is consequently a pressing need for significant efforts to either develop markers capable of cross-species transfer, in order to enrich existing genetic maps, or generate more informative species-specific genetic and genomic tools which can enable the identification of orthologous genes through genome synteny analysis [3].

Lentil (Lens culinaris ssp. culinaris) is an important grain legume species cultivated throughout Western Asia, the Middle east, North Africa, the Indian subcontinent, North America and Australia, providing a vital source of dietary protein in human diets and straw for animal feed. Lentil is a diploid (2n = 2 × = 14), annual flowering self-pollinating crop with a genome size of c. 4 Gbp [4]. Lentil shares the ability to fix atmospheric nitrogen with other legumes, making it important in the management of soil fertility in cereal based crop** systems. Lentil also provides rotational benefits for management of weeds, diseases and pests, and in many cases offers a profitable, high value crop option for farmers [5]. However, relatively few genomic resources are currently available for lentil, a total of 9,513 EST sequences being present in the public domain as of 3rd February 2011.

Estimation of elapsed times since species divergence from a common ancestor is important for plant comparative genomics. The Galegoid sub-family Vicieae, which contains M. truncatula, lentil, field pea (Pisum sativum L.) and faba bean (Vicia faba L.), diverged from the Loteae sub-family (which contains Lotus japonicus) c. 25 million years ago [6]. Despite this extended period of divergence, high levels of macrosynteny are observed between the various Galegoid species [2]. Close genomic relationships have also been observed for more distant comparisons, such as between Glycine and Medicago[7], for which substantial regions of almost perfect colinearity have been observed. Comparative sequence studies with genome-sequenced legumes is hence of potentially high value for underdeveloped species such as lentil.

Due to recent advances in sequencing technology it has become possible to rapidly generate large datasets with significantly reduced time and labour requirements [8, 9]. These methods offer a cost-effective means to access the gene space of a target organism through in-depth sequencing of the transcriptome. Initial transcriptome sequencing studies were largely exploratory, and failed to exploit the potential for next-generation transcriptome sequencing at different scales [10]. However, many recent reports have been published on massively parallel approaches to transcriptome sequencing [1113], largely using model organisms with available draft genomes to assist assembly [19]. Functionally-associated EST-SSRs provide an effective means of molecular marker development that targets nucleotide diversity in genic regions, allowing the possibility of 'perfect' marker development for the molecular breeding of crop plants. In addition, due to location in conserved genic regions, EST-SSR markers frequently display a high degree of operational transferability between related species [4, sheet 4.2). Base substitution rate was determined through comparison of consensus sequences of orthologues from Lens culinaris, Medicago truncatula and Arabidopsis thaliana for four randomly selected genes (each being > 0.5 kb in length) (data not shown). Average nucleotide substitution rate in the lentil and M. truncatula comparison was found to be 9 per 100 bp (stdev = 1.1), but the equivalent value when comparing lentil to A. thaliana was 23/100 bp (stdev = 2.6). Due to the relatively high degree of sequence conservation, data from the present study is also applicable to model species such as M. truncatula and A. thaliana, providing an opportunity to study comparative genomics and evolutionary relationships between dicotyledonous plant species.

The consensus sequences were also compared against Arabidopsis thaliana database and 7,476 unique matches were identified, including 3,941 contigs and 3,535 singletons (Additional file 5). All unique matches were annotated and gene ontology (GO) terms were further assigned corresponding to a total of 34,034 gene counts and 44,734 annotation counts (Figures 4-6). The intracellular component category of the cellular component classification class contributed the largest proportion of all annotations (17%), followed by the cytoplasmic component (13%), chloroplast component (11%), membrane component (10%), other cellular component (10%), nuclear component (8%) and plasma membrane component (8%) categories. Other components such as plastid, cytosol, mitochondria, ER, golgi apparatus, cell wall, ribosome and extracellular components were represented at proportions less than 5% of total (Figure 4). Among the molecular function classification class, the enzyme activity, transferase activity, binding activity, hydrolase activity, molecular function, nucleotide binding and protein binding categories included the majority of detected matches (Figure 5). In the biological processes classification class, cellular (25%) and metabolic processes (22%) constituted the major categories, followed by protein metabolism (9%), unknown biological processes (9%), developmental processes (5%), stress response (5%), transport (5%) and cell organization and biogenesis (4%) (Figure 6).

Figure 4
figure 4

Pie-chart representation of GO annotation results from lentil consensus sequences for Cellular Component, with a total number of gene counts of 11,446. Since one gene product can be assigned to more than one GO terms, the total percentage in each category could exceed 100%.

Figure 5
figure 5

Pie-chart representation of GO annotation results from lentil consensus sequences for Molecular Function, with a total number of gene counts of 119,316. Details are as for Figure 4.

Figure 6
figure 6

Pie-chart representation of GO annotation results from lentil consensus sequences for Biological Process, with a total number of gene counts of 13,270. Details are as for Figure 4.

Finally the lentil unigene set was also compared against Glycine max EST sequences database [http://www.phytozome.net/soybean, 25] that identified 20,419 unique matches (Additional file 6).

Frequency and distribution of EST-SSRs in lentil transcriptome

EST-SSR discovery was performed based on analysis from assembled contig templates, and a total of 2,929 distinct loci were identified, a frequency of 16% (2,415 SSR containing contigs/15,354 total contigs). A total of 2,393 SSR primer pairs were designed from these loci, 412 template contigs containing at least two SSR loci eligible for primer pair design (Additional file 7). Incidences of different repeat types were determined (Table 2), the most abundant being trinucleotide arrays (1,424: 60.6%). Frequencies for each array type according to repeat unit number were also evaluated (Table 2), the most common class being n = 5 (1,037 loci: 44.1%), while only 1.1% of loci contained more than 9 repeat units.

Table 2 Summary information on frequencies of different SSR repeat motif types related to variation of repeat unit numbers in lentil EST-SSR loci

Validation of a subset of EST-SSRs

A subset of 192 EST-SSR primer pairs were selected for validation of marker assay performance. A total of 166 primer pairs successfully obtained amplification products from one or more template genotype, of which 51 (30.7%) revealed polymorphism between 12 L. culinaris genotypes. Inclusion of the non-domesticated species L. nigricans permitted polymorphism detection by 28 additional primer pairs (an increase to 47.5% of total) (Additional file 8).

Discussion

Assembly and annotation

A limited amount of genomic data pertaining to the cool-season food legume lentil (as well as related species such as chickpea, field pea and faba bean) was publicly available prior to commencement of this study. Relatively few activities have previously been performed to address this deficit. The 454 Life Sciences GS-FLX Titanium second-generation sequencing technology provides a rapid, efficient and cost-effective method for genomic resource enrichment through generation of large numbers of ESTs with individual read lengths of up to 500 bp. The technology has previously been used to perform de novo bacterial genome sequencing, whole genome shotgun sequencing, metagenomic studies, transcriptome characterisation and small RNA sequence determination [26].

In order to produce a maximally informative lentil transcriptome sequence resource, cDNA from leaf and stem tissue was normalised prior to sequence analysis. This process reduces oversampling of abundant transcripts such those derived from the chloroplast or nuclear genes involved in photosynthetic processes (e.g. rbcL, rbcS, cab), and more efficient detection of transcripts expressed at low levels in specific tissues. A preliminary study (data not shown) suggested that normalisation of lentil cDNA could enhance rare transcript detection by c. 10%. A similar approach was effective for detection of lowly-expressed genes from cDNA sequencing in both M. truncatula and Artemisia annua[38]. Traditional methods for development of genomic DNA-derived SSRs are expensive, laborious and time-consuming. However, ESTs generated by large-scale transcriptome sequencing programs are a potentially rich source for SSR discovery. EST-SSRs exhibit potential advantages when compared to SSRs located in non-transcribed regions due to generally more consistent efficiency of amplification, and enhanced cross-species transferability [39, 40]. Substantial numbers of SSR loci were identified in the present lentil EST collection, supporting high quality PCR primer design in most instances. As EST-SSR loci are genic and have been derived from transcriptome database, the majority of the EST-SSR loci should occur in the protein-coding sequences of annotated contigs, representing genes of known or predicted identity and function.

The frequency of EST-SSR loci detected in cultivated lentil is similar to estimates from other broad-leaved plant species (2% to 17%) [41]. These prevalence values are influenced by the type of software (such as SPUTNIK, SSRIT, Batchprimer3 and FastPCR) and default parameters used for detection, which would be expected to produce minor influences on the efficiency of detection [4246]. In the present study, trinucleotide units are the most abundant form of SSR repeat structure, consistent with results from other plant species [47, 48].

The results of EST-SSR validation using cultivated and non-domesticated Lens genotypes suggest that polymorphism frequency is highly enhanced by inclusion of germplasm outside the pooled of contemporary germplasm, as observed for other crop plant species [4951]. The ability to amplify across specific boundaries is a common property of EST-SSRs, as previously described, and has been demonstrated for other legume taxa through efficient detection of polymorphisms in other Medicago species such as alfalfa by M. truncatula-derived EST-SSRs [46]. L. culinaris-derived SSRs may hence be implemented for study of the genus Lens in the broader sense, to fully access exotic gene pools.

Conclusions

Generation of a substantial EST-derived dataset from cultivated lentil is described in this study, comprising 84,074 unigenes, of which c. 25,000 have been sequence annotated. A set of EST-SSR primer pairs has been designed using unigene templates and demonstrated to be effective for polymorphism detection within cultivated germplasm and across the genus Lens.