Introduction

Argania spinosa L. (Fig. 1), belonging to the family Sapotaceae in the order Ericales, is an endemic species distributed in more than 800,000 Hectares in the southwestern area of Morocco. Argan oil production has tremendously eased the poverty of around 2.5 million members of the rural population living nearby (Lybbert et al. 2002, 2011; Guillaume and Charrouf 2011; Mateille et al. 2016). With the ever-growing population and the booming market, the demand for the species has amplified greatly in the latest years owing to the oil extracted from its almonds, which is marketed worldwide and highly appreciated due to its high medicinal, cosmetic, and pharmacological value (Gonzálvez et al. 2010; Guillaume and Charrouf 2011; Cabrera-Vique et al. 2012). In addition to its ecological and socio-economic value, many investigations have revealed that the species has a high level of stress tolerance (Chakhchar et al. 2015; Ain-Lhout et al. 2016), which makes it the best candidate for an environmentally friendly agriculture (Polain and Waroux 2013; Mateille et al. 2016). However, despite the tree’s well-known tolerance, there is an evidence of an alarming regression in argan forest, which is altering the ecosystem to a large extent. Morocco, as the main argan oil producer, is also highly vulnerable to land degradation due to its geographic location within a drought-prone spot of the Mediterranean area (Gonzálvez et al. 2010; Cabrera-Vique et al. 2012). This situation has led to a decline in the green forest landscape and to the genetic diversity erosion (Alados and Aich 2008; Díaz-Barradas et al. 2010; Alba Sánchez et al. 2015; Zhao et al. 2019). Since the late 1900s, the forest has experienced significant shrinking in milder areas with a subsequent decrease in fruit production and oil yielding (Díaz-Barradas et al. 2010; Zunzunegui et al. 2010; Ait Aabd et al. 2014).

Fig. 1
figure 1

Argan (Argania spinosa L.,) tree, in its natural habitat in southwestern area of Morocco

Many investigations have designated climate changes along with other anthropogenic factors as the key elements leading to the perceived loss (Alados and Aich 2008; Charrouf et al. 2008; Alba Sánchez et al. 2015). Therefore, under the incessant threat of global warming and the thriving oil market (Le Polain and Waroux 2013), interventions are required to restore the natural forest, which is in danger of extinction, specifically as additional contractions are predicted under future climate change scenarios (Alba Sánchez et al. 2015). The implementation of new strategies that could address the issues related to forest loss is urgently needed for the sustainable development of the agroforestry system.

Many gaps remain to be filled in terms of research in this area and further inputs are required regarding the argan genomic basis to improve our knowledge of the bottlenecks that limit argan domestication and breeding. Over the past several years, efforts have mainly been oriented to the use of DNA-marker-based technology to characterize the high level of intra-species genetic diversity across a broad geographical range of forests (Ait Aabd et al. 2019). Understanding this diversity, will allow for the incorporation of superior economic traits through breeding programs. Moreover, research trials on the argan genetic and genomic bases will greatly boost these efforts. Prior to 2018, these efforts were impeded by the lack of availability of nucleotide sequences in the public databases. Recently, a draft genome of A. spinosa assembly was generated by a hybrid de novo assembly method that combines short- and long sequencing reads (Khayi et al. 2018). The 671 Mb assembly was estimated to cover 89% of the genome of the highly heterozygous diploid species, 2n = 2x = 20, with x = ploidy number and n = gametic number of chromosomes (Majourhat et al. 2007; Khayi et al. 2018; El Boukhari et al. 2023). However, an annotation was lacking and the genome was not easily accessible. Here, we provide an annotation for the generated raw genome and an easy access for scientific community to perform further research. We generated a reference transcriptome for argan and used it to annotate the argan genome (Version May 2019). Further, we identified the argan homologs of biosynthetic pathways relevant to its agronomical value with special focus on the tocopherol biosynthesis.

Materials and methods

Gene and transposable element prediction and genome annotation

To annotate the A. spinosa genome, the draft genome generated in a previous project (Khayi et al. 2018) was used as a resource to execute a homology-based gene prediction program that combines gene models from multiple reference organisms. The obtained sequences and annotations from the evolutionarily related reference species kiwi (Actinidia chinensis; GCA_003024255.1), tea (Camellia sinensis L.; GCF_004153795.1), Rhododendron williamsianum (GCA_009746105.1), and blueberry (Vaccinium corymbosum L.) (www.vaccinium.org, Draper v1.0) were integrated to build preliminary gene models using Gene Model Mapper (GeMoMa) v1.6.3 (Keilwagen et al. 2016). The models were subsequently combined and filtered to generate a set of complete gene models. From these models, 2000 genes were randomly selected and split into two sets.

Genome annotation was subsequently performed using AUGUSTUS software (v. 3.4.0). The selected subset of predicted genes was used during the training step and the evaluation of the AUGUSTUS gene prediction tool (Stanke et al. 2006) according to Hoff and Stanke (2019). All unfiltered GeMoMa models were subsequently converted into AUGUSTUS “hints” as “CDSpart” and “intron”.

RepeatModeler (Smit and Hubley 2015) was used to identify repeats and transposable elements in the genome. The resulting sequences were applied to the genome using RepeatMasker (Smit et al. 2015). The RepeatMasker results were converted into AUGUSTUS “hints” as “nonexonpart”. The final gene model was built using AUGUSTUS with default parameters and allowing for alternative splice variants based on “hints”.

Genome completeness and gene function prediction

The gene prediction quality was evaluated exploiting the software Benchmarking Universal Single-Copy Orthologs (BUSCO) v.4 (Seppey et al. 2019). BUSCO analyses the coverage of Single-Copy Orthologs that are designated as BUSCO versions of the gene models used to train the AUGUSTUS (v3.4.0) gene prediction tool (Stanke et al. 2004). The completeness of the model was evaluated using the “viridiplantae” and “eudicots” datasets.

Functional descriptions were assigned to the predicted models based on homology. The predicted genes were blasted alongside the UniProt/Swiss-Prot (http://beta.uniprot.org/uniprot) and UniProt/TrEMBL (http://www.uniprot.org/) databases. The functional annotation was derived from the reference protein with the best blast hit (the highest score). The final genome assembly results were uploaded to the GenDBE annotation platform (Meyer et al. 2003).

RNA extraction and PacBio IsoSeq library construction

In order to improve the annotated A. spinosa genome assembly with full-length transcripts, an IsoSeq analysis of A. spinosa RNA was facilitated. The gene prediction was subsequently improved by performing RNA sequencing.

Sample material and RNA extraction

Plant material for RNA-seq was collected from Bonn University botanic gardens at Germany (IPEN numbers: XX-0-BONN-19646 procured in 1980, and MA-0-BONN-24662 and MA-0-BONN-40479 acquired in 2005 from Agadir, Morocco). Due to the high and disturbing amounts of phenolic compounds in the leave tissues, 100–500 mg of root and flower tissues was separately collected and used for our experiment. Extraction of the total RNA was performed with the NucleoSpin® RNA Plant and Fungi Kit (Macherey–Nagel GmbH & Co. KG, Düren, Germany) according to the manufacturer’s instructions. The cDNA libraries were prepared, and strand-specific sequencing was conducted by the Biological and Medical Research Center (BMFZ), Düsseldorf, Germany.

To avoid bias toward specific transcripts, the RNAs from the different tissues were used at equal amount and sequenced using PacBio’s Iso-Seq method. The A. spinosa RNA pool, of root and flower tissues, was assessed using the Qubit RNA HS Assay (Thermo Fisher Scientific Massachusetts, U.S.A.) and the NanoDrop (Thermo Fisher Scientific, Massachusetts, U.S.A.) to check the concentration, while the FragmentAnalyzer (Agilent Technologies, California, U.S.A.) was used to check the RNA integrity. Subsequently, the intact poly(A) RNA was captured and purified using the NEBNext Poly(A) mRNA Magnetic Isolation Module (New England BioLabs, Massachusetts, U.S.A.) following the manufacturer’s instructions. Next, the cDNA was synthesized according to the TeloPrime Kit Version 2 (Lexogen, Vienna, Austria). The optimal amount of cycles was determined by qPCR (QuantStudio 3, Thermo Fisher Scientific, Massachusetts, U.S.A.) with 0.1 × SYBR Green (Merck Millipore, Darmstadt, Germany), TeloPrime kit chemistry and 10% of the cDNA as input material. Based on the qPCR, the amount of cycles was adjusted to yield appropriate amounts of amplified cDNA (500 ng to 2 µg) within the Endpoint-PCR. The quality of the mass-amplified full-length cDNA was assessed according to the manufacturer’s instructions and column-based purified (TeloPrime Kit Version 2, Lexogen, Vienna, Austria) followed by the IsoSeq PacBio library preparation, using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, California, U.S.A.). Subsequently, the quality (FragmentAnalyzer, Agilent Technologies, California, U.S.A.) and quantity (Qubit, Thermo Fisher Scientific, Massachusetts, U.S.A.) was determined. The long-read sequencing was executed on a Sequel IIe system with a single SMRT cell (8 M), the Sequel II Binding kit 2.1, and the Sequel II Sequencing kit 2.0 (Pacific Biosciences, California, U.S.A.). The SMRT cell was loaded with 120 pM of material. The movie time of 24 h started after a 2 h of immobilization step and a 2 h of pre-extension step.

Processing of raw PacBio reads and Iso-Seq analysis

The raw subread dataset was processed with SMRT Link v10.1 (Pacific Biosciences, California, U.S.A.) to generate the circular consensus sequences (CCS) reads, internally using ccs v.6.0.0. cDNA barcodes introduced with the TeloPrime Kit Version 2 (Lexogen, Vienna, Austria) were trimmed via lima (v 2.1.0). Subsequently, the high-quality full-length transcripts were processed with isoseq3 (v 3.4.0).

The RNA-seq reads were reallocated to the genome assembly in the GenDBE annotation platform (Meyer et al. 2003).

Comparative genome analysis and ortholog identification

Protein sequences from Arabidopsis thaliana (https://www.arabidopsis.org), A. chinensis (https://kiwifruitgenome.org/ftp/A_chinensis/Red5/v1.0/), A. spinosa, Brassica napus (http://www.genoscope.cns.fr/brassicanapus), Helianthus annuus (https://sunflowergenome.org), C. sinensis (https://www.ncbi.nlm.nih.gov/assembly/GCF_004153795.1), Medicago truncatula (http://www.medicagogenome.org/), and Triticum aestivum (https://wheat-urgi.versailles.inra.fr/Tools) were used to perform a quantitative analysis of the different protein sets, their intersections, and aggregates of intersections through the visualization of intersections among multiple sets (Lex et al. 2014). Only the longest splice-variant of each gene was used; the sequences with internal stop-codons or internal gaps were removed. The number of orthologous groups, both shared and unique, is presented for each genome.

To proceed with the analysis, we used a set of the main genes related to argan oil production, namely tocopherol and phytosterol-pathway-related genes and implicated in fatty acid (FA) and triacylglycerol (TAG) biosynthesis. Tocopherol, phytosterol, FA, and TAG are the four main components of argan oil. A manual examination of the genes present in the argan genome was performed using the A. thaliana protein sequence and the GenDBE blast tool. To investigate the contraction of the gene families in argan and to determine lineage-specific genes, data on the genome of A. spinosa, C. sinensis, A. chinensis, M. truncatula, H. annuus, B. napus and A. thaliana were used to perform a comparative analysis using OrthoFinder (v.2.4.0) to identify the orthogroups (Emms and Kelly 2015).

Sequence analysis and construction of phylogenetic trees

Sequences were retrieved using BLAST (Altschul et al. 1990) searches in JGI Phytozome (Goodstein et al. 2012) (Oryza sativa, Z. mays, Hordeum vulgare, T. aestivum, Solanum lycopersicum, H. annuus, Glycine max, A. thaliana, Capsella rubella, Brassica rapa and Amborella trichopoda), NCBI (Sayers et al. 2022) (A. chinensis and C. sinensis) and an in house database for A. spinosa. Full-length protein sequences were aligned using MAFFT (Katoh and Standley 2013) (--localpair --maxiterate 10,000 --reorder), resulting alignments were then used to generate consensus trees (maximum-likelihood, 100 bootstraps) with iqtree (Minh et al. 2020) (-T AUTO -m TEST -b 100 -allnni -con).

Results

Gene prediction, transposable elements and genome completeness

Herein, we present an annotated A. spinosa genome assembly that represents 99.76% of the predicted 671 Mb genome. A total of 287,123 genes were obtained from four reference genomes, kiwi fruit (A. chinensis), tea (C. sinensis L.), R. williamsianum, and blueberry (V. corymbosum L.). 62,590 predicted genes, which are encoding 82,286 transcripts were identified by the AUGUSTUS software (Stanke et al. 2006). Of these transcripts, 49,575 transcripts were completely supported by homologous proteins from at least one of the reference annotations and 10,295 were novel gene models. The detailed characterization of these predicted genes suggests that the mean number of exons per gene is 4, with an average of 220 bp and 776 bp for exon and intron lengths respectively. The transcripts lengths (mRNA) varying from 9 to 136,337 bp, with an average of 3717 bp. Furthermore, based on the RepeatModeler analysis used to process the genomic sequence and identify repeated sequences, a total of 2390 repeat family sequences were identified in the complete genome and 56.33% of all sequences were masked by RepeatMasker.

To ensure that the predicted gene set is complete and accurate, the annotation was evaluated by BUSCO (Simao et al. 2015). The results from the analysis reveal that the adopted pipeline was able to identify 92%, with 0.2% missing, and 91.7%, with 2.4% missing, of the complete data for “viridiplantae” and “eudicots”, respectively. The functional annotation resulted in a reliable assignment for 25,436 transcripts, with 14,292 transcripts not having a significant hit based on their homology to the used protein databases.

Reference transcriptome

The argan reference transcriptome comprises 128,978 Pac Bio IsoSeq HQ transcripts extracted from 2,907,823 PacBio IsoSeq reads. Of those, 21.21% fully mapped the genome, while 59.35% were unmapped. Of the latter one, a large fraction mapped to fungi (21.27%), suggesting strong fungal association of the roots with fungi or putative fungal infections of flower material collected from the botanical garden. Around 6% of the transcripts were identified as putative transposons.

Comparative genome analysis and gene identification

Comparative analyses of the whole protein sets (Table 1) from A. thaliana, A. chinensis (kiwi), A. spinosa, B. napus (canola), H. annuus (sunflower), C. sinensis (tea), M. truncatula (barrel clover), and T. aestivum (wheat) were performed. The argan genome includes 1,140 ortholog groups not identified in the other species (Fig. 2). 7,753 groups were shared among all species and 603 groups are shared with its closest relatives, C. sinensis. Among those groups, 4,534 genes were identified as a specific singleton to argan compared with other species.

Table 1 Comparative analysis of orthologous groups involved in oil production and vitamin E biosynthesis identified in Argania spinosa, Arabidopsis thanliana, Actinidia chinensis, Brassica napus, Helianthus annuus and Camellia sinensis
Fig. 2
figure 2

Analysis of orthogroups in Arabidopsis thaliana, Actinidia chinensis, Argania spinosa, Brassica napus, Helianthus annuus, Camellia sinensis, Medical truncatula, and Triticum aestivum) created with UpSet R (Lex et al. 2014). The bars represent the number of orthologous groups (intersection size). The dots symbolize the species included in the intersection. The plot was restricted to intersections with more than 100 groups of orthologous

Further analysis of the conserved functional domains was performed through Pfam database (http://pfam.xfam.org/) to further characterize the hypothetical argan-specific genes. Among the other plant genomes of members of Ericales that we used in our study, 244 orthologue groups were shared in the intersection with the A. chinensis genome, while 603 orthologue groups were shared with the tea genome, which has the closest evolutionary relationship to the argan genome (Table 1).

A total of 664 genes, which consists mostly of transcription factors, were designated as specific to the A. chinensis genome, which is estimated to be 637.99 Mb (Wu et al. 2019). Regarding the genome of C. sinensis, more than 8,000 singletons were assigned to the 3.8–4.0 Gb genome (Sharma et al. 2018) (Fig. 2). The argan genome shares 3,510 and 831 orthologs with the tea genome and the kiwi genome, respectively (Fig. 2).

Sequence analysis and construction of phylogenetic trees

To gain insight into the agronomically important components of the argan genetic inventory, genes related to the major traits in oil crops, oil content and quality, the copy number of genes involved in these traits was analyzed. We identified the argan orthologs of oil biosynthesis related genes and their regulators known from other oil seed plants from the Ericales and Asterids using the OrthoFinder program. Biosynthetic pathways of specific interest were the tocopherol, fatty acid, triacylglycerol (TAG), and the phytosterol biosynthesis pathways. Members of these pathways were identified in the genomes of A. spinosa, A. chinensis, A. thaliana, B. napus, C. sinensis, and H. annuus and are summarized (Table 1). Due to its alloploid nature of its genome (Lu et al. 2019), B. napus encodes for often more than double the number of genes than the other plants species included in this analysis. Generally, gene numbers of those families are quite similar between A. thaliana and A. spinosa and we mention here only those genes that deviated in number from A. thaliana. By analyzing the phytosterol pathway-related genes we found for the most important candidate gene, coding for the sterol methyltransferase II (SMT2) enzyme, three orthologs in A. spinosa and only two in A. thaliana. Furthermore, three A. thaliana genes encode for SQE1, but seven in A. spinosa.

In the TAG biosynthesis pathway, the genes encoding for LPAT2, DGAT3, and PDAT are single copy in A. thaliana and have two orthologs in A. spinosa. A single gene in A. thaliana also encodes DGAT3 but five orthologs are found in A. spinosa. However, LPCAT1 and LPCAT2 together have only a single ortholog in A. spinosa.

The fatty acid biosynthesis pathway includes a larger number of enzymes, most of the encoded by several genes. The enzymes α-CT, FatB, LACS9, and KASIII are encoded by a single gene in A. thaliana and by two in A. spinosa. The enzymes KASII, KAR, and FAD2 are also encoded by a single gene in A. thaliana but three in A. spinosa. The enzymes, BCCP, HAD and FAB2, are encoded by several genes in A. thaliana but A. spinosa encodes these genes by more copies.

Tocopherol biosynthesis (Table 1) includes mainly single copy genes in A. thaliana with the exception of GGPS (11 genes) and TAT (2 genes). Most of these single genes encoding for enzymes have two orthologs in A. spinosa (VTE3, VTE5, MCT and HMBPP). However, two genes encode TAT in A. thaliana and three in A. spinosa and the GGPS enzyme in A. spinosa is encoded by only 5 genes. Taken together, the gene numbers of the oil crop A. spinosa does not deviate substantially from the non-oil plant A. thaliana.

Evolutionary relationships of the tocopherol biosynthesis gene families

One of the main factors for the thriving demands for argan oil is its high content in total tocopherol in comparison to other oil seed crops. Tocopherols, vitamin E family substances, are lipophilic antioxidants that prevent the oxidation of unsaturated fatty acids. They come in four isoforms of differing biological activity with α-Tocopherol (α-Toc) showing the strongest antioxidant properties. The key genes implicated in the downstream stages of the biosynthesis pathway are encoding for homogentisate phytyltransferase (HPT, encoded by VTE2 gene), 2-methyl-6-phenyl benzoquinone methyltransferase (MPBQ MT, encoded by VTE3 gene), tocopherol cyclase (TC, encoded by VTE1 gene), and γ-tocopherol-methyltransferase (γ-TMT, encoded by VTE4 gene) (Fig. 3).

Fig. 3
figure 3

The tocopherol biosynthetic pathway. Enzymes are shown in blue and metabolites in red. The initial prenylation reaction of HGA and PDP is catalyzed by HPT enzyme to form MPBQ. MPBQ is then methylated by MPBQ methyltransferase to form DMPBQ, which can be cyclized by TC and further methylated by γ-TMT to form α-Tocopherol (Savidge et al. 2002). Arabidopsis thaliana genes, encoding the enzymes, are emphasized in yellow. The key biosynthetic genes identified in Argania spinosa are highlighted in green. PDP: phytyldiphosphate; HGA: homogentisic acid; MPBQ: 2-methyl-6-phytylbenzoquinol; DMPBQ: 2,3-dimethyl-6-phytyl-1,4-benzoquinol; HPT: homogentisate phytyltransferase; MPBQ MT: MPBQ methyltransferase; TC: tocopherol cyclase; γ-TMT: γ-tocopherol methyltransferase

We identified four key genes involved in tocopherol biosynthesis in argan genome encoding for tocopherol cyclase (TC, encoded by VTE1 gene), homogentisate phytyltransferase (HPT, encoded by VTE2), 2-methyl-6-phytylbenzoquinol methyltransferase (MPBQ MT, encoded by VTE3), and γ-tocopherol methyltransferase (TMT, encoded by VTE4). Their evolutionary histories were analyzed separately by phylogeny reconstructions (Fig. 4).

Fig. 4
figure 4

Maximum likelihood phylogeny of VTE 1 (a), VTE 2 (b), VTE 3 (c), VTE 4 (d) -like genes including angiosperm representative with sequenced genomes. Numbers next to branches denote boostrap support of the respective lineage split

The Maximum likelihood (ML) phylogeny of VTE1, encoding for TC enzyme, shows the A. trichopoda VTE1 sequence as sister group to all other VTE1-like proteins. Monocot sequences form a clade with 100% bootstrap support and all dicot sequences form a group with 86% support. Except for G. max with two VTE1 genes and T. aestivum with three, all species included in the analysis have only a single VTE1 gene. Ericales genes including the argan sequence form a subclade with 95% bootstrap support, and argan codes for a single VTE1 ortholog.

The ML phylogeny of VTE2, encoding for HPT enzymes, supports the monocot representatives with 100% bootstrap value and the dicot sequences with 83% bootstrap value. Several independent gene duplications occurred within the VTE2-like genes, resulting three VTE2 homologs in T. aestivum, two in G. max, two in H. annuus and two in A. chinensis, and a single one in argan.

The VTE3-like genes cluster in a monocot clade supported by 100% bootstrap support and a dicot clade with 79% bootstrap support. Interestingly, a dicot specific gene duplication leading to two monocot subclades, one supported by 95%, the other one by 98% bootstrap support. Thus, most species analyzed encode at least two homologs of VTE3, except for Z. mays, with only a single gene. Independently, gene duplication occurred within the dicots, leading a more complex situation. The lineage including the A. thaliana VTE3 gene is supported by a bootstrap value of 50% and includes sequences from argan, B. napus, A. thaliana and other Brassicaceae and H. annuus. The other subfamily is supported by 86% bootstrap support and includes sequences from all analyzed Ericales and B. napus. The argan genome encodes two VTE3 homologs.

The VTE4, encoding for TMT enzyme, protein family shows support of 100% for the monocot sequences, and 75% for the dicot members. Within the dicots, the best-supported subclade includes Brassicaceae members with a support of 100%, all other subclades show support below 50%, suggesting an extensive duplication history and rapid sequence evolution, unlike observed in the other VTE gene families. A single VTE4 copy was identified in argan.

Discussion

Owing to the lack of annotated genomic data on A. spinosa, little is known about this economically important plant. In this work, we annotated a draft assembly using a model-based prediction approach to generate a data resource, available in a public database to advance research on the argan tree. In this work, the draft genome assembly generated by Khayi et al. (2018) was used for gene prediction and functional annotation. Ultimately, the draft genome was predicted to encode for 42,095 genes. The number of genes predicted for A. spinosa is high compared with the number of genes predicted for the closely related and annotated species Primula veris (19,507 predicted genes; Nowak et al. 2015), Primula vulgaris (24,599 predicted genes; Cocker et al. 2018), but similar to A. chinensis (kiwi, 39,040 genes; Huang et al. 2013), and is higher than the genes number covered in the well-annotated species A. thaliana (27,029 genes; Swarbreck et al. 2007). This suggests extensive genome evolution in the Ericales.

The number of repeated genes in the genome of argan (56.33% of the assembly) appears to be higher than that reported in the related species kiwi with 36% of the assembly (Huang et al. 2013) which mainly made up by retrotransposons including the long terminal repeat (LTR) family as the most abundant. For comparison, the A. thaliana genome includes only 14% repetitive sequences suggesting that the many genes predicted from the argan genome may be transposons or retrotransposons.

To the best of our knowledge, this is the sixth whole genome in the order Ericales to be sequenced and annotated, after A. chinensis, P. vulgaris, P. veris, C. sinensis, R. williamsianum, and V. corymbosum. The reports presented in this study, represents the first whole genome-annotation of species member of under-explored Sapotaceae family including other economically important species such as Vitellaria paradoxa (shea tree) and Chrysophyllum cainito (star apple or cainito) (Vallée et al. 2016).

In this study, a total of 45 putative sequences that can be phylogenetically clustered were identified in the argan genome, including the membrane-bound FA desaturases (FADs). FADs play primordial role in FA composition regulation in plants, which sprightly affects oil quality and is involved in tolerance to abiotic stresses. Therefore, FADs should be added to the list of the main candidate genes for crop improvement (Xu et al. 2019). Moreover, one of the main factors underlying the thriving demand for argan oil is its high total tocopherol content compared with other oil seed crops. Beyond its antioxidant activity and implications for human health (Stampfer et al. 1993; Brigelius-Flohe and Traber 1999), analyses in A. thaliana indicate, that genes involved in tocopherol biosynthesis are important for biotic/abiotic stress response (Falk and Munné-Bosch 2010). An example is homogentisate phytyltransferase (HPT), which has an impact on seed longevity and adjustment to low-temperature degrees (Hunter and Cahoon 2007; Maeda et al. 2008; Ren et al. 2011; Ji et al. 2016). Their characterization in highly drought resistant species such as argan may pave the way to improve crop plant responses to abiotic/biotic stresses (Munné-Bosch 2005; Ji et al. 2016). However, more effort has yet to be deployed to investigate their functions in biochemical assays.

The genetic diversity of the tocopherol pathway genes, and the intra- and inter-species variations in the α-Toc content, in particular between different wild argan genotypes, is also unknown. Based on the sequence homology, the enzymes TC (VTE1), and TMT (VTE4) are the product of single copy genes in argan, kiwi and tea, while two genes in argan genome encode the enzyme MPBQ-MT (VTE3), but only single copy genes in tea and kiwi (Fig. 4). This finding contradicts previous reports claiming the tocopherol biosynthetic pathway enzymes are represented with single-copy genes only (DellaPenna 2005).

Interestingly, the gene encoding the HPT (VTE2) is represented by two genes in kiwi, a single gene in argan, while being deficient in tea genome. This might be the reason for its richness in the new tocopherol analogues, such as δ-tocomonoenol, in kiwi fruit (Fiorentino et al. 2009).

The various gene duplication scenarios revealed in oilseed plants (Table 1), in particular wheat and soybean, could be the explaining factor leading to the reported variations in their tocopherol contents. The final step of the pathway, which consists in the conversion of the large γ-Toc pool to α-Toc, is variable and not always effective in all plants. This could be the limiting step leading to intra- and inter-species variations in particularly α-Toc content (Grusak and DellaPenna 1999; Kumari et al. 2019).

The difference in gene copy numbers is revealed as well among the species for the VTE4 encoding the γ-TMT enzyme, which is catalyzing this critical step of the pathway. These inter-species variations could explain the variable conversion efficiencies to α-Toc, which is not always effective in plants. A single copy in argan genome represents the VTE4 while gene duplications are found H. annuus and B. napus. The latter two contain predominantly α-Toc, with more than 95% and 56.4% of the total tocopherol amount, respectively (Cao et al. 2015), while γ-Toc represents the major part for argan seeds with 85% (Marfil et al. 2011).

Conclusion

In summary, we present the reference transcriptome annotated argan genome, together with an analysis of homologs of the important biosynthesis genes for oil production and tocopherol biosynthesis genes, in selected species, including oil crops and with a special focus on argan. The annotated genome generated in this work could be used as a reference for comprehensive genome-assisted cultivar breeding and genome editing improvements and to revisit any trait in order to elucidate the molecular mechanisms that trigger the relevant metabolic pathways, including the genes involved in cosmetic and pharmacological products. Such analyses could contribute to the improvement of argan oil traits and facilitate the development of strategies to breed and domesticate the argan tree through the manipulation of the identified genes. The genes identified that are associated with pathways of the main oil components could be used as candidate genes for improvement of both the seed content and quality and to improve plant responses to abiotic/biotic stresses to help them cope with climate change.