Abstract
Background
Maize (Zea mays L.) is at the vanguard facing the upcoming breeding challenges. However, both a super pan-genome for the Zea genus and a comprehensive genetic variation map for maize breeding are still lacking.
Results
Here, we construct an approximately 6.71-Gb pan-Zea genome that contains around 4.57-Gb non-B73 reference sequences from fragmented de novo assemblies of 721 pan-Zea individuals. We annotate a total of 58,944 pan-Zea genes and find around 44.34% of them are dispensable in the pan-Zea population. Moreover, 255,821 common structural variations are identified and genotyped in a maize association map** panel. Further analyses reveal gene presence/absence variants and their potential roles during domestication of maize. Combining genetic analyses with multi-omics data, we demonstrate how structural variants are associated with complex agronomic traits.
Conclusions
Our results highlight the underexplored role of the pan-Zea genome and structural variations to further understand domestication of maize and explore their potential utilization in crop improvement.
Similar content being viewed by others
Background
The increasing population and unpredictability evoked by global climate change have brought new demands to increase the productivity and quality of our crops [1]. Food production must increase 70% by 2050 to feed the increase in the world’s population [2]. The past few decades have witnessed a rapid evolution of sequencing and marker technologies alongside the widespread adoption of genome-based breeding approaches [3]. These technological revolutions have promoted innovations in crop breeding from conventional phenotype-based selection to genomics-assisted breeding and genetic engineering [4, 5].
While they harbor great potential, the development of breeding technologies and the explosive growth of biological information have also highlighted the insufficiencies in conventional genomics-assisted breeding strategies. The first of these insufficiencies is the use of a single reference genome. More and more evidences have shown that map** reads onto a single reference genome can result in reference bias and missing information in highly polymorphic regions and regions that are not present in the genome [6,7,8]. Thus, a more comprehensive way is to replace the single reference genome with a pan-genome, which represents the complete genetic repertoire of a species. With reduced sequencing costs in recent years, the desirability to construct pan-genomes has spread from Streptococcus agalactiae [9] to eukaryotic species [10,11,12], including many major crops, such as rice, bread wheat, soybean, and tomato [13,14,15,16]. Secondly, the conventional genomics-assisted breeding strategies majorly rely on single nucleotide polymorphisms (SNPs) and short insertions/deletions (InDels, hereafter representing insertions/deletions < 50 bp) because they could be easily acquired from low-depth resequencing of cultivated lines. However, SNPs/InDels do not represent the complete genetic repertoire of a species [17, 18]. Other genetic variations, such as structural variations (SVs), also play important roles in plant genetics [19, 20], and their potential should be harnessed for crop breeding and improvement. Besides, applying multi-omic (e.g., transcriptomic, proteomic, metabolomic, and epigenetic) bio-data to reveal genetic mechanisms is becoming more practical [21]. It is highly conceivable that systematic integration of multi-omics data could accelerate crop breeding and improvement [22, 23]. Given these considerations, it follows that to aid in increasing the productivity and quality of crops from the perspectives of genomics and genetics, we should (i) construct a genus-level crop pan-genome, or “super-pan-genome” [24], that includes both cultivated and wild accessions within a genus; (ii) include more genetic variations (e.g., SVs) in addition to SNPs/InDels into genomics-assisted crop breeding, and (iii) systematically integrate multi-omics evidence to accelerate crop breeding.
Maize is a staple crop and a model organism for genetic research [25]. Since the first release of the maize B73 reference genome in 2009 [26], more than 40 maize genomes have been released to date. Moreover, multi-omics maize data, including DNA resequencing [27,28,29,30,31], transcriptomic [32, 33], metabolomic [34, 39] and the population-level transcripts of hundreds of diverse lines [1: Fig. S8D–E. G Schematic of the variant graph genome representation for AGPv4 Chr2:171064-171220, with the SNP paths, short InDels, and a large deletion. H The identity and map** rate distribution of the simulated short reads from the genomes of the 26 NAM founders against the variant graph. Dark blue individuals are presented on the variant graph, whereas light blue individuals are not
The gene models and functional annotations for the pan-Zea genome were next generated by merging the AGPv4 reference gene annotations with the non-reference genes that were annotated based on a combination of transcript evidence, homologous protein evidence, and ab initio gene predictions (Additional file 1: Fig. S4 and Additional file 12: Supplementary Materials and Methods), resulting in 58,944 genes (39,591 AGPv4 genes and 19,353 non-reference genes, see Additional file 4: Table S3) and 21,167 orthologous groups (Additional file 12: Supplementary Materials and Methods). About 85.82% of the genes were assigned to at least one functional annotation (Additional file 1: Fig. S5).
The gene presence/absence (gPAV) patterns for each maize inbred and teosinte individual (landrace individuals were excluded from the downstream analysis to avoid bias, leaving 691 genotypes for subsequent analysis, see Additional file 12: Supplementary Materials and Methods) were estimated using a read-map**-based method that maintained robustness among different read depths (Additional file 1: Fig. S6A). The resulting gPAV patterns followed previous reports (Additional file 1: Fig. S6B–D), with an estimated genoty** accuracy of ~99.71% and ~95.84% for true presence and true absence, respectively (Additional file 1: Fig. S6E). Principal component analysis and linkage disequilibrium (LD) rank analyses of the gPAVs revealed that the gPAVs were related to the population structure and were well represented by SNPs with ~97.37% gPAVs displaying high LD with nearby SNPs (Additional file 1: Fig. S7 and Additional file 12: Supplementary Materials and Methods). Next, to investigate the PAVs of genes and the orthologous groups from a population perspective, we identified the “core” (with population-level loss rate not significantly greater than 1%) and “dispensable” (with population-level loss rate significantly greater than 1%) genes and orthologous groups based on the gPAV and the derived orthologous group PAV (oPAV) matrices (Additional file 1: Fig. S8A, Additional file 12: Supplementary Materials and Methods). The results of these analyses revealed that ~44.34% of the pan-Zea genes were dispensable, while only ~7.42% of the pan-Zea orthologous groups were dispensable (Fig. 1E and Additional file 1: Fig. S8B). An average of 6020 genes displayed PAV patterns between two individuals, with larger differences for two inter-subspecies individuals (~6779 between one teosinte and one maize) than inner-subspecies (~5520 between two maize and ~5635 between two teosintes) (Additional file 1: Fig. S8C). Given our knowledge of the core and dispensable gene/orthologous groups, we estimated the gene/orthologue-group set size for the pan-Zea genome and the core genome. The in silico simulation showed that the pan-Zea genome (Fig. 1F), as well as the subspecies pan-genomes (pan-maize genome and pan-teosinte genome, Additional file 1: Fig. S8D–E), displayed characteristics of a “closed pan-genome” [45] with plateaus in the size curves, suggesting that we identified almost all of the genes in maize and teosinte. The results show that 27 individuals represented an average of 95% of the pan-Zea gene-set (range ~91.9 to ~97.3%), while 99 individuals covered 99% of the pan-Zea genes (Fig. 1F).
In addition to the linear representation of the pan-Zea genome, we also constructed a graph-based genome (Fig. 1G), including the SNPs, short InDels (<50 bp in length), and SVs (>50 bp in length) (see sections below and Additional file 12: Supplementary Materials and Methods for details of the SVs). Furthermore, we estimated a representation of the variant graph genome using reads simulated from the 26 NAM founder genome assemblies. The results showed that reads simulated from the “not-in-graph” NAM founders (23/26) had compatible map** rates (99.40% on average) and alignment identities (91.00% on average) with those of the “within-graph” NAM founders (3/26, including B73, with 99.44% and 91.14% for the average map** rate and alignment identity, respectively) (Fig. 1H), indicating that the variant graph represented the vast majority of the maize genetic repertoire.
The presence/absence patterns of pan-Zea genes and the orthologous groups
An interesting question is which genes are more likely to show gPAVs in the genus Zea. To address this question, we investigated the associations between the gPAVs and genic features, including the sub-genome origin, gene age, gene length, orthologue group size, expression levels, and selective constraints (Additional file 12: Supplementary Materials and Methods and Additional file 5: Table S4). The results of these analyses revealed that the gPAV was significantly associated with the genic features under investigation (Fig. 2A). Specifically, they showed that the absence of pan-Zea genes was more prevalent among genes that were newly derived, within large orthologous groups, and with genes that were either minimally or tissue specifically expressed. The dispensable genes were more likely to be evolving under relaxed selective constraint than the core genes. These findings are consistent with previous findings that older genes are more essential [46] and associated with higher expression levels and stronger purifying selection [47].
The 691 pan-Zea individuals were divided into three maize sub-populations and eight teosinte sub-populations (Fig. 2B and C, [42, 44]). To investigate whether these sub-populations lost genes and orthologous groups evenly, we clustered the dispensable genes into 16 clusters (CLS1–16 in Fig. 2B and Additional file 5: Table S4) and the orthologous groups into 13 clusters (oCLS1–13 in Fig. 2C). The distributions of the gPAVs and oPAVs were associated with the topology of the species tree. The distal-to-maize teosinte subspecies (Zea nicaraguensis, Zea luxurians, Zea diploperennis, Zea perennis, and Zea mays ssp. huehuetenangensis) had more subspecies-enriched genes and orthologous groups than close relatives of maize (Zea mays ssp. mexicana and Zea mays ssp. parviglumis).
Enrichment analysis suggested that the gPAV and oPAV clusters may reflect distinct molecular functions among the sub-populations (Fig. 2D–E, Additional file 6: Table S5, Additional file 7: Table S6 and Additional file 1: Fig. S9). Specifically, the teosinte concentrated genes (CLS5, 8, 9, 10, 11, and 12 in Fig. 2A) had enrichment signals, including plant-pathogen interactions, bacterial toxins, biosynthesis of flavonoids, di-/tri- or sesqui-terpenoids, cutin, suberin and wax, and the mitogen-activated protein kinase signaling pathway (Fig. 2D). These enriched pathways are all related to abiotic and biotic stress responses [48]. These findings corroborate the previous finding that reduced genetic diversity during crop domestication leads to the loss of several loci related to the stress response [49], suggesting the potential role of gene loss in the stress-susceptible changes in crops. Comparatively, maize-enriched genes (CLS6, 13, 15, and 16 in Fig. 2A) and orthologous groups (oCLS10-13 in Fig. 2B) were more likely to be related to germination, nutrition, and flavor-related pathways [50, 51]. For example, the maize concentrated orthologous group oCLS7 was enriched in amino acid and protein-related pathways when compared to the teosinte concentrated orthologous group oCLS6 (Additional file 1: Fig. S10), and the orthologous groups that are rarely present in teosintes (oCLS10–13) were enriched in metabolite pathways related to folate, fatty acids, ascorbate, carotenoids, biotin, and various carbohydrates (Fig. 2E).
To further address the contents of teosinte-specific genes and maize-specific genes, we investigated the sub-population gPAV distribution differences in teosinte versus maize (TEO-Maize) and tropical maize versus temperate maize (TST-TEM) in more detail. As results, 51.09% (3452/6756) of the maize-enriched genes were absent in all teosinte individuals (hereafter referred to as maize-specific genes), while 17.83% (2189/12,278) of the teosinte-enriched genes were lost in maize (teosinte-specific genes) (Fig. 2F). We detected 3543 TST-TEM unbalanced (enriched or diminished) genes, with only 1.93% of the sub-group-specific items. A total of 4042 genes were found with no specific distribution preference in any sub-group (random genes, Fig. 2F). Further analysis revealed that the random genes were significantly under-represented in the domestication and adaptation selective sweep regions (Fig. 2G), suggesting that sub-group unbalanced genes were selected during maize domestication and adaptation. Thus, we compared the proportion of different sub-group-enriched genes between selected and background regions (Fig. 2H). This analysis revealed that although the distribution of sub-group-enriched genes was not significantly different between TEM and TST maize, the teosinte-enriched genes were more likely to appear in domestication regions (Fig. 2H), suggesting that some teosinte genes were selected to be lost during domestication.
Another notable question is the pattern of gain-or-loss of teosinte/maize-specific genes, considering that a gene can become group-specific either through (i) loss of all members of the other group or (ii) gained from exogenous sources that were not available to the other group. While the gPAVs showed a predominance of teosinte-enriched genes, the oPAV clusters showed the opposite trend in that more orthologous groups were enriched in maize than in teosinte (Additional file 1: Fig. S11A and B), and the maize-enriched orthologous groups, particularly the maize-specific groups, tended to be smaller (Fig. 2I and Additional file 1: Fig. S11C). The comparisons of group-specific gene features indicated that the teosinte-specific genes showed a typical “easy-to-lose” pattern within larger families of shorter and newer. However, the maize-specific genes showed an opposite pattern of smaller family size and longer and considerably older genes (Fig. 2I). These analyses suggest that the teosinte-specific genes most likely resulted from gene loss, while at least a subset of the maize-specific genes were derived from resources outside of the Zea genus, perhaps through horizontal gene transfer from bacteria [44]. To complement the genetic variation map and estimate the impact of SVs on the maize phenotypic variations, we constructed a comprehensive SV map by integrating evidences from variant graphs, comparative genomics, and short-read alignments. We filtered the SVs with a set of strict conditions, and only the common (MAF > 0.05) SVs within the maize population were retained (Additional file 1: Fig. S12 and Additional file 12: Supplementary Materials and Methods). We detected 274,649 common SVs, including 181,874 deletions (DELs), 19,628 insertions (INSs), 26,894 translocations (TRAs), 7020 duplications (DUPs), and 1577 inversions (INVs), as well as 18,828 common gPAVs from the aforementioned gPAV matrix. A total of 11,208,912 SNPs and 2,015,663 InDels (1,045,218 short-insertions and 970,445 short-deletions) that were common in the maize population were extracted from the Zea haplotype map and were combined with the common SVs identified in the current study to form a maize common genetic variation map for downstream analyses (Fig. 3A and B and Additional file 1: Fig. S13A-B). Considering the repeat-rich nature of maize, we also assigned the SVs to their closest transposable elements (TEs) according to physical overlap and sequence similarity (Additional file 12: Supplementary Materials and Methods) and found that ~60.03% of the SVs were TE-related (Additional file 1: Fig. S13B), indicating the TE origin of a sizeable proportion of SVs in maize. The size of most of the genotyped common SVs was smaller than 5 Kb (Additional file 1: Fig. S13C), which may have been caused by the limitation in WGS short reads [56]. The SNPs, InDels, and SVs showed similar MAF distribution patterns that skewed toward rare variants (Additional file 1: Fig. S13D). The estimate of the representativeness of each SV by nearby SNPs revealed that 37.36% of the SVs showed low LD levels with nearby SNPs (Fig. 3C and Additional file 12: Supplementary Materials and Methods), suggesting that these SVs could harbor information that cannot be represented by nearby SNPs.
With the representative genetic variation map that covered the typical genetic variation types, we conducted comprehensive investigations on the differences in the partitioning of narrow-sense heritability (h2) among the different genetic variation types. To address this, we estimated the phenotypic variance explained by the genetic variations from a collection of complex agronomic phenotypes and multi-omics molecular trait data, including gene expression, metabolites, protein contents, and DNA methylation (Additional file 8: Table S7 and Additional file 12: Supplementary Materials and Methods). This analysis revealed that each class of the multi-omics traits displayed high heritability (average values of 0.72, 0.66, 0.67, 0.70, and 0.85 for agronomic traits, expression, metabolites, DNA methylation, and proteins respectively; see also Fig. 3D), indicating their capability to characterize the heritability patterns among the genetic features. The h2 values of the SVs were higher than those of the SNPs and InDels in the multi-omics molecular traits (Fig. 3E) even though there were significantly more SNPs (43.8 times) and InDels (7.9 times) than SVs. To estimate the h2 differences in an unbiased manner, we binned, or partitioned and randomized, the genetic variations to keep each of the compared features, including MAF, the SNP LD rank, variant types/sub-types, genomic locations, and impacts on the genes, in the same volume (see Additional file 12: Supplementary Materials and Methods). As a result, the SVs showed a more clearly decreasing h2 pattern with the increase in the MAF than SNPs/InDels. The h2 values of the SVs were negatively correlated with the SNP LD rank values, in contrast to those of InDels (Additional file 1: Fig. S13E). When came into the same volume, SVs (with an average h2 of 0.71, range 0.70–0.72 for each randomization) explained an average of 14.19% and 1.48% more phenotypic variance than the SNPs (with average h2 of 0.63, range 0.60–0.65 for each randomization) and InDels (with an average h2 of 0.70, range 0.68–0.72 for each randomization), respectively (Fig. 3F). Higher h2 values were found in the intergenic variants, genic variants with a high impact on genes, gPAV, and TRA type of SVs, as well as the LTR and helitron-related SVs (Fig. 3F and Additional file 1: Fig. S13F). These findings suggest that SVs are more likely to lead to functional changes than other variants.
The impact of the pan-Zea genome and structural variations on maize phenotypic variations
The comprehensive genetic variation map and the multi-level phenotypes provided opportunities to further investigate the potential effects of different genetic and genomic features on the phenotypes. To this end, we performed genome-wide association analyses for the complex traits (agronomic traits, metabolites, and protein contents) and local association analyses for the molecular traits (gene expression and DNA methylation) (see Additional file 12: Supplementary Materials and Methods for details). A total of 21,255 non-redundant QTLs with a median QTL interval of ~152.77 Kb were identified for 21,206 different traits (Fig. 4A, Additional file 1: Fig. S14 and Fig. S15). About 32.78% of the identified QTLs were SV-QTL (Fig. 4B). Among them, 459 QTLs were SV-specific (could only be identified by SVs, Fig. 4C and Additional file 9: Table S8). The proportion of SV-QTLs was much higher than the proportion of SVs in all of the variants, indicating that SVs are more likely to lead to functional changes.
About 47.54% (8950) of the tested gPAVs had associated signals (Fig. 4D and Additional file 12: Supplementary Materials and Methods), suggesting the functional importance of these gPAVs. An excellent example is shown in Fig. 4E–H in which a premature pectin methylesterase (PME) gene (Zm00001d048936) at the maize gametophyte factor1 (Ga1) locus was reported to be causative of unilateral cross-incompatibility [146]. A total of 807,787 genetic variations (3.25% SVs, 15.73% InDels, and 81.02% SNPs) were kept as CCVs, as they were within the 95% confidence interval of the causal variant set for at least one trait. On average, the statistical fine-map** kept ~18 variants as CCVs from ~229 nominally associated variants (with P < 0.001, see Additional file 12: Supplementary Materials and Methods) for each QTL. The number of CCVs was poorly correlated with the QTL quality score, the significance of the leading variant, and the number of genetic variations within the QTL (Additional file 1: Fig. S18A), indicating that CCVs could reflect additional information that cannot be represented by using the leading variants alone. The estimate of the effect sizes of the CCVs showed that SVs and gPAVs had a larger effect size than that of the SNPs/InDels (Additional file 1: Fig. S18B). The general feature enrichment analyses between the CCVs and the nominally associated variations (see Additional file 12: Supplementary Materials and Methods) showed that the SVs, particularly INSs and gPAVs, were more likely to be enriched in the causal variant sets than SNPs or InDels (Additional file 1: Fig. S18C). Specifically, INSs were enriched in expression, metabolites, and methylation, while gPAVs were only found enriched in expression (Additional file 1: Fig. S19). For SVs related to different TE classes, the helitron and TIR-related SVs were more likely to be causal than the LTR-related SVs. Genic variants, particularly those with a high impact on genes, were more likely to be causal (Additional file 1: Fig. S18C).
To further investigate the effect of different genetic variations on gene expression, we estimated the enrichment of CCVs in the cis-eQTLs along their distance to the transcription start site (TSS). The results showed that the CCVs were enriched in TSS-nearby regions (“Causal variants” track in Additional file 1: Fig. S20), following previous results [164]. Further investigation revealed that the SVs displayed the waviest trend for the fold enrichment changes along distances to the TSS than INDELs and SNPs (“SNP,” “INDEL,” and “SV” tracks in Additional file 1: Fig. S20). This pattern indicated that the TE-related SVs were more likely to be enriched in the upstream regions of TSSs (“TE-related SVs” track in Additional file 1: Fig. S20). These findings suggest that rather than directly affect TSSs or gene body regions, SVs (particularly TE-related SVs) would be more likely to affect gene expression by affecting nearby upstream regions of the gene, where most cis-regulatory elements (CREs) are located [165].
A detailed example is illustrated in Fig. 5. The impact of a SV on the expression of Zm00001d023299, a zinc finger CCCH domain-containing protein (ZEAMAP [57]) that has been previously proposed to be a candidate QTL (marked as IDP103 in MaizeGDB), related to drought and ultraviolet stressors [58] and was highly expressed in response to various stressors in maize (Additional file 1: Fig. S21). In the current study, an SV-specific eQTL related to the expression of Zm00001d023299 in drought-stressed leaves harbored a CCV named PZ00001aSV02097079INS (Fig. 5A). The expression of Zm00001d023299 in leaves [33] responded to different drought treatment levels, and the presence of PZ00001aSV02097079INS suppressed gene expression in leaves (Fig. 5B). The presence of PZ00001aSV02097079INS could also increase the survival rate of maize under drought stress [59] (Fig. 5C). Comparisons among maize genomes indicated that PZ00001aSV02097079INS is a 1947-bp Harbinger-transposon-like sequence (Additional file 1: Fig. S22) inserted 2269 bp upstream of Zm00001d023299 (Fig. 5D). This evidence suggests that PZ00001aSV02097079INS could be one of the causes of maize drought resistance by suppressing the expression of Zm00001d023299 in leaves. An investigation into the expression patterns in different tissues of four maize founder individuals from the Complete-diallele design plus Unbalanced Breeding-like Inter-Cross (CUBIC) population [1: Fig. S23. The P-values were calculated from 10,000 permutations of the t-test