A pan-Zea genome map for enhancing maize improvement

Gui, Songtao; Wei, Wenjie; Jiang, Chenglin; Luo, **gyun; Chen, Lu; Wu, Shenshen; Li, Wenqiang; Wang, Yuebin; Li, Shuyan; Yang, Ning; Li, Qing; Fernie, Alisdair R.; Yan, Jianbing

doi:10.1186/s13059-022-02742-7

A pan-Zea genome map for enhancing maize improvement

Research
Open access
Published: 23 August 2022

Volume 23, article number 178, (2022)
Cite this article

Download PDF

You have full access to this open access article

Genome Biology Aims and scope Submit manuscript

A pan-Zea genome map for enhancing maize improvement

Download PDF

Songtao Gui¹,
Wenjie Wei¹,
Chenglin Jiang¹,
**gyun Luo¹,
Lu Chen¹,
Shenshen Wu¹,
Wenqiang Li¹,
Yuebin Wang¹,
Shuyan Li¹,
Ning Yang^1,2,
Qing Li^1,2,
Alisdair R. Fernie³ &
…
Jianbing Yan ORCID: orcid.org/0000-0001-8650-7811^1,2

11k Accesses
27 Citations
12 Altmetric
1 Mention
Explore all metrics

Abstract

Background

Maize (Zea mays L.) is at the vanguard facing the upcoming breeding challenges. However, both a super pan-genome for the Zea genus and a comprehensive genetic variation map for maize breeding are still lacking.

Results

Here, we construct an approximately 6.71-Gb pan-Zea genome that contains around 4.57-Gb non-B73 reference sequences from fragmented de novo assemblies of 721 pan-Zea individuals. We annotate a total of 58,944 pan-Zea genes and find around 44.34% of them are dispensable in the pan-Zea population. Moreover, 255,821 common structural variations are identified and genotyped in a maize association map** panel. Further analyses reveal gene presence/absence variants and their potential roles during domestication of maize. Combining genetic analyses with multi-omics data, we demonstrate how structural variants are associated with complex agronomic traits.

Conclusions

Our results highlight the underexplored role of the pan-Zea genome and structural variations to further understand domestication of maize and explore their potential utilization in crop improvement.

Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement

Article Open access 31 May 2019

De novo genome assembly and analyses of 12 founder inbred lines provide insights into maize heterosis

Article 16 January 2023

Genetic properties of the MAGIC maize population: a new platform for high definition QTL map** in Zea mays

Article Open access 11 September 2015

Background

The increasing population and unpredictability evoked by global climate change have brought new demands to increase the productivity and quality of our crops [1]. Food production must increase 70% by 2050 to feed the increase in the world’s population [2]. The past few decades have witnessed a rapid evolution of sequencing and marker technologies alongside the widespread adoption of genome-based breeding approaches [3]. These technological revolutions have promoted innovations in crop breeding from conventional phenotype-based selection to genomics-assisted breeding and genetic engineering [4, 5].

While they harbor great potential, the development of breeding technologies and the explosive growth of biological information have also highlighted the insufficiencies in conventional genomics-assisted breeding strategies. The first of these insufficiencies is the use of a single reference genome. More and more evidences have shown that map** reads onto a single reference genome can result in reference bias and missing information in highly polymorphic regions and regions that are not present in the genome [6,7,8]. Thus, a more comprehensive way is to replace the single reference genome with a pan-genome, which represents the complete genetic repertoire of a species. With reduced sequencing costs in recent years, the desirability to construct pan-genomes has spread from Streptococcus agalactiae [9] to eukaryotic species [10,11,12], including many major crops, such as rice, bread wheat, soybean, and tomato [13,14,15,16]. Secondly, the conventional genomics-assisted breeding strategies majorly rely on single nucleotide polymorphisms (SNPs) and short insertions/deletions (InDels, hereafter representing insertions/deletions < 50 bp) because they could be easily acquired from low-depth resequencing of cultivated lines. However, SNPs/InDels do not represent the complete genetic repertoire of a species [17, 18]. Other genetic variations, such as structural variations (SVs), also play important roles in plant genetics [19, 20], and their potential should be harnessed for crop breeding and improvement. Besides, applying multi-omic (e.g., transcriptomic, proteomic, metabolomic, and epigenetic) bio-data to reveal genetic mechanisms is becoming more practical [21]. It is highly conceivable that systematic integration of multi-omics data could accelerate crop breeding and improvement [22, 23]. Given these considerations, it follows that to aid in increasing the productivity and quality of crops from the perspectives of genomics and genetics, we should (i) construct a genus-level crop pan-genome, or “super-pan-genome” [24], that includes both cultivated and wild accessions within a genus; (ii) include more genetic variations (e.g., SVs) in addition to SNPs/InDels into genomics-assisted crop breeding, and (iii) systematically integrate multi-omics evidence to accelerate crop breeding.

Maize is a staple crop and a model organism for genetic research [25]. Since the first release of the maize B73 reference genome in 2009 [26], more than 40 maize genomes have been released to date. Moreover, multi-omics maize data, including DNA resequencing [27,28,29,30,31], transcriptomic [32, 33], metabolomic [34, 39] and the population-level transcripts of hundreds of diverse lines [1: Fig. S8D–E. G Schematic of the variant graph genome representation for AGPv4 Chr2:171064-171220, with the SNP paths, short InDels, and a large deletion. H The identity and map** rate distribution of the simulated short reads from the genomes of the 26 NAM founders against the variant graph. Dark blue individuals are presented on the variant graph, whereas light blue individuals are not

Full size image

The gene models and functional annotations for the pan-Zea genome were next generated by merging the AGPv4 reference gene annotations with the non-reference genes that were annotated based on a combination of transcript evidence, homologous protein evidence, and ab initio gene predictions (Additional file 1: Fig. S4 and Additional file 12: Supplementary Materials and Methods), resulting in 58,944 genes (39,591 AGPv4 genes and 19,353 non-reference genes, see Additional file 4: Table S3) and 21,167 orthologous groups (Additional file 12: Supplementary Materials and Methods). About 85.82% of the genes were assigned to at least one functional annotation (Additional file 1: Fig. S5).

The gene presence/absence (gPAV) patterns for each maize inbred and teosinte individual (landrace individuals were excluded from the downstream analysis to avoid bias, leaving 691 genotypes for subsequent analysis, see Additional file 12: Supplementary Materials and Methods) were estimated using a read-map**-based method that maintained robustness among different read depths (Additional file 1: Fig. S6A). The resulting gPAV patterns followed previous reports (Additional file 1: Fig. S6B–D), with an estimated genoty** accuracy of ~99.71% and ~95.84% for true presence and true absence, respectively (Additional file 1: Fig. S6E). Principal component analysis and linkage disequilibrium (LD) rank analyses of the gPAVs revealed that the gPAVs were related to the population structure and were well represented by SNPs with ~97.37% gPAVs displaying high LD with nearby SNPs (Additional file 1: Fig. S7 and Additional file 12: Supplementary Materials and Methods). Next, to investigate the PAVs of genes and the orthologous groups from a population perspective, we identified the “core” (with population-level loss rate not significantly greater than 1%) and “dispensable” (with population-level loss rate significantly greater than 1%) genes and orthologous groups based on the gPAV and the derived orthologous group PAV (oPAV) matrices (Additional file 1: Fig. S8A, Additional file 12: Supplementary Materials and Methods). The results of these analyses revealed that ~44.34% of the pan-Zea genes were dispensable, while only ~7.42% of the pan-Zea orthologous groups were dispensable (Fig. 1E and Additional file 1: Fig. S8B). An average of 6020 genes displayed PAV patterns between two individuals, with larger differences for two inter-subspecies individuals (~6779 between one teosinte and one maize) than inner-subspecies (~5520 between two maize and ~5635 between two teosintes) (Additional file 1: Fig. S8C). Given our knowledge of the core and dispensable gene/orthologous groups, we estimated the gene/orthologue-group set size for the pan-Zea genome and the core genome. The in silico simulation showed that the pan-Zea genome (Fig. 1F), as well as the subspecies pan-genomes (pan-maize genome and pan-teosinte genome, Additional file 1: Fig. S8D–E), displayed characteristics of a “closed pan-genome” [45] with plateaus in the size curves, suggesting that we identified almost all of the genes in maize and teosinte. The results show that 27 individuals represented an average of 95% of the pan-Zea gene-set (range ~91.9 to ~97.3%), while 99 individuals covered 99% of the pan-Zea genes (Fig. 1F).

In addition to the linear representation of the pan-Zea genome, we also constructed a graph-based genome (Fig. 1G), including the SNPs, short InDels (<50 bp in length), and SVs (>50 bp in length) (see sections below and Additional file 12: Supplementary Materials and Methods for details of the SVs). Furthermore, we estimated a representation of the variant graph genome using reads simulated from the 26 NAM founder genome assemblies. The results showed that reads simulated from the “not-in-graph” NAM founders (23/26) had compatible map** rates (99.40% on average) and alignment identities (91.00% on average) with those of the “within-graph” NAM founders (3/26, including B73, with 99.44% and 91.14% for the average map** rate and alignment identity, respectively) (Fig. 1H), indicating that the variant graph represented the vast majority of the maize genetic repertoire.

The presence/absence patterns of pan-Zea genes and the orthologous groups

An interesting question is which genes are more likely to show gPAVs in the genus Zea. To address this question, we investigated the associations between the gPAVs and genic features, including the sub-genome origin, gene age, gene length, orthologue group size, expression levels, and selective constraints (Additional file 12: Supplementary Materials and Methods and Additional file 5: Table S4). The results of these analyses revealed that the gPAV was significantly associated with the genic features under investigation (Fig. 2A). Specifically, they showed that the absence of pan-Zea genes was more prevalent among genes that were newly derived, within large orthologous groups, and with genes that were either minimally or tissue specifically expressed. The dispensable genes were more likely to be evolving under relaxed selective constraint than the core genes. These findings are consistent with previous findings that older genes are more essential [46] and associated with higher expression levels and stronger purifying selection [47].

The 691 pan-Zea individuals were divided into three maize sub-populations and eight teosinte sub-populations (Fig. 2B and C, [42, 44]). To investigate whether these sub-populations lost genes and orthologous groups evenly, we clustered the dispensable genes into 16 clusters (CLS1–16 in Fig. 2B and Additional file 5: Table S4) and the orthologous groups into 13 clusters (oCLS1–13 in Fig. 2C). The distributions of the gPAVs and oPAVs were associated with the topology of the species tree. The distal-to-maize teosinte subspecies (Zea nicaraguensis, Zea luxurians, Zea diploperennis, Zea perennis, and Zea mays ssp. huehuetenangensis) had more subspecies-enriched genes and orthologous groups than close relatives of maize (Zea mays ssp. mexicana and Zea mays ssp. parviglumis).

Enrichment analysis suggested that the gPAV and oPAV clusters may reflect distinct molecular functions among the sub-populations (Fig. 2D–E, Additional file 6: Table S5, Additional file 7: Table S6 and Additional file 1: Fig. S9). Specifically, the teosinte concentrated genes (CLS5, 8, 9, 10, 11, and 12 in Fig. 2A) had enrichment signals, including plant-pathogen interactions, bacterial toxins, biosynthesis of flavonoids, di-/tri- or sesqui-terpenoids, cutin, suberin and wax, and the mitogen-activated protein kinase signaling pathway (Fig. 2D). These enriched pathways are all related to abiotic and biotic stress responses [48]. These findings corroborate the previous finding that reduced genetic diversity during crop domestication leads to the loss of several loci related to the stress response [49], suggesting the potential role of gene loss in the stress-susceptible changes in crops. Comparatively, maize-enriched genes (CLS6, 13, 15, and 16 in Fig. 2A) and orthologous groups (oCLS10-13 in Fig. 2B) were more likely to be related to germination, nutrition, and flavor-related pathways [50, 51]. For example, the maize concentrated orthologous group oCLS7 was enriched in amino acid and protein-related pathways when compared to the teosinte concentrated orthologous group oCLS6 (Additional file 1: Fig. S10), and the orthologous groups that are rarely present in teosintes (oCLS10–13) were enriched in metabolite pathways related to folate, fatty acids, ascorbate, carotenoids, biotin, and various carbohydrates (Fig. 2E).

To further address the contents of teosinte-specific genes and maize-specific genes, we investigated the sub-population gPAV distribution differences in teosinte versus maize (TEO-Maize) and tropical maize versus temperate maize (TST-TEM) in more detail. As results, 51.09% (3452/6756) of the maize-enriched genes were absent in all teosinte individuals (hereafter referred to as maize-specific genes), while 17.83% (2189/12,278) of the teosinte-enriched genes were lost in maize (teosinte-specific genes) (Fig. 2F). We detected 3543 TST-TEM unbalanced (enriched or diminished) genes, with only 1.93% of the sub-group-specific items. A total of 4042 genes were found with no specific distribution preference in any sub-group (random genes, Fig. 2F). Further analysis revealed that the random genes were significantly under-represented in the domestication and adaptation selective sweep regions (Fig. 2G), suggesting that sub-group unbalanced genes were selected during maize domestication and adaptation. Thus, we compared the proportion of different sub-group-enriched genes between selected and background regions (Fig. 2H). This analysis revealed that although the distribution of sub-group-enriched genes was not significantly different between TEM and TST maize, the teosinte-enriched genes were more likely to appear in domestication regions (Fig. 2H), suggesting that some teosinte genes were selected to be lost during domestication.

Another notable question is the pattern of gain-or-loss of teosinte/maize-specific genes, considering that a gene can become group-specific either through (i) loss of all members of the other group or (ii) gained from exogenous sources that were not available to the other group. While the gPAVs showed a predominance of teosinte-enriched genes, the oPAV clusters showed the opposite trend in that more orthologous groups were enriched in maize than in teosinte (Additional file 1: Fig. S11A and B), and the maize-enriched orthologous groups, particularly the maize-specific groups, tended to be smaller (Fig. 2I and Additional file 1: Fig. S11C). The comparisons of group-specific gene features indicated that the teosinte-specific genes showed a typical “easy-to-lose” pattern within larger families of shorter and newer. However, the maize-specific genes showed an opposite pattern of smaller family size and longer and considerably older genes (Fig. 2I). These analyses suggest that the teosinte-specific genes most likely resulted from gene loss, while at least a subset of the maize-specific genes were derived from resources outside of the Zea genus, perhaps through horizontal gene transfer from bacteria [44]. To complement the genetic variation map and estimate the impact of SVs on the maize phenotypic variations, we constructed a comprehensive SV map by integrating evidences from variant graphs, comparative genomics, and short-read alignments. We filtered the SVs with a set of strict conditions, and only the common (MAF > 0.05) SVs within the maize population were retained (Additional file 1: Fig. S12 and Additional file 12: Supplementary Materials and Methods). We detected 274,649 common SVs, including 181,874 deletions (DELs), 19,628 insertions (INSs), 26,894 translocations (TRAs), 7020 duplications (DUPs), and 1577 inversions (INVs), as well as 18,828 common gPAVs from the aforementioned gPAV matrix. A total of 11,208,912 SNPs and 2,015,663 InDels (1,045,218 short-insertions and 970,445 short-deletions) that were common in the maize population were extracted from the Zea haplotype map and were combined with the common SVs identified in the current study to form a maize common genetic variation map for downstream analyses (Fig. 3A and B and Additional file 1: Fig. S13A-B). Considering the repeat-rich nature of maize, we also assigned the SVs to their closest transposable elements (TEs) according to physical overlap and sequence similarity (Additional file 12: Supplementary Materials and Methods) and found that ~60.03% of the SVs were TE-related (Additional file 1: Fig. S13B), indicating the TE origin of a sizeable proportion of SVs in maize. The size of most of the genotyped common SVs was smaller than 5 Kb (Additional file 1: Fig. S13C), which may have been caused by the limitation in WGS short reads [56]. The SNPs, InDels, and SVs showed similar MAF distribution patterns that skewed toward rare variants (Additional file 1: Fig. S13D). The estimate of the representativeness of each SV by nearby SNPs revealed that 37.36% of the SVs showed low LD levels with nearby SNPs (Fig. 3C and Additional file 12: Supplementary Materials and Methods), suggesting that these SVs could harbor information that cannot be represented by nearby SNPs.

With the representative genetic variation map that covered the typical genetic variation types, we conducted comprehensive investigations on the differences in the partitioning of narrow-sense heritability (h²) among the different genetic variation types. To address this, we estimated the phenotypic variance explained by the genetic variations from a collection of complex agronomic phenotypes and multi-omics molecular trait data, including gene expression, metabolites, protein contents, and DNA methylation (Additional file 8: Table S7 and Additional file 12: Supplementary Materials and Methods). This analysis revealed that each class of the multi-omics traits displayed high heritability (average values of 0.72, 0.66, 0.67, 0.70, and 0.85 for agronomic traits, expression, metabolites, DNA methylation, and proteins respectively; see also Fig. 3D), indicating their capability to characterize the heritability patterns among the genetic features. The h² values of the SVs were higher than those of the SNPs and InDels in the multi-omics molecular traits (Fig. 3E) even though there were significantly more SNPs (43.8 times) and InDels (7.9 times) than SVs. To estimate the h² differences in an unbiased manner, we binned, or partitioned and randomized, the genetic variations to keep each of the compared features, including MAF, the SNP LD rank, variant types/sub-types, genomic locations, and impacts on the genes, in the same volume (see Additional file 12: Supplementary Materials and Methods). As a result, the SVs showed a more clearly decreasing h² pattern with the increase in the MAF than SNPs/InDels. The h² values of the SVs were negatively correlated with the SNP LD rank values, in contrast to those of InDels (Additional file 1: Fig. S13E). When came into the same volume, SVs (with an average h² of 0.71, range 0.70–0.72 for each randomization) explained an average of 14.19% and 1.48% more phenotypic variance than the SNPs (with average h² of 0.63, range 0.60–0.65 for each randomization) and InDels (with an average h² of 0.70, range 0.68–0.72 for each randomization), respectively (Fig. 3F). Higher h² values were found in the intergenic variants, genic variants with a high impact on genes, gPAV, and TRA type of SVs, as well as the LTR and helitron-related SVs (Fig. 3F and Additional file 1: Fig. S13F). These findings suggest that SVs are more likely to lead to functional changes than other variants.

The impact of the pan-Zea genome and structural variations on maize phenotypic variations

The comprehensive genetic variation map and the multi-level phenotypes provided opportunities to further investigate the potential effects of different genetic and genomic features on the phenotypes. To this end, we performed genome-wide association analyses for the complex traits (agronomic traits, metabolites, and protein contents) and local association analyses for the molecular traits (gene expression and DNA methylation) (see Additional file 12: Supplementary Materials and Methods for details). A total of 21,255 non-redundant QTLs with a median QTL interval of ~152.77 Kb were identified for 21,206 different traits (Fig. 4A, Additional file 1: Fig. S14 and Fig. S15). About 32.78% of the identified QTLs were SV-QTL (Fig. 4B). Among them, 459 QTLs were SV-specific (could only be identified by SVs, Fig. 4C and Additional file 9: Table S8). The proportion of SV-QTLs was much higher than the proportion of SVs in all of the variants, indicating that SVs are more likely to lead to functional changes.

About 47.54% (8950) of the tested gPAVs had associated signals (Fig. 4D and Additional file 12: Supplementary Materials and Methods), suggesting the functional importance of these gPAVs. An excellent example is shown in Fig. 4E–H in which a premature pectin methylesterase (PME) gene (Zm00001d048936) at the maize gametophyte factor1 (Ga1) locus was reported to be causative of unilateral cross-incompatibility [146]. A total of 807,787 genetic variations (3.25% SVs, 15.73% InDels, and 81.02% SNPs) were kept as CCVs, as they were within the 95% confidence interval of the causal variant set for at least one trait. On average, the statistical fine-map** kept ~18 variants as CCVs from ~229 nominally associated variants (with P < 0.001, see Additional file 12: Supplementary Materials and Methods) for each QTL. The number of CCVs was poorly correlated with the QTL quality score, the significance of the leading variant, and the number of genetic variations within the QTL (Additional file 1: Fig. S18A), indicating that CCVs could reflect additional information that cannot be represented by using the leading variants alone. The estimate of the effect sizes of the CCVs showed that SVs and gPAVs had a larger effect size than that of the SNPs/InDels (Additional file 1: Fig. S18B). The general feature enrichment analyses between the CCVs and the nominally associated variations (see Additional file 12: Supplementary Materials and Methods) showed that the SVs, particularly INSs and gPAVs, were more likely to be enriched in the causal variant sets than SNPs or InDels (Additional file 1: Fig. S18C). Specifically, INSs were enriched in expression, metabolites, and methylation, while gPAVs were only found enriched in expression (Additional file 1: Fig. S19). For SVs related to different TE classes, the helitron and TIR-related SVs were more likely to be causal than the LTR-related SVs. Genic variants, particularly those with a high impact on genes, were more likely to be causal (Additional file 1: Fig. S18C).

To further investigate the effect of different genetic variations on gene expression, we estimated the enrichment of CCVs in the cis-eQTLs along their distance to the transcription start site (TSS). The results showed that the CCVs were enriched in TSS-nearby regions (“Causal variants” track in Additional file 1: Fig. S20), following previous results [164]. Further investigation revealed that the SVs displayed the waviest trend for the fold enrichment changes along distances to the TSS than INDELs and SNPs (“SNP,” “INDEL,” and “SV” tracks in Additional file 1: Fig. S20). This pattern indicated that the TE-related SVs were more likely to be enriched in the upstream regions of TSSs (“TE-related SVs” track in Additional file 1: Fig. S20). These findings suggest that rather than directly affect TSSs or gene body regions, SVs (particularly TE-related SVs) would be more likely to affect gene expression by affecting nearby upstream regions of the gene, where most cis-regulatory elements (CREs) are located [165].

A detailed example is illustrated in Fig. 5. The impact of a SV on the expression of Zm00001d023299, a zinc finger CCCH domain-containing protein (ZEAMAP [57]) that has been previously proposed to be a candidate QTL (marked as IDP103 in MaizeGDB), related to drought and ultraviolet stressors [58] and was highly expressed in response to various stressors in maize (Additional file 1: Fig. S21). In the current study, an SV-specific eQTL related to the expression of Zm00001d023299 in drought-stressed leaves harbored a CCV named PZ00001aSV02097079INS (Fig. 5A). The expression of Zm00001d023299 in leaves [33] responded to different drought treatment levels, and the presence of PZ00001aSV02097079INS suppressed gene expression in leaves (Fig. 5B). The presence of PZ00001aSV02097079INS could also increase the survival rate of maize under drought stress [59] (Fig. 5C). Comparisons among maize genomes indicated that PZ00001aSV02097079INS is a 1947-bp Harbinger-transposon-like sequence (Additional file 1: Fig. S22) inserted 2269 bp upstream of Zm00001d023299 (Fig. 5D). This evidence suggests that PZ00001aSV02097079INS could be one of the causes of maize drought resistance by suppressing the expression of Zm00001d023299 in leaves. An investigation into the expression patterns in different tissues of four maize founder individuals from the Complete-diallele design plus Unbalanced Breeding-like Inter-Cross (CUBIC) population [1: Fig. S23. The P-values were calculated from 10,000 permutations of the t-test

Full size image

In conclusion, we revealed the genomic feature priorities that were more causatively associated with multi-omics-level phenotype variations. We showed that leveraging SVs and pan-Zea genome-based gPAVs can be used to detect causative associations related to important agronomic phenotypic variations that could not be directly identified using single genome-based SNPs.

Discussion

Community resources to accelerate maize molecular breeding

To meet the upcoming food production challenge, considerable efforts are required in maize genomics and genetics to further understand the genetics of agronomic traits and increase the efficiency of molecular breeding. In this study, we constructed a genus-level pan-genome for Zea, identified the pan-Zea gene and orthologous group presence/absence patterns, complemented the maize genetic variation map by including common SV types, and systematically investigated the potential genetic and genomic feature priorities for multi-omics traits by identifying the trait-associated QTLs and candidate causal variants from GWAS summary statistics data. These resources will provide useful information for maize breeding in the following three ways.

First, the wild relatives of maize have shown their potential for increasing stress tolerance and overall yield [62,63,65]. While GWAS have been successful in decoding genotype-phenotype associations in maize [42, 70,

Conclusions

In summary, we constructed a pan-Zea genome, analyzed the gene presence/absence patterns, and investigated the impact of the pan-Zea genome and different genetic variant features on maize biochemical and phenotypic variations. These findings will provide useful information for unraveling the genetic architecture of maize complex agronomic traits, accelerate maize molecular breeding, and improve our understanding of maize domestication and adaptation procedure. Still, some limitations have hampered more informative pan-Zea genomic and genetic results in the current study.

Perhaps the most obvious holding-back of the pan-Zea genome in the current study is the lack of teosinte reference genomes and population-level deep long-read sequencing data for teosintes, landraces, and elite maize. With the fragmental assemblies generated from deep-WGS reads, we had to perform additional data polishing steps to ensure credibility, which inevitably affected the volume of the results, and led to an underestimate of the teosinte sequences and genes within the pan-Zea genome, the large and rare SVs, and the untangled variations in the highly repetitive regions.

Foreseeable actions that could address these limitations and boost our understanding of maize breeding and improvement would be to (i) enrich the pan-Zea genomic information pool with reference genomes and population-level long-read sequences of teosintes and maize landraces, (ii) enlarge the pan-Zea genetic variation matrices and their associations with more biochemicals, (iii) refine the genetic interactions by investigating causation and pleiotropy, and (iv) decrypt the regulatory network of maize phenotypic variation by combining the genetic interactions with other bio-networks, such as interactomes [72] and cell-cell communications [73]. The concept of the pan-genome has expanded from the whole gene and sequence set to the whole set of genomic and genetic variations within a genus. Moreover, with the development of reference-free whole-genome alignments [74] and genome graphic representations [8], it may be time to change the maize reference genomes [7]. We have many reasons to be optimistic when facing the imminent challenges of increasing the productivity and quality of crops with our current resources and the anticipated upcoming progress in crop genomics and genetics.

Methods

The detailed materials and methods, including (i) collection of genomic and transcriptomic data, (ii) de novo assembly, whole-genome comparison, and pan-genome construction, (iii) pan-genome gene annotation, (iv) pan-Zea gene and ortholog group analyses, (v) characterizing gene features, (vi) genoty** and characterizing the maize genetic variation map, (vii) variant graph constructing, reads simulation, and map**, (viii) phenotype data collection and normalization, (ix) estimating of narrow-sense heritability, (x) identifying trait-associated QTLs, genes, and variants, (xi) analyses of gPAV and SV cases, and (xii) miscellaneous statistical analyses and visualizations, are available in Additional file 12: Supplementary Materials and Methods.

Availability of data and materials

The accessions of the whole-genome sequencing reads of the pan-Zea individuals could be retrieved from GenBank: the maize association map** panel lines [75], the teosinte lines [76], and the maize landrace lines [77]. The accessions of the 11 publicly available genome assemblies are listed in Additional file 2: Table S1. The pan-Zea RNA sequencing data (RNA-Seq) were retrieved from GenBank: the RNA-Seq of the kernels of the AMP lines [78] and the RNA-Seq of the leaves of teosinte lines [79]. The genome assemblies of the maize NAM founders were downloaded from MaizeGDB [80]. The datasets generated in this paper, including the fragmental de novo assemblies of the 701 pan-Zea individuals, the pan-Zea genome and annotations, the maize structural variation map, and the gene PAV matrices, have been deposited in the China National GeneBank Sequence Archive (CNSA) [81] and figshare [82]. The above data, along with the graphic representation of pan-Zea genome and the GWAS summary data for QTL regions, could also be accessed in the download page of the ZEAMAP database [83] and the China National Center for Bioinformation (CNCB) [84]. The related analysis scripts, pipelines, and source codes are available at Github under MIT license and Zenodo: pan-Zea genome construction pipeline [85, 160], SV calling and genoty** scripts [86, 161], and miscellaneous analysis scripts [87, 162].

References

Hickey LT, Hafeez AN, Robinson H, Jackson SA, Leal-Bertioli SC, Tester M, et al. Breeding crops to feed 10 billion. Nat Biotechnol. 2019;37:744–54.
Article CAS PubMed Google Scholar
Hunter MC, Smith RG, Schipanski ME, Atwood LW, Mortensen DA. Agriculture in 2050: recalibrating targets for sustainable intensification. Bioscience. 2017;67:386–91.
Article Google Scholar
Varshney RK, Nayak SN, May GD, Jackson SA. Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends Biotechnol. 2009;27:522–30.
Article CAS PubMed Google Scholar
Wallace JG, Rodgers-Melnick E, Buckler ES. On the road to breeding 4.0: unraveling the good, the bad, and the boring of crop quantitative genomics. Annu Rev Genet. 2018;52:421–44.
Article CAS PubMed Google Scholar
Wei X, Qiu J, Yong K, Fan J, Zhang Q, Hua H, et al. A quantitative genomics map of rice provides genetic insights and guides breeding. Nat Genet. 2021;53:243–53.
Article CAS PubMed Google Scholar
Yang X, Lee W-P, Ye K, Lee C. One reference genome is not enough. Genome Biol. 2019;20:104.
Article PubMed PubMed Central Google Scholar
Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome? Genome Biol. 2019;20:1–9.
Article Google Scholar
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, et al. Pangenome graphs. Annu Rev Genomics Hum Genet. 2020;21:139–62.
Article CAS PubMed PubMed Central Google Scholar
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci. 2005;102:13950–5.
Article CAS PubMed PubMed Central Google Scholar
Peter J, De Chiara M, Friedrich A, Yue JX, Pflieger D, Bergstrom A, et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature. 2018;556:339–44.
Article CAS PubMed PubMed Central Google Scholar
Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet. 2019;51:30–5.
Article CAS PubMed Google Scholar
1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell. 2016;166:481–91.
Article CAS Google Scholar
Alonge M, Wang X, Benoit M, Soyk S, Pereira L, Zhang L, et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell. 2020;182:145–161 e123.
Article CAS PubMed PubMed Central Google Scholar
Liu Y, Du H, Li P, Shen Y, Peng H, Liu S, et al. Pan-genome of wild and cultivated soybeans. Cell. 2020;182:162–176 e113.
Article CAS PubMed Google Scholar
Montenegro JD, Golicz AA, Bayer PE, Hurgobin B, Lee H, Chan CK, et al. The pangenome of hexaploid bread wheat. Plant J. 2017;90:1007–13.
Article CAS PubMed Google Scholar
Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, et al. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature. 2018;557:43–9.
Article CAS PubMed PubMed Central Google Scholar
Springer NM, Ying K, Fu Y, Ji T, Yeh CT, Jia Y, et al. Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV) in genome content. PLoS Genet. 2009;5:e1000734.
Article PubMed PubMed Central CAS Google Scholar
Chiang C, Scott AJ, Davis JR, Tsang EK, Li X, Kim Y, et al. The impact of structural variation on human gene expression. Nat Genet. 2017;49:692–9.
Article CAS PubMed PubMed Central Google Scholar
Fuentes RR, Chebotarov D, Duitama J, Smith S, De la Hoz JF, Mohiyuddin M, et al. Structural variants in 3000 rice genomes. Genome Res. 2019;29:870–80.
Article CAS PubMed PubMed Central Google Scholar
Yang N, Liu J, Gao Q, Gui S, Chen L, Yang L, et al. Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement. Nat Genet. 2019;51:1052–9.
Article CAS PubMed Google Scholar
Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights. 2020;14:1177932219899051.
Article PubMed PubMed Central Google Scholar
Scossa F, Alseekh S, Fernie AR. Integrating multi-omics data for crop improvement. J Plant Physiol. 2021;257:153352.
Article CAS PubMed Google Scholar
Li Q, Yan J. Sustainable agriculture in the era of omics: knowledge-driven crop breeding. Genome Biol. 2020;21:154.
Article PubMed PubMed Central Google Scholar
Khan AW, Garg V, Roorkiwal M, Golicz AA, Edwards D, Varshney RK. Super-pangenome by integrating the wild side of a species for accelerated crop improvement. Trends Plant Sci. 2020;25:148–58.
Article CAS PubMed PubMed Central Google Scholar
Wallace J, Larsson S, Buckler E. Entering the second century of maize quantitative genetics. Heredity. 2014;112:30–8.
Article CAS PubMed Google Scholar
Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–5.
Article CAS PubMed Google Scholar
Gore MA, Chia J-M, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, et al. A first-generation haplotype map of maize. Science. 2009;326:1115–7.
Article CAS PubMed Google Scholar
Bukowski R, Guo X, Lu Y, Zou C, He B, Rong Z, et al. Construction of the third-generation Zea mays haplotype map. Gigascience. 2017;7:gix134.
PubMed Central Google Scholar
Chia JM, Song C, Bradbury PJ, Costich D, de Leon N, Doebley J, et al. Maize HapMap2 identifies extant variation from a genome in flux. Nat Genet. 2012;44:803–7.
Article CAS PubMed Google Scholar
Jiao Y, Zhao H, Ren L, Song W, Zeng B, Guo J, et al. Genome-wide genetic changes during modern breeding of maize. Nat Genet. 2014;46:1039–40.
Article CAS Google Scholar
Wang B, Lin Z, Li X, Zhao Y, Zhao B, Wu G, et al. Genome-wide selection and genetic improvement during modern maize breeding. Nat Genet. 2020;52:565–71.
Article PubMed CAS Google Scholar
Li H, Peng Z, Yang X, Wang W, Fu J, Wang J, et al. Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nat Genet. 2013;45:43–50.
Article CAS PubMed Google Scholar
Liu S, Li C, Wang H, Wang S, Yang S, Liu X, et al. Map** regulatory variants controlling gene expression in drought response and tolerance in maize. Genome Biol. 2020;21:163.
Article CAS PubMed PubMed Central Google Scholar
Wen W, Li D, Li X, Gao Y, Li W, Li H, et al. Metabolome-based genome-wide association study of maize kernel leads to novel biochemical insights. Nat Commun. 2014;5:3438.
Article PubMed CAS Google Scholar
Wen W, ** M, Li K, Liu H, **ao Y, Zhao M, et al. An integrated multi-layered analysis of the metabolic networks of different tissues uncovers key genetic components of primary metabolism in maize. Plant J. 2018;93:1116–28.
Article CAS PubMed Google Scholar
Jiang L-G, Li B, Liu S-X, Wang H-W, Li C-P, Song S-H, et al. Characterization of proteome variation during modern maize breeding. Mol Cell Proteomics. 2019;18:263–76.
Article CAS PubMed Google Scholar
Walley JW, Sartor RC, Shen Z, Schmitz RJ, Wu KJ, Urich MA, et al. Integration of omic networks in a developmental atlas of maize. Science. 2016;353:814–8.
Article CAS PubMed PubMed Central Google Scholar
Xu J, Chen G, Hermanson PJ, Xu Q, Sun C, Chen W, et al. Population-level analysis reveals the widespread occurrence and phenotypic consequence of DNA methylation variation not tagged by genetic variation in maize. Genome Biol. 2019;20:243.
Article PubMed PubMed Central CAS Google Scholar
Hufford MB, Seetharam AS, Woodhouse MR, Chougule KM, Ou S, Liu J, et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science. 2021;373:655–62.
Article CAS PubMed PubMed Central Google Scholar
** M, Liu H, He C, Fu J, **ao Y, Wang Y, et al. Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation. Sci Rep. 2016;6:18936.
Article CAS PubMed PubMed Central Google Scholar
Hirsch CN, Foerster JM, Johnson JM, Sekhon RS, Muttoni G, Vaillancourt B, et al. Insights into the maize pan-genome and pan-transcriptome. Plant Cell. 2014;26:121–35.
Article CAS PubMed PubMed Central Google Scholar
Yang X, Gao S, Xu S, Zhang Z, Prasanna BM, Li L, et al. Characterization of a global germplasm collection and its potential utilization for analysis of complex quantitative traits in maize. Mol Breed. 2011;28:511–26.
Article Google Scholar
Wang L, Beissinger TM, Lorant A, Ross-Ibarra C, Ross-Ibarra J, Hufford MB. The interplay of demography and selection during maize domestication and expansion. Genome Biol. 2017;18:1–13.
Article CAS Google Scholar
Chen L, Luo J, ** M, Yang N, Liu X, Peng Y, et al. Portrait of a genus: the genetic diversity of Zea. bioRxiv. 2021:2021.2004.2007.438828. https://doi.org/10.1101/2021.04.07.438828.
Vernikos G, Medini D, Riley DR, Tettelin H. Ten years of pan-genome analyses. Curr Opin Microbiol. 2015;23:148–54.
Article CAS PubMed Google Scholar
Chen W-H, Trachana K, Lercher MJ, Bork P. Younger genes are less likely to be essential than older genes, and duplicates are less likely to be essential than singletons of the same age. Mol Biol Evol. 2012;29:1703–6.
Article CAS PubMed PubMed Central Google Scholar
Wolf YI, Novichkov PS, Karev GP, Koonin EV, Lipman DJ. The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc Natl Acad Sci. 2009;106:7273–80.
Article CAS PubMed PubMed Central Google Scholar
Tohge T, de Souza LP, Fernie AR. Current understanding of the pathways of flavonoid biosynthesis in model and crop plants. J Exp Bot. 2017;68:4013–28.
Article CAS PubMed Google Scholar
Warschefsky E, Penmetsa RV, Cook DR, Von Wettberg EJ. Back to the wilds: tap** evolutionary adaptations for resilient crops through systematic hybridization with crop wild relatives. Am J Bot. 2014;101:1791–800.
Article PubMed Google Scholar
Shi J, Wang H, Hazebroek J, Ertl DS, Harp T. The maize low-phytic acid 3 encodes a myo-inositol kinase that plays a role in phytic acid biosynthesis in develo** seeds. Plant J. 2005;42:708–19.
Article CAS PubMed Google Scholar
Phillippy BQ. Identification of inositol 1,3,4-trisphosphate 5-kinase and inositol 1,3,4,5-tetrakisphosphate 6-kinase in immature soybean seeds. Plant Physiol. 1998;116:291.
Article CAS PubMed Central Google Scholar
Cheng S, **an W, Fu Y, Marin B, Keller J, Wu T, et al. Genomes of subaerial Zygnematophyceae provide insights into land plant evolution. Cell. 2019;179:1057–1067 e1014.
Article CAS PubMed Google Scholar
Wang H, Sun S, Ge W, Zhao L, Hou B, Wang K, et al. Horizontal gene transfer of Fhb7 from fungus underlies Fusarium head blight resistance in wheat. Science. 2020;368:eaba5435.
Article CAS PubMed Google Scholar
**a J, Guo Z, Yang Z, Han H, Wang S, Xu H, et al. Whitefly hijacks a plant detoxification gene that neutralizes plant toxins. Cell. 2021;184:1693–1705 e1617.
Article CAS PubMed Google Scholar
Hibdige SGS, Raimondeau P, Christin PA, Dunning LT. Widespread lateral gene transfer among grasses. New Phytol. 2021;230:2474–86.
Article CAS PubMed Google Scholar
Ho SS, Urban AE, Mills RE. Structural variation in the sequencing era. Nat Rev Genet. 2020;21:171–89.
Article CAS PubMed Google Scholar
Gui S, Yang L, Li J, Luo J, Xu X, Yuan J, et al. ZEAMAP, a comprehensive database adapted to the maize multi-omics era. iScience. 2020;23:101241.
Article CAS PubMed PubMed Central Google Scholar
Richbourg HL. QTL analysis for genes conferring tolerance to drought stress and damage from UV-B radiationUniversity of North Carolina Wilmington; 2008.
Google Scholar
Wang X, Wang H, Liu S, Ferjani A, Li J, Yan J, et al. Genetic variation in ZmVPP1 contributes to drought tolerance in maize seedlings. Nat Genet. 2016;48:1233–41.
Article CAS PubMed Google Scholar
Liu H-J, Wang X, **ao Y, Luo J, Qiao F, Yang W, et al. CUBIC: an atlas of genetic architecture promises directed maize improvement. Genome Biol. 2020;21:20.
Article PubMed PubMed Central Google Scholar
Tu X, Mejía-Guerra MK, Valdes Franco JA, Tzeng D, Chu P-Y, Shen W, et al. Reconstructing the maize leaf regulatory network using ChIP-seq data of 104 transcription factors. Nat Commun. 2020;11:5089.
Article CAS PubMed PubMed Central Google Scholar
Maxted N, Kell S. Establishment of a global network for the in situ conservation of crop wild relatives: status and needs. Rome: FAO commission on genetic resources for food and agriculture; 2009. p. 266.
Google Scholar
Prischmann D, Dashiell K, Schneider D, Eubanks M. Evaluating Tripsacum-introgressed maize germplasm after infestation with western corn rootworms (Coleoptera: Chrysomelidae). J Appl Entomol. 2009;133:10–20.
Article Google Scholar
Tian J, Wang C, **a J, Wu L, Xu G, Wu W, et al. Teosinte ligule allele narrows plant architecture and enhances high-density maize yields. Science. 2019;365:658–64.
Article CAS PubMed Google Scholar
Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nat Rev Genet. 2017;18:117–27.
Article CAS PubMed Google Scholar
**ao Y, Liu H, Wu L, Warburton M, Yan J. Genome-wide association studies in maize: praise and stargaze. Mol Plant. 2017;10:359–74.
Article CAS PubMed Google Scholar
van Dijk ADJ, Kootstra G, Kruijer W, de Ridder D. Machine learning in plant science and plant breeding. iScience. 2021;24:101890.
Article PubMed Google Scholar
Ren D, An L, Li B, Qiao L, Liu W. Efficient weighting methods for genomic best linear-unbiased prediction (BLUP) adapted to the genetic architectures of quantitative traits. Heredity. 2021;126:320–34.
Article CAS PubMed Google Scholar
Swarts K, Gutaker RM, Benz B, Blake M, Bukowski R, Holland J, et al. Genomic estimation of complex traits reveals ancient maize adaptation to temperate North America. Science. 2017;357:512–5.
Article CAS PubMed Google Scholar
Riedelsheimer C, Lisec J, Czedik-Eysenberg A, Sulpice R, Flis A, Grieder C, et al. Genome-wide association map** of leaf metabolic profiles for dissecting complex traits in maize. Proc Natl Acad Sci. 2012;109:8872–7.
Article CAS PubMed PubMed Central Google Scholar
Deng M, Li D, Luo J, **ao Y, Liu H, Pan Q, et al. The genetic architecture of amino acids dissection by association and linkage analysis in maize. Plant Biotechnol J. 2017;15:1250–63.
Article CAS PubMed PubMed Central Google Scholar
Wu L, Han L, Li Q, Wang G, Zhang H, Li L. Using interactome big data to crack genetic mysteries and enhance future crop breeding. Mol Plant. 2021;14:77–94.
Article CAS PubMed Google Scholar
Shao X, Lu X, Liao J, Chen H, Fan X. New avenues for systematically inferring cell-cell communication: through single-cell transcriptomics data. Protein Cell. 2020;11:866–80.
Article CAS PubMed PubMed Central Google Scholar
Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, et al. Progressive cactus is a multiple-genome aligner for the thousand-genome era. Nature. 2020;587:246–51.
Article CAS PubMed PubMed Central Google Scholar
BIG: Deep DNA resequencing of the association map** panel. NCBI. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA531553 (2019).
HZAU: Zea mays raw sequence reads (teosinte). NCBI. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA641489 (2020).
ISU: Zea mays subsp. mays Raw sequence reads. NCBI. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA300309 (2015).
CAAS: Zea mays transcriptome or gene expression. NCBI. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA208608 (2013).
HZAU: Zea mays raw sequence reads (RNA-seq). NCBI. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA645739 (2020).
MaizeGDB: The NAM parents sequencing project. MaizeGDB. https://maizegdb.org/NAM_project (2021).
Gui S: The pan-Zea genome assembly and annotation CNGBdb. (2022). https://db.cngb.org/search/project/CNP0002007.
Gui S: The pan-Zea_genome supporting data. FigShare. https://doi.org/10.6084/m9.figshare.19097447.v1 (2022).
Gui S: The ZEAMAP database: Pan-Zea genomic data and GWAS summary data. ZEAMAP. https://ftp.cngb.org/pub/CNSA/data3/CNP0001565/zeamap/06_Pangenome/ (2022).
Gui S: The pan-Zea genome project. CNCB. https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA010654 (2022).
Gui S: pan-Zea genome construction pipeline. Github. https://github.com/songtaogui/pan-Zea_construct/tree/v1.0.0 (2021).
Gui S: pan-Zea SV calling and genoty** scripts. Github. https://github.com/songtaogui/pan-Zea_construct/tree/v1.0.0 (2021).
Gui S: pan-Zea miscellaneous analysis scripts. Github. https://github.com/songtaogui/pan-Zea_utilities/tree/v1.0.0 (2021).
Bushnell B. BBTools: a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data. Joint Genome Institute. 2018.
Peng Y, Leung HC, Yiu S-M, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420–8.
Article CAS PubMed Google Scholar
Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics. 2018;34:i142–50.
Article CAS PubMed PubMed Central Google Scholar
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100.
Article CAS PubMed PubMed Central Google Scholar
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ar**v 2013:1303.3997. https://ui.adsabs.harvard.edu/abs/2013ar**v1303.3997L.
Kehr B, Melsted P, Halldórsson BV. PopIns: population-scale detection of novel sequence insertions. Bioinformatics. 2016;32:961–7.
Article CAS PubMed Google Scholar
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–2.
Article CAS PubMed PubMed Central Google Scholar
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–52.
Article CAS PubMed PubMed Central Google Scholar
Consortium TU. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2018;47:D506–15.
Article CAS Google Scholar
Buchfink B, **e C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60.
Article CAS PubMed Google Scholar
Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:1–11.
Article CAS Google Scholar
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31:5654–66.
Article CAS PubMed PubMed Central Google Scholar
Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34:525–7.
Article CAS PubMed Google Scholar
Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–44.
Article CAS PubMed Google Scholar
Schattner P, Brooks AN, Lowe TM. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res. 2005;33:W686–9.
Article CAS PubMed PubMed Central Google Scholar
Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 2008;9:1–22.
Article CAS Google Scholar
Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–40.
Article CAS PubMed PubMed Central Google Scholar
Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005;21:3674–6.
Article CAS PubMed Google Scholar
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–30.
Article CAS PubMed Google Scholar
Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. HMMER web server: 2018 update. Nucleic Acids Res. 2018;46:W200–4.
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 2015;44:D286–93.
Article PubMed PubMed Central CAS Google Scholar
Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol Biol Evol. 2017;34:2115–22.
Article CAS PubMed PubMed Central Google Scholar
Rawlings ND, Barrett AJ, Thomas PD, Huang X, Bateman A, Finn RD. The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Res. 2017;46:D624–32.
Article PubMed Central CAS Google Scholar
Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2.
Article CAS PubMed Google Scholar
Golicz AA, Martinez PA, Zander M, Patel DA, Van De Wouw AP, Visendi P, et al. Gene loss in the fungal canola pathogen Leptosphaeria maculans. Funct Integr Genom. 2015;15:189–96.
Article CAS Google Scholar
Hu Z, Sun C. Lu K-c, Chu X, Zhao Y, Lu J, Shi J, Wei C: EUPAN enables pan-genome studies of a large number of eukaryotic genomes. Bioinformatics. 2017;33:2408–9.
Article CAS PubMed Google Scholar
Ihaka R, Gentleman R. R: a language for data analysis and graphics. J Comput Graph Stat. 1996;5:299–314.
Google Scholar
Tang J, Liu J, Zhang M, Mei Q. Visualizing large-scale and high-dimensional data. In proceedings of the 25th international conference on world wide web; 2016. p. 287–97.
Book Google Scholar
Zhang YE, Vibranovski MD, Landback P, Marais GA, Long M. Chromosomal redistribution of male-biased genes in mammalian evolution with two bursts of gene gain on the X chromosome. PLoS Biol. 2010;8:e1000494.
Article PubMed PubMed Central CAS Google Scholar
Li H. A statistical framework for SNP calling, mutation discovery, association map** and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–93.
Article CAS PubMed PubMed Central Google Scholar
Nelson CW, Moncla LH, Hughes AL. SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data. Bioinformatics. 2015;31:3709–11.
CAS PubMed PubMed Central Google Scholar
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–8.
Article CAS PubMed PubMed Central Google Scholar
Freed D, Aldana R, Weber JA, Edwards JS. The Sentieon genomics tools-a fast and accurate solution to variant calling from next-generation sequence data. bioRxiv. 2017:115717. http://biorxiv.org/content/early/2017/05/12/115717.abstract.
Jeffares DC, Jolly C, Hoti M, Speed D, Shaw L, Rallis C, et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat Commun. 2017;8:14061.
Article CAS PubMed PubMed Central Google Scholar
Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of NGS alignment formats. Bioinformatics. 2015;31:2032–4.
Article CAS PubMed PubMed Central Google Scholar
Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28:i333–9.
Article CAS PubMed PubMed Central Google Scholar
Cameron DL, Schröder J, Penington JS, Do H, Molania R, Dobrovic A, et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 2017;27:2050–60.
Article CAS PubMed PubMed Central Google Scholar
Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F, Källberg M, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32:1220–2.
Article CAS PubMed Google Scholar
Sibbesen JA, Maretty L, Krogh A. The Danish Pan-genome C: accurate genoty** across variant classes and lengths using variant graphs. Nat Genet. 2018;50:1054–9.
Article CAS PubMed Google Scholar
Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015;12:966–8.
Article CAS PubMed PubMed Central Google Scholar
Kokot M, Długosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017;33:2759–61.
Article CAS PubMed Google Scholar
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, et al. The ensembl variant effect predictor. Genome Biol. 2016;17:122.
Article PubMed PubMed Central CAS Google Scholar
Anderson SN, Stitzer MC, Brohammer AB, Zhou P, Noshay JM, O'Connor CH, et al. Transposable elements contribute to dynamic genome content in maize. Plant J. 2019;100:1052–65.
Article CAS PubMed Google Scholar
Stuart T, Eichten SR, Cahn J, Karpievitch YV, Borevitz JO, Lister R. Population scale map** of transposable element diversity reveals links to gene regulation and epigenomic variation. Elife. 2016;5:e20777.
Article PubMed PubMed Central Google Scholar
Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al. Variation graph toolkit improves read map** by representing genetic variation in the reference. Nat Biotechnol. 2018;36:875–9.
Article CAS PubMed PubMed Central Google Scholar
Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 2020;21:1–28.
Article Google Scholar
Yang N, Lu Y, Yang X, Huang J, Zhou Y, Ali F, et al. Genome wide association studies using a new nonparametric model reveal the genetic architecture of 17 agronomic traits in an enlarged maize association panel. PLoS Genet. 2014;10:e1004573.
Article PubMed PubMed Central CAS Google Scholar
Liu J, Huang J, Guo H, Lan L, Wang H, Xu Y, et al. The conserved and unique genetic architecture of kernel size and weight in maize and rice. Plant Physiol. 2017;175:774–85.
Article PubMed PubMed Central CAS Google Scholar
Li W, Yu Y, Wang L, Luo Y, Peng Y, Xu Y, et al. The genetic architecture of the dynamic changes in grain moisture in maize. Plant Biotechnol J. 2021;19:1195–205.
Article CAS PubMed PubMed Central Google Scholar
** M, Liu X, Jia W, Liu H, Li W, Peng Y, et al. ZmCOL3, a CCT gene represses flowering in maize by interfering with the circadian clock and activating expression of ZmCCT. J Integr Plant Biol. 2018;60:465–80.
Article CAS PubMed Google Scholar
Li Z, Chen J, Han L, Wen J, Chen G, Li H, et al. Association map** resolving the major loci Scmv2 conferring resistance to sugarcane mosaic virus in maize. Eur J Plant Pathol. 2016;145:385–91.
Article CAS Google Scholar
Chen G, Wang X, Hao J, Yan J, Ding J. Genome-wide association implicates candidate genes conferring resistance to maize rough dwarf disease in maize. PLoS One. 2015;10:e0142001.
Article PubMed PubMed Central CAS Google Scholar
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 2020;49:D1388–95.
Article PubMed Central CAS Google Scholar
Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81:1084–97.
Article CAS PubMed PubMed Central Google Scholar
Meyer PE, Meyer MPE. Package ‘infotheo’. R Package. 2009;1.
Speed D, Hemani G, Johnson MR, Balding DJ. Improved heritability estimation from genome-wide SNPs. Am J Hum Genet. 2012;91:1011–21.
Article CAS PubMed PubMed Central Google Scholar
Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. TASSEL: software for association map** of complex traits in diverse samples. Bioinformatics. 2007;23:2633–5.
Article CAS PubMed Google Scholar
Haller T, Tasa T, Metspalu A. Manhattan harvester and cropper: a system for GWAS peak detection. BMC Bioinformatics. 2019;20:1–8.
Article Google Scholar
Lee Y, Luca F, Pique-Regi R, Wen X. Bayesian multi-SNP genetic association analysis: control of FDR and use of summary statistics. bioRxiv. 2018:316471. http://biorxiv.org/content/early/2018/05/08/316471.abstract.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75.
Article CAS PubMed PubMed Central Google Scholar
Wang Y, Li W, Wang L, Yan J, Lu G, Yang N, et al. Three types of genes underlying the Gametophyte factor1 locus cause unilateral cross incompatibility in maize. Nat Commun. 2022;13(1):4498. https://doi.org/10.1038/s41467-022-32180-9.
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–7.
Article CAS PubMed PubMed Central Google Scholar
Letunic I, Bork P. Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49:W293–6.
Article CAS PubMed PubMed Central Google Scholar
Luo J, Wei C, Liu H, Cheng S, **ao Y, Wang X, et al. MaizeCUBIC: a comprehensive variation database for a maize synthetic population. Database. 2020;2020.
Lescot M, Déhais P, Thijs G, Marchal K, Moreau Y, Van de Peer Y, et al. PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids Res. 2002;30:325–7.
Article CAS PubMed PubMed Central Google Scholar
Luo Y, Zhang M, Liu Y, Liu J, Li W, Chen G, et al. Genetic variation in YIGE1 contributes to ear length and grain yield in maize. New Phytol. 2022;234(2):513–26. https://doi.org/10.1111/nph.17882.
Klopfenstein DV, Zhang L, Pedersen BS, Ramírez F, Warwick Vesztrocy A, Naldi A, et al. GOATOOLS: a python library for gene ontology analyses. Sci Rep. 2018;8:10872.
Article CAS PubMed PubMed Central Google Scholar
Hothorn T, Hornik K, Van De Wiel MA, Zeileis A. A lego system for conditional inference. Am Stat. 2006;60:257–63.
Article Google Scholar
Gu Z, Gu L, Eils R, Schlesner M, Brors B. Circlize implements and enhances circular visualization in R. Bioinformatics. 2014;30:2811–2.
Article CAS PubMed Google Scholar
Wickham H. ggplot2. Wiley interdisciplinary reviews. Comput Stat. 2011;3:180–5.
Article Google Scholar
Kolde R, Kolde MR. Package ‘pheatmap’. R package. 2015;1:790.
Google Scholar
Wilke CO, Wickham H, Wilke MCO: Package ‘cowplot’. R Package. 2019.
Gui S: pan-Zea genome construction pipeline. Github. https://doi.org/10.5281/zenodo.6864803 (2021).
Gui S: pan-Zea SV calling and genoty** scripts. Github. https://doi.org/10.5281/zenodo.6864803 (2021).
Gui S. Pan-Zea miscellaneous analysis scripts. Github. https://doi.org/10.5281/zenodo.6864805 (2021).
Zhang Z, Zhang B, Chen Z, Zhang D, Zhang H, Wang H, Zhang Ye, Cai D, Liu J, ** studies. PLoS Genet. 2014;10:e1004722.
Biłas, Szafran K, Hnatuszko-Konka K, Kononowicz AK. Cis-regulatory elements used to control gene expression in plants. Plant Cell Tissue Organ Culture (PCTOC). 2016;127:269–87.

Download references

Acknowledgements

The authors would thank Dr. **n Yang and Mr. Yabing Zhu (Bei**g Genomics Institute-Shenzhen) for their help in the functional annotations of the pan-Zea genome. The authors would also thank Dr. Yingjie **ao, Dr. Yuanyan Chen, Dr. Gengshen Chen, and Mr. Gang Lu (National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University) for providing valuable advices. The computations in this paper were run on the bioinformatics computing platform of the National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University.

Review history

The review history is available as Additional file 13.

Peer review information

Wen**g She was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Funding

This research was supported by the National Key Research and Development Program of China (2020YFE0202300) and National Natural Science Foundation of China (U1901201; 32000396).

Author information

Authors and Affiliations

National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
Songtao Gui, Wenjie Wei, Chenglin Jiang, **gyun Luo, Lu Chen, Shenshen Wu, Wenqiang Li, Yuebin Wang, Shuyan Li, Ning Yang, Qing Li & Jianbing Yan
Hubei Hongshan Laboratory, Wuhan, 430070, China
Ning Yang, Qing Li & Jianbing Yan
Department of Molecular Physiology, Max-Planck-Institute of Molecular Plant Physiology, Am Mühlenberg 1, 14476, Potsdam, Golm, Germany
Alisdair R. Fernie

Authors

Songtao Gui
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Wei
View author publications
You can also search for this author in PubMed Google Scholar
Chenglin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
**gyun Luo
View author publications
You can also search for this author in PubMed Google Scholar
Lu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shenshen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wenqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuebin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shuyan Li
View author publications
You can also search for this author in PubMed Google Scholar
Ning Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Li
View author publications
You can also search for this author in PubMed Google Scholar
Alisdair R. Fernie
View author publications
You can also search for this author in PubMed Google Scholar
Jianbing Yan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, S.G., N. Y., and J. Y.; methodology, S.G., W.W., and C.J.; software, S.G. and W.W.; resources, J.L., L.C., W.L., and Y.W.; formal analysis, S.G., W.W., C.J., S.W., and S.L.; writing — original draft, S.G.; writing — review and editing, S.G., N.Y., Q.L., A.R.F., and J.Y.; funding acquisition, S.G. and J.Y.; supervision, J.Y. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Jianbing Yan.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13059_2022_2742_MOESM1_ESM.pdf

Additional file 1: Figure S1. Statistics of pan-Zea NGS de novo assemblies. Figure S2. Schematic diagram illustrated the filtering of assembly-to-assembly alignment. Figure S3. Sketch of the non-reference sequence anchoring and clustering pipeline. Figure S4. Pan-Zea gene annotation and functional annotation pipeline. Figure S5. Statistics of the proportions of functionally annotated genes. Figure S6. Validation of the gene present and absent variations (gPAVs). Figure S7. PCA and SNP LD rank analysis of gPAV. Figure S8. Core and dispensable genes and ortholog groups. Figure S9. KEGG pathways of each dispensable ortholog group cluster. Figure S10. The KEGG pathways enriched in maize concentrated ortholog groups when compared with teosinte concentrated ortholog groups. Figure S11. Distribution of sub-population enriched genes and ortholog groups. Figure S12. The sketch of SV calling and genoty** pipeline. Figure S13. Additional features of the maize genetic variation map. Figure S14. Distribution of the associated QTLs and causal variations along the chromosome. Figure S15. Distribution of the features of associated QTLs, genes and causal variations. Figure S16. Multiple sequence alignment of the PME genes in the Ga1 locus. Figure S17. Multiple sequence alignments of the protein sequences of pan-Zea PME genes. Figure S18. Statistics of candidate causal variants. Figure S19. Enrichment of causal variants features within different omics trait classes. Figure S20. Enrichment of cis-eQTL causal variants along different distance to TSS. Figure S21. Expression patterns of the candidate genes of the example SV-QTLs. Figure S22. Blast graphic summary of PZ00001aSV02097079INS. Figure S23. Distribution of the Zm00001d023299 expression patterns in the tenth leaf of V9 stage in the CUBIC offspring population related to the haplotypes flanking 500Kb of Zm00001d023299. Figure S24. The diagram of vectors used in the luciferase reporter assays of Zm00001d023299 promoter with (A) or without (B) the predicted ABRE.

13059_2022_2742_MOESM2_ESM.xlsx

Additional file 2: Table S1. Estimating WGS de novo assemblies by whole genome comparison between WGS de novo assemblies and reference genome sequences of B73, SK, Mo17 and HZS.

Additional file 3: Table S2. Public genome assemblies used in the pan-Zea genome construction.

Additional file 4: Table S3. Statistics of pan-Zea gene annotations

Additional file 5: Tables S4. Pan Zea gene features.

Additional file 6: Table S5. Enrichment of molecular function GO items for each gene PAV cluster.

Additional file 7: Table S6. Enrichment of KEGG ontology for each gene PAV cluster.

Additional file 8: Table S7. Statistics of the complex and molecular trait variation data in this study.

Additional file 9: Table S8. Information of QTLs that cannot be detected by SNPs.

13059_2022_2742_MOESM10_ESM.xlsx

Additional file 10: Table S9. Primers used in the luciferase reporter assay of the activity of the predicted ABRE positioned upstream of Zm00001d023299.

Additional file 11: Supplementary Text S1.

13059_2022_2742_MOESM12_ESM.pdf

Additional file 12: Supplementary Materials and Methods [88,89,90,91,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159].

Additional file 13. Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Gui, S., Wei, W., Jiang, C. et al. A pan-Zea genome map for enhancing maize improvement. Genome Biol 23, 178 (2022). https://doi.org/10.1186/s13059-022-02742-7

Download citation

Received: 11 March 2022
Accepted: 27 July 2022
Published: 23 August 2022
DOI: https://doi.org/10.1186/s13059-022-02742-7

A pan-Zea genome map for enhancing maize improvement

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Background

The presence/absence patterns of pan-Zea genes and the orthologous groups

The impact of the pan-Zea genome and structural variations on maize phenotypic variations

Discussion

Community resources to accelerate maize molecular breeding

Methods

Availability of data and materials

References

Acknowledgements

Review history

Peer review information

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation