Main

The Brassica genus contains a diverse range of oilseed and vegetable crops important for human nutrition1. Crops of particular agricultural importance include three diploid species, Brassica rapa (AA), Brassica nigra (BB) and Brassica oleracea (CC), and three allopolyploid species, B. napus (AACC), B. juncea (AABB) and Brassica carinata (BBCC). The evolutionary relationships among these Brassica species are described by what is called the 'triangle of U' model2, which proposes how the genomes of the three ancestral Brassica species, B. rapa, B. nigra and Brassica oleracae, combined to give rise to the allopolyploid species of this genus. B. juncea formed by hybridization between the diploid ancestors of B. rapa and B. nigra, followed by spontaneous chromosome doubling. Subsequent diversifying selection then gave rise to the vegetable- and oil-use subvarieties of B. juncea. These subvarieties include vegetable and oilseed mustard in China, oilseed crops in India, canola crops in Canada and Australia, and condiment crops in Europe and other regions3. Cultivation of B. juncea began in China about 6,000 to 7,000 years ago

Results

Genome assembly, scaffold anchoring and annotation

To distinguish among subgenomes in Brassica species, we redesignated the subgenomes in Brassica23 as follows: B. rapa as BraA; B. nigra as BniB; B. oleracea as BolC; B. juncea A subgenome as BjuA and B subgenome as BjuB; and B. napus A subgenome as BnaA and C subgenome as BnaC.

We selected an advanced generation inbred line of B. juncea var. tumida (variety T84−66) for whole-genome sequencing. We estimated the size of the T84-66 genome at 922 Mb by flow cytometry (Supplementary Fig. 1 and Supplementary Table 1). We assembled the T84-66 genome using 176× Illumina shotgun reads and 12× PacBio single-molecule long reads (Supplementary Table 2a,b and Supplementary Fig. 2). The assembly spanned 784 Mb, 85% of the 922 Mb estimated by flow cytometry (Supplementary Table 3). The contig N50 value was 61 kb, and the scaffold N50 was 855 kb (Supplementary Table 3).

We collected 996,648 BioNano DNA molecules over 150 kb, which corresponds to 222 equivalents of the genome, the average of which exceeded 2 Mb in size (Supplementary Table 4). The genome map assembled de novo consisted of 922 constituent genome maps with average length of 1.19 Mb and N50 of 1.84 M (Supplementary Table 4). We used these assemblies to correct the genome assembly above (Supplementary Fig. 3). The final assembly by the BioNano approach spanned 955 Mb, and the scaffold N50 was 1.5 Mb (Supplementary Table 3). We constructed a high-resolution genetic map with 5,333 bin markers and 18 pseudo-chromosomes (10A and 8B subgenomes; Supplementary Tables 5 and 6). We then integrated a published B. juncea genetic map24 (Supplementary Table 7). Finally, we anchored 91.5% and 72.3% of A- and B-subgenome assembly sequences onto the 10 and the 8 pseudo-chromosomes, respectively (Supplementary Table 8a and Supplementary Fig. 4). We sorted the B. juncea chromosomes into the 402.1 Mb BjuA and 547.5 Mb BjuB subgenomes based on this assembly (Supplementary Table 9).

We also sequenced the genome of a doubled haploid line of B. nigra (YZ12151) for comparative genomic study. We assembled a collection of 96× Illumina shotgun reads to generate a 396.9 Mb genome sequence for B. nigra, with a scaffold N50 of 557.3 kb, and 68% of the estimated 591 Mb B. nigra genome (Supplementary Tables 10 and 11, and Supplementary Fig. 5). We anchored the 66% scaffolds into pseudo-chromosomes for B. nigra, referring to the BjuB genetic map (Supplementary Table 8b).

To validate the genome assembly, we used subreads from PacBio, of which 10 subreads had more than 99.4% coverage and 92.3% identity, on average, with the assembled genome (Supplementary Table 12). We aligned 15 published bacterial artificial chromosomes (BACs) from B. nigra to the B. nigra genome assembly, and observed over 98.5% coverage and 99.8% identity on average to BAC clones (Supplementary Table 13 and Supplementary Fig. 6). We BLAST-aligned 458 core eukaryotic genes (Cluster of Essential Genes (CEG) database)25 to the genome assembly with core eukaryotic genes map** approach (CEGMA) pipeline26, which showed high-confidence hits of 453 (98.8%) and 458 (100%) CEG proteins for all 458 essential genes in CEG with full length (>70% alignment) in the genome of B. juncea and B. nigra, respectively (Supplementary Table 14a). We validated the assembled genomes by matching expressed sequence tags (ESTs) downloaded from the US National Center for Biotechnology Information (NCBI) database, which indicated that 98.9% and 98.2% ESTs were supported by the assembled genomes of B. juncea and B. nigra (>50% alignments), respectively (Supplementary Table 14b).

We identified and compared repetitive sequences from syntenic regions of these genomes. We identified 316.1 Mb of repetitive sequence from the B. juncea genome, 131.2 Mb from BjuA and 216.5 Mb from BjuB (Supplementary Table 15). Long terminal repeats (LTRs) are the predominant transposable element (TE) family identified in all sequenced Brassica genomes6,7. Copia- and Gypsy-type LTRs represent the two most abundant TE subfamilies. Using repetitive sequence from syntenic regions, we found that they constituted a similar percentage of all TEs in the BjuA and BjuB, and their respective ancestral genomes (Supplementary Fig. 7a). We observed similar repetitive sequence contributions in B. napus (Supplementary Fig. 7b). We identified TEs in the B. juncea and B. napus subgenomes that were newly formed after divergence from each ancestral genome (Supplementary Fig. 8 and Supplementary Table 16a). We confirmed five randomly selected newly formed TEs by PCR amplification from B. rapa, B. nigra and B. juncea (Supplementary Fig. 9). These newly formed TEs showed similar distribution and percentage between the B. juncea and B. napus subgenomes, and their respective ancestral genomes (Supplementary Table 16b and Supplementary Fig. 10a,b). We observed 310 newly formed TEs to be active between the subgenomes of B. juncea, a much larger number than the 41 newly formed TEs being found active between the subgenomes of B. napus (Supplementary Table 17).

We annotated 80,050 and 49,826 protein-coding genes in the B. juncea and B. nigra genomes, respectively (Supplementary Table 18). Approximately 97.8% of B. juncea genes and 94.7% of B. nigra genes could be annotated by non-redundant nucleotide and protein sequences in the NCBI, Cluster of Orthologous Groups (COG), Gene Ontology (GO), SWISS-PROT and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (Supplementary Table 19). Using transcriptomes of B. juncea we validated gene predictions of assembled genomes, verifying all predictions (Supplementary Table 20a,b). Additionally, we identified noncoding RNAs, consisting of 21 long noncoding RNAs, 3,725 small RNAs, 2,638 tRNAs, 511 rRNAs, 1,402 microRNAs and 15,418 small nuclear RNAs, from the B. juncea genome (Supplementary Table 21).

We extracted 28,228 and 28,917 syntenic ortholog gene pairs from the B. juncea subgenomes and their ancestral genomes to identify gene loss during the speciation process8. In total, we identified 562 and 545 genes lost from BjuA and BjuB, respectively, relative to their common ancestral genomes. This represents a higher percentage than the gene loss estimates for BnaA and BnaC, relative to their common ancestral genomes (Supplementary Table 22). We validated gene loss using PCR amplifications (Supplementary Fig. 11). Gene loss numbers of B. juncea and B. napus were consistent with their formation times. The identified genes lost in the B. juncea subgenomes of BjuA and BjuB are involved in different functions based on Gene Ontology (Supplementary Fig. 12a,b). We mapped the distributions of genes, repetitive sequences, gene loss, pseudogenes, genome markers and genetic markers of the B. juncea subgenomes (Fig. 1).

Figure 1: The genome of B. juncea vegetable-use variety T84-66.
figure 1

The B. juncea genome comprises 10 chromosomes belonging to BjuA (J01−J10; right semicircle) and 8 chromosomes belonging to BjuB (J11−J18; left semicircle), scaled on the basis of their assembled length. Homeologous relationships between BjuA and BjuB chromosomes are displayed with connecting lines colored according to the BjuB subgenome. The tracks, from outer to inner, show gene density (non-overlap**, window size = 500 kb), repetitive sequence density (window size = 500 kb), the location of gene loss (blue solid point for gene loss), the location of pseudogenes (solid line for pseudogenes), genome (optical) marker density (window size = 500 kb) and genetic marker density (window size = 500 kb).

Comparison of A subgenomes in Brassica

Synteny analysis among three A subgenomes of Brassica showed strong co-linearity, although chromosomal rearrangements have occurred between BjuA and BraA after their divergence from the common B. rapa ancestor (Fig. 2a and Supplementary Table 23). To study the divergence of BjuA and BnaA, we assayed single-nucleotide polymorphisms (SNPs) from the resequencing of A subgenomes from eighteen B. juncea accessions including the B. juncea reference sequence (Supplementary Table 24), five B. napus accessions including the B. napus reference sequence8, and 27 B. rapa accessions, including the B. rapa reference sequence6 that covers most subspecies of B. rapa (Supplementary Table 25). We constructed a neighbor-joining tree for A subgenomes in Brassica, and discovered that BjuA and BnaA had divergent origins (Fig. 2b). BjuA might derive from B. rapa ssp. tricolaris, which is distributed in Asia, whereas BnaA might derive from B. rapa ssp. rapa (European turnip), which is widely distributed in Europe (Fig. 2b). This discovery indicates that allopolyploids B. juncea and B. napus have independent geographical origins, deriving from Asian and European regions, respectively.

Figure 2: Synteny and phylogenetic evolution analysis of three A-subgenome in Brassica species.
figure 2

(a) Schematic representation of synteny among B. rapa (BraA), A subgenome of B. juncea (BjuA) and B. napus (BnaA). Each line connects a pair of orthologous genes between genomes or subgenomes. (b) A phylogenetic neighbor-joining tree constructed from BjuA, BnaA, and resequencing of A subgenomes of 17 B. juncea, 5 B. napus and 27 B. rapa accessions. Vegetable- and oil-use subvarieties of B. juncea are marked with black and rose lines. (c) An ultrametric tree constructed from A subgenomes of B. juncea (BjuA) and B. napus (BnaA); two A subgenomes from the resequencing of B. juncea (BjuA_R16) and B. rapa ssp. tricolaris; and two A subgenomes from the resequencing of B. napus (BnaA) and B. rapa ssp. rapa (Europe). B. oleracea (BolC) is considered as an outgroup. Numbers in parentheses indicate the divergence time interval (Mya). The two A subgenomes from the resequencing of B. juncea (BjuA) and B. rapa (ssp. tricolaris) correspond to the two red and bold branches in b. The two A subgenomes from the resequencing of B. napus (BnaA_R1) and B. rapa ssp. rapa (Europe) correspond to the two blue bold branches in b.

Furthermore, we found that all A subgenomes from B. juncea were rooted in the common ancestor, and evolved into different subvarieties for vegetable or oil use (Fig. 2b). Principal component analysis displayed that vegetable- and oil-use subvarieties of B. juncea were distributed nearby B. rapa ssp. tricolaris group and far from other subspecies of B. rapa, supporting the ancestor being closer to B. rapa ssp. tricolaris (Supplementary Fig. 13). Using the independent origin of A subgenomes in B. napus and B. juncea as a control, we compared the SNP variation characteristics between BjuA and BnaA, and that between A subgenomes of vegetable- and oil-use subvarieties of B. juncea. We found typical SNP polyphyletic origin pattern between BjuA and BnaA, and typical SNP monophyletic origin pattern for A subgenomes of vegetable- and oil-use subvarieties in B. juncea (Supplementary Fig. 14). In total, the results drawn from the phylogenetic tree, principal component analysis and SNP variation patterns point to a monophyletic origin and evolution into vegetable- and oil-use subvarieties for A subgenomes of B. juncea.

To estimate when B. juncea formed, we found that the synonymous nucleotide substitution rate was not accurate for estimating formation time of the post-neopolyploid species (Supplementary Fig. 15 and Supplementary Table 26a,b). We therefore used phylogenetic analysis and Bayesian method38 (Supplementary Tables 12 and 14a,b, and Supplementary Fig. 6). Additional details are available in the Supplementary Note.

Genetic map and pseudo-chromosome construction.

We constructed a reference genetic map of B. juncea based on genoty** by resequencing of 100 individuals of F2 population39 (Supplementary Table 5). After resequencing reads alignment with BWA40, potential SNPs were identified by GATK v3.4 (ref. 41). Pairwise recombination of this marker set on each scaffold was calculated, of which adjacent SNPs with pairwise recombination rate less than 0.001 were lumped into a genetic bin, excluding bins showing significantly distorted segregation (chi-squared test, P < 0.01). A final set of bin markers was grouped to 18 linkage groups using Highmap42 (Supplementary Table 8a).

ALLMAPS43 was used to construct the initial pseudo-chromosomes of B. juncea from scaffolds using the genetic map (T84/DTC) constructed in the present study being integrated with a published genetic map (SY/PM)23. We sorted BjuA and BjuB subgenomes of B. juncea referred to the final genetic map (Supplementary Table 9). Additional details are available in the Supplementary Note.

Genome annotation.

The repetitive sequences of the B. juncea genome were identified with a combination of de novo and homolog strategies. Four de novo programs including RepeatScout44, LTR-FINDER45, MITE46 and PILER47 were used to generate the initial repeat library. The initial repeat database was classified into classes, subclasses, superfamilies and families by the PASTEClassifier with REPET48. We then merged transposable element (TE) sequences of Brassica species and the Repbase database49 together to construct a new repeat database and distinguish the genome assembly repeat sequences through RepeatMasker50 (Supplementary Table 15).

Genes were annotated iteratively using three main approaches: homology-based, de novo and EST/unigenes-based. Results of these three methods were integrated by GLEAN51 to get a high-confidence gene model. An RNA-seq based method map** transcriptome data to the reference genome using TopHat and assembling transcripts with Cufflinks was adopted to obtain the gene structures and new genes52 (Supplementary Tables 18, 19 and 20a,b).

tRNAscan-SEM (version 1.23)53 was used to detect reliable tRNA positions. Noncoding RNAs were predicted by the Infernal program using default parameters54. Through comparing the similarity of secondary structure between the B. juncea sequence and Rfam (v12.0) database55, the noncoding RNAs were classified into different families (Supplementary Table 21).

Stringent criteria and strategy were used to identify new TEs for the BjuA subgenome (Supplementary Fig. 8). The same strategy was used to identify new TEs in the subgenomes of B. juncea and B. napus compared to their corresponding ancestral genome after divergence from a common ancestor (Supplementary Table 16a,b and Supplementary Note).

We performed all-against-all BLASTP (E = 1 × 10−5)56 and chained the BLASTP hits by QUOTA-ALIGN (cscore = 0.5)57 with '1:1 synteny screen' to call synteny blocks. The '1:3 synteny screen' model was used to identify synteny blocks between A. thaliana and Brassica because of whole genome triplication in Brassica evolution history6 by QUOTA-ALIGN (cscore = 0.5). All gene losses were calculated based on the Brassica ancestor common gene sets of each species. Meanwhile, we identified gene loss from other subgenomes (BniB, BjuB, BolC, BnaA and BnaC) of Brassica (Supplementary Tables 22 and 34). Additional details are available in the Supplementary Note.

Comparison of A subgenomes in Brassica.

We called SNPs from A subgenomes by resequencing of B. juncea, B. napus and B. rapa and referring to the B. rapa reference genome using BWA40, GATK41 and SAMtools58 (Supplementary Table 25). Ungenotyped SNPs were imputed by the KNN algorithms59. SNPs with minor allele frequency (MAF) > 0.05 were picked for further analysis. Only non-heterozygous SNPs with integrity > 0.6 were kept for phylogenetic tree construction. The neighbor-joining phylogenetic tree for A subgenomes in Brassica was constructed by MEGA v6.0 using the Kimura 2-parameter model with 1,000 bootstraps and default parameters60.

We selected high quality SNPs with integrity ≥ 0.8 and MAF ≥ 0.05 from all SNPs above for principal component analysis using STRATPCA program from the EIGENSOFT package61.

To compare the characteristics of the SNPs of B. juncea and B. napus, we selected six B. juncea varieties including three vegetable- and three oil-use subvarieties (CN53, CN58, CN04 and CN02, EU07, AU213, respectively) and five B. napus varieties. We only retained SNPs with full integrity (integrity = 1) for further analysis. Fixed SNPs were defined as the frequency of alleles ≥ 60% and were different from their reference genome in B. juncea and B. napus populations. Polymorphic SNPs in B. juncea population were defined as the frequency of alleles ≥ 60%, and their genoty**s were distinct from B. napus. Polymorphic SNPs in B. napus population were defined as the frequency of alleles ≥ 60% and their genoty**s were dissimilar to B. juncea. We identified fixed and polymorphic SNPs in B. juncea and B. napus populations and those between B. juncea and B. napus population based on different frequencies of alleles scaled at 60%, 70%, 80% and 90%. We used the same strategies to identify fixed and polymorphic SNPs in and between vegetable- and oil-use B. juncea. Additional details are available in the Supplementary Note.

Formation time estimation for B. juncea.

To estimate the formation time of B. juncea, we first selected BjuA, its closest relative genome from B. rapa and the earliest divergent B. juncea accession based on the phylogenetic tree of A subgenomes in Brassica. Then we reconstructed the coding region sequences for selected varieties from the resequencing data. After multiple sequence alignments by MUSCLE v3.3 (ref. 62), a phylogenetic tree was constructed and divergence time was estimated by Bayesian MCMC analyses in BEASTv1.8 (ref. 41 for 17 B. juncea accessions with default parameters and filtered out with depth < 3×. Coding region sequence sets were then reconstructed based on high quality SNPs for each sample. To detect selective pressure of each coding gene, the rates of nonsynonymous (dN) and synonymous (dS) (ω = dN/dS) substitutions were estimated site-by-site using the YN00 program with default parameters from the PAML 4.2b package58. Each paired gene set of 17 samples was estimated repeatedly. All Ka/Ks of gene pairs were classified to three categories (dominant genes, subordinate genes and neutral genes). Meanwhile All Ka/Ks of gene pairs were separated into BjuA/BjuB subgenomes. To test statistical significance of different data sets, we performed a permutation test on them with 1,000 permutations (Supplementary Table 26a,b). Additional details are available in the Supplementary Note.

Detection of selective sweep signals.

Average pairwise diversity (π) and population differentiation statistic (FST) were calculated through Bio::PopGen of bioperl package65. Selective sweep regions were identified in the 10 vegetable- and 7 oil-use B. juncea subvarieties by combining FST outliers and π ratio outliers (θπ (vegetable-use/oil-use)) with 100 kb sliding windows and 10 kb steps. Adjacent windows extended to 10 kb likely represent the effect of a single divergence region and thus were linked to define a 'candidate gene region' (Supplementary Table 31). Additional details are available in the Supplementary Note.

URLs.

PBjelly, https://sourceforge.net/projects/pb-jelly/; A. thaliana and B. rapa protein sequences, http://genome.jgi-psf.org/; PASA, http://pasapipeline.github.io/; KEGG Automatic Annotation Server; http://www.genome.jp/kegg/kaas/.

Accession codes.

The genome assemblies of B. juncea and B. nigra have been deposited at GenBank: LFQT00000000 and LFLV00000000, respectively (BioProject PRJNA285130).