Background & Summary

Common wheat (Triticum aestivum L., 2n = 6x = 42, AABBDD) is a natural amphiploid derived from the intergeneric cross between T. turgidum subsp. durum (Desf.) Husn., a cultivated allotetraploid (2n = 4x = 28, AABB), and Aegilops tauschii Coss., a diploid goat grass (2n = 2x = 14, DD). The genetic diversity among common wheat cultivars has been drastically reduced owing to bottlenecks resulting from polyploidy, domestication, and modern plant breeding. The decline in genetic diversity can be counteracted by direct hybridization between common wheat and A. tauschii1,2, or through hybridization between common wheat and synthetic hexaploid wheats (SHWs) developed by crossing tetraploid wheat and A. tauschii3.

The primary objective of the direct hybridization method is to augment genetic diversity specifically for the D genome in common wheat, addressing a crucial concern in wheat breeding, because significantly lower genetic diversity values characterize this genome compared with the A and B genomes4. However, the diminished genetic diversity resulting from the bottlenecks also affects the A and B genomes. Consequently, the utilization of SHW lines enables the diversity of all three subgenomes of common wheat to be enhanced. This approach facilitates the direct transfer of genes/loci for traits of interest from diploid and tetraploid to hexaploid wheat.

To date, the International Maize and Wheat Improvement Centre (CIMMYT) has developed more than 1200 SHW lines3. Since the introduction of more than 200 SHW accessions from CIMMYT in 1995, four SHW-derived cultivars, namely, Chuanmai 38, 42, 43, and 47, have been raised and cultivated, which have been widely used in wheat breeding as elite parents in China. Subsequently, a number of secondary SHW-derived cultivars have been developed and released, including Chuanmai 104, developed from the cross of Chuanmai 42 and Chuannong 16. Chuanmai 104 is an important high-yielding wheat cultivar grown in Southwest China in recent years. The maximum yield of Chuanmai 104 attains 10,947 kg/ha under the humid and predominantly cloudy climate of the Sichuan Basin in Southwest China5. Chuanmai 104 is becoming a cornerstone breeding parent of wheat in China. Furthermore, China is among the main countries that are exploiting the advantages of SHW lines as genetic resources, especially in Southwest China3. The increasing utilization of SHW worldwide is indicative of the success of such an approach, which will gradually become an effective means of overcoming the bottleneck of wheat breeding.

Considering previous studies based on SHWs, a major potential limiting factor is the limited genetic resources and lack of reference-quality pseudomolecule assemblies (RQAs)6. Chapman et al. integrated whole-genome sequencing and genetic map** to assemble and ordered contigs of the SHW cultivar W79847. However, given the short reads generated by next-generation sequencing (NGS), and the lack of chromosome conformation capture sequencing or chromosome isolation via flow sorting, the assembly was only 9.1 Gb, which was substantially less than the estimated 15 Gb size of the hexaploid wheat genome6. Although single-nucleotide polymorphism (SNP) genoty** arrays are relatively simple and inexpensive, a limitation is that only the variants pre-selected for inclusion on the array can be analyzed. Consequently, if the SNP panels were designed using common wheat genome assemblies, they would lack sufficient representation of variants in the target gene pools, and thus assessment of useful variation in SHW and derivative germplasm would be challenging. More recently, reduction in costs have meant that RQAs and large-scale whole-genome resequencing are feasible and affordable for SHWs.

In the current study, we first generated a chromosome-level assembly for Chuanmai 104 (Fig. 1), based on an integrated approach including PacBio HiFi sequencing reads and chromosome conformation capture sequencing. The final Chuanmai 104 genome assembly consisted of 14.81 Gb with a contig N50 of 58.25 Mb, a contig N90 of 8.41 Mb, and a longest contig of 422.27 Mb (Table 1). Among previously published hexaploidy wheat assemblies, seven of the 21 chromosomes in the Chuanmai 104 were the longest (Table 2). The long terminal repeat (LTR) Assembly Index (LAI)8 of the Chuanmai 104 genome assembly was 15.17, 14.64, and 10.85 for A subgenome, B subgenome, D subgenome respectively, and for each chromosome, the LAI values ranges from 10.21 to 15.71 (Table 3). Benchmarking universal single-copy orthologs (BUSCO) analysis yielded a completeness score of 99.30%, which was comparable with that of common wheat genomes and notably higher than that of the SHW cultivar W7984 (Table 1). Repeats comprised 81.36% of the sequences with a predominance of retrotransposons, which accounted for 62.96% of the sequences (Table 4). In total, 122,554 high-confidence and 136,431 low-confidence protein-coding gene models were predicted (Table 5); this number was similar to that for the common wheat Chinese Spring (Table 1). The high-quality Chuanmai 104 genome assembly generated in this study provides a reference genome for SHW-derived cultivars, and offers a promising outlook for the study and utilization of SHW genetic resources in wheat improvement, which is essential to meet the global food demand.

Fig. 1
figure 1

Overview of Chuanmai 104 chromosome-scale assembly. (a) Distribution of the A. tauschii clone A6-10 subtelomeric tandem repeat sequence (GenBank Accession AY249980.1). (b) Distribution of the A. tauschii clone 6C6-3 (GenBank Accession AY249981.1) and 6C6-4 (GenBank Accession AY249982.1) and T. monococcum ssp. aegilopoides clone BAC TbBAC5 (GenBank Accession DQ904440.1) and TbBAC30 (GenBank Accession EF624064.1) centromere-specific tandem repeat sequences. (c) Distribution of the non-coding gene density. (d) Distribution of the transposable elements’ density. (e) Distribution of the tandem repeat density. (f) Distribution of the long terminal repeat density. (g) Distribution of the high-confidence protein-coding gene density. (h) Distribution of the significant enrichment of subgenome-specific k-mers identified by SubPhaser (gold colour for A, blue for B, and orange for D). (i) density distribution of the D subgenome-specific k-mer set. (j) density distribution of the B subgenome-specific k-mer set. (k) density distribution of the A subgenome-specific k-mer set. Links between chromosomes are collinearity blocks, which are coloured according to the homologous chromosomes. All the densities were calculated using sliding windows (window size: 1Mbp, step size: 1Mbp), except the density distribution of the non-coding genes, which use a window size of 10Mbp and a step size of 1Mbp for smoother visualization.

Table 1 The summary results of the SHWs (CM104 and W7984) and common wheats (Fielder, Kariega, Attraktion, Renan, CS) genome assemblies.
Table 2 The chromosomes lengths of selected representative hexaploid wheat genomes.
Table 3 Statistics of number of contigs, LAI and non-coding RNAs on each chromosome in Chuanmai 104.
Table 4 The statistics for the repeats in the Chuanmai 104 genome.
Table 5 Statistics of gene structural and functional annotation.

Methods

Plant material, DNA extraction, and sequencing

The SHW-derived cultivar Chuanmai 104 was kindly provided by Wuyun Yang (Crop Research Institute, Sichuan Academy of Agricultural Sciences). The plants used for sequencing were grown in a growth chamber with a controlled environment of 20 degree Celsius under a 12 h light/12 h dark photoperiod for 2 weeks. Genomic DNA (gDNA) was extracted from seedling leaf tissues using the cetyltrimethylammonium bromide method. Three methods were applied for DNA quantification and quality testing, including (i) NanoDrop 2000 spectrophotometer (Thermo Fischer Scientific), (ii) gel electrophoresis and (iii) Qubit fluorometer (Invitrogen). Total DNA was purified by AMPure PB beads (Pacific Biosciences, CA, USA; PN 100-265-900). High-quality gDNA (≥10 μg, ≥100 ng/μl) was prepared for the next step of library construction. PacBio single-molecule real-time (SMRT) bell library preparation was performed using the SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA; PN 101-853-100) in accordance with the manufacturer’s instructions. The library was prepared for sequencing with a 30 h movie on the Sequel IIe system (Pacific Biosciences) by the Berry Genomics Corporation (Bei**g, China). Totally, we generated 668.43 Gb bases (~45X) with 40,999,150 CCS reads from 20 SMRT cells.

Chromosome conformation capture (Hi-C) sequencing of Chuanmai 104 was performed using the protocol of Peng et al.9. In brief, 2–4 g tender leaves from the plants used for genome sequencing were harvested and stored in liquid nitrogen, and then the Hi-C libraries were prepared and sequenced on the MGISEQ-2000 platform by BGI (Wuhan, China). Samples were cut into pieces of ca. 2 cm2, and transferred to 50 ml tubes containing 15 ml of ice-cold nuclear isolation buffer (NBE) with 2% formaldehyde, followed by vacuum infiltration (400 mbar) and incubation with a supplemented cross-linking agent for 1 h. Cross-linking was quenched by adding 2 M glycine to a final concentration of 0.125 M with incubation for 5 min under vacuum, followed by fixation on ice. Then, the fixed leaf pieces were washed three times with sterile Milli-Q water, ground in liquid nitrogen and subjected to nucleus isolation. The isolated nuclei were purified, checked for quality and quantity and digested with 100 units of DpnII. The next steps were Hi-C specific, including marking the DNA ends with biotin-14-dATP and performing blunt-end ligation of the cross-linked fragments. After ligation, cross-linking was reversed by overnight incubation with proteinase K at 65 °C. Biotin-14-dATP was further removed from non-ligated DNA ends using the exonuclease activity of T4 DNA polymerase. DNA was purified by phenol:chloroform (1:1) extraction, precipitated and washed as previously described. The purified DNA was physically sheared to a size of 300–600 bp by sonication and was size-fractionated using standard 2% agarose gel electrophoresis to obtain fragments in the range of 300–600 bp. The fragmented ends were blunt-end repaired and A-tailed, followed by purification through biotin-streptavidin-mediated pulldown. PCR amplification was conducted using 12–15 cycles to enrich the ligation products. Totally, we generated more than 2 Tb bases (>135 X) with 6.69 Gb read pairs.

For full-length transcriptome sequencing, we collected pooled sample for Chuanmai 104, which comprised whole plant organs except for roots from seed germination to the three-leaf stage, shoots at the seedling stage, and leaves, stems, ears, and seeds from the heading to the late-filling stages. Total RNA was isolated using TRIzol Reagent in accordance with the manufacturer’s instructions (Thermofisher). The RNA purity and raw contamination were first assed by Nanodrop 2000 (Thermo Fischer Scientific), and then the RNA Integrity Number (RIN) and concentration were further assessed by an Agilent 4200 (Agilent Technologies). High-quality RNA (2 μg, 300 ng/μl) was prepared for the next step of library construction. PacBio SMRT bell library preparation was performed using the SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences) in accordance with the manufacturer’s instructions. The library was prepared for sequencing with a 30 h movie on the Sequel IIe system (Pacific Biosciences) by the Berry Genomics Corporation. Totally, we generated 186.35 Gb bases with 2,283,790 polymerase reads from one SMRT cell. The final 46,130,981 subreads range from 51 bp to 241,082 bp, with a mean and N50 value of 4,039.55 bp and 4,561 bp respectively.

Genome assembly

The PacBio HiFi CCS reads were assembled using hifiasm10 (v0.16.1, with default parameters). The Hi-C reads were incorporated using Juicer tools11 (v1.6) and EndHiC12. In brief, preprocessing of the Hi-C reads was performed with juicer.sh11 (parameter: -s DpnII). The output file corresponding to the Hi-C contacts with duplicates removed and map** quality values larger than 30 was generated as input for EndHiC12. These result files were plotted to visualize the Hi-C map and for manual curation, and were used to generate the final assembly (21 pseudomolecules and one unanchored pseudomolecule). The NCBI Foreign Contamination Screen (FCS)13 was used to identify and remove contaminant sequences (adaptors and organelles) in genome assemblies. Totally, the FCS identified total 754 contaminant fragments, including one adaptor fragment and 753 mitochondrial fragments, and all these contaminants are located on the unanchored pseudomolecule and were masked.

Validation of genome assemblies

Genome sizes were estimated using three algorithms (gce14, GenomeScope215, and findGSE16) with different k-mer sizes. The quality and completeness of the genome assemblies were assessed by merqury17, which uses a reference-free, k-mer-based approach, and BUSCO18 (v5, poales_odb10), which is based on evolutionarily informed expectations of the near-universal single-copy orthologous gene content. LTR assembly index (LAI)8 that evaluates assembly continuity using LTR-RTs were calculated.

Subgenome assignment, validation, and nomenclature

To assign each chromosome to each linkage group and apply the corresponding nomenclature in Chinese Spring, SubPhaser19, a robust allopolyploid subgenome phasing method based on subgenome-specific k-mers, was used. To validate the correctness of the subgenome assignment, a reference-guided strategy based on subgenome homology was also used to distinguish the subgenomes. We mapped the Chuanmai 104 genome to the Chinese Spring genome using mashmap20 (-f map–perc_identity 90 -s 1000000). Then, the alignments were plotted and manually checked. This procedure successfully categorized the 21 chromosomes into three homologous groups. The nomenclature system for Chinese Spring chromosomes was adopted for naming of the homologous groups (1–7) of the Chuanmai 104 genome.

Repeat annotation

Tandem repeats of all lengths were annotated with TandemRepeatsFinder21 v4.09 using the default parameters (Match Mismatch Delta PM PI Minscore MaxPeriod: 2 7 7 80 10 50 500). LTR_FINDER22 (v1.05) and LTR_harvest23 (v1.5.10) were used for long terminal repeat (LTR) identification, and the results were processed with LTR_retriever24 (v2.8) to generate a species-specific LTR library. The species-specific LTR libraries, wheat transposable element (TE) sequences from ClariTeRep (https://github.com/jdaron/CLARI-TE), and plant TE sequences from Repbase25 were merged to generate the TE library. Transposons were detected and classified by a homology search against the combined TE library. The program Vmatch (http://www.vmatch.de/), a fast and efficient matching tool suitable for large and highly repetitive genomes, was used for this computationally intensive task with the following parameters: identity ≥ 70%, minimal hit length 75 bp, and seed length 12 bp (exact command line: -d -p -l 75 -identity 70 -seedlength 12 -exdrop 5).

Non-coding gene annotation

Noncoding RNAs (ncRNAs), including miRNAs, small nuclear RNAs, rRNAs and regulatory elements, were identified using the Infernal26 (version 1.1.2) program to search against the Rfam27 database (v14.8). The rRNAs, and tRNAs were further identified using RNAmmer28 (version 1.2) and tRNAscan-SE29 (v1.3.1) respectively.

Protein-coding gene annotation

Gene model prediction was performed following the method described by Mascher et al.30, with minor modifications, which integrated transcriptomic data, protein homology, and ab initio prediction. (1) First, isoform sequencing (Iso-Seq) data were mapped to the genome using minimap231 (v2.17-r941; parameters: -ax splice -uf –secondary = no -C5). The redundant isoforms were further collapsed into transcript loci using cDNA_Cupcake (https://github.com/Magdoll/cDNA_Cupcake) (parameter: –dun-merge-5-shorter). TransDecoder (v5.5.0, https://github.com/TransDecoder/TransDecoder) was used to predict protein sequences among the transcripts. (2) For protein homology evidence, we projected the gene structures from Triticeae species, comprising Ae. tauschii, T. turgidum subsp. dicoccoides, T. turgidum subsp. durum, T. aestivum Chinese Spring, T. urartu, Ae. speltoides, and Hordeum vulgare Morex, onto the Chuanmai 104 genome using liftoff32 with default parameters. (3) We produced ab initio gene predictions using AUGUSTUS33 (v3.4.0), GeneMark-ET34 (v4.38), and GeneID35 (v1.4). In brief, AUGUSTUS33 gene prediction was performed using a model specifically trained from the software and a hints file generated using the previously mentioned Iso-Seq predictions. GeneMark-ET34 was used with the option -ET, and the intron coordinates were calculated using the above-mentioned Iso-Seq alignments. GeneID35 was run with a model specifically trained from the software (-GP taestivum.param). We used EVidenceModeler36 (EVM; v1.1.1) to integrate all of the gene evidence from transcriptomics, protein alignments, and ab initio predictions.

Protein-coding gene models from EVM were classified as high-confidence or low-confidence according to criteria used by the International Wheat Genome Sequencing Consortium, with minor modifications37. In brief, protein-coding gene models were considered as ‘complete’ when start and stop codons were present. A comparison with PTREP38 (the database of hypothetical proteins deduced from the nonredundant database of TEs within the TREP database), UniPoa39 (Poaceae database of annotated proteins from the UniProt database), and UniMag40 (validated Magnoliophyta proteins from SwissProt) was performed using DIAMOND41 (v2.0.9; parameters: -e 1e-10 –query-cover 80–subject-cover 80). Gene candidates were classified using the following criteria: a high-confidence gene model was ‘complete’ with a hit in the UniMag40 database and/or in UniPoa39 but not PTREP38; the remaining gene models were classified as low-confidence genes.

Functional assignments of the predicted protein-coding genes were obtained with BLAST42 by aligning the coding regions to sequences in public protein databases, including the trEMBL40, RefSeq43., and SwissProt40 databases. The putative domains and GO44 terms of the predicted proteins were identified using the InterProScan45 program. The putative orthologs in the KEGG46 database were identified using KoFamScan47.

Data Records

The HiFi reads, Iso-seq reads, and Hi-C reads that were used for the Chuanmai 104 genome assembly have been deposited in the NCBI Sequence Read Archive with accession number SRP488123 and under BioProject number PRJNA107040948. The HiFi reads, Iso-seq reads, and Hi-C reads were also deposited in the National Genomics Data Centre (NGDC) with BioProject ID PRJCA022052 (https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA022052). The genome assembly has been deposited at GenBank under the accession JBBIFV00000000049. The genome assemblies and annotations have also been deposited at FigShare50 with doi number https://doi.org/10.6084/m9.figshare.25282654.

Technical Validation

The assembled genome size is similar to the size estimated by different algorithms14,15,16 (Fig. 2a–c), and is significantly larger than that published previously for the SHW cultivar W7984 (Table 1). The base-level accuracy QV (consensus quality value) and k-mer completeness scores evaluated with merqury17 are 65.86 and 97.59%, respectively. The long terminal repeat (LTR) Assembly Index (LAI)8 of the Chuanmai 104 genome assembly was 15.17, 14.64, and 10.85 for A subgenome, B subgenome, D subgenome respectively, which are higher than the LAI values obtained for Chinese Spring (11.88, 12.51 and 9.97 for A subgenome, B subgenome, D subgenome respectively). The BUSCO18 score is 99.3% and only 0.7% BUSCO genes are missing (Fig. 2d). These results indicate a high completeness of the Chuanmai 104 assembly. Comparison with other common wheat genome assemblies revealed that the Chuanmai 104 NG50 value was significantly larger, implying high connectivity (Fig. 2e). The GC-depth plot (Fig. 2f) of the Chuanmai 104 genome across every 2 kb nonoverlap** sliding window showed no distinct secondary peaks, indicating that haplotype homology was adequately recognized during assembly.

Fig. 2
figure 2

Validations of Chuanmai 104 genome assemblies. (a) Genome sizes estimated using different algorithms with different K-mer sizes. (b,c) examples of genome size estimated by findGCE (K = 181, b) and GenomeScope2 (K = 181, c) respectively. (d) Gene completeness assessed by BUSCO using the Poales dataset with a total of 4896 groups. (e) NGx plots for the Chuanmai 104 and other common wheat genomes. (f) GC content and average sequencing depth (GC-depth) plot of the Chuanmai 104 genome across every 2-kb nonoverlap** sliding window.

The Hi-C contact map was manually curated and assessed with Juicebox and revealed a dense pattern along the diagonal, indicating no potential mis-assemblies (Fig. 3). The anti-diagonals are typical for Triticeae genomes51 (Fig. 3). The distribution of the A. tauschii subtelomeric tandem repeat sequences (NCBI GenBank accessions: AY249980.1, AY249981.1, and AY249982.1) and T. monococcum subsp. aegilopoides centromere-specific tandem repeat sequences (NCBI GenBank accessions: DQ904440.1 and EF624064.1) indicate the completeness in these complex regions (Fig. 1a–c).

Fig. 3
figure 3

Hi-C contact maps of chromosomes. The dashed lines indicate chromosomes boundaries.

Using SubPhaser19, a robust allopolyploid subgenome phasing method based on subgenome-specific k-mers, the 21 chromosomes of the Chuanmai 104 genome were aggregated into three linkage groups (Fig. 1h–k). These groups show high synteny to chromosomes of Chinese Spring at both the nucleotide and protein levels (Fig. 4), indicating the correctness of the chromosome assembly. Moreover, these synteny results show the relative conservation of the common wheat and SHW genomes, although the sources of the subgenomes and their evolutionary history differ.

Fig. 4
figure 4

Nucleotide-level (a) and protein-level (b) synteny between the 21 chromosomes of Chuanmai 104 and Chinese Spring.