Background & Summary

Tomato (Solanum lycopersicum) is one of the most valuable vegetable crops worldwide. It also serves as a classic model system for studying plant-pathogen interactions and fruit development1,2. Fruit size increased gradually during tomato domestication; however, continued selection reduced the genetic diversity, causing the loss of multiple disease resistance in cultivated species3,4. Thus, wild tomato species have been frequently used as important germplasm donors in modern tomato breeding programs5,6. S. pimpinellifolium, the wild progenitor of the cultivated tomato7, possesses genes that confer resistance to biotic and abiotic stresses8,9; for example, Sm from S. pimpinellifolium PI79532 confers high resistance against gray leaf spot in tomato10; the I gene, also derived from PI79532, confers resistance against Fusarium oxysporum f. sp. lycopersici races 111; Rx4 from S. pimpinellifolium PI128216 confers hypersensitive resistance to bacterial spot race T312; and Ph-3 derived from S. pimpinellifolium L3708, confers resistance to Phytophthora infestans13. These findings indicate the huge potential of S. pimpinellifolium for use in breeding programs to develop disease-resistant varieties.

Whole-genome sequencing improves molecular breeding because high-quality plant genomes facilitate the identification of genetic diversity among different germplasms14,15,16,17. Currently, chromosome-level genome assemblies are available for the cultivated tomatoes, such as S. lycopersicum cv. M8218 and Heinz 170619,20, and wild tomatoes, such as S. pennellii LA071621 and S. galapagense LA043622. All these genome assemblies provide favorable support for the discovery of causal genetic variations underlying the major tomato traits based on comparative genomic analysis. S. pimpinellifolium LA1589 is a wild-type tomato accession with small, red, round fruits (Fig. 1a) that is widely used for trait map**23,24,25,26. Particularly, the well-established introgression line population from cross of S. lycopersicum cv. E6203 and LA1589 represents one of the widest crosses and serves as an important source for scientists and breeders27. Although the draft genome assembly of this accession was published 10 years ago28, a chromosome-level genome sequence has not yet been published, and thus the vast majority of sequence variations are poorly characterized and their impact on important traits are largely hidden.

Fig. 1
figure 1

Overview of the S. pimpinellifolium LA1589 genome assembly and features. (a) Morphology of the root, stem, leaf, flower, and fruit of LA1589. (b) Genomescope profile for 21-mers based on Illumina short-reads. (c) Hi-C contact map the chromosome-level assembly of LA1589. (d) Genome features of LA1589. For the circos map, the tracks from outside to inside are: (i) GC content (%); (ii) density of protein-coding genes; (iii) TE density; (iv) LTR density.

In this study, we assembled the chromosome-level genome of S. pimpinellifolium using a combination of short-read sequencing, PacBio sequencing, Hi-C scaffolding, and Bionano optical map** technologies. The resulting assembly has a total length of 833 Mb, with a contig N50 of 31 Mb, a complete BUSCO value of 98.3%, and a high LAI score of 14.49. The high-quality S. pimpinellifolium genome assembled in this study provides a valuable genetic resource for future efforts to study tomato domestication and promote genome-scale breeding.

Methods

Library construction and genome sequencing

The seeds of S. pimpinellifolium LA1589 were acquired from TGRC (https://tgrc.ucdavis.edu/) and planted in the greenhouse at the Institute of Genetics and Developmental Biology, Chinese Academy of Sciences (Bei**g, China). Total genomic DNA was extracted from fresh young leaves using the CTAB method29. A Pacific Biosciences (PacBio) SMRT library was constructed from high molecular weight DNA following the standard SMRTbell library preparation protocol. A total of five SMRT cells were run on the PacBio Sequel system. For short-read sequencing, the paired-end libraries with a 350-bp insert length were constructed and sequenced using the BGISEQ-500 platform. A high-throughput chromosome conformation capture (Hi-C) library was prepared following the proximo Hi-C plant protocol (Phase Genomics) and sequenced using an Illumina NovaSeq. 6000 platform with the paired-end mode. For BioNano optical map**, genomic DNA was isolated using a BioNano Plant Tissue DNA Isolation Kit. Labelled genomic DNA was then loaded onto the BioNano Saphyr System.

Genome survey

The k-mer frequency method was employed to estimate the genome size. The short-read sequencing produced 104.7 Gb of clean data after filtering out low-quality reads. Jellyfish v2.2.1030 (count -C -m 21; histo -h 40000) was used to compute a histogram of 21 k-mer frequencies. The heterozygosity level was calculated using GenomeScope v1.031. As a result, the estimated genome scale of S. pimpinellifolium was 835.55 Mb, with a heterozygosity rate of 0.08% (Fig. 1b).

Genome assembly and quality assessment

The PacBio sequencing produced 282.3 Gb long reads. Canu v1.832 (genomeSize = 800 m minOverlapLength = 600 minReadLength = 1000) was used to assemble PacBio subreads to PacBio contigs. BioNano optical maps were assembled into consensus physical maps using BioNano Solve v3.1 (https://bionanogenomics.com/). HERA v1.033 was used to extend and connect the contigs, and to fill in gaps in the BioNano hybrid scaffolds. The 128.5 Gb Hi-C reads were mapped to the scaffolds with Bowtie234. Then, HiC-Pro35 was employed to align the pair-end reads and Juicebox36 was used to build the interaction map (Fig. 1c). The scaffolds were further clustered and assigned to different chromosomes. To increase the accuracy of the assembly, Illumina short reads were mapped to genome using BWA v0.7.1537. Next, the genome was corrected using Pilon v1.2438, and three rounds of genome correction were performed. The 833.19-Mb final assembly had a contig N50 length of 31.2 Mb, and approximately 98.87% of the assembled sequence was anchored onto 12 pseudo-chromosomes (Fig. 1d), and showed a greater improvement compared to the previous version of LA1589 genome assembly released in 2012. Moreover, it was also very outstanding when compared with the reference assemblies of S. pennellii LA0716 and S. lycopersicum cv. Heinz 1706 (Table 1).

Table 1 Comparison of tomato genome assemblies.

The completeness of the genome was evaluated using BUSCO (Benchmarking Universal Single-Copy Orthologs) v5.4.539 program with the Solanales odb10 dataset, revealing 98.3% of Solanaceae BUSCOs were captured in this assembly (Table 2). Furthermore, the contiguity of the genome was evaluated by calculating LTR Assembly Index (LAI)40 using LTR_retriever v2.9.941 with default parameters. The LAI value of the genome assembly was 14.49. Collectively, these results indicate a high quality of the S. pimpinellifolium genome assembly.

Table 2 BUSCO analysis of the genome assembly.

Repeat annotation

The transposable element (TE) libraries were obtained by running the EDTA pipeline42. In addition, short interspersed nuclear element (SINE) candidates were predicted by the SINE-Finder program v1.043 and integrated into the TE library. RepeatMasker v4.0.744 was used for homologous repeat identification by running against the consensus TE library. Approximately 74.47% of the genome was composed of repetitive sequences (Table 3). LTRs represented the largest proportion (47.45%) of repetitive elements in the genome, of which Gypsy (28.12%) was the most abundant. The insertion time of long terminal repeat (LTR) retrotransposons was estimated as described previously45. In brief, the 5′ and 3′ end terminal repeat sequences of each LTR were extracted and aligned using MUSCLE v3.8.155146. Next, the insertion time of LTR was calculated by T = K/2r, where K is the divergence rate and r is the neutral mutation rate. The results showed that the main burst of Gypsy elements occurred about 0.75 million years ago (MYA), whereas the main burst of Copia elements occurred about 0.6 MYA (Fig. 2), indicating that the amplification of Gypsy elements occurred prior to that of Copia elements and that Gypsy expansion had a major effect on the S. pimpinellifolium genome expansion.

Table 3 Classification of transposable elements in the S. pimpinellifolium genome.
Fig. 2
figure 2

Overall insertion time distribution of LTR elements in the S. pimpinellifolium genome.

Gene prediction and annotation

Protein coding genes (PCGs) in the S. pimpinellifolium genome were annotated using the MAKER pipeline v3.01.0447. Nucleotide and protein sequences from Heinz 1706 v4.0 (https://solgenomics.net/) were used as queries for homology-based predictions. Ab initio gene prediction methods used within MAKER included SNAP v2006-07-2848 and AUGUSTUS v2.5.549. Homology-based and ab initio-based gene prediction resulted in the identification of 41,449 PCGs, which was 6,722 more genes than in the previous version of the genome. Functional annotation of the PCGs was performed using Hayai-Annotation Plants v1.0.250 and KOBAS51. The predicted protein sequences were searched against the InterPro52, Swiss-Prot53, and NR (https://www.ncbi.nlm.nih.gov/protein) databases. In total, 36,960 (89.17%) genes were assigned specific functions (Table 4). Orthologous genes were identified using MCScanX54 and OrthMCL v2.0.955. A total of 29,542 LA1589/Heinz 1706 orthologs were identified.

Table 4 Function annotation of predicted protein-coding genes.

The first category of non-coding genes, tRNAs, were annotated by tRNAscan-SE v2.0.356. rRNAs were annotated by RNAmmer v1.257. miRNAs and snRNAs were predicted by the cmscan module in INFERNAL v1.1.258 (--cut_ga --rfam --fmt 2) with searches against the Rfam database v14.959. In total, four types of noncoding RNA, including 1073 tRNAs, 698 rRNAs, 582 snRNAs, and 405 miRNAs were identified from the genome.

Data Records

The raw sequencing data generated in this study have been deposited in NCBI Sequence Read Archive with accession number SRP47117760 and in NGDC Genome Sequence Archive with the accession number CRA01244661. The final genome assembly has been deposited in GenBank under accession GCA_034621305.162. The genome annotations are available from the Figshare63.

Technical Validation

The quality of the S. pimpinellifolium assembly was evaluated using three approaches. First, the completeness of the genome assembly was assessed using BUSCO v5.4.5 and 98.30% of the BUSCO genes were complete. Then, the assembly continuity was determined by analyzing the LTR Assembly Index (LAI). The LAI score (14.49) met the quality standard for reference genomes. Additionally, for the assessment of the correctness of the genome assembly, we re-aligned clean Illumina DNA sequencing data against the assembly using BWA v0.7.15, and 99.77% reads could be successfully mapped. All these statistics indicated that this S. pimpinellifolium genome is of high accuracy and completeness.