Background

Rice (Oryza sativa L.) is a staple food for more than half of the world's population, providing about 19 percent of the world's and 29 percent of Asia's caloric supply (IRRI [2009]). Although demands on the nutritional and industrial functionality of rice are increasing, especially to improve human health and quality of life, improving the yield potential of rice is still a major challenge for rice breeders, who must address the rapid growth of the world population along with dramatic reductions in the amount of cultivated land (Khush [1999]), as well as environmental challenges (Nelson, International Food Policy Research Institute [2009]). Asian varieties of cultivated rice include two major subspecies, O. sativa indica and O. s. japonica, which are differentiated based on morphological and physiological characteristics and geographical distribution (Morishima and Oka [1981]; Sano and Morishima [1992]). O. s. indica cultivars have higher genetic diversity (Lu et al. [2002]), a broader cultivation range, and stronger resistance to prominent diseases and insect pests compared to O. s. japonica cultivars (Chung and Heu [1991]). Inter-subspecific hybridization between indica and japonica rice cultivars may enrich allelic variation and facilitate hybrid vigor by creating new genetic recombinations (Cheng et al. [2007]). In spite of these advantages, the introduction of desirable indica traits into the japonica variety has not been successful due to reproductive barriers and the incorporation of undesirable characteristics, such as low eating quality for people who prefer the taste of japonica rice (Chung and Heu [1991]).

Tongil rice (IR667-98-1-2) is the first semi-dwarf variety obtained by a three-way cross of indica/japonica varieties as part of a collaborative research project between the International Rice Research Institute (IRRI) and the government of South Korea (Figure 1). The development of Tongil rice resulted in a significant yield increase from 4 to 5 t ha-1, corresponding to a 30% yield increase relative to the leading japonica varieties grown in Korea (Chung and Heu [1980]). After the introduction of Tongil rice in 1972, Korean rice production significantly increased and the South Korean government announced the achievement of agricultural self-sufficiency (the so-called `Green Revolution') in 1977. However, the genome characterization and structure of Tongil rice have never been analyzed.

Figure 1
figure 1

Morphological comparison of Tongil and parental lines. From left to right: Tongil, Yukara, IR8, and TN1. (A) The plant architecture of Tongil, its japonica parent (Yukara), and its indica parents (IR8 and TN1). (B) The panicle phenotype of Tongil and its parents. (C) The grain shapes and brown rice shapes of Tongil and its parents. Scale bars are included in each panel.

Rice is a useful model crop for studying genome structure due to its relatively small genome. Furthermore, its genetic and physical data have been extensively analyzed by the International Rice Genome Sequencing Project (IRGSP) (International Rice Genome Sequencing P [2005]). The recent improvement of next-generation sequencing (NGS) technology has enabled high-throughput genoty** and elucidation of genome structures of various rice cultivars (Huang et al. [2009]; Huang et al. [2012]). Most sequence-based rice genome analyses are based on DNA polymorphisms, single nucleotide polymorphisms (SNPs) and insertion-deletions (InDels). SNP detection is the first step for comparing DNA variation and is an effective tool to elucidate genome structure and composition (Feltus et al. [2004]; McNally et al. [2009]; Shen et al. [2010]; Chen et al. [2014]).

In this study, we sequenced the whole genomes of Tongil rice (Oryza sativa L.) and its parental varieties to analyze the genome structure of Tongil in detail and to identify regions of the indica and japonica parental genomes that introgressed in the Tongil genome. In addition, we analyzed previously reported yield-related genes (Gn1a, Ghd7, sd1, GS3 and qSW5), SSRs, GO annotation, and other genetic characteristics of the Tongil genome.

Results

Genome structure of Tongil

The whole genomes of Tongil and its three parental varieties, Yukara, IR8, and TN1 (Taichung Native 1), were sequenced on the Illumina-GAII platform. A large number of short reads were mapped onto the reference Nipponbare genome and then assembled into a consensus sequence. A total of 199,543,820 reads of the Tongil genome, corresponding to 17,339,883,560 bp (17.3 Gb), were generated, representing a 47-fold sequence depth and covering 88.8% of the Nipponbare pseudomolecules (Table 1 and Additional file 1: Table S1). We detected a total of 2,149,991 SNPs between Tongil and Nipponbare sequences (Additional file 2: Table S2). The two indica parents of Tongil, IR8 and TN1, had 6.22 and 6.04 SNPs per kb, respectively, whereas the japonica parent of Tongil, Yukara, had only 0.49 SNP per kb (Additional file 2: Table S2). Using the SNP data sets from Tongil and its parents, we defined the genomic origins of regions of the Tongil genome by SNP calling (Additional file 3: Figure S1; Additional file 4: Table S3; see also the SNP calling section in the Materials and Methods), and then performed a SEG-Map analysis (Zhao et al. [2010b]) of Tongil (Figure 2). The whole genome of Tongil consisted of an average contribution of 91.8% from indica, 7.9% from japonica, and 0.3% unknown (i.e., not defined as indica or japonica regions) (Figure 2 and Table 2). The contribution of indica to the Tongil genome varied across chromosomes, from 74% (Chr. 2) to 100% (Chr. 12). A relatively high proportion of the japonica genome was found on chromosomes 1, 2, and 3, whereas the japonica sequences were barely detectable on chromosomes 8 and 12. In addition, there were no differences in gene density between the indica- and japonica-derived genome regions of Tongil (Figure 2 and Table 2).

Table 1 General sequencing statistics for Tongil and its parental genomes
Figure 2
figure 2

Indica / japonica genome organization on the 12 chromosomes of Tongil. Blue indicates the indica genome (TN1 and IR8); red indicates the japonica genome (Yukara); and yellow indicates a region from an unknown genome. The percentages describe the proportion of indica contribution on each chromosome.

Table 2 Determination of the indica / japonica genome origin of Tongil, based on a window size of 9

Gene distribution and gene ontology analysis of Tongil

We analyzed the gene content of Tongil to understand the relationship between the composition of the genome and genes (open reading frames: ORFs), and also to elucidate the distribution of indica- and japonica- originated genes (alleles) within the Tongil genome. The gene distribution ratio according to indica or japonica genome composition was similar to the genome distribution ratio (Table 2 and Additional file 5: Table S4). The origins of genes from the indica and japonica parents were 88.3% and 11.4%, respectively, suggesting that the average gene composition was similar to the genome composition ratio of Tongil, although the distribution of parental origin varied across chromosomes. We performed gene ontology (GO) analysis of the Tongil genome according to three categories to identify biological patterns using a list of genes derived from indica, japonica, and unknown genomes: cellular components, molecular functions, and biological processes (Additional file 6: Figure S2; Additional file 7: Figure S3; Additional file 8: Figure S4). The results of GO analysis revealed that the average contribution of the indica or japonica genome to each GO category was almost identical to the gene and genome distribution ratios. O. s. indica and O. s. japonica contributed 86.8% and 12.7% of the cellular components, 87.4% and 12.2% of the molecular functions, and 87.3% and 12.2% of the biological processes, respectively, to the Tongil genome. However, in the `molecular functions' category, all 17 genes related to channel regulator activity were derived from indica regions, whereas all adhesion-related genes in the biological processes category were derived solely from japonica regions.

Simple sequence repeats (SSRs) in the Tongil genome

A total of 177 distinctive motif families were annotated on the Tongil genome (Additional file 9: Figure S5; Additional file 10: Figure S6). Di-nucleotide repeats were predominant among the classified repeats, and AT/TA repeats were the most abundant motifs in both indica- (29.09%) and japonica- derived (21.8%) regions within the Tongil genome. The next most abundant motif relative to AT/TA was CT/GA, and CGC was the most abundant motif among tri-nucleotide repeats. The di-, tri-, and tetra-nucleotide repeat patterns were different from that of the reference Nipponbare genome (McCouch et al. [2002]; Zhou et al. [2005]), and also differed from that of wheat (Weng et al. [2005]). A total of 90.1% of SSR motifs in the Tongil genome were from indica, 9.6% were from japonica, and 0.3% were from an unknown genome (Additional file 10: Figure S6).

Distribution of yield-related genes in the Tongil genome

One of the most important aims of this study was to explore which regions of the indica and japonica parental genomes have introgressed into the Tongil variety to provide its high-yield potential. Tongil is morphologically characterized by short plant height, lodging resistance, open plant architecture, medium-long erect leaves, thick leaf sheaths and culms, relatively long panicles, and easily shattered grain (Chung and Heu [1980]) (Figure 1). Although these phenotypic characteristics affect Tongil's high-yield potential, to date we have no molecular genetic evidence regarding the nature of these traits, with the exception of semi-dwarf gene 1 (sd1) (Chung and Heu [1980]). Therefore, we analyzed several well-characterized genes associated with high yield potential in the Tongil genome: sd1 (Nagano et al. [2005]; Sasaki et al. [2002]; Monna et al. [2002]), Ghd7 (Liu et al. [2009]). In each window, the proportion of SNPs originating from each parent was examined for genotype calling. Huang et al. determined optimum window size by calculating the probability of finding a specific number of japonica SNPs in a window based on SNP error rates. Recent improvements in sequencing technology, however, resulted in fewer errors in SNP identification. Thus, the method suggested by Huang et al. ([2009]) was not directly applicable in this study. Even with a window size of 2, for example, calling accuracy could reach 99.99%. Instead of calculating this probability, the optimum window size was determined iteratively by comparing the portion of japonica SNPs (O) and the portion of the genome originating from japonica (P). Tongil was resequenced to obtain SNPs originating from its parents and to calculate the percentage of japonica SNPs in each chromosome. SEG-Map software (Huang et al. [2009]) was also used for genotype calling on each chromosome. Because the optimum window size was unknown, a range of window sizes from 1 to 199 was used. Then, the Nash-Sutcliffe efficiency (E) between O and P was calculated as follows:

E=1- i = 1 n O i - P i 2 i = 1 n O i - O m 2
(1)

Here, an individual chromosome is denoted by i. The average percentage of japonica SNPs on each chromosome is denoted by Om. The optimal window size was defined as that with a maximum value of E; values of E ranged from -29 to 0.963. This maximum value of E occurred with a window size of 9. The percentage of indica SNPs was at its second highest (0.966) with a window size of 9. At a window size of 10, the E value dropped rapidly for japonica SNPs (0.037) and indica SNPs (-0.018). Thus, a window size of 9 was selected as the optimum for data analysis (Additional file 7: Figure S3).

Parental genome composition of Tongil

We compared DNA variation between the parental and Tongil genomes. Genomic regions originating from the japonica (Yukara) and indica (TN1 or IR8) parents were identified by comparing the Tongil genome sequence to parental sequences. Estimated indica and japonica regions in the Tongil genome sequence were calculated based on the methods of Zhao et al. (Zhao et al. [2010a]).

Gene ontology and classification

Annotated Nipponbare reference genes were classified based on parental origin in the Tongil genome and assigned to the three main GO-term categories (cellular component, molecular function, and biological process) using BLAST2GO software (www.blast2go.com) (Conesa et al. [2005]).

Simple sequence repeats (SSRs)

SSR loci were searched using SSR search software (Initiative [2000]) and classified with respect to their parental origin.

Authors' contributions

BK and HK conceived of the study and participated in its design. IC and BC performed bioinformatic analysis and data processing. BK and JL collected samples and phenotype data. DK, BK, GL, and JS analyzed the data and helped to draft the manuscript. TY, KK, DK, and JC helped to revise the manuscript. All authors read and approved the final manuscript.

Additional files