Background

Rice (Oryza sativa L.) is the most important cereal and staple food for more than 50% of global population and 90% of the world’s rice is grown in Asia [1]. The world population is expected to rise to 10 billion by 2050 and to feed this population agriculture production needs to be expanded by 70%. It was estimated that the world population would require 763 million tons of rice in 2020, and 852 million tons by 2035 [2, 3]. Harnessing heterosis is one of the major approaches to increase rice yield and has made a great contribution to food security in China and many other countries [2, 7]. The photo or temperature sensitive genetic male sterile (PTGMS) lines have occupied millions of hectares of rice field in China for more than a decade [11,12,13,16]. It enables researchers to readily select sub-sets of informative SNPs for use on smaller, in-house platforms for immediate applications in various genomic and molecular approaches including marker-assisted or genomic selection, genome-wide polymorphisms in high throughput QTL and association map**. It is also used to select highly targeted sets of SNPs for high-resolution haplotype analysis and gene discovery. Several medium-density and high-density chip arrays have been developed for rice [17]. Theses SNP assays have been developed at different densities, for example the 50 K-SNP chip [18], C6AIR [19], the RICE6K [21], and the 700 K-SNP High Density Rice Array [22]. The SNP density required to meet these criteria in rice is ~ 6–7000 markers, due to the significant differences in SNP distribution and frequency that characterize the deeply differentiated subpopulations of O. sativa [19]. The availability of high-density SNP chips for rice makes it possible to undertake large-scale, high-throughput germplasm characterization, enhancing the value of the genetic resources available in the world’s major germplasm repositories.

In this study, 131 two-line PTGMS rice lines were clustered on the basis of sterile genes. The clusters were further genotyped by 56 K whole genome rice SNP-chip and the genetic relationship among the materials was analyzed. The core breeding collection of these materials was screened and the selective sweep that occurred during the breeding process was investigated. The results provide important insights into the narrow genetic basis of available PTGMS gene pool and is a reference for the future two-line hybrid breeding and research programs in rice.

Results

Screening for photo- and temperature-sensitive nuclear genes

The collection of 131 two-line male sterile lines was evaluated and screened against the photosensitive male sterility (pms3) and temperature sensitive male sterility (tms5) genes to identify the similarity among them. The total gene pool was divided into four major populations. There were 9.16% (12) lines in the population 1 (P1), possessing the pms3 and denoted as the photosensitive genic male sterile population. A total of 77% (101) lines were in the population 2 (P2) possessing tms5 and denoted as the thermo sensitive genic male sterile population. Only 3.82% (5) lines possessed both pms3 and tms5 genes and were classified as the population 3 (P3). They are denoted as the photo/thermo sensitive genic male sterile population. The other 9.92% (13) lines were the normal breeding lines without any PTGMS gene, classified in the population 4 (P4) and denoted as the conventional breeding population (Additional Table 1).

Genoty** and evaluation of genomic variants

In the present study, we genotyped the 131 lines with 56 K SNP marker chip. The average SNP density was 1.52 SNPs per 10 kb genomic region with a range from 1.32 in the chromosome 12 to 1.74 SNP in the chromosome 3. The average filtered SNP density was one SNP per 15.5 kb genomic region which varied from 1 SNP per 9.7 kb for P4 to 26 kb for P3 (Fig. 1). The genome coverage ranged from 91% for chromosome 1 to 100% for most of the other chromosomes. The markers density was much higher than 12 SNPs per Mb in intergenic SNP based assay [19] but relatively lower than the previously reported 0.745 per kb 50 kb density in gene single copy-based chip [23]. Hence, the markers density in this study is suitable to investigate the genetic diversity among genotypes.

Fig. 1
figure 1

SNP markers density per Mb in the 12 chromosomes of rice genome genotyped by using 56 K SNP-Chip for 118 PTGMS lines and 13 conventional breeding lines

Genetic purity

Genetic purity among the genotypes can be evaluated by the available homozygous or heterozygous alleles. In the current study, we assessed the frequency of homozygous and heterozygous alleles per locus and estimated the nucleotide diversity and Shannon’s index (I). The genome of all four populations was highly homozygous with 93 to 97% homozygous alleles. The lowest population heterozygosity among alleles was observed in P4 (2.66%) followed by P3 (3.14%), while the maximum heterozygosity was observed for P1 (6.80%) followed by P2 (3.70%) (Fig. 2, Additional Table 1).

Fig. 2
figure 2

The trend and relationship among four populations for various genetic diversity parameters, including Number homozygous-1 alleles, Number of heterozygous alleles, Number homozygous-2 alleles, P-values for the Hardy Weinberg equilibrium, Genetic diversity parameter (π), and Shannon’s diversity index (I). The populations (P1-P4) were defined based on the presence of photosensitive male sterility (pms3) and temperature sensitive male sterility (tms5). P1 has pms3, P2 has tms5, P3 has both pms3 and tms5, and P4 is the conventional breeding lines without male sterility genes

Genome wide nucleotide diversity

The genetic diversity among 118 male sterile lines and 13 conventional breeding lines was evaluated. The population’s nucleotide diversity (π) was low, ranging from 7.54 × 10− 8 to 4.29 × 10− 4. It indicates the close relatedness among the genotypes and suggests the limited number of available signature-genes in the germplasm. Among all four populations, we observed the lowest genome-wide π-value in the P4 (male fertile) genotypes (π = 0.0000339), while the highest genetic diversity (π = 0.0000503) was in P1. The values for genetic diversity among P2 (π = 0.0000348) and P3 (π = 0.0000347) were approximately same (Fig. 2, Additional Table 1). Within each population, the widest genetic diversity range was observed for P3, while it was the lowest in P2.

Genome-wide genetic differentiation

Although the genetic differentiation between the whole male sterile population (P1, P2, P3) and the conventional lines was not very high (weighted Fst = 0.078), the highest levels (mean weighted Fst > 0.2) of genetic differentiation was revealed by P1 with other populations. It was maximum with P4 (weighted Fst = 0.252) followed by P3 (weighted Fst = 0.178) and P2 (weighted Fst = 0.108) (Fig. 3, Table 1). The level of genetic differentiation between P3 and P4 was similar to that of P1 from P2. On the other hand, the genetic differentiation of P2 from P3 and P4 was not very high (weighted Fst = 0.045 and 0.099, respectively). It suggests a driving force of pms3 and tms5 genes in sha** the genetic variation pattern and indicates the genetic similarity of P2 and P3. The linkage disequilibrium (LD) was estimated by r2 for the distance classes of < 1 kb in 30 kb distance around the loci pms3 and tms5. The average r2 value reached the threshold of r2 < 0.5 at the distance of 3.9 kb for markers around the tms5 locus while this value remained above threshold for the pms3 locus (Additional Fig. 5).

Fig. 3
figure 3

The genetic differentiation evaluated by FST values and the ratio of π-values among P4 versus others male sterile populations (P1, P2, P3). The red boxes in Chromosomes (Chr) 2, 4, and 6 indicate the top selective sweeps

Table 1 The Fst values for pairwise comparison among populations

Phylogenetic cluster analysis of the four populations of PTGMS lines of rice

The phylogenetic analysis was performed with all selected SNP markers to reveal the ancestral relation among the 118 male sterile lines and 13 conventional breeding lines. According to the neighbor joining (NJ) tree, all the lines could be divided into four clades at the 0.05 genetic distances (Fig. 4). Among them, the first clade consisted of five lines (N5088S, Nongken58S, Wan2304S, 7001S, and N95076S). These lines were at the maximum genetic distance of 0.29 and were the root of the phylogenetic tree. These lines could be the ancestors of other male sterile lines. The second clade was composed of six genotypes, including five genotypes (GD-1S, H03S, S242, S240, and 1103S) belonging to P1, and one genotype (11Fan17S) belonging to P2. Four genotypes (H03S, S242, S240, and 1103S) in this clade grouped together and showed a relatively high distance from other genotypes, while the two other genotypes from P1 (GD1S) and P2 (11Fan17S) made the root of remaining genotypes. These genotypes might be the progenitors of clade 1 lines and the ancestors of the remaining as they showed close relation to other genotypes. The remaining genotypes from P2 and P3 were grouped with the genotypes from P4, indicating their genetic resemblance.

Fig. 4
figure 4

Neighbor joining (NJ) phylogenetic tree of 131 selected genotypes indicating the genetic distances based on SNP markers. The different clusters were indicated in different branch colors. The cluster with green and blue branch color belongs to P1 indicating the root of tree, the cluster with red branch color possessed the conventional breeding lines. The clusters with black and pink branch colors belong to P2 and P3, respectively

Principal component analysis of the four populations of PTGMS lines of rice

The results were further supported by the principal component analysis (PCA). The top two principal components PC1 and PC2 explained 19.77 and 7.62% of the total variation, respectively, and divided the germplasm into two major categories. The next two PCs such as PC3 and PC4 explained 6.23 and 3.82% of the total variation, respectively (Additional Fig. 1). One of two clusters grouped the five breeding lines of P1 (clade-I of phylogenetic tree) with a wide genetic distance from other clade, while the other clusters possessed all of the genotypes from P2, P3 and P4 with a narrow genetic distance. Being an independent cluster from others, the Cluster 1 harbored the highest level of genetic differentiation (Table 1, Fig. 5, Additional Fig. 1). Its unique genetic variation pattern could also be evidenced in the top two PCs. The succeeding clusters could further be classified into three overlap** groups (Fig. 5).

Fig. 5
figure 5

Principal component analysis of 131 rice genotypes, with PC1 and PC2 classifying the whole germplasm into clusters

Admixture cluster analysis of the four populations of PTGMS lines of rice

To infer the admixture degree across the 131 samples, we further performed an unsupervised admixture analysis with 56 K SNP markers based on K run from 2 to 4. We found that at K = 2, a genetic divergence occurred between the P1 genotypes and their close relatives, while K = 3 and K = 4 sub-divided the groups. At K = 4, the whole germplasm was divided into four (S1, S2, S3, S4) groups of 28, 5, 23 and 75 genotypes, respectively. Except for the genetically distant five genotypes from P1 grouped as S2 in the structure analysis, a potential widespread genetic introgression from conventional breeding lines of P4 to other populations was observed across K = 2 to K = 4 (Fig. 6). There were 10, 10 and 13 genetically pure lines in S1, S3 and S4, and the remaining showed genomic introgression. These results reinforce the previous analysis with pure lines and mixed genomic lines [21, 24]. Among the populations, 5, 31, and 1 genotype from P1, P2, and P4, respectively, were observed to be pure lines. In contrast, all the conventional breeding lines in P4 had the introgressed genomic components. The pure lines specifically the genotypes in S2 are likely the ancestors of the remaining germplasm.

Fig. 6
figure 6

Population admixture analysis of 131 genotypes up to K = 4, indicating the genetically stable and mixed genotypes

Genome-wide selective sweep signals, their molecular function and validation

In order to better detect genome-wide selection signals related to the male sterility in the genotypes, we divided the populations into male sterile (P1, P2, P3) and male fertile (P4) groups. The high Fst values (top 1%, Fst value> 0.34) were used as criteria for classifying the selective sweeps. There was no selection sweep on chromosome 3, 5, and 8. A total of 1044 candidate genes were found within the sweeps detected on other chromosomes. Some of these genes could be associated with sterility (Additional Table 3). Five sweeps located on chromosomes 1, 2, 4 and 6 exhibited high Fst values (0.888, 0.718, 0.652, 0.650, 0.643) indicating obvious genetic differentiation between male sterile and fertile populations. The largest genomic region of 2 Mb containing 376 candidate genes was observed on chromosome 2 followed by 1 Mb region of chromosome 4 containing 174 candidate genes (Additional Table 2). Kyoto Encyclopedia of genes and genomes (KEGG) pathway enrichment analysis revealed that the candidate genes in the selection sweeps were mainly involved in the ‘Alanine, aspartate and glutamate metabolism’ and ‘ABC transporters’ pathways (Fig. 7A). Gene ontology (GO) analysis revealed 113 GO terms of which the molecular binding was identified as the top enriched ‘Molecular function’ (Fig. 7B).

Fig. 7
figure 7

The candidate genes function analysis revealed by (A) KEGG and (b) (GO databases

To further validate the genome-wide selective sweep signals, three pedigree groups A, B, and C were obtained from the rice breeding database according to their breeding history (Additional Fig. 2). In the pedigree groups A, there were 7, 15 and 1 genotypes containing pms3, tms5 and pms3.tms5 genes, respectively, while the genotypes in group B and C possessed tms5 gene. The genome wide diversity for all three groups was investigated and the genetic differentiation from conventional genotypes (P4) was studied. A total of 185, 182 and 181 selection sweep signals were observed for pedigree group A, B and C, respectively, in comparison with conventional lines at the Fst threshold of top 5% selective sweep signals (Additional Fig. 3, Additional Table 4). Among the top 1% selective sweeps, we found the same genomic regions as identified in top genetic differentiation hits for male sterile (P1, P2, and P3) and conventional breeding lines (P4). The genes in candidate regions were subjected to GO analysis. The ‘Binding’ type of molecular functions involved in ‘metabolic’ and ‘signal transduction’ processes in ‘Nucleus’ and ‘membrane complexes’ were on top hits (Additional Fig. 4).

Selection of core germplasm

All the genotypes were arranged on the basis of genetic diversity and top 30% genotypes were selected at the first step. All the cluster analysis and the male sterility allele’s evaluation grouped the germplasm into four groups. The genotypes in each group were arranged on the basis of their available sterility allele and the genetic diversity among the genotypes. The top 10% of commonly selected genotypes from both procedures resulted in the selection of 13 genotypes to develop a core collection. Among them, 2 (GD1S and N5088S), and 11 (ZhunS, Biao506S, Shen08S, **ShanS, 2301S, S204, Ke8S, Long605S, 66S, Longke638S, and 99S) genotypes were selected from P1 and P2, respectively. Furthermore, one genotype (S239) from P3 and one conventional genotype (Shuhui881) were also included in the core collection.

Discussion

Crop breeding programs aim to harness genetic diversity for desirable phenotypes to meet human demands [25]. To achieve the ideal genotype, the bottleneck effect on phenotypic selection in elite varieties during rice breeding programs have dramatically narrowed down their genetic diversity [26]. However, the information about genes which generated the changes in desirable phenotypes in elite rice varieties is limited. Even some genes may cause the transition of PGMS line to TGMS line [27]. The photosensitive (pms3) and temperature sensitive male sterility (tms5) genes could classify the collection of 131 two-line male sterile lines into four major populations in this study. The information about available PTGMS genes and markers will not only help the direct selection of genotypes in hybrid breeding but also for the future research programs. Based on these genetic markers, other germplasm resources could also be evaluated and manipulated to enhance the genetic diversity. In addition, genome-wide marker analysis of various rice populations has demonstrated that shifts in genetic population structures have occurred multiple times in history [28]. The shift of genetic diversity in the local gene pool may be managed with the use of germplasm for human demands in rice breeding programs [28].

Genetic purity

Maintenance of genetic purity in inbred lines by minimizing residual heterozygosity (heterogeneity) is important for quality seed production [29]. The threshold value may vary depending on the purpose of the line development program and level of inbreeding. In the current study, the genome of all four populations revealed a high level of homozygosity with 93 to 97% homozygous alleles (Fig. 2, Additional Table 2). Generally, the male sterile lines are bred by backcross for many generations. Therefore, they may show morphological differences but have minor to negligible differences in the nuclear genome. This level of heterogeneity may be due to the use of different methods for line-maintenance or the natural variation through crossing over. Currently, there is more demand in develo** uniform hybrids using genetically pure parental lines, especially doubled haploid lines [29]. As a result, rice breeders are using fixed lines in their new pedigree starts up and advance each generation through selfing than sib-mating. In the long-term solution to improve the homogeneity, it is recommended to use doubled haploid (DH) technology in develo** genetically pure DH lines that can be derived in a short period of time [29].

Genome wide nucleotide diversity

There was a low nucleotide diversity (π) in all four populations (Fig. 2, Additional Table 2), which indicated a close relation among the genotypes and suggested the limited number of available signature-genes in the germplasm. The lowest genetic diversity in P4 revealed that the conventional breeding lines are facing an intensive selection pressure than the PTGMS lines. The widest range of genetic diversity was observed for P3, while it was the narrowest in P2. Among the male sterile populations, P2 and P3 were selected at similar levels and showed higher selective sweep than P1. In general, the average genetic diversity value of 0.67 and 0.90 within Asian cultivated rice and common wild rice has been observed with SSR markers [30]. The analysis of 4408 accessions of Chinese cultivated rice germplasm with 12 isozyme loci reported an average gene diversity range from 0.012 to 0.547 [31]. In our experiment, the results showed that the genetic diversity of the tested materials was very low. It indicates the genetic similarity among the genotypes and a similar pedigree, resulting in the decreased nuclear genome diversity.

Genetic relationship and population structure among the two-line hybrid rice lines

To understand the genetic admixture, the relative kinship coefficients are used as indicators of genetic relationship among the pairs of genotypes. The values of admixture ranged from zero for lack of relation to higher values for stronger relationships. The results from admixture and PCA analyses in this study revealed that the genetic divergence first occurred between the P1 genotypes and their close relatives. The presence of pure lines in each subgroup reinforced the population admixture. The admixture results were reinforced by the phylogenetic analysis and revealed the ancestral relationship among the 131 lines of rice (Fig. 4). As a whole, the majority of the genotypes could not be well grouped into specific cluster as indicated by the phylogenetic and admixture analyses, suggestive of a complex genetic structure. Besides, it was hard to separate the rice breeding lines of different populations from each other, indicating high admixture among them.

Genome-wide genetic differentiation

The accumulation of differences in allelic frequencies between completely or partially isolated populations due to evolutionary forces such as selection or genetic drift could be evaluated by population differentiation analysis. The population differentiation was studied by Fst-values. The P1 population was genetically differentiated from other populations. There was no significant difference between population P3 from P4 and the P1 from P2. It suggests a driving force of pms3 and tms5 genes in sha** the genetic variation pattern, and indicates the genetic similarity of P2 with P3. The results may not totally support the arguments given by [32, 33] that a critical role might have been played by distance based isolation in sha** the genomic variation. Hence a strong artificial selection can be proposed as the main driving force in sha** rice genomic variation.

Genome-wide selective sweep signals and the role of candidate genes

The genome-wide selection signals related to the male sterility in genotypes was also observed by comparing the male sterile (P1, P2, P3) and male fertile (P4) groups. There are various traits in plants such as plant height, seed color, and stem angle that can be influenced by selective sweep [29, 34]. The study on distinct phenotypic evaluation of rice revealed the physiological and morphological effects of selection sweep on rice breeding [29, 34]. The candidate genes identified in this study can be functionally characterized for their roles in genetic differentiation and induction of male sterility.

As shown in the Manhattan plot (Fig. 3, Additional Table 3), the highest Fst score (0.888) was observed within 120 kb interval on chromosome 4 (4:7920001–8,040,000). This region harbored 19 protein-coding genes, including (i) RPM1 disease resistance protein which facilitates a rapid and sustained increase in cytosolic calcium that is necessary for the oxidative burst and hypersensitive cell death ([35], (ii) TRAF-type zinc finger family protein, its function is extraordinarily diverse and includes DNA recognition, RNA packaging, transcriptional activation, regulation of apoptosis, protein folding and assembly, and lipid binding [36], (iii) wall associated receptor like Kinase (osWAK) protein, its central role in resisting a range of fungal and bacterial diseases [37].

Two of the selective sweeps on chromosome 2 (2:4860001–5,080,000 & 6,080,001–6,600,000) consisted of 48 and 103 candidate genes, respectively. Amon the candidate genes, we have (i) AP2 domain containing genes which are necessary for flower development, stem cell maintenance, seed development, and abiotic stresses resistance [1). The germplasm was obtained from Hunan Hybrid Rice Research Center, Hunan China, and grown in the experimental farm of the same institute. All the standard cultivation practices were adopted. The fresh leaves of at least six plants of each male sterile line were collected and total DNA was extracted by the simple CTAB method with minor modifications [45]. The DNA samples (50–100 ng/μL per sample) in a 96-well plate format were prepared for genoty** with high-density Illumina 56 K infinium SNP Chip in Huazhi Biotech Co. Ltd., Changsha, China. Monomorphic markers, with missing values < 20%, with a minor allele frequency < 5%, and/or showing unclear SNPs were excluded from the analysis. The filtered genotypic data was used in subsequent analysis.

PCR reaction and genomic screening

Two male sterile genes, pms3 and tms5 were used to characterize the germplasm. The polymerase chain reaction (PCR) was conducted in 20 μL containing 3 μL primers, 2 μL 10 × PCR buffer (involving 15 mmol/L MgCl2), 0.3 μL dNTPs (10 mmol/L), 50–100 ng template DNA, 1 U Taq enzyme. The reaction program was pre-denatured at 94 °C for 5 min, 35 cycles with 1 min at 94 °C, 1 min at 55 °C and 1 min at 72 °C, finally extended for 5 min at 72 °C. Then, the amplified products were electrophoresed on a 6% denatured polyacrylamide gel. On the basis of male sterility genes the germplasm was clustered into four classes as the genotypes carrying “pms3”, “tms5”, both “pms3 and tms5” genes, and the conventional breeding strains without any male sterility allele (Additional Table 1).

Statistical analysis

To investigate the relationship among genotypes, the population structure was calculated by using all of filtered SNPs in model-based program Structure v2.4.2 [46]. Ten independent simulations were carried out for each K (the number of populations) ranging from 1 to 6. For each simulation, 10,000 iterations before a burn-in length of 50,000 Markov Chain Monte Carlo replications were performed with the selection of admixture and related frequency models. The LnP(D) values and optimal K-value was estimated using Evanno’s 1 K method [46] with online tool Structure Harvester [47]. Furthermore, principal component analysis was performed. Genome-wide diversity (π) within each population and the pair wise genetic differentiation between each population was computed using VCFtools version 0.1.14 [48], with a window size of 100 kb and a step size of 20 kb [49]. The linkage disequilibrium decay rate was estimated by r2 values with distance across the loci. The < 1 kb distance window for SNP pairs in 30 kb distance was used. The NJ method based on Nei’s genetic distances among genotypes [50] using the Mega X [51] followed for phylogenetic cluster analysis of germplasm. The tree was visualized and edited by Evolview online tool [52]. PowerMarker v3.25 [53] was used to identify the core germplasm for breeding. The PTGMS genes based clusters of the germplasm were compared in seven possible combinations viz.; 1) ‘pms3’ vs ‘tms5’, 2) ‘pms3’ vs ‘pms3 + tms5’, 3) ‘pms3’ vs Normal, 4)tms5’ vs ‘pms3 + tms5’, 5) ‘tms5’ vs Normal, 6) ‘pms3 + tms5’ vs Normal, and 7)’ pms3,tms5,pms3 + tms5’ vs normal. The results from comparison of normal genotypes to others (PTGMS genes containing) lines were used to identify the candidate genes in selection sweeps and for their GO [54] and KEGG enrichment analyses [55]. Finally, the top 10% of germplasm repeatedly discovered in both methods was identified as the core germplasm for future rice breeding.