Hibiscus genus plants belong to the Malvaceae family of angiosperms to which other genera such as Sterculia, Dombeya, and Pavonia also belong. Kenaf (Hibiscus cannabinus L., 2n = 2x = 36) is a member of the Hibiscus genus with potential industrial and commercial interests and identified to be an excellent source of cellulosic fiber originated from bast or stalk for paper industries1,2. In addition, uses of kenaf fiber are not only limited to textile, but also equally important for new materials industries, such as building materials, adsorbents, and composites using new and recycled plastics, etc.3. In the recent past, kenaf seeds have been proved to be potential uses in chemical and bio-energy industries4,5.

Chloroplasts are present in photosynthetically active green tissues6,7 and display a conserved structure of a circular double-stranded DNA molecule8. Cytoplasmic male sterility (CMS) is an important agronomic character, which is widely utilized for F1 hybrid breeding9. Since the discovery of kenaf CMS, achievements have been made on the mechanism of CMS in kenaf. Up to now, several studies are performed on kenaf CMS mechanism10. However, the exact mechanism of kenaf CMS has not been fully elucidated. It is generally believed that cytoplasmic male sterility is closely related to mitochondria. Nevertheless, studies also demonstrated that the chloroplasts might be associated with plant CMS6,11,12. Therefore, in the analyses of the molecular mechanism of kenaf CMS, we should pay attention to the chloroplast genome. However, at present, little is known about the chloroplast genome information of kenaf CMS line and its maintainer line as well as the relationship between the chloroplast genome and kenaf CMS.

Here, we reported the complete cp genome sequences of the kenaf CMS line P3A and its maintainer line P3B by employing the Illumina Hiseq and PacBio platforms. The comparative analysis of the chloroplast genomes among Malvales was performed. SNP and Indel between the two lines were also detected, analyzed, and validated. This study characterized the chloroplast genomes of P3A and P3B and unveiled their discrepancies, which provided basic data for further study of kenaf CMS mechanism.

Results

Genome sequencing and assembly

The DNA bands of P3A and P3B were clear and the DNA was free of protein, pigment, and other impurities (Supplementary Fig. S1). In addition, OD260/280 and OD260/230 were about 1.8 and 2.0, respectively. It was indicated that the DNA quality, concentration, and total amount of DNA met the requirements of subsequent experiments (Supplementary Table S1). Then the chloroplast genomes of kenaf CMS line P3A and its maintainer line P3B were sequenced using Illumina Hiseq and PacBio platforms. About 6.3G and 3.7G raw data were generated while 6.1G and 3.5G clean data were produced in P3B and P3A, respectively. Q30 ratio reached 95.31% and 94.12% in P3B and P3A, respectively, indicating that the data was reliable (Supplementary Table S2).

The chloroplast genomes of P3B and P3A were assembled into circular molecule with a total length of 163,597 bp (Fig. 1) and 163,360 bp (Fig. 2), respectively. The chloroplast genome of P3B was 237 bp longer than that of P3A. The assembled genome sequences were deposited in GenBank with accession number MW446503 (P3B) and MW446504 (P3A).

Figure 1
figure 1

Circular gene map of Hibiscus cannabinus maintainer line (P3B). Genes drawn within the circle are transcribed clockwise, while those drawn outside are transcribed counter clockwise. Genes are colour-coded according to their functional groups.

Figure 2
figure 2

Circular Gene map of Hibiscus cannabinus CMS line (P3A).

General features of kenaf P3B and P3A chloroplast genomes

In this study, the gene number, gene total length, gene average length, gene length/genome, and GC content were noted as 85, 79151 bp, 931 bp, 48.38%, and 36.55% in P3B, respectively. However, in P3A all those were recorded as 83, 87032 bp, 1049 bp, 53.28, and 36.57%, respectively (Table 1; Supplementary Table S3). The chloroplast genomes of P3B and P3A were observed to contain 38 and 41 tRNA genes, respectively. Each chloroplast genome had 8 rRNA genes (Supplementary Table S4). Two protein coding genes, rps19-D2 and petN, were absent and three additional tRNA genes were observed in P3A.

Table 1 List of protein-coding genes present in Hibiscus cannabinus. L (P3B) chloroplast genome. “*”represented genes were absent in cp genome of P3A.

Comparative chloroplast genome analysis

Although the coding region was found to be more highly conserved than the non-coding region, the coding region of kenaf chloroplast genome was still different from that of other three genomes (Fig. 3). The LSC-IRB-SSC-IRA boundary regions were compared within four closely related chloroplast genomes, P3B, P3A, Abelmoschus esculentus, and Gossypium hirsutum (Fig. 4). The rps19 gene of P3B was located in IRa and IRb regions, while it was shifted to LSC region in P3A. The ycf1 gene was extended from the IRa to the LSC region in P3B, P3A, and Gossypium hirsutum. However, ycf1 was found in the junction of IRb/SSC in Abelmoschus esculentus. At the IRb/SSC boundary, the ndhF was observed with 107 bp, 229 bp, and 164 bp gap, respectively in P3B, P3A, and Gossypium hirsutum, which was located in SSC region of Abelmoschus esculentus. trnN-GUU gene was identified in IRb region with 1276 bp and 1273 bp apart from the IRb/LSC junctions of P3B and P3A, respectively, while was shifted to IRa region in Abelmoschus esculentus. trnH-GUG gene was at the junction of IRb/LSC in P3B and 79 bp apart from the IRb/LSC junction in P3A.

Figure 3
figure 3

Comparison of the cp genome sequence of P3B, P3A, Hibiscus syriacus Linn, Abelmoschus esculentus, and Gossypium hirsutum.

Figure 4
figure 4

Comparison of the borders of the LSC, SSC, and IR regions of P3B, P3A, Abelmoschus esculentus, and Gossypium hirsutum cp genomes.

Phylogenetic analysis

To analyze the phylogenetic position of kenaf within Malvales, 19 species of Malvales derived from four families, Malvaceae, Sterculiacaea, Bombacacaea, and Tiliaceae were aligned (Fig. 5). All 17 nodes were resolved well and reliable based on bootstrap value: 16 had bootstrap support of 100% and only 1 harbored the support of 99%. The phylogenetic tree showed that all the 19 species were classified into two clades. One clade included Sterculiacaea, Bombacacaea, and Tiliaceae families. Kenaf, Abelmoschus esculentus, and Hibiscus syriacus were clustered into the other clade, indicating that kenaf was more closely related to Hibiscus syriacus and Abelmoschus esculentus than cotton.

Figure 5
figure 5

Molecular phylogenetic tree of 19 species of Malvales.

SNPs analysis between P3A and P3B

To detect the cpDNA differences between the CMS line P3A and its maintainer line P3B, setting P3B as a reference sequence, the two cp genomes were aligned for SNPs analysis. Although the chloroplast genomes of P3A and P3B showed high homology, structural variation still existed (Supplementary Fig. S2). Further analysis showed that a total of 22 SNPs were detected and located in gene coding region, gene intron region, and intergenic region (Fig. 6). As shown in Table 2, a total of 9 SNPs in gene coding region were located in psbK, atpA, rpoC2, atpB, rpl20, clpP, rpoA, and ycf1, including 6 synonymous SNPs and 3 nonsynonymous SNPs (Fig. 7A). Among these 3 nonsynonymous SNPs, the mutation of atpB, rpl20, and ycf1 lead to amino acid changes. Moreover, 13 synonymous SNPs were found in intron and gene intergenic regions of P3A (Table 3). Of these, 2 SNPs were located in intron and 11 were in intergenic region.

Figure 6
figure 6

SNPs in gene coding region, gene intron and intergenic region between P3A and P3B. Note: DNA bases of P3A and P3B, guanosine (purple), thymine (red), adenine (light green), cytosine (blue). Position (red), nonsynonymous SNP in the CDS; Position (light green), synonymous SNP in the CDS; Position (yellow), SNP in the intron; Position (gray), SNP in the intergenic. Excel software was used for data processing and graph analysis.

Table 2 SNPs in gene coding region of P3A.
Figure 7
figure 7

Statistical analysis of SNPs and Indels in P3A. (A) SNPs statistical analysis of P3A. (B) Indels statistical analysis. Excel and PPT software were used for data processing and graph analysis.

Table 3 SNPs in gene intron and intergenic region of P3A.

Indel analysis between P3A and P3B

As for Indels, there were 53 Indel events, including 24 insertions and 29 deletions (Fig. 8) with different insertion and deletion size in P3A (Fig. 7B). All the 53 Indels were absent in the gene coding region and were distributed in the intergenic region of P3A (Table 4).

Figure 8
figure 8

Indels in gene intron and intergenic regions between P3A and P3B. Excel software was used for data processing and graph analysis.

Table 4 Indel statistics between P3B and P3A.

Verification of the high-throughput sequencing results

To verify the accuracy of high-throughput sequencing results, two genes, atpB and rpl20 were randomly selected for cloning and Sanger sequencing. Results showed that the sequences of atpB and rpl20 existed differences between P3A and P3B at the DNA level. The differential bases were at the position of 151th bp of atpB and 260th bp of rpl20 (Fig. 9A,C). Further analysis of the sequencing peak map exhibited that G-A and T-C mutation types were detected in atpB and rpl20, respectively (Fig. 9B,D). It was the same with the results listed in Table 2, inferring that the high-throughput sequencing results were reliable.

Figure 9
figure 9

Cloning and sequencing results of atpB and rpl20. DNAMAN (V6.0.3.99 version) and PPT software were used for graph analysis.

Discussion

The chloroplast is a very important plant organelle that has its genome and plays a crucial role in generating energy through photosynthesis9. Chloroplast genome has been used as ideal research models, particularly for phylogeny13, DNA barcoding14,15, species conservation, and genome evolution16 because of the highly conservative structure. In the present study, we presented the complete nucleotide sequence of kenaf chloroplast genomes using the Illumina Hiseq and PacBio sequencing platforms. The chloroplast genomes of P3A and P3B were fully characterized. As shown in Figs. 1 and 2, the kenaf chloroplast genome was a typical circle DNA, similar to those from Malvaceae13,17,18. Moreover, the length of the chloroplast genome of kenaf P3B and P3A was 163,597 bp (Fig. 1) and 163,360 bp (Fig. 2), respectively. They were larger than those of Malvaceae plants13,17,18. In addition, a total of 131 and 132 genes, including 85 and 83 protein coding genes, 38 and 41 tRNA genes, and 8 rRNA and 8 rRNA were detected in P3B and P3A, respectively (Table 1). The gene number of kenaf was more than Hibiscus syriacus that also belonged to the Hibiscus genus and contained 81 protein-coding genes, 29 tRNA genes, and 4 rRNA genes17. The genome size differences within the species mentioned above might be due to the species differences.

Although the overall structure, genome size, gene number, and gene order of the chloroplast genome were conserved, the junctions between SSC and IR regions were usually different in the chloroplast genomes of higher plant. The border regions of LSC/IRa, IRa/SSC, SSC/IRb, and IRb/LSC were highly variable with many nucleotide variations in the chloroplast genomes of closely related species19. In this study, we compared the IR boundary regions of the chloroplast genome from three species. The organization of the kenaf chloroplast genome with a pair of IR regions separated by the SSC and LSC regions was the same with most sequenced angiosperm chloroplast genomes, emphasizing the highly conserved nature of plant chloroplast genomes20. However, the border of the kenaf chloroplast genome was a little different from that of other chloroplast genomes (Fig. 4), which probably contributed to the chloroplast genome size differences within Malvaceae species.

It is generally believed that mitochondrial genome rearrangement and generation of new open reading frames (ORFs) changed the transcription and translation products of mitochondrial DNA, resulting in CMS21,22. Unlike mitochondria, little attention has been paid to the relationship between plant chloroplast and CMS, especially in kenaf. Chloroplast, a special subcellular organelle, is closely linked to heterosis23 and may be involved in plant CMS6,11,12. Li and Liu reported that there were some relations between cpDNA and CMS in maize, rape, and wheat24. In the rice CMS line, Hou et al. found differential fragments between the CMS line and its maintainer using AFLP molecular marker technology25. Tang et al. revealed different SNPs and Indels models in the rice CMS line6. Recently, the chloroplast genome size and component between CMS-C cytoplasm and normal cytoplasm were highly consistent, but Indels or SNPs were also detected between the male sterile lines and maintainer lines of maize26. In our investigation, 22 SNPs and 53 Indels were found between the cp genomes of P3B and P3A, which were located in gene coding region, gene intron, and intergenic region (Tables 2, 3, 4). It was consistent with the previous studies mentioned above6,26. In particular, there were a total of 9 SNPs in the gene coding region, which were located in psbK, atpA, rpoC2, rpl20, ycf1, atpB, clpP, and rpoA, respectively. It was found that most of these genes were related to the photosynthetic system or photosynthesis. Furthermore, within the nonsynonymous SNPs, phenylalanine mutated to leucine in ycf1, serine changed to glycine in atpB, and arginine altered to lysine in rpl20 (Table 2). Therefore, the cpDNA or chloroplast protein discrepancy might affect photosynthesis and energy metabolism and it was inferred that there might be some relationship between the chloroplast and kenaf CMS. CMS is the pollen abortion caused by nuclear–cytoplasm interaction27. Cytoplasmic genetic system included chloroplast and mitochondria. Nucleus, chloroplast, and mitochondria were not only independent, but also interrelated, infiltrated and influenced each other28. In the long-term evolution process, plants formed a coordinated relationship among the nucleus, chloroplast, and mitochondria, thus ensuring the normal growth and development of plants. However, once the coordination was broken during the pollen development, the normal information exchange between the nuclear and cytoplasm changed, then probably resulting in pollen abortion29. In other words, the coordination among the nucleus, chloroplast, and mitochondria of kenaf pollen cells might be broken due to the deviant cpDNA thus leading to the CMS of kenaf.

Conclusions

We sequenced and characterized the chloroplast genomes of kenaf CMS line P3A and its maintainer line P3B. The bio-informatics comparison analysis of chloroplast genomes among Malvales was performed. SNP and Indel between the two lines were also detected, analyzed, and validated. Our findings revealed the differences in cpDNA between P3B and P3A, which provided basic information for the further study of kenaf CMS mechsnism.

Materials and methods

Sample collection

Kenaf CMS line P3A and its maintainer line P3B were used in the present study. Seeds of both the cultivars were sown and cultivated in half-strength Hoagland solution as described in our previous study30. Leaves from 25-day-old seedlings were collected and frozen with liquid nitrogen immediately.

Chloroplast DNA (cpDNA) sequencing and genome assembly

Approximately 5 g of fresh leaves were harvested for DNA isolation using an improved extraction method31. After DNA isolation, 1 μg of purified DNA was fragmented and used to construct short-insert libraries according to the manufacturer’s instructions (Illumina), then sequenced on the Illumina Hiseq 4000 and PacBio platforms32.

Prior to assembly, raw reads were filtered. This filtering step was performed to remove the reads with adaptors, the reads showing a quality score below 20 (Q < 20), the reads containing a percentage of uncalled based (“N” characters) equal or greater than 10%, and the duplicated sequences. The chloroplast genome was reconstructed using a combination of de novo and reference-guided assemblies, and the following three steps were used to assemble chloroplast genomes33. First, the filtered reads were assembled into contigs using software SOAPdenovo2.0434. Second, contigs were aligned to the reference genome of Hibiscus syriacus (Accession: NC_026909.1) using BLAST, and aligned contigs (≥ 80% similarity and query coverage) were ordered according to the reference genome. Third, clean reads were mapped to the assembled draft chloroplast genome to correct the wrong bases, and the majority of gaps were filled through the local assembly.

Genome annotation

The online DOGMA tool35 with default parameters was used to predict protein-coding genes, transfer RNA (tRNA) genes, and ribosome RNA (rRNA) genes. A whole chloroplast genome blast search (E-value ≤ 1e−5, minimal alignment length percentage ≥ 40%)36 was performed against 5 databases, including Kyoto Encyclopedia of Genes and Genomes (KEGG)37,38,39, Clusters of Orthologous Groups (COG)40,41, Non-Redundant Protein Database (NR), Swiss-Prot42, and Gene Ontology (GO)43 databases. The circular chloroplast genome maps of P3A and P3B were drawn using OrganellarGenomeDRAWv1.244.

Comparative chloroplast genome analysis

The complete chloroplast genomes of Hibiscus cannabinus (P3B and P3A) were compared with those of three other species, Hibiscus syriacus Linn, Abelmoschus esculentus, and Gossypium hirsutum using the mVISTA program in a Shuffle-LAGAN mode45. The contraction/expansion regions of the inverted repeat (IR) were compared among P3B, P3A, Abelmoschus esculentus, and Gossypium hirsutum. P3B was set as a reference for SNP and indel analysis between P3B and P3A.

Phylogenetic analysis

The 19 completed chloroplast genome sequences representing Malvales plants were downloaded from the NCBI (Supplementary Table S5). Phylogenetic analysis was conducted based on maximum likelihood (ML) analysis using the general time-reversible invariant-sites (GTR + I) nucleotide substitution model with the default parameters in PhyML v3.0 (http://www.atgc-montpellier.fr/phyml/). The bootstrap probability of each branch was calculated by 1000 replications.

SNP detection

The different sites between the sample sequence (P3A) and the reference sequence (P3B) were detected using MUMmer46 software. The potential SNP sites were checked through preliminary filtering. After that, the 100 bp sequences on both sides of the SNP sites of the reference sequence were extracted, and then BLAT47 (version: 35, http://genome.ucsc.edu) software was used and verified by comparing the extracted sequences with the assembly results. If the alignment length is less than 101 bp, it is considered as untrusted SNP, which will be removed. If the alignment is repeated many times, it is considered as repeated SNP, which will also be removed. Subsequently, BLAST36, RepeatMasker48, and TRF49 software were used to detect the repeated sequence area of the reference sequence, filter the SNP located in the repeated area, and finally obtain reliable SNP.

Indel detection

Using LASTZ50,51 software, we compared the sample and reference sequences and the best alignment results were selected through the processing of axt_correction, axtsort and axtbest programs, and the preliminary Indel results were obtained. Then, 150 bp upstream and downstream of the Indel site of the reference sequence were compared with the sequencing reads of the sample by BWA52 (http://bio-bwa.sourceforge.net/) software and SAMtools53,54 (http://samtools.sourceforge.net/). Finally, the reliable Indel was obtained by filtration.

DNA isolation and PCR validation

The leaves used for chloroplast sequencing were used for DNA extraction and PCR validation. DNA isolation was performed according to the CTAB protocol55 with minor modifications. PCR amplification was conducted according to the following procedures: initial denaturation at 95 °C for 3 min; 30 cycles of denaturation at 95 °C for 30 s, annealing at 50–60 °C for 1–2 min, extension at 72 °C for 1 min; and final extension at 72 °C for 5 min. The primer information was listed in Supplementary Table S6. For PCR reactions, each 25 μL reaction mixture contained 12.5 μL of 2 × Phanta Max Master Mix (Vazyme, China), 1.5 μL of primer (10 ppmol, forward primer and reverse primer each 0.75 μL), 1.5 μL of genomic DNA, and 9.5 μL of ddH2O. Then, PCR products were recycled, cloned, and sequenced.